好久没上来写博客了,直入主题。
大家经常用google搜索,如何提取搜索结果的链接呢
google搜索结果url提取,F12,来到console端; 粘贴下面语句,回车。
var tag=document.getElementsByClassName('r');
for (var i=0;i<tag.length;i++){
var a=tag[i].getElementsByTagName("a");
console.log(a[0].href)
}
提取出来,保存到url.txt. 待检测的url和域名,一行一个,先经过去重去空白行
import io
import shutil
readPath='oldurl.txt'
writePath='url.txt'
lines_seen=set()
outfiile=io.open(writePath,'a+',encoding='utf-8')
f=io.open(readPath,'r',encoding='utf-8')
for line in f:
if not len(line):
continue
if line not in lines_seen:
outfiile.write(line)
lines_seen.add(line)
然后再批量检测
ok.txt 域名正常
red.txt 已经屏蔽的域名和链接
#! /usr/bin/env python
#coding:utf-8
import os,urllib,linecache
import sys
import time
import requests
result = list()
strxx = '"Code":"102"'
html = ''
for y in linecache.updatecache(r'url.txt'):
try:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
#response = urllib.urlopen(x)
#html = response.read()
x = 'http://wx.rrbay.com/pro/wxUrlCheck.ashx?url=' + y
response = requests.get(x,headers=headers)
html = response.text
time.sleep(3)
#print x,a
except Exception,e:
html = ''
print e
if strxx in html:
print 'ok:'
print x
with open ('ok.txt','a') as f:
f.write(y)
else:
print 'error:'
print y
html = ''
with open ('red.txt','a') as f:
f.write(y)
本文摘自 :https://blog.51cto.com/u