0.问题:
动态识别网站验证码以便后续操作
1.思路:
1.1.获取验证码图片
1.2.使用百度OCR接口在线识别验证码
2.实现:
2.1.获取验证码图片
2.1.1使用webdriver模拟浏览器获取网页
2.1.2根据页面元素中的验证码图片位置属性截取验证码图片并保存
代码实现如下:
def verifycode():
driver = webdriver.Chrome()
driver.set_page_load_timeout(5)
driver.set_script_timeout(5)
try:
driver.get("https://query.ruankao.org.cn/certificate/main")
except Exception as e:
print('time out in search page')
# 1.将注册页面截图保存,这里需要以png结尾,其他图片格式会有warning
driver.save_screenshot("scr_img.png")
# 2.定位到验证码图片元素
#code_ele = driver.find_element_by_id("imgVerifyCode")
code_ele = driver.find_element_by_id("pic")
# 3.元素的位置,结果:{'y': 478, 'x': 565},为图片左上角的位置
print(code_ele.location)
# 4.元素的大小,结果:{'height': 37, 'width': 135}
print(code_ele.size)
# 5.得到将元素的具体位置
x0 = code_ele.location["x"] # 565
y0 = code_ele.location["y"] # 478
x1 = code_ele.size["width"] + x0
y1 = code_ele.size["height"] + y0
img = Image.open("scr_img.png")
image = img.crop((x0, y0, x1, y1)) # 左、上、右、下
image.save("code_img.png") # 将验证码图片保存为code_img.png
或者使用xpath定位到验证码的url然后直接下载验证码图片,实现如下:
def verifycode():
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Referer': 'https://query.ruankao.org.cn/certificate/main',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74',
'Cookie': 'PHPSESSID=trq1o40; acw_tc=784e8288dgh67f; SERVERID=f7154867dcfa|1618889640|1618887987'
}
# 先用带Cookie的header请求验证码,则服务端存储 _cookis:_verifycode的对应,并返回验证码图片
xpath_str = '//img[@name="pic"]/@src'
base_url = "https://query.ruankao.org.cn/certificate/main"
html_res = requests.get(base_url, headers=headers).text
dom = etree.HTML(html_res)
items = dom.xpath(xpath_str)
if len(items) > 0:
cap_url = items[0]
print(cap_url)
cap = requests.get(cap_url, headers=headers)
with open("cap.png", "wb") as f:
f.write(cap.content)
f.close()
2.2 使用百度OCR接口在线识别验证码
2.2.1 登录百度智能云,创建OCR应用实例,获取APP_ID和APP_KEY
根据文档一步一步来肯定能成功,目前有免费额度个人认证 1,000 次/月,企业认证 2,000 次/月,免费测试资源用尽后按照如下价格进行计费
获取到APP_ID和APP_KEY后,就可以调用其接口在线识别了,可以参考技术文档
# encoding:utf-8
import requests
import base64
'''
通用文字识别(高精度版)
'''
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic"
# 二进制方式打开图片文件
f = open('[本地文件]', 'rb')
img = base64.b64encode(f.read())
params = {"image":img}
access_token = '[调用鉴权接口获取的t oken]'
request_url = request_url + "?access_token=" + access_token
headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
if response:
print (response.json())