最近实验室要求在爬取一些论文数据,过程中遇到了不少问题,在此记录一下。
未解决的问题
https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cctc.202101625
这个网页,当我用requests去获得它的论文数据时,无论怎么设置headers和cookie,还是显示503错误,不知道是什么反爬的措施。在此把代码贴出来,期待能收获大佬的解答。
import requests
from hyper.contrib import HTTP20Adapter
url = 'https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cctc.202101625'
session = requests.session()
session.mount(url, HTTP20Adapter())
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-platform': '"macOS"',
':authority': 'chemistry-europe.onlinelibrary.wiley.com',
':method': 'GET',
':path': '/doi/full/10.1002/cctc.202101625',
':scheme': 'https',
'cache-control': 'max-age=0',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate'