Bootstrap

模拟浏览器进行爬取时遇到的一些问题记录

最近实验室要求在爬取一些论文数据,过程中遇到了不少问题,在此记录一下。

未解决的问题

https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cctc.202101625
这个网页,当我用requests去获得它的论文数据时,无论怎么设置headers和cookie,还是显示503错误,不知道是什么反爬的措施。在此把代码贴出来,期待能收获大佬的解答。

import requests
from hyper.contrib import HTTP20Adapter
url = 'https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cctc.202101625'
session = requests.session()
session.mount(url, HTTP20Adapter())
headers = {
   
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
        'sec-ch-ua-platform': '"macOS"',
        ':authority': 'chemistry-europe.onlinelibrary.wiley.com',
        ':method': 'GET',
        ':path': '/doi/full/10.1002/cctc.202101625',
        ':scheme': 'https',
        'cache-control': 'max-age=0',
        'sec-ch-ua-mobile': '?0',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate'
;