scrapy超实用的两个中间件和参数配置
中间件(代理、UA)
自定义代理中间件
我自己编写了一个IP池子,代理IP放在redis当中,需要在请求的时候从redis当中随机获取到一条代理IP
(我有两个代理池环境 一个正式环境一个测试环境 如果你只有一个的话请看 自定义代理中间件setting.py的参数编写)
class ProxyMiddleware:
@classmethod
def from_crawler(cls, crawler):
cls.connect_type = crawler.settings.get('CONNECT_TYPE')
print('\033[3;31mIPpool连接摸索:{}\033[0m\n\n'.format(cls.connect_type))
if cls.connect_type == 'localhost':
cls.REDIS_URL = crawler.settings.get('REDIS_HOST')
cls.REDIS_PORT = crawler.settings.get('REDIS_PORT')
cls.REDIS_DB = crawler.settings.get('REDIS_DATABASE')
cls.REDIS_PASSWORD = crawler.settings.get('REDIS_PASSWORD')
cls.REDIS_QUEUE_NAME = crawler.settings.get('REDIS_QUEUE_NAME')
elif cls.connect_type == 'server':
cls.REDIS_URL = crawler.settings.get('SERVER_REDIS_HOST')
cls.REDIS_PORT = crawler.settings.get('SERVER_REDIS_PORT')
cls.REDIS_DB = crawler.settings.get('SERVER_REDIS_DATABASE')
cls.REDIS_PASSWORD = crawler.settings.get('SERVER_REDIS_PASSWORD')
cls.REDIS_QUEUE_NAME = crawler.settings.get('SERVER_REDIS_QUEUE_NAME')
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def spider_opened(self, spider):
self.pika = redis.Redis(host=self.REDIS_URL, port=self.REDIS_PORT, password=self.REDIS_PASSWORD,
db=self.REDIS_DB,
decode_responses=True)
print('============IP池已链接{}============'.format(self.connect_type))
def close_spider(self, spider):
self.pika.close()
print("################################################")
print("############## IPpool_spider #############")
print("################################################")
def getIP(self):
proxies_list = self.pika.hvals(self.REDIS_QUEUE_NAME)
if proxies_list:
ip = random.choice(proxies_list)
proxy = 'http://{}'.format(ip)
return proxy
else:
print('\033[3;31m《《《代理池为空》》》\033[0m\n')
def process_request(self, request, spider):
spiderNames = spider.settings.get('SPIDERNAMES')
if spider.name in spiderNames:
proxies = self.getIP()
print('获取到的代理:{}'.format(proxies))
request.meta['proxy'] = proxies
CONNECT_TYPE:IP池的链接模式 localhost为测试环境代理IP池 server为正式环境的代理IP池
REDIS_URL 、REDIS_PORT 、REDIS_DB 、REDIS_PASSWORD 、REDIS_QUEUE_NAME 为连接redis的参数(写在了setting.py文件当中)
SPIDERNAMES:需要使用代理的爬虫脚本名称(写在了setting.py当中),因为有的网页不需要代理也能获取到就没有必要使用代理
(spider.settings.get 获取setting.py当中的参数设置)
自定义代理中间件setting.py的参数编写
如果需要使用上面的代理中间件 需要在setting.py当中自定义几个参数 如下:
SPIDERNAMES = [] # 需要使用代理的脚本名称
# redis配置 写自己的redis代理池连接参数
# 线上 正式
SERVER_REDIS_HOST = 'xxx.xxx.xxx.xx'
SERVER_REDIS_PORT = xxx
SERVER_REDIS_DATABASE = x
SERVER_REDIS_PASSWORD = 'xxxx'
SERVER_REDIS_QUEUE_NAME = 'xxxx'
# 本地、测试运行环境
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
REDIS_DATABASE = 0
REDIS_PASSWORD = ''
REDIS_QUEUE_NAME = 'ippool'
CONNECT_TYPE = 'server' # 运行环境
如果只有一个环境的话 就随便写一个 CONNECT_TYPE 就写那个对应的环境 比如我只有一个环境 我写到REDIS_HOST REDIS_PORT REDIS_DATABASE REDIS_PASSWORD REDIS_QUEUE_NAME 当中 那么CONNECT_TYPE就写localhost 或者更改ProxyMiddleware当中的读取参数逻辑(from_crawler当中)
自定义UA中间件
这个就是判断一下有没有手动添加UA,如果已经添加了UA就不做修改,如果没有自定义UA就随机添加一个 UA中间件不需要配置其他参数
class UAMiddleware(object):
user_agent_list = [
"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)",
"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12 "
]
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if request.headers.get("'User-Agent'") or request.headers.get('user-agent'):
print('拥有UA 不需要更更换 ')
else:
request.headers['User-Agent'] = ua
启动代理和UA中间件
在setting中找到 DOWNLOADER_MIDDLEWARES 将这两个中间件添加进去
DOWNLOADER_MIDDLEWARES = {
'zmnProject.middlewares.ZmnprojectDownloaderMiddleware': 543,
'你的项目.middlewares.UAMiddleware': 350,
'你的项目.middlewares.ProxyMiddleware': 200,
}
setting常用的参数配置
DOWNLOAD_DELAY 下载延迟
ROBOTSTXT_OBEY 是否遵守ROBOT协议
CONCURRENT_REQUESTS 请求并发数量
CONCURRENT_ITEMS item并发数量