爬虫从入门到精通(8) | 高并发爬虫-使用多线程/多进程/协程创建爬虫

在这里插入图片描述

文章目录

一、多进程和多线程介绍
二、普通爬虫
三、多线程爬虫
- 1.普通方法调用
- 2.线程类调用
四、多进程爬虫
- 1.普通方法调用
- 2.进程类写法
五、gevent协程爬虫

一、多进程和多线程介绍

二、普通爬虫

看一个简单的代码，访问100次百度的耗时

# coding: utf-8
import time

import requests


def get_response():
    try:
        url = 'https://www.baidu.com/'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
        }
        response = requests.get(url, headers=headers, timeout=2)
        print(response.status_code)

    except Exception as e:
        print(e)


if __name__ == '__main__':
    a = time.time()
    for i in range(100):
        get_response()
    print(time.time() - a)

在这里插入图片描述
如果使用多线程或者多进程进行并发抓取，那么速度会不会很快

三、多线程爬虫

1.普通方法调用

# coding: utf-8
import time
import threading
import requests


def get_response():
    try:
        url = 'https://www.baidu.com/'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
        }
        response = requests.get(url, headers=headers, timeout=2)
        print(response.status_code)

    except Exception as e:
        print(e)


def fun():
    for i in range(10):
        get_response()


if __name__ == '__main__':
    for i in range(10):
        threading.Thread(target=fun).start()

windows环境下100次10个线程：耗时7s

2.线程类调用

# coding: utf-8
import time
import threading
import requests
import multiprocessing


class Spider(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def get_response(self):
        try:
            url = 'https://www.baidu.com/'
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
            }
            response = requests.get(url, headers=headers, timeout=2)
            print(response.status_code)

        except Exception as e:
            print(e)

    def run(self):
        for i in range(10):
            self.get_response()


if __name__ == '__main__':
    for i in range(10):
        Spider().run()

四、多进程爬虫

1.普通方法调用

# coding: utf-8

import requests
import multiprocessing


def get_response():
    try:
        url = 'https://www.baidu.com/'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
        }
        response = requests.get(url, headers=headers, timeout=2)
        print(response.status_code)

    except Exception as e:
        print(e)


def fun():
    for i in range(25):
        get_response()


if __name__ == '__main__':
    for i in range(4):
        multiprocessing.Process(target=fun).start()

windows环境下100次并发4个进程：耗时12秒

2.进程类写法

# coding: utf-8
import requests
import multiprocessing


class Spider(multiprocessing.Process):
    def get_response(self):
        try:
            url = 'https://www.baidu.com/'
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
            }
            response = requests.get(url, headers=headers, timeout=2)
            print(response.status_code)

        except Exception as e:
            print(e)

    def run(self):
        for i in range(25):
            self.get_response()


if __name__ == '__main__':
    for i in range(4):
        s = Spider()
        s.start()

五、gevent协程爬虫

1.gevent模块简介

Python通过yield提供了对协程的基本支持，但是不完全。而第三方的gevent为Python提供了比较完善的协程支持。
gevent是第三方库，通过greenlet实现协程，其基本思想是：当一个greenlet遇到IO操作时，比如访问网络，就自动切换到其他的greenlet，等到IO操作完成，再在适当的时候切换回来继续执行。由于IO操作非常耗时，经常使程序处于等待状态，有了gevent为我们自动切换协程，就保证总有greenlet在运行，而不是等待IO。
总结：gevent：协程解决网络阻塞实例，实现并发

注意：使用gevent，可以获得极高的并发性能，但gevent只能在Unix/Linux下运行，在Windows下不保证正常安装和运行。

2.安装和依赖

依赖于greenlet 、library
支持python 2.6+ 、python 3.3+

pip install gevent

3.gevent协程爬虫示例

# coding: utf-8 
# 在导入其他库和模块前，先把monkey模块导入进来，并运行monkey.patch_all()。这样，才能先给程序打上补丁。
from gevent import monkey  # 从gevent库里导入了monkey模块，这个模块能将程序转换成可异步的程序

monkey.patch_all()  # 它的作用其实就像你的电脑有时会弹出“是否要用补丁修补漏洞或更新”一样。它能给程序打上补丁，让程序变成是异步模式，而不是同步模式。它也叫“猴子补丁”。
import gevent
import requests
import time


def get_response(url):  # 定义一个函数，用来执行解析网址和爬取内容
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}

    res = requests.get(url, headers=headers)  # 发出请求
    print(res.status_code)


if __name__ == '__main__':
    start = time.time()  # 开始时间
    # 构建100个请求任务
    url_list = []
    for i in range(100):
        url = 'https://www.baidu.com/'
        url_list.append(url)
    # 使用协程
    tasks_list = []
    for url in url_list:
        # 用gevent.spawn()创建任务，此任务可以调用cra(url)函数，参数1函数名，后边为该函数需要的参数，按顺序写
        task = gevent.spawn(get_response, url)
        tasks_list.append(task)  # 将任务加入列表
    # 调用gevent库里的joinall方法，能启动执行tasks_list所有的任务。
    gevent.joinall(tasks_list)

    end = time.time()  # 结束时间
    print(end - start)

另外我们可以配合多进程+协程使用

爬虫从入门到精通(8) | 高并发爬虫-使用多线程/多进程/协程创建爬虫

文章目录

一、多进程和多线程介绍

二、普通爬虫

三、多线程爬虫

1.普通方法调用

2.线程类调用

四、多进程爬虫

1.普通方法调用

2.进程类写法

五、gevent协程爬虫

1.gevent模块简介

2.安装和依赖

3.gevent协程爬虫示例

悦读