python之urllib模块

爬虫基础知识

什么是爬虫

爬虫：通俗点来讲就是一段可以自动抓取互联网信息的程序。互联网就像一张大网，而爬虫程序就是蜘蛛

爬虫的基本架构

在这里插入图片描述

爬虫客户端（调度器）
- 负责调度URL管理器，下载器，解析器之间的协调工作
URL管理器
- 包括要爬取的URL地址和已爬取的URL地址，防止重复抓取URL和循环抓取URL。
- 实现URL管理器主要有三种方式：通过内存，数据库，缓存数据来实现
网页下载器
- 通过传入一个URL地址来下载网页，将网页转换成一个字符串
- 网页下载器有urllib模块（标准模块），request模块（第三方模块）
网页解析器
- 将一个网页字符串进行解析，可以按照我们的要求来提取出我们有用的信息，也可以根据DOM树的解析方式来解析
- 网页解析器有正则表达式（直观，将网页转成字符串通过模糊匹配的方式来提取有价值的信息，当文档比较复杂的时候，该方法提取数据的时候就会非常的困难）html.parser（Python自带的）、beautifulsoup（第三方插件，可以使用Python自带的html.parser进行解析，也可以使用lxml进行解析，相对于其他几种来说要强大一些）、lxml（第三方插件，可以解析 xml 和 HTML），html.parser 和 beautifulsoup 以及 lxml 都是以 DOM 树的方式进行解析的
数据
- 网页提取的有用数据

简单实战1

url规律：

https://tieba.baidu.com/p/5752826839?pn=1
https://tieba.baidu.com/p/5752826839?pn=2
https://tieba.baidu.com/p/5752826839?pn=3

图片html解析

< img class=“BDE_Image” src=“https://imgsa.baidu.com/forum/w%3D580/sign=8be466fee7f81a4c2632ecc1e7286029/bcbb0d338744ebf89d9bb0b5d5f9d72a6259a7aa.jpg"size="350738” changedsize="true"width=“560” height=“995”>

通过对url的规律进行分析，和图片的html进行分析，查找到关键需要爬取的关键信息

import urllib.request
import urllib.error
import re


def get_page(url):
    """
    获取网页内容
    """
    try:
        # 获得网站的html代码
        response = urllib.request.urlopen(url)
    except urllib.error.URLError as e:
        print("爬取%s失败。。。" % url)
    else:
        # html代码默认是byte类型，所以先用read()方法转换成字符串
        content = response.read()
        return content


def parser_content(content):
    """
    解析网页代码，获取所有的图片链接
    """

    # 将网站代码解码
    content = content.decode('utf-8').replace('\n', ' ')

    # 匹配模式，去匹配图片链接
    pattern = re.compile(r'<img class="BDE_Image".*? src="(https://.*?\.jpg)".*?">')
    img_url_list = re.findall(pattern, content)
    return img_url_list


def get_page_img(page):
    """
    将页面里的图片下载下来
    """
    url = "https://tieba.baidu.com/p/5752826839?pn=%s" % page
    # 得到网站html内容
    content = get_page(url)

    # 如果网站存在
    if content:
        # 解析网站并获取图片链接
        img_url_list = parser_content(content)
        for i in img_url_list:
            # 依次遍历图片每一个链接，获取图片内容
            img_content = get_page(i)
            img_name = i.split('/')[-1]
            with open('img/%s' % img_name, 'wb') as file:
                file.write(img_content)
                print("下载图片%s成功...." % img_name)


if __name__ == "__main__":
    for page in range(1, 5):
        print("正在爬取第%s页的图片...." % page)
        get_page_img(page)

简单实战2

url规律

https://sc.chinaz.com/tupian/dadanrenti_2.html
https://sc.chinaz.com/tupian/dadanrenti_3.html
https://sc.chinaz.com/tupian/dadanrenti_4.html

图片规律

<img src="//scpic.chinaz.net/files/pic/pic9/202008/apic27210.jpg" border="0" alt="">

<img alt="" src="//scpic.chinaz.net/files/pic/pic9/202009/apic27584.jpg" border="0">

import urllib.request
import urllib.error
import re


def get_web(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
        response = urllib.request.Request(url, headers=headers)
        content = urllib.request.urlopen(response).read()
    except urllib.error.HTTPError as e:
        print("状态码：", e.code)
        print("原因：", e.reason)
        print("请求头：", e.headers)
    else:
        return content


def parse_content(content):
    content = content.decode('utf-8')
    pattern = re.compile(r'<img.*?src.="(//scpic2.chinaz.net/Files/pic/pic9/.*?\.jpg).*?">')
    img_list = re.findall(pattern, content)
    return img_list


def get_img(pages):
    url = 'https://sc.chinaz.com/tupian/dadanrenti_%s.html' % pages
    content = get_web(url)
    if content:
        img_list = parse_content(content)
        for j in img_list:
            img_url = 'https:%s' % j
            img_name = j.split('/')[-1]
            img_content = get_web(img_url)
            with open('img\\%s' % img_name, 'wb') as file:
                file.write(img_content)
                print("下载图片%s成功...." % img_name)


if __name__ == '__main__':
    for pages in range(2, 10):
        print("正在爬取第%s页的图片。。。\n" % pages)
        get_img(pages)

反爬虫之模拟浏览器

在我们进行爬取数据时会遇到爬取不到的情况，是因为有些网站的反爬虫要求必须是浏览器才能访问，所以我们要进行模拟浏览器去访问
在网页的审查元素中找到user-Agent

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LC5Oia1Y-1622125789636)(C:\Users\26976\AppData\Roaming\Typora\typora-user-images\image-20210512170849315.png)]

import urllib.request
import urllib.error

url = "https://www.baidu.com/"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}

response = urllib.request.Request(url, headers=headers)

content = urllib.request.urlopen(response).read().decode('utf-8')

print(content)

爬取某网站图片

网站url规律和图片链接规律

https://sc.chinaz.com/tupian/dadanrenti_2.html
https://sc.chinaz.com/tupian/dadanrenti_3.html
https://sc.chinaz.com/tupian/dadanrenti_4.html

<img src2="//scpic.chinaz.net/Files/pic/pic9/202008/apic27210_s.jpg" alt="狂野美女大胆人体艺术图片">

<img src2="//scpic.chinaz.net/Files/pic/pic9/202009/apic27584_s.jpg" alt="迷人美女大胆人体艺术图片">

<img src2="//scpic3.chinaz.net/Files/pic/pic9/202008/apic26939_s.jpg" alt="大胆欧美风人休艺术图片">

<img src2="//scpic.chinaz.net/Files/pic/pic9/201909/zzpic20215_s.jpg" alt="湿身大胆人体摄影图片">

为了防止一个浏览器访问被频繁封掉，我们可以模拟多个浏览器访问

import urllib.request
import urllib.error
import re


def get_web(url):
    try:
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0',
        	'User-Agent':"Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19",
            'User-Agent':"Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0"
        }
        response = urllib.request.Request(url, headers=headers)
        content = urllib.request.urlopen(response).read()
    except urllib.error.HTTPError as e:
        print("状态码：", e.code)
        print("原因：", e.reason)
        print("请求头：", e.headers)
    else:
        return content


def parse_content(content):
    content = content.decode('utf-8')
    pattern = re.compile(r'<img src2="(//scpic[1-9]{0,1}\.chinaz\.net/Files/pic/pic9/.*?\.jpg).*?">')
    img_list = re.findall(pattern, content)
    return img_list


def get_img(pages):
    url = 'https://sc.chinaz.com/tupian/dadanrenti_%s.html' % pages
    content = get_web(url)
    if content:
        img_list = parse_content(content)
        for j in img_list:
            img_url = 'https:%s' % j
            img_name = j.split('/')[-1]
            img_content = get_web(img_url)
            with open('img\\%s' % img_name, 'wb') as file:
                file.write(img_content)
                print("下载图片%s成功...." % img_name)


if __name__ == '__main__':
    for pages in range(2, 10):
        print("正在爬取第%s页的图片。。。\n" % pages)
        get_img(pages)

反爬虫之设置代理

为什么要设置ip代理？
- 使用ip代理可以跳过一些服务器的反爬虫机制
- 使用IP代理，让其他的IP代替你的IP访问页面
如何防止自己的ip代理被封
- 设置多个ip代理
- 设置延迟: time.sleep(random.randint(1,3))

from urllib.request import ProxyHandler, build_opener, install_opener, urlopen
from urllib import  request

def use_proxy(proxies, url):
    # 1. 调用urllib.request.ProxyHandler
    proxy_support = ProxyHandler(proxies=proxies)
    # 2. Opener 类似于urlopen
    opener = build_opener(proxy_support)
    # 3. 安装Opener
    install_opener(opener)

    # user_agent =  "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"
    # user_agent =  "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"
    user_agent = 'Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3'
    # 模拟浏览器;
    opener.addheaders = [('User-agent', user_agent)]
    urlObj = urlopen(url)
    content = urlObj.read().decode('utf-8')
    return  content

if __name__ == '__main__':
    url = 'http://httpbin.org/get'
    proxies = {'https': "111.177.178.167:9999", 'http': '114.249.118.221:9000'}
    use_proxy(proxies, url)

异常处理

HTTP常见的状态码有哪些：
- 2xx: 成功
- 3xx: 重定向
- 4xx: 客户端的问题
- 5xxx: 服务端的问题

import urllib.requestimport urllib.errortry:    url = "http://www.baidu.com/hello.html"    response = urllib.request.urlopen(url, timeout=0.1)except urllib.error.HTTPError as http:    print(http.code, http.reason, http.headers)except urllib.error.URLError as url:    print(url.reason)else:    content = response.read()