Bootstrap

python爬虫多页爬取好看妹子图片

  宝贝们又来学习了,之前说道爬虫爬取图片,但只是做到当页爬取,如果要做到多页爬取要该怎办呢?现在就由我粉兔带大家学习如何爬取好看妹子的图片

  其实也很简单,宝贝别害怕,因为我也是刚学会,各位宝贝的疑问也能更全面了解和解答接下来就跟着我一起学吧。

一、首先导入爬虫必备的基础架构

1.同样是彼岸图网

4K壁纸_4K手机壁纸_4K高清壁纸大全_电脑壁纸_4K,5K,6K,7K,8K壁纸图片素材_彼岸图网

url = 'https://pic.netbian.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36' ,
    'Referer': 'http://pic.netbian.com/e/search/result/?searchid=1224'
}
cont = requests.get(url=url, headers=headers)
cont.encoding = cont.apparent_encoding

宝贝如果这里有不懂的可以看前面我发的第章爬虫😊

二、多页爬取基础思路

1.点击第二页

2.寻找妹子

宝贝我们发现在原网页上多了一个index_2.html

所以第二个网页解析也与第一个一样不过多了index_2.html

url = 'https://pic.netbian.com/index_{}.html'.format(2)

3.召集妹子

因为我们要多页爬取所以要一个循环主体这里我们自已建一个函数将此功能归类以后可以调用,

这里,我只爬取前十页,做个样宝贝们可以根据自已的需要进行设置

def win():
    for i in range(0, 10):  # 建议从零开始0=1网页
        url = 'https://pic.netbian.com/index_{}.html'.format(i)
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36",
            'Referer': 'http://pic.netbian.com/e/search/result/?searchid=1224'
        }
        con = requests.get(url=url, headers=headers)
        con.encoding = con.apparent_encoding

4给妹子卸甲!(正侧匹配re

  text = con.text
  parser = re.compile(r'src="(/u.*?)" alt="(.*?)"', re.S)
  rel = parser.findall(text)

5.建房子存妹子(创建文件进行存储我们的学习资料

path = 'photo'
if not os.path.exists(path):
   os.mkdir(path)

6.询问妹子住址和名字(从网页中获取图片与名字

for img in rel:
    time.sleep(1)
    link = img[0]
    name = img[1]
    name = re.sub(r"\*", "", name)

7.筛选妹子避免男扮女装(给与条件进行存储

既然获得了妹子名字那么接下来就是入住了,但要加条件我们只要妹子!这很重要

for j in name:
   if j == '女':
        with open(path + '/' + '{}.jpg'.format(name), 'wb') as f:
                        urlo = 'https://pic.netbian.com/'
                        res = requests.get(urlo + link)
                        f.write(res.content)
                        f.close()
                    print(name + ".jpg获取成功")
if __name__ == '__main__':
    win()

三、完整代码

import requests
import re
import time
import os


def win():
    for i in range(0, 10):  # 建议从零开始0=1网页
        url = 'https://pic.netbian.com/index_{}.html'.format(i)
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36",
            'Referer': 'http://pic.netbian.com/e/search/result/?searchid=1224'
        }
        con = requests.get(url=url, headers=headers)
        con.encoding = con.apparent_encoding
        text = con.text
        parser = re.compile(r'src="(/u.*?)" alt="(.*?)"', re.S)
        rel = parser.findall(text)

        path = 'photo'
        if not os.path.exists(path):
            os.mkdir(path)
        for img in rel:
            time.sleep(1)
            link = img[0]
            name = img[1]
            name = re.sub(r"\*", "", name)
            for j in name:
                if j == '女':
                    with open(path + '/' + '{}.jpg'.format(name), 'wb') as f:
                        urlo = 'https://pic.netbian.com/'
                        res = requests.get(urlo + link)
                        f.write(res.content)
                        f.close()
                    print(name + ".jpg获取成功")


if __name__ == '__main__':
    win()

;