基于 requests 依赖包的 Python 爬虫实战

一、技术栈

核心库：
- requests：用于发送 HTTP 请求，是 Python 中最常用的 HTTP 库之一，可以方便地模拟浏览器发送 GET、POST 等请求，获取网页内容。
- BeautifulSoup（可选）：用于解析 HTML 或 XML 内容，可帮助提取所需的数据。如果需要对网页进行解析和数据提取，BeautifulSoup 是一个强大的工具。
- re（可选）：Python 的内置模块，用于正则表达式操作，在一些情况下可以辅助数据提取，尤其是在需要精确匹配某些文本模式时。

二、对象和爬虫使用

发送请求：

使用 requests.get() 或 requests.post() 方法发送请求。例如，要获取一个网页的内容，以领券网为例，网址http://www.i075.com/，可以使用：

import requests

url = "http://www.i075.com/"
response = requests.get(url)

response 对象将包含服务器的响应信息，如状态码、响应头、响应内容等。

处理响应：

检查响应状态码，确保请求成功：

if response.status_code == 200:
    print("请求成功")
else:
    print(f"请求失败，状态码：{response.status_code}")

获取网页内容：

content = response.text
print(content)

对于一些需要登录或携带参数的请求，可以使用以下方式：

# 携带参数的 GET 请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get(url, params=params)

# POST 请求
data = {'username': 'user', 'password': 'pass'}
response = requests.post(url, data=data)

解析内容（如果使用 BeautifulSoup）：

首先需要安装 BeautifulSoup：pip install beautifulsoup4
然后可以这样使用：

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'html.parser')
# 查找元素，例如查找所有的 <a> 标签
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

对于更复杂的元素查找和数据提取，可以使用 BeautifulSoup 的各种选择器和方法，如 find()、find_all()、select() 等。

使用正则表达式提取数据（如果使用 re）：

例如，要从网页内容中提取所有的电子邮件地址：

import re

emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', content)
for email in emails:
    print(email)

三、环境依赖构建

安装 Python：

确保你已经安装了 Python，推荐使用 Python 3.x 版本，可以从 Python 官方网站下载并安装。

安装 requests 库：

打开命令行终端，运行以下命令：bash

pip install requests

安装其他可选库（根据需求）：

如果你打算使用 BeautifulSoup 进行网页解析，需要安装 beautifulsoup4：

pip install beautifulsoup4

四、爬虫示例代码
以下是一个完整的简单爬虫示例，用于爬取一个网页并提取其中的标题：

import requests
from bs4 import BeautifulSoup


def fetch_and_parse(url):
    try:
        # 发送请求
        response = requests.get(url)
        if response.status_code == 200:
            # 解析内容
            soup = BeautifulSoup(response.text, 'html.parser')
            # 查找标题
            title = soup.title.string
            print(f"网页标题: {title}")
        else:
            print(f"请求失败，状态码：{response.status_code}")
    except requests.RequestException as e:
        print(f"请求出错: {e}")


if __name__ == "__main__":
    target_url = "http://www.i075.com"
    fetch_and_parse(target_url)

在这个示例中：

首先使用 requests.get() 发送请求到目标 URL。
检查响应状态码是否为 200，如果成功则使用 BeautifulSoup 解析响应内容。
使用 soup.title.string 查找并提取网页的标题元素。

这个概要为你提供了一个基于 requests 包的 Python 爬虫的基本框架和思路，你可以根据自己的需求进行扩展和修改，以完成更复杂的爬虫任务，如处理多个页面、使用会话管理、处理反爬虫机制等。

基于 requests 依赖包的 Python 爬虫实战

悦读