Python爬虫技术第11节发送GET和POST请求

使用爬虫技术来从网页抓取数据或与API进行交互通常涉及几个关键步骤。这里我将指导你如何使用Python的requests库来发送GET和POST请求，以及如何解析返回的数据。

1. 安装必要的库

首先，确保你已经安装了requests库，如果还没有安装，可以通过以下命令安装：

pip install requests

2. 发送GET请求

GET请求通常用于获取信息，比如从一个API获取单词定义。下面是一个示例代码，它向一个虚构的API发送GET请求来获取单词定义：

import requests

def get_word_definition(word):
    url = f"https://api.example.com/words/{word}"
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        return data['definition']
    else:
        return None

print(get_word_definition('example'))

3. 发送POST请求

POST请求通常用于提交数据到服务器，例如，你可以用POST请求来添加新的单词到你的单词管理系统中。假设我们有一个API允许我们这样做，代码如下：

import requests
import json

def add_word_to_system(word, definition):
    url = "https://api.example.com/words"
    headers = {'Content-Type': 'application/json'}
    payload = {'word': word, 'definition': definition}
    
    response = requests.post(url, data=json.dumps(payload), headers=headers)
    
    if response.status_code == 201:
        print(f"Word '{word}' added successfully.")
    else:
        print("Failed to add the word.")

add_word_to_system('example', 'A sample or model.')

4. 解析HTML网页（使用BeautifulSoup）

如果你需要从HTML网页抓取数据，可以使用BeautifulSoup库来解析HTML。先安装beautifulsoup4：

pip install beautifulsoup4

然后，使用requests获取网页内容，再用BeautifulSoup解析：

from bs4 import BeautifulSoup
import requests

def get_word_from_web(word):
    url = f"https://www.dictionary.com/browse/{word}"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        definition = soup.find('div', {'class': 'css-1o56a8i e1hk9ate4'}).get_text()
        return definition.strip()
    else:
        return None

print(get_word_from_web('example'))

请注意，实际的网站结构可能与上述代码中的选择器不同，因此在使用时需要根据目标网站的具体HTML结构进行调整。

5. 错误处理

在发送网络请求时，应该始终包含错误处理逻辑，以应对网络问题、服务器错误或数据解析错误等异常情况。

以上就是使用Python和requests库进行GET和POST请求的基本方法，以及如何解析HTML网页的方法。如果你有具体的目标网站或API，可能需要对这些代码进行相应的调整。

当然，我们可以增加一些实用的功能，如错误处理、重试机制、日志记录等。以下是改进后的示例代码：

GET 请求示例 - 获取单词定义

import requests
import logging

logging.basicConfig(level=logging.INFO)

def get_word_definition(word):
    base_url = "https://api.example.com/words/"
    url = base_url + word
    
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError for bad responses (4xx and 5xx)
        
        data = response.json()
        return data['definition']
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")
        return None
    except (KeyError, ValueError) as e:
        logging.error(f"Failed to parse response: {e}")
        return None

# 使用示例
word = 'example'
definition = get_word_definition(word)
if definition:
    print(f"The definition of '{word}' is: {definition}")
else:
    print(f"Could not retrieve definition for '{word}'.")

POST 请求示例 - 向系统添加单词

import requests
import json
import logging

logging.basicConfig(level=logging.INFO)

def add_word_to_system(word, definition):
    url = "https://api.example.com/words"
    headers = {'Content-Type': 'application/json'}
    payload = {'word': word, 'definition': definition}
    
    try:
        response = requests.post(url, data=json.dumps(payload), headers=headers)
        response.raise_for_status()
        
        print(f"Word '{word}' added successfully.")
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")

# 使用示例
word = 'example'
definition = 'A sample or model.'
add_word_to_system(word, definition)

HTML 解析示例 - 从网页获取单词定义

from bs4 import BeautifulSoup
import requests
import logging

logging.basicConfig(level=logging.INFO)

def get_word_from_web(word):
    base_url = "https://www.dictionary.com/browse/"
    url = base_url + word
    
    try:
        response = requests.get(url)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        definition = soup.find('div', {'class': 'css-1o56a8i e1hk9ate4'})
        
        if definition:
            return definition.get_text().strip()
        else:
            logging.warning("Definition not found on page.")
            return None
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")
        return None

# 使用示例
word = 'example'
definition = get_word_from_web(word)
if definition:
    print(f"The definition of '{word}' from the web is: {definition}")
else:
    print(f"Could not retrieve definition for '{word}' from the web.")

以上代码中，我们加入了基本的日志记录，这有助于在出现问题时调试。同时，我们使用response.raise_for_status()来自动检测HTTP错误，并通过异常处理来优雅地处理这些问题。此外，对于HTML解析，我们检查了是否找到了预期的元素，如果没有找到，则发出警告。这些都是在开发爬虫或网络应用程序时非常重要的实践。

可以考虑增加一些更高级的功能，例如使用代理、设置超时、以及实现重试机制。以下是这些功能的代码实现：

设置代理和超时

在发送请求时，你可能需要通过代理服务器来发送请求，或者设置请求的超时时间。这是修改后的get_word_definition函数：

import requests
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO)

def get_word_definition(word, proxies=None, timeout=10):
    base_url = "https://api.example.com/words/"
    url = base_url + word
    
    session = requests.Session()
    retry = Retry(total=3, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    try:
        response = session.get(url, proxies=proxies, timeout=timeout)
        response.raise_for_status()
        
        data = response.json()
        return data['definition']
    except requests.RequestException as e:
        logging.error(f"Request failed: {e}")
        return None
    except (KeyError, ValueError) as e:
        logging.error(f"Failed to parse response: {e}")
        return None

在这个例子中，我们引入了requests的Session对象和Retry类，以便可以设定重试次数和超时时间。我们还允许传递代理参数和超时值。

使用代理

如果你需要通过代理服务器发送请求，可以这样调用：

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

word = 'example'
definition = get_word_definition(word, proxies=proxies)

实现重试机制

上面的例子中，我们已经在get_word_definition函数中实现了重试机制，但如果你想为所有请求都添加重试，可以在会话级别设置：

session = requests.Session()
retry_strategy = Retry(
    total=3,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

这样，无论你使用session.get()还是session.post()，都会应用相同的重试策略。

这些是构建健壮网络请求的一些常见做法。在实际部署中，你可能还需要考虑认证、证书验证、并发控制等更多细节。希望这些信息对你有帮助！如果有任何特定需求或疑问，随时告诉我。