Python爬虫技术第15节 CSS选择器基础

在使用Python进行网页爬取时，CSS选择器是提取HTML文档中特定元素的常用方法之一。CSS选择器基于HTML元素的结构和属性来定位和选择页面中的元素。结合Python中的BeautifulSoup库或PyQuery库等，可以非常高效地解析和筛选出你想要的数据。

CSS选择器基础

标签选择器：
使用元素名称作为选择器，如 div 或 a。
类选择器：
使用点前缀加上类名，如 .classname。
ID选择器：
使用井号前缀加上ID名，如 #idname。
属性选择器：
可以选择具有特定属性的元素，如 [href] 或 [class="myclass"]。
子元素选择器：
用于选择某个元素的直接子元素，如 ul > li。
后代选择器：
用于选择某个元素的所有后代元素，如 div p（选择所有在div内的p标签）。
相邻兄弟选择器：
用于选择紧接在另一个元素后的元素，如 h1 + p。
一般兄弟选择器：
用于选择同级的元素，如 h1 ~ p。
组合选择器：
可以将多个选择器用逗号分隔，如 div, span。

在Python中使用CSS选择器

使用BeautifulSoup

假设你有以下HTML代码：

<div id="content">
    <h1>My Title</h1>
    <p class="description">This is a description.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</div>

使用BeautifulSoup来解析并提取数据：

from bs4 import BeautifulSoup

html_doc = """
<div id="content">
    <h1>My Title</h1>
    <p class="description">This is a description.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 获取标题
title = soup.select_one('h1').text
print("Title:", title)

# 获取描述
description = soup.select_one('.description').text
print("Description:", description)

# 获取列表项
items = [item.text for item in soup.select('li')]
print("Items:", items)

使用PyQuery

PyQuery库的使用方式更接近jQuery：

from pyquery import PyQuery as pq

html_doc = """
<div id="content">
    <h1>My Title</h1>
    <p class="description">This is a description.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</div>
"""

doc = pq(html_doc)

# 获取标题
title = doc('h1').text()
print("Title:", title)

# 获取描述
description = doc('.description').text()
print("Description:", description)

# 获取列表项
items = doc('li').map(lambda i, e: pq(e).text())
print("Items:", list(items))

以上就是使用CSS选择器结合Python进行网页数据抓取的基本方法。通过这些工具，你可以更加灵活和精确地从网页中提取所需信息。

当然，我们可以处理更复杂的HTML结构、使用更多的CSS选择器以及处理可能出现的异常情况。下面是一个更详细的示例，展示如何使用BeautifulSoup和PyQuery处理一个包含更多元素和属性的HTML文档。

假设我们有以下HTML结构：

<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div id="header">
        <h1>Welcome to Our Site</h1>
    </div>
    <div id="content">
        <section class="main">
            <article class="post" data-id="1">
                <h2>Post Title 1</h2>
                <p>Some text here...</p>
                <a href="/post/1" class="read-more">Read more</a>
            </article>
            <article class="post" data-id="2">
                <h2>Post Title 2</h2>
                <p>Some other text here...</p>
                <a href="/post/2" class="read-more">Read more</a>
            </article>
        </section>
        <aside>
            <h3>Latest Comments</h3>
            <ul>
                <li>User 1 commented on Post 1</li>
                <li>User 2 commented on Post 2</li>
            </ul>
        </aside>
    </div>
    <footer>
        <p>Copyright © 2024</p>
    </footer>
</body>
</html>

我们将使用这个HTML结构来演示如何提取特定的帖子标题、文本和链接。

使用BeautifulSoup

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div id="header">
        <h1>Welcome to Our Site</h1>
    </div>
    <div id="content">
        <section class="main">
            <article class="post" data-id="1">
                <h2>Post Title 1</h2>
                <p>Some text here...</p>
                <a href="/post/1" class="read-more">Read more</a>
            </article>
            <article class="post" data-id="2">
                <h2>Post Title 2</h2>
                <p>Some other text here...</p>
                <a href="/post/2" class="read-more">Read more</a>
            </article>
        </section>
        <aside>
            <h3>Latest Comments</h3>
            <ul>
                <li>User 1 commented on Post 1</li>
                <li>User 2 commented on Post 2</li>
            </ul>
        </aside>
    </div>
    <footer>
        <p>Copyright © 2024</p>
    </footer>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 提取所有帖子的标题
titles = [post.h2.text for post in soup.select('.post')]
print("Titles:", titles)

# 提取所有帖子的链接
links = [post.a['href'] for post in soup.select('.post .read-more') if 'href' in post.a.attrs]
print("Links:", links)

# 提取第一个帖子的文本
first_post_text = soup.select_one('.post:first-of-type p').text
print("First Post Text:", first_post_text)

# 检查是否有最新评论
latest_comments = soup.select_one('#content aside ul')
if latest_comments:
    print("Latest Comments Found!")
else:
    print("No latest comments found.")

使用PyQuery

from pyquery import PyQuery as pq

html_doc = """
<html>
<!-- HTML content here -->
</html>
"""

doc = pq(html_doc)

# 提取所有帖子的标题
titles = doc('.post h2').map(lambda i, e: pq(e).text())
print("Titles:", list(titles))

# 提取所有帖子的链接
links = doc('.post .read-more').map(lambda i, e: pq(e).attr('href'))
print("Links:", list(links))

# 提取第一个帖子的文本
first_post_text = doc('.post:first-of-type p').text()
print("First Post Text:", first_post_text)

# 检查是否有最新评论
latest_comments = doc('#content aside ul')
if latest_comments.length:
    print("Latest Comments Found!")
else:
    print("No latest comments found.")

以上代码展示了如何使用CSS选择器与Python库来处理和提取复杂HTML文档中的信息。注意，在实际应用中，你可能需要处理网络请求错误、HTML解析错误或页面结构不一致的情况，因此在真实环境中，你可能需要添加更多的错误检查和异常处理逻辑。

接下来，我们可以添加异常处理机制，确保在遇到网络错误、无效的HTML结构或者缺少预期元素时，程序能够优雅地处理这些情况。同时，我们还可以增强代码的健壮性，例如通过使用更具体的CSS选择器来减少误匹配的可能性，并且在处理大量数据时考虑性能优化。

以下是使用BeautifulSoup和PyQuery对上述HTML代码进行数据提取的改进版代码：

使用BeautifulSoup

from bs4 import BeautifulSoup
import requests

def fetch_data(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # 如果响应状态码不是200，则抛出HTTPError异常
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching URL: {url}")
        print(e)
        return None

def parse_html(html):
    if html is None:
        return []
    
    soup = BeautifulSoup(html, 'html.parser')
    
    posts = []
    for post in soup.select('.post'):
        try:
            title = post.h2.text.strip()
            text = post.p.text.strip()
            link = post.find('a', class_='read-more')['href']
            posts.append({
                'title': title,
                'text': text,
                'link': link
            })
        except AttributeError:
            print("Missing element in post, skipping...")
            continue
    
    return posts

def main():
    url = "http://example.com"
    html = fetch_data(url)
    posts = parse_html(html)
    print(posts)

if __name__ == "__main__":
    main()

使用PyQuery

from pyquery import PyQuery as pq
import requests

def fetch_data(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching URL: {url}")
        print(e)
        return None

def parse_html(html):
    if html is None:
        return []
    
    doc = pq(html)
    posts = []

    doc('.post').each(lambda i, e: 
        posts.append({
            'title': pq(e)('h2').text(),
            'text': pq(e)('p').text(),
            'link': pq(e)('a.read-more').attr('href')
        }) if pq(e)('h2') and pq(e)('p') and pq(e)('a.read-more') else None
    )
    
    return [post for post in posts if post is not None]

def main():
    url = "http://example.com"
    html = fetch_data(url)
    posts = parse_html(html)
    print(posts)

if __name__ == "__main__":
    main()

在这两个示例中，我们做了如下改进：

添加了网络请求函数fetch_data，它会处理网络错误和HTTP错误。
在parse_html函数中，我们添加了对缺失元素的异常处理，避免因为某个元素不存在而导致整个程序崩溃。
使用了strip()方法来去除文本中的空白字符，保证数据的整洁。
在使用PyQuery时，使用了.each()方法来迭代每个.post元素，这样可以更自然地处理每个帖子的提取过程，并且通过列表推导式过滤掉任何可能为None的帖子。

这些改进使得代码更加健壮，能够在面对各种意外情况时给出适当的反馈，而不是突然崩溃。