beautifulsoup进行简单爬虫尝试

0x01安装依赖和参考文档
参考
https://blog.csdn.net/qq_21933615/article/details/81171951
https://blog.csdn.net/m0_37623485/article/details/88324296

pip install beautifulsoup4
pip install lxml #解析器

#使用时要引入的模块
import requests
from bs4 import BeautifulSoup

0x02 BeautifulSoup对象
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup, Comment .
1.Tag就是 HTML 中的一个个标签

print soup.title
print soup.head
print soup.a#这样只会输出第一项符合条件的
print soup.a.name
print soup.attrs
print soup.p.attrs #在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。
print soup.p['class'] #单独获取某个属性
print soup.p.get('class') ##单独获取某个属性 跟上面一样的

2.NavigableString
得到了标签的内容用 .string 即可获取标签内部的文字，例如：

print soup.p.string

3.BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，使用bs的时候一般会先创建一个Beautifulsoup的实例

soup=BeautifulSoup('html格式的文档','lxml')#后面为使用的解析器，也可以换成别的，不写那个参数默认lxml
#这时会读取给定文档中所有的内容，按一定格式存到soup

#例子
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup('html','lxml')

4.Comment
Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号
例如：

soup=Beautifulsoup("<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>")
print soup.a.string
#result:Elsie

a 标签里的内容实际上是注释，但此时我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦
我们打印输出下它的类型，发现它是一个 Comment 类型，所以，我们在使用前最好做一下判断，判断代码如下：

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

0x03 搜索文档树

find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件，以列表返回
name 参数
可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉，如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容，如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回

soup.find_all('b')
# [<b>The Dormouse's story</b>]

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

keyword 参数
如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 xx 的参数,Beautiful Soup会搜索每个tag的xx属性


soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

text 参数
通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

soup.find_all(text="Elsie")
# [u'Elsie']
 
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

limit 参数
find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量

CSS选择器
使用 .select() 方法传入字符串参数即可查找

#通过tag来查找
print(soup.select('a'))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''

#通过id来查找
print(soup.select('#nav_logo'))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
'''

#通过class来查找
print(soup.select('.qy-logo'))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
'''

#通过属性的值来查找
print(soup.select('div[style="display:none;"]'))
'''
result:
[<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
'''

0x04通过url得到html文档
用到request模块，例如

import requests
from bs4 import BeautifulSoup
url = 'http://www.baidu.com'
r=requests.get(url)
html=r.content
html_doc=str(html,'utf-8')
soup = BeautifulSoup(html_doc,'lxml')
tag=soup.p
print(soup.prettify())

伪装浏览器UA
用requests发请求的时候加上header

 url='xxxxxxxxxxxxxxxxx'
    print(strb)
    header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0','Cookie':xxx}
    r=requests.get(url=url,headers=header)

beautifulsoup进行简单爬虫尝试

悦读