1.找到想要得到的小说网站https://www.ddxstxt8.com/5_5034/
在浏览器的输入框中输入https://www.ddyueshu.com/5_5034/得到网站
2.分析网站
按F12出现页面布局,找需要内容,source里发现,链接与章节标题
3.观察第一章小说得出规律
点击第一章链接进入,第一章网页,点击F12得到网页代码,观察信息,可以发现在sources中发现小说内容
4.开始购建爬虫框架,从简单深入
import re
import requests
#头部伪装
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
url = "https://www.ddxstxt8.com/4_5034/16258926.html"
req = requests.get(url,headers=headers)
req.encoding = 'GBK'#转码
#获得简化得HTML内容
b = re.sub(r'\r<br />\r<br /> ','',req.text)
b = re.sub(' ','',b)
b = re.sub('<br /><br /><script>chaptererror();</script><br />请记住本书首发域名:ddyueshu.com。顶点小说手机版阅读网址:m.ddyueshu.com</div>','',b)
#获得文章内容
result = re.findall(r'<div id="content"><br /><br />(.*)',b)
print(result)
#存入
with open('../data/wushenzhuzai.txt','w') as f:
f.write(str(result))
5.爬取目录、链接
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
url = "https://www.ddxstxt8.com/5_5034/"
res = requests.get(url,headers = headers)
res.encoding = 'gbk'
print(res.text)
import re
pattern = '<dd><a href ="(.+)">(.+)</a></dd>'
list = re.findall(pattern,res.text)
print(list)
观察列表信息,多了前面6章再简化
list1 = list[6:]
list1
简化
将链接,章目录存入列表 hrefs、z_names
存入csv文件
爬虫封装
import requests
import re
#爬虫封装
def spyders(url_a):
url = 'https://www.ddxstxt8.com/4_5034/'+url_a
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
res = requests.get(url,headers=headers)
res.encoding = 'GBK'
b = re.sub(r'\r<br />\r<br /> ','',res.text)
b = re.sub(' ','',b)
b = re.sub('<br /><br /><script>chaptererror();</script><br />请记住本书首发域名:ddyueshu.com。顶点小说手机版阅读网址:m.ddyueshu.com</div>','',b)
result = re.findall(r'<div id="content"><br /><br />(.*)',b)
return result
#存入文件封装
def savef(a,name_a):
with open('../data/'+str(name_a)+'.txt','w') as f:
f.write(str(a))
import pandas as pd
df = pd.read_csv('../data/wushenzhuzai_df.csv')
df.drop(columns='Unnamed: 0',inplace=True)
def main():
for i in range(len(df)):
a=spyders(df['hrefs'][i])
savef(a,df['z_names'][i])
print('下载。。。。。',str(i+1),'话')
if __name__=='__main__':
main()
爬取成功。