3.编写网络爬虫

我们经常需要用爬虫遍历多个页面甚至多个网站。之所以叫网络爬虫，是因为它们可以在 Web 上爬行。它们本质上就是一种递归方式。
使用网络爬虫的时候，必须非常谨慎地考虑需要消耗多少带宽，还要尽力思考能不能让抓取目标的服务器负载更低一些。
1.遍历单个域名
六度分隔理论：把不相干的两个事物通过一个链条连接起来。
获取一个页面上指向其他词条的所有链接：

# -*- coding: GBK -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://en.wikipedia.org/wiki/kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div', {
   'id': 'bodyContent'}).find_all(
	'a', href=re.compile('^(/wiki/)((?!:).)*$')):
		if 'href' in link.attrs:
			print(link.attrs['href'])

从某一个URL开始，返回页面内的链接列表，然后随机从链接列表中抽取一个URL，再次返回该URL的链接列表，直到程序关闭或页面没有链接为止：

# -*- coding: GBK -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
	html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
	bs = BeautifulSoup(html, 'html.parser')	
	return bs.find('div', {
   'id': 'bodyContent'}).find_all('a',
		href=re.compile('^(/wiki/)((?!:).)*$'))
		
links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
	newArticle = links[random.randint(0, len(links)-1)].attrs['href']
	print(newArticle)
	links = getLinks(newArticle)

使用随机数算法，需要在算法初始化阶段提供一个随机数

3.编写网络爬虫

悦读