爬取电影天堂笔记1

重点知识1：map函数的使用

描述

map() 会根据提供的函数对指定序列做映射。

第一个参数 function 以参数序列中的每一个元素调用 function 函数，返回包含每次 function 函数返回值的新列表。

语法

map() 函数语法：

map(function, iterable, ...)

参数

function -- 函数
iterable -- 一个或多个序列

返回值

Python 2.x 返回列表。

Python 3.x 返回迭代器。前面要加list

实例

以下实例展示了 map() 的使用方法：

>>>def square(x) : # 计算平方数 ... return x ** 2 ... 

>>> map(square, [1,2,3,4,5]) # 计算列表各个元素的平方 [1, 4, 9, 16, 25] 

>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5]) # 使用 lambda 匿名函数 [1, 4, 9, 16, 25] 

# 提供了两个列表，对相同位置的列表数据进行相加 
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10]) [3, 7, 11, 15, 19]

案例分析：

BASE_DOMAIN = 'http://www.dytt8.net'



def abc(url):

return BASE_DOMAIN+url



for detail_url in detail_urls:

index=0

detail_url=abc(detail_url)

detail_urls[index]=detail_url

index+=1

错误类型<map object at 0x000001625DBE1BB0>

等同于 list（map(lambda url:BASE_DOMAIN+url, detail_urls)）python3中map()返回iterators类型，不再是python2中的list类型。对此我们进行list转换即可。

重点知识2. etree.tostring()

tostring( )方法可以输出修正之后的HTML代码，也可以直接读取文本进行解析，但是结果为bytes类型，因此需要利用decode()方法将其转成str类型

etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。

etree.tostring()：输出修正后的结果，类型是bytes

下面是源码资料

import requests
from lxml import etree

BASE_DOMAIN = 'http://www.dytt8.net'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    'Referer': 'http://www.dytt8.net/html/gndy/dyzz/list_23_2.html'
}
def get_index(url):
    response=requests.get(url,headers=HEADERS)
    #因为网页是gbk编码，所以转码为gbk,还是有乱码，可以加ignore忽视乱码
    # html=response.content.decode('gbk','ignore')#忽视错误乱码，content返回的是二进制
    html = response.text#返回的是Unicode编码，网页源码
    tree=etree.HTML(html)
    detail_urls = tree.xpath('//table[@class="tbspan"]//a/@href')
    #python3  map返回的是一个迭代器对象
    detail_url=list(map(lambda url:BASE_DOMAIN+url,detail_urls))
    return detail_url

def get_url_detail(url):
    movie={}
    response=requests.get(url,headers=HEADERS)
    #html = response.text#<font color="#07519a">2019Äê¸ß·Ö»ñ½±¾çÇé¡¶±»Í¿ÎÛµÄÄñ¡·BDÖÐ×Ö</font>
    html = response.content.decode('gbk', 'ignore')
    tree = etree.HTML(html)
    titles = tree.xpath('//h1/font[@color="#07519a"]/text()')
    download=tree.xpath('//td[@bgcolor="#fdfddf"]/a/text()')
    movie['download']=download
    for title in titles:
        movie['title']=title  #增加字典定义
        # print(etree.tostring(x,encoding='utf-8').decode('utf-8'))#tostring修正文本
        #<font color="#07519a">2019年高分获奖剧情《被涂污的鸟》BD中字</font>

    def parse_strip(info,content):
        return info.replace(content, '').strip()

    infos = tree.xpath('//div[@id="Zoom"]//text()')
    for index,info in enumerate(infos):
        if info.startswith("◎年　　代"):
            year = parse_strip(info,"◎年　　代")
            movie['year']=year

        if info.startswith("◎产　　地"):
            product = parse_strip(info,"◎产　　地")
            movie['product']=product

        if info.startswith("◎类　　别"):
            type = parse_strip(info,"◎类　　别")
            movie['type']=type

        if info.startswith("◎语　　言"):
            language = parse_strip(info,"◎语　　言")
            movie['language']=language

        if info.startswith("◎上映日期"):
            date = parse_strip(info,"◎上映日期")
            movie['date']=date

        if info.startswith("◎豆瓣评分"):
            douban_rating = parse_strip(info,"◎豆瓣评分")
            movie['douban_rating']=douban_rating

        if info.startswith("◎主　　演"):
            # 从当前位置，一直往下面遍历
            actors=[info]
            for x in range(index+1,len(infos)):
                actor=infos[x]
                if actor.startswith("◎"):
                    break
                actors.append(actor.strip())
            actors=(','.join(actors))
            movie['actors']=actors

        if info.startswith("◎简　　介"):
            profiles=[]
            for x in range(index+1,len(infos)):
                profile=infos[x]
                if profile.startswith('◎获奖情况'):
                    break
                profiles.append(profile.strip())
            movie['profiles']=profiles
    return movie

def main():
    movies=[]
    base_url = 'http://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
    for i in range(1,8):
        url=base_url.format(i)
        detail_urls=get_index(url)
        #列表可以遍历
        for detail_url in detail_urls:
            movie=get_url_detail(detail_url)
            movies.append(movie)
            print(movies)

if __name__ == '__main__':
    main()

爬取电影天堂笔记1

重点知识1：map函数的使用

Python 2.x 返回列表。

Python 3.x 返回迭代器。前面要加list

重点知识2. etree.tostring()

悦读