python脚本 数据挖掘
This article details a python script that scrapes the fiction text of any subsection of the fanfiction and fan works site: Archive of Our Own. To access the scraper code and an example dataset (the top 200 Coffee Shop AU fanfictions), here’s the Github link.
本文详细介绍了一个python脚本,该脚本可刮写同人小说和同人作品网站:“我们自己的档案”的任何小节的小说文本。 要访问刮板代码和示例数据集(前200个Coffee Shop AU幻想小说),请访问Github链接。
这是什么? (What is this?)
Archive of Our Own (AO3) is a “A fan-created, fan-run, nonprofit, noncommercial archive for transformative fanworks, like fanfiction, fanart, fan videos, and podfic” by the Organization for Transformative Works, that caters to over more than 38,730 fandoms, 2,739,000 users, and 6,396,000 works.
“我们自己的存档” (AO3)是“ 变革性作品组织 ”的“由粉丝创建的,由粉丝运营的,非营利性,非商业性的档案馆,用于变革的同志作品,例如同人小说,同人作品 ,同人影片和Podfic”,可满足更多需求超过38,730个同人圈,2,739,000个用户和6,396,000个作品。
It’s a major hub for writing and reading fanfiction, among other things, and boasts a very well categorized tagging system that allows for users to search and specify for what kind of fanfiction they wish to read. When one wishes to search and specify fanfiction based on some sort of attribute, they’ll encounter the following search results page:
它是写作和阅读幻想小说的主要枢纽,并且拥有分类良好的标记系统,该系统允许用户搜索并指定他们想要阅读的幻想小说。 当您希望基于某种属性搜索并指定幻想小说时,他们将遇到以下搜索结果页面:
On this page, one can navigate across the pages of results using the orange box, indicate further specifications using the blue sort and filter options, and see works like the one highlighted in the purple, which match the current specifications.
在此页面上,可以使用橙色框浏览结果的页面,使用蓝色的“排序”和“过滤器”选项指示进一步的规格,并查看类似于紫色的突出显示的与当前规格匹配的作品。
A work on Archive Of Our Own looks like the following:
关于“拥有自己的档案”的作品如下所示:
Where the yellow box symbolizes the text within the work — the fiction.
黄色框象征着作品中的文字-小说。
This scraper does the following: given the URL of the first page of the search results (assuming you’ve already narrowed down what you want), how many pages of results you want to scrape, and the name of the output file you want to create — the scraper takes the fiction within the bounds of the search and puts them in a text file.
此抓取工具执行以下操作:给出搜索结果第一页的URL(假设您已经缩小了所需的范围),要抓取的结果页面的数量以及要输出的文件的名称创建-抓取工具将虚构内容带入搜索范围,并将其放入文本文件中。
This text file can be used for a bunch of different things — from training your own GPT-2 AI to generate text based off of it, to general text and sentiment analysis. An example output exists in the github link, with the top 200 works of complete Coffee Shop AU fanfictions on AO3.
该文本文件可用于许多不同的事情-从训练自己的GPT-2 AI生成基于该文本的文本,到常规文本和情感分析。 github链接中有一个示例输出,其中有AO3上完整的Coffee Shop AU幻想小说的前200名作品。
你怎么使用这个? (How can you use this?)
Once you’ve downloaded and opened the .py file in your environment -
在环境中下载并打开.py文件后,
First: Narrow Down Search Results
第一:缩小搜索结果
On Archive Of Our Own, find a set of tags you’d like to specify for through the sort and filter section until you get a search results page that has what you’re looking for. Make sure you’re on the first page of results. Copy the resulting URL.
在“我们自己的存档”上,通过“排序和过滤器”部分找到要指定的一组标签,直到获得具有所需内容的搜索结果页面。 确保您位于结果的第一页。 复制结果URL。
Enter this URL in the page variable at the top of the script.
在脚本顶部的页面变量中输入此URL。
Second: Specify Parameters
第二:指定参数
Remembering that each page contains 20 works, and that the scraper successfully scrapes any fiction work (no art, no podcasts, no custom-coded pages), specify the number of results pages you want to scrape in the NumberOfPages variable.
记住每个页面包含20幅作品,并且该抓取器成功地抓取了任何小说作品(没有艺术作品,没有播客,没有自定义编码的页面),请在NumberOfPages变量中指定要抓取的结果页面的数量。
Specify the output file name as needed on the variable, nameOfFileCreated.
根据需要在变量nameOfFileCreated上指定输出文件名。
Third: Run the Script, and Wait
第三:运行脚本,然后等待
The scraper takes a little bit of time, about an hour for 200 fanfictions. In the end, it’ll result in a text file with the name specified.
刮板需要一点时间,大约需要一个小时进行200次幻想。 最后,它将生成一个具有指定名称的文本文件。
Have fun with your fanfiction splice!
尽情享受您的幻想小说拼接吧!
它是如何工作的? (How does it work?)
The scraping process is based in the beautifulsoup package, in which the HTML of a page is stored in an object, and one can find tags with specific attributes within the object using the find() and find_all() functions. This is a good tutorial to learn how it works. The urllib package is used to safely request URLs.
抓取过程基于beautifulsoup软件包,其中页面HTML存储在一个对象中,并且可以使用find()和find_all()函数在对象内查找具有特定属性的标签。 这是学习它如何工作的很好的教程。 urllib软件包用于安全地请求URL。
A note about a potential error one may encounter.
关于可能遇到的潜在错误的注释。
If one requests to see too many URLs in a short period of time from the server, you might get a timeout error that’ll just require you to increase the number in the time.sleep() line of code, to 5 to something longer, so it slows down the rate of requesting URLs.
如果一个人请求在短时间内从服务器上看到太多URL,则可能会收到超时错误,该错误只会要求您将time.sleep()代码行中的数字增加到5个,甚至更长一些,因此会降低请求网址的速度。
python脚本 数据挖掘