本文是从B站提取弹幕,并生成《请回答1988》用户点评的词云图,具体代码参见请回答1988弹幕词云
B站弹幕提取
首先,通过b站网址,查看到《请回答1988》木鱼水心弹幕最多的一集,其URL如下:
https://www.bilibili.com/video/BV1g7411d7v7?from=search&seid=4258323448425596581
通过b站的如下api接口,获取视频片段信息:
https://api.bilibili.com/x/player/pagelist?bvid=BV1g7411d7v7&jsonp=jsonp
得到的json形式结果如下:
{
"code": 0,
"message": "0",
"ttl": 1,
"data": [
{
"cid": 166505309,
"page": 1,
"from": "vupload",
"part": "请回答1988 P6修正",
"duration": 4366,
"vid": "",
"weblink": "",
"dimension": {
"width": 1280,
"height": 720,
"rotate": 0
}
}
]
}
将上述结果中的cid,作为下面api接口中的oid参数,传入到接口调用中
https://api.bilibili.com/x/v1/dm/list.so?oid=166505309
该调用会返回如下形式的弹幕xml结果:
<i>
<chatserver>chat.bilibili.com</chatserver>
<chatid>166505309</chatid>
<mission>0</mission>
<maxlimit>8000</maxlimit>
<state>0</state>
<real_name>0</real_name>
<source>e-r</source>
<d p="3476.34000,1,25,16777215,1584938994,0,3db21a6b,30302178995863555">吸吸吸吸吸</d>
<d p="1885.60400,1,25,16777215,1584939023,0,571b43d9,30302194189729799">这算什么交往哈哈哈哈</d>
<d p="1919.24700,1,25,16777215,1584939057,0,571b43d9,30302211817865221">关二哥</d>
<d p="1936.11400,1,25,16777215,1584939084,0,571b43d9,30302226084265987">化妆了</d>
<d p="3444.53200,1,25,16777215,1584939135,0,c2a058d,30302253026902021">请务必这样做</d>
<d p="4342.22900,1,25,16777215,1584939534,0,74eee2ec,30302461728653317">三连催更</d>
</i>
到此,我们就取到了具体弹幕信息了。
弹幕数据解析
以下是采用python来实现弹幕解析,并存储到本地txt文件中:
首先,需要先安装相关依赖包,例如requests、chardet等,我当前是采用的pycharm,直接在 setting的project中安装即可。
提示:安装前,可以增加一些镜像站,否则直接访问国外镜像很容易出现超时异常。我当前添加的是清华的镜像https://pypi.tuna.tsinghua.edu.cn/simple
具体实现代码如下:
# 实现B站弹幕信息的解析和保存
import requests
import json
import chardet
import re
# 1.根据bvid获取cid
def get_cid():
url = 'https://api.bilibili.com/x/player/pagelist?bvid=BV1g7411d7v7&jsonp=jsonp'
# 获取url请求响应结果
res_text = requests.get(url).text
json_obj = json.loads(res_text)
return json_obj['data'][0]['cid']
# 2.根据cid调用api,获取弹幕信息,并保存到txt数据文件中
def save_danm(cid):
url = 'https://api.bilibili.com/x/v1/dm/list.so?oid=' + str(cid)
res = requests.get(url)
# 基于返回结果制定编码,避免乱码
res.encoding = chardet.detect(res.content)['encoding']
res_text = res.text
# *?模式是将*的贪婪匹配修改为惰性匹配,也就按最小匹配模式
pattern = re.compile('<d.*?>(.*?)</d>')
dan_mu_list = pattern.findall(res_text)
with open('../resources/dan_mu.txt', mode='w', encoding='utf-8') as f:
for d in dan_mu_list:
# 逐行写入弹幕
f.write(d + '\n')
print('dan_mu.txt done')
# 调用方法,获取cid
cid = get_cid()
# 调用方法,保存弹幕
save_danm(cid)
弹幕词云生成
特别提示:在安装wordcloud
库之前,需要先安装VC_redist.x64.exe 另外,停用词目前是用的哈工大的版本,具体见后文。
具体词云生成代码如下:
# 实现将弹幕生成为词云图
import pandas as pd
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from imageio import imread
# 读取文本文件,并进行分词
with open('resources/dan_mu.txt', encoding='utf-8') as f:
txt = f.read()
# 将读取的内容,自动按行拆分为列表
txt_list = txt.split()
# 将每句弹幕进行分词
txt_cut = [jieba.lcut(x) for x in txt_list]
# 读取停用词表
with open('resources/stopwords.txt', encoding='utf-8') as f:
stop_words = f.read()
stop_words = stop_words.split()
# 创建一维数组
s_txt_cut = pd.Series(txt_cut)
# 分词结果中去除停用词
txt_last = s_txt_cut.apply(lambda x: [i for i in x if i not in stop_words])
# 词频统计
txt_stat = []
for i in txt_last:
txt_stat.extend(i)
txt_count = pd.Series(txt_stat).value_counts()
# 词云图绘制
back_img = imread(r"love.jpg")
wc_obj = WordCloud(font_path='resources/msyh.ttc', background_color='white', max_words=1800, mask=back_img,
max_font_size=200,
random_state=42)
wc_fit = wc_obj.fit_words(txt_count)
plt.figure(figsize=(16, 9))
plt.imshow(wc_fit)
plt.axis('off')
plt.show()
wc_obj.to_file('word_cloud_1988.png')
其他操作
升级pip
以管理员权限运行powershell,分别执行如下命令:
python -m ensurepip
python -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U
pycharm
pycharm安装后,目前主要配置有:
- 安装 saveaction插件,时间保存时自动格式化代码
- 调整字号大小
- 修改单行删除的快捷键,由 ctrl+Y 修改为 ctrl+d
参考资料
-
https://github.com/goto456/stopwords
-
https://gitee.com/MarineJ/stopwords