2020年B站跨年晚会弹幕内容分析

倒计时5天｜Python&Stata数据分析课冷假任务坊

不雅后感只有两个字

昨天刷B站看到B站跨年晚会9.9分好评，抱着尝尝的态度，后果操作不住本身刷了两遍，书到用时方恨少，只想用两个字概括"卧槽"。

B站内容安置出色，拿捏住了不同年代的情怀点，比如1980-1990的港台风、1990-2000年代数码宝物柯南、2000-2020年代魔兽世界，每团体的时代联结略有不同，这仅是我本身的感受。

除内容，弹幕互动也是挺令人舒爽的事情，出色落临时刻也就是弹幕大军压镜的时候，我把我认为比拟出色的局部截图制作成了蔑视频。

此次我收集了跨年晚会的两种数据

弹幕数据，正在不雅看时用户的感受
评论数据，已不雅看完成的用户感受

爬虫制作成视频教程已上传至千聊课程中，感喜好的同学可以存眷一下

Python网络爬虫与文本数据分析(学术) 戳

本文的小目的

我们只是想看一下在时间线上弹幕量的散布状况，下图是怎么做出的

横坐标为分钟刻度
纵坐标为弹幕量

数据预备（清洗）

B站晚会有三个篇章，每一个篇章视频长度大概60-70min。我们收集的数据集字段包孕：

Date：收集时间2019.01.03,所以有三天的弹幕
Chapter: 第几个篇章，B站跨年有三个篇章，每一个篇章60min安排
VideoTime: 在Chapter中的此刻播放时间（相关于篇章开始的秒数）
SenderId: 弹幕发送者的匿名ID
DanMuContent: 弹幕文本内容


import pandas as pd


df = pd.read_csv('data/弹幕new.csv')
#剔往反复项
df.drop_duplicates(inplace=True)
#查验 反省数据个数
print(len(df))
#显示前5行
df.head()

VideoTime很可以是字符串，我们先将其变成浮点数。然后我们查验反省每一个章节最大时间长度。


def str2float(string):
    #将VideoTime从字符串变成浮点数
    try: 
        return float(string)
    except:
        return 0.0


df['VideoTime'] = df['VideoTime'].apply(str2float)
print('Chapter 1', df[df['Chapter']==1]['VideoTime'].max())
print('Chapter 2', df[df['Chapter']==2]['VideoTime'].max())
print('Chapter 3', df[df['Chapter']==3]['VideoTime'].max())


Chapter 1 4253.188
Chapter 2 4000.27
Chapter 3 4555.0

我们想做弹幕的时间线上的散布，所以需要把三个在时间上思索先后


chapter1 = df[df['Chapter']==1]
chapter2 = df[df['Chapter']==2]
chapter3 = df[df['Chapter']==3]


#将时间放在一个时间线上
chapter2['VideoTime'] = chapter2['VideoTime']+ 4253.188
chapter3['VideoTime'] = chapter3['VideoTime']+ 4253.188 + 4000.27


#兼并chapter1， chapter2， chapter3
chapter = pd.concat([chapter1, chapter2, chapter3])
#VideoTime升序
chapter.sort_values(by='VideoTime', ascending=True, inplace=True)
chapter

我们想看一下各个章节在时间上弹幕量的散布状况(以min为单位),

横坐标为分钟刻度
纵坐标为弹幕量


def second2minute(second):
    #将VideoTime从秒数变成分钟数
    try:
        return int(float(second)/60)
    except:
        return 0




chapter['VideoTime'] = chapter['VideoTime'].apply(second2minute)
chapter


import matplotlib.pyplot as plt
%matplotlib inline


plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
danmudf = chapter.groupby('VideoTime').agg({'DanMuContent': ['count']})
danmudf.plot(kind='line', figsize=(20, 10), legend=False)
plt.title("2020年B站跨年晚会弹幕量趋向图", fontweight='bold', fontsize=25)
plt.xlabel('时间点', fontweight='bold', fontsize=20)
plt.ylabel('弹幕量', fontweight='bold', fontsize=20, rotation=0)
plt.show()

从上图可以看到用户发弹幕量相对比拟集中的区间为:

(37, 63)
(100, 120)
以后都是独峰

37-63阶段

这是第一篇章中后期，也是全部晚会高潮迭起的阶段。依次有冯提莫、动漫歌曲(如butterfly)、种花组曲(如钢铁激流举行曲）三种不同类型的扮演杂糅在一路。

冯提莫的甜美

动漫歌曲的青春回想

有一种家叫“国度”

115四周

115min，也就是第二篇章40min四周。四周的节目

自豪的少年（曲子是《那年那兔那些事》）

后面的比拟陡峭的小山峰时间点上还有几个大碗，比如邓紫棋、吴亦凡、五月天，就不粘贴截图了。

整体内容分析


import re
import jieba
import csv
from pyecharts import options as opts
from pyecharts.charts import Page, WordCloud
from pyecharts.globals import SymbolType




# 读取文件中的文本
text = ''.join(df['DanMuContent'])
#剔除非中文的内容（只保存中文）
text = ''.join(re.findall(r'[u4e00-u9fa5]+', text))
wordlist = jieba.lcut(text)
wordset = [w for w in set(wordlist) if len(w)>1]
wordfreq = []
#词语计数
for word in wordset:
    freq = wordlist.count(word)
    wordfreq.append((word, freq))
# 词频排序
wordfreq = sorted(wordfreq, key=lambda k:k[1], reverse=True)


wordcloud =WordCloud()
wordcloud.add("",
              wordfreq,
              word_size_range=[20,100])
wordcloud.set_global_opts(title_opts=opts.TitleOpts(title="2020年B站跨年晚会"))
wordcloud.render('B站跨年.html')
wordcloud.render_notebook()