• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (07): 1302-1308.doi: 10.3969/j.issn.1007-130X.2020.07.020

• 数据挖掘与人工智能 • 上一篇    下一篇

基于种子词和数据集的垃圾弹幕屏蔽词典的自动构建

汪舸1,2,吴方君1,2   

  1. (1.江西财经大学信息管理学院,江西 南昌 330013;

    2.江西财经大学数据与知识工程江西省高校重点实验室,江西 南昌 330013)

  • 收稿日期:2019-12-18 修回日期:2020-02-27 接受日期:2020-07-25 出版日期:2020-07-25 发布日期:2020-07-27

Automatic construction of the garbage barrage  shielding dictionary based on seed words and dataset

WANG Ge1,2,WU Fang-jun1,2   

  1. (1.School of Information Management,Jiangxi University of Finance and Economics,Nanchang 330013;

    2.Jiangxi Key Laboratory of Data and Knowledge Engineering,

    Jiangxi University of Finance and Economics,Nanchang 330013,China)

  • Received:2019-12-18 Revised:2020-02-27 Accepted:2020-07-25 Online:2020-07-25 Published:2020-07-27

摘要: 随着弹幕视频的流行,弹幕已经成为了互联网时代青年互动交流的一种形式,但随着弹幕数量的增多,如何屏蔽垃圾弹幕成为一个问题。在各类视频网站提出的关键词屏蔽法的基础上,提出了分别基于种子词和数据集的2类屏蔽词典自动构建方法。第1类方法主要采用Google的自然语言处理工具word2vec和PMI,寻找与种子词相似性较大或共现次数较多的词添加到屏蔽词典中去;第2类方法主要采用TF-IDF、LDA主题模型和信息增益IG的方法,从垃圾弹幕数据集中提取关键词来构建屏蔽词典。最后,对所构建的屏蔽词典进行评测,实验结果表明,词典规模在400~500时,弹幕屏蔽效果最好。同时,还考察了LDA主题数和数据集规模等因素对弹幕屏蔽效果的影响。

关键词: 弹幕, 关键词屏蔽, 屏蔽词典, 种子词

Abstract: With the popularity of barrage video, barrage has become a form of interactive communication among young people in the Internet age, but with the increase in the number of barrage, how to block junk barrage has become a problem. On the basis of keyword masking method proposed by various video websites, this paper proposes two automatic shielding dictionary construction methods based on seed words and data sets respectively. The first method mainly uses Google’s natural language proces- sing tool (word2vec) and point mutual information (PMI). These words with greater similarity to seed words or more co-occurrences are added into the shielding dictionary. The second method mainly adopts TF-IDF (Term Frequency Inverse Document Frequency), LDA (Latent Dirichlet Allocation) topic model and IG (Information Gain), and extracts the keywords from the garbage barrage dataset to construct the shielding dictionary. Finally, the constructed shielding dictionaries are evaluated. The experimental results show that the barrage shielding effect is best when the dictionary scale is 400~ 500. Besides, the influence of LDA topic number and dataset size on the shielding effect of the barrage is also investigated.


Key words: barrage, keyword shielding, shielding dictionary, seed words