• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (07): 1302-1308.doi: 10.3969/j.issn.1007-130X.2020.07.020

Previous Articles     Next Articles

Automatic construction of the garbage barrage  shielding dictionary based on seed words and dataset

WANG Ge1,2,WU Fang-jun1,2   

  1. (1.School of Information Management,Jiangxi University of Finance and Economics,Nanchang 330013;

    2.Jiangxi Key Laboratory of Data and Knowledge Engineering,

    Jiangxi University of Finance and Economics,Nanchang 330013,China)

  • Received:2019-12-18 Revised:2020-02-27 Accepted:2020-07-25 Online:2020-07-25 Published:2020-07-27

Abstract: With the popularity of barrage video, barrage has become a form of interactive communication among young people in the Internet age, but with the increase in the number of barrage, how to block junk barrage has become a problem. On the basis of keyword masking method proposed by various video websites, this paper proposes two automatic shielding dictionary construction methods based on seed words and data sets respectively. The first method mainly uses Google’s natural language proces- sing tool (word2vec) and point mutual information (PMI). These words with greater similarity to seed words or more co-occurrences are added into the shielding dictionary. The second method mainly adopts TF-IDF (Term Frequency Inverse Document Frequency), LDA (Latent Dirichlet Allocation) topic model and IG (Information Gain), and extracts the keywords from the garbage barrage dataset to construct the shielding dictionary. Finally, the constructed shielding dictionaries are evaluated. The experimental results show that the barrage shielding effect is best when the dictionary scale is 400~ 500. Besides, the influence of LDA topic number and dataset size on the shielding effect of the barrage is also investigated.


Key words: barrage, keyword shielding, shielding dictionary, seed words