Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (07): 1302-1308.doi: 10.3969/j.issn.1007-130X.2020.07.020
Previous Articles Next Articles
WANG Ge1,2,WU Fang-jun1,2
Received:
Revised:
Accepted:
Online:
Published:
Abstract: With the popularity of barrage video, barrage has become a form of interactive communication among young people in the Internet age, but with the increase in the number of barrage, how to block junk barrage has become a problem. On the basis of keyword masking method proposed by various video websites, this paper proposes two automatic shielding dictionary construction methods based on seed words and data sets respectively. The first method mainly uses Google’s natural language proces- sing tool (word2vec) and point mutual information (PMI). These words with greater similarity to seed words or more co-occurrences are added into the shielding dictionary. The second method mainly adopts TF-IDF (Term Frequency Inverse Document Frequency), LDA (Latent Dirichlet Allocation) topic model and IG (Information Gain), and extracts the keywords from the garbage barrage dataset to construct the shielding dictionary. Finally, the constructed shielding dictionaries are evaluated. The experimental results show that the barrage shielding effect is best when the dictionary scale is 400~ 500. Besides, the influence of LDA topic number and dataset size on the shielding effect of the barrage is also investigated.
Key words: barrage, keyword shielding, shielding dictionary, seed words
WANG Ge, WU Fang-jun, . Automatic construction of the garbage barrage shielding dictionary based on seed words and dataset[J]. Computer Engineering & Science, 2020, 42(07): 1302-1308.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/10.3969/j.issn.1007-130X.2020.07.020
http://joces.nudt.edu.cn/EN/Y2020/V42/I07/1302