• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A  cross-lingual document similarity
calculation method based on bilingual LDA

CHENG Wei1,2,XIAN Yan-tuan1,2,ZHOU Lan-jiang1,2,YU Zheng-tao1,2,WANG Hong-bin1,2   

  1. (1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
    2.Key Laboratory of Intelligent Information Processing,
    Kunming University of Science and Technology,Kunming 650500,China)
     
  • Received:2015-12-29 Revised:2016-02-23 Online:2017-05-25 Published:2017-05-25

Abstract:

Based on the idea of bilingual topic model, we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA. Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus. The new corpus's bilingual documents are mapped to the vector space of the same topic.  We use the cosine similarity method and topic distribution combined to calculate the similarity of the bilingual documents of the new corpus. We improve the topic frequency inverse document frequency method from the aspect of the dispersion of in-category and the between-category topic distribution, and utilize the improved method to calculate feature topic weights. Experimental results show that the improved weight calculation method can enhance the recall rate, enable the LDA similarity calculation algorithm not limited to certain categories, and it is reliable.

Key words: bilingual LDA, cross-lingual document similarity calculation, cosine similarity, topic frequency-inverse document frequency