• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Keyword extraction of Uyghur-Kazakh
texts based on stem units

SARDAR Parhat,MIJIT Ablimit,ASKAR Hamdulla   

  1. (College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
  • Received:2019-08-04 Revised:2019-10-23 Online:2020-01-25 Published:2020-01-25

Abstract:

A keywords extraction method of Uyghur and Kazakh (Uyghur-Kazakh) texts based on stem units is proposed. Uyghur-Kazakh is a derivative language lacking resources. Morpheme structure analysis and stem extraction can effectively reduce the granularity capacity and improve the coverage of derivative languages. In this paper, Uyghur-Kazakh texts are downloaded from the Internet and segmented into morpheme sequences. word2vec is used to train stem vectors to represent text content in a distributed way. Then, TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used to weight the stem vectors. Keywords are extracted by using the keyword vector of training set and the stem vector similarity of testing set. The experimental results show that the proposed method based on morpheme segmentation and stem vector representation are the important steps and has more excellent performance in the extraction of keywords from derivative languages like Uygur-Kazakh.
 
 

Key words: Uyghur-Kazakh, stem vector, keyword extraction, morphology