• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于词干单元的维-哈语文本关键词提取研究

沙尔旦尔·帕尔哈提,米吉提·阿不里米提,艾斯卡尔·艾木都拉   

  1. (新疆大学信息科学与工程学院,新疆 乌鲁木齐 830046)
  • 收稿日期:2019-08-04 修回日期:2019-10-23 出版日期:2020-01-25 发布日期:2020-01-25
  • 基金资助:

    国家自然科学基金(61662078)

Keyword extraction of Uyghur-Kazakh
texts based on stem units

SARDAR Parhat,MIJIT Ablimit,ASKAR Hamdulla   

  1. (College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
  • Received:2019-08-04 Revised:2019-10-23 Online:2020-01-25 Published:2020-01-25

摘要:

提出了基于词干单元的维吾尔语和哈萨克语(以下称维-哈语)文本关键词提取方法。维-哈语属于资源缺乏的派生类语言,词素结构分析和词干提取方法能有效地减少派生类语言的粒度容量,并且可以提高其覆盖率。从网上下载维-哈语文本,并切分成词素序列,用word2vec训练词干向量以分布式表示文本内容,再用TF-IDF算法对其词干向量进行加权处理。根据训练集关键词干向量和测试集词干向量相似度来提取关键词。实验结果表明,基于词素切分及词干向量表示的方法是在维-哈语等派生类语言关键词提取任务中的重要步骤,通过这个步骤,能够提高关键词提取的准确率。

关键词: 维-哈语, 词干向量, 关键词提取, 形态学

Abstract:

A keywords extraction method of Uyghur and Kazakh (Uyghur-Kazakh) texts based on stem units is proposed. Uyghur-Kazakh is a derivative language lacking resources. Morpheme structure analysis and stem extraction can effectively reduce the granularity capacity and improve the coverage of derivative languages. In this paper, Uyghur-Kazakh texts are downloaded from the Internet and segmented into morpheme sequences. word2vec is used to train stem vectors to represent text content in a distributed way. Then, TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used to weight the stem vectors. Keywords are extracted by using the keyword vector of training set and the stem vector similarity of testing set. The experimental results show that the proposed method based on morpheme segmentation and stem vector representation are the important steps and has more excellent performance in the extraction of keywords from derivative languages like Uygur-Kazakh.
 
 

Key words: Uyghur-Kazakh, stem vector, keyword extraction, morphology