• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (12): 150-154.

• 论文 • 上一篇    下一篇

基于SVM的维吾尔文文本分类研究

阿力木江·艾沙1,2,吐尔根·依布拉音2,库尔班·吾布力2,艾山·吾买尔2   

  1. (1.新疆大学现代教育技术中心,新疆 乌鲁木齐 830046;
    2.新疆大学信息科学与工程学院,新疆  乌鲁木齐 830046)
  • 收稿日期:2011-12-30 修回日期:2012-03-05 出版日期:2012-12-25 发布日期:2012-12-25
  • 基金资助:

    国家自然科学基金资助项目( 61063026,61163028)

Research of Uyghur Language Text  Categorization Based on SVM

Alimjan AYSA1,2,Turgun IBRAHIM2,Kurban OBUL2,Hasan OMAR2   

  1. (1.Center of Modern Education Technology,Xinjiang University,Urumqi 830046;
    2.College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
  • Received:2011-12-30 Revised:2012-03-05 Online:2012-12-25 Published:2012-12-25

摘要:

文本自动分类技术在提高文本信息利用的有效性和准确性上具有重要的现实意义和广阔的应用前景。随着Internet上维吾尔文信息的迅速发展,维吾尔文文本分类成为处理和组织这些大量文本数据的关键技术。研究维吾尔文文本分类相关技术和方法,针对维吾尔文文本在向量空间模型表示下的高维性,本文采用词干提取和χ2统计量相结合的方法对表示空间进行降维。采用SVM算法构造了维吾尔文文本分类器。针对维吾尔文文本分类语料进行的实验结果表明,SVM分类器的MacroF1值达到了84.6%,明显好于kNN方法。

关键词: 文本分类, SVM, kNN, 维吾尔语Key

Abstract:

The automatic text categorization technique has important practical significance and broad application prospect in improving the validity and accuracy of the use of text information.With the rapid increase of Uyghur language text information on the Internet,Uyghur language text categorization has become a key technique of processing and organizing these text data.As to the high dimensionality of Uyghur language text under vector space model representation,the stemming technique is used along with χ2 to reduce the dimensionality.Uyghur language text categorizer is constructed based on SVM.The experimental results based on Uyghur language text corpus show that the MacroF1 value of SVM categorizer can reach 84.6% and outperform the kNN approach.

Key words: text categorization;SVM;kNN;uyghur language