• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (6): 187-190.

• 论文 • 上一篇    

基于最大熵分类器的藏文句子边界自动识别方法研究

才藏太   

  1. (青海师范大学计算机学院,青海 西宁 810008)
  • 收稿日期:2011-09-01 修回日期:2011-11-03 出版日期:2012-06-25 发布日期:2012-06-25

Research on the Automatic Identification of Tibetan Sentence Boundaries with Maximum Entropy Classifier

CAI Zangtai   

  1. (School of Computer,Qinghai Normal University,Xining 810008,China)
  • Received:2011-09-01 Revised:2011-11-03 Online:2012-06-25 Published:2012-06-25

摘要:

藏文句子的边界识别是藏文文本分析的基础性研究, 是藏文与其他语种之间建立句子级平行语料库的必要工作,也是进一步进行藏汉机器翻译的基础。本文通过分析藏文句子的结束形式, 研究藏文句子边界规则,提出了一种藏文句子的边界识别方法。该方法首先利用特殊规则和词表对藏文句子进行识别,然后利用最大熵模型对有歧义的句子进一步识别。从而提高藏文句子的边界识别率。

关键词: 藏文句子, 边界识别, 最大熵模型

Abstract:

The boundary Ientification of Tibetan sentence is the basical research of Tibetan text analysis. It is the essential work to build a Parallel Corpora between Tibetan and other languages, and also it is the base to do TibetanChinese machine translation. The article raises the ways of Boundary Identification of Tibetan sentences through the analyze of the ending forms of Tibetan sentences and the study of it’s boundary rules. The method is firstly using the special rules and word forms to identify Tibetan Sentences, and then to make a further identification for those ambiguous sentences by using Maximum Entropy Model. So it can improve the boundary identification rate of Tibetan sentences.

Key words: Tibetan sentence;boundary identification;maximum entropy model