• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (5): 151-154.

• 论文 • 上一篇    下一篇



  1. (青海师范大学藏文信息处理省部共建教育部重点实验室,青海 西宁 810008)
  • 收稿日期:2010-06-10 修回日期:2010-08-29 出版日期:2011-05-25 发布日期:2011-05-25
  • 作者简介:才智杰(1970),男,青海乐都人,硕士,副教授,研究方向为藏文信息处理。才让卓玛(1970),女,青海乐都人,硕士,副教授,研究方向为藏文信息处理。
  • 基金资助:


Design of a Tibetan Word Segmentation System

CAI Zhi Jie,CAI Rang Zhuo Ma   

  1. (Tibetan Intellectual Information Processing Centre,Qinghai Normal University,Xining 810008,China)
  • Received:2010-06-10 Revised:2010-08-29 Online:2011-05-25 Published:2011-05-25



关键词: 中文信息处理, 语料库, 藏文分词


As the fundamental linguistic knowledge base, humanannotated corpora are the basis of many statistical natural language processing tasks. Along with the wide use of statistical methods in natural language processing, corpus construction becomes an important research area.Word segmentation is necessary prerequisite of syntax parsing; its performance determines the parsing accuracy in a large degree.By the statistical analysis on a Tibetan corpus with 850,000 bytes, we first investigate the  distribution and the syntactic function of Tibetan words,  introduce a dictionarybased Tibetan word segmentation model, and then present the dictionary structure, caseauxiliary blocking and restoring algorithms which are necessary to Tibetan word segmentation. The development of the Tibetan word segmentation system also facilitates the research of the Tibetan word input methods, the Tibetan electronic dictionary construction, the Tibetan word frequency statistics, the design and realization of the search engine, the development of the machine translation system, the security of the network information, the construction of the Tibetan corpus, and the Tibetan semantic analysis.

Key words: Chinese information processing;corpus;Tibetan word segmentation