J4 ›› 2011, Vol. 33 ›› Issue (5): 151-154.
• 论文 • Previous Articles Next Articles
CAI Zhi Jie,CAI Rang Zhuo Ma
Received:
Revised:
Online:
Published:
Abstract:
As the fundamental linguistic knowledge base, humanannotated corpora are the basis of many statistical natural language processing tasks. Along with the wide use of statistical methods in natural language processing, corpus construction becomes an important research area.Word segmentation is necessary prerequisite of syntax parsing; its performance determines the parsing accuracy in a large degree.By the statistical analysis on a Tibetan corpus with 850,000 bytes, we first investigate the distribution and the syntactic function of Tibetan words, introduce a dictionarybased Tibetan word segmentation model, and then present the dictionary structure, caseauxiliary blocking and restoring algorithms which are necessary to Tibetan word segmentation. The development of the Tibetan word segmentation system also facilitates the research of the Tibetan word input methods, the Tibetan electronic dictionary construction, the Tibetan word frequency statistics, the design and realization of the search engine, the development of the machine translation system, the security of the network information, the construction of the Tibetan corpus, and the Tibetan semantic analysis.
Key words: Chinese information processing;corpus;Tibetan word segmentation
CAI Zhi Jie,CAI Rang Zhuo Ma. Design of a Tibetan Word Segmentation System[J]. J4, 2011, 33(5): 151-154.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2011/V33/I5/151