• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (05): 1046-1051.

• 论文 • 上一篇    下一篇

CRF和词典相结合的蒙古文地名识别研究

吴金星1,丽丽1,杨振新2   

  1. (1.内蒙古大学蒙古学学院,内蒙古 呼和浩特 010021;2.中国科学院合肥智能机械研究所,安徽 合肥 230031)
  • 收稿日期:2015-10-29 修回日期:2015-12-10 出版日期:2016-05-25 发布日期:2016-05-25
  • 基金资助:

    内蒙古自治区蒙古语言文字信息化专项扶持项目(2012339);国家自然科学基金(61070099);内蒙古自治区教育厅项目(NJZC16002)

Recognition of geographical names in Mongolian
based on conditional random fields and dictionary   

WU Jinxing1,LI Li1,YANG Zhenxin2   

  1. (1.School of Mongolian Studies,Inner Mongolia University,Huhhot 010021;
    2.Institute of Intelligent Machine,Chinese Academy of Science,Hefei 230031,China)
  • Received:2015-10-29 Revised:2015-12-10 Online:2016-05-25 Published:2016-05-25

摘要:

蒙古语在命名实体识别方面开展过人名的识别,但在地名的识别方面还没有开展相应的研究。首次实现了基于条件随机场模型的蒙古文地名识别。首先从蒙古语黏着性特点分析入手,研究了蒙古语语料库中地名的存在形式以及各类地名的特点,针对蒙古语语料库中地名的特点,在词汇特征、指示词特征、特征词特征等特征基础上引入了词性特征。之后通过地名词典补召了未识别的地名。以内蒙古大学开发的100万词规模的标注语料库为训练数据,该模型的地名识别性能达到了94.68%的准确率、84.40%的召回率和89.24%的F值。

关键词: 蒙古文地名, 识别, CRF, 特征, 词典

Abstract:

This is the first realization of Mongolian geographical names recognition based on conditional random fields. First we analyze the existing forms and characteristics of the geographical names in the corpus from the aspect of Mongolian adhesion characteristic. In addition to designation words and the part of speech, lexical features are also introduced as the location feature of geographical names. Then unrecognized names are called by location dictionaries. Taking the 3rdlevel annotated corpus with about 1000,000 words as the training data, the proposed model achieves an accuracy of 94.68%, a recall rate of 84.40%, and a F score of 89.24%.

Key words: Mongolian geographical name;recognition;CRF;feature;dictionary