• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

融合上下文字符信息的泰语神经网络分词方法

陶广奉,线岩团,王红斌,汪淑娟   

  1. (昆明理工大学信息工程与自动化学院,云南 昆明 650500)
  • 收稿日期:2016-11-18 修回日期:2016-12-23 出版日期:2018-05-25 发布日期:2018-05-25
  • 基金资助:

    国家自然科学基金(61363044,61462054);云南省科技厅面上项目(2015FB135);云南省教育厅科学研究基金(2014Z021)

A context character feature based neural
network model for Thai word segmentation

TAO Guang-feng,XIAN Yan-tuan,WANG Hong-bin,WANG Shu-juan   

  1. (School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2016-11-18 Revised:2016-12-23 Online:2018-05-25 Published:2018-05-25

摘要:

自动分词是自然语言处理的关键基础技术。针对传统泰语统计分词方法特征模板复杂、搜索空间大的问题,提出融合上下文字符信息的泰语神经网络分词模型。该模型借助词分布表示方法,训练泰语字符表示向量,利用多层神经网络分类器实现泰语分词。基于InterBEST 2009泰语分词评测语料的实验结果表明,所提方法相较于条件随机场分词模型、Character-Cluster Hybrid 分词模型以及 GLR and N-gram 分词模型取得了更好的分词效果,分词准确率、召回率和F值分别达到了97.27%、99.26 %及98.26 %,相比条件随机场分词速度提高了112.78%。
 

关键词: 泰语分词, 神经网络模型, 上下文字符信息, 字符向量

Abstract:

Automatic word segmentation is a fundamental technology of natural language processing. Aiming at the problem of complex feature template and large search space in the traditional Thai word segmentation method, this paper proposes a context character feature based neural network model for Thai word segmentation. The proposed model uses the word distribution table to train the word representation vector, and utilizes a multi-layer neural network classifier for Thai word segmentation. Experimental results on InterBEST 2009 Thai word evaluation corpus show that, compared with the conditional random field model, the Character-Cluster Hybrid segmentation model, and the GLR and N-gram segmentation model, our proposal achieves better performance. Word segmentation accuracy, recall ratio and F value reach 97.27%, 99.26% and 98.26%, respectively. Our model improves the segmentation speed by 112.78% in comparison to the conditional random field model.
 

Key words: Thai word segmentation, neural network model, context character feature, characters vector