• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于双向长短期记忆神经网络的老挝语分词方法

何力,周兰江,周枫,郭剑毅   

  1. (昆明理工大学信息工程与自动化学院,云南 昆明 650500)
  • 收稿日期:2018-07-18 修回日期:2018-11-08 出版日期:2019-07-25 发布日期:2019-07-25
  • 基金资助:

    国家自然科学基金(61662040,61562049)

A Lao word segmentation method based on
bidirectional longshort term memory neural network model

HE Li,ZHOU Lanjiang,ZHOU Feng,GUO Jianyi
 
  

  1. (Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2018-07-18 Revised:2018-11-08 Online:2019-07-25 Published:2019-07-25

摘要:

作为语言最小独立运行且有意义的单位,将连续型的老挝语划分成词是非常有必要的。提出一种基于双向长短期记忆BLSTM神经网络模型的老挝语分词方法,使用包含913 487个词的人工分词语料来训练模型,将老挝语分词任务转化为基于音节的序列标注任务,即将老挝语音节标注为词首(B)、词中(M)、词尾(E)和单独成词(S)4个标签。首先将老挝语句子划分成音节并训练成向量,然后把这些向量作为BLSTM神经网络模型的输入来预估该音节所属标签,再使用序列推断算法确定其标签,最后使用人工标注的分词语料进行实验。实验表明,基于双向长短期记忆神经网络的老挝语分词方法在准确率上达到了87.48%,效果明显好于以往的分词方法。
 

关键词: 神经网络, 音节, 双向长短期记忆, 老挝语分词

Abstract:

It is necessary to divide the continuous Lao language into words, which are the smallest independent and meaningful unit of language. We propose a Lao word segmentation method based on bidirectional long-short term memory (BLSTM) neural network model. The model is trained from a Lao corpus that contains 913487 manually tagged words. In this model, the Lao word segmentation task can be transformed into a syllablebased sequential tagging task, in which a Lao syllable is labeled as four tags: begin-word (B), middleword (M), end-word (E) and singleword (S). Firstly, Lao sentences are divided into syllables and the syllables are trained into vectors. Secondly, as the input of the BLSTM neural network model, these vectors are used to predict the label of the syllable. Thirdly, the sequence inference algorithm is used to determine the label of the syllable. We carry out experiments on the manually labeled word-segmentation corpus. Experimental results show that the proposal has an accuracy of 87.48%, which is obviously better than that of
existing  word segmentation methods.

Key words: neural network, syllable, bidirectional long-short term memory, Lao word segmentation