• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (12): 2238-2242.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于Siamese循环神经网络的泰文句子切分方法

线岩团1,2,张志菊1,2,王红斌1,2 ,文永华1,2   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;

    2.昆明理工大学云南省人工智能重点实验室, 云南 昆明 650500)
  • 收稿日期:2020-07-28 修回日期:2020-11-04 接受日期:2021-12-25 出版日期:2021-12-25 发布日期:2021-12-31
  • 基金资助:
    国家自然科学基金(61363044,61462054) 

Thai sentence segmentation based on Siamese recurrent neural network

XIAN Yan-tuan1,2,ZHANG Zhi-ju1,2,WANG Hong-bin1,2,WEN Yong-hua1,2#br#

#br#
  

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;

    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)


  • Received:2020-07-28 Revised:2020-11-04 Accepted:2021-12-25 Online:2021-12-25 Published:2021-12-31

摘要: 泰文很少运用标点符号,句子间没有明显的分隔符,需要根据语义进行断句,为泰文词法分析、句法分析和机器翻译等自然语言处理任务带来了额外的困难。针对泰文断句问题提出一种基于Siamese循环神经网络的句子自动切分方法。相比传统泰文断句方法,该方法无需人工定义特征,而是采用统一的循环神经网络分别对候选断句点前后的词序列进行编码;然后,通过综合前后词序列的编码向量作为特征来构建泰文句子切分模型。在ORCHID泰文语料上的实验结果表明,所提出的方法优于传统泰文句子切分方法。

关键词: 泰文, 句子切分, 循环神经网络

Abstract: Thai rarely use punctuation, and there are no obvious separators between sentences. Sentences need to be segmented by semantics, which brings extra difficulties to natural language processing tasks such as lexical analysis, syntactic analysis and machine translation. This paper proposes a sentence segmentation method based on dual-path neural network. Compared with the traditional Thai sentence segmentation method, this method does not need to define the feature manually, but uses a unified circular neural network to encode the sequence of words before and after the candidate interval. Then, the coding vector of the sequence before and after the sequence is used as the feature to construct the Thai segmentation classification model. Experimental results on the Orchid97 Thai corpus show that the proposed method is superior to the traditional Thai sentence segmentation method.


Key words: Thai language, sentence segmentation, recurrent neural network ,