• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (9): 127-131.

• 论文 • 上一篇    下一篇

基于双数组Trie树的中文分词词典算法优化研究

杨文川,刘健,于淼   

  1. (北京邮电大学计算机学院,北京 100876)
  • 收稿日期:2013-04-11 修回日期:2013-07-24 出版日期:2013-09-25 发布日期:2013-09-25
  • 基金资助:

    北大方正集团有限公司数字出版技术国家重点实验室开放课题资助项目(2012072011)

Research of an improved algorithm for Chinese word
segmentation dictionary based on double-array Trie-tree        

YANG Wenchuan,LIU Jian,YU Miao   

  1. (School of Computer Science,Beijing University of Post and Telecommunication,Beijing 100876,China)
  • Received:2013-04-11 Revised:2013-07-24 Online:2013-09-25 Published:2013-09-25

摘要:

基于双数组Trie树的中文分词词典具有较高的查找效率,但其插入时间复杂度较高。为此提出了一种基于双数组Trie树结构的改进算法iDAT,在原始词典初始化时优先处理分支多的节点,并在初始化之后对base数组中的空序列的下标值做Hash,Hash表中存放空序列之前的所有空序列个数之和,而后运用iDAT算法进行插入。本算法借鉴了单模式匹配的Sunday算法中的跳跃思想,在适当增加空间开销的基础上,降低了Trie树在动态插入过程中的平均时间复杂度,在实际操作过程中有着良好的性能。

关键词: 双数组, Trie树, 时间复杂度, 分词词典

Abstract:

Chinese word segmentation dictionary based on the doublearray Trietree has higher search efficiency, but the dynamic insertion consumes a lot of time. Therefore, an improved algorithm (iDAT) based on doublearray Trietree for Chinese word segmentation dictionary is proposed. The nodes with more branches are handled while the original dictionary is being initialized. After the initialization, a Hash process is performed on the index values of empty sequence in base array. The final Hash table stores the sum of the empty sequences before the current empty sequence. After that, the iDAT is used to carry out the dynamic insertion process. This algorithm adopts Sunday jumps algorithm of single pattern matching. With the reasonable increasement of space, it reduces the the average time complexity of the dynamic insertion process in Trietree. Practical results show it has good operation performance.

Key words: doublearray;Trietree;time complexity;word segmentation dictionary