• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (8): 1591-1598.

• 论文 • Previous Articles     Next Articles

A patent literature term extraction method
based on the boundary tag sets  

DING Jie1,L Xueqiang1,LIU Kehui2   

  1. (1.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,
    Beijing Information Science and Technology University,Beijing 100101;
    2.Beijing Research Center of Urban System Engineering,Beijing 100035,China)
  • Received:2014-04-21 Revised:2014-08-21 Online:2015-08-25 Published:2015-08-25

Abstract:

Currently, most term boundary detection methods calculate the tightness between the strings by selecting an appropriate statistic magnitude and setting an appropriate threshold. However, these methods cannot obtain good results when extracting long terms. In order to solve the low recall problem of long-term extraction during the term extraction process, we propose a patent literature term extraction method based on boundary tag sets on the basis of studying a lot of patent literatures. We first propose the concept of boundary tag set and then construct boundary tag sets based on the characteristics of the boundary of terms in patent literatures. Besides, a new seedterm weighting approach is proposed to extract seed terms. Patent document terminology is compared with the Chinese Daily corpus to get terminology component library, thus improving the termhood of the candidate terms. Finally, the terms are filtered by boundary entropy so as to get a better result.Experimental results show that the proposed method has better results, with a correct rate of 81.67%, a recall rate of 71.92%, and F value of 0.765, and the results are better than the other methods mentioned in this paper.

Key words: boundary tag set;seedterm;term component library;boundary entropy