• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (4): 136-143.

• 论文 • Previous Articles     Next Articles

Picturetext webpage model and page element feature induction   

YU Long,WANG Jinlong   

  1. (PLA University of Science and Technology,Nanjing 210007,China)
  • Received:2012-03-27 Revised:2012-05-03 Online:2013-04-25 Published:2013-04-25

Abstract:

According to the graphictext content as the core of the page information extraction, this paper in a formal way forward on the page for elemental analysis of theoretical model. Through the definition of basic elements and rules of transformation, graphictext page model with tree structure to show the page elements within the text and graphic features. The graphictext page model elements in many features, by defining the elements classification of similarity, is proposed in this paper to obtain the best classification feature set and the recognition threshold method and gives the algorithm implementation. The experimental results show that, the graphictext page model simplifies the page element size, feature set in smaller learning costs induction can achieve ideal classification accuracy.

Key words: web extraction;web page element;picturetext model;feature induction