• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (08): 1505-1511.

• 论文 • Previous Articles     Next Articles

Research on extracting text from MS-DOC files          

HUANG Bugen1,FU Juan2   

  1. (1.Department of Computer Information and Cyber Security,Jiangsu Police Institute,Nanjing 210012;2.Huaian Municipal Public Security Bureau,Huaian 223005,China)
  • Received:2012-12-11 Revised:2013-03-05 Online:2014-08-25 Published:2014-08-25

Abstract:

Keyword search is widely used in intelligence analysis, search engine and computer forensics. However, sometimes searching key words in MSDOC files may fail to find out some matches, which are usually called false negatives. Microsoft compound document is composed by a series of stream stored in sectors. The streams and the sectors are managed through the directory and the sector allocation table. The text is stored in the MSDOC file WordDocument stream, text storage is not necessarily continuous, and the Table stream records the block information. Keyword may be stored in different sectors that are not adjacent. Even the sectors are adjacent, the part of the keyword may be compressed, but the other part is not compressed. These cause the false negatives. Firstly, texts are extracted from the WordDocument  stream based on the block information in the Table stream, and they are encoded uniformly. Secondly, a keyword search is carried out. These two steps can avoid the false negative.

Key words: compound document;text extraction;keyword;search;computer forensics