Research on extracting text from MS-DOC files

J4 ›› 2014, Vol. 36 ›› Issue (08): 1505-1511.

• 论文 • Previous Articles Next Articles

Research on extracting text from MS-DOC files

HUANG Bugen1,FU Juan2

(1.Department of Computer Information and Cyber Security,Jiangsu Police Institute,Nanjing 210012;2.Huaian Municipal Public Security Bureau,Huaian 223005,China)

Received:2012-12-11 Revised:2013-03-05 Online:2014-08-25 Published:2014-08-25

Abstract

Abstract:

Keyword search is widely used in intelligence analysis, search engine and computer forensics. However, sometimes searching key words in MSDOC files may fail to find out some matches, which are usually called false negatives. Microsoft compound document is composed by a series of stream stored in sectors. The streams and the sectors are managed through the directory and the sector allocation table. The text is stored in the MSDOC file WordDocument stream, text storage is not necessarily continuous, and the Table stream records the block information. Keyword may be stored in different sectors that are not adjacent. Even the sectors are adjacent, the part of the keyword may be compressed, but the other part is not compressed. These cause the false negatives. Firstly, texts are extracted from the WordDocument stream based on the block information in the Table stream, and they are encoded uniformly. Secondly, a keyword search is carried out. These two steps can avoid the false negative.

Key words: compound document;text extraction;keyword;search;computer forensics

HUANG Bugen1,FU Juan2. Research on extracting text from MS-DOC files [J]. J4, 2014, 36(08): 1505-1511.

Research on extracting text from MS-DOC files

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 0

Recommended Articles

Metrics

Comments