J4 ›› 2014, Vol. 36 ›› Issue (08): 1505-1511.
• 论文 • Previous Articles Next Articles
HUANG Bugen1,FU Juan2
Received:
Revised:
Online:
Published:
Abstract:
Keyword search is widely used in intelligence analysis, search engine and computer forensics. However, sometimes searching key words in MSDOC files may fail to find out some matches, which are usually called false negatives. Microsoft compound document is composed by a series of stream stored in sectors. The streams and the sectors are managed through the directory and the sector allocation table. The text is stored in the MSDOC file WordDocument stream, text storage is not necessarily continuous, and the Table stream records the block information. Keyword may be stored in different sectors that are not adjacent. Even the sectors are adjacent, the part of the keyword may be compressed, but the other part is not compressed. These cause the false negatives. Firstly, texts are extracted from the WordDocument stream based on the block information in the Table stream, and they are encoded uniformly. Secondly, a keyword search is carried out. These two steps can avoid the false negative.
Key words: compound document;text extraction;keyword;search;computer forensics
HUANG Bugen1,FU Juan2. Research on extracting text from MS-DOC files [J]. J4, 2014, 36(08): 1505-1511.
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2014/V36/I08/1505