• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (08): 1505-1511.

• 论文 • 上一篇    下一篇

MS-DOC文件文本提取研究

黄步根1,伏娟2   

  1. (1.江苏警官学院计算机信息与网络安全系,江苏 南京210012;2.淮安市公安局,江苏 淮安 223005)
  • 收稿日期:2012-12-11 修回日期:2013-03-05 出版日期:2014-08-25 发布日期:2014-08-25
  • 基金资助:

    国家社会科学基金资助项目(13BTQ046);公安技术,江苏省高等学校“十二五”重点学科建设专项资金资助

Research on extracting text from MS-DOC files          

HUANG Bugen1,FU Juan2   

  1. (1.Department of Computer Information and Cyber Security,Jiangsu Police Institute,Nanjing 210012;2.Huaian Municipal Public Security Bureau,Huaian 223005,China)
  • Received:2012-12-11 Revised:2013-03-05 Online:2014-08-25 Published:2014-08-25

摘要:

关键词搜索广泛应用于情报分析、搜索引擎和计算机取证,对MSDOC文件进行关键词搜索可能漏判,明明存在的关键词却找不到。微软复合文档结构由一系列流组成,流以扇区为单位存储,通过目录结构和扇区分配表对流及其存储空间进行管理。MSDOC文件中的文本存储在WordDocument流中,文本存储不一定连续,通过Table流记录分块情况。关键词可能跨越不相邻扇区,即使在相邻扇区,一个关键词可能一部分是压缩存储,另一部分是非压缩存储,这些都是关键词搜索漏判的原因。根据Table流中的分块信息提取WordDocument流中的文本,并统一编码格式,进而进行关键词搜索,就可以避免漏判。

关键词: 复合文档, 文本提取, 关键词, 搜索, 计算机取证

Abstract:

Keyword search is widely used in intelligence analysis, search engine and computer forensics. However, sometimes searching key words in MSDOC files may fail to find out some matches, which are usually called false negatives. Microsoft compound document is composed by a series of stream stored in sectors. The streams and the sectors are managed through the directory and the sector allocation table. The text is stored in the MSDOC file WordDocument stream, text storage is not necessarily continuous, and the Table stream records the block information. Keyword may be stored in different sectors that are not adjacent. Even the sectors are adjacent, the part of the keyword may be compressed, but the other part is not compressed. These cause the false negatives. Firstly, texts are extracted from the WordDocument  stream based on the block information in the Table stream, and they are encoded uniformly. Secondly, a keyword search is carried out. These two steps can avoid the false negative.

Key words: compound document;text extraction;keyword;search;computer forensics