J4 ›› 2008, Vol. 30 ›› Issue (2): 1-4.
• 论文 • Next Articles
Online:
Published:
Abstract:
Webinfomall is a Chinese web archive developed at Peking University since 2001. As of today, it has accumulated about three billion Chinese web pages since early 2002, and is increasing in volume at the rate of one to two million pages a day. Providing an effective information mining system over Webin fomall is a basic challenge we would like to take. In this article, we describe a pilot effort towards the challenge. In particular, a system framework (HisTrace) is introduced, which aims at an efficient extraction of reports about historical events. Due to the sheer amount of data in Webinfomall and d the noisy nature of web pages, it turns out that many engineering issues must be addressed. This report provides an analysis of some of the major ones . Finally, we briefly describe the implementation status of HisTrace.
Key words: web archive, text mining, link analysis, replica detection, information compression
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2008/V30/I2/1