Abstract

Abstract:

Webinfomall is a Chinese web archive developed at Peking University since 2001. As of today, it has accumulated about three billion Chinese web pages since early 2002, and is increasing in volume at the rate of one to two million pages a day. Providing an effective information mining system over Webin fomall is a basic challenge we would like to take. In this article, we describe a pilot effort towards the challenge. In particular, a system framework （HisTrace） is introduced, which aims at an efficient extraction of reports about historical events. Due to the sheer amount of data in Webinfomall and d the noisy nature of web pages, it turns out that many engineering issues must be addressed. This report provides an analysis of some of the major ones . Finally, we briefly describe the implementation status of HisTrace.

Key words: web archive, text mining, link analysis, replica detection, information compression

[1]	ZHOU Zhong-bao, ZHU Wen-jing, WANG Hao, GUO Xiu-yuan, WANG Li-feng. Social media KOL based on barrage text mining [J]. Computer Engineering & Science, 2022, 44(03): 521-529.
[2]	ZHENG Wei-tao1,WU Yong-liang1,GUO Fang-lin1,YAN Guang-hui1,HE Li2. A micro-blog community discovery algorithm based on link analysis and user interests [J]. Computer Engineering & Science, 2017, 39(04): 804-812.
[3]	WANG Wei1,LI Jiajing2,WENG Jiajia1. A quantitative social stability analysis framework based on web sensitive information mining [J]. J4, 2015, 37(06): 1214-1220.
[4]	JING Liping,YUN Jiali,YU Jian. Domain Knowledge in Text Mining:Opportunities and Challenges [J]. J4, 2010, 32(6): 88-91.
[5]	LIU Xiaoyong. Text Clustering Algorithm with Ant Colony Based on the Best Solution Kept [J]. J4, 2010, 32(5): 79-81.
[6]	. [J]. J4, 2008, 30(8): 92-96.
[7]	. [J]. J4, 2007, 29(9): 117-119.
[8]	. [J]. J4, 2007, 29(9): 110-113.
[9]	. [J]. J4, 2007, 29(1): 103-104.

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 9

Recommended Articles

Metrics

Comments