• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A multidimension document representation
approach based on social features

CHEN Gong1,HUANG Ruizhang1,2,ZHONG Wenliang1   

  1. (1.College of Computer Science and Technology,Guizhou University,Guiyang 550025;
    2.Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China)
  • Received:2016-07-03 Revised:2016-09-01 Online:2016-11-25 Published:2016-11-25

Abstract:

 

For all web document analysis approaches, finding good representation of web documents
plays a fundamental role and greatly affects the performance of web document analysis. We
propose a multidimension representation scheme for web documents. In addition to
extracting features directly from document contents, which is normally employed by
tradition document representation approaches, we also represent web documents with deeper
features that can be learned internally from documents and externally from web document
contexts. We exploit the three representation dimensions, including superficial dimension,
latent dimension and social dimension, extract and discover the features of superficial and
latent dimensions internally from document contents, and capture the social dimension
features externally from the interaction behavior between users and web documents. The
proposed multidimension representation scheme can be applied to document analysis models.
We conduct extensive experiments to evaluate its effectiveness in terms of document
clustering performance. Two common document clustering algorithms, multidimension k
means and multidimension hierarchical agglomerative clustering, are investigated.
Experiments verify that the proposed multidimension document representation scheme is
effective. Moreover, we report interesting observations in crossdimension features
discovered from experimental results.

Key words: document representation, document clustering, social feature