• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于社交特征的多维度文本表示方法

陈功1,黄瑞章1,2,钟文良1   

  1. (1.贵州大学计算机科学与技术学院,贵州 贵阳 550025;
    2.贵州省公共大数据重点实验室,贵州 贵阳 550025)
  • 收稿日期:2016-07-03 修回日期:2016-09-01 出版日期:2016-11-25 发布日期:2016-11-25
  • 基金资助:
    国家自然科学基金(61462011,61202089);高等学校博士学科专项科研基金
    (20125201120006);贵州大学引进人才科研项目(2011015);贵州大学研究生创新基金(研理工2016052

A multidimension document representation
approach based on social features

CHEN Gong1,HUANG Ruizhang1,2,ZHONG Wenliang1   

  1. (1.College of Computer Science and Technology,Guizhou University,Guiyang 550025;
    2.Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China)
  • Received:2016-07-03 Revised:2016-09-01 Online:2016-11-25 Published:2016-11-25

摘要:

Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维
度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外
部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐
含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到
。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种
常用的文本聚类算法——Kmeans和层次聚类算法,并命名为多维度Kmeans MDKM和多维度层次聚类算
法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深
层次的发现。

关键词: 文本表示, 文本聚类, 社交特征

Abstract:

 

For all web document analysis approaches, finding good representation of web documents
plays a fundamental role and greatly affects the performance of web document analysis. We
propose a multidimension representation scheme for web documents. In addition to
extracting features directly from document contents, which is normally employed by
tradition document representation approaches, we also represent web documents with deeper
features that can be learned internally from documents and externally from web document
contexts. We exploit the three representation dimensions, including superficial dimension,
latent dimension and social dimension, extract and discover the features of superficial and
latent dimensions internally from document contents, and capture the social dimension
features externally from the interaction behavior between users and web documents. The
proposed multidimension representation scheme can be applied to document analysis models.
We conduct extensive experiments to evaluate its effectiveness in terms of document
clustering performance. Two common document clustering algorithms, multidimension k
means and multidimension hierarchical agglomerative clustering, are investigated.
Experiments verify that the proposed multidimension document representation scheme is
effective. Moreover, we report interesting observations in crossdimension features
discovered from experimental results.

Key words: document representation, document clustering, social feature