基于社交特征的多维度文本表示方法

计算机工程与科学

基于社交特征的多维度文本表示方法

陈功1，黄瑞章1，2，钟文良1

（1.贵州大学计算机科学与技术学院，贵州贵阳 550025；

2.贵州省公共大数据重点实验室，贵州贵阳 550025）

收稿日期:2016-07-03 修回日期:2016-09-01 出版日期:2016-11-25 发布日期:2016-11-25
基金资助:
国家自然科学基金（61462011，61202089);高等学校博士学科专项科研基金

(20125201120006);贵州大学引进人才科研项目（2011015）；贵州大学研究生创新基金（研理工2016052

）

A multidimension document representation

approach based on social features

CHEN Gong1,HUANG Ruizhang1,2,ZHONG Wenliang1

（1.College of Computer Science and Technology,Guizhou University,Guiyang 550025;

2.Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China）

Received:2016-07-03 Revised:2016-09-01 Online:2016-11-25 Published:2016-11-25

摘要/Abstract

摘要：

Web文本表示方法作为所有Web文本分析的基础工作，对文本分析的结果有深远的影响。提出了一种多维

度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征，而文档的深层次特征和外

部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征，其中表层特征和隐

含特征可以由文本内容中提取和学习得到，而文本的社交特征可以通过分析文档与用户的交互行为得到

。所提出的多维度文本表示方法具有易用性，可以应用于各种文本分析模型中。在实验中，改进了两种

常用的文本聚类算法——Kmeans和层次聚类算法，并命名为多维度Kmeans MDKM和多维度层次聚类算

法MDHAC。通过大量的实验表明了本方法的高效性。此外，我们在各种特征的结合实验结果中还有一些深

层次的发现。

关键词: 文本表示, 文本聚类, 社交特征

Abstract:

For all web document analysis approaches, finding good representation of web documents

plays a fundamental role and greatly affects the performance of web document analysis. We

propose a multidimension representation scheme for web documents. In addition to

extracting features directly from document contents, which is normally employed by

tradition document representation approaches, we also represent web documents with deeper

features that can be learned internally from documents and externally from web document

contexts. We exploit the three representation dimensions, including superficial dimension,

latent dimension and social dimension, extract and discover the features of superficial and

latent dimensions internally from document contents, and capture the social dimension

features externally from the interaction behavior between users and web documents. The

proposed multidimension representation scheme can be applied to document analysis models.

We conduct extensive experiments to evaluate its effectiveness in terms of document

clustering performance. Two common document clustering algorithms, multidimension k

means and multidimension hierarchical agglomerative clustering, are investigated.

Experiments verify that the proposed multidimension document representation scheme is

effective. Moreover, we report interesting observations in crossdimension features

discovered from experimental results.

Key words: document representation, document clustering, social feature

陈功1，黄瑞章1，2，钟文良1. 基于社交特征的多维度文本表示方法[J]. 计算机工程与科学.

CHEN Gong1,HUANG Ruizhang1,2,ZHONG Wenliang1.

A multidimension document representation

approach based on social features

[J]. Computer Engineering & Science.

[1]	武国胜, 张月琴. 基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究[J]. 计算机工程与科学, 2020, 42(04): 722-732.
[2]	脱婷1，马慧芳1,2，魏家辉1，刘海姣1. 基于语义特征空间上下文的短文本表示学习[J]. 计算机工程与科学, 2019, 41(02): 378-384.
[3]	马慧芳，朱志强，成玉丹，贾俊杰. 基于核心词项平均划分相似度的短文本聚类算法[J]. 计算机工程与科学, 2017, 39(08): 1562-1569.
[4]	吐尔地·托合提，艾海麦提江·阿布来提，米也塞·艾尼玩，艾斯卡尔·艾木都拉. 一种结合GAAC和Kmeans的维吾尔文文本聚类算法[J]. J4, 2013, 35(7): 149-155.
[5]	丁建立1,2,杨博1,2,雷雄3. 基于MapReduce的航空公司服务品质热点发现算法[J]. J4, 2013, 35(4): 130-135.
[6]	马甲林,刘金岭,于长辉. 一种高效中文文本聚类算法[J]. J4, 2013, 35(2): 103-108.
[7]	金春霞,周海岩. 位置加权文本聚类算法[J]. J4, 2011, 33(6): 154-158.
[8]	景丽萍，恽佳丽，于剑. 领域知识在文本聚类应用中的机遇和挑战[J]. J4, 2010, 32(6): 88-91.
[9]	刘晓勇. 基于最优适值保留的蚁群文本聚类算法[J]. J4, 2010, 32(5): 79-81.
[10]	童健华谭洪舟. 一种基于人工免疫网络的文本聚类算法[J]. J4, 2007, 29(10): 17-19.
[11]	林春燕[1] 朱东华[2]. 一种快速的文本聚类-分类法[J]. J4, 2004, 26(7): 74-76.