• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (1): 103-107.

• 论文 • 上一篇    下一篇

一种基于文本相似度矩阵运算的非结构化海量投诉数据分类算法

李青1,陈阳2,谢浩然1,蒙圣光3   

  1. (1.香港城市大学计算机科学系,香港 九龙塘 999077;
    2.中国移动通信集团广西有限公司,广西 南宁 530000;
    3.珠海市发思特软件技术有限公司,广东 珠海 519080)
  • 收稿日期:2010-05-20 修回日期:2010-10-26 出版日期:2012-01-25 发布日期:2012-01-25

A Text Similarity Matrix OperationBased Classification Algorithm for Largescale Unstructured Complaint Data

LI Qing1,CHEN Yang2,XIE Haoran1,MENG Shengguang3   

  1. (1.Department of Computer Science,City University of Hong Kong, Kowloon Tong, Hong Kong SAR 999077;
    2.China Mobile Corporation Guangxi Co. Ltd., Nanning 530000;
    3.Faster Software Technology Co. Ltd., Zhuhai 519080,China)
  • Received:2010-05-20 Revised:2010-10-26 Online:2012-01-25 Published:2012-01-25

摘要:

随着互联网和信息技术的日新月异,非结构化数据量有呈几何级数增长的趋势。尤其是Web 2.0网络社区的流行与火爆,使得增长趋势得到了进一步的加速。因此,面对海量的非结构化数据,如何有效地管理和组织它们,以便于终端用户进行信息存取,成为了一个迫在眉睫的重要研究课题。本文通过对非结构化数据的文本的建模和文本相似度比较,对于大规模非结构化数据的分类算法进行了讨论和研究,并将此算法应用到了中国移动的投诉数据分类系统中。在系统实施后,非常有效地提高了投诉数据的处理效率,从而印证所提出分类算法及系统框架的有效性。

关键词: 文本相似度, 非结构化数据, 投诉数据分类系统

Abstract:

With the fast development of the Internet and information technology nowadays, the growth of the volume of unstructured data is exponential. In particular, the prevalence of the Web 2.0 network community further enlarges the growth tendency. Therefore, how to manage and organize largescale unstructured data effectively, so as to facilitate enduser information access, becomes an urgent and important research topic. In this paper, based on the text of unstructured data modeling and text similarity, the existing largescale unstructured data classification algorithms are surveyed and discussed, and they are applied to a China Mobile user complaint data classification system. Upon the latter, the effectiveness of processing the complaint data is shown to have been much improved, and the usage of our proposed classification algorithm and system architecture is verified.

Key words: text similarity;unstructured data;complaint data classification system