• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (9): 128-134.

• 论文 • 上一篇    下一篇

面向Web社会网络搜索的人名同一性判断

张晓芳,李国徽,庞永杰   

  1. (华中科技大学计算机科学与技术学院,湖北 武汉 430074)
  • 收稿日期:2011-07-25 修回日期:2011-10-12 出版日期:2012-09-25 发布日期:2012-09-25
  • 基金资助:

    国家自然科学基金资助项目(60873030,61173049)

Identical Name Judgment Based on Web Social Network Search

ZHANG Xiaofang,LI Guohui,PANG Yongjie   

  1. (School of Computer Science and Technology,
    Huazhong University of Science and Technology,Wuhan 430074,China)
  • Received:2011-07-25 Revised:2011-10-12 Online:2012-09-25 Published:2012-09-25

摘要:

随着人们在互联网上的行为日益丰富,互联网上的社交行为和关系逐渐接近传统的客观世界的社交网络,并能够真实反映出人与人之间在客观世界的真实关系。可以从互联网中通过搜索的方式来构建一个真实客观世界的社会网络。社会网络搜索技术及其方法逐渐成为目前的研究热点,如何对每个Web进行人名同一性判断是社会网络搜索的关键技术。为了从文本中抽取准确的特征并降低向量维度,本文给出了一个基于Cvalue和逆文档频率IDF的特征向量权值计算方法;实现了基于余弦夹角的相似度计算的算法;通过对文本聚类算法中层次聚类算法和划分聚类算法的研究,给出一种改进的层次聚类算法来实现人名同一性判断。以搜索引擎的人名检索结果进行测试,说明了基于改进的层次聚类算法能有效地提高人名同一性判断的性能。

关键词: 社会网络, 向量空间模型, 同一性判断, 层次聚类

Abstract:

With  the increase of activity on the Internet from people, the social contact based on the Internet closes that in the real world. We can structure a real social network via the search technology from the Internet. Social network search technology has captured the attention of many researchers in recent times. When multiple persons share the same name, it is essential for social network search to disambiguate them on each Web. A character weight calculation method based on Cvalue and IDF is presented so that we can retrieve accurate characters and reduce vector dimension. An algorithm based on the cosine angle is given to calculate the degree of similarity. By analyzing hierarchical and partitioning clustering in the text clustering algorithm, an improved hierarchical method to implement identical name judgment is proposed. For reducing the time complexity of clustering algorithm, a new method on calculating the centroid of cluster is presented. We test the method on a search engine for name search, and the results show that identical name judgment based on a modified hierarchical clustering algorithm can significantly improve performance.

Key words: social network;vector space model;identical judgment;hierarchical clustering