• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2006, Vol. 28 ›› Issue (12): 74-76.

• 论文 • 上一篇    下一篇

具有随机化统计检验的聚类分析算法与网络实现

张文军 张润杰 古德祥   

  • 出版日期:2006-12-01 发布日期:2010-05-20

  • Online:2006-12-01 Published:2010-05-20

摘要:

聚类分析是应用最为广泛的数学方法之一,但又被认为是数学上不严格的一类方法。主要原因在于聚类过程及其结果没有统计学标准。本文建立了具有随机化统计检验的聚类分析算法,用于对若干个样品进行有显著性标记的聚类分析。该算法由三部分组成:距离测度计算、随机化检验和系统聚类。在该算法中,有14种距离测度、三种系统聚类方方法及指标加权与否可供选择。样品之间的距离定义为:1-随机化检验的P检验值;两类间的距离若满足P检验标准则合并为同一类是统计上显著的、可接受的,否则就是不显著的、不可接受的。算法的特点是:用随机化方法进行差异显著性检验,使得对多种距离测度可进行严格的统计检验,随机化检验不需统计前提和假设,适用于各种统计问问题;用于差异显著性检验的随机化方法需要随机化数值为正整数值,适用范围过窄,用数值同步移位和平移方法可使之适用于实数域。算法用Java语言网络化实现,包含六个类和一个HTFML文件。可通过网络在多种Java兼容的浏览器上实现算法共享。根据水稻田无脊椎动物多样性的调查数据,本文对该算法进行了对比分析,并讨论了选择距离测度的一些原则和进一步研究的途径等问题。

关键词: 聚类分析 随机化统计检验 距离测度 算法 网络实现

Abstract:

A prohlem with the algorithms of clustering analysis is that their results are always not statistically tested. An algorithm of clustering analysis wi th randomized statistical testing is developed in this paper. It consists of three parts: calculation of distance measures, randomized testing, and hie  erarchical clustering. In this algorithm the between-sample distance is defined as the 1-p_test value, where the p_test value is calculated from the ran domization procedure for the two samples. If the between-class distance meets with the p_test criterion it will be statistically reasonable to combine t  he two classes into one class. Fourteen distance measures and three methods of hierarchical clustering are given. The algorithm is implemented as the ne twork program with the Java language which is comprised of 6 Java classes and a HTML file. The program can run on Java-enabled Web browsers. This algori  thm is tested with the investigation of rice invertebrate diversity. The criteria for choosing distance measures and the perspective for improving the a lgorithm are disussed.

Key words: cluster analysis, randomized statistical resting, distance measure;algorithm;network implementation