• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2007, Vol. 29 ›› Issue (3): 54-56.

• 论文 • 上一篇    下一篇

基于隐马尔可夫模型的二次k-均值基因序列聚类算法

吴君浩[1] 骆嘉伟[1] 王艳[1] 杨涛[1] 杨旭[2]   

  • 出版日期:2007-03-01 发布日期:2010-05-30

  • Online:2007-03-01 Published:2010-05-30

摘要:

本文提出了一种基于隐马尔可夫模型的二次k-均值聚类算法并实现了对基因序列数据的建模与聚类。算法首先引入了同源基因序列核苷酸比率趋向于一致的生物学特征来对基  因序列数据进行初次k-均值聚类,然后利用第一次聚类结果训练出表征序列特征的隐马尔可夫模型,最后采用基于模型的k-均值方法再次聚类。实验结果表明,该算法是可行的, ,并且具有较好的聚类质量。

关键词: 隐马尔可夫模型 基因序列 建模 k-均值聚类

Abstract:

A double k-mean clustering algorithm for modeling and clustering the gene sequence data is proposed by using the hidden Markov models(HMMs).First,the biological characteristics of four nucleotides ratio of homologous gene sequences is proposed to initial k-mean clustering on gene sequence data,and second,the first clustering results are utilized to train some HMMs which can denote sequence identities well.Finally,mode-based k-mean approach is adapted to clustering again.The experimental results show that the new algorithm is feasible and has comparatively better clustering quality.

Key words: HMM;gene sequences;modeling;k-mean clustering