• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (11): 27-33.

• 论文 • 上一篇    下一篇

一种基于聚类的大规模单体分型算法

潘玮华,陈波,徐云   

  1. (1.中国科学技术大学计算机科学与技术学院,安徽 合肥 230027;2.安徽省高性能计算重点实验室,安徽 合肥 230027)
  • 收稿日期:2013-08-12 修回日期:2013-10-11 出版日期:2013-11-25 发布日期:2013-11-25
  • 基金资助:

    国家自然科学基金面上项目(60970085);国家自然科学基金资助项目(61033009)

Clusteringbased largescale haplotype phasing algorithm 

PAN Weihua,CHEN Bo,XU Yun   

  1. (1.School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027;
    2.Key Laboratory of High Performance Computing,Hefei 230027,China)
  • Received:2013-08-12 Revised:2013-10-11 Online:2013-11-25 Published:2013-11-25

摘要:

大规模单体分型问题是生物遗传分析领域一个重要的基础性问题。针对现有算法求解大规模单体分型问题时存在的缺陷,在原有WinHAP算法的基础上引入聚类思想,提出一种基于聚类的WinHAP算法。该算法在保证原算法精度不下降的前提下,大大提高了算法的计算速度,降低了空间消耗,并具有空间需求与序列条数无关这一优良特性,因此特别适合处理超大规模的数据集。在SIMD共享存储模型下对算法进行了并行化,并设计了基于贪心的线程任务分配策略,获得了接近线性的加速比。

关键词: 单体分型, 聚类, 大规模计算, 并行计算, 生物信息学

Abstract:

Largescale haplotype phasing is an important fundamental problem in genetic analysis. To overcome the weakness of existing algorithms, we introduce the concept of clustering into original WinHAP algorithm and propose the Clutering based WinHAP algorithm. This algorithm improves original WinHAP in computing speed and memory without decreasing the precision, and its memory has nothing to do with the number of sequences. Thus, it is suited to very large datasets. The algorithm is parallelized under SIMD shared memory model and greedy task designing strategy is devised. The experiment reveals a nearlinear speedup with respect to the sequential algorithm.

Key words: haplotype phasing;clustering;largescale computing;parallel computing;bioinformatics