• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

一种基于低频种子的三代测序序列比对方法

宋思怡1,2,程昊宇1,2,徐云1,2   

  1. (1.中国科学技术大学计算机科学与技术学院,安徽 合肥 230027;2.安徽省高性能计算重点实验室,安徽 合肥 230027)
  • 收稿日期:2018-11-27 修回日期:2019-02-27 出版日期:2019-09-25 发布日期:2019-09-25
  • 基金资助:

    国家自然科学基金(61672480)

A novel sequence alignment method for third-generation
 sequencing based on low frequency seeds

SONG Si-yi1,2,CHEN Hao-yu1,2,XU Yun1,2   

  1. (1.School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027;
    2.Key Laboratory of High Performance Computing of Anhui Province,Hefei  230027,China)
     
  • Received:2018-11-27 Revised:2019-02-27 Online:2019-09-25 Published:2019-09-25

摘要:

随着测序技术的发展,三代测序已广泛应用于基因研究中。但是,由于三代测序序列具有平均长度长、错误率高的特性,如何快速、准确地将测序片段比对到参考基因组上成为严峻挑战。现有方法使用种子(从测序片段中挑选的短序列)来加速比对过程,但在挑选时未考虑频率特性,导致定位候选区域阶段时间消耗较大。因此,提出了一种基于低频种子的三代测序序列比对方法,该方法采用种子投票策略,使用低频种子进行投票,减少投票计数的时间消耗,并根据位置及票数关系对候选区域进行再过滤,进一步提高比对速度。实验结果表明,在确保敏感性和准确率的同时,本文方法比现有方法快3倍左右。

关键词: 三代测序, 单分子实时测序, 序列比对, 种子-扩展法

Abstract:

With the development of sequencing technology, the third-generation sequencing has been widely used in genetic research. It can generate longer sequences but has a higher error rate. It is difficult to align sequences to the reference genome quickly and accurately. Existing methods utilizes seeds which are subsequences selected from test sequences to speed up the alignment process. However, the seed frequency is not fully considered, which results in a large time consumption in the stage of finding candidate regions. We therefore propose a sequence alignment method for third-generation sequencing based on low frequency seeds. Its key idea is a modified seed-voting strategy, which adopts frequency seeds for voting to reduce the time consumption for counting the votes. Moreover, the alignment method re-filters the candidate regions based on the position and the number of votes, further increasing the speed of alignment. Experimental results show that the method is about 3 times faster than existing methods while ensuring sensitivity and accuracy.

 

 

Key words: third-generation sequencing, single molecule real-time sequencing, sequence alignment, seed-and-extend method