• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (2): 153-158.doi: 10.3969/j.issn.1007130X.2011.

• 论文 • 上一篇    下一篇

一种基于多阶畸变子模式相似性学习的序列识

邱德红,方少红,孙 蕾   

  1. (华中科技大学软件学院,湖北 武汉 430074)
  • 收稿日期:2010-03-30 修回日期:2010-06-28 出版日期:2011-02-25 发布日期:2011-02-25
  • 通讯作者: 邱德红
  • 作者简介:邱德红(1971),男,湖南永州人,博士,副教授,研究方向为机器学习、软件工程、数字媒体技术。方少红(1968),女,湖北荆州人,博士,副教授,研究方向为图像处理、软件工程、数字媒体技术。孙蕾(1985),女,湖北武汉人,硕士生,研究方向为机器学习、数字媒体技术。
  • 基金资助:

    国家自然科学基金资助项目(60873031)

An Approach to Sequence Recognition Based on the Similarity Learning of MultiDegree Distortion Subsequence

QIU Dehong,FANG Shaohong,SUN Lei   

  1. (School of Software Engineering,Huazhong University of Science and Technology,Wuhan 430074,China)
  • Received:2010-03-30 Revised:2010-06-28 Online:2011-02-25 Published:2011-02-25

摘要:

序列识别研究对于诸多应用研究领域有重要的意义。在序列识别中,由于多种因素的影响,同一类别标记的序列往往不具有严格的相似性。变化序列相似性描述的尺度对序列的相似性进行描述有利于获得更准确的序列相似性描述结果,为此提出了基于多阶畸变序列子模式的序列识别方法。通过定义序列多阶畸变子模式特征空间及其核变换函数,设计线性开销算法有效实现了序列畸变子模式高维特征向量的计算,进而利用半定规划对多阶畸变序列子模式的核变换矩阵进行优化。基于多阶畸变子模式相似性描述优化结果,支持向量机生成的识别方法比较好地适应了序列之间的不同程度的相似性畸变,而且具有柔性边界特征。本方法在蛋白质基准数据SCOP 1.37 PDB90上进行了实验,普遍提高了该数据集上33个不同家族蛋白质序列的识别结果。

关键词: 畸变子模式, 相似性学习, 半定规划, 序列识别

Abstract:

In the domain of sequence recognition, sequences with the same label are not rigorously similar because of the influence of many factors. Using multiscale to measure the similarities between signature sequences is much helpful to obtaining  highlyqualified similarity measures. A new method for sequence recognition based on distorted subsequence is put forward in this paper. A kernel function, which takes into account the distortions of various degrees, is defined on the feature space spanned by the distorted subsequences, and an efficient algorithm of linear cost is designed to compute the feature vectors with high dimensions. A combination of the kernel matrix with different distortions is learned and optimized through Semidefinite Program (SDP). Combining the optimized kernel with Support Vector Machine (SVM), a classifier with softer boundary that allows the most appropriate degree of distortions within the sequences is built. The experiments on the benchmark database of SCOP 1.37 PDB90 show that the classifier improves the recognition accuracy universally for most protein sequences in the 33 families of the benchmark database.

Key words: distorted subsequence;similarity learning;semidefinite program;sequence recognition