• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于复合物参与度和密度的关键蛋白质预测

毛伊敏,刘银萍   

  1. (江西理工大学信息工程学院,江西 赣州 341000)
     
  • 收稿日期:2018-12-21 修回日期:2019-05-02 出版日期:2019-10-25 发布日期:2019-10-25
  • 基金资助:

    国家自然科学基金(41562019);江西省自然科学基金(GJJ161566)

An essential proteins prediction algorithm based on
 participation degree in protein complex and density

MAO Yi-min,LIU Yin-ping   

  1. (School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China)
  • Received:2018-12-21 Revised:2019-05-02 Online:2019-10-25 Published:2019-10-25

摘要:

针对在蛋白质相互作用网络上的关键蛋白质识别只关注拓扑特性,蛋白质相互作用数据中存在较高比例的假阳性数据以及基于复合物信息的关键蛋白质识别算法对节点的邻域信息和复合物的挖掘对关键蛋白质的识别影响效果考虑不够全面等导致的识别准确率和特异性不高的问题,提出一种基于复合物参与度和密度的关键蛋白质预测算法PEC。首先融合GO注释信息和边聚集系数构造加权PPI网络,克服假阳性对实验结果的影响;基于蛋白质相互作用的边权重,构造相似度矩阵,设计特征值间的最大本征差值来自动确定划分数目K,同时根据加权网络中的蛋白质节点度来选取K个初始聚类中心,进而利用谱聚类结合模糊C-means聚类算法实现复合物的挖掘,提高聚类的准确率,降低数据的维数;其次基于蛋白质节点的复合物参与度以及节点邻域子图密度,设计出关键节点的关键性得分。在DIP和Krogan 2个数据集上,将PEC与
DC、BC、CC、SC、IC、PeC、WDC、LIDC、LBCC和UC 10种经典算法相比,实验结果表明,PEC算法能够识别出更多的关键蛋白质,且聚类结果的准确率和特异性较高。
 
 

关键词: 蛋白质相互作用网络, 谱聚类算法, 蛋白质复合物, 密度, 关键蛋白质

Abstract:

The identification of essential proteins in the protein-protein interaction (PPI) network tends to only focus on the topological characteristics of the nodes, and the PPI data contains high false positive, the neighborhood information of nodes and the influence of complex mining on the recognition of essential proteins are not considered comprehensively by the essential proteins recognition algorithm based on complex information, so the accuracy and specificity of the recognition results are not high. In order to deal with these problems, an essential proteins prediction algorithm based on participation degree in protein complex and density (PEC) is proposed. Firstly, the GO annotation information and the edge aggregation coefficient are used to construct the weighted PPI network to overcome the influence of false positives on the experimental results. Based on the edge weight of protein interaction, the similarity matrix is constructed. The maximum difference between eigenvectors is designed to automatically determine the partition number K. Meanwhile, K initial clustering centers are selected according to the degree of protein nodes in the weighted network. Furthermore, the spectral clustering and the fuzzy C-means (FCM) clustering algorithm are combined to excavate the protein complex, thus improving the clustering accuracy and reduces the data dimension. Secondly, based on the degree of participation in protein complex and the neighborhood subgraph density, the scores of the essential proteins are proposed. The experiment results on DIP and Krogan datasets show that, compared with 10 classic algorithms such as DC, BC, CC, SC, IC, PeC, WDC, LIDC, LBCC and UC, PEC can correctly identify more essential proteins with higher accuracy and specificity.
 

Key words: protein-protein interaction network, spectral clustering algorithm, protein complexes, density;essential proteins