• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2007, Vol. 29 ›› Issue (9): 84-90.

• 论文 • 上一篇    下一篇

基于主成份分析的肿瘤分类检测算法研究

王树林[1,2] 王戟[1] 陈火旺[1] 张波云[1]   

  • 出版日期:2007-09-01 发布日期:2010-06-02

  • Online:2007-09-01 Published:2010-06-02

摘要:

基于基因表达谱的肿瘤诊断方法有望成为临床医学上一种快速而有效的诊断方法,但由于基因表达数据存在维数过高、样本量很小以及噪音大等特点,使得提取与肿瘤有关的信息基因成为一件有挑战性的工作。因此,在分析了目前肿瘤分类检测所采用方法的基础上,本文提出了一种结合基因特征记分和主成份分析的混合特征抽取方法。实验表明明,这种方法能够有效地提取分类特征信息,并在保持较高的肿瘤识别准确率的前提下大幅度地降低基因表达数据的维数,使得分类器性能得到很大提高。实验采用了两种与肿瘤有关的基因表达数据集来验证这种混合特征抽取方法的有效性,采用支持向量机的分类实验结果表明,所提出的混合方法不仅交叉验证识别准确率高而且分类结果能够可
可视化。对于结肠癌组织样本集,其交叉验证识别准确率高这95.16%;而对于急性白血病组织样本集,其交叉验证识别准确率高这100%。

关键词: 支持向量机 基因表达谱 肿瘤分类 主成份分析

Abstract:

The tumor diagnosis method based on gene expression profiles will be developed into a fast and effective method in clinical domain in the near future.Although DNA microarray experiments provide us with a huge amount of gene expression data, in fact, only a few genes relate to tumor. Moreover, it is d  ifficult to extract tumor-related genes from gene expression profiles because of its characteristics such as the high dimensionality, the small sample set, many noises and redundancies in gene expression profiles. In this paper we propose a novel feature extraction approach which projects high dimensional data onto a lower dimensional feature space,which improves the SVM-based classification performance of gene expression data. We have examined two sets of gene expression data (colon dataset and leukemia dataset) by means of SVM classifiers with different parameters to validate the proposed approachh. Experimental results show that SVM has a superior performance in the classification of gene expression data using the principal components extracted  from the top-ranked genes based on the gene ranking method. The cross-validation accuracy of 95.16% has been achieved for colon dataset using SVM classi   fiers and 100% for leukemia dataset also. Another advantage of the proposed method is that the results of the sample classification can be visualized in  the form of 2D or 3D scatter plot.

Key words: (SVM;genc expression profile, tumor classification;principal component analysis)