• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (5): 940-950.

• 人工智能与数据挖掘 • 上一篇    

面向不平衡数据的特征子空间增强的异质集成学习

陈丽芳1,2,白云1,施永辉1,代琪1   

  1. (1.华北理工大学理学院,河北 唐山 063210;2.河北省数据科学与应用重点实验室,河北 唐山 063210)

  • 收稿日期:2024-01-15 修回日期:2024-05-16 出版日期:2025-05-25 发布日期:2025-05-27
  • 基金资助:
    国家自然科学基金(52074126)

Heterogeneous ensemble learning with feature subspace augmentation for imbalanced data

CHEN Lifang1,2,BAI Yun1,SHI Yonghui1,DAI Qi1#br#   

  1. (1.College of Science,North China University of Science and Technology,Tangshan 063210;
    2.Hebei Key Laboratory of Data Science and Application,Tangshan 063210,China)
  • Received:2024-01-15 Revised:2024-05-16 Online:2025-05-25 Published:2025-05-27

摘要: 对于不平衡数据,传统分类器趋向于保证多数类的准确率,而牺牲少数类的准确率,造成算法的整体性能下降。针对这一问题,提出一种面向不平衡数据的特征子空间增强的异质集成学习算法HEL-FSA。首先利用XGBoost算法学习特征的重要性,并选择重要的特征,形成数据集的特征子空间;其次使用SMOTE算法在特征子空间中生成新样本,获得更加平衡的训练数据;最后,采用逻辑回归、决策树、多层感知器、支持向量机和XGBoost这5种基模型,并使用if_any算法融合异质基模型。在9个不平衡数据集上的实验结果验证了该算法的可行性,同时,将提出的算法用于宫颈癌风险预测,增强了其对宫颈癌风险的理解和预测能力。

关键词: 不平衡数据, 特征选择, 集成学习, 合成少数类过采样技术

Abstract: For imbalanced data, traditional classifiers tend to identify the majority class at the expense of accuracy for the minority class, leading to degraded overall algorithm performance. To address this issue, a heterogeneous ensemble learning algorithm with feature subspace augmentation (HEL-FSA) for imbalanced data is proposed. Firstly, using the XGBoost algorithm to learn the importance of features and selects important features to form a feature subspace for the dataset. Secondly, the SMOTE algorithm is used to generate new samples within this feature subspace, obtaining more balanced training data. Thirdly, five classifiers, named Logistic Regression, Decision Tree, Multi-Layer Perceptron, Support Vector Machine, and XGBoost  are employed as base models, and the heterogeneous base models are fused using the if_any algorithm. Experimental results on nine imbalanced datasets verify the feasibility of the proposed algorithm. Additionally, when applied to cervical cancer risk prediction, the proposed algorithm enhances the ability to understand and predict cervical cancer risk.

Key words: imbalanced data, feature selection, ensemble learning, synthetic minority over-sampling technique