面向不平衡数据的特征子空间增强的异质集成学习

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (5): 940-950.

• 人工智能与数据挖掘 • 上一篇

面向不平衡数据的特征子空间增强的异质集成学习

陈丽芳1,2，白云1，施永辉1，代琪1

（1.华北理工大学理学院，河北唐山 063210；2.河北省数据科学与应用重点实验室，河北唐山 063210）

收稿日期:2024-01-15 修回日期:2024-05-16 出版日期:2025-05-25 发布日期:2025-05-27
基金资助:
国家自然科学基金（52074126）

Heterogeneous ensemble learning with feature subspace augmentation for imbalanced data

CHEN Lifang1,2，BAI Yun1,SHI Yonghui1，DAI Qi1#br#

(1.College of Science,North China University of Science and Technology,Tangshan 063210;
2.Hebei Key Laboratory of Data Science and Application,Tangshan 063210,China)

Received:2024-01-15 Revised:2024-05-16 Online:2025-05-25 Published:2025-05-27

摘要/Abstract

摘要： 对于不平衡数据，传统分类器趋向于保证多数类的准确率，而牺牲少数类的准确率，造成算法的整体性能下降。针对这一问题，提出一种面向不平衡数据的特征子空间增强的异质集成学习算法HEL-FSA。首先利用XGBoost算法学习特征的重要性，并选择重要的特征，形成数据集的特征子空间；其次使用SMOTE算法在特征子空间中生成新样本，获得更加平衡的训练数据；最后，采用逻辑回归、决策树、多层感知器、支持向量机和XGBoost这5种基模型，并使用if_any算法融合异质基模型。在9个不平衡数据集上的实验结果验证了该算法的可行性，同时，将提出的算法用于宫颈癌风险预测，增强了其对宫颈癌风险的理解和预测能力。

关键词: 不平衡数据, 特征选择, 集成学习, 合成少数类过采样技术

Abstract: For imbalanced data, traditional classifiers tend to identify the majority class at the expense of accuracy for the minority class, leading to degraded overall algorithm performance. To address this issue, a heterogeneous ensemble learning algorithm with feature subspace augmentation (HEL-FSA) for imbalanced data is proposed. Firstly, using the XGBoost algorithm to learn the importance of features and selects important features to form a feature subspace for the dataset. Secondly, the SMOTE algorithm is used to generate new samples within this feature subspace, obtaining more balanced training data. Thirdly, five classifiers, named Logistic Regression, Decision Tree, Multi-Layer Perceptron, Support Vector Machine, and XGBoost are employed as base models, and the heterogeneous base models are fused using the if_any algorithm. Experimental results on nine imbalanced datasets verify the feasibility of the proposed algorithm. Additionally, when applied to cervical cancer risk prediction, the proposed algorithm enhances the ability to understand and predict cervical cancer risk.

Key words: imbalanced data, feature selection, ensemble learning, synthetic minority over-sampling technique

陈丽芳, 白云, 施永辉, 代琪. 面向不平衡数据的特征子空间增强的异质集成学习[J]. 计算机工程与科学, 2025, 47(5): 940-950.

CHEN Lifang, BAI Yun, SHI Yonghui, DAI Qi. Heterogeneous ensemble learning with feature subspace augmentation for imbalanced data[J]. Computer Engineering & Science, 2025, 47(5): 940-950.

[1]	魏东, 贾宇辰, 韩少然. 数据中心制冷系统强化学习控制[J]. 计算机工程与科学, 2025, 47(3): 422-433.
[2]	王宇飞, 刘强, 张唯贞, 伍晓洁, 李佳雯, 王煜恒. rtTorTIM：基于多模态特征融合和Stacking集成学习的实时Tor流量识别方法#br#[J]. 计算机工程与科学, 2025, 47(2): 238-246.
[3]	刘沛, 刘昌华, 林俏伶. 基于优化特征堆叠与集成学习的车联网入侵检测模型[J]. 计算机工程与科学, 2024, 46(12): 2186-2195.
[4]	董燕灵, 张淑芬, 徐精诚, 王豪石, . 面向Stacking算法的差分隐私保护研究[J]. 计算机工程与科学, 2024, 46(02): 244-252.
[5]	刘振超, 苑迎春, 王克俭, 何晨. 融合特征权重与改进粒子群优化的特征选择算法[J]. 计算机工程与科学, 2024, 46(02): 282-291.
[6]	钟卓辉, 陈黎飞, . 基于模型的非凸聚类算法[J]. 计算机工程与科学, 2024, 46(02): 292-302.
[7]	庞诺言, 关东海, 袁伟伟. 基于早期时间序列分类的可解释实时机动识别算法[J]. 计算机工程与科学, 2024, 46(02): 353-362.
[8]	赵瑞平, 降爱莲. 基于自编码器和局部嵌入的无监督特征选择[J]. 计算机工程与科学, 2023, 45(07): 1282-1291.
[9]	陈俊彦, 卢贤涛, 黄雪锋, 卢小烨, 廖岑卉珊. 基于Double-Bagging特征降维异质集成入侵检测[J]. 计算机工程与科学, 2023, 45(06): 1011-1019.
[10]	顾楚梅, 曹建军, 王保卫, 徐雨芯, . 基于蚁群参数优化的LightGBM辐射源个体识别[J]. 计算机工程与科学, 2023, 45(01): 85-94.
[11]	苏赋, 罗海波. 改进Stacking集成学习的指纹识别算法[J]. 计算机工程与科学, 2022, 44(12): 2153-2161.
[12]	文武, 万玉辉, 文志云, . 基于正余弦算法的文本特征选择[J]. 计算机工程与科学, 2022, 44(08): 1467-1473.
[13]	刘云, 肖添, 王梓宇. 动态特征选择算法对恶意行为检测的优化研究[J]. 计算机工程与科学, 2022, 44(04): 665-673.
[14]	吴尚智, 徐丹丹, 王旭文, 夏宁. 基于广义重要度和runner-root算法的特征选择[J]. 计算机工程与科学, 2022, 44(04): 723-729.
[15]	李雨晨, 魏巍, 白伟明, 王达. 基于标签共现关系的多标签特征选择[J]. 计算机工程与科学, 2021, 43(11): 2049-2055.

面向不平衡数据的特征子空间增强的异质集成学习

Heterogeneous ensemble learning with feature subspace augmentation for imbalanced data

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价