基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究

计算机工程与科学

基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究

武国胜，张月琴

（太原理工大学信息与计算机学院,山西晋中 030600）

收稿日期:2019-09-05 修回日期:2019-10-22 出版日期:2020-04-25 发布日期:2020-04-25
基金资助:
山西省自然科学基金（201701D121057）

An improved density peak algorithm for

micro-learning unit text clustering based on LSA model

WU Guo-sheng,ZHANG Yue-qin

(College of Information and Computer Science,Taiyuan University of Technology,Jinzhong 030600,China)

Received:2019-09-05 Revised:2019-10-22 Online:2020-04-25 Published:2020-04-25

摘要/Abstract

摘要：

微学习资源爆炸式的增长带来了大量未经组织处理的文本资源，大量以碎片化形式呈现的微学习资源为学习者的使用带来极大的不便。为让学习者能在碎片化的资源中找到适合于个性化学习的内容，对以文本形式的微学习资源进行聚类是很有必要的。为此，尝试将经过改进的密度峰值算法应用于微学习单元文本聚类。针对密度峰值算法在该领域聚类时存在向量空间高维稀疏、全局一致性不足、截断距离敏感、选择密度峰值中心需要人工监督等问题，使用潜在语义分析模型（LSA）建模，并提出2点改进：其一，针对聚类要求重新定义局部密度，并引入密度敏感距离作为聚类的判据，通过解决截断距离敏感性问题来解决聚类分配时全局一致性问题；其二，用线性拟合寻找野值点来自动寻找密度峰值中心，以实现非人工监督的峰值中心选取问题。微学习单元真实数据集上的实验验证结果表明，本文所提算法比原密度峰值算法以及其他经典聚类算法更适合于微学习单元文本聚类。

关键词: 微学习, 文本聚类, 密度聚类, LSA, 密度敏感距离, 线性拟合

Abstract:

With the explosive growth of micro-learning resources, a large number of unprocessed fragmented text resources bring great inconvenience to learners. In order to help learners to find suitable contents from fragmented resources for personalized learning, it is necessary to cluster micro-learning resources in the form of text. Therefore, this paper attempts to apply an improved density peak algorithm to micro-learning unit text clustering. Aiming at the problems of high dimensional sparse vector space, insufficient global consistency, cutoff distance sensitivity, and supervised selection of density peak centers when the density peak algorithm perform clustering in its field, this paper proposes two approaches based on Latent Semantic Analysis (LSA) model. Firstly, a new definition of local density is proposed according to clustering requirements, density sensitive distance is used as the clustering criteria, and the global consistency problem of clustering is solved by solving the problem of cutoff distance sensitivity. Secondly, outliers are found by linear fitting to automatically find the density peak centers in order to realize unsupervised selection problem of peak centers. Experimental results on real data sets of micro-learning units show that the proposal is more suitable for text clustering of micro-learning units than the original algorithm and other classical clustering algorithms.

Key words: micro-learning, text clustering, density-based clustering, LSA, density-sensitive distance, linear fitting

武国胜, 张月琴. 基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究[J]. 计算机工程与科学.

WU Guo-sheng, ZHANG Yue-qin.

An improved density peak algorithm for

micro-learning unit text clustering based on LSA model

[J]. Computer Engineering & Science.

[1]	王若宾, 耿芳东, 张永梅, 宋威, 王伟锋, 徐琳. 基于改进自适应DBSCAN的混合式MOOC视频观看模式挖掘[J]. 计算机工程与科学, 2023, 45(09): 1670-1678.
[2]	吴翠先1,2,3，何少元1,2. 基于区间数的不确定性数据聚类算法:UD-OPTICS[J]. 计算机工程与科学, 2019, 41(07): 1303-1311.
[3]	马慧芳，朱志强，成玉丹，贾俊杰. 基于核心词项平均划分相似度的短文本聚类算法[J]. 计算机工程与科学, 2017, 39(08): 1562-1569.
[4]	陈功1，黄瑞章1，2，钟文良1. 基于社交特征的多维度文本表示方法[J]. 计算机工程与科学, 2016, 38(11): 2348-2355.
[5]	林江豪1，周咏梅1，2，阳爱民1，2，王伟2. 结合词向量和聚类算法的新闻评论话题演进分析[J]. 计算机工程与科学, 2016, 38(11): 2368-2374.
[6]	谭光兴，刘臻晖. 基于SVM的局部潜在语义分析算法研究[J]. J4, 2016, 38(01): 177-182.
[7]	吐尔地·托合提，艾海麦提江·阿布来提，米也塞·艾尼玩，艾斯卡尔·艾木都拉. 一种结合GAAC和Kmeans的维吾尔文文本聚类算法[J]. J4, 2013, 35(7): 149-155.
[8]	丁建立1,2,杨博1,2,雷雄3. 基于MapReduce的航空公司服务品质热点发现算法[J]. J4, 2013, 35(4): 130-135.
[9]	马甲林,刘金岭,于长辉. 一种高效中文文本聚类算法[J]. J4, 2013, 35(2): 103-108.
[10]	王立新，陈海涛，汪志发. 一种面向SOA业务恢复的服务选择算法[J]. J4, 2012, 34(11): 180-185.
[11]	金春霞,周海岩. 位置加权文本聚类算法[J]. J4, 2011, 33(6): 154-158.
[12]	申延成1,谢端强1,李超1,2. Salsa20的差分故障分[J]. J4, 2011, 33(3): 7-12.
[13]	景丽萍，恽佳丽，于剑. 领域知识在文本聚类应用中的机遇和挑战[J]. J4, 2010, 32(6): 88-91.
[14]	刘晓勇. 基于最优适值保留的蚁群文本聚类算法[J]. J4, 2010, 32(5): 79-81.
[15]	李玉忍杨金孝张兴国齐蓉林辉. 基于迭代学习的PID控制研究[J]. J4, 2007, 29(4): 98-100.