• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究

武国胜,张月琴   

  1. (太原理工大学信息与计算机学院,山西 晋中 030600)
  • 收稿日期:2019-09-05 修回日期:2019-10-22 出版日期:2020-04-25 发布日期:2020-04-25
  • 基金资助:

    山西省自然科学基金(201701D121057)

An improved density peak algorithm for
micro-learning unit text clustering based on LSA model

WU Guo-sheng,ZHANG Yue-qin     

  1. (College of Information and Computer Science,Taiyuan University of Technology,Jinzhong  030600,China)
     
  • Received:2019-09-05 Revised:2019-10-22 Online:2020-04-25 Published:2020-04-25

摘要:

微学习资源爆炸式的增长带来了大量未经组织处理的文本资源,大量以碎片化形式呈现的微学习资源为学习者的使用带来极大的不便。为让学习者能在碎片化的资源中找到适合于个性化学习的内容,对以文本形式的微学习资源进行聚类是很有必要的。为此,尝试将经过改进的密度峰值算法应用于微学习单元文本聚类。针对密度峰值算法在该领域聚类时存在向量空间高维稀疏、全局一致性不足、截断距离敏感、选择密度峰值中心需要人工监督等问题,使用潜在语义分析模型(LSA)建模,并提出2点改进:其一,针对聚类要求重新定义局部密度,并引入密度敏感距离作为聚类的判据,通过解决截断距离敏感性问题来解决聚类分配时全局一致性问题;其二,用线性拟合寻找野值点来自动寻找密度峰值中心,以实现非人工监督的峰值中心选取问题。微学习单元真实数据集上的实验验证结果表明,本文所提算法比原密度峰值算法以及其他经典聚类算法更适合于微学习单元文本聚类。
 
 

关键词: 微学习, 文本聚类, 密度聚类, LSA, 密度敏感距离, 线性拟合

Abstract:

With the explosive growth of micro-learning resources, a large number of unprocessed fragmented text resources bring great inconvenience to learners. In order to help learners to find suitable contents from fragmented resources for personalized learning, it is necessary to cluster micro-learning resources in the form of text. Therefore, this paper attempts to apply an improved density peak algorithm to micro-learning unit text clustering. Aiming at the problems of high dimensional sparse vector space, insufficient global consistency, cutoff distance sensitivity, and supervised selection of density peak centers when the density peak algorithm perform clustering in its field, this paper proposes two approaches based on Latent Semantic Analysis (LSA) model. Firstly, a new definition of local density is proposed according to clustering requirements, density sensitive distance is used as the clustering criteria, and the global consistency problem of clustering is solved by solving the problem of cutoff distance sensitivity. Secondly, outliers are found by linear fitting to automatically find the density peak centers in order to realize unsupervised selection problem of peak centers. Experimental results on real data sets of micro-learning units show that the proposal is more suitable for text clustering of micro-learning units than the original algorithm and other classical clustering algorithms.
 

Key words: micro-learning, text clustering, density-based clustering, LSA, density-sensitive distance, linear fitting