• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (2): 277-285.

• 人工智能与数据挖掘 • 上一篇    下一篇

ICBV:一种基于BERT变分自编码的半监督意图聚类方法

赵锦栎,勾智楠,高凯   

  1. (1.河北科技大学信息科学与工程学院,河北 石家庄 050018;2.河北经贸大学管理科学与信息工程学院,河北 石家庄 050061)
  • 收稿日期:2024-08-19 修回日期:2024-11-14 出版日期:2026-02-25 发布日期:2026-03-10
  • 基金资助:
    河北省自然科学基金(F2023207003)

ICBV:A semi-supervised intent clustering method based on BERT variational autoencoder

ZHAO Jinyue,GOU Zhinan,GAO Kai   

  1. (1.School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang 050018;
    2.School of Management Science and Information Engineering,
    Hebei University of Economics and Business,Shijiazhuang 050061,China)
  • Received:2024-08-19 Revised:2024-11-14 Online:2026-02-25 Published:2026-03-10

摘要: 意图聚类在自然语言处理中具有重要价值,面对有限的标记数据时,现有方法往往难以捕捉到离散文本表示中复杂的语义信息,并且未标记数据常常包含噪声,直接为其赋予伪标签可能会对模型的训练造成负面影响,因此如何有效利用未标记数据并减少噪声成为关键问题。为了解决这一问题,提出了一种名为ICBV的半监督聚类方法。该方法结合少量有标签数据和基于BERT编码的变分自编码器进行预训练表示学习,并随后在训练阶段采用质心引导策略。ICBV能够对输入文本进行编码并计算潜在变量,从而捕捉数据的潜在空间表示。ICBV相较传统聚类方法,还利用了深度学习的特性,以便更有效地捕捉数据的复杂结构和非线性关系。在BANKING77数据集上的不同已知类比率设置下的实验中,准确率相对最新基线方法有所提高,验证了VAE编码获得潜在变量表示的有效性和聚类方法的鲁棒性。该方法为自然语言处理领域中意图聚类中的标记数据不足和噪声问题提供了一种解决方案。


关键词: 半监督聚类, 意图聚类, 变分自编码器(VAE)

Abstract: Intent clustering is a valuable task in the domain of natural language processing (NLP). When confronted with limited labeled data, existing methods often struggle to capture the complex semantic information embedded in discrete  representations. Moreover, unlabeled data frequently contains noise, and directly assigning pseudo-labels to it may have a negative impact on model training. Therefore, effectively leveraging unlabeled data while mitigating noise becomes a critical challenge. To address this issue, this paper proposes a semi-supervised clustering method named ICBV (intent clustering based on BERT variational autoencoder). This approach combines a small amount of labeled data with pre-trained representation learning using a BERT-encoded variational autoencoder (VAE). Subsequently, a centroid-guided strategy is employed during the training phase. ICBV encodes input text and computes latent variables to capture the latent space representation of the data. Compared to traditional clustering algorithms, ICBV also leverages the characteristics of deep learning to more effectively capture the complex structures and nonlinear relationships within the data. In experiments conducted on the BANKING77 dataset under varying ratios of known classes, ICBV achieved an accuracy improvement over the state-of-the-art baselines, validating the effectiveness of the VAE-encoded latent variable representations and the robustness of the clustering algorithm. This paper provides a solution to the challenges of insufficient labeled data and noise in intent clustering within the NLP domain.


Key words: semi-supervised clustering, intent clustering, variational autoencoder(VAE)