• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (2): 277-285.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

ICBV:A semi-supervised intent clustering method based on BERT variational autoencoder

ZHAO Jinyue,GOU Zhinan,GAO Kai   

  1. (1.School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang 050018;
    2.School of Management Science and Information Engineering,
    Hebei University of Economics and Business,Shijiazhuang 050061,China)
  • Received:2024-08-19 Revised:2024-11-14 Online:2026-02-25 Published:2026-03-10

Abstract: Intent clustering is a valuable task in the domain of natural language processing (NLP). When confronted with limited labeled data, existing methods often struggle to capture the complex semantic information embedded in discrete  representations. Moreover, unlabeled data frequently contains noise, and directly assigning pseudo-labels to it may have a negative impact on model training. Therefore, effectively leveraging unlabeled data while mitigating noise becomes a critical challenge. To address this issue, this paper proposes a semi-supervised clustering method named ICBV (intent clustering based on BERT variational autoencoder). This approach combines a small amount of labeled data with pre-trained representation learning using a BERT-encoded variational autoencoder (VAE). Subsequently, a centroid-guided strategy is employed during the training phase. ICBV encodes input text and computes latent variables to capture the latent space representation of the data. Compared to traditional clustering algorithms, ICBV also leverages the characteristics of deep learning to more effectively capture the complex structures and nonlinear relationships within the data. In experiments conducted on the BANKING77 dataset under varying ratios of known classes, ICBV achieved an accuracy improvement over the state-of-the-art baselines, validating the effectiveness of the VAE-encoded latent variable representations and the robustness of the clustering algorithm. This paper provides a solution to the challenges of insufficient labeled data and noise in intent clustering within the NLP domain.


Key words: semi-supervised clustering, intent clustering, variational autoencoder(VAE)