• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (02): 370-380.

• Artificial Intelligence and Data Mining • Previous Articles    

DCsR:An integrated error correction framework for Chinese text

CAO Jun-hang1,2,HUANG Rui-zhang1,2,BAI Rui-na1,2,ZHAO Jian-hui1,2   

  1. (1.State Key Laboratory of Public Big Data,Guiyang 550025;
    2.College of Computer Science and Technology,Guizhou University,Guiyang 550025,China) 
  • Received:2022-09-27 Revised:2022-10-18 Accepted:2023-02-25 Online:2023-02-25 Published:2023-02-16

Abstract: Chinese text error correction has a very important application in natural language processing. For Chinese texts with flexible and changeable writing, the existing error correction models cannot cover the correction of various types of errors, and there is always a problem that selecting TOP1 from TOPK has a high error probability. This paper proposes an integrated error correction framework for Chinese text—DCsR(Detector Correctors-Ranker). The framework abandons the previous solution based on the premise of known error types and uses a single model for error correction. According to different scenarios, a variety of excellent error correction models are selected for error correction and then integrated to recall a more comprehensive candidate set. At the same time, according to the importance of the customized features, a multi-strategy and scalable candidate sorting algorithm is established to select more credible correction results. The DCsR framework effectively solves the problem of model bias, and further improves the performance of Chinese text spelling error correction. The experimental results show that, compared with the single model with the best performance, the DCsR framework improves the F1 value of error correction by 3.93% on the public data set SIGHAN15, which further improves the error correction performance of Chinese text. The ablation experiment on CGED2020 also shows the effectiveness of the DCsR framework.

Key words: Chinese text error correction, DCsR framework, integrated error correction, feature importance, candidate sorting algorithm