• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (02): 370-380.

• 人工智能与数据挖掘 • 上一篇    

DCsR:一种面向中文文本的集成式纠错框架

曹军航1,2,黄瑞章1,2,白瑞娜1,2,赵建辉1,2   

  1. (1.公共大数据国家重点实验室,贵州 贵阳 550025;2.贵州大学计算机科学与技术学院,贵州 贵阳 550025)
  • 收稿日期:2022-09-27 修回日期:2022-10-18 接受日期:2023-02-25 出版日期:2023-02-25 发布日期:2023-02-16

DCsR:An integrated error correction framework for Chinese text

CAO Jun-hang1,2,HUANG Rui-zhang1,2,BAI Rui-na1,2,ZHAO Jian-hui1,2   

  1. (1.State Key Laboratory of Public Big Data,Guiyang 550025;
    2.College of Computer Science and Technology,Guizhou University,Guiyang 550025,China) 
  • Received:2022-09-27 Revised:2022-10-18 Accepted:2023-02-25 Online:2023-02-25 Published:2023-02-16

摘要: 中文文本纠错技术在自然语言处理中有着非常重要的应用。针对书写灵活多变的中文文本,现有的纠错模型无法覆盖多种错误类型且存在从候选集合TOPK中挑选TOP1时出错概率极大的问题。提出了一种面向中文文本的集成式纠错框架——DCsR,摒弃以往建立在已知错误类型的假设上利用单一模型进行纠错的解决方案,根据不同场景选择添加多种表现优异的纠错模型分别进行纠错再集成召回更全面的候选集,同时根据自定义特征的重要程度建立了一个多策略、可拓展的候选排序算法,以挑选更具有公信力的修正结果。DCsR框架有效地解决了模型的偏向性问题,进一步全面提升了对中文文本拼写纠错的能力。实验结果表明,在公开数据集SIGHAN15上,对比现在的主流纠错模型,使用DCsR框架进行纠错的F1值比表现最优的单模型纠错高出了3.93%,进一步提升了对中文文本的纠错能力。针对CGED2020进行的消融实验也表明了DCsR框架的有效性。

关键词: 中文文本纠错, DCsR框架, 集成式纠错, 特征重要程度, 候选排序算法

Abstract: Chinese text error correction has a very important application in natural language processing. For Chinese texts with flexible and changeable writing, the existing error correction models cannot cover the correction of various types of errors, and there is always a problem that selecting TOP1 from TOPK has a high error probability. This paper proposes an integrated error correction framework for Chinese text—DCsR(Detector Correctors-Ranker). The framework abandons the previous solution based on the premise of known error types and uses a single model for error correction. According to different scenarios, a variety of excellent error correction models are selected for error correction and then integrated to recall a more comprehensive candidate set. At the same time, according to the importance of the customized features, a multi-strategy and scalable candidate sorting algorithm is established to select more credible correction results. The DCsR framework effectively solves the problem of model bias, and further improves the performance of Chinese text spelling error correction. The experimental results show that, compared with the single model with the best performance, the DCsR framework improves the F1 value of error correction by 3.93% on the public data set SIGHAN15, which further improves the error correction performance of Chinese text. The ablation experiment on CGED2020 also shows the effectiveness of the DCsR framework.

Key words: Chinese text error correction, DCsR framework, integrated error correction, feature importance, candidate sorting algorithm