计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (01): 138-148.
秦颖
收稿日期:
2020-08-16
修回日期:
2020-10-30
接受日期:
2022-01-25
出版日期:
2022-01-25
发布日期:
2022-01-13
基金资助:
QIN Ying
Received:
2020-08-16
Revised:
2020-10-30
Accepted:
2022-01-25
Online:
2022-01-25
Published:
2022-01-13
摘要: 生成语言的质量评价很大程度上影响着自然语言生成的研究,已成为制约该领域发展的瓶颈问题。通过对机器翻译、自动文摘、对话系统、图像标题生成和机器写作等广义自然语言生成任务的语言质量评价方法的汇总,介绍了人工评价和自动评价的特点、优缺点和开放评价资源,分析了不同任务的不同评价角度和适用面。不同评价方法的对比分析,可为方法融合和关键问题的探索提供借鉴。整体上机器生成语言质量评价还局限于语言形式的比较,在语义表达的准确性、衔接连贯性等深层评价上存在诸多挑战。结合评价难点问题和现有研究的推进情况,分析了生成语言质量评价的研究趋势。
秦颖. 机器生成语言的质量评价方法综述[J]. 计算机工程与科学, 2022, 44(01): 138-148.
QIN Ying. A survey on quality evaluation of machine generated texts[J]. Computer Engineering & Science, 2022, 44(01): 138-148.
[1] | Anderson P,Fernando B,Johnson M,et al. SPICE:Semantic propositional image caption evaluation[C]∥Proc of European Conference on Computer Vision,2016:382-398. |
[2] | Liu C W,Lowe R,Serban I V,et al. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation metrics for dialogue response generation[J]. |
arXiv:1603.08023,2016. | |
[3] | Li M,Weston J,Roller S. Acute-eval:Improved dialogue evaluation with optimized questions and multi-turn comparisons[J]. arXiv:1909.03087,2019. |
[4] | Bakhtin A,Gross S,Ott M,et al. Real or fake? Learning to discriminate machine from human generated text[J]. arXiv:1906.03351,2019. |
[5] | Koehn P,Monz C. Manual and automatic evaluation of machine translation between European languages[C]∥Proc of |
the 1st Workshop on Statistical Machine Translation,2006:102-121. | |
[6] | Papineni K,Roukos S,Ward T. BLEU:A method for automatic evaluation of machine translation[C]∥Proc of the 40th Annual Meeting of the Association for Computational Linguistics,2002:311-318. |
[7] | Callison-Burch C,Fordyce C S,Koehn P,et al. (Meta-) evaluation of machine translation[C]∥Proc of the 2nd Workshop on Statistical Machine Translation,2007:136-158. |
[8] | Callison-Burch C,Koehn P,Monz C,et al. Findings of the 2009 workshop on statistical machine translation[C]∥Proc of the 4th Workshop on Statistical Machine Translation,2009:1-28. |
[9] | Callison-Burch C, Koehn P,Monz C,et al. Findings of the 2012 workshop on statistical machine translation[C]∥Proc of the 7th Workshop on Statistical Machine Translation,2012:10-51. |
[10] | Bojar O,Chatterjee R,Federmann C,et al. Findings of the 2016 conference on machine translation[C]∥Proc of the 1st Conference on Machine Translation, 2016:131-198. |
[11] | Callison-Burch C,Koehn P,Monz C,et al. Findings of the 2011 workshop on statistical machine translation[C]∥Proc of the 6th Workshop on Statistical Machine Translation,2011:22-64. |
[12] | Lin C Y,Och F J. Looking for a few good metrics:ROUGE and its evaluation[C]∥Proc of the 4th NTCIR Workshop, 2004:1-8. |
[13] | Cohen J. A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement,1960,20(1):37-46. |
[14] | Landis J R,Koch G G. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers[J]. Biometrics,1977,33(2):363-374. |
[15] | Callison-Burch C. Fast,cheap,and creative:Evaluating translation quality using Amazon’s Mechanical Turk[C]∥Proc of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009:286-295. |
[16] | Hovy D, Berg-Kirkpatrick T,Vaswani A,et al. Learning whom to trust with MACE[C]∥Proc of the 2013 Confe- rence of the North American Chapter of the Association for Computational Linguistics, 2013:1120-1130. |
[17] | Bloodgood M, Callison-Burch C. Using Mechanical Turk to build machine translation evaluation sets[C]∥Proc of the NAACL HLT 2010 Workshop on |
Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010:208-211. | |
[18] | Gatt A,Krahmer E. Survey of the state of the art in natural language generation:Core tasks,applications and evaluation[J]. Journal of Artificial Intelligence Research,2018,61:65-170. |
[19] | Duh K. Ranking vs. regression in machine translation evaluation[C]∥Proc of the 3rd Workshop on Statistical Machine Translation,2008:191-194. |
[20] | Zarrieβ S,Loth S,Schlangen D. Reading times predict the quality of generated text above and beyond human ratings[C]∥Proc of the 15th European Workshop on Natural Language Generation (ENLG),2015:38-47. |
[21] | Gambhir M,Gupta V. Recent automatic text summarization techniques:A survey[J]. Artificial Intelligence Review,2017,47(1):1-66. |
[22] | Clark E,Celikyilmaz A,Smith N A. Sentence mover’s similarity:Automatic evaluation for multi-sentence texts[C]∥Proc of the 57th Annual Meeting of the Association for Computational Linguistics,2019:2748-2760. |
[23] | Lin C Y,Hovy E. Manual and automatic evaluation of summaries[C]∥Proc of the ACL-02 Workshop on Automatic Summarization,2002:45-51. |
[24] | Rankel P A,Conroy J,Dang H T,et al. A decade of automatic content evaluation of news summaries:Reassessing the state of the art[C]∥Proc of the 51st Annual Meeting of the Association for Computational Linguistics,2013:131-136. |
[25] | Indu M,Kavitha K. Review on text summarization evaluation methods[C]∥Proc of 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS),2016:1-4. |
[26] | Deriu J, Rodrigo A,Otegi A,et al. Survey on evaluation methods for dialogue systems[J]. arXiv:1905.04071,2019. |
[27] | Novikova J,Duek O,Curry A C,et al. Why we need new evaluation metrics for NLG[C]∥Proc of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:2241-2252. |
[28] | See A,Roller S,Kiela D,et al. What makes a good conversation? How controllable attributes affect human judgments[J]. arXiv:1902.08654,2019. |
[29] | Adiwardana D, Luong M T, So D R, et al. Towards a human-like open-domain chatbot[J]. arXiv:2001.09977,2020. |
[30] | Pad S,Galley M,Jurafsky D, et al. Robust machine translation evaluation with entailment features[C]∥Proc of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,2009:297-305. |
[31] | Williams E J. Regression analysis[M]. New York:Wiley,1959. |
[32] | Clark J H,Dyer C,Lavie A,et al. Better hypothesis testing for statistical machine translation:Controlling for optimizer instability[C]∥Proc of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies,2011:176-181. |
[33] | Liu D,Daniel G. Syntactic features for evaluation of machine translation[C]∥Proc of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization,2005:25-32. |
[34] | Culy C,Riehemann S Z. The limits of n-gram translation evaluation metrics[C]∥Proc of Machine Translation Summit IX,2003:71-78. |
[35] | Han A L F,Lu Y,Wong D F,et al. Quality estimation for machine translation using the joint method of evaluation criteria and statistical modeling[C]∥Proc of the ACL 2013 8th Workshop on Statistical Machine Translation (ACL-WMT 2013),2013:365-372. |
[36] | Chen B,Kuhn R. AMBER:A modified bleu,enhanced ranking metric[C]∥Proc of the 6th Workshop on Statistical Machine Translation,2011:71-77. |
[37] | Callison-Burch C,Koehn P,Osborne M. Improved statistical machine translation using paraphrases[C]∥Proc ofthe Main Conference on Human Technology Conference of the North American Chapter of the Association of Computational Linguistics,2006:17-24. |
[38] | Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics[C]∥Proc of the 2nd International Conference on Human Language Technology Research,2002:138-145. |
[39] | Banerjee S, Lavie A. METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]∥Proc of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,2005:65-72. |
[40] | Specia L,Raj D,Turchi M. Machine translation evaluation versus quality estimation[J]. Machine Translation,2010,24(1):39-50. |
[41] | Specia L, Paetzold G,Scarton C. Multi-level translation quality prediction with QuEst++[C]∥Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and |
7t | h International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing:System Demonstrations,2015:115-120. |
[42] | Radev R,Teufel S,Saggion H,et al. Evaluation challenges in large-scaledocument summarization[C]∥Proc of the 41st Annual Meeting ofthe Association for Computational Linguistics,2003:375-382. |
[43] | Goldstein J,Kantrowitz M,Mittal V,et al. Summarizing text documents:Sentence selection and evaluation metrics[C]∥Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:121-128. |
[44] | Lin C Y,Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics[C]∥Proc of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics,2003:150-157. |
[45] | Lin C Y. ROUGE:A package for automatic evaluation of summaries[C]∥Proc of Workshop on Text Summarization Branches Out,Post-Conference Workshop of ACL 2004,2004:1-8. |
[46] | Amigo E,Gonzalo J,Penas A,et al. QARLA:A framework for the evaluation of text summarization systems[C]∥Proc of the 43rd Annual Meeting on Association for Computational Linguistics,2005:280-289. |
[47] | Steinberger J,Jezek K. Using latent semantic analysis in text summarization and summary evaluation[J]. Proceedings of ISIM,2004,4:93-100. |
[48] | Erkan G,Radev D R. LexRank:Graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research,2004,22:457-479. |
[49] | Vedantam R, Zitnick C L, Parikh D. CIDEr:Consensus-based image description evaluation[C]∥Proc of the 2015 IEEE Conference on Computer Vision and Pattern Recognition,2015:4566-4575. |
[50] | Chen X,Fang H,Lin T Y,et al. Microsoft COCO Captions:Data collection and evaluation server[J]. arXiv:1504.00325,2015. |
[51] | Elliott D,Keller F. Comparing automatic evaluation measures for image description[C]∥Proc of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014:452-457. |
[52] | Cui Y,Yang G,Veit A,et al. Learning to evaluate image captioning[C]∥Proc of the 2018 IEEE Conference on Computer Vision and Pattern Recognition,2018:5804-5812. |
[53] | Wu W,Guo Z,Zhou X,et al. Proactive human-machine conversation with explicit conversation goals[J]. arXiv:1906.05572,2019. |
[54] | Lian R,Xie M,Wang F,et al. Learning to select knowledge for response generation in dialog systems[J]. arXiv:1902.04911,2019. |
[55] | Zhao T,Zhao R,Eskenazi M. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders[J]. arXiv:1703.10960,2017. |
[56] | Kannan A, Vinyals O. Adversarial evaluation of dialogue models[J]. arXiv:1701.08198,2017. |
[57] | Bruni E, Fernandez R. Adversarial evaluation for open- domain dialogue generation[C]∥Proc of the 18th AnnualSIGdial Meeting on Discourse and Dialogue,2017:284-288. |
[58] | Zhang X,Lapata M. Sentence simplification with deep reinforcement learning[J]. arXiv:1703.10931,2017. |
[59] | Wubben S, van den Bosch A, Krahmer E. Sentence simplification by monolingual machine translation[C]∥Proc of the 50th Annual Meeting of the Association for Computational Linguistics,2012,1:1015-1024. |
[60] | Roemmele M, Gordon A S,Swanson R. Evaluating story generation systems using automated linguistic analyses[C]∥Proc of SIGKDD 2017 Workshop on Machine Learning for Creativity,2017:13-17. |
[61] | Mostafazadeh N,Chambers N,He X,et al. A corpus and evaluation framework for deeper understanding of commonsense stories[J]. arXiv:1604.01696,2016. |
[62] | Fan A,Lewis M,Dauphin Y. Hierarchical neural story generation[J]. arXiv:1805.04833,2018. |
[63] | Zellers R,Holtzman A,Rashkin H,et al. Defending against neural fake news[C]∥Proc of the 33rd International Conference on Neural Information Processing Systems,2019:9054-9065. |
[64] | Reiter E,Belz A. An investigation into the validity of some metrics for automatically evaluating natural language generation systems[J]. Computational Linguistics,2009,35(4):529-558. |
[65] | Kilickaya M,Erdem A,Ikizler-Cinbis N,et al. Re-evaluating automatic metrics for image captioning[J]. arXiv:1612.07600,2016. |
[66] | Qin Y,Specia L. Insight into multiple references in an MT evaluation metric[C]∥Proc of International Symposium on Natural Language Processing Based on Naturally Annotated Big Data,2015:131-140. |
[67] | Tatsunori H,Zhang H,Percy L. Unifying human and statistical evaluation for natural language generation[J]. arXiv:1904.02792,2019. |
[68] | Hardcastle D,Scott D. Can we evaluate the quality of generated text?[C]∥Proc of the International Conference on Language Resources and Evaluation,2008:3151-3158. |
[69] | Kusner M,Sun Y,Kolkin N,et al. From word embeddings to document distances[C]∥Proc of International Confe- rence on Machine Learning,2015:957-966. |
[70] | Ive J, Blain F, Specia L. DeepQuest:A framework for neural-based quality estimation[C]∥Proc of the 27th International Conference on Computational Linguistics,2018:3146-3157. |
[1] | 张迎晨, 高盛祥, 余正涛, 王振晗, 毛存礼, . 融合BERT与词嵌入双重表征的汉越神经机器翻译方法[J]. 计算机工程与科学, 2023, 45(03): 546-553. |
[2] | 肖妮妮, 金畅, 段湘煜. 基于提高伪平行句对质量的无监督领域适应机器翻译[J]. 计算机工程与科学, 2022, 44(12): 2230-2237. |
[3] | 王煦, 贾浩, 季佰军, 段湘煜. 基于词典模型融合的神经机器翻译[J]. 计算机工程与科学, 2022, 44(08): 1481-1487. |
[4] | 薛擎天, 李军辉, 贡正仙, 徐东钦. 基于预训练的无监督神经机器翻译模型研究[J]. 计算机工程与科学, 2022, 44(04): 730-736. |
[5] | 尤丛丛, 高盛祥, 余正涛, 毛存礼, 潘润海, . 基于同义词数据增强的汉越神经机器翻译方法[J]. 计算机工程与科学, 2021, 43(08): 1497-1502. |
[6] | 贾承勋, 赖华, 余正涛, 文永华, 于志强, . 基于枢轴语言的汉越神经机器翻译伪平行语料生成[J]. 计算机工程与科学, 2021, 43(03): 542-550. |
[7] | 史小静, 宁秋怡, 季佰军, 段湘煜. 信息传递增强的神经机器翻译[J]. 计算机工程与科学, 2021, 43(01): 134-141. |
[8] | 肖新凤1,2,李石君2,余伟2,刘杰2,刘倍雄1. 基于改进seq2seq模型的英汉翻译研究[J]. 计算机工程与科学, 2019, 41(07): 1257-1265. |
[9] | 刘婉婉,苏依拉,乌尼尔,仁庆道尔吉. 基于LSTM的蒙汉机器翻译的研究[J]. 计算机工程与科学, 2018, 40(10): 1890-1896. |
[10] | 杨宪泽,陈毅红. 汉藏机器翻译的特点与手写汉字切分分析研究[J]. J4, 2014, 36(08): 1595-1598. |
[11] | 杨宪泽,肖 明. 一种混合式机器翻译方法的分析研究[J]. J4, 2012, 34(2): 168-171. |
[12] | 唐俊. SSC软聚类算法在面向查询的多文档文摘中的应用[J]. J4, 2010, 32(6): 112-114. |
[13] | 苏翔,李玉鑑. GIZA++计算性能分析[J]. J4, 2010, 32(5): 147-149. |
[14] | 巢文涵[1] 李舟军[2] 陈跃新[1]. 一种用于机器翻译的相似句对检索方法[J]. J4, 2008, 30(9): 132-136. |
[15] | 刘金红[1] 王挺[2] 陆余良[1]. 基于XML的译文生成关键技术研究[J]. J4, 2005, 27(8): 106-108. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
湘公网安备 43010502000083号
湘ICP备10006030号
版权所有 © 《计算机工程与科学》 编辑部
地址:中国湖南省长沙市开福区德雅路109号(410073) 电话:0731-87002567 Email: jsjgcykx@vip.163.com
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn