• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (10): 1869-1876.

• 人工智能与数据挖掘 • 上一篇    下一篇

手语到情感语音的转换

王伟喆1,郭威彤2,3,杨鸿武1,2,3   

  1. (1.西北师范大学物理与电子工程学院,甘肃 兰州 730070;2.西北师范大学教育技术学院,甘肃 兰州 730070;
    3.互联网教育数据学习分析技术国家地方联合工程实验室,甘肃 兰州 730070)

  • 收稿日期:2020-08-16 修回日期:2021-03-02 接受日期:2022-10-25 出版日期:2022-10-25 发布日期:2022-10-28
  • 基金资助:
    国家自然科学基金(62067008,31860285);甘肃省自然科学基金(21JR7RA117);甘肃省教育科学“十三五”规划2020年度重点课题GS[2020]GHBZ190

Converting sign language to emotional speech

WANG Wei-zhe1,GUO Wei-tong2,3,YANG Hong-wu1,2,3#br#   

  1. (1.College of Physics and Electronic Engineering,Northwest Normal University,Lanzhou 730070;
    2.School of Educational Technology,Northwest Normal University,Lanzhou 730070;
    3.National and Provincial Joint Engineering Laboratory of 
    Learning Analysis Technology in Online Education,Lanzhou 730070,China)
  • Received:2020-08-16 Revised:2021-03-02 Accepted:2022-10-25 Online:2022-10-25 Published:2022-10-28

摘要: 为了解决语言障碍者与健康人之间的交流障碍问题,提出了一种基于神经网络的手语到情感语音转换方法。首先,建立了手势语料库、人脸表情语料库和情感语音语料库;然后利用深度卷积神经网络实现手势识别和人脸表情识别,并以普通话声韵母为合成单元,训练基于说话人自适应的深度神经网络情感语音声学模型和基于说话人自适应的混合长短时记忆网络情感语音声学模型;最后将手势语义的上下文相关标注和人脸表情对应的情感标签输入情感语音合成模型,合成出对应的情感语音。实验结果表明,该方法手势识别率和人脸表情识别率分别达到了95.86%和92.42%,合成的情感语音EMOS得分为4.15,合成的情感语音具有较高的情感表达程度,可用于语言障碍者与健康人之间正常交流。

关键词: 手势识别, 人脸表情识别, 情感语音合成, 神经网络, 手语到语音转换, 语言障碍者

Abstract: In order to solve the problem of communication between speech-impaired people and healthy people, a neural network-based sign language-to-emotional speech conversion method is proposed. Firstly, a gesture corpus, a facial expression corpus, and an emotional speech corpus are established. Then, a deep convolution neural network is used to realize the recognition of gestures and facial expression. Mandarin vowels and consonants are used as synthesis units to train the deep neural network emotional speech acoustic model based on speaker adaptation and the mixed long short-term memory network emotional speech acoustic model based on speaker adaptation. Finally, the context-dependent labels of gesture semantics and the emotion labels corresponding to facial expression are input into the emotional speech synthesis model to synthesize the corresponding emotional speech. The experimental results show that gesture recognition accuracy and the facial expression recognition accuracy are 95.86% and 92.42%, respectively, and the average mean score of the synthesized emotional speech is 4.15. Meanwhile, the synthesized emotional speech has a high degree of emotional expression, which can be used for communication between speech-impaired people and healthy people. 


Key words: gesture recognition, facial expression recognition, emotional speech synthesis, neural network, sign language to speech conversion, speech-impaired people