• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (12): 2331-2338.

• 论文 • 上一篇    下一篇

一个面向信息抽取的中英文平行语料库

惠浩添,李云建,钱龙华,周国栋   

  1. (1.苏州大学自然语言处理实验室,江苏 苏州 215006;2.苏州大学计算机科学与技术学院,江苏 苏州 215006)
  • 收稿日期:2015-08-26 修回日期:2015-10-21 出版日期:2015-12-25 发布日期:2015-12-25
  • 基金资助:

    国家自然科学基金资助项目(61373096,90920004);江苏省高校自然科学研究重大项目(11KJA520003)

A ChineseEnglish parallel corpus for information extraction 

HUI Haotian,LI Yunjian,QIAN Longhua,ZHOU Guodong   

  1. (1.Natural Language Processing Lab,Soochow University,Suzhou 215006;2.School of Computer Science & Technology,Soochow University,Suzhou 215006,China)
  • Received:2015-08-26 Revised:2015-10-21 Online:2015-12-25 Published:2015-12-25

摘要:

除了机器翻译,平行语料库对信息检索、信息抽取及知识获取等研究领域具有重要的作用,但是传统的平行语料库只是在句子级对齐,因而对跨语言自然语言处理研究的作用有限。鉴于此,以OntoNotes中英文平行语料库为基础,通过自动抽取、自动映射加人工标注相结合的方法,构建了一个面向信息抽取的高质量中英文平行语料库。该语料库不仅包含中英文实体及其相互关系,而且实现了中英文在实体和关系级别上的对齐。因此,该语料库将有助于中英文信息抽取的对比研究,揭示不同语言在语义表达上的差异,也为跨语言信息抽取的研究提供了一个有价值的平台。

关键词: 命名实体, 语义关系, 双语映射, 平行语料库

Abstract:

In addition to machine translation, parallel corpora play an important role in information retrieval, information extraction and knowledge acquisition, etc. However, traditional parallel corpora are aligned at sentence level, thus their significance for research on crosslanguage natural language processing is limited. In view of this, on the basis of the OntoNotes, we construct a high quality Chinese and English parallel corpus for information extraction by combining automatic extraction, automatic mapping and manual annotation. The corpus contains the entities and their mutual relations, and achieves the alignment between Chinese and English both on entity and relation levels. This corpus therefore can facilitate comparative study of information extraction in Chinese and English, reveal the difference of semantic expressions between languages, and also provide a valuable platform for research on cross-language information extraction.Key words:

Key words: named entity;semantic relation;bilingual mapping;parallel corpus