• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (12): 2331-2338.

• 论文 • Previous Articles     Next Articles

A ChineseEnglish parallel corpus for information extraction 

HUI Haotian,LI Yunjian,QIAN Longhua,ZHOU Guodong   

  1. (1.Natural Language Processing Lab,Soochow University,Suzhou 215006;2.School of Computer Science & Technology,Soochow University,Suzhou 215006,China)
  • Received:2015-08-26 Revised:2015-10-21 Online:2015-12-25 Published:2015-12-25

Abstract:

In addition to machine translation, parallel corpora play an important role in information retrieval, information extraction and knowledge acquisition, etc. However, traditional parallel corpora are aligned at sentence level, thus their significance for research on crosslanguage natural language processing is limited. In view of this, on the basis of the OntoNotes, we construct a high quality Chinese and English parallel corpus for information extraction by combining automatic extraction, automatic mapping and manual annotation. The corpus contains the entities and their mutual relations, and achieves the alignment between Chinese and English both on entity and relation levels. This corpus therefore can facilitate comparative study of information extraction in Chinese and English, reveal the difference of semantic expressions between languages, and also provide a valuable platform for research on cross-language information extraction.Key words:

Key words: named entity;semantic relation;bilingual mapping;parallel corpus