Coreference Resolution for Extracting Answers

Home

This is the Groningen page for the project Coreference Resolution for Extracting Answers (COREA).

The project is part of the Stevin-initiative of the Dutch and Flemish government, and will be carried out in collaboration with the Language Technology Group of the University of Antwerp and Language and Computing.

Project Summary

Coreference resolution is a key ingredient for the automatic interpretation of text. It has been studied mainly from a linguistic perspective, with an emphasis on establishing potential antecedents for pronouns. Practical applications, such as Information Extraction (IE), summarization and Question Answering (QA), require accurate identification of coreference relations between noun phrases in general. Computational systems for assigning such relations automatically, require the availability of a sufficient amount of annotated data for training and testing. For Dutch, annotated data is scarce and coreference resolution systems are lacking.
In this project, we aim to develop a robust system for assigning such relations automatically, and we will investigate the effect of making coreference relations explicit on the accuracy of systems for IE and QA. We will annotate a limited amount of application-specific corpus material, which is required for the evaluation of the coreference resolution system in the context of IE and QA.
The project contributes to the goals of Stevin by providing a robust coreference resolution system which is applicable in a range of applications for Dutch, such as information extraction, question answering and summerization. In addition, general guidelines for coreference annotation will become available and a tool will be developed to support the annotation of coreference in text. Finally, a limited amount of data annotated with coreferential information, including spoken language data, will be produced.
The post doc in Groningen will be connected to local work on syntactic annotation within the Stevin-initiative and with the QA system which is being developed within the IMIX programme.
The full text of the scientific parts of the proposal can be found here.

Annotated Corpora

Annotated texts are provided as XML. A stylesheet is used to support visualization (tested for Firefox and Opera, highlighting does not work in IE). Click on the _inline.xml files to see the texts. For two corpora, annotated texts are available:

DCOI (a subset of the DCOI-corpus)
CGN (a subset of the the CGN-corpus)