A Dutch coreference resolution system with quote attribution
Andreas van Cranenburgh


Coreference resolution is the task of identifying spans in text (mentions) that refer to the same entity. We present a rule-based coreference resolution system for Dutch, based on the Stanford deterministic multi-sieve architecture. The implementation is designed to support book-length documents. In addition to coreference, direct speech is attributed to its speaker and addressee using heuristic rules. Speakers are detected where explicitly mentioned, and this information is extrapolated assuming turn-taking of alternating interlocutors. Pronoun resolution exploits gender and animacy information on nouns from Cornetto, and gender of first names from the Meertens Voornamenbank.

Our system improves on a previous system with the same architecture, GroRef, when evaluated on the CLIN26 shared task development set (BLANC 33.12 vs 31.48). Evaluation on the SemEval 2010 Dutch development set suggests that different annotation guidelines prevent the system from attaining a reasonable score; in particular, predicted mentions have a large effect on the score and the system’s decisions. With gold mentions our system does beat the Dutch system in SemEval 2010 (BLANC 66.75 vs 65.3). We also annotated the first 100 sentences of 10 novels, by manually correcting our system output. On these novels, our system obtains a BLANC score of 69.42%.

Two obvious future work ideas suggest themselves. First, a classifier to improve detection of mentions and singletons (mentions that do not corefer). Second, since Sonar contains a 1 million word coreference dataset, there is sufficient data to train a deep learning system.