CLIN 29 in Groningen

Examining Existing Information Extraction Tools on Manually-Annotated Protest Events in Indian News
Berfu Büyüköz, Ali Hürriyetoğlu, Erdem Yörük and Deniz Yüret

This study is done as part of an ongoing project that aims to identify a new welfare regime in emerging market economies and explain why it has emerged. To extract the needed information from a vast amount of news archives, we develop Information Extraction (IE) models.

Before building our models, we examine various existing outstanding IE toolkits including Stanford NER, NeuroNER, Spacy, PETRARCH and an event extraction tool created by BLENDER Lab. In this paper, we report a detailed evaluation of the tools.

India is one of the countries that have emerging market economies. Therefore, we used a manually annotated 116 Times of India news texts as our test data. For comparison, we also tested the tools on ACE2005 and CoNLL2003 datasets.

The English used in Times of India news texts can be seen as a variety of English, developed out of quite different events, people and places. These variations unavoidably bring differences in the usage of English, as well as the context.

These experiments help us test the usability of the tools, compare different modeling approaches with each other, such as rule-based and neural-based. And most importantly, we demonstrate how well existing tools perform on a variety of English instead of frequently used mainstream native English data, i.e., how generalizable they are. We believe analyzing power and weaknesses of what we have as NLP community is the key to build better systems.