CLIN 29 in Groningen

The Tweet-NL Alarm Corpus
Martijn Bartelds, Tommaso Caselli, Hetty Wessel-Smit and Bassam Shoukri

This abstract reports on the creation of a language resource for Dutch for the automatic identification and classification of alarm messages on Twitter, using the 4 C2000 alarm sub-classes of ‘bezitsaantasting’ (property violation) of the Dutch emergency services.

Using the C2000 sub-classes: ‘diefstal’ (theft), ‘inbraak’ (break-in), ‘vernieling’ (destruction) and ‘overval’ (robbery) we generated a dictionary of 32 keywords with Cornetto . We collected 3,079,588 tweets and expanded the dictionary to 130 keywords by selecting relevant co-occurrences using Pointwise Mutual Information.
We extracted 64,836 tweets using the expanded dictionary and conducted two rounds of manual annotation: the first, with trained annotators, resulted in 4,983 annotated tweets out of the initial data set. The second, with domain experts, resulted in 2,125 tweets out of the second dataset. The datasets have then been merged to compose the Tweet-NL Alarm Corpus v1.0. The corpus contains 7,108 tweets, distributed as follows: diefstal (920), inbraak (382), overval (571), vernieling (394), and other (4,841).

We are currently experimenting automatic classifiers for alarm message identification and classification.