Tailoring Data Collection towards a Dutch Corpus for Hate Speech Detection
Juliet van Rosendaal and Malvina Nissim


The automatic detection of hate speech in online communities has become a pressing need. At the major latest evaluation campaigns, challenges have been organised on this task to promote the development of working systems, and at the same time provide a better understanding of the phenomenon. As a consequence, annotated datasets for a series of languages, including English (tasks at SemEval), Italian (tasks at EVALITA), and Spanish (tasks at IberEval), are now available.

We should also create a resource for Dutch. Manual annotation is time consuming, but considering the sizes of existing datasets it can be done, provided proper guidelines are in place. One outstanding problem though is data collection pre-annotation. While some language phenomena are widespread in any (social media) text we might collect, hate speech isn't necessarily so, and by random sampling from platforms such as Twitter and Reddit, one could end up having to go through a large amount of non-hateful messages before finding hateful ones, resulting in a highly skewed, and possibly not too useful, dataset. While realistic proportions are important, too little data for the phenomenon at hand does not allow for developing successful models.

In this contribution, we present a completely data-driven simple method towards the creation of a Dutch corpus for hate-speech annotation. It exploits cross-information from Twitter and Reddit, mainly relying on tf-idf and keyword matching. Through a series of progressive refinements, we show how even just a qualitative analysis on the results highlights the benefits of our approach.