CLIN 29 in Groningen

Detecting Controversy in Dutch News
Daphne Groot and Tommaso Caselli

In this work, we investigate automatic controversy detection in Dutch news using a distant supervised approach based on entropy.
We collected a total of 1859 news articles from Facebook from five different Dutch news providers (NOS, RTL Nieuws, de Volkskrant, het Parool, NRC and de Telegraaf) together with their Facebook users’ reactions (LIKE, LOVE, HAHA, WOW, SAD and ANGRY). We used the reactions as proxies for controversies, assuming that the higher the entropy of the reactions, the more controversial is the news. A manual exploration the 10-top and 10-bottom news of the dataset ordered by entropy confirmed the validity of the intuition.
We then developed a linear regression model to predict the controversy of news based on token and character n-grams. As a baseline, we used a dummy regressor always predicting the average of the entropy. We investigate three experimental settings: i.) all-news, a 10-fold cross validated model on the full corpus; ii.) in-source, a 10-fold cross validated model on each news source separately; and iii.) across-source, where we trained on one news source and tested on the other 5 (e.g. train on NOS and test on het Parool). In all experimental settings, the model beat the baseline. In particular, in all-news the model MSE=0.033 (baseline MSE= 0.049); in in-source the average of the model MSE=0.036 (baseline MSE= 0.042); and in across-source the average of the model MSE=0.052 (baseline MSE= 0.59).

We extended the model to predict topics, the number of reactions and their type to form a complete pipeline.