Domain influence in language variety recognition on Dutch subtitles
Hans Van Halteren


This paper describes experiments in which I tried to distinguish between Flemish and Netherlandic Dutch subtitles, as originally proposed in the VarDial 2018 Dutch-Flemish Subtitle task. During VarDial, all data was used as a monolithic block, with train and test data drawn randomly from this same block. As specific programs tended to be subtitles in either Flanders or The Netherlands, language variety recognition was mixed with domain recognition.

In order to investigate how the relation between training and test domains influences the recognition quality, I divided the data into two non-overlapping domains and repeated the train-test cycle with all possible scenarios. I argue that the best impression of the recognizability of the language varieties is gained when training on one domain and testing on another.

Apart from the quantitative results, I also provide a qualitative analysis, by investigating in detail the most distinguishing features in the various scenarios. Here too, it is with the out-of-domain recognition that we find some genuine differences between Flemish and Netherlandic Dutch.