CLIN 29 in Groningen

How to optimize your Twitter collection: Dutch keyword sets for various research goals.
Tim Kreutz and Walter Daelemans

Methods for collecting Dutch tweets have always been subject to Twitter API constraints. Two general approaches to tapping into the live Twitter stream, namely user- and content-based collection methods, are compared within these constraints. The latter method, which allows selection of tweets by up to 400 keywords is shown to collect more tweets, but trades-off precision for recall. In this paper we optimize keyword sets for the collection of Dutch tweets using the content-based collection method. We specifically describe three research settings that influence the definition of ‘optimal’. The first setting aims to collect most Dutch tweets (optimize recall) so that the set may later be restricted by more sophisticated language identifiers. The second setting aims to find the best balance between precision and recall, which in practice is shown to work like a clean sampling of Dutch tweets. In the third setting, we identify user locations to optimize our keyword sets for Flemish and Dutch users specifically. Optimization for specific subsets of users yields slight but interesting differences in the keyword sets.