This page contains the datasets used in [1], i.e. training and test datasets for Dutch in retagged CoNLL format. The data was converted from Alpino XML into CoNLL format based on an adapted version of Erwin Marsi's conversion software [2], but PoS tags were replaced by automatically assigned Alpino tags.


The freely available data can to be found at: data/
Train data (cdb), retagged with Alpino. Newspaper text, 7136 sentences.
CoNLL 2006 test data for Dutch, retagged with Alpino tags. Institutional brochure about youth health, 386 sentences.
A license from the Dutch TST-centrale is necessary to use the following data.
Wikipedia articles annotated during the LASSY project and converted into retagged CoNLL format. Various domains, 95 Wikipedia articles.


[1] Barbara Plank. Improved statistical measures to assess natural language parser performance across domains. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.

[2] The conversion of Alpino Treebank XML to CoNLL format is based on Erwin Marsi's tool developed for CoNLL-X 2007, available at: However, instead of using MBT tags, we adapted the conversion scripts such that they use Alpino Pos tags. (adapted scripts will be made available here)

