Alpino2conll

Barbara Plank <b.plank AT rug.nl>

This page contains the datasets used in [1], i.e. training and test datasets for Dutch in retagged CoNLL format. The data was converted from Alpino XML into CoNLL format based on an adapted version of Erwin Marsi's conversion software [2], but PoS tags were replaced by automatically assigned Alpino tags.

Data

The freely available data can to be found at: data/
Description
cdb.conll.utf8
Train data (cdb), retagged with Alpino. Newspaper text, 7136 sentences.
conll2006-test.conll
CoNLL 2006 test data for Dutch, retagged with Alpino tags. Institutional brochure about youth health, 386 sentences.
Other
A license from the Dutch TST-centrale is necessary to use the following data.
wikipedia-v0.1.tar.gz
Wikipedia articles annotated during the LASSY project and converted into retagged CoNLL format. Various domains, 95 Wikipedia articles.

References

[1] Barbara Plank. Improved statistical measures to assess natural language parser performance across domains. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.

[2] The conversion of Alpino Treebank XML to CoNLL format is based on Erwin Marsi's tool developed for CoNLL-X 2007, available at: http://nextens.uvt.nl/depparse-wiki/SharedTaskWebsite However, instead of using MBT tags, we adapted the conversion scripts such that they use Alpino Pos tags. (adapted scripts will be made available here)


Last update: