This page provides access to a collection of resources that provide category information for named entities in Dutch. It integrates information obtained from three other resources: Wikipedia categories (nl), Dutch wordnet (Cornetto), and category labels mined from a large Dutch text corpus (TwNC and Wikipedia, see also LASSY).
All the Wikipedia data was collected using the June 2008 XML version of Dutch Wikipedia, provided by the University of Amsterdam. Wikipedia categories for entities were obtained by simply collecting the categories provided for each Wikipedia page. Supercategories were obtained by collecting the categories mentioned on each category page. Note that the category structure of Wikipedia has some drawbacks. Links sometimes are redundant (i.e. a page mentions a category as well as one of its supercategories, e.g. Louis Armstrong is in category Jazztrompettist and in Jazzmusicus, where the last category is a supercategory of the first), and sometimes even circular (Aërodynamica).No attempts have been made to clean the Wikipedia category labels and relationships.
Yago is a resource that links English Wikipedia category labels to WordNet. We provide similar links for Dutch, using Dutch Wikipedia and Cornetto, a Dutch wordnet. Links from category labels for pages in Wikipedia and word senses in Cornetto have been established automatically, and thus the data is noisy to some extent. Our approach to linking Wikipedia and Cornetto is described in more detail in Bouma 2009, but note that the paper discusses an experiment using an earlier version of Wikipedia, and uses EuroWordNet (the predecessor of Cornetto).
We collected isa-labels for named entities from an automatically parsed version of the Twente news corpus and a plain text version of Wikipedia. We extracted
We have used these resources in our work on question answering and GikiCLEF (an entity ranking task).
Note that we only provide limited information on Cornetto, as Cornetto is not in the public domain. In particular, links to synonyms or broader terms are not provided. These can be obtained easily, however, from the Cornetto sources.