traditionele letterenstudies en moderne informatietechnologie

Dutch Ontology Resources

This page provides access to a collection of resources that provide category information for named entities in Dutch. It integrates information obtained from three other resources: Wikipedia categories (nl), Dutch wordnet (Cornetto), and category labels mined from a large Dutch text corpus (TwNC and Wikipedia, see also LASSY).

Explore


Entity (thing, page) Category (concept, class)

About

All the Wikipedia data was collected using the June 2008 XML version of Dutch Wikipedia, provided by the University of Amsterdam. Wikipedia categories for entities were obtained by simply collecting the categories provided for each Wikipedia page. Supercategories were obtained by collecting the categories mentioned on each category page. Note that the category structure of Wikipedia has some drawbacks. Links sometimes are redundant (i.e. a page mentions a category as well as one of its supercategories, e.g. Louis Armstrong is in category Jazztrompettist and in Jazzmusicus, where the last category is a supercategory of the first), and sometimes even circular (AĆ«rodynamica).No attempts have been made to clean the Wikipedia category labels and relationships.

Yago is a resource that links English Wikipedia category labels to WordNet. We provide similar links for Dutch, using Dutch Wikipedia and Cornetto, a Dutch wordnet. Links from category labels for pages in Wikipedia and word senses in Cornetto have been established automatically, and thus the data is noisy to some extent. Our approach to linking Wikipedia and Cornetto is described in more detail in Bouma 2009, but note that the paper discusses an experiment using an earlier version of Wikipedia, and uses EuroWordNet (the predecessor of Cornetto).

We collected isa-labels for named entities from an automatically parsed version of the Twente news corpus and a plain text version of Wikipedia. We extracted (adjective) noun name tuples parsed as a part of single NP (i.e. schiereiland Gallipoli or Noorse krant Aftenposten). This produces a large number of category labels for named entities. The results are rather noisy, however, due to the fact that not all predeceding nouns are proper category labels (i.e. naamgenoot), and parsing errors. We did attempt to filter irrelevant labels, but quite a bit of noise remains. Instead of storing the word forms of adjectives and nouns, we stored the stemmed and morphologically analyzed forms of these words (for use in a question answering system). The construction of this dataset is described in detail in van der Plas 2008 (chapter 6).

We have used these resources in our work on question answering and GikiCLEF (an entity ranking task).

Statistics

Download

Note that we only provide limited information on Cornetto, as Cornetto is not in the public domain. In particular, links to synonyms or broader terms are not provided. These can be obtained easily, however, from the Cornetto sources.

Contact

Remarks, Suggestions, Requests: Gosse Bouma