Geo-temporal named entity recognition in multilingual picture collections
Hans Paulussen and Dirk De Hertog
This talk describes the creation of a gold standard for Named Entity Recognition (NER), limited to spatio-temporal relations as applied to descriptions of multilingual picture collections. The gold standard will be used for training NER-tools as part of the UGESCO project which aims at developing geo-temporal (meta)data extraction routines and enrichment tools.
The challenges are related to the nature and the type of texts used for picture descriptions and the types of enrichments required for geo-temporal named entity recognition. The main issue is the brevity and the structure of the photo descriptions. Text samples describing pictures are usually short texts, lacking co-textual information. Description fields often have a reduced syntactic structure which requires specific training and/or adaptation of the existing NLP tools. Text brevity also has an influence in multilingual contexts. The database of photo descriptions covers Dutch and French text samples, which can be a tricky problem for automatic language identification.
Previous NER research tasks have mainly focused on broad NE categories (i.e. persons, organisations and locations) found in mainly journalistic texts. In this project, the focus is on subcategories of time and location. In other words, we limit the general NE categories, but expand the granularity of the subcategories of time and location.
The granularity of locations has an influence on the type of annotations used. Whereas general NE categories cover mainly proper nouns, the location subcategories require multi word units, not limited to proper nouns.