The diachronic Dutch Lexicon, TICCLAT style
Martin Reynaert, Patrick Bos and Janneke van der Zwaan


The TICCLAT project currently underway at Meertens Institute and the Netherlands eScience Center should be seen as a wholesale, mainly statistically guided, attempt to automatically chart the full lexicon of Dutch throughout its history on the basis of the corpora unified in the Nederlab project. This has now delivered a uniformly processed and linguistically enriched diachronic corpus of Dutch containing about 18.5 billion word tokens. TICCLAT is to complement a solid basis composed of the available validated lexicons and name lists for Dutch with the full extent of data collected from the Nederlab corpus such as word type statistics and word dispersion information, document, time and locality references and linguistic annotations.

The relevant modules of the Text-Induced Corpus Clean-Up system or TICCL will be employed to fill the morphological paradigms for the word forms collected and to find attestations for the possible word forms given a particular lemma in the available contemporary Dutch corpora, i.e. CGN for spoken and SoNaR for written Dutch. We will further expand the databases with more historical word types on the basis of manually digitized corpora such as the Database of Dutch Literature (DBNL) and the other corpora of early Dutch. Finally the far larger corpora automatically digitized by the National Library KB are to be incorporated, thereby completing the picture of real words with the OCR produced and TICCL-linked non-word variants.

We will provide an overview of the statistics regarding the actual state-of-affairs and demonstrate what the system already has to offer the user.