The availability of sufficient language and speech resources is of crucial importance for the development of language and speech technology for any given language. Resources include simple word lists (for spelling correction and hyphenation), advanced lexical databases with syntactic, semantic, and phonological information (for grammar development, disambiguation, text-to-speech, etc), statistical information extracted from corpora, such as word-level and character-level N-gram statistics (for optical character and speech recognition tasks), tagged corpora (for training and testing part-of-speech taggers and grammars), parallel corpora (for automatic translation), speech corpora of different sorts (individual words and connected speech, read and spontaneous utterances, optimal, office, and (mobile) telephone recordings), as well as various (language-specific) tools to be used as components of larger systems (such as stemmers for document indexing, and part of speech taggers and formal grammars for grammatical error correction, translation and dialogue systems).
We investigated whether sufficient resources are currently available for Dutch and concluded that, while a number of corpora and lexical databases are available, the overall situation needs to be improved drastically.
The most important supplier of text corpora for Dutch is the Instituut voor Nederlandse Lexicografie (INL). This institute, whose prime task is the production of an exhaustive, corpus-based, dictionary of Dutch, has collected several large (from 5 to 50 million words) corpora of modern Dutch, including newspaper, literary, and technical texts. The corpora are mostly labelled with part of speech information. It should be noted, however, that the labelling is the result of an automatic process, and thus contains a certain percentage of errors. Only a small fragment of the data (the 5 million INL corpus distributed as part of the European Corpus Initiative Multilingual Corpus CD-ROM) is conveniently accessible for language technology research purposes. Access to the other corpora is restricted, because of copyright issues.
A widely used text corpus is the so-called Eindhoven or Uit den Boogaart corpus. This (750.000 words) corpus was collected in the sixties and early seventies, as part of a research project aiming at the collection of frequency data for written and spoken Dutch (Uit den Boogaart 75; de Jong 79) . While the statistical results are still available (at least in paper form), the corpus itself is no longer actively maintained, and therefore only available through informal channels. The fact that it is still in use must be be attributed to the fact that this is the only available medium-size corpus that was carefully tagged.
ELRA offers two multilingual text corpora including Dutch (the
Polylingual Document Collection, ELRA-W0007, and the Multilingual
Parallel Corpus, ELRA-W0008). More recently
created corpora, such as the ANNO corpus (a 640.000 word corpus
automatically labelled with part of speech information (partly
corrected), created at the Centre for Computational Linguistics,
Leuven University [Schuurman1997])
and the OVIS corpus (a 10.000 utterance
corpus with syntactic and semantic annotations, created in the context of the
Dutch NWO priority programme for language and speech technology
[Bonnema, Bod, and Scha1997,Nederhof
1997]), and the PAROLE corpus (a 3 million
word Dutch corpus with 250.000 words annotated with part of speech,
currently under development at the INL) are quite domain and
application specific (OVIS), of medium size, and not (yet)
publicly available.
Text corpora for Dutch that are easily available and accessible are quite rare. For research which requires mostly raw text, this problem can be relatively easily overcome, as raw text is available from a number of sources (such as newsgroup archives, web-sites, and CD-ROM's containing large quantities of text). It should be noted, however, that such corpora tend to be unbalanced (i.e. do not contain a representative sample of the relevant texts) and that the size of the corpora that can be constructed this way may still be too small for certain applications (and there may be copyright problems). More problematic is the fact that annotated corpora are not sufficiently available, and that the corpora that exist are annotated automatically (introducing errors) and hard to access. The creation of such corpora for English, such as the Penn Treebank [Marcus, Santorini, and Marcinkiewicz1993] has had an enormous effect on the field. The development of comparable resources for Dutch, be it corpora with labelled with part of speech, syntactic, or semantic information, is expensive, especially if manual correction of automatically labelled texts is included, and requires careful planning and coordination with potential users. Currently, we are not aware of any attempts to produce such text corpora on a larger scale.
The most important lexical resource for Dutch is the CELEX database [Baayen, Piepenbrock, and van Rijn1993], available on CD-ROM and also accessible on the internet. This database (which also includes English and German data) contains detailed morpho-phonological information for approximately 125.000 lexemes and 400.000 word forms. The database provides information concerning, among other things, pronunciation, stress, syllabification, part of speech, and frequency (based on the INL corpora). Recently, the database has been made conformant with the new spelling rules for Dutch.
For spelling correction, several word lists (with hyphenation patterns) are available, including some public domain lists (http://www.iaf.nl/Users/Meridian/ words.htm). Eurodicautom ( http://www2.echo.lu/ edic/) is a large multilingual dictionary and terminology database, compiled by the Translation service of the EC.
The CELEX database is the only resource which is used widely in
research on Dutch language and speech technology. It is an
invaluable source of lexical, especially morpho-phonological,
information. It is very difficult to
obtain lexical information on area's that are not covered in CELEX. Syntactic valence information for a substantial number of
predicates is a crucial ingredient of wide-coverage syntactic
analyzers. Such information, comparable to what is provided for
English in the COMPLEX lexicon [Grishman, Macleod, and Meyers1994] is currently not
available for Dutch. Similarly, semantic information, in the form of
concept networks, thesauri, etc., is not available. This is an
obstacle for disambiguation, automatic translation, and information
retrieval tasks. The EuroWordNet project (
http://www.let.uva.nl/ewn/) has as its goal the construction
of a multilingual semantic lexical database (with an estimated size of
50.000 entries per language), in the style of WordNet.
The development of speech recognition and synthesis is heavily dependent on the availability of corpora and (pronunciation) dictionaries. The earliest annotated corpora of spoken language are the Eindhoven and Groningen (available from ELRA) corpora. More recently, the POLYPHONE-NL corpus (telephone speech) was created (by SPEX) as part of a joint European effort, and the COGEN corpus was created at ELIS, (Ghent University), as part of the Flemish short term programme on speech and language technology. Pronunciation information can be found in CELEX, and in FONILEX, a Flemish counterpart of CELEX developed at Leuven University focussing on pronunciation only, and also developed as part of the korte termijnprogramma.
Recently, the Dutch and Flemish government started a joint project (Corpus Gesproken Nederlands), aiming at the creation of a corpus of spoken Dutch with a size of approximately 10 million words. The corpus will be supplemented with detailed phonological as well as morphosyntactic (part of speech) labels, whereas a smaller part will get a syntactic annotation, making the corpus a valuable resource for the development of spoken language applications.
Resources for language and speech processing not only include lexical databases and corpora, but also standards for the annotation of corpora and coding of dictionaries, tools to explore these databases, and tools to perform (elementary) language and speech processing tasks, and evaluation standards. In as far as these tools and standards are language specific, they are also a part of the infrastructure for language and speech technology for a given language. The situation for Dutch in this respect is extremely weak.
Standards for the phonetical encoding of pronunciation dictionaries appear to be well investigated. The CELEX database provides an encoding (DISC) which can automatically be translated into several standard encodings (such as SAMPA), and which has been used (with minor modifications) in related projects such as FONILEX. For part of speech tagging, several tag-sets are in use, sometimes product specific, sometimes developed in the context of European projects (such as PAROLE), and sometimes derived from the categories distinguished in the ANS. No consensus exists, a situation which apears to be a serious obstacle for the development of larger annotated corpora.
Part of speech disambiguators for Dutch exist mainly as research tools
[Daelemans
1996,Berghmans1994]). Xerox has developed a tagger for Dutch
which is avialable for commercial and research purposes (
www.rxrc.xerox.com/research/mltt/Tools/ pos.html).
Resources for development of wide-coverage grammars are lacking. There are no explicit, formal, grammars for Dutch, no test suites or tree-banks, no valence dictionaries, etc. Given this state of affairs, it is not surprising that there have been only a few attempts to develop wide-coverage, robust, grammars for Dutch. One such grammar is the AMAZON parser using CELEX as its dictionary [Oltmans1994] . Given the fact that CELEX does not provide valence information, it is not surprising to see that the resulting system can only be seen as a first attempt at a wide-coverage grammar.
A consequence of the lack of evaluation metrics, benchmarks, tree-banks, tagged corpora, etc., is that evaluation of existing systems is practically impossible. Within the current project, the evaluation of existing tools had to be restricted to the reactions of researchers who had experience in working with these tools, and no formal evaluation could be carried out. We believe that this state of affairs is not only a problem for determining the quality of existing systems, but also prevents the development of new and better tools.
The language and speech infrastructure for Dutch, as far as resources are concerned, needs to be improved drastically in at least three area's. The available lexical databases need to be extended with fine-grained syntactic and semantic information. Publicly available medium-size and large corpora with part of speech, syntactic, or semantic information are currently virtually non-existent. Yet, such corpora are crucial for the development of robust natural language systems. Finally, the development of tools and standards needs to be stimulated, as these are hardly available for Dutch, and yet form an essential part of the language and speech infrastructure. With the development of a large Dutch-Flemish corpus of spoken language, the state of affairs for spoken language resources is looking rather promising.