Development Plans

Next: Towards a linguistically-informed search Up: Grammar Development for Dutch Previous: Current Status Contents

Subsections

Development Plans

In [115], a detailed account of the syntactic coverage of the TST grammar is given. Although inspired by linguistic theory and designed as much as possible as a general grammar of Dutch, the current fragment is not a general, wide-coverage, grammar of Dutch. In particular, coverage in the lexical domain is limited, and several grammatical constructions are not taken into consideration (e.g. passives) or accounted for only to a certain extent (e.g. the grammar of Dutch verb clusters). The coverage of the grammar is quite satisfactory for the TST application, however. For instance, when evaluating the grammar on a corpus of 1000 transcribed test-sentences, we obtained a semantic concept accuracy of 95%. The evaluation results are presented in detail in [10].

Our development plans aim at building a general, wide-coverage, grammar of Dutch, taking the TST grammar as a starting point. The inheritance-based set-up of the lexicon and the rule set facilitates such a development. The resulting grammar will be used as a concrete test-case for experiments and evaluation in the grammar specialisation and grammar approximation projects.

Below, we give an overview of activities to be carried out as part of the grammar development effort.

Corpus Exploration

Grammar development can benefit enormously from the availability of (annotated) corpora [104]. An annotated corpus can be used for various kinds of testing (ensuring coverage does not decrease from one version of the grammar to the next), debugging (spotting undesired derivations or rule interactions), and evaluation (measuring syntactic and lexical coverage). Both raw corpora, corpora labelled with part-of-speech tags, and tree-banks (corpora with syntactic annotation) can be used for this purpose. For Dutch, there are a limited number of corpora available for this purpose, such as the corpora of the Instituut voor Nederlandse Lexicografie, the Eindhoven (Uit den Boogaard) corpus [32], the TST corpus [101], and the Parole corpus [63,81].

Furthermore, we expect the corpora within the project for a corpus of spoken Dutch ( Corpus Gesproken Nederlands) [66] to be very valuable in this respect. The project aims at the collection of 10 million words of spoken Dutch, annotated (among others) with part of speech and lexical information. Some parts of the corpus will moreover be syntactically annotated. Collaboration with the syntactic and semantic annotation efforts in this corpus initiative will therefore be important. The first results of the Corpus Gesproken Nederlands project will be available in 1999. If these corpora are not suitable for our purposes, then steps will be taken to develop suitably annotated corpora in cooperation with other interested parties.

In addition to the exploration of such corpora, effort will be devoted to the construction of a more systematically constructed set of example sentences. Such test suites of considerable sizes already exist for English, German and French [65]. The construction of such a test suite for Dutch will be a very useful evaluation tool both for the proposed project and for other efforts aimed at the construction of Dutch grammars and/or Dutch language technological applications.

Syntactic Coverage

It is clear that important syntactic constructions (such as passives and relative clauses) are missing from the grammar. Furthermore, evaluation on corpora will reveal that a number of other syntactic constructions are still missing in the grammar. These constructions will have to be incorporated in a linguistically motivated fashion, and in a way compatible with the overall architecture of the grammar.

Lexical Coverage

As a consequence of the fact that the TST grammar was used to interface with a speech recogniser (which, for the given task, can typically handle up to a few thousand words), the current lexicon is relatively small. To count as a wide-coverage grammar, it will be necessary to expand the lexicon dramatically. (The XTAG grammar for English, for instance, contains over 300.000 word forms.) Various resources can be used to facilitate lexicon development. The Dutch part of the Celex lexical database (CELEX) [3] contains morphosyntactic information for over 100.000 lemma's and over 300.000 word forms. Information about syntactic valence is not standardly included in this database, but is available (R. Piepenbrock, p.c.), and is also provided by the lexica developed as part of the projects RBN (Referentiebestand Nederlands) and Parole [63]. Lexical semantic (conceptual) information is available to some extent in CELEX, and will be available in the EuroWordNet database [118].

Automated lexical acquisition

Apart from using existing lexical resources, we hope to experiment with techniques for acquiring lexical information automatically. Given a syntactically analyzed corpus, it is relatively straightforward to collect data about the valence of verbs and other lexical items. However, syntactic annotation is not a prerequisite for obtaining this kind of knowledge. For instance, [12]; [70]; [13]; [31]; [21] and [22] investigate how one may obtain (statistical) lexical (valence) information from an untagged corpus, using no or very coarse grammar rules. Automated acquisition of lexical information is bound to be less precise than manually constructed lexica. However, the approach also has two distinct advantages. First, acquisition can be done for a given domain or application area, thus opening up the possibility of automatically tuning the lexicon for a given domain. Second, the lexical information which is obtained in this way is probabilistic, i.e., not only provides us with information about the possible subcategorisation frames for a given verb, but also tells us which of these frames occurs most frequently. This could be an important aid for disambiguation.

Linguistic Sophistication

The TST fragment contains an account of cross-serial dependency constructions, unbounded dependencies, and modifier attachment, but does not cover these phenomena in their full generality. Furthermore, the grammar format imposes rather strict conditions on the kind of rules that can be formulated. These limitations and constraints are partly a consequence of the application for which the grammar has been developed (which contains only relatively straightforward cases of the phenomena just mentioned) and partly a consequence of restrictions imposed by processing. In order to obtain a general, linguistically motivated, grammar, these phenomena will have to be dealt with in their full generality, and, consequently, certain constraints which are a consequence of processing considerations will have to be removed. It is obvious that this has implications for processing efficiency. Thus, it will become important to investigate how efficiency may be restored, either by approximation or specialisation. In other words, the Dutch grammar to be developed will provide an ideal test-case for the other two research areas.

Next: Towards a linguistically-informed search Up: Grammar Development for Dutch Previous: Current Status Contents

2000-07-10