The central question is to what extent it is possible to adapt a linguistically motivated and general grammar of Dutch to a particular domain. Important evaluation criteria are how adequate the resulting, specialised, grammar is. Another important evaluation criterion is how much effort is required for the specialisation. For instance, many of the proposed specialisation techniques assume the existence of an annotated corpus of examples. In such cases, the evaluation should consider the required detail of annotation and the required amount of corpus material.
Apart from the importance of a corpus-based methodology, another important conclusion to be drawn from previous work is the importance of the actual words. For this reason, we propose to apply disambiguation techniques on lexical dependency structures rather than on syntactic parse trees. The focus on the actual words is now motivated as follows.
In speech-recognition systems, a language model is responsible
for the prediction of the `next' word in an utterance. N-gram
statistical models are almost exclusively used for this task. In such
a model, the probability of the next word w is dependent only on the
last N-1 words just seen. As a typical example, consider the case in which
N=3. For such a trigram model, a large corpus is used to count
the frequency of occurrence of all possible triples of words.
If wi, wj were the last two words seen so far, then
the probability that the next word is wk is estimated to be the
frequency of the trigram
divided by
the frequency of the pair
(for low frequency
counts often special arrangements are necessary). In the simple
sentence:
the probability that the word home follows after I want to
go is thus estimated by the number of times to go home occurs
in the corpus, divided by the number of times to go occurs in
the corpus.
![]() |
In practice, trigrams are much more accurate for the purpose of predicting the probability of a sentence than, for instance, stochastic context-free grammars which aim to model syntactic regularities (an observation which usually surprises syntacticians). Yet, it is immediately clear that many linguistically significant dependencies cannot be captured by such simple models. The success of simple models such as the trigram model strongly suggests that the actual words are extremely important.
The disappointing results of stochastic context-free
grammars can be explained, because such stochastic context-free
grammars are typically unable to express (statistical) dependencies
between words. In stochastic context-free grammars, grammar rules are
augmented with probabilities. Such probabilities can be automatically
derived from a corpus. Simplifying matters somewhat, the probability
of a rule such as
is estimated by the number of
times the rule
occurs in the corpus divided by
the number of times the category vp occurs in the corpus (i.e. the
proportion of cases that this particular vp rule was used to derive
a vp).
![]() |
Note, however, that the lack of expressiveness with respect to lexical dependencies is not an inherent property of stochastic context-free grammars, but rather a property of their typical use. For instance, it is quite possible to envision stochastic context-free grammars of lexical dependency structures in which each of the non-terminal nodes in the stochastic context-free grammar relates to a word in the input sentence. Obviously, the relation between input sentences and tree structure is different in such an approach. In such a set-up, some other grammatical device might produce such a dependency structure for which we can then compute its probability according to the stochastic context-free grammar (which defines all grammatical dependency structures and their associated probability). The example might for instance give rise to the dependency structure given in figure 2.6.
![]() |
A very promising line of research therefore consists of applying statistical techniques (such as stochastic context-free grammar) to lexical dependency structures, as opposed to traditional syntactic structures. Therefore, we propose to compare a number of probabilistic techniques which are sensitive to lexical dependencies.
In order to be able to do so, lexical dependency structures need to be defined. For instance, questions such as whether there should be a link in examples such as in figure 2.6 between the verb go and the matrix subject I should be answered. The construction of such lexical dependency structures is either performed explicitly by the grammar, or else can be straightforwardly derived from the structures the grammar derives. All modern grammatical theories exploit the notion linguistic head in one way or the other. Lexical dependency structures can be derived from such headed representations.
The following lists a number of approaches towards disambiguation:
For each of these approaches, an initial feasibility study will be conducted, in order to find out how such an approach can be combined with a given general grammar of Dutch. Based on this feasibility study a selection will be made from this list of a number of disambiguation techniques. The selected techniques will then be specified and implemented in detail, and carefully evaluated on an annotated corpus.
An important aspect of the study of disambiguation techniques concerns their success on a human-annotated test-corpus. At the moment there is relatively little syntactically and/or semantically annotated corpus material available for Dutch. Hopefully, the recently initiated corpus initiative `Corpus Gesproken Nederlands' [66] will help to remedy this situation (cf. section 2.4.4). Furthermore, the test bank developed as an evaluation tool for the grammar development work (cf. section 2.4) can be used to perform a more qualitatively oriented evaluation.