Research Questions in Grammar Specialisation and Disambiguation

The central question is to what extent it is possible to adapt a linguistically motivated and general grammar of Dutch to a particular domain. Important evaluation criteria are how adequate the resulting, specialised, grammar is. Another important evaluation criterion is how much effort is required for the specialisation. For instance, many of the proposed specialisation techniques assume the existence of an annotated corpus of examples. In such cases, the evaluation should consider the required detail of annotation and the required amount of corpus material.

Apart from the importance of a corpus-based methodology, another important conclusion to be drawn from previous work is the importance of the actual words. For this reason, we propose to apply disambiguation techniques on lexical dependency structures rather than on syntactic parse trees. The focus on the actual words is now motivated as follows.

The Importance of Words for Disambiguation

In speech-recognition systems, a language model is responsible for the prediction of the `next' word in an utterance. N-gram statistical models are almost exclusively used for this task. In such a model, the probability of the next word w is dependent only on the last N-1 words just seen. As a typical example, consider the case in which N=3. For such a trigram model, a large corpus is used to count the frequency of occurrence of all possible triples of words. If w_i, w_j were the last two words seen so far, then the probability that the next word is w_k is estimated to be the frequency of the trigram $\langle w_i,w_j,w_k\rangle$ divided by the frequency of the pair $\langle w_i,w_j\rangle$ (for low frequency counts often special arrangements are necessary). In the simple sentence:

$\ex. I want to go home \par$
the probability that the word home follows after I want to go is thus estimated by the number of times to go home occurs in the corpus, divided by the number of times to go occurs in the corpus.

**Figure:** Trigrams capture dependencies between neighbouring words. Trigrams are unable to capture dependencies that are more than 2 words apart, such as the dependency between the subject I and the embedded verb *to go*.
$\includegraphics [scale=0.6]{trigram.eps}$

In practice, trigrams are much more accurate for the purpose of predicting the probability of a sentence than, for instance, stochastic context-free grammars which aim to model syntactic regularities (an observation which usually surprises syntacticians). Yet, it is immediately clear that many linguistically significant dependencies cannot be captured by such simple models. The success of simple models such as the trigram model strongly suggests that the actual words are extremely important.

The disappointing results of stochastic context-free grammars can be explained, because such stochastic context-free grammars are typically unable to express (statistical) dependencies between words. In stochastic context-free grammars, grammar rules are augmented with probabilities. Such probabilities can be automatically derived from a corpus. Simplifying matters somewhat, the probability of a rule such as $vp \rightarrow v, np$ is estimated by the number of times the rule $vp \rightarrow v, np$ occurs in the corpus divided by the number of times the category vp occurs in the corpus (i.e. the proportion of cases that this particular vp rule was used to derive a vp).

**Figure:** Parse-tree augmented with probabilities provided by a stochastic context-free grammars. Such models are unable to express the dependency between words such as go and *home*.
$\begin{figure} \centering \pstree[levelsep=*0.8cm,nodesep=1pt]{\Tr[ref=c]{\begin... ...\end{tabular}}}{ \pstree{\Tr[ref=c]{\bf\large home}}{} } } } } \end{figure}$

Note, however, that the lack of expressiveness with respect to lexical dependencies is not an inherent property of stochastic context-free grammars, but rather a property of their typical use. For instance, it is quite possible to envision stochastic context-free grammars of lexical dependency structures in which each of the non-terminal nodes in the stochastic context-free grammar relates to a word in the input sentence. Obviously, the relation between input sentences and tree structure is different in such an approach. In such a set-up, some other grammatical device might produce such a dependency structure for which we can then compute its probability according to the stochastic context-free grammar (which defines all grammatical dependency structures and their associated probability). The example might for instance give rise to the dependency structure given in figure 2.6.

**Figure:** Dependency structure augmented with probabilities. In such dependency structures, probabilistic dependencies between dependent words can easily be expressed.
$\begin{figure} \centering \pstree[levelsep=*0.8cm,nodesep=1pt]{\Tr[ref=c]{\begin... ...bular}{c}\bf\large home\\ \bf\large0.067 \end{tabular}}}{ } } } \end{figure}$

Disambiguation of Lexical Dependency Structures

A very promising line of research therefore consists of applying statistical techniques (such as stochastic context-free grammar) to lexical dependency structures, as opposed to traditional syntactic structures. Therefore, we propose to compare a number of probabilistic techniques which are sensitive to lexical dependencies.

In order to be able to do so, lexical dependency structures need to be defined. For instance, questions such as whether there should be a link in examples such as in figure 2.6 between the verb go and the matrix subject I should be answered. The construction of such lexical dependency structures is either performed explicitly by the grammar, or else can be straightforwardly derived from the structures the grammar derives. All modern grammatical theories exploit the notion linguistic head in one way or the other. Lexical dependency structures can be derived from such headed representations.

For each of these approaches, an initial feasibility study will be conducted, in order to find out how such an approach can be combined with a given general grammar of Dutch. Based on this feasibility study a selection will be made from this list of a number of disambiguation techniques. The selected techniques will then be specified and implemented in detail, and carefully evaluated on an annotated corpus.

Evaluation of Disambiguation Techniques

An important aspect of the study of disambiguation techniques concerns their success on a human-annotated test-corpus. At the moment there is relatively little syntactically and/or semantically annotated corpus material available for Dutch. Hopefully, the recently initiated corpus initiative `Corpus Gesproken Nederlands' [66] will help to remedy this situation (cf. section 2.4.4). Furthermore, the test bank developed as an evaluation tool for the grammar development work (cf. section 2.4) can be used to perform a more qualitatively oriented evaluation.