Next: Efficiency Up: Robust Grammatical Analysis for Previous: Best-first methods

Evaluation

We present a number of results to indicate how well the NLP component currently performs. In the NWO Priority Programme, two alternative natural language processing modules are developed in parallel: the `grammar-based' module described here, and a `data-oriented' (statistical, probabilistic, DOP) module. Both of these modules fit into the system architecture of OVIS. The DOP approach is documented in a number of publications [39,9,8].

In order to compare both NLP modules, a formal evaluation has been carried out on 1000 new, unseen, representative word graphs (obtained using the latest version of the speech recognizer). Full details on the evaluation procedure, and all evaluation results, are described elsewhere [42,10]. For these word graphs, annotations were provided by our project partners consisting of the actual sentences ('test sentences'), and updates ('test updates').

The Ngram models used by our implementation were constructed on the basis of a corpus of almost 60K user utterances (almost 200K words).

Some indication of the difficulty of the test-set of 1000 word-graphs is presented in table 1, both for the input word-graphs and for the normalised word-graphs. The table lists the number of transitions, the number of words of the actual utterances, the average number of transitions per word, the average number of words per graph, the average number of transitions per graph, and finally the maximum number of transitions per graph. The number of transitions per word in the normalized word-graph is an indication of the additional ambiguity that the parser encounters in comparison with parsing of ordinary strings.

**Table 1:** **Characterization of test set (1).**
	graphs	transitions	words	t/w	w/g	t/g	max(t/g)
input	1000	48215	3229	14.9	3.2	48.2	793
normalized	1000	73502	3229	22.8	3.2	73.5	2943

A further indication of the difficulty of this set of word-graphs is obtained if we look at the word and sentence accuracy obtained by a number of simple methods. The string comparison on which sentence accuracy and word accuracy are based is defined by the minimal number of substitutions, deletions and insertions that is required to turn the first string into the second (Levenshtein distance d). Word accuracy is defined as $1 - \frac{d}{n}$ where n is the length of the actual utterance.

The method speech only takes into account the acoustic scores found in the word-graph. The method possible assumes that there is an oracle which chooses a path such that it turns out to be the best possible path. This method can be seen as a natural upper bound on what can be achieved. The methods bigram ( trigram) report on a method which only uses a bigram (trigram) language model. The methods speech_bigram ( speech_trigram) use a combination of bigram (trigram) statistics and the speech score.

**Table 2:** **Characterization of test set (2).**
method	WA	SA
speech	69.8	56.0
possible	90.4	83.7
bigram	69.0	57.4
trigram	73.1	61.8
speech_bigram	81.1	73.6
speech_trigram	83.9	76.2

Subsections

Next: Efficiency Up: Robust Grammatical Analysis for Previous: Best-first methods

2000-07-10