We present a number of results to indicate how well the NLP component currently performs. In the NWO Priority Programme, two alternative natural language processing modules are developed in parallel: the `grammar-based' module described here, and a `data-oriented' (statistical, probabilistic, DOP) module. Both of these modules fit into the system architecture of OVIS. The DOP approach is documented in a number of publications [39,9,8].
In order to compare both NLP modules, a formal evaluation has been carried out on 1000 new, unseen, representative word graphs (obtained using the latest version of the speech recognizer). Full details on the evaluation procedure, and all evaluation results, are described elsewhere [42,10]. For these word graphs, annotations were provided by our project partners consisting of the actual sentences ('test sentences'), and updates ('test updates').
The Ngram models used by our implementation were constructed on the basis of a corpus of almost 60K user utterances (almost 200K words).
Some indication of the difficulty of the test-set of 1000 word-graphs is presented in table 1, both for the input word-graphs and for the normalised word-graphs. The table lists the number of transitions, the number of words of the actual utterances, the average number of transitions per word, the average number of words per graph, the average number of transitions per graph, and finally the maximum number of transitions per graph. The number of transitions per word in the normalized word-graph is an indication of the additional ambiguity that the parser encounters in comparison with parsing of ordinary strings.
A further indication of the difficulty of this set of word-graphs is
obtained if we look at the word and sentence accuracy obtained by a
number of simple methods.
The string comparison on which sentence accuracy and word accuracy are
based is defined by the minimal number of substitutions, deletions and
insertions that is required to turn the first string into the second
(Levenshtein distance d). Word accuracy is defined as
where n is the length of the actual utterance.
The method speech only takes into account the acoustic scores
found in the word-graph. The method possible
assumes that there is an oracle which chooses a path such that it
turns out to be the best possible path. This method can be seen as a
natural upper bound on what can be achieved. The methods bigram
( trigram) report on a method which only uses a bigram (trigram)
language model. The methods speech_bigram (
speech_trigram) use a combination of bigram (trigram)
statistics and the speech score.