An experimental version of the system has been available to the general public for almost a year. From a large set of more recent dialogues a subset was selected randomly for testing. Many of the other dialogues were available for training purposes. Both the training and test dialogues are therefore dialogues with `normal' users.
In particular, a training set of 10K richly annotated word graphs was available. The 10K training corpus is annotated with the user utterance, a syntactic tree and an update. This training set was used to train the DOP system. It was also used by the grammar-based component for reasons of grammar maintenance and grammar testing.
A further training set of about 90K annotated user utterances was available as well. It was primarily used for constructing the Ngram models incorporated in the grammar-based component.
The NLP components of OVIS2 have been evaluated on 1000 unseen user utterances. The latest version of the speech recogniser produced 1000 word graphs on the basis of these 1000 user utterances. For these word graphs, annotations consisting of the actual sentence ('test sentence'), and an update ('test update') were assigned semi-automatically, without taking into account the dialogue context in which the sentences were uttered. These annotations were unknown to both NLP groups. The annotation tools are described in Bonnema (1996).
After both NLP components had produced the results on word graphs, the test sentences were made available. Both NLP components were then applied to these test utterances as well, to mimic a situation in which speech recognition is perfect.
The test updates were available for inspection by the NLP groups only after both modules completed processing the test material. A small number of errors was encountered in these test updates. These errors were corrected before the accuracy scores were computed. The accuracy scores presented below were all obtained using the same evaluation software.