The Robustness Component in OVIS

In an attempt to obtain a robust natural language understanding component we have experimented in OVIS with the techniques mentioned in the preceding paragraph. The top category (start symbol) of the OVIS grammar is defined as the category max(Sem). Moreover there are unary rules such as max(Sem) np(Sem,..) for NP, S, PP, ADVP.

In the first phase, the parser finds all occurrences of the top category in the input word-graph. Thus, we obtain items for all possible maximal projections anywhere in the input graph. To obtain the best analysis for the word-graph as a whole the following procedure is being used. This robustness procedure consists of a best-first search from the beginning of the graph to the end of the graph. A path in the input graph can be constructed by taking steps of the following two types. To move from position P to Q you can either:

1.: use a maximal projection from P to Q (as constructed by the parser)
2.: use a transition from P to Q. In this case we say that we `skip' that transition.

In order to compare paths in the best-first search method, we define a score function which includes:

1.: the number of skips. We prefer paths with a smaller number of such skips.
2.: the number of maximal projections. We prefer paths with a smaller number of such projections.
3.: the combined acoustic score as defined in the word-graph.

This best-first search method can be implemented in linear time; it is essentially the Viterbi algorithm for finding the most likely path through an HMM.

The resulting `best' path in general consists of a number of maximal projections (connected by skips). In the OVIS application these often are simple time or place expressions. The pragmatic module is able to deal with such pieces of information and will perform better if it is given such partial parse results (as compared with an approach in which parser failure does not produce any information).

In practice, the best-first search will not try to use all categories as constructed by the parser. For this reason I have implemented the second phase of the parser in a lazy fashion: the second phase is only applied for a certain category if that category is considered in the best-first search. A table is maintained to make sure that for each category the second phase is only applied once.

We are experimenting with other criteria to be included in the scoring function:

The semantic score. The resulting semantic representation is compared with the question asked previously (in the application all user utterances are reactions to the questions asked by the system). If the answer matches the question then a better score is obtained.
The bigram score.

Evaluation of the Robustness component

In order to evaluate the appropriate combination of the factors determining the scoring function, and to evaluate this approach with respect to other approaches, we use a corpus of word-graphs for which we know the corresponding actual utterances. We compare the sentence associated with the `best' path in the word-graph with the sentence that was actually spoken. Clearly if the robustness component more often uses the information that was actually uttered then we have more confidence in that component. The string comparison is defined by the minimal number of deletions and insertions that is required to turn the first string into the second (Levenshtein distance), although it may be worthwile to invest other measures.

For example, it seems likely that for our application it is less problematic to `miss' information, whereas ``hallucination'' is a more severe problem. If information is missed, then the dialogue might become somewhat longer than neccessary. But if information is assumed for which no evidence existed in the actual utterences then misunderstandings might easily lead to confusion. This intuition could be formalized by a scoring function in which insertion is cheaper than deletion.

Currently the string that is being compared with the actual utterance is defined as the best path through the word-graph, given the best-first search procedure defined previously. However note that this implies that `skips' are also part of that string. This seems undesirable because such skips have no effect on the semantic representations that result. Therefore, it may be better (at least for internal evaluation purposes) to count such skips as `deletions' even if those words were actually spoken.

A final issue concerns the evaluation of the evaluation criterion. Clearly the task of the NLP component in the current framework is not to predict which sentence was actually spoken, but to construct the best possible semantic representations of it. Therefore the current evaluation technique should only be used together with more sophisticated techniques which should take into account the actual semantic representations that are constructed.

Next: Some Results Up: Robust Parsing with the Previous: Head-corner Parsing and Robustness

Noord G.J.M. van
1998-09-25