next up previous
Next: Conclusion Up: Evaluation Previous: Efficiency

Accuracy

The results for word accuracy given above provide a measure for the extent to which linguistic processing contributes to speech recognition. However, since the main task of the linguistic component is to analyze utterances semantically, an equally important measure is concept accuracy, i.e. the extent to which semantic analysis corresponds with the meaning of the utterance that was actually produced by the user.

For determining concept accuracy, we have used a semantically annotated corpus of 10K user responses. Each user response was annotated with an update representing the meaning of the utterance that was actually spoken. The annotations were made by our project partners in Amsterdam, in accordance with the existing guidelines [46].

Updates take the form described in section 2.5. An update is a logical formula which can be evaluated against an information state and which gives rise to a new, updated information state. The most straightforward method for evaluating concept accuracy in this setting is to compare (the normal form of) the update produced by the grammar with (the normal form of) the annotated update. A major obstacle for this approach, however, is the fact that very fine-grained semantic distinctions can be made in the update-language. While these distinctions are relevant semantically (i.e. in certain cases they may lead to slightly different updates of an information state), they can often be ignored by a dialogue manager. For instance, the two updates below are semantically not equivalent, as the ground-focus distinction is slightly different.

\begin{displaymath}\small\begin{minipage}[t]{.9\textwidth}\begin{verbatim}user...
...own.leiden];[! place.town.abcoude])\end{verbatim}\end{minipage}\end{displaymath} (39)

However, the dialogue manager will decide in both cases that this is a correction of the destination town.

Since semantic analysis is the input for the dialogue manager, we have measured concept accuracy in terms of a simplified version of the update language. Inspired by a similar proposal in Boros et al. [11], we translate each update into a set of semantic units, where a unit in our case is a triple $\langle$ CommunicativeFunction, Slot, Value$\rangle$. For instance, the two examples above both translate as

$\langle$ denial, destination_town, leiden $\rangle$
$\langle$ correction, destination_town, abcoude $\rangle$

Both the updates in the annotated corpus and the updates produced by the system were translated into semantic units.

Semantic accuracy is given in table 5 according to four different definitions. Firstly, we list the proportion of utterances for which the corresponding semantic units exactly match the semantic units of the annotation ( match). Furthermore we calculate precision (the number of correct semantic units divided by the number of semantic units which were produced) and recall (the number of correct semantic units divided by the number of semantic units of the annotation). Finally, following Boros et al. [11], we also present concept accuracy as


\begin{displaymath}
CA = 100 \left( 1 - \frac{SU_S + SU_I + SU_D}{SU} \right) \%
\end{displaymath}

where SU is the total number of semantic units in the translated corpus annotation, and SUS, SUI, and SUD are the number of substitutions, insertions, and deletions that are necessary to make the translated grammar update equivalent to the translation of the corpus update.

We achieve the results listed in table 5 for the test-set of 1000 word-graphs. String accuracy is presented in terms of word-accuracy (WA) and sentence accuracy (SA).


Table 5: Accuracy.
Input Method String accuracy Semantic accuracy
    WA SA match precision recall CA
test sentence data-oriented N/A N/A 93.0 94.0 92.5 91.6
test sentence grammar-based N/A N/A 95.7 95.7 96.4 95.0
word-graph data-oriented 76.8 69.3 74.9 80.1 78.8 75.5
word-graph grammar-based N=2 82.3 75.8 80.9 83.6 84.8 80.9
word-graph grammar-based N=3 84.2 76.6 82.0 85.0 86.0 82.6


next up previous
Next: Conclusion Up: Evaluation Previous: Efficiency

2000-07-10