next up previous
Next: Conclusions Up: Evaluation of the NLP Previous: Test Set


Results of the Evaluation

This section lists the results for word graphs. In table 3 we list the results in terms of string accuracy, semantic accuracy and the computational resources required to complete the test.


Table 3: Accuracy and Computational Resources for 1000 word graphs. String Accuracy and Semantic Accuracy is given as percentages; total and maximum CPU-time in seconds, maximum memory requirements in Megabytes.
Method Site String Acc Semantic Accuracy CPU Mem
WA SA match prec recall ca total max max
d2 A'dam 76.8 69.3 74.9 80.1 78.8 75.5 7011 648 619
d4 A'dam 77.2 69.4 74.9 79.1 78.8 75.1 32798 2023 621
f(bi,50) Gron 81.3 74.6 79.5 82.9 83.8 79.9 215 16 37
f(bi,100) Gron 82.3 75.8 80.9 83.6 84.8 80.9 297 15 37
f(bi,125) Gron 82.3 75.9 81.3 83.9 85.2 81.3 340 24 38
b(bi,1) Gron 81.1 73.6 78.5 82.1 83.1 78.9 175 16 31
b(bi,2) Gron 82.3 75.7 80.8 83.9 84.8 81.1 255 20 32
b(bi,4) Gron 82.8 76.0 80.8 83.8 85.0 81.3 479 115 34
b(bi,8) Gron 83.4 76.5 81.6 84.6 85.6 82.2 780 276 43
b(bi,16) Gron 83.8 76.4 81.7 84.9 86.0 82.6 1659 757 60
f(tr,50) Gron 83.9 76.2 81.8 84.9 85.9 82.5 1399 607 64
f(tr,100) Gron 84.2 76.6 82.0 85.0 86.0 82.6 1614 690 64
f(tr,125) Gron 84.2 76.5 82.1 85.3 86.3 82.8 1723 755 64
b(tr,1) Gron 83.9 76.2 81.5 84.5 85.7 82.2 1420 603 64
b(tr,2) Gron 84.1 76.4 81.8 85.3 86.4 83.0 2802 1405 101
b(tr,4) Gron 84.3 76.4 82.0 85.4 86.4 83.0 5524 2791 177


The total amount of CPU-time is somewhat misleading because typically many word graphs can be treated very efficiently, whereas only a few word graphs require very much CPU-time. In table 4 we indicate the semantic accuracy (concept accuracy) that is obtained if a time-out is assumed (in such cases we assume that the system does not provide an update).


Table 4: Concept accuracy for 1000 word graphs (percentages), if all results are disregarded with a time-out of respectively 100, 500, 1000, 5000, 10000 milliseconds of CPU-time. The last column repeats the results if no time-out is assumed.
Method Site 100 500 1000 5000 10000 >
d2 A'dam 37.0 53.0 58.1 68.1 70.4 75.5
d4 A'dam 24.6 34.5 38.2 50.4 57.3 75.1
f(bi,50) Gron 46.0 73.7 76.9 80.3 80.3 79.9
f(bi,100) Gron 44.4 67.9 75.3 81.1 81.2 80.9
f(bi,125) Gron 44.6 64.9 73.3 81.3 81.7 81.3
b(bi,1) Gron 58.2 73.1 76.6 79.3 79.2 78.9
b(bi,2) Gron 54.7 74.1 77.6 81.1 81.5 81.1
b(bi,4) Gron 49.6 72.3 75.6 80.3 80.5 81.3
b(bi,8) Gron 45.9 70.2 74.4 80.9 81.5 82.2
b(bi,16) Gron 42.2 65.5 72.5 78.0 81.0 82.6
f(tr,50) Gron 45.5 71.2 75.4 81.0 81.7 82.6
f(tr,100) Gron 44.5 64.2 71.9 80.5 81.8 82.6
f(tr,125) Gron 44.1 62.2 70.2 80.6 81.9 82.8
b(tr,1) Gron 52.7 70.9 74.7 80.7 81.2 82.2
b(tr,2) Gron 49.6 68.8 72.7 79.1 81.4 83.0
b(tr,4) Gron 48.0 66.6 71.6 78.2 79.5 83.0


We also present the results for test sentences (rather than word graphs). Such a test indicates what the results are if the speech recogniser would perform perfectly. Obviously, it does not make sense to measure string accuracy in such a set-up. Semantic accuracy and computational resources is presented in table 5. Because the average sentence length is very small, we present the results for concept accuracy versus the length of the input sentence in table 6.


Table 5: Semantic Accuracy and Computational Resources for 1000 test sentences. Total and maximum CPU-time in seconds; memory in Megabytes.
Method Site Semantic Accuracy CPU Mem
match prec recall ca total max max
d4 A'dam 92.2 93.8 91.2 90.4 856 14 21
group.d2 A'dam 93.0 94.0 92.5 91.6 91 9 14
group.d4 A'dam 92.7 93.8 91.8 91.0 1614 174 48
group.d5 A'dam 92.6 93.7 92.3 91.4 3159 337 78
nlp Gron 95.7 95.7 96.4 95.0 27 1 31



Table 6: Concept Accuracy versus Sentence Length for 1000 test sentences. The third column repeats the results for the full test set. The remaining columns list the results for the subset of the test set containing the sentences with at least 2 (4, 6, 8, 10) words.
Method site all $\geq 2$ $\geq 4$ $\geq 6$ $\geq 8$ $\geq 10$
# instances 1000 601 344 160 74 38
d4 A'dam 90.4 87.4 84.7 75.9 69.8 65.2
group.d2 A'dam 91.6 89.0 86.7 78.7 69.8 68.3
group.d4 A'dam 91.0 88.2 85.8 77.3 69.4 64.6
group.d5 A'dam 91.4 88.7 86.6 78.9 71.8 67.1
nlp Gron 95.0 93.4 93.0 88.1 85.9 87.0



next up previous
Next: Conclusions Up: Evaluation of the NLP Previous: Test Set

2000-07-10