Results of the Evaluation

This section lists the results for word graphs. In table 3 we list the results in terms of string accuracy, semantic accuracy and the computational resources required to complete the test.

Table 3: Accuracy and Computational Resources for 1000 word graphs. String Accuracy and Semantic Accuracy is given as percentages; total and maximum CPU-time in seconds, maximum memory requirements in Megabytes.

Method	Site	String Acc		Semantic Accuracy				CPU		Mem
		WA	SA	match	prec	recall	ca	total	max	max
d2	A'dam	76.8	69.3	74.9	80.1	78.8	75.5	7011	648	619
d4	A'dam	77.2	69.4	74.9	79.1	78.8	75.1	32798	2023	621
f(bi,50)	Gron	81.3	74.6	79.5	82.9	83.8	79.9	215	16	37
f(bi,100)	Gron	82.3	75.8	80.9	83.6	84.8	80.9	297	15	37
f(bi,125)	Gron	82.3	75.9	81.3	83.9	85.2	81.3	340	24	38
b(bi,1)	Gron	81.1	73.6	78.5	82.1	83.1	78.9	175	16	31
b(bi,2)	Gron	82.3	75.7	80.8	83.9	84.8	81.1	255	20	32
b(bi,4)	Gron	82.8	76.0	80.8	83.8	85.0	81.3	479	115	34
b(bi,8)	Gron	83.4	76.5	81.6	84.6	85.6	82.2	780	276	43
b(bi,16)	Gron	83.8	76.4	81.7	84.9	86.0	82.6	1659	757	60
f(tr,50)	Gron	83.9	76.2	81.8	84.9	85.9	82.5	1399	607	64
f(tr,100)	Gron	84.2	76.6	82.0	85.0	86.0	82.6	1614	690	64
f(tr,125)	Gron	84.2	76.5	82.1	85.3	86.3	82.8	1723	755	64
b(tr,1)	Gron	83.9	76.2	81.5	84.5	85.7	82.2	1420	603	64
b(tr,2)	Gron	84.1	76.4	81.8	85.3	86.4	83.0	2802	1405	101
b(tr,4)	Gron	84.3	76.4	82.0	85.4	86.4	83.0	5524	2791	177

The total amount of CPU-time is somewhat misleading because typically many word graphs can be treated very efficiently, whereas only a few word graphs require very much CPU-time. In table 4 we indicate the semantic accuracy (concept accuracy) that is obtained if a time-out is assumed (in such cases we assume that the system does not provide an update).

Table 4: Concept accuracy for 1000 word graphs (percentages), if all results are disregarded with a time-out of respectively 100, 500, 1000, 5000, 10000 milliseconds of CPU-time. The last column repeats the results if no time-out is assumed.

Method	Site	100	500	1000	5000	10000	>
d2	A'dam	37.0	53.0	58.1	68.1	70.4	75.5
d4	A'dam	24.6	34.5	38.2	50.4	57.3	75.1
f(bi,50)	Gron	46.0	73.7	76.9	80.3	80.3	79.9
f(bi,100)	Gron	44.4	67.9	75.3	81.1	81.2	80.9
f(bi,125)	Gron	44.6	64.9	73.3	81.3	81.7	81.3
b(bi,1)	Gron	58.2	73.1	76.6	79.3	79.2	78.9
b(bi,2)	Gron	54.7	74.1	77.6	81.1	81.5	81.1
b(bi,4)	Gron	49.6	72.3	75.6	80.3	80.5	81.3
b(bi,8)	Gron	45.9	70.2	74.4	80.9	81.5	82.2
b(bi,16)	Gron	42.2	65.5	72.5	78.0	81.0	82.6
f(tr,50)	Gron	45.5	71.2	75.4	81.0	81.7	82.6
f(tr,100)	Gron	44.5	64.2	71.9	80.5	81.8	82.6
f(tr,125)	Gron	44.1	62.2	70.2	80.6	81.9	82.8
b(tr,1)	Gron	52.7	70.9	74.7	80.7	81.2	82.2
b(tr,2)	Gron	49.6	68.8	72.7	79.1	81.4	83.0
b(tr,4)	Gron	48.0	66.6	71.6	78.2	79.5	83.0

We also present the results for test sentences (rather than word graphs). Such a test indicates what the results are if the speech recogniser would perform perfectly. Obviously, it does not make sense to measure string accuracy in such a set-up. Semantic accuracy and computational resources is presented in table 5. Because the average sentence length is very small, we present the results for concept accuracy versus the length of the input sentence in table 6.

Table 5: Semantic Accuracy and Computational Resources for 1000 test sentences. Total and maximum CPU-time in seconds; memory in Megabytes.

Method	Site	Semantic Accuracy				CPU		Mem
		match	prec	recall	ca	total	max	max
d4	A'dam	92.2	93.8	91.2	90.4	856	14	21
group.d2	A'dam	93.0	94.0	92.5	91.6	91	9	14
group.d4	A'dam	92.7	93.8	91.8	91.0	1614	174	48
group.d5	A'dam	92.6	93.7	92.3	91.4	3159	337	78
nlp	Gron	95.7	95.7	96.4	95.0	27	1	31

Table 6: Concept Accuracy versus Sentence Length for 1000 test sentences. The third column repeats the results for the full test set. The remaining columns list the results for the subset of the test set containing the sentences with at least 2 (4, 6, 8, 10) words.

Method	site	all	$\geq 2$	$\geq 4$	$\geq 6$	$\geq 8$	$\geq 10$
# instances		1000	601	344	160	74	38
d4	A'dam	90.4	87.4	84.7	75.9	69.8	65.2
group.d2	A'dam	91.6	89.0	86.7	78.7	69.8	68.3
group.d4	A'dam	91.0	88.2	85.8	77.3	69.4	64.6
group.d5	A'dam	91.4	88.7	86.6	78.9	71.8	67.1
nlp	Gron	95.0	93.4	93.0	88.1	85.9	87.0