Practical Experience

There does not exist a generally agreed upon method to measure the efficiency of parsers for grammars of the kind we assume here, i.e. constraint-based grammars for natural language understanding. Therefore, I will present the results of the parser for the current version of the OVIS grammar in comparison with a number of other parsers that have been developed in the same project (by the author and his colleagues). Moreover, a similar experiment was performed with two other grammars: the English MiMo2 grammar [30], and the English Alvey NL Tools grammar [6]. It should be clear that the results to be presented should not be taken as a formal evaluation, but are presented solely to give an impression of the practical feasibility of the parser, at least for its present purpose. The following results should be understood with these reservations in mind.

Other Parsers

In the experiments the head-corner parser was compared with a number of other parsers. The parsers are described in further detail in [31] and [29]. The last two parsers of the following list were implemented by Mark-Jan Nederhof.

Note that we have experimented with a number of different versions of each of these parsers. We will report only on the most efficient version. The experiments were performed on a 125Mhz HP-UX 735 machine with 240 Megabytes of memory. Timings measure CPU-time and should be independent of the load on the machine. ⁷

Experiment 1: OVIS

The OVIS grammar (for Dutch) contains about 1400 lexical entries (many of which are station and city names) and 66 rules (a substantial fraction of those rules is concerned with time and date expressions), including 7 epsilon rules. The most important epsilon rule is part of a gap threading implementation of verb-second. The grammar is documented in detail in [31]. The head-corner table contains 128 pairs, the lexical head-corner table contains 93 pairs, the gap-head-corner table contains 14 pairs. The left-corner table contains 156 pairs, the lexical left-corner table contains 114 pairs, the gap-left-corner table contains 20 pairs. The pre-compiled grammar (which is used by the chart parsers) contains 92 rules.

The input for the parser consists of a test-set of 5000 word-graphs, randomly taken from a corpus of more than 25000 word-graphs. These word-graphs are the latest word-graphs that were available to us; they are `real' output of the current version of the speech recognizer as developed by our project partners. In this application, typical utterances are short. As a consequence, the typical size of word-graphs is rather small too, as can be seen in table 1.

Table 1: The leftmost table gives information concerning the number of transitions per word-graph of the test set for the OVIS grammar. As can be seen from this table, more than half of the corpus consists of word-graphs with at most five transitions. In the rightmost table the number of words per utterance is given. Many utterances consists of less than five words.

# transitions	# word-graphs
0-5	2825
6-10	850
11-15	408
16-20	246
21-30	237
31-40	146
41-50	83
51-75	112
76-100	44
101-150	36
151-200	12
263	1

# words	# utterances
1-2	2465
3-4	1448
5-6	543
7-8	319
9-10	118
11-12	56
13-14	26
15-16	20
17-18	5

We report on three different experiments with the OVIS grammar and these word-graphs. In the first experiment, the system runs in best-1-mode: the best path is selected from the word-graph using bigram scores and the acoustic scores (present in the word-graph). This best path is then sent to the parser and robustness component. In the second experiment, the parser is given the utterance as it was actually spoken (to simulate a situation in which speech recognition is perfect). In the third experiment, the parser takes the full word-graph as its input. The results are then passed on to the robustness component. As explained in the previous section on robustness, each of the parsers finds all derivations of the start symbol anywhere in the input (this is the case in each of the OVIS experiments).

For the current version of the OVIS system, parsing on the basis of the best path in the word-graph gives results in terms of word-accuracy which are similar to the results obtained with full word-graphs. Results for concept-accuracy are not yet available. Details can be found in [29].

Parsing best path only

Table 2: Total and average CPU-time and maximal space requirements for a test-set of 5000 best paths through word-graphs (OVIS grammar).

parser	total (msec)	msec/sentence	maximum	maximum space
hc	169370	34	530	163
lc	180160	36	530	171
bu-active	291870	58	4220	1627
bu-inactive	545060	109	13050	784
bu-earley	961760	192	24470	2526
lr	1088940	218	416000	4412

Parsing sentences

The differences in CPU-time for the corpus of 5000 word-graphs are similar to differences we have found for other test sets. The results are also very similar to the results we obtain if we parse the actually spoken utterances. Table 3 lists the results of parsing the set of 5000 utterances from which the word-graphs were derived.

Table 3: Total and average CPU-time and maximum space requirements for a test-set of 5000 utterances (OVIS grammar).

parser	total (msec)	msec/sentence	maximum	maximum space
hc	126930	25	510	137
lc	137090	27	490	174
bu-active	257390	51	4030	1438
bu-inactive	546650	109	15170	1056
bu-earley	934810	187	25490	3558
lr	957980	192	417580	4435

Parsing word-graphs

Obviously, parsing word-graphs is more difficult than parsing only the best path through a word-graph, or parsing an ordinary sentence. In table 4 we list the results for the same set of 5000 word-graphs. This experiment could only be performed for the head-corner and the left-corner parser. The other parsers ran into memory problems for some very large word-graphs.

Table 4: Total and average CPU-time and maximum space requirements for a test-set of 5000 word-graphs (OVIS grammar).

parser	total (msec)	msec/word-graph	maximum	maximum space
lc	410670	82	15360	4455
hc	435320	87	16230	4174

In order to compare the other parsers too, I performed the experiment with a time-out of 5000 msec (the memory problems only occur for word-graphs that take longer to process). In table 5 the percentage of word-graphs that can be treated within a certain amount of CPU-time are listed.

Table 5: Percentage of word-graphs that can be treated within time limit (OVIS grammar).

parser	500	1000	2000	3000	4000	5000	time-outs
lc	97.72	99.28	99.78	99.92	99.92	99.92	4
hc	97.42	98.94	99.60	99.84	99.92	99.92	4
lr	91.44	94.42	96.30	96.98	97.34	97.70	115
bu-active	91.84	94.76	96.04	96.84	97.30	97.60	120
bu-inactive	82.36	88.64	92.24	94.10	95.14	95.86	207
bu-earley	77.10	84.26	89.04	91.42	92.64	93.50	325

From the experiments with the OVIS grammar and corpus it can be concluded that the head-corner and left-corner parsers (implemented with selective memoization and goal-weakening) are much more efficient than the other parsers. In the case of word-graphs, the left-corner parser is about 5% faster than the head-corner parser; for strings, the head-corner parser is about 6 to 8% faster than the left-corner parser.

Experiment 2: MiMo2

For this experiment we have no notion of typical input, but instead just made up a set of 25 sentences of various length and difficulty, with a total of 338 readings. In order to be able to complete the experiment a time-out of 60 seconds of CPU-time was used. Timings include lexical lookup and parse tree recovery.

The original parser implemented in the MiMo2 system (a left-corner parser without packing) took 294 seconds of CPU-time to complete the experiment (with 3 time-outs). Because the test environment was (only slightly) different, we have indicated the latter results in italics. Average CPU-time is only given for those parsers which completed each of the sentences within the time limit. The results are given in table 6

Table 6: Total and average CPU-time and maximum space requirements for set of 25 sentences (MiMo2 grammar).

parser	total (msec)	msec/sentence	maximum space	time-outs
hc	52670	2107	2062	0
bu-active	52990	2120	30392	0
lc	109750	4390	8570	0
mimo2-lc	294000			3
bu-earley	439050		12910	4
bu-inactive	498610		7236	5

The bottom-up active chart parser performs better on smaller sentences with a small number of readings. For longer and more ambiguous sentences the head-corner parser is (much) more efficient. The other parsers are consistently much less efficient.

Experiment 3: Alvey NL Tools

In order to be able to use the head-corner parser we needed to determine for each of the rules which element on the right-hand-side constitutes the head of the rule. The head-corner relation contains 352 pairs; the lexical head-corner relation contains 180 pairs. We also report on experiments in which for each rule the left-most member of the right-hand-side was selected as the head. The goal-weakening operator used for the left-corner and head-corner parser removes all features (only leaving the functor symbol of each category); again this simplifies the maintenance of the goal table considerably.

The bottom-up chart parsers use a version of the grammar in which all epsilon rules are compiled out. The resulting grammar has 1015 rules.

The first test set consists of 129 short sentences (mean length 6.7 words). Our results were obtained with a newer version of the Alvey NL Tools grammar. In the table below we list the results for the same grammar and test set for Carroll's bottom-up left-corner parser (BU-LC). Carroll performed this experiment on a SUN UltraSparc 1/140. It was estimated by Carroll and the author that this machine is about 1.62 times faster than the HP-UX 735 on which the other experiments were performed. ⁹

In the table below we have multiplied the 13.3 seconds of CPU-time (obtained by Carroll) with this factor in order to compare his results with our results. Clearly, these numbers should be taken with extreme caution, because many factors in the test environment differ (hardware, LISP versus Prolog). For this reason we use italics in table 7.

Table 7: Total and average CPU-time and maximum space requirements for set of 129 short sentences (Alvey NL Tools grammar). Italicized items are offered for cautious comparison.

parser	msec	msec/sentence	max Kbytes
bu-active	18250	141	1276
lc	21900	170	137
Carroll BU-LC	21500	167
hc (lc mode)	23690	184	165
bu-earley	27670	214	758
hc	68880	534	140
bu-inactive	83690	649	170

The second test set consists of 100 longer and much more complex sentences. The length of the sentences is distributed uniformly between 13 and 30 words (sentences created by Carroll). Many of the sentences have many parses: the maximum number of parses is 2736 for one 29-word sentence. Average number of readings is about 100 readings per sentence.

Again, we list the results Carroll obtained with the BU-LC parser. It took 205.7 seconds on the SUN UltraSparc 1/140. ¹⁰ The bottom-up active chart parser ran into memory problems for some very ambiguous sentences and was very slow on many of the other sentences (due to the lack of packing). The results are summarized in table 8.

Table 8: Total and average CPU-time and maximum space requirements for set of 100 longer sentences (Alvey NL Tools grammar). Italicized items are offered for cautious comparison.

parser	msec	msec/sentence	max Kbytes
lc	195850	1959	10955
hc (lc mode)	216180	2162	10969
Carroll BU-LC	333000	3330
bu-earley	1219120	12191	18232
hc	3053910	30539	7915
bu-inactive	3578370	35784	16936
bu-active	> >		> 65000

The implementation of the left-corner parser based on selective memoization and goal-weakening seems to be substantially more efficient than the chart-based implementation of Carroll. The head-corner parser running in left-corner mode is almost as fast as this specialized left-corner parser. This suggests that the use of selective memoization with goal-weakening is on the right track.

From these experiments it can be concluded that the head-corner parser is not suitable for the Alvey NL Tools grammar. The reason seems to be that for this grammar the amount of top-down information that is available through the head-corner table is of limited value. In order to parse a given goal category, too many different lexical head-corners are typically available. For example, in order to parse a sentence possible head-corners include auxiliaries, verbs, adverbs, complementizers, pronouns, prepositions, determiners, nouns and conjunctions. In contrast, in the MiMo2 grammar only verbs can function as the head-corners of sentences. As a result the prediction step introduces too much non-determinism. A related reason for the poor performance for this grammar might be the large amount of lexical ambiguity. The grammar and lexicon used in the experiment is compiled from a compact user notation. In the compiled format, all disjunctions are spelled out in different rules and lexical entries. As a result, many words have a large number of (only slightly different) readings. It may be that the head-corner parser is less suitable in such circumstances. This could also explain the fact that the head-corner parser performs better on strings then on word-graphs: in many respects the generalization to word-graphs is similar to an increase in lexical ambiguity. This suggests that further improvements in the design of the head-corner parser should be sought in the prediction step.

Availability test material

ftp://ftp.let.rug.nl/pub/prolog-app/CL97/ http://www.let.rug.nl/~vannoord/CL97/