Next: Status of the work Up: Grammatical Analysis in a Previous: The Grammar

Parsing

As explained before, a user communicates with the OVIS system by means of spoken language over a telephone line. The analysis of a sentence uttered by the user first requires automatic speech recognition. In the ideal case, this consists in identifying a single list of words that were uttered by the user. The next step would be to parse this list of words, i.e. to recognize the list of words as a correct sentence with a certain structure, i.e. parse. Ideally, only one such parse can be assigned to the sentence. The semantics, i.e. the QLF can be obtained from the parse. This QLF is subsequently translated into an update and passed on to the other modules of the OVIS system.

There are three obstacles to straightforward application of the syntactic analysis as described above:

The speech recognizer may not be able to uniquely identify the list of words uttered by the user.
The sentence uttered by the user may not be grammatical.
Sentences may be ambiguous, i.e. there may be several parses for a single sentence.

In the following, we will first explain what the parsing process looks like if we do not take these problems into account. Then we will consecutively discuss these problems and the way we deal with them.

Introduction to parsing

In the ideal case, the input of the parser is a single list of words, such as

Ik wil om ongeveer vier uur vertrekken

I want to leave around four o'clock

and the intended output is a single parse, such as the one in figure 3, which has been constructed according to the grammar (discussed in the previous section).

There are many parsing algorithms which can be applied in order to find such parses, given an input and a grammar. A simple parsing algorithm is the CYK algorithm [11]. It is based on the divide-and-conquer strategy: the problem of finding a parse for some input is reduced to finding subparses for the substrings of the input first; these are then combined. More specifically, we start with finding subparses for the smallest units in the input (lexical entries), collections of these subparses are then combined to form larger parses for larger chunks of the input, until the parse for the complete input is found.

In the example, we start by labeling the words in the sentence with categories NP, V, P, T-MOD, T-NUM, T-HOUR, V, by looking up the words in the lexicon. Then, the result is used to construct small subparses, such as the subparse covering the input `ongeveer vier uur' in figure 3, which has category NP in its root. This subparse is later combined with a subparse covering adjacent input `om' in order to construct the larger subparse covering the input `om ongeveer vier uur' and having PP as its root category, and so on, until the complete parse in figure 3 has been found. Such a combination of sub-parses is allowed only in case there is a grammar rule that licenses the combination. In the running example we can combine P and NP because there is a grammar rule PP #MATH192# P NP.

CYK is a bottom-up algorithm. Other parsing algorithms, such as top-down parsing, are more selective: the kinds of subparse that are constructed from a certain input position onward are dependent on subparses already found to the left of that input position. An example is Earley's algorithm [9]. Other examples can be found in e.g. [14].

When complex categories are used (as is the case in OVIS), the parsing problem becomes even more complicated and further distinctions can be made between different top-down strategies and different bottom-up strategies.

A further distinction between parsing algorithms can be made by identifying in which order the input string is traversed. Many algorithms proceed from left to right, but it has been observed that a bidirectional flow of control, as in head-driven parsers [12,22,20,15], has practical advantages [4].

Although the end result of parsing is always the same, the speed of different parsing algorithms may differ significantly. This is a relevant issue for a real-time system such as OVIS. For this reason we are currently developing implementations of a number of different parsing algorithms. These implementations are then compared to find out which parser works best for the OVIS grammar and input which is typical of the application.

For the construction of parsers from the grammar, we use a parser generator that is capable of detailed analyses of the grammar. For example, it performs a separation of grammar symbols into a part that is required for finding the correct parses, and a part that is merely required for finding the semantics of correct parses. During syntactic analysis, the latter part only needs to be applied for complete parses, and not for all subparses, some of which may turn out to be useless later. This may speed up the syntactic analysis.

Word graphs

In practice, speech recognition does not allow unique identification of the list of words that a person actually utters at the other end of a telephone line. There are a number of reasons why this is not possible. First, there are technical limitations in state-of-the-art speech recognition for connected speech, which makes that full recognition of all words is unattainable for all speakers regardless of dialects and foreign accents, the pitch of their voices, how well they articulate, etc. The second reason is the varying quality of telephone connections, which may blur the spoken text to some degree. Also background noise, such as slamming doors, noisy pets, and conversations in the background, constitute problems for speech recognizers. Finally, some sequences of sounds may actually be ambiguous, in the sense that they may represent multiple sequences of words. The reason that people are usually unaware of such ambiguity is because we unconsciously disambiguate by means of syntax and other kinds of non-phonetic information.

These considerations have led to the notion of word graph [16]. A word graph, produced by an automatic speech recognition device, is a succinct representation of all alternative sequences of words where the device cannot identify a unique sequence. Most parsing algorithms for linear input can easily be generalized to deal with word-graphs [3,13,23].

An example of such a graph is given in figure 6. The nodes represent the points in time. The left-most node represents the point in time where a sentence may start, the right-most node represents the point in time where the sentence may end.

All paths from the initial to the final node represent one alternative sentence. The edges in the graph are labeled with words, but also with scores, which indicate some degree of certainty with which words are recognized. The lower the score, the likelier it is that a word should represent the sounds perceived between the corresponding points in time. For a sequence of words, the scores can be summed. The result is a degree of likelihood that that sequence of words corresponds with the actual utterance.

One could try to find the path in the graph from the initial node to the final node which has the lowest score, and take this to be the intended sentence. However, as we have argued before, non-phonetic information such as syntax and world knowledge is often needed to find the correct sentence, and therefore selection of one particular sentence based on phonetic scores alone may be premature.

In the example of figure 6, the path corresponding with the `sentence'

Ik willen een Etten snelheid

I want(plural) a Etten speed

has the lowest score, yet it is unlikely that this is the sentence uttered by the user, for obvious reasons.

Therefore, the other sentences should not be discarded before syntactic and semantic considerations have been taken into account.

**Figure 6:** An example of a word graph.
$\includegraphics{wg.eps}$

In the running example there are 8 paths in the word graph. Only two are more or less syntactically correct:

Ik wil alleen met de sneltrein

I want only with the inter-city

Ik wil alleen met de snelheid

I want only with the speed

The second has the lowest score, viz. 5 + 8 + 4 + 8 + 6 + 5 = 36, and may be taken as the intended sentence, unless later phases of the analysis (e.g. pragmatic analysis) prefer the first, which is likely in this case.

Robust parsing

As a first approximation, we suggested in the previous section that the analysis of the input should restrict itself to the sentences in the word graph that could be processed by the syntactic analysis. Strictly speaking, syntactic analysis means assigning a structure to a syntactically correct sentence. In this section we have to refine this idea. The reason is that typical sentences that occur in spoken dialogues may not be syntactically correct. Since in everyday human-human conversations, syntactic errors hardly ever disrupt the normal course of dialogues, one would like the OVIS system to be equally robust against any kind of syntactic error in the sentences spoken by the user.

More specifically, if the grammar does not assign an analysis to a given input, then the system should still try to construct some form of parse which allows a semantic value to be computed. This semantic value should in the ideal case correspond with the intentions of the user, as expressed in the ungrammatical sentence (note that robustness concerns both ill-formed input and well-formed input which our grammar fails to recognize).

The realisation of this idea can be performed by three mechanisms, which may be combined in order to obtain the desired result:

The grammatical constraints are relaxed.
The input is changed in order to satisfy the grammar.
Partial sentences are recognized.

A typical application of the first mechanism is the treatment of mismatches of agreement. For example,

Ik wilt naar Amsterdam

I wants to go to Amsterdam

is a grammatically incorrect sentence, because there is no agreement between the subject and the verb. By relaxing this constraint, the analysis of the input can proceed as if the correct verb form were substituted. A disadvantage of this approach is that unwanted analyses now may be derived as well.

Another typical source of violation of grammaticality is the phenomenon of hesitations. A typical example is

Ik wil naar ...eh...naar Groningen

I want to go to...hm...to Groningen

The user hesitates to mention the place of destination, and when he is ready to continue the sentence, he repeats some of the words mentioned before (`naar' in the example).

Such errors may be treated by changing the input. Techniques for accomplishing such repairs are proposed in [7] and [6]. Since the input for the syntactic analysis in our case is the word graph, we realise this by adding some edges. We give an example for the phenomenon of hesitations. The desired correction is described by the following rewrite rule.

w eh w $\displaystyle \rightarrow$ w

In words, this rewrite rule says that an occurrence of a string w, which consists of arbitrary words, followed by some sign of hesitation (represented by eh), followed by the same string w, can be changed into the string w. An example of the application of this rule is given in figure 7.

**Figure 7:** Adding an edge according to a rewrite rule.
$\includegraphics{cor.eps}$

Many other forms of ungrammaticality can be corrected by means of rewrite rules.

Some input from the user may not at all be intended to be a complete sentence: it may consist of phrases or merely partial phrases. For example, if the system asks

Waar wilt u heen?

Where do you want to go to?

the user may answer with any of the following utterances, none of which is a sentence:

a. Delft

b. naar Delft

c. Delft om negen uur

d. naar Delft om negen uur

to Delft at nine o'clock

It would be fruitless to try to turn these answers into full sentences in all possible ways. Instead, it obviously suffices that the system should parse the answer as a phrase or partial phrase, then compute the semantics, and finally complete this semantics to be the intention of the user, using the elements of the question.

In the example (3.3a), the syntactic analysis should merely recognize `Delft' as the name of a station and translate it into the internal name for this station. Subsequent combination of this data with the meaning of the question then yields some semantics reflecting the intention of the user to go to Delft (and not to depart from Delft, for example).

The above problem can also be solved by designing the grammar in such a way that the language that is described is not merely the language of all correct and complete Dutch sentences, but the language of all relevant answers to questions in the application domain, in colloquial Dutch, including (partial) phrases.

More interesting is the treatment of cases where the input consists mostly of uninterpretable subsequences of words, with only a few clear phrases, for example

de naar een vanuit Amsterdam uur

the to a from Amsterdam o'clock

The only information discernible in this sentence is that the user wants to depart from Amsterdam. The remainder of the intentions of the user may not become clear from the utterance as the speech recognizer has processed it (we assume other paths in the word graph are equally uninterpretable). The OVIS system may however use the partial information obtained thus far and continue by requesting the desired time of departure, the place of destination, etc.

In all cases where paths in the word graph cannot be recognized as grammatical sentences, the score of the paths should be incremented. This is done under the assumption that the more we change to the structure of sentences as we receive them through the speech recognizer, the larger the chances are that the semantics we eventually derive is inconsistent with the intentions of the user.

For example, in figure 7 the newly added edge has a score which is not merely the sum of the edges that it was derived from, but an extra penalty is added, because this edge is a derived edge, not an original one.

Ambiguous sentences

As we have remarked before, word graphs may contain several sentences that are grammatical, and some that are ungrammatical, but allow easy correction. In practice, additional ambiguity is introduced by the grammar. The use of underspecified semantic representations (QLF) can be applied only to a limited extent to reduce ambiguities.

A typical example is PP-attachment:

Ik wil naar Hengelo reizen om zes uur

I want to travel to Hengelo at six o'clock

The phrase om zes uur may refer to wil (the desire to go to Hengelo is restricted to six o'clock) or to reizen (the travelling is to take place at six); even more interpretations are possible.

The consequence of such ambiguity is that several parses need to be constructed, only some of which will prove to be interpretable in later phases of analysis. In the above example, the first reading of the sentence will probably not fit in any pragmatic framework, and will therefore be rejected.

At the current moment it is not clear whether such ambiguity can and should already be solved during syntactic analysis. One conceptual objection against this solution is that it undermines the modular structure of the system. This is because the syntactic module would need to be contaminated with pragmatic information. In other application domains where different pragmatic considerations prevail such a syntactic module would then no longer be applicable.

Next: Status of the work Up: Grammatical Analysis in a Previous: The Grammar

Noord G.J.M. van
1998-09-25

a.	`Delft`
b.	`naar Delft`
c.	`Delft om negen uur`
d.	`naar Delft om negen uur`
	to Delft at nine o'clock