Data Oriented Parsing

Next: An extension to semantic Up: Two NLP Components Previous: Two NLP Components

Data Oriented Parsing

Research in the Data Oriented Parsing framework explores the hypothesis that humans analyse new input by drawing analogies with concrete past language experiences, rather than by applying abstract rules [9].

In developing computational models embodying this idea, we have so far focused on one particularly straightforward instantiation of it: our algorithms analyse new input by considering the various ways in which this input might be generated by a stochastic process which combines fragments of trees from an annotated corpus of previously encountered utterances. Formally, these models may be viewed as implementing extremely redundant Stochastic Tree Substitution Grammars (STSG's); the grammar rules are the subtrees extracted from the annotated corpus [2].

An important parameter of models of this sort is the way in which the subtree-substitution probabilities in the stochastic process are computed on the basis of the subtree frequencies in the annotated corpus. All current models follow [2] in defining the probability of substituting a subtree t on a specific node as the probability of selecting t among all subtrees in the corpus that could be substituted on that node, i.e., as the number of occurrences of t divided by the total number of occurrences of subtrees t' with the same root node label as t:

$\begin{displaymath} P(t) =\frac{\ensuremath{\left\vert t\right\vert}}{\sum_{t':\... ...t}(t')=\mathit{root}(t)}\ensuremath{\left\vert t'\right\vert}} \end{displaymath}$

(1)

Given these subtree substitution probabilities, the probability of a derivation $t_1 \circ\cdots\circ t_n$ can be computed as the product of the probabilities of the substitutions that it consists of

$\begin{displaymath} P(t_1\circ\cdots\circ t_n) = \prod_i P(t_i) \end{displaymath}$

(2)

The probability of a parse tree is equal to the probability that any of its distinct derivations is generated, which is the sum of the probabilities of all derivations of that parse tree. Let $t_{\mathit{id}}$ be the i-th subtree in the derivation d that yields tree T, then the probability of T is given by:

$\begin{displaymath} P(T) = \sum_d \prod_i P(t_{\mathit{id}}) \end{displaymath}$

(3)

An efficient polynomial algorithm for computing the Most Probable Derivation is given in Sima'an (1996a). From a theoretical point of view we might expect the computation of the Most Probable Parse to yield better disambiguation accuracies than the Most Probable Derivation, and this expectation was confirmed by certain experiments. However, Sima'an (1996b) has shown that the problem of computing the Most Probable Parse is not solvable by deterministic polynomial time algorithms. For reasons regarding time-complexity, the most probable derivation (MPD) is still the method of choice for a real-world application.

The algorithm presented in Sima'an (1996a) is implemented in the ``Data Oriented Parsing and DIsambiguation System'' (DOPDIS). The algorithm extends a well-known CFG parsing algorithm (the CKY algorithm) in a suitable way in order to deal with STSG's. The extension makes use of the fact that the paths in a parse-tree, which is generated from an STSG derivation for a given sentence, form a regular set that can be easily computed. By computing such sets, DOPDIS generates exactly the necessary trees which the STSG dictates. On top of this mechanism, DOPDIS computes both the most probable derivation and the probability of an input sentence. The construction operates in time-complexity which is cubic in sentence length and linear in STSG size, which is a good achievement in parsing tree grammars (existing tree-parsing techniques have complexity which is square in grammar size).

The extension to parsing and disambiguating word graphs maintains the same time and space complexity (where instead of sentence length here the complexity concerns the numbers of nodes i.e. states the word graph contains). DOPDIS computes the so called Most Probable intersection-Derivation of a word graph and the DOP STSG. An intersection-derivation (i-derivation) is a pair $\langle\mathit{string, derivation}\rangle$ where $\mathit{string}$ is the string of words on a path in the input word graph, and $\mathit{derivation}$ is an STSG derivation for that string. The probability of the i-derivation is computed through a weighted product of the probabilities on the path in the word graph and the STSG-derivation probability. The probabilities of the word graph paths are obtained from the speech-recognisers likelihoods by a normalisation heuristic. The probabilities resulting from this heuristic combine better with the DOP probabilities than raw speech recogniser likelihoods. The issue of scaling the likelihoods of the speech-recogniser in a well-founded way is still under study. The current method divides the likelihood of every transition by a normalisation factor that is computed from the word graph.

Subsections

Next: An extension to semantic Up: Two NLP Components Previous: Two NLP Components

2000-07-10