In developing computational models embodying this idea, we have so far focused on one particularly straightforward instantiation of it: our algorithms analyse new input by considering the various ways in which this input might be generated by a stochastic process which combines fragments of trees from an annotated corpus of previously encountered utterances. Formally, these models may be viewed as implementing extremely redundant Stochastic Tree Substitution Grammars (STSG's); the grammar rules are the subtrees extracted from the annotated corpus [2].
An important parameter of models of this sort is the way in which the subtree-substitution probabilities in the stochastic process are computed on the basis of the subtree frequencies in the annotated corpus. All current models follow [2] in defining the probability of substituting a subtree t on a specific node as the probability of selecting t among all subtrees in the corpus that could be substituted on that node, i.e., as the number of occurrences of t divided by the total number of occurrences of subtrees t' with the same root node label as t:
Given these subtree substitution probabilities, the probability of a
derivation
can be computed as the product
of the probabilities of the substitutions that it consists of
![]() |
(2) |
![]() |
(3) |
An efficient polynomial algorithm for computing the Most Probable Derivation is given in Sima'an (1996a). From a theoretical point of view we might expect the computation of the Most Probable Parse to yield better disambiguation accuracies than the Most Probable Derivation, and this expectation was confirmed by certain experiments. However, Sima'an (1996b) has shown that the problem of computing the Most Probable Parse is not solvable by deterministic polynomial time algorithms. For reasons regarding time-complexity, the most probable derivation (MPD) is still the method of choice for a real-world application.
The algorithm presented in Sima'an (1996a) is implemented in the ``Data Oriented Parsing and DIsambiguation System'' (DOPDIS). The algorithm extends a well-known CFG parsing algorithm (the CKY algorithm) in a suitable way in order to deal with STSG's. The extension makes use of the fact that the paths in a parse-tree, which is generated from an STSG derivation for a given sentence, form a regular set that can be easily computed. By computing such sets, DOPDIS generates exactly the necessary trees which the STSG dictates. On top of this mechanism, DOPDIS computes both the most probable derivation and the probability of an input sentence. The construction operates in time-complexity which is cubic in sentence length and linear in STSG size, which is a good achievement in parsing tree grammars (existing tree-parsing techniques have complexity which is square in grammar size).
The extension to parsing and disambiguating word graphs maintains the
same time and space complexity (where instead of sentence length here
the complexity concerns the numbers of nodes i.e. states the
word graph contains). DOPDIS computes the so called Most Probable
intersection-Derivation of a word graph and the DOP STSG. An
intersection-derivation (i-derivation) is a pair
where
is the
string of words on a path in the input word graph, and
is an STSG derivation for that string. The
probability of the i-derivation is computed through a weighted product
of the probabilities on the path in the word graph and the
STSG-derivation probability. The probabilities of the word graph paths
are obtained from the speech-recognisers likelihoods by a
normalisation heuristic. The probabilities resulting from this
heuristic combine better with the DOP probabilities than raw speech
recogniser likelihoods. The issue of scaling the likelihoods of the
speech-recogniser in a well-founded way is still under study. The
current method divides the likelihood of every transition by a
normalisation factor that is computed from the word graph.