More specifically, I will assume that grammar rules are represented by the predicate headed_rule/4 in which the first argument is the head of the rule, the second argument is the mother node of the rule, the third argument is the reversed list of daughters left of the head, and the fourth argument is a list of the daughters right of the head.
As an example, the DCG rule
|
(1) |
|
(2) |
I will be assuming furthermore that lexical lookup has been performed already by another module. This module has asserted clauses for the predicate lexical_analysis/3 where the first two arguments are the string positions and the third argument is the (lexical) category. For an input sentence `Time flies like an arrow' this module may produce the following set of clauses:
|
(3) |
|
(4) |
|
(5) |
|
(6) |
|
(7) |
|
(8) |
Note that nothing would prevent us to memo other predicates as well. Experience suggests that the cost of maintaining tables for e.g. the head_corner relation is (much) higher than the associated profit. The use of memoization for only the parse/5 goals implies that the memory requirements of the head-corner parser in terms of the number of items that is being recorded is much smaller than in ordinary chart parsers. Not only do we refrain from asserting so-called active items, but we also refrain from asserting inactive items for non-maximal projections of heads. In practice the difference in space requirements is enormous (2 orders of magnitude). This difference may in turn be a significant reason for the practical efficiency of the head-corner parser. 1
Depending on the properties of a particular grammar, it may for example be worthwile to restrict a given category to its syntactic features before we attempt to solve the parse goal of that category. Shieber's restriction operator [6] can be used here. Thus we essentially throw some information away before an attempt is made to solve a (memoized) goal. For example, the category
|
(9) |
|
(10) |
Note that goal weakening is sound. An answer a to a weakened goal
g is only considered as an answer for g if a and g unify.
Also note that goal-weakening is complete in the sense that for an
answer a to a goal g there will always be an answer a' to the
weakening of g such that a' subsumes a.
For practical implementations the use of goal weakening can be
extremely important. It is my experience that a well-chosen goal
weakening operator may reduce parsing times with an order of
magnitude.
In this system the input for the parser is not a simple list of words, but rather a word-graph: a directed, acyclic graph where the states are points in time, and the edges are labelled with word hypotheses and their corresponding probability. Thus, such word-graphs are acyclic weighted finite-state automata.
In certain approaches toward ill-formed input processing the desire to generalize from input strings to input finite-state automata is also clearly present. E.g., in [3] a framework for ill-formed input processing is described in which certain common errors are modelled as (weighted) finite-state transducers. The composition of an input sentence with these transducers produces a (weighted) finite state automaton which is then input for the parser.
The generalization from strings to weighted automata introduces essentially two complications. Firstly, we cannot use string indices anymore. Secondly we need to keep track of the probabilities of the words used in a certain derivation.
Parsing on the basis of a finite-state automaton can be seen as the computation of the intersection of that automaton with the grammar. It can be shown that if the definite-clause grammar is off-line parsable, and if the finite-state automaton is acyclic, then this computation can be guaranteed to terminate [7]. Moverover, existing techniques for parsing based on strings can be generalized easily by using the names of states in the automaton instead of the usual string indices.
In the head-corner parser, this leads to a change in the definition of the predicate between/4. Rather than a simple integer comparison, we now need to check that a derivation from P0 to P can be extended to a derivation from E0 to E by checking that there are paths in the word-graph from E0 to P0 and from P to E.
The predicate between/4 is implemented using memo-ization as follows. It is assumed that state names are integers; to rule out cyclic word-graphs we also require that for all transitions from P0 to P it is the case that P0 < P. Transitions in the word-graph are represented by clauses of the form wordgraph:trans(P0,Sym,P,Score) which indicate that there is a transition from state P0 to P with symbol Sym and probability Score.
|
(11) |