Top-down generators and left recursion

Consider the `naive' top-down generation algorithm defined in chapter 2 and repeated here as figure 3.6.

**Figure 3.6:** Meta interpreter for $\cal {R}$ ( $\cal {L}$ )-grammars
$\begin{figure} \prn\pred \head{\mbox{\it refutation}(\mbox{\rm Goal}) \mbox{\tt ... ...\rm H}),} \body{ \mbox{\it refutations}(\mbox{\rm T}).} \epred\eprn \end{figure}$

$\pr \pred \head{\prologq \mbox{\it refutation}( \avm{ \mbox{\it cat}: \mbox{\rm ... ...\rm omdat(slaapt(arie))}\\ \mbox{\it v2}: \mbox{\rm no\_v2} }). } \epred\epr$
The task is to generate a saturated complementizer phrase. The binary complementizer rule 7 is the only candidate and may be applied. After the proper generation of the first daughter of this rule the generator attempts to generate a saturated verb phrase with semantics slaapt(arie). For this embedded generation task the verb phrase rule 1 is the appropriate candidate. The first daughter of this rule, however, is not sufficiently instantiated to guide the search any further: the algorithm may for example apply the same rule 1 for this node, and will go into an infinite loop because each time it chooses rule 1 for the first daughter of rule 1.

Therefore, the order in which nonterminals are expanded is very important, as was noticed by [17] and [108]. If, in the foregoing example, the generator first tries to generate the second daughter, then it may be possible to generate the first daughter without problems afterwards. In the approach of [17], the order of generation is defined by the rule-writer by annotating DCG rules. In the approach of [108], this ordering is achieved automatically (and dynamically), essentially by a version of the goal-freezing technique [13]. Put simply, the generator only generates a node if its feature structure is instantiated; otherwise the generator will try to generate other nodes first.

A specific version of this approach, is an approach where one of the daughters has a privileged status. This daughter, that we might call the `head' of the rule, will always be generated first. I will assume that for all rules the first daughter of the list of daughters represents the head. Note that this does not imply that this daughter is the leftmost daughter; the information about the left-to-right order of daughters is represented by the difference-list representation of strings in each node. Moreover note that this particular representation may be the result of some compilation step from a representation that is more convenient for the rule writer. For example, rule 1 was represented schematically as

$\begin{displaymath} \mbox{\it sign}(\mbox{\rm X}_{0}) \mbox{\tt :-} \mbox{\it sign}(\mbox{\rm X}_{1}), \mbox{\it sign}(\mbox{\rm X}_{2}), \phi . \end{displaymath}$
If we assume that the second daughter is the head, then this clause is simply written:

$\begin{displaymath} \mbox{\it sign}(\mbox{\rm X}_{0}) \mbox{\tt :-} \mbox{\it sign}(\mbox{\rm X}_{2}), \mbox{\it sign}(\mbox{\rm X}_{1}), \phi. \end{displaymath}$
Now without any modification to the original definition of refutations this change will imply that heads are generated first. Assume furthermore that the head of a rule is that daughter which shares its semantics with the mother node. In all (non-unit) rules of the Dutch example grammar it is possible to choose such a daughter.

The resulting simple top-down generation algorithm is equivalent to the approaches of [17] and [108] with respect to one major problem: the left-recursion problem. Consider what happens if we try to generate a sentence for the same semantic structure 17 as before, but this time assuming that heads are generated first. Again, the generator comes to the point where it tries to generate a saturated verb phrase with semantics slaapt(arie). As before the generator selects the binary verb phrase rule 1. This time the generator does not try to generate the argument of this rule, but immediately starts to generate its head. The resulting feature structure now looks as follows:

$\begin{displaymath}\avm{ \mbox{\it cat}: \mbox{\rm vp} \\ \mbox{\it sc}: \lang... ...\mbox{\rm slaapt(arie)} \\ \mbox{\it v2}: \mbox{\rm no\_v2} }\end{displaymath}$
where Obj₁ is shared with the argument that still needs to be generated after the generation of the head. This feature structure can be the input for the verb phrase rule 1 again, of which the head will then be instantiated into:

$\begin{displaymath}\avm{ \mbox{\it cat}: \mbox{\rm vp} \\ \mbox{\it sc}: \lang... ...\mbox{\rm slaapt(arie)} \\ \mbox{\it v2}: \mbox{\rm no\_v2} }\end{displaymath}$
Each time the same rule can be applied (predicting longer and longer subcategorization lists), resulting in non-termination, because of left recursion. It should be clear that this non-termination problem is not necessarily caused by the leftmost daughter of a rule; the problem is caused by the node that is generated first. Thus the English equivalent of the verb phrase rule where the order of the head and the argument is switched presents exactly the same problem.

In this particular case, the problem arises because there is no limit to the size of the subcategorization list. Although one might propose an ad hoc upper bound on the length of the subcategorization list for lexical entries, even this expedient may be insufficient. In analyses of Dutch cross-serial verb constructions [20,34] subcategorization lists such as these may be appended by syntactic rules [59,90,68,103] resulting in indefinitely long lists. Consider the Dutch sentence

$\begin{exam} \begin{flushleft} dat [Jan [Arie [Bob [de muizen [zag helpen loslat... ...e\\ {\it that Jan saw Arie help Bob release the mice} \end{flushleft}\end{exam}$
The string of verbs is analyzed by appending their subcategorization lists, as is illustrated in figure 3.7.

**Figure 3.7:** Analysis of cross-serial dependency, in which subcategorization lists of verbs are appended. The tree represents the verb-cluster of the sentence `dat Jan Arie Bob de muizen zag helpen loslaten.
$\begin{figure} \begin{center} \leavevmode \unitlength1pt \beginpicture \setplot... ...slaten}{release}}} [Bl] at 118.75 18.00 \endpicture \end{center}\par\end{figure}$

In summary, top-down generation algorithms, even if controlled by the instantiation status of goals, can fail to terminate on certain grammars. The case given above, reminiscent of analyses from HPSG, as well as analyses from categorial unification grammar are examples in which the well-foundedness of the generation process resides in lexical information unavailable to top-down regimes. The conclusion of this section therefore is that top-down generators have problems with some linguistically motivated left recursive analyses. The source of the problem is that information available in lexical entries is not used sufficiently to guide the search.