Next: The intersection of a Up: The intersection of Finite Previous: Introduction

The intersection of a CFG and a FSA

The calculation of the intersection of a CFG and a FSA is very simple [1]. The (context-free) grammar defining this intersection is simply constructed by keeping track of the state names in the non-terminal category symbols. For each rule X₀ $\rightarrow$ X₁...X_n there are rules $\langle$ X₀q₀q $\rangle$ $\rightarrow$ $\langle$ X₁q₀q₁ $\rangle$ $\langle$ X₂q₁q₂ $\rangle$ ... $\langle$ X_nq_{n - 1}q $\rangle$ , for all q₀...q_n. Furthermore for each transition $\delta$ (q_i, $\sigma$ ) = q_k we have a rule $\langle$ $\sigma$ q_iq_k $\rangle$ $\rightarrow$ $\sigma$ . Thus the intersection of a FSA and a CFG is a CFG that exactly derives all parse-trees. Such a grammar might be called the parse-forest grammar.

Although this construction shows that the intersection of a FSA and a CFG is itself a CFG, it is not of practical interest. The reason is that this construction typically yields an enormous amount of rules that are `useless'. In fact the (possibly enormously large) parse forest grammar might define an empty language (if the intersection was empty). Luckily `ordinary' recognizers/parsers for CFG can be easily generalized to construct this intersection yielding (in typical cases) a much smaller grammar. Checking whether the intersection is empty or not is then usually very simple as well: only in the latter case will the parser terminate succesfully.

To illustrate how a parser can be generalized to accept a FSA as input we present a simple top-down parser.

A context-free grammar is represented as a definite-clause specification as follows. We do not wish to define the sets of terminal and non-terminal symbols explicitly, these can be understood from the rules that are defined using the relation rule/2, and where symbols of the rhs are prefixed with `-' in the case of terminals and `+' in the case of non-terminals. The relation top/1 defines the start symbol. The language L' = aⁿbⁿ is defined as:

top(s).

rule(s,[-a,+s,-b]).  rule(s,[]).

In order to illustrate how ordinary parsers can be used to compute the intersection of a FSA and a CFG consider first the definite-clause specification of a top-down parser. This parser runs in polynomial time if implemented using Earley deduction or XOLDT resolution [14]. It is assumed that the input string is represented by the trans/3 predicate.

parse(P0,P) :- 
    top(Cat), parse(+Cat,P0,P).

parse(-Cat,P0,P) :- 
    trans(P0,Cat,P), 
    side_effect(p(Cat,P0,P) --> Cat).
parse(+Cat,P0,P) :- 
    rule(Cat,Ds), 
    parse_ds(Ds,P0,P,His),
    side_effect(p(Cat,P0,P) --> His).

parse_ds([],P,P,[]).  
parse_ds([H|T],P0,P,[p(H,P0,P1)|His]) :-
    parse(H,P0,P1), 
    parse_ds(T,P1,P,His).

The predicate side_effect is used to construct the parse forest grammar. The predicate always succeeds, and as a side-effect asserts that its argument is a rule of the parse forest grammar. For the sentence `a a b b' we obtain the parse forest grammar:

p(s,2,2) --> [].  
p(s,1,3) --> 
    [p(-a,1,2),p(+s,2,2),p(-b,2,3)].
p(s,0,4) --> 
    [p(-a,0,1),p(+s,1,3),p(-b,3,4)].  
p(a,1,2) --> a.  
p(a,0,1) --> a.  
p(b,2,3) --> b.  
p(b,3,4) --> b.

The reader easily verifies that indeed this grammar generates (a isomorphism of) the single parse tree of this example, assuming of course that the start symbol for this parse-forest grammar is p(s,0,4). In the parse-forest grammar, complex symbols are non-terminals, atomic symbols are terminals.

**Figure 1:** A parse-tree extracted from the parse forest grammar
$\begin{figure} \begin{center} \leavevmode \unitlength1pt \beginpicture \setplot... ...3 102.00 \put{\hbox{}} [Bl] at 0.00 165.00 \endpicture \end{center}\end{figure}$

Next consider the definite clause specification of a FSA. We define the transition relation using the relation trans/3. For start states, the relation start/1 should hold, and for final states the relation final/1 should hold. Thus the following FSA, defining the regular language L = (aa)^*b⁺ (i.e. an even number of a's followed by at least one b) is given as:

$\begin{pspicture}(-1,-1)(5.0,4.0) \psset{unit=0.025cm}\rput(0,60.0){\circlenode[... ...e{x}{a}} \nccircle[angle=-90]{<-}{2}{0.5cm} \trput{\rnode{x}{b}} \end{pspicture}$

Interestingly, nothing needs to be changed to use the same parser for the computation of the intersection of a FSA and a CFG. If our input `sentence' now is the definition of trans/3 as given above, we obtain the following parse forest grammar (where the start symbol is p(s,q0,q2)):

p(s,q0,q0) --> []. 
p(s,q1,q1) --> [].  
p(s,q1,q2) --> 
    [p(-a,q1,q0),p(+s,q0,q0),p(-b,q0,q2)].  
p(s,q0,q2) --> 
    [p(-a,q0,q1),p(+s,q1,q2),p(-b,q2,q2)].  
p(s,q1,q2) --> 
    [p(-a,q1,q0),p(+s,q0,q2),p(-b,q2,q2)].  
p(a,q0,q1) --> a.  
p(a,q1,q0) --> a.  
p(b,q0,q2) --> b.  
p(b,q2,q2) --> b.

Thus, even though we now use the same parser for an infinite set of input sentences (represented by the FSA) the parser still is able to come up with a parse forest grammar. A possible derivation for this grammar constructs the following (abbreviated) parse tree in figure 1. Note that the construction of Bar Hillel would have yielded a grammar with 88 rules.

Next: The intersection of a Up: The intersection of Finite Previous: Introduction

Noord G.J.M. van
1998-09-29