Next: Conclusions Up: Transducers from Rewrite Rules Previous: Implementation

Longest Match Capturing

As discussed in §1 the POSIX standard requires that multiple captures follow a longest match strategy. For multiple captures as in (3), one establishes first a longest match for domain(T₁)^....^.domain(T_n). Then we ensure that each of domain(T_i) in turn is required to match as long as possible, with each one having priority over its rightward neighbors. To implement this, we define a macro lm_concat(Ts) and use it as:

replace(lm_concat(Ts),Left,Right)

**Figure 2:** Definition of `lm_concat` operator.
$\begin{figure}\begin{tex2html_preform}\begin{verbatim}macro(lm_concat(Ts),mark_... ..._list(R,(SoFar o F),Composed).\end{verbatim}\end{tex2html_preform} \end{figure}$

Ensuring the longest overall match is delegated to the replace macro, so lm_concat(Ts) needs only ensure that each individual transducer within Ts gets its proper left-to-right longest matching priority. This problem is mostly solved by the same techniques used to ensure the longest match within the replace macro. The only complication here is that Ts can be of unbounded length. So it is not possible to have a single expression in the finite state calculus that applies to all possible lenghts. This means that we need something a little more powerful than mere macro expansion to construct the proper finite state calculus expression. The FSA Utilities provides a Prolog hook for this purpose. The resulting definition of lm_concat is given in figure 2.

Suppose (as in [3]), we want to match the following list of recognizers against the string topological and insert a marker in each boundary position. This reduces to applying:

lm_concat([
   [{[t,o],[t,o,p]},[]: '#'],
   [{o,[p,o,l,o]},[]: '#'],
   {[g,i,c,a,l],[o^,l,o,g,i,c,a,l]}
          ])

This expression transduces the string topological only to the string top#o#logical.⁵

Next: Conclusions Up: Transducers from Rewrite Rules Previous: Implementation

Noord G.J.M. van
1999-04-15