Next: The Algorithm Up: Transducers from Rewrite Rules Previous: Transducers from Rewrite Rules

Introduction

Context sensitive rewrite rules have been widely used in several areas of natural language processing. Johnson [4] has shown that such rewrite rules are equivalent to finite state transducers in the special case that they are not allowed to rewrite their own output. An algorithm for compilation into transducers was provided by [5]. Improvements and extensions to this algorithm have been provided by [7], [9], [8] and [12]. In this paper, the algorithm will be extended to provide a limited form of backreferencing. Backreferencing has been implicit in previous research, such as in the ``batch rules'' of [5], bracketing transducers for finite-state parsing [8], and the ``LocalExtension'' operation of [13]. The explicit use of backreferencing leads to more elegant and general solutions.

Backreferencing is widely used in editors, scripting languages and other tools employing regular expressions [3]. For example, Emacs uses the special brackets $ and $ to capture strings along with the notation \n to recall the nth such string. The expression \(a*\)b\1 matches strings of the form aⁿbaⁿ. Unrestricted use of backreferencing thus can introduce non-regular languages. For NLP finite state calculi [6,16] this is unacceptable. The form of backreferences introduced in this paper will therefore be restricted.

The central case of an allowable backreference is:

$\begin{displaymath} x\Rightarrow T(x)/\lambda\mbox{\_\_}\rho \end{displaymath}$ (1)

This says that each string x preceded by $\lambda$ and followed by $\rho$ is replaced by T(x), where $\lambda$ and $\rho$ are arbitrary regular expressions, and T is a transducer.¹ This contrasts sharply with the rewriting rules that follow the tradition of Kaplan & Kay:

$\begin{displaymath} \phi\Rightarrow \psi/\lambda\mbox{\_\_}\rho \end{displaymath}$ (2)

In this case, any string from the language $\phi$ is replaced by any string independently chosen from the language $\psi$ .

We also allow multiple (non-permuting) backreferences of the form:

$\begin{displaymath} x_1 x_2\ldots x_n \Rightarrow T_1(x_1) T_2(x_2) \ldots T_n(x_n)/\lambda\mbox{\_\_}\rho \end{displaymath}$ (3)

Since transducers are closed under concatenation, handling multiple backreferences reduces to the problem of handling a single backreference:

$\begin{displaymath}x\Rightarrow (T_1 \cdot T_2 \cdot \ldots \cdot T_n)(x)/\lambda\mbox{\_\_}\rho \end{displaymath}$ (4)

A problem arises if we want capturing to follow the POSIX standard requiring a longest-capture strategy. Friedl [3] (p. 117), for example, discusses matching the regular expression (to|top)(o|polo)?(gical|o?logical) against the word: topological. The desired result is that (once an overall match is established) the first set of parentheses should capture the longest string possible (top); the second set should then match the longest string possible from what's left (o), and so on. Such a left-most longest match concatenation operation is described in §3.

In the following section, we initially concentrate on the simple case in (1) and show how (1) may be compiled assuming left-to-right processing along with the overall longest match strategy described by [8].

The major components of the algorithm are not new, but straightforward modifications of components presented in [8] and [12]. We improve upon existing approaches because we solve a problem concerning the use of special marker symbols (§2.1.2). A further contribution is that all steps are implemented in a freely available system, the FSA Utilities of [16] (§2.1.1).

Next: The Algorithm Up: Transducers from Rewrite Rules Previous: Transducers from Rewrite Rules

Noord G.J.M. van
1999-04-15