Context sensitive rewrite rules have been widely used in several areas of natural language processing. Johnson [4] has shown that such rewrite rules are equivalent to finite state transducers in the special case that they are not allowed to rewrite their own output. An algorithm for compilation into transducers was provided by [5]. Improvements and extensions to this algorithm have been provided by [7], [9], [8] and [12]. In this paper, the algorithm will be extended to provide a limited form of backreferencing. Backreferencing has been implicit in previous research, such as in the ``batch rules'' of [5], bracketing transducers for finite-state parsing [8], and the ``LocalExtension'' operation of [13]. The explicit use of backreferencing leads to more elegant and general solutions.
Backreferencing is widely used in editors, scripting languages and
other tools employing regular expressions [3]. For
example, Emacs uses the special brackets \(
and \)
to
capture strings along with the notation \
n to recall the
nth such string. The expression \
(a*\
)b\
1
matches strings of the form anban. Unrestricted use of
backreferencing thus can introduce non-regular languages. For NLP
finite state calculi [6,16] this is
unacceptable. The form of backreferences introduced in this paper
will therefore be restricted.
The central case of an allowable backreference is:
![]() |
(1) |
This says that each string x preceded by
and
followed by
is replaced by T(x), where
and
are arbitrary regular expressions, and T is a
transducer.1 This contrasts sharply with the rewriting
rules that follow the tradition of Kaplan & Kay:
![]() |
(2) |
In this case, any string from the language
is replaced
by any string independently chosen from the language
.
We also allow multiple (non-permuting) backreferences of the form:
![]() |
(3) |
Since transducers are closed under concatenation, handling multiple backreferences reduces to the problem of handling a single backreference:
![]() |
(4) |
A problem arises if we want capturing to follow the POSIX standard requiring a longest-capture strategy. Friedl [3] (p. 117), for example, discusses matching the regular expression (to|top)(o|polo)?(gical|o?logical) against the word: topological. The desired result is that (once an overall match is established) the first set of parentheses should capture the longest string possible (top); the second set should then match the longest string possible from what's left (o), and so on. Such a left-most longest match concatenation operation is described in §3.
In the following section, we initially concentrate on the simple case in (1) and show how (1) may be compiled assuming left-to-right processing along with the overall longest match strategy described by [8].
The major components of the algorithm are not new, but straightforward modifications of components presented in [8] and [12]. We improve upon existing approaches because we solve a problem concerning the use of special marker symbols (§2.1.2). A further contribution is that all steps are implemented in a freely available system, the FSA Utilities of [16] (§2.1.1).