The algorithm presented here has extended previous algorithms for rewrite rules by adding a limited version of backreferencing. This allows the output of rewriting to be dependent on the form of the strings which are rewritten. This new feature brings techniques used in Perl-like languages into the finite state calculus. Such an integration is needed in practical applications where simple text processing needs to be combined with more sophisticated computational linguistics techniques.
One particularly interesting example where backreferences are
essential is cascaded deterministic (longest match) finite state
parsing as described for example in Abney [2] and
various papers in [14]. Clearly, the standard
rewrite rules do not apply in this domain. If NP is an NP recognizer,
it would not do to say NP
[NP
.
Nothing would force the string matched by the
NP to the left of the arrow to be the same as the string
matched by the NP to the right of the arrow.
One advantage of using our algorithm for finite state parsing is that
the left and right contexts may be used to bring in top-down
filtering.6 An
often cited advantage of finite state parsing is robustness. A
constituent is found bottom up in an early level in the cascade even
if that constituent does not ultimately contribute to an S in a later
level of the cascade. While this is undoubtedly an advantage for
certain applications, our approach would allow the introduction of
some top-down filtering while maintaining the robustness of a
bottom-up approach. A second advantage for robust finite state parsing
is that bracketing could also include the notion of ``repair'' as in
[1]. One might, for example, want to say something
like:
7 so that an NP could be parsed as a
slightly malformed Det followed by a slightly malformed N. RepairDet and RepairN, in this example, could be doing a variety
of things such as: contextualized spelling correction, reordering of
function words, replacement of phrases by acronyms, or any other
operation implemented as a transducer.
Finally, we should mention the problem of complexity. A critical reader might see the nine steps in our algorithm and conclude that the algorithm is overly complex. This would be a false conclusion. To begin with, the problem itself is complex. It is easy to create examples where the resulting transducer created by any algorithm would become unmanageably large. But there exist strategies for keeping the transducers smaller. For example, it is not necessary for all nine steps to be composed. They can also be cascaded. In that case it will be possible to implement different steps by different strategies, e.g. by deterministic or non-deterministic transducers or bimachines [15]. The range of possibilities leaves plenty of room for future research.