Next: Bibliography Up: Transducers from Rewrite Rules Previous: Longest Match Capturing

Conclusions

The algorithm presented here has extended previous algorithms for rewrite rules by adding a limited version of backreferencing. This allows the output of rewriting to be dependent on the form of the strings which are rewritten. This new feature brings techniques used in Perl-like languages into the finite state calculus. Such an integration is needed in practical applications where simple text processing needs to be combined with more sophisticated computational linguistics techniques.

One particularly interesting example where backreferences are essential is cascaded deterministic (longest match) finite state parsing as described for example in Abney [2] and various papers in [14]. Clearly, the standard rewrite rules do not apply in this domain. If NP is an NP recognizer, it would not do to say NP $\Rightarrow$ [NP $]/\lambda \mbox{\_\_}\rho$ . Nothing would force the string matched by the NP to the left of the arrow to be the same as the string matched by the NP to the right of the arrow.

One advantage of using our algorithm for finite state parsing is that the left and right contexts may be used to bring in top-down filtering.⁶ An often cited advantage of finite state parsing is robustness. A constituent is found bottom up in an early level in the cascade even if that constituent does not ultimately contribute to an S in a later level of the cascade. While this is undoubtedly an advantage for certain applications, our approach would allow the introduction of some top-down filtering while maintaining the robustness of a bottom-up approach. A second advantage for robust finite state parsing is that bracketing could also include the notion of ``repair'' as in [1]. One might, for example, want to say something like: $\mbox{\it xy} \Rightarrow [_{NP}\, \mbox{\it RepairDet(x) RepairN(y)}\,]/\lambda \mbox{\_\_}\rho$ ⁷ so that an NP could be parsed as a slightly malformed Det followed by a slightly malformed N. RepairDet and RepairN, in this example, could be doing a variety of things such as: contextualized spelling correction, reordering of function words, replacement of phrases by acronyms, or any other operation implemented as a transducer.

Finally, we should mention the problem of complexity. A critical reader might see the nine steps in our algorithm and conclude that the algorithm is overly complex. This would be a false conclusion. To begin with, the problem itself is complex. It is easy to create examples where the resulting transducer created by any algorithm would become unmanageably large. But there exist strategies for keeping the transducers smaller. For example, it is not necessary for all nine steps to be composed. They can also be cascaded. In that case it will be possible to implement different steps by different strategies, e.g. by deterministic or non-deterministic transducers or bimachines [15]. The range of possibilities leaves plenty of room for future research.

Next: Bibliography Up: Transducers from Rewrite Rules Previous: Longest Match Capturing

Noord G.J.M. van
1999-04-15