Next: Concluding Remarks Up: An Extendible Regular Expression Previous: Leftmost-longest contexted replacement.

Implementational Issues

The regular expression compiler is defined in SICStus Prolog. This choice was motivated because the FSA Utilities toolbox has been developed as a platform for experimenting with finite-state approaches in natural language processing. Prolog allows for rapid prototyping of new techniques and variations of known techniques. The drawback is that the CPU-time requirements increase in comparison with an implementation based on C or C++. In [26] it is shown that the implementation of the determinizer is typically about 2 to 5 times slower in FSA Utilities than in AT&T's fsm library ([18]); the FSA Utilities toolbox contains a variant of the determinization algorithm for input automata with large amounts of $\epsilon$ -moves. In such cases FSA Utilities is often much faster. The implementation of the minimization algorithm [9,2] is up to three times faster than the implementation described in [17], which was shown to be much faster than the corresponding implementations in Fire Lite [27] and Grail [21].

Regular expressions are read and parsed using the Prolog parser (i.e. regular expressions are read in as Prolog terms), exploiting the inherent flexibility of this parser (such as the possibility to declare that new operators may be written using operator syntax; therefore we can write E1 o E2 instead of o(E1,E2)). The constructed term is straightforwardly compiled into a corresponding finite automaton using a simple top-down recursive-descent procedure.

This mechanism implies that in order to construct an automaton for a regular expression such as [a*,b,c^,d+] automata are constructed for each of the sub-expressions. For regular expressions which are constructed solely using such simple operators, more efficient automaton construction algorithms are known. We have not implemented these algorithms because of the desire to be able to treat user-defined operators. A possible improvement could be to have the compiler identify which parts of an expression are simple enough to be treated by a more efficient specialized algorithm.

The compiler supports caching of sub-expressions. If the cache facility is switched on, then the result of each sub-expression that is encountered will be cached for later re-use. This can increase efficiency for the compilation of a single expression, but it is especially useful in an interactive session where the user gradually alters the regular expression; typically a large part of the expression remains the same and interactive response time can be much more attractive.

The caching facility can also be used selectively. The cache(Expr) operator can be used to cache the result of the compilation of a specific regular expression Expr. For instance, in example 20 the expression lev1(X) is defined as { subs(X), del(X), ins(X) }. If we write instead:

$\begin{displaymath}\begin{minipage}[t]{.9\textwidth}\begin{verbatim}macro(lev1... ...e(X)),del(cache(X)),ins(cache(X))})\end{verbatim}\end{minipage}\end{displaymath}$

(31)

then X will be compiled only once.

The compiler supports a number of other operators which have an effect on the underlying automata, but not on the corresponding language or relation. For instance, the operator determinize(Expr) can be used to ensure that the resulting automaton is determinized. Similar operators provide a simple interface to various minimization algorithms provided by FSA5.

Furthermore, certain operators can be used for the sole purpose of obtaining a side-effect. One example was the cache/1 operator discussed above. The operator spy(Expr), for instance, can be used to request that the compiler provides progress information on the compilation of the expression Expr (size of the result, and CPU-time required to obtain the result). Such progress information is crucial for a better understanding of the sources of complexity of particular expressions.

Next: Concluding Remarks Up: An Extendible Regular Expression Previous: Leftmost-longest contexted replacement.

2000-07-03