The regular expression compiler is defined in SICStus Prolog. This
choice was motivated because the FSA Utilities toolbox has been
developed as a platform for experimenting with finite-state approaches
in natural language processing. Prolog allows for rapid prototyping
of new techniques and variations of known techniques. The drawback is
that the CPU-time requirements increase in comparison with an
implementation based on C or C++. In [26] it is shown that
the implementation of the determinizer is typically about 2 to 5 times
slower in FSA Utilities than in AT&T's fsm library
([18]); the FSA Utilities toolbox contains a variant of the
determinization algorithm for input automata with large amounts of
-moves. In such cases FSA Utilities is often much faster.
The implementation of the minimization algorithm
[9,2] is up to three times faster than
the implementation described in [17], which was shown to
be much faster than the corresponding implementations in Fire Lite
[27] and Grail [21].
Regular expressions are read and parsed using the Prolog parser (i.e. regular expressions are read in as Prolog terms), exploiting the inherent flexibility of this parser (such as the possibility to declare that new operators may be written using operator syntax; therefore we can write E1 o E2 instead of o(E1,E2)). The constructed term is straightforwardly compiled into a corresponding finite automaton using a simple top-down recursive-descent procedure.
This mechanism implies that in order to construct an
automaton for a regular expression such as [a*,b,c^,d+]
automata are constructed for each of the sub-expressions. For regular
expressions which are constructed solely using such simple operators,
more efficient automaton construction algorithms are known. We have
not implemented these algorithms because of the desire to be able to
treat user-defined operators. A possible improvement could be to have
the compiler identify which parts of an expression are simple enough
to be treated by a more efficient specialized algorithm.
The compiler supports caching of sub-expressions. If the cache facility is switched on, then the result of each sub-expression that is encountered will be cached for later re-use. This can increase efficiency for the compilation of a single expression, but it is especially useful in an interactive session where the user gradually alters the regular expression; typically a large part of the expression remains the same and interactive response time can be much more attractive.
The caching facility can also be used selectively. The
cache(Expr) operator can be used to cache the result of the
compilation of a specific regular expression Expr. For instance,
in example 20 the expression lev1(X) is defined as
{ subs(X), del(X), ins(X) }. If we write instead:
![]() |
(31) |
The compiler supports a number of other operators which have an effect on the underlying automata, but not on the corresponding language or relation. For instance, the operator determinize(Expr) can be used to ensure that the resulting automaton is determinized. Similar operators provide a simple interface to various minimization algorithms provided by FSA5.
Furthermore, certain operators can be used for the sole purpose of obtaining a side-effect. One example was the cache/1 operator discussed above. The operator spy(Expr), for instance, can be used to request that the compiler provides progress information on the compilation of the expression Expr (size of the result, and CPU-time required to obtain the result). Such progress information is crucial for a better understanding of the sources of complexity of particular expressions.