3. Regular Expression Operators

This section lists the regular expression operators built-in in FSA.

3.1. ?

The set of one-symbol strings over the universal alphabet, ie. ? can be read as `any symbol whatsoever'. It uses the true/1 predicate_declaration from the current predicate module (cf the chapter Predicates on Symbols). If that declaration is not defined then no compilation for this operator is possible, and an error occurs.

3.2. Expr# m[inimize](Expr) mh(Expr) mb(Expr)

Applies minimization to the result of compiling Expr. There are a number of related expressions depending on which minimization algorithm is to be used.

mb uses the algorithm due to Brzozowski, mh uses the algorithm by Hopcroft (as described in Aho, Hopcroft and Ullman, 1974).

If Expr is a transducer then it is temporarily treated as a recognizer over pairs of symbols (using the fsa_frozen predicate module).

3.3. A! determinize(A) determinize(A,Algorithm)

Set of strings denoted by A, but moreover the subset construction determinization algorithm is applied to ensure that the automaton is deterministic. The algorithm can be specified as the second argument. There are several variants of the algorithm, which are different with respect to the treatment of epsilon transitions:

per_graph: first construct efree automaton (jumps taken into account on target side of transitions and on start states)
per_inverted_graph: first construct efree automaton (jumps taken into account on source side of transitions and and final states)
per_reachable_graph: as per_inverted_graph, but maintains accessibility
per_co_reachable_graph: as per_graph, but maintains co-accessibility
per_subset: compute transitive closure of jumps on the fly for each subset
per_state: compute transitive closure of jumps on the fly for each state

These variants and some interesting experimental observations are described in a paper I presented at the FSMNLP 98 workshop in Ankara. The paper is available from http://www.let.rug.nl/~vannoord/papers/. An improved version of the paper has been published in Computational Linguistics.

By default the algorithm is chosen by a simple heuristic based on the number of states and number of jumps of the input automaton. If A is a transducer then it is temporarily treated as a recognizer over atomic symbols, which happen to be pairs of predicates.

3.4. efree(E) reachable_efree(E) co_reachable_efree(E)

Constructs epsilon-free automaton for the automaton created for E. The first variant is faster, the second and third algorithms yield smaller automata by only taking into account states reachable from the start state, resp. from which a final state is reachable.

3.5. term_complement(E) `E

The term complement of E, i.e. the set of all single symbol strings minus those in E; this is equivalent to ? - E.

3.6. ~E complement(E)

The complement of the language denoted by E. E must be a recognizer.

3.7. A-B difference(A,B)

Set of strings denoted by A minus those given by B. A and B must be recognizers.

3.8. $E containment(E)

The language consisting of all strings that have an instance of E as a sub-string: [? *, E, ? *]. Note that the result is a minimal automaton. Since the definition of this operator depends on the ?/0 operator it is only defined if the current predicate module provides a definition of true/1. E can be both a recognizer or a transducer.

3.9. t_determinize(E)

The set of pairs denoted by E, but moreover the determinization algorithm for transducers by Mohri, cf. also Roche and Schabes, is applied to E. NB: this is only guaranteed to terminate if in fact E can be determinized in the appropriate sense. The implementation currently does not check for this. Refer to the Examples/RocheSchabes97 directory for an experimental implementation of that check and various related algorithms.

representation of sequential transducers

Note that in FSA subsequential transducers are represented as ordinary transducers. This implies in particular that instead of output symbols associated with final states, we have a separate transition over epsilon input and final output to a new final state. Similarly, automata which require an initial output to be associated with the start state will give rise to an extra transition from a new start state with epsilon input and the required start output.

The treatment of identity constraints over predicate pairs is especially tricky. This is mostly hidden in the predicate module declaration of t_determinize/2 preds. Funny things to watch for:

certain non-functional automata become t_deterministic:

t_minimize([ a x {b,d}, d*])

This is actually quite useful.

delayed identity constraints: predicates apply on the input side before the transition containing the target of the identity constraint is encountered. For example:

t_minimize({[a:b,?,?,?,?,?,b],[a:c,?,?,?,?,?,c]})

This is especially nasty if the number of question marks do not match up:

t_minimize({[a:b,?,?,?,?,?,b],[a:c,?,?,?,c]})

the opposite occurs as well: sometimes we have to output symbols satisfying a certain predicate which must be identical to an input symbol which is yet to be encountered! This currently works. Try for instance:

t_minimize([a:b,a..f])

I think this is quite spectacular.

3.10. t_minimize(E)

The set of pairs denoted by E, but moreover the minimization algorithm for transducers by Morhi is applied to E. Note that this uses the t_determinize operator - for further details check there.

3.11. w_determinize(E)

The set of pairs denoted by E, but moreover the determinization algorithm for string to weight transducers by Mohri is applied to E. Cf. the t_determinize operator for further details.

3.12. wt_determinize(E)

Denotes the weighted transducer given by E, but moreover the determinization algorithm for weighted transducers by Mohri is applied to E. Cf. the t_determinize operator for further details.

3.13. w_minimize(E)

The set of pairs denoted by E, but moreover the minimization algorithm for string to weight transducers by Mohri is applied to E. Thus, E must denote a weighted recognizer. Cf. the t_determinize operator for further details.

3.14. wt_minimize(E)

Denotes the weighted transducer given by E, but moreover the minimization algorithm for string to weight transducers by Mohri is applied to E. Cf. the t_minimize operator for further details.

3.15. words(ListOfAtoms)

Creates minimal automaton for each of the atoms in ListOfAtoms, where each atom is expanded out into a concatenation of the atoms corresponding to its characters; i.e.

words([john,peter,mary])

is equivalent to

{ [j,o,h,n],[p,e,t,e,r],[m,a,r,y] }

3.16. perfect_hash(ListOfAtoms)

Creates weighted recognizer implementing a (minimal) perfect hash for the words found in ListOfAtoms. For instance:

perfect_hash([john,peter,mary])

is equivalent to

w_minimize( {[j,o,h,n] :: 0,
[p,e,t,e,r] :: 2,
[m,a,r,y] :: 1 } )

3.17. A:B pair(A,B) A:B:Weight

A and B are symbols; this is a transducer mapping an A to a B. In addition, A:B:W defines a weighted transducer mapping A to B with associated weight W.

3.18. A::W

Defines a weighted recognizer where A is a recognizer, and W its weight; alternatively defines a weighted transducer where A is a transducer and W its weight.

3.19. A x B cross_product(A,B) A x B x W

The set of pairs (A0,B0) such that A0 is in A and B0 is in B. Both A and B must describe recognizers.

In addition, A x B x W defines a weighted transducer where strings from A are mapped to B, with weight W.

3.20. A xx B sl_cross_product(A,B)

The set of pairs (A0,B0) such that A0 is in A and B0 is in B, moreover the strings A0 and B0 are to be of the same length. Both A and B must be recognizers.

3.21. escape(Sym)

Sym is a symbol. This denotes the language consisting of that symbol. Can be used to overwrite special meaning of some symbols. For instance, escape(?) can be used to denote a literal question mark. Sym should be ground Prolog term, and it is passed through the predicate regex_notation_to_predicate of the current predicate module.

3.22. S..T

S and T are one-character atoms or integers. In the first case, denotes the set of symbols from S up to T in ASCII coding. For instance a..e is equivalent to {a,b,c,d,e}. If S and T are integers, represents the set of integers in that interval: for instance 8..11 is equivalent to {8,9,10,11}.

3.23. incomplete(A)

Ensures that all states in the automaton for A are co-accessible, i.e. for each state there is path to a final state.

3.24. coaccessible(A)

Ensures that all states in the automaton for A are co-accessible, i.e. for each state there is path to a final state.

3.25. reachable(A)

This operator ensures that for each state s in the automaton of A there is a path from a start state to s.

3.26. accessible(A)

This operator ensures that for each state s in the automaton of A there is a path from a start state to s.

3.27. complete(A)

Adds transitions and a sink state such that the transition table is total, i.e. there is a transition for every symbol from every state. If A is a transducer then it is temporarily treated as a recognizer over pairs of symbols. If A is a transducer then it is temporarily treated as a recognizer.

3.28. ignore(A,B)

Strings from A interspersed with substrings from B. For instance, ignore([a,a,a],c) contains all strings over the alphabet {a,c} which contain exactly three a's. Both A and B must be recognizers.

3.29. {}

{} denotes the empty language.

3.30. {E1,E2,..,En} union(E1,E2) set([E1,E2,..,En])

Union of the languages denoted by E1,..,En. As a special case, '{}' is the empty language, i.e. a language without any strings. Note that the result is a minimal automaton. E1 .. En can be both transducers or recognizers. If one of them is a transducer, then all of the others are coerced into transducers as well.

3.31. []

[] denotes the empty string (or equivalently the language solely consisting of the empty string).

3.32. [E1,E2,...En] concat(E1,E2)

The concatenation of the languages denoted by E1, E2, .. En. As a special case, [] is the language solely containing of the empty string. Note that the result is a minimal automaton. E1 .. En can be both recognizers and transducers. If one of them is a transducer then all of the others are coerced into transducers as well.

3.33. E* kleene_star(E)

Kleene closure (zero or more concatenations) of the language denoted by E. Note that the result is a minimal automaton. E can be both a recognizer or a transducer.

3.34. E+ [kleene_]plus(E)

Kleene plus (one or more concatentations) of the language denoted by E. Note that the result is a minimal automaton. E can be both a recognizer or a transducer.

3.35. option(E) E^

Union of E with the empty string, i.e. a string from E occurs optionally. The result is a minimal automaton. E can be both a recognizer or a transducer.

3.36. intersect[ion](A,B) A & B

The intersection of the languages denoted by A and B. Produces a minimal automaton. A and B must be recognizers.

3.37. intersect_list([E0,E1,...,En])

This is equivalent to the sequential intersection of E0, E1, .., En, i.e., to the expression

E0 & E1 & ... & En

3.38. E0 o E1 compose(E0,E1)

The set of pairs (A,C) such that (A,B) is in E0 and (B,C) is in E1. Both E0 and E1 are (coerced into) transducers. Note that the result is a minimal automaton.

Note that in case both E0 and E1 are not same-length transducers, then often the resulting transducer will give rise to `spurious' results in the sense that for a given input the same output is produced several times. See the paper by Pereira and Riley, 1996, for some suggestions to repair this. Obviously, in cases where you can determinize the transducer (with t_determinize) the spurious ambiguities will disappear as well.

3.39. compose_list([E0,E1,...,En])

This is equivalent to the sequential composition of E0, E1, .., En, i.e., to the expression

E0 o E1 o ... o En

3.40. reverse(E)

set of strings F such that the reversal of F is in E.

3.41. inversion(E) inverse(E) invert(E)

The set of pairs B:A such that A:B is in E. If E is a recognizer, then it is converted to its identity transducer.

3.42. id(E) identity(E)

The set of pairs A:A such that A is in E.

3.43. domain(E)

The set of strings A such A:B is in E.

3.44. weighted_domain(E)

The set of weighted strings A::B such A:C:B is in E.

3.45. range(E)

The set of strings B such that A:B is in E; if E is a weighted transducer then we obtain the set of strings B such that A:B:W in E.

3.46. weighted_range(E)

The set of weighted strings B:W such that A:B:W is in E.

3.47. weights(E)

A recognizer for the weights as defined in the weighed recognizer or weighted transducer E.

3.48. no_weights(E)

If E is a weighted recognizer, then this defines the recognizer obtained by ignoring all weights. If E is a weighted transducer, then this defines a transducer obtained by ignoring all weights.

3.49. cleanup(E)

The cleanup operator attempts to pack several transitions into one. For instance, assume there are two transitions from state p to q over the predicates p1 and p2 respectively. If p3 is a predicate which is true just in case either p1 or p2 is true, then we replace the two transitions by one transition over predicate p3. Note that if E is a sequence transducer, then cleanup does not attempt to create identities or combine one or more transitions involving identities (since that would require global analysis in order to know how identities on input and output side line up).

3.50. expand_non_overlapping_predicates(E)

This operator constructs an automaton M for the expression E in such a way that all predicates which occur in M have an empty intersection, i.e. the resulting predicates in M are non-overlapping. Note that in the case of a sequence transducer, M is first coerced into a (synchronized) letter transducer.

3.51. subs(E)

E is supposed to be a transducer, or weighted recognizer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by substituting one symbol in x.

3.52. del(E)

E is supposed to be a transducer, or weighted recognizer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by deleting one symbol in x.

3.53. ins(E)

E is supposed to be a transducer, or weighted recognizer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by inserting one symbol in x.

Denotes the string Atom, as a concatenation of its individual characters. For instance word(regular) is equivalent to [r,e,g,u,l,a,r].

3.55. convert_pred_module(NewModule,Expr) convert_pred_module(NewDomainMod,NewRangeMod,Expr)

Converts the automaton defined by Expr into an automaton using the pred_module declarations found in NewModule. Note that this is possible only in case the newer module is at least as expressive as the old one. For instance, you can convert an automaton with the fsa_frozen predicate module into an automaton with the fsa_preds predicate module, but not vice versa. The binary operator is for recognizers, the ternary operator for letter transducers.

3.56. fa(Fa)

Fa is already a finite automaton in appropriate format.

3.57. file(X)

This denotes the finite-automaton read from file X.

3.58. spy(Expr)

Equivalent to Expr, but sets spy-point on compilation of Expr. This implies that for debug level 1 or higher the CPU-time is reported required to compile Expr, as well as the size of the resulting automaton.

3.59. cache(Expr)

Equivalent to Expr, but caches result of compiling Expr, if the flag regex_cache is set to selective. If that flag has value off then there is no caching. If the value is on then the regular expression compiler caches every sub-computation.

3.60. random(NrStates,NrSymbols,Den,JDens[,FDens])

A random non-deterministic automaton is constructed consisting of the number of states specified in the first argument, number of symbols in the second argument. The desired density of the automaton is given in the third argument, whereas the fourth argument is the jump density. The final argument is a number between 0 and 1 indicating the likelihood that a state is final. If no fifth argument is specified, then all states are final.

For example, random(20,10,0.1,0.1) will be an automaton with 20 states, 10 symbols, approximately 400 transitions and 40 jumps. Automata generated this way always contain a single start state. They use the fsa_frozen predicate module; transition labels are integers counting from 0 upward.

If you need deterministic automata, consider the det_random/[4,5] regular expression operators.

3.61. det_random(NrStates,NrSymbols,Den,FDens)

A deterministic random automaton is constructed consisting of the number of states specified in the first argument, number of symbols in the second argument. The desired density of the automaton is given in the third argument, whereas the final argument is a number between 0 and 1 indicating the likelihood that a state is final. For instance, det_random(20,10,0.1,0.1) will be an automaton with 20 states, (at most) 10 symbols, approximately 20 transitions, and approximately 2 final states. Note that the generated automaton may contain states which are unreachable from the start state. Automata generated this way always contain a single start state. They use the fsa_frozen predicate module; transition labels are integers counting from 0 upward.

If you need automata in which all states are reachable, consider the reachable/1 regular expression operator.

3.62. pragma(ListOfExpressions,E)

Equivalent to E, but this expression instructs the regular expression compiler that it should first compile each of the expressions in the list ListOfExpressions, which are supposed to occur repetitively in E. This is typically used in macros which use sub-expressions more than once. For example, consider the following macro (adapted from Kaplan and Kay):

macro(p_iff_s(L1,L2),
if_p_then_s(L1,L2) & if_s_then_p(L1,L2)).

The pragma operator can be used in order to ensure that L1 and L2 are compiled only once:

macro(p_iff_s(L1,L2), pragma([V1-L1,V2-L2],
if_p_then_s(V1,V2) & if_s_then_p(V1,V2))).

The first argument of pragma thus is a list of pairs V-Term where V is a variable which occurs in the second argument (in typical cases it occurs at least twice). The Term associated with V is compiled only once. For every occurrence of V in the second argument, the result of that compilation is used.