Next: Look Ahead Up: a Previous: Functionality

Implementation

Introduction.

The implementation of the Elex scanner generator consists of two parts: automaton construction and code generation. The first part constructs on the basis of an Elex script a deterministic finite automaton, in which each final state of the automaton is associated with the name of a rule. In the case a final state can be reached for more than a single rule, the order of the rules in the input file determines which rule has priority: rules which are defined further down in the script have preference (note that this is opposite to the convention implemented in lex). The second part generates the actual code for such a given finite automaton. This second part is different for each output language. Therefore, in order to add Prolog support for Elex we only need to worry about this second part.

Even if the finite automaton which is constructed on the basis of the regular expressions is determinized, this does not mean that the actual problem of tokenizing a given string is deterministic, since the choice between on the one hand accepting a token (and continuing to find the next token) and on the other hand continuing the current token is often not deterministic. To take an example from the Elex documentation, consider the regular expression for|forest|end. If the input is the string forend then at the time the scanner is considering the `e' symbol, it doesn't know whether to accept for as a token, and to use `e' as the first character of the next token, or whether the scanner should try to find forest.

A simple backtracking scheme is used in such cases. Elex provides an option which indicates the number of tokens of lookahead the scanner should use. A larger number naturally results in a slower, but possibly more useful, scanner. The implementation of this backtracking scheme should be taken care of in the code generation part.

A few remaining non-deterministic choices remain. For instance, consider the regular expression for|forest|est. For the input forest two analyses are available. In such cases the scanner operates in a greedy way: longer matches have preference. For this example this implies that only a single token will be returned. Again, this preference for longest matches should be implemented in the code generation part.

Automaton.

Let the result of compiling the regular expressions of an Elex script be a deterministic finite state automaton $\langle$ Q, q₀, $\Sigma$ , $\delta$ , F $\rangle$ , such that:

Q is the set of states.
q₀ $\in$ Q is the start state.
$\Sigma$ is the set of symbols. We assume $\Sigma$ is the set of integers 0...255.
$\delta$ is the (partial) transition function mapping Q x $\Sigma$ to Q
F $\subseteq$ Q is the set of final states

Furthermore, such a finite automaton is associated with T, $\tau$ , $\pi$ such that:

T is the set of tokens (the names of the rules used in the Elex script.
$\tau$ is a function from F to T (assigning a token to a final state).
$\pi$ is a function defined on T which returns the corresponding Prolog fragment for each token.

Code generation.

The resulting Prolog tokenizer will extend the set of clauses given below. Here, the expression <LOOKAHEAD> is used as a meta variable which will be replaced in the actual code by an integer $\geq$ 0 indicating the number of tokens of look ahead.

%% tokenize(+Chars,-Tokens)
tokenize([],[]).
tokenize([Char|Chars0],Tokens0) :-
    (  tokenize(Tokens0,Tokens1,[Char|Chars0],Chars1,Goal),
       look_ahead(Chars1,<LOOKAHEAD>)
    -> call(Goal), 
       tokenize(Chars1,Tokens1)
    ;  error(Char,Chars0,Tokens0)
    ).

look_ahead([],_).
look_ahead([H|T],N) :- 
    look_aheadX(N,[H|T]).

look_aheadX(0,_) :- !.
look_aheadX(N0,[H|T]) :-
    tokenize(_,_,[H|T],Str,_), 
    N is N0-1, 
    look_ahead(Str,N).

The if-then-else construct in the second clause of the tokenize/2 predicate takes care of look-ahead and error handling. Moreover, given the order of the clauses to be defined below, such that clauses corresponding to transitions will precede clauses corresponding to final states, this construct also implements the longest match preference (cf. section 4 for a discussion of the implementation of look-ahead).

The remaining work is encoded by a number of clauses defining the predicate tokenize/5. These clauses are defined as follows:

1.

For start state q₀ there is a clause:

tokenize(Ts0, Ts, S0, S, G):-
tokenizeq₀(Ts0, Ts, S0, S, T,[], T, G).

2.

For each state q $\in$ Q such that there is a $\sigma$ $\in$ $\Sigma$ such that $\delta$ (q, $\sigma$ ) is defined, there is a clause:

tokenizeq(Ts0, Ts,[C| S0], S,[C| Sy0], Sy, T, G):-
tokenizeq(C, Ts0, Ts, S0, S, Sy0, Sy, T, G).

3.

For each $\delta$ (q_i, $\sigma$ ) = q_j) there is a clause

tokenizeq_i( $\sigma$ , Ts0, Ts, S0, S, Sy0, Sy, T, G):-
tokenizeq_j(Ts0, Ts, S0, S, Sy0, Sy, T, G).

4.

For each $\tau$ (q) = N there is a clause

tokenizeq(Ts0, Ts, S, S, Sy, Sy, T,N(T, Ts0, Ts)).

5.

Finally, for each N $\in$ T there is a clause

$N\mbox{\tt (\_\_Token,\_\_Tokens0,\_\_Tokens):-} \pi(N)$

The ordering of the resulting clauses is not important, except that clauses introduced in (2) should precede clauses of the same predicate introduced in (4). This requirement enforces the preference for longer matches.

Example.

As an example, consider the tiny Elex script for the `forend' example mentioned previously. This script is given as:

scanner tw
begin
  on Forend "for|end|forend"
  <prolog
    atom_chars(W,__Token), 
    __Tokens0=[w(W)|__Tokens].
  >
  on WhiteSpace "[ \t\n]+"
  <prolog
    __Tokens0=__Tokens.
  >
  on error
  <prolog 
    __Tokens = [skip(__Char)|Ts], 
    tokenize(__Chars,Ts).        
  >
end

The clauses needed to complete this example are given below. The intermediate finite automaton is given in figure 1.

**Figure 1:** Finite automaton corresponding to the `forend' example.
$\includegraphics[width=\textwidth]{tw.ps}$

tokenize(Ts0,Ts,S0,S,G):- tokenize0(Ts0,Ts,S0,S,T,[],T,G).

tokenize0(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize0(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize1(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize1(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize2(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize2(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize3(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize3(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize5(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize5(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize6(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize6(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize7(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize7(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize8(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize8(C,Ts0,Ts,S0,S,Sy0,Sy,T,G).

tokenize0(  9,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize0( 10,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize0( 32,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize0(101,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize1(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize0(102,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize6(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize1(110,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize3(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize2(110,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize3(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize3(100,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize4(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize5(  9,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize5( 10,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize5( 32,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize6(111,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize7(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize7(114,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize8(Ts0,Ts,S0,S,Sy0,Sy,T,G).
tokenize8(101,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize2(Ts0,Ts,S0,S,Sy0,Sy,T,G).

tokenize4(Ts0,Ts,S,S,Sy,Sy,T,'Forend'(T,Ts0,Ts)).
tokenize5(Ts0,Ts,S,S,Sy,Sy,T,'WhiteSpace'(T,Ts0,Ts)).
tokenize8(Ts0,Ts,S,S,Sy,Sy,T,'Forend'(T,Ts0,Ts)).

'Forend'(__Token,__Tokens0,__Tokens):-
    atom_chars(W,__Token), __Tokens0=[w(W)|__Tokens].

'WhiteSpace'(__Token,__Tokens0,__Tokens):-
    __Tokens0=__Tokens.

error(__Char,__Chars,__Tokens) :-
    __Tokens = [skip(__Char)|Ts], tokenize(__Chars,Ts).

Compact scanners.

The resulting Prolog code is efficient in that it exploits first argument indexing. However, the resulting code can get rather verbose. For this reason Elex(Prolog) can also create a somewhat less efficient but much more compact program. Instead of providing a clause for each state and each symbol, the compact version combines all transitions for a state into a single clause. Arithmetic tests are used to descriminate between cases. Note that in some compilers, such as in newer versions of SICStus Prolog, such tests in the if part of an if-then-else do not create a choice point. For such compilers the compact version is not much slower than the version given above (and loading the tokenizer is much faster). We do not formally define the compact implementation, but merely refer to the example in the appendix where we provide the resulting compact tokenizer for the example in section 2.

In order to provide an indication of the trade-offs involved, we compared a relatively simple tokenizer in compact and expanded form. The expanded version took about 5.9 CPU-seconds to tokenize a 4 MB list of character codes (size of tokenizer about 1400 lines of Prolog code). The compact version consists of about 400 lines, and takes about 8.6 seconds for tokenization (using SICStus Prolog 3 #6). ¹ For SWI-Prolog (Version 2.8.6), the differences were larger: about 14.7 versus about 32.9 CPU-seconds.

Next: Look Ahead Up: a Previous: Functionality

Noord G.J.M. van
1998-09-25