Even if the finite automaton which is constructed on the basis of the regular expressions is determinized, this does not mean that the actual problem of tokenizing a given string is deterministic, since the choice between on the one hand accepting a token (and continuing to find the next token) and on the other hand continuing the current token is often not deterministic. To take an example from the Elex documentation, consider the regular expression for|forest|end. If the input is the string forend then at the time the scanner is considering the `e' symbol, it doesn't know whether to accept for as a token, and to use `e' as the first character of the next token, or whether the scanner should try to find forest.
A simple backtracking scheme is used in such cases. Elex provides an option which indicates the number of tokens of lookahead the scanner should use. A larger number naturally results in a slower, but possibly more useful, scanner. The implementation of this backtracking scheme should be taken care of in the code generation part.
A few remaining non-deterministic choices remain. For instance, consider the regular expression for|forest|est. For the input forest two analyses are available. In such cases the scanner operates in a greedy way: longer matches have preference. For this example this implies that only a single token will be returned. Again, this preference for longest matches should be implemented in the code generation part.
<LOOKAHEAD>
is used as a meta
variable which will be replaced in the actual code by an integer
%% tokenize(+Chars,-Tokens) tokenize([],[]). tokenize([Char|Chars0],Tokens0) :- ( tokenize(Tokens0,Tokens1,[Char|Chars0],Chars1,Goal), look_ahead(Chars1,<LOOKAHEAD>) -> call(Goal), tokenize(Chars1,Tokens1) ; error(Char,Chars0,Tokens0) ). look_ahead([],_). look_ahead([H|T],N) :- look_aheadX(N,[H|T]). look_aheadX(0,_) :- !. look_aheadX(N0,[H|T]) :- tokenize(_,_,[H|T],Str,_), N is N0-1, look_ahead(Str,N).
The if-then-else construct in the second clause of the tokenize/2 predicate takes care of look-ahead and error handling. Moreover, given the order of the clauses to be defined below, such that clauses corresponding to transitions will precede clauses corresponding to final states, this construct also implements the longest match preference (cf. section 4 for a discussion of the implementation of look-ahead).
The remaining work is encoded by a number of clauses defining the predicate tokenize/5. These clauses are defined as follows:
scanner tw begin on Forend "for|end|forend" <prolog atom_chars(W,__Token), __Tokens0=[w(W)|__Tokens]. > on WhiteSpace "[ \t\n]+" <prolog __Tokens0=__Tokens. > on error <prolog __Tokens = [skip(__Char)|Ts], tokenize(__Chars,Ts). > end
The clauses needed to complete this example are given below. The intermediate finite automaton is given in figure 1.
tokenize(Ts0,Ts,S0,S,G):- tokenize0(Ts0,Ts,S0,S,T,[],T,G). tokenize0(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize0(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize1(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize1(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize2(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize2(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize3(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize3(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize5(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize5(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize6(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize6(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize7(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize7(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize8(Ts0,Ts,[C|S0],S,[C|Sy0],Sy,T,G):- tokenize8(C,Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize0( 9,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize0( 10,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize0( 32,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize0(101,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize1(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize0(102,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize6(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize1(110,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize3(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize2(110,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize3(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize3(100,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize4(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize5( 9,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize5( 10,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize5( 32,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize5(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize6(111,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize7(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize7(114,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize8(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize8(101,Ts0,Ts,S0,S,Sy0,Sy,T,G):- tokenize2(Ts0,Ts,S0,S,Sy0,Sy,T,G). tokenize4(Ts0,Ts,S,S,Sy,Sy,T,'Forend'(T,Ts0,Ts)). tokenize5(Ts0,Ts,S,S,Sy,Sy,T,'WhiteSpace'(T,Ts0,Ts)). tokenize8(Ts0,Ts,S,S,Sy,Sy,T,'Forend'(T,Ts0,Ts)). 'Forend'(__Token,__Tokens0,__Tokens):- atom_chars(W,__Token), __Tokens0=[w(W)|__Tokens]. 'WhiteSpace'(__Token,__Tokens0,__Tokens):- __Tokens0=__Tokens. error(__Char,__Chars,__Tokens) :- __Tokens = [skip(__Char)|Ts], tokenize(__Chars,Ts).
In order to provide an indication of the trade-offs involved, we
compared a relatively simple tokenizer in compact and expanded form.
The expanded version took about 5.9 CPU-seconds to tokenize a 4 MB
list of character codes (size of tokenizer about 1400 lines of Prolog
code). The compact version consists of about 400 lines, and takes
about 8.6 seconds for tokenization (using SICStus Prolog 3 #6
).
1 For SWI-Prolog (Version 2.8.6), the differences were larger:
about 14.7 versus about 32.9 CPU-seconds.