scanner text symbols Number Word defines number = "0-9+" begin on Number "<number>(\.<number>)?([eE][\+\-]?<number>)?" <prolog number_chars(Nr,__Token), __Tokens0=[number(Nr)|__Tokens]. > <cpp return Number; > on Word "[a-zA-Z][a-zA-Z\-]*" <prolog atom_chars(Word,__Token), __Tokens0=[w(Word)|__Tokens]. > <cpp return Word; > on WhiteSpace "[ \t\n]+" <prolog __Tokens0=__Tokens. > <cpp return SymNULL; > on error <prolog __Tokens = [skip(__Char)|Ts], tokenize(__Chars,Ts). > <cpp return SymERROR; > end
This
scanner defines the lexical categories number and word. It provides output code in Prolog (the parts written
between <prolog
and >
) and C++ (the parts written
between <cpp
and >
).
The first line of the file assigns a name to the scanner (after the keyword scanner). This name is used as the name of the module in which the scanner will be defined. The section after the keyword symbols is not used for Prolog (it defines a number of symbols to be included in the header of the C++ output file). The section after the keyword defines consists of a number of abbreviations. Regular expressions which occur frequently below can be defined here, and given a name. The name can then be used in other definitions instead of the full regular expression. In the example, the abbreviation number is defined as a sequence of one or more occurrences of the characters 0..9. The section between the keywords begin and end consists of a number of rules. Each rule consists of:
In the case of Prolog, the code fragment will be used as the body of a clause in the resulting scanner. Three special variables can be used in the Prolog fragments. __Token is the sequence of character codes which matches the regular expression. __Tokens0 and __Tokens are used as a difference list representation of the resulting list of tokens.
As a prototypical example, consider the rule for Number. If the scanner indeed found a sequence of character codes matching the regular expression, then __Token will be bound to the list of these character codes. The predicate number_chars (built-in in e.g. SICStus Prolog) will convert this list to a Prolog number. The token which is unified with the first element of __Tokens0 is a unary term indicating the nature of the token (number in this case) and its actual value.
The reason that we manipulate the difference list encoding of the resulting list of tokens explicitly is that it provides for greater flexibility. This way, it is easy to assign multiple tokens for a single regular expression, or to skip material completely. An example of the latter is provided by the WhiteSpace rule. This rule matches a sequence of one or more occurrences of space, tab and newline. Because the variables __Tokens0 and __Tokens are unified in the code part, this implies that such white space is ignored in the result.
The error rule is special. The rule is applied when a position in the sequence of character codes is encountered for which no regular expression is applicable. In this case the special variables are different too, namely __Char, __Chars, and __Tokens, such that at the time the error occurs, __Char is bound to the first character of the remaining list of character codes, __Chars are the rest of the remaining characters, and __Tokens is the list of tokens, to be associated with these remaining character codes. In the example, a lexical category skip(__Char) is assigned to the first character; the predicate tokenize/2 is used to tokenize the remaining characters (note that this predicate is the actual result of the scanner generator). Obviously, other definitions of the error rule can be given in order to abort computation, or to raise an exception, or to silently ignore this character and tokenize the remaining ones.