Next: Functionality Up: a Previous: a

Motivation

The argumentation for developing Elex(Prolog) consists of three steps:

1.: Tokenization is a separate programming task (separate from reading input on the one hand, and parsing on the other hand).
2.: Scanner generators are useful to create such tokenizers automatically on the basis of a number of regular expressions.
3.: Multilingual scanner generators have a number of advantages over language specific tools.

In many (Prolog) programs, the problem arises to analyse input in some format. In some cases, the input may be specified as Prolog terms in which case the Prolog built-in read can be used. In most cases, however, the format is different. In such cases, it is important to distinguish between the following sub-tasks. These distinctions are sometimes blurred in actual implementations:

1.: reading a sequence of characters into a list of character codes
2.: tokenizing such a list of character codes into a list of tokens
3.: parsing the list of tokens

Reading vs. tokenizing.

The reason for distinguishing reading and tokenizing (contrary to the approach illustrated in chapter 10 of [4]) is the observation that there can be several ways in which input is provided to a program. As an example consider a program which uses socket-IO to obtain its input. For development purposes it is useful if this program can also run in a mode such that it uses standard input in an interactive way. A further disadvantage for integrating reading and tokenizing is that such an integrated approach must take extreme care that in the context of errors in the tokenizer the input channel is properly reset. Similar problems occur if the tokenizer uses backtracking, whereas using backtracking in a scanner which works on a list of character codes is of course trivial.

Note that the size of the input is not an important practical issue anymore for most applications. For example, in SICStus Prolog we have successfully created lists of megabytes of character codes without any problems whatsoever.

Tokenizing vs. parsing.

We also distinguish tokenization and parsing. Even if this distinction is commonly accepted in other areas of computer science, it seems that practical Prolog implementations sometimes neglect the distinction. In such implementations, Definite Clause Grammars are used directly on list of character codes. The DCG then contains special clauses to deal with lexical issues such as the recognition of real numbers. Such an approach might complicate the DCG considerably, resulting in increased costs of development and maintenance.

Automatic construction of tokenizers.

Not only do we argue for considering tokenization as a separate task, we also argue that there are good reasons to implement such a tokenizer by means of a specialized program which constructs such a tokenizer on the basis of a number of regular expressions. The best-known example of such a program (often called scanner generator) is lex [2] [1]. Such a scanner generator takes as its input a number of regular expressions (one for each lexical category). Its output consists of a program which implements the tokenizer. In the case of Elex(Prolog), this program defines the predicate tokenize/2 which can be used for tokenizing a list of character codes into a list of tokens. The use of such a tool can be motivated because:

Declarativeness: It can be argued that a regular expression defining, for instance, floating point numbers is more declarative and easier to understand then a DCG on character codes defining the same floating point numbers.
Efficiency: If a scanner generator is used, the result is guaranteed to be efficient, by exploiting standard determinization techniques. This efficiency is much harder to obtain `by hand'.
Flexibility: The use of a scanner generator allows for much more flexibility. A change of the scanner only requires a change in a regular expression. For DCG, in contrast, such a change may be more intricate, especially if the DCG was written in a careful way to ensure efficiency.
Ease of Development: The declarativeness and flexibility of scanner generators imply that programs can be more easily developed and maintained.

Multilingual Scanner Generators.

Well-known examples of scanner generators are lex and flex which construct output in C. Prolog programmers may be familiar with plex [6], which is a scanner generator written in Prolog which provides output in Prolog. Elex is a program similar to these, but it extends existing scanner generators because Elex aims at providing multiple output languages. The advantages of a single tool for multiple languages are obvious:

Programmers which use a variety of languages will only need to learn a single scanner generator tool, rather than a number of such tools for each language they use.
In larger systems modules may be implemented using a variety of programming languages. But these modules may very well need similar scanners. For example, in case these modules communicate by means of some interface language, then each module requires a scanner for this interface language. A multilingual scanner generator will not only reduce the amount of work required to construct these scanners, but, more importantly, it will also immediately guarantee consistency.

Currently, Elex is under development, and the number of languages is still rather limited. Recently, we have added Prolog support to Elex. In this paper we describe how Elex can be used to produce Prolog output, and we describe some issues concerning the implementation of Elex(Prolog).

Next: Functionality Up: a Previous: a

Noord G.J.M. van
1998-09-25