Next: Functionality
Up: a
Previous: a
Subsections
The argumentation for developing Elex(Prolog) consists of three
steps:
- 1.
- Tokenization is a separate
programming task (separate from reading input on the one hand, and
parsing on the other hand).
- 2.
- Scanner generators are useful to create such tokenizers
automatically on the basis of a number of regular expressions.
- 3.
- Multilingual scanner generators have a number of
advantages over language specific tools.
In many (Prolog) programs, the problem arises to analyse input in some
format. In some cases, the input may be specified as Prolog terms in
which case the Prolog built-in read can be used. In most cases,
however, the format is different. In such cases, it is important to
distinguish between the following sub-tasks. These distinctions are
sometimes blurred in actual implementations:
- 1.
- reading a sequence of characters into a list of character codes
- 2.
- tokenizing such a list of character codes into a list of tokens
- 3.
- parsing the list of tokens
The reason for distinguishing reading and tokenizing (contrary to the
approach illustrated in chapter 10 of [4]) is the
observation that there can be several ways in which input is provided
to a program. As an example consider a program which uses socket-IO to
obtain its input. For development purposes it is useful if this
program can also run in a mode such that it uses standard input in an
interactive way. A further disadvantage for integrating reading and
tokenizing is that such an integrated approach must take extreme care
that in the context of errors in the tokenizer the input channel is
properly reset. Similar problems occur if the tokenizer uses
backtracking, whereas using backtracking in a scanner which works on a
list of character codes is of course trivial.
Note that the size of the input is not an important practical issue
anymore for most applications. For example, in SICStus Prolog we have
successfully created lists of megabytes of character codes without any
problems whatsoever.
We also distinguish tokenization and parsing. Even if this distinction
is commonly accepted in other areas of computer science, it seems that
practical Prolog implementations sometimes neglect the distinction. In
such implementations, Definite Clause Grammars are used directly on
list of character codes. The DCG then contains special clauses to deal
with lexical issues such as the recognition of real numbers. Such an
approach might complicate the DCG considerably, resulting in increased
costs of development and maintenance.
Not only do we argue for considering tokenization as a separate task,
we also argue that there are good reasons to implement such a
tokenizer by means of a specialized program which constructs such a
tokenizer on the basis of a number of regular expressions. The
best-known example of such a program (often called scanner
generator) is lex [2] [1]. Such a scanner generator
takes as its input a number of regular expressions (one for each
lexical category). Its output consists of a program which implements
the tokenizer. In the case of Elex(Prolog), this program defines
the predicate tokenize/2 which can be used for tokenizing a list
of character codes into a list of tokens. The use of such a tool can
be motivated because:
- Declarativeness
- It can be argued that a regular expression
defining, for instance, floating point numbers is more declarative
and easier to understand then a DCG on character codes defining the
same floating point numbers.
- Efficiency
- If a scanner generator is used, the result is guaranteed to be
efficient, by exploiting standard determinization techniques. This
efficiency is much harder to obtain `by hand'.
- Flexibility
- The use of a scanner generator allows for much more
flexibility. A change of the scanner only requires a change in a
regular expression. For DCG, in contrast, such a change may be more
intricate, especially if the DCG was written in a careful way to
ensure efficiency.
- Ease of Development
- The declarativeness and flexibility of scanner generators imply that
programs can be more easily developed and maintained.
Well-known examples of scanner generators are lex and flex
which construct output in C. Prolog programmers may be familiar with
plex [6], which is a scanner generator
written in Prolog which provides output in Prolog. Elex is a
program similar to these, but it extends existing scanner generators
because Elex aims at providing multiple output languages. The
advantages of a single tool for multiple languages are obvious:
- Programmers which use a variety of languages will only need to
learn a single scanner generator tool, rather than a number of such
tools for each language they use.
- In larger systems modules may be implemented using a variety of
programming languages. But these modules may very well need similar
scanners. For example, in case these modules communicate by means of
some interface language, then each module requires a scanner for
this interface language. A multilingual scanner generator will not
only reduce the amount of work required to construct these scanners,
but, more importantly, it will also immediately guarantee
consistency.
Currently, Elex is under development, and the number of
languages is still rather limited. Recently, we have added Prolog
support to Elex. In this paper we describe how Elex
can be used to produce Prolog output, and we describe some issues
concerning the implementation of Elex(Prolog).
Next: Functionality
Up: a
Previous: a
Noord G.J.M. van
1998-09-25