In order to experiment with finitestate techniques, it is very important to have available an implementation of the Finitestate Calculus, i.e., implementations of all the important finitestate operations such as union, concatenation, Kleene closure, difference, intersection, complementation, etc. An operation such as union takes two finitestate languages l_{1} and l_{2} (expressed as regular expressions or as finitestate automata) and produces the union of the sentences in these languages (this is defined as ). Such regular operations are crucial building blocks of approximation algorithms and other finitestate language processing algorithms.
Furthermore, such an implementation should make available various tools for performing operations such as determinisation and minimisation of automata. A number of implementations exist [56,58,37,76]. The sources of these implementations are not available however. Since we expect that fruitful research in this area will lead to extensions and/or alternative implementations it is important that an implementation of the finitestate calculus is available that can be easily extended.
Therefore it is proposed that the FSA Utilities toolkit [111,112] is adopted, and actively maintained and developed as part of this project. The toolkit consists of a collection of utilities to manipulate regular expressions, finitestate automata and finitestate transducers. Manipulations include automata construction from regular expressions, determinisation (both for finitestate acceptors and finitestate transducers), minimisation, composition, complementation, intersection, Kleene closure, etc. Furthermore, various visualisation tools are available to browse finitestate automata. An illustration of the toolbox is given in figure 2.3. Some of the functionality of the toolkit is available through the World Wide Web. A first demonstration version was developed by the University of Karslruhe (http://i12www.ira.uka.de/Visualisierung.endlicher.Automaten/). Peter Kleiweg has developed a demonstration version of some of the regular expression features of the toolkit (http://www.let.rug.nl/~vannoord/fsademo/).

Many of the existing finitestate approximation techniques are defined on the basis of a specific parsing algorithm restricted with some mechanism to limit its recognising power to finitestate languages. For instance, whereas the approach of e.g. [85] is based on an LR parser, [51] proposes a finitestate approximation technique based on a leftcorner parser. The result can be seen as an explicit compilation of the parser in [93] into a finitestate automaton, by providing a maximum depth to the stack size. An alternative approach is described in [36].
As far as we know, the finitestate approximation approaches mentioned before have not yet been applied in practice; on the other hand, the finitestate parsing approaches known as chunking have already been applied successfully. In these latter approaches exemplified in publications such as [1]; [23]; [39] and [94], a grammar is defined by a sequence of levels (or stages). Each level consists of a number of `grammar rules' which apply to a given string. The result of a level is then passed on to the next level. Lower levels are used to create phrases. These phrases are then combined into more extensive analyses by later levels. Since rules do not apply recursively at a level, and there is only a finite number of levels, the resulting cascade is finitestate.
The drawback, from our perspective, of the finitestate chunking method is that such finitestate grammar rules are defined directly as such, and are not derived in some way from a more powerful grammatical model. We will therefore investigate whether it is possible to obtain finitestate approximations by automatically dividing the grammar rules of a given grammar into such a finitestate cascade approximating the original grammar. In order to be able to do this, static grammar analysis techniques will be employed, in combination with certain limits on recursive rule application.
Finitestate approximation techniques can be evaluated by comparing the language of the underlying grammar, and the language defined by the approximation. Some techniques always produce a superset of the language of the input grammar (such techniques are useful in applications where the finitestate approximation is used as a preprocessor); other techniques always yield a subset of the language of the input grammar (for instance in approaches which are psycholinguistically motivated). For some techniques neither characterisation holds.
Furthermore, techniques can be characterised by evaluating their results for certain subclasses. For instance, an interesting result would be if a certain approximation technique would be exact i.e., always find an equivalent finitestate automaton, for cases in which the input describes a regular language. However, it turns out that this is impossible; it is an unsolvable problem in general to transform a given contextfree grammar generating a regular language into an equivalent finitestate automaton [108].
Other subcases are for instance the leftlinear and rightlinear grammars. The technique described in [85] is exact in these cases. The case of leftlinear and rightlinear grammars is generalised somewhat in [77] to the strongly regular grammars. His method is exact for those grammars.
Another interesting question is to compare a given finitestate approximation with the psycholinguistic observations discussed in the previous chapter. For instance, what kind of center embedding is allowed in the approximation? How does this compare with human abilities? Similar questions can be asked with respect to other types of construction which are difficult for humans such as crossing dependencies. Obviously, constructions humans have no difficulty with should be treated without problems.
Apart from a qualitative evaluation of finitestate approximation techniques it is worth evaluating how such approximations work out in practice, i.e. for realistic grammars on realistic corpora. As far as we know such an empirical evaluation has not been performed before. Therefore, we propose to set up experiments for a number of different approximation techniques, in order to compare their effects.
Such an empirical, quantitative evaluation will start out by selecting an appropriate source grammar and an appropriate corpus. Hopefully some of the subcorpora developed in the corpus initiative `Corpus Gesproken Nederlands' will be useful for this purpose (cf. section 2.4.4). Implementations of a number of existing approximation techniques will be required. These techniques can then be applied to the grammar and the resulting finitestate automata can be compared. Such a comparison should not only take into account the loss of accuracy that results from the approximation, but will also take into account computational issues, such as the size of the resulting automaton, the difficulty of the production of the automaton, and other properties of the automaton.
Most finitestate approximation techniques assume that the input grammar is a contextfree grammar. Preliminary experiments with featurebased grammars are described in [6]. Furthermore, certain approximation techniques generalise to featurebased grammars in a natural way [51]. In other cases the assumption is that a constraintbased grammar is first approximated by a contextfree grammar, which can then be approximated by a finitestate grammar in turn.
Each of these methods have in common that distinct complex categories in the input grammar are mapped to the same category in the approximation. This leads to an approximation which will allow sentences not allowed by the input grammar (the superset case). One important research question in the proposed project is the question how approximations of constraintbased grammars can be defined which yield an interesting subset of the sentences allowed by the input grammar.
At first sight, the concept of finitestate approximation of e.g. contextfree grammars seems to ignore an important aspect of such grammars. Contextfree grammars not only characterise the grammatical sentences of a language, but also assign structural descriptions to such sentences. These structural descriptions are crucial for the interpretation of sentences.
Different approaches can be identified with respect to this issue. It has been suggested [85] that the original grammar should still be used for the purpose of interpretation. In such a setup the finitestate approximation is used as a quick filter to rule out (possibly many) impossible analyses. For a few succeeding analyses the original grammar is then applied to obtain interpretations.
In [78] this approach is extended by letting the finitestate approximation produce a finitestate transducer rather than a recogniser. The transducer produces a table, in linear time, which can then be used (in a second phase) to recover all parsetrees with respect to the original grammar.
The drawbacks of these methods is that the original grammar is still crucial for language understanding. We envision a more elegant approach in which a finitestate transducer is constructed as a finitestate approximation; this transducer should directly produce structures which can be used for language interpretation. There should be no need to refer to the input grammar in the semantic interpretation component. Some interesting ideas are presented in [62].
Finitestate approximation is a technique which has been exclusively applied to language understanding. An interesting question is whether it is fruitful to regard language generation as a finitestate process as well. If it is a fundamental property of human language that its processing is limited by finitestate means, then a natural question to ask is whether language production can be characterised by limitations of a similar nature.
Deterministic finitestate automata (DFA) constitute only one formalisation of regular languages. The advantages of this particular formalisation are that it can be implemented extremely efficiently (in linear time, and independent of the size of the actual automaton), and that for each given deterministic finitestate automaton an equivalent but minimal automaton can be automatically computed. The drawback of deterministic finitestate automata as a formalisation of regular languages is that such DFA are in general not very compact. In fact, this is the reason that in many implementations of large DFA a somewhat less efficient but much more compact format is chosen [30,61].
Many other formalisations of regular languages exist. For instance, regular expressions are another very popular device to present regular languages. Such regular expressions can be implemented fairly efficiently as well (for instance, the regular expression matching in Perl is done without compilation to DFA), but there is no concept of a `minimal' regular expression.
In [5] finitestate automata are augmented with a finite number of registers. This allows for a much more compact representation. Moreover, efficiency of implementation is hardly affected. However, the drawback is that there is little known about the construction of a minimal representative for such augmented finitestate devices.
In [61] binary automata are introduced as a representation for regular languages. Again, very compact and fairly efficient representations are possible. The authors discuss a few heuristics to minimize such binary automata. No general algorithms are known though.
An important question therefore is whether there exist a representation of finite automata which combines efficiency and compactness with the possibility to construct minimal representatives automatically.