Research Questions in Finite-state Language Processing

In order to experiment with finite-state techniques, it is very important to have available an implementation of the Finite-state Calculus, i.e., implementations of all the important finite-state operations such as union, concatenation, Kleene closure, difference, intersection, complementation, etc. An operation such as union takes two finite-state languages l₁ and l₂ (expressed as regular expressions or as finite-state automata) and produces the union of the sentences in these languages (this is defined as $l_1 \cup l_2 = \{ s \vert s\in l_1 \vee s\in l_2 \}$ ). Such regular operations are crucial building blocks of approximation algorithms and other finite-state language processing algorithms.

Furthermore, such an implementation should make available various tools for performing operations such as determinisation and minimisation of automata. A number of implementations exist [56,58,37,76]. The sources of these implementations are not available however. Since we expect that fruitful research in this area will lead to extensions and/or alternative implementations it is important that an implementation of the finite-state calculus is available that can be easily extended.

Therefore it is proposed that the FSA Utilities toolkit [111,112] is adopted, and actively maintained and developed as part of this project. The toolkit consists of a collection of utilities to manipulate regular expressions, finite-state automata and finite-state transducers. Manipulations include automata construction from regular expressions, determinisation (both for finite-state acceptors and finite-state transducers), minimisation, composition, complementation, intersection, Kleene closure, etc. Furthermore, various visualisation tools are available to browse finite-state automata. An illustration of the tool-box is given in figure 2.3. Some of the functionality of the toolkit is available through the World Wide Web. A first demonstration version was developed by the University of Karslruhe (http://i12www.ira.uka.de/Visualisierung.endlicher.Automaten/). Peter Kleiweg has developed a demonstration version of some of the regular expression features of the toolkit (http://www.let.rug.nl/~vannoord/fsademo/).

**Figure:** Illustration of the FSA Utilities toolkit. In this example the regular expression operator `have_ons' and its associated finite-state automaton is displayed. The `have_ons' operator is part of an implementation in finite-state calculus (based on [57]) of the syllabification analysis in Optimality Theory [87]. The `have_ons' operator represents the constraint that syllables must have onsets.
$\includegraphics [width=\textwidth]{fsa.ps}$

Methods of Finite-state Approximation

Many of the existing finite-state approximation techniques are defined on the basis of a specific parsing algorithm restricted with some mechanism to limit its recognising power to finite-state languages. For instance, whereas the approach of e.g. [85] is based on an LR parser, [51] proposes a finite-state approximation technique based on a left-corner parser. The result can be seen as an explicit compilation of the parser in [93] into a finite-state automaton, by providing a maximum depth to the stack size. An alternative approach is described in [36].

As far as we know, the finite-state approximation approaches mentioned before have not yet been applied in practice; on the other hand, the finite-state parsing approaches known as chunking have already been applied successfully. In these latter approaches exemplified in publications such as [1]; [23]; [39] and [94], a grammar is defined by a sequence of levels (or stages). Each level consists of a number of `grammar rules' which apply to a given string. The result of a level is then passed on to the next level. Lower levels are used to create phrases. These phrases are then combined into more extensive analyses by later levels. Since rules do not apply recursively at a level, and there is only a finite number of levels, the resulting cascade is finite-state.

The drawback, from our perspective, of the finite-state chunking method is that such finite-state grammar rules are defined directly as such, and are not derived in some way from a more powerful grammatical model. We will therefore investigate whether it is possible to obtain finite-state approximations by automatically dividing the grammar rules of a given grammar into such a finite-state cascade approximating the original grammar. In order to be able to do this, static grammar analysis techniques will be employed, in combination with certain limits on recursive rule application.

Qualitative Evaluation of Finite-state Approximation

Finite-state approximation techniques can be evaluated by comparing the language of the underlying grammar, and the language defined by the approximation. Some techniques always produce a superset of the language of the input grammar (such techniques are useful in applications where the finite-state approximation is used as a pre-processor); other techniques always yield a subset of the language of the input grammar (for instance in approaches which are psycholinguistically motivated). For some techniques neither characterisation holds.

Furthermore, techniques can be characterised by evaluating their results for certain sub-classes. For instance, an interesting result would be if a certain approximation technique would be exact i.e., always find an equivalent finite-state automaton, for cases in which the input describes a regular language. However, it turns out that this is impossible; it is an unsolvable problem in general to transform a given context-free grammar generating a regular language into an equivalent finite-state automaton [108].

Other sub-cases are for instance the left-linear and right-linear grammars. The technique described in [85] is exact in these cases. The case of left-linear and right-linear grammars is generalised somewhat in [77] to the strongly regular grammars. His method is exact for those grammars.

Another interesting question is to compare a given finite-state approximation with the psycholinguistic observations discussed in the previous chapter. For instance, what kind of center embedding is allowed in the approximation? How does this compare with human abilities? Similar questions can be asked with respect to other types of construction which are difficult for humans such as crossing dependencies. Obviously, constructions humans have no difficulty with should be treated without problems.

Experimental Evaluation of Finite-state Approximation

Apart from a qualitative evaluation of finite-state approximation techniques it is worth evaluating how such approximations work out in practice, i.e. for realistic grammars on realistic corpora. As far as we know such an empirical evaluation has not been performed before. Therefore, we propose to set up experiments for a number of different approximation techniques, in order to compare their effects.

Such an empirical, quantitative evaluation will start out by selecting an appropriate source grammar and an appropriate corpus. Hopefully some of the sub-corpora developed in the corpus initiative `Corpus Gesproken Nederlands' will be useful for this purpose (cf. section 2.4.4). Implementations of a number of existing approximation techniques will be required. These techniques can then be applied to the grammar and the resulting finite-state automata can be compared. Such a comparison should not only take into account the loss of accuracy that results from the approximation, but will also take into account computational issues, such as the size of the resulting automaton, the difficulty of the production of the automaton, and other properties of the automaton.

Approximation of constraint-based grammars

Most finite-state approximation techniques assume that the input grammar is a context-free grammar. Preliminary experiments with feature-based grammars are described in [6]. Furthermore, certain approximation techniques generalise to feature-based grammars in a natural way [51]. In other cases the assumption is that a constraint-based grammar is first approximated by a context-free grammar, which can then be approximated by a finite-state grammar in turn.

Each of these methods have in common that distinct complex categories in the input grammar are mapped to the same category in the approximation. This leads to an approximation which will allow sentences not allowed by the input grammar (the superset case). One important research question in the proposed project is the question how approximations of constraint-based grammars can be defined which yield an interesting subset of the sentences allowed by the input grammar.

Finite-state Approximation and Interpretation

At first sight, the concept of finite-state approximation of e.g. context-free grammars seems to ignore an important aspect of such grammars. Context-free grammars not only characterise the grammatical sentences of a language, but also assign structural descriptions to such sentences. These structural descriptions are crucial for the interpretation of sentences.

Different approaches can be identified with respect to this issue. It has been suggested [85] that the original grammar should still be used for the purpose of interpretation. In such a set-up the finite-state approximation is used as a quick filter to rule out (possibly many) impossible analyses. For a few succeeding analyses the original grammar is then applied to obtain interpretations.

In [78] this approach is extended by letting the finite-state approximation produce a finite-state transducer rather than a recogniser. The transducer produces a table, in linear time, which can then be used (in a second phase) to recover all parse-trees with respect to the original grammar.

The drawbacks of these methods is that the original grammar is still crucial for language understanding. We envision a more elegant approach in which a finite-state transducer is constructed as a finite-state approximation; this transducer should directly produce structures which can be used for language interpretation. There should be no need to refer to the input grammar in the semantic interpretation component. Some interesting ideas are presented in [62].

Finite-state Approximation in Language Generation

Finite-state approximation is a technique which has been exclusively applied to language understanding. An interesting question is whether it is fruitful to regard language generation as a finite-state process as well. If it is a fundamental property of human language that its processing is limited by finite-state means, then a natural question to ask is whether language production can be characterised by limitations of a similar nature.

Compact Formats for Finite-state Approximations

Deterministic finite-state automata (DFA) constitute only one formalisation of regular languages. The advantages of this particular formalisation are that it can be implemented extremely efficiently (in linear time, and independent of the size of the actual automaton), and that for each given deterministic finite-state automaton an equivalent but minimal automaton can be automatically computed. The drawback of deterministic finite-state automata as a formalisation of regular languages is that such DFA are in general not very compact. In fact, this is the reason that in many implementations of large DFA a somewhat less efficient but much more compact format is chosen [30,61].

Many other formalisations of regular languages exist. For instance, regular expressions are another very popular device to present regular languages. Such regular expressions can be implemented fairly efficiently as well (for instance, the regular expression matching in Perl is done without compilation to DFA), but there is no concept of a `minimal' regular expression.

In [5] finite-state automata are augmented with a finite number of registers. This allows for a much more compact representation. Moreover, efficiency of implementation is hardly affected. However, the drawback is that there is little known about the construction of a minimal representative for such augmented finite-state devices.

In [61] binary automata are introduced as a representation for regular languages. Again, very compact and fairly efficient representations are possible. The authors discuss a few heuristics to minimize such binary automata. No general algorithms are known though.

An important question therefore is whether there exist a representation of finite automata which combines efficiency and compactness with the possibility to construct minimal representatives automatically.