Previous Work in Finite-state Language Processing

Next: Research Questions in Finite-state Up: Finite-state Language Processing Previous: Arguments for Finite-state Language Contents

Previous Work in Finite-state Language Processing

Finite-state techniques have been used extensively in language processing. For instance, it has been shown that phonological models consisting of sets of context-sensitive rewrite-rules are, given some reasonable assumptions, finite-state [50,59,54]. Recently, a similar result has been obtained for more modern phonological models expressed in Optimality Theory [57].

Syntactic research has almost always followed Chomsky's arguments against finite-state grammars as a competence model, but there are some exceptions [62,43].

In linguistic applications the use of finite-state techniques is very wide-spread; the following overview only lists a small number of recent publications:

tokenization: [40,103] Tokenization is concerned with relatively superficial problems such as the determination of sentence boundaries and word boundaries in a given text (which, however, are challenging in languages with writing systems with poor marking of boundaries, e.g., Japanese). The techniques are comparable to the lexical analysis phase of the compilation of programming languages (such as provided by the UNIX tool lex): regular expressions are defined for words, sentences, and possibly some further special categories such as dates, numbers, etc.
lexicography: [74,30] Recent interest has focused around compact implementations of dictionaries using finite-state minimisation techniques; such techniques yield smaller results than the traditional trie data-structure. Other applications of finite-state techniques in lexicography concerns the representation of subcategorisation properties of words by means of finite-state patterns [42].
spell checking: Spell checking is naturally described as an application of finite-state automata [80,30].
part-of-speech tagging: Part-of-speech tagging can be implemented by means of finite-state transducers. [95] describe a method to compile a rule set produced by the application of the rule-based tagger by Eric Brill [15,16] into a deterministic finite-state transducer. As a result, an extremely fast tagger is obtained.
speech recognition: Language modelling for speech recognition is almost always performed with finite-state models, in particular hidden Markov models [49]. Further applications of finite-state techniques for speech recognition are described in [83]; [75].
phonology and morphology: The phonology and morphology of natural languages is often implemented by means of finite-state devices; for instance the popular `two-level' system of [59] is finite-state. As we already mentioned above, [54] show that the context-sensitive rule systems traditionally used in phonology are finite-state in expressive power as well. Further improvements concerning the compilation of such sets of context-sensitive rules into finite-state automata are described in [56] and [76].
syntax: A number of proposals exist for the treatment of natural language syntax by means of finite-state devices. The best-known example probably is the Constraint Grammar approach [120,55,119,107]. Other approaches can be found for instance in [53]; [1]; [23]; [39]; [94]. Finite-state approximation techniques are described below.

If we restrict our attention to finite-state approaches to natural language syntax in which a distinction is maintained between a competence grammar (expressed in a formalism with at least context-free power) and language performance, then two different approaches can be identified as methods to realize the link between competence and performance.

Firstly, some have proposed parsing algorithms which interpret such a grammar. The parser, however, is associated with a fixed maximal amount of memory. Publications such as [52]; [88]; [93] are examples of this interpreter approach. [93] shows how a left-corner parser [97], augmented with an operation of ``eager'' composition, and formalised as a push-down automaton, leads to a parser which utilises its stack only for center-embedding constructions. Left-recursive and right-recursive structures, on the other hand, can be processed without exploiting the stack. In contrast, a purely bottom-up or top-down parser formalised in a similar fashion requires to use its stack for right-recursive (respectively left-recursive) structures.

In contrast, in finite-state approximation a finite-state grammar is explicitly constructed, often on the basis of a specific parsing algorithm restricted with some mechanism to limit its recognising power to finite-state languages. This approach, exemplified in [84]; [96]; [85]; [77] and [51] could then be called a compilation approach.

Chomsky [25], footnote 10, discussing finite-state devices, quite rightly warns us that

The assumption that sentences are produced or recognised by a device of this sort tells us almost nothing about the method of processing.

In this context, it is worth mentioning the result by Stearns [106] that it is possible to determine whether a deterministic push-down machine recognises a regular language. Furthermore Stearns shows that if the push-down machine recognises a regular language, then the number of states of a finite-state automaton recognising the same language may be bounded by an expression of the order t^{q^{q^q}}, where q is the number of states, and t the size of the alphabet of the push-down automaton. Stearns concludes that^2.2

If this bound cannot be improved significantly, then it would appear profitable in some cases to maintain a pushdown design for a recogniser even if a finite-state design is possible.

In light of these observations, it may be difficult to choose between the interpreter or compiler approach, or still other approaches that may be designed (for instance finite-state automata augmented with a finite number of registers [62,5]). Minimised finite-state automata are an appropriate normal form for regular languages; but they are only `minimal' in the sense that there isn't a smaller automaton in that representation. For example, suppose you have a very complicated language X. Furthermore, you want to have an automaton for the language $\mbox{union}(aXa,bXb)$ . The minimal automaton for that language will essentially need to have a copy of the automaton for X in it, as in the first automaton of figure 2.2. However, if you had a single global memory cell containing the value of the first letter you read, then you would only need one copy of X, as in the second automaton of figure 2.2. So even if the two representations are formally equivalent, their sizes may vary dramatically.

**Figure:** Two representations of the language $\mbox{union}(aXa,bXb)$ where X is some given language and M(X) is its corresponding automaton. In the classical representation, two copies of M(X) are required. If a register is used to keep track of the first input symbol, then a single copy of M(X) suffices. Here we use the operation set(x) to set the register, and get(x) to check the value of the register.
$\includegraphics [scale=0.7]{complex.ps}$ $\includegraphics [scale=0.7]{simple.ps}$

As a starting point, the minimal deterministic finite-state automata constructed in the compiler approach may provide a normal-form; a representation that we can use to compare different approaches.

Next: Research Questions in Finite-state Up: Finite-state Language Processing Previous: Arguments for Finite-state Language Contents

2000-07-10