In natural language processing, it is often more natural to think of symbols in terms of predicates or classes. The linguistic principle of Community dictates that similar segments behave similarly. Predicates are a means to express this similarity. In computational phonology it is thus more natural to talk about vowels and consonants rather than enumerate each of the phonemes in these classes. Phonological generalizations typically refer to predicates such as fricative, nasal, voiced and very seldomly to individual phonemes directly. Therefore, in finite state computational phonology, some have proposed finite state automata in which transitions are associated with sets of symbols [35,3,8,36].
As a further piece of motivation for the introduction of predicates,
consider the unknown symbol regular expression operator,
typically written ?, as it is available in some regular expression
compilers [18,34]. An obvious implementation
will expand the ? operator into a set of transitions for each of
the symbols in the alphabet . In our proposal, the ?
operator will be expanded into a single transition with an associated
predicate which is true for all symbols; this has the advantage that
need not be explicitly defined. As a consequence, there is no
need to assume that the alphabet is finite. Such considerations
become important for applications with large alphabets, such as the
Unicode alphabet. Even larger alphabets may surface in natural
language processing applications in which the symbols are words.
Typical electronic dictionaries have at least 200K words and even this
large size alphabet is not enough to handle unrestricted texts.
Realistically, robust syntactic parsing requires an infinite
alphabet.1
Below, we define predicate augmented finite state automata more precisely; for now it suffices to assume that such automata are similar to classical finite state automata, except that we have predicates instead of symbols.