FSA Tutorial

Gosse Bouma
Natural Language Processing II

Using FSA

You find the program FSA on hagen in

/users1/vannoord/bin/fsa

and on tcw2 in

/home/vannoord/Linux/bin/fsa

If you still haven't done this, add the path to these bin directories to your PATH (ask your local unix guru).

Starting FSA

Start FSA with the command

fsa tkconsol=on -tk

You should see a screen with a box 'Regex' where you can type in regular expressions, and a box 'string' where you can test which strings are accepted (or not) by the regular expressions. The biggest part of the screen shows the automaton compiled on the basis of the regular expression previously typed in 'Regex'. The bottom part of the screen displays messages from the program, these are relevant for error-correction (debugging?). Basically, these messages report whether the word typed inside the box 'string' is accepted by the automaton or not.

There is a manual for FSA where the syntax of regular expressions is described as well as the (many) extra options that the system supports.

In case you don't have access to FSA, you can use the on-line demo version. Together with the demo version there is an alternative tutorial available.

Examples of regular expressions

[a*,b*] defines the language that consists of an arbitrary number of a's, followed by an arbitrary number of b's. The strings aaabb, aab, aaaaa, bbb are all recognized by this regular expression, but the strings aba, bbaa, acb, cab are not.
[a,b]* defines the language that consists of an arbitrary number of repetitions of the string ab: ab, ababab, but not abba, aabb, or bb.
{a, b} an a or a b.
? any symbol
?* any series of symbols
?- q any single symbol except q
a..z, 'A'..'Z', '0'..'9' the letter a or b or c ...or z, the letter A or B or C ... or Z, the digit 0 or 1 or 2.... or 9.
a ^ an optional a.
$ q a string containing a q (equivalent to [? *, q, ? *])
~ $ q a string that does not contain a q (equivalent (? - q )* )
$ q & $ x a string containing a q and an x (in any order).

Exercises

Introductory

Give a regular expression for `words', i.e. all strings that consist of letters only,
Give a regular expression for recognizing US ZIP Codes (i.e. MI 48824, CA 94301, MA 02101).
Give a regular expression for `numeric words', i.e. all strings that include at least one digit in addition to letters (Windows95, dvi2ps, 2K, ...).
Give a regular expression for 'complex words', i.e. all strings that include some other alphanumeric character type in addition to letters (except a blank space)(hand-made, windows3.1)
Give a regular expression that accepts email adresses (gosse@let.rug.nl) : strings made out of letters, the period ('.'), and exactly 1 '@' symbol.

N.b. Numbers and uppercase letters in regular expressions should be in single quotes: '0', 'A'. (In strings you want to check they should not be quoted.) N.b.b. Sometimes two operators need to be seperated by a space. In particular, between negation and contains: ~ $.

Advanced

Give a regular expression that recognizes only inputs satisfying the following condition:

the string contains only a's and b's and the total number of letters is even,
the string contains only a's and b's and the total number of a's is even,
the string contains only a's and b's and the number of a's is even and the number of b's is even,
the string contains only a's and b's and each a is preceded by a b.

Macro's and auxiliary files

In order to define complex regular expressions you can use macros. Macros need to be defined in a separate (Prolog) file, for example my_macros.pl:

     :- multifile macro/2.

     macro(vowel,{a,e,i,o,u,y}).
     macro(consonant, {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,z}).
     macro(syllable, [consonant *, vowel +, consonant *]).

The first line is necessary so Prolog knows there are macro definitions in this file. In Prolog, every statement ends with a period ('.'). Macro takes two arguments, a name and a regular expression. Names must start with a lower-case letter. After saving the file my_macros.pl, you can load it in going to the menu File and choosing LoadAux or Reconsult Aux. Select the corresponding filename (my_macros.pl) in the resulting box. After that you can use the macros you have just defined as if they were a regular expression. Thus, you can type 'vowel' in the RegExp line. This expression will be translated as the regular expression seen above. Macros can call other macros, as in at the definition of 'syllable' above.

Projects

Determining the Category of Words

The file words_pos.txt contains an alphabetic list of nouns, adjectives, and verbs. The file words_pos_rev.txt contains the same list, but now sorted alphabetically on the end of the word, instead of the beginning. The file suffix_stats.txt contains statistics for the most frequent 5 letter suffixes derived from a list of approx. 75K nouns, verbs, and adjectives.

Formulate a regular expression which recognizes (to some extent) adjectives. You may want to use an auxiliary (macros) file for the definition.
Use the data mentioned above for inspiration.
Choose an arbitrary document, and collect the first 10 (or even better 20 or 50) adjectives. How many of these would be accepted by your definition)?
Now collect the first 10 (20, 50, ...) verbs and nouns. How many of these are accepted by your automaton?
On the basis of this experiment, discuss how difficult it is to predict that a word is an adjective.
As an alternative for step 3 - 5, yo may use the scripts in this Makefile, to test the accuracy of your regular expression on the data in word_pos.txt. In theory, after placing the Makefile in your working directory, the command
```
make adj_results
```
should give statistics for the number of falsely unrecognized adjectives, and falsely recognized N's or V's. The files adj_X_errors (where X = A, N, or V) contain the errors themselves.

Syllable

The file monosyll.txt contains words extracted from an English word list of the form:

[consonant*, vowel+, consonant*]

Thus, these are mostly mono-syllabic words. However, the list also contains some words which are in fact bi-syllabic or whose spelling seems to deviate from the normal English spelling.

Write a regular expression which recognizes English monosyllabic words more accurately. You may approach the problem by defining macros for onset, nucleus, and coda, and defining a macro for syllable as

[onset, nucleus, coda]

Test your expression on some of the examples in the data file.
Test whether the bi-syllabic words and words with an idiosyncratic (foreign or otherwise) spelling in the list are rejected by your definition.
Alternatively, steps 2-3 can be tested on all data in monosyll.txt by placing this Makefile in your working directory, and running the command
```
make monosyll.err
```
This gives a list of all words rejected by your reg ex.
Alternatively, you can do the Dutch version of this exercise.