FSA Tutorial
Gosse Bouma
Natural Language Processing II
Using FSA
You find the program
FSA on hagen in /users1/vannoord/bin/fsa
and on tcw2
in
/home/vannoord/Linux/bin/fsa
If you still haven't done this, add the path to these
bin directories to your PATH (ask your local unix
guru).
Starting FSA
Start FSA with the
command fsa tkconsol=on -tk
You should see a
screen with a box 'Regex' where you can type in regular
expressions, and a box 'string' where you can test which
strings are accepted (or not) by the regular expressions. The biggest
part of the screen shows the automaton compiled on the basis of the
regular expression previously typed in 'Regex'. The bottom
part of the screen displays messages from the program, these are
relevant for error-correction (debugging?). Basically, these messages
report whether the word typed inside the box 'string' is accepted by
the automaton or not.
There is a
manual for FSA where the syntax of regular expressions is
described as well as the (many) extra options that the system
supports.
In case you don't have access to FSA, you can use the
on-line
demo version. Together with the demo version there is an
alternative tutorial available.
Examples of regular expressions
-
[a*,b*] defines the language that consists
of an arbitrary number of a's, followed by an arbitrary number of b's.
The strings aaabb, aab, aaaaa, bbb are all
recognized by this regular expression, but the strings aba,
bbaa, acb, cab are not.
-
[a,b]* defines the language that consists
of an arbitrary number of repetitions of the string ab: ab,
ababab, but not abba, aabb, or bb.
-
{a, b} an a or
a b.
-
? any symbol
-
?* any series of symbols
-
?- q any single symbol except q
-
a..z, 'A'..'Z', '0'..'9' the letter a
or b or c ...or
z, the letter A
or B or C ...
or Z, the digit 0
or 1 or 2....
or 9.
-
a ^ an optional a.
-
$ q a string containing a q
(equivalent to [? *, q, ? *])
-
~ $ q a string that does not contain a q
(equivalent (? - q )* )
-
$ q & $ x a string containing a q
and an x (in any order).
Exercises
Introductory
-
Give a regular expression for `words', i.e. all strings that consist of
letters only,
-
Give a regular expression for recognizing US ZIP Codes (i.e. MI
48824, CA 94301, MA 02101).
-
Give a regular expression for `numeric words', i.e. all strings that include
at least one digit in addition to letters (Windows95,
dvi2ps, 2K, ...).
-
Give a regular expression for 'complex words', i.e. all strings that include
some other alphanumeric character type in addition to letters (except a
blank space)(hand-made, windows3.1)
-
Give a regular expression that accepts email adresses (gosse@let.rug.nl)
: strings made out of letters, the period ('.'),
and exactly 1 '@' symbol.
N.b. Numbers and uppercase letters in regular expressions should be in
single quotes: '0', 'A'. (In strings you want
to check they should not be quoted.) N.b.b. Sometimes two operators need
to be seperated by a space. In particular, between negation and contains:
~ $.
Advanced
Give a regular expression that recognizes only inputs satisfying the following
condition:
-
the string contains only a's and b's
and the total number of letters is even,
-
the string contains only a's and b's
and the total number of a's is even,
-
the string contains only a's and b's
and the number of a's is even and the number of
b's is even,
-
the string contains only a's and b's
and each a is preceded by a b.
Macro's and auxiliary files
In order to define complex regular expressions you can use macros. Macros
need to be defined in a separate (Prolog) file, for example my_macros.pl:
:- multifile macro/2.
macro(vowel,{a,e,i,o,u,y}).
macro(consonant, {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,z}).
macro(syllable, [consonant *, vowel +, consonant *]).
The first line is necessary so Prolog knows there are macro definitions
in this file. In Prolog, every statement ends with a period ('.'). Macro
takes two arguments, a name and a regular expression. Names must start
with a lower-case letter. After saving the file my_macros.pl, you can load
it in going to the menu File and choosing
LoadAux or Reconsult
Aux. Select the corresponding filename (my_macros.pl) in the resulting
box. After that you can use the macros you have just defined as if they
were a regular expression. Thus, you can type 'vowel' in the RegExp line.
This expression will be translated as the regular expression seen above.
Macros can call other macros, as in at the definition of 'syllable' above.
Projects
Determining the Category of Words
The file words_pos.txt contains an alphabetic
list of nouns, adjectives, and verbs. The file words_pos_rev.txt
contains the same list, but now sorted alphabetically on the end of the
word, instead of the beginning. The file suffix_stats.txt
contains statistics for the most frequent 5 letter suffixes derived
from a list of approx. 75K nouns, verbs, and adjectives.
-
Formulate a regular expression which recognizes (to some extent) adjectives.
You may want to use an auxiliary (macros) file for the definition.
-
Use the data mentioned above for inspiration.
-
Choose an arbitrary document, and collect the first 10 (or even better
20 or 50) adjectives. How many of these would be accepted by your definition)?
-
Now collect the first 10 (20, 50, ...) verbs and nouns. How many of these
are accepted by your automaton?
-
On the basis of this experiment, discuss how difficult it is to predict
that a word is an adjective.
- As an alternative for step 3 - 5, yo may use the scripts in this Makefile, to test
the accuracy of your regular expression on the data in word_pos.txt. In theory, after placing the Makefile in your
working directory, the command
make adj_results
should give statistics for the number of falsely unrecognized
adjectives, and falsely recognized N's or V's. The files adj_X_errors (where X = A, N, or V) contain the errors
themselves.
Syllable
The file monosyll.txt contains words extracted from an English word list
of the form:
[consonant*, vowel+, consonant*]
Thus, these are mostly mono-syllabic words. However, the list also contains
some words which are in fact bi-syllabic or whose spelling seems to deviate
from the normal English spelling.
-
Write a regular expression which recognizes English monosyllabic words
more accurately. You may approach the problem by defining macros for onset,
nucleus, and coda, and defining a macro for syllable as
[onset, nucleus, coda]
-
Test your expression on some of the examples in the data file.
-
Test whether the bi-syllabic words and words with an idiosyncratic (foreign
or otherwise) spelling in the list are rejected by your definition.
- Alternatively, steps 2-3 can be tested on all data in monosyll.txt by placing this Makefile in your
working directory, and running the command
make monosyll.err
This gives a list of all words rejected by your reg ex.
- Alternatively, you can do the Dutch version of this exercise.