FSA Tutorial

1. Use of FSA

1.1. Foreword

You find the program FSA on hagen in

/users1/vannoord/bin/fsa

and on tcw2 in

/home/vannoord/Linux/bin/fsa

If you still haven't done this, add the path to these bin directories to your PATH (ask your local unix guru).

N.b. FSA does not work on the tcwxN machines in the practicum lab 229 (Munting). In order to be able to use FSA you must first log in to tcw2. To do this, preferably use ssh:

ssh tcw2

1.2. Start FSA

Start FSA with the command

fsa tkconsol=on -tk

You should see a screen with a box 'Regex' where you can type in regular expressions, and a box 'string' where you can test which strings are accepted (or not) by the regular expressions. The biggest part of the screen shows the automaton compiled on the basis of the regular expression previously typed in 'Regex'. The bottom part of the screen displays messages from the program, these are relevant for error-correction (debugging?). Basically, these messages report whether the word typed inside the box 'string' is accepted by the automaton or not.

There is a manual for FSA where the syntax of regular expressions is described as well as the (many) extra options that the system supports.

In case you don't have access to FSA, you can use the on-line demo version. Together with the demo version there is an alternative tutorial available.

2. A couple of simple examples of regular expressions:

[a*,b*] : defines the language that consists of an arbitrary number of 'a's, followed by an arbitrary number of 'b's. The strings aaabb, aab, aaaaa, bbb are all recognized by this regular expression, but the strings aba, bbaa, acb, cab are not.
[a,b]* specifies the language that consists of an arbitrary number of repetitions of the string ab: ab, ababab, but not abba, aabb, or bb.
{a, b} : an 'a' or a 'b'.
? : any symbol
?*: any series of symbols
?- q : all strings of 1 symbol except the q
a..z, 'A'..'Z', '0'..'9': the letter a or b or c ...or z, the letter A or B or C.... or Z, the digit 0 or 1 or 2.... or 9.
a ^ : optional one a.
$ q : all strings that include one q somewhere (equivalent to [? *, q, ? *])
~ $ q : all strings where the q does not appear(equivalent (? - q )* )
$ q & $ x : all strings that contain a q as well as an x (in any order).

3. Exercises

3.1. Simple

Give a regular expression for `words', i.e. all strings that consist of letters only
Give a regular expression for `numeric words', i.e. all strings that include at least one digit in addition to letters (Windows95, dvi2ps, 2K, ...).
Give a regular expression for 'complex words', i.e. all strings that include some other alphanumeric character type in addition to letters (except a blank space)(zwart-wit, windows3.1)
Give a regular expression that accepts email adresses (gosse@let.rug.nl) : strings made out of letters, the period ('.'), and exactly 1 '@' symbol.

N.b. Numbers and uppercase letters should be written in between single quotes: '0', 'A'. In the string this is not necessary.
Sometimes there must be a blank space in between two operators, in concrete between ~ en $.

3.2. Abbreviations

Sometimes it is important to recognize the existing abbreviations in a text. A text to speech conversion system is a system that is able to articulate written text automatically. In this context the recognition of abbreviations is crucial because these expressions must often be uttered. For example, if the word EEG appears in the text, then the utterance must sound ee - ee - gee, en obviously not like eeg. The same applies to an abbreviation such as t.a.v. that in such system is perhaps best uttered as ten aanzien van (with respect to); in any case, the utterance should not rhyme with daf. Regular expressions turn out to be relevant also in text to speech conversion systems. With regular expressions one can always define precisely which letter combinations should be assigned to the abbreviations. In a first attempt you can continue as follows. An abbreviation is a word built out of uppercase letters:

A..Z+

or contains a period:

$ '.'

Combining the two posibilities:

{ A..Z+ , $ '.' }

Define a regular expression to recognize abbreviations. Ensure that examples like the following are correctly accepted:

t.a.v.                       KPN
NAVO                         CIA
KGB                          mr.
jl.                          C.I.D.
o.a.                         PSP'er
NOSstudio                    NAVO-generaals
N.O.-I                       N.H.M.-tarieven
SALT-perscentrum             oud-ASVA-voorzitter
N.S.B.ers                    S.S.'ers

At the same time make sure that examples like these are not accepted:

Gertjan                      boterhammenzakje
15.30                        4.49.6
...                          U

3.3 Difficult

Give a regular expression that recognizes only inputs consisting of the letters a en b, and in addition the following holds true:

that the total number of letters is even.
that the total number of as is even.
that the number of as is even, and also the number of bs.
that exactly and only once three as one after the other appear.
each a is preceded by a b
each a is preceded by an a or a b

4. Macro's and auxiliary files

In order to define complex regular expressions you can use macros. Macros need to be defined in a separate (Prolog) file, for example my_macros.pl:

macro(klinker,{a,e,i,o,u,y}).
macro(medeklinker, {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,z}).
macro(lettergreep, [medeklinker *, klinker +, medeklinker *]).

After saving the file my_macros.pl, you can load it in going to the menu File and choosing LoadAux or Reconsult Aux. Select the corresponding filename (my_macros.pl) in the resulting box.

After that you can use the macros you have just defined as if they were a regular expression. Thus, you can type 'klinker' in the Regexp line. This expression will be translated as the regular expression seen above.

Macros can call other macros (look at the definition of 'lettergreep' (syllable) above).