FSA Tutorial
1. Use of FSA
1.1. Foreword
You find the program
FSA on hagen in /users1/vannoord/bin/fsa
and on tcw2
in
/home/vannoord/Linux/bin/fsa
If you still haven't done this, add the path to these
bin directories to your PATH (ask your local unix
guru).
N.b. FSA does not work on the tcwxN machines in the practicum lab 229 (Munting).
In order to be able to use FSA you must first log in to tcw2. To do this,
preferably use ssh:
ssh tcw2
1.2. Start FSA
Start FSA with the
command fsa tkconsol=on -tk
You should see a
screen with a box 'Regex' where you can type in regular
expressions, and a box 'string' where you can test which
strings are accepted (or not) by the regular expressions. The biggest
part of the screen shows the automaton compiled on the basis of the
regular expression previously typed in 'Regex'. The bottom
part of the screen displays messages from the program, these are
relevant for error-correction (debugging?). Basically, these messages
report whether the word typed inside the box 'string' is accepted by
the automaton or not.
There is a
manual for FSA where the syntax of regular expressions is
described as well as the (many) extra options that the system
supports.
In case you don't have access to FSA, you can use the
on-line
demo version. Together with the demo version there is an
alternative tutorial available.
2. A couple of simple examples of regular expressions:
-
[a*,b*] : defines the language that consists of an arbitrary
number of 'a's, followed by an arbitrary number of 'b's. The
strings aaabb, aab, aaaaa, bbb are all recognized by this
regular expression, but the strings aba, bbaa, acb, cab are
not.
-
[a,b]* specifies the language that consists of an arbitrary number
of repetitions of the string ab: ab, ababab, but not
abba, aabb, or bb.
-
{a, b} : an 'a' or a 'b'.
-
? : any symbol
-
?*: any series of symbols
-
?- q : all strings of 1 symbol except the q
-
a..z, 'A'..'Z', '0'..'9': the letter a or b or c ...or z, the letter A or
B or C.... or Z, the digit 0 or 1 or 2.... or 9.
-
a ^ : optional one a.
-
$ q : all strings that include one q somewhere
(equivalent to [? *, q, ? *])
-
~ $ q : all strings where the q does not appear(equivalent (? -
q )* )
-
$ q & $ x : all strings that contain a q as well as an x (in any
order).
3. Exercises
3.1. Simple
-
Give a regular expression for `words', i.e. all strings that consist of
letters only
-
Give a regular expression for `numeric words', i.e. all strings
that include at least one digit in addition to letters (Windows95,
dvi2ps, 2K, ...).
-
Give a regular expression for 'complex words', i.e. all strings
that include some other alphanumeric character type in addition to letters
(except a blank space)(zwart-wit, windows3.1)
-
Give a regular expression that accepts email adresses (gosse@let.rug.nl)
: strings made out of letters, the period ('.'), and exactly 1 '@' symbol.
N.b. Numbers and uppercase letters should be written in between
single quotes: '0', 'A'. In the string this is not necessary.
Sometimes there must be a blank space in between two operators, in
concrete between ~ en $.
3.2. Abbreviations
Sometimes it is important to recognize the existing
abbreviations in a text. A text to speech conversion
system is a system that is able to articulate written text
automatically. In this context the recognition of abbreviations is
crucial because these expressions must often be uttered. For example,
if the word EEG appears in the text, then the utterance must
sound ee - ee - gee, en obviously not like eeg. The
same applies to an abbreviation such as t.a.v. that in such
system is perhaps best uttered as ten aanzien van (with
respect to); in any case, the utterance should not rhyme with
daf.
Regular expressions turn out to be relevant also in text to speech conversion
systems. With regular expressions one can always define precisely which letter
combinations should be assigned to the abbreviations. In a first attempt
you can continue as follows. An abbreviation is a word built out of
uppercase letters:
A..Z+
or contains a period:
$ '.'
Combining the two posibilities:
{ A..Z+ , $ '.' }
Define a regular expression to recognize abbreviations. Ensure
that examples like the following are correctly accepted:
t.a.v. KPN
NAVO CIA
KGB mr.
jl. C.I.D.
o.a. PSP'er
NOSstudio NAVO-generaals
N.O.-I N.H.M.-tarieven
SALT-perscentrum oud-ASVA-voorzitter
N.S.B.ers S.S.'ers
At the same time make sure that examples like these are not accepted:
Gertjan boterhammenzakje
15.30 4.49.6
... U
3.3 Difficult
Give a regular expression that recognizes only inputs consisting
of the letters a en b, and in addition the
following holds true:
- that the total number of letters is even.
- that the total number of as is even.
- that the number of as is even, and also
the number of bs.
- that exactly and only once three as one after the other appear.
- each a is preceded by a b
- each a is preceded by an a or a b
4. Macro's and auxiliary files
In order to define complex regular expressions you can use macros.
Macros need to be defined in a separate (Prolog) file, for example
my_macros.pl:
macro(klinker,{a,e,i,o,u,y}).
macro(medeklinker, {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,z}).
macro(lettergreep, [medeklinker *, klinker +, medeklinker *]).
After saving the file my_macros.pl, you can load it in going to
the menu File and choosing LoadAux or Reconsult Aux. Select
the corresponding filename (my_macros.pl) in the resulting box.
After that you can use the macros you have just defined
as if they were a regular expression. Thus, you can type 'klinker'
in the Regexp line. This expression will be translated as the
regular expression seen above.
Macros can call other macros (look
at the definition of 'lettergreep' (syllable) above).