Exercise 1
General remarks
-
Exercises are made individually or in pairs. In the latter case, submit results only once, with
both names.
- All material required for this exercise can be found in the directory
~gosse/NLP/exercise1 on hagen.
Exercise
Produce an FSA macro-file containing macros for
-
syllable (a possible syllable of English)
-
onset (a possible beginning (sequence of consonants) of a English syllable)
-
nucleus (a possible center (vowels) of a English syllable)
-
coda (a possible ending (consonants) of a English syllable)
A simple example issyllable.pl. You can take this
as a start and replace the definitions by something more adequate...
Macro's and auxiliary files
Macro's can be loaded by starting fsa (fsa tkconsol=on -tk), and
going to the menu File and choosing LoadAux or Reconsult Aux. Select the
corresponding filename (my_macros.pl) in the resulting box.
After that you can use the macros you have just defined as if they
were a regular expression. Thus, if a macro 'vowel' is defined, you can type 'vowel' in the Regexp
line. This expression will be translated as the regular expression
in the definition of the macro.
TESTING
After making the definitions, and checking them in fsa, you can test your
work in two ways. This requires the following files:
-
Makefile definitions of
make commands.
-
eow.stem list of simple words
-
eow.syll hyphenation for these words
-
hyphenate.pl Prolog-file for producing hyphenation
patterns
-
monosyll list of words of the form [consonant
*, vowel +, consonant *]
-
replace.pl prolog file with macro's for hyphenate.pl
-
syllable.pl a file with macro's for
syllable, onset, nucleus, and coda.
Test 1: Recognize `foreign' words.
The file 'monosyll' consists of a list of 740 words of the form
[consonant*, vowel+, consonant*]. The Unix command
make not_accepted
produces a file `not_accepted' which contains all words not recognized
by 'syllable'. This list should only contain words which consist of more
than a single syllable (aaien, beiaard,...) and non-native words (back, blues,...).
Test 2: Hyphenating simple words
The file eow.stem contains a list of 11.382
words. The command
make hyphen_errors
produces a file hyphen_errors which contains all wrongly hyphenated
(1st column, 2nd column = correct patterns ), and gives the percentage of
correctly hyphenated words.
NOTA BENE
- Check your definitions before testing, for instance by loading them in
fsa and trying out some examples.
-
The unix command make produces files as defined in a 'Makefile'. Rerunning
a test sometimes leads to the message
'File up to date'. In those cases, just remove File and run make again.
If you want to start all over, do 'make clean' : this removes all files
made by make.
Reporting Results
- Mail the file syllable.pl and a brief report to your lab-assistent.
- In the report you should give the results of test 1 (how many words are not recognized,
which kind of words are not recognized?) and of test 2 (how many mistakes? what
kind of mistakes?)
- Send your results to m.b.villada@let.rug.nl
Deadline: Thursday, April, 17
Good luck!
Gosse.
n.b. This approach to hyphenation works less well for English, as the spelling
of syllables is more irregular than in Dutch, and morphological structure plays a more
prominent role. It is still interesting to find out to what extent it can be made to work....