Exercise 1
General remarks
-
Exercises are made individually or in pairs. In the latter case, submit results only once, with
both names.
- All material required for this exercise can be found in the directory
~gosse/NLP/exercise1 on hagen.
Exercise
Produce an FSA macro-file containing macros for
-
syllable (a possible syllable of Dutch)
-
onset (a possible beginning (sequence of consonants) of a Dutch syllable)
-
nucleus (a possible center (vowels) of a Dutch syllable)
-
coda (a possible ending (consonants) of a Dutch syllable)
A simple example issyllable.pl. You can take this
as a start and replace the definitions by something more adequate...
Macro's and auxiliary files
Macro's can be loaded by starting fsa (fsa tkconsol=on -tk), and
going to the menu File and choosing LoadAux or Reconsult Aux. Select the
corresponding filename (my_macros.pl) in the resulting box.
After that you can use the macros you have just defined as if they
were a regular expression. Thus, if a macro 'vowel' is defined, you can type 'vowel' in the Regexp
line. This expression will be translated as the regular expression
in the definition of the macro.
TESTING
After making the definitions, and checking them in fsa, you can test your
work in two ways. This requires the following files:
Test 1: Recognize `foreign' words.
The file 'monosyll' consists of a list of 5890 words of the form
[consonant*, vowel+, consonant*]. The Unix command
make not_accepted
produces a file `not_accepted' which contains all words not recognized
by 'syllable'. This list should only contain words which consist of more
than a single syllable (aaien, beiaard,...) and non-native words (back, blues,...).
Test 2: Hyphenating simple words
The file dol.mono.stem contains a list of 12628 mono-morphemic (non-compound)
words. The command
make hyphen_errors
produces a file hyphen_errors which contains all wrongly hyphenated
(1st column, 2nd column = correct patterns ), and gives the percentage of
correctly hyphenated words.
NOTA BENE
- Check your definitions before testing, for instance by loading them in
fsa and trying out some examples.
-
The unix command make produces files as defined in a 'Makefile'. Rerunning
a test sometimes leads to the message
'File up to date'. In those cases, just remove File and run make again.
If you want to start all over, do 'make clean' : this removes all files
made by make.
Reporting Results
- Mail the file syllable.pl and a brief report to your lab-assistent.
- In the report you should give the results of test 1 (how many words are not recognized,
which kind of words are not recognized?) and of test 2 (how many mistakes? what
kind of mistakes?)
- Send your results to m.b.villada@let.rug.nl
Deadline: Thursday, April, 17
Good luck!
Gosse.
p.s. A first try gave 22% unaccepted words for test 1, and
10% errors for test 2.....