required: Daniel Jurafsky and James H. Martin, Speech and Language
Processing. An introduction to Natural Language Processing,
Computational Linguistics and Speech Recognition.
Second Edition. Prentice Hall 2008.
Relevant chapters:
 chapters 1, 2, 3,
 chapter 4.1, 4.2, 4.3, 4.4. The remainder of chapter 4: only motivating and
concluding parts, skip over the technical parts.
 chapter 5
 chapter 12, 13
 chapter 14, excluding 14.6, 14.8, 14.9
 chapter 16 (excluding details of pumping lemma in 16.2.1)
Overview
The course provides a fundamental introduction in Natural Language
Processing. Various important fundamental topics are treated,
including Regular Expressions, Finite Automata, Computational
Morphology, Ngrams, Hidden Markov Models, Partofspeech Tagging,
Context Free Grammar, Parsing, Statistical Parsing.
Credits
In order to obtain the 5 ECTS for this course, you need to
obtain a sufficient (5,5) grade for the exam.
In order to be allowed to take the exam, you need to obtain a
sufficient (5,5) grade for your portfolio. The portfolio contains
detailed reports / solutions of all the exercises in the course. The
exercises will be announced on this webpage. At the end of the
course, you need to submit your portfolio electronically (deadline:
March 24, 13:00). Only submissions through Nestor in PDFformat are
taken into consideration. A random selection of the exercises will be
graded. Missing exercises are graded as 0.
Course Plan

week 1. Introduction, Finite automata, Finite Transducers
[ch 2, ch 3]. Sheets: page 134 and 6984 ./sapporo.pdf
Exercises
 In this example, the alfabet consists of the characters a, b, c and d. Create a finite automaton (recognizer) which accepts:
 all strings which start with a b
 all strings which end with a b
 all strings which start with an a and which end with a c
 all strings with an odd number of c's
 all strings in which every a is followed by a b
 all strings such that a b is never preceded by a c
For each of the automata, explain whether the automaton is deterministic or not.
 For a given nondeterministic finite state recognizer, there
always exists an equivalent deterministic finite state
recognizer. A determinization algorithm takes a possibly
nondeterministic fsa and writes out a deterministic fsa. A typical
algorithm does this, by considering as states in the deterministic
automaton all subsets of states of the nondeterministic
automaton. Loosely describe such an algorithm. What is the complexity
of the algorithm?
 2.10, 2.11
 In this example, the alfabet consists of the characters a, b, c and d. Create finite state transducers for each of the following mappings:
 a mapping in which each string is copied, except that a b
is introduced after every a.
 a mapping where the output is the single symbol string a if the length of
the input is even, and b if the length of the input is odd.
 a mapping in which each string is copied, except that every b
is deleted.
 a mapping in which each string is copied, except that every b
is replaced by a c if it is preceded by a d.
 a mapping in which each string is copied, except that every b
is replaced by a c if it is followed by a d.

week 2. Regular expressions;
edition2: [ch 2, ch 3]
Sheets: page 3557, 7991 ./sapporo.pdf,
Exercises
 3.5 but instead of a single transducer, provide a
cascade of simple transducers
 Define a regular expression (using the notation introduced in
class) for the following languages:
 All strings containing two vowels, ending in a consonant,
with an odd number of letters.
 All strings which end in a voiceless consonant
 All strings in which every a is preceded by b
 All strings in which the number of a's is larger than the number of b's
 All strings in which every a is preceded by b, and the number of b's is divisible by 5
 All palindromes (words that read the same backwards, e.g. anna)
 All strings in which every a is preceded by b, and every b is preceded by a
 Define regular expressions for the following transductions:
 Define a transducer which maps every consonant to the letter
c, every vowel to v and every other symbol to #.
 Define a transducer which maps every c to either s
or k, depending on its pronounciation (in Dutch). Also treat
the case of ch mapping it to g.
 Define a transducer which maps strings of the form
[a*,b*] to the symbol yes if
there are more a's than b's, and to the symbol
no otherwise.
 Define a weighted acceptor that maps strings of the form
[a*,b*] to a positive ingeger if
there are more a's than b's, to 0 if the number of a's
and b's is equal, and to a negative interer
otherwise.
 Define a transducer which maps every Dutch word into its
MYFE FETEMSHAPELYKE SPELYMH (by Rudy
Kousbroek). On the first of april, 1977, Rudy Kousbroek
proposed the following spelling changes in the Dutch newspaper
NRC Handelslad. The piece contained a number of comments,
mostly positive, by a variety of people (all written by
Adriaan van Dis). In the MYFE FETEMSHAPELYKE SPELYMH
only the letters a, e, f, h, k, l, m, p, r, s, t u en y
remain. The rules are as follows:
 no distinction between lower case and upper case,
everything is written in upper case.
 c is replaced by s or
k; q is replaced by ku;
x is replaced by ks
 every letter repetition is replaced by a single
occurrence of that letter
 v and w disappear and are
replaced by f
 z disapears and is replaced by s
 y takes the place of i, j ij and
ie
 b is replaced by p
 d is replaced by t
 g and ch are replaced by
h
 o,oo,oe,ou are replaced by u
 n is replaced by m

In Dutch, the regular form of the past tense is formed by the suffix
te or de, depending on whether the previous sound is
a voiceless consonant or not. In addition, this previous consonant is
devoiced in front of such a suffix too. In Dutch spelling, this
devoicing is visible only for the letters v,z Thus we have:
dub+Te dubde
wend+Te wendde
bof+Te bofte
leeg+Te leegde
kuch+Te kuchte
push+Te pushte
gooi+Te gooide
roetsj+Te roetsjte
buk+Te bukte
ril+Te rilde
ram+Te ramde
ren+Te rende
rep+Te repte
hoor+Te hoorde
pas+Te paste
zet+Te zette
leev+Te leefde
geeuw+Te geeuwde
fax+Te faxte
soez+Te soesde
Define a regular expression for a transducer which maps strings of the
type listed in the
first column to their correct spelling, as exemplified in the second column.
 (Potentially difficult.)
 Some examples of nondeterministic finite state machines which
display exponential blowup if they are determinized could be found,
some time ago, at
http://www.cs.cmu.edu/~cdm/bilder.html. Can you come up with such
an example yourself??
Here is an example of a finite state acceptor
which illustrates exponential blow up if the automaton is determined.
 Can you describe  in words  the language accepted by this automaton?
 Can you provide an equivalent regular expression (in FSA notation)?
 Use the FSA utilities to prove that your regular expression is
indeed equivalent

week 3.
Exercises
 3.10
 3.11 Since this is not a programming course, here's a program in perl. Adapt to alternative
scoring schemes.
 Extend the minimum edit distance algorithm to compute the minimal
edit distance with transposition. Assume that transposition,
substitution, deletion and
insertion all are associated with a cost of 1. Note that
transpositions are allowed only between characters that are adjecent
in the original string. (This distance is known as Damerau
distance. See, for instance, Wikipedia, for an explanation why it is
not straightforward to allow transpositions more generally).
 Give an example of a Dutch (English, French, Spanish...) sentence,
which  if you remove the word boundaries  can be parsed as a
sequence of words in more than one way. Example from the sheets:
"thetabledownthere" can be parsed as "the table down there" and
"theta bled own there".
 4.1
A Perl script to compute word Ngram frequencies is available as make_ngram. You can use this for the next
exercises.
 4.3 You may want to use the following
corpora: traditional annual speeches by the Queen at the opening of the parliamentary year.
Can you establish interesting differences between the eldest and the newest speeches?
 Ngram models can be understood as weighted finitestate
acceptors. For unigram, bigram, and trigram models respectively,
describe how a given Ngram model can be implemented as a weighted
finitestate acceptor.
 4.4 Describe in detail how you could use a word
trigram language model to generate random sentences. The idea here is,
that the choice of sentences reflects the trigram model.
 Suffix arrays. What is the most frequent 12gram in the Dutch
Wikipedia corpus?
A 2011 version of the Dutch Wikipedia corpus is /net/corpora/nlwiki/20110804/nlwiki20110804.sents.gz
Use the implementation
of suffix arrays (with perfect hash keys for words) from www.let.rug.nl/~vannoord/software.html
If you work on the LWP, you don't need to install the software, but you can use
it directly at /net/aps/64/src/Alpino/SuffixArrays/. In order to use
the binary, you may need to set
export LD_LIBRARY_PATH=/net/aps/64/lib
PATH=/net/aps/64/bin:$PATH
 Try to fool the language guesser. Think up a perfectly
grammatical Dutch (English, German, French, Spanish, Italian) sentence
of at least 30 characters which is classified as English (German,
French, Italian, Spanish ...) by the language guesser. Try to make your
sentence as long and natural as possible.

week 4.

Ngrams. [ch 4] perplexity sheets Martin lecture 7.

Ngrams. smoothing. sheets Martin lecture 7.

Introduction of Partofspeech Tagging. [ch 5] In particular: HMM Tagging, Viterbi
Learning. sheets Martin lecture 9.
Exercises
 What is the most frequent word trigram in Dutch?
Give some estimations for the number of word bigrams / trigrams /
fourgrams types for various corpus sizes. In particular,
study the number of Ngram types of a corpus of
10 Million words.
 Describe in detail how you compute the
perplexity of a test corpus given a character bigram language model.
 Estimate the perplexity of a subpart of the nlwiki corpus, according to
a character trigram model trained on a different subpart of the nlwiki corpus.
 treatment of unknown words (ch 4.3.2 in edition2). What is wrong
with the suggested approach for unigram modeling? Hint: what happens
in the special case where *all* words in training data are not in the
vocabulary?

Split (part of) the nlwiki corpus randomly in two parts (how?). The first, training,
part should contain 90% of the sentences, and the test part should
contain 10% of the sentences. In this exercise we want to
investigate how often an Ngram occurs in the test data that was
unseen in the training data.
 How many unigrams in the testdata are not present in the training data?
 How many bigrams in the testdata are not present in the training data?
 How many trigrams in the testdata are not present in the training data?
 How many fourgrams in the testdata are not present in the training data?
 5.1
 5.6 (international edition: exercise 5.5). This is the question
about the implementation of a baseline POStagger.
Describe such an implementation in detail, but you don't need to
actually implement it.
 Consider a bigram HMM tagger. Work out in detail the steps of the
Viterbi algorithm in tagging the sentence het was het
slapen where we have the following probabilities (here I use
start and end to represent the startstate and finalstate respectively):
P(Dstart) = 0.1
P(Nstart) = 0.4
P(Vstart) = 0.2
P(PRONstart) = 0.1
P(hetD) = 0.25
P(hetPRON) = 0.05
P(wasN) = 0.0001
P(wasV) = 0.01
P(slapenV) = 0.0001
P(slapenN) = 0.0001
P(VD) = 0.00001
P(ND) = 0.4
P(VPRON) = 0.05
P(NPRON) = 0.001
P(DV) = 0.01
P(PRONV) = 0.01
P(PRONN) = 0.001
P(DN) = 0.001
P(endD) = 0.25
P(endPRON) = 0.25
P(endN) = 0.25
P(endV) = 0.25
 Give a specification of a Viterbi algorithm for finding the best
tag sequence given a trigram HMM.
 What is the complexity of the Viterbi algorithm for tagging with bigrams,
in terms of the length of the sentence n, and the number of tags T.
 Idem, for trigram tagger.
 Indicate, for the HMM tagger, what problem(s) is/are caused by
the occurrence of unknown words. Provide some ideas for solving those
problems.
 The Viterbi algorithm can be used to provide a single tagsequence
for a given sentence. Indicate how you might adapt the tagger to be
able to return multiple, good, tag sequences; for instance the best 10
tag sequences.

week 5.
 Formal grammars of English [ch 12], sheets Martin lecture 12 (t/m 60)
 Syntactic Parsing [ch 13], sheets Martin lecture 13 en 14
 Earley
Exercises
 12.1
 12.2
 12.3
 12.5
 Apply  on paper  the Earley algorithm on the sentence "Does the flight serve a meal", with the L1 grammar
from figure 13.8 in the book (on the left).
 13.4

week 6.
 CKY [ch 13]
 Probabilistic Context Free Grammar [ch 14], sheets Martin lecture 16 (13  28)
 Probabilistic Parsing with CKY [ch 14]
 Two tricks to make PCFG work better, sheets Martin lecture 17
 parent annotation
 head annotation
 Chomsky hierarchy [ch 16.1]
 Is natural language context free? [ch 16.3]
Exercises
 12.11
 Apply  on paper  the CKY algorithm on the sentence "Does the flight serve a meal", using the
"L1 in CNF" grammar from Figure 13.8 (on the right).
 14.4: "fill out the rest of the chart in Fig. 14.4"
 14.2: "Modify the algorithm for conversion to CNF from chapter 13 to handle
rule probabilities"
 16.1
 16.4
 Make sure you have read all relevant material from the book

Week 7.