required: Daniel Jurafsky and James H. Martin, Speech and Language
Processing. An introduction to Natural Language Processing,
Computational Linguistics and Speech Recognition.
Second Edition. Prentice Hall 2008.
- chapters 1, 2, 3,
- chapter 4.1, 4.2, 4.3, 4.4. The remainder of chapter 4: only motivating and
concluding parts, skip over the technical parts.
- chapter 5
- chapter 12, 13
- chapter 14, excluding 14.6, 14.8, 14.9
- chapter 16 (excluding details of pumping lemma in 16.2.1)
The course provides a fundamental introduction in Natural Language
Processing. Various important fundamental topics are treated,
including Regular Expressions, Finite Automata, Computational
Morphology, N-grams, Hidden Markov Models, Part-of-speech Tagging,
Context Free Grammar, Parsing, Statistical Parsing.
In order to obtain the 5 ECTS for this course, you need to
obtain a sufficient (5,5) grade for the exam.
In order to be allowed to take the exam, you need to obtain a
sufficient (5,5) grade for your port-folio. The port-folio contains
detailed reports / solutions of all the exercises in the course. The
exercises will be announced on this web-page. At the end of the
course, you need to submit your port-folio electronically (deadline:
March 24, 13:00). Only submissions through Nestor in PDF-format are
taken into consideration. A random selection of the exercises will be
graded. Missing exercises are graded as 0.
week 1. Introduction, Finite automata, Finite Transducers
[ch 2, ch 3]. Sheets: page 1-34 and 69-84 ./sapporo.pdf
- In this example, the alfabet consists of the characters a, b, c and d. Create a finite automaton (recognizer) which accepts:
For each of the automata, explain whether the automaton is deterministic or not.
- all strings which start with a b
- all strings which end with a b
- all strings which start with an a and which end with a c
- all strings with an odd number of c's
- all strings in which every a is followed by a b
- all strings such that a b is never preceded by a c
- For a given non-deterministic finite state recognizer, there
always exists an equivalent deterministic finite state
recognizer. A determinization algorithm takes a possibly
non-deterministic fsa and writes out a deterministic fsa. A typical
algorithm does this, by considering as states in the deterministic
automaton all subsets of states of the non-deterministic
automaton. Loosely describe such an algorithm. What is the complexity
of the algorithm?
- 2.10, 2.11
- In this example, the alfabet consists of the characters a, b, c and d. Create finite state transducers for each of the following mappings:
- a mapping in which each string is copied, except that a b
is introduced after every a.
- a mapping where the output is the single symbol string a if the length of
the input is even, and b if the length of the input is odd.
- a mapping in which each string is copied, except that every b
- a mapping in which each string is copied, except that every b
is replaced by a c if it is preceded by a d.
- a mapping in which each string is copied, except that every b
is replaced by a c if it is followed by a d.
week 2. Regular expressions;
edition2: [ch 2, ch 3]
Sheets: page 35-57, 79-91 ./sapporo.pdf,
- 3.5 but instead of a single transducer, provide a
cascade of simple transducers
- Define a regular expression (using the notation introduced in
class) for the following languages:
- All strings containing two vowels, ending in a consonant,
with an odd number of letters.
- All strings which end in a voiceless consonant
- All strings in which every a is preceded by b
- All strings in which the number of a's is larger than the number of b's
- All strings in which every a is preceded by b, and the number of b's is divisible by 5
- All palindromes (words that read the same backwards, e.g. anna)
- All strings in which every a is preceded by b, and every b is preceded by a
- Define regular expressions for the following transductions:
Define a regular expression for a transducer which maps strings of the
type listed in the
first column to their correct spelling, as exemplified in the second column.
- Define a transducer which maps every consonant to the letter
c, every vowel to v and every other symbol to #.
- Define a transducer which maps every c to either s
or k, depending on its pronounciation (in Dutch). Also treat
the case of ch mapping it to g.
- Define a transducer which maps strings of the form
[a*,b*] to the symbol yes if
there are more a's than b's, and to the symbol
- Define a weighted acceptor that maps strings of the form
[a*,b*] to a positive ingeger if
there are more a's than b's, to 0 if the number of a's
and b's is equal, and to a negative interer
- Define a transducer which maps every Dutch word into its
MYFE FETEMSHAPELYKE SPELYMH (by Rudy
Kousbroek). On the first of april, 1977, Rudy Kousbroek
proposed the following spelling changes in the Dutch newspaper
NRC Handelslad. The piece contained a number of comments,
mostly positive, by a variety of people (all written by
Adriaan van Dis). In the MYFE FETEMSHAPELYKE SPELYMH
only the letters a, e, f, h, k, l, m, p, r, s, t u en y
remain. The rules are as follows:
- no distinction between lower case and upper case,
everything is written in upper case.
- c is replaced by s or
k; q is replaced by ku;
x is replaced by ks
- every letter repetition is replaced by a single
occurrence of that letter
- v and w disappear and are
replaced by f
- z disapears and is replaced by s
- y takes the place of i, j ij and
- b is replaced by p
- d is replaced by t
- g and ch are replaced by
- o,oo,oe,ou are replaced by u
- n is replaced by m
In Dutch, the regular form of the past tense is formed by the suffix
te or de, depending on whether the previous sound is
a voiceless consonant or not. In addition, this previous consonant is
devoiced in front of such a suffix too. In Dutch spelling, this
devoicing is visible only for the letters v,z Thus we have:
- (Potentially difficult.)
- Some examples of non-deterministic finite state machines which
display exponential blow-up if they are determinized could be found,
some time ago, at
http://www.cs.cmu.edu/~cdm/bilder.html. Can you come up with such
an example yourself??
Here is an example of a finite state acceptor
which illustrates exponential blow up if the automaton is determined.
- Can you describe - in words - the language accepted by this automaton?
- Can you provide an equivalent regular expression (in FSA notation)?
- Use the FSA utilities to prove that your regular expression is
- During class (2008), a student (Eamon N) asked about the transitive closure of
regular relations, R<*>. It turns out, that you can show that there exist
regular relations T such that T<*> is not regular. A proof can be found
on page 19 of the
Licentiate Thesis of Marcus Nilsson, thanks for Tim Fernando for
pointing this out to me.
The example can be re-phrased as follows, using FSA Utilities notation. Consider the regular
This swaps a pair of a,b to b,a, while copying the rest of the
string (which also contains a's and b's). Let us write tr_cl_swap for
the transitive closure of swap, wrt composition, i.e., swap o swap o
swap o .... o swap.
Consider the language
[a,b]* o tr_cl_swap
Furthermore, consider the intersection of this language, with the
Describe this language. Why does this show that tr_cl_swap is not a
- 3.11 Since this is not a programming course, here's a program in perl. Adapt to alternative
- Extend the minimum edit distance algorithm to compute the minimal
edit distance with transposition. Assume that transposition,
substitution, deletion and
insertion all are associated with a cost of 1. Note that
transpositions are allowed only between characters that are adjecent
in the original string. (This distance is known as Damerau
distance. See, for instance, Wikipedia, for an explanation why it is
not straightforward to allow transpositions more generally).
- Give an example of a Dutch (English, French, Spanish...) sentence,
which - if you remove the word boundaries - can be parsed as a
sequence of words in more than one way. Example from the sheets:
"thetabledownthere" can be parsed as "the table down there" and
"theta bled own there".
A Perl script to compute word Ngram frequencies is available as make_ngram. You can use this for the next
- 4.3 You may want to use the following
corpora: traditional annual speeches by the Queen at the opening of the parliamentary year.
Can you establish interesting differences between the eldest and the newest speeches?
- Ngram models can be understood as weighted finite-state
acceptors. For unigram-, bigram-, and trigram models respectively,
describe how a given Ngram model can be implemented as a weighted
- 4.4 Describe in detail how you could use a word
trigram language model to generate random sentences. The idea here is,
that the choice of sentences reflects the trigram model.
- Suffix arrays. What is the most frequent 12-gram in the Dutch
A 2011 version of the Dutch Wikipedia corpus is /net/corpora/nlwiki/20110804/nlwiki-20110804.sents.gz
Use the implementation
of suffix arrays (with perfect hash keys for words) from www.let.rug.nl/~vannoord/software.html
If you work on the LWP, you don't need to install the software, but you can use
it directly at /net/aps/64/src/Alpino/SuffixArrays/. In order to use
the binary, you may need to set
- Try to fool the language guesser. Think up a perfectly
grammatical Dutch (English, German, French, Spanish, Italian) sentence
of at least 30 characters which is classified as English (German,
French, Italian, Spanish ...) by the language guesser. Try to make your
sentence as long and natural as possible.
N-grams. [ch 4] perplexity sheets Martin lecture 7.
N-grams. smoothing. sheets Martin lecture 7.
Introduction of Part-of-speech Tagging. [ch 5] In particular: HMM Tagging, Viterbi
Learning. sheets Martin lecture 9.
- What is the most frequent word trigram in Dutch?
Give some estimations for the number of word bigrams / trigrams /
fourgrams types for various corpus sizes. In particular,
study the number of N-gram types of a corpus of
10 Million words.
- Describe in detail how you compute the
perplexity of a test corpus given a character bigram language model.
- Estimate the perplexity of a sub-part of the nlwiki corpus, according to
a character trigram model trained on a different sub-part of the nlwiki corpus.
- treatment of unknown words (ch 4.3.2 in edition2). What is wrong
with the suggested approach for unigram modeling? Hint: what happens
in the special case where *all* words in training data are not in the
Split (part of) the nlwiki corpus randomly in two parts (how?). The first, training,
part should contain 90% of the sentences, and the test part should
contain 10% of the sentences. In this exercise we want to
investigate how often an N-gram occurs in the test data that was
unseen in the training data.
- How many unigrams in the test-data are not present in the training data?
- How many bigrams in the test-data are not present in the training data?
- How many trigrams in the test-data are not present in the training data?
- How many fourgrams in the test-data are not present in the training data?
- 5.6 (international edition: exercise 5.5). This is the question
about the implementation of a baseline POS-tagger.
Describe such an implementation in detail, but you don't need to
actually implement it.
- Consider a bigram HMM tagger. Work out in detail the steps of the
Viterbi algorithm in tagging the sentence het was het
slapen where we have the following probabilities (here I use
start and end to represent the start-state and final-state respectively):
P(D|start) = 0.1
P(N|start) = 0.4
P(V|start) = 0.2
P(PRON|start) = 0.1
P(het|D) = 0.25
P(het|PRON) = 0.05
P(was|N) = 0.0001
P(was|V) = 0.01
P(slapen|V) = 0.0001
P(slapen|N) = 0.0001
P(V|D) = 0.00001
P(N|D) = 0.4
P(V|PRON) = 0.05
P(N|PRON) = 0.001
P(D|V) = 0.01
P(PRON|V) = 0.01
P(PRON|N) = 0.001
P(D|N) = 0.001
P(end|D) = 0.25
P(end|PRON) = 0.25
P(end|N) = 0.25
P(end|V) = 0.25
- Give a specification of a Viterbi algorithm for finding the best
tag sequence given a trigram HMM.
- What is the complexity of the Viterbi algorithm for tagging with bigrams,
in terms of the length of the sentence n, and the number of tags T.
- Idem, for trigram tagger.
- Indicate, for the HMM tagger, what problem(s) is/are caused by
the occurrence of unknown words. Provide some ideas for solving those
- The Viterbi algorithm can be used to provide a single tag-sequence
for a given sentence. Indicate how you might adapt the tagger to be
able to return multiple, good, tag sequences; for instance the best 10
- Formal grammars of English [ch 12], sheets Martin lecture 12 (t/m 60)
- Syntactic Parsing [ch 13], sheets Martin lecture 13 en 14
- CKY [ch 13]
- Probabilistic Context Free Grammar [ch 14], sheets Martin lecture 16 (13 - 28)
- Probabilistic Parsing with CKY [ch 14]
- Two tricks to make PCFG work better, sheets Martin lecture 17
- parent annotation
- head annotation
- Chomsky hierarchy [ch 16.1]
- Is natural language context free? [ch 16.3]
- Apply - on paper - the Earley algorithm on the sentence "Does the flight serve a meal", with the L1 grammar
from figure 13.8 in the book (on the left).
- Apply - on paper - the CKY algorithm on the sentence "Does the flight serve a meal", using the
"L1 in CNF" grammar from Figure 13.8 (on the right).
- 14.4: "fill out the rest of the chart in Fig. 14.4"
- 14.2: "Modify the algorithm for conversion to CNF from chapter 13 to handle
- Make sure you have read all relevant material from the book