Lab 4: Markov Models and Part-of-Speech Tagging
NOTE: You may work in groups of up to 3 people in this assignment!
IMPORTANT: For the People working at home or somewhere else: You need
my the modified HMM package which is based
on the
one
by Tapas Kanungo and the
files in tagging for this exercise. You need to compile
the HMM source which might be tricky on Windows. You save a lot of trouble if
you do this exercise on a linux machine especially on hagen where you get the
compiled version installed!
Introduction
In this exercise we use Markov Models for automatic part-of-speech tagging and
word generation. We will use a
modified HMM package
(based on
this one
developed by Tapas Kanungo).
We will also use some additional scripts and data
provided by Philip Resnik for his
exercise in a course on computational linguistics at the
University of Maryland. Don't copy the
files from there because we will use additional data and modified software!
Hand-in: A report with solutions and answers to the questions and assignments
below.
The task of a part-of-speech tagger is to assign wordclass labels to words in
running text. Here is a little example of "English-like" text:
the/DT plane/NN can/MD fly/VB ./.
the/DT typical/JJ plane/NN can/MD see/VB the/DT plane/NN ./.
a/DT typical/JJ fly/NN can/MD see/VB ./.
The tagset is a small subset of the
WSJ tagset including the following basic wordclasses: DT (determiner), JJ
(adjective), NN (noun), VB (verb), WP (Wh question word), . (punctuation) and
MD (modal).
We will
- use tagged training data for supervised learning of a "visible Markov
model"
- use unlabeled training data to train a Hidden Markov Model
- combine both techniques
- apply learned models for tagging of unseen data and for word
generation
The HMM package and some additional scripts are installed on hagen at
~tiedeman/ML06/HMM
and the data we will use is in
~tiedeman/ML06/tagging
. Make a copy of the data files into your own
working directory.
Markov Models for part-of-speech tagging are used as follows: Words in the text
are used as the observed data items in the sequence they appear in the
text. They are generated from a sequence of states that correspond to the
part-of-speech labels assigned to these words. Hence we have probabilistic
transitions from a part-of-speech label to another in the model. The states in
the model produce discrete output namely the words in the text. We will only
consider the simplest model, the first-order Markov model in which each
part-of-speech label is dependent on the previous one and conditionally
independent of other labels.
Training a Visible Markov Model
First, we will build a Visible Markov Model for part-of-speech tagging using
supervised learning from
labeled training data. The training data is in
~tiedeman/ML06/tagging/example.train.tagged in the format as described above. For the
model we have to estimate 3 kinds of probabilities (model parameters):
- initial probabilities pi_i, the probability to start in a certain state
- transition probabilities a_ij, the probability to go from state i to state
j (wordclass i to wordclass j)
- emission probabilities b_ik, the probability to generate word k in state i
We do simple maximum likelihood estimation (MLE) with and without smoothing to
estimate our parameters (look at the end of
this file
for some additional information about MLE). Using MLE, probabilities can be
estimated from frequency counts like this:
P(x)=count(x)/N, where count(x) is the frequency of item x
in the data collection D and N is the total number of items in D
P(x|y)=count(x,y)/count(y), where count(x,y) is the
frequency of items x and y occurring together in D
Often it is necessary to "smooth" the probability estimations based on MLE in
order to be able to handle unseen events. A simple smoothing technique is the
"add-one discounting" estimation. Using this technique, probabilities are
estimated as follows:
P(x)=(count(x)+1)/(N+k), where k is the number of distinct
elements x in our data
P(x|y)=(count(x,y)+1)/(count(y)+m), where m is the number of
distinct elements x in the data (not 'y' as it was written earlier ...)
The HMM package works with numbers only. Therefore we have to convert our
data into numeric values. There are keyfiles for converting words into digits
(~tiedeman/ML06/tagging/example.words) and for converting labels into state
numbers (~tiedeman/ML06/tagging/example.states). For convenience, I already
converted the training data into numeric values
(~tiedeman/ML06/example.train.seq).
Now the assignments:
- Estimate the parameters of a Visible Markov Model from the data in
~tiedeman/ML06/tagging/example.train.seq and store them in the format used by
the HMM package (look at ~tiedeman/ML06/HMM/README to see the
requirements). Start with unsmoothed MLE estimates. The best way to do this is
to write a little script in your favorite scripting language to do the counts
for you and for generating the HMM file as required. Explain your solution
(include code if necessary) and show the resulting HMM in your report.
- You can view the HMM in a more readable form using the viewhmm.pl
script:
~tiedeman/ML06/HMM/viewhmm.pl supervised.hmm ~tiedeman/ML06/tagging/example.words 0 0 | less
where "supervised.hmm" is the file in which you saved your HMM. The last two
parameters are thresholds for displaying the transition and the emission
probabilities. Using the information you see, draw your model as a weighted
finite state automaton (you can do that by hand) and comment on the model and
its limitations.
- Markov models can be used to predict the next item in a sequence. Hence,
they can be used to generate text using the probabilistic transition and
emission distributions in the model.
Using your transition diagram from above and the output of viewhmm.pl, start at
the the most likely start state, and write down the most likely symbol to be
emitted there. (Break ties at random.) Then follow the most likely arc to the
next state, and write down the most likely symbol to be emitted there. Continue
in this fashion until you emit a punctuation mark, or until you get
bored. Write down your sequence.
- The HMM package includes a program for generating sequences at random
according to the model. Run, e.g.
~tiedeman/ML06/HMM/genseq -T 15 supervised.hmm
to generate a random sequence of 15 words from your model
(supervised.hmm). genseq includes a random number generator to alternate the
output. You can set the seed for this generator with the -S flag. In order to
see the actual words coming out of the model you can pipe the output through a
little script in the HMM package directory:
~tiedeman/ML06/HMM/genseq -T 15 supervised.hmm | ~tiedeman/ML06/HMM/ints2words.pl ~tiedeman/ML06/tagging/example.words
You can play around with this a little bit if you like. Give at least one
sequence generated by the model. Comment on the quality of the output (compared
to the input in the training data). Do they look ok and do you obtain sequences
not included in the training data?
- The Viterbi algorithm is used to find the optimal state sequence for a
given observation. This is used to tag a text and you can try it with your
model with the following command:
~tiedeman/ML06/HMM/testvit supervised.hmm ~tiedeman/ML06/tagging/example.test.wordseq
The input sequence above (example.test.wordseq) is generated from
an example text used for testing out
models (~tiedeman/ML06/tagging/example.test). You can also make
your own input for tagging with the words2seq.pl script:
echo "a plane can fly ." | ~tiedeman/ML06/HMM/words2seq.pl ~tiedeman/ML06/tagging/example.words > my_text
There is a tagged version of the example text used above in
~tiedeman/ML06/tagging/example.test.tagged that you can use as
gold standard. The transformation into state numbers is in
~tiedeman/ML06/tagging/example.test.tagseq. Tag the example text
with your model and compute the accuracy of the tagger for this test data. Look
at the tags produced by your model. Can you see a reason for the tagging
errors?
- Build another Visible Markov Model similar to the one above but with
smoothed probability distributions. Use the simple "add-one discounting"
approach as described earlier. Describe your solution and include the HMM with
its parameters in
your report. Tag the test data again with your new model and
compute the accuracy. Comment on the results and the effect of smoothing.
Training a Hidden Markov Model
In this exercise we train a Hidden Markov Model for part-of-speech tagging on
unlabeled data. For this we use the well-known Baum-Welch algorithm for EM
training of HMMs. The training data is in the following file:
~tiedeman/ML06/tagging/example.train. The transformation of this
file into numeric values is in ~tiedeman/ML06/tagging/example.seq
using the same transformation table as before
(~tiedeman/ML06/tagging/example.words).
- Train the HMM with the following command
~tiedeman/ML06/HMM/esthmm -M nr_of_symbols -N nr_of_states ~tiedeman/ML06/tagging/example.seq > unsupervised.hmm
You have to replace nr_of_symbols with the number of distinct
output symbols in your data (number of distinct words) and
nr_of_states with the number of states you wish to have in your
model. We know that we want to represent part-of-speech labels with the states
in the model and therefore we need one state for each label!
- Look at the model with
~tiedeman/ML06/HMM/viewhmm.pl unsupervised.hmm ~tiedeman/ML06/tagging/example.words 0 0 | less
Try to label each state with a distinct part-of-speech label according to the
probability distributions in the model (look especially at the emission
probabilities for the labeling). How good is the match between your
intuitions and the decisions the model made automatically?
Tag the test data as in exercise 5 on visible Markov models but now with
your HMM trained on unlabeled data. What is the accuracy of your model? Note
that states in your current model do not correspond to the states listed in
~tiedeman/ML06/tagging/example.states! You have to map the
resulting state sequence to a part-of-speech sequence using the labels assigned
to each state in the previous assignment. Comment on the tagging results!
- The training procedure as implemented in esthmm starts with a random
initial model. For this, it uses a random number generator. You can see that
the initial model changes each time you run esthmm (add the flag -v when
calling esthmm to see the initial model). What influence has the initial model
on the training procedure and the final model found? (Run the training several
times with the "-v" flag set and look at the number of re-estimation iterations
and the probabilities of the observed training data given the models)
- You can start with a specific initial model by setting a seed value for the
random number generator using the '-S' argument for esthmm. Select a seed and
run the training procedure once more. If you repeat you will see that you
always start with the same initial model and you always obtain the same final
model. Now you can give another argument to control the number of iterations to
be run. For example, to run 3 iterations run the following command (with seed
10):
~tiedeman/ML06/HMM/esthmm -v -S 10 -i 3 -M nr_of_symbols -N nr_of_states ~tiedeman/ML06/tagging/example.seq
Run the steps from 1 to 10 iterations and record the probability of the data
given the model reported by esthmm. You may plot this probability during
training if you like. Do you see a general tendency?
Combining supervised and unsupervised learning
In the previous exercise we could see that the final model learned from
unlabeled data depends on the initial model the training procedure starts
with. Supervised learning is usually more exact than unsupervised
learning. However, there is often too little training data available and
labeling additional data is expensive. A common strategy is therefore to
combine supervised with unsupervised learning. We can easily do that with HMMs
using the 2 strategies tested earlier. We start with supervised learning from
labeled training data to estimate the initial model and continue with
unsupervised training on unlabeled data.
- Use the Visible Markov Model with smoothed probability estimations produced
in the first assignment as initial
model to train an HMM on the unlabeled training data given in
~tiedeman/ML06/tagging/example.seq. Use the '-I' argument to
specify the file with the initial model to start with. Leave out the -M and -N
arguments! These parameters will be taken from the initial model. You can still
use "-v" and "-i" if you like. Compare the model to the ones produced earlier
(look for example at the probability of the data given the model).
- Why is it necessary to start the Baum-Welch training with smoothed
probability estimates?
- Tag the test data
~tiedeman/ML06/tagging/example.test.wordseq
again and compute accuracy. Compare the results with the ones from applying the
visible Markov models and the hidden Markov model.
- Look at the current model with viewhmm.pl in the same way as above and
comment on the model parameters. Do you see any striking differences to the
earlier models?
Bonus Assignment
Visualize parameter changes during unsupervised
training with animated plots. Decide yourself how to represent parameters and
their values. The nicest solution would be to have a little script that
generates the visual representation from given data and a training procedure.