Learning From Data LIX016M05
Hands-on introduction to machine learning for applications in
language and communication.
- docent: Gertjan van Noord and Simon Suster
- credits: 5
- part of the MA programme CIW/Information Science
- suggested literature: Manning, Raghavan, Schütze, Introduction to Information Retrieval. Cambridge University Press 2008.
- lecture on monday, 13:00-14:45, room Oude Boteringestraat 23.001
- lab on friday, 11:00-12:45, room Harmony Building 1312.0119A
- your grade is determined on the basis of the exercises - there is no exam
- exercises to be submitted via Nestor - deadline usually next monday, at noon
- Suggested literature: chapter 13 of Manning, Raghavan, Schütze. sheets 24-45.
- Introduction. Who are we. Purpose of the course. Requirements.
- Learning from data
- classification / regression
- supervised / unsupervised
- static / sequences
- Boolean classification with naive Bayes.
- Avoiding zero: smoothing
- Exercise set 1
- Classification using perceptrons. WEKA toolbox. sheets weka; sheets perceptron;
- Feature selection. K-nearest neighbors.
- discussion of the results of exercise set 1
- feature selection: mutual information
- vector space classification
- k-nearest neighbor (sheets 25 - 47)
- distance metrics
- Unsupervised learning: Kmeans clustering (sheets);
clustering in dialectology
- Unsupervised learning: Clustering of similar words. Brown clustering
- Linear Regression (Sheets)
- Progress meeting on Final project
- Final project: Presentation (January 23, 11am)
- Boolean classification by means of Naive Bayes. For this
exercise, you need to implement the classifier "from scratch". This
implies you cannot use machine learning toolboxes. Suggested
programming language: Python3. Task: classify a given tweet as
either "written in Dutch" or "not written in Dutch". Training data
is provided on the LWP machines as:
Secret test data is available to the teachers. The grade of your exercise
will be also determined by the accuracy obtained on the test data!
You must submit one program - but you can submit a second
program. The first program should fulfill the minimum requirement
for this exercise. The second program can be submitted for a bonus
grade - if you obtain the highest accuracy overall.
Additional remark: it is very interesting to compare your results with the results
reported in this paper.
- [required] Algorithm: (multinomial) Naive Bayes. Features: character counts.
- [optional] Algorithm: (multinomial) Naive Bayes. You can use whatever features you want - for instance words, character N-grams, etc.
- Weka. The Perceptron. In the first few exercises, we work with the
same Naive Bayes classifier as you have implemented last week, but
this time using Weka. We apply the classifiers once again to the
language identification task, with the same data-sets as last week.
- First, you need to convert the Language Identification
dataset to ARFF format used by Weka. As part of conversion, you
create a list of attributes (features). This week, you are free to
use attributes of your choice. These can be characters, character
sequences, words or whichever other representation you may think
of. We prefer attribute sets which lead to better accuracy. The
conversion script should first create a "vocabulary" (an explicit
enumeration of the set of attributes associated with the unique
integer that is used in the ARFF file, and save it in a separate
file. This set of attributes and the corresponding integers must
remain fixed when your conversion script will be run on unseen data
in the future. It is important that the conversion does not
introduce new attributes when run on new data as this would cause
dataset compatibility issues in Weka.
For writing a conversion script, you may use the file
/net/shared/simsuster/NL_OTHER_part.arff as an example of a
converted file. Note that this file was obtained from only 20
instances of the training data. When implementing, you should pay
special attention to two facts. Firstly, some characters need to be
escaped in order to be properly read by Weka. This is best achieved by
enclosing all attributes in single quotes in the ARFF file, and
further replacing the single quote (') which might occur as an
attribute with another symbol (such as ';quote;'). Similar replacement
should be carried out for the backslash (\) sign. Of course, you may
also decide simply to exclude these characters from your vocabulary.
Secondly, the attribute indices in actual instances in section
@data should be ordered (as can be seen in the example file).
- Once you have implemented the conversion, report on the number of
attributes in your dataset. What is the simplest possible baseline
that you can think of? What accuracy does it give on the dataset?
- Now, run the Naive Bayes classifier
(weka.classifiers.bayes.NaiveBayesMultinomial). Since a separate test
set is not provided for you (we keep it secret!), you can use
cross-validation, as a means to judge the performance of your
classifier. Cross-validation is a technique that gives an estimate of
the expected accuracy on unseen data. The data is split in k parts
(also called folds), and the model is trained on all parts except one
which is used for validating (testing). This is done k times, so that
each part of the data gets evaluated. Finally, the average accuracy is
taken. By default, Weka sets k to 10. What is the accuracy of the
classifier? Is it roughly the same as using your own script of week
1? If the difference is substantial (>2%), what do you think the
reason could be?
What is the bias of the classifier, i.e. which of the classes (Dutch
or Other languages) gets misclassified more often?
- Next, run the Voted Perceptron
(weka.classifiers.functions.VotedPerceptron) and report the accuracy
together with the number of iterations used. Is the bias of the
classifier still in favor of one of the classes (and by the same
amount as before)? Note that perceptron training takes a longer time
than training of the Naive Bayes.
- (bonus) Implement the "vanilla" perceptron classifier
(i.e. perceptron without any extensions) in Python3 and submit the
To sum up, you are asked to submit the following:
- conversion script from text to ARFF (for Naive Bayes, language identification)
- file with the vocabulary (for Naive Bayes, language identification)
- file with the model constructed by Weka (for Naive Bayes, language identification)
- text file with your answers to the questions above
- (bonus) perceptron code
- Feature Selection. kNN classification.
- Consider, once again, the language identification data of week 1 (and week 2).
As features, we use characters. List (in order) the 25 highest scoring features according to mutual information. Only consider
features which occur at least fifty times.
- Consider, once again, the language identification data of week 1 (and week 2).
As features, we use words. List (in order) the 25 highest scoring features according to mutual information. Only consider
features which occur at least five times.
- Use WEKA to figure out how well kNN classification works on a named entity classification task for Dutch. In this task,
named entities are to be classified as one of four categories: ORG (organization), PER (person), LOC (location) or MISC.
Apart from the named entity itself, the context of the named entity is given (the two words left of the entity and the
two words right of the entity). The training data is given as
Here are the first few lines of the data file:
ORG Floralux communicatiebureau dat inhuurde .
MISC BPA met een dat het
ORG Floralux Vandaag is dus met
MISC BPA maar het waarmee die
PER Christiane Vandenbussche omdat zaakvoerster haar schepenambt
PER Vandenbussche aanleg werd begin de
Each line represents an occurrence of a named entity in context. Fields are separated by TAB. The fields
represent respectively the category, the named entity, word-2, word-1, word+1 and word+2.
You are free to use a feature set of your choice.
Please upload your solution, in such a way that we can apply your solution to a hidden test set.
Tips. In my initial experiments, I found that only the word immediately to the left of the named entity provides
much relevant information (as opposed to the other words in the context). Furthermore, using Weka, I get best results using
the lazy.IBk implementation with K=1. For some reason, Using lazy.IB1 makes the classifier fall asleep. Finally, my best result using
10-fold cross-validation on the training data is 85.3% correct classifications, using 8639 features; or 87.2% with over 40K features.
As last time, in my approach I first generate a dictionary: a file which contains all the features that I want to use. This file is
generated on the basis of the training data. Using the dictionary and the training data, a conversion script then creates the
- Preparing the arff file. In this assignment, you will be
working on text categorization using Reuters texts classified in 6
categories. The file is available as /net/shared/simsuster/reuters-allcats.csv
(comma-separated, class label then a list of words). Convert the csv
file to arff. This can be done in Weka or by implementing your own
conversion script. Note that converting with Weka might first give you
text string as an attribute, which then needs to be further processed
to obtain numeric attributes (as in Exercise 2). You can use the
"weka.filters.unsupervised.attribute.StringToWordVector". (We prefer
representations leading to better clustering results.) Please document
all processing steps that you take.
Report the number of documents in each class.
- Run K-means. K-means clustering is available in Weka as
"SimpleKMeans". Use "classes to clusters" evaluation. This ignores
the class attribute before clustering. It will assigns classes to
the clusters, based on the majority value of the class attribute
within each cluster. Then, the classification error will be computed
based on this assignment. Report the overall classification error.
Report the accuracy (purity) for each of the clusters. Which
clusters are most often confused?
- In the dialectology example, pronounciations of a specific word are being compaired.
In many cases, the differences are small and of a phonetic/phonological nature. In some cases, however,
the difference is lexical. For instance, in the data for the word "duivel" (devil), there are many
variants of "duivel" but also variants of "lucifer".
This data file is available as
Your task is to convert this data file to ARFF and use Weka to come up with a clustering of the data, in such a
way that all variants of "duivel" will end up in a cluster, and all variants of "lucifer" will end up in a
different cluster, etc.
Lines with a single '|' can be ignored, as well as lines which contain two variants separated by '/'.
Bonus-points if you can manage to get a third cluster with the lexeme "satan".
- Clustering with Cluto. For this exercise, we use the cluto toolbox.
A manual can easily be found on-line. The command-line programs are available in directory /opt/netapps/bin.
The data set that you work with in this example, is generated on the basis of automatically parsed Dutch news-paper texts. It extracts
for many sentences pairs of the type "Head Name" where Name is an apposition by Head. Examples include "president Bush", "cyclist Joop Zoetemelk", etc.
We want to use this data to cluster named entities. Each named entity (given various frequency thresholds) is represented as a vector where each of
the dimensions are counts of head words (such as "president"). The data file is given as
The first line of the file documents each of the dimensions. The rest of the file gives, for a named entity, the frequency of each of the dimensions.
- Write a script to convert this data file to the cluto format, for use with the vcluster command. You have a choice whether to use the
sparse or the dense matrix representation. The latter may be easier. Also take into account the option to generate a file with the labels.
- Apply the vcluster program to cluster this data, using agglomerative clustering with the cosine similarity score. One possible command-line
vcluster -showtree -sim=cos -clmethod=agglo -plottree=plot.ps vectors.cluto 2
In that case, the file plot.ps will contain a graphical picture of the tree, in postscript. This can be displayed with the "gv" command. Alternatively, you
can convert postscript to pdf with the pstopdf command. Ensure that the tree looks reasonable, indicating that the clustering actually did well.
A larger data set is/will be available as
Try out the clustering on this data set as well.
- Is it possible to assign, manually, reasonable class labels to the ten highest clusters in the final tree, in a reasonable way? Rather than looking in the
tree, you may want to specify the number of clusters, 10, as the final argument to the vclusters command, and inspect the clustering in the FILE.clustering.10 output file
(combine it with the row labels).
- Is it possible to assign, manually, the labels MISC, ORG, LOC, PER to the ten highest clusters in the final tree, in a reasonable way?
- Final project: Predicting opening-weekend revenue for movies from critic reviews.
From this dataset, you are only allowed to use the training and development part. The data
can be found in /net/shared/simsuster/movies-data-v1.0/domains-train-dev.tl
Shortly, you will also find in
the same data-set
These resources may be useful for some of the features that you may wish to try for this problem.
We focus on the prediction of the overall revenue (so not the per screen revenues).
The task, data set and results are described in the this paper.
You can use Weka, for instance the SimpleLinearRegression method. Weka will also produce the standard evaluation
results (mean absolute error, and correlation).
- segmented with Splitta
- POS-tagged with Citar
- Dependency parsed with MSTParser
You need to submit, by January 21:
On January 23, 11am, we have a final meeting for this class where you present (5 minutes) your solution. Room:
- A written report (about 2 pages), clarifying what you tried to solve the task, which features
you used, whatever else you tried, how well you solved the task on the development set
- All scripts, data files and model files that we need in order to apply your solution to the test set
- A README file which contains very clear instructions (the precise UNIX commands) in order that we
can apply your solution to the test data
Note that this exercise set counts for 2/7 of the final grade. Each of the other five exercise sets count for 1/7 of
the final grade.