Learning From Data LIX016M05

Hands-on introduction to machine learning for applications in language and communication.



    • Suggested literature: chapter 13 of Manning, Raghavan, Schütze. sheets 24-45.
    • Introduction. Who are we. Purpose of the course. Requirements.
    • Learning from data
      • classification / regression
      • supervised / unsupervised
      • static / sequences
    • Boolean classification with naive Bayes.
    • Avoiding zero: smoothing
    • Exercise set 1
  1. Classification using perceptrons. WEKA toolbox. sheets weka; sheets perceptron;
  2. Feature selection. K-nearest neighbors.
    • discussion of the results of exercise set 1
    • feature selection: mutual information
    • vector space classification
    • k-nearest neighbor (sheets 25 - 47)
    • distance metrics
      • dice
      • cosine
  3. Unsupervised learning: Kmeans clustering (sheets); clustering in dialectology
  4. Unsupervised learning: Clustering of similar words. Brown clustering
  5. Linear Regression (Sheets)
  6. Progress meeting on Final project
  7. Final project: Presentation (January 23, 11am)

Exercise sets

  1. Boolean classification by means of Naive Bayes. For this exercise, you need to implement the classifier "from scratch". This implies you cannot use machine learning toolboxes. Suggested programming language: Python3. Task: classify a given tweet as either "written in Dutch" or "not written in Dutch". Training data is provided on the LWP machines as:
    Secret test data is available to the teachers. The grade of your exercise will be also determined by the accuracy obtained on the test data! You must submit one program - but you can submit a second program. The first program should fulfill the minimum requirement for this exercise. The second program can be submitted for a bonus grade - if you obtain the highest accuracy overall.
    1. [required] Algorithm: (multinomial) Naive Bayes. Features: character counts.
    2. [optional] Algorithm: (multinomial) Naive Bayes. You can use whatever features you want - for instance words, character N-grams, etc.
    Additional remark: it is very interesting to compare your results with the results reported in this paper.
  2. Weka. The Perceptron. In the first few exercises, we work with the same Naive Bayes classifier as you have implemented last week, but this time using Weka. We apply the classifiers once again to the language identification task, with the same data-sets as last week.
    1. First, you need to convert the Language Identification dataset to ARFF format used by Weka. As part of conversion, you create a list of attributes (features). This week, you are free to use attributes of your choice. These can be characters, character sequences, words or whichever other representation you may think of. We prefer attribute sets which lead to better accuracy. The conversion script should first create a "vocabulary" (an explicit enumeration of the set of attributes associated with the unique integer that is used in the ARFF file, and save it in a separate file. This set of attributes and the corresponding integers must remain fixed when your conversion script will be run on unseen data in the future. It is important that the conversion does not introduce new attributes when run on new data as this would cause dataset compatibility issues in Weka. For writing a conversion script, you may use the file /net/shared/simsuster/NL_OTHER_part.arff as an example of a converted file. Note that this file was obtained from only 20 instances of the training data. When implementing, you should pay special attention to two facts. Firstly, some characters need to be escaped in order to be properly read by Weka. This is best achieved by enclosing all attributes in single quotes in the ARFF file, and further replacing the single quote (') which might occur as an attribute with another symbol (such as ';quote;'). Similar replacement should be carried out for the backslash (\) sign. Of course, you may also decide simply to exclude these characters from your vocabulary. Secondly, the attribute indices in actual instances in section @data should be ordered (as can be seen in the example file).
    2. Once you have implemented the conversion, report on the number of attributes in your dataset. What is the simplest possible baseline that you can think of? What accuracy does it give on the dataset?
    3. Now, run the Naive Bayes classifier (weka.classifiers.bayes.NaiveBayesMultinomial). Since a separate test set is not provided for you (we keep it secret!), you can use cross-validation, as a means to judge the performance of your classifier. Cross-validation is a technique that gives an estimate of the expected accuracy on unseen data. The data is split in k parts (also called folds), and the model is trained on all parts except one which is used for validating (testing). This is done k times, so that each part of the data gets evaluated. Finally, the average accuracy is taken. By default, Weka sets k to 10. What is the accuracy of the classifier? Is it roughly the same as using your own script of week 1? If the difference is substantial (>2%), what do you think the reason could be? What is the bias of the classifier, i.e. which of the classes (Dutch or Other languages) gets misclassified more often?
    4. Next, run the Voted Perceptron (weka.classifiers.functions.VotedPerceptron) and report the accuracy together with the number of iterations used. Is the bias of the classifier still in favor of one of the classes (and by the same amount as before)? Note that perceptron training takes a longer time than training of the Naive Bayes.
    5. (bonus) Implement the "vanilla" perceptron classifier (i.e. perceptron without any extensions) in Python3 and submit the code. To sum up, you are asked to submit the following:
      • conversion script from text to ARFF (for Naive Bayes, language identification)
      • file with the vocabulary (for Naive Bayes, language identification)
      • file with the model constructed by Weka (for Naive Bayes, language identification)
      • text file with your answers to the questions above
      • (bonus) perceptron code
  3. Feature Selection. kNN classification.
    1. Consider, once again, the language identification data of week 1 (and week 2). As features, we use characters. List (in order) the 25 highest scoring features according to mutual information. Only consider features which occur at least fifty times.
    2. Consider, once again, the language identification data of week 1 (and week 2). As features, we use words. List (in order) the 25 highest scoring features according to mutual information. Only consider features which occur at least five times.
    3. Use WEKA to figure out how well kNN classification works on a named entity classification task for Dutch. In this task, named entities are to be classified as one of four categories: ORG (organization), PER (person), LOC (location) or MISC. Apart from the named entity itself, the context of the named entity is given (the two words left of the entity and the two words right of the entity). The training data is given as
      Here are the first few lines of the data file:
      ORG     Floralux        communicatiebureau      dat     inhuurde        .
      MISC    BPA     met     een     dat     het
      ORG     Floralux        Vandaag is      dus     met
      MISC    BPA     maar    het     waarmee die
      PER     Christiane Vandenbussche        omdat   zaakvoerster    haar    schepenambt
      PER     Vandenbussche   aanleg  werd    begin   de
      Each line represents an occurrence of a named entity in context. Fields are separated by TAB. The fields represent respectively the category, the named entity, word-2, word-1, word+1 and word+2. You are free to use a feature set of your choice. Please upload your solution, in such a way that we can apply your solution to a hidden test set.

      Tips. In my initial experiments, I found that only the word immediately to the left of the named entity provides much relevant information (as opposed to the other words in the context). Furthermore, using Weka, I get best results using the lazy.IBk implementation with K=1. For some reason, Using lazy.IB1 makes the classifier fall asleep. Finally, my best result using 10-fold cross-validation on the training data is 85.3% correct classifications, using 8639 features; or 87.2% with over 40K features.

      As last time, in my approach I first generate a dictionary: a file which contains all the features that I want to use. This file is generated on the basis of the training data. Using the dictionary and the training data, a conversion script then creates the arff file.

  4. Clustering.
    1. Preparing the arff file. In this assignment, you will be working on text categorization using Reuters texts classified in 6 categories. The file is available as /net/shared/simsuster/reuters-allcats.csv (comma-separated, class label then a list of words). Convert the csv file to arff. This can be done in Weka or by implementing your own conversion script. Note that converting with Weka might first give you text string as an attribute, which then needs to be further processed to obtain numeric attributes (as in Exercise 2). You can use the filtering tool "weka.filters.unsupervised.attribute.StringToWordVector". (We prefer representations leading to better clustering results.) Please document all processing steps that you take. Report the number of documents in each class.
    2. Run K-means. K-means clustering is available in Weka as "SimpleKMeans". Use "classes to clusters" evaluation. This ignores the class attribute before clustering. It will assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then, the classification error will be computed based on this assignment. Report the overall classification error. Report the accuracy (purity) for each of the clusters. Which clusters are most often confused?
    3. In the dialectology example, pronounciations of a specific word are being compaired. In many cases, the differences are small and of a phonetic/phonological nature. In some cases, however, the difference is lexical. For instance, in the data for the word "duivel" (devil), there are many variants of "duivel" but also variants of "lucifer". This data file is available as http://www.let.rug.nl/~heeringa/dialectology/atlas/rnd/words/unicode/035.txt Your task is to convert this data file to ARFF and use Weka to come up with a clustering of the data, in such a way that all variants of "duivel" will end up in a cluster, and all variants of "lucifer" will end up in a different cluster, etc. Lines with a single '|' can be ignored, as well as lines which contain two variants separated by '/'. Bonus-points if you can manage to get a third cluster with the lexeme "satan".
  5. Clustering with Cluto. For this exercise, we use the cluto toolbox. A manual can easily be found on-line. The command-line programs are available in directory /opt/netapps/bin. The data set that you work with in this example, is generated on the basis of automatically parsed Dutch news-paper texts. It extracts for many sentences pairs of the type "Head Name" where Name is an apposition by Head. Examples include "president Bush", "cyclist Joop Zoetemelk", etc. We want to use this data to cluster named entities. Each named entity (given various frequency thresholds) is represented as a vector where each of the dimensions are counts of head words (such as "president"). The data file is given as
    The first line of the file documents each of the dimensions. The rest of the file gives, for a named entity, the frequency of each of the dimensions.
    1. Write a script to convert this data file to the cluto format, for use with the vcluster command. You have a choice whether to use the sparse or the dense matrix representation. The latter may be easier. Also take into account the option to generate a file with the labels.
    2. Apply the vcluster program to cluster this data, using agglomerative clustering with the cosine similarity score. One possible command-line is:
      vcluster -showtree  -sim=cos -clmethod=agglo -plottree=plot.ps vectors.cluto 2
      In that case, the file plot.ps will contain a graphical picture of the tree, in postscript. This can be displayed with the "gv" command. Alternatively, you can convert postscript to pdf with the pstopdf command. Ensure that the tree looks reasonable, indicating that the clustering actually did well.
    3. A larger data set is/will be available as
      Try out the clustering on this data set as well.
    4. Is it possible to assign, manually, reasonable class labels to the ten highest clusters in the final tree, in a reasonable way? Rather than looking in the tree, you may want to specify the number of clusters, 10, as the final argument to the vclusters command, and inspect the clustering in the FILE.clustering.10 output file (combine it with the row labels).
    5. Is it possible to assign, manually, the labels MISC, ORG, LOC, PER to the ten highest clusters in the final tree, in a reasonable way?
  6. Final project: Predicting opening-weekend revenue for movies from critic reviews. Dataset: www.ark.cs.cmu.edu/movie$-data/ From this dataset, you are only allowed to use the training and development part. The data can be found in /net/shared/simsuster/movies-data-v1.0/domains-train-dev.tl Shortly, you will also find in
    the same data-set
    • segmented with Splitta
    • POS-tagged with Citar
    • Dependency parsed with MSTParser
    These resources may be useful for some of the features that you may wish to try for this problem. We focus on the prediction of the overall revenue (so not the per screen revenues). The task, data set and results are described in the this paper. You can use Weka, for instance the SimpleLinearRegression method. Weka will also produce the standard evaluation results (mean absolute error, and correlation).

    You need to submit, by January 21:

    • A written report (about 2 pages), clarifying what you tried to solve the task, which features you used, whatever else you tried, how well you solved the task on the development set
    • All scripts, data files and model files that we need in order to apply your solution to the test set
    • A README file which contains very clear instructions (the precise UNIX commands) in order that we can apply your solution to the test data
    On January 23, 11am, we have a final meeting for this class where you present (5 minutes) your solution. Room: Harmony 1312.025.

    Note that this exercise set counts for 2/7 of the final grade. Each of the other five exercise sets count for 1/7 of the final grade.