Here I try to collect additional information/hints for assignment 2 .... general comments ---------------- - look at the data first and how they are organized before starting to implement something - remember that you have to do 10-fold cross validation --> think of some convenient way to split the data set into 10 partitions (they should be of approximatly same size and should contain data for each speaker in the set) - you don't have to do all processing in one single program (for example you can do the pre-processing before learning the naive bayes model and before doing the classificaion. you may use, for example, Unix command line tools to do simple pre-processing) - you can do learning and classification in one go (meaning that you don't have to store the model parameters on disk somehow) - try to be careful about using internal memory! don't store the entire text collection in memory for learning your model! run sequentially through the collection, text by text, to count the necessary frequencies training -------- - don't store the entire text collection in memory - run sequentially through your training data, text by text, and count the frequencies you need (you have to do some tokenization, i.e. split the text into words and other tokens) - you can either store the frequency counts or estimated probabilities on disk - you can also do training and classification in one go --> run the training (do the counts and estimations) --> run the classification --> calculate accuracy classification -------------- - read the model parameters if you saved them to disk - read texts from your evaluation set one by one and classify them according to your model (of course you need to do the same pre-processing as for training) - if you experiment with different kinds of pre-processing make sure that you use the same one for training and evaluation - it's probably a good idea to store your parameters in some hash structures to easily get access to the ones you need for classification - multiplying small numbers may cause a number underflow --> use log for the maximization in your classification (addition is also faster!) v-best = argmax_v P(v) prod P(w|v) = argmax_v log [ P(v) prod_w P(w|v) ] = argmax_v log P(v) + sum_w [ log P(w|v) ] (using logs is possible because a logarithmic fuction is monotonic!) pre-processing -------------- simple tokenization with Unix command-line tools: - use tr for 'translating' certain characters into others, for example tr -s ' ' '\n' < input.txt > output.txt + this replaces every space character with a new-line character + you still have the punctuations attached to words! --> you could use sed to separate punctuations from words before running 'tr' tr -cs A-Za-z '\n' input.txt > output.txt + this replaces every character that is NOT [A-Za-z] with a new-line character + but you will loose all characters not being [A-Za-z]! --> you could define a different character set to keep What do you get from this? --> one token per line --> now you can read tokens line by line and count frequencies This can, of course, be done in a much more sophisticated way! counting -------- There are also Unix commands to do some counting if you don't want to do this inside of your program If you have your input with one token per line you can run the following command: sort input_one_token_per_line.txt | uniq -c > counts.txt estimating probabilities ------------------------ We do simple MAXIMUM LIKELIHOOD ESTIMATION (MLE) to get our probabilities: That means: if we assume that words w are outcomes of a random process and that they are independent of each other we can estimate the probability P(w) = freq(w)/N where freq(w) is the frequency of the word w and N is the total number of words (actually it is the sum of frequencies for all words in the text which is equal to the total number of words) If we condition on the class v we count frequencies within texts with this classification. P(w|v) = freq(wv)/Nv where freq(wv) is the frequency of word w in texts with classification v and Nv is the total frequency of words in texts with classification v. The problem with MLE is that it doesn't "reserve" probability mass for events that do not occur in the training data. That means that such probabilities will be 0 (impossible). However, you might see such events later in data instances that you want to process. However, the probability of such an event is 0 in your model meaning that it should be impossible! This ususally crashes your model (making the whole data instance impossible) A solution is to use prior knowledge to reserve part of the probability mass for events that do not occur in the training data. This is where the m-estimate comes into play: P(w) = ( freq(w) + m*p ) / ( N + m ) where p is the prior estimate of your P(w) and m is called the "equivalent sample size". You can think of m "virtual examples" which are distributed according to te prior p If you don't know more about the distribution of your data --> assume a uniform distribution (all w are equally likely)! p = 1/k where k is the number of different words (=vocabulary) (which are possible outcomes of the random process) Now you still have to decide on the number of "virtual examples" you like to add. The simplest solution is to add one virtual example per outcome: m = k That's how you get the add-one discounting estimate: P(w) = ( freq(w) + 1 ) / ( N + k ) Note that even add-one estimates use some kind of prior knowledge, namely that each outcome is possible! So this estimate is actually not MLE anymore, strictly speaking! In general this is called smoothing and there are many more (advanced) ways of doing it (for example Good Turing smoothing). Maximum Likelihood Estimation ----------------------------- Why is it called MLE? MLE actually means that you try to estimate your parameters f by maximizing the likelihood of some data D given your parameters f. In general we look for parameters f that maximize P(f|D) f-best = argmax_f P(f|D) = argmax_f P(D|f) P(f) As you know, in MLE we assume that all f are uniformly distributed, meaning that P(f) is a constant that can be dismissed in the maximization: f-MLE = argmax_f P(D|f) These are the parameters f that maximize the likelihood of our data, i.e. that best "explain" the data set D. Now we have a bag-of-words approach where we need the probabilities of each word in our vocabulary. In other words our parameters f are the P(w) for each word w. We also assume that all words are independent of each other (which is not true but anyway ...). It can be shown that the following estimations are the ones that maximize P(D|f): P(w) = freq(w) / N (as discussed above)