Jerom F. Janssen (2007)
Diachronical Text Classification
A study of text properties and their changes over time
Master's thesis, Rijksuniversiteit Groningen.
[ Paper (PDF, 1983 kb) ]

Introduction

Automatic text classification, being the automatic assignment of a text to some predefined class based on features of both that particular text and the class or classes it could belong to, is a dynamic field of research. One of the earliest papers in this field is “Automatic Indexing: An Experimental Inquiry”, demonstrating that it is a relatively new discipline.(Maron, 1961) Text classification has many applications, such as fraud detection,[1] spam email detection[2] and authorship attribution. (Chaski,2005) There are several remaining challenges; e.g. concerning authorship attribution, i.e. establishing who authored of certain texts. Seminal, extensive research concerning the case of the disputed authorship of several of the Federalist Papers was published by Mosteller and Wallace. (Mosteller & Wallace, 1964) They found that James Madison was statistically most likely to be the author of all of the essays whose authorship was disputed and that probably neither Alexander Hamilton nor John Jay had authored these texts. This was also the conclusion of the majority of historians who had looked into the problem. The 1964 study functioned as verification of the historian's opinions; Mosteller and Wallace had looked at the evidence in an entirely different way than the historians had. Whereas the historians had relied on semantics and their knowledge of history to reach their conclusions, Mosteller and Wallace had used a carefully designed statistical analysis of texts by Jay, Hamilton and Madison. In other words, the historians relied on external evidence, whereas Mosteller and Wallace used internal evidence as a basis for their conclusions.[3]

Several text classification-related topics frequently pop up in the media, spanning topics like automatic indexing in order to establish the subject of some text, issues concerning authorship identification and the attribution of texts to a certain category. An example of the latter is the ongoing battle against spammers who send out unsolicited bulk email, or spam email.[4] Many applications in the field of text classification have been developed and refined over the years, both due to progressive insights concerning existing problems as well as the introduction of new problems along the way. An example of the latter is the detection of spam email, which was a novel problem in the sense that the texts to be classified were very small, yet existing techniques such as Naïe Bayes based algorithms could be applied and fine-tuned to address the problem. In the field of authorship identification, for instance, applications have taken the form of software that identifies fraudulent students who hand in coursework or even theses written by others.[5] The exposure of authors writing under a pseudonym has also become a possibility. (Zundert, 2002) Another application of automated text analysis is the automatic recognition of a text's language. Microsoft makes use of this technology in its word processor Word 2003, which can intelligently invoke the appropriate spelling and grammar checkers by analyzing (parts of) the text of a document.

The subject of this thesis is diachronic text classification and it is written with several goals in mind. The study, which is exploratory in nature (as opposed to hypothesis-testing), aims to find features of English, as they are present in published texts and that are to some degree characteristic of the age in which they were published. The study also should show to what extent it is possible to use these features with machine learning techniques in order to accurately attribute texts to a certain period in time. Two machine learning algorithms were chosen: Naïve Bayes and k-Nearest Neighbor. The Naïve Bayes algorithm has a track record of good performance in text classification related research and it is used in many applications, which almost allows it to function as a benchmark to compare other machine learning algorithms against. This algorithm performs well when the data used with it suggests it adheres to certain rules but its performance drops when data suggests there are many exceptions to a rule. The k-Nearest Neighbor algorithm was included in this research because it is more robust with regard to exceptions.

A possible use for diachronic text classification could be the use as a verification tool by itself, or alongside authorship verification. For instance, discussion amongst scholars might give rise to such use. There are long-standing discussions concerning Shakespeare's authorship of certain texts, as can be read in articles such as “The Man Who Shakespeare Was Not (and Who He Was)” (Ogburn, 1974) and “Shakespeare as Shakespeare” (Evans & Levin, 1975) published as a response to that article. The publishing of the book “The Truth Will Out: Unmasking the Real Shakespeare” shows that the debate is not over yet (James & Rubinstein, 2005). Another possible use of diachronic text classification could be to aid scholars in dating texts that cannot be linked with enough confidence to an author and therefore cannot be linked to a possible period by that form of inference. Having information on the time in which a text was written might exclude certain candidate authors, thus narrowing down the collection of possible authors.

Although the controversy concerning the authorship of several works that are usually attributed to Shakespeare has initially sparked my interest in text classification, the focus is not on Shakespeare's works in particular. His works alone cannot give enough insight in the changes that the English language has gone through over time. The reasons for this are twofold: Shakespeare's works simply do not span enough time to allow for the detection of significant changes in language over time. In addition, when looking at one author, one is more likely to find changes in that author's style than changes over time in general. A larger corpus is therefore needed. The corpus used in this research is a subset of the collection of digitized texts, or e-Texts, distributed by Project Gutenberg.[6] It provides ample data in the form of plain text format e-Texts, stemming from many different authors, representing many centuries of literature, plays, magazines, poems, etc., potentially allowing for the discovery of language features that are time specific rather then specific to one or more authors.

In this research three different approaches are evaluated in order to study the diachronic information stored inside a text. Some aspects of this diachronic information, which tells us something about the time in which the text originated, is discussed and evaluated. Each approach requires a different type of feature but each approach uses the same data set; a hand-picked sub-selection of files of the Project Gutenberg's corpus. The motives for the hand-picking process are both theoretical and practical, and they are strongly linked. Control over the “cleanliness” of the files that were to function as data was desired so that some of the original Project Gutenberg files could be excluded on the basis of suitability. An example of such a file is one which stated the first one million decimals of π. A more practical reason is that extra information needed to be linked to files that were deemed suitable, such as whom the author was and when it was published. For this latter step, each file had to be processed manually.

The first approach consists of looking at linguistic features, focusing on entire words, as they are present in the data set. The presence (or lack thereof) of certain so-called “marker” words, such as pronouns like “thou”, “hast”, “y'all” and nouns like “automobile”, “car”, “AIDS” and “email” as well as words from other syntactic categories are considered to be linguistic features here. No syntactic parsing is used in this research, nor is a distinction made between words occurring inside or outside quotation marks. Also, punctuation is ignored, for reasons clarified in Chapter Three.

The second approach in the search for linguistic features that can function as indicators for the diachronic origin of texts is to look at morphology of words rather than looking at entire words. For instance, changes over time in the relative frequencies of words ending in “-st”, “-ly”, “-ed” and “-ing” are taken into account but also morphemes with Latin and Greek roots.

In addition to more conventional linguistic features such as word or morpheme frequencies, the third approach focuses on features like the average word length, the vowel-consonant ratio, the uppercase-lowercase ratio and also unigrams on the character level are considered.

The above three categories are closely related; if words are looked at, then, in a sense, morphemes are taken into account as well. For instance, the meaning “before” can be expressed through the Latin morpheme “pre”, e.g. in the word “prefix”, but this meaning can also be expressed through the morpheme “ante” (e.g. in the word “antemeridian”), which is of Greek origin. The reasons for selecting morphemes as linguistic features were twofold; on the one hand, it could be interesting to see if there were measurable shifts in the uses of Latin or Greek morphemes over time. On the other hand, selecting a few hundred words and tracking their frequency over time might be less effective then selecting morphemes, as many different words share the same set of morphemes. Therefore, selecting just a few dozen of morphemes might yield usable data spanning hundreds if not thousands of words. In addition, it seemed conceivable that if there had been a steady decrease or increase in the frequency of words with a Greek or Latin origin, perhaps this was reflected through a decrease in increase of words with Greek or Latin morphemes, perhaps resulting in a measurable change over time in average word length or the variance of word length.

These three angles are used in conjunction with two different machine learning techniques; NaïveBayes and k-Nearest Neighbor. These two techniques are compared based on their performance in terms of accuracy when it comes to assigning texts to certain points in time, i.e. the estimated date of publication of these texts.

This research will not try to discover as many diachronic-information-rich features as possible, nor is it an extensive comparison between several machine-learning techniques. The study aims to test the possibility of diachronic text classification using digitized texts.

Chapter Two of this thesis will discuss the Naïve Bayes and k-Nearest Neighbor algorithms in detail. Chapter Three will cover the corpus used, both the automatic and the manual editing of the corpus data, the selection process that determined which e-Texts were used and how meta data in terms of word occurrence/absence and descriptive statistics was extracted from the manipulated corpus data. Chapter Four describes how this meta data was transformed to make it suitable for use with the software package WEKA, which performs the data analysis with the aforementioned machine learning techniques. (Frank & Witten, 2005) Also, a description of the feature tests used in this research is provided. Chapter Five will discuss the results of this research. It shows that for our corpus size and feature selection the k-Nearest Neighbor algorithms usually outperforms the Naïve Bayes algorithm. It briefly discusses which feature tests do not work, going on to discuss the morpheme feature tests which showed potential but which need tuning before they can perform well. The discussion focuses, however, on the feature tests which seemed to work quitewell: the unigrams test on the character level and the feature test which takes the relative frequencies of the Top N most frequent words of the entire corpus into account. The circumstances in which they performed well are discussed and also their different strengths and weaknesses with respect to this research are highlighted.


  1. See http://turnitin.com/static/, last accessed on 29 January 2007, for an example of a commercial application in the field of academic fraud detection
  2. See http://spamassassin.apache.org, last accessed on 29 January 2007, for an example of text classification software widely adapted in commercial settings. Many companies make use of this software to limit the amount of spam email sent across the Internet. Also see the Volkstkrant newspaper article online at http://www.volkskrant.nl/technologie/article390447.ece/Spammer_wordt_slimmer_ndash_en_zijn_bestrijder_ook, last accessed 29 January 2007
  3. For a discussion on external and internal evidence in authorship attribution, see Foster, 1989
  4. A historical note: it is believed that a DEC marketing representative sent the first spam email in 1978 to all users of email on the west coast of the United States at that time, following the first email sent over a network by Ray Tomlinson in 1971 by just seven years. For more information, see http://thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com/first96.html,last accessed on 8 September 2006 and http://openmap.bbn.com/~tomlinso/ray/firstemailframe.html, last accessed on 8 September 2006
  5. Again, see http://turnitin.com/static/, last accessed on 29 January 2007, for an example of a commercial application in the field of academic fraud detection
  6. See http://www.gutenberg.org/wiki/Main_Page, last accessed on 10 September 2006