Bas van Driel (2001)
The use of Stemming in an Information Retrievel System
Master's thesis, Rijksuniversiteit Groningen.
[ Paper (PDF, 163 kb) ]

1. Introduction

The science of information retrieval nowadays can be seen as one of the most important technologies that help us to find all textual information we need within a structure that we can search and analyze. Since the explosion of digital texts through the World Wide Web, this information is stored and spread over billions of documents increasing the demand for 'smart' retrieval systems. This 'smart'-ness of information retrieval systems can be achieved by numerous techniques. These techniques should interpret data in such a way that returned documents are something more than just the result of a simple data-comparison between query and document collection. The idea is that a query represents a certain need for information instead of merely data, so the answer to the query has to reflect this need.

In this thesis I will explain several techniques through which retrieval of data can be enhanced towards retrieval of information. My main focus however will be towards the specific technique of stemming. This technique makes use of the fact that most of the words in a text are derived from a base stem. In order to make sure that documents related to a query are retrieved even though the actual word searched for may not explicitly occur within a text, for example: 'connections', the system stems the query-words and the words in the vocabulary so that query and vocabulary only contain stems of words. In this example 'connect' would be the word the query is transformed to, and all words derived from connect (connections, connectivity, connecting) would be represented by the stemmed word 'connect' in the vocabulary. Through this method one can achieve a higher number of retrieved documents. Additionally, some sort of semantic clustering takes place since most words derived from base stems can be considered as semantically correlated to the base stem.
Of course this method has a negative tradeoff. This tradeoff occurs with the notion of precision. That is, if one generalizes specific words into base stems, it is harder to get to a precise answer to a query. Hence, when searching for 'connections', a document that contains the word 'connectivity' twice will score equally compared to a document containing twice the search-word 'connections'. In this way, you lose precision with respect to what is actually searched for. One would expect the document explicitly containing the word 'connectivity' to score higher (which it does not in this case). I am interested in what specific advantages and disadvantages this tradeoff brings about.

To test the use of stemming, an implementation of a basic search engine based on inverted files will be discussed. The document collection that will be searched is one of the USA-Web located at http://www.let.rug.nl/~usa/. This website contains HMTL-documents which contain information on the history of the United States of America from the colonial period untill present times and is at the moment of writing a relatively small collection containing about 3500 documents in the English language.

First (chapter 2), I will describe several distinct qualities of an IR and provide a definition of Information Retrieval.
After this description, in chapter 3, a summary of the basic concepts and ideas within the field of Information Retrieval is given. This is done in order to give an informative background of the field so that one will feel comfortable with the several aspects discussed in this thesis. Also, information on different techniques is given in order to provide for an understanding of my choice for the use of an inverted file based IR system as opposed to vector based or probabilistic matching.
In chapter 4, the implementation will be discussed, followed by an evaluation of results obtained through this implementation (chapter 5).
Finally, I will give a conclusion (chapter 6) that should give a satisfying answer to the question "What advantages and disadvantages can be seen with respect to the use of a stemming-algorithm in an Information Retrieval system applied to a relatively small single-language document-collection?".