In this thesis I will explain several techniques through which retrieval
of data can be enhanced towards retrieval of information. My main focus
however will be towards the specific technique of stemming. This
technique makes use of the fact that most of the words in a text are
derived from a base stem. In order to make sure that documents related
to a query are retrieved even though the actual word searched for may
not explicitly occur within a text, for example: 'connections', the
system stems the query-words and the words in the vocabulary so that
query and vocabulary only contain stems of words. In this example
'connect' would be the word the query is transformed to, and all words
derived from connect (connections, connectivity, connecting) would be
represented by the stemmed word 'connect' in the vocabulary. Through
this method one can achieve a higher number of retrieved documents.
Additionally, some sort of semantic clustering takes place since most
words derived from base stems can be considered as semantically
correlated to the base stem.
Of course this method has a negative tradeoff. This tradeoff occurs with the notion of precision. That is, if one generalizes specific words into base stems, it is harder to get to a precise answer to a query. Hence, when searching for 'connections', a document that contains the word 'connectivity' twice will score equally compared to a document containing twice the search-word 'connections'. In this way, you lose precision with respect to what is actually searched for. One would expect the document explicitly containing the word 'connectivity' to score higher (which it does not in this case). I am interested in what specific advantages and disadvantages this tradeoff brings about.
To test the use of stemming, an implementation of a basic search engine based on inverted files will be discussed. The document collection that will be searched is one of the USA-Web located at http://www.let.rug.nl/~usa/. This website contains HMTL-documents which contain information on the history of the United States of America from the colonial period untill present times and is at the moment of writing a relatively small collection containing about 3500 documents in the English language.
First (chapter 2), I will describe several distinct qualities of an IR
and provide a definition of Information Retrieval.
After this description, in chapter 3, a summary of the basic concepts and ideas within the field of Information Retrieval is given. This is done in order to give an informative background of the field so that one will feel comfortable with the several aspects discussed in this thesis. Also, information on different techniques is given in order to provide for an understanding of my choice for the use of an inverted file based IR system as opposed to vector based or probabilistic matching.
In chapter 4, the implementation will be discussed, followed by an evaluation of results obtained through this implementation (chapter 5).
Finally, I will give a conclusion (chapter 6) that should give a satisfying answer to the question "What advantages and disadvantages can be seen with respect to the use of a stemming-algorithm in an Information Retrieval system applied to a relatively small single-language document-collection?".