Ismail Fahmi (2004)
Examining Learning Algorithms for Text Classification In Digital Libraries
Master's thesis, Rijksuniversiteit Groningen.
[ Paper (PDF, 390 kb) ]


Information presentation in a digital library plays important role especially in improving the usability of collections and helping users to get started with the collection. One approach is to provide an overview through large topical category hierarchies associated with the documents of a collection. But with the growth in the amount of information, this manual classification becomes a new problem for users. The navigation through the hierarchy can be a time-consuming and frustrating process.

In this master thesis, we examine the performance of machine learning algorithms for automatic text classification. We examine three learning algorithms namely ID3, Instance- Based Learning, and Naive Bayes to classify documents according to their category hier- archies. We focused on the e ectiveness measurement such as recall, precision, the F1- measure, error, and the learning curve in learning a manually classified metadata collection from the Indonesian Digital Library Network (IndonesiaDLN), and we compare the results with an examination of the Reuters-21578 dataset. We summarize the algorithm that is most suitable for the digital library collection and the performance of the algorithms on these datasets.