Project


I have finished my Ph.D. Thesis with the title "Linguistic Knowledge and Word Sense Disambiguation" and have successfully defended it on November 1, 2004. For more information, see the publications page. My Ph.D. project was part of the PIONIER-Project Algorithms for Linguistic Processing and my supervisors were Gertjan van Noord and John Nerbonne.

My Ph.D. project's main goal is/was to investigate Word Sense Disambiguation (WSD) for Dutch using statistical methods in combination with different sources of linguistic information. I developped a system which makes as much use as possible from linguistic information. The main research questions are:

The types of linguistic knowledge I have been looking at so far include morphological information (lemmas), syntactic information (part of speech, dependency relations), and semantic information (EuroWordNet). Pragmatic information (topic) will be left for future work, I'm afraid.

During the first year of my Ph.D. studentship, I conducted experiments with a naive Bayes classifier on a 3 million word corpus of Dutch. Since there is very little disambiguated material available (without which evaluation of results is not possible), I circumvented this problem by artificially creating such data using pseudowords. The method of pseudowords consists of introducing a form of artificial ambiguity in (untagged) corpora. I have especially been investigating whether disambiguating pseudowords is comparable to the task of disambiguating real ambiguous words and came to the conclusion that these two tasks are not substitutable.

In my current research, I am systematically investigating which sources of linguistic knowledge work best for WSD and in what combination. Using a Maximum Entropy classifier, the different linguistic sources are first tested individually on their value for WSD. In a second step, the ways of combining the acquired information are looked at.

Here's the summary (ps or pdf) of my PhD thesis.


Research Interests



Back to top