BA Scripties Informatiekunde

Scriptie-klas BA Informatiekunde 2006-2007

Planning

do 19 april: Presentaties Tristan en Daniel
do 26 april: indienen (per e-mail) scriptie-voorstel
- probleemstelling
- literatuur
- plan van aanpak
- evaluatie
do 10 mei: Presentaties overige deelnemers

Mogelijke Onderwerpen

1. Leeshulp

Bij het lezen van wetenschappelijke (juridische, medische) tekst kom je veel terminologie tegen. Definities van zulke termen kun je vinden in Wikipedia of gespecialiseerde woordenboeken. In dit project ontwikkel je software die gebruikers in staat stelt tijdens het lezen gemakkelijk informatie over een woord of term te vinden (bv dmv pop-ups), zoals definities, synoniemen, betekenis van afkortingen, vertaling in een andere taal, ...

Literatuur

Ismail Fahmi and Gosse Bouma, Learning to Identify Definitions using Syntactic Features, Proceedings of the EACL 2006 workshop on Learning Structured Information in Natural Language Applications.
A Simple Algorithm For Identifying Abbreviation Definitions in Biomedical Text, Ariel Schmartz and Marti Hearst, 2003. Proceedings of the Pacific Symposium on Biocomputing (download the software (java)).
A comparison study of biomedical short form definition detection algorithms, Manabu Torii, Hongfang Liu, Zhnagzhi Hu, Cathy Wu. Proceedings of the 1st international workshop on Text mining in bioinformatics 2006.
Efficient Acronym-Expansion Matching for Automatic Acronym Acquisition Manuel Zahariev Department of Computing Sciences, Simon Fraser University, Burnaby, B.C., Canada

2. ISA-relaties

WordNet is een elektronisch woordenboek waarin woorden o.a. d.m.v. ISA-relaties (hyperniem-relaties) worden gedefinieerd: een hond ISA huisdier, een BMW ISA automerk, etc. Veel toepassingen in NLP (zoals QA) maken gebruik van WN. De coverage van WN (m.n. voor talen anders dan het Engels) is echter beperkt. Daarom is het interessant ISA relaties automatisch te leren.

De klassieke aanpak is van Hearst. Recenter hebben mensen geprobeerd dit m.b.v. het Web te doen, aangezien daar vele malen meer informatie beschikbaar is.

Project 2.1: Maak gebruik van lijstjes en de titels van lijstjes op het web-pagina's. Zie Shinzato en Torisawa.
Project 2.2: Genereer Hearst-patterns, stuur deze naar Google (of een andere zoekmachine) en analyseer de resultaten. Zie Tjong Kim Sang

Literatuur

Hearst, M. Automatic Acquisition of Hyponyms from Large Text Corpora, Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France, July 1992.
Keiji Shinzato, Kentaro Torisawa: Acquiring Hyponymy Relations from Web Documents. HLT-NAACL 2004: 73-80
Erik Tjong Kim Sang,....

3. Persoonsnamen disambigueren

Net als namen van locaties, kunnen ook namen van personen ambigu zijn. In dit onderzoek probeer je te ontdekken wanneer dezelfde naam naar verschillende personen verwijst.

Literatuur

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu and Marius Pasca In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), pp. 9-16, Trento, Italy, April 2006.
Unsupervised Personal Name Disambiguation Gideon S. Mann and David Yarowsky CoNLL, Edmonton, Alberta 2003

4. Web-based Question Answering

Question answering is het vinden van het antwoord op een vraag van een gebruiker in een document-collectie of op het web. Het voordeel van het gebruik van het web is dat er veel informatie beschikbaar is, en dus dat het antwoord op een vraag vaak letterlijk voorhanden is, en dat het juiste antwoord vaak ook het meest frequente antwoord op het web is. Veel onderzoek is gedaan voor het Engels. Het Nederlandstalige web is groot, maar niet zo groot als het Engelse.

Onderzoek of deze benadering ook werkt voor het Nederlands.

Literatuur

Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. Data-Intensive Question Answering. Proceedings of the Tenth Text REtrieval Conference (TREC 2001), pages 393-400, November 2001, Gaithersburg, Maryland.
Probabilistic Question Answering on the Web (2002) Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, Amardeep Grewal. In Proc. of the Int. WWW Conf