LOT Summer School 2007 course de

LOT Winter School 2009

Course title

Computational Linguistics

Teacher

Gosse Bouma

Postal Address: Information Science, Fac of Arts, University of Groningen, PO Box 716, 9700 AS Groningen

Homepage: www.let.rug.nl/~gosse

Course Level:

Intermediate

The course is intended for linguists with an interest in computational applications and corpus linguistics.
It is assumed that the participants have a background in linguistics, but no specific knowledge of corpus or computational linguistics is presupposed.

Course Description

In recent years, ever larger text corpora have been created, which are of interest both for linguistic research, but also as data sources for practical applications. In this course, we give an overview of methods for automatically enriching such corpora with linguistic information, in particular with syntactic structure. We will discuss examples of linguistic research which is based on information obtained from such corpora. We will also show how syntactic structure can be used to extract various kinds of information from large text collections, and how such information can be used in applications.

Day-to-day Program

Monday: Creating Annotating, and Searching Corpora

This is a brief introduction in corpus linguistics. We discuss issues as size, representativeness, tokenization, Part of Speech tagging, Syntactic information, methods for manual and automatic annotation, formats, and search tools. We argue that properly annotated corpora are of interest for linguistic research as well as applications in the field of information search and extraction.

Lecture Notes (4 on 1 A4)
Stanford Parser
BNC
Gosse Bouma and Begona Villada (2001), Corpus-based acquisition of collocational prepositional phrases
Grefenstette and Nioche (2000), Estimation of English and non-English language use on the WWW
Oostendorp and van der Wouden, Corpus Internet
Veronis, Google's counts faked?
Liberman, Google-sampling: avoiding pseudo-text in cyberspace

Tuesday: Statistical Parsing

Automatic syntactic annotation requires accurate and robust methods for automatic analysis of text. We discuss various approaches to creating large grammars, ranging from systems that are the mainly created to systems that are induced automatically from corpus fragments.

Wednesday: Syntactic Patterns in Large Corpora

Syntactically annotated corpora make it possible to investigate the frequency of syntactic constructions, and of lexical items in specific syntactic constructions. We give an overview of corpus linguistic and psycholinguistic research in which such information has been used.

Lecture Notes (4 on 1 A4)
Bouma and Spenader (2009), The Distribution of Weak and Strong Object Reflexives in Dutch
Bouma, Hendriks, and Hoeksema, Focus Particles inside Prepositional Phrases: A Comparison between Dutch, English, and German
Bastiaanse, Bouma, Post, Linguistic Complexity and Frequency in Agrammatic Speech Production (draft)
Dissertation by Gerlof Bouma on Word Order in Dutch

Thursday: Mining Syntactically Annotated Corpora

Various text mining applications may benefit from access to syntactic information. We will discuss the role of syntactic annotation in lexical acquisition, information extraction, and question answering.

Friday: Coreference Resolution

Many of the research questions in corpus linguistics require semantic distinctions (i.e. thematic roles, information about the animacy of arguments, word sense) that are difficult to obtain using syntactic information only. Many practical applications can benefit from semantic information as well (i.e. named-entity classes, co-reference resolution, word sense, discourse relations).