LOT Winter School 2009

Course title

Computational Linguistics


Gosse Bouma


Course Level:


The course is intended for linguists with an interest in computational applications and corpus linguistics.
It is assumed that the participants have a background in linguistics, but no specific knowledge of corpus or computational linguistics is presupposed.


Course Description

In recent years, ever larger text corpora have been created, which are of interest both for linguistic research, but also as data sources for practical applications. In this course, we give an overview of methods for automatically enriching such corpora with linguistic information, in particular with syntactic structure. We will discuss examples of linguistic research which is based on information obtained from such corpora. We will also show how syntactic structure can be used to extract various kinds of information from large text collections, and how such information can be used in applications.


Day-to-day Program

Monday: Creating Annotating, and Searching Corpora

This is a brief introduction in corpus linguistics. We discuss issues as size, representativeness, tokenization, Part of Speech tagging, Syntactic information, methods for manual and automatic annotation, formats, and search tools. We argue that properly annotated corpora are of interest for linguistic research as well as applications in the field of information search and extraction.


Tuesday: Statistical Parsing

Automatic syntactic annotation requires accurate and robust methods for automatic analysis of text. We discuss various approaches to creating large grammars, ranging from systems that are the mainly created to systems that are induced automatically from corpus fragments.


Wednesday: Syntactic Patterns in Large Corpora

Syntactically annotated corpora make it possible to investigate the frequency of syntactic constructions, and of lexical items in specific syntactic constructions. We give an overview of corpus linguistic and psycholinguistic research in which such information has been used.


Thursday: Mining Syntactically Annotated Corpora

Various text mining applications may benefit from access to syntactic information. We will discuss the role of syntactic annotation in lexical acquisition, information extraction, and question answering.


Friday: Coreference Resolution

Many of the research questions in corpus linguistics require semantic distinctions (i.e. thematic roles, information about the animacy of arguments, word sense) that are difficult to obtain using syntactic information only. Many practical applications can benefit from semantic information as well (i.e. named-entity classes, co-reference resolution, word sense, discourse relations).


Background and preparatory readings: 

Further readings: