LOT Winter School 2009

Course title

Computational Linguistics


Gosse Bouma


E-mail: g.bouma@rug.nl

Postal Address:  Information Science, Fac of Arts, University of Groningen, PO Box 716, 9700 AS Groningen

Homepage: www.let.rug.nl/~gosse

Course Level:


The course is intended for linguists with an interest in computational applications and corpus linguistics.
It is assumed that the participants have a background in linguistics, but no specific knowledge of corpus or computational linguistics is presupposed.


Course Description

In recent years, ever larger text corpora have been created, which are of interest both for linguistic research, but also as data sources for practical applications. In this course, we give an overview of methods for automatically enriching such corpora with linguistic information, in particular with syntactic structure. We will discuss examples of linguistic research which is based on information obtained from such corpora. We will also show how syntactic structure can be used to extract various kinds of information from large text collections, and how such information can be used in applications.


Day-to-day Program

Monday: Creating Annotating, and Searching Corpora

This is a brief introduction in corpus linguistics. We discuss issues as size, representativeness, tokenization, Part of Speech tagging, Syntactic information, methods for manual and automatic annotation, formats, and search tools. We argue that properly annotated corpora are of interest for linguistic research as well as applications in the field of information search and extraction.


Tuesday: Statistical Parsing

Automatic syntactic annotation requires accurate and robust methods for automatic analysis of text. We discuss various approaches to creating large grammars, ranging from systems that are the mainly created to systems that are induced automatically from corpus fragments.


Wednesday: Syntactic Patterns in Large Corpora

Syntactically annotated corpora make it possible to investigate the frequency of syntactic constructions, and of lexical items in specific syntactic constructions. We give an overview of corpus linguistic and psycholinguistic research in which such information has been used.


Thursday: Mining Syntactically Annotated Corpora

Various text mining applications may benefit from access to syntactic information. We will discuss the role of syntactic annotation in lexical acquisition, information extraction, and question answering.


Friday: Coreference Resolution

Many of the research questions in corpus linguistics require semantic distinctions (i.e. thematic roles, information about the animacy of arguments, word sense) that are difficult to obtain using syntactic information only. Many practical applications can benefit from semantic information as well (i.e. named-entity classes, co-reference resolution, word sense, discourse relations).


Background and preparatory readings: 

Course readings:

  1. Douglas Roland, Frederic Dick, Jeffrey L Elman, Frequency of basic English grammatical structures: A corpus analysis Journal of Memory and Language, Vol. 57, No. 3. (October 2007), pp. 348-379. DOI, ScienceDirect
  2. Gertjan van Noord (2006), At Last Parsing is Now Operational, Piet Mertens and Cedrick Fairon and Anne Dister and Patrick Watrin (eds.), TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des l angues naturelles, pdf
  3. Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Harald Baayen. 2007. "Predicting the Dative Alternation." In Cognitive Foundations of Interpretation, ed. by Gerlof Bouma, I. Kraemer, and J. Zwarts. Amsterdam: Royal Netherlands Academy of Science, pp. 69--94. 33 pages. Pdf
  4. Gosse Bouma, Ismail Fahmi, Jori Mur, Gertjan van Noord, Lonneke van der Plas, Jörg Tiedemann. Linguistic Knowledge and Question Answering. Traitement Automatique des Langues 46 (3) 2005. Pages 15--39. Appeared in 2007. pdf

Further readings: