Postal Address: Information Science, Fac of
Arts, University of Groningen, PO Box 716, 9700 AS Groningen
The course is intended for linguists with an interest in computational
applications and corpus linguistics.
It is assumed that the participants have a background in linguistics, but no specific knowledge of corpus or computational linguistics is presupposed.
In recent years, ever larger text corpora have been created, which are of interest both for linguistic research, but also as data sources for practical applications. In this course, we give an overview of methods for automatically enriching such corpora with linguistic information, in particular with syntactic structure. We will discuss examples of linguistic research which is based on information obtained from such corpora. We will also show how syntactic structure can be used to extract various kinds of information from large text collections, and how such information can be used in applications.
Monday: Creating Annotating, and Searching
This is a brief introduction in corpus linguistics. We
discuss issues as size, representativeness, tokenization, Part of Speech
tagging, Syntactic information, methods for manual and automatic annotation,
formats, and search tools. We argue that properly annotated corpora are of
interest for linguistic research as well as applications in the field of
information search and extraction.
Tuesday: Statistical Parsing
Automatic syntactic annotation requires accurate and robust methods for automatic analysis of text. We discuss various approaches to creating large grammars, ranging from systems that are the mainly created to systems that are induced automatically from corpus fragments.
Wednesday: Syntactic Patterns in Large Corpora
Syntactically annotated corpora make it possible to investigate the frequency of syntactic constructions, and of lexical items in specific syntactic constructions. We give an overview of corpus linguistic and psycholinguistic research in which such information has been used.
Various text mining applications may benefit from access to syntactic information. We will discuss the role of syntactic annotation in lexical acquisition, information extraction, and question answering.
Friday: Coreference Resolution
Many of the research questions in corpus linguistics
require semantic distinctions (i.e. thematic roles, information about the
animacy of arguments, word sense) that are difficult to obtain using syntactic
information only. Many practical applications can benefit from semantic
information as well (i.e. named-entity classes, co-reference resolution, word
sense, discourse relations).
Background and preparatory readings: