LASSY

... he refused to be a dog just like Lassy was ...

LASSY (Large Scale Syntactic Annotation of written Dutch) was a STEVIN project. STEVIN was the Flemish-Dutch Language and Speech Processing Technology Programme launched by de Nederlandse Taalunie.

A large corpus of written Dutch texts (1,000,000 words) has been syntactically annotated (manually corrected), based on CGN and D-COI. In addition, a very large corpus (more than 700,000,000 words) has been syntactically annotated automatically. The project extends the available syntactically annotated corpora for Dutch both in size as well as with respect to the various text genres and topical domains. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction is illustrated and evaluated in a series of case studies.

Partners

Lassy is carried out by a consortium consisting of the University of Groningen and the Katholieke Universiteit Leuven. Researchers involved in the project include:


Erik Tjong Kim Sang
Gosse Bouma
Gertjan van Noord



Frank van Eynde
Ineke Schuurman
Vincent Vandeghinste

Lassy Initiatives

List of Resources

Descriptions of the project

Annotation Manuals

DTD for Lassy XML files

Tools for Lassy

Some annotated sentences

In Lassy two treebanks have been delivered. The treebanks can be obtained from the TST-Centrale.
  1. Lassy Small is a 1 million word corpus with manually verified syntactic annotations. Lassy Small contains among others a subset of SONAR500, but for historical reasons, the identifiers of some of the sentences are different. An overview is given here.
  2. Lassy Large is a 700 million word corpus with automatically assigned syntactic annotations. Lassy Large contains the following corpora. The Wikipedia part is available on-line, as an example.
    • Eindhoven corpus. 40 thousand sentences, 713 thousand tokens.
    • EMEA corpus. Over 1 million sentences, 13 million tokens.
    • Europarl corpus. Over 1 million sentences, 37 million tokens.
    • Wikipedia dump of 2011. 9 million sentences, 145 million tokens.
    • Senseval corpus of Dutch. 12 thousand sentences, 156 thousand tokens.
    • SONAR500 corpus. 41 million sentences, 510 million tokens.
    • Small corpus including the annual "Troonrede" of Queen Beatrix since 1990.

User Manuals

Deliverables

Internal stuff

Publications about Lassy

Research which makes use of the Lassy treebanks

Below, we did not list the various publications of related STEVIN projects which build upon the Lassy corpora, which include the projects SoNaR, Paco-MT, and DPC.