LASSY (Large Scale Syntactic Annotation of written Dutch) was a STEVIN project. STEVIN was the Flemish-Dutch Language and Speech Processing Technology Programme launched by de Nederlandse Taalunie.

A large corpus of written Dutch texts (1,000,000 words) has been syntactically annotated (manually corrected), based on CGN and D-COI. In addition, a very large corpus (more than 700,000,000 words) has been syntactically annotated automatically. The project extends the available syntactically annotated corpora for Dutch both in size as well as with respect to the various text genres and topical domains. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction is illustrated and evaluated in a series of case studies.


Lassy is carried out by a consortium consisting of the University of Groningen and the Katholieke Universiteit Leuven. Researchers involved in the project include:

Erik Tjong Kim Sang
Gosse Bouma
Gertjan van Noord

Frank van Eynde
Ineke Schuurman
Vincent Vandeghinste

In Lassy two treebanks have been delivered. The treebanks can be obtained from the TST-Centrale.
  1. Lassy Small is a 1 million word corpus with manually verified syntactic annotations. Lassy Small contains among others a subset of SONAR500, but for historical reasons, the identifiers of some of the sentences are different. An overview is given here.
  2. Lassy Large is a 700 million word corpus with automatically assigned syntactic annotations. Lassy Large contains the following corpora. The Wikipedia part is available on-line, as an example.
    • Eindhoven corpus. 40 thousand sentences, 713 thousand tokens.
    • EMEA corpus. Over 1 million sentences, 13 million tokens.
    • Europarl corpus. Over 1 million sentences, 37 million tokens.
    • Wikipedia dump of 2011. 9 million sentences, 145 million tokens.
    • Senseval corpus of Dutch. 12 thousand sentences, 156 thousand tokens.
    • SONAR500 corpus. 41 million sentences, 510 million tokens.
    • Small corpus including the annual "Troonrede" of Queen Beatrix since 1990.

Below, we did not list the various publications of related STEVIN projects which build upon the Lassy corpora, which include the projects SoNaR, Paco-MT, and DPC.