LASSY
... he refused to be a dog just like Lassy was ...
LASSY (Large Scale Syntactic Annotation of written Dutch) was a STEVIN
project. STEVIN was the Flemish-Dutch Language and Speech Processing
Technology Programme launched by de Nederlandse Taalunie.
A large corpus of written Dutch texts (1,000,000 words) has been
syntactically annotated (manually corrected), based on CGN and D-COI.
In addition, a very large corpus (more than 700,000,000
words) has been syntactically annotated automatically. The project
extends the available syntactically annotated corpora for Dutch
both in size as well as with respect to the various text genres and
topical domains. In addition, various browse and search tools for
syntactically annotated corpora have been developed and made
available. Their potential for applications in corpus linguistics and
information extraction is illustrated and evaluated in a series of case studies.
Partners
Lassy is carried out by a consortium consisting of the University of
Groningen and the Katholieke Universiteit Leuven. Researchers involved
in the project include:
Lassy Initiatives
- Lassy sponsored the invited lecture of Anette Frank at the ACL
2007 workshop Deep Linguistic Processing, June 28, 2007 in Prague.
Further information is available from the DLP website
- Lassy initiated the local organization of TLT7: the Seventh
International Conference on Treebanks and Linguistic Theories. January
23-24, 2009 in Groningen. Further information can be obtained from
the TLT7 webpage.
- Lassy has sponsored an invited keynote lecture by Ken Churck
(Microsoft Research) at the 30th anniversary TaBu symposium on June 11
and 12, 2009 in Groningen. Further information is
available from the conference
website.
- Lassy sponsored a workshop entitled Distributional Semantics
Workshop, on June 23, 2010 in Groningen. More information can be
obtained from the
workshop webpage.
The sponsorship made possible invited presentations at the workshop by Yves Peirsman (Leuven), Sophia Katrenko (Amsterdam) and Diarmuid Ó Séaghdha (Cambridge).
List of Resources
Descriptions of the project
Annotation Manuals
DTD for Lassy XML files
Tools for Lassy
- DACT, an easy to use corpus tool for Lassy corpora, developed by Daniel de Kok, with help from Jelmer van der Linde.
- command-line tools with similar functionality as Dact, developed by Daniel de Kok, with help from Jelmer van der Lind, Lars Buitinck, Peter Kleiweg
- GrETEL, another tool for querying Lassy treebanks, developed by Liesbeth Augustinus.
- Peter's version of Erik's Search Application, web application for searching pairs of words, initially developed by Erik Tjong Kim Sang, further developed by Peter Kleiweg
- Alpino parser
Some annotated sentences
In Lassy two treebanks have been delivered. The treebanks can be obtained from
the TST-Centrale.
- Lassy Small is a 1 million word corpus with manually
verified syntactic annotations. Lassy Small contains among others a subset
of SONAR500, but for historical reasons, the identifiers of some of
the sentences are different. An overview is given here.
- Lassy Large is a 700 million word corpus with automatically
assigned syntactic annotations.
Lassy Large contains the following corpora. The Wikipedia part
is available on-line, as an example.
-
Eindhoven corpus. 40 thousand sentences, 713 thousand tokens.
- EMEA corpus. Over 1 million sentences, 13 million tokens.
- Europarl corpus. Over 1 million sentences, 37 million tokens.
-
Wikipedia dump of 2011. 9 million sentences, 145 million tokens.
- Senseval corpus of Dutch. 12 thousand sentences, 156 thousand tokens.
- SONAR500 corpus. 41 million sentences, 510 million tokens.
- Small corpus including the annual "Troonrede" of Queen Beatrix since 1990.
User Manuals
Deliverables
Internal stuff
Publications about Lassy
- Gertjan van Noord, Ineke Schuurman, Vincent Vandeghinste. Syntactic
Annotation of Large Corpora in STEVIN. In: LREC 2006.
[pdf]
- Gosse Bouma and Geert Kloosterman. Mining Syntatically Annotated
Corpora with XQuery. In: LAW 2007, Prague.
[pdf]
- Martijn Wieling, Mark-Jan Nederhof, Gertjan van Noord. Parsing Partially
Bracketed Input. In: CLIN 2005. Proceedings of the 16th Meeting of
Computational Linguistics in the Netherlands. Pages 1--16. 2007.
[pdf]
- Nelleke Oostdijk, Martin Reynaert, Paola Monachesi, Gertjan van Noord,
Roland Ordelman, Ineke Schuurman, Vincent Vandeghinste. From D-Coi to SoNaR: A
reference corpus for Dutch. In: LREC 2008. [pdf]
- Gertjan van Noord. Huge Parsed Corpora in LASSY. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van
Noord (editors), Proceedings of the Seventh International Workshop on
Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009,
Groningen, The Netherlands. LOT Occasional Series.
[LOT site]
- Ineke Schuurman, Veronique Hoste and Paola Monachesi.
Cultivating Trees: Adding Several Semantic Layers to the Lassy
Treebank in SoNaR. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Gertjan van Noord and Gosse Bouma. Parsed Corpora for Linguistics.
In: Proceedings of EACL Workshop The Interaction between
Linguistics and Computational Linguistics: Virtuous, Vicious or
Vacuous? Athens, 2009. pp 33-39. [pdf]
- Gertjan van Noord, Gosse Bouma, Frank van Eynde, Daniel de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang, Vincent Vandeghinste. Large Scale Syntactic Annotation of Written Dutch: Lassy. In: STEVIN volume to be published by Springer.
Research which makes use of the Lassy treebanks
Below, we did not list the various publications of related STEVIN projects which build upon the
Lassy corpora, which include the projects SoNaR, Paco-MT, and DPC.
- Gertjan van Noord. Using Self-Trained Bilexical Preferences to Improve
Disambiguation Accuracy. In: IWPT2007, Prague, 2007.
[pdf]
- Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas,
Jörg Tiedemann. Question Answering with Joost at CLEF 2008. CLEF 2008
Working Notes. Aarhus Denmark, 2008.
[pdf]
- Barbara Plank and Gertjan van Noord. Exploring An Auxiliary
Distribution based approach to Domain Adaptation of a Syntactic
Disambiguation Model. In: Coling Workhop 'Cross Framework and
Cross Domain Parser Evaluation'. 2008.
[pdf]
- Gosse Bouma, Geert Kloosterman, Jori Mur, Gertjan van Noord,
Lonneke van der Plas, and Jörg Tiedemann. Question Answering with
Joost at CLEF 2007. In: Carol Peters, Valentin Jijkoun, Thomas Mandl,
Henning Mueller, Douglas W. Oard, Anselmo Penas, Vivien Petras, Diana
Santos (editors), Advances in Multilingual and Multimodal Information
Retrieval, 8th workshop of the Cross-Language Evaluation Form, CLEF
2007, Budapest, Hungary, September 19-21, 2007, Revised Selected
Papers. Lecture Notes in Computer Science 5152, Springer 2008. pp 257-260.
- Erik Tjong Kim Sang. To Use a Treebank or Not - Which Is Better
for Hypernym Extraction? In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
-
Anna Lobanova, Jennifer Spenader, Tim van de
Cruys, Tom van der Kleij and Erik Tjong Kim Sang.
Automatic Relation Extraction - Can Synonym Extraction Benefit from
Antonym Knowledge? In: Proceedings of
WordNets and other Lexical Semantic Resources - between Lexical
Semantics, Lexicography, Terminology and Formal Ontologies
(NODALIDA2009 workshop), Odense, Denmark, May 2009. [pdf]
- Gosse Bouma and Jennifer Spenader. The Distribution of Weak and
Strong Object Reflexives in Dutch. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
-
Erik Tjong Kim Sang and Katja Hofmann,
Lexical Patterns or Dependency Patterns: Which Is Better for Hypernym
Extraction? In: Proceedings
of CoNLL-2009, Boulder, CO, USA, June 2009.
[pdf]
- Gertjan van Noord, Learning Efficient Parsing. In: EACL 2009. The
12th Conference of the European Chapter of the Association for
Computational Linguistics. 30 March - 3 April 2009, Athens, Greece. pp
817-825. [pdf]
- Daniël de Kok, Jianqiang Ma and Gertjan van Noord, A generalized
method for iterative error mining in parsing results. In: ACL2009
Workshop Grammar Engineering Across Frameworks (GEAF), Singapore,
2009. [pdf]
- Vincent Vandeghinste. Tree-based target language modeling. In: 13th Annual conference of the European Association for machine translation pages:152-159. Barcelona 2009.
- Barbara Plank. Improved statistical measures to assess natural language parser performance
across domains. In: LREC 2010. [pdf]
- Kostadin Cholakov and Gertjan van Noord. Using Unknown Word Techniques To
Learn Known Words. In: EMNLP 2010.
[pdf]
- Barbara Plank and Gertjan van Noord. Grammar-driven versus data-driven: which parsing system is more affected by domain shifts? In: NLPLING '10. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground. Pages 25-33. 2010. [web-page]
- Barbara Plank and Gertjan van Noord. Dutch Dependency Parser
Performance Across Domains. In: Proceedings of the 20th Meeting of
Computational Linguistics in the Netherlands.
[pdf]
- Kostadin Cholakov and Gertjan van Noord. Acquisition of Unknown
Word Paradigms for Large Scale Grammars. In: COLING 2010: Poster Volume,
pages 153-161. August 23-27, Beijing, China.
[pdf]
- Gertjan van Noord. Self-trained Bilexical Preferences to Improve
Disambiguation Accuracy. In: Harry Bunt, Paola Merlo and Joakim Nivre
(editors), Trends in Parsing Technology. Dependency Parsing, Domain
Adaptation, and Deep Parsing. Springer Verlag. pp 183-200. 2010.
[draft pdf;
book
page of publisher]
- Daniel de Kok and Barbara Plank and Gertjan van Noord. Reversible Stochastic Attribute-value Grammars. In: ACL 2011.
[pdf]
- Kostadin Cholakov, Gertjan van Noord, Valia Kordoni, Yi Zhang. An empirical comparison of Unknown Word Prediction Methods. In: IJCNLP 2011.
[pdf]
- Philip van Oosten, Véronique Hoste and Dries Tanghe, A Posteriori Agreement as a Quality Measure for Readability Prediction Systems. In: Computational Linguistics and Intelligent Text Processing.
Lecture Notes in Computer Science, 2011, Volume 6609/2011, 424-435, DOI: 10.1007/978-3-642-19437-5_35.
[web-page]
- Nick Ruiz and Edgar Weiffenbach. Using corpora tools to analyze gradable nouns in Dutch. In: Computational Linguistics in the Netherlands Journal 1 (2011) 41-59. [pdf]
- Philip van Oosten, Veronique Hoste. Readability annotation: replacing the expert by the crowd. In: IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications. Pages 120-129.
[web-page]
- Daniel de Kok. Discriminative features in reversible stochastic attribute-value grammars. In: Proceedings of the UCNLG+Eval: Language Generation and
Evaluation Workshop. EMNLP 2011. [pdf]
- Barbara Plank. Domain Adaptation for Parsing. Ph.D.-thesis University of Groningen, 2011.
- Liesbeth Augustinus, Vincent Vandeghinste, Frank Van Eynde, Example-Based Treebank Querying.
In: LREC 2012. Instanbul, 2012. [pdf]