LASSY
... he refused to be a dog just like Lassy was ...
LASSY (Large Scale Syntactic Annotation of written Dutch) is a STEVIN
project. STEVIN is a Flemish-Dutch Language and Speech Processing
Technology Programme launched by de Nederlandse Taalunie. The STEVIN
programme office is run jointly by NWO Humanities Division and
SenterNovem.
A large corpus of written Dutch texts (1,000,000 words) has been
syntactically annotated (manually corrected), based on D-COI and its
successor. In addition, a very large corpus (about 1,500,000,000
words) has been syntactically annotated automatically. The project
extends the available syntactically annotated corpora for Dutch
both in size as well as with respect to the various text genres and
topical domains. In addition, various browse and search tools for
syntactically annotated corpora have been developed and made
available. Their potential for applications in corpus linguistics and
information extraction is illustrated and evaluated in a series of case studies.
Partners
Lassy is carried out by a consortium consisting of the University of
Groningen and the Katholieke Universiteit Leuven. Researchers involved
in the project include:
Lassy Initiatives
- Lassy sponsored the invited lecture of Anette Frank at the ACL
2007 workshop Deep Linguistic Processing, June 28, 2007 in Prague.
Further information is available from the DLP website
- Lassy initiated the local organization of TLT7: the Seventh
International Conference on Treebanks and Linguistic Theories. January
23-24, 2009 in Groningen. Further information can be obtained from
the TLT7 webpage.
- Lassy has sponsored an invited keynote lecture by Ken Churck
(Microsoft Research) at the 30th anniversary TaBu symposium on June 11
and 12, 2009 in Groningen. Further information is
available from the conference
website.
- Lassy sponsored a workshop entitled Distributional Semantics
Workshop, on June 23, 2010 in Groningen. More information can be
obtained from the
workshp webpage.
The sponsorship made possible invited presentations at the workshop by Yves Peirsman (Leuven), Sophia Katrenko (Amsterdam) and Diarmuid Ó Séaghdha (Cambridge).
- Lassy sponsored an extra CLCG Linguistics Colloquium, with a
presentation by Mrs. Valia Kordoni (Saarbrücken and Berlin)
entitled Enhancing Performance of Deep Linguistic
Grammars. More information is available from
the CLCG
website
List of Resources
Descriptions of the project
- Project Proposal
- Short project description in DIXIT (in Dutch)
- A0 portrait poster (june 2008)
- A0 landscape poster (september 2008)
Annotation Manuals
- Lassy Syntactic Annotation Manual
- D-Coi POS-tagging and Lemmatization Manual
User Manuals for Software
-
User Guide: How to use the Alpino/D-Coi/Lassy Treebank Tools
-
User Guide: How to annotate with Alpino
Needs updating w.r.t. off-line annotation
-
User Guide: How to use Alpino
DTD for Lassy XML files
- Use your right mouse button to save the following link:
DTD for Lassy Dependency Structures
External Links
- Alpino parser and related tools
- DACT, an easy to use corpus tool for Lassy corpora, developed by Daniel de Kok en Jelmer van der Linde.
Some annotated sentences
In Lassy two annotated corpora will be delivered.
- Lassy Small is a 1 million word corpus with manually
verified syntactic annotations.
- Lassy Large is a 1500 million word corpus with automatically
assigned syntactic annotations.
- live version of
Lassy Large corpus, in development
You can browse the automatically assigned syntactic annotations with
an XHTML-aware browser such as Firefox.
In particular, this contains at the moment the following corpora.
Note that not all links will work, due to copyright restrictions.
-
CLEF. Corpus that was used in the CLEF Question Answering track
2005. 4.2 million sentences, 78 million tokens.
-
Eindhoven corpus. 40 thousand sentences, 713 thousand tokens.
-
EMEA corpus. Over 1 million sentences, 13 million tokens.
-
Europarl corpus. Over 1 million sentences, 37 million tokens.
-
Mediargus corpus. 103 million sentences, 1396 million tokens.
-
Wikipedia dump of 2010. 12 million sentences, 159 million tokens.
-
Senseval corpus of Dutch. 12 thousand sentences, 156 thousand tokens.
-
Sonar corpus, release 2. 17 million sentences, 246 million tokens.
Note: some overlap with other corpora in Lassy Large.
-
Small corpus including the annual "Troonrede" of Queen Beatrix since 1990.
-
newspaper parts of TwNC-05 corpus.
26 million sentences, 454 million tokens.
Deliverables
Internal stuff
Publications
- Gertjan van Noord, Ineke Schuurman, Vincent Vandeghinste. Syntactic
Annotation of Large Corpora in STEVIN. In: LREC 2006.
[pdf]
- Gosse Bouma and Geert Kloosterman. Mining Syntatically Annotated
Corpora with XQuery. In: LAW 2007, Prague.
[pdf]
- Gertjan van Noord. Using Self-Trained Bilexical Preferences to Improve
Disambiguation Accuracy. In: IWPT2007, Prague.
[pdf]
- Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas,
Jörg Tiedemann. Question Answering with Joost at CLEF 2008. CLEF 2008
Working Notes. Aarhus Denmark.
[pdf]
- Barbara Plank and Gertjan van Noord. Exploring An Auxiliary
Distribution based approach to Domain Adaptation of a Syntactic
Disambiguation Model. In: Coling Workhop 'Cross Framework and
Cross Domain Parser Evaluation'. [pdf]
- Nelleke Oostdijk, Martin Reynaert, Paola Monachesi, Gertjan van Noord,
Roland Ordelman, Ineke Schuurman, Vincent Vandeghinste. From D-Coi to SoNaR: A
reference corpus for Dutch. In: LREC 2008. [pdf]
- Gosse Bouma, Geert Kloosterman, Jori Mur, Gertjan van Noord,
Lonneke van der Plas, and Jörg Tiedemann. Question Answering with
Joost at CLEF 2007. In: Carol Peters, Valentin Jijkoun, Thomas Mandl,
Henning Mueller, Douglas W. Oard, Anselmo Penas, Vivien Petras, Diana
Santos (editors), Advances in Multilingual and Multimodal Information
Retrieval, 8th workshop of the Cross-Language Evaluation Form, CLEF
2007, Budapest, Hungary, September 19-21, 2007, Revised Selected
Papers. Lecture Notes in Computer Science 5152, Springer 2008. pp 257-260.
- Gertjan van Noord. Huge Parsed Corpora in LASSY. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van
Noord (editors), Proceedings of the Seventh International Workshop on
Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009,
Groningen, The Netherlands. LOT Occasional Series.
[LOT site]
- Gertjan van Noord. Self-trained Bilexical Preferences to Improve
Disambiguation Accuracy. To appear in a book on
parsing technology, based on selected papers from the IWPT 2007,
CONNL 2007, and IWPT 2005 workshops, edited by Harry Bunt, Paola Merlo
and Joakim Nivre, published by Springer. [draft pdf]
- Gosse Bouma and Jennifer Spenader. The Distribution of Weak and
Strong Object Reflexives in Dutch. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Erik Tjong Kim Sang. To Use a Treebank or Not - Which Is Better
for Hypernym Extraction? In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Ineke Schuurman, Veronique Hoste and Paola Monachesi.
Cultivating Trees: Adding Several Semantic Layers to the Lassy
Treebank in SoNaR. In: Frank van
Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors),
Proceedings of the Seventh International Workshop on Treebanks and
Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The
Netherlands. LOT Occasional Series.
[LOT site]
- Gertjan van Noord and Gosse Bouma. Parsed Corpora for Linguistics.
In: Proceedings of EACL Workshop The Interaction between
Linguistics and Computational Linguistics: Virtuous, Vicious or
Vacuous? Athens, 2009. pp 33-39. [pdf]
- Gertjan van Noord, Learning Efficient Parsing. In: EACL 2009. The
12th Conference of the European Chapter of the Association for
Computational Linguistics. 30 March - 3 April 2009, Athens, Greece. pp
817-825. [pdf]
- Daniël de Kok, Jianqiang Ma and Gertjan van Noord, A generalized
method for iterative error mining in parsing results. In: ACL2009
Workshop Grammar Engineering Across Frameworks (GEAF), Singapore,
2009. [pdf]
-
Anna Lobanova, Jennifer Spenader, Tim van de
Cruys, Tom van der Kleij and Erik Tjong Kim Sang.
Automatic Relation Extraction - Can Synonym Extraction Benefit from
Antonym Knowledge? In: Proceedings of
WordNets and other Lexical Semantic Resources - between Lexical
Semantics, Lexicography, Terminology and Formal Ontologies
(NODALIDA2009 workshop), Odense, Denmark, May 2009. [pdf]
-
Erik Tjong Kim Sang and Katja Hofmann,
Lexical Patterns or Dependency Patterns: Which Is Better for Hypernym
Extraction? In: Proceedings
of CoNLL-2009, Boulder, CO, USA, June 2009.
[pdf]