Welcome to the Alpino Treebank website. The Alpino treebank contains
syntactically annotated Dutch sentences. The treebank (more than
150,000 words) includes the full cdbl (newspaper) part of the
Eindhoven corpus. You will also find here a number of tools
to browse and search the treebank.
Original CDROM Version
The
content of the CDROM which appeared in November
2002.
Treebanks: hand-corrected
Treebanks: not corrected
Formats
- XML. The XML format is understood by a number of utilities
described below. This format is suitable for machine
processing. As of December 2005, for the bigger corpora the .xml files
are collected together in compressed format (dictzip). We have new
tools capable of efficiently retrieving individual .xml files from
these archives.
- PDF. The PDF format is for visual human inspection - note that in some
cases the leaf nodes are abbreviated.
- SVG. Recently, we also provide the dependency structures in SVG format. If
your browser does not understand this format, perhaps you need to install a
separate SVG viewer.
Utilities
A number of
scripts and programs is supplied
to browse and search through the treebanks.
For more recent versions and various new tools, please download the Alpino
distribution from http://www.let.rug.nl/~vannoord/alp/Alpino/
- dtview Graphical display of CGN dependency structures.
This script takes file name arguments (typically with a .xml
extension) and displays the CGN structures found in those files. The
script assumes you have Tcl/Tk installed on your machine.
- dt_search Simple wrapper around the xmlmatch
program. In order to use this script, you first need to compile and
install the xmlmatch program (there are installation instructions in
the source directory). The script enables efficient and powerful
search in CGN dependency structures. Examples are provided in a
separate readme file.
- dtv Simple wrapper to combine the dt_search and dtview
programs. The script takes a query and a set of file names. The files
which match the given query are then displayed by means of dtview (the
part(s) of the tree that matched the query will be highlighted).
- Thistle is an editor for linguistic datastructures by
Jo Calder. We have used Thistle for editing CGN dependency
structures. The files without an extension are in the SGML format
understood by Thistle. The file cgn_dt.spec contains the specification
of our implementation of the CGN structures - it is required by
Thistle. In order to use thistle you need to install the wrapper
scripts (the program is in Java and already compiled). Please cd to
directory thistle-2-0-1 for further instructions. Note that this
version of Thistle contains a number of small improvements and bug
fixes based on the thistle-2-0-alpha release. See the
Thistle homepage.
Documentation
An attempt to describe a number of differences between the CGN
annotation practice and ours is given in
this document, which is heavily out of date. The
good news is that the number of differences has been reduced heavily recently.
Publications
- Robert Malouf, Gertjan van Noord. Wide Coverage Parsing with
Stochastic Attribute Value Grammars. In: IJCNLP-04 Workshop Beyond
Shallow Analyses - Formalisms and statistical modeling for deep
analyses. [pdf,web page]
- Chapter 5. The Alpino Dependency Treebank. In: Leonoor van der
Beek, Gosse Bouma, Jan Daciuk, Tanja Gaustad, Robert Malouf,
Gertjan van Noord, Robbert Prins, Begoña Villada,
Algorithms for Linguistic Processing NWO PIONIER Progress
Report. Groningen 2002.
[pdf]
- Leonoor van der Beek, Gosse Bouma, Robert Malouf, Gertjan van
Noord. The Alpino Dependency Treebank. In:
Computational Linguistics in the Netherlands CLIN 2001. Rodopi 2002.
[pdf]
-
Leonoor van der Beek, Gosse Bouma, and Gertjan van Noord.
Een brede computationele grammatica voor het Nederlands.
Nederlandse Taalkunde, 2002.
[pdf]
-
Gosse Bouma and Geert Kloosterman. Querying dependency treebanks in XML.
In Proceedings of the Third international conference on Language
Resources and Evaluation (LREC), Gran Canaria, 2002.
[pdf]
- Gosse Bouma, Gertjan van Noord, Robert Malouf. Alpino: Wide
Coverage Computational Analysis of Dutch. In: Computational
Linguistics in the Netherlands CLIN 2000. Rodopi 2001.
[pdf]
Alpino Demo
In the context of the treebanking efforts, we are constructing a
natural language understanding system for Dutch: Alpino. This
ever-growing system is built on top of
Hdrug; it contains a wide-coverage
HPSG for Dutch, a large-scale lexicon, a parser, a
disambiguation component using a log-linear (maximum entropy) model,
etc. There is an
experimental
web-demo.
Who to blame
- Leonoor van der Beek (annotation)
- Gosse Bouma (annotation)
- Jan Daciuk (tools)
- Geert Kloosterman (tools)
- Robert Malouf (tools)
- Gertjan van Noord (annotation, tools)
- Robbert Prins (art work, tools)
Further information:
Algorithms for Linguistic Processing Homepage.
Feedback
Based on the number of errors that we have found ourselves during the
last few months, it is certain that there are still many errors in the
treebank. We appreciate your feedback if you find errors. Please
send a polite email to:
vannoord@let.rug.nl
Hugo Brandt Corstius was keynote speaker at the 13th CLIN meeting
(29 November 2002 in Groningen). After his presentation, the first
cdrom was officially handed to him.