Authors

Bart Cramer <B.Cramer@student.rug.nl>

Gertjan van Noord <vannoord@let.rug.nl>

Introduction

This document gives an overview of the Alpino system and related software. Together, these programs can be used to annotate a corpus of Dutch sentences syntactically. We will not discuss the specific rules of syntactic annotation itself; for this we refer to the CGN document. Alpino itself is a parser which uses linguistic knowledge and various heuristics to construct appropriate linguistic structures of Dutch sentences.

Starting Alpino

Alpino is the tool that we use for interactive annotation. Before using Alpino, make sure that Alpino is properly installed on your machine. Please consult a local wizard for this. The Alpino program itself can be started with the command:

Alpino

The command can take very many options. However, for annotation, there is a specialized script, called Annotate, that we normally use for the specific purpose of annotating. It requires an argument which indicates which suite (a named collection of sentences) you want to annotate.

Annotate NameOfSuite

Annotate can take all the options that Alpino knows about, but these options are placed after the name of the suite.

The Annotate script has some particular assumptions about the directory structure that you are working in. You are advised to construct a new directory for the purpose of annotation. Perhaps this directory is called DCOI or LASSY or ALPINO or some such. In this directory, there ought to be two sub-directories called Suites and Treebank respectively. In the Suites-directory you will collect the files (called suites) that you need to annotate. The annotations will be placed in the directory Treebank. The Annotate script will create a sub-directory in Treebank for each suite that you annotate.

File format of suites

In the directory Suites you need to place the sets of sentences that are to be annotated. Each such a set is called a suite. Such a set of sentences is placed in a file ending with the extension .sents. Although one is free to use any name for the suites, there should be no dots in the filenames (e.g. 1.1.1.2.Thesis.txt.tok.pl), as these will confuse Alpino. For instance, you might create the file my_corpus.sents containing the following lines:

1|De zon komt op boven de stortplaats van het stadje Merlijn .
2|Er klinkt een zacht geluid .
3|Lila het vosje spitst de oren .
4|Ze probeert erachter te komen waar het geluid vandaan komt .
5|Nee , het is geen vogel .

Every line contains a key, a vertical bar, and the tokens of the sentence, separated by space. Note that the sentences are assumed to be tokenized here already: punctuation symbols are treated as separate tokens and therefore seperated by spaces as well. There are various tools in the Alpino software suite for tokenizing texts into this format, but here we assume that you have been given files such as those already.

The keys (whatever is placed to the left of the vertical bar | on each line) can be arbitrary strings, and are used to identify each sentence. Later, the file name containing the annotation of each sentence will use this key.

If you have the correct Makefile in the Suites directory, you can give compile the corpus into a format readable by Alpino. We use the make' program to create the compiled corpus. The `make' program requires a file called `Makefile which should contain the following definition:

%.pl : %.sents
        echo ":- module(suite,[ sentence/2 ])." > $*.pl
        echo >> $*.pl
        cat $*.sents | grep .|\
        sed -e "s/| /|/" \
            -e "s/  *$$//" \
            -e "s/\\\\/\\\\\\\\/g" \
            -e "s/'/\\\\'/g" \
            -e "s/ /','/g" |\
        awk -F\| ' !/^(%|--)/ {\
        printf("sentence(%c%s%c,[%c%s%c]).\n",39,$$1,39,39,$$2,39); N++;}' |\
            sicstus -l $(ALPINO_HOME)/Suites/echo >>$*.pl

Don't worry about the details of this unreadable definition. It is meant to treat white space, quotes etc, in a way so as not to confuse Prolog (Alpino is implemented in Prolog). You can copy a file with this definition from the Alpino distribution:

cp $ALPINO_HOME/Makefile.suite Makefile

You can now use this Makefile in order to create the Prolog version of the suite:

make my_corpus.pl

If all goes well, you now have the file my_corpus.pl which looks as follows:

:- module(suite,[ sentence/2 ]).

sentence('1',['De',zon,komt,op,boven,de,stortplaats,van,het,stadje,'Merlijn','.']).
sentence('2',['Er',klinkt,een,zacht,geluid,'.']).
sentence('3',['Lila',het,vosje,spitst,de,oren,'.']).
sentence('4',['Ze',probeert,erachter,te,komen,waar,het,geluid,vandaan,komt,'.']).
sentence('5',['Nee',',',het,is,geen,vogel,'.']).

If you need to change the suites file, then you should repeat the make compilation step, in order that the changes are also propagated to the compiled version of the suite.

File format of annotations

The annotations are placed in the Treebank subdirectory. Because each sentence gets its own file, a new subdirectory is needed for each suite. In there, the annotation of a sentence is saved as an xml-file, example Treebank/my_corpus/2.xml (Of course, if you haven't done any annotations yet, then there will not be any .xml files in your directory). This file could look as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<alpino_ds version="1.0">
  <node id="0" rel="top" cat="top" begin="0" end="6">
    <node id="1" rel="--" cat="smain" begin="0" end="5">
      <node id="2" rel="mod" pos="adv" begin="0" end="1" root="er" word="Er"/>
      <node id="3" rel="hd" pos="verb" begin="1" end="2" root="klink" word="klinkt"/>
      <node id="4" rel="su" cat="np" begin="2" end="5">
        <node id="5" rel="det" pos="det" begin="2" end="3" root="een" word="een"/>
        <node id="6" rel="mod" pos="adj" begin="3" end="4" root="zacht" word="zacht"/>
        <node id="7" rel="hd" pos="noun" begin="4" end="5" root="geluid" word="geluid"/>
      </node>
    </node>
    <node id="8" rel="--" pos="punct" begin="5" end="6" root="." word="."/>
  </node>
  <sentence>Er klinkt een zacht geluid .</sentence>
</alpino_ds>

There are various programs closely related to treebanks: dtview, dtedit (also known as Thistle), dtchecks, dtsearch and dttred.

You can use dtview to display annotated sentences graphically. The dtview program has a rather intuitive interface, but there are a few short keyboard command that makes going through treebank a bit less damaging for your wrist: Page Up and Page Down for going to the next and previous tree, and the arrow keys for navigating in trees that don't fit on the screen. The t key triggers the same response as the TrEd button on the top, namely opening dttred for that particular tree. Here, you can edit the tree.

dtchecks is used for a superficial check on a treebank. Its syntax is dtchecks directory, to be run from the Treebank directory, and it is able to recognize errors like a node with only one child and incompatible sisters/children/parents. Annotators produce these flaws rather often, so before rounding off a corpus, you must use it. However, note that the dtchecks program might also complain about valid annotations.

dtsearch is a program that permits the user to search for trees in the treebank with certain criteria. By standard, it does it recursively in all subdirectories. After having retrieved the correct XML files, dtsearch will show them in dtview (if you use the -v option). For example, if you would like to find all trees in which zoals occurs as a preposition, you can issue the command:

dtsearch -v '//node[@word="zoals" and @pos="prep"]' .

The dtsearch program has a very powerful query language (known as XPATH), that is documented elsewhere.

A simple example of the annotation process

To see the system at work, let's try to see the procedure from the start until the end. First, make a Suite and a Treebank directory, and create or copy an example suite in your Suite directory, and compile that suite into Prolog format. Let's assume the suite is called my_corpus. Then, issue the command

Annotate my_corpus

from your current directory (that should now contain the Suites and Treebank sub-directories).

Alpino starts up its graphical environment. On the top you will see a series of menu buttons, which we will ignore for now (although it is perhaps useful to know that you can stop the system by selecting the File menu, and clicking the Halt item in that menu).

Below the menu buttons you see the Parse button on the left, with an empty long space to its right, followed by a shorter space at the right hand side. Once we have selected a sentence, the sentence will appear in this space, with its key displayed further to the right. You can try parsing a sentence by selecting it from the list box below the Parse-button, followed by pressing that Parse-button.

Before Alpino really starts parsing the sentence, it presents a new widget with the lexical analysis of the sentence. At this point, you can influence the lexical analysis phase of Alpino, but for now we simply click the Parse button (left below) on this new widget, to really get going. Below, the lexical analysis phase is explained in more detail (you can also swith off the interactive lexical assignment using the option interactive_lexical_analysis=off).

Alpino will now (finally!) parse the sentence syntactically, and it will show you the most probable analysis in the left pane. For each analysis, a green button with the number of the analysis is constructed as well. As a short-cut, clicking on such a numbered analysis button with your right mouse button will immediately display the corresponding dependency structure of that analysis. A left-click on the number will give you a menu, with which you can, for example, save your parse. If you select XML -> Save, Alpino will automatically create the needed file in the corresponding subdirectory of the Treebank folder. If the correct annotation is not available, you can select XML -> Save and Edit, to start the editor for this annotation. There are many more options, some of which are useful. Please experiment.

Options

In Alpino, there are a lot of options available. Some options have a quality/computational time trade-off. The following are the most important:

Tips and tricks

Next to the general way of parsing sentences, there are a number of manual tricks available, which can help a great deal in improving the quality of the parses or in speeding up the parsing process.

Hij fietst [ @pp op zijn gemakje ] [ door de straten van De Baarsjes ] .
FNB zet teksten van [ [ kranten en tijdschriften ] , boeken , [ studie- en vaklectuur ] , bladmuziek , folders , brochures ] om in gesproken vorm .
[ @np De conferentie die betrekking heeft op ondersteunende technologie voor gehandicapten in het algemeen ] , bood [ @np een goed platform [ om duidelijk te maken hoe uitgevers zelf toegankelijke structuren in hun informatie kunnen aanbrengen ] ] .
Hij heeft een beetje [ @postag adjective(no_e(adv)) curly ] haar .
Op een [ @postag adjective(e) mgooie ] dag gingen ze fietsen .
Mijn [ @postag noun(de,count,sg) body mass index ] laat te wensen over .
Ik wil [ @skip ??? ] naar huis
Ten opzichte [ @skip echter ] van deze bezwaren willen wij ....
Ik aanbad [ @phantom hem ] dagelijks in de kerk
Ik kocht boeken en Piet [ @phantom kocht ] platen
Ik heb [ @phantom meer ] boeken gezien dan hem

Limitations: a phantom bracketed string can only contain a single word. The technique does not work yet for words that are part of a multi-word unit.

Warning: the resulting dependency structure is most likely not well-formed and often needs manual editing.

1 |: mlex dreun
stem           string         his                 cat
dreun          [dreun]        normal              noun(de,count,sg)
dreun          [dreun]        normal              verb(hebben,sg1,intransitive)
dreun          [dreun]        normal              verb(hebben,sg1,sbar)
dreun_na       [dreun]        normal              verb(hebben,sg1,part_intransitive(na))
dreun_op       [dreun]        normal              verb(hebben,sg1,part_sbar(op))
dreun_op       [dreun]        normal              verb(hebben,sg1,part_transitive(op))
2 |: * De beer is los
top_features=grammar
parse: De beer is los
Lexical analysis: 4 words, 243 -> 8 -> 8 -> 8 tags, 42 signs, 170 msec
Parsed (70 msec)
created object: 1
[De beer is los]
Q#undefined|De beer is los|1|1|-0.05113238180000002
created object: 2
[De beer is los]
Q#undefined-2|De beer is los|1|1|-0.02903353880000002
cputime total 260 msec
Found 2 solution(s)
32 history items
             top=top
                |
            -- =smain
        _____________________
       |           |        |
     su=np      hd=verb predc=adj
    ________
   |       |       |        |
det=det hd=noun   ben      los
   |       |
  de     beer
3 |: quit
cramer@rana:~>
add_lex curly mooi

tells the Alpino lexicon that the word curly has the same lexical categories as the word mooi. It also works for multiple word units:

add_lex body mass index index

POS-tagging

One very important step in syntactic annotation is POS-tagging: determining which word group a word belongs to. Having precise knowledge about these categories can speed up the annotation process considerably, because then you are able to prune large parts of the search space manually.

Almost all words have several POS-tags. A word can be ambiguous because it belongs to different main categories (werk can be either a verb or a noun), but also because they differ in details (geven can be either intransitive, transitive or ditransitive, some words have other complements as verb phrases, etc.).

In this documents, the main categories and the most frequently used options will be reviewed, as these are most important for the normal user. The very details of some exotic words will thus be left out.

Verbs

The tag of verbs takes at least three arguments:

For the latter, there is a large variety of possibilities. These are the basic ones:

Table: Verb's fourth argument, single forms
Tag Example
intransitive Ik schrijf
transitive Ik schrijf een brief
np_np Ik schrijf mijn moeder een brief
so_pp_np Ik schrijf een brief aan mijn moeder
so_np Ik schrijf mijn moeder
refl Ik schaam me
copula Ik ben een echte fietser
pred_np Ik vind dat leuk
pc_pp(over) Ik schrijf over de ramp
ld_pp Ik zwaai naar mijn moeder
er_er Er zijn er die voetballen leuk vinden
part_intransitive(mee) Ik schrijf mee
fixed([svp_pp(op,naam),acc],norm_passive) Ik schrijf die overwinning op mijn naam
vp Ik speel om te winnen
sbar Ik schrijf dat ik huil
cleft Het was op Piet dat hij een uur moest wachten
aan_het Ik ben aan het lopen
aci Ik laat hem fietsen
aux_psp_hebben Ik heb gefietst
aux_psp_zijn Ik ben nog een uurtje gebleven
aux(inf) Ik zal fietsen
aux(te) Ik blijk te blozen
passive Ik word/ben door mijn moeder uitgezwaaid
te_passive Ik ben te vinden in de kelder
dip_sbar_subj Ik denk , zei Piet , dat hij komt

With a bit of creativity one can easily see the meaning of related POS-tags. A few examples are:

Table: Verb's fourth argument, combined forms
Tag Example
ld_adv Ik ben daar
ld_dir Ik loop huiswaarts
Ik loop het bos in
np_ld_pp Ik schop de bal naar het doel
np_ld_dir Ik duw hem huiswaarts
Ik stuur hem het bos in
sbar_subj Het lijkt dat je geschikt bent
sbar_subj_so_np Het lijkt mij dat je geschikt bent
refl_sbar Ik schaam me dat ik dat gedaan heb
copula_sbar Het is leuk dat je fietst
copula_vp Het is leuk om te fietsen
so_copula Ik ben hem trouw
np_aan_het Ik heb de motor aan het lopen
np_sbar Ik schrijf mijn moeder dat ik huil
er_pp_sbar(achter) Ik ben erachter dat je me bedriegt
pred_np_sbar Ik vind het leuk dat je bloost
pred_np_vp Ik vind het leuk om te blozen
pred_pc_pp(van) Ik word verlegen van jouw aandacht
part_transitive(aan) Ik schrijf een fonds aan
part_np_np(voor) Ik schrijf hem een medicijn voor
fixed([[het,apezuur],refl],no_passive) Ik schrijf me het apezuur
fixed([{[acc(gesprek),pc(met)]}],no_passive) Ik heb een gesprek met de rector
aux_modifier(inf) Het moet raar lopen , wil hij komen

Particles

There are no attributes related to particles (separable verb prefixes). They are just as they are. Particles are used as separable verb prefixes as op in Hij belde ons op, and also as a post-position in a prepositional phrase, such as af in van het begin af had ik twijfels.

Fixed

The category fixed is used for the frozen part of verbal idioms. Examples of these are ter dood, in de koude kleren and uit het oog.

Nouns and related tags

Just as verbs, all noun tags have three compulsory arguments:

The special value meas is used for singular measure nouns that combine nonetheless with plural numbers (dertien jaar, twintig meter).

The optional fourth attribute of nouns shows which complement(s) the noun takes:

Table: Noun's fourth argument
Tag Example
start_app_measure onder het motto : we zullen wel zien
app_measure de stof insuline
np_app_measure op camping de Mokerhei
measure een plakje words
sbar het idee dat je kunt vliegen
vp het idee om te gaan vliegen
subject_sbar het is een feit dat hij komt
subject_vp het is een drama om te moeten wachten
pred_pp(van) dat is van groot belang

Some of these distinctions are somewhat subtle. The label app_measure is used if the following noun is understood somewhat like a name. In many cases, you can add the word genaamd (entitled): een stof genaamd insuline. On the other hand, the measure category is used for words such as plakje, kopje, hoeveelheid which often indicate a quantity. The np_app_measure is used for words that take a full NP modifier (typically including a determiner), often headed by a proper name.

The category pred_pp(Prep) is used for nouns which, if combined with the specified preposition Prep, form a predicative PP complement.

The subject_sbar and and subject_vp tags indicate that the noun occurs as a predicative complement with a sentential subject (sbar or vp).

Verbal nouns

Verbal nouns are verbs that have the syntactical function of a noun (nominalizations). An example is: Ik heb het fietsen nog niet afgeleerd. They have the same complements as their corresponding verb entries in the lexicon. An example of this is: Het in Utrecht wonen bevalt hem where in Utrecht is an LD complement of wonen.

Determiner

Determiners are divided in different categories: gen_determiner, comp_determiner, name_determiners, tmp_determiner and regular determiners.

Regular determiners can be divided in a few categories, and they all take a different set of arguments. The type of the determiner is expressed in the first argument. An overview of all types can be found in the table. Besides the first argument, determiners can have up to five additional arguments, depending on the type of the determiner:

Table: Minor determiner types
Tag Example
determiner(de) De jongen is flink
determiner(deze,nwh,nmod,pro,yparg) Deze jongen is flink
determiner(alle,nwh,mod,pro,nparg) Alle jongens zijn flink
determiner(geen,nwh,mod,pro,yparg,nwkpro,geen) Geeneen jongen is flink
determiner(wat) Tal van jongens zijn flink
determiner(wat,nwh,mod,pro,yparg) Genoeg jongens zijn flink
determiner(elke,nwh,mod) Iedere jongen is flink
determiner(pron) Jouw jongens zijn flink
determiner(sg_num) Menig jongen is flink
determiner(pl_num,nwh,nmod,pro,yparg) Meerdere jongens zijn flink
determiner(welke) Enige jongens zijn flink
determiner(zulke) Zulke jongens zijn flink

Pronouns

Pronouns have a lot of arguments in Alpino. Because most of the pronouns are not ambiguous, this is only needed for the engine underneath, and is therefore less important for the back-end user of the system. Still, we would like to show a few types of pronouns in their usual contexts.

Table: Pronoun types
Tag Example
pronoun(nwh,thi,pl,de,both,indef) Beide gingen naar school
pronoun(ywh,thi,both,de,both,indef) Wie is dat ?
rel_pronoun(de,no_obl) De jongen die ik bekeek , keek naar me
reflexive(je,both) Waar houd je je mee bezig ?

Adjectives

The first argument shows a list in which forms an adjective can occur:

Table: Adjective's first argument
Tag Example
no_e de auto is mooi
e de mooie auto
both het malafide bedrijf; het bedrijf is malafide
pred (only functions as predicative) de kast staat ondersteboven, but not: de onderstebovene kast
er, ere, st, ste (comparative forms) mooier, mooiere, mooist, mooiste
ende/end (gerund with/without e) een auto komende uit de richting Assen
postn de jongen afkomstig uit Palestina

The second argument shows whether the adjective can behave as an adverb as well:

Table: Adjective's further attributes
Tag Example
adv (can be used as vp modifier) De jongen loopt mooi
nonadv (cannot be used as vp mod) *De jongen loopt nederlandstalig
padv (can be used as predm, but not vp mod) Hij loopt herkenbaar over straat
both (can be used both as predm and vp mod) Gek van vreugde liep hij rustig weg
Hij heeft deze zaken gek aangepakt
locadv (can be used as locative vp mod) De jongen loopt ver
tmpadv (can be used as temporal vp mod) Hij heeft kort geaarzeld

A further (optional) argument shows which complements the adjective can have. These are very similar to the complements for verbs.

Prepositions

The preposition tag typically has one argument, which is a list (often the empty list). The list containts potential post-positions, with which the current preposition can combine to form a hd/obj1/hdf triples.

In some cases there is a further second argument for prepositions which indicate more specialized uses, as documented in this table:

Table: Types of prepositions
Tag Example
preposition(over,[]) Het leven gaat niet over rozen
preposition(achter,[aan,door,langs,om,uit]) achter de dief aan
preposition(op,[],pc_adv) Ik rekende op vanmiddag
preposition(ondanks,[],sbar) Ik kom ondanks dat ik me slecht voelde
preposition(ongeacht,[],of_sbar) Ik kom ongeacht of ik me slecht voel
preposition(van,[],pp) Een boek van voor de oorlog
preposition(van,[],loc_adv) Ik kom van boven
preposition(sinds,[],tmp_adv) Hij woont hier sinds gisteren
preposition(tegenin,[],extracted_np) Hij gaat er niet tegenin
preposition(te,[],nodet) Hij woont te Assen
preposition(als,[],pred) Dat beschouw ik als ongepast
preposition(voor,[],voor_pred) Ik houd het voor onmogelijk dat je komt
preposition(met,[],absolute) Met Piet achter het stuur zijn we verloren

Adverbs

Table: Types of adverbs
Tag Example
adverb(reeds) Op woensdag wist zij dat reeds
modal_adverb(reeds) Reeds op woensdag wist de minister dat
postnp_adverb(reeds) Deze weeks reeds wist de minister dat
postadv_adverb(reeds) Gisteren reeds wist de minister dat
post_wh_adverb([dan,ook]) Zij kunnen waar dan ook over de wetenschap praten
wh_adverb(hoezo) Hoezo wist de minister dat al ?
waar_adverb(waarover, over) Waarover praten jullie ? (Waarover becomes a pc)
er_adverb(daarmee, met) Hou daarmee op ! (Daarmee -> pc)
sentence_adverb(bovendien) Bovendien is de rente weer op een acceptabel niveau
loc_adverb(onderaan) Groningen bungelt onderaan (onderaan -> ld)
dir_adverb(bergop) De trend van de laatste jaren is duidelijk bergop (bergop -> ld)
tmp_adverb(gisteren) Gisteren wist de minister dat al
predm_adverb([en,masse]) Ze gingen en masse Berlusconi stemmen

The difference between a sentence_adverb and an ordinary adverb is, that the distribution of sentence_adverbs is more limited.

Complementizers

Table: Types of complementizers
Tag Example
complementizer (will become comp) Omdat de fresco's mooi waren viel het bezoek niet tegen
complementizer(sat) (will become comp) Behalve dat de fresco's mooi waren , was er niet veel te zien
Behalve de mooie fresco's was er niet veel te zien
complementizer(root) (will become dlink) Maar het bezoek viel tegen
Maar omdat het bezoek tegenviel …

Conjunctions

Table: Types of conjunctions
Tag Example
conj(en) Ik zag hem en groette hem
conj(alias) Ik ontmoette Johan V. alias de Hakkelaar
left_conj(maar_ook), right_conj(maar_ook) Niet alleen de motoren maar ook de auto's gingen op de bon

UNKNOWN words

UNKNOWN words are wordt that cannot be placed in any context, and will be entirely disregarded in the parsing process. This typically occurs only for parts of multi-word-units.

Proper names

Alpino contains long lists of names, as well as a number of heuristics to recognize names. The tag proper_name is used for these cases. The sub-features include agreemnt (often left unspecified), and a further named entity class such as PER (person names), LOC (geographical names) and ORG (organization names). For annotation purposes you can safely ignore this distinction.