This document provides a description of the format of dependency trees that are used as the input for the Alpino chart generator. Dependency trees describe grammatical dependency relations between lexical items, and the constituents dominating over lexical items. Since dependency trees for generation can contain less information than dependency trees that are produced as a side effect of parsing, we call them abstract dependency trees (ADTs).
While different input formalisms have been proposed for sentence realization in the past, such as minimal recursion semantics (MRS), we have chosen to use a different input format that describes the grammatical dependencies of the to-be generated sentence. The rationale for this format is:
ADTs can be derived from (Alpino) dependency trees with ease.
Most other input formalisms would require rather extensive changes to the lexicon and grammar.
Prior work with Alpino dependency trees has shown that (abstract) dependency trees provide sufficient abstraction for tasks where a generation component is desired.
This document describes the format of ADTs, including the representation of ADTs as Prolog terms and XML documents. The procedure for deriving an ADT from a normal Alpino dependency tree is also described.
An abstract dependency is a directed acyclic graph that models the grammatical relations between lexical items and categories built from lexical items. The generator creates realizations for abstract dependency trees that describe at least one possible sentence. An abstract dependency tree can consist of four node types:
Category (interior) nodes.
Lexical (leaf) nodes.
Edges between nodes have a dependency label, where hd indicates the head of a grammatical relation. Figure [dtree] shows a dependency tree for the sentence "De boeken kosten ons een klein fortuin". For interior nodes, the dependency label is shown as the first element, and the category as the second.
Possible dependency labels are listed in [deplabels]. In the following sections, the node types will be described in more detail.
|BODY||body (with complementizer)|
|HDF||closing element of a circumposition|
|LD||locative or directional complement|
|ME||measure phrase complement|
|MWP||part of a multi-word-unit|
|NUCL||nucleus discourse unit|
|OBCOMP||object of comparative|
|OBJ2||secondary object (indirect object, . . .)|
|POBJ1||provisional direct object|
|RHD||head of a relative clause|
|SAT||satellite discourse unit|
|SE||inherently reflexive complement|
|SVP||separable verb particle|
|WHD||head of wh-question|
Category nodes are interior tree nodes that describe a category. All possible categories are listed in table [categories].
|AHI||aan het-infinitive group|
|CP||phrase started by a subordinating conjunction|
|DETP||word group with a determiner as the head|
|INF||bare infinitive group|
|PPRES||present participle group|
|SMAIN||declarative sentence (verb at the second position)|
|SSUB||Subordinate clause (verb final)|
|SV1||verb-initial sentence (yes/no question, imperatives)|
|WHREL||relative clause with embedded antecedent|
Lexical nodes are leaf nodes that provide an abstracted representation for (surface) words. As a minimum a lexical node should specify:
The word sense. The sense of a word is the root of a word, possibly with additional information to select for a specific reading.
An Alpino part of speech tag.
A set of attribute/value pairs.
Part of speech tags and possible attributes are discussed in more detail below.
Alpino part of speech tags
Table [postags] lists all possible Alpino part of speech tags for lexical items.
In addition to the Alpino part of speech tag, some additional information is normally required to determine the semantics embodied by an ADT. In particular, the number for nouns, and the tense and inflection for verbs should be known.
The number of a noun is specified with the num attribute, which can have one of the following values: sg, pl, both, meas, bare_meas, pl_num, sg_num
The tense of the verb is specified with the tense attribute, which can have one of the following values: present, past. For infinitive and imperative inflections, this attribute is omitted.
The inflection of a verb is specified with the infl attribute, which can have one of the following values: imp(sg1), imp(modal_u), inf, inf(no_e), inf_ipp, modal_not_u, modal_inv, pl, psp, sg_hebt, sg_heeft, sg, sg1, sg3, subjunctive
Indexed nodes wrap another node to give it a specific index that can be referred to by a reference node. For instance, in the sentence Ik heb de trein gemist both ik-heb and ik-gemist have a subject-head relation, while gemist is the head of a vc of heb. To allow for such representations, co-indexation is required.
Reference nodes refer the index number of an index node. Reference nodes have no additional content, since it is provided by the indexed node.
To allow for an abstract description of verbs that have a separable particle, it is allowed to omit particles with the svp relation in an ADT. If such particles are not included as nodes in the ADT, the generator will attempt to generate a sentence with a separated and a connected particle.
An ADT can be stored as a Prolog term that consists of recursive tree terms. The basic format is:
Where Relation is a relation term, and Daughters is a list of daughter nodes, or the empty list for leaf nodes. A relation has the following form:
Type indicates the type of relation, such as su, obj1, or mod. The label is one of the four node types described before - it can be a category, a lexical item, an index that is associated with a lexical item, or a reference to an index. Categories use a p/1 term, for instance,
is a node of the category ppart with a vc relation. Lexical nodes use a adt_lex/4 term as their label:
Here the root, sense, POS tag, and attributes of a lexical item are noted. The sense of a lexical item can contain additional information about a lexical item to select for a specific reading. For instance, for words with fixed parts, the fixed parts are often listed in the sense. E.g. the sense of rood aanlopen is rood-loop_aan. The root can be omitted by replacing it with an anonymous variable (_).
This is an example of a lexical node:
This describes a noun with the root trein and two attributes (gen=de,num=sg) with the head (hd) relation.
It is often necessary to refer to a lexical or category node twice in a tree. The first occurrence of the node is wrapped in an i/2 term:
Subsequent uses of the same node can then be marked with an i/1 term. For instance, we can refer to
Combined, we can construct ADTs as Prolog terms for all grammatical sentences. E.g., the ADT term for the sample discussed above is:
tree(r(top,p(top)),[ tree(r(--,p(smain)),[ tree(r(su,i(1,adt_lex(ik,ik,pron, [wh=nwh,per=fir,num=sg,gen=de,case=nom,def=def]))),), tree(r(hd,adt_lex(heb,heb,verb, [sc=aux_psp_hebben,infl=sg1,tense=present])),), tree(r(vc,p(ppart)),[ tree(r(su,i(1)),),tree(r(obj1,p(np)),[ tree(r(det,adt_lex(de,de,det,[infl=de])),), tree(r(hd,adt_lex(trein,trein,noun,[gen=de,num=sg])),) ]), tree(r(hd,adt_lex(mis,mis,verb,[sc=transitive,infl=psp])),) ]) ]) ])
ADTs can also be represented as XML documents to allow for easy querying and manipulation outside the Alpino environment.
The dependency tree is represented in XML as a recursive structure of node elements. Each node has an identifier (id) and relation (rel). Category nodes have a cat attribute that specifies the category. For instance:
<node cat="ppart" id="4" rel="vc"> ... </node>
Every lexical node has the root, sense, and postag attributes for respectively describing the word root, sense, and part of speech tag. E.g.:
<node gen="de" id="8" num="sg" pos="noun" rel="hd" root="trein" sense="trein" type="adt_lex"/>
Lexical nodes have other attributes, as described in the previous sections. For instace, here the num and gen attributes are also listed for the noun number and gender.
A node can be co-indexed by adding the index attribute to a lexical node:
<node id="2" index="1" root="ik" sense="ik" [...] />
The referring node also contains an index attribute, but no other information specific to a lexical node. For example:
<node id="5" index="1" rel="su"/>
As an example of a full ADT, we include an ADT for the same sentence used in the Prolog ADT:
<?xml version="1.0" encoding="ISO-8859-1"?> <alpino_adt version="1.3"> <node cat="top" id="0" rel="top"> <node cat="smain" id="1" rel="--"> <node case="nom" def="def" gen="de" id="2" index="1" num="sg" per="fir" pos="pron" rel="su" root="ik" sense="ik" wh="nwh"/> <node id="3" infl="sg1" pos="verb" rel="hd" root="heb" sc="aux_psp_hebben" sense="heb" tense="present"/> <node cat="ppart" id="4" rel="vc"> <node id="5" index="1" rel="su"/> <node cat="np" id="6" rel="obj1"> <node id="7" infl="de" pos="det" rel="det" root="de" sense="de"/> <node gen="de" id="8" num="sg" pos="noun" rel="hd" root="trein" sense="trein"/> </node> <node id="9" infl="psp" pos="verb" rel="hd" root="mis" sc="transitive" sense="mis"/> </node> </node> </node> </alpino_adt>
DT to ADT conversion
Abstract dependency trees (ADTs) can be constructed from normal Alpino dependency trees, by removing certain information. A dependency tree can be transformed to an ADT in the following manner:
Particles that have the svp relation and that are in the frame of their head in the dependency relation are removed.
Positions of lexical items (that determine adjacency) are removed.
Frames are converted to a part of speech tag and a set of attribute/value pairs. Some attributes can be underspecified.
Optional punctuation is removed.
Before conversion to an ADT, the lexical nodes in a DT contain frames, which represent the subcategorization information of those lexical items. For instance, the DT for the sentence Ik heb de trein gemist contains the following frames (in word order):
pronoun(nwh,fir,sg,de,nom,def) verb(hebben,sg1,aux_psp_hebben) determiner(de) noun(de,count,sg) verb(hebben,psp,transitive)
While frames provide accurate information about lexical items, it is not an ideal format for the manipulation or underspecification of lexical items. For this reason the frame is replaced by a part of speech tag and a set of attribute/value pairs in ADTs. In the ADT created for the sentence above, the following POS tags and pairs will be used:
pron [wh=nwh,per=fir,num=sg,gen=de,case=nom,def=def] verb [sc=aux_psp_hebben,infl=sg1,tense=present] det [infl=de] noun [gen=de,num=sg] verb [sc=transitive,infl=psp]
In addition to part of speech tags and attribute/value pairs, lexical nodes contain the root and sense of the word. The sense contains the root and possibly other lexical material that is required according to the subcategorization frame. The sense specifies which reading of a word is used.
[lassyann] Gertjan van Noord, Ineke Schuurman, and Gosse Bouma, Lassy Syntactische Annotatie, http://www.let.rug.nl/vannoord/Lassy/sa-man_lassy.pdf