Introduction

This document provides a description of the format of dependency trees that are used as the input for the Alpino chart generator. Dependency trees describe grammatical dependency relations between lexical items, and the constituents dominating over lexical items. Since dependency trees for generation can contain less information than dependency trees that are produced as a side effect of parsing, we call them abstract dependency trees (ADTs).

While different input formalisms have been proposed for sentence realization in the past, such as minimal recursion semantics (MRS), we have chosen to use a different input format that describes the grammatical dependencies of the to-be generated sentence. The rationale for this format is:

  • ADTs can be derived from (Alpino) dependency trees with ease.

  • Most other input formalisms would require rather extensive changes to the lexicon and grammar.

  • Prior work with Alpino dependency trees has shown that (abstract) dependency trees provide sufficient abstraction for tasks where a generation component is desired.

This document describes the format of ADTs, including the representation of ADTs as Prolog terms and XML documents. The procedure for deriving an ADT from a normal Alpino dependency tree is also described.

ADT format

Introduction

An abstract dependency is a directed acyclic graph that models the grammatical relations between lexical items and categories built from lexical items. The generator creates realizations for abstract dependency trees that describe at least one possible sentence. An abstract dependency tree can consist of four node types:

  • Category (interior) nodes.

  • Lexical (leaf) nodes.

  • Indexed nodes.

  • Reference nodes.

Edges between nodes have a dependency label, where hd indicates the head of a grammatical relation. Figure [dtree] shows a dependency tree for the sentence "De boeken kosten ons een klein fortuin". For interior nodes, the dependency label is shown as the first element, and the category as the second.

dtree.png
Figure 1: Dependency tree for "De boeken kosten ons een klein fortuin"

Possible dependency labels are listed in [deplabels]. In the following sections, the node types will be described in more detail.

TableDependency labels
Dependency label Description
APP apposition
BODY body (with complementizer)
CMP complementizer
CNJ conjunct
CRD coordinator
DET determiner
DLINK discourse-link
DP discourse-part
HD head
HDF closing element of a circumposition
LD locative or directional complement
ME measure phrase complement
MOD modifier
MWP part of a multi-word-unit
NUCL nucleus discourse unit
OBCOMP object of comparative
OBJ1 direct object
OBJ2 secondary object (indirect object, . . .)
PC prepositional complement
POBJ1 provisional direct object
PREDC predicative complement
PREDM predicative modifier
RHD head of a relative clause
SAT satellite discourse unit
SE inherently reflexive complement
SU subject
SUP provisional subject
SVP separable verb particle
TAG discourse tag
VC verbal complement
WHD head of wh-question

Category nodes

Category nodes are interior tree nodes that describe a category. All possible categories are listed in table [categories].

TableCategory labels
Category label Description
AP adjective phrase
ADVP adverb phrase
AHI aan het-infinitive group
CONJ conjunction
CP phrase started by a subordinating conjunction
DETP word group with a determiner as the head
DU discourse unit
INF bare infinitive group
NP noun phrase
OTI om te-infinitive-group
PPART passive/perfect participle
PP prepositional phrase
PPRES present participle group
REL relative clause
SMAIN declarative sentence (verb at the second position)
SSUB Subordinate clause (verb final)
SVAN van-sentence
SV1 verb-initial sentence (yes/no question, imperatives)
TI te-infinitive group
WHREL relative clause with embedded antecedent
WHSUB embedded question
WHQ WH-question

Lexical nodes

Lexical nodes are leaf nodes that provide an abstracted representation for (surface) words. As a minimum a lexical node should specify:

  • The word sense. The sense of a word is the root of a word, possibly with additional information to select for a specific reading.

  • An Alpino part of speech tag.

  • A set of attribute/value pairs.

Part of speech tags and possible attributes are discussed in more detail below.

Alpino part of speech tags

Table [postags] lists all possible Alpino part of speech tags for lexical items.

TablePOS tags
Tag Meaning
adj Adjective
adv Adverbial
comp Complementizer
comparative Comparative
det Determiner
fixed Fixed parts
name Named entity
noun Noun
num Numeral
part Particle
pron Pronoun
prep Preposition
punct Punctuation
verb Verb
vg Conjunction

Attributes

In addition to the Alpino part of speech tag, some additional information is normally required to determine the semantics embodied by an ADT. In particular, the number for nouns, and the tense and inflection for verbs should be known.

  • The number of a noun is specified with the num attribute, which can have one of the following values: sg, pl, both, meas, bare_meas, pl_num, sg_num

  • The tense of the verb is specified with the tense attribute, which can have one of the following values: present, past. For infinitive and imperative inflections, this attribute is omitted.

  • The inflection of a verb is specified with the infl attribute, which can have one of the following values: imp(sg1), imp(modal_u), inf, inf(no_e), inf_ipp, modal_not_u, modal_inv, pl, psp, sg_hebt, sg_heeft, sg, sg1, sg3, subjunctive

Index nodes

Indexed nodes wrap another node to give it a specific index that can be referred to by a reference node. For instance, in the sentence Ik heb de trein gemist both ik-heb and ik-gemist have a subject-head relation, while gemist is the head of a vc of heb. To allow for such representations, co-indexation is required.

Reference nodes

Reference nodes refer the index number of an index node. Reference nodes have no additional content, since it is provided by the indexed node.

Further notes

To allow for an abstract description of verbs that have a separable particle, it is allowed to omit particles with the svp relation in an ADT. If such particles are not included as nodes in the ADT, the generator will attempt to generate a sentence with a separated and a connected particle.

Input formats

Prolog

An ADT can be stored as a Prolog term that consists of recursive tree terms. The basic format is:

tree(Relation,Daughters)

Where Relation is a relation term, and Daughters is a list of daughter nodes, or the empty list for leaf nodes. A relation has the following form:

r(Type,Label)

Type indicates the type of relation, such as su, obj1, or mod. The label is one of the four node types described before - it can be a category, a lexical item, an index that is associated with a lexical item, or a reference to an index. Categories use a p/1 term, for instance,

tree(r(vc,p(ppart)),[...])

is a node of the category ppart with a vc relation. Lexical nodes use a adt_lex/4 term as their label:

adt_lex(Root,Sense,PosTag,Attributes)

Here the root, sense, POS tag, and attributes of a lexical item are noted. The sense of a lexical item can contain additional information about a lexical item to select for a specific reading. For instance, for words with fixed parts, the fixed parts are often listed in the sense. E.g. the sense of rood aanlopen is rood-loop_aan. The root can be omitted by replacing it with an anonymous variable (_).

This is an example of a lexical node:

tree(r(hd,adt_lex(trein,trein,noun,[gen=de,num=sg])),[])

This describes a noun with the root trein and two attributes (gen=de,num=sg) with the head (hd) relation.

It is often necessary to refer to a lexical or category node twice in a tree. The first occurrence of the node is wrapped in an i/2 term:

i(Number,Label)

Subsequent uses of the same node can then be marked with an i/1 term. For instance, we can refer to

i(1,adt_lex(ik,ik,pron,[wh=nwh,per=fir,num=sg,gen=de,case=nom,def=def]))

with:

i(1)

Combined, we can construct ADTs as Prolog terms for all grammatical sentences. E.g., the ADT term for the sample discussed above is:

tree(r(top,p(top)),[
 tree(r(--,p(smain)),[
  tree(r(su,i(1,adt_lex(ik,ik,pron,
    [wh=nwh,per=fir,num=sg,gen=de,case=nom,def=def]))),[]),
  tree(r(hd,adt_lex(heb,heb,verb,
    [sc=aux_psp_hebben,infl=sg1,tense=present])),[]),
  tree(r(vc,p(ppart)),[
   tree(r(su,i(1)),[]),tree(r(obj1,p(np)),[
    tree(r(det,adt_lex(de,de,det,[infl=de])),[]),
    tree(r(hd,adt_lex(trein,trein,noun,[gen=de,num=sg])),[])
   ]),
   tree(r(hd,adt_lex(mis,mis,verb,[sc=transitive,infl=psp])),[])
  ])
 ])
])

XML

ADTs can also be represented as XML documents to allow for easy querying and manipulation outside the Alpino environment.

The dependency tree is represented in XML as a recursive structure of node elements. Each node has an identifier (id) and relation (rel). Category nodes have a cat attribute that specifies the category. For instance:

<node cat="ppart" id="4" rel="vc">
 ...
</node>

Every lexical node has the root, sense, and postag attributes for respectively describing the word root, sense, and part of speech tag. E.g.:

<node gen="de" id="8" num="sg" pos="noun" rel="hd" root="trein"
  sense="trein" type="adt_lex"/>

Lexical nodes have other attributes, as described in the previous sections. For instace, here the num and gen attributes are also listed for the noun number and gender.

A node can be co-indexed by adding the index attribute to a lexical node:

<node id="2" index="1" root="ik" sense="ik" [...] />

The referring node also contains an index attribute, but no other information specific to a lexical node. For example:

<node id="5" index="1" rel="su"/>

As an example of a full ADT, we include an ADT for the same sentence used in the Prolog ADT:

<?xml version="1.0" encoding="ISO-8859-1"?>
<alpino_adt version="1.3">
  <node cat="top" id="0" rel="top">
    <node cat="smain" id="1" rel="--">
      <node case="nom" def="def" gen="de" id="2" index="1" num="sg"
        per="fir" pos="pron" rel="su" root="ik" sense="ik" wh="nwh"/>
      <node id="3" infl="sg1" pos="verb" rel="hd" root="heb"
        sc="aux_psp_hebben" sense="heb" tense="present"/>
      <node cat="ppart" id="4" rel="vc">
        <node id="5" index="1" rel="su"/>
        <node cat="np" id="6" rel="obj1">
          <node id="7" infl="de" pos="det" rel="det" root="de"
            sense="de"/>
          <node gen="de" id="8" num="sg" pos="noun" rel="hd"
            root="trein" sense="trein"/>
        </node>
        <node id="9" infl="psp" pos="verb" rel="hd" root="mis"
          sc="transitive" sense="mis"/>
      </node>
    </node>
  </node>
</alpino_adt>

DT to ADT conversion

Abstract dependency trees (ADTs) can be constructed from normal Alpino dependency trees, by removing certain information. A dependency tree can be transformed to an ADT in the following manner:

  • Particles that have the svp relation and that are in the frame of their head in the dependency relation are removed.

  • Positions of lexical items (that determine adjacency) are removed.

  • Frames are converted to a part of speech tag and a set of attribute/value pairs. Some attributes can be underspecified.

  • Optional punctuation is removed.

Before conversion to an ADT, the lexical nodes in a DT contain frames, which represent the subcategorization information of those lexical items. For instance, the DT for the sentence Ik heb de trein gemist contains the following frames (in word order):

pronoun(nwh,fir,sg,de,nom,def)
verb(hebben,sg1,aux_psp_hebben)
determiner(de)
noun(de,count,sg)
verb(hebben,psp,transitive)

While frames provide accurate information about lexical items, it is not an ideal format for the manipulation or underspecification of lexical items. For this reason the frame is replaced by a part of speech tag and a set of attribute/value pairs in ADTs. In the ADT created for the sentence above, the following POS tags and pairs will be used:

pron [wh=nwh,per=fir,num=sg,gen=de,case=nom,def=def]
verb [sc=aux_psp_hebben,infl=sg1,tense=present]
det  [infl=de]
noun [gen=de,num=sg]
verb [sc=transitive,infl=psp]

In addition to part of speech tags and attribute/value pairs, lexical nodes contain the root and sense of the word. The sense contains the root and possibly other lexical material that is required according to the subcategorization frame. The sense specifies which reading of a word is used.

Bibliography