Querying the treebank

The results of the annotation process are stored in XML. XML is widely in use for storing and distributing language resources, and a range of standards and software tools are available which support creation, modification, and search of XML documents. Both the Alpino parser and the Thiste editor output dependency trees encoded in XML.

As the treebank grows in size, it becomes increasingly interesting to explore it interactively. Queries to the treebank may be motivated by linguistic interest (i.e. which verbs take inherently reflexive objects?) but can also be a tool for quality control (i.e. find all PPs where the head is not a preposition).

The XPath standard⁴ implements a powerful query language for XML documents, which can be used to formulate queries over the treebank. XPath supports conjunction, disjunction, negation, and comparison of numeric values, and seems to have sufficient expressive power to support a range of linguistically relevant queries. Various tools support XPath and can be used to implement a query-tool. Currently, we are using a C-based tool implemented on top of the LibXML library.⁵

The XML encoding of dependency trees used by Thistle (and, for compatibility, also by the parser) is not very compact, and contains various layers of structure that are not linguistically relevant. Searching such documents for linguistically interesting patterns is difficult, as queries tend to get verbose and require intimate knowledge of the XML structure, which is mostly linguistically irrelevant. We therefore transform the original XML documents into a different XML format, which is much more compact (the average filesize reduces with 90%) and which provides maximal support for linguistic queries.

As XML documents are basically trees, consisting of elements which contain other elements, dependency trees can simply be represented as XML documents, where every node in the tree is represented by an element node. Properties are represented by attributes. Terminal nodes (leaves) are nodes which contain no daughter elements. The XML representation of (the top of) the dependency tree given in figure 2 is given in figure 7.

**Figure 7:** XML encoding of dependency trees.
$\begin{figure} \begin{center} {\small\begin{verbatim}<node rel=''top'' cat=''s... ...d=''6'' hd=''3''> .... </node> </node>\end{verbatim}} \end{center}\end{figure}$

The transformation of dependency trees into the format given in figure 7 is not only used to eliminate linguistically irrelevant structure, but also to make explicit information which was only implicitly stored in the original XML encoding. The indices on root forms that were used to indicate their string position are removed and the corresponding information is added in the attributes start and end. Apart from the root form, the inflected form of the word as it appears in the annotated sentence is also added. Words are annotated with part of speech ( pos information, whereas phrases are annotated with category ( cat) information. A drawback of this distinction is that it becomes impossible to find all NPs with a single (non-disjunctive) query, as phrasal NPs are cat="np" and lexical NPs are pos="noun". To overcome this problem, category information is added to non-projecting (i.e. non-head) leaves in the tree as well. Finally, the attribute hd encodes the string position of the lexical head of every phrase. The latter information is useful for queries involving discontinuous consituents. In those cases, the start and end positions may not be very informative, and it can be more interesting to be able to locate the position of the lexical head.

We now present a number of examples which illustrate how XPath can be used to formulate various types of linguistic queries. Examples involving the use of the hd attribute can be found in [2].

Objects of prepositions are usually of category NP. However, other categories are not completely excluded. The query in (2) finds the objects within PPs.

The double slash means we are looking for a matching element anywhere in the document (i.e. it is an ancestor of the top element of the document), whereas the single slash means that the element following it must be an immediate daughter of the element preceding it. The @-sign selects attributes. Thus, we are looking for nodes with dependency relation obj1, immediately dominated by a node with category pp. In the current state of the dependency treebank, 98% (5,892 of 6,062) of the matching nodes are regular NPs. The remainder is formed by relative clauses ( voor wie het werk goed kende, for who knew the work well), PPs ( tot aan de waterkant, till on the waterfront), adverbial pronouns (see below), and phrasal complements ( zonder dat het een cent kost, without that it a penny costs).

The CGN annotation guidelines distinguish between three possible dependency relations for PPs: complement, modifier, or 'locative or directional complement' (a more or less obligatory dependent containing a semantically meaningful preposition which is not fixed). Assigning the correct dependency relation is difficult, both for the computational parser and for human annotators. The following query finds the head of PPs introducing locative dependents:

Here, the double dots allow us to refer to attributes of the dominating XML element. Thus, we are looking for a node with dependency relation hd, which is dominated by a PP with a ld dependency relation. Here, we exploit the fact that the mother node in the dependency tree corresponds with the immediately dominating element in the XML encoding as well.

Comparing the list of matching prepositions with a general frequency list reveals that about 6% of the PPs are locative dependents. The preposition naar ( to, towards) typically introduces locative dependents (50% (74 out of 151) of its usage), whereas the most frequent preposition (i.e. van, of) does introduce a locative in only 1% (15 out of 1496) of the cases.

In PPs containing an impersonal pronoun like er ( there), the pronoun always precedes the preposition. The two are usually written as a single word ( eraan, there-on). A further peculiarity is that pronoun and preposition need not be adjacent ( In Delft wordt er nog over vergaderd ( In Delft, one still talks about it)). The following query finds such discontinuous phrases:

Here, the '<'-operator compares the value of the end position of the object of the PP with the start position of the head of the PP. If the first is strictly smaller than the second, the PP is discontinuous. The corpus contains 133 discontinuous PPs containing an impersonal pronoun vs. almost 322 continuous pronoun-preposition combinations, realized as a single word, and 17 cases where these are realized as two words. This shows that in almost 25% of the cases, the preposition + impersonal pronoun construction is discontinuous.