Previous | Home

Lecture Notes

 

Gosse Bouma

02-06-2010

These exercises are part of a corpus linguistics course taught at University of Groningen, The Netherlands.

If you are a course participant, please put the solutions to these exercises (both programs and program output) in the digital dropbox of Nestor by Friday 11 June 2010.

This week there is not a perl programming exercise, but an exercise about working with syntactically annotated corpora (treebanks), as described in the lecture notes. The treebanks as well as the software for searching the treebanks is installed on the student linux network.

Exercise 6.1 is worth 30 points, 6.2 is worth 40 points, and 6.3 is worth 30 points.

Preparations

Setting the Linux environment variables

For browsing and searching the treebanks, you will need two programs, dtview and dtsearch, that require a bit of installation.

Append the following lines to your .bashrc file (located in your home directory)

# setup PATH and ALPINO_HOME for new 32/64 bit systems

    arch=$(uname -m)

    if [ "$arch" = x86_64 ]; then
        PREFIX=/storage/aps/64
    else
        PREFIX=/storage/aps/32
    fi

    export PREFIX

PATH=$PREFIX/bin:$PATH

export ALPINO_HOME=$PREFIX/src/Alpino

PATH=$PATH:$ALPINO_HOME/bin

Now type the command

source .bashrc

You should now be able to execute the commands dtview and dtsearch.

If you want these settings to be in effect each time you log in, also copy ~gosse/.profile to your own directory:

# ~/.bash_profile: executed by bash(1) for login shells.
# see /usr/share/doc/bash/examples/startup-files for examples.
# the files are located in the bash-doc package.

# the default umask is set in /etc/login.defs
#umask 022

# include .bashrc if it exists
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

Treebank

The treebank we will be working with is the cdb treebank, the newspaper section of the so-called Eindhoven Corpus, dating from the early 70's.

It is located in ~gosse/Alpino/Treebank/cdb. It is probably convenient to add a symbolic link to this directory in the directory where you will be working:

ln -s ~gosse/Alpino/Treebank/cdb .

You should be able to view trees, using the command

dtview cdb/1.xml
dtview cdb/11*.xml

The first is for showing a single file, the second lets you browse a set of files (using the next/previous buttons). (Of course, you can also use less to look at the xml code of a file.)

With dtsearch you can search for files containing a specific construction, for instance

dtsearch -s '//node[@root="werk" and @pos="noun"]' cdb/*.xml

This searches all (xml) files in cdb for sentences that contain a form of the the noun _werk_, and returns the matching sentence, with brackets around the matching part of the sentence.

You can also produce counts for your search results using the -l flag:

dtsearch -l '//node[@root="werk" and @pos="noun"]' cdb/*.xml

(Alternatively, you can pipe the result of a dtsearch query to wc.)

Exercises

Exercise 6.1

  1. How many sentences contain the word "werk"?

  2. How many sentences contain the word "werk" as a noun?

  3. How many sentences contain the word "werk" as a verb?

  4. How many sentences contain some form of the noun "werk"?

  5. How many sentences contain some form of the verb "werk"?

  6. Explain the difference in outcome between question 3 and question 5.

For question 1-5, report the query that you used to answer the question, as well as the answer.

Exercise 6.2

Indirect objects (rel="obj2") in Dutch can be realized as a preposition phrase (cat="pp") as well as as a noun phrase (cat="np") or just a noun (pos="noun").

  1. How many sentences contain an indirect object?

  2. How many sentences contain an indirect object that is a preposition phrase?

  3. How many sentences contain an indirect object consisting of a noun phrase?

  4. How many sentences contain an indirect object consisting of a single noun? Which nouns do occur most frequently as a indirect objects?

  5. Produce a ranked list of the nouns you found in the previous exercise (hint : use dtsearch with the -c flag to get only the matching part of a sentence, hint: use your linux skills (as covered in the text manipulation class, or see the lecture notes) for producing a frequency list. (alternatively: write a perl script for processing the output of dtsearch))

  6. How many sentences contain an indirect object consisting of just a preposition and a noun (i.e. _aan Mao, aan hem_)? Which nouns do occur most frequently in this construction?

  7. Is there a difference in the results you got for question 4 and 5 and for question 6? If so, what is it?

For question 1-3, report the query that you used to answer the question, as well as the answer. For questions 4, 5, and 6, give the commands you used, as well as the result you got.

Exercise 6.3

Prepositional phrases (PPs) can be modifiers (i.e. rel=mod) to verbs, as well as to nouns.

  1. Find PPs that modify a verb, i.e. that are modifiers and sisters of a verb

  2. Find PPs that modify a noun

  3. Produce a ranked frequency list of the prepositions that head PPs that modify a verb

  4. Produce a ranked frequency list of the prepositions that head PPs that modify a noun

  5. If you look at the 10 most frequent results for question 3 and 4, are there prepositions that occur (relatively) more frequently with nouns than with verbs? and vice versa?

For question 1 and 2, report the query you used. For question 3 and 4, give the query + other commands you used, as well as the top 10 most frequent results.