Lecture Notes
Gosse Bouma
02-06-2010
These exercises are part of a corpus linguistics course taught at
University of Groningen, The Netherlands.
If you are a course participant, please put the solutions to these
exercises (both programs and program output) in the digital dropbox of
Nestor by Friday 11 June 2010.
This week there is not a perl programming exercise, but an
exercise about working with syntactically annotated corpora (treebanks),
as described in the lecture notes.
The treebanks as well as the software for searching the treebanks is
installed on the student linux network.
Exercise 6.1 is worth 30 points, 6.2 is worth 40 points, and 6.3 is worth 30 points.
Setting the Linux environment variables
For browsing and searching the treebanks, you will need two programs,
dtview and dtsearch, that require a bit of installation.
Append the following lines to your .bashrc file (located in your home directory)
# setup PATH and ALPINO_HOME for new 32/64 bit systems
arch=$(uname -m)
if [ "$arch" = x86_64 ]; then
PREFIX=/storage/aps/64
else
PREFIX=/storage/aps/32
fi
export PREFIX
PATH=$PREFIX/bin:$PATH
export ALPINO_HOME=$PREFIX/src/Alpino
PATH=$PATH:$ALPINO_HOME/bin
Now type the command
You should now be able to execute the commands dtview and dtsearch.
If you want these settings to be in effect each time you log in,
also copy ~gosse/.profile to your own directory:
# ~/.bash_profile: executed by bash(1) for login shells.
# see /usr/share/doc/bash/examples/startup-files for examples.
# the files are located in the bash-doc package.
# the default umask is set in /etc/login.defs
#umask 022
# include .bashrc if it exists
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
The treebank we will be working with is the cdb
treebank, the newspaper section of the so-called Eindhoven Corpus, dating from the early 70's.
It is located in ~gosse/Alpino/Treebank/cdb. It is probably
convenient to add a symbolic link to this directory in the directory
where you will be working:
ln -s ~gosse/Alpino/Treebank/cdb .
You should be able to view trees, using the command
dtview cdb/1.xml
dtview cdb/11*.xml
The first is for showing a single file, the second lets you
browse a set of files (using the next/previous buttons).
(Of course, you can also use less to look at the xml code
of a file.)
With dtsearch you can search for files containing a specific
construction, for instance
dtsearch -s '//node[@root="werk" and @pos="noun"]' cdb/*.xml
This searches all (xml) files in cdb for sentences that contain
a form of the the noun _werk_, and returns the matching
sentence, with brackets around the matching part of the sentence.
You can also produce counts for your search results using the -l flag:
dtsearch -l '//node[@root="werk" and @pos="noun"]' cdb/*.xml
(Alternatively, you can pipe the result of a dtsearch query to wc.)
Exercise 6.1
-
How many sentences contain the word "werk"?
-
How many sentences contain the word "werk" as a noun?
-
How many sentences contain the word "werk" as a verb?
-
How many sentences contain some form of the noun "werk"?
-
How many sentences contain some form of the verb "werk"?
-
Explain the difference in outcome between question 3 and
question 5.
For question 1-5, report the query that you used to answer the question, as well
as the answer.
Exercise 6.2
Indirect objects (rel="obj2") in Dutch can be realized as a preposition phrase (cat="pp")
as well as as a noun phrase (cat="np") or just a noun (pos="noun").
-
How many sentences contain an indirect object?
-
How many sentences contain an indirect object that is a preposition phrase?
-
How many sentences contain an indirect object consisting of a noun phrase?
-
How many sentences contain an indirect object consisting of a single noun?
Which nouns do occur most frequently as a indirect objects?
-
Produce a ranked list of the nouns you found in the previous exercise
(hint : use dtsearch with the -c flag to get only the matching part of a sentence,
hint: use your linux skills (as covered in the text manipulation class, or see the lecture notes) for producing a
frequency list. (alternatively: write a perl script for processing the output of
dtsearch))
-
How many sentences contain an indirect object consisting of just a preposition and a noun (i.e. _aan Mao, aan hem_)? Which nouns
do occur most frequently in this construction?
-
Is there a difference in the results you got for question 4 and 5 and for question 6? If so, what is it?
For question 1-3, report the query that you used to answer the question, as well
as the answer. For questions 4, 5, and 6, give the commands you used, as well as the
result you got.
Exercise 6.3
Prepositional phrases (PPs) can be modifiers (i.e. rel=mod) to verbs, as well as to nouns.
-
Find PPs that modify a verb, i.e. that are modifiers and sisters of a verb
-
Find PPs that modify a noun
-
Produce a ranked frequency list of the prepositions that head PPs that modify a verb
-
Produce a ranked frequency list of the prepositions that head PPs that modify a noun
-
If you look at the 10 most frequent results for question 3 and 4, are there prepositions that
occur (relatively) more frequently with nouns than with verbs? and vice versa?
For question 1 and 2, report the query you used. For question 3 and 4, give the query + other commands
you used, as well as the top 10 most frequent results.