Corpus Analysis
Details for Autumn 2012; more later
- credits: 5. This course is part of the Research Master
in Linguistics.
- teacher: Gertjan van
Noord;
G.J.M.van.Noord@rug.nl
- the first part of the course is an introduction to Unix
utilities, and some more. Week 1-7.
Location:
Monday 16:00-17:45: Harmonie 13-0344 (tutorial);
Thursday 9:00-11:00: Harmonie 12-0119A (practical)
- the second part of the course is more theoretical and includes
student projects. Week 10-16.
- First Meeting: Monday September 3, 16:00, Harmonie 13-0344.
ROOM CHANGE:
- Monday Meetings: we are back in Harmonie 13-0344!
- Literature:
- I've started updating the sheets for this year. They are
available as one big powerpoint file.
- FAQ: how to access my files on the Linux computer if I am not sitting
at a Linux computer? Answer
Description
The lecture is a first introduction to corpus linguistics. This part
of the course introduces standard techniques to use corpora for
linguistic research. Various corpora are discussed, as well as corpus
annotations. Practical methods to search and count linguistic patterns
in corpora are part of the course. You will also be introduced to the
UNIX environment and the UNIX tools for text analysis. A number of
simple statistical measures are discussed which are often used in
corpus linguistic research. For this course you have to:
- hand in the assignment from the practicals and do a test (UNIX
Tools) at the end of the *fourth* practical.
- hand in the 2nd assignment (Corpus Linguistic 'experiment') at
the very end of the course,
- give an oral presentation about this 2nd assignment.
The grade will be calculated on the basis of these three points.
Assignments should be sent to Gertjan van Noord. Send it as ordinary
text, and provide your name and student number.
overview of assignments completed
- week 1: Annerose, Steven, Amber, Charlotte, Stefan, Edit, Anne, Kim Heerdink, Marleen, Ruth, Noortje, Kim Heiligenstein, Yorqin
- week 2: Annerose, Steven, Noortje, Kim Heerdink, Anne, Amber, Kim Heiligenstein, Charlotte, Ruth, Marleen, Edit, Stefan, Yorqin
- week 3+4: Anne, Annerose, Steven, Edit, Amber, Charlotte, Noortje, Marleen, Ruth, Kim Heerdink, Stefan, Kim Heiligenstein, Yorqin
- TEST: Kim Heerdink, Amber, Edit, Kim Heiligenstein, Charlotte, Annerose, Noortje, Yorqin, Ann, Steven, Marleen, Ruth, Stefan
- week 5: Marleen, Anne, Annerose, Steven, Charlotte, Kim Heiligenstein, Amber, Edit, Kim Heerdink, Ruth, Stefan, Noortje
- research paper outline: Noortje, Annerose, Marleen, Amber, Ruth, Charlotte, Edit, Steven, Kim Heerdink, Stefan, Anne, Yorqin
- research paper: Annerose, Kim Heiligenstein, Edit, Amber, Charlotte, Steven, Noortje, Anne
- presentation: all
dates for the presentation:
- monday, December 3, 15:00, Harmonie 13-0344: Ruth, Amber
- thursday, December 6, 09:00, Turftorenstraat T14: Stefan, Noortje, Kim
- monday, December 10, 15:00, Harmonie 13-0344: AnneRose, Steven, Kim Heiligenstein, ?Marleen
- thursday, December 13, 09:00, Turftorenstraat T14: Anne, Charlotte, Edit, Yorqin
- reserve date: monday, December 17, 15:00, Harmonie 13-0344
Planning
week 1: Corpus Linguistics and UNIX tools
Exercises
week 2: Text Analysis
Exercises
week 3: More text analysis
Exercises
Follow up: Google Ngram counts:
week 4: Syntactically Annotated Corpora
DACT is available here!
Sheets
Exercises: finish exercises of week 3, and start with
the following exercises.
Exercises
My solution to the last question.
week 5: Review + the Test
Computing in linux.
ps - we normally do not worry about the base of the log - as long as you use
this consistently. Simply use the log that is easiest. If you want you can always
convert using the 'change-the-base-formula':
Changing the base:
How to compute? Anything goes...
- pencil and paper...
- your high school calculator
- xcalc
- bc (natural log)
% bc -l
l((29/1129388)/((452/1129388)*(2356/1129388)))
3.42607955643651944743
- awk (natural log)
% awk 'END {print log((29/1129388)/((452/1129388)*(2356/1129388)))}' < /dev/null
% awk '{ print log(($1/$4)/(($2/$4)*($3/$4)))}'
29 452 2356 1129388
3.42608
% awk '{ if (NF==4) T=$4; print log(($1/T)/(($2/T)*($3/T)))}'
29 452 2356 1129388
3.42608
280 344 365
7.83144
week 6: Results of test
Discussion of the results of the test...
solutions
overview
ps., with respect to the the/I ratio question of last week. It turns out that
if you inspect the frequency of "de" and "ik" in Dutch tweets, you will find a
very peculiar pattern. These frequencies vary quite a bit over the day. Include
a reference to this paper by Erik Tjong Kim Sang (in Dutch). And:
- this variation seems to follow a pattern
- the variation of "de" and "ik" are each other's mirror
Use this website
to inspect frequency vs. time of day/week/month in Dutch tweets.
This picture displays the odd relationship between "de" and "ik":
Finish exercises of week 4.
week 7: Research project proposals
Homework for week 6: come up with an idea for your research project.
You need to present (5 mins) your idea orally at the next meeting,
October 15, 4pm.
week 10: Research project proposals
Homework for week 7: finish written version of project proposal. The proposal needs to be
submitted to the teacher before November 5, 3pm.
week 11: Class
After the break, the next class will be November 12, 3 pm. Purpose of the meeting:
-
discuss progress wrt research projects
-
make schedule for presentations
Deadline for reports: Friday December 21, 2012, 5:59 pm (local time).