Corpus Analysis

Details for Autumn 2012; more later

Description

The lecture is a first introduction to corpus linguistics. This part of the course introduces standard techniques to use corpora for linguistic research. Various corpora are discussed, as well as corpus annotations. Practical methods to search and count linguistic patterns in corpora are part of the course. You will also be introduced to the UNIX environment and the UNIX tools for text analysis. A number of simple statistical measures are discussed which are often used in corpus linguistic research. For this course you have to: The grade will be calculated on the basis of these three points. Assignments should be sent to Gertjan van Noord. Send it as ordinary text, and provide your name and student number.

overview of assignments completed

  1. week 1: Annerose, Steven, Amber, Charlotte, Stefan, Edit, Anne, Kim Heerdink, Marleen, Ruth, Noortje, Kim Heiligenstein, Yorqin
  2. week 2: Annerose, Steven, Noortje, Kim Heerdink, Anne, Amber, Kim Heiligenstein, Charlotte, Ruth, Marleen, Edit, Stefan, Yorqin
  3. week 3+4: Anne, Annerose, Steven, Edit, Amber, Charlotte, Noortje, Marleen, Ruth, Kim Heerdink, Stefan, Kim Heiligenstein, Yorqin
  4. TEST: Kim Heerdink, Amber, Edit, Kim Heiligenstein, Charlotte, Annerose, Noortje, Yorqin, Ann, Steven, Marleen, Ruth, Stefan
  5. week 5: Marleen, Anne, Annerose, Steven, Charlotte, Kim Heiligenstein, Amber, Edit, Kim Heerdink, Ruth, Stefan, Noortje
  6. research paper outline: Noortje, Annerose, Marleen, Amber, Ruth, Charlotte, Edit, Steven, Kim Heerdink, Stefan, Anne, Yorqin
  7. research paper: Annerose, Kim Heiligenstein, Edit, Amber, Charlotte, Steven, Noortje, Anne
  8. presentation: all

    dates for the presentation:

    1. monday, December 3, 15:00, Harmonie 13-0344: Ruth, Amber
    2. thursday, December 6, 09:00, Turftorenstraat T14: Stefan, Noortje, Kim
    3. monday, December 10, 15:00, Harmonie 13-0344: AnneRose, Steven, Kim Heiligenstein, ?Marleen
    4. thursday, December 13, 09:00, Turftorenstraat T14: Anne, Charlotte, Edit, Yorqin
    5. reserve date: monday, December 17, 15:00, Harmonie 13-0344

Planning

  1. week 1: Corpus Linguistics and UNIX tools

    Exercises

  2. week 2: Text Analysis

    Exercises

  3. week 3: More text analysis

    Exercises Follow up: Google Ngram counts:

  4. week 4: Syntactically Annotated Corpora

    DACT is available here!

    Sheets

    Exercises: finish exercises of week 3, and start with the following exercises.

    Exercises

    My solution to the last question.

  5. week 5: Review + the Test

    Computing in linux. ps - we normally do not worry about the base of the log - as long as you use this consistently. Simply use the log that is easiest. If you want you can always convert using the 'change-the-base-formula':

    Changing the base: \log_b a = {\log_d a \over \log_d b}

    How to compute? Anything goes...

    • pencil and paper...
    • your high school calculator
    • xcalc
    • bc (natural log)
      % bc -l
      l((29/1129388)/((452/1129388)*(2356/1129388)))
      3.42607955643651944743
        
    • awk (natural log)
      % awk 'END {print log((29/1129388)/((452/1129388)*(2356/1129388)))}' < /dev/null
      
      % awk '{ print log(($1/$4)/(($2/$4)*($3/$4)))}'
      29 452 2356 1129388
      3.42608
      
      % awk '{ if (NF==4) T=$4; print log(($1/T)/(($2/T)*($3/T)))}'
      29 452 2356 1129388
      3.42608
      280 344 365
      7.83144
      
  6. week 6: Results of test

    Discussion of the results of the test...

    solutions

    overview

    ps., with respect to the the/I ratio question of last week. It turns out that if you inspect the frequency of "de" and "ik" in Dutch tweets, you will find a very peculiar pattern. These frequencies vary quite a bit over the day. Include a reference to this paper by Erik Tjong Kim Sang (in Dutch). And:

    • this variation seems to follow a pattern
    • the variation of "de" and "ik" are each other's mirror
    Use this website to inspect frequency vs. time of day/week/month in Dutch tweets. This picture displays the odd relationship between "de" and "ik":

    Finish exercises of week 4.

  7. week 7: Research project proposals

    Homework for week 6: come up with an idea for your research project. You need to present (5 mins) your idea orally at the next meeting, October 15, 4pm.
  8. week 10: Research project proposals

    Homework for week 7: finish written version of project proposal. The proposal needs to be submitted to the teacher before November 5, 3pm.
  9. week 11: Class

    After the break, the next class will be November 12, 3 pm. Purpose of the meeting:
    • discuss progress wrt research projects
    • make schedule for presentations

    Deadline for reports: Friday December 21, 2012, 5:59 pm (local time).