Corpus linguistics LIX011PO5

2006: The homepage of this year's course is here!

Corpus linguistics is one of the oldest and at the same time one of the youngest areas of linguistics. Thanks to the computer, it is now possible to use automatic analysis of text collections for a number of ends. Corpora (collections of texts) are currently used in many areas, including historical linguistics, psycholinguistics, applied linguistics, computational linguistics, and theoretical linguistics. Computers not only allow fast analysis, but via the Internet make it possible to directly access a vast and quickly growing body of text. This course will present the foundations of and the use of computers in corpus linguistics and will take the form of lectures along with a computer practicum. Some familiarity with UNIX and regular expressions is assumed, roughly equivalent to the course Tekstmanipulatie.

NOTE: In order to do the exercises for this course, you'll need an account on hagen, the Unix server for students in the faculty of arts. If you don't already have an account, you'll need to get one before the first lab.

NOTE: There is no obligatory literature that students must purchase. However, a book abut programming Perl is highly recomended. The classic is Programming Perl from O'Reilly. Learning Perl is also a good introduction to this scripting language. Here is also a brief Perl reference guide for a summary of Perl syntax and in-built functions.

NEWS: No good news - hagen is still down! This means that teaching (and taking) this course is extremly difficult. Is it possible for everybody to work on other machines (at home or here at the University)? Perl can be installed on almost every platform (Windows, Mac, etc ...). I will change deadlines and assignments and hopefully we can manage. (This is really a catastrophy! Please, complain as much as possible to the responsible poeple for this mess ... it's not me ...)

NEWS: Suggestions for final assignments! This is a preliminary list and may change at any time. Discuss your assignment with me before starting to work on it! For many assigments you will need access to corpus data. Hopefully hagen will be up again soon and I can provide you with the data!

NEWS: Hagen is back! Deadlines apply as stated below! Results of assignments will be here!
Note that I mark submisions with 'L' if they came late. I will be more critical on those than the ones submitted on time! Extra times means that I expect better solutions!


Week Lecture Links/Literature Perl Assignment Deadline
6 April Introduction [CL-ch1] [CC-2.1&2.4] variable types and control structures Assignment 1 13 April
13 April Compiling corpora [CL-ch2] [G94] [CC-2.2] Regular expressions I Assignment 2 29 April
20 April Corpus based methods I [CL-3] Regular expressions II Assignment 3 11 May
27 April Statistics [MS-5] Subroutines & simple fileIO Assignment 4 18 May
... no classes this week! Look at suggestions for final assignments and try to decide what you want to do! Start to make a plan for your final assignment!
11 May Corpus based methods II Text manipulation Assignment 5 25 May
18 May Data-driven NLP I Advanced data structures Assignment 6 01 June
25 May Data-driven NLP II start working on the final assignment: planning and discussion
Final Assignment 29 June

The schedule is likely to change during the course!
Perl on other platforms:
Literature on-line:
On line resources:
About Perl: