2006: The homepage of this year's course is here!
Corpus linguistics is one of the oldest and at the same time one of the youngest areas of linguistics. Thanks to the computer, it is now possible to use automatic analysis of text collections for a number of ends. Corpora (collections of texts) are currently used in many areas, including historical linguistics, psycholinguistics, applied linguistics, computational linguistics, and theoretical linguistics. Computers not only allow fast analysis, but via the Internet make it possible to directly access a vast and quickly growing body of text. This course will present the foundations of and the use of computers in corpus linguistics and will take the form of lectures along with a computer practicum. Some familiarity with UNIX and regular expressions is assumed, roughly equivalent to the course Tekstmanipulatie.NOTE: In order to do the exercises for this course, you'll need an account on hagen, the Unix server for students in the faculty of arts. If you don't already have an account, you'll need to get one before the first lab.
NOTE: There is no obligatory literature that students must purchase. However, a book abut programming Perl is highly recomended. The classic is Programming Perl from O'Reilly. Learning Perl is also a good introduction to this scripting language. Here is also a brief Perl reference guide for a summary of Perl syntax and in-built functions.
NEWS: No good news - hagen is still down! This means that teaching (and taking) this course is extremly difficult. Is it possible for everybody to work on other machines (at home or here at the University)? Perl can be installed on almost every platform (Windows, Mac, etc ...). I will change deadlines and assignments and hopefully we can manage. (This is really a catastrophy! Please, complain as much as possible to the responsible poeple for this mess ... it's not me ...)
NEWS: Suggestions for final assignments! This is a preliminary list and may change at any time. Discuss your assignment with me before starting to work on it! For many assigments you will need access to corpus data. Hopefully hagen will be up again soon and I can provide you with the data!
NEWS: Hagen is back! Deadlines apply
as stated below! Results of assignments will be
here!
Note that I mark submisions with 'L' if they came late. I will be more
critical on those than the ones submitted on time! Extra times means that I
expect better solutions!
| Week | Lecture | Links/Literature | Perl | Assignment | Deadline |
|---|---|---|---|---|---|
| 6 April | Introduction | [CL-ch1] [CC-2.1&2.4] | variable types and control structures | Assignment 1 | 13 April |
| 13 April | Compiling corpora | [CL-ch2] [G94] [CC-2.2] | Regular expressions I | Assignment 2 | 29 April |
| 20 April | Corpus based methods I | [CL-3] | Regular expressions II | Assignment 3 | 11 May |
| 27 April | Statistics | [MS-5] | Subroutines & simple fileIO | Assignment 4 | 18 May |
| ... no classes this week! Look at suggestions for final assignments and try to decide what you want to do! Start to make a plan for your final assignment! | |||||
| 11 May | Corpus based methods II | Text manipulation | Assignment 5 | 25 May | |
| 18 May | Data-driven NLP I | Advanced data structures | Assignment 6 | 01 June | |
| 25 May | Data-driven NLP II | start working on the final assignment: planning and discussion | |||
| Final Assignment | 29 June | ||||