Seminar in Methodology and Statistics
For students in the Linguistics Research Master's Program and
Linguistics PhD students
Course under development--more or less permanently!
Spring, 2011
When Tues. 9:15-11
Where Harmonie H15.0036
Instructor: John Nerbonne
Announcements
- Papers due on June 30th. See below
for requirements. Extensions possible, but no later
than Aug. 15th. Please send me a mail if you wish to turn in your
paper after June 30th.
- Discussion with Tony Mullen Tony Mullen is an Erasmus Mundus
visiting scholar in Language and Communication Technology who will
give a talk on his research on Tues. June 7 (right after the
talks by Jet en Vincenzo). He is also interested in talking
to students, especially those in LCT, but also others at 11 am
that same day.
- More Exercises Create some quiz questions to stimulate
students to learn regression models.
Details here.
- Extra Session Tues. Apr. 12, 9:15-11
Harmonie 13.15.036 (usual room). Purpose: identify topics for
presentations and papers for those who have not until now.
- R Lab on Mixed Models Thursday 31 Mar. 9:15-11
Harmonie 13.12.0107AC.
Martijn Wieling
will lead the
mixed
models lab
- Exercises available Try out the quiz questions your colleagues
suggested, and criticize them a bit. See quizzes. Criticizes others' questions, again working as a group. See
Exercise 2.
- Exercises Create some quiz questions to stimulate
students to learn ANOVA. Details here.
- R Labs Thursdays 9:15-11 begin Thurs. March 3 (for five weeks)
in the Harmonie building: 13.12.0107AC.
Erik Tjong Kim Sang
will lead the labs, and he's put the labs online
here
- Sheets from Stat. II class available
here
Description
The structure of the course revolves around seminar presentations by
participants. Presentations primarily concern statistical or
methodological issues in the research of the participants. See below for some of the topics presented and
discussed in 2004 through 2008.
Topics include a selection from permutation tests; bootstrapping;
analysis of variance and analysis of covariance; regression including
multiple regression and hierarchical (multi-level) regression;
dimension reduction techniques including factor analysis, principal
component analysis, multidimensional scaling and/or latent semantic
analysis; analysis of nominal data including association strength,
Cohen's kappa, binomial or multinomial models, Fisher's exact test,
odds ratios, entropy, or logistic regression. Other topics have
regularly been presented, mostly at the request of the participants.
Prerequisites
Participants in the course should have completed a basic course in
statistics covering topics such as descriptive statistics but also
basis hypothesis testing using z-tests, t-tests, and χ²
Participation in the first semester Research Master course on
statistics and corpus linguistics (Wander Lowie and Gertjan van Noord)
is strongly recommended for anyone who's never taken statistics. It is
required that you know the statistics from that course, so if
you've never taken such a course, that's a good basis. The statistics
course given in the European Master's in Clinical Linguistics is
also good.
Requirements
Students taking the course for credit in the research master must
present at least one hour-long session (30 min. presentation plus 15
min. discussion and question session) on a statistical analysis
technique. In addition they must turn in a 8-10 pp. (2,000-2,500
wd.), which may be on the same statistical analysis technique, but
which may also be on another. In the paper it is important to embed
the discussion in the analysis of concrete data, to explain how the
analysis works, under what conditions it may be applied, and what its
shortcomings are. Graphical presentations of the data attempting to
illustrate the tendency that is to be proven or disproven are
definitely valuable.
It is fine with me if you turn in a paper reporting on work for
another course as long as the paper turned I receive focuses on the
statistical analysis. If you want turn in the same paper for two
course, make sure that both instructors know this.
Ph.D. candidates from BCN or the Graduate School in the Humanities
have received credit for this course in the past if they presented one
topic (45 min. -- (30 + 15)) and participated regularly. I assume that
that will continue to be the case.
Books
In general, we will try to use the following:
- David S. Moore and George McCabe (1993)
Introduction to the Practice of Statistics 5th edition.
Freeman: New York.
We assume the materials in chapters 1-9, 12, and 14 (subject
of Introductory Statistics).
More advanced chapters such as those on permutation tests,
bootstrapping, or regression models might be subjects of
presentations and discussion.
A nice alternative seems to be Alan Agesti & Barbara
Finlay's Statistical Methods for the Social Sciences
4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't
used it yet, but it has a good selection of material.
- Toni Rietveld and Roeland van Hout (1993) Statistical
Techniques for the Study of Language and Language Behavior.
Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics,
and still excellent. But see below.
Other references that have also been found useful are the following:
- Alan Agresti (1996) An Introduction to Categorical Data
Analysis. Wiley: New York.
- Barbara Tabachnik and Linda Fidell (2001) Using Multivariate
Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those
interested in mathematical underpinnings or those interested
in developing statistical theory further.
- Chris Manning and Hinrich Schütze (1999) Foundations of
Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on
appropriate stats, including statistical modeling, information theory.
- Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999)
Introduction to Information Retrieval, Cambridge
University Press: Cambridge, UK
Focus on IR, CL, naturally,
but lots on stats, including singular value decomposition, latent
semantic indexing.
Many also find the more "how-to" books useful. The following books
have been found especially valuable as they provide very practical
instructions for doing analysis in SPSS or R.
- Andy Field (2000) Discovering Statistics using SPSS
for Windows. Sage: London.
Good for SPSS tips, covers basics well, informal (wordy) style.
- Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS
in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
- Harald Baayen (2008) Analyzing Linguistic Data. A Practical
Introduction to Linguistics using R Cambridge University Press:
Cambridge. Thereis also
an online version available.
The book by Baayen may be the best book ever written
on linguistic statistics. Especially if you are using large
data sets (corpus frequencies), R is the way to go.
- Keith Johnson (2008) Quantitative Methods in Linguistics
There is
an
online version as well.
I confess that I still haven't read this (6/2008), but based on Johnson's
work in general, I expect it to be good. Like Baayen's, this book
is R-based.
Articles may also be used from time to time. The books are on
reserve, most at the Letteren library, reserve shelf. Please note
that books on reserve at the Letteren library that normally belong
there are not moved to the reserve shelf. Instead, they're
kept at their normal places (use the catalogue), but are on reserve
and cannot be loaned out.
Schedule 2011 (Tentative!)
The schedule for Spring 2011. Meetings
are Tues. 9-11 in H1315.036
| Week |
Date |
Theme |
Readings |
Leader |
| 1 |
15 Feb. |
Organizational |
|
John Nerbonne |
| 2-3 |
22 Feb.- 1 Mar. |
ANOVA, Factorial ANOVA |
M&M Chap.12-13 |
John Nerbonne |
| 4 |
8 Mar. |
Repeated Measures |
Rietveld/van Hout, Ch.4.6; Field 13 |
John Nerbonne |
| 5-6 |
15-22 Mar. |
Regression, Mult. Regr. |
M&M 10-11 |
John Nerbonne |
| 7 |
29 Mar. |
Mixed Models |
Baayen, Ch.7 |
Martijn Wieling |
| 8 |
5 April |
no class - break |
|
|
| 9 |
12 Apr. |
Identify topics |
Optional |
John Nerbonne |
| 10 |
19 Apr. |
Logistic Regression |
M&M Chap.15 |
John Nerbonne |
| 11 |
26 Apr. |
Association Strength, PMI |
Wiechmann slides |
Simon Šuster |
|
|
Association Strength, Odds Ratios |
Agresti Chap. 2.3 |
Laura Handojo |
|
|
Ass. Strength & Fisher's Exact |
Pedersen 1996 |
Jelke Bloem |
|
|
Association Strength, Minimum Sensitivity |
Wiechmann paper |
Igor Tytyk |
| 12 |
3 May |
May Break |
|
|
| 13 |
10 May |
Association Strength |
|
Stefan Evert |
| 14 |
17 May |
Repeated Measures |
Field 13; Rietveld & van Hout Ch. 4.6 |
Laura Bos |
|
|
|
|
Connie Lahmann |
|
|
... vs. Mixed Models |
|
Ruggero Montalto |
|
|
|
|
Oscar Strik |
| 15 |
24 May |
Clustering |
Johnson, Ch.6.1-6.4 |
Martin Boroš |
|
|
Principal Components |
Tabachnik & Fidell, Ch.13 |
Ke Tranh |
|
|
Multi-Dimensional Scaling |
Johnson, Ch.6.5 |
Ljubomir Žlatkov |
|
|
Correspondence Analysis |
|
Kristel Uiboaed |
| Extra |
26 May |
Validation |
|
Kaitlin Mignella |
|
|
|
|
Mona Zimmermeister |
|
|
minF |
|
Jurriën Schuurman |
| 16 |
31 May |
No class |
|
JN in Kampala |
| 17 |
7 June |
Logistic Regression |
M&M Ch.15 |
Vincenzo Tabacco |
|
|
Cochran Q |
Field, 15.6.3 |
Jet Vonk |
|
Guest Lecture |
Sentiment in Texts |
Clutering, PMI, Naive Bayes |
Tony Mullen |
Click on the lecture (etc.) title to see more.
Course Materials 2011
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-6).
- Martijn Wieling on
Mixed Models
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
- Stefan Evert's page on statistics and software for
measuring
collocation strength.
- Assocation Strength talks
- Simon Šuster on
Mutual Information and Collocations
- Laura Handojo
Odds Ratios and Collocations
- Jelke Bloem on
Fisher's Exact Test to Detect Animacy
- Igor Tytyk on
Minimum Sensitivity and Collostructions
- Stefan Evert on
statistical association strength and multi-word expressions.
- Repeated Measures vs. Mixed Models
- Connie Lahmann on
'Higher Language Cognition' and Grammaticality Verification
- Laura Bos
Repeated Measures ANOVA \& Permutation Statistics
- Ruggero Montalto on
Repeated Measures vs. Mixed Models
- Oscar Strik on the
Aikake Information Criterion
- Dimension Reduction
- Martin Boros
Cluster Analysis and Silhouette Width
- Ke Tran on
Principal
Component Analysis (and Face Recognition!)
- Lubomir Zlatkov on
Multi-Dimensional Scaling
- Kristel Uiboaed on
Correspondence Analysis
- Mona Timmermeister & Caitlin Mignella on
Validating a Pronunciation Difference Measure
- Jurriën Schuurman on
Min F in Psycholinguistics
- Jet Vonk on
Cochran's Q
Course Materials 2010
- Eliza Magaretha on
regression
used to evaluation the quality of inducing pronunciation
distances from empirical data.
- Nynke van der Vliet on
Cohen's κ
used to measure the agreement between annotators of hierarchically
structured material.
- Edgar Weiffenbach on
Log
Linear Models of Contingency used to analyse corpus frequencies.
- Nick Ruiz on
Logistic
Regression used to analyse corpus frequencies.
- Rahmad Mahendra on
Cross Entropy
used to measure model quality in computational analyses.
- Nadine Glas on
(Log) Odds Ratios
used to statistic independence of categorical variables (with a
comparison to χ².
- Seid Tvica on
Ordinal
Regression used to measure comprehensibility of foreign speech.
Course Materials 2009
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
-
- Thomas Zastrow
Entropy in Dialectometry and P. Nabende on
Cross Entropy and
Model Comparision
- Xuchen Yao
Bayesian vs. Frequentist Approaches to Statistics
- Çagri Çöltekin
Hierarchical Bayesian Networks as Learning Models
for background reading see Wagenmakers, E.-J., Lee, M. D.,
Lodewyckx, T., & Iverson, G. (2008).
Bayesian versus frequentist inference.
In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.),
Bayesian Evaluation of Informative Hypotheses, pp. 181-207.
Springer: New York.
- Ma Jianqiang
Permutation Tests and Monte Carlo Sampling
- Jelena Prokic
Clustering and the Bootstrap
- Arjen Versloot
Using Late Medieval Sources for Linguistic Reconstructions
(and regression)
- Gulsen Yilmaz
Using Multiple Regression to Understand Language Attrition
- Harwintha Anjarningsih
Repeated Measures ANOVA applied to ERP data.
- Natalia Ergorova
Repeated Measures ANOVA applied to ERP data, Example II.
- Anja Schüppert
(Binary) Logistic Regression applied to foreign comprehension data.
- Ankelien Schippers
(Multinomial) Logistic Regression applied to historical syntax.
- Karin Beijering
Loglinear Analysis of Contingency Tables applied to historical syntax.
- Ildikó Berzlanovich
Intercoder Agreement in Discourse Analysis (Cohen's κ)
- Myrte Faber
Annotating Turn Competition in Multi-Party Conversations
(Cohen's κ)
- Martijn Wieling
Bipartite Spectral Graph Clustering (applied to dialectal variation)
- Proscovia Olango
Naive Bayes (applied to disambiguation)
Course Materials 2008
- J. Nerbonne on Entropy and
Information Theory and the
Conditional Entropy
of the phoneme mapping (in Scandivanvian) "semi-communication".
- E. Rossi on
Normal Distributions and Sampling and on
hypothesis testing and
t-tests
- V. Koukoulioti on
Nonparametric Fallback Tests
- E. Rossi on
Single ANOVA
- Th. Mehotcheva on
Kruskal-Wallis
- H. Loerts on
Multivariate ANOVA and Repeated Measures
- H. Ahmed on
χ² and Fisher's Exact Test
- A. Lobanova on
(Log) Odds Ratios and
Word Order Studies
- J. Nerbonne on regression
and multiple
regression
- N. Haque on Principal
Component Analysis
- S. van Ommen on
Clustering
Course Materials 2007
- J. Nerbonne on regression
and multiple
regression
- B. Szmrecsányi
"Language users as creatures of habit: a corpus-linguistic analysis
of persistence in spoken English"
Corpus Linguistics and Linguistic Theory 1(1): 113-150.
- L. Stowe on Analysis of
Variance, incl. Multiple Analysis of Variance.
- A. Banga and Tam Ho on
Repeated Measures, (ANOVA)
- S. Berends on
Assumptions of ANOVA
- T. Caspi on Windowing,
Correlations, and Dynamic Systems Theory
- Th. Leinonen on
Regression in Phonetics and Computational Modeling
- J. Nerbonne on Multiple
Regression Models.
- V. Baaijen on
Applying Nonparametric Statistics to Analyse Writing (Comparing
Think-Aloud Protocols and Keystroke Logging)
- M. Knippers and R. Montalto on
Dealing with Nonnormal Distributions in a Repeated Measures Design
- M. Spruit on
Search for Associations among Variables
- G. Korfiatis on
Principal Components Analysis
- T. Van de Cruys on
Dimensionality Reduction for Similarity Detection
(Singular Value Decomposition, Non-Negative Matrix Factorization)
Course Materials 2006
- Nerbonne on
Factor Analysis
- Wiersma on Permutation Tests
- Zinger on n-gram
models
- Ruffle on
syntactic
differences in Old English
- Van der Cruys on
Latent Semantic
Analysis
- Vasishth on
Mixed
Effects Models
- Moberg on
conditional entropy
used to model comprehensibility.
- Mur on
binomial models, esp. the paired sign test.
- Heeringa on
bootstrapping
.
- Ruffle on
Log Odds
- Xiaoyan Xu on
Multivariate
nature of Language Attrition
- Kwant on
"Delphi"
techniques for identifying variables
Course Materials 2005
-
Nerbonne on χ²
-
Villada on Association Statistics for
Recognizing Multi-Word Units
- Featherston on Magnitude
Estimation. (Various papers, of which Featherston's "Decathlon
Model" is perhaps the best starting point.)
- Donkers on ANOVA, repeated measures.
- Ruffle and Trofimova on Fisher's Exact Test.
- Bouma on Corpora and Counting.
- Van Noord on Search in Automatically Analysed Corpora.
- Smits and Rossi on Binomial Chances.
- Kremers on Log Odds Ratios.
- Nerbonne on Logistic Regression.
- Deunk on Analysing Qualitative Results via
Multi-Level Regression.
- van der Plas on Clustering.
- van der Beek on Entropy as Measure of Syntactic Influence.
- Fahmi on Indentifying Terminology.
Course Materials 2004
-
Nerbonne on Logistic Regression
-
Siedle on Hierarchical Cluster Analysis
-
Hopp on Magnitude Estimation
-
Lichte on Association Measures
-
Kootstra on Exploratory Factor Analysis
-
Rossi on Odds Ratios in Aphasiology
Student Projects
-
Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to
foreign language learning.