Seminar in Methodology and Statistics
For students in the Linguistics Research Master's Program and
Linguistics PhD students (& others by agreement)
Course under development--more or less permanently!
Spring, 2013
Lecture/Seminar: Thurs. 9:00-10:45, Harmonie H1315.0031
R Lab w. Çağrı Çöltekin: Fri. 9:00-10:45, Harmonie
H1312.0107A, beginning Feb.27!
Instructor: John Nerbonne (see site for
email, phone, geographic coordinates and for office hours)
Announcements
- Sheets for the first half of the course (multivariate
techniques) available
here
- Lab Sessions begin Fri. Feb.15. See
the labs' web site for an idea of what will the labs will involve.
Description
The first half of the course will be an introduction to multivariate
statistical techniques commonly used in linguistics and
communications. It will be accompanied by lab
sessions using the statistical package R.
The rest of the course revolves around seminar presentations by
participants. Presentations primarily concern statistical or
methodological issues in the research of the participants. See below for some of the topics presented and
discussed from 2004 through 2011.
Topics for the second half include a selection from permutation
tests; bootstrapping; analysis of variance and analysis of covariance;
regression including multiple regression and hierarchical
(multi-level) regression; dimension reduction techniques including
factor analysis, principal component analysis, multidimensional
scaling and/or latent semantic analysis; analysis of nominal data
including association strength, Cohen's kappa, binomial or multinomial
models, Fisher's exact test, odds ratios, information-theoretic
inspired measures such as pointwise mutual information, or logistic
regression. Other topics have regularly been presented, mostly at the
request of the participants.
Prerequisites
Participants in the course should have completed a basic course in
statistics covering topics such as descriptive statistics but also
basis hypothesis testing using z-tests, t-tests, and χ²
Participation in the first semester Research Master course on
statistics and corpus linguistics (Wander Lowie and Gertjan van Noord)
is strongly recommended for anyone who's never taken statistics. It is
required that you know the statistics from that course, so if
you've never taken such a course, that's a good basis. The statistics
course given in the European Master's in Clinical Linguistics is
also good.
Requirements
All participants, including auditors not taking the course for
credit, must present at least one hour-long session (30
min. presentation plus 15 min. discussion and question session) on a
statistical analysis technique. In addition students taking the
course for credit in the research master must turn in a 8-10
pp. (2,000-2,500 wd.) paper, which may be on the same statistical
analysis technique, but which may also be on another. In the paper it
is important to embed the discussion in the analysis of concrete data,
to explain how the analysis works, under what conditions it may be
applied, and what its shortcomings are. Graphical presentations of
the data attempting to illustrate the tendency that is to be proven or
disproven are definitely valuable.
It is fine with me if you turn in a paper reporting on work for
another course as long as the paper turned I receive focuses on the
statistical analysis. If you want turn in the same paper for two
course, make sure that both instructors know this and agree to it.
Ph.D. candidates from BCN or the Graduate School in the Humanities
have received credit for this course in the past if they presented one
topic (45 min. -- (30 + 15)) and participated regularly. I assume that
that will continue to be the case.
Books
In general, we will try to use the following:
- David S. Moore and George McCabe (1993)
Introduction to the Practice of Statistics 5th edition.
Freeman: New York.
We assume the materials in chapters 1-9, 12, and 14 (subject
of Introductory Statistics).
More advanced chapters such as those on permutation tests,
bootstrapping, or regression models might be subjects of
presentations and discussion.
A nice alternative seems to be Alan Agesti & Barbara
Finlay's Statistical Methods for the Social Sciences
4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't
used it yet, but it has a good selection of material.
- Toni Rietveld and Roeland van Hout (1993) Statistical
Techniques for the Study of Language and Language Behavior.
Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics,
and still excellent. But see below.
Other references that have also been found useful are the following:
- Alan Agresti (1996) An Introduction to Categorical Data
Analysis. Wiley: New York.
- Barbara Tabachnik and Linda Fidell (2001) Using Multivariate
Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those
interested in mathematical underpinnings or those interested
in developing statistical theory further.
- Chris Manning and Hinrich Schütze (1999) Foundations of
Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on
appropriate stats, including statistical modeling, information theory.
- Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999)
Introduction to Information Retrieval, Cambridge
University Press: Cambridge, UK
Focus on IR, CL, naturally,
but lots on stats, including singular value decomposition, latent
semantic indexing.
Many also find the more "how-to" books useful. The following books
have been found especially valuable as they provide very practical
instructions for doing analysis in SPSS or R.
- Andy Field (2000) Discovering Statistics using SPSS
for Windows. Sage: London.
Good for SPSS tips, covers basics well, informal (wordy) style.
- Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS
in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
- Harald Baayen (2008) Analyzing Linguistic Data. A Practical
Introduction to Linguistics using R Cambridge University Press:
Cambridge. Thereis also
an online version available.
The book by Baayen may be the best book ever written
on linguistic statistics. Especially if you are using large
data sets (corpus frequencies), R is the way to go.
- Keith Johnson (2008) Quantitative Methods in Linguistics
There is
an
online version as well.
I confess that I still haven't read this (6/2008), but based on Johnson's
work in general, I expect it to be good. Like Baayen's, this book
is R-based.
Articles may also be used from time to time. The books are on
reserve, most at the Letteren library, reserve shelf. Please note
that books on reserve at the Letteren library that normally belong
there are not moved to the reserve shelf. Instead, they're
kept at their normal places (use the catalogue), but are on reserve
and cannot be loaned out.
Schedule for Seminars/Lectures 2013 (Tentative!)
The schedule for Spring 2013. Meetings
are Thurs. 9:00-11 in H1315.031
| Week |
Date |
Theme |
Readings |
Leader |
| 1 |
14 Feb. |
Organizational |
|
John Nerbonne |
| 2-3 |
21-28 Feb. |
ANOVA, Factorial ANOVA |
M&M Chap.12-13 |
John Nerbonne |
| 4 |
7 Mar. |
Repeated Measures |
Rietveld/van Hout, Ch.4.6; Field 13 |
John Nerbonne |
| 5-6 |
14-21 Mar. |
Regression, Mult. Regr. |
M&M 10-11 |
John Nerbonne |
| 7 |
28 Mar. |
Multilevel Regression |
Baayen, Ch.7; Field, Ch.19 |
Çağrı Çöltekin |
|
4, 11 April |
no class - exams |
|
|
| 8 |
18 Apr. |
Multilevel Regression, cont. |
Baayen, Ch.7; Field, Ch.19 |
Çağrı Çöltekin |
| 9 |
25 Apr. |
Permutation Statistics; Syntactic substrates |
M&M Chap.14 |
John Nerbonne |
|
|
Independent Component Analysis; Authorship |
Wikipedia Article |
Carmen Klaußner |
|
2 May |
Meivakantie! |
|
|
|
9 May |
Ascension Thursday |
|
|
| 10 |
16 May |
Meta-analysis |
Wikipedia Article |
Jay van Cleef |
|
|
ANOVA & Multi-level regr., pitch & emotion |
see above |
Stephen Gilbers |
| 11 |
23 May |
ANOVA & Multi-level regr.; Pronoun Int. |
see above |
Franziska Köder |
|
|
ANOVA & Multi-level regr.; Eyetracking |
|
Magreet Vogelzang |
|
|
ANOVA & Multi-level regr.; Dyslexia |
|
Rui Qin |
| 12 |
30 May |
ANOVA & Multi-level regr.; Comprehensibility |
see above |
Josephine Kurvers |
|
|
Code Switching |
|
Ramon Kezer |
|
|
Conditional Entropy; Comprehensibility |
Manning & Schütze, Ch.2.2 |
Kim Heiligstein |
| 13 |
6 June |
Comprehensibility |
ANOVA & Multi-level regr.; Comprehensibility |
Jelena Golubovic |
|
|
Tagging |
Wikipedia, Part-of-speech Tagging |
Mets Visser |
|
|
Simulation of Diffusion |
Wikipedia, Cellular Automaton |
Jaap Nanninga |
Click on the lecture (etc.) title to see more.
Course Materials 2013
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Çagrı Çöltekin on Multilevel Regression.
- John Nerbonne on
Permutation tests.
- Jay van Cleef on Re-analysis for ERP
.
- Stephen Gilbers on ANOVA & Emotional Speech in bearers of Cochlear Implants
.
- Franziska Köder on ANOVA and Pronoun Interpretation
.
- Magreet Vogelzang on Mixed
Models and Eyetracking (and ideas on GAMs).
- Rui Qin on Multi-Level
Regression and Early Detection of Dyslexia.
- Ramon Kezer on Factor Analysis and Code Switching.
- Kim Heiligstein on Conditional Entropy and Comprehensibility.
Course Materials 2012
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Çagrı Çöltekin on Bayesian vs.
Frequentist Statistics.
- Lotte Schott on Repeated
Measures Anova for ERP (EEG) Data
- Martijn Wieling on Mixed Model Regression
for analyzing linguistic variation and for analyzing eye-tracking
- Gökhan Akçapınar on Information
Gain as used in constructing decision trees
- HuiPing Chan on Multiple
Regression for Analysing Second Language Vocabulary Learning with
Attention to the AIC and to Cook's Distance
- Matthew Smith on Conditional
Entropy and Mutual Intelligibility
- Lili Szábo on Predicting Vowel Harmony using
Pointwise Mutual Information
- Melanie Hof on Measurement
Reliability
- Marjoleine Sloos on Linear Disriminant
Analysis
- Caroline Morris on
Association Strength used to gauge Langauge Change
Course Materials 2011
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-6).
- Martijn Wieling on
Mixed Models
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
- Stefan Evert's page on statistics and software for
measuring
collocation strength.
- Assocation Strength talks
- Simon Šuster on
Mutual Information and Collocations
- Laura Handojo
Odds Ratios and Collocations
- Jelke Bloem on
Fisher's Exact Test to Detect Animacy
- Igor Tytyk on
Minimum Sensitivity and Collostructions
- Stefan Evert on
statistical association strength and multi-word expressions.
- Repeated Measures vs. Mixed Models
- Connie Lahmann on
'Higher Language Cognition' and Grammaticality Verification
- Laura Bos
Repeated Measures ANOVA \& Permutation Statistics
- Ruggero Montalto on
Repeated Measures vs. Mixed Models
- Oscar Strik on the
Aikake Information Criterion
- Dimension Reduction
- Martin Boros
Cluster Analysis and Silhouette Width
- Ke Tran on
Principal
Component Analysis (and Face Recognition!)
- Lubomir Zlatkov on
Multi-Dimensional Scaling
- Kristel Uiboaed on
Correspondence Analysis
- Mona Timmermeister & Caitlin Mignella on
Validating a Pronunciation Difference Measure
- Jurriën Schuurman on
Min F in Psycholinguistics
- Jet Vonk on
Cochran's Q
Course Materials 2010
- Eliza Magaretha on
regression
used to evaluation the quality of inducing pronunciation
distances from empirical data.
- Nynke van der Vliet on
Cohen's κ
used to measure the agreement between annotators of hierarchically
structured material.
- Edgar Weiffenbach on
Log
Linear Models of Contingency used to analyse corpus frequencies.
- Nick Ruiz on
Logistic
Regression used to analyse corpus frequencies.
- Rahmad Mahendra on
Cross Entropy
used to measure model quality in computational analyses.
- Nadine Glas on
(Log) Odds Ratios
used to statistic independence of categorical variables (with a
comparison to χ².
- Seid Tvica on
Ordinal
Regression used to measure comprehensibility of foreign speech.
Course Materials 2009
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
-
- Thomas Zastrow
Entropy in Dialectometry and P. Nabende on
Cross Entropy and
Model Comparision
- Xuchen Yao
Bayesian vs. Frequentist Approaches to Statistics
- Çagri Çöltekin
Hierarchical Bayesian Networks as Learning Models
for background reading see Wagenmakers, E.-J., Lee, M. D.,
Lodewyckx, T., & Iverson, G. (2008).
Bayesian versus frequentist inference.
In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.),
Bayesian Evaluation of Informative Hypotheses, pp. 181-207.
Springer: New York.
- Ma Jianqiang
Permutation Tests and Monte Carlo Sampling
- Jelena Prokic
Clustering and the Bootstrap
- Arjen Versloot
Using Late Medieval Sources for Linguistic Reconstructions
(and regression)
- Gulsen Yilmaz
Using Multiple Regression to Understand Language Attrition
- Harwintha Anjarningsih
Repeated Measures ANOVA applied to ERP data.
- Natalia Ergorova
Repeated Measures ANOVA applied to ERP data, Example II.
- Anja Schüppert
(Binary) Logistic Regression applied to foreign comprehension data.
- Ankelien Schippers
(Multinomial) Logistic Regression applied to historical syntax.
- Karin Beijering
Loglinear Analysis of Contingency Tables applied to historical syntax.
- Ildikó Berzlanovich
Intercoder Agreement in Discourse Analysis (Cohen's κ)
- Myrte Faber
Annotating Turn Competition in Multi-Party Conversations
(Cohen's κ)
- Martijn Wieling
Bipartite Spectral Graph Clustering (applied to dialectal variation)
- Proscovia Olango
Naive Bayes (applied to disambiguation)
Course Materials 2008
- J. Nerbonne on Entropy and
Information Theory and the
Conditional Entropy
of the phoneme mapping (in Scandivanvian) "semi-communication".
- E. Rossi on
Normal Distributions and Sampling and on
hypothesis testing and
t-tests
- V. Koukoulioti on
Nonparametric Fallback Tests
- E. Rossi on
Single ANOVA
- Th. Mehotcheva on
Kruskal-Wallis
- H. Loerts on
Multivariate ANOVA and Repeated Measures
- H. Ahmed on
χ² and Fisher's Exact Test
- A. Lobanova on
(Log) Odds Ratios and
Word Order Studies
- J. Nerbonne on regression
and multiple
regression
- N. Haque on Principal
Component Analysis
- S. van Ommen on
Clustering
Course Materials 2007
- J. Nerbonne on regression
and multiple
regression
- B. Szmrecsányi
"Language users as creatures of habit: a corpus-linguistic analysis
of persistence in spoken English"
Corpus Linguistics and Linguistic Theory 1(1): 113-150.
- L. Stowe on Analysis of
Variance, incl. Multiple Analysis of Variance.
- A. Banga and Tam Ho on
Repeated Measures, (ANOVA)
- S. Berends on
Assumptions of ANOVA
- T. Caspi on Windowing,
Correlations, and Dynamic Systems Theory
- Th. Leinonen on
Regression in Phonetics and Computational Modeling
- J. Nerbonne on Multiple
Regression Models.
- V. Baaijen on
Applying Nonparametric Statistics to Analyse Writing (Comparing
Think-Aloud Protocols and Keystroke Logging)
- M. Knippers and R. Montalto on
Dealing with Nonnormal Distributions in a Repeated Measures Design
- M. Spruit on
Search for Associations among Variables
- G. Korfiatis on
Principal Components Analysis
- T. Van de Cruys on
Dimensionality Reduction for Similarity Detection
(Singular Value Decomposition, Non-Negative Matrix Factorization)
Course Materials 2006
- Nerbonne on
Factor Analysis
- Wiersma on Permutation Tests
- Zinger on n-gram
models
- Ruffle on
syntactic
differences in Old English
- Van der Cruys on
Latent Semantic
Analysis
- Vasishth on
Mixed
Effects Models
- Moberg on
conditional entropy
used to model comprehensibility.
- Mur on
binomial models, esp. the paired sign test.
- Heeringa on
bootstrapping
.
- Ruffle on
Log Odds
- Xiaoyan Xu on
Multivariate
nature of Language Attrition
- Kwant on
"Delphi"
techniques for identifying variables
Course Materials 2005
-
Nerbonne on χ²
-
Villada on Association Statistics for
Recognizing Multi-Word Units
- Featherston on Magnitude
Estimation. (Various papers, of which Featherston's "Decathlon
Model" is perhaps the best starting point.)
- Donkers on ANOVA, repeated measures.
- Ruffle and Trofimova on Fisher's Exact Test.
- Bouma on Corpora and Counting.
- Van Noord on Search in Automatically Analysed Corpora.
- Smits and Rossi on Binomial Chances.
- Kremers on Log Odds Ratios.
- Nerbonne on Logistic Regression.
- Deunk on Analysing Qualitative Results via
Multi-Level Regression.
- van der Plas on Clustering.
- van der Beek on Entropy as Measure of Syntactic Influence.
- Fahmi on Indentifying Terminology.
Course Materials 2004
-
Nerbonne on Logistic Regression
-
Siedle on Hierarchical Cluster Analysis
-
Hopp on Magnitude Estimation
-
Lichte on Association Measures
-
Kootstra on Exploratory Factor Analysis
-
Rossi on Odds Ratios in Aphasiology
Student Projects
- Melanie Hof's 2012 paper
"Questionnaire Evaluation with Factor Analysis
and Cronbach's Alpha"
-
Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to
foreign language learning.