Seminar in Methodology and Statistics
For students in the Linguistics Research Master's Program and
Linguistics PhD students (& others by agreement)
Course under developmentmore or less permanently!
Spring, 2014 (under construction)
Lecture/Seminar: Wed. 15:0016:45, Turftorenstraat, room 12.
(This is the little building just south of the Harmonie building.)
R Lab w. Çağrı Çöltekin: Fri. 9:0010:45, Harmonie
H1312.0107A, beginning Feb.7!
Instructor: John Nerbonne (see site for
email, phone, geographic coordinates and for office hours)
Announcements
 Sheets for the first half of the course (multivariate
techniques) available
here
 Lab Sessions begin Fri. Feb.15. See
the labs' web site for an idea of what will the labs will involve.
Description
The first half of the course will be an introduction to multivariate
statistical techniques commonly used in linguistics and
communications. It will be accompanied by lab
sessions using the statistical package R.
The rest of the course revolves around seminar presentations by
participants. Presentations primarily concern statistical or
methodological issues in the research of the participants. See below for some of the topics presented and
discussed from 2004 through 2011.
Topics for the second half include a selection from permutation
tests; bootstrapping; analysis of variance and analysis of covariance;
regression including multiple regression and hierarchical
(multilevel) regression; dimension reduction techniques including
factor analysis, principal component analysis, multidimensional
scaling and/or latent semantic analysis; analysis of nominal data
including association strength, Cohen's kappa, binomial or multinomial
models, Fisher's exact test, odds ratios, informationtheoretic
inspired measures such as pointwise mutual information, or logistic
regression. Other topics have regularly been presented, mostly at the
request of the participants.
Prerequisites
Participants in the course should have completed a basic course in
statistics covering topics such as descriptive statistics but also
basis hypothesis testing using ztests, ttests, and χ²
Participation in the first semester Research Master course on
statistics and corpus linguistics (Wander Lowie and Gertjan van Noord)
is strongly recommended for anyone who's never taken statistics. It is
required that you know the statistics from that course, so if
you've never taken such a course, that's a good basis. The statistics
course given in the European Master's in Clinical Linguistics is
also good.
All participants, including auditors not taking the course for
credit, must present at least one hourlong session (30
min. presentation plus 15 min. discussion and question session) on a
statistical analysis technique. In addition students taking the
course for credit in the research master must turn in a 810
pp. (2,0002,500 wd.) paper, which may be on the same statistical
analysis technique, but which may also be on another. In the paper it
is important to embed the discussion in the analysis of concrete data,
to explain how the analysis works, under what conditions it may be
applied, and what its shortcomings are. Graphical presentations of
the data attempting to illustrate the tendency that is to be proven or
disproven are definitely valuable.
It is fine with me if you turn in a paper reporting on work for
another course as long as the paper turned I receive focuses on the
statistical analysis. If you want turn in the same paper for two
course, make sure that both instructors know this and agree to it.
Ph.D. candidates from BCN or the Graduate School in the Humanities
have received credit for this course in the past if they presented one
topic (45 min.  (30 + 15)) and participated regularly. I assume that
that will continue to be the case.
Books
In general, we will try to use the following:
 David S. Moore and George McCabe (1993)
Introduction to the Practice of Statistics 5th edition.
Freeman: New York.
We assume the materials in chapters 19, 12, and 14 (subject
of Introductory Statistics).
More advanced chapters such as those on permutation tests,
bootstrapping, or regression models might be subjects of
presentations and discussion.
A nice alternative seems to be Alan Agesti & Barbara
Finlay's Statistical Methods for the Social Sciences
4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't
used it yet, but it has a good selection of material.
 Toni Rietveld and Roeland van Hout (1993) Statistical
Techniques for the Study of Language and Language Behavior.
Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics,
and still excellent. But see below.
Other references that have also been found useful are the following:
 Alan Agresti (1996) An Introduction to Categorical Data
Analysis. Wiley: New York.
 Barbara Tabachnik and Linda Fidell (2001) Using Multivariate
Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those
interested in mathematical underpinnings or those interested
in developing statistical theory further.
 Chris Manning and Hinrich Schütze (1999) Foundations of
Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on
appropriate stats, including statistical modeling, information theory.
 Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999)
Introduction to Information Retrieval, Cambridge
University Press: Cambridge, UK
Focus on IR, CL, naturally,
but lots on stats, including singular value decomposition, latent
semantic indexing.
Many also find the more "howto" books useful. The following books
have been found especially valuable as they provide very practical
instructions for doing analysis in SPSS or R.
 Andy Field (2000) Discovering Statistics using SPSS
for Windows. Sage: London.
Good for SPSS tips, covers basics well, informal (wordy) style.
 Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS
in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
 Harald Baayen (2008) Analyzing Linguistic Data. A Practical
Introduction to Linguistics using R Cambridge University Press:
Cambridge. Thereis also
an online version available.
The book by Baayen may be the best book ever written
on linguistic statistics. Especially if you are using large
data sets (corpus frequencies), R is the way to go.
 Keith Johnson (2008) Quantitative Methods in Linguistics
There is
an
online version as well.
I confess that I still haven't read this (6/2008), but based on Johnson's
work in general, I expect it to be good. Like Baayen's, this book
is Rbased.
Articles may also be used from time to time. The books are on
reserve, most at the Letteren library, reserve shelf. Please note
that books on reserve at the Letteren library that normally belong
there are not moved to the reserve shelf. Instead, they're
kept at their normal places (use the catalogue), but are on reserve
and cannot be loaned out.
Schedule for Seminars/Lectures 2014 (Tentative!)
The schedule for Spring 2014. Meetings
are Wed. 35 pm in Turftorenstr., room 12
Week 
Date 
Theme 
Readings 
Leader 
1 
5 Feb. 
Organizational 

John Nerbonne 
23 
1219 Feb. 
ANOVA, Factorial ANOVA 
M&M Chap.1213 
John Nerbonne 
4 
26 Mar. 
Repeated Measures 
Rietveld/van Hout, Ch.4.6; Field 13 
John Nerbonne 
56 
512 Mar. 
Regression, Mult. Regr. 
M&M 1011 
John Nerbonne 
7 
19 Mar. 
Logistic Regression 
Baayen, Ch.7; Field, Ch.19 
John Nerbonne 

26.Mar.9.Apr. 
no class  exams 


8 
16 Apr. 
Multilevel Regression 
Baayen, Ch.7; Field, Ch.19 
Çağrı Çöltekin 
9 
23 Apr. 
Mixed effects Models 
Baayen, Ch.7; Field, Ch.19 
Çağrı Çöltekin 
10 
30 Apr. 
Generalized Additive Models 
S.Woods '06, GAMS: Intro w. R 
Martijn Wieling 
11 
7 May 
Information Theory / Morphology 

Lena Rampula 


Multilevel Regression / 2nd Lg. Learning 

Sabrina Sun 
12 
14 May 
Mixed Models / Multilingualism 

Kristie James 


Regression / Gender 

Bich Ngoc Do 
13 
21 May 
Mixed Models / Focus Contrast 

Amelie la Roi 
14 
28 May 

CANCELLED! 

15 
4 June May 
Aphasiology 

Inna Skrynnikova 


Eyetracking 

Jidde Jacobi 


Bilingualism 

Anna Saarloos 
Click on the lecture (etc.) title to see more.
Course Materials 2014
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on
Multilevel Regression.
 Martijn Wieling on
Generalized Additive Models for EEGs.
 Lena Rampula on
Identifying Semitic Roots with Machne Learning.
 Sabrina Sun on
MixedEffect Models for predicting 2nd Lg. Learning Success.
 Kristie James on
Errors in English as a Lingua Franca analyzed using Mixed Effets Regression.
 Bich Ngoc Do on
ZeroInflated Models for Epicene Pronouns.
 Amelia La Roi on
MixedEffects Models for Analyzing Focus Stress.
 Anna Saarloos on
Regression Models for Analyzing Influences on Vocabulary Size.
Course Materials 2013
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on Multilevel Regression.
 John Nerbonne on
Permutation tests.
 Jay van Cleef on Reanalysis for ERP
.
 Stephen Gilbers on ANOVA & Emotional Speech in bearers of Cochlear Implants
.
 Franziska Köder on ANOVA and Pronoun Interpretation
.
 Magreet Vogelzang on Mixed
Models and Eyetracking (and ideas on GAMs).
 Rui Qin on MultiLevel
Regression and Early Detection of Dyslexia.
 Ramon Kezer on Factor Analysis and Code Switching.
 Kim Heiligstein on Conditional Entropy and Comprehensibility.
Course Materials 2012
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on Bayesian vs.
Frequentist Statistics.
 Lotte Schott on Repeated
Measures Anova for ERP (EEG) Data
 Martijn Wieling on Mixed Model Regression
for analyzing linguistic variation and for analyzing eyetracking
 Gökhan Akçapınar on Information
Gain as used in constructing decision trees
 HuiPing Chan on Multiple
Regression for Analysing Second Language Vocabulary Learning with
Attention to the AIC and to Cook's Distance
 Matthew Smith on Conditional
Entropy and Mutual Intelligibility
 Lili Szábo on Predicting Vowel Harmony using
Pointwise Mutual Information
 Melanie Hof on Measurement
Reliability
 Marjoleine Sloos on Linear Disriminant
Analysis
 Caroline Morris on
Association Strength used to gauge Langauge Change
Course Materials 2011
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 26).
 Martijn Wieling on
Mixed Models
 J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semicommunication".
 Stefan Evert's page on statistics and software for
measuring
collocation strength.
 Assocation Strength talks
 Simon Šuster on
Mutual Information and Collocations
 Laura Handojo
Odds Ratios and Collocations
 Jelke Bloem on
Fisher's Exact Test to Detect Animacy
 Igor Tytyk on
Minimum Sensitivity and Collostructions
 Stefan Evert on
statistical association strength and multiword expressions.
 Repeated Measures vs. Mixed Models
 Connie Lahmann on
'Higher Language Cognition' and Grammaticality Verification
 Laura Bos
Repeated Measures ANOVA \& Permutation Statistics
 Ruggero Montalto on
Repeated Measures vs. Mixed Models
 Oscar Strik on the
Aikake Information Criterion
 Dimension Reduction
 Martin Boros
Cluster Analysis and Silhouette Width
 Ke Tran on
Principal
Component Analysis (and Face Recognition!)
 Lubomir Zlatkov on
MultiDimensional Scaling
 Kristel Uiboaed on
Correspondence Analysis
 Mona Timmermeister & Caitlin Mignella on
Validating a Pronunciation Difference Measure
 Jurriën Schuurman on
Min F in Psycholinguistics
 Jet Vonk on
Cochran's Q
Course Materials 2010
 Eliza Magaretha on
regression
used to evaluation the quality of inducing pronunciation
distances from empirical data.
 Nynke van der Vliet on
Cohen's κ
used to measure the agreement between annotators of hierarchically
structured material.
 Edgar Weiffenbach on
Log
Linear Models of Contingency used to analyse corpus frequencies.
 Nick Ruiz on
Logistic
Regression used to analyse corpus frequencies.
 Rahmad Mahendra on
Cross Entropy
used to measure model quality in computational analyses.
 Nadine Glas on
(Log) Odds Ratios
used to statistic independence of categorical variables (with a
comparison to χ².
 Seid Tvica on
Ordinal
Regression used to measure comprehensibility of foreign speech.
Course Materials 2009
 J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semicommunication".

 Thomas Zastrow
Entropy in Dialectometry and P. Nabende on
Cross Entropy and
Model Comparision
 Xuchen Yao
Bayesian vs. Frequentist Approaches to Statistics
 Çagri Çöltekin
Hierarchical Bayesian Networks as Learning Models
for background reading see Wagenmakers, E.J., Lee, M. D.,
Lodewyckx, T., & Iverson, G. (2008).
Bayesian versus frequentist inference.
In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.),
Bayesian Evaluation of Informative Hypotheses, pp. 181207.
Springer: New York.
 Ma Jianqiang
Permutation Tests and Monte Carlo Sampling
 Jelena Prokic
Clustering and the Bootstrap
 Arjen Versloot
Using Late Medieval Sources for Linguistic Reconstructions
(and regression)
 Gulsen Yilmaz
Using Multiple Regression to Understand Language Attrition
 Harwintha Anjarningsih
Repeated Measures ANOVA applied to ERP data.
 Natalia Ergorova
Repeated Measures ANOVA applied to ERP data, Example II.
 Anja Schüppert
(Binary) Logistic Regression applied to foreign comprehension data.
 Ankelien Schippers
(Multinomial) Logistic Regression applied to historical syntax.
 Karin Beijering
Loglinear Analysis of Contingency Tables applied to historical syntax.
 Ildikó Berzlanovich
Intercoder Agreement in Discourse Analysis (Cohen's κ)
 Myrte Faber
Annotating Turn Competition in MultiParty Conversations
(Cohen's κ)
 Martijn Wieling
Bipartite Spectral Graph Clustering (applied to dialectal variation)
 Proscovia Olango
Naive Bayes (applied to disambiguation)
Course Materials 2008
 J. Nerbonne on Entropy and
Information Theory and the
Conditional Entropy
of the phoneme mapping (in Scandivanvian) "semicommunication".
 E. Rossi on
Normal Distributions and Sampling and on
hypothesis testing and
ttests
 V. Koukoulioti on
Nonparametric Fallback Tests
 E. Rossi on
Single ANOVA
 Th. Mehotcheva on
KruskalWallis
 H. Loerts on
Multivariate ANOVA and Repeated Measures
 H. Ahmed on
χ² and Fisher's Exact Test
 A. Lobanova on
(Log) Odds Ratios and
Word Order Studies
 J. Nerbonne on regression
and multiple
regression
 N. Haque on Principal
Component Analysis
 S. van Ommen on
Clustering
Course Materials 2007
 J. Nerbonne on regression
and multiple
regression
 B. Szmrecsányi
"Language users as creatures of habit: a corpuslinguistic analysis
of persistence in spoken English"
Corpus Linguistics and Linguistic Theory 1(1): 113150.
 L. Stowe on Analysis of
Variance, incl. Multiple Analysis of Variance.
 A. Banga and Tam Ho on
Repeated Measures, (ANOVA)
 S. Berends on
Assumptions of ANOVA
 T. Caspi on Windowing,
Correlations, and Dynamic Systems Theory
 Th. Leinonen on
Regression in Phonetics and Computational Modeling
 J. Nerbonne on Multiple
Regression Models.
 V. Baaijen on
Applying Nonparametric Statistics to Analyse Writing (Comparing
ThinkAloud Protocols and Keystroke Logging)
 M. Knippers and R. Montalto on
Dealing with Nonnormal Distributions in a Repeated Measures Design
 M. Spruit on
Search for Associations among Variables
 G. Korfiatis on
Principal Components Analysis
 T. Van de Cruys on
Dimensionality Reduction for Similarity Detection
(Singular Value Decomposition, NonNegative Matrix Factorization)
Course Materials 2006
 Nerbonne on
Factor Analysis
 Wiersma on Permutation Tests
 Zinger on ngram
models
 Ruffle on
syntactic
differences in Old English
 Van der Cruys on
Latent Semantic
Analysis
 Vasishth on
Mixed
Effects Models
 Moberg on
conditional entropy
used to model comprehensibility.
 Mur on
binomial models, esp. the paired sign test.
 Heeringa on
bootstrapping
.
 Ruffle on
Log Odds
 Xiaoyan Xu on
Multivariate
nature of Language Attrition
 Kwant on
"Delphi"
techniques for identifying variables
Course Materials 2005

Nerbonne on χ²

Villada on Association Statistics for
Recognizing MultiWord Units
 Featherston on Magnitude
Estimation. (Various papers, of which Featherston's "Decathlon
Model" is perhaps the best starting point.)
 Donkers on ANOVA, repeated measures.
 Ruffle and Trofimova on Fisher's Exact Test.
 Bouma on Corpora and Counting.
 Van Noord on Search in Automatically Analysed Corpora.
 Smits and Rossi on Binomial Chances.
 Kremers on Log Odds Ratios.
 Nerbonne on Logistic Regression.
 Deunk on Analysing Qualitative Results via
MultiLevel Regression.
 van der Plas on Clustering.
 van der Beek on Entropy as Measure of Syntactic Influence.
 Fahmi on Indentifying Terminology.
Course Materials 2004

Nerbonne on Logistic Regression

Siedle on Hierarchical Cluster Analysis

Hopp on Magnitude Estimation

Lichte on Association Measures

Kootstra on Exploratory Factor Analysis

Rossi on Odds Ratios in Aphasiology
Student Projects
 Melanie Hof's 2012 paper
"Questionnaire Evaluation with Factor Analysis
and Cronbach's Alpha"

Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to
foreign language learning.