Seminar in Methodology and Statistics
For students in the Linguistics Research Master's Program and
Linguistics PhD students (& others by agreement)
Course under developmentmore or less permanently!
Spring, 2016 (under construction)
Lecture/Seminar: Tues. 16:0017:45, Room A902K (little bldg at corner of
Broerstraat & Oude Kijk in Jatstraat,
just west of the Academiegebouw.)
R Lab w. Annelot de Rechteren van Hemert: Fri. 9:0010:45, Let 1313 Multimediazaal
1, beginning Feb.12!anne.recht@gmail.com
Instructor: John Nerbonne (see site for
email, phone, geographic coordinates and for office hours)
Announcements
Description
The first half of the course will be an introduction to multivariate
statistical techniques commonly used in linguistics and
communications. It will be accompanied by lab
sessions using the statistical package R.
The rest of the course revolves around seminar presentations by
participants. Presentations primarily concern statistical or
methodological issues in the research of the participants. See below for some of the topics presented and
discussed from 2004 through 2014.
Topics for the second half include a selection from permutation
tests; bootstrapping; analysis of variance and analysis of covariance;
regression including multiple regression and hierarchical
(multilevel) regression; dimension reduction techniques including
factor analysis, principal component analysis, multidimensional
scaling and/or latent semantic analysis; analysis of nominal data
including association strength, Cohen's kappa, binomial or multinomial
models, Fisher's exact test, odds ratios, informationtheoretic
inspired measures such as pointwise mutual information, or logistic
regression. Other topics have regularly been presented, mostly at the
request of the participants.
Prerequisites
Participants in the course should have completed a basic course in
statistics covering topics such as descriptive statistics but also
basis hypothesis testing using ztests, ttests, and χ²
Participation in the first semester Research Master course on
statistics and corpus linguistics (Wander Lowie and Gertjan van Noord)
is strongly recommended for anyone who's never taken statistics. It is
required that you know the statistics from that course, so if
you've never taken such a course, that's a good basis. The statistics
course given in the European Master's in Clinical Linguistics is
also good.
All participants, including auditors not taking the course for
credit, must present at least one hourlong session (30
min. presentation plus 15 min. discussion and question session) on a
statistical analysis technique. In addition students taking the
course for credit in the research master must turn in a 810
pp. (2,0002,500 wd.) paper, which may be on the same statistical
analysis technique, but which may also be on another. In the paper
and presentation it is important to embed the discussion in the
analysis of concrete data, to explain how the analysis works, under
what conditions it may be applied, and what its shortcomings are. The
emphasis is on the statistical technique, but the research question
should be explained along with the background theory. Graphical
presentations of the data attempting to illustrate the tendency that
is to be proven or disproven are definitely valuable.
It is fine with me if you turn in a paper and/or presentation
reporting on work for another course as long as the paper turned in
receive focuses on the statistical analysis. If you want turn in the
same paper for two courses, make sure that both instructors know this
and agree to it.
Ph.D. candidates from BCN or the Graduate School in the Humanities
have received credit for this course in the past if they presented one
topic (45 min.  (30 + 15)) and participated regularly. I assume that
that will continue to be the case.
Books
In general, we will try to use the following:
 Natalia Levshina (2015)
How to do linguistics with R. Data exploration and
statistical analysis John Banjamins: Amsterdam.
Good for R tips, R Studio, nice focus on linguistic
problems.
 Andy Field, Jeremy Miles & Zoë Fielde (2012)
Discovering Statistics using R. Sage: London.
Good for R tips, covers basics OK, informal (wordy) style.
 David S. Moore and George McCabe (1993)
Introduction to the Practice of Statistics 5th edition.
Freeman: New York.
We assume the materials in chapters 19, 12, and 14 (subject
of Introductory Statistics).
More advanced chapters such as those on permutation tests,
bootstrapping, or regression models might be subjects of
presentations and discussion. Excellent introduction!
The Moore & McCabe book is in the
library of the faculty of Behavioral and Social Sciences
(Grote Kruisstr. 2/1). At least one copy is kept there and is not lent but
must be used there. Filed under usoc 014D 073 ex.5
 A nice alternative seems to be Alan Agesti & Barbara
Finlay's Statistical Methods for the Social Sciences
4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't
used it yet, but it has a good selection of material.
 Toni Rietveld and Roeland van Hout (1993) Statistical
Techniques for the Study of Language and Language Behavior.
Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics,
and still excellent. But see below.
Available electronically from the University Library in
Groningen!
Other references that have also been found useful are the following:
 Alan Agresti (1996) An Introduction to Categorical Data
Analysis. Wiley: New York.
 Barbara Tabachnik and Linda Fidell (2001) Using Multivariate
Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those
interested in mathematical underpinnings or those interested
in developing statistical theory further.
 Chris Manning and Hinrich Schütze (1999) Foundations of
Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on
appropriate stats, including statistical modeling, information theory.
 Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999)
Introduction to Information Retrieval, Cambridge
University Press: Cambridge, UK
Focus on IR, CL, naturally,
but lots on stats, including singular value decomposition, latent
semantic indexing.
Many also find the more "howto" books useful. The following books
have been found especially valuable as they provide very practical
instructions for doing analysis in SPSS or R.
 Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS
in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
 Harald Baayen (2008) Analyzing Linguistic Data. A Practical
Introduction to Linguistics using R Cambridge University Press:
Cambridge. Thereis also
an online version available.
The book by Baayen may be the best book ever written
on linguistic statistics. Especially if you are using large
data sets (corpus frequencies), R is the way to go.
 Keith Johnson (2008) Quantitative Methods in Linguistics
There is
an
online version as well.
I confess that I still haven't read this (1/2015) through, but I've
read sections, and based on those and on Johnson's
work in general, I expect it to be good. Like Baayen's, this book
is Rbased.
The books are on reserve, most at the Letteren library, reserve shelf.
Please note that books on reserve at the Letteren library that
normally belong there are not moved to the reserve shelf.
Instead, they're kept at their normal places (use the catalogue), but
are on reserve and cannot be loaned out.
Schedule for Seminars/Lectures 2016 (Tentative!)
The schedule for Spring 2016. Meetings
are Tues. 46 pm in A902
Week 
Date 
Theme 
Readings 
Leader 
1 
9 Feb. 
Organizational 

John Nerbonne 
2 
16 Feb. 
ANOVA, Factorial ANOVA 
Levshina, Ch.8.23 
John Nerbonne 
3 
23 Feb. 
Repeated Measures 
Levshina, Ch.8.4; Rietveld/van Hout, Ch.4.6 
John Nerbonne 
4 
1 Mar. 
Simple Linear Regression 
Levshina, Ch.6> or Field 6 

5 
8 Mar. 
Mult. Regr. 
Levshina, Ch.7 or Field 7 
John Nerbonne 
6 
15 Mar 
Logistic Regression 
Levshina, Ch.12; Field, Ch.8 
John Nerbonne 

21/38/4 
Exam period 
no meetings 

7 
12 Apr. 
Mixed effects Models 
Baayen, Ch.7 
John Nerbonne 
8 
19 Apr. 
Quantifier Interpretation 
Mixed Effects Log. Regr. 
Isolde van Dorst 
9 
26 Apr. 
Final Voicing in Whisper
 Rep. Meas. ANOVA 
Marita Everhardt 


Final Voicing, cont. 
Mixed Effects Regr. 
Liqin Zhang 
10 
3 May 
Permutation Tests 
Moore & McCabe, Ch. 14 
John Nerbonne 


Bootstrap Sampling 
Moore & McCabe, Ch. 14 
John Nerbonne 
11 
10 May 
Instructor absent 
PhD defense, Freiburg 

12 
17 May 
Code Switching 
Logistic Regression 
Masha Medvereva 
13 
24 May 
Chatlike dialogues 
Logistic Regression 
Lotte Verheijen 
13 
24 May 
Chatlike dialogue 
Mixed Effects Log. Regr. 
Guanghao You 
Click on the lecture (etc.) title to see more.
Course Materials 2016
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 John Nerbonne presenting Martijn Wieling's sheets on
Mixed
Effects Regression.
 Isolde van Dorst on analyzing
Quantifier Interpretations using Mixed Effect Regression
 Marita Everhardt on
Analyzing final voicing in whispered speech using Factorial ANOVA
 Liqin Zhang on
Mixed Effects Modeling of Final Devoicing in Whispered Speech (with
same data as in Marita Everhardt's presentation (see above).
 John Nerbonne on
Permutation
Testing.
 John Nerbonne presenting
Bootstrap clustering and
noisy clustering,
also using Jelena Prokić's sheets on
Clustering and the Bootstrap.
 Masha Medvedeva on
Predicting Code
Switches in Udmurt/Russian (using logistic regression).
 Guanghao You on
Mixed Effects Logistic Regression to analyze chatlike dialogues.
Course Materials 2015
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Martijn Wieling on
Mixed
Effects Regression
 Marieke Engbrenghof on
Mixed Design
Model for the Acquisition of English Vocabulary
 Ingemarie Donker on
Predictors of
Disfluency Markers in First Language Attrition
 Nienke Hoeksema on
Repeated
Measures ANOVA for Reaction Time and Accuracy Data
 Esther van der Berg on
Logistic Regression
to Analyze Language Change
 Elena Badmaeva on
Logistic Regression
to Analyze the Machine Learning of Russian Diminutive Formation
 J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semicommunication".
 Alicia Krebs on
Corpus
Linguistics: Analysing Word Frequencies
 Annelot de Rechteren van Hemert on
Support Vector Machines: Eye movement classification in L1/L2
Syntactic Processing
Course Materials 2014
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on
Multilevel Regression.
 Martijn Wieling on
Generalized Additive Models for EEGs.
 Lena Rampula on
Identifying Semitic Roots with Machine Learning.
 Sabrina Sun on
MixedEffect Models for Predicting 2nd Lg. Learning Success.
 Kristie James on
Errors in English as a Lingua Franca Analyzed using Mixed Effects Regression.
 Bich Ngoc Do on
ZeroInflated Models for Epicene Pronouns.
 Amelia La Roi on
MixedEffects Models for Analyzing Focus Stress.
 Anna Saarloos on
Regression Models for Analyzing Influences on Vocabulary Size.
Course Materials 2013
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on Multilevel Regression.
 John Nerbonne on
Permutation tests.
 Jay van Cleef on Reanalysis for ERP
.
 Stephen Gilbers on ANOVA & Emotional Speech in bearers of Cochlear Implants
.
 Franziska Köder on ANOVA and Pronoun Interpretation
.
 Magreet Vogelzang on Mixed
Models and Eyetracking (and ideas on GAMs).
 Rui Qin on MultiLevel
Regression and Early Detection of Dyslexia.
 Ramon Kezer on Factor Analysis and Code Switching.
 Kim Heiligstein on Conditional Entropy and Comprehensibility.
Course Materials 2012
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 27).
 Çagrı Çöltekin on Bayesian vs.
Frequentist Statistics.
 Lotte Schott on Repeated
Measures Anova for ERP (EEG) Data
 Martijn Wieling on Mixed Model Regression
for analyzing linguistic variation and for analyzing eyetracking
 Gökhan Akçapınar on Information
Gain as used in constructing decision trees
 HuiPing Chan on Multiple
Regression for Analysing Second Language Vocabulary Learning with
Attention to the AIC and to Cook's Distance
 Matthew Smith on Conditional
Entropy and Mutual Intelligibility
 Lili Szábo on Predicting Vowel Harmony using
Pointwise Mutual Information
 Melanie Hof on Measurement
Reliability
 Marjoleine Sloos on Linear Disriminant
Analysis
 Caroline Morris on
Association Strength used to gauge Langauge Change
Course Materials 2011
 John Nerbonne's lectures
on various ANOVA & regression models (weeks 26).
 Martijn Wieling on
Mixed Models
 J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semicommunication".
 Stefan Evert's page on statistics and software for
measuring
collocation strength.
 Assocation Strength talks
 Simon Šuster on
Mutual Information and Collocations
 Laura Handojo
Odds Ratios and Collocations
 Jelke Bloem on
Fisher's Exact Test to Detect Animacy
 Igor Tytyk on
Minimum Sensitivity and Collostructions
 Stefan Evert on
statistical association strength and multiword expressions.
 Repeated Measures vs. Mixed Models
 Connie Lahmann on
'Higher Language Cognition' and Grammaticality Verification
 Laura Bos
Repeated Measures ANOVA \& Permutation Statistics
 Ruggero Montalto on
Repeated Measures vs. Mixed Models
 Oscar Strik on the
Aikake Information Criterion
 Dimension Reduction
 Martin Boros
Cluster Analysis and Silhouette Width
 Ke Tran on
Principal
Component Analysis (and Face Recognition!)
 Lubomir Zlatkov on
MultiDimensional Scaling
 Kristel Uiboaed on
Correspondence Analysis
 Mona Timmermeister & Caitlin Mignella on
Validating a Pronunciation Difference Measure
 Jurriën Schuurman on
Min F in Psycholinguistics
 Jet Vonk on
Cochran's Q
Course Materials 2010
 Eliza Magaretha on
regression
used to evaluation the quality of inducing pronunciation
distances from empirical data.
 Nynke van der Vliet on
Cohen's κ
used to measure the agreement between annotators of hierarchically
structured material.
 Edgar Weiffenbach on
Log
Linear Models of Contingency used to analyse corpus frequencies.
 Nick Ruiz on
Logistic
Regression used to analyse corpus frequencies.
 Rahmad Mahendra on
Cross Entropy
used to measure model quality in computational analyses.
 Nadine Glas on
(Log) Odds Ratios
used to statistic independence of categorical variables (with a
comparison to χ².
 Seid Tvica on
Ordinal
Regression used to measure comprehensibility of foreign speech.
Course Materials 2009
 J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semicommunication".

 Thomas Zastrow
Entropy in Dialectometry and P. Nabende on
Cross Entropy and
Model Comparision
 Xuchen Yao
Bayesian vs. Frequentist Approaches to Statistics
 Çagri Çöltekin
Hierarchical Bayesian Networks as Learning Models
for background reading see Wagenmakers, E.J., Lee, M. D.,
Lodewyckx, T., & Iverson, G. (2008).
Bayesian versus frequentist inference.
In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.),
Bayesian Evaluation of Informative Hypotheses, pp. 181207.
Springer: New York.
 Ma Jianqiang
Permutation Tests and Monte Carlo Sampling
 Jelena Prokic
Clustering and the Bootstrap
 Arjen Versloot
Using Late Medieval Sources for Linguistic Reconstructions
(and regression)
 Gulsen Yilmaz
Using Multiple Regression to Understand Language Attrition
 Harwintha Anjarningsih
Repeated Measures ANOVA applied to ERP data.
 Natalia Ergorova
Repeated Measures ANOVA applied to ERP data, Example II.
 Anja Schüppert
(Binary) Logistic Regression applied to foreign comprehension data.
 Ankelien Schippers
(Multinomial) Logistic Regression applied to historical syntax.
 Karin Beijering
Loglinear Analysis of Contingency Tables applied to historical syntax.
 Ildikó Berzlanovich
Intercoder Agreement in Discourse Analysis (Cohen's κ)
 Myrte Faber
Annotating Turn Competition in MultiParty Conversations
(Cohen's κ)
 Martijn Wieling
Bipartite Spectral Graph Clustering (applied to dialectal variation)
 Proscovia Olango
Naive Bayes (applied to disambiguation)
Course Materials 2008
 J. Nerbonne on Entropy and
Information Theory and the
Conditional Entropy
of the phoneme mapping (in Scandivanvian) "semicommunication".
 E. Rossi on
Normal Distributions and Sampling and on
hypothesis testing and
ttests
 V. Koukoulioti on
Nonparametric Fallback Tests
 E. Rossi on
Single ANOVA
 Th. Mehotcheva on
KruskalWallis
 H. Loerts on
Multivariate ANOVA and Repeated Measures
 H. Ahmed on
χ² and Fisher's Exact Test
 A. Lobanova on
(Log) Odds Ratios and
Word Order Studies
 J. Nerbonne on regression
and multiple
regression
 N. Haque on Principal
Component Analysis
 S. van Ommen on
Clustering
Course Materials 2007
 J. Nerbonne on regression
and multiple
regression
 B. Szmrecsányi
"Language users as creatures of habit: a corpuslinguistic analysis
of persistence in spoken English"
Corpus Linguistics and Linguistic Theory 1(1): 113150.
 L. Stowe on Analysis of
Variance, incl. Multiple Analysis of Variance.
 A. Banga and Tam Ho on
Repeated Measures, (ANOVA)
 S. Berends on
Assumptions of ANOVA
 T. Caspi on Windowing,
Correlations, and Dynamic Systems Theory
 Th. Leinonen on
Regression in Phonetics and Computational Modeling
 J. Nerbonne on Multiple
Regression Models.
 V. Baaijen on
Applying Nonparametric Statistics to Analyse Writing (Comparing
ThinkAloud Protocols and Keystroke Logging)
 M. Knippers and R. Montalto on
Dealing with Nonnormal Distributions in a Repeated Measures Design
 M. Spruit on
Search for Associations among Variables
 G. Korfiatis on
Principal Components Analysis
 T. Van de Cruys on
Dimensionality Reduction for Similarity Detection
(Singular Value Decomposition, NonNegative Matrix Factorization)
Course Materials 2006
 Nerbonne on
Factor Analysis
 Wiersma on Permutation Tests
 Zinger on ngram
models
 Ruffle on
syntactic
differences in Old English
 Van der Cruys on
Latent Semantic
Analysis
 Vasishth on
Mixed
Effects Models
 Moberg on
conditional entropy
used to model comprehensibility.
 Mur on
binomial models, esp. the paired sign test.
 Heeringa on
bootstrapping
.
 Ruffle on
Log Odds
 Xiaoyan Xu on
Multivariate
nature of Language Attrition
 Kwant on
"Delphi"
techniques for identifying variables
Course Materials 2005

Nerbonne on χ²

Villada on Association Statistics for
Recognizing MultiWord Units
 Featherston on Magnitude
Estimation. (Various papers, of which Featherston's "Decathlon
Model" is perhaps the best starting point.)
 Donkers on ANOVA, repeated measures.
 Ruffle and Trofimova on Fisher's Exact Test.
 Bouma on Corpora and Counting.
 Van Noord on Search in Automatically Analysed Corpora.
 Smits and Rossi on Binomial Chances.
 Kremers on Log Odds Ratios.
 Nerbonne on Logistic Regression.
 Deunk on Analysing Qualitative Results via
MultiLevel Regression.
 van der Plas on Clustering.
 van der Beek on Entropy as Measure of Syntactic Influence.
 Fahmi on Indentifying Terminology.
Course Materials 2004

Nerbonne on Logistic Regression

Siedle on Hierarchical Cluster Analysis

Hopp on Magnitude Estimation

Lichte on Association Measures

Kootstra on Exploratory Factor Analysis

Rossi on Odds Ratios in Aphasiology
Student Projects
 Melanie Hof's 2012 paper
"Questionnaire Evaluation with Factor Analysis
and Cronbach's Alpha"

Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to
foreign language learning.