Seminar in Methodology and Statistics
For students in the Linguistics Research Master's Program and
Linguistics PhD students (& others by agreement)
Course under development--more or less permanently!
Spring, 2016 (under construction)
Lecture/Seminar: Tues. 16:00-17:45, Room A902K (little bldg at corner of
Broerstraat & Oude Kijk in Jatstraat,
just west of the Academiegebouw.)
R Lab w. Annelot de Rechteren van Hemert: Fri. 9:00-10:45, Let 1313 Multimediazaal
1, beginning Feb.12!anne.recht@gmail.com
Instructor: John Nerbonne (see site for
email, phone, geographic coordinates and for office hours)
Announcements
Description
The first half of the course will be an introduction to multivariate
statistical techniques commonly used in linguistics and
communications. It will be accompanied by lab
sessions using the statistical package R.
The rest of the course revolves around seminar presentations by
participants. Presentations primarily concern statistical or
methodological issues in the research of the participants. See below for some of the topics presented and
discussed from 2004 through 2014.
Topics for the second half include a selection from permutation
tests; bootstrapping; analysis of variance and analysis of covariance;
regression including multiple regression and hierarchical
(multi-level) regression; dimension reduction techniques including
factor analysis, principal component analysis, multidimensional
scaling and/or latent semantic analysis; analysis of nominal data
including association strength, Cohen's kappa, binomial or multinomial
models, Fisher's exact test, odds ratios, information-theoretic
inspired measures such as pointwise mutual information, or logistic
regression. Other topics have regularly been presented, mostly at the
request of the participants.
Prerequisites
Participants in the course should have completed a basic course in
statistics covering topics such as descriptive statistics but also
basis hypothesis testing using z-tests, t-tests, and χ²
Participation in the first semester Research Master course on
statistics and corpus linguistics (Wander Lowie and Gertjan van Noord)
is strongly recommended for anyone who's never taken statistics. It is
required that you know the statistics from that course, so if
you've never taken such a course, that's a good basis. The statistics
course given in the European Master's in Clinical Linguistics is
also good.
Requirements
All participants, including auditors not taking the course for
credit, must present at least one hour-long session (30
min. presentation plus 15 min. discussion and question session) on a
statistical analysis technique. In addition students taking the
course for credit in the research master must turn in a 8-10
pp. (2,000-2,500 wd.) paper, which may be on the same statistical
analysis technique, but which may also be on another. In the paper
and presentation it is important to embed the discussion in the
analysis of concrete data, to explain how the analysis works, under
what conditions it may be applied, and what its shortcomings are. The
emphasis is on the statistical technique, but the research question
should be explained along with the background theory. Graphical
presentations of the data attempting to illustrate the tendency that
is to be proven or disproven are definitely valuable.
It is fine with me if you turn in a paper and/or presentation
reporting on work for another course as long as the paper turned in
receive focuses on the statistical analysis. If you want turn in the
same paper for two courses, make sure that both instructors know this
and agree to it.
Ph.D. candidates from BCN or the Graduate School in the Humanities
have received credit for this course in the past if they presented one
topic (45 min. -- (30 + 15)) and participated regularly. I assume that
that will continue to be the case.
Books
In general, we will try to use the following:
- Natalia Levshina (2015)
How to do linguistics with R. Data exploration and
statistical analysis John Banjamins: Amsterdam.
Good for R tips, R Studio, nice focus on linguistic
problems.
- Andy Field, Jeremy Miles & Zoë Fielde (2012)
Discovering Statistics using R. Sage: London.
Good for R tips, covers basics OK, informal (wordy) style.
- David S. Moore and George McCabe (1993)
Introduction to the Practice of Statistics 5th edition.
Freeman: New York.
We assume the materials in chapters 1-9, 12, and 14 (subject
of Introductory Statistics).
More advanced chapters such as those on permutation tests,
bootstrapping, or regression models might be subjects of
presentations and discussion. Excellent introduction!
The Moore & McCabe book is in the
library of the faculty of Behavioral and Social Sciences
(Grote Kruisstr. 2/1). At least one copy is kept there and is not lent but
must be used there. Filed under usoc 014D 073 ex.5
- A nice alternative seems to be Alan Agesti & Barbara
Finlay's Statistical Methods for the Social Sciences
4th ed. Pearson: Upper Saddle River, NJ, 2009. I haven't
used it yet, but it has a good selection of material.
- Toni Rietveld and Roeland van Hout (1993) Statistical
Techniques for the Study of Language and Language Behavior.
Mouton De Gruyter: Berlin.
For many years, the text for statistics in linguistics,
and still excellent. But see below.
Available electronically from the University Library in
Groningen!
Other references that have also been found useful are the following:
- Alan Agresti (1996) An Introduction to Categorical Data
Analysis. Wiley: New York.
- Barbara Tabachnik and Linda Fidell (2001) Using Multivariate
Statistics, Pearson: Needham Heights, MA.
Comprehensive, and aimed at analysis, as opposed to those
interested in mathematical underpinnings or those interested
in developing statistical theory further.
- Chris Manning and Hinrich Schütze (1999) Foundations of
Statistical Natural Language Processing, MIT Press: Cambridge, MA.
Focus on computational lingusitics, naturally, but lots on
appropriate stats, including statistical modeling, information theory.
- Chris Manning, Prabhakar Raghavan and Hinrich Schütze (1999)
Introduction to Information Retrieval, Cambridge
University Press: Cambridge, UK
Focus on IR, CL, naturally,
but lots on stats, including singular value decomposition, latent
semantic indexing.
Many also find the more "how-to" books useful. The following books
have been found especially valuable as they provide very practical
instructions for doing analysis in SPSS or R.
- Dennis Howitt and Duncan Cramer (2008) Introduction to SPSS
in Psychology For Version 16 and earlier. 4th ed. Pearson: Essex.
Excellent continuation for topics too difficult for the Field book.
- Harald Baayen (2008) Analyzing Linguistic Data. A Practical
Introduction to Linguistics using R Cambridge University Press:
Cambridge. Thereis also
an online version available.
The book by Baayen may be the best book ever written
on linguistic statistics. Especially if you are using large
data sets (corpus frequencies), R is the way to go.
- Keith Johnson (2008) Quantitative Methods in Linguistics
There is
an
online version as well.
I confess that I still haven't read this (1/2015) through, but I've
read sections, and based on those and on Johnson's
work in general, I expect it to be good. Like Baayen's, this book
is R-based.
The books are on reserve, most at the Letteren library, reserve shelf.
Please note that books on reserve at the Letteren library that
normally belong there are not moved to the reserve shelf.
Instead, they're kept at their normal places (use the catalogue), but
are on reserve and cannot be loaned out.
Schedule for Seminars/Lectures 2016 (Tentative!)
The schedule for Spring 2016. Meetings
are Tues. 4-6 pm in A902
Week |
Date |
Theme |
Readings |
Leader |
1 |
9 Feb. |
Organizational |
|
John Nerbonne |
2 |
16 Feb. |
ANOVA, Factorial ANOVA |
Levshina, Ch.8.2-3 |
John Nerbonne |
3 |
23 Feb. |
Repeated Measures |
Levshina, Ch.8.4; Rietveld/van Hout, Ch.4.6 |
John Nerbonne |
4 |
1 Mar. |
Simple Linear Regression |
Levshina, Ch.6> or Field 6 |
|
5 |
8 Mar. |
Mult. Regr. |
Levshina, Ch.7 or Field 7 |
John Nerbonne |
6 |
15 Mar |
Logistic Regression |
Levshina, Ch.12; Field, Ch.8 |
John Nerbonne |
|
21/3-8/4 |
Exam period |
no meetings |
|
7 |
12 Apr. |
Mixed effects Models |
Baayen, Ch.7 |
John Nerbonne |
8 |
19 Apr. |
Quantifier Interpretation |
Mixed Effects Log. Regr. |
Isolde van Dorst |
9 |
26 Apr. |
Final Voicing in Whisper
| Rep. Meas. ANOVA |
Marita Everhardt |
|
|
Final Voicing, cont. |
Mixed Effects Regr. |
Liqin Zhang |
10 |
3 May |
Permutation Tests |
Moore & McCabe, Ch. 14 |
John Nerbonne |
|
|
Bootstrap Sampling |
Moore & McCabe, Ch. 14 |
John Nerbonne |
11 |
10 May |
Instructor absent |
PhD defense, Freiburg |
|
12 |
17 May |
Code Switching |
Logistic Regression |
Masha Medvereva |
13 |
24 May |
Chat-like dialogues |
Logistic Regression |
Lotte Verheijen |
13 |
24 May |
Chat-like dialogue |
Mixed Effects Log. Regr. |
Guanghao You |
Click on the lecture (etc.) title to see more.
Course Materials 2016
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- John Nerbonne presenting Martijn Wieling's sheets on
Mixed
Effects Regression.
- Isolde van Dorst on analyzing
Quantifier Interpretations using Mixed Effect Regression
- Marita Everhardt on
Analyzing final voicing in whispered speech using Factorial ANOVA
- Liqin Zhang on
Mixed Effects Modeling of Final Devoicing in Whispered Speech (with
same data as in Marita Everhardt's presentation (see above).
- John Nerbonne on
Permutation
Testing.
- John Nerbonne presenting
Bootstrap clustering and
noisy clustering,
also using Jelena Prokić's sheets on
Clustering and the Bootstrap.
- Masha Medvedeva on
Predicting Code
Switches in Udmurt/Russian (using logistic regression).
- Lotte Verheijen on
Logistic Regression to analyze chat-like dialogues.
- Guanghao You on
Mixed Effects Logistic Regression to analyze chat-like dialogues.
Course Materials 2015
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Martijn Wieling on
Mixed
Effects Regression
- Marieke Engbrenghof on
Mixed Design
Model for the Acquisition of English Vocabulary
- Ingemarie Donker on
Predictors of
Disfluency Markers in First Language Attrition
- Nienke Hoeksema on
Repeated
Measures ANOVA for Reaction Time and Accuracy Data
- Esther van der Berg on
Logistic Regression
to Analyze Language Change
- Elena Badmaeva on
Logistic Regression
to Analyze the Machine Learning of Russian Diminutive Formation
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
- Alicia Krebs on
Corpus
Linguistics: Analysing Word Frequencies
- Annelot de Rechteren van Hemert on
Support Vector Machines: Eye movement classification in L1/L2
Syntactic Processing
Course Materials 2014
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Çagrı Çöltekin on
Multilevel Regression.
- Martijn Wieling on
Generalized Additive Models for EEGs.
- Lena Rampula on
Identifying Semitic Roots with Machine Learning.
- Sabrina Sun on
Mixed-Effect Models for Predicting 2nd Lg. Learning Success.
- Kristie James on
Errors in English as a Lingua Franca Analyzed using Mixed Effects Regression.
- Bich Ngoc Do on
Zero-Inflated Models for Epicene Pronouns.
- Amelia La Roi on
Mixed-Effects Models for Analyzing Focus Stress.
- Anna Saarloos on
Regression Models for Analyzing Influences on Vocabulary Size.
Course Materials 2013
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Çagrı Çöltekin on Multilevel Regression.
- John Nerbonne on
Permutation tests.
- Jay van Cleef on Re-analysis for ERP
.
- Stephen Gilbers on ANOVA & Emotional Speech in bearers of Cochlear Implants
.
- Franziska Köder on ANOVA and Pronoun Interpretation
.
- Magreet Vogelzang on Mixed
Models and Eyetracking (and ideas on GAMs).
- Rui Qin on Multi-Level
Regression and Early Detection of Dyslexia.
- Ramon Kezer on Factor Analysis and Code Switching.
- Kim Heiligstein on Conditional Entropy and Comprehensibility.
Course Materials 2012
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-7).
- Çagrı Çöltekin on Bayesian vs.
Frequentist Statistics.
- Lotte Schott on Repeated
Measures Anova for ERP (EEG) Data
- Martijn Wieling on Mixed Model Regression
for analyzing linguistic variation and for analyzing eye-tracking
- Gökhan Akçapınar on Information
Gain as used in constructing decision trees
- HuiPing Chan on Multiple
Regression for Analysing Second Language Vocabulary Learning with
Attention to the AIC and to Cook's Distance
- Matthew Smith on Conditional
Entropy and Mutual Intelligibility
- Lili Szábo on Predicting Vowel Harmony using
Pointwise Mutual Information
- Melanie Hof on Measurement
Reliability
- Marjoleine Sloos on Linear Disriminant
Analysis
- Caroline Morris on
Association Strength used to gauge Langauge Change
Course Materials 2011
- John Nerbonne's lectures
on various ANOVA & regression models (weeks 2-6).
- Martijn Wieling on
Mixed Models
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
- Stefan Evert's page on statistics and software for
measuring
collocation strength.
- Assocation Strength talks
- Simon Šuster on
Mutual Information and Collocations
- Laura Handojo
Odds Ratios and Collocations
- Jelke Bloem on
Fisher's Exact Test to Detect Animacy
- Igor Tytyk on
Minimum Sensitivity and Collostructions
- Stefan Evert on
statistical association strength and multi-word expressions.
- Repeated Measures vs. Mixed Models
- Connie Lahmann on
'Higher Language Cognition' and Grammaticality Verification
- Laura Bos
Repeated Measures ANOVA \& Permutation Statistics
- Ruggero Montalto on
Repeated Measures vs. Mixed Models
- Oscar Strik on the
Aikake Information Criterion
- Dimension Reduction
- Martin Boros
Cluster Analysis and Silhouette Width
- Ke Tran on
Principal
Component Analysis (and Face Recognition!)
- Lubomir Zlatkov on
Multi-Dimensional Scaling
- Kristel Uiboaed on
Correspondence Analysis
- Mona Timmermeister & Caitlin Mignella on
Validating a Pronunciation Difference Measure
- Jurriën Schuurman on
Min F in Psycholinguistics
- Jet Vonk on
Cochran's Q
Course Materials 2010
- Eliza Magaretha on
regression
used to evaluation the quality of inducing pronunciation
distances from empirical data.
- Nynke van der Vliet on
Cohen's κ
used to measure the agreement between annotators of hierarchically
structured material.
- Edgar Weiffenbach on
Log
Linear Models of Contingency used to analyse corpus frequencies.
- Nick Ruiz on
Logistic
Regression used to analyse corpus frequencies.
- Rahmad Mahendra on
Cross Entropy
used to measure model quality in computational analyses.
- Nadine Glas on
(Log) Odds Ratios
used to statistic independence of categorical variables (with a
comparison to χ².
- Seid Tvica on
Ordinal
Regression used to measure comprehensibility of foreign speech.
Course Materials 2009
- J. Nerbonne on Entropy and
Information Theory and the Conditional Entropy
of the phoneme mapping (in Scandivanvian)
"semi-communication".
-
- Thomas Zastrow
Entropy in Dialectometry and P. Nabende on
Cross Entropy and
Model Comparision
- Xuchen Yao
Bayesian vs. Frequentist Approaches to Statistics
- Çagri Çöltekin
Hierarchical Bayesian Networks as Learning Models
for background reading see Wagenmakers, E.-J., Lee, M. D.,
Lodewyckx, T., & Iverson, G. (2008).
Bayesian versus frequentist inference.
In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.),
Bayesian Evaluation of Informative Hypotheses, pp. 181-207.
Springer: New York.
- Ma Jianqiang
Permutation Tests and Monte Carlo Sampling
- Jelena Prokic
Clustering and the Bootstrap
- Arjen Versloot
Using Late Medieval Sources for Linguistic Reconstructions
(and regression)
- Gulsen Yilmaz
Using Multiple Regression to Understand Language Attrition
- Harwintha Anjarningsih
Repeated Measures ANOVA applied to ERP data.
- Natalia Ergorova
Repeated Measures ANOVA applied to ERP data, Example II.
- Anja Schüppert
(Binary) Logistic Regression applied to foreign comprehension data.
- Ankelien Schippers
(Multinomial) Logistic Regression applied to historical syntax.
- Karin Beijering
Loglinear Analysis of Contingency Tables applied to historical syntax.
- Ildikó Berzlanovich
Intercoder Agreement in Discourse Analysis (Cohen's κ)
- Myrte Faber
Annotating Turn Competition in Multi-Party Conversations
(Cohen's κ)
- Martijn Wieling
Bipartite Spectral Graph Clustering (applied to dialectal variation)
- Proscovia Olango
Naive Bayes (applied to disambiguation)
Course Materials 2008
- J. Nerbonne on Entropy and
Information Theory and the
Conditional Entropy
of the phoneme mapping (in Scandivanvian) "semi-communication".
- E. Rossi on
Normal Distributions and Sampling and on
hypothesis testing and
t-tests
- V. Koukoulioti on
Nonparametric Fallback Tests
- E. Rossi on
Single ANOVA
- Th. Mehotcheva on
Kruskal-Wallis
- H. Loerts on
Multivariate ANOVA and Repeated Measures
- H. Ahmed on
χ² and Fisher's Exact Test
- A. Lobanova on
(Log) Odds Ratios and
Word Order Studies
- J. Nerbonne on regression
and multiple
regression
- N. Haque on Principal
Component Analysis
- S. van Ommen on
Clustering
Course Materials 2007
- J. Nerbonne on regression
and multiple
regression
- B. Szmrecsányi
"Language users as creatures of habit: a corpus-linguistic analysis
of persistence in spoken English"
Corpus Linguistics and Linguistic Theory 1(1): 113-150.
- L. Stowe on Analysis of
Variance, incl. Multiple Analysis of Variance.
- A. Banga and Tam Ho on
Repeated Measures, (ANOVA)
- S. Berends on
Assumptions of ANOVA
- T. Caspi on Windowing,
Correlations, and Dynamic Systems Theory
- Th. Leinonen on
Regression in Phonetics and Computational Modeling
- J. Nerbonne on Multiple
Regression Models.
- V. Baaijen on
Applying Nonparametric Statistics to Analyse Writing (Comparing
Think-Aloud Protocols and Keystroke Logging)
- M. Knippers and R. Montalto on
Dealing with Nonnormal Distributions in a Repeated Measures Design
- M. Spruit on
Search for Associations among Variables
- G. Korfiatis on
Principal Components Analysis
- T. Van de Cruys on
Dimensionality Reduction for Similarity Detection
(Singular Value Decomposition, Non-Negative Matrix Factorization)
Course Materials 2006
- Nerbonne on
Factor Analysis
- Wiersma on Permutation Tests
- Zinger on n-gram
models
- Ruffle on
syntactic
differences in Old English
- Van der Cruys on
Latent Semantic
Analysis
- Vasishth on
Mixed
Effects Models
- Moberg on
conditional entropy
used to model comprehensibility.
- Mur on
binomial models, esp. the paired sign test.
- Heeringa on
bootstrapping
.
- Ruffle on
Log Odds
- Xiaoyan Xu on
Multivariate
nature of Language Attrition
- Kwant on
"Delphi"
techniques for identifying variables
Course Materials 2005
-
Nerbonne on χ²
-
Villada on Association Statistics for
Recognizing Multi-Word Units
- Featherston on Magnitude
Estimation. (Various papers, of which Featherston's "Decathlon
Model" is perhaps the best starting point.)
- Donkers on ANOVA, repeated measures.
- Ruffle and Trofimova on Fisher's Exact Test.
- Bouma on Corpora and Counting.
- Van Noord on Search in Automatically Analysed Corpora.
- Smits and Rossi on Binomial Chances.
- Kremers on Log Odds Ratios.
- Nerbonne on Logistic Regression.
- Deunk on Analysing Qualitative Results via
Multi-Level Regression.
- van der Plas on Clustering.
- van der Beek on Entropy as Measure of Syntactic Influence.
- Fahmi on Indentifying Terminology.
Course Materials 2004
-
Nerbonne on Logistic Regression
-
Siedle on Hierarchical Cluster Analysis
-
Hopp on Magnitude Estimation
-
Lichte on Association Measures
-
Kootstra on Exploratory Factor Analysis
-
Rossi on Odds Ratios in Aphasiology
Student Projects
- Melanie Hof's 2012 paper
"Questionnaire Evaluation with Factor Analysis
and Cronbach's Alpha"
-
Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to
foreign language learning.