Seminar in Methodology and Statistics

For students in the Linguistics Research Master's Program and Linguistics PhD students (& others by agreement)

Course under development--more or less permanently!

Spring, 2016 (under construction)

Lecture/Seminar: Tues. 16:00-17:45, Room A902K (little bldg at corner of Broerstraat & Oude Kijk in Jatstraat, just west of the Academiegebouw.)
R Lab w. Annelot de Rechteren van Hemert: Fri. 9:00-10:45, Let 1313 Multimediazaal 1, beginning Feb.12!

Instructor: John Nerbonne (see site for email, phone, geographic coordinates and for office hours)



The first half of the course will be an introduction to multivariate statistical techniques commonly used in linguistics and communications. It will be accompanied by lab sessions using the statistical package R. The rest of the course revolves around seminar presentations by participants. Presentations primarily concern statistical or methodological issues in the research of the participants. See below for some of the topics presented and discussed from 2004 through 2014.

Topics for the second half include a selection from permutation tests; bootstrapping; analysis of variance and analysis of covariance; regression including multiple regression and hierarchical (multi-level) regression; dimension reduction techniques including factor analysis, principal component analysis, multidimensional scaling and/or latent semantic analysis; analysis of nominal data including association strength, Cohen's kappa, binomial or multinomial models, Fisher's exact test, odds ratios, information-theoretic inspired measures such as pointwise mutual information, or logistic regression. Other topics have regularly been presented, mostly at the request of the participants.


Participants in the course should have completed a basic course in statistics covering topics such as descriptive statistics but also basis hypothesis testing using z-tests, t-tests, and χ² Participation in the first semester Research Master course on statistics and corpus linguistics (Wander Lowie and Gertjan van Noord) is strongly recommended for anyone who's never taken statistics. It is required that you know the statistics from that course, so if you've never taken such a course, that's a good basis. The statistics course given in the European Master's in Clinical Linguistics is also good.


All participants, including auditors not taking the course for credit, must present at least one hour-long session (30 min. presentation plus 15 min. discussion and question session) on a statistical analysis technique. In addition students taking the course for credit in the research master must turn in a 8-10 pp. (2,000-2,500 wd.) paper, which may be on the same statistical analysis technique, but which may also be on another. In the paper and presentation it is important to embed the discussion in the analysis of concrete data, to explain how the analysis works, under what conditions it may be applied, and what its shortcomings are. The emphasis is on the statistical technique, but the research question should be explained along with the background theory. Graphical presentations of the data attempting to illustrate the tendency that is to be proven or disproven are definitely valuable.

It is fine with me if you turn in a paper and/or presentation reporting on work for another course as long as the paper turned in receive focuses on the statistical analysis. If you want turn in the same paper for two courses, make sure that both instructors know this and agree to it.

Ph.D. candidates from BCN or the Graduate School in the Humanities have received credit for this course in the past if they presented one topic (45 min. -- (30 + 15)) and participated regularly. I assume that that will continue to be the case.


In general, we will try to use the following:
Other references that have also been found useful are the following: Many also find the more "how-to" books useful. The following books have been found especially valuable as they provide very practical instructions for doing analysis in SPSS or R. The books are on reserve, most at the Letteren library, reserve shelf. Please note that books on reserve at the Letteren library that normally belong there are not moved to the reserve shelf. Instead, they're kept at their normal places (use the catalogue), but are on reserve and cannot be loaned out.

Schedule for Seminars/Lectures 2015 (Tentative!)

The schedule for Spring 2016. Meetings are Tues. 4-6 pm in A902
Week Date Theme Readings Leader
1 9 Feb. Organizational John Nerbonne
2 16 Feb. ANOVA, Factorial ANOVA Levshina, Ch.8.2-3 John Nerbonne
3 23 Feb. Repeated Measures Levshina, Ch.8.4; Rietveld/van Hout, Ch.4.6 John Nerbonne
4 1 Mar. Simple Linear Regression Levshina, Ch.6> or Field 6
5 8 Mar. Mult. Regr. Levshina, Ch.7 or Field 7 John Nerbonne
6 15 Mar Logistic Regression Levshina, Ch.12; Field, Ch.8 John Nerbonne
21/3-8/4 Exam period no meetings
7 12 Apr. Mixed effects Models Baayen, Ch.7 John Nerbonne
8 19 Apr. Student presentations
9 26 Apr. Student presentations
10 3 May Student presentations
11 10 May
12 17 May Student presentations
13 24 May Student presentations


Click on the lecture (etc.) title to see more.

Course Materials 2015

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-7).
  2. Martijn Wieling on Mixed Effects Regression
  3. Marieke Engbrenghof on Mixed Design Model for the Acquisition of English Vocabulary
  4. Ingemarie Donker on Predictors of Disfluency Markers in First Language Attrition
  5. Nienke Hoeksema on Repeated Measures ANOVA for Reaction Time and Accuracy Data
  6. Esther van der Berg on Logistic Regression to Analyze Language Change
  7. Elena Badmaeva on Logistic Regression to Analyze the Machine Learning of Russian Diminutive Formation
  8. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  9. Alicia Krebs on Corpus Linguistics: Analysing Word Frequencies
  10. Annelot de Rechteren van Hemert on Support Vector Machines: Eye movement classification in L1/L2 Syntactic Processing

Course Materials 2014

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-7).
  2. Çagrı Çöltekin on Multilevel Regression.
  3. Martijn Wieling on Generalized Additive Models for EEGs.
  4. Lena Rampula on Identifying Semitic Roots with Machine Learning.
  5. Sabrina Sun on Mixed-Effect Models for Predicting 2nd Lg. Learning Success.
  6. Kristie James on Errors in English as a Lingua Franca Analyzed using Mixed Effects Regression.
  7. Bich Ngoc Do on Zero-Inflated Models for Epicene Pronouns.
  8. Amelia La Roi on Mixed-Effects Models for Analyzing Focus Stress.
  9. Anna Saarloos on Regression Models for Analyzing Influences on Vocabulary Size.

Course Materials 2013

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-7).
  2. Çagrı Çöltekin on Multilevel Regression.
  3. John Nerbonne on Permutation tests.
  4. Jay van Cleef on Re-analysis for ERP .
  5. Stephen Gilbers on ANOVA & Emotional Speech in bearers of Cochlear Implants .
  6. Franziska Köder on ANOVA and Pronoun Interpretation .
  7. Magreet Vogelzang on Mixed Models and Eyetracking (and ideas on GAMs).
  8. Rui Qin on Multi-Level Regression and Early Detection of Dyslexia.
  9. Ramon Kezer on Factor Analysis and Code Switching.
  10. Kim Heiligstein on Conditional Entropy and Comprehensibility.

Course Materials 2012

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-7).
  2. Çagrı Çöltekin on Bayesian vs. Frequentist Statistics.
  3. Lotte Schott on Repeated Measures Anova for ERP (EEG) Data
  4. Martijn Wieling on Mixed Model Regression for analyzing linguistic variation and for analyzing eye-tracking
  5. Gökhan Akçapınar on Information Gain as used in constructing decision trees
  6. HuiPing Chan on Multiple Regression for Analysing Second Language Vocabulary Learning with Attention to the AIC and to Cook's Distance
  7. Matthew Smith on Conditional Entropy and Mutual Intelligibility
  8. Lili Szábo on Predicting Vowel Harmony using Pointwise Mutual Information
  9. Melanie Hof on Measurement Reliability
  10. Marjoleine Sloos on Linear Disriminant Analysis
  11. Caroline Morris on Association Strength used to gauge Langauge Change

Course Materials 2011

  1. John Nerbonne's lectures on various ANOVA & regression models (weeks 2-6).
  2. Martijn Wieling on Mixed Models
  3. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  4. Stefan Evert's page on statistics and software for measuring collocation strength.
  5. Assocation Strength talks
    1. Simon Šuster on Mutual Information and Collocations
    2. Laura Handojo Odds Ratios and Collocations
    3. Jelke Bloem on Fisher's Exact Test to Detect Animacy
    4. Igor Tytyk on Minimum Sensitivity and Collostructions
  6. Stefan Evert on statistical association strength and multi-word expressions.
  7. Repeated Measures vs. Mixed Models
    1. Connie Lahmann on 'Higher Language Cognition' and Grammaticality Verification
    2. Laura Bos Repeated Measures ANOVA \& Permutation Statistics
    3. Ruggero Montalto on Repeated Measures vs. Mixed Models
    4. Oscar Strik on the Aikake Information Criterion
  8. Dimension Reduction
    1. Martin Boros Cluster Analysis and Silhouette Width
    2. Ke Tran on Principal Component Analysis (and Face Recognition!)
    3. Lubomir Zlatkov on Multi-Dimensional Scaling
    4. Kristel Uiboaed on Correspondence Analysis
  9. Mona Timmermeister & Caitlin Mignella on Validating a Pronunciation Difference Measure
  10. Jurriën Schuurman on Min F in Psycholinguistics
  11. Jet Vonk on Cochran's Q

Course Materials 2010

  1. Eliza Magaretha on regression used to evaluation the quality of inducing pronunciation distances from empirical data.
  2. Nynke van der Vliet on Cohen's κ used to measure the agreement between annotators of hierarchically structured material.
  3. Edgar Weiffenbach on Log Linear Models of Contingency used to analyse corpus frequencies.
  4. Nick Ruiz on Logistic Regression used to analyse corpus frequencies.
  5. Rahmad Mahendra on Cross Entropy used to measure model quality in computational analyses.
  6. Nadine Glas on (Log) Odds Ratios used to statistic independence of categorical variables (with a comparison to χ².
  7. Seid Tvica on Ordinal Regression used to measure comprehensibility of foreign speech.

Course Materials 2009

  1. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  2. Thomas Zastrow Entropy in Dialectometry and P. Nabende on Cross Entropy and Model Comparision
  3. Xuchen Yao Bayesian vs. Frequentist Approaches to Statistics
  4. Çagri Çöltekin Hierarchical Bayesian Networks as Learning Models for background reading see Wagenmakers, E.-J., Lee, M. D., Lodewyckx, T., & Iverson, G. (2008). Bayesian versus frequentist inference. In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.), Bayesian Evaluation of Informative Hypotheses, pp. 181-207. Springer: New York.
  5. Ma Jianqiang Permutation Tests and Monte Carlo Sampling
  6. Jelena Prokic Clustering and the Bootstrap
  7. Arjen Versloot Using Late Medieval Sources for Linguistic Reconstructions (and regression)
  8. Gulsen Yilmaz Using Multiple Regression to Understand Language Attrition
  9. Harwintha Anjarningsih Repeated Measures ANOVA applied to ERP data.
  10. Natalia Ergorova Repeated Measures ANOVA applied to ERP data, Example II.
  11. Anja Schüppert (Binary) Logistic Regression applied to foreign comprehension data.
  12. Ankelien Schippers (Multinomial) Logistic Regression applied to historical syntax.
  13. Karin Beijering Loglinear Analysis of Contingency Tables applied to historical syntax.
  14. Ildikó Berzlanovich Intercoder Agreement in Discourse Analysis (Cohen's κ)
  15. Myrte Faber Annotating Turn Competition in Multi-Party Conversations (Cohen's κ)
  16. Martijn Wieling Bipartite Spectral Graph Clustering (applied to dialectal variation)
  17. Proscovia Olango Naive Bayes (applied to disambiguation)

Course Materials 2008

  1. J. Nerbonne on Entropy and Information Theory and the Conditional Entropy of the phoneme mapping (in Scandivanvian) "semi-communication".
  2. E. Rossi on Normal Distributions and Sampling and on hypothesis testing and t-tests
  3. V. Koukoulioti on Nonparametric Fallback Tests
  4. E. Rossi on Single ANOVA
  5. Th. Mehotcheva on Kruskal-Wallis
  6. H. Loerts on Multivariate ANOVA and Repeated Measures
  7. H. Ahmed on χ² and Fisher's Exact Test
  8. A. Lobanova on (Log) Odds Ratios and Word Order Studies
  9. J. Nerbonne on regression and multiple regression
  10. N. Haque on Principal Component Analysis
  11. S. van Ommen on Clustering

Course Materials 2007

  1. J. Nerbonne on regression and multiple regression
  2. B. Szmrecsányi "Language users as creatures of habit: a corpus-linguistic analysis of persistence in spoken English" Corpus Linguistics and Linguistic Theory 1(1): 113-150.
  3. L. Stowe on Analysis of Variance, incl. Multiple Analysis of Variance.
  4. A. Banga and Tam Ho on Repeated Measures, (ANOVA)
  5. S. Berends on Assumptions of ANOVA
  6. T. Caspi on Windowing, Correlations, and Dynamic Systems Theory
  7. Th. Leinonen on Regression in Phonetics and Computational Modeling
  8. J. Nerbonne on Multiple Regression Models.
  9. V. Baaijen on Applying Nonparametric Statistics to Analyse Writing (Comparing Think-Aloud Protocols and Keystroke Logging)
  10. M. Knippers and R. Montalto on Dealing with Nonnormal Distributions in a Repeated Measures Design
  11. M. Spruit on Search for Associations among Variables
  12. G. Korfiatis on Principal Components Analysis
  13. T. Van de Cruys on Dimensionality Reduction for Similarity Detection (Singular Value Decomposition, Non-Negative Matrix Factorization)

Course Materials 2006

  1. Nerbonne on Factor Analysis
  2. Wiersma on Permutation Tests
  3. Zinger on n-gram models
  4. Ruffle on syntactic differences in Old English
  5. Van der Cruys on Latent Semantic Analysis
  6. Vasishth on Mixed Effects Models
  7. Moberg on conditional entropy used to model comprehensibility.
  8. Mur on binomial models, esp. the paired sign test.
  9. Heeringa on bootstrapping .
  10. Ruffle on Log Odds
  11. Xiaoyan Xu on Multivariate nature of Language Attrition
  12. Kwant on "Delphi" techniques for identifying variables

Course Materials 2005

  1. Nerbonne on χ²
  2. Villada on Association Statistics for Recognizing Multi-Word Units
  3. Featherston on Magnitude Estimation. (Various papers, of which Featherston's "Decathlon Model" is perhaps the best starting point.)
  4. Donkers on ANOVA, repeated measures.
  5. Ruffle and Trofimova on Fisher's Exact Test.
  6. Bouma on Corpora and Counting.
  7. Van Noord on Search in Automatically Analysed Corpora.
  8. Smits and Rossi on Binomial Chances.
  9. Kremers on Log Odds Ratios.
  10. Nerbonne on Logistic Regression.
  11. Deunk on Analysing Qualitative Results via Multi-Level Regression.
  12. van der Plas on Clustering.
  13. van der Beek on Entropy as Measure of Syntactic Influence.
  14. Fahmi on Indentifying Terminology.

Course Materials 2004

  1. Nerbonne on Logistic Regression
  2. Siedle on Hierarchical Cluster Analysis
  3. Hopp on Magnitude Estimation
  4. Lichte on Association Measures
  5. Kootstra on Exploratory Factor Analysis
  6. Rossi on Odds Ratios in Aphasiology

Student Projects

  1. Melanie Hof's 2012 paper "Questionnaire Evaluation with Factor Analysis and Cronbach's Alpha"
  2. Gerrit Jan Kootstra's 2004 project on exploratory Factor Analysis applied to foreign language learning.