Multiple Regression Assignment
Assignment 4
Introduction
In this assignment one is given the phonetic distances between 52
Dutch settlements, as determined by lingusitic measurements. The
assigment is to investigate whether the travel distances predict the
phonetic distances well, and in addition to investigate the influence of
population size.
We dwell a moment on how population might be expected to influence
pronunciation. Peter Trudgill introduce the "gravity model" of
dialect differentiation in the 1970's. This model predicts that
larger settlements (cities and towns), i.e., where many people live,
influence their neighboring settlements more than smaller ones.
People visit larger settlements more often, and social contact with them
is therefore greater. In this assignment we will investigate whether
the dialect differences can also be partially explained by the population
of the settlements. In order to make this concrete, we associate with
each settlement pair the product of their populations, using the 1815
census as a basis.
The pronunciation differences were calcualted by W. Heeringa (2004),
and the travel costs were calculated in an M.A. thesis by I. van
Gemert (2002). Finally, the population products come from
J.C. Ramaer, Geschiedkundige atlas van Nederland; Het koninkrijk der
Nederlanden 1815-1931 (Den Haag 1931).
Data
The data has the following form:
pronunciation travel pop. product
1 7.18 36742.64 2801772
2 16.31 46541.63 3446019
3 16.16 67355.32 4220721
4 15.48 41677.66 2414421
5 16.18 23813.70 55842768
6 18.03 53006.08 3821328
7 14.39 44020.81 3620628
8 14.28 38677.67 3620628
9 16.22 43177.66 6781653
10 17.59 57627.40 12720366
The data are available at:
data/multi-regr-dialect-data.txt
Read this ascii file in, define three variables, and give them
sensible names.
Analysis
a. First define a new variable for the root of the travel costs. This
should be equal to the square root of the travel costs. We calculate
this in conformance with Trudgill's gravity idea, because we wish to
check whether the linguistic difference between settlements might not
increase more slowly as settlements are further apart (just as the
force due to gravity decreases with the square of the distance).
b. Examine two regression models: one where linguistic distance is
explained on the basis of the travel cost, and one where it's explained
on the basis of root travel cost. The gravity model predicts a positive
correlation between travel costs and linguistic distance. Is this
confirmed? How much of the linguistic distance is explained by the
travel distance? Create two scatterplots, each with a minimal square
regression line. Which model fits better?
c. Examine now the single regression model which uses the population product
to explain the pronunciation distances. The "gravity model" predicts
a negative correlation between pronunciation distance and population
product. Is this confirmed? How much of the pronunciation difference
might be explained by population? Sketch the realtion in a scatterplot
with regression line.
d. Examine now the multiple regression model in which one attempts to
explain pronunciation differences using both travel costs and population
product. Does the combined model explain the pronunciation difference data
more satisfactorily?
e. Examine the residues of the model and check them against a normal quantile
plot.
f. Check for collinearity between the vairables.
John Nerbonne
Last modified: Fri May 6 17:45:29 CEST 2005