Multiple Regression Assignment

Assignment 4 Introduction In this assignment one is given the phonetic distances between 52 Dutch settlements, as determined by lingusitic measurements. The assigment is to investigate whether the travel distances predict the phonetic distances well, and in addition to investigate the influence of population size. We dwell a moment on how population might be expected to influence pronunciation. Peter Trudgill introduce the "gravity model" of dialect differentiation in the 1970's. This model predicts that larger settlements (cities and towns), i.e., where many people live, influence their neighboring settlements more than smaller ones. People visit larger settlements more often, and social contact with them is therefore greater. In this assignment we will investigate whether the dialect differences can also be partially explained by the population of the settlements. In order to make this concrete, we associate with each settlement pair the product of their populations, using the 1815 census as a basis. The pronunciation differences were calcualted by W. Heeringa (2004), and the travel costs were calculated in an M.A. thesis by I. van Gemert (2002). Finally, the population products come from J.C. Ramaer, Geschiedkundige atlas van Nederland; Het koninkrijk der Nederlanden 1815-1931 (Den Haag 1931). Data The data has the following form: pronunciation travel pop. product 1 7.18 36742.64 2801772 2 16.31 46541.63 3446019 3 16.16 67355.32 4220721 4 15.48 41677.66 2414421 5 16.18 23813.70 55842768 6 18.03 53006.08 3821328 7 14.39 44020.81 3620628 8 14.28 38677.67 3620628 9 16.22 43177.66 6781653 10 17.59 57627.40 12720366 The data are available at: data/multi-regr-dialect-data.txt Read this ascii file in, define three variables, and give them sensible names. Analysis a. First define a new variable for the root of the travel costs. This should be equal to the square root of the travel costs. We calculate this in conformance with Trudgill's gravity idea, because we wish to check whether the linguistic difference between settlements might not increase more slowly as settlements are further apart (just as the force due to gravity decreases with the square of the distance). b. Examine two regression models: one where linguistic distance is explained on the basis of the travel cost, and one where it's explained on the basis of root travel cost. The gravity model predicts a positive correlation between travel costs and linguistic distance. Is this confirmed? How much of the linguistic distance is explained by the travel distance? Create two scatterplots, each with a minimal square regression line. Which model fits better? c. Examine now the single regression model which uses the population product to explain the pronunciation distances. The "gravity model" predicts a negative correlation between pronunciation distance and population product. Is this confirmed? How much of the pronunciation difference might be explained by population? Sketch the realtion in a scatterplot with regression line. d. Examine now the multiple regression model in which one attempts to explain pronunciation differences using both travel costs and population product. Does the combined model explain the pronunciation difference data more satisfactorily? e. Examine the residues of the model and check them against a normal quantile plot. f. Check for collinearity between the vairables.

John Nerbonne
Last modified: Fri May 6 17:45:29 CEST 2005