# Exercises in Aggregate Variation

Instructor: John Nerbonne (course under development)
Course Number: LSA.107
Mon. & Wed. 10:10-11:50, June 27-July 13
2005 Linguistics Institute (Harvard/MIT)

Students who wish to receive credit for the course should write up four pages total on at least three different exercises given below.

## Exercises

1. Isoglosses. Using the mapping programs at the Georgia LAMSAS, find two or three lexical isoglosses that overlap well, and two or three that give contrasting information.

2. Needing aggregation. In class I argued that we need to view dialect differences at an aggregate level in three ways:
• We need to rise above the level of exceptions and counterindications, without, however, simply ignoring data.
• At the more abstract level general relations can be stated, so aggregation enables the statement of general laws.
• Rather than beg the question of the degree to which regular linguistic relations are responsible for the sensitivities of dialect speakers and ultimately, for the existence of dialect areas and dialect continua, we can address it.
The task is to write 200-400 wds. criticizing one or more of these arguments. Feel free to reject them, or, if you can think of an additional argument, to introduce it and elaborate it a bit.

3. (Technical) Examine first the speculation in Nerbonne & Kleiweg's (2003) paper on lexical variation in LAMSAS that the differing number of responses elicited by Lowman on the one hand and McDavid on the other causes the measurements between their data to go wrong. Second, study the treatment of multiple responses in that paper.

The task is then try to come up with an alternative treatment of multiple responses which would be less sensitive to differing numbers of responses. If you believe you have a good candidate, you might demonstrate its value by applying it to an artificial situation in which responses come from a fixed set in a given order but in which the size of the response set is a stochastic variable subject to noise. A good measure should be stable on average as responses sets grow.

4. In the second lecture on the fourth day, available here, the slide 21 contains a graph comparing five different ways of measuring lexical overlap, where the two measures involving inverse-frequency are seen to be superior. If one compares these two measures to the worst two, in which stemming may be seen to lead to superior results (to simple string identity), we can see a puzzle about the value of stemming. Explain what this is and try to explain what's going on.
5. The lectures on the third and fourth day, taken together, constitute a puzzle. On the one had we are confident as linguists that we can describe and measure segment similarity more sensitively than simply alike/non-alike, while on the other hand, it has repeatedly been the experience of research that using the feature-based measures of phonetic similarity actually leads to deterioration in the quality of of the results. I speculated in class that this might be due to the fact that the data is noisy and that the sensitivity is misplaced--that we might be trying to measure travel distances with a micrometer. But I'm not entirely satisfied with this excuse.

Other possibilities might be that (i) features are simply very poor at probing the sorts of similarity dialect speakers are sensitive to; (ii) that feature systems mesh poorly with the Levenshtein sequence measure (but recall Heeringa's results, available at on p.31 of the first lecture on validation, where Heeringa concludes that features likewise depress the performance of simple frequency measures); (iii) that it's simply very difficult to design the right feature system; or even (iv) that dialect speakers react to any difference in pronunciations as a sign of different provenance, and that that is sufficient for them.

Take a stand on this issue, using one of the explanations above, or another you suggest, in a brief sketch of 200-500 wd.
6. In the fifth day of lectures, I sketch some applications of the work, all of which exploit the fact that we can employ an aggregate measure of linguistic distance. In a proposal of 200-500 wd. sketch one novel proposal to identify the determinants of dialectal variation, taking care to note where data might be potentially available, but emphasizing the logic of the sort of comparison you propose. You might think of investigating language contact (on a social level), second language acquisition (on the level of the individual psyche), or perhaps the influence of some extralinguistic variable on language variety.
John Nerbonne