Exercises in Aggregate Variation
Instructor: John Nerbonne (course under development)
Course Number: LSA.107
Mon. & Wed. 10:10-11:50, June 27-July 13
2005 Linguistics Institute (Harvard/MIT)
Students who wish to receive credit for the course should write up four pages
total on at least three different exercises given below.
Last modified: Sat July 9 14:06:53 CEST 2005
- Isoglosses. Using the mapping programs at the Georgia
two or three lexical isoglosses that overlap well, and two or three
that give contrasting information.
- Needing aggregation. In class I argued that we need to view dialect
differences at an aggregate level in three ways:
The task is to write 200-400 wds. criticizing one or more of these
arguments. Feel free to reject them, or, if you can think of an
additional argument, to introduce it and elaborate it a bit.
- We need to rise above the level of exceptions and
counterindications, without, however, simply ignoring data.
- At the more abstract level general relations can be stated,
so aggregation enables the statement of general laws.
- Rather than beg the question of the degree to which regular
linguistic relations are responsible for the sensitivities of
dialect speakers and ultimately, for the existence of dialect
areas and dialect continua, we can address it.
- (Technical) Examine first the speculation in Nerbonne &
Kleiweg's (2003) paper on lexical variation in LAMSAS that the
differing number of responses elicited by Lowman on the one hand and
McDavid on the other causes the measurements between their data to
go wrong. Second, study the treatment of multiple responses in
The task is then try to come up with an alternative treatment
of multiple responses which would be less sensitive to differing
numbers of responses. If you believe you have a good candidate,
you might demonstrate its value by applying it to an artificial
situation in which responses come from a fixed set in a given order
but in which the size of the response set is a stochastic variable
subject to noise. A good measure should be stable on average
as responses sets grow.
- In the second lecture on the fourth day, available
here, the slide 21 contains a graph
comparing five different ways of measuring lexical overlap, where
the two measures involving inverse-frequency are seen to be superior.
If one compares these two measures to the worst two, in which stemming
may be seen to lead to superior results (to simple string identity),
we can see a puzzle about the value of stemming. Explain what this is
and try to explain what's going on.
- The lectures on the third and fourth day, taken together, constitute a
puzzle. On the one had we are confident as linguists that we can
describe and measure segment similarity more sensitively than simply
alike/non-alike, while on the other hand, it has repeatedly been the
experience of research that using the feature-based measures of
phonetic similarity actually leads to deterioration in the quality of
of the results. I speculated in class that this might be due to the
fact that the data is noisy and that the sensitivity is misplaced--that
we might be trying to measure travel distances with a micrometer.
But I'm not entirely satisfied with this excuse.
Other possibilities might be that (i) features are simply very
poor at probing the sorts of similarity dialect speakers are sensitive
to; (ii) that feature systems mesh poorly with the Levenshtein
sequence measure (but recall Heeringa's results, available at
on p.31 of the first lecture
on validation, where Heeringa concludes that features likewise
depress the performance of simple frequency measures); (iii) that it's
simply very difficult to design the right feature system; or even
(iv) that dialect speakers react to any difference in pronunciations
as a sign of different provenance, and that that is sufficient for them.
Take a stand on this issue, using one of the explanations above,
or another you suggest, in a brief sketch of 200-500 wd.
- In the fifth day of lectures, I sketch some applications of the work,
all of which exploit the fact that we can employ an aggregate measure
of linguistic distance. In a proposal of 200-500 wd. sketch one
novel proposal to identify the determinants of dialectal variation,
taking care to note where data might be potentially available,
but emphasizing the logic of the sort of comparison you propose.
You might think of investigating language contact (on a social level),
second language acquisition (on the level of the individual
psyche), or perhaps the influence of some extralinguistic variable
on language variety.