Three talks on modeling language acquisition and evolution
December 8, 2011, 9:30 -- 11:30
Zernikezaal, Academigebouw

Related to public defense of dissertation titled Catching words in a stream of speech, there will be three by talks researchers visiting Groningen for the occassion on Thursday, December 8, 2011. The talks will be in Zernikezaal, Academigebouw. The (tentative) schedule and the abstracts can be found below.

09:30 - 10:10 From sounds to words: a Bayesian approach to modeling early language acquisition
Sharon Goldwater, University of Edinburgh
10:10 - 10:50 Evolved properties of language through language learning
Padraic Monaghan, Lancaster University
10:50 - 11:30 A computational approach to early language bootstrapping
Emmanuel Dupoux, LSCP, Paris

From sounds to words: a Bayesian approach to modeling early language acquisition
Sharon Goldwater, University of Edinburgh

The child learning language is faced with a difficult problem: given a set of specific linguistic observations, the learner must infer some abstract representation (a grammar) that generalizes correctly to novel observations and productions. In this talk, I argue that Bayesian computational models provide a principled way to examine the kinds of representations, biases, and sources of information that lead to successful learning. As an example, I discuss my work on modeling word segmentation. I first present a computational study exploring the effects of context on statistical word segmentation. In this study, a model that assumes words are statistically independent (as in the stimuli used in many human experiments) is compared to a model that defines words as units that help to predict following words. I show that the context-independent model undersegments the data, while the contextual model yields much more accurate segmentations, outperforming previous models on realistic corpus data. This difference suggests the need to consider contextual effects in infant word segmentation.

Simulations using corpus data provide insight into the kinds of information that are useful for learning, but it is also important to address the question of whether model predictions are consistent with human learning patterns. In the second part of this talk, I present results from a project designed to evaluate the predictions of various word segmentation models. The human data is based on experiments similar to those of Saffran et al. (1996), but several parameters of the stimuli were varied between subjects to modify the difficulty of the task. The Bayesian model described above correlates better with human patterns of difficulty than any other model tested, suggesting that this model does indeed capture important properties of human segmentation.

Evolved properties of language through language learning
Padraic Monaghan, Lancaster University

Greenberg defined a set of "design features" of language that are universal properties of communication systems, such as the arbitrary relationship between sounds and meanings of words. He described these are definitional properties of language, but they can instead be interpreted as efficient solutions to the task of transmitting complex information. I demonstrate through corpus analyses, computational modelling, and artificial language learning experiments how such design features of language facilitate language learning and how they may have evolved to be part of our communicative systems.

A computational approach to early language bootstrapping
Emmanuel Dupoux, Laboratoire de Sciences Cognitives et Psycholinguistique, Paris

Human infants learn spontaneously and effortlessly the language(s) spoken in their environments, despite the extraordinary complexity of the task. In the past 30 years, tremendous progress has been made regarding the empirical investigation of the linguistic achievements of infants during their first two years of life. In that short period of their life, infants learn in an essentially unsupervised fashion the basic building blocks of the phonetics, phonology, lexical and syntactic organization of their native language (see Jusczyk, 1987). Yet, little is known about the mechanisms responsible for such acquisitions. Do infants rely on general statistical inference principles? Do they rely on specialized algorithms devoted to language?

Here, I will present an overview of the early phases of language acquisition and focus on one area where a modeling approach is currently being conducted, using tools of signal processing and automatic speech recognition: the unsupervized acquisition of phonetic categories. It is known that during the first year of life, before they are able to talk, infants construct a detailed representation of the phonemes of their native language and loose the ability to distinguish nonnative phonemic contrasts been proposed so far, that is, unsupervised statistical clustering (Maye, but rather on contextual allophonic units that are smaller than the phoneme sources of information: the statistical distribution of their contexts, the phonetic plausibility of the grouping, and the existence of lexical minimal pairs (Peperkamp et al., 2006; Martin et al, submitted). It is shown that presupposing the others, but that they need to be combined to arrive at good performance. Modeling results and experiments in human infants will be presented.