Computational Linguistics Reading Group

Title: Generalising Strongly-Lexicalised Parsers Tejaswini Deoskar

Abstract: Due to the Zipfian nature of language, statistical models of language that are learnt from labeled data fail to model the wide variety of linguistic phenomena that fall in the long Zipfian tail. In this talk, I will present two ideas aiming towards 'parser-generalisation', the problem of enhancing a supervised grammar and parsing model to accurately model a wider variety of linguistic data than has been seen in labeled data, using additional unlabeled data.

The first idea concerns the use of the Expectation Maximisation (EM) algorithm for semi-supervised learning of parsing models. While it has long been thought that EM is unsuitable for semi-supervised learning of structured models (Merialdo 1994, Elworthy 1994), I will present experiments under two grammar formalisms (PCFG and CCG) where we have successfully used EM for semi-supervised learning of generative parsers. These two grammars share the property of being 'strongly lexicalised', in that they have complex lexical categories, and a few simple grammar rules that combine them. I will claim that strong lexicalisation makes these grammars more suitable for learning from unlabeled data than grammars which are not lexicalised in this way.

In the above work, I make the assumption that all lexical category types in the language are *known* from the supervised part of the data, but it is associations between words and category types that are unknown, a reasonable assumption to make if the supervised data is large. In the second part of the talk, I will discuss ongoing work where we generate *unseen* category types on the basis of seen types. Focusing only on CCG in this case, we model the internal structure of seen CCG types from a CCG treebank using latent variable PCFG models, and use them to generate new category types, under the assumption that there is a hidden structure in CCG lexical categories which can be uncovered using such models.

Biography: Tejaswini Deoskar is presently working as a Research Associate at the School of Informatics, University of Edinburgh. She completed a PhD in Linguistics from Cornell University in 2009. Before coming to Edinburgh, she worked as docent and post-doc at the University of Amsterdam. Her research interests lie in building computational models of natural language syntax and semantics, with a particular focus on models with fine-grained and sophisticated linguistic representations.

Computational Linguistics Reading Group

User Tools

Site Tools

Sidebar

Page Tools