Nuance Foundation Logo Rob van der Goot Gertjan van Noord

Parsing Algorithms for Uncertain Input

Welcome on the webpage for the Parsing Algorithms for Uncertain Input project.

This project is carried out by Rob van der Goot, supervised by Gertjan van Noord and funded by the Nuance Foundation.

Project Description

The automated analysis of natural language is an important ingredient for future applications which require the ability to understand natural language. For carefully edited texts current algorithms now obtain good results. However, for user generated content such as tweets and contributions to Internet fora, these methods are not adequate - for a variety of reasons including spelling mistakes, grammatical mistakes, unusual tokenization, partial utterances, interruptions. Likewise, the analysis of spoken language faces enormous challenge. One important aspect in which current methods break downs that they take the input very literal. Disfluencies, small mistakes or unexpected interruptions in the input often lead to serious problems. In contrast, humans understand such utterances without problems and are often not even aware of a spelling mistake or a grammatical mistake in the input.

We propose to study a model of language analysis in which the purpose of the parser is to provide the analysis of the 'intended' utterance, which obviously is closely related to the observed input, but might be slighdy different. The relation between the observed sentence and the intended sentence is modeled by a kernel function on input string pairs. Such a kernel function accounts for different kinds of noise. The kernel function might model errors such as disfluencies, false starts, word swaps, etc. More concretely, this kernel function can be thought of as a weighted finite-state transducer, mapping an observed input to a weight. Finite state automaton representing a probability distribution met possible intended input. The parser then is supposed to pick the best parse out of the set of parses of all passible inputs - taking into account the various probabilities. Note that there is an obvious similarity with parsing word graphs (word lattices) as output of a speech recognizer, as well as with some earlier techniques in ill-formed input parsing. The current model combines and generalizes these ideas. The study will focus on questions of the following types, can we efficiently compute such an analysis (taking into account a variety of possible formalizations), and what type of disfluencies, noise, mistakes, etc., in the input can be effectively modeled in this approach,