Parsing Algorithms for Uncertain Input
Welcome on the webpage for the Parsing Algorithms for Uncertain Input project.
The automated analysis of natural language is an important ingredient for future applications which require the ability to understand natural language. For carefully edited texts current algorithms now obtain good results. However, for user generated content such as tweets and contributions to Internet fora, these methods are not adequate - for a variety of reasons including spelling mistakes, grammatical mistakes, unusual tokenization, partial utterances, interruptions. Likewise, the analysis of spoken language faces enormous challenge. One important aspect in which current methods break downs that they take the input very literal. Disfluencies, small mistakes or unexpected interruptions in the input often lead to serious problems. In contrast, humans understand such utterances without problems and are often not even aware of a spelling mistake or a grammatical mistake in the input.
We propose to study a model of language analysis in which the purpose of the parser is to provide the analysis of the 'intended' utterance, which obviously is closely related to the observed input, but might be slighdy different. The relation between the observed sentence and the intended sentence is modeled by a kernel function on input string pairs. Such a kernel function accounts for different kinds of noise. The kernel function might model errors such as disfluencies, false starts, word swaps, etc. More concretely, this kernel function can be thought of as a weighted finite-state transducer, mapping an observed input to a weight. Finite state automaton representing a probability distribution met possible intended input. The parser then is supposed to pick the best parse out of the set of parses of all passible inputs - taking into account the various probabilities. Note that there is an obvious similarity with parsing word graphs (word lattices) as output of a speech recognizer, as well as with some earlier techniques in ill-formed input parsing. The current model combines and generalizes these ideas. The study will focus on questions of the following types, can we efficiently compute such an analysis (taking into account a variety of possible formalizations), and what type of disfluencies, noise, mistakes, etc., in the input can be effectively modeled in this approach,
- Rob van der Goot & Gertjan van Noord. 2017. MoNoise: Modeling Noise Using a Modular Normalization System. To appear in Computational Linguistics in the Netherlands Journal
[paper (draft) | slides | code]
- Rob van der Goot, Barbara Plank & Malvina Nissim. 2017. To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging. In Proceedings of the 3th Workshop on Noisy User-generated Text.
[paper | slides | code | bib]
- Rob van der Goot & Gertjan van Noord. 2017. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
[paper | poster | slides | code | bib]
- Joachim Daiber & Rob van der Goot. 2016. The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions. In Proceedings of the Tenth International Conference on Language Resources and Evaluation.
[paper | poster | data & code | bib]
- Rob van der Goot. 2016. Normalizing Social Media Texts by Combining Word Embeddings and Edit Distances in a Random Forest Regressor. In Normalisation and Analysis of Social Media Texts Workshop.
[paper | slides | code | bib]
- Rob van der Goot & Gertjan van Noord. 2015. ROB: Using Semantic Meaning to Recognize Paraphrases. In Proceedings of the 9th International Workshop on Semantic Evaluation.
[paper | poster | code | bib]