What we need are Natural Language Processing (NLP) models that are more robust: that
work better on unexpected input (like new domains or new languages) and can be trained
from semi-automatically or weakly annotated data from a variety of sources.
My research focuses on bringing NLP one step closer to this goal, by combining fortuitous data with proper machine learning algorithms to enable robust language technology.
I am interested in learning under sample selection bias (domain adaptation, transfer learning), annotation bias (embracing annotator disagreements in learning) and generally, semi-supervised and weakly-supervised machine learning applied to cross-domain and cross-language natural language processing.
Ultimately, NLP should be able to handle any language and any domain. However, there is still a long way to go! Our models need training data,
but annotated data is biased and scarce.
One way to address this problem of training data sparsity is to leverage data that so far has been neglected or rests in non-obvious places. Such fortuitous data  includes using hyperlinks to build more robust Part-of-Speech taggers or named-entity recognizers, learning from annotator disagreement and using behavioral data such as gaze or keystrokes  to inform NLP. Read up more:
- Barbara Plank. What to do about non-standard (or non-canonical) language in NLP. In KONVENS 2016. [arXiv]
- Barbara Plank. Keystroke dynamics as signal for shallow syntactic parsing. The 26 th International Conference on Computational Linguistics (COLING). Osaka, Japan. [arXiv]
- Barbara Plank, Anders Johannsen and Željko Agić. Improving language technology with fortuitous data, ESSLLI 2016 summer school.