User Tools

Site Tools


tobias

Robust Part-of-Speech Tagging of Social Media Text

While much of the research on improving PoS taggers focuses on better in-corpus performance, robustness is a somewhat neglected property. Building robust taggers is especially important in the context of social media, where non-standard writing or code switching are frequently observed. We focus on robustness regarding three main issues: text domains, languages, and long-tail phenomena. Based on a comprehensive evaluation of the cross-domain performance of publicly available English and German taggers, we introduce a new tagging approach that is significantly more robust on social media than a comparable baseline system. Our approach breaks down tagging into two separate steps by first tagging coarse-grained and then refining to the final fine-grained tag. Regarding cross-language robustness, we replicate a recent influential paper that found LSTMs with auxiliary loss to outperform other approaches. Finally, we analyze the ability of taggers to reliably detect phenomena which might only occur a few times in the training data, but are often of special interest for linguistic studies. We explore strategies to target taggers towards a specific PoS tag, e.g. by boosting the signal with additional training data.

tobias.txt · Last modified: 2019/02/06 16:03 (external edit)