Lexical normalization of health-related social media text
Anne Dirkson, Suzan Verberne, Gerard van Oortmerssen and Wessel Kraaij


In the biomedical realm, open knowledge discovery from text has been limited to semi-structured data, such as electronic health records, and biomedical literature. Patient forums, however, contain a wealth of unexploited knowledge: the unprompted experiences of the patients themselves. The aim of our research is to develop an approach to systematically mine patient experiences from patient forums to generate novel clinical hypotheses. These hypotheses could subsequently drive clinical research and thereby improve quality of life for patients.

However, the extraction of this knowledge is complicated by the noisy nature of user-generated social media data, which is plagued by colloquial language use and spelling mistakes. The complex medical domain only aggravates this challenge. Despite increasing use of such data as a valuable complementary knowledge source to scientific literature, lexical normalization has not been addressed properly. We introduce an unsupervised, sequential pipeline for lexical normalization of domain-specific abbreviations and spelling mistakes in health-related social media data. Evaluation of our approach on two cancer-related forums and various benchmark datasets will be presented. Our approach mainly targets medical concepts, which are highly relevant for downstream NLP tasks. Nonetheless, our unsupervised spelling correction may also be interesting for user-generated content in other highly specific and noisy domains. Future work will include extending the pipeline with modules for named entity recognition and automated relation annotation in medical forum posts.