Varying Background Corpora for SMT-based Text Normalization
Claudia Matos Veliz, Orphée De Clercq and Veronique Hoste


One main characteristic of social media is the use of non-standard language among their users. Since NLP tools have been trained on traditional text material, this has led to an increased interest in the task of text normalization.
In our work we applied text normalization to English and Dutch and experimented on noisy text from different social media genres: text messages, message board posts and tweets. When applying SMT to the normalization task, it’s crucial to select a language model (LM) trained on a background corpus which is close to the standard. One could thus suspect that depending on the level of noise of the data, using different corpora for training the LM, should lead to better results.
We relied on three different background corpora for constructing our LMs. For English the OPUS corpus, Europarl, and the combination of both were used. Similarly, for Dutch we used an in-house subtitles dataset, Europarl, and the combination of both. We trained LMs at character (unigram and bigram) and token level. Best results were achieved at the token level for all genres. Regarding the different corpora that were used to construct the LM, we found that Europarl gave the best results for the least noisy genre (tweets). The same is true for the noisiest genre (text messages). Considering our results, it seems to be important to make variations in the background data for building the LM, depending on the amount of noise and vocabulary that is present in the social media genre.