Een baardjuffrouw en haar borstengeknoei. The influence of non-existing words in machine translation output on sentence comprehension.
Laura Van Brussel, Joke Daems and Lieve Macken


MT quality has improved enormously thanks to neural machine translation (NMT). Whereas statistical MT systems struggled most with grammar, NMT often produces grammatically correct sentences. However, sometimes odd (lexical) mistakes appear in NMT output, such as the occurrence of non-existing words. 'Non-existing' refers to words that are typically coined by NMT systems and that are not part of the target language lexicon. These mistakes can often be attributed to the fact that NMT operates at sub-word level, e.g. by using byte pair encoding (Sennrich, Haddow, & Birch, 2016) to deal with infrequent words. For gisting purposes, non-existing words in the MT output can entail comprehensibility problems because the intended source meaning may not be recovered. Within the framework of the ArisToCAT project (Assessing the Comprehensibility of Automatic Translations), this study aims to investigate if and to what extent non-existing words in (English-to-Dutch) MT output impair comprehension.
Participants are presented with different MT sentences containing a non-existing word. Per sentence, they first assign a global comprehensibility score. Second, they are tested for comprehension following the sentence verification technique (Marchant, Royer, & Greene, 1988). A paraphrase of the MT sentence with or without new information is presented. Participants then judge whether or not new information has been added. Third, the participants are asked to describe the meaning of the non-existent word.
The comprehensibility scores together with the results of the sentence verification technique and the correctness of the descriptions enable us to determine to what extent non-existing words affect sentence comprehension.