CLIN 29 in Groningen

Novel Compound Predictor - Learning to Distinguish the Plausible from the Implausible
Prajit Dhar and Lonneke van der Plas

Novel compounds are created on a daily basis. For example, the compounds mobile phone or a Microsoft update were not known a few decades ago, but are now. Compounding is a productive and flexible mechanism for conceptual combination that results in complex lexemes composed of several atomic lexemes.

In this talk, we introduce temporally-aware, compositional models (both neural and non-neural based) for the novel task of predicting unseen but plausible noun-noun compounds. The compositional model is trained on observed compounds, which are the composed and distributed representations of their constituents (modifier and head) across a time-stamped corpus, as well as non-attested compounds (negative evidence). In order to generate negative evidence, i.e., non-attested compounds , we corrupt the observed compounds, by replacing either the head or the modifier with a random word.

The model captures generalizations over this data and learns which combinations give rise to plausible compounds and which others do not. After training, we query the model for the plausibility of automatically generated novel combinations and verify whether the classifications are accurate by scanning unseen future data. For our best model, we find that in 85% of the cases, the novel compounds generated are attested in previously unseen data.