CLIN 29 in Groningen

A Generative Model of Lexical Paraphrase Representation
Miguel Rios, Wilker Aziz and Khalil Sima'An

In this work we induce monolingual word and sentence representations by exploiting automatically generated sentence paraphrases as parallel English-English data. Our model learns by generating both streams of the data one word at a time while marginalising lexical alignments from original to paraphrased English. Automatically generated paraphrased data harbours two challenges. Firstly, how to deal with paraphrasing operations that introduce words for which there is no semantic equivalent in the original sentence (e.g. due to syntactic transformations). For this, our model stochastically alternates between two components: one that maps words from original to paraphrased English, and another that inserts words based on paraphrased context. And secondly, how to deal with noisy automatically generated data. Here we embed words as Gaussian distributions and impose a sentence-level prior. We test our model in traditional semantic benchmarks and show that the model induces representations suitable to a range of semantically relevant problems.