Best practices for learning domain specific cross-lingual embeddings
Lena Shakurova, Beata Nyari, Chao Li and Mihai Rotaru
Cross-lingual embeddings aim to represent words in multiple languages in a shared vector space by capturing semantic similarities across languages. They are a crucial component for scaling applications to multiple languages by transferring knowledge from languages with rich resources to low-resource languages. We investigate the best practices for learning cross-lingual embeddings for the sequence labeling task of multilingual CV (resume) parsing.
We experiment with three factors that affect the quality of cross lingual embeddings. First, we noticed that the quality is sensitive to the choice of anchor words in the bilingual lexicon. Since our application focuses on a specific domain (i.e. human resources), we explore adding domain-specific terms to the bilingual lexicons. We evaluate their effect using both intrinsic embedding metrics (e.g. word translation, multiQVEC) as well as downstream model performance metrics (i.e. segmentation of sections in CVs). Second, we experiment with various approaches to learn the transformation between the language specific embedding spaces. We look at simpler linear projections (e.g. canonical correlation analysis (CCA)) as well as recently introduced methods like NORMA or other non-linear projections. Third, motivated by previous research which shows that the embedding quality drops when aligning distant languages, we experiment with different combinations of languages to measure their effect and to establish best practices when dealing with a variety of languages from different language families.