User Tools

Site Tools


Word Symbolization with a Character-based Statistical Machine Translation Approach

The Parallel Meaning Bank is a multilingual corpus, comprising millions of words in four languages, accompanied with cross-lingual semantic annotation. We represent meaning not only by lexicon, but also non-lexicon symbols, which can be inferred based on the sequence. For example, the transformation turns sequences “seventy-six” to “76”, “Mr.” to “mister” consistently over the corpus. This process is referred to as symbolization, which shares similarities with standardization, lemmatization and error correction. We employed statistical machine translation at character level to build a symbolizer and integrated it as a component in the Parallel Meaning Bank.

duy.txt · Last modified: 2019/02/06 16:03 (external edit)