Gotta catch 'em all: A wide-coverage corpus of idiomatic expressions in English
Hessel Haagsma, Malvina Nissim and Johan Bos
Idiomatic expressions like 'out of the woods' and 'up the ante' present a range of difficulties for natural language processing applications. We present work on the annotation of a corpus of what we term potentially idiomatic expressions (PIEs), a subclass of multiword expressions covering both literal and non-literal uses of idiomatic expressions. Existing corpora are small and have limited coverage of different types of expressions, which hampers research. To further progress on the automatic understanding of idiomatic expressions, larger corpora of PIEs are required. In addition, larger corpora are a potential source for valuable linguistic insights into idioms and their variability.
We present a work-in-progress report on creating a large corpus of idiomatic expressions in English. Our corpus distinguishes itself from existing corpora by the number of idiomatic expressions included (~2000), the inclusion of as many syntactic and morphological variants of those idioms as possible, the use of automatic pre-extraction tools, and the use of crowdsourcing for joint PIE/non-PIE and sense annotation. We combine multiple existing idiom dictionaries to acquire our set of expressions and use the FigureEight crowdsourcing platform to get the annotations. In this presentation, we will show preliminary results on pilot rounds of annotation and discuss decisions we have made. In addition we would like to get audience input on what we should (not) include in the annotation and which pitfalls to be aware of during corpus creation.