Most wanted resources: imperfect written corpora
Ineke Schuurman, Vincent Vandeghinste and Leen Sevens


In our Picto tools for people with an intellectual disability we make use of corpora, for example to detect which spelling mistakes are usually made (phonetic spelling!) and what the typical vocabulary is. Or to see how our users tend to compose sentences. It is quite essential that these corpora contain the *original* texts, i.e., including all (types of) 'mistakes': spelling, word order, punctuation, .... In that sense, we need 'imperfect' corpora.

We are even more interested in such imperfect corpora written by other groups of people, people who are to a larger extent functionally illiterate in the language they are to use, in this case Dutch. This could be migrants (both starting and more advanced NT2 students), deaf people (sign language being their mother tongue), younger children, functionally illiterate native speakers of Dutch, elderly people, ...
They may compose sentences in a way strongly influenced by their mother tongue and culture, qua word order as well as by the way they phrase their thoughts (like Dutch people tending to be more direct than their southern neighbours). Also their spelling may be influenced by their mother tongue.

We are really curious to see what already is around (but hard to find), and to see how more data can be collected, in such a way that they are reusable by all of us (GDPR!).