User Tools

Site Tools


ben

Discourse features for author profiling

Currently, document representations for author profiling experiments are mostly limited to word-based features, sometimes utilising syntactic information. We are investigating whether discourse characteristics as features might improve the document representation. We hypothesize that groups of people with a common sociological or psychological factor (e.g. gender) might organise discourse in a similar way, e.g. by using similar discourse structures, similar connectives and similar ways of structuring text in space and time.

We have started this research with investigating low-level approaches of discourse, beginning with the creation of lexicons of Dutch discourse adverbs and discourse connectives by mining the Dutch Wiktionary. In the lexicon of adverbs, the words are categorised according to the types of adverbs described in the ANS Dutch grammar. The conjunctions are categorised according to the Penn Discourse Treebank categories. Words can belong to more than one category to take polysemy and ambiguity into account. The lexicons will be made publicly available. We propose this representation as a first approximation of discourse structure. We present an analysis of the performance of those features on the classification of author gender in two Dutch corpora (i.e. reviews from CSI Corpus, and blogs).

As a sidestep to this study, we also compare the real usage of discourse words (extracted from a collection of corpora) with the discourse words present on Wiktionary. We will thus evaluate the representativeness of the Wiktionary content.

ben.txt · Last modified: 2019/02/06 16:03 (external edit)