Wat gaat er boven Groningen? Retrieving similar and related questions on GoeieVraag.
Florian Kunneman, Thiago Castro Ferreira, Antal van den Bosch and Emiel Krahmer


We present work on detecting open-domain questions, asked on the Dutch community question answering platform 'GoeieVraag', that are either similar or related to any given question. Detecting similar questions helps to prevent duplicates and provide users directly with the searched for information, while detecting related questions can serve as additional content to a queried or asked question so as to assist visitors that are likely interested in the question's broader topic.

Question search is a popular task in the field of community question answering. We implemented several question similarity metrics that were proposed in recent work, such as machine translation, tree kernels and the soft cosine metric based on deep contextualized word representations, and applied them to the English 'Qatar Living' and 'CQA dubstack' benchmarks as well as to a hand-coded subset of GoeieVraag. The aim of our study is to decide whether any two questions are semantically similar (e.g.: asking the same information), related (e.g.: referring to the same topic) or unrelated. We trained a machine learning classifier to decide whether pairs of questions are 'similar', 'related' or 'unrelated' based on the question similarity scores from several metrics.

The contribution of our study is that we obtain insight into both the influence of preprocessing steps (like the removal of stopwords and punctuation) and the usage of particular machine learning algorithms. In addition, we inquire whether separate modules for extracting the answer type and the main topic of a question are beneficial to detecting related questions.