De-identification of Dutch Mental Health Data
Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools


De-identification or anonymization is the process of removing textual information related to personal identity of authors or subjects of a text. It is a key processing step preceding the release of sensitive documents for scientific study. De-identification is a complex task because it is difficult to find a balance between privacy and usability of text for scientific study. The quality of data anonymization is extremely important but recent work for English (Stubbs et al., 2017) and Dutch (Menger et al., 2017) show that the the current state-of-the-art is insufficient for relying on automatic de-identification.

In this paper we present our approach towards the de-identification of Dutch mental health texts. The unavailability of gold standard data prevented a machine-learning approach, so we employ a rule-based method that relies on a public-domain entity tagger. False positives produced by the tagger are handled by storing the entity words in a cache which is manually checked before the final run. Words linked to personal identity are replaced by fillers indicating their entity type. In rare predefined cases, a numeric id is appended to the filler token to allow for linking information between documents.

We have applied the de-identification process to two collections of Dutch mental health records. Evaluation of the process is a challenge because we do not have access to the original documents. At this moment we are building a corpus of Dutch online biographies which after an egofication process (changing he/she to I) will serve as evaluation data for our de-identication method.