Introducing HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology
Ayla Rigouts Terryn, VĂ©ronique Hoste and Els Lefever


Automatic Term Extraction (ATE) has evolved from purely statistical or linguistic systems to hybrid approaches. More recently, machine learning is being applied to the task of identifying domain-specific vocabulary in specialised texts. Where ATE previously struggled with the data acquisition bottleneck for evaluation purposes only, the machine learning evolution means that training data are now required as well. Constructing datasets by manually annotating terms is a time- and effort-consuming task, often resulting in low inter-annotating agreement and without universally recognised guidelines. Therefore, few large datasets are available and even fewer are available in different languages and domains, annotated according to the same principles. During the first part of this project, such a dataset has been created with terms annotated in three languages and four domains, resulting in over 100k annotations. This dataset is used to train and test a machine learning approach to term extraction. The system is built along the same principles as traditional hybrid systems: term candidates are identified by their part-of-speech patterns (patterns found in the training data) and then filtered and sorted based on other features. To this end, we combine linguistic (e.g. part-of-speech), shape (e.g. length, special characters, capitalisation) and statistical features (e.g. termhood and unithood measures). The main focus is to find out whether, using a supervised machine-learning approach, an adaptable ATE system can be trained that is able to generalise to new, unseen data in another domain, language or even differently annotated dataset.