CLIN 29 in Groningen

Extracting footnotes from OCRed text
Hugo de Vos, Suzan Verberne and Bernard Steunenberg

Footnotes in texts may contain relevant information, for example in scientific literature, historical annotations, and legal documents. In this project, we study the policy-making process of the European Union (EU) by analyzing the annotated versions of law texts. In these documents, annotations have the form of footnotes. Many of the EU documents have been stored as non-readable PDFs and need to be OCRed before they can be processed.
Documents that are scanned and OCRed are stored as a sequence of pages. When creating a corpus from scanned documents, we are not interested in the unit of ‘page’. Instead, we want to represent the document structure in the form of chapters and sections. However, footnotes dictate a page structure, as they interrupt the main text flow at the end of every page. There are different ways in which footnotes could be processed: they can be ignored, be part of the vector space model for the text as a whole, or analyzed in a more structured way. In every situation it is necessary to be able to recognize footnotes.
We present a rule based algorithm for extracting footnotes from plain text (OCR output). It recognizes footnotes without using content or typographical (like font size) features The algorithm has been designed for policy documents from the Council of the EU, achieving a recall of 95%. We provide a full evaluation of the algorithm on those documents, as well as how well it works on other types of texts with footnotes.