CLIN 29 in Groningen

Text classification of financial annual reports
Matej Martinc, Aljoša Valentinčič, Martin Žnidarsič and Senja Pollak

In our analysis we investigate the annual reports for the companies of the Dow Jones Industrial Average 30 (DJIA). We are only interested in Item 7 and Item 7A from Part II, as they cover the less regulated parts of the reports. Each report is matched with the company specific financial information, represented as a set of three binary indicators: increases or decreases in earnings and positive or negative cash flows from operations are based on financial statements, while the third indicator characterizes firms that either pay or do not pay dividends. On the one hand, we aim to determine if it is possible to estimate the value of the financial indicator for a specific year from the textual content of that year's annual report, and on the other, we investigate if the current years' reports could be used for the next year's prediction. We consider two distinct text classification approaches appropriate for small datasets. The first one is an SVM-based approach with extensive feature engineering and the second one is an ULMFIT neural approach proposed by Howard and Ruder (2018), which overcomes the lack of available contextual information in small datasets with a transfer learning technique. A challenge we encountered is class imbalance of indicators which we tackled with two different oversampling techniques. Results show that the SVM-based technique outperforms the ULMFIT neural approach. Even though we can significantly improve the baseline for two out of three financial indicators, the performance of the classifiers is still not fully satisfying.