Daniël de Kok (2008)
Headline generation for Dutch newspaper articles through transformation-based learning
Master's thesis, Rijksuniversiteit Groningen.
[ Paper (PDF, 384 kb) ]

Introduction

The digital age has given us access to an increasing amount of information - historical documents are being digitized, and new information is produced at an astonishing rate. This flood of information can even lead to a state that is sometimes referred to as information overload. Text summarization can provide a partial solution to this problem. By giving short, but accurate summaries of a text a reader can quickly select texts that are interesting and capture the relevant information given by these texts.

An important area of research within text summarization is sentence compression. The goal of sentence compression is to shorten a given sentence to a prespecified length. Sentence compression is useful for various tasks such as the extraction of the core semantics of a sentence, or for generating short headlines from salient sentences. While headline generation can be useful on its own, it's also a well-defined task for researching and improving sentence compression: there is a wealth of training material available in the form of newspaper or magazine headlines, and it is a task that can relatively easily be verified by humans.

Various different methods have been proposed in the past to generate headlines for newspaper articles. Virtually all of the methods developed in recent years make use of parse trees of newspaper article sentences and newspaper headlines. Usually, the first sentence of an article is selected, and words or constituents that are deemed unimportant are removed. This process is often referred to as trimming. In the past, both statistical and rule-based approaches were tried for this task.

This thesis proposes a trimming method that uses transformation-based learning (TBL). TBL algorithms learn rules based on corpus statistics. As such, they produce a human-readable list of rules, but do not require the work involved in manual rule creation. TBL has been employed succesfully in other tasks, such as part-of-speech tagging and chunking. The TBL system described in this thesis learns rules that are conditioned on dependency relations and node characteristics within dependency trees produced by the Dutch Alpino parser and grammar. When all the conditions of a rule are satisfied, the action that is associated with the rule will be invoked (such as node deletion or movement). As a result, the system can make informed decisions about which structures and nodes (and thus words) can be removed safely to trim a sentence.

The rules that the system produces are instantiated through so-called rule templates, wherein conditions are only partially specified. Rules are instantiated based on examples from the training data, making full specifications of the conditions. Consequently, an additional area of interest for this thesis was finding out which rule templates are useful for trimming. Two sets of templates were created: templates that can generate the linguistically-motivated rules described by (Dorr et al., 2003), and a set that adds some additional linguistically-motivated rule templates. It is shown that the first set captures most useful rules, and as such mirrors the insights of Dorr and others.

Finally, the system was evaluated with both the ROUGE measure for text summarization, and an evaluation with human judges who evaluated the grammaticallity and salience of the generated headlines. It is shown that the system performs comparable to competing systems, and as such can provide the elegance of having readable trimming rules, without the human work that is involved with rule writing.