PTB/PDTB files belonging to different genres

I. Sets used in Genre distinctions for discourse in the Penn TreeBank (ACL-IJCNLP, Singapore 2009)

Features of these sets that motivate their treatment as separate genres are given in the above paper.

II. Corresponding genre sets based on the meta-data found in ACL/DCI corpus

Modifications/Notes added by Barbara Plank (in red+boldface)

It is well-known that there is no meta-data in the individual files in the Penn TreeBank Wall Street Journal Corpus. However, meta-data can be found by looking at corresponding articles in the ACL/DCI corpus available from the LDC, in particular in the ACL/DCI Wall Street Journal (wsj) corpus from 1989. Each article in the corpus has a unique document number (DOCNO). The alignment between these two corpora is given in a downloadable TAR file (pennTB_tipster_wsj_map.tar) that appears in the section headed DATA in the PennTreeBank entry in the LDC catalogue. Aligned are the PTB filenames and corresponding WSJ DOCNO strings.

But even here, one will not find an explicit classification of the articles in the ACL/DCI Wall Street Journal corpus into genres. Rather, there is meta-data in the headline (HL) and IN fields that are included for many (but not all) articles. This meta-data can be used in classifying articles into different genres. I chose to consider the same set of genres as I used above, but other sub-divisions of the corpus into genres are possible as well.

In the listing below (unlike in the ACL paper), I have split the PTB/PDTB files that incorrectly contain two adjacent concatenated articles from the ACL/DCI 1989 WSJ corpus, into two files, a and b. It should be clear from the original PTB/PDTB file and the ACL/DCI alignment, where that split should be. (In the above-mentioned TAR file, these errors are listed in ptb_dual_tip.tbl)

Using the alignment of the PTB corpus with the ACL/DCI 1989 WSJ files and culling off the meta-data from the HL and IN fields, one can produce a more accurate classification of PTB/PDTB files into the same set of genres as above. (However, the above set isn't that far off.)

Note: file wsj_1809 should be split in two, even though it's not mentioned in the "dual" mapping file. The second part (b) just contains one sentence -- ignore (because the original tipster filed contained a list of books but that was ignored in the annotation). mapping wsj/tipsterDOCNO: wsj_1089 891019-0135 + 891019-0134.

Overview files:

II.errata: 25
II.essays: 166
II.highlights: 41
II.letters: 59
II.news: 1855
wit and short verse: 14 (+ 1; not in PDTB)
quarterly progress reports: 11 (+1; not in PDTB)
notable & quotable: 6
---
total genre annotated: 2146 + 6 + 14 + 11 = 2177 + 2 (not in PDTB)

total PT wsj files (original mrg files): 2312
total PT wsj files after split in a/b files: 2331 (19 files split)
rest: 2331 - 2177 = 154 files (remaining files that are not in the PDTB and hence not annotated) - 2 (annotated) = 152

Note: wsj_1259 and wsj_1862 do point to the same tipster doc 891024-0126. However, one of them contains one line more plus their syntactic annotation differs. Hence, ignore them. This means we have 2175 remaining genre annotated files. Moreover, if we ignore the 10 copies of file wsj_0190 (quarterly profit reports), we are left with 2165 genre annotated files.

Those 154 (153 + 1) files are not in the Penn discourse treeebank (see list here: files-not-in-pdtb.txt) -- 153 files mentioned by Bonnie in ACL 2009 paper plus one file: wsj_1809b -- the strange one sentence file mentioned above.

remaining-files.txt (now all annotated)
Makefile to get split of 19 WSJ files into a/b parts.