We have parsed the complete text of the English Wikipedia (dump file of 11 Oct 2010). The text was extracted from the wikimedia dump file by means of the WikiExtractor script.

We used the Stanford Parser to produce phrase structure trees and dependency trees. Note that both are stemmed, i.e. the leaves of the trees and the nodes in the dependency graph are root forms, not inflected words.

    (NP (DT the) (NN response))
    (VP (VBZ include)
        (NP (DT a) (JJ numeric) (NN result) (NN code))
          (WHNP (WDT which))
            (VP (VBZ indicate)
                (NP (NN success))
                (, ,)
                (NP (DT some) (NN error) (NN condition))
                (CC or)
                (NP (DT some) (JJ other) (JJ special) (NNS case))))))))
    (. .)))

det(response-2, the-1)
nsubj(include-3, response-2)
det(code-7, a-4)
amod(code-7, numeric-5)
nn(code-7, result-6)
dobj(include-3, code-7)
nsubj(indicate-9, code-7)
rcmod(code-7, indicate-9)
dobj(indicate-9, success-10)
det(condition-14, some-12)
nn(condition-14, error-13)
dobj(indicate-9, condition-14)
conj_or(success-10, condition-14)
det(case-19, some-16)
amod(case-19, other-17)
amod(case-19, special-18)
dobj(indicate-9, case-19)

Note that dependency relations form a (cyclic) graph, i.e. in the example above we have both nsubj(indicate-9, code-7) and rcmod(code-7, indicate-9).

The following script was used to call the parser

java -mx1000m -cp "$scriptdir/stanford-parser.jar:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputFormat "penn,typedDependencies" -outputFormatOptions "stem"  \
-maxLength 50 $scriptdir/englishPCFG.ser.gz $

The corpus was parsed on the high performance cluster of the University of Groningen, in just over 24 hours. The parsed portion of the corpus (sentences longer than 50 words were skipped) consists of 1.1 billion words (1.122.536.262) and 57 million sentences (57.948.417). Note that sentence splitting was performed as part of the parsing process.

This is a sample of the data. The complete data can be downloaded here (10GB zipped tar file, 50GB unzipped).

The frequency of dependency triples is also available. (77MB, 400MB unzipped, only triples occurring at least 5 times):

      6 dobj(abandon, abbey)
      9 dobj(abandon, abstraction)
     17 dobj(abandon, action)
     33 dobj(abandon, activity)
      8 dobj(abandon, administration)
     13 dobj(abandon, advance)
      5 dobj(abandon, advocacy)
      8 dobj(abandon, affair)
      9 dobj(abandon, affiliation)
      5 dobj(abandon, agenda)
      9 dobj(abandon, agreement)
     10 dobj(abandon, agriculture)
     12 dobj(abandon, aim)
     32 dobj(abandon, aircraft)

Note that the Wacky corpus site has made available a parsed (2009) version of Wikipedia using the MALT parser.