/net/corpora bijna vol
/net/shared/corpora
Rumble is een gedeeltelijke implementatie van JSONiq, een query-taal voor data in JSON-formaat.
Rumble werkt op Spark. Spark werkt op Hadoop.
Met Rumble kun je zoeken in data die lokaal is opgeslagen, op een Hadoop-cluster (Hadoop Distributed File System), of in de cloud op Amazon S3 of Microsoft Azure.
Spark heeft z’n eigen querytalen, Spark SQL en GraphX.
Als het doel is om lokaal query’s op JSON-data te doen, dan kun je SQLite of PostgreSQL gebruiken, of NoSQL databases zoals Mongo of ElasticSearch (of een van de vele anderen). Rumble alleen maar gebruiken voor lokale data lijkt niet efficiënt.
Als het doel is om met gedistribueerde data te werken, dan heb je er misschien meer aan om te beginnen met Hadoop, Spark of Lucene. Maar we hebben geen netwerk van servers.
p209327@haytabo:/net/corpora$ du -h --summarize * 3,9G 110kDBRD 4,0G 40twene_nl 25G 5-gram 125M AMR 57M AlpinoTestData 13M Atranos 3,7G BNC 9,4G BNC-XML 15G BasiLex-corpus_1.0 59G BasiLex-corpus_1.0-extracted 7,9G BasiScriptCorpus1_0 39G BasiScriptCorpus1_0-extracted 176M CCGBANK-ZH 711M CCGbanks 277M CELEX 3,2G CGN-in-alpino-xml 1,3G CGN-with-metadata 121G CGN_2.0.3 2,4G CGN_ANN_V1.0.tgz 4,4G CGN_ANN_V2 12M CGN_ANN_V2.0.1 105M CGN_LEXICON 321M CGN_cmdi 530M CLEF 7,8M CLIN-mv 880K CLIN15-Factuality 429M CONDIV 19M Childes_dutch 18M Childes_dutch_cmdi 92M Childes_dutch_xml 18M ChineseCCG.fid 142M Chinese_Treebank_9.0_LDC du: cannot access 'CoNLL_shared_task/2011': Permission denied du: cannot access 'CoNLL_shared_task/2012': Permission denied du: cannot access 'CoNLL_shared_task/README': Permission denied 4,0K CoNLL_shared_task 1,4G CommonVoices-Frysk 745M Corea 255M Cornetto 377M Cornetto2.0 75M Corpus_vanReenen_Mulder 1,4M Coster 921M DAESO1.0 76M DAISY 1,6G DBNL 1,2G DGT_TU_1.0 du: cannot read directory 'DUOMAN/Latest/D.4.3 Corpus annotated with coreferential relations/Extra_set/MMAX': Permission denied du: cannot read directory 'DUOMAN/Latest/D.4.3 Corpus annotated with coreferential relations/Documentation_anno': Permission denied 90M DUOMAN 90G Delpher 12G DutchBooksCorpus 4,4G DutchWebCorpus 2,0M EANS 674M ECI 18G EUbookshop 72M Eindhoven 177M Elex 359M EuroWordNet 0 Europarl7-NL-Parsed 68G FAME 3,0M Federalist 92M Germanet_v12 29G Google du: cannot read directory 'Gysseling/Documenten': Permission denied 112K Gysseling 5,3M HofstadLyceum 11M INTERSECT 6,5G ITWAC 72M KDE4 4,2M KorpusGesprokeAfrikaans 3,9M LDC2009T23.tar.gz 2,7M LOB du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt0UQZSU': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttc5CK2R': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttCq7aSE': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttvWU4jp': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttrrkoqz': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkdbSQO': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttjwPeBb': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkt634E': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttZDjf8n': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkFZLMv': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.muttHhS4bX': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt394tLL': Permission denied du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt1a9xeM': Permission denied 19G LassyDevelop 21G LassyDevelopMod 921G LassyLarge 475G LassyLargeExtra 6,4M LassyLargeMod 0 LassySmall 3,7G LassySmall3 3,7G LassySmall4 4,1G LassySmall5 7,0G LassySmall6 7,9M LassySmallMeta 7,0G Mediargus 2,9M MedischeWP 40G NLCOW 2,6G NewsCommentary9.1 172M Oersetter 20G OpenSubtitles2012 5,1G Opus 3,3G PMB_corpora.tar 43M ParCor 89M Parole 690M PennTreebank 551M PragueArabicTreebank 619M PragueTreebank 1,6G Reuters 7,2M SICK 19M SPOD 13M SentiWordNet 714G SoNaRCorpus_NC_1.2 30G SoNaRNewMediaCorpus_1.0.1 63M SpectrumEncyclopedie 2,9G StackOverflow.tar.gz 2,6M Susanne du: cannot read directory 'TWITA/Code/lib/python2.7': Permission denied 94G TWITA 2,0G TueBa-DZ 5,1G TwNC-0.2 7,3G UKWAC 12G UKWAC_plain du: cannot read directory 'UniversalDependencies1.2/universal-dependencies-1.2': Permission denied du: cannot access 'UniversalDependencies1.2/ud-tools-v1.2/example-data/tanl.conll': Permission denied 224M UniversalDependencies1.2 1002M UniversalDependencies1.3 753M VU-DNC 395M Vlaams du: cannot read directory 'Volkskrant97/tokenized': Permission denied 48M Volkskrant97 5,1M WPSpel 1,9G WashingtonPost.v3.tar.gz 3,4G Wikipedia_dumps_English 312K admin du: cannot read directory 'bibles': Permission denied 4,0K bibles 64G blacklab 16G blacklab-input 1,1G bliip_87_89_wsj 95M chords 1,7M cornetto-tools 504M dcoi-final.tgz 1016M dpc1.0p2 4,0K du.txt 11G dutchsemcor 4,0K dutchsemcor.EMAIL du: cannot read directory 'eng_web_tbk': Permission denied 4,0K eng_web_tbk 5,6G europarl7 22M factbank_v1 23M googlebooks 1,7M kde.nl.gz 5,3G kyoto1-full 61M lingspam_public du: cannot read directory 'lost+found': Permission denied 16K lost+found 11M ner 8,0K nlcow 2,1G nltk_data 12G nlwiki 23M novelsample_correctedMod du: cannot read directory 'nxt-switchboard-annotations': Permission denied 4,0K nxt-switchboard-annotations 2,3G ontonotes-release-5.0 581G paqu 37M pdtb_v2 34M rst_discourse_treebank 177M semeval_parsing_time_normalizations 46G spinn3r_corpus 539G sufarr du: cannot read directory 'tmp-sensymbolizer/tmp_017': Permission denied du: cannot read directory 'tmp-sensymbolizer/models': Permission denied 8,7G tmp-sensymbolizer 1,1M troonrede 8,1G twisty du: cannot read directory 'twitter/201205/OldFiles': Permission denied du: cannot read directory 'twitter/201211': Permission denied du: cannot read directory 'twitter/000RAW/201208/OldFiles': Permission denied du: cannot read directory 'twitter/000RAW/201209/OldFiles': Permission denied du: cannot read directory 'twitter/000RAW/201207/OldFiles': Permission denied 1,2T twitter 1,4T twitter2 837G twitter2_en 42G twitter_en 240M ud-documentation-v2.0 552M ud-documentation-v2.5 612M ud-documentation-v2.7 du: cannot read directory 'ud-test-v2.0-conll2017/input/conll17-ud-development-2017-03-19': Permission denied du: cannot read directory 'ud-test-v2.0-conll2017/input/conll17-ud-trial-2017-03-19': Permission denied du: cannot read directory 'ud-test-v2.0-conll2017/gold/conll17-ud-development-2017-03-19': Permission denied du: cannot read directory 'ud-test-v2.0-conll2017/gold/conll17-ud-trial-2017-03-19': Permission denied 247M ud-test-v2.0-conll2017 2,1M ud-tools-v2.5 2,7M ud-tools-v2.7 976M ud-treebanks-conll2017 2,0G ud-treebanks-v2.5 2,3G ud-treebanks-v2.7 4,8G wablieft 89M wikicsv 3,7G word2vec