26 mei 2021

  1. haytabo, libs:
    • git
    • lamachine
  2. alud
    • TODOs in alud docs
    • kwesties
      1. misplaced (or not) heads in conjunctions, zie hier
      2. ingevoegde woorden in Enhanced UD
        • xpath: //dep[@elided] of //*[@ud="enhanced" and contains(@id,".")]
    • release 2.8 begin mei
      • Alpino Treebank
        • PaQu
          • Docker
        • AlpinoGraph
      • evaluatie door Anouk B.
  3. XPath, tools
    • iets doen met gecompileerde XPath-expressies in Go
  4. /net/corpora bijna vol
    • verplaatsen naar /net/shared/corpora
      • sufarr
      • twitter
  5. Rumble

Rumble is een gedeeltelijke implementatie van JSONiq, een query-taal voor data in JSON-formaat.

Rumble werkt op Spark. Spark werkt op Hadoop.

Met Rumble kun je zoeken in data die lokaal is opgeslagen, op een Hadoop-cluster (Hadoop Distributed File System), of in de cloud op Amazon S3 of Microsoft Azure.

Spark heeft z’n eigen querytalen, Spark SQL en GraphX.

Als het doel is om lokaal query’s op JSON-data te doen, dan kun je SQLite of PostgreSQL gebruiken, of NoSQL databases zoals Mongo of ElasticSearch (of een van de vele anderen). Rumble alleen maar gebruiken voor lokale data lijkt niet efficiënt.

Als het doel is om met gedistribueerde data te werken, dan heb je er misschien meer aan om te beginnen met Hadoop, Spark of Lucene. Maar we hebben geen netwerk van servers.


p209327@haytabo:/net/corpora$ du -h --summarize *
3,9G    110kDBRD
4,0G    40twene_nl
25G     5-gram
125M    AMR
57M     AlpinoTestData
13M     Atranos
3,7G    BNC
9,4G    BNC-XML
15G     BasiLex-corpus_1.0
59G     BasiLex-corpus_1.0-extracted
7,9G    BasiScriptCorpus1_0
39G     BasiScriptCorpus1_0-extracted
176M    CCGBANK-ZH
711M    CCGbanks
277M    CELEX
3,2G    CGN-in-alpino-xml
1,3G    CGN-with-metadata
121G    CGN_2.0.3
2,4G    CGN_ANN_V1.0.tgz
4,4G    CGN_ANN_V2
12M     CGN_ANN_V2.0.1
105M    CGN_LEXICON
321M    CGN_cmdi
530M    CLEF
7,8M    CLIN-mv
880K    CLIN15-Factuality
429M    CONDIV
19M     Childes_dutch
18M     Childes_dutch_cmdi
92M     Childes_dutch_xml
18M     ChineseCCG.fid
142M    Chinese_Treebank_9.0_LDC
du: cannot access 'CoNLL_shared_task/2011': Permission denied
du: cannot access 'CoNLL_shared_task/2012': Permission denied
du: cannot access 'CoNLL_shared_task/README': Permission denied
4,0K    CoNLL_shared_task
1,4G    CommonVoices-Frysk
745M    Corea
255M    Cornetto
377M    Cornetto2.0
75M     Corpus_vanReenen_Mulder
1,4M    Coster
921M    DAESO1.0
76M     DAISY
1,6G    DBNL
1,2G    DGT_TU_1.0
du: cannot read directory 'DUOMAN/Latest/D.4.3 Corpus annotated with coreferential relations/Extra_set/MMAX': Permission denied
du: cannot read directory 'DUOMAN/Latest/D.4.3 Corpus annotated with coreferential relations/Documentation_anno': Permission denied
90M     DUOMAN
90G     Delpher
12G     DutchBooksCorpus
4,4G    DutchWebCorpus
2,0M    EANS
674M    ECI
18G     EUbookshop
72M     Eindhoven
177M    Elex
359M    EuroWordNet
0       Europarl7-NL-Parsed
68G     FAME
3,0M    Federalist
92M     Germanet_v12
29G     Google
du: cannot read directory 'Gysseling/Documenten': Permission denied
112K    Gysseling
5,3M    HofstadLyceum
11M     INTERSECT
6,5G    ITWAC
72M     KDE4
4,2M    KorpusGesprokeAfrikaans
3,9M    LDC2009T23.tar.gz
2,7M    LOB
du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt0UQZSU': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttc5CK2R': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttCq7aSE': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttvWU4jp': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttrrkoqz': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkdbSQO': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttjwPeBb': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkt634E': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttZDjf8n': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttkFZLMv': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.muttHhS4bX': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt394tLL': Permission denied
du: cannot read directory 'LassyDevelop/Archive/DPC/.mutt1a9xeM': Permission denied
19G     LassyDevelop
21G     LassyDevelopMod
921G    LassyLarge
475G    LassyLargeExtra
6,4M    LassyLargeMod
0       LassySmall
3,7G    LassySmall3
3,7G    LassySmall4
4,1G    LassySmall5
7,0G    LassySmall6
7,9M    LassySmallMeta
7,0G    Mediargus
2,9M    MedischeWP
40G     NLCOW
2,6G    NewsCommentary9.1
172M    Oersetter
20G     OpenSubtitles2012
5,1G    Opus
3,3G    PMB_corpora.tar
43M     ParCor
89M     Parole
690M    PennTreebank
551M    PragueArabicTreebank
619M    PragueTreebank
1,6G    Reuters
7,2M    SICK
19M     SPOD
13M     SentiWordNet
714G    SoNaRCorpus_NC_1.2
30G     SoNaRNewMediaCorpus_1.0.1
63M     SpectrumEncyclopedie
2,9G    StackOverflow.tar.gz
2,6M    Susanne
du: cannot read directory 'TWITA/Code/lib/python2.7': Permission denied
94G     TWITA
2,0G    TueBa-DZ
5,1G    TwNC-0.2
7,3G    UKWAC
12G     UKWAC_plain
du: cannot read directory 'UniversalDependencies1.2/universal-dependencies-1.2': Permission denied
du: cannot access 'UniversalDependencies1.2/ud-tools-v1.2/example-data/tanl.conll': Permission denied
224M    UniversalDependencies1.2
1002M   UniversalDependencies1.3
753M    VU-DNC
395M    Vlaams
du: cannot read directory 'Volkskrant97/tokenized': Permission denied
48M     Volkskrant97
5,1M    WPSpel
1,9G    WashingtonPost.v3.tar.gz
3,4G    Wikipedia_dumps_English
312K    admin
du: cannot read directory 'bibles': Permission denied
4,0K    bibles
64G     blacklab
16G     blacklab-input
1,1G    bliip_87_89_wsj
95M     chords
1,7M    cornetto-tools
504M    dcoi-final.tgz
1016M   dpc1.0p2
4,0K    du.txt
11G     dutchsemcor
4,0K    dutchsemcor.EMAIL
du: cannot read directory 'eng_web_tbk': Permission denied
4,0K    eng_web_tbk
5,6G    europarl7
22M     factbank_v1
23M     googlebooks
1,7M    kde.nl.gz
5,3G    kyoto1-full
61M     lingspam_public
du: cannot read directory 'lost+found': Permission denied
16K     lost+found
11M     ner
8,0K    nlcow
2,1G    nltk_data
12G     nlwiki
23M     novelsample_correctedMod
du: cannot read directory 'nxt-switchboard-annotations': Permission denied
4,0K    nxt-switchboard-annotations
2,3G    ontonotes-release-5.0
581G    paqu
37M     pdtb_v2
34M     rst_discourse_treebank
177M    semeval_parsing_time_normalizations
46G     spinn3r_corpus
539G    sufarr
du: cannot read directory 'tmp-sensymbolizer/tmp_017': Permission denied
du: cannot read directory 'tmp-sensymbolizer/models': Permission denied
8,7G    tmp-sensymbolizer
1,1M    troonrede
8,1G    twisty
du: cannot read directory 'twitter/201205/OldFiles': Permission denied
du: cannot read directory 'twitter/201211': Permission denied
du: cannot read directory 'twitter/000RAW/201208/OldFiles': Permission denied
du: cannot read directory 'twitter/000RAW/201209/OldFiles': Permission denied
du: cannot read directory 'twitter/000RAW/201207/OldFiles': Permission denied
1,2T    twitter
1,4T    twitter2
837G    twitter2_en
42G     twitter_en
240M    ud-documentation-v2.0
552M    ud-documentation-v2.5
612M    ud-documentation-v2.7
du: cannot read directory 'ud-test-v2.0-conll2017/input/conll17-ud-development-2017-03-19': Permission denied
du: cannot read directory 'ud-test-v2.0-conll2017/input/conll17-ud-trial-2017-03-19': Permission denied
du: cannot read directory 'ud-test-v2.0-conll2017/gold/conll17-ud-development-2017-03-19': Permission denied
du: cannot read directory 'ud-test-v2.0-conll2017/gold/conll17-ud-trial-2017-03-19': Permission denied
247M    ud-test-v2.0-conll2017
2,1M    ud-tools-v2.5
2,7M    ud-tools-v2.7
976M    ud-treebanks-conll2017
2,0G    ud-treebanks-v2.5
2,3G    ud-treebanks-v2.7
4,8G    wablieft
89M     wikicsv
3,7G    word2vec
universal dependencies