Download elephant

Here you can download the release of elephant. Included in the distribution there are the models for sentence and word boundary detection of English, Dutch and Italian. The model for English is a snapshot of the model used for the tokenisation of the Groningen Meaning Bank taken on 2013 September 6th. The model for Dutch is trained on the same data as presented in the paper, but has been improved with additional n-gram features. The model for Italian is the best-performing model according to the experiments presented in the paper.

download elephant-1.1.zip

Installation instructions

To install elephant simply type

$ make ; make install
this will compile the external tools wapiti and elman and copy the executables files in /usr/local/bin . To change the destination directory the variable PREFIX in the Makefile has to be edited. After installation, elephant is invoked like in these examples: (PTB-style output)
$ echo 'Good morning Mr. President.' | elephant -m models/english
(IOB output format)
$ echo 'Good morning Mr. President.' | elephant -m models/english -f iob
It is also possible to run elephant from the source directory without need to install it, by just typing
$ make
and invoking the executable from the current directory, e.g.
$ echo 'Good morning Mr. President.' | ./elephant -m models/english/
Good morning Mr. President .
The -t iob options makes elephant output a double column format. Each line represents one character, the first column is its Unicode codepoint the second is its assigned label.
$ echo 'Good morning Mr. President.' | ./elephant -m models/english/ -f iob
71	S
111	I
111	I
100	I
32	O
109	T
111	I
114	I
110	I
105	I
110	I
103	I
32	O
77	T
114	I
46	I
32	O
80	T
114	I
101	I
115	I
105	I
100	I
101	I
110	I
116	I
46	T
10	O

Prerequisites

Elephant makes use of the wapiti sequence labelling toolkit. The source code of wapiti is included in the elephant distribution and is compiled automatically. It is also possible to compile only wapiti by typing

$ make wapiti

Models

A statistical model for elephant is a directory containing two files named respectively wapiti and elman. The current release of elephant is bundled with three ready-to-use models. We selected the best performing models according to our experiments, that is, Cat-Code-7-SRN for English and Dutch and Cat-Code-11-SRN for Italian. Additional details can be found in the paper.

Training new models

Included in the bundle there is a script to facilitate the training of new models. The script takes as input tokenized text in IOB format and a wapiti pattern file.

$ ./elephant-train 
usage: elephant-train [-h] -m MODEL_DIR [-e ELMAN_MODEL] -w
                      WAPITI_PATTERN_FILE -i INPUT_IOB_FILE
                      [-d DEVEL_IOB_FILE]

Licence

Elephant is licenced under the term of the two-clause BSD Licence:

Copyright (c) 2009-2013  CNRS
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.