Annotation Manual

Part-of-speech tags

We use the part-of-speech tagset of CCGbank, which is a slight variant of the Penn Treebank tagset as listed in Ann Taylor, Mitchell Marcus and Beatrice Santorini (2003): The Penn Treebank: An Overview, Section 1.1. We hope to provide a list of differences in CCGbank soon.

Named-entity tags

The annotation scheme for named entities is based on the one provided by ACE (see the annotation guidelines). We adopt three of their basic categories, and add four other categories (inspired by Satoshi Sekine's Extended Named Entity Hierarchy). This results in the following classification:

  • Person (PER) - Person entities are limited to individuals that are human or have human characteristics, such as divine entities.
  • Organization (ORG) - Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure.
  • Location (LOC) - Location entities are limited to geographical entities such as geographical areas and landmasses, bodies of water, and geological formations.
  • Artifact (ART) - Artifacts are limited to manmade objects, structures and abstract entities, including buildings, facilities, art and scientific theories.
  • Natural Object (NAT) - Natural objects are entities that occur naturally and are not manmade, such as diseases, biological entities and other living things.
  • Event (EVE) - Events are incidents and occasions that occur during a particular time.
  • Time (TIM) - Time entities are limited to references to certain temporal entities that have a name, such as the days of the week and months of a year. For all other temporal expressions the tagging layer timex is used (see below).

These seven basic entities are considered to cover all named entities. Currently, all entities receive one named entity tag, but there are infamous ambiguous cases. In order to reduce such ambiguity, we add an extra category for Geo-political entities, which is interpreted as a hybrid tag for Location and Organisation:

  • Geo-political Entity (GPE) - GPE entities are geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a city, a nation, its region, its government, or its people (LOC•ORG).

In named entities consisting of multiple tokens, each token is tagged (e.g., New|LOC York|LOC). In nested named entities, each token is tagged with the tag appropriate for the outermost named entity that it is part of (e.g., New|ORG York|ORG Yankees|ORG). In general only entities with part-of-speech tag NNP or NNPS receive a named entity tag. Exceptions to this rule are demonyms, which have POS tag JJ (e.g., American|ORG), and long names consisting of tokens with different POS tags (e.g., Paradise|ART By|ART The|ART Dashboard|ART Light|ART).

Time expressions and numericals are tagged on separate layers: timex and numex (Palmer and Day, 1997). The timex layer divides time expressions into Date and Time. The numex layer classifies numericals as Percentage or Money.

Syntactic Analysis

We use Combinatory Categorial Grammar (CCG; Steedman, 2001) for describing syntactic structure.

We use the C&C parser (Clark and Curran, 2004) trained on CCGbank (Hockenmaier and Steedman, 2007) for automatic annotation. Where the parser makes mistakes, manual correction is not as straightforward as in the case of tags, because it is a tree structure rather than individual word tags that need to be fixed. Currently there is no interface for directly editing the tree structure.

However, many parsing errors are due to wrong part-of-speech tags. In these cases, the tree can be corrected simply by correcting the part-of-speech annotation.

If this does not suffice, then many parsing errors can be fixed by correcting the CCG categories (also called supertags) of individual tokens. The category of a token (or larger phrase) determines in what ways it can combine syntactically with other phrases and therefore constrains the set of possible derivations. GMB Explorer allows users to edit token categories and creates corresponding bits of wisdom. These are then sent back to the parser to influence the derivation it will produce.

There are some limitations to this way of correcting syntactic annotation:

  1. It requires knowledge of CCG, in particular of CCGbank's flavor of it. Notably, this flavor includes feature annotations on some atomic categories, which are appended to them in square brackets.
  2. Due to the nature of the C&C parser, category bits of wisdom are currently treated as hints rather than hard constraints. The parser will sometimes appear to ignore them.
  3. Arbitrary CCG categories cannot currently be used. The parser's statistical model is limited to a finite inventory. Although Explorer allows the input of arbitrary strings as categories, it displays suggestions as you type. You should choose categories from these suggestions only.
  4. Lexical categories cannot disambiguate between all attachment decisions. E.g. although the CCG category of a preposition determines whether it attaches to an NP or to a VP, if, for example, more than one NP is available to attach to, then there is currently no way to disambiguate between the two.


For a more extensive description of the resource, check the publications.