Saturday, September 14, 2013

August 2013: The E600 curation month

Amid July – August high temperatures for some of the team members, visits in associate labs for some others, and as a side-activity to normal lab/office work for the rest, the most tedious and time-consuming part of this project has now been completed.

Environments-600 (E600), a corpus comprising 600 EOL Taxa pages was evenly and randomly distributed among the 6 curators (4 graduate students, 2 postdocs, see June’s post).

To maximize environment type coverage the 600 EOL documents were species pages randomly picked from the following eight taxonomic taxa: Actinopterygii, Annelida, Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta. These are taxa either associated with different environments to each other, or known to exist in a diverse range of environments.

Each curator had 45 days to annotate 120 documents (ie. their part of the corpus: 600/6 = 100 documents each, plus 20 documents (ie. 20% of 100) that are common with other curators. The ‘20% overlap’ is an important part of the curation process. It supports the calculation of the Inter-annotator agreement (IAA, based on pairwise calculations of the Cohen's kappa coefficient.

Each curator had access to his/her own documents only. No information on the shared documents had been disclosed.

All curators were instructed to evaluate all document substrings and map the recognized environment descriptors to the corresponding EnvO terms.

Reflecting on the EnvO, envo-basic.obo, version-date: 14th June 2013, such environment descriptors included: habitats, biomes, enviromental features, conditions and materials (EnvO high level terms:  00002036, 00000428, 00002297, 01000203, 00010483 respectively)

All recognized mentions should be listed (including repetitions) in the order of appearance in text. To facilitate EnvO term search and ontology browsing OBO-Edit has been employed.

When an environment descriptor could refer to more than one EnvO terms multiple mappings were allowed (e.g. mapping “forest” to ENVO:00000111, “forest” (environmental feature), and  01000174, “forest biome”).

In the case of “nested” environment descriptors, a “left-longest most”-like matching approach applied. If for example “sandy sediment” is met in text, it will be mapped to ENVO: 01000118, “sandy sediment” (and not to the nested terms: sand, sediment).

During the curation a range of special cases were encountered. Cases like misspellings, EnvO term missing synonyms and enumerations were indicated as such. Environment descriptors that did not correspond to an existing EnvO term were also marked as such.

Finally, when environment descriptive terms where part of geographical locations and/or common taxon names (e.g. Steppe Eagle, Aquila nipalensis, shown in the Figure) were flagged as such to allow for downstream analysis.

Calculating the IAA, merging the annotated document in a single corpus are now ongoing, paving the ground for the ENVIRONMENT’s accuracy benchmark. Stay tuned!

Steppe Eagle, Aquila nipalensis, a common species name including a reference to an environment. Such cases occurred during the curation have been flagged for follow-up analysis (Image License: CC BY NC SA, © Tarique Sani, Source: Flickr: EOL Images)