Thursday, July 4, 2013

June 2013: The “dry-run” curation month

A gold standard corpus generation comprises steps such as: document collection/selection, manual document annotation, annotation result collection and statistical analysis.

The first and last steps can be computationally assisted and partially automated. However, this is not the case for the manual document annotation. Also called “curation”, the manual document annotation comprises the manual scanning of the document text to identify environment descriptive terms and map them to unique identifiers according to a community resource (the Environment Ontology (EnvO) in this case).

The tediousness and time-demands of such process call for collaborative effort. Aa international group of six researchers: Lucia Fanini, Sarah Faulwetter, Evangelos Pafilis, Christina Pavloudi, Julia Schnetzer, Katerina Vasileiadou (in alphabetical order) have undertaken this task. 

Coming from a diverse range of scientific background (such as ecology, computational biology, molecular biology, and systematic) they represent different mindsets upon scanning pieces of text, in a way representing different EOL readers.

Such pluralism is a desired feature for the corpus curation, however a common understanding among team members has to be established.

This was one of the main aims of the test curation (“dry run”) that took place during June 2013. A small set of documents (Text sections from EOL species pages, see post) were delivered to all curators. Upon manually annotating these documents curators  collected as many questions as possible around unclear and/or problematic annotation cases. Some examples of the latter are: terms and/or synonyms missing from EnvO, words that could be mapped to multiple EnvO terms, location names, nested environment descriptive terms.

A strategy employing a set of flags to indicate such cases is now in place. The previously generated the curation guideline document (see post) has  been updated accordingly and the production-level curation may now start.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text @ PLOS ONE

The sister projects of SPECIES and ORGANISMS now published at PLOS ONE, part of the PLOS Text Mining Collection.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, et al. (2013) PLoS ONE 8(6): e65390. doi:10.1371/journal.pone.0065390

The knowledge, skills and know-how gained through this work paved the ground for ENVIRONMENTS.

A big thank you to the team, Evangelos