Saturday, March 30, 2013

March 2013: paving the ground on several project aspects


March 2013 has largely been a preparatory month paving the ground for follow-up tasks.

The main task of ENVIRONMENTS-EOL is the identification of environment descriptive terms, such as terrestrial, aquatic, lagoon, coral reef, in EOL Pages.

To materialize this aim one needs: on the one hand to collect the text bits of the EOL Taxon Pages containing environmental context information that could be mined, and on the other hand a piece of software capable of identifying environment descriptors in these text bits.

For the former scripts have been written employing the EOL API  to retrieve sections (“subjects” in the EOL terminology) of every taxon such as: TaxonBiology, Description, Biology, Distribution, Habitat and more. EOL’s adherence to the standards (e.g. to the Species Profile Model) has significantly assisted such procedure. In active collaboration with the EOL Developer Team the text retrieval will be optimized further.

For the latter a prototype tagger, ENVIRONMENTS, has been compiled. ENVIRONMENTS is based on SPECIES, a tagger capable of identifying organism names in text using a dictionary-based approach (Main developers: Lars, Sune).  

ENVIRONMENTS is capable of identifying environment descriptive terms by looking up words in the text against a dictionary of environment descriptors. A prototype dictionary has been created according to the naming information available in the Environment Ontology (EnvO).

EnvO is a community resource offering a controlled, structured vocabulary for ecosystems types (“biomes”), environmental materials, and environmental features (e.g. habitats).

The different types and sources of EnvO term names and synonyms have been explored and the more precise ones have been selected.

Further steps include actions that will improve the match between the way terms are written in the text and the way they exist in EnvO e.g. by automatically adding the plural form of the terms in the dictionary.

Another important aspect of the ENVIRONMENTS-EOL project is the evaluation of the accuracy of the environment descriptive term identification. To this end, the creation of a manually annotated corpus (gold standard) is necessary.

Such a corpus comprises a set of documents in which environment descriptors have been manually identified and mapped to unique identifiers in community resources (e.g. the Environment Ontology terms).

Once such a gold standard corpus is in place, its manually annotated tags can be compared with those predicted by named entity recognition software. In this way the accuracy of the latter can be calculated.

Reflecting on the experience gained from the creation of an manually annotated corpus of taxonomic  mentions (S800 corpus) and on the pilot annotation of environment descriptive terms in PubMed abstracts (Thanks to Christina for her support) a guideline document is now in place.

Such a document will provide the cutator team (Aikaterini, Christina, Evangelos, Julia, Lucia, Sarah) a guide with examples of documents in which environment descriptors have been manually identified and mapped to the corresponding  EnvO terms.

Additionally, this guide elaborates on the main categories employed by EnvO, presents web-search tools dedicated to EnvO and text editors to facilitate the annotation task, discusses issues already spotted e.g. how to handle environmental descriptors currently missing from EnvO, and enlists hints and tips that could assist the tedious task of the manual annotation.

Welcome to ENVIRONMENTS-EOL, a few words on the project


Large-scale biological questions such as retrieving all species belonging to a specific group (e.g. Invertebrates), associated with a particular environment (e.g. coral reefs) and occurring in a specific region (e.g. Indo-Pacific Ocean) require the combinatorial analysis of information available in a diverse range of resources.

Taxonomy information along with species occurrence data (stored in centralized biodiversity resources) can be combined to this end. To fill-in, however, the missing pieces of the puzzle, input based on knowledge existing in the scientific literature is required.

The Encyclopedia of Life (http://eol.org) by collecting the available information about a given taxon is a one-stop-shop that greatly facilitates answering such questions.

The identification of environment descriptive terms, such as terrestrial, aquatic, lagoon, coral reef, in EOL text can drive the mining of species environmental context information.

ENVIRONMENTS is an open source tool supporting such identification. It does so by looking up words in the text against a dictionary of environment descriptors.

The Environment Ontology (http://environmentontology.org/), a community resource offering a controlled, structured vocabulary for ecosystems types (“biomes”), environmental materials, and environmental features (e.g. habitats), serves as the source of names and synonyms for the creation of such a dictionary.

While the environment descriptive term identification is the core of the project, tasks such as:
  • the evaluation of the accuracy of the method (via the creation of a manually annotated, gold standard, corpus)
  • the assessment of the contribution of the different EOL page sections to the environmental context mining
  • the consideration of taxonomy and species occurrence information
  • the generation of summarizing visualizations supporting comparisons and biological inferences

are equally important in answering large-scale biological questions like the one in the beginning of this post.

What lies ahead is a challenging project comprising a diverse range of tasks. As response, a team of researchers with diverse backgrounds (molecular biology, microbial ecology, data analysis, text/literature mining, bioinformatics, statistics and more) has been put together to this end.

Through the posts in this blog, it will, hopefully, be possible to keep you up-to-date with the project developments, provide you with more information on the tasks involved, present to you and bring you in contact with the people contributing to the different tasks.

Stay tuned!