iEvoBio Keynote Talk 2010

Biodiversity Discovery and Documentation in the Information and Attention AgePresented by: Rob Guralnick

Authors: Rob Guralnick and Andrew Hill

Contributors: Meredith Lane, Dan Janies, Walter Jetz, and lots of other folks.

Funding support: Global Biodiversity Information Facility, National BiologicalInformation Infrastructure, Defense Advanced Research Projects Agency, National Science Foundation.

#ievobio

WHAT IS BIODIVERSITY DISCOVERY AND DOCUMENTATION?

Linnean shortfall (too few taxonomists,antiquated and laborious process)

Wallacean shortfall (very coarse resolution, scattered data, no integration)

Darwinian shortfall (trees scattered in literature, no “mother of all trees”)

Multiple repositories that do not communicate well storing genetic, phenotypic data. Phenotypic knowledge-bases lag behind.

ACTION IMPEDIMENTS

Discovering and documenting new units of biodiversity

Discovering and documentingdistributions of lineages

Discovering and documentingrelationship among lineages

Discovering and documentinglineage traits from genomesto phenotype.

app

From the State of Observed Species report, http://species.asu.edu/files/SOS2010.pdf

Pace of Species Description and Documentation for 2008

Approx. 1.922 million named species (all taxa)~4-30 million undiscovered

Pace of Species Description and Documentation for 2008

Assuming a relatively conservative number (eg. 10 million undescribed species), it will take another360 years to discover and document them at our current pace. Why is discovery and documentation so slow?

1. Taxonomists proceed in the same manner today as they did one hundred years ago.

2. Few products are generated along the way. This also means the process is vulnerable (to loss of computers to the loss of taxonomists themselves).

3. Discovery and documentation are coupled.

State and Scale of Knowledge in Environmental Sciences

Scal

e (G

rain

)

Ecoregion

World

Continents/Realms

200km

50km

1km

100m

1m

1996

: GTO

PO 3

0

TopographyLandcover

currentLandcover

futureVertebrate

distributions

2009

: SRT

MV

V4

2006

WW

F

2005

-9: I

UCN

, mis

c.

Atla

s dat

a, su

rvey

s

2003

: GLC

200

0

2009

: Glo

bCov

er

1992

:BIO

ME

2001

:Im

age

2.2

Regi

onal

mod

els

Knowledge Gap

Slide from Walter Jetz (thanks Walter)

90 meterresolution SRTM elevationdata for a Portion of Colorado

100 timesas coarse

1000 timesas coarse

A view of theworld at differentresolutions

Distribution Knowledge Is Scattered

• Points are from GBIF data portal

• Expert opinion range map from IUCN Red List

• IUCN also lists some habitat preferences (cropland, meadows, mountain valleys)

Microtus montanus

Documenting our biodiversity mattersbecause it is underincreasing threat.

“Overall, we are locked into a race. We must hurry to acquire the knowledge on which a wise policy of conservation and development can be based for centuries to come.”

- E. O. Wilson

HOW DO WE DO THIS?

DEVELOP KNOWLEDGEBASES OF SPECIES DISTRIBUTIONS AND SPECIES RELATIONSHIPS.

PROVIDE MEANS TO INTEGRATE ACROSS THESE KNOWLEDGE-BASES

PROVIDE TOOLS TO RAPIDLY AND EASILY EXLORE THESE DATA ACROSS SPACE AND TIME

MAKE THIS A COMMUNITY EFFORT – LEVERAGE COMMUNITY SOURCING

Raw global data

lineage, occurrence,

environmental

Initial Research Questions

Analytical Methodsmeans to summarize data &

select hypotheses

Phylo-, Biodiversity and Ecological Informatics

New Research Questions

Processed global data

Species, distributions, new

envir. layers

Growing dataand informationrepositories

Tools Encoding analytical

methods

Application Services

automated workflow for biodiversity science

Growing Toolbox

XX

Concepts and ideas

X

Concepts and ideas Growing Toolbox Growing data Repositories, formats

Tree of Life

Population Genetics

Paup, Phyml/Raxml, MrBayes, Beast, Mesquite, etc.

TCS/NCA, MsBayes. BayesSCC,Structure, etc.

GenBank, TreeBase

(Nexus/Newick/PhyloXML,etc)

Earth Surface

Climate

Ecosystem fluxes

Satellites (Modis/GOES/Landat, etc)

Satellites; historical, current in-situ, GCMs, etc.

Infrared Imaging Spectrometer , etc Instrument-based raw

Statistical/inferential

Inference-based

Satellite image repositiories, Worldclim, PRISM , PMIP

(erdapp, netCDF, GIS formats)

Species named

Species traits

Species distributions

TaxonX, automatedSpecies name extraction

ITIS, Catalog of Life, Zoobank, Zookeys, etc.

Lucid, Ontologies, RDF

GIS, habitat suitability models, SDMs/ENMs, Survey Gap,etc.

Morphbank, TraitNet, etc.

GBIF, VertNet, OBIS (species occurrence), Map of Life, IUCNObservations and model-based

From Peterson et al. In Press Systematics and Biodiversity

The Interconnected Nature of Biodiversity Ideas, Outputs, Repositories

DECOUPLING SPECIES DISCOVERY AND DOCUMENTATION(OR GET IT OUT THERE FOR OTHERS TO USE AND REPURPOSE)

(OR CLAIM NEW BIODIVERSITY, PROVISIONALLY, BEFORE FORMAL PUBLICATION)

Generate new data from specimens

Genbank

Morphbank

Treebase

Link new unit of biodiversity onto tree of life(claim discovery)

Formal publication (documentation)

Comparartive analyses

Publish step 1 repositories

Publish step 2

Community sourcing

Scratch-pads

Life-desks

TAKE HOME MESSAGE 1:

We need to use the web as a collaborative work environment for biodiversity knowledge generation

We need to claim knowledge of the existence of new species before all of the formal steps to document it are complete

We need to publish new data about species soon after generation and prior to publication

Questions:

• How are drug resistant strains of H5N1 circulating around the globe?

• How did drug resistance arise in the H5N1 population?

• Are mutations that give rise to drug resistance in H5N1 under positive selection?

• Can we provide ways for researchers and the general public to near real-time track this spread?

Hosts and strains of avian influenza A

What about monitoring an evolving Earth System?Tracking the spread of disease lineages with known important mutations through time & space

Viral structure

Methods:

• Collect public genome data for H5N1 avian influenza (676 full genomes).

• Use tools for more efficient alignment and phylogenetic analysis of data

• Test whether mutations on M2 gene (L26I, V27A/I, A30S, S31N) that provide resistance to adamantanes (a class of drugs used to treat influenza A) are under positive selection, purifying selection or are neutral (across the full sampled population of H5N1 inf. A)

• Make GoogleEarthTM vizualizations available

.

Global View of Spread of H5N1 (blue branches are lineages with mutation for higher transmissibility among mammals)

Resistant mutant found at position 31 of the M2 protein – colored red below

Altitude of node X = a+ [(n− 1) ×b]

Dn/ds measurements across the M2 protein (high Dn/ds ratios (>1) suggest that more non-synonymous substitutions are

occurring than expected and therefore are likely being maintained in population)

Table 2 Amantadine use in chicken farms in Northern China in 1 year (from October 2004 to September 2005)

Farms No. of total chicken

Total days of medication in 1 year

Dosage (%; w/w)

Routes of administration

A 8,300 37 0.03 Feedstuff

B 15,600 26 .025 Feedstuff

C 10,400 43 .022 Feedstuff

D 26,100 21 .01 Drinking H20

E 7,200 64 .015 Drinking H20

F 13,300 25 .032 Feedstuff

G 4,300 63 .01 Drinking H20

H 21,700 38 .012 Drinking H20

I 5,400 42 .025 Feedstuff

J 14,700 59 .01 Drinking H20From He, 2007, Antiviral Research

So What Did We Find Out?

• Drug resistance to adamantanes is under positive selection for at least some mutations (S31N and V27A/I).

• Drug resistant lineages can spread quickly across the globe

• Emergence of drug resistance has been through mutation not recombination and hitch-hiking (results not shown)

• Effectively treating a potential H5N1 pandemic is based on continued monitoring of evolution and spread of resistance to adamantanes and oseltimivir (Tamiflu)

TAKE HOME MESSAGES 2:

• It is possible to not just develop observing systems of species but of evolving lineages.

• These monitoring or observing systems can provide a unique view into evolution, selection and adaptation.

• Such systems are essential for more accurate forecasting.

• Developing such a system means creating automated workflows.

http:// geophylo.appspot.com/ Hill and Guralnick, in press, Ecography Google App Engine application

WHAT ABOUT ALLOWING OTHERS TO MAKE THEIR OWN GEOPHYLOGENY?

GeoPhylo Engine - Written in Python, open source, and deployed on Google App Engine.

Advantages of cloud-based deployment:

• Scalable (near infinite computation resources)

• All versioning kept intact so developers can easily link to latest and greatest

• Storage of persistent KMLs for users who want to share and modify their KMLs.

• Easily deployable as a web service


Geophylogenies provide rich visualizations of multidimensional data that can be examined at multiple spatial (and temporal) scales

Such visualizations may appeal beyond our community of evolutionary biologists to the broader scientific and policy community

Automated approaches and workbench-oriented tools allow for updating, community-driven content to be generated

Our ultimate goal should be an ever-growing “mother of all trees” from which we can attach new “twigs” as we discover them.

Can We Really Track Distributions of Lineages Through Space and Time?

Map of Life Will:

• Provide expert opinion range maps for almost all terrestrial vertebrates (and means to accumulate more maps for other taxa)

• Provide means for the community to annotate those maps

• Assemble point occurences, habitat preference data and environmental data (e.g. climate, landcover, soil, etc)

• Provide a modeling approach to generate much finer scale distribution models (on the order of a kilometer resolution)

Overlaying expert opinion maps and model outputs

Biodiversity encyclopedias

Point occurrences,Valid taxonomies

Range maps, Validation services

Species occurrencedatabases

Range maps

Species data

National biological data

Online conservation tools

En

viro

nm

en

tal d

ata

ITIS

ITIS

Map of Life Connections

- Common data model for range maps

- Web-services based for sharing maps

- Focus on improvement through modeling and community involvement both

Integrating phylogenetic and distributional data in GoogleEarthTM

Work

Workflowscombiningphlyogeneticapproaches,conservation status and speciesoccurrence


Map of Life fills a critical gap in our global biodiversity knowledge by integrating different sources of species distribution into high resolution range maps for community use.

The ultimate goal is to integrate such species distribution knowledge with knowledge about relationships among species and conservation knowledge

Such integration, at global scale, and across large taxonomic groups, is the next step forward

Patterns PredictionsRelationalModeling

Community Sourcing and the Attention Age

At the heart of the message here today is also a challenge:

The vision here suggests that data publishing and “sharing” is as important as academic “kudos”

Can we act for collective good of our community and by so doing see gains for all?

Lets change our model of credit!

Education

iEvoBio Keynote Talk 2010