Upload
bono
View
23
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA). - PowerPoint PPT Presentation
Citation preview
DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM
NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009
Paul Gilna, B.Sc., Ph.D.
California Institute for Telecommunications & Information Technology (Calit2)
University of California, San Diego
Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA)
Global Scientific Research Cyber-Community
Global Scientific Research Cyber-Community
•3100 users•70 countries
CAMERA 2.0 Objectives
• CAMERA serves as one representation of a specific
research community’s need for a system to- Provide a metadata rich family of scalable databases and make them
available to the community
- Collect and reference increasing metadata relevant to environmental
metagenome datasets
- Exploit the power of querying on metadata across multiple geospatial
locations
- Provide a facility that allows for a diversity of software tools to be easily
integrated into the system (and sufficient compute resources to support
these analyses)
The Semantically Aware DB Schema
• Some key features of the semantically aware DB schema- Environmental parameters: Modeled more generally, to accommodate
any environment and any parameter within an environment
- Sequence: Separate “registries” for DNA, rRNA, mRNA, viral segments, reference genomes etc. Sequence annotations are independently searchable.
- Workflow Connection: Every computed property is associated with the workflow instance that created it.
- Associated Data : Data not produced in CAMERA but often used for analysis and comparison
- Ontologies: All metadata, measured and observed parameters are connected to ontologies, whenever possible.
Integration of External Data
• Warehousing- Reference genomes- Homologs, CoG clusters- Raster data from slow/complex servers
• Remote Data- KEGG pathways- NASA MODIS data- World Ocean Atlas- Other data that come as “data sets” that do
not conform to the schema
NASA Aqua-MODIS satellite data
Metadata: beyond data collected at sampling site
Sea Surface Temp
Chlorophyll
MODIS Images covering
GOS sites #8 – 12, mid
November, 2003
Integration of Enhanced Metadata
Integrate and browse additional sources of microbial data
CAMERA 2.0 (Data Submission)
Growing the CAMERA Community and Resource…
Investigator submits proposal to GBMF
Investigator submits metadata to CAMERA
CAMERA sends acknowledgement to Investigator, Seq. Group, GBMF
Seq. Group send barcoded sample “kit” to investigators Seq. Group
Upload data to CAMERA (& Investigator)
Data & Metadata Released in six months
•Metadata now collected before sequence data: GSC-compliant
•Project-ID serves as acceptance-proof
•Sample is Received and Sequenced
Webb Miller and Stephan C. Schuster,
and Roche / 454 Genome Sequencer
GBMF Data Acquisition Pipeline:A New Data Submission Paradigm-Metadata
First!
Data Standards
• Minimal Information for (Meta)Genomic Sequences: MIGS/MIMS
• A Metadata standard, developed by the Genomics Standards Consortium
-Controlled vocabularies e.g. EnvO, PATO-Common language: GCDML
• Submissions shall comply with a MIMS/MIGS core, but any metadata can be entered via keywords and free text
• Different metadata submission forms for different habitats: (water, soil, air, hosts)
User Friendly Compute Environment
CAMERA 2.0 (Computation)
From simple job submission to community developed and published workflows…
RAMMCAP – Rapid clustering and functional annotation for metagenomic sequences
RNA finding/filtering
DNA Clustering• Unique sequence • Taxonomy / population analysis
ORF clustering • ORF calling• Unique sequences• Protein families
ORF and cluster annotation• Pfam, Tigrfam, COG, etc.
Features• Very fast (10-100x) as compared to BLAST-based methods• Effective tools: CD-HIT, HMMERHEAD, meta_RNA, and RPS-BLAST• Focused functional annotation via curated protein families
CD-HIT, 90-95%
More in-depth analysis and further annotation
MetagenomicRaw reads
CD-HIT-EST, 95%
DNAclusters
Proteinclusters
Representativesequences
Unique DNAsequences
ORF Annotation
1. ORF_finder2. Metagene
CD-HIT, 60 or 30%
COG
Pfam
Tigrfam
HMMER HMMERHEADRPS-BLAST
ClusterAnnotation
1. tRNA scan2. rRNA scan3. meta_RNA
ORFs
Non-redundantORFs
tRNAs
rRNAs
Annotation workflow
A green box is called an ‘actor’ , which performs a task.
This special actor represents an annotation component, such as BLAST search.
Workflow parameters, which can be specified by users in the portal, are passed to workflow components.
Data flow is divided.
Provenance of Workflow Related Data
• Provenance: A concept from art history and library- Inputs, outputs, intermediate results, workflow
design, workflow run
• Collected information - Can be used in a number of ways
- Validation, reproducibility, fault tolerance, etc…
- Linked to the semantic database
- Viewable and searchable from CAMERA 2.0
http://camera.calit2.net