27
November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division (LBNL)

November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

Embed Size (px)

DESCRIPTION

November 18, 2003 SC’O3 General Goals Genomic Data Life after the Human Genome Project NERSC Storage Systems Data Management Future Directions

Citation preview

Page 1: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Optimizing GenomicData Storage

forWide Accessibility

Joint Genome Institute (JGI)NERSC Center

Computational Research Division (LBNL)

Page 2: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Collaborators

Nancy Meyer NERSC - HPSSHarvard Holmes NERSC - HPSSJonathan Carter NERSC - User Services Horst Simon NERSC Center Director

Susan Lucas JGI-PGF - Head, Production SequencingArthur Kobayashi JGI-PGF - Production InformaticsEddy Rubin JGI Director

Arie Shoshani LBNL Computational Research Division

Millions of Microbes Everywhere

Page 3: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

•General Goals

•Genomic Data • Life after the Human Genome Project

•NERSC Storage Systems•Data Management

•Future Directions

Page 4: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

General Goals

1. Distribute, archive, and enhance access to the data generated at DOE’s Joint Genome Institute(JGI) Production Genomic Facility(PGF)

2. Serve as a resource for community access to these data.

3. Establish a long term collaboration between the JGI and the NERSC Center.

• High Performance Storage System (HPSS)

Page 5: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Environmental GenomicsCarbon Cycle

QuickTime™ and aYUV420 codec decompressorare needed to see this picture.

Page 6: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Environmental Genomics

• < 1% of microbes are culturable• Many unculturables live in

interdependent consortia of considerable diversity

• Aim: to recover genome-scale sequences and reveal metabolic capabilities

• How can we understand the action of microbes at the molecular level?

• What is the structure of natural microbial populations? What is a microbial species?

Page 7: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Future environmentaltargets for JGI

Newman and Banfield, Science 2002

Whole metagenome shotgun sequencing and targeted fosmid-based methods can be used to recover useful draft genomes

Page 8: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI Microbial Program

JGI microbial sequencing targets a broad range of bacteria and archaea with relevance to:

•Bioremediation•Carbon Sequestration•Global Climate Change•Biodiversity•Biomass Conversion•Energy Production•Disease

Page 9: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

EUCARYA

Single origin of Mitochondria ?

BACTERIA

ARCHAEA

Plants, Animals, Fungi

Page 10: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI Microbial Program

Lactic acid bacteriaLactobacillus gasseri (Klaenhammer)Oenoccoccus oeni (Mills)

Complex polysaccharide degradationClostridium thermocellum (Wu)Microbulbifer degradans (Weiner)(complements white rot fungus sequence)

Phototrophic bacteriaRhodospirillium rubrum (Roberts) (complements Rhodopseudomonas palustris and Rhodobacter spheroides)

Toxic waste degradation and microbial ecologyDesulfuromonas acetoxidans (Lovely)Desulfovibrio desulfuricans

Microbes in extreme environmentsPsychrobacter (Thomashow)Methanococcoides burtonii (Sowers, Cavicchioli)

Infectious diseases of plants and animalsErlichia chaffeensis (Yu)Pseudomonas syringae (Lindow)

Anaerobic methane oxidizing consortium “ball of bugs” (DeLong, Monterey Bay)one (or two?!) reverse methanogenic archaea in core plus sulfur reducing bacterium on surface

Page 11: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI - Then & Now

Then:•Single project - Human Genome (ch 5,16,& 19)•All data sent to NCBI/GenBank for storage and distribution•Minimum local responsibility for data stewardship•Relatively low production sequencing rate

Now:•Dozens of whole genome projects (2 million to more than a billion bases, each)

•Multiple species (microbial to vertebrates)•Complex environmental genomic communities•Full responsibility for data storage and distribution•Limited storage capacity•Production sequencing rate is increasing

Page 12: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI Monthly Production

0200400600800

100012001400160018002000

Jun-02Jul-02Aug-02Sep-02Oct-02Nov-02Dec-02Jan-03Feb-03Mar-03Apr-03May-030200400600800

100012001400160018002000

May-99 May-00 May-01 May-02 May-03

Millions of Bases5yr History 12 months

Page 13: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC121 GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT TCTGAGGCTG TGCGCCCAGA181 TCGCGAGAGA GGCAAGTGGG GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG241 AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA AAGAGGAGAA AAGTGGAGGA301 GGGCGAGTAG GGGGGTGGGG CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC361 CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT GGCGGCTGCC CTCCCTTGTT421 TCCGCTGCAT CCAGACTTCC TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC481 ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC GGCGGGGGCT GGGGCGCGGG541 GGCCGGACCA TGAGCCGCTG AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG601 GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG GCGAGGCGGC CACACGCACC661 GAGCCAGCGA CCCCCGGGCG ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG721 GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC TCCTGGGCTG CCTGCTGAGC781 CACGCCGCCG CCGCGCCGTC GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG841 GACAAAGAGT TGGCAGTGGT GAGTTGCT

This is Not Raw Data

Page 14: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Neither is This

Page 15: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

These are the Raw Data

Page 16: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Genome Sequencing

Start with genomic DNA

Makesheared fragments

Sequence both ends of fragments

Reconstruct genome computationallyProvide genome and

tools to community

High-throughput computational analysis

Page 17: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Paired Plasmid Sequencing

Page 18: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI Data Production• Millions of files per month of raw trace data

• 100 assembled projects per month(50MB-250MB) and several large assembled projects per year

• More data are being generated than ever before

• Currently trace data are maintained online only while projects are in process.

• Whole completed projects are available to download. They are large and contain millions of files.

Page 19: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

JGI Raw Data Organization

Project = Series of Libraries that define a genomeLibrary = Series of PlatesPlate = 384 ClonesClone= 2 Lanes1 Lane = ~1MB each distributed into 4 files:

1 FASTA file = 1KB1 scf file = 50KB1 abd file= 250KB1 rsd/ab1file = 650KB

In May-03, PGF ran 2.5 million successful lanes = 2.5TB/month; 10 million files

(0.75TB/month (9 TB/year) non-trace files)This does not include any assembly, database or metadata!

Page 20: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Current Access to JGI Data

•Access to these data is in demand by scientific fields that were not anticipated by the Human Genome Project

•Microbiologists •Environmental Scientists•Evolutionary Scientists•GtL projects

•The computational sophistication of the user community is uneven, at best. Not everyone will want the same kind of files.

•GenBank is not capable of serving all of the JGI’s needs.

Page 21: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Current Access to JGI Data(cont.)

• The data are processed by researchers using iterative and pattern matching techniques often requiring access to data that spans several projects and genomes. This is different from the Human Project.

• Currently, this requires downloads of projects and then unpacking the project files to access the data. Millions of files to unpack and slow transfer of whole project files.

• At best, the raw data used to generate the sequences in a project are very difficult to retrieve and interrogate.

Page 22: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

NERSC Storage Systems

•DOE’s largest unclassified storage systems with current archival capacity of 8PBs

•Robust and available 24x7 with high reliability and excellent network connectivity

•Very configurable and currently provides good service for both large streaming data and concurrent direct access.

•Experienced and innovative staff are adding new capabilities and distributing storage as the NERSC Center data requirements change over time.

Page 23: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Distribute and Enhance Access

1. Initially, we plan to hold all the sequence data online or near-line.

We will prototype and select the best way to do this:•distributed file systems•local file systems•cached web servers•tools.

2. Collaborate with JGI to organize and cluster the sequence data so they can be retrieved in meaningful pieces.

Page 24: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Distribute and Enhance Access(cont.)

3. Distribute the data between JGI and NERSC/HPSS: • Develop tools and methodologies to move the data

between JGI and NERSC/HPSS for timely access to sequence data as they are being generated.

• Incorporate this into regular site backups

4. Build a web interface to the data providing a consistent view of the data (allowing the data to be distributed underneath) with a link to the data at JGI for ease of access.

Page 25: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

1. Metadata for the files being collected -- schema definition development -- the database system to support the metadata -- query interfaces to query the metadata -- possible rapid prototyping using the OPM tools

2. Data entry tools for the metadata -- procedure to enforce metadata entry -- checks on the correctness of the metadata entered

Data OrganizationRequirements

None of this was contemplated in the Human Project

Page 26: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

3. Robust massive file movement -- from daily generated files into NERSC's HPSS -- insure correctness in spite of system, network, and HPSS transient failures -- automated reporting of errors / failures -- possible use of HRM technology

4. Managing annotations of genomic data -- need to support history of annotation, perhaps by

version hierarchy -- need for a controlled vocabulary (an ontology) for

searching the annotations

Data Organization Requirements(cont.)

Page 27: November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division

November 18, 2003SC’O3

Future Goals1. Hold more partial and raw data online

2. Enhance searching these data using annotated databases.

3. Enhance current iterative processing of the data by moving some of this processing close to the data. • For example some programs could run on the web server

with access to a local file system of data for matches and selections of data.

NERSC to become the repository of DOE genomic data focusing on microbial and environmental genomics