Upload
ravi-madduri
View
431
Download
0
Embed Size (px)
Citation preview
globus.org/genomics
Finding Needles in a Haystack – Big Data Management and Analysis using Globus
Ravi [email protected]
JSM 2015, Seattle, Washington
globus.org/genomics
• Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab
• We are a non-profit organization building solutions for non-profit researchers
• Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions
Who We Are
globus.org/genomics
Publish
results
Collectdata
Design experimen
t
Test hypothesis
Hypothesize
explanation
Identify patterns
Analyzedata
Finding needles in haystacks
Pose questio
n
3
globus.org/genomics
Imagine if a researcher, when tackling a problem,
could easily:• Assemble, integrate, and interpret all
relevant data within a knowledge network
• Be informed of anomalies, patterns, gaps
• Formulate & apply computational models
• Outsource tasks if local expertise lacking
• Launch automated processes to test hypotheses, expand knowledge network
• Pay for all this by taking on other tasks
globus.org/genomics
We will cover
• Accelerating Scientific Discovery Process by providing Science as a Service– Research Data Management– Analyzing Research Data
• Interactive Analysis• Large-scale Analysis
– Publishing Results so others can• Discover• Validate• Reproduce/Use
globus.org/genomics
90% of cancer patients carry a mutation that may be responsive to a known drug
Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015
Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack
– Nancy Cox (Vanderbilt)
globus.org/genomics
Higgs discovery “only possible because of the extraordinary achievements of …
grid computing”Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
globus.org/genomics
How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine?
Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia
globus.org/genomics
Managing big data with Globus
PI initiates transfer request; or requested automatically by script, science gateway
1
Globus transfers files reliably, securely
Light SourceCompute Facility
2
PI selects files to share, selects user or group,
and sets access permissions
Globus controls access to shared
files on existing storage; no need
to move files to cloud storage!
Researcher logs in to Globus and accesses shared files; no local
account required; download via Globus
Researcher assembles data set;
describes it using metadata (Dublin core and domain-
specific)
Curator reviews and approves; data set
published on campus or other system
Peers, collaborators search and discover datasets; transfer and share using Globus
4
7
6
3
5• SaaS Only a web
browser required• Access using your
campus credentials• Globus monitors and
informs throughout
6 8
Publication Repository
Personal Computer
globus.org/genomics
Globus Platform-as-a-Service
Identity, Group, Profile Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
Glo
bus
API
s
Glo
bus
Conn
ect
globus.org/genomics
Globus Adoption and Usage• 166,449 active Globus endpoints• 27,961 users registered• Biggest transfer: 500.42TB• Longest running transfer: 182 days. • Fastest transfer: 58.5Gbps (average)• 55TB moved per day, on average, since the
service was launched in November 2010• Average throughput: 637.7Mbps (since
service launch)
globus.org/genomics
Analyzing Big Data using Globus Galaxies
Sequencing Centers
Sequencing Centers
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
Globus provides for• High-performance • Fault-tolerant• Securefile transfer between all data-endpoints
Data management Data analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy Data Libraries
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
• Globus integrated within Galaxy
• Web-based UI• Drag-Drop
workflow creations
• Easily modify workflows with new tools
Galaxy-based workflow management
FTP, SCP, others
FTP, SCP
SCP
Globus Genomics
FTP,
SCP,
HTTP
globus.org/genomics
Our Science Stack• Galaxy
– Interactive execution, iPython, R– Creation, Execution, Sharing, Discovering
Workflows• Globus
– Data management– Identity Management
• AWS– HTCondor, Chef, EC2, EBS, S3, SNS– Spot, Route 53, Cloud Formation
SaaS
PaaS
IaaS
globus.org/genomics
Examples of what researchers have done
globus.org/genomics
• 134 samples and 4 workflows • 4 TB data initially• 2200 core hours in 6 days
Cox lab, UChicago
globus.org/genomics
Consensus Caller
globus.org/genomics
Rediscovery of previously observed variants Transition/Transversion Ratio
Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio
globus.org/genomics
Contaminated Samples
globus.org/genomics
Olopade lab, UChicago
A profile of inherited predisposition to breast cancer among Nigerian womenY. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
• 200 targeted exomes• 200 GB data initially• 76,920 core hours in 1.25 days
globus.org/genomics
Expanding Consensus Genotyper – SNVs, Indels,
SVs
RAW FASTQs
GATK Pipeline/HC
FreeBayes
SAMtools mpileup
GATK Pipeline/UG
VCF
VCF
VCF
VCF
Consensus Genotyper
VCF
Atlas2
Delly/Contra
VCF
VCF
globus.org/genomics
14 deleterious SNVs and 11 damaging Indels (BRCA1: 15, BRCA2: 4, PALB2: 2, BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were found in 29 subjects, and they were all confidently detected among 5 callers. Identified SNVs and Indels were all confirmed by Sanger sequencing.
Preliminary Results are very encouraging
globus.org/genomics
QC
PPMI ADNI
Adenocarcinomahttp://bit.ly/1M0h6Yx
http://bit.ly/A10R89y
Adrenal
Brain Alignment Feature count
AlignmentQC
1. Query and discover data
3. Execute parallel alignment workflow on dynamically provisioned cloud resources
ERMrest
2. Transfer bags
Alignment FilesAlignment
Files
3. Publish bags
BDDS Collection
Alignment FilesAlignment
Files
Differential expression
Differential expression
4. Discover published data and execute comparison workflow
Combining Data management and Analysis
globus.org/genomics
Gene Expression Results
globus.org/genomics
Globus Genomics at a glance
30 institutions, groups
10smillion core hours
labs
2 PBsraw sequences
analyzed
>1500 analysis tools
1000s genomes processed
>50workflows
99%uptime over the past
two years
1 PBlargest single transfer
to do
5 dayslongest running
workflow
100sdifferent species
1000s genomes processed
5 dayslongest running
workflow
globus.org/genomics
Other Globus Genomics users
DobynsLab
Cox LabVolchenboum LabOlopade Lab
Nagarajan Lab
globus.org/genomics
Pricing includes• Estimated compute• Storage (one month)• Globus Genomics platform usage• Support
Costs are remarkably low
globus.org/genomics
Globus Genomics – Making it routine to find needles in NGS haystacks
www.globus.org/genomics
globus.org/genomics
Other Examples of Science as a Service
• PDACS - Portal for data analysis services for cosmological simulations
• CVRG Galaxy – Large-scale ECG Data Analysis
• Globus Proteomics• eMatter – Material Science Simulations• FACE-IT - Framework to Advance Climate,
Economic, and Impact Investigations with Information Technology (usefaceit.org)
globus.org/genomics
• More information on Globus Genomics:www.globus.org/genomics
• More information on Globus: www.globus.org
globus.org/genomics
Our work is supported by:U.S . DEPARTMENT OF
ENERGY
31
globus.org/genomics
Thank you!
@madduri