Upload
cray-inc
View
326
Download
2
Tags:
Embed Size (px)
Citation preview
C O M P U T E | S T O R E | A N A L Y Z E
High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows
Ted Slater
Bio-IT World Conference & Expo 2015
20 April 2015
C O M P U T E | S T O R E | A N A L Y Z E
About Cray
Copyright 2015 Cray Inc.
Seymour Cray founded Cray Research in 1972
• 1972-1996, Cray Research grew to leadership in Supercomputing
• 1996-2000, Cray was subsidiary of SGI
• 2000- present, Cray Inc. growing to $525M in revenue in 2013
• Cray Inc. formed in April 2000
Cray Inc.
• NASDAQ: CRAY
• Over 1,000 employees across 30 countries
• Headquartered in Seattle, WA
Three Focus Areas
• Computation
• Storage
• Analytics
Seven Major Development Sites:
• Austin, TX
• Chippewa Falls, WI
• Pleasanton, CA
• St. Paul, MN
• San Jose, CA
• Seattle, WA
• Bristol, UK
C O M P U T E | S T O R E | A N A L Y Z E
Our Vision
Modeling The WorldFusing Supercomputing and Big & Fast Data
Compute Store Analyze
Data
Models
Math
Models
Data-
Intensive
Processing
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
New data sources and emerging analytical approaches to
enable predictive modeling and knowledge discovery
Convergence of analytics and supercomputing opening
new opportunities to meet the pace of discovery
Ad-hoc cluster infrastructures exacerbating complexity,
reliability and usability challenges
Organizations struggling to keep compute infrastructures
up to date, with rapidly changing life sciences technologies
The Life Sciences/Healthcare Communities
Market and Technology Drivers
The race to understand individual patients, diseases and
treatments at the molecular level
Precision
Medicine
Pace of
Technology
Cluster
Sprawl
Rise of High
Performance
Analytics
Data Science
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
The Quest for In-Time Analytics
Copyright 2015 Cray Inc.
Resp
on
se t
ime f
ram
es
<30ms
30ms
10min
>10min
Low-Latency
BatchFew data
scientists who
wrangle data
Business
analysts
accustomed to
interactive time
frames
Streaming data
Stationary data
Low-latency applications require performance optimizations
• Memory-storage hierarchies
• Fast interconnects
C O M P U T E | S T O R E | A N A L Y Z E
Multi-step Analytics Pipelines
Copyright 2015 Cray Inc.
Data Prep/
ETL
Stream
Processing
Data
Mining
Interactive
Queries
Actionable
Insight
Analytics Pipeline
Performance Productivity
C O M P U T E | S T O R E | A N A L Y Z E
Convergence of Analytics and Supercomputing
High Performance Computing• Finance: portfolio optimization, pricing, risk
• Energy: seismic modeling
• Life sciences: genomics, drug discovery
• Scientific: simulation, weather forecasting
Traditional Big Data• Batch analytics
• Undifferentiated systems
“Simulation is the original
Big Data Market” – IDC
High Performance Big Data Analytics• Low-latency analytics
• Next-generation architecture
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Analytics Solutions
Powered ByExtreme Analytics Platform
• Turnkey Advanced Analytics Platform
• Next-Generation System Architecture
• Engineered for Performance
Graph Discovery Appliance
• Discover Unknown & Hidden
Relationships in Big Data
• Real-time Data Discovery
• Realize Rapid Time-to-Value
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Cray’s Next-Generation Sequencing Solution:Accelerated Time to Discovery
Genome Assembly
High-Throughput NGS Storage and
Archive Environment
Bioinformatics Analytics
Personalized
Medicine
Pathway
Modeling
Hypothesis
GenerationAlternative
Indications
Biomarker
Prediction
Patient
Selection
Base Calling
Assembly
Variant
Analysis
QC
Annotation
Next-Generation
Sequencers
Manage all aspects of
NGS pipeline in one
environment• Address data transfer
and compute
bottlenecks
• Speed up whole-
genome resequencing
analysis
• Fast short-read
alignment
• Calculate differential
gene expression from
large RNA-Seq
datasets
• “Single pane of glass”
management interface
Enterprise Benefits
• Open architecture
• Reduced footprint
• Eliminates cluster sprawl
• Out-of-the-box
performance with
flexibility to meet
evolving needs
• Pay-as-you-grow
storage/archival
performance and
capacity
• Minimal management
burden and lower TCO
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
NGS Data Management is Overwhelming
Production – Huge data volumes and excessive data movement are pushing the
limits of many storage and networking infrastructures.
Archive – NGS workflows generate huge volumes of data, which are both
tedious and costly to retain.
Three NGS Challenges: Sequence Assembly, Bioinformatics and Data Management
NGS Bioinformatics is Complex
Complexity is High – Interpreting NGS sequence meaning involves annotation,
integration, visualization and collaboration, requiring diverse expertise.
Performance – Post-sequencing analytics is computationally demanding, in both
performance and scale.
Sequence Assembly is a Bottleneck
Sequence costs down, and sequence volumes are up. Huge volume makes
assembly the challenge. The rate at which genotypic variation can be
characterized is now limited by computational tools, not by sequencing
technology.
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Next-Generation Sequencing:Urika-XA platform for all aspects of NGS bioinformatics
Next-generation
sequencers
Urika-XA platform
simplifies
the NGS workflow
Manage all aspects of NGS pipeline
in one environment• Address data transfer and compute bottlenecks
• Speed up whole-genome resequencing analysis
• Fast short-read alignment
• Calculating differential gene expression from
large RNA-Seq datasets
• “Single pane of glass” management interface
Eliminate cluster sprawl
Reduce data movement
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
…add a Scalable Archive Strategy to NGS
● Cray Tiered Adaptive Storage (TAS) for active data use and archiving
● Policy-based data movement
● Performs at scale
● NGS generates enormous amounts of data
● Once data is processed, much of it is no longer needed but must be saved
● A proper archive strategy will eliminate bottlenecks, improve performance and reduce costs
Next-generation
sequencers
Urika-XA platform
simplifies workflow
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Halvade – Intel® Wrapper for Hadoop®
● Key observations
● Read mapping is parallel by read; variant calling is parallel by chromosomal region
● Map phase: read mapping • Reduce phase: variant calling
● Leveraging Hadoop improved throughput ~40X
● BWA and GATK – single node = 5 days
● Hadoop single node = 2.5 days
● Hadoop 50 nodes < 3 hours
● Urika-XA ~2 hours
● An additional 20% in performance
● Follow-on analytics can be done on the same platform
Reference: Decap et al., Bioinformatics 2015 Mar 26. pii: btv179.
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Genomic Analysis with Hadoop
Genetic Data
Clinical Trial Records
Patient Records
Social Media Data
6
Life sciences-specific
data formats
Analysis on
Urika-XA platform
Life sciences-
specific results
http://www.biodatomics.com
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Lumenogix NGS
50x Whole Human
Genome
http://www.lumenogix.com
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Lumenogix and Cray Performance Details
● AWS Cluster using Halvade
● Urika-XA platform using Lumenogix
0
20
40
60
80
100
120
140
160
180
AWS Urika-XA1
Min
ute
s
Time to process 50x Whole Human Genome
Process Time
BWA 17 minutes
Tag & Shuffle Reads 2 minutes
Sort and Compress 1 minute
Mark Duplicates 1 minute
Realignment 6 minutes
Genotyping 18 minutes
Total 45 minutes
Genome split
into 4MB sections
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
GenomeNext
Churchill: Kelly et al., Genome Biology 2015, 16:6 doi:10.1186/s13059-014-0577-x
• Churchill uses novel, deterministic parallelization to deliver a deterministic,
balanced, highly scalable regional parallelization strategy
• Enables computationally efficient whole genome sequencing data analysis in
less than 2 hours
Stay tuned for Urika-XA system results!
http://www.genomenext.com
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Urika-XA Extreme Analytics Platform for NGS
Pre-integrated, open platform for high performance Hadoop and Spark™ analytics
Save months standing up a Hadoop cluster• Run a 48-node Hadoop cluster out of the box
• Cloudera Hadoop and Apache Spark factory installed
Replace 3 standard racks with a single Urika-XA
system rack• High-density compute powered by Intel® Xeon® processors
• Consolidate wide range of analytics onto single platform
Future-proof your big data environment• Next-gen architecture leveraging SSDs and InfiniBand
• Designed for low-latency, in-memory processing
Copyright 2015 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Thank You
Dave Anstey: [email protected]
http://www.cray.com
C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Copyright 2015 Cray Inc.
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviatefrom published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXIONand URIKA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and trademarks of Cray Inc.: CS, XC, XE, XK and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee ofLinus Torvalds, owner of the mark on a worldwide basis.
Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the trademarks of their respective owners.
Copyright 2015 Cray Inc.