Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Accelerating Genome Assembly with
Power8Seung-Jong Park, Ph.D.
School of EECS, CCT, Louisiana State University
Join the Conversation #OpenPOWERSummit
Agenda
� The Genome Assembly Problem
� Accelerating Graph Construction with POWER8
� Accelerating Graph Simplification with IBM CAPI®
Flash and Redis NoSQL database.
25/8/2016
The Genome Assembly Problem
35/8/2016
� NGS Technologies Outpaced Moore’s Law
� Software with Extreme Scalability
� HPC Platform
• More Compute Cycles
• Extreme I/O Performance
• Huge Storage Space
Challenges for Genome Assemblers
45/8/2016
Genome
NGS
Reads (TBs)
HPC
Re-constructed
Genome (MBs/GBs)Data and
Compute
Intensive
MapReduce-based Graph Construction
55/8/2016
TAGTCGAGG
CT
TAGTCGAGG
CT
GGCTTTAGAT
C
GGCTTTAGAT
CTGAGGCTTTA
G
TGAGGCTTTA
G Map
TTTAGAGACA
G
TTTAGAGACA
GGATCCGATGA
G
GATCCGATGA
GTAGTCGAGG
CT
TAGTCGAGG
CT Map
TTTA:G
TAGT:C
TTAG:A
TAGA:G
TCCG:
A
TCCG:
ATGAG:
N
TGAG:
N
TCGA:
G
TCGA:
G
AGAG:
A
AGAG:
AAGAC:A
ACAG:
N
ACAG:
NATCC:
G
ATCC:
GCCGA:
T
CCGA:
TCGAT:
G
CGAT:
GATGA:G
AGTC:
G
AGTC:
GCGAG:
G
CGAG:
GAGGC:
T
AGGC:
T
GATC:
C
GATC:
C
GAGA:
C
GAGA:
CGACA:
G
GACA:
G
GATG:A
GTCG:
A
GTCG:
AGAGG:
C
GAGG:
CGGCT:
N
GGCT:
N
GGCT:
T
GGCT:
T
GTCG:
A
GTCG:
AGAGG:
C
GAGG:
CGGCT:
N
GGCT:
N
GCTT:T
GATC:
N
GATC:
NGAGG:
C
GAGG:
CGGCT:
T
GGCT:
TGCTT:T
AGTC:
G
AGTC:
GCGAG:
G
CGAG:
GAGGC:
T
AGGC:
TCTTT:A
AGAT:C
AGGC:
T
AGGC:
TCTTT:A
TAGT:C
TGAG:
G
TGAG:
G
TCGA:
G
TCGA:
GTTTA:G
TTAG:A
TAGA:T
TTTA:G
TTAG:N
Reduce
Reduce
Reduce
TAGA:G,T
TAGT:C
TCCG:A
TCGA:G
TGAG:G
TTAG:A
TTTA:G
ACAG:N
AGAC:A
AGAG:A
AGAT:C
AGGC:T
AGTC:G
ATCC:G
ATGA:G
CCGA:T
CGAG:G
CGAT:G
CTTT:A
GACA:G
GAGA:C
GAGG:C
GATC:C
GATG:A
GCTT:T
GGCT:T
GTCG:A
Accelerating Graph Construction with POWER8
65/8/2016
Experimental Test Beds
75/8/2016
System Type IBM PKY Cluster LSU SuperMikeII
Processor Two 10-core IBM Power8 Two 8-core Intel SandyBridge Xeon
Maximum #Nodes used in various
experiments
40 120
#Physical cores/node 20 (8 Simultaneous Multi-Thread) 16 (Hyper threading disabled)
#vcores/node 160 16
RAM/node (GB) 256 32
#Disks/node 5 3
#Disks/node used for shuffled data 3 1
Total Storage space/node used for shuffled
data
1.8 0.5
Network 56Gbps InfiniBand (non-blocking) 40Gbps InfiniBand (2:1 blockings)
Datasets
85/8/2016
Genome data set Input size Shuffle data
size
Output size
Rice genome 12GB 70GB 50GB
Bumble bee genome 90GB 600GB 95GB
Metagenome 3.2TB 20TB 8.6TB
Input data set to stage 2 Key-value Stores
With Redis NoSql and IBM Power8-CAPI -Flash
Hadoop Configurations
95/8/2016
Hadoop Parameters IBM Power8 SuperMikeII
Yarn.nodemanager.cpu.resource.vcore 120 16
Yarn.nodemanager.memory.mb 231000 29000
Mapreduce.map/reduce.cpu.vcore 4 2
Mapreduce.map/reduce.memory.mb 7000 3500
Mapreduce.map/reduce.java.opts 6500m 3000m
Hadoop Scalability with POWER8 SMTs
� Tested with small size rice genome data on 2 node
� Almost linear scalability with increasing SMTs
105/8/2016
Rice Genome
Analyzing small size (12GB) data
� Eliminate the impact of network and disk I/O
� 7.5X performance improvement per server
115/8/2016
Bumble Bee Genome
Analyzing Medium size (90GB) Bumble Bee genome
� 7.5x improvement in terms of Performance/server
125/8/2016
Metagenome Stage 1
Analyzing huge (3.2TB) metagenome data
� Only 6.5 hours on 40-node IBM Power8 cluster
� More than 9x improvement in terms of performance
per server
135/8/2016
IBM Data Engine for NoSQLPerformance and Value
Stage 2 Requires Large Memory access that isn’t readily available via
traditional compute processing.
CustomHardwareApplication
POWER8
CAPP
Coherence Bus
PSL
FPGA or ASIC
Customizable Hardware
Application Accelerator
• Specific system SW, middleware, or user application
• Written to durable interface provided by PSL
POWER8
PCIe Gen 3
Transport for encapsulated messages
Processor Service Layer (PSL)
• Present robust, durable interfaces to applications
• Offload complexity / content from CAPP
Virtual Addressing• Accelerator can work with same memory addresses that the processors use• Pointers de-referenced same as the host application• Removes OS & device driver overhead
Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks” as a normal thread Lowers
latency over IO communication model
POWER8 CAPI (Coherent Accelerator Processor
Interface)
Redis Labs Exploits the IBM Data Engine for NoSQL
Redis stores key-value pairs• Key-value pairs may be variable size, in any
format (Text, Document, JPEG, Video, etc.)
Basic operations are “SET” and “GET”> SET 100001 “CAPI is Fast”
> GET 100001
“CAPI is Fast”
> ...
Database Characteristics• 90 GB MAX Capacity, up to 10 GB RAM, and 80 GB Flash
• key-value pairs are 1,000 bytes of random data
• DB filled with ~50GB of data (42.5 million keys)
Client Characteristics• 288 clients, randomly issuing Redis GETs or SETs
• ~50% of keys from RAM, ~50% from CAPI-Accelerated Flash
Demo System:• IBM Power System S812L
• 1 POWER8 Socket
• 2 IBM DataEngine for NoSQL CAPI Accelerators
• 1 FlashSystem 840
• Ubuntu 14.10
• Redis Labs Enterprise Cluster (Beta)
Set Key = Value
Retrieve Key
10Gb Uplinks
Power8 Server
Flash Array w/ up
To 56TB
Demonstration Platform
(POWER8 + CAPI Flash)
Infrastructure Attributes
- up to 192 threads in 2U Server drawer
- up to 56 TB of memory based Flash per 2U Drawer
- Shared Memory & Cache for dynamic tuning
WWW
OpenPower Partner Redis Labs’s highly-differentiated product
offering built on CAPI is available today.
Demo Link
IBM Data Engine for NoSQL + Redis Labs Value
Built on Open APIs• Leverages IBM DataEngine for NoSQL APIs
Redis Labs Enterprise Cluster provides
near Speed of RAM, with the Capacity of
Flash• Leverages IBM DataEngine for NoSQL CAPI Accelerator for
high-speed, low-latency link to Flash
Controls use of Memory, Flash, and Cost!• Hot Data Maintained in RAM
• Provides ISPs and MSPs up to 72% Cost Savings
When 80% of Data is in Flash
Redis Labs Enterprise Cluster allows the user to select the ratio of
RAM and flash with a simple slider, when using POWER8 with the
IBM Data Engine for NoSQL.
Load Balancer
500GB Cache
Node
10Gb Uplink
POWER8 Server
Flash Array w/ up
to 56TB
Differentiated NoSQL
(POWER8 + FlashSystem with CAPI)
Infrastructure Attributes
- 192 threads in 4U server drawer
- 56 TB of flash per 2U drawer
- Shared Memory & cache for dynamic tuning
- Elimination of I/O and network overhead
- Cluster solution in a box
Today’s NoSQL in memory (x86)
Infrastructure Requirements
- Large distributed (Scale out)
- Large memory per node
- Networking bandwidth needs
- Load balancing
Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.
WWW10Gb Uplink
WWW
Backup Nodes
500GB Cache
Node500GB Cache
Node500GB Cache
Node500GB Cache
Node
What CAPI Means for NoSQL Solutions
Big Redis w/ CAPI Flash Offers New Performance / Cost Points
Users pick the performance / cost point that meets their solution needs, be it IOPs Rate or Latency requirements.
*typical workload
0% 18% 45% 72% 81%
Average Latency (ms)
1
5
8
9
10
% Implementation Savings
100% 80% 50% 20% 10%
IOPS at 1 ms Latency
382K 208K 188K 175K
2.5M
366-750K
1.35M
483-950K
671-1250K
IOPS at Max Throughput
DRAM / FLASH Ratio
Stage 2 Graph Simplification with Distributed NoSQL
205/8/2016
TAGA:G,T
TAGT:C
TCCG:A
TCGA:G
TGAG:G
TTAG:A
TTTA:G
ACAG:N
AGAC:A
AGAG:A
AGAT:C
AGGC:T
AGTC:G
ATCC:G
ATGA:G
GACA:G
GAGA:C
GAGG:C
GATC:C
GATG:A
GCTT:T
GGCT:T
GTCG:A
CCGA:T
CGAG:G
CGAT:G
CTTT:A
TAGTCGAG GAGGCTTTAGA
Accelerating Simplification with IBM CAPI Flash
215/8/2016
NoSQL I/O
Throughput
(keys/sec)
CAPI Flash I/O
Throughput
(bytes/sec)
Only 20 Power8 Cores + CAPI :
500GB Graph traversal in
7.5 Hrs
LSU Project Contributors
� Arghaya Kasuum Das, PhD Stident
� Sayan Goswami, PhD Student
� Richard Platania, PhD student
� Terry Leatherland, IBM Systems Architect.
Fall/Winter 2015 project
5/8/2016 22