View
3
Download
0
Category
Preview:
Citation preview
Enterprise Data Trends for Hybrid Multi-cloudsThe semiconductor crisis, data growth, and new opportunities with cloud, AI, and models
© 2020 NetApp, Inc. All rights reserved.
Naresh PatelVice President & Chief ArchitectNetApp
November 3, 2020
© 2020 NetApp, Inc. All rights reserved. 2
Overview
Tech
Trends
Global
Data
Servers
/Stor
age
CPU comple
x
Scale Cube
Move
Process
Stor
e
Layers × Lens
CRISIS
© 2020 NetApp, Inc. All rights reserved. 3
• Data-focused interplay of the 3 things you can do with data: • Process it• Move it • Store it
• AI-infused decision making using real-time data vs. HW/SW designer making choices
• Opportunities for performance/energy models & metrics
Lenses
TECHNOLOGY TRENDS
GLOBAL DATA / STORAGE SCALING AND PROPERTIES
SERVER AND STORAGE SCALING
CPU / MEMORY COMPLEX
Layers
How data gets processed, moved, and stored at the component level?
Process
StoreMove
New data can be processed, moved and stored. Semiconductors play a key role each.
© 2020 NetApp, Inc. All rights reserved. 4
SemiconductorMagneticOptical
SemiconductorCopper / Optical
SemiconductorOptical Semiconductor
Edge to core to multi-cloud
Multiple servers and storage
CPU-memory complex
Fundamental building blocks
© 2020 NetApp, Inc. All rights reserved. 5
2012 2014 2016 2018 2020 2022
0
50
100
150
200
250
Years
#transistors
(Mpersqmm)
© 2020 NetApp, Inc. All rights reserved. 6
• Moore’s Law “demands”:1. 2X transistors per wafer every 2 years2. Cost per million transistors goes down ~50% 3. Less power from smaller transistors
• But costs & power are not coming down as fast with process shrinks• Wafer fabrication costs are higher • Higher resistance of thinner traces reduce power benefits• Chips with “dark” silicon to handle physical limits
Costs per transistor no longer going down in leading edge semiconductor process
Microprocessor Cost Trends
2!/#
1000
2005 2010 2015 2020
0.0050.010
0.0500.100
0.5001
5
Years
Relativecost/transistor
10002!/#
<10% p.a. drop
> 50% p.a. drop
Source: Linley Group
Apple A14@ 5nm
3nm?
© 2020 NetApp, Inc. All rights reserved. 7
• Wright’s Law: For every doubling of cumulative units produced, cost to manufacture one unit will fall by a constant percentage (factor 2$%).
As Moore’s Law slows down, Wright’s Law may provide a better fitWright’s Law – or “Experience” Power Law
𝑓 𝑥 = 𝑥$%𝑓(2𝑥) = 𝑓(𝑥)2$%
4 6 8 10 120.05
0.10
0.50
1
Years
Relativecostperunit
∝ 1/𝑡#
∝ 1/2!/#
If unit productiongrow linearly
If unit productiongrows exponentially
5 10 15 20
0.05
0.10
0.20
Cumulative Number of units produced
Relativecostperunit
• Unit costs depend on the unit production curve over time 𝑢(𝑡)• When unit production grows exponentially over time, the
unit cost using Wright’s Law over time is the same a Moore’s Law
• When unit production grows linearly, unit costs follows a power law over time
• Implications• Chip designs leveraging existing process node• Creates opportunities for scaling via new computer
architectures• New breakthrough?
Moore’s Law scaling slows• Manufacturing improvements
continue• Better packaging: Chiplet
approach to overcome single monolithic die challenges for 3D scaling beyond 5nm
• Novel domain-specific architectures, esp. AI.
General purpose CPU becoming system bottleneck
• Composable architectures (including disaggregated resources) via SW APIs
• Scale-out architectures for public and private clouds
• Scale-up with HW accelerators and domain-specific architectures: GPUs, AI/ML,…
Persistent memory
• Memory-based computing provide synchronous access for simpler / efficient programming• Multi-TB scale
Processing inside memory
Emerging CPU and Memory TrendsGeneral purpose CPUs being augmented for speed and efficiency
© 2020 NetApp, Inc. All rights reserved. 8
Emerging storage media trendsOngoing Flash $/GB reduction by adding layers, but new breakthrough needed to reduce cost per Flash cell
© 2020 NetApp, Inc. All rights reserved. 9
New approaches for differentiated workloads
Bring compute closer to media to take advantage of high internal BWWider adoption of AI/ML techniques for large immutable data sets
Cost/GB reductionFlash cost reduction via stacking layers continuesCost reduction from QLC with endurance trade-offNeed breakthrough media to challenge NL-SAS HDDs in $/GB
Low latency media (better latency than TLC NAND Flash)
As a read cacheAs a "write absorber”As network-attached replacement for server-side SSDs
Data Access in Memory / Storage HierarchyStorage hierarchy choices getting crowded
© 2020 NetApp, Inc. All rights reserved. 10
Source: P. Faraboschi, “The Data Access Continuum”, SC’19 MCHPC
Tape
Emerging Networking TrendsNetwork scaling continues upward trend
© 2020 NetApp, Inc. All rights reserved. 11
Link speeds continue upward trend beyond 100 GbE
Switch ports going to 400 GbE, using 4 lanes of 100G SerDesPublic clouds creating faster transitions to next generationNext step is 800 GbE
Smart NICs with processing capability
Offer HW offload capability and relatively easy programmabilityDirect placement benefits of RDMA on client and storage sideSecurity features
NVMe-oF for storage connectivity scaling well to 1000’s of nodes
RoCEv2, TCP-IP for most robust transport layer
How data gets processed, moved, and stored across the Edge, Core, Multi-Cloud
Process
StoreMove
How might we find the “best home” for a workload across Edge, Core, and Multi-cloud?
© 2020 NetApp, Inc. All rights reserved. 12
Most data created at the edgeEndpoints, CloudletsCore for sovereigntyIncreased cloud storage for consumers
Data fabric to allow movementacross edge, core, multi-cloudMove the right data, to theright place, at the right time!
Apps at the edge for real-timeApps in core DC for security/complianceApps in cloud for scale, cloud nativeContainers and serverless FaaS
Edge to core to multi-cloud
Multiple servers and storage
CPU-memory complex
Fundamental building blocks
© 2020 NetApp, Inc. All rights reserved. 13
Data growth continues in the amount created and the amount storedGlobal Data Creation and Installed Storage
All Data created in 2020
2020 2024201620122008
“Queue Length” =Data in installed base
HDDSSDNVMOpticalTape
“Data Arrival Rate”Increasing -+
© 2020 NetApp, Inc. All rights reserved. 14
Post-processing the evolution of drives provides useful insights across the population. Real-time analysis of collective drives’ lifetime of experience enables better prediction and decision making.
Example: Evolution of “data reads per day” on a set of drives
0
0.5
1
1.5
2
2.5
3
3.5
4
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5
SSD
Driv
e Re
ads P
er D
ay
Power On Years for Drive
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5
SSD
Driv
e Re
ads P
er D
ay
Power On Years for Drive
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
How data gets processed, moved, and stored across server nodes?
Process
StoreMove
How might we find the right balance of processing, movement and storage for my workload?
© 2020 NetApp, Inc. All rights reserved. 15
Rapid shift to SSDs for high IOPSUnlock internal memory/SSD BW
Data movement getting fasterPCIeG4/5 or EthernetLess movement saves energy
Bottleneck for many workloadsProcess in memory / storage / networkDistribute processing across servers
Edge to core to multi-cloud
Multiple servers and storage
CPU-memory complex
Fundamental building blocks
What Type of Processing is being Offloaded from x86 CPU?Pensando Distributed Services Card: 100 GbE Networking and easy programmable offloads, but with ASIC-like speeds
16 © 2020 NetApp, Inc. All rights reserved. Naresh Patel
x86 CPUStorage Controller
PensandoPCIe CardARM Cores P4 Engines HW Engines 100 GbE
Shared Memory
BULK DATA & TRANSFORMSwith DATA INTEGRITY
Process executing x86 Instructions
k=1
k=2
k=3
��� ��� ��� ��� ��� ����
�
�
�
�
��
��� ��� ����������(�)
How does the Processing Offload to Pensando help Performance?Moving Bulk Data & Transforms from Main Memory to Pensando card: Reduces CPU usage and latency
17 © 2020 NetApp, Inc. All rights reserved. Naresh Patel
Storage CPU cores
MainMemory
PensandoCard Network
PCIe
LLC
BULK DATA &TRANSFORMS
Offloading Bulk Data Transforms:• Saves CPU cycles directly• Reduces memory BW• Helps maintain consistent latency
Load
Rel
ativ
e La
tenc
y
Half-Latency
Knee
Higher variability →
higher latency
100GbE
Y-axis, Vertical Scaling: Offload “Bulk Data & Transforms” for Storage
• Storage Efficiency• Compression, Compaction• Deduplication / Hash
• Data Protection• Checksums• RAID operations
• Security computation• Encrypt data at rest• Encrypt data in motion
• Networking• NIC, RDMA
© 2020 NetApp, Inc. All rights reserved. Naresh Patel18
“Data touch” / CPU-intensive functionsmove to specialized HW on “Fabric”ÞPensando’s card with P4+ engines, HW
engines, and ARM coresBenefits compound via ONTAP software …ÞPrimary data snapshots, clones on boxÞPreserve storage efficiency for secondary
data servicesÞTiering data to Public Clouds
D
c4c3c2c1
Z-Axis, Scale-out for High-Tech Industry Applications
Linear scale for performance and capacity
Operational simplicity• Single mount point with automated load
and space distribution
Consistent high performance• Predictable, consistent low latency• All-flash containers
High resiliency• NetApp® ONTAP® nondisruptive operations
FlexGroup on A400 with Pensando for clustering: a scalable, high-performance data container
19 © 2020 NetApp, Inc. All rights reserved. Naresh Patel
Single namespace
Apps for electronic design automation,high-tech, oil and gas,
media and entertainment
d2
C
d1
Cz-axis
d4
C
d3
C
X-axis, Horizontal Scaling - Accelerate Your Global NamespaceFlexCache volumes distributes hot data over Pensando cluster ports
20
NetApp® FlexCache® volumes • Sparsely populated volumes that can be cached on the same
cluster or a different cluster as the origin volumes to accelerate data access
Performance acceleration for hot volumes• Cache read and metadata for CPU-intensive workloads• Provides different mount points to avoid hot volumes• Cache data within the cluster (intra-cluster)
Cross data center data distribution• Cache across multiple data centers to reduce WAN latencies• Bring data closer to compute and users• Between Netapp AFF, FAS, or ONTAP Select systems
© 2020 NetApp, Inc. All rights reserved. Naresh Patel
D
C
D
Cx-axis
D
C
D
C
Scale in All 3 Dimensions – System & DataScalability = Ability to handle a growing amount of work (CPU and Data)
21 © 2020 NetApp, Inc. All rights reserved. Naresh Patel
One system & datamonolithic architecture
y-axis Functional DecompositionScale by splitting into different thingsComponents: Modules/Microservices
Split by function, service,data affinity
Source: the Scale Cube http://theartofscalability.com/
D
C
Many nodes each a clone, load balancedData reads from replicas; write to one node
x-axis Horizontal Technical DuplicationScale by cloning pieces of dataTechnical Architecture Layering
z-axis Data Partit
ioning
Scale by splitti
ng into similar co
ntexts / s
hards
No splits
Split by users, large modules, hash,...
Near Infinite Scale
D
C
d2
C
d1
C
D
Cx-axis
z-axis
D
C
D
C
d4
C
d3
C
y-axis
D
c4c3c2c1
How data gets processed, moved, and stored inside the CPU-Memory Complex
Process
StoreMove
How might we get more performance and energy efficiency out of the CPU and memory?
© 2020 NetApp, Inc. All rights reserved. 22
Memory bound by latencyPersistent memory for capacityCXL for large memory spaces
Need energy focusData hops are “hidden”Perf and energy metrics
Bottleneck for many workloadsDecentralize processingComputational memory
Edge to core to multi-cloud
Multiple servers and storage
CPU-memory complex
Fundamental building blocks
© 2020 NetApp, Inc. All rights reserved. 23
• Many data center workloads have low instructions per cycle• Back-end bound → waiting to fill caches / memory latency• Dependent data accesses → serial memory accesses
• Characteristics of data intensive workloads• Request and response over the network• Simultaneously access many contexts (fan in)• Areas of ”tax”: Protocol buffers, RPCs, memory moves,
compression, memory allocation, hashing
• Memory latencies going higher with bigger sockets• Moving these ”data intensive” functions away from CPU to
ethernet fabric-attached offload HW• Move processing into memory for better latency and energy
efficiency• Symmetric Multi-Threaded cores
“100% CPU Utilization” hides the underlying bottlenecks – cache hierarchy and memory latency. CPU Pipeline is stalled for >80% of the time for many workloads!
Source: S. Kanev et al. Profiling a warehouse-scale computer, ICSA’15
MB/G/1 Energy-Performance Queue with Batched Arrivals
24 © 2020 NetApp, Inc. All rights reserved.
Service Time
Random Arrivals with rate λ of batched requestsSpecified batch size distribution (Pr(1), Pr(2), Pr(3),…)
Specify n general service time distributions- Different for each queue depth (1, 2, 3,…,n)- Same service time distribution for queue length n, n+1,..
System powers-down when queue is empty (power-down time distribution)
System powers-up on arrival of new item to empty queue (power-up time distribution)
MODEL INPUT: arrival rate, 1 discrete distribution, n+2 continuous distributions (n service times, POWER-UP, and POWER-DOWN times) and power usage for ON, OFF. MODEL OUTPUT: response time distribution (means, variance, and higher moments), energy usage
Response time or Latency
Example: EP Queue + Reinforcement Learning to make power usage choices (and service times) with energy constraints
© 2020 NetApp, Inc. All rights reserved. 25
Summary
Tech
Trends
Global
Data
Servers
/Stor
age
CPU comple
x
Scale Cube
Move
Process
Stor
e
Layers × Lens
CRISIS
AI-infus
ion
Renais
sanc
e
Perform
ance
&
Energy
metr
ics
Recommended