View
1
Download
0
Category
Preview:
Citation preview
Building High-Performance Networked Systems with Innovative Hardware and Software Techniques
Kai Zhang University of Science and Technology of China
The Growth Trend of Business Data
2
The volume of business data worldwide is expected to double every 2 years
Source: Oracle
The Growth Trend of Network Speed
3
Source: IEEE 802.3 Higher Speed Study
Group - Tutorial
Linux, FreeBSD,
…
Solutions for Next Generation Networked Systems
4
NETWORKED SYSTEM
OPERATING SYSTEM
HARDWARE
Networked Intrusion Detection System
Firewall
IPSec Gateway
Load BalancerRouter
Key-Value Store
… …
TCP/IP Stack
NIC Driver
Socket
CPUs
DPDK, Netmap, …New drivers, bypass the OS
How to utilize?
GPUs CPUs with Integrated GPUs Xeon Phi
Solutions for Next Generation Networked Systems
• Demonstration of Two Networked Systems
• Mega-KV• A key-value store system with the highest throughput• 100x higher throughput than Memcached
• Snort with DPDK• Enhance the efficiency of network I/O of Snort with DPDK• A cooperation between USTC and Intel for educational
purpose• To be a course lab for Advanced Computer Networks
5
Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores
6
Key-Value Stores๏ A simple but effective method to manage data where a data record
(or a value) is stored and retrieved with its associated key • variable type and length of record (value)• simple or no schema• easy software development for many applications
๏ Key-value stores have been widely deployed in data processing production systems:
7
Simple and Easy Interfaces of Key-Value Stores
• set (key, value)• value = get (key)
Index
38john_age
…
Variable-length keys & values
Client
key
key-value store
GET Key: john_age Value: 38
8
Key-Value Stores: Examples
Keys Values
Amazon Customer ID Customer profile (e.g., credit card, buying history)
Facebook, Twiter User ID User profile (e.g., friends, photos, posts)
iCloud/iTunes Movie/song name Movie, Song
Distributed file Systems Block ID Block
Workflow of a Typical In-Memory Key-Value Store
Network Processing
Memory Management
Index Operations
Access Value
10
Workflow of a Typical In-Memory Key-Value Store
GET
TCP/IP Processing
Query ParsingNetwork Processing
MemoryFull
MemoryNot Full
Evict Allocate Memory Management
Delete from Index
Insert into Index
Search in Index Index Operations
Read & Send Value Access Value
11
SETDELETE
Where does time go in KV-Store MICA [NSDI’14]
Exec
utio
n Ti
me
Perc
enta
ge
0
0.25
0.5
0.75
1
Four Data Sets
1) 128B Key 1024B Value
2) 32B Key 512B Value
3) 16B Key 64B Value
4) 8B Key 8B Value
Access Value
Index Operations
Network Processing (w/ DPDK) & Memory
Management
Index operation becomes one of the major bottlenecks12
Random Memory Accesses in In-Memory Key-Value Stores
QueryNetwork Processing & Memory Management
Access Value
Random Memory Accesses
Index Operation
Hash Table
13
Random Memory Accesses of Indexing are ExpensiveTi
me
(nan
osec
ond)
0
60
120
180
240
Number of Memory Accesses1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sequential memory accessRandom memory access
72
163
CPU: Intel Xeon E5-2650v2
Memory: 1600 MHz DDR3
14
Inabilities for CPUs to Accelerate Random Memory Accesses
2. Prefetch
1. Cache
3. Multithreading
• Not easy to predict next memory address
• Working set is large (~100 GB), CPU cache is small (~10 MB)
• Limited number of hardware threads• Limited number of Miss Status Holding Registers (MSHRs)
15
CPU spends a large portion of its time idling, waiting for data
Mega-KV addresses two issues: large number of queries and random memory access delay
Network Processing
Memory Management
Access Value
Index Operation To accelerate it by GPUs
DPDK, Multiget, UDP
Bitmap, Optimistic concurrent access
Prefetch
16
The Goal of Mega-KV
๏ Throughput is the critical issue in big data environment• Throughput measures the capability of a key-value store
system to process a growing amount of queries on an increasingly large data set
๏ Acceptable in-memory key-value store latency• < 1 millisecond, e.g. Facebook, Amazon, …
๏ Our goal: Maximize throughput subject to an acceptable latency
17
CPU vs. GPU
Intel Xeon E5-2650v2:2.3 billion Transistors
8 Cores59.7 GB/s memory bandwidth
Nvidia GTX 780:7 billion Transistors
2,304 Cores (12 SMXs)288.4 GB/s memory bandwidth
Massive ALUs
ControlCache
18
Two Advantages of GPUs for Key-Value Stores
GPU Core
…
…
…
cache miss
cache miss
Thread A
Thread B
Thread C
Instruction Buffer memory request issued, switch to another threadmemory request issued, switch to another thread
19
1. Massive Processing Units to Address Large Number of Concurrent Queries
• KV Store — simple independent memory access operations• GPU — thousands of cores for parallel processing
2. Massively Hiding Memory Access Latency • KV Store — random memory accesses in index operations• GPUs can effectively hide memory access latency with massive hardware
threads and zero-overhead thread scheduling (a GPU hardware support)
Basic Design of Mega-KV
Pre-Processing
GPU Processing
Post-Processing
Index OperationsNetwork processing,Memory management Read & Send Value
20
RXDPDK
TXDPDK
Basic Design
Pre-Processing
Network processing,Memory management
21
Basic Design
Pre-Processing
Network processing,Memory management
Batch
22
Basic Design
Pre-Processing
Network processing,Memory management
Parallel Processing in GPUs
23
Basic Design
Pre-Processing
Network processing,Memory management
Parallel Processing in GPUs
24
Post-Processing
Basic Design
Pre-Processing
Network processing,Memory management
Parallel Processing in GPUs
Read & Send Value
25
Post-Processing
Basic Design
Pre-Processing
Network processing,Memory management
Parallel Processing in GPUs
Read & Send Value
26
Challenges of Offloading Index Operations to GPUs
1. GPUs’ memory capacity is small: ~10 GB• Working set may be hundreds of gigabytes
2. Low PCIe bandwidth• PCIe is generally the bottleneck of GPUs if large bulk of data
needs to be transferred
3. Handling variable-length data is inefficient for GPUs• Imbalance load between GPU cores
27
Our Approach
C C C C
Input data (Keys) Index
key
GPU optimized cuckoo hash table thatstores key signatures and value locations
C C C C
Compress
Compressed fixed-length signatures
28
Address challenges 2, 3 (PCIe bandwidth and variable length data)
Address challenges 1, 3(GPU memory capacity and variable length data)
GPU Cuckoo Hash Table
Key
Signaturecompress
Our Approach: Search Index
29
key comparison
Send to client
ValueKeyKV object
Evaluation - Hardware Setup
CPU:
Intel Xeon E5-2650v2 octa-core, 2.6GHz
Total 16 CPU cores
GPU:
Nvidia GTX 780, 2304 cores, 863MHz
Total 4608 cores
NIC:
Intel dual-port 10Gbps NIC
Total 40 Gbps
30
Reaching a Record High ThroughputTh
roug
hput
(MO
PS)
020406080
100120140160180
Data Sets
8B key8B value
16B key64B value
32B key512B value
128B key1024B value
Fastest CPU-based KV store Mega-KV
2.1x1.9x
2.8x2.1x
31
LatencyC
DF
0
0.25
0.5
0.75
1
Round Trip Time (microsecond)
0 60 120 180 240 300 360 420 480 540 600
160 MOPS
95th: 390
50th: 256
32
Compared withFacebook
1,200 (95th)
300 (50th)
Accelerating the Network I/O of Snort with DPDK
33
Snort
• Snort is a multi-mode packet analysis tool• Sniffer• Packet Logger• Forensic Data Analysis tool• Network Intrusion Detection System
• Snort is able to perform network traffic analysis both in real-time and for forensic post processing
• Snort “Metrics” • Fast (High probability of detection for an attack on high speed networks)• Configurable (Easy rules language, many reporting/logging options)
34
Snort Architecture
35
Packet I/O
Packet Decoder
Preprocessor
Detection Engine
Output
Incapability in Handling 10Gbps Network Traffic
36
Cycles Needed in Snort
1,200 400 - 25,000…+
Packet I/O Detection, etc.
Your Budget
1,400
10Gbps, min-sized packets, quad-core 2.66GHz CPUs
(in x86, cycle numbers are from RouteBricks [Dobrescu09] and PacketShader[Han10])
Incapability of Snort for High Speed Networks
• Snort was designed to detect attacks on 100Mbps links• Current network speed reaches 10Gbps and 40Gbps• Snort becomes incapable of detecting intrusions on current backbone
and data center network
• Accelerate network I/O with DPDK• Snort 2.9 introduces the Data Acquisition library (DAQ) for packet I/O• Current supported: pcap, AF_PACKET, netmap, ipfw• Our work: add DPDK support in DAQ
37
Opportunities from DPDK
38
Cycles Needed 1,200 400 - 25,000…+
Packet I/O Detection, etc.
80
DPDK
EXPERIMENTAL SETUP
• Hardware• CPU: Intel(R) Xeon(R) CPU E5-2650 v3 • NIC: 82599ES 10-Gigabit SFI/SFP+ Network Cards
• Software • Linux 3.19.0 • Snort 2.9.8.0 • DPDK 2.1.0
39
Snort Modes
• Two Snort modes: Passive Mode and Inline Mode• Snort is bypassed in the Passive Mode• Snort filters packets in the Inline Mode
40
Snort
PassiveMode
SnortInlineMode
Snort Throughput (Inline Mode w/o DPI)
41
Throughput Stability of DPDK and Netmap
Thro
ughp
ut (M
pps)
13
14
15
1 2 3 4 5 6
14.86 14.88 14.86 14.88 14.86 14.88
14.25
13.90
13.49
14.20
13.7213.90
netmap dpdk
Latency (Inline Mode w/o DPI)
42
Latency Stability of DPDK and Netmap
Late
ncy
(ms)
0.0
0.1
0.2
0.3
1 2 3 4 5 6
0.15 0.16 0.15 0.15 0.15 0.14
0.26 0.25
0.34
0.25
0.34 0.33
netmap dpdk
Snort Throughput (Passive Mode w/ DPI)
43
DPDK
netmap
AF_PACKET
pcap
Throughput (Mpps)
0 1.25 2.5 3.75 5
1.36
1.56
4.24
4.28
Summary
๏ Mega-KV provides the highest throughput• Fastest in-memory key-value store on commodity processors• More than 100x faster than Memcached• Open source at http://kay21s.github.io/megakv/
๏ Integrating DPDK into Snort• Improving the latency and throughput of Snort• For educational purpose: To be a course assignment in USTC
44
Thanks!
45
Recommended