48
Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref Minlan Yu Ramesh Govindan Amin Vahdat

Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Trumpet: Timely and Precise Triggers in Data Centers

Masoud MoshrefMinlan Yu

Ramesh GovindanAmin Vahdat

Page 2: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

The Problem

2

Human-in-the-loop failure assessment and repair

Long failure repair times in large networks

Evolve or Die, SIGCOMM 2016

Page 3: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Humans in the Loop

3

Detect

Examine ongoing network events to find possible root-causes

Operators examine dashboards

Page 4: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Our Focus

4

Humans

A framework for programmed detection of events in large datacenters

Programs

Page 5: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Events

5

Link failure

DDoS

Traf

fic

surg

e

Packet delay

Lost packet

Pack

et b

urs

t

Swit

ch f

ailu

re

IncastLoad imbalance

Blackhole

Congestion

Traffic hijack

Loop

Middlebox failure

❖ Availability❖ Performance❖ Security

Burst Loss

Page 6: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

6

Fine Timescale

Events

40 ms burst

Timeouts lasting several 100 ms

Detecting Transient Congestion

Page 7: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Alternative Approach

7

Detect

Aggregated, often sampled measures of network health

Dashboard data insufficent

Page 8: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Fine Timescale

Events

8

Did this tenant see a sudden increase in traffic over the last few milliseconds?

Detecting Attack Onset

Data center

Page 9: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Inspect Every

Packet

9

Link failure

DDoS

Traf

fic

surg

e

Packet delay

Lost packet

Pack

et b

urs

t

Swit

ch f

ailu

re

IncastLoad imbalance

Blackhole

Congestion

Traffic hijack

Loop

Middlebox failure

Some events may require inspecting every packet

Burst Loss

Page 10: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Eventing Framework Requirements

Expressivity

★ Set of possible events not known a priori

Fine timescale eventing

★ Capture transient and onset events

Per-packet processing

★ Precise event determination

10

Because data centers will require high availability and high utilization

Page 11: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

11

A Key Architectural

QuestionWhere do we place eventing

functionality?

Switches HostsNICs

❖ Are programmable❖ Have processing power for fine-time

scale eventing❖ Already inspect every packet

Page 12: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

12

Trumpet is a host-based eventing framework

Page 13: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

13

Page 14: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

14

Trumpet has a logically centralized event manager that aggregates local

events from per-host packet monitors

Page 15: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

For each packet matching

group by

and report every

each group that satisfies

Filter

Predicate

Time-interval

Flow-granularity

15

Event Definition

Flow volumes, loss rate, loss pattern (bursts), delay

Page 16: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

16

For each packet matching

group by

and report every

any flow whose

Event Example

Service IP Prefix

5-tuple

10ms

sum (is_lost & is_burst) > 10%

Is there any flow sourced by a service that sees a burst of losses in a small interval?

Page 17: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

17

For each packet matching

group by

and report every

any job whose

Event Example

Cluster IP Prefix and Port

Job IP Prefix

10ms

sum (volume) > 100MB

Is there a job in a cluster that sees abnormal traffic volumes in a small interval?

Page 18: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

18

Server

Controller

Server

VM

VM

Hypervisor

Trumpet Packet Monitor

Software switch

Trumpet Event Manager

Triggers

Trigger Reports

Event

Report

Trumpet Design

Page 19: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

19

Trumpet Event Manager

Trumpet Event

Manager

Congestion?

CongestionTriggers

Contains event attributes, detects local events

Page 20: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

20

Trumpet Event Manager

Trumpet Event

Manager

Page 21: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

21

Trumpet Event Manager

Trumpet Event

Manager

Large flow?

Large FlowTriggers

Trumpet can be used by programs to drill-down to

potential root causes

Page 22: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

22

The monitor optimizes packet processing to inspect every packet and evaluate

predicates at fine timescales

Page 23: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

The Packet Monitor

23

Server

VM

VM

Hypervisor

Trumpet Packet Monitor

Software switch

Page 24: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

A Key Design

Decision

24

Server

VM

VM

Hypervisor

Trumpet Packet Monitor

Software switch

Run monitor on CPU core used by software switch❖ Conserves CPU resources❖ Avoids inter-core synchronization

Page 25: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

25

Can a single CPU core monitor thousands of triggers at full packet

rate (14.8 Mpps) on a 10G NIC?

Page 26: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Two Obvious Tricks

Use kernel bypass

❖ Avoid kernel stack overhead

Use polling to have tighter scheduling

❖ Trigger time intervals at 10ms

26

Necessary, but far from sufficient….

Page 27: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

27

Packet Match Update

statistics at Check

Source IP = 10.1.1.0/24Source IP = 20.2.2.0/24

PredicateTime intervalFilter

Sum(loss) > 10% Sum(size) < 10MB

Flow granularity10ms

100msService IP prefix5-tuple

filters flow granularity predicate

time-interval

Monitor Design

at

With 1000s of triggers

Page 28: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

28

Packet Match Update

statistics at Check filters flow granularity predicate

time-interval

Design Challenges

at

Which of these should be performed❖ On-path❖ Off-path

Page 29: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

29

Packet Match Update

statistics at Check filters flow granularity predicate

time-interval

Design Challenges

at

Which operations to do on-path?❖ 70ns to forward and inspect packet

Page 30: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

30

Packet Match Update

statistics at Check filters flow granularity predicate

time-interval

Design Challenges

at

How to schedule off-path operations?❖ Off-path on same core, can delay packets❖ Bound delay to a few μs

Page 31: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

31

Packet

Match Update

statistics at Check filters flow granularity predicate

time-interval

Strawman Design

at

PacketHistory

On-Path

Off-Path

Doesn’t scale to large numbers of triggers

Page 32: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

32

PacketMatch

Update statistics at

Check

filters flow granularity

predicate

time-interval

Strawman Design

at

On-Path

Off-Path

Still cannot reach goal❖ Memory subsystem becomes a bottleneck

Page 33: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

33

PacketMatch

Update statistics at

Check

filters 5-tuple granularity

predicate

time-interval

Trumpet Monitor Design

at

On-Path

Off-Path

Gather statistics at

flow granularity

Can be different fromflow-granularity

Page 34: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

34

PacketMatch

Update statistics at

filters 5-tuple granularity

Optimizations

On-Path

❖ Use tuple-space search for matching❖ Match on first packet, cache match❖ Lay out tables to enable cache prefetch❖ Use TLB huge pages for tables

Page 35: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

35

Check predicate

time-interval

Optimizations

at

Off-Path

Gather statistics at

flow granularity

❖ Lazy cleanup of statistics across intervals❖ Lay out tables to enable cache prefetch❖ Bounded-delay cooperative scheduling

Page 36: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Bounded Delay

Cooperative Scheduling

36

Off-Path On-Path

Bounded Delay

Bound delay to a few μs

Page 37: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

37

Trumpet can monitor thousands of triggers at full packet rate on a 10G NIC

Page 38: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

38

Trumpet is expressive❖ Transient congestion❖ Burst loss❖ Attack onset

Trumpet scales to thousands of triggers

Trumpet is DoS-Resilient

Evaluation

Page 39: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Detecting Transient

Congestion

39

Congestion

Large Flow Trumpet

can detect millisecond

scale congestion

events

40 ms

Page 40: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Scalability

40

Trumpet can process❉ 14.8 Mpps❖ 64 byte packets at 10G❖ 650 byte packets at 4x10G

… while evaluating 4K triggers at 10ms granularity

❉Xeon ES-2650, 10-core 2.3 Ghz, Intel 82599 10G NIC

Page 41: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Performance Envelope

41

Triggers matched by each flow

How often each predicate is checked

Above this rate, Trumpet would miss

events

Page 42: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Performance Envelope

42

At moderate packet rates, can detect events at 1ms

Number of <trigger, flow> pairs increases statistics gathering overhead

Page 43: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Performance Envelope

43

Need to profile and provision Trumpet deployment

Above 10ms, CPU can sustain full packet rate

Page 44: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Conclusion

Future datacenters will need fast and precise eventing

❖ Trumpet is an expressive system for host-based eventing

Trumpet can process 16K triggers at full packet rate

❖ … without delaying packets by more than 10 μs

Future work: scale to 40G NICs

❖ … perhaps with NIC or switch support

44

https://github.com/USC-NSL/Trumpet

Page 45: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

A Big Discrepancy

45

Outage budget for five 9s availability

24 seconds per month

99.999% uptime

Long failure durations due to time to

root-cause failures

Page 46: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

46

Every optimization is necessary❉

❉Details in the paper

Page 47: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Humans in the Loop

47

Detect

Locate

Inspect

Fix

Page 48: Triggers in Data and Precise Trumpet: Timelyminlanyu.seas.harvard.edu/talk/sigcomm16.pdf · Conclusion Future datacenters will need fast and precise eventing Trumpet is an expressive

Programs in the Loop

48

Detect

Locate

Inspect

Fix

Programs in the loop