Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Continuous Distributed Monitoring in the
Evolved Packet Core
Industry Experience Report
Romaric Duvignau 1 Marina Papatriantafilou 1
Konstantinos Peratinos 3 Eric Nordstrom 2 Patrik Nyman 2
DEBS 2019, Darmstadt (June 26).
1 Chalmers University of Technology, 2 Ericsson, 3 Chalmers student and Ericsson intern.
Introduction
Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
• Large-Scale, Distributed, Performance-critical system.
• Strong need to continuously monitor the EPC: e.g. detection
of under- or over-used subcomponents.
1
Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
• Large-Scale, Distributed, Performance-critical system.
• Strong need to continuously monitor the EPC: e.g. detection
of under- or over-used subcomponents.
1
MME, QoS, billing, ...
Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
• Large-Scale, Distributed, Performance-critical system.
• Strong need to continuously monitor the EPC: e.g. detection
of under- or over-used subcomponents.
1
MME, QoS, billing, ...
Packet Gateway
Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane(UP)
GTP
Base station
Control Plane(CP)
User Equipment(UE)
PFCP
TEID TEID
W1 W2 WN.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
• Large-Scale, Distributed, Performance-critical system.
• Strong need to continuously monitor the EPC: e.g. detection
of under- or over-used subcomponents.
1
MME, QoS, billing, ...
Packet Gateway
Continuous Distributed Monitoring
Continuous Distributed Monitoring (CDM) Model
2
Continuous Distributed Monitoring (CDM) Model
2
f (S1,S2, · · · ,Sk)
Continuous Distributed Monitoring (CDM) Model
2
f (S1,S2, · · · ,Sk)
There exist variants (unidirectional, relay nodes, etc).
Continuous Distributed Monitoring (CDM) Model
2
f (S1,S2, · · · ,Sk)
There exist variants (unidirectional, relay nodes, etc).
• Instant computation &
communication
• f depends on ∪Si
System Architecture
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
Differences with CDM models
• Sites identity matters, performance statistics 6= “events”, etc
• Need to account for comp. and communication delays!
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period
→ At the Agg: monitoring decisions then 1 monitoring message.
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period Fetches
→ At the Agg: monitoring decisions then 1 monitoring message.
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period Fetches Sliding Window
→ At the Agg: monitoring decisions then 1 monitoring message.
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period Fetches Sliding Window
→ At the Agg: monitoring decisions then 1 monitoring message.
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period Fetches Sliding Window
→ At the Agg: monitoring decisions then 1 monitoring message.
3
System Architecture Overview
C
Agg1
w11 w1
2 · · · w1`
Agg2
w21 w2
2 · · · w2`
· · ·
Load Balancer
Incoming
Traffic
Fetched
Statis-
tics
Monitoring Messages
Display (analysts)
time
Monitoring Period Fetches Sliding Window
→ At the Agg: monitoring decisions then 1 monitoring message.3
Monitoring Algorithms
Selected CDM Algorithms for Counting problems
Basic Mode: Exact Monitoring
• Send an update if last value sent is
different to measured value
• Keep an exact sliding window of
the last n values
•
• • •••• •
Approximation Mode: Relative Error of ε• Uses Exponential Histograms for
approximate counting
• Send the approximate count when
it is beyond some error bound from
the last value sent
• Requires in all O(log(nε)/ε) words
•
• • •••• •
4
Selected CDM Algorithms for Counting problems
Basic Mode: Exact Monitoring
• Send an update if last value sent is
different to measured value
• Keep an exact sliding window of
the last n values
•
• • •••• •
Approximation Mode: Relative Error of ε• Uses Exponential Histograms for
approximate counting
• Send the approximate count when
it is beyond some error bound from
the last value sent
• Requires in all O(log(nε)/ε) words
•
• • •••• •
4
Results
Experimental setup
• EPG setup: 2 aggregators, 72 workers per aggregator
• 2 phases: increasing load (20min) then stable load (15min)
0
20
40
60
80
100
CPU
utiliz
atio
n (%
) Maxp95Medianp5Min
0 500 1000 1500 2000
1M
2M
3M
Pack
et ra
te (p
acke
ts/s
)
Maxp95Medianp5Min
5
Experimental setup
• EPG setup: 2 aggregators, 72 workers per aggregator
• 2 phases: increasing load (20min) then stable load (15min)
0
20
40
60
80
100
CPU
utiliz
atio
n (%
) Maxp95Medianp5Min
0 500 1000 1500 2000
1M
2M
3M
Pack
et ra
te (p
acke
ts/s
)
Maxp95Medianp5Min
5
1000 fetches /s – high precision
1 fetch /s – low precision
No. of Monitoring Updates per Round
• 5-10% of data sent for packet proc. rate; 30-70% for CPU.
0.4
0.6
0.8
1.0
Upda
tes v
s Bas
ic (c
pu)
5%10%
20%5%W60
0 500 1000 1500 20000.0
0.2
0.4
0.6
0.8
Upda
tes v
s Bas
ic (p
kt)
5%10%
20%5%W60
• Max relative error < 5ε9 and average < ε
5 .
6
No. of Monitoring Updates per Round
• 5-10% of data sent for packet proc. rate; 30-70% for CPU.
0.4
0.6
0.8
1.0
Upda
tes v
s Bas
ic (c
pu)
5%10%
20%5%W60
0 500 1000 1500 20000.0
0.2
0.4
0.6
0.8
Upda
tes v
s Bas
ic (p
kt)
5%10%
20%5%W60
• Max relative error < 5ε9 and average < ε
5 .6
Monitoring Availability
• 8 runs (ca 4h of data) with monitoring round = 1s
1.5
2.0
2.5
Upda
te ti
me
(s,M
A300
)
B5%W60 Agg1/2B5% Agg1/25% Agg1/2B Agg1/2
250 500 750 1000 1250 1500 1750 2000
0.4
0.6
0.8
Avai
labi
lity
(MA3
00) B
20%10%5%
B20%B10%B5%B5%W60
7
Conclusion
Conclusions
• Adjusted state-of-the-art CDM implementations in the EPC
• Keys to popularize CDM within a production level system
• From experiments, only 6% of data sent for 1.6% avg error
• Useful for the upcoming transition to 5G architecture
8
Thank you!
8
Error Analysis
• Max relative error is always close to 5ε9
• Larger window influences absolute error on CPU
0.00
0.02
5%
0.00
0.02
5%W60
0
1
5%
0 500 1000 1500 20000
2
5%W60
Maxp90Medianp10Min
Comparison with Simple Approximation
• Simple Approximation: keep an exact window and send
updates when last count is beyond some predefined relative
bound
B 5% 10% 20% 5%W600
20
40
60
No o
f upd
ates
(pkt
)
B 5% 10% 20% 5%W600
5
10
Rela
tive
erro
rs (%
,pkt
)
• ε-Approximate algorithm presents similar tradeoffs as the
simple approximation with bound 5ε9
CDM approaches
Simple approaches
• Flooding, do not scale!
• Polling, but hard to choose right polling interval!
• Sampling, do not capture scarce under/over-used components!
Solutions
• Communication-optimal algorithms
• Geometric Monitoring → efficient network-wide aggregate.
• Tailored algorithms for particular tasks → e.g. computing the
frequency of items or most popular ones.
• Heuristics → e.g. adaptive filters.
• Compromises: Magpie, Dapper, Ganglia...
Proposed Monitoring Solutions
time
Monitoring Period Fetches Sliding Window
Monitoring Logic for each monitored value
• Implemented as part of the aggregator nodes
• once all fetched have been collected, a monitoring decision is
taken upon propagating the update
• Aggregation of all monitoring updates: sending of (up to) a
single monitoring message per aggregator
Selected CDM Algorithms
Basic Mode
• Send an update if last value sent is different
• Keep an exact sliding window of length n
ε-Approximation Mode
• Maintains an ε9 -approximate Exponential Histogram for
counting approximate sum c of items over a sliding window of
the last n events
• Whenever c > (1 + 4ε9 )c or c < (1− 4ε
9 )c , send an update,
where c is the last value sent
• Requires in all O(log(nε)/ε) words of memory
Measuring Metrics of Interests: 2 modes
With high granularity: CPU usage
1. P fetches of CPU-usage for past 1ms each within one
monitoring period
2. Frequency chart (histogram of F bins) for the P fetches
3. Sliding Windows are updated : each bin is monitored
4. For each changed (basic) or outside of bounds (approx) value,
a monitoring update is sent
5. Upon receiving an update: C updates its frequency counts for
the resp. observer and CPU-bin and then may display the
average CPU over the window as∑
1≤i≤F ifi/∑
1≤i≤F fi
With low granularity: Packet Processing Rate
• Only the no. of processed packets per mon. period is tracked