Upload
demis-gomes
View
35
Download
0
Embed Size (px)
Citation preview
Performance Evaluation Between Checkpoint Services in Multi-tier Stateful
ApplicationsDemis Gomes
Advisor: Glauco GonçalvesCo-Advisor: Patricia Endo
2
INTRODUCTION
3
Introduction
• Plataform-as-a-Service (PaaS)
DeveloperPaaS
Application
User
PaaS Provider
4
Introduction
• Multi-tier stateful applications
5
Introduction
• It is important keep an application in a PaaS running as long as possible
• A downtime causes many financial losses
6
Introduction
• The average cost of a critical application failure per hour is $500,000 to $1 million.
Source: https://devops.com/2015/02/11/real-cost-downtime/ . Last access 11 out. 2016
Checkpoint Services!
7
IntroductionDevelopers Users
CheckpointService
PaaS Providers
8
Background
• A checkpoint service is divided into three mechanisms– Checkpoint saving– Failure detection– Failover
9
Background
• Checkpoint Service
AppActiveStandb
y
Checkpoint Service
AppState
AppState
AppState
Failover
CheckpointSaving
App
FailureDetection
10
Background
• Service Availability Forum (SAF)
• Three different implementations:– Non-collocated– Collocated warm– Collocated hot
11
Background
12
Checkpoint ServicesCS Application-level CS System-level
App
Agent
State-aware application
App
Agent
HA-agnostic application
ContainerCheckpoint
Manager CheckpointManager
13
Motivation
• Works presented either app-lvl [1] or sys-lvl [2]
• Lack of consistent comparison between these services
• No implementation in accordance with the SAF standard
14
Motivation
• Carry out a performance evaluation between system and application checkpoint services, where these models follow the SAF standard and evaluate the impact of different recovery modes in time and resource consumption
15
Answer three questions
• System-level ~= App-level?• Impact of changing from non-
collocated to collocated?• Bottlenecks of the system-level
and application-level?
16
CHECKPOINT SERVICES
17
Application
• State-aware application • A multi-tier stateful chat– Frontend: provides interface and
saves user’s data– Backend: saves room messages– Database: stores information related
to rooms and users
App AgentGET /state
200 OK
18
Application
• State provided via JSON (backend)
19
CS System-level
• We used well-known tools:– LXC as container–NFS as file system– rsync to transfer files between
instances– CRIU to establish checkpoint and
restore containers
CS: Checkpoint Service! :D
20
CS System-level
• We did not implement collocated hot because CRIU does not allow restore in a running instance
21
CS System-level
• Checkpoint in non-collocatedApp
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
Container
Container
22
CS System-level
• Checkpoint in collocated warmApp
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
rsync
Container
Container
23
Container
CS System-level
• Failover in non-collocatedApp
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
Container
24
Container
CS System-level
• Failover in collocated warmApp
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
Container
rsync
25
CS App-level
• CS at application-level was developed from scratch for this work
• REST resources
Remember, CS: Checkpoint Service! :D
GET http://{manager_ip}:{manager_port}/config
RESPONSE 200 OK Content-type: application/json
26
CS App-level
• Checkpoint at Application-level
App
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
State-aware application Non-collocated
Collocatedwarm
Collocatedhot
27
CS App-level
• Failover in non-collocated
App
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
28
CS App-level
• Failover in collocated warm
App
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
29
CS App-level
• Failover in collocated hot
App
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
30
EVALUATION
31
Evaluation
• Two evaluations were conducted– Evaluation I: Failover time
comparison – Evaluation II: Checkpoint time and
resources consumption comparison
32
Evaluation
Physical Machines: 16 GB RAM, 8 cores, Gigabit Interface
33
Evaluation I
• Methodology– Backend with 1, 5,10,15,20 and 25
MB of state sizes– Experiment Manager starts the
experiment and generates a failure alert
– Failover process is executed– Failover time is collected
34
Failover time – Non collocated
Application-level has a greater failover time
The growth is linear
35
Failover time – Non collocated
We estimate the failover time with state size increasing until 100 MB
App lvl would be 66% faster
36
Failover time – Collocated
Application-level collocated warm is greatly impacted with increase of state size
The values of app lvl collocated hot and sys lvl collocated warm are very similar
37
Failover time – Collocated
Linear regression shows:
High increase of app lvl collocated warm
Slight increase on sys lvl collocated warm
Constant values to collocated hot
38
Evaluation II
• Methodology– Similarly to the previous experiment,
states are saved in same state sizes– Experiment Manager triggers a
checkpoint process– Checkpoint time is collected– Resources consumption are
evaluated
39
Evaluation II
• Methodology– Resources consumption metrics
Metrics Measured inCheckpoint Time s
CPU Load %
Memory Occupation %
Network I/O Throughput Mbps
Disk I/O Throughput b/s
40
Evaluation IICheckpoint times
41
Evaluation II – Active InstanceAt 25MB CPU Memory Network (I/O) Disk (W)
Sys-lvl collocated
warm
6,8% 9,4% 0/59,8 Mbps 1300 b/s
App-lvl collocated
warm
2,7% 9,1% 0/8,8 Mbps 9220 b/s
App-lvl collocated hot
2,53% 9,5% 0/8,64 Mbps 8340 b/s
At 25MB CPU Memory Network (I/O)
Disk (W)
Sys-lvl non-collocated
6% 9,1% 0/81 Mbps 1780b/s
App-lvl non-collocated
2% 8,92% 0/11,6 Mbps
2410 b/s
42
Evaluation II – Standby InstanceAt 25 MB CPU Memory Network (I/O) Disk (W)
Sys-lvl collocated
warm
1,8% 10,3% 5,1/0 Mbps 12500 b/s
App-lvl collocated
warm
2,5% 11,9% 8,5/8,5 Mbps 7280 b/s
App-lvl collocated
hot
4,1% 12,4% 8,35/8,35 Mbps
6900 b/s
At 25 MB CPU Memory Network (I/O)
Disk (W)
Sys-lvl non-collocated
0,16% 9,8% 0/0 Mbps 800 b/s
App-lvl non-collocated
0,2% 11,4% 0/0 Mbps 2600 b/s
43
Discussion
• Availability Analysis in a year• Mean Time To Recovery (MTTR) as
failover time• Mean Time To Failure (MTTF) as
Apache Server (788.4h/year) [3]• Assuming that the failover time is 50
times greater• High Availability (HA) = 99.999%
(five nines)
44
Discussion
MTTR in25 MB (s)
MTTR in 25 MB with
factor 50 (s)
MTTF(s) Availability with factor 50 (%)
System-levelcollocated warm
0.38636 19.318 2838240 99.9993
Application-level collocated warm
1.27823 63.9115 2838240 99.997
Application-levelcollocated hot
0.25802 12.901 2838240 99.9995
System-levelnon-collocated
3.5441 177.205 2838240 99.9937
Application-level non-collocated
1.38795 69.3975 2838240 99.997
Availability analysis (25 MB)
45
Discussion
MTTR in100 MB
(s)
MTTR in 100 MB with
factor 50 (s)
MTTF(s) Availability with factor 50 (%)
System-levelcollocated warm
0.5902 29.51 2838240 99.9989
Application-level collocated warm
3.8621 193.1 2838240 99.993
Application-levelcollocated hot
0.2677 13.385 2838240 99.9995
System-levelnon-collocated
9.7999 498.995 2838240 99.9824
Application-level non-collocated
4.321 216.05 2838240 99.9923
Availability analysis (prediction until 100 MB)
46
CONCLUSIONS AND FUTURE WORKS
47
Conclusions
Answering the questions• System-level ~= App-level?
Yes! In collocated warm
48
Conclusions
• Impact of change from non-collocated to collocated?– Failover: great decrease– Checkpoint: great increase– Resources Consumption: Similar,
except of CPU and disk (greater on collocated)
49
Conclusions
• Bottlenecks of the system-level and application-level?
– App : disk, CPU in standby (hot) and development time
– Sys: CPU, network and NFS
50
Conclusions
• CS Application-level– Private PaaS – App with large state size and high
rate of checkpoints (massive online applications)
51
Conclusions
• CS System-level– PaaS with legacy applications– App with less state size and higher
checkpoint intervals
52
Conclusions
• PaaS Business Model– Non-collocated: Free plans– Collocated: Premium plans
53
Contributions
• Short paper approved with results of Experiment I, entitled:
“Failover Time Evaluation Between Checkpoint Services in Multi-tier Stateful Applications”
IM-2017, Exp. Session (Qualis B1)
54
Future Works
As future works, we will study• Scalability of services• Resources consumption on
Experiment Instance
55
Acknowledgments
• Thanks!
#CatãoEterno
57
References• [1] KANSO, Ali; LEMIEUX, Yves. Achieving High Availability
at the Application Level in the Cloud. In: 2013 IEEE Sixth International Conference on Cloud Computing. IEEE, 2013. p. 778-785.
• [2] LI, Wubin; KANSO, Ali; GHERBI, Abdelouahed. Leveraging linux containers to achieve high availability for cloud services. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on. IEEE, 2015. p. 76-83
• [3] MELO, R. M. D. et al. Redundant vod streaming service in a private cloud: availability modeling and sensitivity analysis. Mathematical Problems in Engineering, Hindawi Publishing Corporation, v. 2014, 2014
58
BACKUP
59
Agenda
• Introduction• Checkpoint Services• Evaluation– Experiment I– Experiment II
• Conclusion and Future Works• Acknowledgments
60
Introduction
• PaaS contains several challenges, where one is the availability of your services
• Multi-tier stateful applications
61
Introduction
• Many PaaS not have a mechanism that handles failures on application
• Some offers a backup but is not transparent
62
Introduction
Tsuru only restarts application, not saving your last state
63
VM x Container
• VMs • Containerization
64
Objectives• General– Carry out a consistent comparison between
checkpoint in system and application levels• Specifics– Develop the two modes following SAF
standard– Compare the services among following
metrics:• Failover time• Checkpoint time• Load generated in application
65
Application
• Application generates new base states if– threshold defined by developer has
reached– A time limit has reached
App 20 new messages!
App 120 seconds without updates!
66
CS System-level
67
CS System-level
• Checkpoint/Restore In Userspace (CRIU)
• Saves memory context• Freezes processes reading
memory• Restores processes in machines
with same filesystem
68
CS System-level
• Phoenix!
69
Checkpoint Services Implementation
• URLS implemented by chat
70
Checkpoint Services
• CS Application-level
App
CheckpointManager
Agent
App
Agent
Standby Instance
ActiveInstance
State-aware application Non-collocated
Collocatedwarm
Collocatedhot
71VM/Container
Checkpoint Services
• CS System-levelApp
CheckpointManager
Agent
App
AgentStandby Instance
ActiveInstance
HA-agnosticapplication
Non-collocated
Collocatedwarm
Collocatedhot
VM/Container
72
CS System-level
• LXC must be configured to allow CRIU make checkpoint and restore
73
Evaluation II
• Methodology– Checkpoint time is presented as
means with 95% Confidence Interval (CI)
– Resource consumption are means with 95% CI related to active and standby instances
74
CS System-level
• Checkpoint process is established in non-collocated– saving container via CRIU and storing
your memory context in a shared file system between Manager and Agent
• In collocated:– saving container via CRIU and send
state via rsync to all standby instances
75
CS System-level
76
CS System-level
• Failover process (non-collocated)
77
CS System-level
• Failover process (collocated warm)
78
CS App-level
79
CS App-level
• In failover process (non-collocated)
80
CS App-level
• In failover process (collocated warm)
81
CS App-level
• In failover process (collocated hot)
82
Evaluation I
• T-test between app collocated hot and sys collocated warm
83
Evaluation IINetwork received (collocated modes)
84
Evaluation IINetwork received (non-collocated)
85
Evaluation IICPU Load (collocated modes)
86
Evaluation IICPU Load (non-collocated)
87
Evaluation IIMemory occupation (collocated modes)
88
Evaluation IIMemory occupation (non-collocated)
89
Evaluation IINetwork sent (collocated modes)
90
Evaluation IINetwork sent (non-collocated)
91
Evaluation IIDisk written (collocated modes)
92
Evaluation IIDisk written (non-collocated)
93
Acknowledgments
• Family• Friends• Creators• UFRPE• Advisors (the bests)• CNPq and FACEPE