Live migrating a container: pros, cons and gotchas

Live migrating a container:pros, cons and gotchas

Pavel EmelyanovPrincipal engineer @ Virtuozzo

AgendaAgenda

• Why you might want to live migrate a container

• Why (and how) to avoid live migration

• Why is container live migration so complex

Migration in a nutshelMigration in a nutshel

• Save state

• Copy state

• Restore from state

Why you might want to live migrate a containerWhy you might want to live migrate a container

• Spectacular

• Load balancing

• Updating kernel

– Can avoid live migration, just C/R

• Updaring or replacing hardware

Why to avoid live migrationWhy to avoid live migration

How to avoid live migrationHow to avoid live migration

• Balance network traffic

• Microservices

• Crash-driven updates

• Planned downtime

Making live migration liveMaking live migration live

• State saving, transfering and restoring happens with tasks frozen

• (Big) memory transfer should not be done at that time

• Memory pre-copy

• Memory post-copy

Pre-copyPre-copy

• Track memory changes,

copy memory while tasks are running, goto again

• Pros:

– Safe: once migrated, source node can disappear

• Cons:

– Unpredictable: iterations may take long

– Non-guaranteed: “dirty” memory next round may remain big

Post-copyPost-copy

• Migrate all but memory, turn on “network swap” on destination

• Pros:

– Predictable: time to migrate can be well estimated

• Cons:

– Unsafe: src node death means death of container on destination

Live migration at lengthLive migration at length

• Memory pre-copy (iteratively, optional)

• Freeze + Save state

• Copy state

• Restore from state + Unfreeze and resume

• Memory post-copy (optional)

GotchasGotchas

Things to work withThings to work with

• VM

– Environment: virtual hardware, paravirt

– CPU

– Memory

• Container

– Environment: cgroups, namespaces

– Processes and other animals

– Memory

Memory pre-copyMemory pre-copy

• VM

– All memory at hands

– Plain address space

• Container

– Memory

● is scatered over the processes

● can be (or can be not) shared

● can be (or can be not) mapped to disk files

Save stateSave state

• VM

– Hardware state

● Tree of ~100 objects

● Fixed amount of data per each

• Container

– State of all objects

● Graph of up to ~1000 objects

● All have different amount of data, different reading API

Restore from stateRestore from state

• VM

– Copy memory in place, write state into devices

• Container

– Creation of many small objects

– Not all have sane API for creation

● Creation sequence can be non-trivial

Memory post-copyMemory post-copy

• UserfaultFD from Andrea Archangeli

• VM

– Merged into 4.2

• Container

– Non-cooperative work of uffd monitor and client,

need further patching

And we also need this, this and this!And we also need this, this and this!

• Check for CPUs compatibility

• Check and load necessary kernel modules (iptables, filesystems)

• Non-shared filesystem should be copied

• Roll-back on source node if something fails in between

– Keep tasks frozen after dump, kill after restore

ImplementationImplementation

• CRIU

– Save & restore state

– Memory pre/post copy

• P.Haul

– Checks

– Orchestrate all C/R steps

– Deal with filesystem

P.Haul goalsP.Haul goals

• Provide engine for containers live miration using CRIU

• Perform necessary pre-checks (e.g. CPU compatibility)

• Organize memory pre-copy and/or post-copy

• Take care of file-system migration (if needed)

Under the hoodUnder the hood

CRIU CRIUp.haul p.hauldocker -d docker -dmigrate

src dst

check (CPUs, kernels)

pre-dumpmemory

other images

restore

memorylazy mem

FS copy

pre-cop ypost-co py

freezetime

More infoMore info

• http://criu.org

• http://criu.org/P.Haul

• criu@openvz.org

• +CriuOrg / @__criu__

• https://github.com/xemul/(criu|p.haul)

Thank you!Pavel Emelyanov@__criu__xemul@openvz.org

Live migrating a container: pros, cons and gotchas

Technology

Angular Gotchas - Windows · PDF file · 2016-01-26Angular Gotchas Best of Web ... $scope.lastName = ‘Emo’ fullName (Directive) $scope.age=31 ... app.directive("fullName", function(){

Cognos Analytics Implementation Tips, Tricks & Gotchas

The Top Most Common SystemVerilog Constrained Random Gotchas · Contribution •Illustrates top most common SystemVerilog and UVM constrained random gotchas •Helps –Eliminate/reduce

Verilog and SystemVerilog Gotchas - download.e-bookshelf.de · Verilog and SystemVerilog Gotchas 101 Common CodingErrorsand How to Avoid Them. Stuart Sutherland DonMills Verilog and

Potential gotchas in making a backbone app

Ruby Gotchas

HIDDEN “GOTCHAS,” AND MISUNDERSTOOD PROVISIONS...CRACKING THE CCPA CODE: COMPLYING IN THE MIDST OF ITS AMBIGUITIES, HIDDEN “GOTCHAS,” AND MISUNDERSTOOD PROVISIONS Kristen Mathews

C++ Gotchas - pearsoncmg.comptgmedia.pearsoncmg.com/images/9780321125187/samplepages/... · C++ Gotchas Avoiding Common Problems in Coding and Design Stephen C. Dewhurst Boston •San

Structural Authorizations in SAP HR With Gotchas

Kln Tokyo · 2012-04-06 · Schema Design Gotchas in MySQL 131 Normalization and Denormalization 133 Pros and Cons of a Normalized Schema 134 Pros and Cons of a Denormalized Schema

Certain Non-UCC and UCC 'Gotchas' for the Unwary …apps.americanbar.org/buslaw/committees/CL190016pub/... · CERTAIN NON-UCC AND UCC "GOTCHAS" FOR THE ... Characterization of the

Top 10 Enterprise Mobile App Development Gotchas

Condor Submissions: Gotchas

How to avoid Go gotchas - divan's blog · How to avoid Go gotchas by learning internals Ivan Danyliuk, Codemotion Milano 26 Nov 2016

Java™ Platform Concurrency Gotchas - Oracle · Java™ Platform Concurrency Gotchas Alex Miller Terracotta. Questions to answer > What are common concurrency problems? > Why are

Neglecting Fax When Migrating to IPhosteddocs.ittoolbox.com › eFax Neglecting Fax In VoIP... · 2014-01-07 · Neglecting Fax When Migrating to IP. ... the pros and cons of this

Concurrency gotchas

Usability and Accessibility CSS Gotchas

JavaScript gotchas

Plugins 2.0 & OSGi Gotchas - Atlassian Summit 2010