MesosCon Stateful Services

What Building Multiple Scalable DC/OS Deployments Taught Me about Running Stateful Services on DC/OSNathan Shimek - VP of Client Solutions at New ContextDinesh Israin – Senior Software Engineer at Portworx

Challenge Domains

• Platform Availability– Build in Resiliency– Monitoring and Metrics– Testing– Limit the Blast Radius

• Within the Cluster– Platform Security– Isolation– Maintenance

Challenge Domains

• Outside the Cluster– Routing– Load Balancing– Service Discovery

• Organizational– Adoption and User Experience– Rules and Controls– Training– Fostering the Right Skillsets

Platform Availability

Platform Availability, or Lack Thereof• There are lots of ways to negatively impact availability

(this isn’t comprehensive by any means)– Host failure – Zone outage– Loss of subnet connectivity / Network segmentation– Failed volume– Loss of storage connectivity– Storage driver failure– Runaway application– Unresponsive tasks– Orphaned tasks– Runaway job/app launching– Unresponsive app/job scheduler– Escaped bugs

Addressing Platform Availability

• Build in Resiliency – Go for HA right off the bat

• Multi-Master, Multi-Zone, separate subnets– Scalable architecture informs other automation

and tool choices• Platform is more resilient and availability leads to

adoption and happy users– Automated cluster build / re-build

• Resiliency continued– Ability to add and remove masters and

workers independently and easily• Operators should be safe in terminating at least one

node at a time– Execute a single command to recreate any

missing nodes• Newly created workers and masters should initialize

and join the cluster with no additional human intervention

• Test fault recovery features– Cause real world outages – In a production like environment

• Monitor for failure scenarios– Aligned with failure scenarios

• Infrastructure is multi-disciplinary, DC/OS is no exception

• Limit the blast radius– Isolation

• User applications from each other• Platform services from users• Platform services from each other

– Effective controls to enforce isolation• Be especially careful with inter-service

dependencies

Within the Cluster

• Platform Security– All of this is new so attacks are evolving rapidly– Engage the security team– Areas to Review:

• Marathon app and metronome job config• Privilege escalations• Sandbox escapes• Docker file

Within the Cluster

• Isolation– Limit damage users can cause to each other– Also limit damage users can cause to platform

services– Isolate platform services from each other to

avoid cascading failures

Within the Cluster

• Maintenance– Remember how you started with an HA

deployment? If you didn’t, now is the time to frown.

– Metrics and monitoring should detect these situations

– Alerts and pre-built workplans are necessary– Automated clean up jobs are ideal

Outside the Cluster

• Routing– Cross-app reverse proxies

• Public agents pre-populated with shared proxies • Public agents auto-scaled to maintain space for

spikes• Some high-traffic apps may still need their own

– External Port Management• In addition to IPAM, some means of managing ports

on external load balancers may be necessary

Outside the Cluster

• Service Discovery and Load Balancing– Synchronization with app events– Tune DNS cache times / service discovering

polling interval• Several common tools available for this

including:– mesosphere/marathon-lb– containous/traefik– gliderlabs/registrator– AVI Vantage

Organizational

• Adoption and User Experience– Treat dev like prod– Devs are first users– Integrations make or break the experience

Organizational

• Training– Formal training– Find experienced advocates– Developer training is a must

Organizational

• Fostering the Right Skillsets– Pairing with experienced engineers– Operations playground– Fail fast and fail often– Hack days

Organizational

• Rules and Controls– Fundamentally a shared resource– Beta user program– Organizational controls and ground rules

MesosCon Stateful Services

Documents

Stateful Breakpoints - Bodden

Stateful Functions @ BOSS

Performance evaluation between checkpoint services in multi tier stateful

Modeling Stateful Resources with Web ServicesModeling Stateful Resources with Web Services 3 Resource approach to declaring and implementing the association between a Web service and

Stateful Distributed Dataflow Graphs - SICSictlabs-summer-school.sics.se/2015/slides/stateful...Peter R. Pietzuch prp@doc.ic.ac.uk Stateful Distributed Dataflow Graphs: Imperative

Windows Server & System Center Futures—Bring Azure to your ...marketing.netrixllc.com/acton/attachment/29364/f... · presentation services stateless services. ... Stateful Services

MesosCon 2016: Frameworks in Harmony

SNAP: Stateful Network-Wide Abstractions for Packet Processingarashloo/SNAP-SIGCOMM16.pdf · SNAP: Stateful Network-Wide Abstractions for Packet Processing ... stateful operations

MesosCon 2016 - minimesos, the experimentation and testing tool for Apache Mesos

Building Event Driven Services with Stateful Streams

Microsoft brand templateazurebootcampdk.com/presentations/Bootcamp Cloud Patterns.pdf · Presentation services Stateful services Stateless services with related databases Model/Database

Stateful Connection Tracking & Stateful NAT - Open …openvswitch.org/support/ovscon2014/17/1030-conntrack_nat.pdfStateful Connection Tracking & Stateful NAT Justin Pettit VMware Thomas

Stateful services & service orchestration

DockerCon EU 2015: Persistent, stateful services with docker cluster, namespaces and docker volume magic

MesosCon Stateful Services · Dinesh Israin–Senior Software Engineer at Portworx ... –Maintenance. Challenge Domains • Outside the Cluster –Routing –Load Balancing –Service

Stateful(beans( - UniTrento

Stateful Data Types

Linux 2.4 stateful firewall designcarfield.com.hk/document/security/Linux+Stateful+Firewall.pdf · Linux 2.4 stateful firewall design ... Defining rules 6 4. Stateful firewalls 8

MesosCon 2015 Aug 20th Netflix Senior Software Engineer ... · Reactive stream processing, Mantis Cloud native Lightweight, dynamic jobs Stateful, multi-stage Real time, anomaly detection,

Apache Flink and More @ MesosCon Asia 2017