Software Architecture for Cloud Infrastructureohar/materiaali2016/vierailuluento_rautonen_2016.pdf · Erosion-resistance comes from separation of concerns – application – infrastructure

@trautonenTapio Rautonen

Software Architecture forCloud Infrastructure


Tapio Rautonensoftware architect

over 5 years of experience with different cloud platforms

co-created AWS course for developers

Heroku workshops for students


Cloud computing characteristics

On-demand self-serviceConsumer can provision computing capabilities without requiring human interactionBroad network accessCapabilities are available over the network and accessible by heterogeneous clientsResource poolingProvider's computing resources are pooled to serve multiple consumers dynamicallyRapid elasticityCapabilities can be elastically provisioned and appear unlimited for the consumerMeasured serviceAutomatically controlled and optimized resources by metering capabilities


Software architecture principles

● Intentional architecture with emergent design● High modularity

– high cohesion, loose coupling– low algorithmic complexity

● Well described elements– expressive and meaningful names and APIs– clean code

● Passes all defined tests or acceptance criteria● Lightweight documentation


Distributed computing fallacies

1. The network is reliable2. Latency is zero3. Bandwidth is infinite4. The network is secure5. Topology doesn't change6. There is one administrator7. Transport cost is zero8. The network is homogeneous

Peter D

eutsch

Sun Micro

system

s


SaaS architecture methodology

● Declarative formats for setup and runtime automation● Clean contract with infrastructure for maximum portability● Cloud platform deployments, obviating the need for ops● Tooling, architecture and dev practices support scaling

Modern software is delivered from the cloud to heterogeneous clients on-demand


AWS reference architecture


Real world architecture

“complex” monolithbecomes distributed

pile of crap


Service discovery

● Services need to know about each other– inexistence of centralized service bus– smart endpoints and client side load balancing

● Service registry is the new single point of failure?– value availability over consistency

● Provides a limited set of well defined features– services notify each other of their availability and status– easy integration with standard protocols like HTTP or DNS– notifications on services starting and stopping


Autoscaling

● Adapting to changing workloads– optimize capacity and operational cost– increase failure resilience

● Requires key performance metrics capturing– response times, queue sizes, CPU and memory utilization

● Decision logic based on scaling metrics– when to scale up and down, prevent scaling oscillation

● Application must be designed for scaling– stateless, immutable, automatically provisioned


Ephemeral runtime environments

● Short lifetime of an application runtime environment– scaling, testing, materializing ideas– requires highly automatized infrastructure

● Nothing can be stored in the runtime environment– logs, file uploads, database storage files, configuration

● Results stateless services– optimal for horizontal scaling– integrates to State as a Service

● Must be repeatable and automatically provisioned


Asynchronous messaging

● Key strategy for services to communicate and coordinate– decouple consumer process from the implementing service– enables scalability and improves resilience

● Basic messaging patterns– one-way message– request and response– broadcast message

● Numerous implementation concerns– ordering, grouping, repeating, expiration, idempotency and scheduling


Data consistency

● All instances of application see the exact same data– strong consistency

● Application instance might see data of operation in flight– eventual consistency

● Distributed data stores are subjected to CAP theorem– consistency, availability, partition tolerance– only two of the features can be implemented

● Recovering from failures of eventually consistent data– compensating logic with idemponent commands


Metrics and logging

● Ephemeral and elastic systems– require central awareness of state

● Gain understanding how the services are used– plan for future requirements– gather scaling metrics– bill customers for usage (pay-per-use)– detect faulty behavior

● Balance between value provided and cost of collecting– robustness of the metering system impacts on profitability– collect end-to-end scenarios rather than operational factors


Configuration management

● Externalize configuration out of runtime environment– repeatable, versioned

● Local configuration pitfalls– limits to single application, hard for multiple instances

● Runtime reconfiguration– application can be reconfigured without redeployment or restart– minimize downtime, enable feature flags, help debugging– thread safety and performance is a concern– prepare for rollbacks and unavailability of configuration store


Software erosion

● Slow deterioration of software leading to faulty behavior● Fighting erosion is more expensive than usually admitted● Erosion-resistance comes from separation of concerns

– application – infrastructure● Clear contract of services provided by infrastructure

– change in infrastructure does not break the contract– application can change within its respected realm

● Solutions against erosion– Platform as a Service, container virtualization


Design for failure


New era of design patterns

● Cache-Aside● Circuit Breaker● Compensating Transaction● Command and Query Responsibility Segregation (CQRS)● Event Sourcing● Queue-Based Load Leveling● Sharding● Throttling


Cache-Aside pattern

● Aggregated search combining multiple services– requires additional search cache (Solr, ElasticSearch, ...)

● Improve performance of frequently read data● Local caching results inconsistent state between instances● Consistency of data stores and cache is really hard to maintain


Circuit Breaker pattern

● Systems fail in ways beyond imagination– prevent failures to cascade– allow system to operate in degraded mode

● Suitable for big microservices architecture– creates routing complexity and overhead– potential single point of failure must be highly available

● Enables central logging and metrics– dashboards and central state awareness


Netflix Hystrix Dashboard


Compensating Transaction pattern

● Irrecoverable failures in distributed systems are hard– eventual consistency, rollbacks are impossible

● Distributed transactions (XA)– difficult and complex to implement, and still not bulletproof– not usable for generic REST services

● Undo the effects of the original operation– defines an eventually consistent steps for a reverse operation– compensation logic may be difficult to generate– operations should be idemponent to prevent further catastrophe


CQRS pattern

● Command and Query Responsibility Segregation– segregates read and write operations with separate interfaces– allows to maximize performance, scalability and security

● Introduces flexibility at the cost of complexity– traditionally same DTO is used for read and write operations– different data model for read (query) and write (command)– supports different read and write data stores– not suitable for simple business rules where CRUD is sufficient

● Often used together with event sourcing pattern


Event Sourcing pattern

● Append only store of events that describe actions for data– simplifies tasks in complex domains by avoiding synchronization– improves performance, scalability and consistency for transactional data

● Maintains full audit trail and history– enables compensation actions– supports play back at any point in time

● Events are simple, but the operation logic is not– updates and deletes must be implemented with compensation– “at least once” publication requires idemponent consumers


Queue-Based Load Leveling pattern

● Buffer between task and service– minimizes the impact of peaks of work load– task flood may result unresponsive or failure of the service

● Task provider and service runs asynchronously– queue decouples tasks from the service– service can handle tasks at its own optimal pace– requires a mechanism for responses if the task expects a reply


Sharding pattern

● Divide data store into multiple horizontal partitions● Overcomes limitations of single server data store

– finite storage space– computing resources for large number of concurrent users– network bandwidth governed performance or geographically limited storage

● Strategy defines the sharding key and data distribution– wrong sharding strategy results bad performance– referential integrity and consistency is hard to maintain

● Configuring and managing big set of shards is a challenge


Throttling pattern

● Controls the consumption of resource used by a service– allows the system to function and meet SLA on extreme load

● Throttle after soft limit of resource usage is exceeded– reject requests for user that exceed the soft limits– disable or degrade functionality of nonessential services– queue-based load leveling with priority queues

● Throttling is an architectural decision– must be detected and performed very quickly– services should return specific error code for clients– can be used as an interim measure while autoscaling


Cloud architecture pitfalls

● Failures do cascade– even without a single point of failure

● Multi-service search is hard to get right– cache-aside issues

● Never rely on unreliable message delivery– use asynchronous messaging with persistent stores

● Monolith has one big problem– microservices will generate a lot of small (and big) problems


Reach for the skies

● Distributed systems are hard to build– no silver bullet exists (sorry to disappoint again)

● Cloud infrastructure drives towards microservices– learn new design patterns during the journey– automated system requires less ops and offers more resilience

● Do you think Netflix did it right the first time?– learn from failure– design for failure

● Cloud native applications are the future


Thank you

Always looking for great talenthttps://gofore.com/liity-joukkoon/

Documents

Software Architecture for Cloud Infrastructureohar/materiaali2016/vierailuluento_rautonen_2016.pdf · Erosion-resistance comes from separation of concerns – application – infrastructure