Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adrian Cockcroft’s architecture trends and topics for 2021
Adrian Cockcroft
VP Cloud Architecture Strategy
AWS
A R C 2 1 3
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless first
Chaos engineering and failing over without falling over
Wardley Mapping for technology evolution
Hardware trends
Sustainability
Agenda
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless first
Trends: Microservices
MicroXchg Berlin event where a group of speakers including @adrianco adopted the term
Start with serverless microservices
Definition from a talk by @adrianco in 2014
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We chose microservices to get speed and agility . . .
So how long does it really take to build a new application from scratch?
AWS re:InventN O N - P R O F I T H A C K A T H O N F O R S O C I A L G O O D
SkyTruth
Best Friends Animal Society
Several teams
of ~four people
Non-profit organization
with a problem
Monday, 8 AM
Here’s a set of
problems. Pick
one and build a
solution by 8 PM
this evening.
“ “+ =
“ “SkyTruth
Best Friends Animal Society
Several teams
of ~four people
Non-profit organization
with a problem
Monday, 8 AM
Here’s a set of
problems. Pick
one and build a
solution by 8 PM
this evening.+ =
AWS re:InventN O N - P R O F I T H A C K A T H O N F O R S O C I A L G O O D
Where would you start?
Every team used
AWS Lambda 1 Less than
a day total2
Extremely functional
and scalable prototypes4Many team
members’ first time
using AWS Lambda
3
2016 AWS re:Invent HackathonA L I G H T B U L B M O M E N T
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Note: See re:Invent 2019, SVS343
So . . . why doesn’t everyone use serverless first?
Language support
Scalability (new: 10 GB/6 vCPU)
Startup and network latency
Databases / storage interfacing
Security
State handling, event processing
Run duration
Complex configs (new: AWS Proton)
Portability (new: container support) Getting started
See my Serverless-First Function talk for a deep dive https://aws.amazon.com/blogs/compute/icymi-serverless-first-function/
Common concerns
Update S E R V E R L E S S S C A L E
AWS Lambda per-function
maximum disk space
Local
512 MB
Amazon EFS
Petabytes
47.9 TiB per file
June 2020 introduction of support for NFS mounting an
Amazon EFS file system – up to 25,000 concurrent connectionshttps://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/
UpdateS E R V E R L E S S - F I R S T C O N T A I N E R S
New support for
common build pipelines
delivering containers for
both runtimes
Serverless AWS Fargate
re:Invent 2020 introduction of AWS Lambda support for container packaging
and AWS Proton delivery tooling makes serverless first an even better choice!
Easier transition from spiky
workloads to Always On capacity
Bottom line
Serverless is the fastest way
to build a modern application
If you object, let us know what
we should work on next!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering and failing over without falling over
Chaos engineering definition(Adrian’s concise version)
Experiment to ensure that the impact of failures is mitigated
Trends – chaos engineering – 10 years on . . .
2010 – First Netflix blog post about Chaos Monkey
Past Present Future
?
Disasterrecovery
Chaosengineering
Continuousresilience
Resilience
We build redundancy into systems so that if
something fails, we can fail over to an alternative
However, our ability to fail over is complex and
hard to test, so often the whole system falls over
How can we do better?
The last strand that breaks is not the cause of a failure!
Build resilient systems like a rope, not a chain, but make sure you know how
much margin you have and how “frayed” your system is
Infrastructure
Switching
Application
People
Chaos
engineering
team
To
ols
Security
red
team
To
ols
Four layers, two teams, an attitude
Experiment to ensure that the effect of failures is mitigated
Chaos architecture
New! AWS Fault Injection Simulator
Run experiments to ensure that both availability and performance impact of failures are mitigated
“You can’t legislateagainst failure; focus on
fast detection and response.”
—Chris Pinkham
Observability Control
Let’s see what we can learn from experts who have been
working on controlling safety critical systems for decades
Engineering a Safer WorldSystems Thinking Applied to Safety – 2012
Professor Nancy G. Leveson – MIT
STPA – Systems Theoretic Process Analysis
STAMP – Systems Theoretic Accident Model & Processes
http://psas.scripts.mit.edu for handbook and talks
STPA model
control diagramUnderstand hazards that
could disrupt successful
application processing
Observabilityand control
Data plane
Throughput
Control plane
Human controller
Model of automation
Model of controlled process
DisplaysControls
Control algorithmModel of
controlled process
Web service
SensorsActuators
Control action generation
Environmental inputs
Written / trained procedures
Completed actions
Customerrequests
Disturbances
Data plane
Throughput
Control plane
Human controller
Model of automation
Model of controlled process
DisplaysControls
Control algorithmModel of
controlled process
Web service
SensorsActuators
Control action generation
Environmental inputs
Written / trained procedures
Completed actions
Customerrequests
Disturbances
What happens
if there is a
big enough
disturbance
to break the
web service?
Large-scale failure
Application is “out of control,” many possibilities
Automation has failed
Network partition, no route connecting application to customers
Application crashed or corrupted, not easily restartable
Anything else you didn’t think of
Trigger failover to alternative system
Large-scale failures
Human controllers
Router control plane
Routing service
Control plane A
Web service A
Availability Zone A
Control plane B
Web service B
Availability Zone B
Control plane C
Web service C
Availability Zone C
Symmetry
and assertions• Services and data
are consistent across
three zones
• Zone failure modes
are independent
• Application should
work normally with
any zone offline
• Routing service
manages failover
ScenarioA W S A V A I L A B I L I T Y Z O N E S
Human controllers
Routing service
Control plane A
Web service A
Availability Zone A
Control plane B
Web service B
Availability Zone B
• Router control plane
detects offline AZ, stops
routing traffic
to it, and retries requests
on the online AZs
• Automated response;
what could go wrong?
Router control plane
ScenarioA W S A V A I L A B I L I T Y Z O N E S
Human controllers
Router control plane
Routing service
Control plane A
Web service A
Availability Zone A
Control plane B
Web service B
Availability Zone B
• Missing updates
• Zeroed
• Overflowed
• Corrupted
• Out of order
• Updates too rapid
• Updates infrequent
• Updates delayed
• Coordination problems
• Degradation over time
STPA hazardsS E N S O R M E T R I C S C H E C K L I S T
Routing control
plane doesn’t clearly
inform humans that
everything is taken
care of, and offline
zone delays and
breaks other metrics
with a flood of errors
Human controllers
Router control plane
Routing service
Control plane A
Web service A
Availability Zone A
Control plane B
Web service B
Availability Zone B
• Model mismatch
• Missing inputs
• Missing updates
• Updates too rapid
• Updates infrequent
• Updates delayed
• Coordination problems
• Degradation over time
STPA hazardsM O D E L P R O B L E M S C H E C K L I S T
Confused human
controllers disagree
among themselves
about whether
they need to do
something or not,
with floods of errors,
displays that lag
reality by several
minutes, and out-of-
date runbooks
Human controllers
Router control plane
Routing service
Control plane A
Web service A
Availability Zone A
Control plane B
Web service B
Availability Zone B
• Not provided
• Unsafe action
• Safe but too early
• Safe but too late
• Wrong sequence
• Stopped too soon
• Applied too long
• Conflicts
• Coordination problems
• Degradation over time
STPA hazardsH U M A N C O N T R O L A C T I O N C H E C K L I S T
Human controllers
should not need
to do anything!
However, they
are confused and
working separately,
trying to fix different
problems – some of
their tools don’t get
used often, and
are broken or
misconfigured to do
the wrong thing
Human controllers
Instead of failing
over, system
falls over
Most likelyresult
In-rush of extra
traffic from failed
zone, and extra work
from a cross zone
request retry storm
causes zones A and B
to struggle and
triggers a complete
failure of the
application –
meanwhile, the
routing service also
has a retry storm
and is impacted
How to fail over without falling over
Alert correlation
Floods of alerts need to be reduced to actionable insights (new: Amazon DevOps Guru)
Observability system needs to cope with floods without failing
Run regular chaos engineering experiments (new: AWS Fault Injection Simulator)
Retry storms – prevent work amplification
Reduce retries to zero except at subsystem entry and exit points
Reduce timeouts to drop orphaned requests
Route calls within the same zone
Symmetries
High level of automation, consistent configuration as code
Consistent instance types, services, versions, zones, and Regions
Principles
If it can be the same, make it look and act identically
If it’s different, make that clearly visible
Test your assumptions continuously
What about multi-Region?
Get AZ failover solid
before attempting
multi-Region
Use STPA to
analyze multi-Region-
specific hazards
Follow
Well-Architected
guide patterns
As data centers migrate to the cloud, fragile and manual disaster recovery processes have been standardized and automated
Testing failure mitigation has moved from a scary annual experience to automated
continuous resilience
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Wardley Maps
Early adopters are using Wardley Maps
Maps always allow you to define position and movement
What is a Wardley Map?
Wardley Maps help define strategic context, barriers, and actions
Examples
• Evolution of technology from custom to utility• Technology stack migrations from data center to cloud• Make vs. buy decisions• Open-source product strategy
Mapping open source for Kubernetes
Too many
projects in the cloud-
native landscape
All in
different stages
of evolution
So let’s try and
put some of them
on a map!
CNCF landscapeH T T P S : / / L A N D S C A P E . C N C F . I O – N O T A M A P , A N E Y E C H A R T !
Projects outside CNCF CNCF projects Services
Mapping value chain against evolutionT H I S I S A M A P ! C L O U D - N A T I V E K U B E R N E T E S C O R E P R O J E C T S
Projects outside CNCF CNCF projects Services
New! AWS Grafana and Prometheus services R E M E M B E R : A L L M A P S A R E W R O N G , B U T S O M E M A P S A R E U S E F U L – U S E T O F A C I L I T A T E D I S C U S S I O N
Results from mapping
Successful products and services due to better:
Shared understanding of moves
Situational awareness
Application of doctrines
Strategic decision-making
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware trends
Data that’s too big to process on one machine
What is “big data”?
But the machines keep getting bigger
EC2 UltraClusters4 , 0 0 0 + A 1 0 0 G P U S O N 4 0 0 G B I T L I N K S I N T O P E T A B I T - S C A L E F A B R I C
Gigabits
bandwidth
10,000
100
10
320 GB 40 GB1,100 GB24,000 GB4,400,000 GB
1,000
1
Memory capacity vs. bandwidth
Engulf your data
in memory to
reduce overhead
Internal memory bandwidth of A100
Terabyte+/second
Network bandwidth per node
comparable to Intel Cascade
Lake main memory bandwidth
EC2 UltraCluster
UltraCluster 4,000+ P4d
400 Gbit Ethernet
Network bandwidth
per node
AWS u24-tb1
8x Intel
Cascade Lake
main memory
Single
NVIDIA A100
1.2 TB/s
GPU memory
AWS P4d
8x NVIDIA A100
NVLINK 600 GB/s
GPU memory
AWS P4d
2x Intel Cascade Lake
main memory
Memory capacity
How can we take advantage of more memory?
In-memory analytics
(e.g., SAP HANA) is
mainstream today
Very large
graph databases(it’s hard to partition graphs)
Use the memory
interconnect as a low-
latency network?
Put all the pods
and instances on
the same node?
Fast replication to
another node, in a
different AZ, for
persistence?
Use shared
memory to avoid
serialization overhead?
Interesting work to watch
Twizzler
UC Santa Cruz / Daniel
Bittman – data-oriented
operating system
Brytlyt
Database that runs
completely inside the GPU
MemVerge
Distributed
memory objects
SQream
GPU accelerated
data warehouse
Zenotech
GPU-based CFD simulation
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sustainability
The Climate Pledge
Reach the Paris Agreement
BY 2040
Climate Pledge fund
$2 BILLION
Signatories
https://www.theclimatepledge.com
Renewable Energy
https://sustainability.aboutamazon.com/environment/sustainable-operations/renewable-energy
Reducing carbon by moving to AWS
https://sustainability.aboutamazon.com/environment/the-cloud
Amazon Sustainability Data Initiative significantly reduces the
cost, time, and technical barriers
associated with analyzing large datasets to
generate sustainability insights
Data Insights
Information Actionable knowledge
We
ath
er
& c
lim
ate
Re
mo
te s
en
sin
g
CMIP6 climate projections
LOCA downscaled climate projections
Atmospheric Models from Météo-France
NOAA’s Global Surface Summary of Day
HIRLAM Weather Model
NOAA Global Ensemble Forecast System (GEFS)
NOAA Global Forecast System (GFS)
NOAA High-Resolution Rapid Refresh (HRRR) Model
…
JMA Himawari-8
Landsat 8
NOAA GOES 16 & 17
Sentinel-1
MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4
Terra Fusion Data Sampler
CBERS
…
Enabling easy access to petabytes of foundational data
Hosted free of charge in public Amazon S3 buckets
Managed and updated by data owners
Amazon Sustainability Data Initiative – ASDI
OS-Climate is an open-source platform to enable access to critical
data and models to support climate risk assessment
https://www.os-climate.org/
Climate risk: OS-Climate
CALL TO ACTION
Use AWS services and open data programs to architect solutions and achieve your sustainability goals
Look for opportunities to optimize your architecture
Share the solutions you build to support sustainability goals and collaborate with others in your industry and cross-sector
Resources
Related re:Invent 2020 talks
ARC207 – Architecting resiliency, recovery, and reliability: Capital One
ARC307 – Going global in minutes: Multi-Region expansion simplified (Autodesk)
ARC316 – Testing resiliency using chaos engineering
Related blog posts
“Failing over without falling over”:https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery
Full Serverless-First talks: https://aws.amazon.com/blogs/compute/icymi-serverless-first-function
Response time variation: https://dev.to/aws/why-are-services-slow-sometimes-mn3
Retries and timeouts: https://dev.to/aws/if-at-first-you-don-t-get-an-answer-3e85
Paper
“Building Mission-Critical Financial Services Applications on AWS,” by Pawan Agnihotri with contributions by Adrian Cockcroft: https://d1.awsstatic.com/Financial%20Services/Resilient%20Applications%20on%20AWS%20for%20Financial%20Services.pdf
Book recommendations
Reading list at: http://a.co/79CGMfB
Adrian Cockcroft@adrianco
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Well-Architected Framework
The official best practices for architecting in the AWS Cloud
aws.amazon.com/architecture/well-architected
AWS Well-Architected Labs
Hands-on labs to help you learn, measure, and build using architectural best practices
wellarchitectedlabs.com
AWS Architecture Center
Official AWS repository for all architecture resources
aws.amazon.com/architecture
AWS Solutions Library
Vetted reference implementations and Well-Architected patterns
aws.amazon.com/solutions
Architecture resources
Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adrian Cockcroft
@adrianco
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.