View
103
Download
0
Category
Preview:
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alexander Filipchik – Principal Engineer, Sony Interactive Entertainment
Dustin Pham – Principal Engineer, Sony Interactive Entertainment
David Green – Enterprise Solutions Architect, Amazon Web Services
Moving Mission-Critical Apps from One
Region to Multi-Region active/active
November 30, 2016
ARC309
Thank you
What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Multi-Region AWS Services
• Case Study: Sony’s Multi-Region Active/Active Journey
• Design approach
• Lessons learned
• Migrating without downtime
AWS Global Infrastructure
AWS worldwide locations
Region (14)
Coming Soon (4)
AWS worldwide locations
Region topology
Transit
Transit
AZ
AZ
AZ AZAZ
Region topology
Transit
Transit
AZ
AZ
AZ AZAZ
Availability Zone
Availability Zone
Transit
Transit
AZ
AZ
AZ AZAZ
Single region high-availability approach
• Leverage multiple Availability Zones (AZs)
Availability Zone A Availability Zone B Availability Zone C
us-east-1
Reminder: Region-wide AWS services
• Amazon Simple Storage Service (Amazon S3)
• Amazon Elastic File System (Amazon EFS)
• Amazon Relational Database Services (RDS)
• Amazon DynamoDB
• And many more…
OK … should I use Multi-Region?
Good Reasons for Multi-Region
• Lower latency to a subset of customers
• Legal and regulatory compliance (i.e. data
sovereignty)
• Satisfy disaster recovery requirements
AWS Multi-Region services
Multi-Region services
• Amazon Route 53 (Managed DNS)
• S3 with cross-region replication
• RDS multi-region database replication
• And many more…
• EBS snapshots
• AMI
Amazon Route 53
• Health checks
• Send traffic to healthy infrastructure
• Latency-based routing
• Geo DNS
• Weighted Round Robin
• Global footprint via 60+ POPs
• Supports AWS and non-AWS resources
prod-1 prod-2
95% 5%
example.net
health
health
+
weight
Example: Weighted with failover
prod.examp.net
examp-fail.s3-website
S3 – cross-region replicationAutomated, fast, and reliable asynchronous replication of data across AWS regions
• Only replicates new PUTs. Once
S3 is configured, all new uploads
into a source bucket will be
replicated
• Entire bucket or prefix based
• 1:1 replication between any 2
regions / storage classes
• Transition S3 ownership from
primary account to sub-account
Use cases:
• Compliance—store data hundreds of miles apart
• Lower latency—distribute data to regional customers
• Security—create remote replicas managed by separate AWS accounts
Source
(Virginia)
Destination
(Oregon)
RDS cross-region replication
• Move data closer to customers
• Satisfy disaster recovery requirements
• Relieve pressure on database master
• Promote read-replica to master
• AWS managed service
RDS cross-region replication
Leverage existing resources
Many resources exist
AWS Reference Architecture Implementation Guides
What to expect from the session
• Architecture Background
• AWS global infrastructure
• Single vs Multi-Region?
• Enabling AWS services
• Case Study: Sony Multi-Region Active/Active
• Design approach
• Lessons learned
• Migrating without downtime
Who is talking?
Alexander Filipchik (PSN: LaserToy)Principal Software Engineer
at Sony Interactive Entertainment
Dustin Pham Principal Software Engineer
at Sony Interactive Entertainment
Our active/active story
Small team, large responsibility
• Service team ran like a startup
• Less than 10 core people working on new PS3 store
services
• PSN’s user base was already in the several hundred
millions of users
• Relied on quick iterations of architecture on AWS
Social
Video
Commerce
MULTIPLE NEW VIRTUAL REALITY
PLATFORM LAUNCHES OF VARYING
EXPERIENCE LEVEL
THE YEAR OF VRCardboard
Transforming the store
Delivered new store
• Great job, now onto the PS4
• PS4 launch – 1 million users at once on Day 1, Hour 1
• Designing for many different use cases at scale
Architecture phases
Proof of Concept
Scale OptimizeMake Highly
Available
SF Bay
Next step: make highly available
• Highly available for us: multiregion active/active
• Raising key questions:
• How does one move a large set of critical apps with
hundreds of terabytes of live data?
• How did we architect every aspect to allow for multiregional,
active-active?
• How do we turn on active-active without user impact?
• User impact includes Hardware (ps3/ps4/etc.) and Game
partners!
• Where do we even begin?
Starting with applications
Applications
• First question to answer: What does it mean to be
multiregional?
• Different people had different answers:
• Active/stand-by vs. active/active
• Full data replication vs. partial
• Automatic failover vs. manual
• Etc.
After some healthy discussions
Agreement
• “You should be able to lose 1 of anything” approach.
• Which means, we should be able to survive without any
visible impact losing of:
• 1 server
• 1 Availability Zone
• 1 region
Starting with uncertainty
• Multiple macro and micro services
• Stateless and stateful services
• They depend on multiple technologies
• Some are multiregional and some are not
• Documentation was as always: out of date
Inventory of dependencies
0102030405060708090
100
Tech
% o
fapplic
ations
What is multiregional by design?
With some customizations
Stages of grief
• Denial – can’t be true, let’s check again
• Anger – we told everyone to be active/active ready!!!
• Bargaining – active/stand-by?
• Depression – we can’t do it
• Acceptance – let’s work to fix it, we have 6 months…
What it tells us
• We can’t just put things in two regions and expect them
to work
• We will need to do some work to:
• Migrate services to technology which is multiregional by
design
• Somehow make underlying technology multiregional
Scheduling/optimization problem
• There is work that should be done on both apps and
infrastructure side
• We need to schedule it so we can get results faster
and minimize waits
• And we wanted machine to help us
The world’s leading graph database
That can store a graph of 30B nodes
Here to help us to deal with our problem
Why Neo4J
• Graph engine and we are dealing with a graph
• Query language that is very powerful
• Can be populated programmatically
• Can show us something we didn’t expect
How to use it?
• Model
• Identify nodes and relations
• Tracing
• Code analyzer
• Talking to people
• Generate the graph
• Run queries
Model example
• Nodes
• Users
• Technology: (Cassandra, Redis)
• multiregional: true/false
• Service (applications)
• stateless: true/false
• Edges
• Usage patterns (read, write)
Graph definition example
Graph example
Can be enriched with:
• Load balancers
• Security groups
• VPCs
• NATs
• Etc.
Ours looked more like
And running some Neo4j magic
This one is important
Shows you what is ready to go
What to do next
• Validate multiregional technologies do actually work
• Figure out what to do with non-multiregional technologies
• Move services in the following order:
Validating our main DB (Cassandra)
A lot of unknowns:
• Will it work?
• Will performance degrade?
• How eventual is multiregional eventual consistency?
• Will we hit any roadblocks?
• Well, how many roadblocks will we hit?
What did we know?
Netflix is doing it on AWS and they actually tested it
They wrote 1M records in one region of a multiregion
cluster
500 ms later read in other clusters was initiated
All records were successfully read
Well…
Some questions to answer:
Should we just trust the
Netflix’s results and just
replicate data and see what
happens?
Is their experiment applicable
to our situation?
Can we do better?
Break Something
Free Coffee
Say, "there's
gotta be a better way to do this"
HOW TO GET AN ENGINEER'S ATTENTION
Cassandra validation strategy
• Use production load/data
• Simulate disruptions
• Track replication latencies
• Track lost mutations
• Cassandra modifications were required
Preparation
Exporter
Region 1
Region 2
Ingester
Ingester
Test
Read/Write
Loader
Region 1
Read/Write
Loader
Region 2
Analysis
Sample results (usw1-usw2)
1
10
100
1000
10000
100000
1000000
10000000
617
14
617
16
617
18
617
20
617
22
617
24
617
26
617
28
618
02
618
04
618
06
618
08
618
10
618
12
618
14
618
16
618
18
618
20
618
22
618
24
618
26
618
28
618
30
618
32
618
34
618
36
618
38
618
40
618
42
618
44
618
46
618
48
618
50
618
52
618
54
618
56
618
58
619
00
619
02
619
04
619
06
619
08
619
10
619
12
619
14
Two DC connection cut-off and recovery ( latency in logarithmic scale)
Pct95 Pct99
Pct999 MaxLag
Things that are not multiregional by design
We gave teams 2 options:
• Redesign if is critical to user’s experience
• If not in the critical path (batch jobs)
• active/passive
• master/slave
• Use Kafka as a replication backbone (recommended)
Solr example (pre active/active)
Indexer
Master
App1
App2
Replicator
Replicator
Read Replicas
Read Replicas
Solr example (easy active/active)
Indexer
Master
Replicators
Read ReplicasApps
Replicators
Read ReplicasApps
Region 1 Region 2
Solr example (Kafka active/active)
Indexer
Read ReplicasApps
Region 1
Solr Indexer
Indexer
Read ReplicasApps
Region 2
Solr Indexer
Are we missing anything?
Yes, infrastructure
Decompose and recompose
Breaking up the system into moveable parts
App + caching tier
Data tier
Inbound tierOutbound tier
Clients
Phase 1: Infrastructure
Private Subnet
Public Subnet
ELBs Inbound tierOutbound Tier
Infrastructure to build/move:
• VPCs
• Subnets
• ACLs
• ELBs
• IGW
• NAT
• Egress
Phase 1: Infrastructure key points
• Building infrastructure in new region must be fully
automated (Infrastructure as Code)
• Regional communication decisions
• VPNs?
• Over Internet?
• Do infrastructures have to match exactly?
• 1st region evolved organically
• 2nd region should be blueprint for all new region DCs
Phase 2: Data
Public subnet
ELBs
Data tier
Inbound tierOutbound tier
Phase 2: Data option 1 replication over VPN
Public Subnet
ELBs
Data tier
Inbound tierOutbound tier
Region 2
VPN
Phase 2: Data option 1 replication over VPN
• Pros
• Setting up VPN with current network architecture would be
easier on data tier
• Secure
• Managing data nodes intercommunication is straight forward
and has lower operational overhead
• Cons
• Limit on throughput
• Data set is large and can quickly saturate VPN
• Scaling more applications in future will be complicated!
Phase 2: Data option 2 replication over ENIs with public IPs
Private subnet
Public subnet
ELBs
Data tier
Inbound tierOutbound tier
Region 2
SSL
SSL
Phase 2: Data option 2 replication over ENIs with public IPs
• Pros
• Not network constrained
• Able to add more applications + data without need of building
new infrastructure to support
• Cons
• Operationally, more orchestration (Cassandra, for example,
needs to know other node Elastic IPs)
• Internode data transfer security is a must
Phase 3: App tier + cache strategy
Outbound Tier
Region 2
Phase 3: App tier + cache strategy
• Applications communicate within a region only
• Applications do not call another region’s databases,
caches, or applications
• Isolation creates for predictable failure cases and clearly
defines failure domains
• Monitoring and alerting are greatly simplified in this
model
Phase 4: Client routing
Region 1 Region 2
DNS
Phase 4: Client routing
• Predictable “sticky” routing to avoid user bounce via
Georouting
• Data replication manages cross region state
• Allows for routing to stateless services
• Ability to do % based routing to manage different failure
scenarios
Putting it all together
Software design for multiregion deployments
• Typical software architecture
APIs
Business Logic
Data Access
Cross
Cutting
Config
Software design for multiregion deployments
Region 1 Region 2
Remember when we mentioned to have application tier call patterns to be
isolated in a region? How do we achieve this simply?
Software configuration approaches
• An application config to connect to a database could
look like:cassandra.seeds=10.0.1.16,10.0.1.17
• A naïve approach would be to have an application have
multiple configs per deployable depending on its regioncassandra.seeds.region1=10.0.1.16,10.0.1.17
cassandra.seeds.region2=10.0.2.16,10.0.2.16
• This, of course, results in an app config management
nightmare, especially now with 2 regions
Software configuration approaches
• What if we
implemented a
basic “central"
way of
configuration
Region x
Region x
Local DB
Where are my C*
Seeds? IPs are x.x.x.xcassandra.seeds=cass-
seed1, cass-seed2
cass-seed1 resolves to
x.x.x.x
Simplified software configuration (context)
• Context is made available to application which contains:
• Data Center/region
• Endpoint short-name resolution
• Environment (Dev, QA, Prod, A/B)
• Database connection details
• Context is the responsibility of the infrastructure itself
and is provided through build automation, AWS tagging,
etc.
• App is responsible for behaving correctly off of context
Infrastructure as code
• New regions must be built through automation
• Specification of services to Terraform
• Internal tool and DSL was built to manage domain
specific needs
• Example:
• Specify an app requires Cassandra and SNS
• Generates Terraform to create security groups for ports 9160,
7199-7999, build SNS, build ELB for app, etc.
Database automation
• Ansible run to assist
in build Cassandra in
public subnet and
associate EIPs to
every new node
• Manages network
rules (whitelisting)
• Manages certificates
and SSLPrivate Subnet
Public Subnet
ELBsOutbound Tier
Region 2
SSL
SSL
Monitoring multiregional deployments
Monitoring through proper tagging
• Part of the “Context” applications are aware of is the
region
• Adds “region” to any app logs
• Region tags then added in metrics and can be surfaced
in grafana or any monitoring of your choice
• Cross-regional monitoring key metrics and alerting
• Data replication (hints in Cassandra, seconds behind master
in MySQL, etc.)
• Data in/out
Putting it all together
Region 1 Region 2
Create
infrastructure
Replicate
DNS
Lessons learned
Lessons learned
• Data synchronization is super critical, so dependency
map based off of the data technologies first.
• Always run your own benchmarking.
• Do not allow legacy to control other region’s design. Find
a healthy transition and balance between old and new.
• Applications must be context-driven.
• Depending on your data load, Cross-regional VPNs may
not make sense.
PlayStation is hiring in SF:
Find us at hackitects.com
Thank you!
Remember to complete
your evaluations!
Related Sessions
Recommended