51
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sailesh Krishnamurthy Senior Engineering Manager Amazon Web Services Amazon Aurora: A Database for the Cloud

Amazon Aurora: A Database for the Cloud

Embed Size (px)

Citation preview

©"2015,"Amazon"Web"Services," Inc."or"its"Affiliates."All"rights"reserved.

Sailesh"KrishnamurthySenior"Engineering"Manager

Amazon"Web"Services

Amazon'Aurora:'A'Database'for'the'Cloud

Rapidly"growing"global"footprint

Over"1"million"active customers"across"190"countries

800+"government"agencies

3,000+"educational" institutions

12"regions

33"availability"zones

52"edge"locations

Everyday,"AWS"adds"enough"new"server"capacity"to"support"Amazon.com"when"it"was"a"$7"billion"global"enterprise.

Moving"from"a"world"where"we"design"for"sharing"scarce"system"resources" to"one"where"the"central"challenge"is"taking"advantage"of"their"abundance

How"is"the"Cloud"changing"the"World"?

Designing"Databases"for"the"Cloud

Scarcity"to"Abundance

Monolithic"to"Service"Oriented

Single"cluster"to"a"Fleet"of"clusters

What"is"Amazon"Aurora"?

MySQLVcompatible"relational"database

Performance"and availability(of"commercial"databases

Simplicity and costVeffectiveness of"open"source"databases

Delivered"as"a"managed"service

Agenda

Motivation

Aurora"Architecture

Performance

Availability

Usability

Beyond"Benchmarks

Current"DB"Architectures"are"Monolithic

SQL

Transactions

Caching

Logging

SQL

Transactions

Caching

Logging

Storage

Application

Even%when%you%scale%it%out,%you’re%still% replicating% the%same%stack

SQL

Transactions

Caching

Logging

SQL

Transactions

Caching

Logging

Application

StorageStorage

SQL

Transactions

Caching

Logging

Application

Storage

SQL

Transactions

Caching

Logging

Application

Storage

Aurora"Architecture:"ReVimagined"for"the"CloudMoved"the"logging"and"storage"layer"into"a"multitenant,"scaleVout"storage"service"optimized"for"OLTP"database"workloads

Leverage"existing"AWS"services:"Amazon"EC2,"Amazon"VPC,"Amazon"DynamoDB,"Amazon"SWF,"and"Amazon"S3

Maintain"compatibility"with"MySQL"–customers"can"migrate"their"MySQL"applications"asVis,"use"all"MySQL"tools.

Control'PlaneData'Plane

Amazon DynamoDB

Amazon SWF

Amazon Route 53

Logging'+'Storage

SQL

Transactions

Caching

Amazon S3

1

2

3

Aurora"Storage:"A"ServiceVOriented"Architecture• ScaleVout,"multiVtenant,"SSD"storage• Seamless" storage"scalability• Up"to"64"TB"database"size• Only" pay"for"what"you"use

• LogVstructured"storage• Many"small" segments," each"with"their"own"redo"logs• Redo"logs"used" to"generate"data"pages"on"demand• Eliminates" chatter"between"database"and"storage

• Highly"available/durable"by"default• 6Vway"replication" across"3"AZs• 4"of" 6"write"quorum• Automatic"fallback" to"3"of"4"if"an"Availability" Zone"(AZ)"is"unavailable

• 3"of" 6"read"quorum• Continuous" backup" to"S3

AZ"1 AZ"2 AZ"3

Amazon" S3

Aurora:"Database/Storage"Interaction

SQL

Transactions

AZ"1 AZ"2 AZ"3

Caching

Amazon" S3

• NewVAPI"model• Reads"are"blockVbased" (read"pages)

• Writes"are"deltaVbased" (write"redo"logs)

• Distributed"quorumVbased"writes• Ordered"logVstream" in"a"single" LSN"space

• Database"writes"logVstream" to"6"nodes" in"3"AZs

• Transaction" commit"only" after"write"quorum"established

• Continuous"state"exchange"protocol"• Segments" can"have"holes" (lost" log"records)

• Read"at"an"LSN"directed"to"the"right"segment

• Storage"segments" know"when" to"coalesce" redo"logs

Agenda

Motivation

Aurora"Architecture

Performance

Availability

Usability

Beyond"Benchmarks

5x"faster"than"RDS"MySQL"5.6"&"5.7

WRITE PERFORMANCE READ PERFORMANCE

MySQL" SysBench results

R3.8XL:"32"cores"/"244"GB"RAMFive%times%higher%throughput% than%stock%MySQL,

based% on%industry%standard% benchmarks.

How"did"we"achieve"this"?

Do"fewer"IOs

Minimize"network"packets

Cache"prior"results

Offload"the"database"engine

DO'LESS'WORK

Process"asynchronously

Reduce" latency"path

Use"lockVfree"data"structures

Batch"operations"together

BE'MORE'EFFICIENT

DATABASES'ARE'ALL'ABOUT'I/O

NETWORKOATTACHED'STORAGE'IS'ALL'ABOUT'PACKETS/SECOND

HIGHOTHROUGHPUT'PROCESSING'DOES'NOT'ALLOW'CONTEXT'SWITCHES

IO"Traffic"in"RDS"MySQL

BINLOG DATA DOUBLEVWRITEREDO"LOG FRM"FILES

T YPE ' O F 'WRI T E

MYSQL'WITH'REPLICA

EBS"mirrorEBS"mirror

AZ"1 AZ"2

Amazon" S3

EBSAmazon"Elastic"Block"

Store"(EBS)

PrimaryInstance

ReplicaInstance

1

2

3

4

5

Issue"write"to"EBS"– EBS"issues"to"mirror,"ackwhen"both"doneStage"write"to"standby"instance"through"DRBDIssue"write"to"EBS"on"standby"instance

IO'FLOW

Steps"1,"3,"5"are"sequential"and"synchronousThis"amplifies"both"latency"and"jitterMany"types"of"writes"for"each"user"operationHave"to"write"data"blocks"twice"to"avoid"torn"writes

OBSERVATIONS

780K"transactions7,388K"I/Os per"million"txns (excludes"mirroring,"standby)Average"7.4"I/Os per"transaction

PERFORMANCE

30"minute" SysBench" writeonly workload," 100GB" dataset,"RDS'MultiAZ," 30K" PIOPS

IO"Traffic"in"Aurora

BINLOG DATA DOUBLEVWRITEREDO"LOG FRM"FILES

T YPE ' O F 'WRI T E

AZ"1 AZ"3

PrimaryInstance

Amazon S3

AZ"2

ReplicaInstance

AMAZON& AURORA

ASYNC4/6"QUORUM

DISTRIBUTED"WRITES

IO&FLOW

Only"write"redo"log"records;"all"steps"asynchronousNo"data"block"writes"(checkpoint,"cache"replacement)6Xmore log"writes,"but"9X less network"trafficTolerant"of"network"and"storage"outlier"latency

OBSERVATIONS

27,378K"transactions" 35X MORE

950K"I/Os per"1M"txns (6X"amplification)Average"0.95"I/O"per"txn 7.7X LESS

PERFORMANCE

Boxcar"redo"log"records"– fully"ordered"by"LSNShuffle"to"appropriate"segments"– partially"orderedBoxcar"to"storage"nodes"and"issue"writesReplica

Instance

30"minute" SysBench" writeonly workload," 100GB" dataset

LOG" RECORDS

Primary"Instance

INCOMING"QUEUE

STORAGE'NODE

S3"BACKUP

1

2

3

4

5

6

7

8

UPDATE"QUEUE

ACK

HOTLOG

DATABLOCKS

POINT"IN"TIMESNAPSHOT

GC

SCRUBCOALESCE

SORTGROUP

PEER" TO"PEER" GOSSIPPeerStorageNodes

All"steps"are"asynchronousOnly"steps"1"and"2"are"in"foreground"latency"pathInput"queue"is"46X less than"MySQL"(unamplified,"per"node)Favor"latencyVsensitive"operationsUse"disk"space"to"buffer"against"spikes"in"activity

OBSERVATIONS

IO&FLOW

① Receive"record"and"add"to"inVmemory"queue② Persist"record"and"ACK"③ Organize"records"and"identify"gaps"in"log④ Gossip"with"peers"to"fill"in"holes⑤ Coalesce"log"records"into"new"data"block"versions⑥ Periodically"stage"log"and"new"block"versions"to"S3⑦ Periodically"garbage"collect"old"versions⑧ Periodically"validate"CRC"codes"on"blocks

IO"Traffic"in"Aurora"(Storage"Node)

IO"Traffic"in"Aurora"Read"Replicas

PAGE" CACHE"UPDATE"SELECTIVE" LOG" APPLY

Aurora' Master

30%"Read

70%"Write

Aurora' Replica

100%"New"Reads

Shared' MultiOAZ' Storage

MySQL' Master

30%"Read

70%"Write

MySQL' Replica

30%"New" Reads

70%"Write

SINGLEVTHREADEDBINLOG" APPLY

Data' Volume Data' Volume

• Logical: Ship"SQL"statements"to"Replica

• Write"workload"similar"on"both"instances

• Independent"storage

• Can"result"in"data"drift"between"Master"and"Replica

Physical: Ship"redo"from"Master"to"Replica

Replica"shares"storage."No"writes"performed

Cached"pages"have"redo"applied

Advance"read"view"when"all"commits"seen

MYSQL'READ'SCALING AMAZON'AURORA'READ'SCALING

Adaptive"Thread"Pool

• ReVentrant"connections"multiplexed"to"active"threads

• KernelVspace"epoll()"inserts"into"latchVfree"event"queue

• Dynamically"size"thread"pool"

• Gracefully"handles"5000+"concurrent"client"sessions"on"r3.8xl

Standard"MySQL"– one"thread"per"connection

Doesn’t"scale"with"connection"count

MySQL"EE"– connections"assigned"to"thread"group

Requires"careful"stall"threshold"tuning

CLIENT

"CONN

ECTION

CLIENT

"CONN

ECTION LATCH" FREE

TASK"QUEUE

epoll()

MYSQL'THREAD'MODEL AURORA'THREAD'MODEL

Asynchronous"Group"Commits

Read

Write

Commit

Read

Read

T1

Commit " ( T1 )

Commit " ( T2 )

Commit' (T3)

L SN " 1 0

LSN " 1 2

LSN ' 22

L SN " 5 0

LSN ' 30'

LSN ' 34

LSN ' 41

LSN ' 47

LSN"20

LSN"49

Commit' (T4)

Commit' (T5)

Commit' (T6)

Commit' (T7)

Commit " ( T8 )

LSN"GROWTHDurable"LSN"at"headVnode"

COMMIT"QUEUEPending"commits"in"LSN"order

TIME

GROUPCOMMIT

TRANSACTIONS

Read

Write

Commit

Read

Read

T1

Read

Write

Commit

Read

Read

Tn

• TRADITIONAL'APPROACH AMAZON&AURORAMaintain( a(buffer(of(log(records(to(write(out(to(disk

Issue(write( when( buffer(full(or(time(out(waiting( for(writes

First(writer( has(latency(penalty(when(write( rate(is(low

Request( I/O(with( first(write,(fill(buffer(till(write( picked(up

Individual( write(durable( when( 4(of(6(storage(nodes(ACK

Advance(DB(Durable( point(up(to(earliest(pending( ACK

Agenda

Motivation

Aurora"Architecture

Performance

Availability

Usability

Benchmarks

What"about"availability?

“Performance%only%matters%if%your%database%is%up”

More"Replicas• Aurora"cluster"contains"primary"node"and"up"to"fifteen"secondary"nodes

• Failing"database"nodes"are"automatically"detected"and"replaced

• Failing"database"processes"are"automatically"detected"and"recycled

• Secondary"nodes"automatically"promoted"on"persistent"outage,"no"single"point"of"failure

• Customer"application"may"scaleVout"read"traffic"across"secondary"nodes

AZ"1 AZ"3AZ"2

PrimaryNodePrimaryNodePrimaryNode

PrimaryNodePrimaryNodeSecondaryNode

PrimaryNodePrimaryNodeSecondaryNode

! Customer"specifiable" failVover"order

! Read"balancing"across"read"replicas

Storage"Durability• Storage"volume"automatically"grows"up"to"64"TB

• Quorum"system"for"read/write;"latency"tolerant

• Peer"to"peer"gossip"replication"to"fill"in"holes

• Continuous"backup"to"S3"(built"for"11"9s"durability)

• Continuous"monitoring"of"nodes"and"disks"for"repair"

• 10GB"segments"as"unit"of"repair"or"hotspot"rebalance

• Quorum"membership"changes"do"not"stall"writes

AZ"1 AZ"2 AZ"3

Amazon" S3

Continuous"BackupSegment"snapshot Log"records

Recovery"point

Segment'1

Segment'2

Segment'3

Time

• Take periodic"snapshot"of"each"segment"in"parallel;"stream"the"redo"logs"to"Amazon"S3

• Backup"happens"continuously"without"performance"or"availability"impact

• At"restore,"retrieve"the"appropriate"segment"snapshots"and"log"streams"to"storage"nodes

• Apply"log"streams"to"segment"snapshots"in"parallel"and"asynchronously

Survivable"Buffer"Caches• We"moved"the"cache"out"of"the"database"process

• Cache"remains"warm"in"the"event"of"database"restart

• Lets"you"resume"fully"loaded"operations"much"faster

• Instant"crash"recovery"+ survivable"cache"='quick"and"easy"recovery"from"DB"failures

SQL

Transactions

Caching

SQL

Transactions

Caching

SQL

Transactions

Caching

Caching"process"is"outside"the"DB"process"and"remains"warm"across"a"database"restar t

Instant"Crash"Recovery• Traditional"Databases

• Have"to"replay"logs"since"the"last"checkpoint

• Typically"5"minutes"between"checkpoints

• SingleVthreaded"in"MySQL;"requires"a"large"number"of"disk"accesses

• Amazon"Aurora

• Underlying"storage"replays"redo"records"on"demand"as"part"of"a"disk"read

• Parallel,"distributed,"asynchronous

• No"replay"for"startup

Checkpointed"Data Redo"Log

Crash" at"T0 requiresa"reVapplication" of"theSQL"in" the"redo" log"sincelast"checkpoint

T0 T0

Crash" at"T0 will" result" in" redo"logs"being" applied" to"each"segment" on"demand," in"parallel,"asynchronously

Fast"FailVOver

AppRunningFailure"Detection DNS"Propagation

Recovery Recovery

DBFailure

MYSQL

AppRunning

Failure"Detection DNS"Propagation

Recovery

DBFailure

AURORA"WITH"MARIADB"DRIVER

1 5 O 2 0 's e c

3 O 2 0 ' s e c

RealVlife"data"V failVover"time

“In"RDS"MySQL,"it"took"minutes"or"sometimes"tens"of"minutes"to"failover."It’s"pretty"awesome"that"you"can"failover/restart"within"less"than"a"minute.”

Simulate"failures"using"SQL

ALTER"SYSTEM"CRASH"[{INSTANCE"|"DISPATCHER"|"NODE}]

ALTER"SYSTEM"SIMULATE"percent_failure DISK"failure_type IN"[DISK"index"|"NODE"index]"FOR"INTERVAL"interval

ALTER"SYSTEM"SIMULATE"percent_failure NETWORK"failure_type[TO"{ALL"|"read_replica |"availability_zone}]"FOR"INTERVAL"interval

• To"cause"the"failure"of"a"component"at"the"database"node:

• To"simulate"the"failure"of"disks:

• To"simulate"the"failure"of"networking:

Agenda

Motivation

Aurora"Architecture

Performance

Availability

Usability

Beyond"Benchmarks

What"about"usability"?

“Focus%on%the%application%and%not%managing%the%system”

Simplify"Database"Management• Create"a"database"in"minutes

• Automated"patching

• PushVbutton"scale"compute

• Continuous"backups"to"Amazon"S3

• Automatic"failure"detection"and"failover

Amazon RDS

Simplify"Storage"Management• Read"replicas"are"available"as"failover"targets—no"data"loss

• Instantly"create"user"snapshots—no"performance"impact

• Continuous,"incremental"backups"to"Amazon"S3

• Automatic"storage"scaling"up"to"64"TB—no"performance"or"availability"impact

• Automatic"restriping,"mirror"repair,"hot"spot"management,"encryption

Simplify"Data"Security• Encryption"to"secure"data"at"rest• AESV256;"hardware"accelerated• All"blocks"on"disk"and"in"Amazon"S3"are"encrypted• Key"management"via"AWS"KMS

• SSL"to"secure"data"in"transit

• Network"isolation"via"Amazon"VPC"by"default

• No"direct"access"to"nodes

• Supports"industry"standard"security"and"data"protection"certifications

Storage

SQL

Transactions

Caching

Amazon S3

Application

What%about tools%and%eco-system?All%MySQL%tools%work%as%is%…

Well"established"MySQL"ecosystem

Business'Intelligence Data'Integration Query'and'Monitoring SI'and'Consulting

Source:%Amazon

“We'ran'our'compatibility'test'suites'against'Amazon'Aurora'and'everything'just'

worked.""V Dan"Jewett,"Vice"President"of"Product"Management"at"Tableau

Integration"with"3rd"Party"Tools

Monitor"Aurora"with"Datadog

• Just"add"readVonly"AWS"credentials"and"select"the"services"you"wish"to"monitor"(e.g."RDS)

Ready%to%move?We%made%it%easy%to%migrate%..%

Simplify"migration"from"RDS"MySQL

• 1."Establish"baseline

a. RDS"MySQL"to"Aurora"DB"snapshot"migration

b. MySQL"dump/import

• 2."CatchVup"changesApplication'Users

MySQL Aurora

Network

Migration"from"EC2"&"onVpremise MySQL• Data(migration(service

• Logical(data(replication(from(onLpremise(or(EC2• Code(&(schema(conversion(across(engines

• S3(integration• Load(partial(datasets(directly(from(/(to(S3• Ingest(large(database(snapshots((>2TB)

• Snowball(integration• Ingest(huge(database(snapshots((>10TB)• Send(us(your(data(in(a(suitcase!

Migration"from"nonVMySQL"Databases

AWS(Database(Migration(Service

" Move"data"to"the"same"or"different"database"engine"

" Keep"your"apps"running"during"the"migration

" Start"your"first"migration"in"10"minutes"or"less

" Replicate"within,"to,"or"from"Amazon"EC2"or"RDS

Agenda

Motivation

Aurora"Architecture

Performance

Availability

Usability

Beyond"Benchmarks

Beyond"Benchmarks

• If'only'real"world"applications"saw"benchmark"performance

• POSSIBLE'DISTORTIONSReal"world"requests"contend"with"each"otherReal"world"metadata"rarely"fits" in"data"dictionary"cacheReal"world"data"rarely"fits"in"buffer"cacheReal"world"production"databases"need"to"run"HA

• SysBench"OLTP"Workload

• 250"tables

Connections Amazon&AuroraRDS&MySQLw/&30K IOPS

50 40,000 10,000

500" 71,000 21,000

5,000" 110,000 13,000

8xUP' TO

FASTER

Scaling"User"Connections

Tables Amazon&AuroraMySQL I2.8XLlocal SSD

MySQLI2.8XLRAM&disk

RDS&MySQLw/&30K IOPS(single&AZ)

10" 60,000" 18,000" 22,000" 25,000"

100" 66,000" 19,000" 24,000" 23,000"

1,000" 64,000" 7,000" 18,000" 8,000"

10,000" 54,000" 4,000" 8,000" 5,000"

• SysBench writeVonly" workload

• Measuring" writes" per" second• 1,000" connections

11xUP&TO

FASTER

Scaling"Table"Count

DB&Size Amazon& AuroraRDS& MySQLw/&30K IOPS

1GB 107,000 8,400

10GB 107,000 2,400

100GB" 101,000 1,500

1TB 26,000 1,200

67xUP&TO

FASTER

• SYSBENCH'WRITEOONLY

DB&Size Amazon& AuroraRDS& MySQLw/&30K IOPS

80GB 12,582 585

800GB 9,406 69

CLOUDHARMONY&TPCVC

136xUP&TO

FASTER

Scaling"Data"Set

Aurora"3X"faster

RealVlife"data"– gaming"workloadAurora"vs"RDS"MySQL"– r3.4XL,"MAZ

Updates'per

second Amazon' Aurora

RDS'MySQL

30K IOPS (single'AZ)

1,000 2.62"ms 0"s

2,000 3.42"ms 1"s

5,000 3.94"ms 60"s

10,000 5.38"ms 300"s

• SysBench" Writeonly Workload

• 250"tables

500xUP&TO

LOWER& LAG

Scaling"With"Replicas

“In"RDS"MySQL,"we"saw"replica"lag"spike"to"almost"12"minutes"which"is"almost"absurd"from"an"application’s"perspective."The"maximum"read"replica"lag"across"4"replicas"never"exceeded"beyond"20"ms.”

RealVlife"data"V read"replica"latency

Questions'?

Thank%you!

P.S.%We’re%hiring%!%Email%me%at:%[email protected]

http://aws.amazon.com/rds/aurora