65
The Prisoner’s Dilemma & SOAs Lessons from Amazon, Google, and Lucidchart By Derrick Isaacson

Prisoner's Dilemma and Service-oriented Architectures

Embed Size (px)

Citation preview

The Prisoner’s Dilemma& SOAs

Lessons from Amazon, Google, and Lucidchart

By Derrick Isaacson

http://theinspirationroom.com/daily/print/2009/1/wrigleys_orbit_interrogation.jpg

“Can I get

that without

the bacon?”

- no one ever

http://www.food.com/photo-finder/all/bacon?photog=1072593

http://baconipsum.com/?paras=1&type=all-meat&start-with-lorem=1

https://qzprod.files.wordpress.com/2013/03/costco-retail-web.jpg?w=1600

“Wow, that was a cheap trip

to Costco” - no one ever

http://www.someecards.com/usercards/viewcard/MjAxMi03YWZiMjJiMTg3NDFhYTUy

I can’t remember if that getter function takes 100ns or 100ms.

- no one ever

• Should I try to abstract away this service request as a“remote procedure call”?

• 6 orders of magnitude difference!

My front-side bus only fails for 1 second every 17 minutes!

- no one ever

• 99.9% availability

Our internet only supports .NET.

- no one ever

• Do your clients rely on an SDK?

Distributed System ArchitecturesDoes it have to be “Service-oriented”?

http://upload.wikimedia.org/wikipedia/commons/d/da/KL_CoreMemory.jpg

Distributed Memory

RPC

<I’m>

<not>

<making>

<a>

<service>

<request>

<I’m>

<just>

<calling>

<a>

<procedure>

Distributed File System

mount -t nfs -o proto=tcp,port=2049 nfs-server:/ /mnt

Distributed Data Stores

• Replicated MySQL

• Mongo

• S3

• RDS

• BigTable

• Cassandra

P2P

Streaming Media

Service-oriented Architecture AttemptSocial Bookmarking App

GET /profiles/123

GET /users/123

Calculate something

GET /users/123/permissions

If user can’t view profile

send 403

POST /eventFeed {new profile view}

GET /users/123/friends

GET /bookmarks?userId=123

GET /catalog/books?ids=1,3,10

Calculate something else

GET /bookmarks/trending

Send HTML

Failure calling a service?

Simple SOA Availability

<98.7%

99.5%

99.8%

99.6%

.995 * .998 * .998 * .996 = 0.987

Early days Lucidchart by Status Code

96.5%

2xx or

3xx

Early days Lucidchart 1 second Latencies

10.8%

> 1s

What Happened?!?How can a SOA make my app better!

"A distributed system is at best a necessary evil, evil because of the extra complexity...

or perhaps better put, a sensible engineering decision given the trade-offs involved."

-David Cheriton, Distributed Systems Lecture Notes, ch. 1

The CAP Theorem

http://learnyousomeerlang.com/distribunomicon

The CAP Theorem1

• Safety – nothing bad ever happens

• Liveness – good things happen

• Unreliability – network dis-connectivity, crash failures, message loss, Byzantine failures, slowdown, etc.

• Consistency – every response sent to a client is correct

• Availability – every request gets a response

• Partition tolerance – operating in the face of arbitrary failures

Consistency:

Nothing Bad Happens

Assumption: Failures Happen

Availability Consistency

GET /profiles/123

GET /users/123

Calculate something

GET /users/123/permissions

If user can’t view profile

send 403

POST /eventFeed {new profile view}

GET /users/123/friends

GET /bookmarks?userId=123

GET /catalog/books?ids=1,3,10

Calculate something else

GET /bookmarks/trending

Send response

ResponseHandler<User> handler = new ResponseHandler<User>()

{

public User handleResponse(

final HttpResponse response) {

int status = response.getStatusLine().getStatusCode();

if (status >= 200 && status < 300) {

HttpEntity entity = response.getEntity();

return entity != null ? Parser.parse(entity) : null;

} else {

}

}

};

HttpGet userGet = new HttpGet("http://example.com/users/123");

User user = httpclient.execute(userGet, handler);

https://hc.apache.org/httpcomponents-client-4.3.x/examples.html

Works great to calculate a user!

Best Effort Availability -Guaranteed consistency

Best Effort Consistency -Guaranteed availability

Amazon Checkout

http://highscalability.com/amazon-architecture

“WOW

I really regret

sacrificing consistency for

availability”

-said no amazon ever That’s $74 Billion

Google File System: relaxed consistency model

Throughput

Latency

Hang Consistency!

Add:

• Caching

• Timeouts

• Retries

• Guessing

• Anything!

Anti-entropyAdded energy to combat failure

Tip 1:HTTP Caching

Availability/Performance Consistency

Tip 2: HTTP Caching as Fallback

Tip 3: Retries

• Exponential backoffs & max retries

Tip 3: HTTP Caching Technologies

• Apache HttpComponents – HttpClient Cache

• Ehcache

• Redis

• Memcached

• CloudFront

• Akamai

• Berkeley DB

• AWS SNS (for notifying caches components of changes)

Segmenting CAPA is highly available and B is highly consistent

Segmenting Consistency and Availability1. Data Partitioning

Shopping Cart

Warehouse Inventory DB

Segmenting2. Operation Partitioning

Reads

Writes

Dynamo PNUTS&

Segmenting3. Functional partitioning

User Service, Document Snapshots

Document Service

Segmenting4. Hierarchical Partitioning

Leaves

Root

http://www.slashgear.com/google-data-center-hd-photos-hit-where-the-internet-lives-gallery-17252451/

Data Driven Design 1

Timeouts & Retries

Final Tip: Data driven design

• Max I/O wait time = # of threads * (CONNECT_TIMEOUT + READ_TIMEOUT)

• 9 front end servers received 1900 requests in 60 seconds and 300for Flickr resources (16%).

• 35 requests per server per minute

• Max 100 threads, => 6,000 thread seconds in one minute

• Goal: ensure < 10% of thread seconds spent blocked on Flickr I/O

• 600 < 35 requests * (CONNECT_TIMEOUT + READ_TIMEOUT)

• CONNECT_TIMEOUT + READ_TIMEOUT < 17 seconds

TCP Connect

Send

Request Block on socket read Read response

CONNECT_TIMEOUT

READ_TIMEOUT

Data Driven Design 2

Caching

Best Effort Consistency System

99.9%

99.5%

99.8%

99.6%

Wow, my

pizza has too

many

toppings

- no one ever

http://upload.wikimedia.org/wikipedia/commons/6/60/Pizza_Hut_Meat_Lover's_pizza_3.JPG

“WOWMy system has

too muchavailability.”

-no one ever

Questions?

golucid.co

http://www.slideshare.net/DerrickIsaacson