Python, Go, and the Cost of Concurrency in the Cloud

Chris Erway Chief Architect, AppNeta

Python, Go and the Cost of Concurrency in the Cloud

Python, Go, and the Cost of Concurrency in the Cloud

Chris Erway Chief Architect, AppNeta

Goal of this talk

• Introduce Go for Python (& Ruby) programmers

• Explain key differences between Go and the other “no semicolons” languages

• Show an example application illustrating why those key differences matter for your app’s bottom line

About me• Daily Python hacker

• came from C/C++, UNIX systems hacking background

• PhD on P2P/crypto research, more Python & C++

• Two months experience in Go

• co-founder, Tracelytics (now AppNeta)

Things I like about Python

• Fun to program — “Zen of Python”

• Builtin maps, sets, arrays, tuples

• Good library support

• Simple duck typing (as opposed to strict OO)

• A little code goes a long way

The Zen of Python (PEP 20)

Things I don’t like about Python

• Performance: not too slow, but not too fast either

• Dependencies can be a pain (virtualenv, pip, etc)

• The dreaded Global Interpreter Lock (GIL)

• Lack of typed function signatures can make reading code difficult

Go• Announced 2009

• Creators: Ken Thompson (B, Plan 9 from Bell Labs), Rob Pike (Plan 9), Robert Griesemer (V8 engine) • “all three of us had to be talked into every feature in the

language, so there was no extraneous garbage put into the language for any reason”

• Statically typed, garbage-collected

• Fast compilation, static linking

Go is simple• Basic builtin data types

• boolean, int (int, int32, int64, byte, rune, …), float, complex, string • Complex builtin types

• pointer - a typed reference to a value • array - fixed length sequence of typed elements • slice - window into part of an array • map - typed key/value • channel - typed, optionally directional and/or buffered • struct - sequence of named and typed elements

• All of Go’s keywords: break default func interface select case defer go map struct chan else goto package switch const fallthrough if range type continue for import return var

Similarities betweenGo and Python

• Easy to read, no semicolons

• Built-in maps, arrays, strings

• Both support calling into C code when necessary

• Interfaces based on duck typing

• No virtual inheritance

• Statically typed, but type inference and interfaces give it a “dynamic feel”

Differences betweenGo and Python

• Go is compiled to native machine code • Fast compiler, single static binary

• Go is fast; memory usage depends on size of structs: no per-object dictionaries, as in Python

• Go has concurrency features built into the language: goroutines, channels, runtime scheduler

• Go has curly braces

Go examples• Lots of resources online to help you learn Go!

• Following slides from “Go for Pythonistas” by Francesc Campoy Flores, Google and “Go and the Zen of Python” by Andrew Gerrand, Google

• Also interesting viewing: “Go for Python Programmers” by Brian Dorsey, Google

• See also: The Go Programming Language Blog, blog.golang.org

Go methods & objects"Simple is better than complex."

Methods are just functions (no special location)

There's no this or self - the receiver is like any other function argument

type Vector struct { X, Y float64}

func (v Vector) Abs() float64 { return math.Sqrt(v.X*v.X + v.Y*v.Y)}

Go methods and objects

Go methods & objects"Simple is better than complex."

Methods are just functions (no special location)

There's no this or self - the receiver is like any other function argument

type Vector struct { X, Y float64}

func (v Vector) Abs() float64 { return math.Sqrt(v.X*v.X + v.Y*v.Y)}

"Simple is better than complex."

Methods can be declared on any named type (no classes)

type Scalar float64

func (s Scalar) Abs() float64 { if s < 0 { return float64(-s) } return float64(s)}

Go methods & objects"Simple is better than complex."

Methods are just functions (no special location)

There's no this or self - the receiver is like any other function argument

type Vector struct { X, Y float64}

func (v Vector) Abs() float64 { return math.Sqrt(v.X*v.X + v.Y*v.Y)}

"Simple is better than complex."

Methods can be declared on any named type (no classes)

type Scalar float64

func (s Scalar) Abs() float64 { if s < 0 { return float64(-s) } return float64(s)}

"Simple is better than complex."

Interfaces are just methods (no data)

Interfaces are implicit (no implements declaration)

type Abser interface { Abs() float64}

(Both Vector and Scalar implement Abser, even though they don't know that Abser exists.)

Go methods & objects"Simple is better than complex."

Methods are just functions (no special location)

There's no this or self - the receiver is like any other function argument

type Vector struct { X, Y float64}

func (v Vector) Abs() float64 { return math.Sqrt(v.X*v.X + v.Y*v.Y)}

"Simple is better than complex."

Methods can be declared on any named type (no classes)

type Scalar float64

func (s Scalar) Abs() float64 { if s < 0 { return float64(-s) } return float64(s)}

"Simple is better than complex."

Interfaces are just methods (no data)

Interfaces are implicit (no implements declaration)

type Abser interface { Abs() float64}

(Both Vector and Scalar implement Abser, even though they don't know that Abser exists.)

"Simple is better than complex."

Identifier case sets visibility.

If a name begins with a capital, it is visible outside its package:

package foo

type Foo struct { // exported type bar int // unexported field}

func (f Foo) Bar() {} // exported method

func (f Foo) quux() {} // unexported method

Only code inside the package can see unexported ("private") names.

Go control flow• Just a few keywords:

• if • for • switch • select (like switch for channels)

• But without: • Ternary operator (Python: X if COND else Y) • List comprehensions, crazy Python-style one-liners

Go syntax and simplicity"Readability counts."

Go was designed for teams of hundreds/thousands of programmers. Readability is of paramount importance.

The gofmt tool enforces "one true style." (No more stupid arguments.)

Type inference saves a lot of typing, but not at the cost of readability. Types are stillrequired where they help readability (function declarations, for example).

Many other language design decisions were made in the name of readability (case-based name visibility, for example).

Example: Go URL-fetcher

package main

import ( "fmt"; "net/http"; "time" )

func main() { urls := []string{"http://google.com/", "http://bing.com/"} start := time.Now() done := make(chan string) for _, u := range urls { go func(u string) { resp, err := http.Get(u) if err != nil { done <- u + " " + err.Error() } else { done <- u + " " + resp.Status } }(u) } for _ = range urls { fmt.Println(<-done, time.Since(start)) }} Run

Example: Go waitgroups


Have you ever heard of Fibonacci?

def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return b

def fib_rec(n): if n <= 1: return 1 else: return fib_rec(n-1) + fib_rec(n-2)

for x in range(10): print fib(x), fib_rec(x)


Something familiar?

func fib(n int) int { a, b := 0, 1 for i := 0; i < n; i++ { a, b = b, a+b } return b}

func fibRec(n int) int { if n <= 1 { return 1 } return fibRec(n-1) + fibRec(n-2)}

func main() { for i := 0; i < 10; i++ { fmt.Println(fib(i), fibRec(i)) }} Run

Fibonacci without generators? What?

Python generators are awesome.

def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b yield a

Mechanically complex.

f = fib(10)try: while True: print f.next()except StopIteration: print 'done'

But very easy to use.

for x in fib(10): print xprint 'done'

Python generators

Note the generator executes concurrently. Hmm... I like concurrency.

Go concurrency

Based on goroutines and channels.

Goroutines: very light processing actors (the gophers).

Channels: typed, synchronized, thread-safe pipes (the arrows).

"Generator" goroutines

"Generator" goroutines

Uses a channel send instead of yield.

func fib(c chan int, n int) { a, b := 0, 1 for i := 0; i < n; i++ { a, b = b, a+b c <- a } close(c)}

func main() { c := make(chan int) go fib(c, 10)

for x := range c { fmt.Println(x) }} Run

"Generator" goroutines

A more generator-like style:

func fib(n int) chan int { c := make(chan int) go func() { a, b := 0, 1 for i := 0; i < n; i++ { a, b = b, a+b c <- a } close(c) }() return c}

func main() { for x := range fib(10) { fmt.Println(x) }} Run

Language comparison

Node.js C/C++ Java Go

semicolons N N Y Y Y N

curly braces N N* Y Y Y Y

static types N N N Y Y Y

easy-to-use concurrency N N Y N N Y

multi-core concurrency N N N Y Y Y

compiled N N N Y Y Y

OO: classes, inheritance Y Y Y Y Y N*

So who cares?• You do — concurrency is important in the modern

computing environment

• Programming for “the cloud” or for “SOA” or “microservices” is fundamentally different than writing a LAMP/MEAN/Rails app

• Assumptions on latency, throughput, scale all change

• The language you pick can cost you time & money

Cloud vs. self-managed

• Managed

• “Infinite” scale

• HTTP-based RPC

• Usage-based pricing

• Hard to overprovision


• Self-hosted

• As scalable as you can make it (e.g. Redis vs. Cassandra)

• Connection-oriented services

• Instance-based pricing

• Some overprovisioning necessary

Cloud environments demand concurrency

• Self-hosted systems and databases generally use pools of long-living connections

• RabbitMQ vs SQS

• HTTP-based APIs can have high latency

• DynamoDB 5-10ms latency

• Kinesis PutRecords, S3, SQS 10-100ms latency

What about my async code for Python, Ruby, Node?

• Async I/O makes network, disk reads & writes asynchronous

• Used by Python’s gevent, Tornado, Twisted

• Ruby EventMachine, Celluloid

• Node.js, libuv, libev, libevent

• Allows interpreter to switch to another execution context/greenlet/thread while I/O is pending

• Go: blocking I/O is OK when you have multiple goroutines

Node.js async example

Gevent async example

Cloud APIs require compute-heavy RPCs

• HTTP-based APIs with authenticated JSON/XML

• Encryption: TLS/SSL key exchange, negotiation

• Authentication: AWS, Google request signature schemes

• Serialization: Convert data to JSON, base64, etc

• Not as simple as binary data over raw sockets

• Not pure disk/network I/O — not as easy to use async I/O

Increasing prevalenceof multi-core architectures

• Dual-core, quad-core, 8-core, 16-core, 32-core …

• How will you use all those CPUs?

• Strong opinion: Docker, containerization is a crutch for horizontally scaling single-threaded services

Motivating example

S3 + DynamoDBAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysis

S3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDB

~700 items analyzed each second ~700 S3 PUTs/sec, ~70KB each ~700 DynamoDB item writes/sec

Cloud storage, queue, and log costs

Write CostMax

Object Size

Storage $/GB-month Read cost

S3 $5/million PUTs 5GB $0.03 $0.01 Glacier $0.40/million GETs

GCS $10/million PUTs 5TB $0.026$0.01 Nearline $1/million GETs

SQS $0.50/million API requests 192KB $0.50/million API requests

Kinesis $0.028/million PUT records 50KB $11 per(each reads

shard-month 2MB/sec)



(half off with

Hz-monthitems reservation)

400KB (all Ks & Vs)

$0.25 (structured,




Hz-month itemsconsistent)

Motivating example
~700 items analyzed each second ~700 S3 PUTs/sec, ~70KB each ~700 DynamoDB item writes/sec

SQSSQSSQSSQSSQSAnalysisAnalysisAnalysisAnalysisAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis

Batch S3 Writesfor fewer S3 PUTs

SQSSQSSQSSQSSQSAnalysisAnalysisAnalysisAnalysisAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis


Read data objects from SQS Batch into larger files and store in S3

Batch S3 Writer

SQSSQSSQSSQSSQSAnalysisAnalysisAnalysisAnalysisAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis SQSAnalysis

Batch Writer



Read data objects from SQS Batch into larger files and store in S3

Write S3 URL and offsets to DynamoDB

SQS + S3 + DynamoDB
no batching

Monthly Cost

Usage Rate

Batch Size Service

$13,100 1000 Hz 1 S3 PUT

$474 1000 Hz 1 DynamoDB item writes

$13,574 1000 Hz TOTAL

S3 + DynamoDBAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysisAnalysis

S3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3 + DynamoDBS3, DynamoDB

SQS + S3 + DynamoDBS3_BATCH_SZ=10

Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$1310 100 Hz 10 S3 PUT

$474 100 Hz 10 DynamoDB item writes

$3356 1000 Hz TOTAL (24.7%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer



SQS + S3 + DynamoDBS3_BATCH_SZ=10

Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$1310 100 Hz 10 S3 PUT

$474 100 Hz 10 DynamoDB item writes

$3356 1000 Hz TOTAL (24.7%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer



SQS + S3 + DynamoDB
S3_BATCH_SZ=100

Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$131 10 Hz 100 S3 PUT

$474 10 Hz 100 DynamoDB item writes

$2177 1000 Hz TOTAL (16%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer




Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$263 10 Hz 100 GCS PUT

$474 10 Hz 100 DynamoDB item writes

$2300 1000 Hz TOTAL (16%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer

Google Cloud Storage


SQS + S3 + DynamoDB
S3_BATCH_SZ=100

Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$131 10 Hz 100 S3 PUT

$474 10 Hz 100 DynamoDB item writes

$2177 1000 Hz TOTAL (16%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer



SQS + S3 + DynamoDB
S3_BATCH_SZ=1000

Monthly Cost

Usage Rate

Batch Size Service

$1310 1000 Hz 1 SQS SendMessage

$131 100 Hz 10 SQS ReceiveMessage

$131 100 Hz 10 SQS DeleteMessage

$13 1 Hz 1000 S3 PUT

$474 1 Hz 1000 DynamoDB item writes

$2059 1000 Hz TOTAL (15%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer



RabbitMQ + S3 + DynamoDB self-managed queue instances

Monthly Cost Usage Rate

Batch Size Service

$4091 1000 Hz2x r3.8xlarge (244GB RAM ea) store 70KB items for ~1hr without exceeding 50% RAM

2x? Double instances to allow for spikes, robustness to failure

$13 1 Hz 1000 S3 PUT

$474 1000 Hz 1000 DynamoDB item writes

>$4578 1000 Hz TOTAL (15%)

SQSQSQSQSQAnalysAnalysAnalysAnalysAnalys SQAnalys SQAnalys SQAnalys SQAnalys SQSAnalysis

Batch Writer



Basic algorithm


SQS Get ≤10 messages

Batch size reached or flush

timer fired?


Build batch file & offset map

PUT batch file

Write offset map



Delete messages SQS

Implementation difficulties



Get ≤10 messages

Batch size reached or flush

timer fired?


Build batch file & offset map

PUT batch file

Write offset map



Average latency: ~20ms (50/sec/thread)

Latency: 20-200ms (size-dependent)

Latency: <10ms

Single Python processes
~50 messages/sec


SQS Get ≤10 messages

Batch size reached or flush

timer fired?


Build batch file & offset map

PUT batch file

Write offset map



Delete messages SQS

Average latency: ~20ms

Multiple Python processes
4 procs = 200 messages/sec

Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map


Get ≤10 messages

Build batch file & offset map

PUT batch file



Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map Write offset map

Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map


No No No No

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Delete messages Delete messages Delete messages Delete messages

Process-based scaling leads to suboptimal cost performance

• Impossible to scale number of SQS pollers and S3 writers independently

• One batch buffers per process: smaller batches than optimal, hard to “max out” S3 batch size before timeout

• Hard to “max out” 10 messages each SQS read

• Hard to detect when system is falling behind, problematic if write latency > read latency


Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map


Get ≤10 messages

Build batch file & offset map

PUT batch file



Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map Write offset map

Get ≤10 messages

Build batch file & offset map

PUT batch file


Write offset map


No No No No

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Batch size reached or flush

timer fired?

Delete messages Delete messages Delete messages Delete messages

Go implementation

Get ≤10 messages

Build batch file & offset map

PUT batch file

Write offset map


Build batch file & offset map

PUT batch file


Get ≤10 messages

Build batch file & offset map

PUT batch file

Write offset map Write offset map

S3 DynamoDB

Get ≤10 messages

Build batch file & offset map

PUT batch file

Write offset map


Batch size reached or flush

timer fired?

SQS message channel

Batch channel

Get ≤10 messages …

Concurrency costs money

very important when your code combines lots of high-latency API calls / RPCs

• Ruby, Python, Node.js all require lots of concurrent processes to achieve good concurrency

• Result: over-provisioning, over-polling, IPC when you don’t need to

• Result: suboptimal cost when using usage-priced APIs

Does this apply to me?

• More cores, more cloud, all the time

• SOA, “microservices”

• Do you have code that calls multiple independent services serially?

• Why?

Couldn't I just use {C, C++, Java, Scala, Clojure, Erlang, Haskell} to achieve multi-core concurrency?

• Yes, but —

• it may still be such a pain to spawn new threads in your language that you don’t do it enough (e.g. Java, C/C++) vs. just typing “go func()”

• C/C++ and Java have pretty heavyweight thread sizes, typically can only support 1K-10K threads

• Go (and Erlang?) have very lightweight thread and can support millions of goroutines

Thank you!

• Hope this was useful and interesting!

