Golang Performance : microbenchmarks, profilers, and a war story

  • View
    4.561

  • Download
    0

  • Category

    Software

Preview:

Citation preview

The fastest NoSQL database!!

Talking about Go Performance!!

Try it while I blab !! github.com/aerospike/aerospike-server!

github.com/aerospike/aerospike-client-go!

Who am I ?

Brian Bulkowski!brian@bulkowski.org!brian@aerospike.com!

@bbulkow!

TRS-80, PC, Apple II, Vax 11/70, Wang First product: lightpen university teaching kiosk Palo Alto High School ( ‘85 )

Liberate / NetComputer through the boom

10B market cap in 1999, employee 32

2003-2007 “time off” ( startups ) Citrusleaf / Aerospike history

42 year old first-time CEO (me) 2008 Prototype 2010 First sales “get the band back together” 2011+ 3 rounds of funding (Draper, ALP, NEA, CNTP) 70 employees, 2 offices

Does brian know performance?

Brian Bulkowski!brian@bulkowski.org!brian@aerospike.com!

@bbulkow!

Undergrad project: image converter Single pass arbitrary scale and rotate w/ nyquist filters

Novell

Fastest Appletalk server + router available

Starlight Networks 150Mb/sec video server on P133

Liberate

HTML technology for embedded systems

Aggregate Knowledge Realtime reccommendations: 2x faster in first week

Aerospike 10x faster than existing NoSQL, 100x faster than RDBMs

Internet Technology Stack

MILLIONS OF CONSUMERS BILLIONS OF DEVICES

APP SERVERS

DATA WAREHOUSE INSIGHTS

WRITE CONTEXT

In-memory NoSQL

WRITE REAL-TIME CONTEXT READ RECENT CONTENT

PROFILE STORE

Cookies, email, deviceID, IP address, location, segments, clicks, likes, tweets, search terms...

REAL-TIME ANALYTICS

Best sellers, top scores, trending tweets

BATCH ANALYTICS Discover patterns,

segment data: location patterns, audience

affinity

Who uses Aerospike?

theTradeDesk

… to name a few!

Aerospike is High Performance

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000

Balanced Read-Heavy

Aerospike 3 (in-memory) Aerospike 3 (persistent) Aerospike 2 Cassandra MongoDB Couchbase 1.8 Couchbase 2.0

Easy Clients ( better than JSON )

Go!Python!

Also, analytics

http://www.aerospike.com/community/labs/!

If it is so good, why haven't I heard of it?

Established in 2009 (newer than most)

Used in Advertising – ad exchanges, data exchanges, targeting, real-time bidding, real-time attribution.

Open Sourced in June 2014

When should I use Aerospike? Redis, but with scale & flash

Cassandra, but fast

User data, session data, behavior, fraud…

API billing ~ retail actions ~ recommendations

Up and running in 10 minutes!( vagrant, EC2 …)!

Why does Aerospike care about Go? It’s cool !

Promises performance with expressive ( as an old C guy, Go is aimed at me )

Our customers are diving in, deploying

What about (other versions of other languages)…( sure, they’re cool too! )

Go!

Some old microbenchmarks

Profilers, how to run it

War story: optimizing our Go client

( sure, we know Go isn’t JUST about performance )

Let’s talk about….

Old Microbenchmark In Nov 22 2009, I posted to Golang Nuts

Old Microbenchmark Seconds (Nov 2009) 1.1 - python (CPython 2.6.2, the distro release with no tweaks) "4.6 - go (current hg release) "4.2 - ruby 1.8 (distro release) "1.1 - ruby 1.9 (distro release)

Pike said: "I suspect the great majority of the time in your benchmark is due to Go's current rudimentary garbage collector.  Tests like this generate a lot of garbage that is collected slowly.  From experiments I've done, a better implementation can make a huge difference.  Profiling this test shows at least 50% of the time is in the allocator and collector, as opposed to about 5% printing the string and less than 15% in the map code.  A better allocator and collector would make a dramatic change. ""The short answer: the Go runtime is new and completely untuned.  The libraries need work too.

Microbenchmark “T1” for i := 0; i < 1000000; i++ { x = ( 2 * x ) + x + 1 }1.96 s (big integer only) Python 1.04 ms (2.17s big.Int) Go 5 ms (2.15s BigNum) Java Good news: go is right in the hunt, but easier to code Amazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks T5 – the 2009 benchmark12.5 sec Python 12.56 sec Go 2.56 sec Java Good news: not slower than python!Bad news: Holy Crap compared to Java

Amazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks – the old code T5 – the 2009 benchmark (slower CPU) for x := 0; x < 1000000; x++ { a := make(map[int] string); for a1 := 0; a1 < 50; a1++ { a[a1] = strconv.Itoa(a1); }}12.56 secondsAmazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks – tune the map T5 – the 2009 benchmark for x := 0; x < 1000000; x++ { a := make(map[int] string, 50); for a1 := 0; a1 < 50; a1++ { a[a1] = strconv.Itoa(a1); }}7.80 secondsAmazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks – remove the Itoa T5 – the 2009 benchmark for x := 0; x < 1000000; x++ { a := make(map[int] string, 50); for a1 := 0; a1 < 50; a1++ { a[a1] = "123456”; }}

5.45 secondsAmazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks – singleton Map T5 – the 2009 benchmarka := make(map[int] string, 50);for x := 0; x < 1000000; x++ { // a := make(map[int] string, 50); for a1 := 0; a1 < 50; a1++ { a[a1] = "123456”; }}2.03 seconds ! Finally better than Java ! Amazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Microbenchmarks – Java T5 – the 2009 benchmarkfor (int x=0; x < 1000000; x++) {

HashMap<Integer, String> a = new HashMap<Integer, String>();for (int a1=0; a1 < 50; a1++) {

a.put(a1, Integer.toString(a1) );}

}2.56 secondsAmazon m3.xlarge (4 core E3@2.5Ghz)"Python 2.6.9"Go 1.3.3"Java 1.7.0_71"Amazon Linux (3.16)

Any ideas?

( I haven’t figured it out yet )

Next microbenchmarks ! Float, String

Go Channels vs Java Futures … couldn’t code the java part in time!

Simple TCP echo, but with transactions

Log processing

Ruby 2.1, Go 1.4…

Your votes ?

Profilers pprof is pretty great!

Import in all your main’s, does not seem to hurtimport _ "net/http/pprof”

Add the HTTP listener ( only on flag )

// launch http pprof listener if in profile mode if *profileMode { go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }()

}

Profilers Take a 30 second snapshotgo tool pprof http://localhost:6060/debug/pprof/profile?seconds=xx

pprof prompt: ‘top 10’ (pprof) top 10

Total: 3852 samples 1187 30.8% 30.8% 1254 32.6% syscall.Syscall 304 7.9% 38.7% 304 7.9% ExternalCode 172 4.5% 43.2% 175 4.5% github.com/aerospike/aerospike-client-go/pkg/ripemd160._Block 137 3.6% 46.7% 233 6.0% runtime.mallocgc 98 2.5% 49.3% 98 2.5% runtime.futex 79 2.1% 51.3% 86 2.2% runtime.MSpan_Sweep 77 2.0% 53.3% 77 2.0% scanblock 68 1.8% 55.1% 68 1.8% runtime.xchg 46 1.2% 56.3% 46 1.2% runtime.epollwait

Profilers (pprof) web

Profilers Good old ‘oprofile’, let’s not forget it –--- ( especially if you can get kernel symbols, hard )

sudo yum -y install oprofile Start capturing sudo opcontrol --reset sudo opcontrol --no-vmlinux sudo opcontrol –start

Run your program sudo opcontrol --dump sudo opcontrol --shutdown

Dump your resultsudo opreport -l --demangle=smart --debug-info

Cheat Sheet http://www.bonsai.com/wiki/howtos/tuning/oprofile/

Profilers opreportsamples % linenr info image name app name symbol name 28106 56.5877 (no location information) no-vmlinux no-vmlinux /no-vmlinux 6216 12.5151 rand.go:76 benchmark benchmark math/rand.(*Rand).Int31n 3940 7.9327 rng.go:232 benchmark benchmark math/rand.(*rngSource).Int63 1987 4.0006 benchmark.go:255 benchmark benchmark main.randString 1584 3.1892 rand.go:43 benchmark benchmark math/rand.(*Rand).Int63 1465 2.9496 rand.go:93 benchmark benchmark math/rand.(*Rand).Intn 1421 2.8610 rand.go:49 benchmark benchmark math/rand.(*Rand).Int31 354 0.7127 ripemd160block.go:45 benchmark benchmark github.com/aerospike/aerosp ike-client-go/pkg/ripemd160._Block 349 0.7027 mgc0.c:720 benchmark benchmark scanblock 307 0.6181 malloc.goc:40 benchmark benchmark runtime.mallocgc 205 0.4127 mgc0.c:1783 benchmark benchmark runtime.MSpan_Sweep 138 0.2778 memmove_amd64.s:33 benchmark benchmark runtime.memmove 131 0.2638 asm_amd64.s:600 benchmark benchmark runtime.xchg

Tuning the Aerospike Client

What does the client do?!!Maintain the DHT state!!Keep a connection pool!!Make requests to the right servers!!Box / unbox to wire protocol…!

SIMPLE

Tuning the Aerospike Client Attempt 1: run pprof!!The usual dance of making life!easy for the garbage collector !(just like java)!!pprof worked!!the hot objects showed up!!Cache easily with Sized Channels !!!!

Tuning the Aerospike Client

Attempt 2: oprofile!!oprofile found rand() taking time!!Optimization gave nothing!!… not sure why not …!!Currently happy with throughput!

Tuning the Aerospike Client Latency problem at customer site !!!User validating a server install with a quick Go client!“17 ms average latency @ 20K TPS” --- terrible!!!Server measured at 0.4 ms @ 40k TPS, ! -- ping ok! -- it’s the client!!Where’s the latency source? GC? Green Threads? Network?! -- Profile shows low GC load! -- Hard to measure thread latency!

EC2 m3.xlarge ($0.05/hr)!4 core E5-2670 @ 2.5 Ghz!Bare metal vs Virtual!Centos 6 vs Latest Kernel!Intel SSDs vs RAM!

Tuning the Aerospike Client GO!!!

Java!!!

What happened? •  Not sure what happened at deployment !

(yet, suspect old kernel)!

•  A week lost by developers using MacOS, Laptop!(MacOS is showing bad latency)!

•  C code is running slower – we think it’s random fill of buffer!

•  Lesson: just switch to Linux 3.12-ish kernels!

•  Lesson: fewer lines ~ 11k Go, 17k Java!

•  Lesson: for network / IO, these languages are THE SAME !