22
© 2011Geeknet Inc Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ Rick Copeland @rick446 [email protected]

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Embed Size (px)

DESCRIPTION

With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.

Citation preview

Page 1: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

© 2011Geeknet Inc

Realtime Analytics using MongoDB, Python, Gevent,

and ZeroMQ

Rick Copeland@rick446

[email protected]

Page 2: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

SourceForge s MongoDB

- Tried CouchDB – liked the dev model, not so much the performance

- Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!)

- Built an entirely new tool platform around MongoDB (Allura)

Page 3: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

The Problem We’re Trying to Solve

- We have lots of users (good)

- We have lots of projects (good)

- We don’t know what those users and projects are doing (not so good)

- We have tons of code in PHP, Perl, and Python (not so good)

Page 4: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Introducing Zarkov 0.0.1

Asynchronous TCP server for event logging with gevent

Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client)

Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}

Page 5: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov Architecture

MongoDB

BSON over

ZeroMQ

Journal Greenlet

Commit Greenlet

Write-ahead log

Write-ahead log

Aggregation

Greenlet

Page 6: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Technologies

- MongoDB- Fast (10k+ inserts/s single-threaded)

- ZeroMQ- Built-in buffering- PUSH/PULL sockets (push never blocks, easy to distribute work)

- BSON- Fast Python/C implementation- More types than JSON

- Gevent- “green threads” for Python

Page 7: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

“Wow, it’s really fast; can it replace…”

- Download statistics?- Google Analytics?- Project realtime statistics?

“Probably, but it’ll take some

work….”

Page 8: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Moving towards production....

- MongoDB MapReduce: convenient, but not so fast- Global JS Interpreter Lock per mongod- Lots of writing to temp collections (high lock %)- Javascript without libraries (ick!)

- Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB

Page 9: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov’s already doing a lot…

So we added a lightweight map/reduce framework

- Write your map/reduce jobs in Python- Input/Output is MongoDB- Intermediate files are local .bson files- Use ZeroMQ for job distribution

Page 10: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Quick Map/reduce Refresherdef map_reduce(input_collection, query, output_collection,

map, reduce):

objects = input_collection.find(query)

map_results = list(map(objects))

map_results.sort(key=operator.itemgetter(0))

for key, kv_pairs in itertools.groupby(

(map_results, operator.itemgetter(0)):

value = reduce(key, [ v for k,v in kv_pairs ])

output_collection.save(

{"_id":key, "value":value})

Page 11: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Quick Map/reduce Refresherdef map_reduce(input_collection, query, output_collection,

map, reduce):

objects = input_collection.find(query)

map_results = list(map(objects))

map_results.sort(key=operator.itemgetter(0))

for key, kv_pairs in itertools.groupby(

(map_results, operator.itemgetter(0)):

value = reduce(key, [ v for k,v in kv_pairs ])

output_collection.save(

{"_id":key, "value":value})

Parallel

Page 12: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov Map/Reduce Architecture

map_in_#.bson

Query

Map

Sort

Reduce Commit

map_out_#.bson

reduce_in.bson

JobMgr

Page 13: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov Map/Reduce

- Phases managed by greenlets

- Map and reduce jobs parceled out to remote workers via zmq PUSH/PULL

- Adaptive timeout/retry to support dead workers

- Sort phase is local (big mergesort) but still done in worker processes

Page 14: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov Web Service

- We’ve got the data in, now how do we get it out?

- Zarkov includes a tiny HTTP server

$ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q

{"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}}

- Values come out tweaked for use in flot

Page 15: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Zarkov Deployment at SF.net

Page 16: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

© 2011Geeknet Inc

Lessons learned at

Page 17: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

MongoDB Tricks

- Autoincrement integers are harder than in MySQL but not impossible

- Unsafe writes, insert > update

class IdGen(object): @classmethod def get_ids(cls, inc=1): obj = cls.query.find_and_modify( query={'_id':0}, update={ '$inc': dict(inc=inc), }, upsert=True, new=True) return range(obj.inc - inc, obj.inc)

Page 18: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

MongoDB Pitfalls

- $addToSet is nice but nothing beats an integer range query

- Avoid Javascript like the plague (mapreduce, group, $where)

- Indexing is nice, but slows things down; use _id when you can

- mongorestore is fast, but locks a lot

Page 19: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Open Source

Minghttp://sf.net/projects/

merciless/MIT License

Allurahttp://sf.net/p/allura/

Apache License

Zarkovhttp://sf.net/p/zarkov/

Apache License

Page 20: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Future Work

Remove SPoF

Better way of expressing aggregates Suggestions?

Better web integration WebSockets/Socket.io

Maybe trigger aggs based on event activity?

Page 21: Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

© 2011Geeknet Inc

Rick Copeland@rick446

[email protected]