OLAP ON QERIES IN SECONDS ON PETABYTE DATASETDistributing Petabucket data using CephFS
Milosz Tanski, CTO @[email protected]
October 2014
©AdFin. All Rights Reserved2
Outline
Who/what is AdFin?
What is PetaBucket?
Petabucket on CephFS
Contributing FSCache support to CephFS
©AdFin. All Rights Reserved3
About Adfin
= Ad-Tech + Finance-Tech
Creating tools that bring buying intelligence to programmatic media.
Advertising is bought and sold in real time via RTB (since 2008)
Brining transparency to the Ad markets.
The Bloomberg, S&P, Markit… for Ad markets.
©AdFin. All Rights Reserved4
We Deliver… Pretty Analytics
©AdFin. All Rights Reserved5
We Deliver… Pretty Analytics
©AdFin. All Rights Reserved6
We Deliver… Pretty Analytics
©AdFin. All Rights Reserved7
We Deliver… Pretty Analytics
©AdFin. All Rights Reserved8
What’s the problem?
Market is ~500 Billion impressions a day; it’s growing.
Each impression is unique.
Each is worth a small fraction of a penny.
Magnitude more then number of trades in the Financial markets
There’s a magnitude more bids for those impressions.
That’s a lot of data to process, store, analyze.
©AdFin. All Rights Reserved9
Petabucket
Distributed, time series, relational, OLAP database
Relational query language (but not SQL)
Query in broken up into many smaller chunks
Great single node performance. 10s of millions rows a second.
Vectorized query processing, vectorized compressed bitmap indexes.
Responses in real-time. Goal is low single digit seconds (uncached)
Why? Because we’re a bit crazy.
©AdFin. All Rights Reserved10
Queries easy for humans / machines
11
High Level System Diagram
12
Time series bulk import
©AdFin. All Rights Reserved13
Petabucket and CephFS
CephFS as a single namespace storage for nodes
Why?
Scalable storage (speed / size)
Separate storage from computation
No SPOF
DFS performance
Client (kernel) performance
©AdFin. All Rights Reserved14
High Level System Diagram, part 2
©AdFin. All Rights Reserved15
CephFS is not production ready?
Again, we’re a bit crazy?
Started in early 2013.
When we started client and MDS were not ready.
We found and reported a lot of bugs.
Yan Zhen fixed a lot of bugs. Thanks Yan.
Today we’re happy and in production.
Processed multiple PB of data since then.
©AdFin. All Rights Reserved16
FSCache for kclient
We decided to add local persistent caching support to the kclient.
Our access pattern:
Working set larger then node memory (page cache)
Append-only data (time series)
Most recent month, quarter of data access 100x more often
Benefits:
Reducing latency / speed lost by moving to non-local filesystem
Reduce Ceph network traffic and OSD utilization
Cheap local SSD drives get 500MB/s read performance
Not re-inventing the wheel
©AdFin. All Rights Reserved17
Kernel programming is hard
Have to understand Ceph, kernel, concurrency.
An error in the kernel hangs or Oops your machine.
Bugs in other parts of the kernel? (CacheFS).
Prototype working in two weeks
First submission 2 months later.
In kernel 5 months later.
Number one problem concurrency.
©AdFin. All Rights Reserved18
Ceph with FSCache Status
In since: 3.13
… Works well since: 3.15
… All bugs fixed: 3.17
Speed… as fast as your caching disk
Tested single client performance 1200MB/s
©AdFin. All Rights Reserved19
Next steps…
Contributing to Ceph & kernel is addicting:
Ceph performance work. Improving latency / ioops.
Kernel work: readv2() syscall. File serving applications
http://lwn.net/Articles/612483/
Thank You!
©AdFin. All Rights Reserved21
Let’s Get in Touch
[email protected] E. 34th Street, 15th FloorNew York, New York 10016
Milosz TanskiCTO
linkedin.com/company/AdFin
twitter.com/AdFin