Using Cassandra for RTB systems

Real Time Bidding with Apache Cassandra

Introducing RTBRTB @ Kenshoo:

- Concepts- Architecture- Challenges

Real Time Bidding (RTB)

● Real-time bidding is a dynamic auction process where each

impression is a bid for in (near) real time versus a static auction

● Kenshoo is engaged In Facebook Exchange (FBX)

● In FBX, each bid has a life-time of 120ms. All transactions have to

complete within that period, and the winning ad is presented to the

user.

● Kenshoo employs ad re-targeting, where search engine campaigns

are extended to the social network, thus giving a much higher ROI for

our customers

Flow

WebSite

RTB Logical Architecture

RTB

RTB Front

Bidder Win ErrorOpt Out Pixel Matcher

RTB BackendRTB Brain

RTB Reporter

Cassandra

Cookie to Segment(s)

Bid decision Trees

Campaigns Metadata

https://github.com/kenshoo/rtb-front

https://github.com/kenshoo/rtb-front

https://github.com/kenshoo/rtb-brain

https://github.com/kenshoo/rtb-brain

Focus on RTB Cassandra RTB @ Kenshoo:

- Architecture- Challenges

Requirements

● Handle 25K+ requests within the 120ms bid time-frame including network latencies

● Ability to scale up to 1M per minute requests while keeping the

current latency

● Handle ~10K writes/second with low latency

● Multi DC Configuration, all nodes must be sync-ed in real-time

● Seamless Operations: Compactions and Repairs

● High Security

C* Physical Architecture

(US) West Region

App App

VPN

App Internet

(US) East Region

App App

VPN

App

FBX WEST FBX EAST

GRE

C* Cluster Information

● Cassandra version 1.2.6● Oracle Java 7● Manual tokens, Vnodes Are Coming Soon● Multi-DC Configuration● Network Topology ● DC Connectivity between VPCs via Linux GRE● Amazon C3.2xlarge instance type● Ubuntu 13.10 with EXT4● SSD (Ephemeral)

The Ring

C* Cluster Network Between Sites

● For security reasons we,

○ Do not use EC2Snitch or EC2MultiRegionSnitch

○ Connected the nodes via VPN (Linux GRE)

● Linux GRE is fast, reliable and provides high throughput

(~1Gb/s)

C* Cluster Storage

● We started with Amazon EBS:

○ With small #nodes (up to 4 nodes): You want persistent storage; avoid running repairs if you lose a node

○ 4xEBS devices in RAID10 configuration: Provide up to 1000 IOPs and bursts of up to 2000 IOPS

○ Cheap in AWS

● 8 nodes with Ephemeral Devices:

○ Lower risk: if you lose a node, recovery isn’t as heavy on the whole cluster

○ We used RAID0○ Higher performance (double than EBS)○ Free, bundled within the instances

C* Cluster Storage continued

● 16 nodes with Ephemeral Devices:

○ When load became heavy we grew to 16 nodes○ Compactions and repairs harmed the cluster latency○ We had to use Provisioned IOPs devices for C* maintenance

● C3 Instance type with SSD:○ Came just in time providing ephemeral SSD storage○ They solved our performance problems and enabled

seamless compactions and repairs○ Amazon currently has scarce deployment of this H/W and

nodes are not stable○ Not available yet in all regions○ C3 Nodes Deployment are not always a possiblity due to AWS

capacity issues○ Amazon promised to resolve the C3 issues next month

C* Cluster Performance

Monitoring

● We heavily rely on DataStax OpsCenter

● We grab OpsCenter Metrics out for graphings

● We wrote our own Read/Write Speed Test on separate dedicated KeySpace on

each node to detect bottlenecks and problematic nodes

● We Sample the data separately from the Application to detect if the problem

origins are C* or the application

What have we learned

● Storage:○ Use SSD:

■ It provides high and stable disk performance■ Neutralizes Compaction and Repair effects on the cluster■ Worth the money

● Network:■ Use highest bandwidth VPN possible■ GRE is great (lacks encryption, but provides best bandwidth)

● Maintenance:○ Run Compact Daily: It does miracle to performance on heavy loads○ If you are not on SSD, disable thrift on the node before running compaction○ Do compactions in sequence, node by node○ On high-load systems, avoid repair as possible, it’s better to decommission

and recommission a node than to run repair!○ If you have to repair, always use “-pr” flag and if possible use the

incremental repair option (requires heavy scripting)● Monitoring:

○ Write a sampler and speed tester for each node to detect bottlenecks and performance issues sources

Thank you

Technology

Using Cassandra for RTB systems