Spark war stories taboola

Spark War Stories

Who are we?

Tal SliwowiczDirector, R&[email protected]

Ruthy GoldbergSr. Software [email protected]

Our War Story

“A good plan violently executed now is better than a perfect plan executed next week”

George S. Patton

Our Data Requirements• Lots of incoming traffic (100K requests/sec)• Data:

– Personalized served recommendations – per user, per page view

– Events - What the user actually read and what he did• The data needs to be joined and processed in real

time– Campaigns Management– Recommendations– Billing– Reports– Etc.

• The data needs to be available for offline research

Challenges• We care about sessions - chain of page views

and events for a specific user– Length can be hours or even days

• We care about users – chain of sessions across sites – Length can be days or even months

• Stateless Application – single user data is sent from multiple data centers and multiple servers– No deterministic affinity to a server or DC– Order isn’t guaranteed– Must be robust and automatically deal with late arrivals– “Exactly once” semantics

Challenges Cont.• Many streams of data that need to be joined

(user, session, page view, widgets, recommendations, events, actions)

• 5+TB of daily data• Data analysis requires pre-joining the streams

and looking on the data across time

Naïve / Brute Force Solution• Join some streams in the FE Server

– De-normalization is done as early as possible – Everything that isn’t event or action is joined– However, cannot assume a single PV happens on a

single server• Join the above with events and actions in Spark

memory– Minutes of data - ok– 2+ Hours of data - slow (30+ minutes of processing)– Days of data - #Fail

Why Did it Fail?• Incoming data is received by data class (i.e.

Request, Event, etc) and by incoming timestamp– Separate RDD per class– The RDDs contain randomly - hash partitioned -

incoming data• Join key is by session and page view ids

Why Did it Fail?• To join the data:

– First, remap the incoming data to a PairRDD and add the join key (needs to be done individually, per RDD class)

– Second, cogroup the PairRDDs shuffle must be performed on all participating RDDs

• The initial data is distributed randomly across many nodes and multiple RDDs – Small data sets small shuffles– Huge data sets unmanageable shuffles

See the Shuffle

The Solution

Avoid Them Shuffles

The Solution• Designed to avoid the initial / heaviest shuffle• Go through an intermediary phase before

reading the data for analysis• As streamed data is being received, save each

message to Cassandra– All classes saved together to a single table– The table is partitioned by the read key

Table Model in C*• Partition key – session start hour + user bucket (0-

9,999)• Clustering key - publisher_id, user_id, session_id,

view_id, data_type, data_hash• Data Type - MULTI_REQUEST, USER_EVENT,

ACTION_CONVERSION, …• Data – blobs of protobuff

• Results: – All the data of a single session is in one place, regardless of

time of arrival– Idempotent process – if same message is received twice it

overruns the previous arrivals due to same hash id

Result - No Shuffle

Result• Week of data (~35TB) - 2 hours to analyze and

report • Analyzing 1% sample of the users reduces this

linearly (partition key)• Analyzing a single publisher which is 1% of the

data reduces this almost linearly (clustering key)

Good, but not good enough• We used Cassandra because we had it as an

available resource• However, Cassandra:

– Isn’t columnar - cannot read partial rows (specific columns)

– Eventually consistent – not accurate enough– For heavy loads suffers from memory issues– Cross DC replication isn’t reliable under heavy load

• Now working on the next gen solution– See you in a future meetup…

Some More Tips• Avoid cogroup and use broadcasts when one of

the RDDs is small enough• Whenever possible use map() instead of

mapPartitions()– Memory and processing efficiency gained– Unless setup is expensive

• G1GC – we have had a very good experience with it in tight memory situations– Does not work well out of the box, requires some

tweaking

Thank You!

[email protected]@taboola.com