Upload
tsliwowicz
View
478
Download
1
Embed Size (px)
Citation preview
Spark War Stories
Our War Story
“A good plan violently executed now is better than a perfect plan executed next week”
George S. Patton
Our Data Requirements• Lots of incoming traffic (100K requests/sec)• Data:
– Personalized served recommendations – per user, per page view
– Events - What the user actually read and what he did• The data needs to be joined and processed in real
time– Campaigns Management– Recommendations– Billing– Reports– Etc.
• The data needs to be available for offline research
Challenges• We care about sessions - chain of page views
and events for a specific user– Length can be hours or even days
• We care about users – chain of sessions across sites – Length can be days or even months
• Stateless Application – single user data is sent from multiple data centers and multiple servers– No deterministic affinity to a server or DC– Order isn’t guaranteed– Must be robust and automatically deal with late arrivals– “Exactly once” semantics
Challenges Cont.• Many streams of data that need to be joined
(user, session, page view, widgets, recommendations, events, actions)
• 5+TB of daily data• Data analysis requires pre-joining the streams
and looking on the data across time
Naïve / Brute Force Solution• Join some streams in the FE Server
– De-normalization is done as early as possible – Everything that isn’t event or action is joined– However, cannot assume a single PV happens on a
single server• Join the above with events and actions in Spark
memory– Minutes of data - ok– 2+ Hours of data - slow (30+ minutes of processing)– Days of data - #Fail
Why Did it Fail?• Incoming data is received by data class (i.e.
Request, Event, etc) and by incoming timestamp– Separate RDD per class– The RDDs contain randomly - hash partitioned -
incoming data• Join key is by session and page view ids
Why Did it Fail?• To join the data:
– First, remap the incoming data to a PairRDD and add the join key (needs to be done individually, per RDD class)
– Second, cogroup the PairRDDs shuffle must be performed on all participating RDDs
• The initial data is distributed randomly across many nodes and multiple RDDs – Small data sets small shuffles– Huge data sets unmanageable shuffles
See the Shuffle
The Solution
Avoid Them Shuffles
The Solution• Designed to avoid the initial / heaviest shuffle• Go through an intermediary phase before
reading the data for analysis• As streamed data is being received, save each
message to Cassandra– All classes saved together to a single table– The table is partitioned by the read key
Table Model in C*• Partition key – session start hour + user bucket (0-
9,999)• Clustering key - publisher_id, user_id, session_id,
view_id, data_type, data_hash• Data Type - MULTI_REQUEST, USER_EVENT,
ACTION_CONVERSION, …• Data – blobs of protobuff
• Results: – All the data of a single session is in one place, regardless of
time of arrival– Idempotent process – if same message is received twice it
overruns the previous arrivals due to same hash id
Result - No Shuffle
Result• Week of data (~35TB) - 2 hours to analyze and
report • Analyzing 1% sample of the users reduces this
linearly (partition key)• Analyzing a single publisher which is 1% of the
data reduces this almost linearly (clustering key)
Good, but not good enough• We used Cassandra because we had it as an
available resource• However, Cassandra:
– Isn’t columnar - cannot read partial rows (specific columns)
– Eventually consistent – not accurate enough– For heavy loads suffers from memory issues– Cross DC replication isn’t reliable under heavy load
• Now working on the next gen solution– See you in a future meetup…
Some More Tips• Avoid cogroup and use broadcasts when one of
the RDDs is small enough• Whenever possible use map() instead of
mapPartitions()– Memory and processing efficiency gained– Unless setup is expensive
• G1GC – we have had a very good experience with it in tight memory situations– Does not work well out of the box, requires some
tweaking
Thank You!
[email protected]@taboola.com