Multi dimension aggregations using spark and dataframes

Multidimensional Aggregations using Spark and DataFrames2015-11-10Romi Kuntsman, Senior Big Data Engineer

About me• Leading adoption of Apache Spark in Totango• Working with Spark for 1.5 years from version 1.0• Passionate about actionable big data analytics• Working with web scale and cloud since 2008• Previously: Outbrain, Foresight, RockeTier,

Mamram• B.Sc. in Bioinformatics from Open University • LinkedIn: https://il.linkedin.com/in/romik• email: romi@totango.com

Agenda

• Totango Data Flow Overview

• Apache Spark DataFrames Introduction

• Merging Multiple Results Efficiently

• Open issues and questions

Data Flow Overview“Numbers have an important story to tell.They rely on you to give them a voice.”

– Stephen Few

Let's talk about aggregations

You've all done this...

SELECTmodule,count(*)

FROMactivities

GROUP BYmodule

Aggregations with big data

You probably done or seen this before as well...

Life isn't so simple

Multiple levels of calculations

Different points of view

• First level aggregations (across last 7, 14, 30 days etc)–Counts (per account, activity, module, user etc)–Distinct counts (unique users in module etc)–Sessions (multiple activities grouped by time proximity)–Activity days (how many days had any activity)

• Higher level analytics:–Engagement Score (overall activity compared to others)–Change Metrics (how activity changes over time)–Account Health (good, average or poor)

• And more...

What do we need

• Easy way to develop a new aggregations• No boilerplate code, just business logic• Scalable and distributed• Accurate results (often underestimated)• Fast (short batch, but not realtime in this case)• Idempotent (same results on every run on same

input)• Multi-tenant (same computations on isolated

datasets)

Spark DataFrames“Simple things should be simple,

complex things should be possible.”– Alan Key

Spark DataFrames

• Table-like abstraction on top of Big Data• Able to scale from kilobytes to petabytes, node to

cluster• Transformations available in code or SQL• User defined functions can add columns• Actively developed optimizer• Spark 1.3 (March 2015) - initially released • Spark 1.4 (June 2015) - mature and usable• Spark 1.5 (September 2015) - performance optimized• Spark 1.6 (not yet released) - more optimizations

Look ma, no map reduce!

• module counts:–events.groupBy(module).count

• module unique users:–events.groupBy(module,user).count.group

By(module).count

User defined function

• activity days:

–udfRegistration.register("date_to_days",

new DateToDays())

–eventsWithDate = sqlContext.query(

"select *,date_days(date) from events")

–eventsWithDate.groupBy(module,day).count

RDDs interoperate with DataFrames

Note: sometimes we do need to go from DataFrame to Java and back to accomplish some things:

RDD<FooBar> myRdd =

dataframe.toJavaRDD.map(...).groupBy(...)

newDataFrame = createDataFrame(myRdd,

FooBar.class)

Advantage: speed, ease of development

Disadvantages: less flexible, limited aggregations,

strict simple schema

When going from DF to RDD: toJavaRDD forces

computation; losing Catalyst optimizer in the

transition

Future: maybe can be replaced by UDAF (user

defined aggregate function) in upcoming Spark

releases

DataFrames vs. RDDs

Merge Multiple Results

Merge the results

We've calculated aggregations across various dimensions. Now it's time to collect them grouped by entity (account, user, etc).

Partitioning scheme

• RDD<Value> - not partitioned by key (there is not key…)→ Union of many RDD results will shuffle everything

• DataFrames are not partitioned by column (to be fixed…)→ Union of many DFs results will shuffle everything

• PairRDD<Key,Value> with partitionBy(partitioner) is partitioned→ Union of many PairRDDs which used the same partitioner will be partitioned together!Partitioner interface: (default HashPartitioner fits most cases)

int getPartition(key)

int numPartitions

Number of partitions

• Processing always happens in chunks that can fit into one executor memory

• Too few partitions - some may not fit and you get a OOM

• Too many partitions - many small steps and overall long time

• In a multitenant environment - have to find a formula by input size that works for everyone, from smallest to largest

• When re-partitioning, take note of data being reshuffled

• No magic formula for the optimal number of partitions :-(

Name your stages• Stages can be named

sparkContext.setCallSite (per thread)

To cache or not to cache

• With RDDs you cache at every intersection

• With DataFrames, best to cache input, then optimizer

• Cache when dividing input into sub sections (like time

slices)

• For Caching DataFrame - need to cause computation,

otherwise only LogicalPlan is cache and optimizer

decides what to do (for example when we cache for

time data subset)

More Spark gotchas...

• When loading from Parquet, can't partition by column hash, only by column value

• Use Kryo for serialization (register all classes)• Use standalone shuffle service to avoid losing

shuffles when worker crashes (like in OutOfMemory)

We'll upload separate posts aboutthese and others on our blog

http://labs.totango.com/

• Check out our blog: http://labs.totango.com/• We're hiring!

http://www.totango.com/jobs/–Backend / Big Data Engineers–DevOps–Application / FrontEnd

• Stay in touch–romi@totango.com–https://il.linkedin.com/in/romik

Questions?

Multi dimension aggregations using spark and dataframes

Software

Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by

Introduction to Big Data with Apache Spark...• Supported by pySpark DataFrames (SparkSQL)" • Some of the functionality SQL provides:" » Create, modify, delete relations" » Add,

SolveDF: Extending Spark DataFrames with support for … · 2017-06-27 · SolveDF: Extending Spark DataFrames with support for constrained optimization Frederik Madsen Halberg fhalbe12@student.aau.dk

Automated Machine Learning Workflow for Distributed Big ... · − Spark Streaming, Flink, ... Spark ML Pipeline Stages Test Data Predictions Test Parquet Files Spark DataFrames Feature

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

Apache Spark Notes · SparkSQL is a library that runs on top of the Apache Spark Core and provides DataFrame API. The Spark DataFrames use a relational optimizer called the Catalyst

Chapter 1: Big Data Analytics at a 10,000-Foot View...Chapter 3: Deep Dive into Apache Spark Chapter 4: Big Data Analytics with Spark SQL, DataFrames, and Datasets Chapter 5: Real-Time

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Chapter 1: Scala Overview · Spark shell - Details for Job O X + Executors Stage 2 details details +details SQL ... SQL and DataFrames Spark Streaming Ml-lib (machine learning)

Big Data and Apache Spark - ERASMUS Pulseinfo.cs.pub.ro/scoaladevara/prez2017/ubis2.pdf · Spark SQL Module •component on top of Spark Core •main abstraction DataFrames •support

Beyond SQL: Speeding up Spark with DataFrames

Meetup Real Time Aggregations Spark Streaming + Spark Sql

The Future of Real-Time in Spark - GitHub Pages...2016/02/18 · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

A Spreadsheet Interface for Dataframes

A02 MLWithApacheSpark Zecevic · Spark SQL Work with DataFrames Handle structured data, organized in columns Load/save structured data from/to external data sources Transform, aggregate,

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015 · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

with BSP, Pregel and DataFrames · 2018-11-30 · Spark GraphFrames • DataFrame based extension for Graph processing in Spark. • Can be used in Scala and Python • Graph is defined

Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016 Michael Ludwig. Our Architecture 2. 3. Our Use Cases 4 Billing Pre-Aggregations Interactive