Inneractive - Spark meetup2

Richard Grossman | System Architect

Processing Billions of Daily Events

What we do…

RTB NetworksAdvertiser

Advertiser

2M/min 250msRIR

Networks

SAPI Networks

Video Networks

RAPI Networks

>Incoming requests ==> 1.5 to 2 M / Minute

>Events generated ==> 20 to 30 M / Minute

Generate 5+ TB / day raw data (CSV+Parquet)

>Storing 550 days of aggregated data

>Storing years of raw data

Numbers…

The Past

>Company traffic increased +200% from last year

>Write directly to relational DB is not an option anymore...

>Solution should support both hot and cold data

>Lambda architecture

>Cost effective

Concerns…

Our Solution

>Streaming data with Kafka

>Handle real time data with Spark Streaming

>Handle raw data with Spark Jobs over Parquet DB

>Data Scientist friendly environment using DataBricks

>Super Cost Effective

Architecture

Dstream (Discretized Stream)

Code Sample

implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)

val topicMap = Map[“Topic” → ”5”]

l >Define Streaming Context

val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc, KafkaParams, topicMap, StorageLevel.MEMORY)

l >Define Dstream on Kafka

val mapped = stream flatMap { event => (gender, age) → 1 }

val reduced = mapped.reduceByKey { _ + _ }

l >Aggregate the Data (In our case reduceByKey)

Code Sample

reduced foreachRDD { rdd => rdd.collect() foreach { AggregatedRecords => val key = aggregatedRecords._1 val count = aggregatedRecords._2

INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON DUPLICATE KEY UPDATE …. } }

l > Working now on RDD aggregated : Collect records then insert into MySQL

Architecture Part 2

>100 ~ 200 servers stream events to Kafka

>Spark Streaming cluster handles events in real time

(~30M/Min)

>Updating MySQL at frequency of 1500

Updates/Second

>Generate Parquet format file ~1 GB/hour

>Parquet DB accessible using “DataBricks” cluster for

ad hook queries

Infrastructure

>Running on Amazon EC2

>Kafka cluster (4 Brokers, 3 Zookeepers)

>Spark Streaming cluster (1 Master, 5 Slaves)

>“DataBricks” clusters (On Demand & Spot Instance)

>Storage on Amazon S3 & Glacier

{Thanks}

Inneractive - Spark meetup2

Data & Analytics

Feature engineering-ml-meetup2-170220185754

Inneractive mobile Ad trends 2015 new - A Fyber Companyinner-active.com/wp/.../2016/01/Inneractive-Mobile-Ad-Trends-2015.pdf · of selling mobile video via RTB at scale, as well as

Spark, spark streaming & tachyon

Spark meetup2 final (Taboola)

Intro to Spark and Spark SQL

McDonough Spark Tutorial Spark Summit 2013

Spark summit2014 techtalk - testing spark

Inneractive | expanding human potential

AKKA and Scala @ Inneractive

Spark Your Legacy (Spark Summit 2016)

Spark Platform Spark Core Spark Extensions Using … Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks 20 years

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Spark & Spark SQL

Spark Concepts - Spark SQL, Graphx, Streaming

Spark SQL | Apache Spark

MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Inneractive Brand Guide

IBM Spark Meetup - RDD & Spark Basics