13
Richard Grossman | System Architect Processing Billions of Daily Events

Inneractive - Spark meetup2

Embed Size (px)

Citation preview

Page 1: Inneractive - Spark meetup2

Richard Grossman | System Architect

Processing Billions of Daily Events

Page 2: Inneractive - Spark meetup2

What we do…

RTB NetworksAdvertiser

Advertiser

Advertiser

Advertiser

Advertiser

$$$

2M/min 250msRIR

Networks

SAPI Networks

Video Networks

RAPI Networks

Page 3: Inneractive - Spark meetup2

>Incoming requests ==> 1.5 to 2 M / Minute

>Events generated ==> 20 to 30 M / Minute

Generate 5+ TB / day raw data (CSV+Parquet)

>Storing 550 days of aggregated data

>Storing years of raw data

Numbers…

Page 4: Inneractive - Spark meetup2

The Past

Page 5: Inneractive - Spark meetup2

>Company traffic increased +200% from last year

>Write directly to relational DB is not an option anymore...

>Solution should support both hot and cold data

>Lambda architecture

>Cost effective

Concerns…

Page 6: Inneractive - Spark meetup2

Our Solution

>Streaming data with Kafka

>Handle real time data with Spark Streaming

>Handle raw data with Spark Jobs over Parquet DB

>Data Scientist friendly environment using DataBricks

>Super Cost Effective

Page 7: Inneractive - Spark meetup2

Architecture

Page 8: Inneractive - Spark meetup2

Dstream (Discretized Stream)

Page 9: Inneractive - Spark meetup2

Code Sample

implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)

val topicMap = Map[“Topic” → ”5”]

l >Define Streaming Context

val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc, KafkaParams, topicMap, StorageLevel.MEMORY)

l >Define Dstream on Kafka

val mapped = stream flatMap { event => (gender, age) → 1 }

val reduced = mapped.reduceByKey { _ + _ }

l >Aggregate the Data (In our case reduceByKey)

Page 10: Inneractive - Spark meetup2

Code Sample

reduced foreachRDD { rdd => rdd.collect() foreach { AggregatedRecords => val key = aggregatedRecords._1 val count = aggregatedRecords._2

INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON DUPLICATE KEY UPDATE …. } }

l > Working now on RDD aggregated : Collect records then insert into MySQL

Page 11: Inneractive - Spark meetup2

Architecture Part 2

>100 ~ 200 servers stream events to Kafka

>Spark Streaming cluster handles events in real time

(~30M/Min)

>Updating MySQL at frequency of 1500

Updates/Second

>Generate Parquet format file ~1 GB/hour

>Parquet DB accessible using “DataBricks” cluster for

ad hook queries

Page 12: Inneractive - Spark meetup2

Infrastructure

>Running on Amazon EC2

>Kafka cluster (4 Brokers, 3 Zookeepers)

>Spark Streaming cluster (1 Master, 5 Slaves)

>“DataBricks” clusters (On Demand & Spot Instance)

>Storage on Amazon S3 & Glacier

Page 13: Inneractive - Spark meetup2

{Thanks}