Inneractive - Spark meetup2

Preview:

Citation preview

Richard Grossman | System Architect

Processing Billions of Daily Events

What we do…

RTB NetworksAdvertiser

Advertiser

Advertiser

Advertiser

Advertiser

$$$

2M/min 250msRIR

Networks

SAPI Networks

Video Networks

RAPI Networks

>Incoming requests ==> 1.5 to 2 M / Minute

>Events generated ==> 20 to 30 M / Minute

Generate 5+ TB / day raw data (CSV+Parquet)

>Storing 550 days of aggregated data

>Storing years of raw data

Numbers…

The Past

>Company traffic increased +200% from last year

>Write directly to relational DB is not an option anymore...

>Solution should support both hot and cold data

>Lambda architecture

>Cost effective

Concerns…

Our Solution

>Streaming data with Kafka

>Handle real time data with Spark Streaming

>Handle raw data with Spark Jobs over Parquet DB

>Data Scientist friendly environment using DataBricks

>Super Cost Effective

Architecture

Dstream (Discretized Stream)

Code Sample

implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)

val topicMap = Map[“Topic” → ”5”]

l >Define Streaming Context

val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc, KafkaParams, topicMap, StorageLevel.MEMORY)

l >Define Dstream on Kafka

val mapped = stream flatMap { event => (gender, age) → 1 }

val reduced = mapped.reduceByKey { _ + _ }

l >Aggregate the Data (In our case reduceByKey)

Code Sample

reduced foreachRDD { rdd => rdd.collect() foreach { AggregatedRecords => val key = aggregatedRecords._1 val count = aggregatedRecords._2

INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON DUPLICATE KEY UPDATE …. } }

l > Working now on RDD aggregated : Collect records then insert into MySQL

Architecture Part 2

>100 ~ 200 servers stream events to Kafka

>Spark Streaming cluster handles events in real time

(~30M/Min)

>Updating MySQL at frequency of 1500

Updates/Second

>Generate Parquet format file ~1 GB/hour

>Parquet DB accessible using “DataBricks” cluster for

ad hook queries

Infrastructure

>Running on Amazon EC2

>Kafka cluster (4 Brokers, 3 Zookeepers)

>Spark Streaming cluster (1 Master, 5 Slaves)

>“DataBricks” clusters (On Demand & Spot Instance)

>Storage on Amazon S3 & Glacier

{Thanks}

Recommended