Upload
tsliwowicz
View
44
Download
0
Embed Size (px)
Citation preview
Richard Grossman | System Architect
Processing Billions of Daily Events
What we do…
RTB NetworksAdvertiser
Advertiser
Advertiser
Advertiser
Advertiser
$$$
2M/min 250msRIR
Networks
SAPI Networks
Video Networks
RAPI Networks
>Incoming requests ==> 1.5 to 2 M / Minute
>Events generated ==> 20 to 30 M / Minute
Generate 5+ TB / day raw data (CSV+Parquet)
>Storing 550 days of aggregated data
>Storing years of raw data
Numbers…
The Past
>Company traffic increased +200% from last year
>Write directly to relational DB is not an option anymore...
>Solution should support both hot and cold data
>Lambda architecture
>Cost effective
Concerns…
Our Solution
>Streaming data with Kafka
>Handle real time data with Spark Streaming
>Handle raw data with Spark Jobs over Parquet DB
>Data Scientist friendly environment using DataBricks
>Super Cost Effective
Architecture
Dstream (Discretized Stream)
Code Sample
implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)
val topicMap = Map[“Topic” → ”5”]
l >Define Streaming Context
val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc, KafkaParams, topicMap, StorageLevel.MEMORY)
l >Define Dstream on Kafka
val mapped = stream flatMap { event => (gender, age) → 1 }
val reduced = mapped.reduceByKey { _ + _ }
l >Aggregate the Data (In our case reduceByKey)
Code Sample
reduced foreachRDD { rdd => rdd.collect() foreach { AggregatedRecords => val key = aggregatedRecords._1 val count = aggregatedRecords._2
INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON DUPLICATE KEY UPDATE …. } }
l > Working now on RDD aggregated : Collect records then insert into MySQL
Architecture Part 2
>100 ~ 200 servers stream events to Kafka
>Spark Streaming cluster handles events in real time
(~30M/Min)
>Updating MySQL at frequency of 1500
Updates/Second
>Generate Parquet format file ~1 GB/hour
>Parquet DB accessible using “DataBricks” cluster for
ad hook queries
Infrastructure
>Running on Amazon EC2
>Kafka cluster (4 Brokers, 3 Zookeepers)
>Spark Streaming cluster (1 Master, 5 Slaves)
>“DataBricks” clusters (On Demand & Spot Instance)
>Storage on Amazon S3 & Glacier
{Thanks}