BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ...

ARCHITECTURE & LESSONS LEARNED

BARTOSZ ŁOŚ

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE

BIG DATA TECHNOLOGY WARSAW SUMMIT 2017FEBRUARY 9, 2017

TABLE OF CONTENTS

Agenda:- real-time bidding- the first iteration: mutable structures- the second iteration: data-flow- the third iteration: immutable streams of events

REAL-TIME BIDDING

REAL-TIME BIDDING: RTB PLATFORM

Processing bid requests(350K/s, ~30 SSP networks, <50-100ms)

REAL-TIME BIDDING: DATA & MACHINE LEARNING

Impressions:~ 150M events / day~ 4TB data / day

Clicks:~ 1M events / day~ 35GB data / day

Conversions: ~ 450K events / day ~ 25GB data / day

THE FIRST ITERATION

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23

THE 1ST ITERATION: DRAWBACKS

Issues:- long, overloading data migrations (30 days back)- complex servlets' logic, inability to reprocess- inflexible, various schemas- single-DC- inconsistencies

THE SECOND ITERATION: DATA-FLOW

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23

THE 2ND ITERATION: DISTRIBUTED LOG

Why Apache Kafka:- distributed log- topics partitioning- partition replication- log retention- stateless- efficient data consuming

THE 2ND ITERATION: BATCH LOADING

Why Apache Camus:- "Kafka to HDFS" pipeline- map-reduce jobs, batches- storing offsets in log files- data partitioning

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING

Why Apache Avro:- data serialization framework- rich data structures- self-describing container files- reader & writer schemas- binary data format- schema registry

THE 2ND ITERATION: ACCURATE STATISTICS

Why Apache Storm:- real-time processing- streams of tuples, topologies- fault-tolerance

Why Trident:- transactions, exactly-once processing- microbatches (latency & throughput)

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23

THE 2ND ITERATION: DRAWBACKS

Hybrid architecture:- aggregates (real-time)- raw events (2-hour batches)- joined events (end-of-day batch jobs)

Other issues:- Hive joins- mutable events- servlets' complex logic

THE THIRD ITERATION: NEW APPROACH

THE 3RD ITERATION: NEW APPROACH

{ "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS”}

{ "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION”}

{ "CONVERSION”: "TIME”, "CLICK_ID”, ... "CLICK”}

New approach:- real-time processing- publishing light events- immutable streams of events

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23

THE 3RD ITERATION: EVENTS MERGE 21/23

SUMMARY

What we have achieved:- multi-DC architecture- HDFS & BigQuery streaming- platform monitoring- much more stable platform- higher quality of data processing- better data-flow monitoring, deployment & maintenance

THANK YOUFOR YOUR ATTENTION

BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ...

Documents

Addressing Challenges of Data-Intensive Research · Data Science Warsaw “Data Science Warsaw is a community of data scientists based in Warsaw. We are a non-profit professional

TE Summit 23-24.10.2013.-Dr. Mara Jakobsone From Warsaw to Malta

Dr. Fatih BIROL IEA Chief Economist Warsaw, 8 December 2014 · Dr. Fatih BIROL IEA Chief Economist Warsaw, 8 December 2014 ... Mixed signals in run-up to crucial climate summit in

KEY OUTCOMES OF THE SUMMIT OF NATO HEADS OF STATE … · The previous NATO Summit took place in Warsaw in July 2016. 2. The Brussels Summit was the first full Summit held at NATO’s

Adobe Summit - Data Storytelling

Data Summit, February 2008, Chet Wayland Data Summit Challenge

Data Gathering and Financial Regulation Mark Allen CASE Research, Warsaw

NATO's Warsaw Summit: In Brief - Federation of … · NATO’s Warsaw Summit: In Brief Congressional Research Service ... government since 2014, when Russia annexed Crimea and began

Big data summit

Data Summit Brussels: Introduction

TETRA Congress Warsaw, 13.-14.6.2006 TETRA Data Services & Applications Hannu Vilppunen Warsaw June 2006 Presented by Risto Toikkanen

Fried data summit data quality data analytics together

Euromoney Magazine Survey completed - Bank Handlowy · Euromoney Magazine Survey completed NATO summit in Warsaw on 7-9 July 2016 – useful information Due to the NATO summit, which

Big Data Innovation Summit

WARSAW BUSINESS JOURNAL Rondo 1, Warsaw ......IN BIG DATA POLAND’S LOGISTICS SUCCESS HOME ate a space which, in APPLIANCES EXPORTS WBJ MAY 2017 63 EY Rondo 1, Warsaw Sophistication,

Mini Summit VI: Anti Corruption Case Study · Mini Summit VI: Anti‐Corruption Case Study Contracting with Third Parties ... PCF - Mini Summit # VI - Warsaw - 11.05.2016 | 11. 2

Oracle Big Data Summit

Warsaw University of Life Sciences - TEMPUStempus-prj.onma.edu.ua/dlzone/qantus/varshava2014/... · Warsaw University of Life Sciences Basic data „Qualifications Frameworks for

DATA SUMMIT - csuci.edu

CE EMC Test Report - Summit Datasummitdata.com/Certification/SDC-MSD30AG/EN_301489_MSD30AG.pdfBrand Name : Summit Data Communications Applicant : Summit Data Communications lnc. Address