CEU - Data @enbrite.ly

Preview:

Citation preview

Data @enbrite.lyMate Gulyas

CTO & Co-FounderGULYÁS MÁTÉ

@gulyasm

$171B170 000 000 000

$171BMAGYARORSZÁG GDP 125%

39%

39%

Oct. 27, 1994: Web Gives Birth to Banner Ads

THERE IS FRAUD ON THE

INTERNET

PROTOTYPE

WHY WE DO IT?

HOW WE DO IT?

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

Product placeholder

Spark TOOLS

● 0.5-4TB data processed daily

1-10B rows

● Ad-hoc batch queries 20TB data

● 20+ node cluster

● Spent 4 month optimizing it

UNDER THE HOOD

DATA COLLECTION

The way to access log

{ "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click"}

eyJzZXNzaW9uX2lkIjoic3BhcmtfbWVldHVwX2pzbW1tb3EiLCJ0aW1lc3RhbXAiOjE0NTYwODA5MTU2MjEsInR5cGUiOiAiY2xpY2sifQo=

Click event attributes (created by JS tracker)

Access log format

TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

1.2.

3.

DATA COLLECTION

DATA PROCESSINGDATA PROCESSING

DATA PROCESSING

● AWS

● Apache Spark

● Apache Hadoop

● Luigi

● Python

(pandas, scikit)

● Go

● Pagerduty

● Graphite, Grafana

● Ganglia

● Ansible

● Hashicorp stack

TOOLS

AWS TOOLS

INFRASTRUCTURE

PROVIDER

AWS TOOLS

● 16 services

● 110+ machines

● 1-4 EMR clusters (1-30 node)

● 100TB+ on S3

● All clients has separate infrastructure

Spark TOOLS

INFRASTRUCTURE

PROVIDER

Luigi TOOLS

Luigi + enbrite.ly extensions = Gabo Luigi

WORKFLOW ENGINE

Tools we created GABO LUIGI

HOT MAP DETECTION

BOT TRAFFIC DETECTION

LESSONS LEARNED

LESSONS LEARNED

Automate EVERYTHING

LESSONS LEARNED

OPTIMIZATION takes a

LOT OF TIME

LESSONS LEARNED

Data is NEVER clean

THE NEXT BIG THING....

PRIVACY

MATE GULYASgulyasm@enbrite.ly

@gulyasm@enbritely

THANK YOU!