45
Data @enbrite.ly Mate Gulyas

CEU - Data @enbrite.ly

Embed Size (px)

Citation preview

Page 1: CEU - Data @enbrite.ly

Data @enbrite.lyMate Gulyas

Page 2: CEU - Data @enbrite.ly

CTO & Co-FounderGULYÁS MÁTÉ

@gulyasm

Page 3: CEU - Data @enbrite.ly

$171B170 000 000 000

Page 4: CEU - Data @enbrite.ly

$171BMAGYARORSZÁG GDP 125%

Page 5: CEU - Data @enbrite.ly

39%

39%

Page 6: CEU - Data @enbrite.ly

Oct. 27, 1994: Web Gives Birth to Banner Ads

Page 7: CEU - Data @enbrite.ly
Page 8: CEU - Data @enbrite.ly

THERE IS FRAUD ON THE

INTERNET

Page 9: CEU - Data @enbrite.ly
Page 10: CEU - Data @enbrite.ly

PROTOTYPE

Page 11: CEU - Data @enbrite.ly
Page 12: CEU - Data @enbrite.ly
Page 13: CEU - Data @enbrite.ly

WHY WE DO IT?

Page 14: CEU - Data @enbrite.ly

HOW WE DO IT?

Page 15: CEU - Data @enbrite.ly

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

Page 16: CEU - Data @enbrite.ly

Product placeholder

Page 17: CEU - Data @enbrite.ly
Page 18: CEU - Data @enbrite.ly

Spark TOOLS

● 0.5-4TB data processed daily

1-10B rows

● Ad-hoc batch queries 20TB data

● 20+ node cluster

● Spent 4 month optimizing it

Page 19: CEU - Data @enbrite.ly

UNDER THE HOOD

Page 20: CEU - Data @enbrite.ly

DATA COLLECTION

Page 21: CEU - Data @enbrite.ly

The way to access log

{ "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click"}

eyJzZXNzaW9uX2lkIjoic3BhcmtfbWVldHVwX2pzbW1tb3EiLCJ0aW1lc3RhbXAiOjE0NTYwODA5MTU2MjEsInR5cGUiOiAiY2xpY2sifQo=

Click event attributes (created by JS tracker)

Access log format

TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

1.2.

3.

Page 22: CEU - Data @enbrite.ly

DATA COLLECTION

Page 23: CEU - Data @enbrite.ly

DATA PROCESSINGDATA PROCESSING

Page 24: CEU - Data @enbrite.ly

DATA PROCESSING

● AWS

● Apache Spark

● Apache Hadoop

● Luigi

● Python

(pandas, scikit)

● Go

● Pagerduty

● Graphite, Grafana

● Ganglia

● Ansible

● Hashicorp stack

TOOLS

Page 25: CEU - Data @enbrite.ly

AWS TOOLS

INFRASTRUCTURE

PROVIDER

Page 26: CEU - Data @enbrite.ly

AWS TOOLS

● 16 services

● 110+ machines

● 1-4 EMR clusters (1-30 node)

● 100TB+ on S3

● All clients has separate infrastructure

Page 27: CEU - Data @enbrite.ly

Spark TOOLS

INFRASTRUCTURE

PROVIDER

Page 28: CEU - Data @enbrite.ly

Luigi TOOLS

Luigi + enbrite.ly extensions = Gabo Luigi

WORKFLOW ENGINE

Page 29: CEU - Data @enbrite.ly
Page 30: CEU - Data @enbrite.ly

Tools we created GABO LUIGI

Page 31: CEU - Data @enbrite.ly

HOT MAP DETECTION

Page 32: CEU - Data @enbrite.ly

BOT TRAFFIC DETECTION

Page 33: CEU - Data @enbrite.ly
Page 34: CEU - Data @enbrite.ly
Page 35: CEU - Data @enbrite.ly
Page 36: CEU - Data @enbrite.ly
Page 37: CEU - Data @enbrite.ly

LESSONS LEARNED

Page 38: CEU - Data @enbrite.ly

LESSONS LEARNED

Automate EVERYTHING

Page 39: CEU - Data @enbrite.ly

LESSONS LEARNED

OPTIMIZATION takes a

LOT OF TIME

Page 40: CEU - Data @enbrite.ly

LESSONS LEARNED

Data is NEVER clean

Page 41: CEU - Data @enbrite.ly
Page 42: CEU - Data @enbrite.ly
Page 44: CEU - Data @enbrite.ly

THE NEXT BIG THING....

PRIVACY

Page 45: CEU - Data @enbrite.ly

MATE [email protected]

@gulyasm@enbritely

THANK YOU!