46
Data Platform Evolution

Data platform evolution

Embed Size (px)

Citation preview

Page 1: Data platform evolution

Data PlatformEvolution

Page 2: Data platform evolution

About One by Aol.

Page 3: Data platform evolution

About Our Team

Page 4: Data platform evolution

About Our Data

Video Tracking

Ad Tracking

User Tracking

Page 5: Data platform evolution

LEGACYPLATFORM

Page 6: Data platform evolution

Legacy SystemDWH Cluster

SSIS Manager

External Data Providers

Event Collector

Caching

Reporting

DWH

Application Servers

Page 7: Data platform evolution

Legacy Scale

500TBStorage

40KEvents Processed

per Second

3.5BEvent Processed Daily

Daily Processing

20GBData Daily

Page 8: Data platform evolution

The Need To Change

Cost

Processing Time

Scale

Development ROI

Testability

Accessibility

Page 9: Data platform evolution

NEXTSTEPS

Page 10: Data platform evolution

Next Steps

3 Stages

Outcome

Component Description

Examples

Page 11: Data platform evolution

Legacy SystemDWH Cluster

SSIS Manager

External Data Providers

Event Collector

Caching

Reporting

DWH

Application Servers

Page 12: Data platform evolution

First Stage

Data warehouse

Servers Servers

Servers

Data Collection

Servers

Data Distribution

Servers

DWH API

Servers

External Data Providers

Event Collector Analytics

Reporting

Monitoring

Servers

sFTPFTP

sFTPFTP

Legacy DWH

Servers

Page 13: Data platform evolution

First Stage Summary

Full Redundancy

Comparison Legacy vs. Batch

Linear Scale

Partial Test Coverage

Raw Level Data Access

CD

Page 14: Data platform evolution

First Stage

Data warehouse

Servers Servers

Servers

Data Collection

Servers

Data Distribution

Servers

DWH API

Servers

External Data Providers

Event Collector Analytics

Reporting

Monitoring

Servers

sFTPFTP

sFTPFTP

Legacy DWH

Servers

Page 15: Data platform evolution

Second Stage

Data warehouse

Servers Servers

Servers

Servers

Data Collection

Servers

Data Distribution

Servers

DWH API

Servers

External Data Providers

Event Collector

Scheduling

Reporting

Monitoring

Servers

S3AzuresFTPFTP

AzureS3sFTPFTP

Real Time DWH

Servers

Servers

Analytics

Page 16: Data platform evolution

First Stage Summary

Near Real time Processing

Comparison Batch vs. Real Time

Full Monitoring

Full Test Coverage

“Product” Event/Report Definition

DevOps Automation

Page 17: Data platform evolution

MOREDETAILS

Page 18: Data platform evolution

Batch Event Processing

Hadoop Cluster

Hadoop Monitoring

Aggregated data exporter

Processed data aggregator

Error Processing

Data Archivator

Data Collection Cluster

Raw data processingMap-Reduce

Raw data files pushed to Hadoop (WEB HDFS)

Vertica

External\Internal DWH Clusters

Data flow direction

Monitoring data

Raw data processing1. Cleaning/Transformation/Enrichment/Validation of data from main data sources with Map-Reduce2. Month history

Aggregator Process1. DSL for defining new kind of aggregation

Data exporter1. Export aggregated data2. Export processed data

Processed\Aggregated data

Logging Framework Elastic Search

Logs will be exposed through Kibana to monitor data flow

Monitoring

Monitoring of data flow inside and outside of Event Processing Cluster

Hadoop monitoring data

Error Processing1. Automatic error re-processing with time window

S3

Page 19: Data platform evolution

Examples Event Processing

Page 20: Data platform evolution

Examples Event Processing

Page 21: Data platform evolution

Examples Event Processing

Page 22: Data platform evolution

Examples Event Processing

Page 23: Data platform evolution

Data Collection

Data Collection Cluster

Servers

Servers

Servers

Video TrackingAd TrackingUser Tracking

3rd Party Ad Tracking

SQL Server

CSV data received every hour via FTP. Raw Events and Dimensions.

Text files received every five minutes. From Public and Private Cloud.Raw Events.

Logging Framework Elastic Search

Hadoop Processing Cluster

Data about received files\events reported with logging

framework

Raw data files pushed to Hadoop (WEB HDFS)

Dimension tables

Servers to acquireStage 1 :.NET Application will pull FTP, SQL DWH server for loggers and SQL Replication for dimension dataStage 2:Think to move to other more appropriate technology like Akka

Data flow direction

Logs will be exposed through Kibana to monitor data flow

Monitoring data

Monitoring

Monitoring of data flow inside and outside of Data Collection Cluster

MongoDb

Page 24: Data platform evolution

Data Distribution

Data Distribution Cluster

Hive

Vertica

MongoDB

Report Distributor

Logging Framework Elastic Search

Reporting Platform

Data flow direction

Logs will be exposed through Kibana to monitor data flow

Monitoring data

Monitoring

Monitoring of data flow inside and outside of Data Distribution Cluster

Report S3 Storage

Page 25: Data platform evolution

Examples Data & Distribution Collection

Page 26: Data platform evolution

Examples Data & Distribution Collection

Page 27: Data platform evolution

Examples Data & Distribution Collection

Page 28: Data platform evolution

Examples Data & Distribution Collection

Page 29: Data platform evolution

Reporting Platform

Vertica

Hive

SQL Server

1. Distributed2. Encapsulate Repository3. Versioning4. Smart query execution5. Testable

MongoDb

Reporting Platform

Report Designer

Report Provider

Report Distributor

Reporting API

Statistics Provider

S3 Report Storage

Data sources of Reporting platform are in Private and Public

Application Servers

Page 30: Data platform evolution

Examples Applications

Page 31: Data platform evolution

Examples Applications

Page 32: Data platform evolution

Examples Applications

Page 33: Data platform evolution

Examples Applications

Page 34: Data platform evolution

MonitoringMonitoring Cluster

Cloudera Manager

Elastic Search Cluster

Vertica Management

Kibana

Zabbix

Applications

Vertica

Hadoop

MongoDb

Page 35: Data platform evolution

Examples Monitoring & Alerting

Page 36: Data platform evolution

Examples Monitoring & Alerting

Page 37: Data platform evolution

Examples Monitoring & Alerting

Page 38: Data platform evolution

Examples Monitoring & Alerting

Page 39: Data platform evolution

Examples Monitoring & Alerting

Page 40: Data platform evolution

Examples Monitoring & Alerting

Page 41: Data platform evolution

Migration Outcome

15%Cost Reduction

Linear Scale

90%Unit Test Coverage

x280Processing Time

x50Development ROI

Page 42: Data platform evolution

Current Scale

86BEvent Processed Daily

120TBData Daily

1MEvents Processed

per Second

Near Real Time ProcessingMinimum Interval : 5 min

15+Event Sources

4.5PBHadoop

70TBVertica

Page 43: Data platform evolution

Scale Growth

x15Event Processed Daily

x6000Daily Processed Data

x25Events Processed

per Second

x280Processing Time

Page 44: Data platform evolution

Second Stage

Data warehouse

Servers Servers

Servers

Servers

Data Collection

Servers

Data Distribution

Servers

DWH API

Servers

External Data Providers

Event Collector

Scheduling

Reporting

Monitoring

Servers

S3AzuresFTPFTP

AzureS3sFTPFTP

Real Time DWH

Servers

Servers

Analytics

Page 45: Data platform evolution

Third Stage

Data warehouse

Servers

Servers

Servers

Servers

Data Collection

Servers

Data Distribution

Servers

DWH API

Servers

External Data Providers

Event Collector

Scheduling

Reporting

Monitoring

Servers

S3AzuresFTPFTP

AzureS3sFTPFTP

Real Time DWH

Servers ServersServers

Analytics

Page 46: Data platform evolution

THANKYOU