APACHE BIG DATA CONFERENCE · 2017-12-14 · APACHE BIG DATA CONFERENCE How to transform data into money using Big Data technologies. ... 3 DISTRIBUTED PROCESSING SOLUTION APACHE

APACHEBIG DATACONFERENCE

How to transform data into moneyusing Big Data technologies

After almost a decade developing Big Data projects in Paradigma, through its R+D department

we were early adopters of Spark, which led to the creation of Stratio

THE FIRST SPARK-BASED BIG DATA PLATFORM RELEASED

INTRO

JORGE LOPEZ-MALLA

After working with traditional

processing methods, I started to

do some R&S Big Data projects

and I fell in love with the Big Data

world. Currently i’m doing some

awesome Big Data projects at

Stratio

MY PROFILE

SKILLS

ALBERTO RODRÍGUEZ DE LEMA

After graduating I've been

programming for more than 10 years.

I’ve built high performance and

scalable web applications for

companies such as Indra Systems,

Prudential and Springer Verlag Ltd.

MY PROFILE

@ardlema

SKILLS

II

GO TO SPACESTRATIO

OPEN-SOURCE SOLUTIONSOur enterprises solutions are based on open sourcetechnologies

PURE SPARKThe only pure Spark platform,

the only global solution

ENTERPRISE SPARKOn – premise & cloud, our platform is

geared towards helping companies

SPARK-BASED BD PLATFORMThe first Spark-Based big data platform released

OUR CLIENT

MIDDLE EAST TELCO COMPANY

o 9.500 mil. daily events processed

o 9.2 mil. clients

USE CASES

MANAGEMENT & NORMALIZATION OF DATA SOURCES

USE CASES

1

USE CASES

MANAGEMENT & NORMALIZATION OF DATA SOURCES

1

USE CASES

NETWORK COVERAGE IMPROVEMENT

2

USE CASES

PEOPLE GATHERING

3

USE CASES

PEOPLE GATHERING

3

USE CASES

DATA MONETIZATION

4

USE CASES

DATA MONETIZATION

4

DATA MONETIZATION

4

USE CASES

TECHNICAL CHALLENGES

TECHNICAL PROBLEMS

Huge volumenof data

Huge sizeof Data

Distributedprocessing

Hardto read

Recognized patterns

1 2 3 4 5

1 HUGE VOLUME OF DATA

SOLUTIONAPACHE HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

1 HUGE VOLUME OF DATA

9500 mil. csv daily records -> circa 16 Gb

Requirements:

High availability

Concurrent file reads

2 HUGE SIZE OF DATA

SOLUTIONAPACHE PARQUET

2 HUGE SIZE OF DATA

16.5 Gb of daily event information stored as csv text in HDFS

4.3 Gb of daily event information stored as parquet files in HDFS

STORE IMPROVEMENT Circa 70%

2 HUGE SIZE OF DATA

Time to count daily csv events -> 6.2 minutes

.

Time to count daily Parquet events -> 1 minute

READ PROCESS IMPROVEMENT Circa 80%

3 DISTRIBUTED PROCESSING

SOLUTIONAPACHE SPARK

3 DISTRIBUTED PROCESSING - REQUIREMENTS

Complex algorithmics with the minimum amount of resources

Reduction of the process time in order to obtain data when itstill is used

3 DISTRIBUTED PROCESSING - REQUIREMENTS

Sharing the cluster with legacy processes

Use of legacy outputs processes without does any change

4 HARD TO READ

SOLUTIONSCALA + APACHE SPARK

4 HARD TO READ

Reducing developing time

LOCs dramatically reduced

Number of classes dramatically reduced

Tests and application readability improvements

DSLs make our lives easier

Spark makes Map Reduces jobs even simpler

4 HARD TO READ

5 RECOGNIZED PATTERNS

SOLUTIONAPACHE SPARK

MLLIB

Millons of data processed in order to obtain mathematical models

Applied complex mathematical algorithms to obtain accurate weekly behaviors

5 RECOGNIZED PATTERNS

THANK YOU

UNITED STATES

Tel: (+1) 408 5998830

EUROPE

Tel: (+34) 91 828 64 73

[email protected]

www.stratio.com

Documents

APACHE BIG DATA CONFERENCE · 2017-12-14 · APACHE BIG DATA CONFERENCE How to transform data into money using Big Data technologies. ... 3 DISTRIBUTED PROCESSING SOLUTION APACHE