Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
APACHEBIG DATACONFERENCE
How to transform data into moneyusing Big Data technologies
After almost a decade developing Big Data projects in Paradigma, through its R+D department
we were early adopters of Spark, which led to the creation of Stratio
THE FIRST SPARK-BASED BIG DATA PLATFORM RELEASED
INTRO
JORGE LOPEZ-MALLA
After working with traditional
processing methods, I started to
do some R&S Big Data projects
and I fell in love with the Big Data
world. Currently i’m doing some
awesome Big Data projects at
Stratio
MY PROFILE
SKILLS
ALBERTO RODRÍGUEZ DE LEMA
After graduating I've been
programming for more than 10 years.
I’ve built high performance and
scalable web applications for
companies such as Indra Systems,
Prudential and Springer Verlag Ltd.
MY PROFILE
@ardlema
SKILLS
II
GO TO SPACESTRATIO
OPEN-SOURCE SOLUTIONSOur enterprises solutions are based on open sourcetechnologies
PURE SPARKThe only pure Spark platform,
the only global solution
ENTERPRISE SPARKOn – premise & cloud, our platform is
geared towards helping companies
SPARK-BASED BD PLATFORMThe first Spark-Based big data platform released
OUR CLIENT
MIDDLE EAST TELCO COMPANY
o 9.500 mil. daily events processed
o 9.2 mil. clients
USE CASES
MANAGEMENT & NORMALIZATION OF DATA SOURCES
USE CASES
1
USE CASES
MANAGEMENT & NORMALIZATION OF DATA SOURCES
1
USE CASES
NETWORK COVERAGE IMPROVEMENT
2
USE CASES
PEOPLE GATHERING
3
USE CASES
PEOPLE GATHERING
3
USE CASES
DATA MONETIZATION
4
USE CASES
DATA MONETIZATION
4
DATA MONETIZATION
4
USE CASES
TECHNICAL CHALLENGES
TECHNICAL PROBLEMS
Huge volumenof data
Huge sizeof Data
Distributedprocessing
Hardto read
Recognized patterns
1 2 3 4 5
1 HUGE VOLUME OF DATA
SOLUTIONAPACHE HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
1 HUGE VOLUME OF DATA
9500 mil. csv daily records -> circa 16 Gb
Requirements:
High availability
Concurrent file reads
2 HUGE SIZE OF DATA
SOLUTIONAPACHE PARQUET
2 HUGE SIZE OF DATA
16.5 Gb of daily event information stored as csv text in HDFS
4.3 Gb of daily event information stored as parquet files in HDFS
STORE IMPROVEMENT Circa 70%
2 HUGE SIZE OF DATA
Time to count daily csv events -> 6.2 minutes
.
Time to count daily Parquet events -> 1 minute
READ PROCESS IMPROVEMENT Circa 80%
3 DISTRIBUTED PROCESSING
SOLUTIONAPACHE SPARK
3 DISTRIBUTED PROCESSING - REQUIREMENTS
Complex algorithmics with the minimum amount of resources
Reduction of the process time in order to obtain data when itstill is used
3 DISTRIBUTED PROCESSING - REQUIREMENTS
Sharing the cluster with legacy processes
Use of legacy outputs processes without does any change
4 HARD TO READ
SOLUTIONSCALA + APACHE SPARK
4 HARD TO READ
Reducing developing time
LOCs dramatically reduced
Number of classes dramatically reduced
Tests and application readability improvements
DSLs make our lives easier
Spark makes Map Reduces jobs even simpler
4 HARD TO READ
5 RECOGNIZED PATTERNS
SOLUTIONAPACHE SPARK
MLLIB
Millons of data processed in order to obtain mathematical models
Applied complex mathematical algorithms to obtain accurate weekly behaviors
5 RECOGNIZED PATTERNS
THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
www.stratio.com