Handling not so big data

Handling not so big data.YAPC::Asia 2014 Day 2 2014/08/30@tagomoris

TAGOMORI Satoshi (@tagomoris)LINE Corporation

Analytics Platform Team

Data Analytics overview

collect parseclean up

process

visualize

processstore

Data Analytics overview

collect parseclean up

process

visualize

processstore

Consider data size

Stored size?

Total?

Per day?

Throughput?

Daily average?

Peak time?

Structured?

Compressed?

DO NOT consider exact data size.

It will increase/decrease dramatically!

Consider rough data size

Data size per query

Sub GigaBytes

From GigaBytes to TeraBytes

PetaBytes or More

Sub GB

Use RDBMS!

PB or More

Use Hadoooooooooop!and Storm!

From GB to TB: How large?

Main target for many service providers

Too large

For “a” disk, for “a” memory space

Appropriate for many disks

Not so large for many memory spaces

From GB to TB: For what?

Not so small: We cannot do everything

Data analytics methods

search, aggregate, recommend, anomaly detect

Consider “what you want to do” at first

It fixes what you should consider about

Types of data processing

Data size and I/O throughput intensive:

search, aggregation

CPU power and memory size intensive:

machine learning, graph processing

Select appropriate processing framework/middleware

On memory only? With Spilling?

Architecture ofdistributed processing systems

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware



resource management

job management

framework




framework

Java, Scala, ...

Short break: Languages and DSLs


resource management

job management




SQL: Hive, Impala, Drill, Presto, ...Others: Pig, Cascading, ...

framework



resource management

job management




wwhheerree tthhiiss ttaallkk iiss aabboouutt

A tour ofdistributed processing frameworksand query engines

MapReduceHadoop MapReduce:

Map + Combine + Shuffle + ReduceIntermediate output is written on diskShuffle always requires sync

Map

Map

Map

Map

Combine

Combine

Combine

Combine

Reduce

Reduce

Reduce

shuffle


resource management

job management

framework




MRv1 vs MRv2 on HadoopMRv1:

Resource Management

Job Management

Framework

MRv2:

Job Management

Framework


resource management




job management

framework

MRv1 vs MRv2 on HadoopMRv1:

Resource Management

Job Management

Framework

MRv2:

Job Management

Framework

ffoorrggeett tthhiiss


resource management




job management

framework

Apache Spark

“Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.”

Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDDs)

Pros:

Batch, Machine learning, Graph


resource management




job management

framework

Apache Tez

“The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.”

Directed Acyclic Graph (DAG)

Pros:

Big MR, Multiple Aggregations


resource management




job management

framework

MR, Spark, Tez


resource management

job management

framework




from: http://tez.apache.org/

DAG (directed acyclic graph: 非循環有向グラフ)

http://tez.apache.org

http://tez.apache.org

Variations of engines

Make jobs faster than MapReduce

Especially for memory-intensive, complex jobs

Hive can replace backend from MR to Spark/Tez

MR’s stability is VERY important

Alternatives are under development

MPP Engines

Apache Drill, Cloudera Impala, Facebook Presto

Massively Parallel Processing: MPP

DSL(SQL) + job management

Data source is external datastores

Very low latency: using many threads

Low availability and less tolerance for memory requirements


resource management

job management

framework




Stream processing

Without any storages

Process data for specified windows

every X events, per Y minutes, for unique values, ....

on memory processing

Ultira low latency

There are too many things to be considered about...

Stream processing: moreTwitter Storm

Distributed stream processing platformProcessing w/ Java or JVM languagesFor super high throughput data (not for minimal data)

NorikraNon-distributed stream processing platformProcessing w/ SQL, but not distributed...For low-middle-high throughput data

what i don’t mention about today...

Hadoopとはなにか

Original Hadoop

HDFS

MapReduce v1

Hadoop v2

HDFS

ResourceManager + MapReduce v2


resource management

job management

framework





Hadoop v2

HDFS


Spark


resource management

job management

framework





Hadoop v2

HDFS


Apache Spark

Apache Tez


resource management

job management

framework





v2: Hadoop

HDFS


Apache Spark

Apache Tez

Twitter Storm (Apache incubator)


resource management

job management

framework

monolithicquery engine middleware


query processing subsystem

Hadoopとはなにかv2: Hadoop

HDFS


Apache Spark

Apache Tez


Apache Drilldistributed file system

resource management

job management

frameworkquery processing

subsystem monolithicquery engine middleware



HDFS


Apache Spark

Apache Tez


Apache Drill

Hive, Pig, ...distributed file system

resource management

job management

framework





HDFS


Apache Spark

Apache Tez


Apache Drill

What Hadoop is ....

A L L Y O U R E N G I N E S A R E B E L O N G T O U S .

What Hadoop is?

BigData platform is called as “Hadoop”

like “Linux”, not only kernel, but also distribution

CORE:

distributed file systems

data flow

“BigData as a Service”by @naoya_ito

AWS EMR/RedShift, Google BigQuery, Treasure Data, ...

They have their own architecture

and their storages and data flow

Data flow is always most important

Perl ?

BigData world is dominated by JVM

many contributors from many companies

We should not make distributed processing software

Stand on shoulders on giants!

Connect perl world with JVM systems

by CPAN modules

2. How large your data is?

1. What do you want?

3. Choose architecture!

SHARE software, know-how & concerns!

Thank you!

Technology

Handling not so big data