45
Handling not so big data. YAPC::Asia 2014 Day 2 2014/08/30 @tagomoris

Handling not so big data

Embed Size (px)

DESCRIPTION

Talk at YAPC::Asia Tokyo 2014

Citation preview

Page 1: Handling not so big data

Handling not so big data.YAPC::Asia 2014 Day 2 2014/08/30@tagomoris

Page 2: Handling not so big data

TAGOMORI Satoshi (@tagomoris)LINE Corporation

Analytics Platform Team

Page 3: Handling not so big data
Page 4: Handling not so big data
Page 5: Handling not so big data

Data Analytics overview

collect parseclean up

process

visualize

processstore

Page 6: Handling not so big data

Data Analytics overview

collect parseclean up

process

visualize

processstore

Page 7: Handling not so big data

Consider data size

Stored size?

Total?

Per day?

Throughput?

Daily average?

Peak time?

Structured?

Compressed?

Page 8: Handling not so big data

DO NOT consider exact data size.

It will increase/decrease dramatically!

Page 9: Handling not so big data

Consider rough data size

Data size per query

Sub GigaBytes

From GigaBytes to TeraBytes

PetaBytes or More

Page 10: Handling not so big data

Sub GB

Use RDBMS!

Page 11: Handling not so big data

PB or More

Use Hadoooooooooop!and Storm!

Page 12: Handling not so big data

From GB to TB: How large?

Main target for many service providers

Too large

For “a” disk, for “a” memory space

Appropriate for many disks

Not so large for many memory spaces

Page 13: Handling not so big data

From GB to TB: For what?

Not so small: We cannot do everything

Data analytics methods

search, aggregate, recommend, anomaly detect

Consider “what you want to do” at first

It fixes what you should consider about

Page 14: Handling not so big data

Types of data processing

Data size and I/O throughput intensive:

search, aggregation

CPU power and memory size intensive:

machine learning, graph processing

Select appropriate processing framework/middleware

On memory only? With Spilling?

Page 15: Handling not so big data

Architecture ofdistributed processing systems

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 16: Handling not so big data

Architecture ofdistributed processing systems

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 17: Handling not so big data

framework

Java, Scala, ...

Short break: Languages and DSLs

distributed file system

resource management

job management

domain specific language

query processing subsystem monolithic

query engine middleware

SQL: Hive, Impala, Drill, Presto, ...Others: Pig, Cascading, ...

Page 18: Handling not so big data

framework

Architecture ofdistributed processing systems

distributed file system

resource management

job management

domain specific language

query processing subsystem monolithic

query engine middleware

wwhheerree tthhiiss ttaallkk iiss aabboouutt

Page 19: Handling not so big data

A tour ofdistributed processing frameworksand query engines

Page 20: Handling not so big data

MapReduceHadoop MapReduce:

Map + Combine + Shuffle + ReduceIntermediate output is written on diskShuffle always requires sync

Map

Map

Map

Map

Combine

Combine

Combine

Combine

Reduce

Reduce

Reduce

shuffle

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 21: Handling not so big data

MRv1 vs MRv2 on HadoopMRv1:

Resource Management

Job Management

Framework

MRv2:

Job Management

Framework

distributed file system

resource management

domain specific language

query processing subsystem monolithic

query engine middleware

job management

framework

Page 22: Handling not so big data

MRv1 vs MRv2 on HadoopMRv1:

Resource Management

Job Management

Framework

MRv2:

Job Management

Framework

ffoorrggeett tthhiiss

distributed file system

resource management

domain specific language

query processing subsystem monolithic

query engine middleware

job management

framework

Page 23: Handling not so big data

Apache Spark

“Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.”

Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDDs)

Pros:

Batch, Machine learning, Graph

distributed file system

resource management

domain specific language

query processing subsystem monolithic

query engine middleware

job management

framework

Page 24: Handling not so big data

Apache Tez

“The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.”

Directed Acyclic Graph (DAG)

Pros:

Big MR, Multiple Aggregations

distributed file system

resource management

domain specific language

query processing subsystem monolithic

query engine middleware

job management

framework

Page 25: Handling not so big data

MR, Spark, Tez

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 26: Handling not so big data

from: http://tez.apache.org/

DAG (directed acyclic graph: 非循環有向グラフ)

Page 27: Handling not so big data

Variations of engines

Make jobs faster than MapReduce

Especially for memory-intensive, complex jobs

Hive can replace backend from MR to Spark/Tez

MR’s stability is VERY important

Alternatives are under development

Page 28: Handling not so big data

MPP Engines

Apache Drill, Cloudera Impala, Facebook Presto

Massively Parallel Processing: MPP

DSL(SQL) + job management

Data source is external datastores

Very low latency: using many threads

Low availability and less tolerance for memory requirements

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 29: Handling not so big data

Stream processing

Without any storages

Process data for specified windows

every X events, per Y minutes, for unique values, ....

on memory processing

Ultira low latency

There are too many things to be considered about...

Page 30: Handling not so big data

Stream processing: moreTwitter Storm

Distributed stream processing platformProcessing w/ Java or JVM languagesFor super high throughput data (not for minimal data)

NorikraNon-distributed stream processing platformProcessing w/ SQL, but not distributed...For low-middle-high throughput data

Page 31: Handling not so big data
Page 32: Handling not so big data

what i don’t mention about today...

Page 33: Handling not so big data

Hadoopとはなにか

Original Hadoop

HDFS

MapReduce v1

Hadoop v2

HDFS

ResourceManager + MapReduce v2

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 34: Handling not so big data

Hadoopとはなにか

Hadoop v2

HDFS

ResourceManager + MapReduce v2

Spark

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 35: Handling not so big data

Hadoopとはなにか

Hadoop v2

HDFS

ResourceManager + MapReduce v2

Apache Spark

Apache Tez

distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 36: Handling not so big data

Hadoopとはなにか

v2: Hadoop

HDFS

ResourceManager + MapReduce v2

Apache Spark

Apache Tez

Twitter Storm (Apache incubator)

distributed file system

resource management

job management

framework

monolithicquery engine middleware

domain specific language

query processing subsystem

Page 37: Handling not so big data

Hadoopとはなにかv2: Hadoop

HDFS

ResourceManager + MapReduce v2

Apache Spark

Apache Tez

Twitter Storm (Apache incubator)

Apache Drilldistributed file system

resource management

job management

frameworkquery processing

subsystem monolithicquery engine middleware

domain specific language

Page 38: Handling not so big data

Hadoopとはなにかv2: Hadoop

HDFS

ResourceManager + MapReduce v2

Apache Spark

Apache Tez

Twitter Storm (Apache incubator)

Apache Drill

Hive, Pig, ...distributed file system

resource management

job management

framework

domain specific language

query processing subsystem monolithic

query engine middleware

Page 39: Handling not so big data

Hadoopとはなにかv2: Hadoop

HDFS

ResourceManager + MapReduce v2

Apache Spark

Apache Tez

Twitter Storm (Apache incubator)

Apache Drill

What Hadoop is ....

Page 40: Handling not so big data

A L L Y O U R E N G I N E S A R E B E L O N G T O U S .

Page 41: Handling not so big data

What Hadoop is?

BigData platform is called as “Hadoop”

like “Linux”, not only kernel, but also distribution

CORE:

distributed file systems

data flow

Page 42: Handling not so big data

“BigData as a Service”by @naoya_ito

AWS EMR/RedShift, Google BigQuery, Treasure Data, ...

They have their own architecture

and their storages and data flow

Data flow is always most important

Page 43: Handling not so big data

Perl ?

BigData world is dominated by JVM

many contributors from many companies

We should not make distributed processing software

Stand on shoulders on giants!

Connect perl world with JVM systems

by CPAN modules

Page 44: Handling not so big data

2. How large your data is?

1. What do you want?

3. Choose architecture!

Page 45: Handling not so big data

SHARE software, know-how & concerns!

Thank you!