Handling not so big data.YAPC::Asia 2014 Day 2 2014/08/30@tagomoris
TAGOMORI Satoshi (@tagomoris)LINE Corporation
Analytics Platform Team
Data Analytics overview
collect parseclean up
process
visualize
processstore
Data Analytics overview
collect parseclean up
process
visualize
processstore
Consider data size
Stored size?
Total?
Per day?
Throughput?
Daily average?
Peak time?
Structured?
Compressed?
DO NOT consider exact data size.
It will increase/decrease dramatically!
Consider rough data size
Data size per query
Sub GigaBytes
From GigaBytes to TeraBytes
PetaBytes or More
Sub GB
Use RDBMS!
PB or More
Use Hadoooooooooop!and Storm!
From GB to TB: How large?
Main target for many service providers
Too large
For “a” disk, for “a” memory space
Appropriate for many disks
Not so large for many memory spaces
From GB to TB: For what?
Not so small: We cannot do everything
Data analytics methods
search, aggregate, recommend, anomaly detect
Consider “what you want to do” at first
It fixes what you should consider about
Types of data processing
Data size and I/O throughput intensive:
search, aggregation
CPU power and memory size intensive:
machine learning, graph processing
Select appropriate processing framework/middleware
On memory only? With Spilling?
Architecture ofdistributed processing systems
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Architecture ofdistributed processing systems
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
framework
Java, Scala, ...
Short break: Languages and DSLs
distributed file system
resource management
job management
domain specific language
query processing subsystem monolithic
query engine middleware
SQL: Hive, Impala, Drill, Presto, ...Others: Pig, Cascading, ...
framework
Architecture ofdistributed processing systems
distributed file system
resource management
job management
domain specific language
query processing subsystem monolithic
query engine middleware
wwhheerree tthhiiss ttaallkk iiss aabboouutt
A tour ofdistributed processing frameworksand query engines
MapReduceHadoop MapReduce:
Map + Combine + Shuffle + ReduceIntermediate output is written on diskShuffle always requires sync
Map
Map
Map
Map
Combine
Combine
Combine
Combine
Reduce
Reduce
Reduce
shuffle
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
MRv1 vs MRv2 on HadoopMRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
distributed file system
resource management
domain specific language
query processing subsystem monolithic
query engine middleware
job management
framework
MRv1 vs MRv2 on HadoopMRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
ffoorrggeett tthhiiss
distributed file system
resource management
domain specific language
query processing subsystem monolithic
query engine middleware
job management
framework
Apache Spark
“Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.”
Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDDs)
Pros:
Batch, Machine learning, Graph
distributed file system
resource management
domain specific language
query processing subsystem monolithic
query engine middleware
job management
framework
Apache Tez
“The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.”
Directed Acyclic Graph (DAG)
Pros:
Big MR, Multiple Aggregations
distributed file system
resource management
domain specific language
query processing subsystem monolithic
query engine middleware
job management
framework
MR, Spark, Tez
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
from: http://tez.apache.org/
DAG (directed acyclic graph: 非循環有向グラフ)
Variations of engines
Make jobs faster than MapReduce
Especially for memory-intensive, complex jobs
Hive can replace backend from MR to Spark/Tez
MR’s stability is VERY important
Alternatives are under development
MPP Engines
Apache Drill, Cloudera Impala, Facebook Presto
Massively Parallel Processing: MPP
DSL(SQL) + job management
Data source is external datastores
Very low latency: using many threads
Low availability and less tolerance for memory requirements
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Stream processing
Without any storages
Process data for specified windows
every X events, per Y minutes, for unique values, ....
on memory processing
Ultira low latency
There are too many things to be considered about...
Stream processing: moreTwitter Storm
Distributed stream processing platformProcessing w/ Java or JVM languagesFor super high throughput data (not for minimal data)
NorikraNon-distributed stream processing platformProcessing w/ SQL, but not distributed...For low-middle-high throughput data
what i don’t mention about today...
Hadoopとはなにか
Original Hadoop
HDFS
MapReduce v1
Hadoop v2
HDFS
ResourceManager + MapReduce v2
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Hadoopとはなにか
Hadoop v2
HDFS
ResourceManager + MapReduce v2
Spark
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Hadoopとはなにか
Hadoop v2
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Hadoopとはなにか
v2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
distributed file system
resource management
job management
framework
monolithicquery engine middleware
domain specific language
query processing subsystem
Hadoopとはなにかv2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
Apache Drilldistributed file system
resource management
job management
frameworkquery processing
subsystem monolithicquery engine middleware
domain specific language
Hadoopとはなにかv2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
Apache Drill
Hive, Pig, ...distributed file system
resource management
job management
framework
domain specific language
query processing subsystem monolithic
query engine middleware
Hadoopとはなにかv2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
Apache Drill
What Hadoop is ....
A L L Y O U R E N G I N E S A R E B E L O N G T O U S .
What Hadoop is?
BigData platform is called as “Hadoop”
like “Linux”, not only kernel, but also distribution
CORE:
distributed file systems
data flow
“BigData as a Service”by @naoya_ito
AWS EMR/RedShift, Google BigQuery, Treasure Data, ...
They have their own architecture
and their storages and data flow
Data flow is always most important
Perl ?
BigData world is dominated by JVM
many contributors from many companies
We should not make distributed processing software
Stand on shoulders on giants!
Connect perl world with JVM systems
by CPAN modules
2. How large your data is?
1. What do you want?
3. Choose architecture!
SHARE software, know-how & concerns!
Thank you!