MapReduce: Limitations, Optimizations
and Open Issues
The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13)
Vasiliki Kalavri, Vladimir Vlassov{kalavri, vladv}@kth.se17 July 2013, Melbourne, Australia
Outline
● MapReduce / Hadoop○ background
○ current state
● Limitations and Existing Optimizations○ performance
○ programming model
○ configuration and automation
● Trends, Open Issues, Future Directions2
Motivation and Goal
● Numerous Hadoop variations and enhancements over the past few years○ each branching out from vanilla Hadoop○ hard to choose the appropriate tool○ no categorization / classification exists
● In our survey○ overview existing variations○ classify the optimizations○ identify trends and open issues
4
MapReduce
● Key-Value Pairs○ Partitioning functions
● 2nd Order Functions○ User-Defined Map - Reduce
● Input / Output ○ Distributed Fault-Tolerant File System
● Data-Centric Computation ○ Move the computation to the data
6
Hadoop MapReduce 1.0
7
Limitations● Scalability● Cluster Utilization● No support for non-MR applications
YARN (MapReduce v.2)
8
● JobTracker => Resource Manager and Application Master
● Map/Reduce Slots => Resource Container
MapReduce Limitations
● Performance○ initialization, scheduling, coordination
○ data materialization - intensive disk I/O
● Programming Model○ single-input operators
○ fixed processing pipeline - job chaining
○ no support for iterations
● Configuration and Automation○ sensitive to configuration parameters
○ complicated tuning 9
12
Programming Model Issues (1)
Input A
Input B
Merged Input
tagging
Single Input OperatorsHard to Join / Cross Datasets
pre-processing
13
Programming Model Issues (2)
Fixed, Static Processing Pipeline
Job ChainingNo support for Iterations
Performance Optimizations
● Operator Pipelining● Approximate Results● Indexing and Sorting● Work Sharing● Data Reuse● Skew Mitigation● Data Colocation
14
Programming Model Extensions
● High-Level Languages○ Declarative, SQL-like
○ Semi-structured JSON data
○ Java / Scala libraries for complex processing flows
● Domain-Specific Systems○ Iterations
○ Incremental Computations
15
Configuration and Automation
● Self-Tuning○ dynamic configuration based on workload
○ learn performance models
○ data-flow sharing
● Disk I/O Minimization○ dynamically setting number of reducers
○ handle skew and batch I/O operations
● Data-aware Optimizations○ static code analysis
○ index creation and selective input scans 16
Trends
● In-memory processing○ minimize disk I/O and communication
● Traditional database techniques○ organize and structure data, indexing
● Caching○ reuse of previous computations
● Relaxation of fault-tolerance○ materialize less often
17
18
System Major ContributionOpen-Source,
Available?Transparent
MR Online Pipelining, Online aggregation yes yes
EARL Fast approximate results yes no
Hadoop++, HAIL Improve relational operations no yes / no
MRShare Concurrent work sharing no no
ReStore Reuse of previous computations no yes
SkewTune Automatic skew mitigation no yes
CoHadoop Data colocation no no
HaLoop Iterations support yes no
Incoop Incremental processing no no
Starfish Dynamic self-tuning no yes
Sailfish I/O minimization, automatic tuning no yes
Manimal Automatic data-aware optimizations no yes
Open Issues
● No standard benchmark● No "typical" MapReduce workload● Each system is evaluated using different
○ datasets
○ applications
○ deployments
■ impossible to compare or only compare with vanilla Hadoop
● Application transparency19
Future Directions
● Fault-tolerance adjustment mechanisms● Standardize workloads and comparison
metrics● Support for interactive analysis
○ query optimization techniques
○ data reuse
○ fast approximate results
20
Conclusions
● MapReduce and Hadoop are very useful, successful and interesting tools
● There is still a lot of room for optimizations and research
● But, MapReduce might not always be the right tool for the job○ more flexible data-flows
○ relational operations
○ graph processing
○ machine learning 21