Upload
hortonworks
View
5.409
Download
3
Embed Size (px)
DESCRIPTION
Everything you wanted to know about Apache Tez: -- Distributed execution framework targeted towards data-processing applications. -- Based on expressing a computation as a dataflow graph. -- Highly customizable to meet a broad spectrum of use cases. -- Built on top of YARN – the resource management framework for Hadoop. -- Open source Apache incubator project and Apache licensed.
Citation preview
© Hortonworks Inc. 2013 Page 1
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha@bikassaha
© Hortonworks Inc. 2013
Tez – Introduction
Page 2
• Distributed execution framework targeted towards data-processing applications.
• Based on expressing a computation as a dataflow graph.
• Highly customizable to meet a broad spectrum of use cases.
• Built on top of YARN – the resource management framework for Hadoop.
• Open source Apache incubator project and Apache licensed.
© Hortonworks Inc. 2013
Tez – Design Themes
Page 3
• Empowering End Users• Execution Performance
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying deployment
Page 4
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s– Enable definition of complex data flow pipelines using simple graph
connection API’s. Tez expands the logical plan at runtime.– Targeted towards data processing applications like Hive/Pig but not
limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance.
Page 5
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2
© Hortonworks Inc. 2013
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 6
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model– Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.– End goal is to have a library of inputs, outputs and processors that can
be programmatically composed to generate useful tasks.
Page 7
Mapper
HDFSInput
MapProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
IntermediateJoiner
Input1
JoinProcessor
FileSortedOutput
Input2
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Data type agnostic– Tez is only concerned with the movement of data. Files and streams of
bytes.–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them.
Page 8
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Simplifying deployment– Tez is a completely client side application.–No deployments to do. Simply upload to any accessible FileSystem and
change local Tez configuration to point to that.– Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.– Leverages YARN local resources.
Page 9
ClientMachine
NodeManager
TezTask
NodeManager
TezTaskTezClient
HDFSTez Lib 1 Tez Lib 2
ClientMachine
TezClient
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying usage
With great power API’s come great responsibilities
Tez is a framework on which end user applications can be built
Page 10
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce• Optimal resource management• Plan reconfiguration at runtime• Dynamic physical data flow decisions
Page 11
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce– Eliminate replicated write barrier between successive computations.– Eliminate job launch overhead of workflow jobs.– Eliminate extra stage of map reads in every workflow job.– Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
Page 12
Pig/Hive - MRPig/Hive - Tez
© Hortonworks Inc. 2013
Tez – Execution Performance
• Plan reconfiguration at runtime–Dynamic runtime concurrency control based on data size, user operator
resources, available cluster resources and locality.–Advanced changes in dataflow graph structure.– Progressive graph construction in concert with user optimizer.
Page 13
HDFS Blocks
YARNResources
Stage 150 maps
100 partitions
Stage 2100
reducers
Stage 150 maps
100 partitions
Stage 2100 10
reducers
Only 10GB’s
of data
© Hortonworks Inc. 2013
Tez – Execution Performance
• Optimal resource management–Reuse YARN containers to launch new tasks.–Reuse YARN containers to enable shared objects across tasks.
Page 14
YARN Container
TezTask Host
TezTask1
TezTask2
Sha
red
Obj
ects
YARN Container
Tez Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013
Tez – Execution Performance
• Dynamic physical data flow decisions–Decide the type of physical byte movement and storage on the fly.– Store intermediate data on distributed store, local store or in-memory.– Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 15
Producer(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
© Hortonworks Inc. 2013
Tez – Deep Dive
• DAG API• Runtime API and Event Model• Dynamic Graph Reconfiguration• Tez Session
Page 16
© Hortonworks Inc. 2013
Tez – Deep Dive – DAG API
DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); …….
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); …….
dag.addVertex(map1).addVertex(map2).addVertex(reduce1).addVertex(reduce2).addVertex(join1).addEdge(edge1).addEdge(edge2).addEdge(edge3).addEdge(edge4);
Page 17
reduce1
map2
reduce2
join1
map1
Scatter_Gather
Bipartite Sequential
Scatter_Gather
Bipartite Sequential
Simple DAG definition API
© Hortonworks Inc. 2013
Tez – Deep Dive – DAG API
Page 18
• Data movement – Defines routing of data between tasks– One-To-One : Data from the ith producer task routes to the ith consumer task.– Broadcast : Data from a producer task routes to all consumer tasks.– Scatter-Gather : Producer tasks scatter data into shards and consumer tasks
gather the data. The ith shard from all producer tasks routes to the ith consumer task.
• Scheduling – Defines when a consumer task is scheduled– Sequential : Consumer task may be scheduled after a producer task completes.– Concurrent : Consumer task must be co-scheduled with a producer task.
• Data source – Defines the lifetime/reliability of a task output– Persisted : Output will be available after the task exits. Output may be lost later
on.– Persisted-Reliable : Output is reliably stored and will always be available– Ephemeral : Output is available only while the producer task is running
Edge properties define the connection between producer and consumer vertices in the DAG
© Hortonworks Inc. 2013
Tez – Deep Dive – DAG API
Page 19
reduce1
map2
reduce2
join1
map1
© Hortonworks Inc. 2013
Tez Deep Dive – Runtime API
Page 20
© Hortonworks Inc. 2013
Tez – Deep Dive – Task Execution
Page 21
Task Attempt(real on machine)
Task Attempt(logical in AM)
Env, cmd line, resources
Tez Task JVM
InputProcessor
Output
Get Task
Start container
Input
Processor
OutputControl/DataInformation
Data Events
Control Events
• Start task shell with user specified env, resources etc.
• Fetch and instantiate Input, Processor, Output objects
• Receive (incremental) input information and process the input
• Provide output information
• Provide control/error events
© Hortonworks Inc. 2013
Tez Deep Dive – Runtime Events
Page 22
Reduce Task 2
Input1 Input2
Map Task 2Output1
Output2Output3
Map Task 1Output1
Output2Output3
Error Event
AM
Router
Scatter-Gather Edge
• Events used to communicate between the tasks and between task and ApplicationMaster (AM)
• Data Movement Event used by producer task to inform the consumer task about data location, size etc.
• Input Error event sent by task to AM to inform about errors in reading input. AM then takes action by re-generating the input
• Other events to send task completion notification, data statistics and other control plane information
© Hortonworks Inc. 2013
Tez Deep Dive – Runtime Events
Page 23
Reduce Task 2
Input1 Input2
Map Task 2Output1
Output2Output3
Data Event
Map Task 1Output1
Output2Output3
Error Event
AM
Router
Scatter-Gather Edge
• Events used to communicate between the tasks and between task and ApplicationMaster (AM)
• Data Movement Event used by producer task to inform the consumer task about data location, size etc.
• Input Error event sent by task to AM to inform about errors in reading input. AM then takes action by re-generating the input
• Other events to send task completion notification, data statistics and other control plane information
© Hortonworks Inc. 2013
Tez Deep Dive – Runtime Events
Page 24
Reduce Task 2
Input1 Input2
Map Task 2Output1
Output2Output3
Data Event
Map Task 1Output1
Output2Output3
Error Event
AM
Router
Scatter-Gather Edge
• Events used to communicate between the tasks and between task and ApplicationMaster (AM)
• Data Movement Event used by producer task to inform the consumer task about data location, size etc.
• Input Error event sent by task to AM to inform about errors in reading input. AM then takes action by re-generating the input
• Other events to send task completion notification, data statistics and other control plane information
© Hortonworks Inc. 2013
Tez – Deep Dive – Core Engine
Page 25
reduce1
map1
Start
vertex
Vertex Manager
Start
tasks
DAGScheduler
Get Priority
Get Priority
Start
vertex
TaskScheduler
Get container
Get container
• Vertex Manager• Determines task
parallelism• Determines
when tasks in a vertex can start.
• DAG SchedulerDetermines priority of task
• Task SchedulerAllocates containers from YARN and assigns them to tasks
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 26
Map Vertex
Reduce VertexApp Master
Vertex Manager
Vertex StateMachine
Cancel Task
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 27
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Cancel Task
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 28
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Set Parallelism
Cancel Task
Re-Route
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Reduce Slow Start/Pre-launch
Page 29
Map Vertex
Reduce VertexApp Master
Vertex Manager
Vertex StateMachine
Event Model
Map completion events sent to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.
© Hortonworks Inc. 2013
Tez – Reduce Slow Start/Pre-launch
Page 30
Map Vertex
Reduce VertexApp Master
Vertex ManagerTask Completed
Vertex StateMachine
Event Model
Map completion events sent to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.
© Hortonworks Inc. 2013
Tez – Reduce Slow Start/Pre-launch
Page 31
Map Vertex
Reduce VertexApp Master
Vertex ManagerTask Completed
Vertex StateMachine
Start Tasks
Start
Event Model
Map completion events sent to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.
© Hortonworks Inc. 2013
Tez – Automatic Map Parallelism
Page 32
Map Vertex
• Input vertex manager gets block locations and estimates the number of mappers based on data size, cluster capacity and map data limits. Groups block by locality
• Consumer vertex parallelism gets recursively determined through the chain of consumer vertices
1-1 Edges
1-1 Edges
© Hortonworks Inc. 2013
Tez – Automatic Map Parallelism
Page 33
Map Vertex
Set Parallelism
HDFS
Get
Block
Locations
• Input vertex manager gets block locations and estimates the number of mappers based on data size, cluster capacity and map data limits. Groups block by locality
• Consumer vertex parallelism gets recursively determined through the chain of consumer vertices
1-1 Edges
1-1 Edges
© Hortonworks Inc. 2013
Tez – Automatic Map Parallelism
Page 34
Map VertexSet Parallelism
HDFS
Get
Block
Locations
• Input vertex manager gets block locations and estimates the number of mappers based on data size, cluster capacity and map data limits. Groups block by locality
• Consumer vertex parallelism gets recursively determined through the chain of consumer vertices
© Hortonworks Inc. 2013
Tez - Sessions
Page 35
Application Master
Client
Start Session
Submit DAG
Task Scheduler
Con
tain
er P
ool
Shared Object
Registry
Pre Warmed
JVM
• Key for interactive queries• Analogous to database sessions
and represents a connection between the user and the cluster
• Run multiple DAGs/queries in the same session
• Maintains a pool of reusable containers for low latency execution of tasks within and across queries
• Takes care of data locality and releasing resources when idle
• Session cache in the Application Master and in the container pool reduce re-computation and re-initialization
© Hortonworks Inc. 2013
Tez – Now and Next
Page 36
© Hortonworks Inc. 2013
Tez – Bridge the Data Spectrum
Page 37
Fact TableDimension
Table 1
Result Table 1
Dimension Table 2
Result Table 2
Dimension Table 3
Result Table 3
Broadcast
Join
Shuffle
Join
Typical pattern in a TPC-DS query
Fact Table
Dimension Table 1
Dimension Table 1
Dimension Table 1
Broadcast join
for small data sets
Based on data size, the query optimizer can run either plan as a single Tez job
Broadcast
Join
© Hortonworks Inc. 2013
Tez – Benchmark Performance
Page 38
Significant (but not all) speedups due to Tez• DAG support and runtime graph
reconfiguration enable utilizing the parallelism of the cluster
• Tez Session and container reuse enable efficient and low latency execution
© Hortonworks Inc. 2013
Tez – Performance Analysis
Page 39Architecting the Future of Big Data
Tez Session populates container pool
Dimension table calculation and HDFS split generation in parallel
Dimension tables broadcasted to Hive MapJoin tasks
Final Reducer pre-launched and fetches completed inputs
TPCDS – Query-27 with Hive on Tez
© Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project–Rapid development. Over 600 jiras opened. Over 400 resolved.–Growing community of contributors and users
• Focus on stability– Testing and quality are highest priority.– Code ready and deployed on multi-node environments.
• Support for a vast topology of DAGs– Already functionally equivalent to Map Reduce. Existing Map Reduce
jobs can be executed on Tez with few or no changes.–Hive retargeted to use Tez for execution of queries (HIVE-4660).–Work started on Pig to use Tez for execution of scripts (PIG-3446).
Page 40
© Hortonworks Inc. 2013
Tez – Roadmap
• Richer DAG support– Support for co-scheduling and streaming– Better fault tolerance with checkpoints
• Performance optimizations– More efficiencies in transfer of data– Improve session performance
• Usability.– Stability and testability–Recovery and history– Tools for performance analysis and debugging
Page 41
© Hortonworks Inc. 2013
Tez – Community
• Early adopters and code contributors welcome– Adopters to drive more scenarios. Contributors to make them happen.– Hive and Pig communities are on-board and making great progress - HIVE-4660
and PIG-3446
• Tez meetup for developers and users– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-proces
sing/ (will soon be available on the Apache Wiki)
• Useful links– Work tracking: https://issues.apache.org/jira/browse/TEZ– Code: https://github.com/apache/incubator-tez– Developer list: [email protected]
User list: [email protected] Issues list: [email protected]
Page 42
© Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that works on computations represented as dataflow graphs
• Naturally maps to execution plans produced by query optimizers
• Customizable execution architecture designed to enable dynamic performance optimizations at runtime
• Works out of the box with the platform figuring out the hard stuff
• Span the spectrum of interactive latency to batch• Open source Apache project – your use-cases and code are
welcome• It works and is already being used by Hive and Pig
Page 43
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Questions?
@bikassaha
Page 44