View
107
Download
0
Category
Preview:
Citation preview
View MR Design Patterns course details at www.edureka.co/mapreduce-design-patterns
Application of JOIN Pattern
MAP Reduce Design PATTERN
Slide 2 www.edureka.co/mapreduce-design-patterns
Objectives
At the end of this module, you will be able to understand
Why Design Patterns in MR
Who should know Map-Reduce Design patterns
Available Design Patterns in MR
Join pattern
Slide 3 www.edureka.co/mapreduce-design-patternsSlide 3
Why Design Patterns in MR?
General reusable, optimized solutions to most common problems
Template to solve problems used in different situations
Speed up the development process
Tried and tested design principles
An initial guideline to solve most common problems in MR
Help build sophisticated and best solution
Slide 4 www.edureka.co/mapreduce-design-patternsSlide 4
Who should know MR Design Pattern?
A Java developer who wants to explore world of Big Data
A MapReduce programmer who wants to develop expertise in his/her MR skills
One who aims to become a Hadoop Architect
Slide 5 www.edureka.co/mapreduce-design-patternsSlide 5
Available Design Patterns in MR
Summarization Pattern
Filtering Pattern
Data Organization Pattern
Join Pattern
Meta Pattern
Input & Output Pattern
Slide 6 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it
Datasets generally exist in multiple sources
Deriving full-value requires merging them together
Join Patterns are used for this purpose
Performing joins on the fly on Big Data can be costly in terms of time
Example: Joining StackOverflow data from Comments & Posts on UserId
Slide 7 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it?
Joining Patterns we will talk about are
» Reduce Side Join/Repartition Join
» Reduce Side Join with Bloom Filter
» Replicated Join
» Composite Join
» Cartesian Product
Slide 8 www.edureka.co/mapreduce-design-patterns
Join – Refresher
Inner Join
Outer Join
» Left Outer Join
» Right Outer Join
» Full Outer Join
Anti Join
Cartesian Product
Slide 9 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description
Easiest to implement but can be longest to execute
Supports all types of join operation
Can join multiple data sources, but expensive in terms of network resources & time
All data transferred across network
Example : Join PostLinks table data in StackOverflow to Posts data
Slide 10 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description (Contd.)
Applicability – Use it when
» Multiple large data sets require to be joined
» If one of the data sources is small look at using replicated join
» Different data sources are linked by a foreign key
» You want all join operations to be supported
Slide 12 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
Mapper
» Output key should reflect the foreign key
» Value can be the whole record and an identifier to identify the source
» Use projection and output only the required number of fields
Combiner
» Not Required ; No additional benefit
Slide 13 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
Partitioner
» User Custom Partitioner if required;
Reducer
» Reducer logic based on type of join required» Reducer receives the data from all the different sources per key
Slide 14 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Analogy
Resemblances
» SQL » SELECT users.ID, users.Location, comments.upVotes
FROM users[INNER|LEFT|RIGHT] JOIN commentsON users.ID=comments.UserID
» Pig » Supports inner & outer joins» Inner Join
» A = JOIN comments BY userID, users BY userID;» Outer Join
» A = JOIN comments BY userID [LEFT|RIGHT|FULL] OUTER, users BY userID
Slide 15 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Performance
Performance
» The whole data moves across the network to reducers
» You can optimize by using projection and sending only the required fields
» Number of reducers typically higher than normal
» If you can use any other Join type for your problem, use that instead
Slide 16 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Use Cases
Join tweets with user personal information for Behavioral Analysis
Join PostLinks and Posts tables from StackOverflow to have all related posts in one place
Slide 17 www.edureka.co/mapreduce-design-patterns
Reduce Side Join Example – Problem
Your dataset is the StackOverflow dataset. Look at the PostLinks.xml & Posts.xml file. Join the two tables based on PostId in PostLinks & Id in Posts
» Use MultipleInputs class
» Projection on PostLinks to output only PostId & RelatedPostId fields
Recommended