Upload
-
View
89
Download
3
Tags:
Embed Size (px)
Citation preview
Optimal Chain Matrix Multiplication Big Data Perspective
Presented ByPollab Kumar Roy
STUDY AND REPORT
Presentation Outline
Introduction Big Data Overview
• Definition• Three V presentation• Application
Introduction to Hadoop• Architecture• How it works • Advantage
MapReduce• What is MapReduce?• The Algorithm• Example Scenario
HDFS Matrix Multiplication Multi Way Join Proposed Work Conclusions
Dept. of ICT, MBSTU
2
Dept. of ICT, MBSTU
Slide
Introduction
Matrix multiplication is widely used for many graph algorithms, such as those that calculate the transitive closure. MapReduce is good to implement multi way join operation for very large graphs and metrices.
In this presentation we will see Big Data overview. Matrix multiplication representation in database. Parallel multi way matrix join in database with benefit and limitation.
And a proposal for making chain multiplication more optimal with raw join key.
Dept. of ICT, MBSTU
Slide
Big Data Overview
Big data is a term that refers to data sets whose size , complexity, and rate of growth make them difficult to be captured, managed, processed by conventional technologies.
Big Data Source :
Stock Exchange
data
Social Media data
Black Box data
Dept. of ICT, MBSTU
Slide
Volume Till 2003 was 5 billion GB. Two days in 2011.Every ten minutes in 2013
VarietyStructured: Relational data.Semi Structured: XML data.Unstructured: Word, PDF, Text, Media Logs.
VelocityBig Data Velocity deals with the pace at which data flows in from sources and human interaction.
The three dimensions of Big Data
Dept. of ICT, MBSTU
Slide
Big Data Application Segments
AnalyticsPredictive ModelingDecision ProcessingBehavior AnalysisDemographics
Data WarehouseHostingDigitization/archiveBackupWeb 2.0
Engineering CollaboratingDesign OptimizationProcess FlowFluid Dynamics3D Modeling
AnalyticsPredictive ModelingDecision ProcessingBehavior AnalysisDemographics
Dept. of ICT, MBSTU
Slide
Introduction to Hadoop
Hadoop: Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Doug Cutting son’s toy. Hadoop Architecture :
Two major layers.• Processing layer :
MapReduce• Storage layer :
Hadoop Distributed File System
MapReduce(Distributed Computation)
HDFS(Distributed Storage)
YARN Framework Common Utilities
Dept. of ICT, MBSTU
Slide
Introduction to Hadoop (cont.)
How Hadoop works : Core tasks across a cluster of computers • Data dividing into directories and files(128M/64M).• Files are then distributed across various cluster nodes.• HDFS, supervises the processing.• Blocks are replicated.• Performing sort between the map and reduce stages.• Sending the sorted data to a certain computer.
Advantage : • Low-cost alternative to build bigger servers.• Fault-tolerance and high availability.• Dynamic clustering.• Automatic data distribution and open source
Dept. of ICT, MBSTU
Slide
MapReduce
What is MapReduce : A processing technique and a program model for distributed computing based on java.• Mapper• Shuffle • Reducer• Java based• Key Value
Dept. of ICT, MBSTU
Slide
MapReduce (cont.)
The algorithm: Mapper Reducer Key Value
Dept. of ICT, MBSTU
Slide
MapReduce (cont.)
Word Count Example :
Apple Orange MangoOrange Grapes Plum
Apple Orange Mango
Orange Grapes Plum
Apple Plum MangoApple Apple Plum
Apple Plum Mango
Apple Apple Plum
Apple,1 Orange ,1Mango,1
Orange,1 Grapes ,1Plum,1
Apple,1 Plum ,1Mango,1
Apple,1 Apple ,1Plum,1
Apple,1 Apple,1Apple,1Apple,1
Grapes ,1
Mango,1Mango,1
Orange,1Orange,1
Plum,1 Plum,1Plum,1
Apple,4
Grapes,1
Mango,2
Orange,2
Plum,3
Apple,4 Grapes,1Mango,2Orange,2Plum,3
input Files
each line to individual mapper
map key value splitting
sort, shuffle
Produce key value pairs
Final output
Dept. of ICT, MBSTU
Slide
Hadoop Distributed File System(HDFS)
The HDFS is a distributed, scalable, and portable file-system written in Java for the Hadoop framework.
Feature :• Distributed storage and processing• Name Node• Data Node• Interface in Hadoop• Streaming access
• Cluster status check
Dept. of ICT, MBSTU
Slide
Hadoop Distributed File System(cont.)
Architecture : Data Node, Name Node, Block
Name Node Meta data(Name, replica…)/home/foo/data, 3…
Client
BlocksReplication
Write
Meta data Ops
Read
Block Ops
D a t a n o d e s D a t a n o d e s
Rack 1 Rack 2
Dept. of ICT, MBSTU
Slide
Matrix Multiplication (Via multi-way join)
Usage : Widely used in many graph algorithms • Transitive closure• N-hop neighbors
Join Operation :• Matrices A [p×q] and B [q×r]• C [p×r] = • Each (i,k) th element of C is • A and B by relations and in database, attributes{row, col, val}
• in terms of SQL
User_1
User_2
User_7
User_3
User_5
User_6
User_4
Fig : Social Network
SELECT .row, .col, sum(.val* .val)FROM , WHERE .col= .rowGROUP BY .row, .col
Dept. of ICT, MBSTU
Slide
Matrix Multiplication (cont.)
Fig : Database representation
Dept. of ICT, MBSTU
Slide
Matrix Multiplication (cont.)
Chain way join :
• Eq.(1) typical method,serial two-way join (S2). Separate MR job, Intra-operation parallelism.
• Eq.(2) parallel two-way join (P2). Inter-operation parallelism. and simultaneously
• Eq.(3) parallel m-way join (PM)
((A *B) * (C *D))= (2)(A * B *C *D)= (3)
(((A *B) * C) *D)= (1)A * B *C *D
A * B C *D
Dept. of ICT, MBSTU
Slide
Matrix Multiplication (cont.)
Parallel M-way join :
• S2(n-1) = 4 • P2 = 3• PM = 2
Input : Relations M1, M2,…. Mn representing matrices 1: LIST_Mnext <= M1, M2,…. Mn
2: while |LIST_Mnext|> 1 do 3: for I = 1 to |LIST_Mnext | do 4: if ( i mod m ) == 1 then 5: add Mi to LIST_Mleft
6: Mleft = Mi
7: else 8: add Mi to LIST_Mright ( Mleft ) 9: end if10: end for11: LIST_Mnext = doMR-PM (LIST_Mleft,LIST_Mright )12: end while
M1
M4 M5M2 M3M1
M1 M4
<1st MR job>
<2nd MR job> < result >
Fig : Example of parallel 3 way
Fig : Algorithm for PM join
[𝑙𝑜𝑔¿¿2𝑛 ]¿[ 𝑙𝑜𝑔¿¿𝑚𝑛 ]¿
Dept. of ICT, MBSTU
Slide
Matrix Multiplication (cont.)
Efficiency of m-way join : • MR job iteration • Time
Limitation : • Join key number• Greater network
and sorting overhead
Fig : PM Join key
Dept. of ICT, MBSTU
Slide
Future study and Proposed Work
Future study : • Amazon EC2 • Apache Whirr tools• Larger graph datasets to matrix• Hadoop , more Papers
Proposed work : • PM with the raw key. • This improvement should reduce the number of duplications and
increase the diversity of the join key. • MapReduce framework that does not perform sort operations in
mappers.
Dept. of ICT, MBSTU
Slide
Conclusion
In this presentation, i explained the multiplication of matrices into a multi-way join operation s. The implementation of three types algorithms: S2, P2, and PM.
Parallel m-way join operation can improve the performance of the matrix chain multiplication process.
However, using the composite key introduces a number of disadvantages, such as greater network and sorting overhead.
Finally i propose Parallel m-way join operation with raw key to make it optimal.
Dept. of ICT, MBSTU
Slide
References
Apache hadoop. Website. http://hadoop.apache.org http://www.sas.com/en_us/insights/big-data/hadoop.html Zikopoulos, P. C., Eaton, C., DeRoos, D., Deutsch, T., & Lapis, G.
(2012). Understanding big data. New York et al: McGraw-Hill. Myung, J., & Lee, S. G. (2012, February). Matrix chain
multiplication via multi-way join algorithms in MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (p. 53). ACM.
J. Dean and S. Ghemawat Map-Reduce: simplified data processing on large clusters.