Optimal Chain Matrix Multiplication Big Data Perspective

Optimal Chain Matrix Multiplication Big Data Perspective

Presented ByPollab Kumar Roy

[email protected]

STUDY AND REPORT

Presentation Outline

Introduction Big Data Overview

• Definition• Three V presentation• Application

Introduction to Hadoop• Architecture• How it works • Advantage

MapReduce• What is MapReduce?• The Algorithm• Example Scenario

HDFS Matrix Multiplication Multi Way Join Proposed Work Conclusions

Dept. of ICT, MBSTU

2

Dept. of ICT, MBSTU

Slide

Introduction

Matrix multiplication is widely used for many graph algorithms, such as those that calculate the transitive closure. MapReduce is good to implement multi way join operation for very large graphs and metrices.

In this presentation we will see Big Data overview. Matrix multiplication representation in database. Parallel multi way matrix join in database with benefit and limitation.

And a proposal for making chain multiplication more optimal with raw join key.

Dept. of ICT, MBSTU

Slide

Big Data Overview

Big data is a term that refers to data sets whose size , complexity, and rate of growth make them difficult to be captured, managed, processed by conventional technologies.

Big Data Source :

Stock Exchange

data

Social Media data

Black Box data

Dept. of ICT, MBSTU

Slide

Volume Till 2003 was 5 billion GB. Two days in 2011.Every ten minutes in 2013

VarietyStructured: Relational data.Semi Structured: XML data.Unstructured: Word, PDF, Text, Media Logs.

VelocityBig Data Velocity deals with the pace at which data flows in from sources and human interaction.

The three dimensions of Big Data

Dept. of ICT, MBSTU

Slide

Big Data Application Segments

AnalyticsPredictive ModelingDecision ProcessingBehavior AnalysisDemographics

Data WarehouseHostingDigitization/archiveBackupWeb 2.0

Engineering CollaboratingDesign OptimizationProcess FlowFluid Dynamics3D Modeling

AnalyticsPredictive ModelingDecision ProcessingBehavior AnalysisDemographics

Dept. of ICT, MBSTU

Slide

Introduction to Hadoop

Hadoop: Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

Doug Cutting son’s toy. Hadoop Architecture :

Two major layers.• Processing layer :

MapReduce• Storage layer :

Hadoop Distributed File System

MapReduce(Distributed Computation)

HDFS(Distributed Storage)

YARN Framework Common Utilities

Dept. of ICT, MBSTU

Slide

Introduction to Hadoop (cont.)

How Hadoop works : Core tasks across a cluster of computers • Data dividing into directories and files(128M/64M).• Files are then distributed across various cluster nodes.• HDFS, supervises the processing.• Blocks are replicated.• Performing sort between the map and reduce stages.• Sending the sorted data to a certain computer.

Advantage : • Low-cost alternative to build bigger servers.• Fault-tolerance and high availability.• Dynamic clustering.• Automatic data distribution and open source

Dept. of ICT, MBSTU

Slide

MapReduce

What is MapReduce : A processing technique and a program model for distributed computing based on java.• Mapper• Shuffle • Reducer• Java based• Key Value

Dept. of ICT, MBSTU

Slide

MapReduce (cont.)

The algorithm: Mapper Reducer Key Value

Dept. of ICT, MBSTU

Slide

MapReduce (cont.)

Word Count Example :

Apple Orange MangoOrange Grapes Plum

Apple Orange Mango

Orange Grapes Plum

Apple Plum MangoApple Apple Plum

Apple Plum Mango

Apple Apple Plum

Apple,1 Orange ,1Mango,1

Orange,1 Grapes ,1Plum,1

Apple,1 Plum ,1Mango,1

Apple,1 Apple ,1Plum,1

Apple,1 Apple,1Apple,1Apple,1

Grapes ,1

Mango,1Mango,1

Orange,1Orange,1

Plum,1 Plum,1Plum,1

Apple,4

Grapes,1

Mango,2

Orange,2

Plum,3

Apple,4 Grapes,1Mango,2Orange,2Plum,3

input Files

each line to individual mapper

map key value splitting

sort, shuffle

Produce key value pairs

Final output

Dept. of ICT, MBSTU

Slide

Hadoop Distributed File System(HDFS)

The HDFS is a distributed, scalable, and portable file-system written in Java for the Hadoop framework.

Feature :• Distributed storage and processing• Name Node• Data Node• Interface in Hadoop• Streaming access

• Cluster status check

Dept. of ICT, MBSTU

Slide

Hadoop Distributed File System(cont.)

Architecture : Data Node, Name Node, Block

Name Node Meta data(Name, replica…)/home/foo/data, 3…

Client

BlocksReplication

Write

Meta data Ops

Read

Block Ops

D a t a n o d e s D a t a n o d e s

Rack 1 Rack 2

Dept. of ICT, MBSTU

Slide

Matrix Multiplication (Via multi-way join)

Usage : Widely used in many graph algorithms • Transitive closure• N-hop neighbors

Join Operation :• Matrices A [p×q] and B [q×r]• C [p×r] = • Each (i,k) th element of C is • A and B by relations and in database, attributes{row, col, val}

• in terms of SQL

User_1

User_2

User_7

User_3

User_5

User_6

User_4

Fig : Social Network

SELECT .row, .col, sum(.val* .val)FROM , WHERE .col= .rowGROUP BY .row, .col

Dept. of ICT, MBSTU

Slide

Matrix Multiplication (cont.)

Fig : Database representation

Dept. of ICT, MBSTU

Slide


Chain way join :

• Eq.(1) typical method,serial two-way join (S2). Separate MR job, Intra-operation parallelism.

• Eq.(2) parallel two-way join (P2). Inter-operation parallelism. and simultaneously

• Eq.(3) parallel m-way join (PM)

((A *B) * (C *D))= (2)(A * B *C *D)= (3)

(((A *B) * C) *D)= (1)A * B *C *D

A * B C *D

Dept. of ICT, MBSTU

Slide


Parallel M-way join :

• S2(n-1) = 4 • P2 = 3• PM = 2

Input : Relations M1, M2,…. Mn representing matrices 1: LIST_Mnext <= M1, M2,…. Mn

2: while |LIST_Mnext|> 1 do 3: for I = 1 to |LIST_Mnext | do 4: if ( i mod m ) == 1 then 5: add Mi to LIST_Mleft

6: Mleft = Mi

7: else 8: add Mi to LIST_Mright ( Mleft ) 9: end if10: end for11: LIST_Mnext = doMR-PM (LIST_Mleft,LIST_Mright )12: end while

M1

M4 M5M2 M3M1

M1 M4

<1st MR job>

<2nd MR job> < result >

Fig : Example of parallel 3 way

Fig : Algorithm for PM join

[𝑙𝑜𝑔¿¿2𝑛 ]¿[ 𝑙𝑜𝑔¿¿𝑚𝑛 ]¿

Dept. of ICT, MBSTU

Slide


Efficiency of m-way join : • MR job iteration • Time

Limitation : • Join key number• Greater network

and sorting overhead

Fig : PM Join key

Dept. of ICT, MBSTU

Slide

Future study and Proposed Work

Future study : • Amazon EC2 • Apache Whirr tools• Larger graph datasets to matrix• Hadoop , more Papers

Proposed work : • PM with the raw key. • This improvement should reduce the number of duplications and

increase the diversity of the join key. • MapReduce framework that does not perform sort operations in

mappers.

Dept. of ICT, MBSTU

Slide

Conclusion

In this presentation, i explained the multiplication of matrices into a multi-way join operation s. The implementation of three types algorithms: S2, P2, and PM.

Parallel m-way join operation can improve the performance of the matrix chain multiplication process.

However, using the composite key introduces a number of disadvantages, such as greater network and sorting overhead.

Finally i propose Parallel m-way join operation with raw key to make it optimal.

Dept. of ICT, MBSTU

Slide

References

Apache hadoop. Website. http://hadoop.apache.org http://www.sas.com/en_us/insights/big-data/hadoop.html Zikopoulos, P. C., Eaton, C., DeRoos, D., Deutsch, T., & Lapis, G.

(2012). Understanding big data. New York et al: McGraw-Hill. Myung, J., & Lee, S. G. (2012, February). Matrix chain

multiplication via multi-way join algorithms in MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (p. 53). ACM.

J. Dean and S. Ghemawat Map-Reduce: simplified data processing on large clusters.

http://hadoop.apache.org/

http://hadoop.apache.org/

http://www.sas.com/en_us/insights/big-data/hadoop.html

http://www.sas.com/en_us/insights/big-data/hadoop.html

Data & Analytics

Optimal Chain Matrix Multiplication Big Data Perspective