35
Study of Map Reduce and related technologies Applied management Research Project (January 2011 - April 2011) Vinod Gupta School of Management IIT Kharagpur Submitted By Renjith Peediackal Roll no. 09BM8040 MBA, Batch of 2009-11 Under the guidance of Prof. Prithwis Mukherjee Project Guide VGSOM, IIT Kharagpur

Amr r enjith

Embed Size (px)

DESCRIPTION

A simple explanation of Map Reduce and related technologies

Citation preview

Page 1: Amr r enjith

Study of Map Reduce and related technologies

Applied management Research Project

(January 2011 - April 2011)

Vinod Gupta School of Management

IIT Kharagpur

Submitted By

Renjith Peediackal

Roll no. 09BM8040

MBA, Batch of 2009-11

Under the guidance of

Prof. Prithwis Mukherjee

Project Guide

VGSOM, IIT Kharagpur

Page 2: Amr r enjith

Certificate of Examination

This is to certify that we have examined the summer project report on “Study of Map Reduce

and related technologies” and accept this in partial fulfillment for the MBA degree for which it

has been submitted. This approval does not endorse or accept every statement made, opinion

expressed or conclusion drawn, as recorded in this internship report. It is limited to the

acceptance of the project report for the applicable purpose.

Examiner (External) ____________________________________________________

Additional Examiner ____________________________________________________

Supervisor (Academic) __________________________________________________

Date: ________________________________________________________________

Page 3: Amr r enjith

UNDERTAKING REGARDING PLAGIARISM

This declaration is for the purpose of the Applied Management Research Project on Study of

Map Reduce and related technologies, which is in partial fulfillment for the degree of Master

of Business Administration.

I hereby declare that the work in this report is authentic and genuine to the best of my knowledge

and I would be liable for appropriate punitive action if found otherwise.

Renjith Peediackal

Roll No. 09BM8040

Vinod Gupta School of Management

IIT Kharagpur

Page 4: Amr r enjith

ACKNOWLEDGEMENT

I am highly thankful to Professor Prithwis Mukherjee, esteemed professor at Vinod

Gupta School of Management, IIT Kharagpur for providing me a chance to carry out my work

on map reduce technology as a review paper in point for this project. I am also thankful to him

for time to time interactions and guidance that he provided me in the course of this project. Also

I thank my classmates in the course ‘IT for BI’ for giving me support and feedback for my work.

Renjith Peediackal,

VGSOM,IIT Kharagpur

Page 5: Amr r enjith

Page | 5

UNDERTAKING REGARDING PLAGIARISM 3

ACKNOWLEDGEMENT 4

EXECUTIVE SUMMARY 7

INTRODUCTION 8

THE CASE FOR MAP REDUCE 9

The complexity of input data 9

Information is sparse in the humungous data 9

Appetite for personalization 9

EXAMPLE OF A PROGRAMME USING INTERNET DATA 9

Recommendation systems 9

Some of the problems with recommendation systems 10

Popularity 11

Integration of expert opinion 12

Integration of internet data 12

WHAT IS DATA IN FLIGHT? 14

Map and reduce. 14

HOW DOES A MAP REDUCE PROGRAM WORK 15

Map 16

Shuffle or sort 17

Reduce 17

How this new way helpful to us in accommodating complex internet data for our recommendation system? 18

Brute power 18

Fault tolerance 19

Criticisms 21

Why it is valuable still? 22

An example 23

Sequential web access-based recommendation system 24

PARALLEL DB VS MAPREDUCE 29

Page 6: Amr r enjith

Page | 6

SUMMARY OF ADVANTAGES OF MR 29

WHAT IS ‘HADOOP’? 30

HYBRID OF MAP REDUCE AND RELATIONAL DATABASE 30

PIG 31

WHAT DOES HIVE DO? 32

NEW MODELS: CLOUD 33

PATENT CONCERNS 33

SCOPE OF FUTURE WORK 34

REFERENCES 34

Page 7: Amr r enjith

Page | 7

Executive Summary

The data explosion happen in the internet has forced the technology wizards to redesign

the tools and techniques used to handle that data. ‘Map reduce’, ‘Hadoop’ and ‘Data in flight’

has been buzz words in the last few years. Many of the information available regarding these

technologies are, on one hand absolutely technology oriented or on the other hand marketing

documents of the companies which produce tools based on the technology. While

technological language is difficult for a normal business user to understand it, marketing

documents seems to be very superficial. We see the importance of an article, which explains

the technology in a lucid and simple way. Then the importance and future prospects of this

movement can be understood by business users and adopted this mission as part of the

advanced management research project in the 4th

semester.

Web data available today is dynamic, sparse and unstructured mostly. And processing

the data with existing techniques like parallel relational database management systems is

nearly impossible. And the programming effort need to do this in the older way is humungous

and there is no answer to the question whether anybody will be able to do it practically. The

new technologies like ‘hadoop’ create an easy platform to do parallel data processing in an

easier and cost effective way. In this research paper, the author is trying to explain ‘Hadoop’

and related technology in simple way. In future the author can continue similar effort to

demystify many of the tools and techniques coming from Hadoop community and make it

friendly for the business community.

Page 8: Amr r enjith

Page | 8

Introduction

Web data available today is dynamic, sparse and unstructured mostly. And most of the

people who generates or uses this content is customers or prospects for business organizations.

Therefore the management community, especially marketers is eagerly looking forward to

analyze the information coming out of this internet data, so as to personalize their product and

services. Also processing the data can be vital for governments and social organizations too.

The earlier systems like parallel relational databases were not efficient in processing the

web data. Data in flight refers to the technology of capturing and processing the dynamic data

as and when it is produced and consumed. Google introduced map reduce paradigm for

efficiently processing the web data. Apache organization following the footsteps of map reduce,

gave birth to an open source movement ‘hadoop’ to develop programs and tools to process

internet data.

In this review paper, a humble attempt to explain this technology in a simple and clear manner

has been taken. I hope the readers can appreciate the efforts.

Page 9: Amr r enjith

Page | 9

The case for Map Reduce

The complexity of input data

In today content rich world the enormity and complexity of data available to decision

makers is huge. And when we add the internet to the picture, the volume takes a big bang

explosion. And this data is unstructured and produced and consumed by random people at

different parts of the globe.

Information is sparse in the humungous data

The information is actually hidden in the sea of data produced and consumed.

Appetite for personalization

The perpetual problem of the marketer is to personalize their products to attract and

retain customers. They look forward to information that can be derived from internet to

understand their customers better.

Example of a programme using internet data

Recommendation systems

Take the example of a recommendation system. Customer Y buys product X from an e-

commerce site after going through a number of products X1, X2, X3, X4. And this behavior is

repeated by 100 customers! And there are customers, who dropped from shopping after going

through X1 to X3.

Page 10: Amr r enjith

Page | 10

Should we have suggested those people about X before exiting from the site? But if we

are recommending the wrong product, how will it affect the buying decision?

Questions like this also arise: "Why am I selling X number of figurines, and not Y [a

higher number] number of figurines? What pages are the people who buy things reading? If I'm

running promotions, which Web sites are sending me traffic with people who make purchases?

[ Avinash Kaushik – Analytics expert] And also What kind of content I need to develop in my site

so as to attract the right set of people. Your URL should be present in what kind of sites so that

you get maximum number of referral? How many of them quit after seeing the homepage?

What different kind of design can be possible to make them go forward? Are the users clicking

on the right links in the right fashion in your websites? (Site overlay) What is the bounce rate?

How to save money on PPC schemes? Also how can we get data regarding competitors

marketing activities and how it influences the customers?

Many of the people are using net to sell or at least market their products and services.

How can we analyze the user behavior in such cases to benefit our organization? Example:

using the information from something like face book shopping page.

And most critical question to analytics professional is, can we make robust algorithms to

predict such behavior. How much data we need to process? Is point of sales history enough?

Some of the problems with recommendation systems

Page 11: Amr r enjith

Page | 11

Popularity

Recommendation system has a basic logic of processing historical data of customer behaviour

and predicting their behavior. So naturally the most “popular” product top the list of

recommendations. Here popular means previously favored the most by the same segment of

customers, for a particular product category. This is dangerous to companies in many ways

1. Customer need not be satisfied perpetually by same products

2. Companies have to create niche products and up sell and cross sell it to customers to

satisfy them retain them and thus to be successful in the market. Popularity based

system ruins this possibilities of exploration!

3. Opportunity of selling a product is lost!

4. Lack of personalization leads to broken relations

Page 12: Amr r enjith

Page | 12

Integration of expert opinion

How do we integrate the inferences given by experts in the particular business domain

to the process of statistical analysis? This is a mix of art with science and nobody knows the

right blend. Will this undermine the basic concept of data driven decision making?

Integration of internet data

To attack the current deficiencies, one approach is the usage of more data to increase

the scope and relevance of the analysis. The users of mail, social networking groups, or niche

groups for specific purposes are all consumers and potential customers. Also bloggers,

information providers can play the role of opinion leaders. So if we are able to integrate the

information scattered over the internet to our decision making models, we can better

understand different segments of customers, their current and future needs.

But the fact is that the data available in the internet is not at all friendly for a statistical

analysis using current DBMS technology on a normal level of cost. It is dynamic, sparse and

mostly unstructured. Also the question of who can program such a system to handle web data

using current technology is difficult to answer.

Page 13: Amr r enjith

Page | 13

Growth of data: mindboggling.

– Published content: 3-4 Gb/day

– Professional web content: 2 Gb/day

– User generated content: 5-10 Gb/day

– Private text content: ~2 Tb/day (200x more)

(Ref: Raghu Ramakrishnan

http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf)

Questions to this data: very demanding business users.

– Can we do Analytics over Web Data / User Generated Content?

– Can we handle TB of text data / GB of new data getting added to internet each

day?

Page 14: Amr r enjith

Page | 14

– Structured Queries, Search Queries for unstructured data?

– Can we expect the tools to give answers at “Google-Speed”?

That gives us a strong case for adopting the new technology of data in flight. ‘Map

Reduce’ is a technology developed by Google for the similar purposes. Those

technologies are explained one by one in following paragraphs.

What is Data in flight?

Earlier data was at ‘rest’!

The normal concepts of DBMS where data is at rest and the queries hit those static data

and fetch results

Now data is just flying in!

The new concepts of ‘data in flight’ envisages the already prepared query as static,

collecting dynamic data as and when it is produced and consumed. But this needs more

powerful tools to handle the enormity, complexity and ‘scarce’ nature of information contained

in the data.

Map and reduce.

A map operation is needed to translate the scarce information available in numerous

formats to some forms which can be processed easily by an analytical tool. Once the

information is in simpler and structured form, it can be reduced to the required results.

Page 15: Amr r enjith

Page | 15

A standard example:

Word count!

Given a document, how many of each word are there?

But in real world it can be:

� Given our search logs, how many people click on result 1

� Given our flicker photos, how many cat photos are there by users in each

geographic region

� Give our web crawl, what are the 10 most popular words?

Word count and twitter

� Tweets can be used to get early warnings on epidemic like swine flue

� Tweets can be used to understand the ‘mood’ of people in a region and can be

used for different purposes, even subliminal marketing

� The software created by Dr Peter Dodds and Dr Chris Danforth of the University

of Vermont, collects sentences from blogs and 'tweets‘, zeroing in on the

happiest and saddest days of the last few years.

� Can it prevent social crises?

How does a map reduce program work

Page 16: Amr r enjith

Page | 16

Programmer has to specify two methods: Map and Reduce rest will be taken care by the

platform.

Map

map (k, v) -> <k', v'>*

1. Specify a map function that takes a key(k)/value(v) pair.

a. key = document URL, value = document contents

a. “document1”, “to be or not to be”

2. Output of map is (potentially many) key/value pairs. <k', v'>*

3. In our case, output (word, “1”) once per word in the document

a. “to”, “1”

b. “be”, “1”

Page 17: Amr r enjith

Page | 17

– “or”, “1”

– “to”, “1”

– “not”, “1”

– “be”, “1”

Shuffle or sort

The items which are logically near is brought near to each other physically.

– “to”, “1”

– “to”, “1”

– “be”, “1”

– “be”, “1”

– “not”, “1”

– “or”, “1”

Reduce

reduce (k', <v'>*) -> <k', v'>*

• The reduce function combines the values for a key

Page 18: Amr r enjith

Page | 18

– “be”, “2”

– “not”, “1”

– “or”, “1”

– “to”, “2”

For different use cases functions within map and reduce differs, but the architecture

and the supporting platform remains the same. For example our recommendation system can

use a map function to get any meaningful association between two product names X and Y in

any online discussion to a pair of product combination and count. Later in reduce step based on

the count of combinations arising, the information can be reduced in to suggestion of X given Y.

This is a very simplified explanation of the algorithms. Refer to papers on algorithms such

parallel affinity propagation to understand the complexity of such a process. This is a new way

of writing algorithms. Let us see how it helps to accommodate humungous internet data.

How this new way helpful to us in accommodating complex internet data for our

recommendation system?

Brute power

One view is could be that map reduce algorithms are effective because it

– Uses the brute power of many machines to map the huge chunk of sparse data

into small table of dense data

– The complex and time consuming part of the “task” is done on the new, small

and dense data in reduce part

Page 19: Amr r enjith

Page | 19

– Means, it separate huge data from the time consuming part of the algorithm,

albeit a lot of disk space is utilized.

Fault tolerance

There are two aspects of fault tolerance

1. Transaction should not be affected by the machine failure. One step would be making

the machine less prone to failure. And ask the user to start the transaction from

scratch in case of failure. Saving transaction data permanently (to save some of the

steps) somewhere will reduce the speed of the transaction. Normal DB operation

give emphasize to this kind of fault tolerance in its implementation.

Fault tolerance: RDBMS school of thought

Page 20: Amr r enjith

Page | 20

2. Complex query execution. Here the only point in fault tolerance is that the same query

should be processed again from which ever state it can be restarted. Saving data in

intermediate forms can be desirable. Map Reduce give emphasize to this kind of

tolerance in its implementation. The extra time taken to save permanently is

reduced by parallelization.

Fault tolerance: MR school of thought

Hierarchy of Parallelism:

Page 21: Amr r enjith

Page | 21

Effectively Map reduce philosophy of fault tolerance can be summarized with following

diagram.

Cycle of brute force fault tolerance

Criticisms

1. A giant step backward in the programming paradigm for large-scale data

intensive applications

2. A sub-optimal implementation

3. in that it uses brute force instead of indexing

4. Not novel at all

5. it represents a specific implementation of well known techniques developed

25 years ago

6. Missing most features in current DBMS

Page 22: Amr r enjith

Page | 22

7. Incompatible with all of the tools DBMS users have come to depend on

Why it is valuable still?

The biggest reason is that, It helps the programmers to logically manage the complexity of the

data

Also intermediate permanent writing magically enables two different wonderful features we

need for our recommendation system.

1. It raises the fault tolerance level to such a level, that we can employ millions

of cheap computers to get our work done.

Searching entire social networks to find out any mention of our product from different

segments of customers can be assigned to numerous computers and done fast enough.

2. It brings dynamism and load balancing. If one of the social networks having

denser data and our set of machines take very long time to process, we can

reassign the task to many more set of free machines to speed up the process.

This is essential when we don’t know the nature of the data we are going to

process.

So the parallelism achieved by parallel DBMS (normally done by dividing an execution plan

across a limited number of processors) stand to be nothing in front what is achieved by map

reduce.

Page 23: Amr r enjith

Page | 23

� At large scales, super-fancy reliable hardware still fails, albeit less often

� software still needs to be fault-tolerant

� commodity machines without fancy hardware give better perf/$

� Usage of more memory to speed up querying has its own implication on

tolerance and cost

An example

This example invites you to the complex world of sequential web access-based

recommendation system. Its working is explained in the following diagram:

Page 24: Amr r enjith

Page | 24

Sequential web access-based recommendation system

It goes through web server logs, mines the pattern in the sequence and then creates a pattern

tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou

et al]

System architecture of sequential web access system [Zhou et al]

The target is to create a pattern tree.

And when a particular user has to be catered with a suggestion

Page 25: Amr r enjith

Page | 25

– His access pattern tree is compared with the entire tree of patterns.

– And the most suitable portions of the tree in comparison with the user’s pattern

are selected and

– the branches of those nodes are suggested.

Some details of the algorithm

• Let E be a set of unique access events, which represents web resources accessed by

users, i.e. web pages, URLs, topics or categories

• A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events

Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) a

sample database will be like

Session ID Web access sequence

1 abdac

2 eaebcac

3 babfae

4 abbacfc

Page 26: Amr r enjith

Page | 26

Access events can be classified into frequent and infrequent based on frequency crossing a

threshold level

And a tree consisting of frequent access events can be created.

Length of sequence Sequential web access pattern with support

1 a:4. b:4, c:3

2 aa:4. ab:4. oc3. ba:4. bc:3

3 aac:3, aba;4, obc:3, bac:3

4 Abac:3

Finally the tree is formed:

Page 27: Amr r enjith

Page | 27

The Map and reduce

1. So a map job can be designed to process the logs and create pattern tree.

2. The task is divided among thousands of cheap machines using map Reduce platform.

3. dynamic data and the static query model of data in flight will be very helpful to modify

the main tree

4. The tree structure can be efficiently stored by altering the physical storage by sorting

and partitioning.

Page 28: Amr r enjith

Page | 28

5. Then based on the user’s access pattern we have to select a few parts of the tree. This

can be designed as a reduce job which runs across the tree data.

DBMS for the same case?

• Map

– A huge data base of access logs should be uploaded to a db. And then it should

be updated at regular intervals to reflect the changes in the site usage.

– Then a query has to be written to get tree kind of data structure out of this data

behemoth, which changes shape continuously!

– An execution plan, which is simplistic and non dynamic in nature has to be made.

Ineffective

– It should be divided among many parallel engines

– And this requires expertise in parallel programming.

• Reduce

– During reduce phase the entire tree has to be searched for the existence of

resembling patterns.

– This also will be ineffective in an execution plan driven model as explained

above.

Page 29: Amr r enjith

Page | 29

• And with the explosion of data, and the increased need of increased personalization in

recommendations, map reduce becomes the most suitable pattern.

Parallel DB vs MapReduce

• RDBMS is good when

– if the application is query-intensive,

– whether semi structured or rigidly structured

• MR is effective

– ETL and “read once” data sets.

– Complex analytics.

– Semi-structured data, Non structured

– Quick-and-dirty analyses.

– Limited-budget operations.

Summary of advantages of MR

• Storage system independence

• automatic parallelization

• load balancing

Page 30: Amr r enjith

Page | 30

• network and disk transfer optimization

• handling of machine failures

• Robustness

• Improvements to core library benefit all users of library!

• Ease to programmers!

What is ‘Hadoop’?

Based on the map Reduce paradigm, apache foundation has given rise to a program for

developing tools and techniques on an open source platform. This program and the resultant

technology are referred to as ‘Hadoop’.

The icon of Hadoop

Hybrid of Map Reduce and Relational Database

Page 31: Amr r enjith

Page | 31

As pointed out earlier RDBMS technology is suitable when the data to be processed is

structured. Many of the intermediate forms while processing web data also can be

structured. For that reasons many of the researchers have come up with a hybrid

system, where the unstructured data will be taken care by map reduce platform and the

structured part is handled by RDBMS. HadoopDB is one of those proposed systems.

Pig

If we have established a Hadoop infrastructure, can we use the same for processing repetitive

tasks? How can we use it as efficiently as a relational data base handling repetitive tasks? The

answer is a programming language called ‘Pig’. Pig helps us to write an execution plan, just like

the ones we have in relational database.

Page 32: Amr r enjith

Page | 32

• Pig allows the programmer to control the flow of data by prewritten codes. It is suitable

when the tasks are repetitive and the plans can be envisaged early on.

What does hive do?

Users of databases are not often technology masters. They might be familiar to the existing

platforms. And these platforms tend to generate SQL like queries. We need a program to

convert this traditional SQL queries into mapReduce jobs. And the one created by Hadoop

movement is Hive. So from outside it looks like an SQL query or a button click which generates

SQL query, but the undergoing process is done the map reduce way.

Page 33: Amr r enjith

Page | 33

New models: cloud

1. Many services available on cloud like Amazon web services (Amazon elastic -

http://aws.amazon.com/ec2/)

2. The user gets MR services by entering input text or site name, the required output etc

without going to the technical details

3. Almost infinite scalability

4. New business models which are efficient

Patent Concerns

Excerpts from a Slashdot comment on Jan 19, 2011

“But the very public complaints didn't stop Google from demanding a patent for

MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections).

On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-

Scale Data Processing.”

Page 34: Amr r enjith

Page | 34

• Will Google enforce the patent?

• If it does it will hamper the growth of Hadoop community.

Scope of future work

Hadoop is an open source movement. So the author can contribute to the Hadoop user

community by studying more about the tools and techniques and sharing them on the internet.

Also implementing Hadoop in a lab and understanding many detailed aspects of map reduce

environment also can be done in the future. This is essential to get the complete picture about

these new technologies. Also new use case for this technology can be developed to utilize the

power of Hadoop to solve business problems.

References

1. MapReduce and Parallel DBMSs:Friends or Foes?

Michael St onebraker, Daniel Abad i,Dav id J. eWitt, Sam Maden, Erik Paulson, Andrew

Pavlo, and Alexander Rasin

Communicat ions of the acm January 2010

2. Web warehousing: Web technology meets data warehousing

Xin Tan, David C. Yen ∗, Xiang Fang

3. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch

McKinsey Quarterly

4. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation

Page 35: Amr r enjith

Page | 35

Wenhua Wang, Hanwang Zhang, Fei Wu, and Yueting Zhuang

City University of Hong Kong, Hong Kong, August 2008.

5. Experiences with Map Reduce, an Abstraction for Large-Scale Computation

Jeff Dean Google, Inc.

6. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for

Analytical Workloads

Azza Abouzeid1, Kamil BajdaPawlikowski1, August 2009

7. Parallel Collection of Live Data Using Hadoop

Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George

Stephanides

Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece

8. Hive – A Petabyte Scale Data Warehouse Using Hadoop

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning

Zhang, Suresh Antony, Hao Liu and Raghotham Murthy

9. Massive Structured Data Management Solution

Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania

IBM Research – India

10. Situational Business Intelligence

Alexander Löser, Fabian Hueske, and Volker Markl, TU Berlin

Database System and Information Management Group

11. An Intelligent Recommender System using Sequential Web Access Patterns: Alexander

Löser

http://user.cs.tu-berlin.de/~aloeser

12. http://hadoop.apache.org/

13. http://en.wikipedia.org

14. http://cloudera.com

15. Slashdot.org

16. Amazon.com