16
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

Embed Size (px)

Citation preview

Page 1: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES

AND DATA NODES

Page 2: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

INTRODUCTION

MOTIVATION

IMPLEMENTATION Core Logic (Map-Reduce Framework) Job Scheduling Load Balancing

HADOOP & GOOGLE APP ENGINE

CHALLENGES & ISSUES

PERFORMANCE ANALYSIS & RESULTS

QUESTIONS

Page 3: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

GOOGLE APP ENGINE ? Paas (Platform as a Service) A platform for hosting Web Applications Virtualizes applications across multiple servers

and Google – managed data centers

Page 4: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

Project Description Distribute computation across

multiple servers and share the load across them

Use multiple accounts on App Engine Task Tracker runs on each account Job Tracker runs on a stand-alone

machine

Page 5: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

WHY GOOGLE APP ENGINE ? WRITE THE CORE LOGIC OF APP &

DEPLOY IT NO NEED TO WORRY ABOUT DATA

CENTERS AUTOMATIC SCALING FREE UPTO CERTAIN LIMIT PAY AS WE GO FURTHER

Page 6: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

WHAT WE DID ? BUILT APPLICATIONS(INVERTED INDEX,

WORDCOUNT, MOVIE RATINGS) BUILT MAP – REDUCE FUNCTIONS FOR THESE

APPLICATIONS DEPLOYED THESE MAP/REDUCE

FUNCTIONS ON TASK TRACKERS A JOB TRACKER, ACTING AS A MASTER,

DISTRIBUTES DATA THROUGH URLFETCH

Page 7: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

PROVIDED A UI TO ENABLE THE USER TO UPLOAD INPUT DATA ON GOOGLE’S PERSISTENT STORAGE - BIGTABLE

LIBRARIES USED TO CONNECT TO THE PERSISTENT STORAGE : JDO/JPA

USER CAN CHOOSE THE APPLICATION TO BE RUN

Page 8: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

JOB IS SUBMITTED TO JOB TRACKER JOB TRACKER MAINTAINS A QUEUE OF JOBS SCHEDULER

PRIORITY SCHEDULER THE USER CAN SPECIFY THE PRIORITY FOR

THE JOB. BASED ON IT, JOB WILL BE INSERTED INTO

THE QUEUE USED WHEN THE USER SPECIFIES A

PRIORITY

Page 9: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

FIFO SCHEDULER THE SUBMITTED JOB IS INSERTED AT

THE BACK OF THE QUEUE A JOB IS PICKED FROM THE FRONT

THUS RUNNING IN A FIFO FASHION DEFAULT SCHEDULER

Page 10: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

RESOURCE DAILY LIMIT(FREE)

MAX RATE (FREE)

DAILY LIMIT(BILLED)

MAX RATE(BILLED

REQUESTS 13,00,000 REQUESTS

7,400 REQUESTS/MIN

4,30,00,000 REQUESTS

30,000 REQUESTS/MIN

OUTGOING BANDWIDTH

1 GB 56 MB/MIN 1 GB FREE ; 1046 GB MAX

740 MB/MIN

INCOMING BANDWIDTH

1 GB 56 MB/MIN 1 GB FREE ; 1046 GB MAX

740 MB/MIN

CPU TIME 6.5 CPU HOURS

15 CPU-MIN/MIN

6.5 CPU HOURS FREE; 1729 MAX

72 CPU-MIN/MIN

Page 11: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

WHY ? EVERY ACCOUNT HAS A FIXED QUOTA DISTRIBUTION OF DATA ACROSS MULTIPLE TASK TRACKERS TO PERTAIN TO THE

QUOTA COST MODEL FOR LOAD BALANCING

COST IS PROPORTIONAL TO THE AMOUNT OF DATA PROCESSED BY A TASK TRACKER

DATA DIVIDED INTO EQUAL SIZED CHUNKS AND SENT TO THE TASK TRACKER’S MAP FUNCTION

Page 12: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

HANDLING HUGE DATA SETS DATA DIVIDED INTO CHUNKS WHAT IF CHUNK SIZE IS HUGE ??

AT LEAST, ONE OF THE TASK TRACKER WILL FAIL , NO MATTER WHICH LOAD BALANCING ALGORITHM IS USED

SOLUTION : DYNAMICALLY INCREASE THE NO. OF TASK TRACKERS IF ONE OF THEM FAILS AFTER A FIXED NO OF TRIALS.

Page 13: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

LIMITED CONTROL ON GOOGLE APP ENGINE NO SPAWNING OF THREADS INABILITIY TO WRITE ON THE FILESYSTEM OF

GOOGLE’S SERVER NO CONTROL ON DATA LOCALITY

MACHINE ON WHICH DATA IS STORED, IS DYNAMICALLY ALLOCATED BY GOOGLE

IN HADOOP, THREADS AND FILE IO CAN BE DONE IMPLEMENTING HADOOP USING GOOGLE APP

ENGINE IS DIFFICULT

Page 14: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

DATA RETRIEVAL IS NOT IN THE SAME ORDER AS DATA STORAGE BECAUSE OF GOOGLE’S STORAGE ARCHITECTURE

NO CONTROL ON USAGE OF NETWORK BANDWIDTH BETWEEN THE JOB TRACKER AND TASK TRACKERS

EXPENSIVE JOIN,UNION OPERATIONS WHEN NUMBER OF TABLES INVOLVED ARE HUGE.

Page 15: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

RESULT SAME AS THAT WHEN RUNNING THE APPLICATION ON HADOOP.

TESTED WORDCOUNT APPLICATION ON A DATA SET CONSISTING OF 10000 WORDS USING 3 TASK TRACKERS

NETWORK BANDWIDTH IS A BOTTLENECK IN THE RUNTIME OF APPLICATION AS DATA HAS TO TRASNSFERRED FROM TASK TRACKERS TO JOB TRACKER AND VICE-VERSA.

Page 16: EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES