Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Investigating Distributed Caching Mechanisms for Hadoop

Gurmeet SinghPuneet Chandra

Rashid Tahir

GOAL

• Explore the feasibility of a distributed caching mechanism inside Hadoop

Presentation Overview

• Motivation• Design• Experimental Results• Future Work

Motivation

• Disk Access Times are a bottleneck in cluster computing

• Large amount of data is read from disk• DARE• RAMClouds• PACMan – Coordinated Cache Replacement

We want to strike a balance between RAM and Disk Storage

Our Approach

• Integrate Memcached with Hadoop• Used Quickcached and Spymemcached• Reserve a portion of the main memory at each

node to serve as local cache• Local caches aggregate to abstract a distributed

caching mechanism governed by Memcached• Greedy caching strategy• Least Recently Used (LRU) cache eviction policy

Design Overview

Memcached

Design Choice 1

• Simultaneous requests to Namenode and Memcached

Minimizes access latency with additional network overhead

Design Choice 2• Send request to Namenode only in the case of

a cache miss

Minimizes network overhead with increased latency

Design Choice 3

• Datanodes send requests only to Memcached

• Memcached checks for cached blocks

• If cache miss occurs, it contacts the namenode and returns the replicas’ addresses to the datanodes

Global Cache Replacement• LRU based Global Cache Eviction Scheme

Prefetching

Simulation Results

• Test data ranging from 2GB to 24GB• Word Count and Grep

0 5 10 15 20 25 30 35 400

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

eadWord Count

Word Count

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Cach

e H

it Ra

tio

Grep

0 5 10 15 20 25 30 350

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

ead

Grep

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Hit

Ratio

Future Work

• Implement a pre-fetching mechanism• Customized caching policies based on access

patterns• Compare and contrast caching with locality

aware scheduling

Conclusion

• Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed

Documents

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir