Upload
brayan-drury
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Investigating Distributed Caching Mechanisms for Hadoop
Gurmeet SinghPuneet Chandra
Rashid Tahir
GOAL
• Explore the feasibility of a distributed caching mechanism inside Hadoop
Presentation Overview
• Motivation• Design• Experimental Results• Future Work
Motivation
• Disk Access Times are a bottleneck in cluster computing
• Large amount of data is read from disk• DARE• RAMClouds• PACMan – Coordinated Cache Replacement
We want to strike a balance between RAM and Disk Storage
Our Approach
• Integrate Memcached with Hadoop• Used Quickcached and Spymemcached• Reserve a portion of the main memory at each
node to serve as local cache• Local caches aggregate to abstract a distributed
caching mechanism governed by Memcached• Greedy caching strategy• Least Recently Used (LRU) cache eviction policy
Design Overview
Memcached
Design Choice 1
• Simultaneous requests to Namenode and Memcached
Minimizes access latency with additional network overhead
Design Choice 2• Send request to Namenode only in the case of
a cache miss
Minimizes network overhead with increased latency
Design Choice 3
• Datanodes send requests only to Memcached
• Memcached checks for cached blocks
• If cache miss occurs, it contacts the namenode and returns the replicas’ addresses to the datanodes
Global Cache Replacement• LRU based Global Cache Eviction Scheme
Prefetching
Simulation Results
• Test data ranging from 2GB to 24GB• Word Count and Grep
0 5 10 15 20 25 30 35 400
20
40
60
80
100
Network Overhead vs Cache Size
Cache Size (GB)
% O
verh
eadWord Count
Word Count
0 5 10 15 20 25 30 350
0.2
0.4
0.6
0.8
1
Hit Ratio vs Cache Size
Cache Size (GB)
Cach
e H
it Ra
tio
Grep
0 5 10 15 20 25 30 350
20
40
60
80
100
Network Overhead vs Cache Size
Cache Size (GB)
% O
verh
ead
Grep
0 5 10 15 20 25 30 350
0.2
0.4
0.6
0.8
1
Hit Ratio vs Cache Size
Cache Size (GB)
Hit
Ratio
Future Work
• Implement a pre-fetching mechanism• Customized caching policies based on access
patterns• Compare and contrast caching with locality
aware scheduling
Conclusion
• Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed