Twister4Azure : Iterative MapReduce for Azure Cloud
CCA 2011April 12 – 13, 2011
Thilina Gunarathne, Judy Qiu, Geoffrey Fox{tgunarat, xqiu,gcf}@indiana.edu
MapReduceRoles for Azure
Familiar MapReduce programming model Built using highly-available and scalable Azure cloud
services Co-exist with eventual consistency & high latency of
cloud services Decentralized control
No single point of failure. Supports dynamically scaling up and down of the
compute resources. MapReduce fault tolerance
MapReduceRoles for Azure
Twister for Azure
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as well
as using a bulletin board (special table)
Twister for Azure
Map 1
Map 2
Map n
Map Workers
Red 1
Red 2
Red n
Reduce Workers
In Memory Data Cache
Task Monitoring
Role Monitoring
Worker Role
MapID ……. Status
Map Task Table
MapID ……. Status
Job Bulleting Board
Scheduling Queue
Performance – Kmeans Clustering
0%
20%
40%
60%
80%
100%
120%
140%
160%
0
200
400
600
800
1000
1200
1400
1600
8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M
Rela
tive
Par
alle
l Effi
cien
cy
Tim
e (s
)
Num Instances X Num Data Points
Relative ParallelEfficiencyTime(s)
Performance with/without data caching. Speedup gained using data cache
Scaling speedup Increasing number of iterations
Performance Comparisons
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
128 228 328 428 528 628 728
Par
alle
l Effi
cien
cy
Number of Query Files
Twister4Azure
Hadoop-Blast
DryadLINQ-Blast
0
500
1000
1500
2000
2500
3000
Adju
sted
Time (
s)
Num. of Cores * Num. of Blocks
Twister4Azure
Amazon EMR
Apache Hadoop
50%55%60%65%70%75%80%85%90%95%
100%
Para
llel E
ffici
ency
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
BLAST Sequence Search
Cap3 Sequence Assembly
Smith Watermann Sequence Alignment
Conclusion
Enables users to easily and efficiently perform large scale iterative data analysis and scientific computations on Azure cloud.
Utilizes a novel hybrid scheduling mechanism to provide the caching of static data across iterations.
Utilize cloud infrastructure services effectively to deliver robust and efficient applications.
http://salsahpc.indiana.edu/twister4azure