Overview
High Throughput Computing Motivation All things distributed: Peer-to-peer
– Non structured overlays
– Structured overlays P2P Computing Cassandra HTC over Cassandra Eventual consistency Experiments Future Work Conclusions
High Throughput Computing
Concept introduced by the Condor team in 1996
In contrast to HPC, it optimizes the execution of a set of applications
Figure of merit: the number of computational tasks per time unit
Tasks are independent Examples: Condor, Oracle Grid Engine
(Kalimero), BOINC
Functioning
N worker nodes One master node Users interact with
the master node Master manages
pending task and idle workers using a queuing system
Task are (usually) executed in FIFO order
Motivations
Limitations of this model– Master node may become a
scalability bottleneck– Failures in the master affects the
whole system Is it possible to distribute the
capabilities of the master node among all sytem nodes?
How? (which technology can help?)
All things distributed: peer-to-peer Distributed systems in which all nodes
have the same role Nodes are interconnected defining an
application-level virtual network An overlay network
This overlay is used to locate other nodes and information inside them
Two types of overlays: structured and non-structured
Non-structured overlays
Nodes are interconnected randomly Searchs in the overlay are made by
flooding Efficient search of popular contents Cannot guarantee that any system
point is reachable Not efficient in terms of number of
messages
Non-structured overlays (II)
Structured overlays
Nodes interconnected using some kind of (regular) structure
Each node has an unique ID of N bits, defining a 2N keyspace
This keyspace is divided among the nodes
Structured overlays (II)
Each object in the system has an ID and a position in the key space
A distance-based routing protocol is used
This permits reaching any point with O(log n) messages
Distributed Hash Tables
Provides a hash-like user API:
Put (ID, Object)
Get (ID) Fast access to
distributed information
Used to distribute file, communicate users, VoIP, Video Streaming
P2P Computing
Must be seen by the user as a single resource pool
User should be able to submit jobs from any node in the system
System stores job’s information permitting progress even when the user is not connected
A FIFO order should be guaranteed
DHTs are suitable for this purpose
DHTs for P2P Computing
Must provide scalability in adverse conditions
Must provide persistency (using replication)
Replicas are synchronized by consensus algorithms
Load balancing algorithms are also needed
DHTs for P2P Computing (II)
In 2007 Amazon presented Dynamo, a DHT P2P system with persistence, scalability, access in O(n) and eventual consistency
From Dynamo, many alternatives have been proposed: Riak, Scalaris, Memcached,...
Facebook proposed Cassandra in 2009 with the same Dynamo capabilities and Google's BigTable data model
Cassandra
Developed by Facebook and Twitter since 2009
Has been released to the Apache Foundation
Developed in Java with multilanguage client libraries
Pros: Fault tolerant, decentralized, scalable, durable
Cons: Eventual consistency
Cassandra’s Data Model
DHTs store (key, value) pairs Cassandra store (key, (values..)) tuples
across different tables The different tables are named
ColumnFamilies or SuperColumnFamilies
CF are 4-dimensional tables SCF are 5-dimensional tables
Column Families
WaitingQueue ColumnFamily
JobID Name Owner Binary
1 Task1 User1 URL
2 Task2 User2 URL
3 Task3 User1 URL
N TaskN User3 URL
SuperColumn Families
Queues SuperColumn Family
WaitingJob1 Job2 JobN
Task1 User1 Task2 User2 TaskN UserN
RunningJob1 Job2 JobN
Task1 User1 Task2 User2 TaskN UserN
HTC over Cassandra
A batch queue system has been implemented over Cassandra’s data model
This permits idle workers decide which task to run, in FIFO order
Users can:
– Submit jobs
– Check jobs’ status
– Retrieve jobs’ results The use of Cassandra as underlying data
storage allows for disconnected operation
HTC over Cassandra (II)
System stores Job information
– Name– Owner– Binaries
Users information Queues information
The system is totally reconfigurable at run time, permitting the utilization of unlimited queues with different policies
Eventual Consistency
All changes in any object reach all object replicas eventually
CAP theorem implies that it is not possible to have these three properties at the same time:
– Consistency– Availability– Partition tolerance
Cassandra have selected availability and partition tolerance instead of consistency
In a failure-free scenario, Cassandra provides low latency
Eventual Consistency (II)
This scenario implies the impossibility of atomic operations in Cassandra
In our HTC system, collisions may happen when several nodes try to execute the same task
We have implemented some partial solutions that reduce the probability of a collision:
– QUORUM consistency for all I/O operations
– Extra queue where idle nodes compete for the waiting task
Reduces the collision probability from 30% to 4%
Experiments
We have performed some experiments to evaluate our system
A 20 nodes cluster has been used for this purpose
– Each node has a P4 processor with hyperthreading
– 1.5 – 2 GB of RAM Each node represents one user in the system We have used a workload generator in order to
generate a works list for each user
Metrics
Bounded Slowdown: Waiting time for a job plus the running time
System utilization Scheduling time: time used by idle
nodes to schedule a waiting job Collisions detected
bsd=max 1,wr
max 10, r
System Load
Bounded Slowdown
Scheduling Time
Collisions
Future Work
Find a viable solution to the Eventual Consistency problem
Develop a workflow system with MapReduce tasks
Reputation systems in order to classify nodes behavior
Conclusions
HTC over P2P is possible A prototype has been developed Some preliminary experiments have
been done obtaining good performance levels
QUESTIONS?