Upload
millicent-cross
View
214
Download
2
Embed Size (px)
Citation preview
Fault tolerance in BlobSeerFault tolerance in BlobSeerBogdan NicolaeUniversity of Rennes [email protected]
Jesús Montes SánchezCesViMa – Universidad Politécnica de [email protected]
Data Storage and Access Face Data Storage and Access Face New ChallengesNew Challenges Infrastructures
◦ Grids, clouds, petascale computing infrastructures, desktop grids?
Access pattern: distributed apps with high throughput under concurrency◦ Huge data size, fast data generation rates
PB scale storage is necessary to cope with size Order of TB/week more and more common
◦ Mutable data Poor support in massive storage systems: HadoopFS
◦ Heavy access concurrency: synchronization and consistency Thousands of clients accessing data simultaneously
◦ Versioning Support for rollback, access to historic data
Our Approach: BlobSeerOur Approach: BlobSeerManipulates lightweight huge files: blobsSimple API: read/write/appendBlob is fragmented into pages
◦ Allows huge data amounts to be distributed among machines
◦ Avoids contention for simultaneous accesses to disjoint parts of the data block
Metadata: locate pages that make up a given blob◦ Distributed in a fine-grain manner
Versioning◦ Write/append: generate new pages rather than
overwrite any existing data◦ Metadata is extended to incorporate the update◦ Both the old and the new version of the blob are
accessible as if they were independent blobs
ArchitectureArchitectureClients
◦ Perform fine grain blob accesses
Providers◦ Store the pages of the blob
Provider manager◦ Monitors the providers◦ Favors data load balancing
Metadata providers◦ Store information about
page locationVersion manager
◦ Ensures concurrency control
Clients
Providers
Metadata providers
Provider manager
Version manager
http://blobseer.gforge.inria.fr
How do Writes work?How do Writes work?
Pages are written concurrently by the clients (no sync needed)
Versions are assigned Metadata is written
concurrently by the clients (no sync needed)
Versions are published in the order they where assigned
Client #1
Client #2
Providers Metadataproviders
Versionmanager
Publish
Publish
Client Providers Metadataproviders
Versionmanager
I
II
III
How Does a Read Work?How Does a Read Work?I. Ask the version
manager for the latest published version (optional)
II. Fetch the corresponding metadata from the metadata providers
III. Contact providers in parallel and fetch the pages in the local buffer
Full R/R, R/W concurrency
[0, 4]
[0, 2] [2, 2]
[0, 1] [1, 1] [2, 1] [3, 1]
Metadata (1)Metadata (1)
Organized as a segment tree
Each node covers a range of the blob identified by (offset, size)
The first/second half of the range is covered by the left/right child
Each leaf corresponds to a page and holds information about its location
[0, 4]
[0, 2] [2, 2]
[0, 1] [1, 1] [2, 1] [3, 1]
[0, 2] [2, 2]
[0, 4]
[1, 1] [2, 1]
[0, 8]
[4, 4]
[4, 2]
[4, 1]
Metadata (2)Metadata (2)Each node holds
versioning informationWrite/Append
◦ Add leaves and build subtree up to the root
◦ The tree may grow one level
Read: descend from the root towards the leaves
Tree nodes are distributed among metadata providers
Clients can fetch multiple nodes in parallel
How Concurrent Writes Work: How Concurrent Writes Work: ExampleExample Initial version: v = 1 2 concurrent writers: gray and
black Both write their pages
independently Gray is first, it is enqueued on
the versioning manager and assigned version v2, black gets v3
Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2
First to finish is black, it is marked ready
Next is gray, its root gets published and it is dequeued
Finally black gets first in the queue and and will be published
How Concurrent Writes Work: How Concurrent Writes Work: ExampleExample Initial version: v = 1 2 concurrent writers: gray and
black Both write their pages
independently Gray is first, it is enqueued on
the versioning manager and assigned version v2, black gets v3
Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2
First to finish is black, it is marked ready
Next is gray, its root gets published and it is dequeued
Finally black gets first in the queue and and will be published
How Concurrent Writes Work: How Concurrent Writes Work: ExampleExample Initial version: v = 1 2 concurrent writers: gray and
black Both write their pages
independently Gray is first, it is enqueued on
the versioning manager and assigned version v2, black gets v3
Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2
First to finish is black, it is marked ready
Next is gray, its root gets published and it is dequeued
Finally black gets first in the queue and and will be published
How Concurrent Writes Work: How Concurrent Writes Work: ExampleExample Initial version: v = 1 2 concurrent writers: gray and
black Both write their pages
independently Gray is first, it is enqueued on
the versioning manager and assigned version v2, black gets v3
Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2
First to finish is black, it is marked ready
Next is gray, its root gets published and it is dequeued
Finally black gets first in the queue and and will be published
Impact of Metadata Distribution Impact of Metadata Distribution Under Heavy ConcurrencyUnder Heavy ConcurrencyMetric
◦ Aggregated bandwidth
Configuration◦ 90 data providers◦ Fixed nr of
metadata providers◦ Up to 90 clients◦ 4 writers per client◦ Each writer outputs
8 MB◦ Page size: 128 KB
To be presented at Euro-Par 2009
BlobSeer: How About Fault BlobSeer: How About Fault Tolerance?Tolerance?Metadata?
◦Distributed in a DHT, already benefits from some DHT-inherent FT
Centralized entities?◦Version manager, provider manager◦First idea: PAXOS-like, consensus-based
solutionsData?
◦Simple replication policies not enough◦FT needs to be adapted both to access
pattern and running environment
Fault tolerance in grid Fault tolerance in grid computingcomputingTwo theoretical visions of the grid:
◦Multiple entities: The grid is a set of computational resources
◦Single entity: The grid a “black box” that provides a set of services
Two types of fault tolerance:◦Resource level: Dependability issues in
the grid resources◦Global level: Dependability issues
related to the whole grid (the services provided)
Global level fault toleranceGlobal level fault toleranceImproving dependability of the
services provided◦Low-level approach (multiple entity point
of view)◦High-level approach (single entity point
of view)Is the “single entity" view possible?
◦Maybe the grid is too large and complex to be understood as just one entity...
◦...or maybe is just a matter of perspective.
The key: Abstraction
The Grid seen as a single entity : global-level fault tolerance
GloBeM: Global Behavior Modeling◦ Global QoS rather than dependability of individual
resources
Improving Fault Tolerance Improving Fault Tolerance through Global Behavior through Global Behavior ModelingModeling
J. Montes, A. Sanchez, J. J. Valdes, M. S. Perez, and P. Herrero, "The grid as a single entity: Towards a behavior model of the whole grid" in OTM Conferences (1), ser. Lecture Notes in Computer Science, R. Meersman and Z. Tari, Eds., vol. 5331. Springer, 2008, pp. 886-897.
GloBeM: Zooming InGloBeM: Zooming In
Based on historical monitoring information
Uses knowledge discovery techniques (Data Mining…)
Generates a behavior model in the form a Finite State Machine
Using GloBeM to Improve Fault Using GloBeM to Improve Fault ToleranceTolerance
Extracted from Jesus Montes, Alberto Sanchez and Maria S. Perez, « Improving grid fault tolerance by means of global behavior modelling ». Submitted to Grid’2009.
Applications to BlobSeerApplications to BlobSeerUse GloBeM modeling techniques to
improve BlobSeer’s dependability and QoS◦Model behavior patterns◦Implement adaptive strategies (e.g.
reactive and/or proactive fault tolerance)
Applications to BlobSeer (2)Applications to BlobSeer (2)Steps:
◦Define relevant metrics to monitor Storage usage, effective bandwidth, resource
failure rate
◦Model BlobSeer using GloBeM techniques
◦Analyze states to understand behavior Each state may correspond to a certain level
of QoS
◦Define adapted fault-tolerance policies based on BlobSeer dynamic changes
Questions?Questions?