Fault tolerance in BlobSeer Bogdan Nicolae University of Rennes 1 [email protected] Jesús Montes Sánchez CesViMa – Universidad Politécnica de Madrid

Fault tolerance in BlobSeerFault tolerance in BlobSeerBogdan NicolaeUniversity of Rennes [email protected]

Jesús Montes SánchezCesViMa – Universidad Politécnica de [email protected]

Data Storage and Access Face Data Storage and Access Face New ChallengesNew Challenges Infrastructures

◦ Grids, clouds, petascale computing infrastructures, desktop grids?

Access pattern: distributed apps with high throughput under concurrency◦ Huge data size, fast data generation rates

PB scale storage is necessary to cope with size Order of TB/week more and more common

◦ Mutable data Poor support in massive storage systems: HadoopFS

◦ Heavy access concurrency: synchronization and consistency Thousands of clients accessing data simultaneously

◦ Versioning Support for rollback, access to historic data

Our Approach: BlobSeerOur Approach: BlobSeerManipulates lightweight huge files: blobsSimple API: read/write/appendBlob is fragmented into pages

◦ Allows huge data amounts to be distributed among machines

◦ Avoids contention for simultaneous accesses to disjoint parts of the data block

Metadata: locate pages that make up a given blob◦ Distributed in a fine-grain manner

Versioning◦ Write/append: generate new pages rather than

overwrite any existing data◦ Metadata is extended to incorporate the update◦ Both the old and the new version of the blob are

accessible as if they were independent blobs

ArchitectureArchitectureClients

◦ Perform fine grain blob accesses

Providers◦ Store the pages of the blob

Provider manager◦ Monitors the providers◦ Favors data load balancing

Metadata providers◦ Store information about

page locationVersion manager

◦ Ensures concurrency control

Clients

Providers

Metadata providers

Provider manager

Version manager

http://blobseer.gforge.inria.fr

How do Writes work?How do Writes work?

Pages are written concurrently by the clients (no sync needed)

Versions are assigned Metadata is written

concurrently by the clients (no sync needed)

Versions are published in the order they where assigned

Client #1

Client #2

Providers Metadataproviders

Versionmanager

Publish

Publish

Client Providers Metadataproviders

Versionmanager

I

II

III

How Does a Read Work?How Does a Read Work?I. Ask the version

manager for the latest published version (optional)

II. Fetch the corresponding metadata from the metadata providers

III. Contact providers in parallel and fetch the pages in the local buffer

Full R/R, R/W concurrency

[0, 4]

[0, 2] [2, 2]

[0, 1] [1, 1] [2, 1] [3, 1]

Metadata (1)Metadata (1)

Organized as a segment tree

Each node covers a range of the blob identified by (offset, size)

The first/second half of the range is covered by the left/right child

Each leaf corresponds to a page and holds information about its location

[0, 4]

[0, 2] [2, 2]

[0, 1] [1, 1] [2, 1] [3, 1]

[0, 2] [2, 2]

[0, 4]

[1, 1] [2, 1]

[0, 8]

[4, 4]

[4, 2]

[4, 1]

Metadata (2)Metadata (2)Each node holds

versioning informationWrite/Append

◦ Add leaves and build subtree up to the root

◦ The tree may grow one level

Read: descend from the root towards the leaves

Tree nodes are distributed among metadata providers

Clients can fetch multiple nodes in parallel

How Concurrent Writes Work: How Concurrent Writes Work: ExampleExample Initial version: v = 1 2 concurrent writers: gray and

black Both write their pages

independently Gray is first, it is enqueued on

the versioning manager and assigned version v2, black gets v3

Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2

First to finish is black, it is marked ready

Next is gray, its root gets published and it is dequeued

Finally black gets first in the queue and and will be published

























Impact of Metadata Distribution Impact of Metadata Distribution Under Heavy ConcurrencyUnder Heavy ConcurrencyMetric

◦ Aggregated bandwidth

Configuration◦ 90 data providers◦ Fixed nr of

metadata providers◦ Up to 90 clients◦ 4 writers per client◦ Each writer outputs

8 MB◦ Page size: 128 KB

To be presented at Euro-Par 2009

BlobSeer: How About Fault BlobSeer: How About Fault Tolerance?Tolerance?Metadata?

◦Distributed in a DHT, already benefits from some DHT-inherent FT

Centralized entities?◦Version manager, provider manager◦First idea: PAXOS-like, consensus-based

solutionsData?

◦Simple replication policies not enough◦FT needs to be adapted both to access

pattern and running environment

Fault tolerance in grid Fault tolerance in grid computingcomputingTwo theoretical visions of the grid:

◦Multiple entities: The grid is a set of computational resources

◦Single entity: The grid a “black box” that provides a set of services

Two types of fault tolerance:◦Resource level: Dependability issues in

the grid resources◦Global level: Dependability issues

related to the whole grid (the services provided)

Global level fault toleranceGlobal level fault toleranceImproving dependability of the

services provided◦Low-level approach (multiple entity point

of view)◦High-level approach (single entity point

of view)Is the “single entity" view possible?

◦Maybe the grid is too large and complex to be understood as just one entity...

◦...or maybe is just a matter of perspective.

The key: Abstraction

The Grid seen as a single entity : global-level fault tolerance

GloBeM: Global Behavior Modeling◦ Global QoS rather than dependability of individual

resources

Improving Fault Tolerance Improving Fault Tolerance through Global Behavior through Global Behavior ModelingModeling

J. Montes, A. Sanchez, J. J. Valdes, M. S. Perez, and P. Herrero, "The grid as a single entity: Towards a behavior model of the whole grid" in OTM Conferences (1), ser. Lecture Notes in Computer Science, R. Meersman and Z. Tari, Eds., vol. 5331. Springer, 2008, pp. 886-897.

GloBeM: Zooming InGloBeM: Zooming In

Based on historical monitoring information

Uses knowledge discovery techniques (Data Mining…)

Generates a behavior model in the form a Finite State Machine

Using GloBeM to Improve Fault Using GloBeM to Improve Fault ToleranceTolerance

Extracted from Jesus Montes, Alberto Sanchez and Maria S. Perez, « Improving grid fault tolerance by means of global behavior modelling ». Submitted to Grid’2009.

Applications to BlobSeerApplications to BlobSeerUse GloBeM modeling techniques to

improve BlobSeer’s dependability and QoS◦Model behavior patterns◦Implement adaptive strategies (e.g.

reactive and/or proactive fault tolerance)

Applications to BlobSeer (2)Applications to BlobSeer (2)Steps:

◦Define relevant metrics to monitor Storage usage, effective bandwidth, resource

failure rate

◦Model BlobSeer using GloBeM techniques

◦Analyze states to understand behavior Each state may correspond to a certain level

of QoS

◦Define adapted fault-tolerance policies based on BlobSeer dynamic changes

Questions?Questions?

Documents

Fault tolerance in BlobSeer Bogdan Nicolae University of Rennes 1 [email protected] Jesús Montes Sánchez CesViMa – Universidad Politécnica de Madrid