25
Pyramid: A large-scale array-oriented active storage system Viet-Trung TRAN, Nicolae Bogdan, Gabriel Antoniu, Luc Bougé KerData Team Inria, Rennes, France 02 09 2011

Pyramid: A large-scale array-oriented active storage system

Embed Size (px)

DESCRIPTION

The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.

Citation preview

Page 1: Pyramid: A large-scale array-oriented active storage system

Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,

Gabriel Antoniu, Luc Bougé

KerData Team

Inria, Rennes, France 02 09 2011

Page 2: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 2

Outline

1. Motivation

2. Architecture

3. Preliminary evaluation

4. Conclusion

Page 3: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 00 MOIS 2011 - 3

MotivationWhyarray-orientedstorage?

1

Page 4: Pyramid: A large-scale array-oriented active storage system

Context: Data-intensive large-scale HPC

simulations

• The scalability of data management is becoming

a critical issue

• Mismatch between storage model and application

data model

• Application data model

- Multidimensional typed arrays, images, etc.

• Storage model

- Parallel file systems: Simple and flat I/O

model

- Relational model: ill-suited for Scientifics

• Need additional layers to map the application

model to the storage model

02 09 2011Viet-TrungTran - 4

•Sequence of bytes

Page 5: Pyramid: A large-scale array-oriented active storage system

[M. Stonebraker] The one-storage-fits-all-

needs has reached its limits

• Parallel I/O stack:

- Performance of non-contiguous I/O vs data

atomicity

• Relational data model:

- Simulating arrays on top of table is poor in

performance

- Scalability for join queries

• Need to specialize the I/O stack to match the

applications requirements

- Array-oriented storage for array data model

• Example: SciDB with ArrayStore.

02 09 2011Viet-TrungTran - 5

Application (Visit, Tornado

simulation)

Data model (HDF5, NetCDF)

MPI-IO middleware

Parallel file systems

Page 6: Pyramid: A large-scale array-oriented active storage system

Our approach

• Multi-dimensional aware chunking

• Lock-free, distributed chunk indexing

• Array versioning

• Active storage support

• Versioning array-oriented access interface

02 09 2011Viet-TrungTran - 6

Page 7: Pyramid: A large-scale array-oriented active storage system

Multi-dimensional aware chunking

• Split array into equal chunks and distributed over storage elements

- Simplify load balancing among storage elements

- Keep the neighbors of cells in the same chunk

• Shared nothing architecture

- Easier to handle data consistency

02 09 2011Viet-TrungTran - 7

Page 8: Pyramid: A large-scale array-oriented active storage system

Lock-free, distributed chunk indexing

• Indexing multi-dimensional information

- R-tree, XD-tree, Quad-tree, etc

- Designed and optimized centralized management

• Centralized metadata management scheme may not scale

- Bottleneck under highly concurrency

• Our approach:

- Porting quad-tree like structures to distributed environment

- Using shadowing technique on quad-tree to enable lock-free

concurrent update

02 09 2011Viet-TrungTran - 8

Page 9: Pyramid: A large-scale array-oriented active storage system

Array versioning

• Scientific applications need array versioning (VLDB 2009)

- Check pointing

- Cloning

- Provenance

• Keep data and metadata immutable

- Updating a chunk is handled at metadata level using shadowing

technique

02 09 2011Viet-TrungTran - 9

Page 10: Pyramid: A large-scale array-oriented active storage system

Active storage support

• Move data computation to storage elements

- Conserve bandwidth

- Better workload parallelization

• Allow user sending User defined handlers to storage servers

02 09 2011Viet-TrungTran - 10

Page 11: Pyramid: A large-scale array-oriented active storage system

Versioning array-oriented access interface

• Basic primitives

- id = CREATE(n, sizes[], defval)

- READ(id, v, offsets[], sizes[], buffer)

- w = WRITE(id, offsets[], sizes[], buffer)

- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)

• Other primitives like cloning, filtering mostly can be implemented based

on these above primitives

02 09 2011Viet-TrungTran - 11

Page 12: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 12

Pyramid: Architecture

2

Page 13: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 13

Architecture

• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]

• Version managers

- Ensure concurrency control

• Metadata managers

- Store index tree nodes

• Storage manager

- Monitor the storage servers

- Ensures a load balancing strategy of chunks among storage servers

• Active storage servers

- Store chunks and perform handlers on chunks

• Clients

- Perform I/O accesses

Page 14: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 14

Read

• I: optionally ask the version manager for

the latest published version

• II: fetch the corresponding metadata from

the metadata managers

• III: contact storage servers in parallel and

fetch the chunks in the local buffer

Client

Storage

servers

Metadata

managers

Version

managers

I

II

III

Page 15: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 15

Write

• I: get a list of storage servers that are

able to store the chunks, one for each

chunk

• II: contact storage servers in parallel and

write the chunks to the corresponding

providers

• III: get a version number for the update

• IV: add new metadata to consolidate the

new version

• V: report the new version is ready for

publication.

Client

Storage

servers

Metadata

managers

Version

manager

Storage

manager

II

I

III

IV

V

Page 16: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 16

Lock-free, distributed chunk indexing

• Organized as a Quad-tree to index 2D arrays

• Each tree node has at most 4 children, each covers one of the four quadrants

• Root tree covers the whole array

• Each leaf corresponds to a chunk and holds information about its location

• Tree nodes are immutable, uniquely identified by the version number and the

sub-domain they cover

• Using DHT to distribute tree nodes over metadata managers

Page 17: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 17

Tree shadowing to update

• Write newly created chunks to storage servers

• Build the quad-tree associated to the new snapshot in bottom-up fashion

- Writing the leaves to DHT

- Inner nodes may point to nodes of previous snapshots (imply a

synchronization of the quad-tree generation)

- Avoid synchronization by feeding additional information about the other

concurrent updaters (thank to computational ID of tree nodes)

Page 18: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 18

Efficient parallel updating

• Chunks are written concurrently

• Versions are assigned in the order the

clients finish writing

• Clients get additional information about

the other concurrent writers

• Tree nodes are written in lock-free manner

• Versions are published in the order they

were assigned

Client

#1

Client

#2Storage

servers

Metadata

managers

Version

manager

Publish

Publish

Page 19: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 19

Some more I/O primitives

• Easily implemented thanks to immutable data and metadata blocks

• Cheap I/O operators

• Clone a sub-domain

- Following the metadata tree of a specific snapshot

- Creating new metadata tree and publish as a newly created array

• Filtering, compression ca be done locally in parallel at active storage servers by

introducing user defined handlers

Page 20: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 20

Preliminary evaluationExperimented on G5K (www.grid5000.fr)

3

Page 21: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 21

Experimental setup

Simulate common access pattern exhibited by scientific applications: Array Dicing

• Using at most 130 nodes of Graphene cluster on G5K

- 1 Gbps Ethernet interconnected network

- 49 nodes deployed our Pyramid and the competitor system PVFS

• Array dicing

- Each client accesses a dedicated sub-array

- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)

- Concurrent Reading/Writing

• Measure the performance and scalability

Page 22: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 22

Aggregated throughput achieved under

concurrency

• PVFS suffers from non-

contiguous access pattern due

to serialization to flat file

• Pyramid

- Throughputincreased

steady

- Promising good scalability

on both data and metadata

organization

Page 23: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 23

Conclusion

4

Page 24: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 24

Conclusion

• Pyramid is an array-oriented active storage system

• Proposed a system offering support for

- Parallel array processing for both read and write workloads

- Versioning data

- Distributed metadata management, shadowing to reflect updates

• Preliminary evaluation shows promising scalability

• Future work

- Planed to integrate to HDF5

- Pyramid as a storage engine for SciDB?

- Investigate on keeping data at quad-tree nodes

Could be used for store array at different resolutions (map application)

Page 25: Pyramid: A large-scale array-oriented active storage system

Thankyou

INRIA – KerDataResearch Team

www.irisa.fr/kerdata