Pyramid: A large-scale array-oriented active storage system

Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,

Gabriel Antoniu, Luc Bougé

KerData Team

Inria, Rennes, France 02 09 2011

02 09 2011Viet-TrungTran - 2

Outline

1. Motivation

2. Architecture

3. Preliminary evaluation

4. Conclusion

Viet-TrungTran 00 MOIS 2011 - 3

MotivationWhyarray-orientedstorage?

1

Context: Data-intensive large-scale HPC

simulations

• The scalability of data management is becoming

a critical issue

• Mismatch between storage model and application

data model

• Application data model

- Multidimensional typed arrays, images, etc.

• Storage model

- Parallel file systems: Simple and flat I/O

model

- Relational model: ill-suited for Scientifics

• Need additional layers to map the application

model to the storage model


•Sequence of bytes

[M. Stonebraker] The one-storage-fits-all-

needs has reached its limits

• Parallel I/O stack:

- Performance of non-contiguous I/O vs data

atomicity

• Relational data model:

- Simulating arrays on top of table is poor in

performance

- Scalability for join queries

• Need to specialize the I/O stack to match the

applications requirements

- Array-oriented storage for array data model

• Example: SciDB with ArrayStore.


Application (Visit, Tornado

simulation)

Data model (HDF5, NetCDF)

MPI-IO middleware

Parallel file systems

Our approach

• Multi-dimensional aware chunking

• Lock-free, distributed chunk indexing

• Array versioning

• Active storage support

• Versioning array-oriented access interface


Multi-dimensional aware chunking

• Split array into equal chunks and distributed over storage elements

- Simplify load balancing among storage elements

- Keep the neighbors of cells in the same chunk

• Shared nothing architecture

- Easier to handle data consistency


Lock-free, distributed chunk indexing

• Indexing multi-dimensional information

- R-tree, XD-tree, Quad-tree, etc

- Designed and optimized centralized management

• Centralized metadata management scheme may not scale

- Bottleneck under highly concurrency

• Our approach:

- Porting quad-tree like structures to distributed environment

- Using shadowing technique on quad-tree to enable lock-free

concurrent update


Array versioning

• Scientific applications need array versioning (VLDB 2009)

- Check pointing

- Cloning

- Provenance

• Keep data and metadata immutable

- Updating a chunk is handled at metadata level using shadowing

technique


Active storage support

• Move data computation to storage elements

- Conserve bandwidth

- Better workload parallelization

• Allow user sending User defined handlers to storage servers


Versioning array-oriented access interface

• Basic primitives

- id = CREATE(n, sizes[], defval)

- READ(id, v, offsets[], sizes[], buffer)

- w = WRITE(id, offsets[], sizes[], buffer)

- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)

• Other primitives like cloning, filtering mostly can be implemented based

on these above primitives


Viet-TrungTran 02 09 2011 - 12

Pyramid: Architecture

2


Architecture

• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]

• Version managers

- Ensure concurrency control

• Metadata managers

- Store index tree nodes

• Storage manager

- Monitor the storage servers

- Ensures a load balancing strategy of chunks among storage servers

• Active storage servers

- Store chunks and perform handlers on chunks

• Clients

- Perform I/O accesses


Read

• I: optionally ask the version manager for

the latest published version

• II: fetch the corresponding metadata from

the metadata managers

• III: contact storage servers in parallel and

fetch the chunks in the local buffer

Client

Storage

servers

Metadata

managers

Version

managers

I

II

III


Write

• I: get a list of storage servers that are

able to store the chunks, one for each

chunk

• II: contact storage servers in parallel and

write the chunks to the corresponding

providers

• III: get a version number for the update

• IV: add new metadata to consolidate the

new version

• V: report the new version is ready for

publication.

Client

Storage

servers

Metadata

managers

Version

manager

Storage

manager

II

I

III

IV

V


Lock-free, distributed chunk indexing

• Organized as a Quad-tree to index 2D arrays

• Each tree node has at most 4 children, each covers one of the four quadrants

• Root tree covers the whole array

• Each leaf corresponds to a chunk and holds information about its location

• Tree nodes are immutable, uniquely identified by the version number and the

sub-domain they cover

• Using DHT to distribute tree nodes over metadata managers


Tree shadowing to update

• Write newly created chunks to storage servers

• Build the quad-tree associated to the new snapshot in bottom-up fashion

- Writing the leaves to DHT

- Inner nodes may point to nodes of previous snapshots (imply a

synchronization of the quad-tree generation)

- Avoid synchronization by feeding additional information about the other

concurrent updaters (thank to computational ID of tree nodes)


Efficient parallel updating

• Chunks are written concurrently

• Versions are assigned in the order the

clients finish writing

• Clients get additional information about

the other concurrent writers

• Tree nodes are written in lock-free manner

• Versions are published in the order they

were assigned

Client

#1

Client

#2Storage

servers

Metadata

managers

Version

manager

Publish

Publish


Some more I/O primitives

• Easily implemented thanks to immutable data and metadata blocks

• Cheap I/O operators

• Clone a sub-domain

- Following the metadata tree of a specific snapshot

- Creating new metadata tree and publish as a newly created array

• Filtering, compression ca be done locally in parallel at active storage servers by

introducing user defined handlers


Preliminary evaluationExperimented on G5K (www.grid5000.fr)

3


Experimental setup

Simulate common access pattern exhibited by scientific applications: Array Dicing

• Using at most 130 nodes of Graphene cluster on G5K

- 1 Gbps Ethernet interconnected network

- 49 nodes deployed our Pyramid and the competitor system PVFS

• Array dicing

- Each client accesses a dedicated sub-array

- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)

- Concurrent Reading/Writing

• Measure the performance and scalability


Aggregated throughput achieved under

concurrency

• PVFS suffers from non-

contiguous access pattern due

to serialization to flat file

• Pyramid

- Throughputincreased

steady

- Promising good scalability

on both data and metadata

organization


Conclusion

4


Conclusion

• Pyramid is an array-oriented active storage system

• Proposed a system offering support for

- Parallel array processing for both read and write workloads

- Versioning data

- Distributed metadata management, shadowing to reflect updates

• Preliminary evaluation shows promising scalability

• Future work

- Planed to integrate to HDF5

- Pyramid as a storage engine for SciDB?

- Investigate on keeping data at quad-tree nodes

Could be used for store array at different resolutions (map application)

Thankyou

INRIA – KerDataResearch Team

www.irisa.fr/kerdata

Documents

Pyramid: A large-scale array-oriented active storage system