Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,
Gabriel Antoniu, Luc Bougé
KerData Team
Inria, Rennes, France 02 09 2011
02 09 2011Viet-TrungTran - 2
Outline
1. Motivation
2. Architecture
3. Preliminary evaluation
4. Conclusion
Viet-TrungTran 00 MOIS 2011 - 3
MotivationWhyarray-orientedstorage?
1
Context: Data-intensive large-scale HPC
simulations
• The scalability of data management is becoming
a critical issue
• Mismatch between storage model and application
data model
• Application data model
- Multidimensional typed arrays, images, etc.
• Storage model
- Parallel file systems: Simple and flat I/O
model
- Relational model: ill-suited for Scientifics
• Need additional layers to map the application
model to the storage model
02 09 2011Viet-TrungTran - 4
•Sequence of bytes
[M. Stonebraker] The one-storage-fits-all-
needs has reached its limits
• Parallel I/O stack:
- Performance of non-contiguous I/O vs data
atomicity
• Relational data model:
- Simulating arrays on top of table is poor in
performance
- Scalability for join queries
• Need to specialize the I/O stack to match the
applications requirements
- Array-oriented storage for array data model
• Example: SciDB with ArrayStore.
02 09 2011Viet-TrungTran - 5
Application (Visit, Tornado
simulation)
Data model (HDF5, NetCDF)
MPI-IO middleware
Parallel file systems
Our approach
• Multi-dimensional aware chunking
• Lock-free, distributed chunk indexing
• Array versioning
• Active storage support
• Versioning array-oriented access interface
02 09 2011Viet-TrungTran - 6
Multi-dimensional aware chunking
• Split array into equal chunks and distributed over storage elements
- Simplify load balancing among storage elements
- Keep the neighbors of cells in the same chunk
• Shared nothing architecture
- Easier to handle data consistency
02 09 2011Viet-TrungTran - 7
Lock-free, distributed chunk indexing
• Indexing multi-dimensional information
- R-tree, XD-tree, Quad-tree, etc
- Designed and optimized centralized management
• Centralized metadata management scheme may not scale
- Bottleneck under highly concurrency
• Our approach:
- Porting quad-tree like structures to distributed environment
- Using shadowing technique on quad-tree to enable lock-free
concurrent update
02 09 2011Viet-TrungTran - 8
Array versioning
• Scientific applications need array versioning (VLDB 2009)
- Check pointing
- Cloning
- Provenance
• Keep data and metadata immutable
- Updating a chunk is handled at metadata level using shadowing
technique
02 09 2011Viet-TrungTran - 9
Active storage support
• Move data computation to storage elements
- Conserve bandwidth
- Better workload parallelization
• Allow user sending User defined handlers to storage servers
02 09 2011Viet-TrungTran - 10
Versioning array-oriented access interface
• Basic primitives
- id = CREATE(n, sizes[], defval)
- READ(id, v, offsets[], sizes[], buffer)
- w = WRITE(id, offsets[], sizes[], buffer)
- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)
• Other primitives like cloning, filtering mostly can be implemented based
on these above primitives
02 09 2011Viet-TrungTran - 11
Viet-TrungTran 02 09 2011 - 12
Pyramid: Architecture
2
02 09 2011Viet-TrungTran - 13
Architecture
• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]
• Version managers
- Ensure concurrency control
• Metadata managers
- Store index tree nodes
• Storage manager
- Monitor the storage servers
- Ensures a load balancing strategy of chunks among storage servers
• Active storage servers
- Store chunks and perform handlers on chunks
• Clients
- Perform I/O accesses
02 09 2011Viet-TrungTran - 14
Read
• I: optionally ask the version manager for
the latest published version
• II: fetch the corresponding metadata from
the metadata managers
• III: contact storage servers in parallel and
fetch the chunks in the local buffer
Client
Storage
servers
Metadata
managers
Version
managers
I
II
III
02 09 2011Viet-TrungTran - 15
Write
• I: get a list of storage servers that are
able to store the chunks, one for each
chunk
• II: contact storage servers in parallel and
write the chunks to the corresponding
providers
• III: get a version number for the update
• IV: add new metadata to consolidate the
new version
• V: report the new version is ready for
publication.
Client
Storage
servers
Metadata
managers
Version
manager
Storage
manager
II
I
III
IV
V
02 09 2011Viet-TrungTran - 16
Lock-free, distributed chunk indexing
• Organized as a Quad-tree to index 2D arrays
• Each tree node has at most 4 children, each covers one of the four quadrants
• Root tree covers the whole array
• Each leaf corresponds to a chunk and holds information about its location
• Tree nodes are immutable, uniquely identified by the version number and the
sub-domain they cover
• Using DHT to distribute tree nodes over metadata managers
02 09 2011Viet-TrungTran - 17
Tree shadowing to update
• Write newly created chunks to storage servers
• Build the quad-tree associated to the new snapshot in bottom-up fashion
- Writing the leaves to DHT
- Inner nodes may point to nodes of previous snapshots (imply a
synchronization of the quad-tree generation)
- Avoid synchronization by feeding additional information about the other
concurrent updaters (thank to computational ID of tree nodes)
02 09 2011Viet-TrungTran - 18
Efficient parallel updating
• Chunks are written concurrently
• Versions are assigned in the order the
clients finish writing
• Clients get additional information about
the other concurrent writers
• Tree nodes are written in lock-free manner
• Versions are published in the order they
were assigned
Client
#1
Client
#2Storage
servers
Metadata
managers
Version
manager
Publish
Publish
02 09 2011Viet-TrungTran - 19
Some more I/O primitives
• Easily implemented thanks to immutable data and metadata blocks
• Cheap I/O operators
• Clone a sub-domain
- Following the metadata tree of a specific snapshot
- Creating new metadata tree and publish as a newly created array
• Filtering, compression ca be done locally in parallel at active storage servers by
introducing user defined handlers
Viet-TrungTran 02 09 2011 - 20
Preliminary evaluationExperimented on G5K (www.grid5000.fr)
3
02 09 2011Viet-TrungTran - 21
Experimental setup
Simulate common access pattern exhibited by scientific applications: Array Dicing
• Using at most 130 nodes of Graphene cluster on G5K
- 1 Gbps Ethernet interconnected network
- 49 nodes deployed our Pyramid and the competitor system PVFS
• Array dicing
- Each client accesses a dedicated sub-array
- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)
- Concurrent Reading/Writing
• Measure the performance and scalability
02 09 2011Viet-TrungTran - 22
Aggregated throughput achieved under
concurrency
• PVFS suffers from non-
contiguous access pattern due
to serialization to flat file
• Pyramid
- Throughputincreased
steady
- Promising good scalability
on both data and metadata
organization
Viet-TrungTran 02 09 2011 - 23
Conclusion
4
02 09 2011Viet-TrungTran - 24
Conclusion
• Pyramid is an array-oriented active storage system
• Proposed a system offering support for
- Parallel array processing for both read and write workloads
- Versioning data
- Distributed metadata management, shadowing to reflect updates
• Preliminary evaluation shows promising scalability
• Future work
- Planed to integrate to HDF5
- Pyramid as a storage engine for SciDB?
- Investigate on keeping data at quad-tree nodes
Could be used for store array at different resolutions (map application)
Thankyou
INRIA – KerDataResearch Team
www.irisa.fr/kerdata