Upload
viet-trung-tran
View
1.511
Download
0
Embed Size (px)
DESCRIPTION
The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
Citation preview
Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,
Gabriel Antoniu, Luc Bougé
KerData Team
Inria, Rennes, France 02 09 2011
02 09 2011Viet-TrungTran - 2
Outline
1. Motivation
2. Architecture
3. Preliminary evaluation
4. Conclusion
Viet-TrungTran 00 MOIS 2011 - 3
MotivationWhyarray-orientedstorage?
1
Context: Data-intensive large-scale HPC
simulations
• The scalability of data management is becoming
a critical issue
• Mismatch between storage model and application
data model
• Application data model
- Multidimensional typed arrays, images, etc.
• Storage model
- Parallel file systems: Simple and flat I/O
model
- Relational model: ill-suited for Scientifics
• Need additional layers to map the application
model to the storage model
02 09 2011Viet-TrungTran - 4
•Sequence of bytes
[M. Stonebraker] The one-storage-fits-all-
needs has reached its limits
• Parallel I/O stack:
- Performance of non-contiguous I/O vs data
atomicity
• Relational data model:
- Simulating arrays on top of table is poor in
performance
- Scalability for join queries
• Need to specialize the I/O stack to match the
applications requirements
- Array-oriented storage for array data model
• Example: SciDB with ArrayStore.
02 09 2011Viet-TrungTran - 5
Application (Visit, Tornado
simulation)
Data model (HDF5, NetCDF)
MPI-IO middleware
Parallel file systems
Our approach
• Multi-dimensional aware chunking
• Lock-free, distributed chunk indexing
• Array versioning
• Active storage support
• Versioning array-oriented access interface
02 09 2011Viet-TrungTran - 6
Multi-dimensional aware chunking
• Split array into equal chunks and distributed over storage elements
- Simplify load balancing among storage elements
- Keep the neighbors of cells in the same chunk
• Shared nothing architecture
- Easier to handle data consistency
02 09 2011Viet-TrungTran - 7
Lock-free, distributed chunk indexing
• Indexing multi-dimensional information
- R-tree, XD-tree, Quad-tree, etc
- Designed and optimized centralized management
• Centralized metadata management scheme may not scale
- Bottleneck under highly concurrency
• Our approach:
- Porting quad-tree like structures to distributed environment
- Using shadowing technique on quad-tree to enable lock-free
concurrent update
02 09 2011Viet-TrungTran - 8
Array versioning
• Scientific applications need array versioning (VLDB 2009)
- Check pointing
- Cloning
- Provenance
• Keep data and metadata immutable
- Updating a chunk is handled at metadata level using shadowing
technique
02 09 2011Viet-TrungTran - 9
Active storage support
• Move data computation to storage elements
- Conserve bandwidth
- Better workload parallelization
• Allow user sending User defined handlers to storage servers
02 09 2011Viet-TrungTran - 10
Versioning array-oriented access interface
• Basic primitives
- id = CREATE(n, sizes[], defval)
- READ(id, v, offsets[], sizes[], buffer)
- w = WRITE(id, offsets[], sizes[], buffer)
- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)
• Other primitives like cloning, filtering mostly can be implemented based
on these above primitives
02 09 2011Viet-TrungTran - 11
Viet-TrungTran 02 09 2011 - 12
Pyramid: Architecture
2
02 09 2011Viet-TrungTran - 13
Architecture
• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]
• Version managers
- Ensure concurrency control
• Metadata managers
- Store index tree nodes
• Storage manager
- Monitor the storage servers
- Ensures a load balancing strategy of chunks among storage servers
• Active storage servers
- Store chunks and perform handlers on chunks
• Clients
- Perform I/O accesses
02 09 2011Viet-TrungTran - 14
Read
• I: optionally ask the version manager for
the latest published version
• II: fetch the corresponding metadata from
the metadata managers
• III: contact storage servers in parallel and
fetch the chunks in the local buffer
Client
Storage
servers
Metadata
managers
Version
managers
I
II
III
02 09 2011Viet-TrungTran - 15
Write
• I: get a list of storage servers that are
able to store the chunks, one for each
chunk
• II: contact storage servers in parallel and
write the chunks to the corresponding
providers
• III: get a version number for the update
• IV: add new metadata to consolidate the
new version
• V: report the new version is ready for
publication.
Client
Storage
servers
Metadata
managers
Version
manager
Storage
manager
II
I
III
IV
V
02 09 2011Viet-TrungTran - 16
Lock-free, distributed chunk indexing
• Organized as a Quad-tree to index 2D arrays
• Each tree node has at most 4 children, each covers one of the four quadrants
• Root tree covers the whole array
• Each leaf corresponds to a chunk and holds information about its location
• Tree nodes are immutable, uniquely identified by the version number and the
sub-domain they cover
• Using DHT to distribute tree nodes over metadata managers
02 09 2011Viet-TrungTran - 17
Tree shadowing to update
• Write newly created chunks to storage servers
• Build the quad-tree associated to the new snapshot in bottom-up fashion
- Writing the leaves to DHT
- Inner nodes may point to nodes of previous snapshots (imply a
synchronization of the quad-tree generation)
- Avoid synchronization by feeding additional information about the other
concurrent updaters (thank to computational ID of tree nodes)
02 09 2011Viet-TrungTran - 18
Efficient parallel updating
• Chunks are written concurrently
• Versions are assigned in the order the
clients finish writing
• Clients get additional information about
the other concurrent writers
• Tree nodes are written in lock-free manner
• Versions are published in the order they
were assigned
Client
#1
Client
#2Storage
servers
Metadata
managers
Version
manager
Publish
Publish
02 09 2011Viet-TrungTran - 19
Some more I/O primitives
• Easily implemented thanks to immutable data and metadata blocks
• Cheap I/O operators
• Clone a sub-domain
- Following the metadata tree of a specific snapshot
- Creating new metadata tree and publish as a newly created array
• Filtering, compression ca be done locally in parallel at active storage servers by
introducing user defined handlers
Viet-TrungTran 02 09 2011 - 20
Preliminary evaluationExperimented on G5K (www.grid5000.fr)
3
02 09 2011Viet-TrungTran - 21
Experimental setup
Simulate common access pattern exhibited by scientific applications: Array Dicing
• Using at most 130 nodes of Graphene cluster on G5K
- 1 Gbps Ethernet interconnected network
- 49 nodes deployed our Pyramid and the competitor system PVFS
• Array dicing
- Each client accesses a dedicated sub-array
- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)
- Concurrent Reading/Writing
• Measure the performance and scalability
02 09 2011Viet-TrungTran - 22
Aggregated throughput achieved under
concurrency
• PVFS suffers from non-
contiguous access pattern due
to serialization to flat file
• Pyramid
- Throughputincreased
steady
- Promising good scalability
on both data and metadata
organization
Viet-TrungTran 02 09 2011 - 23
Conclusion
4
02 09 2011Viet-TrungTran - 24
Conclusion
• Pyramid is an array-oriented active storage system
• Proposed a system offering support for
- Parallel array processing for both read and write workloads
- Versioning data
- Distributed metadata management, shadowing to reflect updates
• Preliminary evaluation shows promising scalability
• Future work
- Planed to integrate to HDF5
- Pyramid as a storage engine for SciDB?
- Investigate on keeping data at quad-tree nodes
Could be used for store array at different resolutions (map application)
Thankyou
INRIA – KerDataResearch Team
www.irisa.fr/kerdata