Upload
elliot
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES. A.A. Poyda 1 , M.N. Zhizhin 1 , D.P. Medvedev 2 , D.Y. Mishin 3 1 NRC " Kurchatov Institute", Moscow, Russia 2 Geophysical Center RAS, Moscow, Russia - PowerPoint PPT Presentation
Citation preview
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG
DATA SETS IN EARTH SCIENCES
A.A. Poyda1, M.N. Zhizhin1, D.P. Medvedev2, D.Y. Mishin3
1NRC "Kurchatov Institute", Moscow, Russia
2Geophysical Center RAS, Moscow, Russia3Johns Hopkins University, Baltimore, USA
The Big Data problem in Earth sciences
Current Estimate of NOAA NESDIS DATA ARCHIVE VOLUME PROJECTIONS
(under CLASS Environment - 2 site concept) August 2006
0
20
40
60
80
100
120
140
160
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
YEAR
PETA
BYT
ES
Model Data
NEXRAD
NPOESS
NPP
GOES
)NASA EOS )MODIS
METOP
Ocean Related Data
DMSP
& IN-SITU )Weather)ClimateCORS
POES
.Misc
Sorted by year 2020 volumes
Big Data problem in Earth sciences
• Storage problem: remote access is required.• Data request problem: timeout or insufficient
memory when requesting big data blocks.• Data processing problem: processing of big
data volumes may lead to disk swapping resulting in dramatic performance decrease.
• Optimization of data access and processing is required.
Vis5D time-space-parameter animation
Data model for Earth sciences
Data access and processing optimizations in Earth sciences
• Data access parallelization• Migration to data-flow / block-stream data
access• Data store optimization• Migration to distributed data-flow processing
Data access parallelizationOpenStack Swift
Fault-tolerant, distributed object or blob storage with continuity support
• Works as data container• Supports fault-tolerance and data
replication• Data backup• Scalability• RESTful S3-like interface• Supports users authorization and
authentication (swauth, keystone)
Openstack SWIFT performance
0 2 4 6 8 10 12 14 160
50
100
150
200
250Rate (MB/s)
Data-flow / block-stream data access
Scientific data arrays• Arrays are widely used in environmental sciences to store modelling results,
satellite observations, raster maps, etc.
• Datasets can be quite large, up to several terabytes.
• Most data are stored as file collections in proprietary formats or universally adopted formats like netCDF, GRIB, HDF5.
• File access can be problematic:
Scientists need to know about too many file formats Usually files must be completely downloaded before they can be used Thousands of files can be processed in one data request; only a small
portion of their contents appears in the result set
• Currently available database solutions do not have convenient array storage capabilities.
Data store optimization. Cloud-based Active Storage for multidimensional arrays.
• Active Storage is a new way in database design used for storing multi-dimensional numeric arrays containing space, terrestrial weather data archives and large scaled images.
• Special features of Active Storage are:– Universal architecture capable to store different data types in one
system.– Effective index creating for large data (tens and hundreds Tb).– Can do basic data transformations directly on storage nodes
(arithmetic operations, statistical operations, linear convolution).– Metadata integrated with data.– Can distribute data automatically on several computer nodes (also
can distribute computations).– Can be used in Grid infrastructure using OGSA-DAI services.
Splitting an array into chunks
1 seek 8 seeks 4 seeks 4 seeks
Chunked arrayNon-chunked array
• We store chunks in BLOB fields of a database table
• Chunks do not need to be the same size
chunk_key chunk
0 <Chunk0>
1 <Chunk1>
2 <Chunk2>
3 <Chunk3>
ActiveStorage performance
Request number
Request form (time Х
latitude Х longitude)
1 8 х 64 х 1282 32 х 32 х 643 128 х 16 х 324 512 х 8 х165 2048 х 4 х86 8192 х 2 х 47 32768 х 1 х 2
Distributed data-flow processing
Distributed data-flow processing organization problems:• Data communication support between activity; • Load balancing and parallelization management; • Fault-tolerance and error processing support; • Activity management.At present, several frameworks of distributed data-flow processing exist: Yahoo S4, Twitter Storm, Taverna, Kepler, OGSA DAI.
Twitter Storm
Wind speed calculation workflow example
GetData )U-component)
GetData )V-component)
GetData )U-component)
GetData )V-component)
GetData )U-component)
GetData )V-component)
GetData )U-component)
GetData )V-component)
processing
processing
processing
processing
Output Block
RESTful data service
22 VU Wind speed calculation:
Dependence of data-flow processing time from data volume
Problems that are not solved by frameworks
• Automatic partitioning of source data space.• Flooding and synchronization management in
case of data flow merging.• Data flow routing in case of parallel processing
activity and data flow merging.
Current work
• Twitter Storm data request block-stream activity supporting block geometry and array priority direction properties, and automatic partitioning of source data space.• Twitter Storm data processing activity supporting automatic data flow merging, generalized array processing language, and flooding management.
Results
A framework has been developed, having the following features:
• cloud storage with data reservation and access acceleration;
• designed for large multidimentional data arrays;• request shape flexibility;• flow-based system for access and processing;• high scalability.
Applications
• High-resolution 3D models of the Earth based on large number of observations.
• Climate modeling and analysis tasks.• Multispectral satellite and geological imagery
processing.