DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES

DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG

DATA SETS IN EARTH SCIENCES

A.A. Poyda1, M.N. Zhizhin1, D.P. Medvedev2, D.Y. Mishin3

1NRC "Kurchatov Institute", Moscow, Russia

2Geophysical Center RAS, Moscow, Russia3Johns Hopkins University, Baltimore, USA

The Big Data problem in Earth sciences

Current Estimate of NOAA NESDIS DATA ARCHIVE VOLUME PROJECTIONS

(under CLASS Environment - 2 site concept) August 2006

0

20

40

60

80

100

120

140

160

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

YEAR

PETA

BYT

ES

Model Data

NEXRAD

NPOESS

NPP

GOES

)NASA EOS )MODIS

METOP

Ocean Related Data

DMSP

& IN-SITU )Weather)ClimateCORS

POES

.Misc

Sorted by year 2020 volumes

Big Data problem in Earth sciences

• Storage problem: remote access is required.• Data request problem: timeout or insufficient

memory when requesting big data blocks.• Data processing problem: processing of big

data volumes may lead to disk swapping resulting in dramatic performance decrease.

• Optimization of data access and processing is required.

Vis5D time-space-parameter animation

Data model for Earth sciences

Data access and processing optimizations in Earth sciences

• Data access parallelization• Migration to data-flow / block-stream data

access• Data store optimization• Migration to distributed data-flow processing

Data access parallelizationOpenStack Swift

Fault-tolerant, distributed object or blob storage with continuity support

• Works as data container• Supports fault-tolerance and data

replication• Data backup• Scalability• RESTful S3-like interface• Supports users authorization and

authentication (swauth, keystone)

Openstack SWIFT performance

0 2 4 6 8 10 12 14 160

50

100

150

200

250Rate (MB/s)

Data-flow / block-stream data access

Scientific data arrays• Arrays are widely used in environmental sciences to store modelling results,

satellite observations, raster maps, etc.

• Datasets can be quite large, up to several terabytes.

• Most data are stored as file collections in proprietary formats or universally adopted formats like netCDF, GRIB, HDF5.

• File access can be problematic:

Scientists need to know about too many file formats Usually files must be completely downloaded before they can be used Thousands of files can be processed in one data request; only a small

portion of their contents appears in the result set

• Currently available database solutions do not have convenient array storage capabilities.

Data store optimization. Cloud-based Active Storage for multidimensional arrays.

• Active Storage is a new way in database design used for storing multi-dimensional numeric arrays containing space, terrestrial weather data archives and large scaled images.

• Special features of Active Storage are:– Universal architecture capable to store different data types in one

system.– Effective index creating for large data (tens and hundreds Tb).– Can do basic data transformations directly on storage nodes

(arithmetic operations, statistical operations, linear convolution).– Metadata integrated with data.– Can distribute data automatically on several computer nodes (also

can distribute computations).– Can be used in Grid infrastructure using OGSA-DAI services.

Splitting an array into chunks

1 seek 8 seeks 4 seeks 4 seeks

Chunked arrayNon-chunked array

• We store chunks in BLOB fields of a database table

• Chunks do not need to be the same size

chunk_key chunk

0 <Chunk0>

1 <Chunk1>

2 <Chunk2>

3 <Chunk3>

ActiveStorage performance

Request number

Request form (time Х

latitude Х longitude)

1 8 х 64 х 1282 32 х 32 х 643 128 х 16 х 324 512 х 8 х165 2048 х 4 х86 8192 х 2 х 47 32768 х 1 х 2

Distributed data-flow processing

Distributed data-flow processing organization problems:• Data communication support between activity; • Load balancing and parallelization management; • Fault-tolerance and error processing support; • Activity management.At present, several frameworks of distributed data-flow processing exist: Yahoo S4, Twitter Storm, Taverna, Kepler, OGSA DAI.

Twitter Storm

Wind speed calculation workflow example

GetData )U-component)

GetData )V-component)







processing

processing

processing

processing

Output Block

RESTful data service

22 VU Wind speed calculation:

Dependence of data-flow processing time from data volume

Problems that are not solved by frameworks

• Automatic partitioning of source data space.• Flooding and synchronization management in

case of data flow merging.• Data flow routing in case of parallel processing

activity and data flow merging.

Current work

• Twitter Storm data request block-stream activity supporting block geometry and array priority direction properties, and automatic partitioning of source data space.• Twitter Storm data processing activity supporting automatic data flow merging, generalized array processing language, and flooding management.

Results

A framework has been developed, having the following features:

• cloud storage with data reservation and access acceleration;

• designed for large multidimentional data arrays;• request shape flexibility;• flow-based system for access and processing;• high scalability.

Applications

• High-resolution 3D models of the Earth based on large number of observations.

• Climate modeling and analysis tasks.• Multispectral satellite and geological imagery

processing.

Documents

DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES