32
Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne and Alan Nortan HSS, CISL-NCAR

Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Embed Size (px)

Citation preview

Page 1: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Lossy compression of structured scientific data sets

-Shreya MittapalliNew Jersey Institute of Technology

Friday Jul 31, 2015NCAR

Supervisors: John Clyne and Alan Nortan

HSS, CISL-NCAR

Page 2: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Problem we are trying to solve:

• Due to advancement in technology, large data is collected by the supercomputers, satellites, etc. There are two problems with Big Data:-

1) The hard-disk which collects the data might not have enough disk-space.

2) The speed at which the data can be read might be much lesser than the required speed. For example:

Page 3: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne
Page 4: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne
Page 5: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

To tackle this problem, we compress the data.

One way to compress the data is using Wavelets.Because of their multi-resolution and information compaction properties, wavelets are widely used for lossy compression in numerous consumer multimedia applications (e.g. images, music, and video). For example:

Page 6: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

The parrot is compressed in the ratio 1:35 and the rose 1:18 using wavelets

Source: http://arxiv.org/ftp/arxiv/papers/1004/1004.3276.pdf

Page 7: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

What is Lossy Compression?

Lossy Compression is the class of data encoding methods that uses inexact approximations (or partial data discarding) to represent the content. These techniques are used to reduce data size for storage, handling, and transmitting content.Source: Wikipedia

Page 8: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

In lossy compression:

AdvantageCompressed data can be stored in hard disk and it also saves a lot of computation time

DisadvantageWhile reconstructing back the data, some data is permanently lost.

Page 9: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Project Goal• To determine compression

parameters that:1) minimize distortion for a

desired output file size.2) reduce the computation time

and come with the best possible outcome.

Page 10: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Experiments done• To achieve the project goal, we have been

attempting to experimentally determine the optimal parameter choices for compressing numerical simulation data using wavelets.

• For this we experimented on three different big data sets, viz., two wrf hurricane data sets Katrina and Sandy and one turbulence data set Taylor Green.

Page 11: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Sandy———Grid resolution; 5320 x 5000 x 149 (= 16 Gigabytes / 3D variable)# 3D variables : 15Time steps ~100Total data set size: ~24Terabytes

Katrina—————Grid resolution; 316 x 310 x 35 (= 10 Megabytes / 3D variable)# 3D variables : 12Time steps ~60Total data set size: ~9 Gigabytes

TG—Grid resolution; 1024^3 (= 4 Gigabytes / 3D variable)# 3D variables : 6Time steps ~100Total data set size: ~2.5Terabytes

Page 12: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Images of Hurricane Katrina which occurred on 29th August, 2005.

Page 13: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Images of Hurricane Sandy which occurred on October 25, 2012

Page 14: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Image of vortex iso-surfaces in a viscous flow starting from Taylor-Green initial conditions. Source : http://www.galcit.caltech.edu/research/highlights

• We constructed a python framework that allowed us to change various compression parameters like wavelet type and block size each time.

• Measurements: lmax, rmse, time• Compression ratios:1,2,4,16,32,64,128,256

Page 15: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Compression parameters we wanted to explore:

1) Compare wavelet-types Bior3.3 and Bior4.4• The wavelet Bior4.4 is also called CDF9/7

wavelet which is widely used in the digital signal processing and image compression.

• The wavelet Bior3.3 is traditionally used in Vapor software.

• Goal: Determine if Bior4.4 is better than Bior3.3

Page 16: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Compression parameters we wanted to explore:

2) Compare block size 64x64x64 with other block sizes.

64

256

Page 17: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Compression parameters we wanted to explore:

2) Compare block size 64x64x64 with other block sizes.a) Determine if smaller blocks are better than

the larger blocks. The two contrasting features are:-

i) Smaller blocks are more computationally efficient than larger blocks.

ii) Larger blocks introduce less artefacts than the smaller blocks.

Page 18: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

b) If the block sizes are not in integral multiples of the 64, some extra data is introduced to cover up the gap. This is called padding. • The problem with padding is that while we are

looking to compress the data, an extra data is introduced.

• For TG data, there is no padding but for Katrina and Sandy data, we have 50% and 30% padding respectively.

• Goal: Determine if the aligned data has comparable errors with the padded data.

Example to illustrate padding:

Page 19: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

64

64

64196

149150

50

50

50

Page 20: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

We did the following three experiments:

• We compared the wavelet types Bior3.3 and bior4.4 for all the three data sets.

• We compared larger blocks with smaller blocks. For TG: 64x64x64 vs 128x128x128 vs

256x256x256• We compared padded data with aligned data. a) For Katrina: 64x64x64 vs 64x64x35 b) For Sandy: 64x64x64 vs 64x64x50

Page 21: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

BIOR3.3 VS BIOR4.4The plots for Katrina data illustrating Experiment 1.

Page 22: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne
Page 23: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne
Page 24: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

ALIGNED DATA VS PADDED DATAThe plot for sandy data illustrating Experiment 2.

Page 25: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne
Page 26: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

BIGGER BLOCKS VS SMALLER BLOCKS

The plots for TG data illustrating Experiment 3.

Page 27: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Lmax error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and 256x256x256

Page 28: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

RMSE error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and

256x256x256

Page 29: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

WHEN USING A LARGER BLOCK SIZE (256^3 VS 64^3) FOR THE VX COMPONENT OF THE TG DATA SET( THE DATA IS COMPRESSED 512:1), WE SEE IMPROVED COMPRESSION QUALITY AS ILLUSTRATED ABOVE:

Source: Pablo Mininni, U. of Buenos Aires.

Page 30: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Time taken for the wx variable of TG data set to construct the raw data for the block sizes 64x64x64,128x128x128 and 256x256x256

Page 31: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Conclusion:

1) Bior4.4 is in some cases better than Bior3.32) Surprisingly larger block (say 256x256x256)

is better than 64x64x64 in terms of both the computation time and error.

3) The errors of the aligned data and the padded data are comparable.

Page 32: Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne

Acknowledgements:

• My Supervisors John Clyne and Alan Nortan for their continued support.

• Dongliang Chu, Samuel Li and Kim Zhang.• Delilah• Gail Rutledge• NCAR