1 Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination

1 Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination Slide 2 Big Data Is Often Big Arrays Array data is everywhere 2 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data Slide 3 Inherent Limitations of Current Tools and Paradigms Most scientific data management and data processing tools are too heavy-weight Hard to cope with different data formats and physical structures (variety) Data transformation and data transfer are often prohibitively expensive (volume) Prominent Examples RDBMSs: not suited for array data Array DBMSs: data ingestion MapReduce: specialized file system 3 Slide 4 Mismatch Between Scientific Data and DBMS Scientific (Array) Datasets: Very large but processed infrequently Read/append only No resources for reloading data Popular formats: NetCDF and HDF5 Database Technologies For (read-write) data ACID guaranteed Assume data reloading/reformatting feasible 4 Slide 5 Example Array Data Format - HDF5 HDF5 (Hierarchical Data Format) 5 Slide 6 The Upfront Cost of Using SciDB High-Level Data Flow Requires data ingestion Data Ingestion Steps Raw files (e.g., HDF5) -> CSV Load CSV files into SciDB 6 EarthDB: scalable analysis of MODIS data using SciDB - G. Planthaber et al. Slide 7 Thesis Statement Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions Process data stored in the native format, e.g., NetCDF and HDF5 Support SQL-like operators, e.g., selection and aggregation Support array operations, e.g., structural aggregations Support MapReduce-like processing API 7 Slide 8 Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work 8 Slide 9 Overall Idea An SQL Implementation Over HDF5 Ease-of-use: declarative language instead of low- level programming language + HDF5 API Abstraction: provides a virtual relational view High Efficiency Load data on demand (lazy loading) Parallel query processing Server-side aggregation 9 Slide 10 Functionality Query Based on Dimension Index Values (Type 1) Also supported by HDF5 API Query Based on Dimension Scales (Type 2) coordinate system instead of the physical layout (array subscript) Query Based on Data Values (Type 3) Simple datatype + compound datatype Aggregate Query SUM, COUNT, AVG, MIN, and MAX Server-side aggregation to minimize the data transfer 10 index-based condition coordinate-based condition content-based condition Slide 11 Execution Overview 11 1D: AND-logic condition list 2D: OR-logic condition list 1D: OR-logic condition list Same content- based condition Slide 12 Experimental Setup Experimental Datasets 4 GB (sequential experiments) and 16 GB (parallel experiments) 4D: time, cols, rows, and layers Compared with Baseline Performance and OPeNDAP Baseline performance: no query parsing OPeNDAP: translates HDF5 into a specialized data format 12 Slide 13 Sequential Comparison with OPeNDAP (Type2 and Type3 Queries) 13 Slide 14 Parallel Query Processing for Type2 and Type3 Queries 14 Slide 15 Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work 15 Slide 16 Array Storage as a DB A Paradigm Similar to NoDB Still maintains DB functionality But no data ingestion DB and Array Storage as a DB: Friends or Foes? When to use DB? Load once, and query frequently When to directly use array storage? Query infrequently, so avoid loading Our System Focuses on a set of special array operations - Structural Aggregations 16 Slide 17 Structural Aggregation Types 17 Non-Overlapping Aggregation Overlapping Aggregation Slide 18 Grid Aggregation Parallelization: Easy after Partitioning Considerations Data contiguity which affects the I/O performance Communication cost Load balancing for skewed data Partitioning Strategies Coarse-grained Fine-grained Hybrid Auto-grained 18 Slide 19 Partitioning Strategy Decider Cost Model: analyze loading cost and computation cost separately Load cost Loading factor data amount Computation cost Exception - Auto-Grained: take loading cost and computation cost as a whole 19 Slide 20 Overlapping Aggregation I/O Cost Reuse the data already in the memory Reduce the disk I/O to enhance the I/O performance Memory Accesses Reuse the data already in the cache Reduce cache misses to accelerate the computation Aggregation Approaches Nave approach Data-reuse approach All-reuse approach 20 Slide 21 Example: Hierarchical Aggregation Aggregate 3 grids in a 6 6 array The innermost 2 2 grid The middle 4 4 grid The outmost 6 6 grid (Parallel) sliding aggregation is much more complicated 21 Slide 22 Nave Approach 22 1.Load the innermost grid 2.Aggregate the innermost grid 3.Load the middle grid 4.Aggregate the middle grid 5.Load the outermost grid 6.Aggregate the outermost grid For N grids: N loads + N aggregations Slide 23 Data-Reuse Approach 23 1.Load the outermost grid 2.Aggregate the outermost grid 3.Aggregate the middle grid 4.Aggregate the innermost grid For N grids: 1 load + N aggregations Slide 24 All-Reuse Approach 24 1.Load the outermost grid 2.Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results Slide 25 Sequential Performance Comparison Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked array: can even support overlapping chunking to accelerate the sliding aggregation 25 Slide 26 Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 3 to 6 6 26 Slide 27 Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work 27 Slide 28 Approximate Aggregations Over Array Data Challenges Flexible Aggregation Over Any Subset Dimensional-based/value-based/combined predicate Aggregation Accuracy Spatial distribution/value distribution Aggregation Without Data Reorganization Reorganization is prohibitively expensive Existing Techniques - All Problematic for Array Data Sampling: unable to capture both distributions Histograms: no spatial distribution Wavelets: no value distribution New Data Synopses Bitmap Indices 28 Slide 29 Bitmap Indexing and Pre-Aggregation Bitmap Indices Pre-Aggregation Statistics 29 Slide 30 Approximate Aggregation Workflow 30 Slide 31 Running Example Bitmap Indices Pre-Aggregation Statistics 31 SELECT SUM(Array) WHERE Value > 3 AND ID < 4; Predicate Bitvector: 11110000 i 1 : 01000000 i 2 : 10010000 Count1: 1 Count2: 2 Estimated Sum: 7 1/2 + 16 2/3 = 14.167 Precise Sum: 14 Slide 32 A Novel Binning Strategy Conventional Binning Strategies Equi-width/Equi-depth Not designed for aggregation V-Optimized Binning Strategy Inspired by V-Optimal Histogram Goal: approximately minimize Sum Squared Error (SSE) Unbiased V-Optimized Binning: data is queried randomly Weighted V-Optimized Binning: frequently queried subarea is prior knowledge 32 Slide 33 Unbiased V-Optimized Binning 3 Steps: 1)Initial Binning: use equi-depth binning 2)Iterative Refinement: adjusting bin boundaries 3)Bitvector Generation: mark spatial positions 33 Slide 34 Weighted V-Optimized Binning Difference: minimize WSSE instead of SSE Similar binning algorithm Major Modification representative value for each bin is not the mean value 34 Slide 35 Experimental Setup Data Skew 1)Dense Range: less than 5% space but over 90% data 2)Sparse Range: less than 95% space but over 10% data 5 Types of Queries 1)DB: with dimension-based predicates 2)VBD: with value-based predicates over dense range 3)VBS : with value-based predicates over sparse range 4)CD: with combined predicates over dense range 5)CS : with combined predicates over sparse range Ratio of Querying Possibilities 10 : 1 50% synthetic data is frequently queried 25% real-world data is frequently queried 35 Slide 36 SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset 36 Equi-Width Equi-Depth Unbiased V-Optimized Weighted V-Optimized Slide 37 SUM Aggregation Accuracy of Different Methods on the Real-World Dataset 37 Sampling_2% Sampling_20% (Equi-Depth) MD-Histogram Equi-Depth Unbiased V-Optimized Weighted V-Optimized Slide 38 Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work 38 Slide 39 Scientific Data Analysis Today Store-First-Analyze-After Reload data into another file system E.g., load data from PVFS to HDFS Reload data into another data format E.g., load NetCDF/HDF5 data to a specialized format Problems Long data migration/transformation time Stresses network and disks 39 Slide 40 System Overview Key Feature scientific data processing module 40 Slide 41 Scientific Data Processing Module 41 Slide 42 Parallel Data Processing Times on 16 GB Datasets KNN 42 K-Means Slide 43 Future Work Outline Data Management Support SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Data Processing Support StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream 43 Slide 44 SciSD Subgroup Discovery Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable Can be widely used in scientific knowledge discovery Novelty Subsets can involve dimensional and/or value ranges All numeric attributes High efficiency by frequent bitmap-based approximate aggregations 44 Slide 45 Running Example 45 Slide 46 SciCSM Sometimes its good to contrast what you like with something else. It makes you appreciate it even more. - Darby Conley, Get Fuzzy, 2001 Contrast Set Mining Goal: identify all the filters that can generate significantly different subsets Common filters: time periods, spatial areas, etc. Usage: classifier design, change detection, disaster prediction, etc. 46 Slide 47 Running Example 47 Slide 48 StreamingMATE Extend the precursor system SciMATE to process scientific data stream Generalized Reduction Reduce data stream to a reduction object No shuffling or sorting Focus on the load balancing issues Input data volume can be highly variable Topology update: add/remove/update streaming operators 48 Slide 49 StreamingMATE Overview 49 Slide 50 50 Slide 51 Hyperslab Selector 51 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] False: nullify the condition list True: nullify the elementary condition Fill up all the index boundary values Slide 52 Type2 and Type3 Query Examples 52 Slide 53 Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause 53 Slide 54 Sequential and Parallel Performance of Aggregation Queries 54 Slide 55 Array Databases Examples: SciDB, RasDaMan and MonetDB Take Array as the First-Class Citizens Everything is defined in the array dialect Lightweight or No ACID Maintenance No write conflict: ACID is inherently guaranteed Other Desired Functionality Structural aggregations, array join, provenance 55 Slide 56 Structural Aggregations Aggregate the elements based on positional relationships E.g., moving average: calculates the average of each 2 2 square from left to right 56 Input Array 3.54.55.5 1234 5678 Aggregation Result aggregate the elements in the same square at a time Slide 57 Coarse-Grained Partitioning Pros Low I/O cost Low communication cost Cons Workload imbalance for skewed data 57 Slide 58 Fine-Grained Partitioning Pros Excellent workload balance for skewed data Cons Relatively high I/O cost High communication cost 58 Slide 59 Hybrid Partitioning Pros Low communication cost Good workload balance for skewed data Cons High I/O cost 59 Slide 60 Auto-Grained Partitioning 2 Steps Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity) For each grid, total processing cost = constant loading cost + varying computation cost Partitions the cost array - Balanced Contiguous Multi-Way Partitioning Dynamic programming (small # of grids) Greedy (large # of grids) 60 Slide 61 Auto-Grained Partitioning (Contd) Pros Low I/O cost Low communication cost Great workload balance for skewed data Cons Overhead of sampling an runtime partitioning 61 Slide 62 Partitioning Strategy Summary StrategyI/O Performance Workload Balance ScalabilityAdditional Cost Coarse-GrainedExcellentPoorExcellentNone Fine-GrainedPoorExcellentPoorNone HybridPoorGood None Auto-GrainedGreat Nontrivial 62 Our partitioning strategy decider can help choose the best strategy Slide 63 All-Reuse Approach (Contd) Key Insight # of aggregates # of queried elements More computationally efficient to iterate over elements and update the associated aggregates More Benefits Load balance (for hierarchical/circular aggregations) More speedup for compound array elements The data type of an aggregate is usually primitive, but this is not always true for an array element 63 Slide 64 Parallel Grid Aggregation Performance Used 4 processors on a Real-Life Dataset of 8 GB User-Defined Aggregation: K-Means Vary the number of iterations to vary to the computation amount 64 Slide 65 Data Access Strategies and Patterns Full Read: probably too expensive for reading a small data subset Partial Read Strided pattern Column pattern Discrete point pattern 65 Slide 66 Indexing Cost of Different Binning Strategies with Varying # of Bins on the Synthetic Dataset 66 Slide 67 SUM Aggregation of Equi-Width Binning with Varying # of Bins on the Synthetic Dataset 67 Slide 68 SUM Aggregation of Equi-Depth Binning with Varying # of Bins on the Synthetic Dataset 68 Slide 69 SUM Aggregation of V-Optimized Binning with Varying # of Bins on the Synthetic Dataset 69 Slide 70 Average Relative Error(%) of MAX Aggregation of Different Methods on the Real-World Dataset 70 Slide 71 SUM Aggregation Times of Different Methods on the Real-World Dataset (DB) 71 Slide 72 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBD) 72 Slide 73 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBS) 73 Slide 74 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD) 74 Slide 75 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD) 75 Slide 76 SD vs. Classification 76

Documents

1 Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination