Upload
brianne-payne
View
212
Download
0
Embed Size (px)
Citation preview
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs
Visiting Faculty: Jianting Zhang, The City College of New York (CCNY), [email protected]: Host: Dali Wang, Climate Change Science Institute (CCSI), [email protected]
OverviewAs the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. Partially Supported by DOE’s Visiting Faculty Program (VFP), we are investigating on a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs). We have designed and implemented a popular geospatial technique called Zonal Statistics, i.e., deriving histograms of points or raster cells in polygonal zones. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and ,60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes.
High-resolution Satellite ImageryT
In-situ Observation Sensor Data
T
Global and Regional Climate Model Outputs
Data AssimilationZonal Statistics
TEcological, environmental and administrative zones
T
T
V
Temporal Trends
High-End Computing Facility
A
BC
Thread Block
ROIs
Background & Motivation
Zonal Statistics on NASA Shuttle Radar Topography Mission (SRTM) Data
2 3
SQL: SELECT COUNT(*) from T_O, T_Z WHERE ST_WITHIN (T_O.the_geom,T_Z.the_geom) GROUP BY T_Z.z_id;
Point-in-Polygon Test
For each county, derive its histogram of elevations from raster cells that are in the polygon.
•SRTM: 20*109 (billion) raster cells (~40GB raw, ~15GB compressed TIFF)•Zones: 3141 counties, 87,097 vertices
Brute-force Point-in-polygon test•RT=(#of points)*(number of vertices)*(number of ops per p-in-p test)/(number of ops per second) =20*109*87097*20/(10*109)=3.7*106seconds=40days•Using up all Titan’s 18,688 nodes: ~200 seconds•Flops utilization is typically low: can be <1% for data intensive applications (typically <10% in HPC)
Hybrid Spatial Databases + HPC ApproachObservation : only points/cells that are close enough to polygons need to be tested Question: how do we pair neighboring points/cells with polygons?
Minimum Bounding Boxes (MBRs)
Step1: divide a raster tile into blocks and generate per-tile histograms Step 2: derive polygon MBRs and pair MBRs with blocks through box-in-polygon test (inside/intersect)Step 3: aggregate per-blocks histograms into per-polygon histograms if blocks are within polygonsStep 4: for each intersected polygon/block pair, perform point(cell)-in-polygon test for all the raster cells in the blocks and update respective polygon histogram
(A) Intersect (B) Inside (C) Outside
GPU Implementations
Identifying parallelisms and mapping to hardware(1)Deriving per-block histograms(2)Block in polygon test(3)Aggregate histograms for “within” blocks (4)Point-in-polygon test for individual cells and update
GPU Thread BlockRaster Block
AtomicAdd
1
M1
C1
M1
C2
M2
C3
…
…
…
…
•Point-in-poly test for each of cell’s 4 corners•All-pair Edge intersection tests between polygon and cell 2
Polygon vertices
Cells
•Perfect coalesced memory accesses •Utilizing GPU floating point power
4
BPQ-Tree based Raster compression to save disk I/O•Idea: chop a M-bit raster into M binary bitmaps and then build a quadtree for each bitmap (Zhang et al 2011)
•BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly on GPUs •Advantage 1: compressed data is streamed from disk to GPU without requiring decompression on CPUs reducing CPU memory footprint and data transfer time•Advantage 2: can be used in conjunction with CPU-based compression to further improve compression ratio
Experiments and EvaluationsTile # dimension Partition Schema
1 54000*43200 2*2
2 50400*43200 2*2
3 50400*43200 2*2
4 82800*36000 2*2
5 61200*46800 2*2
6 68400*111600 4*4
Total 20,165,760,000 36
Data Format Volume (GB)Original (Raw) 38TIFF Compression 15gzip compression 8.3BPQ-Tree Compression 7.3BPQ-Tree+ gzip compression
5.5
Single Node Configuration 1: Dell T5400 WS•Intel Xeon E5405 dual Quad-Core Processor (2.00 GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk with 32M cache ($5,000)•Nvidia Quadro 6000 GPU, 448 Fermi core ( 574 MHZ), 6 GB, 144GB/s ($4,500)
Single Node Configuration 2: Do-IT Yourself Workstation•Intel Core-i5 650 Dual-Core, 8 GB, PCI-E Gen3, 500GB 7200 RPM disk with 32M (recycled), ($1000)•Nvidia GTX Titan GPU, 2688 Kepler core ( 837 MHZ), 6 GB, 288 GB/s ($1000)
Parameters:•Raster chunk size (coding): 4096*4096• Thread block size: 256
•Maximum histogram bins: 5000•Raster Block size: 0.1*0.1 degree 360*360 (resolution is 1*1 arc-second)
GPU Cluster : OLCF Titan
ResultsSingle Node Cold Cache Hot cacheConfig1 180s 78sConfig2 101s 46s
Quadro 6000 GTX Titan 8-core CPU(Step 0): Raster decompression 16.2 8.30 131.9Step 1: Per-block histogramming 21.5 13.4 /Step 2: Block-in-polygon test 0.11 0.07 /Step 3: “within-block” histogram aggregation 0.14 0.11 /Step 4: cell-in-polygon test and histogram update 29.7 11.4 /total major steps 67.7 33.3 /Wall-clock end-to-end 85 46 /
Titan#of nodes 1 2 4 8 16Runtime(s) 60.7 31.3 17.9 10.2 7.6
Data and Pre-processing