51
High Performance Spatial Query Processing for Large scale Scientific Data Ablimit Aji Adviser: Fusheng Wang Math & CS Seminar Nov 16, 2012

High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

High Performance Spatial Query Processing for Large scale Scientific

Data

Ablimit Aji Adviser: Fusheng Wang

Math & CS Seminar Nov 16, 2012

Page 2: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging
Page 3: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Spatial Applications

Map

Page 4: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Pathology Analytical Imaging

Big Spatial Data

Satellite Imagery & Remote Sensing

Page 5: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Integrative Multi-Scale Biomedical Informatics

• Reproducible anatomic/functional characterization at gross level (Radiology) and fine level (Pathology).

• Integration with multiple types of “omic”s (genomics, proteomics, metabolomics) information.

• Create categories of jointly

classified data to describe

pathophysiology, predict prognosis

and response to treatment.

Page 6: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Emory In Silico Brain Tumor Research Center

Integrative in-silico study of GBM brain tumors through digital pathology, radiology, “omics” data, and patient outcome.

Emory In Silico Brain Tumor Research Center

Page 7: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Pathology Imaging

• High-resolution whole slide images provide rich information about morphological and functional characteristics of biological systems. (Ex1)

• Such information has tremendous potential for providing insights regarding the underlying mechanisms of disease onset and progression.

Page 8: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Introduction Distinguishing Characteristics in Pathology Images

… Molecular data

Clinical data

Page 9: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Systematic Image Algorithm Evaluation

• High quality image analysis algorithms are essential to support biomedical research and diagnosis

– Validate algorithms with human annotations

– Compare and consolidate different algorithm results

– Sensitivity study on algorithms’ parameters

• Example: What are the distances and overlap ratios between markup boundaries from two algorithms ?

Algorithm one v.s. Algorithm two

Cross match / join two spatial data sets

Page 10: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Spatial Centric Queries

WINDOW

NEAREST NEIGHBOR

CONTAINMENT POINT

SPATIAL JOIN DENSITY

Need: Manage, Query and Compare Spatially Derived Information

Page 11: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

PAIS Data Management Architecture

UML Model XML Schema

Dat

a M

app

ing

Pathology

Image Database

Data Analysis Applications

(MATLAB, SAS…)

Parallel Database

PAIS Data Management

Web APIs

Image Viewer

Clinical Data Molecular & Genetic Annotation

Application Server

Image Data Management

Job

Man

agem

ent

Dat

a St

agin

g

PAIS Document Generator

Database Schema Loading Manager

Zip

ped

PA

IS D

ocu

men

ts

PIDB

Image Analysis Results

Human Annotations

Page 12: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Spatial Database Engine

• Extended on RDBMS to support multi-dimensional spatial data types and queries

– Extended spatial data types

– Spatial functions and procedures

– Spatial access methods

• space oriented – e.g. Grid-File

• data oriented – e.g. R*-Tree

Page 13: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Partitioned, Shared-Nothing Parallel DBMS

• Most often I/O is the main bottleneck

• Partition data for parallel data access

• Shared-nothing architecture is widely used in practice as a scalable database solution

Page 14: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Parallel Spatial DB Summary

• Comprehensive data model to represent data and provenance.

• Expressive query interface (Spatial SQL).

• Can be scaled with a partitioned parallel DBMS.

• Can be quickly setup, and support a set of queries for small/medium scale data.

Page 15: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Questions ?

Page 16: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

• Data loading takes very long time

• Limited scalability: possible but with high cost

• Expensive software/hardware license

• Limited query support

• Maintaining & tuning is complex

– most often needs a DBA

Limitations of Parallel Database Approach

F. Wang et al., High Performance Analytical Pathology Imaging Database for Algorithm

Evaluation, In MICCAI-DCI 2011

Page 17: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

• 10 billion pixels for each WSI: 105x105

• 1M derived spatial objects, 100 M features per image

• Thousands of patients, tens of slides /patient

• Many algorithms with varying parameters

• Moderate size healthcare operations can generate Terabytes of data per day easily

• If full WSI would be adopted for clinical environment, petabytes of data per day is likely

Large Volumes of Derived Data

Demands High-Throughput

Page 18: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

High Complexity of Geometric Computation

0

200

400

600

800

1000

1200

Cross-comparing query

Tim

e (s

)

Area_Of_Intersection 37.4 %

Area_Of_Union 36.7 %

ST_Intersects 21.8 %

Search 2.7%

Index building 1.5%

95.9%

SELECT AVG(ratio) FROM (

SELECT

ST_Area(ST_Intersection(p.the_geom, q.the_geom))/

ST_Area(ST_Union(p.the_geom, q.the_geom)) AS ratio

FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q

WHERE ST_Intersects(p.the_geom, q.the_geom) );

Requires High Performance

Page 19: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

High Performance Queries with MapReduce

• MapReduce is a parallel computing framework widely used for large-scale data analysis and queries

– Widely used in major internet applications open source version: Hadoop

– Two simple UDFs for data processing : map & reduce

– Very easy to setup develop scalable applications

– Implicit parallelization by data partitioning

• Our approach:

– Take advantage of MapReduce to run queries

– Native spatial query support with a spatial engine

Page 20: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Major Gaps

• Spatial Data Partitioning and Index Support

– Spatial data is multidimensional

– Hard to preserve data locality

– Supporting and managing multidimensional index on MapReduce is not easy

Page 21: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Major Gaps

• MapReduce only provides map() and reduce()

– Need to map spatial query processing algorithms to map and reduce environment

– Need to revise and rethink spatial query processing

• An easy to use query interface is required

– MapReduce == assembly language for MPP

– Industrial examples – Hive, Pig, Scope ..

– e.g. in Facebook, 70% of MapReduce workload are submitted through Hive

Page 22: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Spatial Join Processing with MapReduce

• Staging: – images are tiled into regular

grids

– tiles redistributed across HDFS as file blocks

– small tiles are merged and metadata added into records

• Map:

– identify records of same tiles to form tasks

• Reduce: – execute queries with the

spatial query engine

– aggregate query results

Tile

Boundary files

Merged boundary file

record(imageID, tileID, boundary)

Copied to

HDFS

HDFS

Result of Algorithm 1

Result of Algorithm 2

Algorithm

results

Image

Data Scan Data Scan Data Scan

MAP

(one reducer per tile)

Spatial Join

REDUCE

Page 23: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Boundary File1

Boundary File2

Spatial Join Algorithm

Geometry Refinement

Spatial Measure

R*-Tree File1

R*-Tree File2

Result File

Real-time spatial querying engine (RESQUE)

Bulk R*-Tree Building

Bulk R*-Tree Building

Example: Two-way Spatial Join

• Index building on demand (low overhead)

• Query pipelines to combine multiple steps of query processing

• Decoupled parallel spatial query processing

• Support of spatial access methods and algorithms – spatial selection, multi-way spatial join, nearest neighbor, and highest

density queries, and extensible for new ones

Real-Time Spatial Query Engine (RESQUE)

Page 24: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Nearest Neighbor Query Processing Workflow

Query e.g. : for each cell find the closest blood vessel and return distance to that blood vessel. Access methods can vary (R*-Tree, Voronoi, etc..)

Map Reduce Reduce

Page 25: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

System Architecture — Hadoop-GIS

Goal: provide end-to-end support (from storage and indexing

to query execution to query composition) for exploration of

spatial scientific datasets at big data scales.

• SQL query interface for ease of use

• Queries are translated into MapReduce with YSmart-Spatial

• Hadoop executes the translated MapReduce code

• Relies on the RESQUE for spatial query processing

Page 26: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Hadoop-GIS

• HiveQL –> HiveSP

– Pros: Widely used as EDW, Tightly integrated with Hadoop – Mature query processing pipeline is already in place – Cons: Married to Hive, less freedom for optimization

• YSmart –> YSmart-Spatial – Pros: Good query optimization framework – Better control of customized optimization (e.g. GPU) – Cons: Not widely used, lots of development

Page 27: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Hadoop-GIS: HiveSP

Page 28: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Declarative Query Interface

Schema Creation: CREATE TABLE tcga_markups

( markup_id BIGINT, provenance STRING, hand_marked

BOOLEAN, center ST_POINT, polygon ST_POLYGON)

PARTITIONED BY TILE(polygon ST_POLYGON, 4096, 4096)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t‘ STORED AS

TEXTFILE ;

Spatial Join Query: SELECT

ST_AREA(ST_INTERSECTION(ta.polygon,tb.polygon))/

ST_AREA(ST_UNION(ta.polygon,tb.polygon)) AS ratio,

ST_DISTANCE(ST_CENTROID(tb.polygon),ST_CENTROID(ta.polygon))

AS distance

FROM tcga_markups ta JOIN tcga_markups ON

(ST_INTERSECTS(ta.polygon, tb.polygon) = TRUE)

WHERE ta.provenance='Algorithm1' AND

tb.provenance='Algorithm2' ;

Page 29: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Query Processing Pipeline

• Parsing

• Validation

SQL

• Semantic Analysis

• Optimization

• Plan Generation

AST • Physical Plan

• Optimization

Operator Tree

• Hadoop

Execution

Page 30: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Two-way Join Query Plan

Page 31: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Spatial Query Engine Performance

Querying time for a single tile Query response time

for a single image

• R*-Tree building ≈ 16% R*-Tree join ≈ 84%

• Indexing Scalability: IndexTime (Single Image) ≈ IndexTime(Tile) * tile#

• Storage: apply chain coding at R*-Tree leaves ≈ 42% storage reduction

Page 32: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

System Scalability Experiments

• Experimental Setup

– in-house cluster with 10 physical nodes (192 cores)

– join query: a set of 18 images (0.5 M nuclei/image) • 39GBs of data with 6 different results

– nearest neighbor query: 50 images from TCGA (size) • 42 GBs

• Hadoop Setup

– Cloudera Hadoop 0.20.2

– Replication factor = 3

– File split size = 128 MB

– Individual task tracker memory = 512MB

Page 33: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Query Setup

• Multi-way Spatial Join Query

star join clique join

• Join Algorithms

– R*-Tree join (Breadth First Tree Traversal)

– Partition based spatial merge join (PBSM) 1. refine with approximate processing

2. filter candidates with precise algorithm

Page 34: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Multi-Way Spatial Join: Star Join

PBSM R*-Tree

Page 35: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Multi-Way Spatial Join: Clique Join

PBSM R*-Tree

Page 36: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Nearest Neighbor Query

Voronoi Diagram R*-Tree

Page 37: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Data Skew in Spatial Query Processing

• Skew may arise for many reasons • Longest running task becomes bottleneck

Page 38: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Stragglers

Page 39: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Skew Mitigation

• Hadoop generates tasks using hash partitioning (default) • Hash partitioning is not skew resistant • We overwrite the default hashing partition function

Page 40: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Cost Estimation

6 2

4 10

Page 41: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Repartition (hash)

6 2

4 10

16 6

Page 42: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Repartition (optimal)

6 2

4 10

12

10

Page 43: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Query Optimization

• Partition problem is NP-Complete

• Greedy Approach ≈ 4/3 OPT ( not bad )

Page 44: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Query Optimization

Page 45: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Summary

• Effective management of big spatial data is one of the pressing challenges for next generation integrative biomedical research.

• We propose and provide a high performance MapReduce based querying system for large scale spatial data.

• We empirically test the concept with analytical medical imaging as an example application.

• The system is efficient, cost effective, scalable, and provides expressive query language.

Page 46: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Next Steps (I)

• Density aware partition methods

– partition according to the local spatial density

– dynamically adjust partition granularity

• Index and pipeline selection

– one spatial index does not fit all (R*-Tree v.s Voronoi etc..)

– select index based on query type data set properties

– pipeline optimization

Spatial Partitioning and Optimization Methods for MapReduce

Page 47: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Next Steps (II)

Spatial Query Optimization SELECT AVG(ratio) FROM (

SELECT

ST_Area(ST_Intersection(p.geom, q.geom))/

ST_Area(ST_Union(p.geom, q.geom)) AS ratio

FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q

WHERE ST_Intersects(p.geom, q.geom) );

SELECT AVG(ratio) FROM (

SELECT area_intersect/ (area_p + area_q –area_intersect) AS ratio

FROM (

SELECT ST_Area(ST_Intersection(p.geom, q.geom)) AS area_intersect,

ST_Area(p.geom) AS area_p, ST_Area(q.geom) AS area_q

FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q

WHERE p.geom && q.geom) AS temp1 )

WHERE area_intersect >0 ) AS temp2 ;

Page 48: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Next Steps (III)

0

200

400

600

800

1000

1200

1400

1600

1800

Apr-06 Apr-06 Apr-06 Apr-06 Apr-06 Apr-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Jan-11 Jan-11 Jan-11 Jan-11 Jan-11 Jan-11

Peak

Perf

orm

an

ce

(GF

LO

PS

)

G…

0

200

400

600

800

1000

1200

1400

1600

1800

Jul-06 Jul-07 Jul-08 Jul-09 Jul-10

Peak

Perf

orm

an

ce

(GF

LO

PS

)

C…

7900 GTX

8800 GTX 9800 GTX

GTX 280

GTX 285 GTX 480

GTX 580

E4300 E6850 Q9650 X7460 980 XE

All cores on the stream

multiprocessor (SM) execute the

same instruction on different data

E.g., NVIDIA GTX 580 has 512

cores (16 SMs, 32 cores each)

Turbo Charge Hadoop-GIS with GPU

ST_Intersection on set of

polygons

Page 49: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Reduce Unnecessary Computation

p q p q p q p q p q

K. Wang et al., Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems, In PVLDB 2012.

Page 50: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Next Steps (IV)

• Database

– A mature storage engine

– Good single node performance (40 years of research)

– But not scalable

• MapReduce

– Highly scalable and fault tolerant

– But low single node throughput

• Best of the two worlds ?

– current efforts: HadoopDB, Secondo

– how to merge the two for big spatial data ?

MapReduce + DB ≈ Hybrid System

Page 51: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging

Questions?

Hadoop-GIS:

https://web.cci.emory.edu/confluence/display/HadoopGIS