52
1 Spatial Big Data (SBD) Challenges Intersecting Mobile and Cloud Computing Shashi Shekhar McKnight Distinguished University Professor Department of Computer Science and Engineering University of Minnesota www.cs.umn.edu/~shekhar

Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

  • Upload
    vanngoc

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

1

Spatial Big Data (SBD) Challenges Intersecting Mobile and Cloud Computing

Shashi Shekhar McKnight Distinguished University Professor

Department of Computer Science and Engineering

University of Minnesota

www.cs.umn.edu/~shekhar

Page 2: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

2

Spatial Big Data (SBD)

• SBD are important to society

– Ex. Eco-routing, Public Safety & Security, Understanding Climate Change

• SBD exceed capacity of current computing systems

• DBMS Challenges

– Eco-Routing: Lagrangian frame, Non-Stationary Ranking

– Privacy vs. Utility Trade-offs

• Data Analytics Opportunities

– Post Markov Assumption – Estimate Neighbor Relationship from SBD

– Place based Ensemble Models to address spatial heterogeneity

– Bigger the spatial data, simpler may be the spatial models

– Online Spatial Data Analytics

• Platform Challenges

– Map-reduce – expensive reduce not suitable for iterative computations

– Load balancing is harder for maps with polygons and line-strings

– Spatial Hadoop ?

Page 3: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

3

Relational to Spatial DBMS to SBD Management

• 1980s: Relational DBMS • Relational Algebra, B+Tree index

• Query Processing, e.g. sort-merge equi-join algorithms, …

• Spatial customer (e.g. NASA, USPS) faced challenges

• Semantic Gap

• Verbose description for distance, direction, overlap

• Shortest path is Transitive closure

• Performance challenge due to linearity assumption

• Are Sorting & B+ tree appropriate for geographic data?

• New ideas emerged in 1990s • Spatial data types and operations (e.g. OGIS Simple Features)

• R-tree, Spatial-Join-Index, space partitioning, …

• SBD may require new thinking for

• Temporally detailed roadmaps

• Eco-routing queries

• Privacy vs. Utility Trade-off

Page 4: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

4

Data Mining to Spatial Data Mining to SBD Analytics

• 1990s: Data Mining

• Scale up traditional models (e.g., Regression) to large relational databases (460 Tbytes)

• New pattern families: Associations : Which items are bought together? (Ex. Diaper, beer)

• Spatial customers

• Walmart: Which items are bought just before/after events, e.g. hurricanes?

• Where is a pattern (e.g., (diaper-beer) prevalent?

• Global climate change: tele-connections

• But faced challenges

• Independent Identical Distribution assumption not reasonable for spatial data

• Transactions, i.e. disjoint partitioning of data, not natural for continuous space

• This led to Spatial Data Mining (last decade)

• SBD raise new questions • May SBD address open questions, e.g. estimate spatial neighborhood (e.g., W matrix)?

• Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR?

• (When) Does bigger spatial data lead to simpler models, e.g. database as a model ?

• On-line Spatio-temporal Data Analytics

Page 5: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

5

Parallelizing Spatial Big Data on Cloud Computing

• Case 1: Compute Spatial-Autocorrelation Simpler to Parallelize

– Map-reduce is okay

– Should it provide spatial de-clustering services?

– Can query-compiler generate map-reduce parallel code?

• Case 2: Harder : Parallelize Range Query on Polygon Maps

– Need dynamic load balancing beyond map-reduce

– MPI or OpenMP is better!

• Case 3: Estimate Spatial Auto-Regression Parameters, Routing

– Map-reduce is inefficient for iterative computations due to expensive “reduce”!

– MPI, OpenMP, Pregel or Spatial Hadoop is essential!

– Ex. Golden section search, Determinant of large matrix

– Ex. Eco-routing algorithms, Evacuation route planning

Page 7: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

9

Motivation for Spatial Big Data (SBD)

• Societal: • Google Earth, Google Maps, Navigation, location-based service

• Global Challenges facing humanity – many are geo-spatial!

• Many may benefit from Big Spatial Data

Page 8: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

10

Emerging SBD: Geo-social Media, Device2Device

Page 9: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

11

Outline

• Motivation

• What is Spatial Big Data (SBD)? • Definitions

• Examples & Use Cases

• SBD Infrastructure

• SBD Analytics

• Conclusions

Page 10: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

12

Spatial Big Data Definitions

• Spatial datasets exceeding capacity of current computing systems • To manage, process, or analyze the data with reasonable effort

• Due to Volume, Velocity, Variety, …

• SBD History • Data-intensive Computing: Cloud Computing, Map-Reduce, Pregel

• Middleware

• Big-Data including data mining, machine learning, …

•SBD may challenge • DBMS

• Conceptual Data Model, e.g., Frames of Reference

• Query Processing, e.g., Non-Stationary Ranking of candidates

• Data Analytics

• Lack of transactions

• Lack of independent samples and identical distributions

• Cloud Platform, e.g., map-reduce • Expensive reduce not suitable for iterative computation

• Load balancing is harder for polygons and line-strings

Page 11: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

13

Traditional Spatial Data

• Spatial attribute:

– Neighborhood and extent

– Geo-Reference: longitude, latitude, elevation

• Spatial data genre

– Raster: geo-images e.g., Google Earth

– Vector: point, line, polygons

– Graph, e.g., roadmap: node, edge, path

Raster Data for UMN Campus

Courtesy: UMN

Vector Data for UMN Campus

Courtesy: MapQuest

Graph Data for UMN Campus

Courtesy: Bing

Page 12: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

14

Raster SBD

LiDAR & Urban Terrain

Change Detection Feature Extraction

Average Monthly Temperature

(Courtsey: Prof. V. Kumar)

• Data Sets >> Google Earth

– Geo-videos from UAVs, security cameras

– Satellite Imagery (periodic scan), LiDAR, …

– Climate simulation outputs for next century

• Example use cases

– Change detection, Feature extraction,Urban terrain

Page 13: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

15

Vector SBD from Geo-Social Media

• Vector data sub-genre

– Point: location of a tweet, Ushahidi report, checkin, …

– Line-strings, Polygons: roads in openStreetMap

• Use cases: Persistent Surveillance

– Outbreaks of disease, Disaster, Unrest, Crime, …

– Hot-spots, emerging hot-spots

– Spatial Correlations: co-location, teleconnection

Page 14: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

16

Persistent Surveillance at American Red Cross

• Even before cable news outlets began reporting the tornadoes that ripped through Texas on

Tuesday, a map of the state began blinking red on a screen in the Red Cross' new social

media monitoring center, alerting weather watchers that something was happening in the

hard-hit area. (AP, April 16th, 2012)

Page 15: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

17

Emerging SBD: Geo-Sensor Networks

17

• Geo-Sensor Network Examples

– Urban roads: Loop detetors, Cameras

– Electricity distribution grids, …

– Environmental sensors for air quality, …

– Robot with sensors, …

• Sensor Network sub-genre

– Fixed reasonable resource: traffic sensors

– Ad-hoc, resource poor: wireless sensor networks

• Use cases:

– Monitor events, anomalies, e.g., accidents,

Congestion, hotspots

– Feed-back control, Predictive planning

– Environmental Health

Page 16: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

18

Graphs SBDs: Temporally Detailed

• Spatial Graphs, e.g., Roadmaps, Electric grid, Supply Chains, …

– Temporally detailed roadmaps [Navteq]

• Use cases: Best start time, Best route at different start-times

Page 17: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

19

Emerging SBD: Mobile Device2Device

19

• Mobile Device

– Cell-phones, cars, trucks, airplanes, …

– RFID-tags, bar-codes, GPS-collars, …

• Trajectory & Measurements sub-genre

– Receiver: GPS tracks, …

– System: Cameras, RFID readers, …

• Use cases:

– Tracking, Tracing,

• Improve service, deter theft …

– Geo-fencing, Identify nearby friends

– Eco-routing

Page 18: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

20

Emergin Use-Case: Eco-Routing

U.P.S. Embraces High-Tech Delivery Methods (July 12, 2007)

By “The research at U.P.S. is paying off. ……..— saving roughly three million

gallons of fuel in good part by mapping routes that minimize left turns.”

• Minimize fuel consumption and GPG emission

– rather than proxies, e.g. distance, travel-time

– avoid congestion, idling at red-lights, turns and elevation changes, etc.

Page 19: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

21

Eco-Routing Questions

• What are expected fuel saving from use of GPS devices with static roadmaps?

• What is the value-added by historical traffic and congestion information?

• How much additional value is added by real-time traffic information?

• What are the impacts of following on fuel savings and green house emissions?

– traffic management systems (e.g. traffic light timing policies),

– vehicles (e.g. weight, engine size, energy-source),

– driver behavior (e.g. gentle acceleration/braking), environment (e.g. weather)

• What is computational structure of the Eco-Routing problem?

Predictable

Future

Unpredictable

Future

Stationary ranking

of alternate paths

Non-stationary

ranking of paths

Dijkstra’s, A*, Dynamic Programming….

New Computer Science ?

Page 20: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

22

Time Dependence of Shortest Path

22

INPUT:

Source: University of Minnesota

Destination: MSP Airport

Time Interval 7:00am ---12:00noon

OUTPUT:

Page 21: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

23

Routing Algorithm Challenges

Non Stationarity ranking

of paths Non FIFO Behavior

Violation of stationary

assumption dynamic

programming

*Flight schedule between Minneapolis and

Austin (TX)

Violates the no wait

assumption of Dijkstra/A*

Page 22: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

24

Routing Challenges: Lagrangian Frame of Reference

Question: What is the cost of Path <A,C,D> with start-time t=2 ?

Is it 4 or 5 ? Path T = 0 T = 1 T = 2 T = 3

<A,C,D> 4 3 5 4

<A,B,D> 6 5 4 3

Lagrangian Graph

Snapshots of a Graph

Page 23: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

25

Challenges: New Routing Questions

Example Time-Variant Network Questions

New Routing Questions

Best start time to minimize time spend on network

Account for delays at signals, rush hour, etc.

Page 24: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

26

• Weekday GPS track for 3 months

– Patterns of life

– Usual places and visits

– Rare places, Rare visits

Work

Home Club

Farm

Morning 7am – 12am

Afternoon 12noon – 5pm

Evening 5pm – 12pm

Midnight 12midnight –

7pm

Total

Home 10 2 15 29 54

Work 19 20 10 1 50

Club 4 5 4 15

Farm 1 1

Total 30 30 30 30 120

SBD Challenge: Privacy vs. Utility Trade-off

Page 25: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

27

• Checkin Risks: Stalking, GeoSlavery, …

• Ex: Girls Around me App (3/2012), Lacy Peterson [2008]

• Others know that you are not home!

The Girls of Girls Around Me. It's doubtful any of

these girls even know they are being tracked. Their names and locations have been obscured for privacy

reasons. (Source: Cult of Mac, March 30, 2012)

SBD Challenge: Privacy vs. Utility Trade-off

Page 26: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

28

Relational to Spatial DBMS to SBD Management

• 1980s: Relational DBMS • Relational Algebra, B+Tree index

• Query Processing, e.g. sort-merge equi-join algorithms, …

• Spatial customer (e.g. NASA, USPS) faced challenges

• Semantic Gap

• Verbose description for distance, direction, overlap

• Shortest path is Transitive closure

• Performance challenge due to linearity assumption

• Are Sorting & B+ tree appropriate for geographic data?

• New ideas emerged in 1990s • Spatial data types and operations (e.g. OGIS Simple Features)

• R-tree, Spatial-Join-Index, space partitioning, …

• SBD may require new thinking for

• Temporally detailed roadmaps

• Eco-routing queries

• Privacy vs. Utility Trade-off

Page 27: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

29

Outline

• Motivation

• SBD Definition and Examples

• SBD Infrastructure • Parallelizing Spatial Computations

• Implications for Cloud Platforms

• SBD Analytics

• Conclusions

Page 28: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

30

Parallelizing Spatial Big Data on Cloud Computing

• Case 1: Compute Spatial-Autocorrelation Simpler to Parallelize

– Map-reduce is okay

– Should it provide spatial de-clustering services?

– Can query-compiler generate map-reduce parallel code?

• Case 2: Harder : Parallelize Range Query on Polygon Maps

– Need dynamic load balancing beyond map-reduce

– MPI or OpenMP is better!

• Case 3: Estimate Spatial Auto-Regression Parameters, Routing

– Map-reduce is inefficient for iterative computations due to expensive “reduce”!

– MPI, OpenMP, Pregel or Spatial Hadoop is essential!

– Ex. Golden section search, Determinant of large matrix

– Ex. Eco-routing algorithms, Evacuation route planning

Page 29: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

31

Simpler: Land-cover Classification

• Multiscale Multigranular Image Classification into land-cover categories

Inputs Output at 2 Scales

)(2)|()(

)},({maxargˆ

MpenaltyMnobservatiolikelihoodMquality

whereModelqualityodelMModel

Page 30: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

32

Parallelization Choice

1. Initialize parameters and memory

2. for each Spatial Scale

3. for each Quad

4. for each Class

5. Calculate Quality Measure

6 end for Class

7. end for Quad

8. end for Spatial Scale

9. Post-processing

0

0.25

0.5

0.75

1

1 2 4 8

Number of Processors

Eff

icie

nc

y

Class-level

Quad-level

0

1

2

3

4

5

6

7

2 4 8Number of Processors

Sp

ee

du

p

Class-level

Quad-level

Input • 64 x 64 image (Plymouth County, MA)

• 4 classes (All, Woodland, Vegetated,

Suburban)

Language UPC

Platform Cray X1, 1-8 processors)

Page 31: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

33

Harder: Parallelizing Vector GIS

•(1/30) second Response time constraint on Range Query

• Parallel processing necessary since best sequential computer cannot meet requirement

• Blue rectangle = a range query, Polygon colors shows processor assignment

Set of Polygons

Set of Polygons

Display Graphics Engine

Local Terrain

Database

Remote Terrain

Databases

30 Hz. View Graphics

2Hz.

8Km X 8Km Bounding Box

High Performance

GIS Component

25 Km X 25 Km

Bounding Box

Page 32: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

34

Data-Partitioning Approach

• Initial Static Partitioning

• Run-Time dynamic load-balancing (DLB)

• Platforms: Cray T3D (Distributed), SGI Challenge (Shared Memory)

Page 33: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

35

DLB Pool-Size Choice is Challenging!

Page 34: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

36

Hardest – Location Prediction

Nest locations Distance to open water

Vegetation durability Water depth

Page 35: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

37

Ex. 3: Hardest to Parallelize

Name Model

Classical Linear Regression

Spatial Auto-Regression

εxβy

εxβWyy ρ

framework spatialover matrix odneighborho -by- :

parameter n)correlatio-(auto regression-auto spatial the:

nnW

• Maximum Likelihood Estimation

• Need cloud computing to scale up to large spatial dataset.

• However,

• Map reduce is too slow for iterative computations!

• computing determinant of large matrix is an open problem!

SSEnn

L2

)ln(

2

)2ln(ln)ln(

2

WI

Page 36: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

38

Parallelizing Spatial Big Data on Cloud Computing

• Case 1: Compute Spatial-Autocorrelation Simpler to Parallelize

– Map-reduce is okay

– Should it provide spatial de-clustering services?

– Can query-compiler generate map-reduce parallel code?

• Case 2: Harder : Parallelize Range Query on Polygon Maps

– Need dynamic load balancing beyond map-reduce

– MPI or OpenMP is better!

• Case 3: Estimate Spatial Auto-Regression Parameters, Routing

– Map-reduce is inefficient for iterative computations due to expensive “reduce”!

– MPI, OpenMP, Pregel or Spatial Hadoop is essential!

– Ex. Golden section search, Determinant of large matrix

– Ex. Eco-routing algorithms, Evacuation route planning

Page 37: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

39

Outline

• Motivation

• SBD Definitions & Examples

• SBD Infrastructure

• SBD Analytics

• Conclusions

Page 38: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

40

Data Mining to Spatial Data Mining to SBD Analytics

• 1990s: Data Mining

• Scale up traditional models (e.g., Regression) to large relational databases (460 Tbytes)

• New pattern families: Associations : Which items are bought together? (Ex. Diaper, beer)

• Spatial customers

• Walmart: Which items are bought just before/after events, e.g. hurricanes?

• Where is a pattern (e.g., (diaper-beer) prevalent?

• Global climate change: tele-connections

• But faced challenges

• Independent Identical Distribution assumption not reasonable for spatial data

• Transactions, i.e. disjoint partitioning of data, not natural for continuous space

• This led to Spatial Data Mining (last decade)

• SBD raise new questions • May SBD address open questions, e.g. estimate spatial neighborhood (e.g., W matrix)?

• Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR?

• (When) Does bigger spatial data lead to simpler models, e.g. database as a model ?

• On-line Spatio-temporal Data Analytics

Page 39: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

41

Spatial Data Mining Example 1: Spatial Prediction

Nest locations Distance to open water

Vegetation durability Water depth

Page 40: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

42

Spatial Autocorrelation (SA)

• First Law of Geography

– “All things are related, but nearby things are more related than distant things. [Tobler, 1970]”

• Autocorrelation

– Traditional i.i.d. assumption is not valid

– Measures: K-function, Moran’s I, Variogram, …

Pixel property with independent identical

Distribution (i.i.d)

Vegetation Durability with SA

Page 41: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

43

Parameter Estimation for Spatial Auto-regression

Name Model

Classical Linear Regression

Spatial Auto-Regression

εxβy

εxβWyy ρ

• Maximum Likelihood Estimation

• Computationally Expensive

• Determinant of a large matrix

• Iterative Computation

• Golden Section Search for

SSEnn

L2

)ln(

2

)2ln(ln)ln(

2

WI

framework spatial over matrix od neighborho - by - : n n W

parameter n) correlatio - (auto regression - auto spatial the :

Page 42: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

44

Spatial Data Mining Example 2: Clustering

• Clustering: Find groups of tuples

• Statistical Significance

– Complete spatial randomness, cluster, and de-cluster

Inputs:

Complete Spatial Random (CSR),

Cluster,

Decluster

Classical Clustering

(K-means always finds clusters)

Spatial Clustering begs to differ!

Leftmost: Clusters not significant

Rightmost – declustered!

Page 43: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

45

Spatial Data Mining Example 3: Spatial Outliers

• Spatial Outliers

– Locations different than neighbors

– Ex. Sensor 9

– Source: Traffic Data Cities

• Spatial Join Based Tests

Page 44: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

46

Association Patterns

• Association rule e.g. (Diaper in T => Beer in T)

– Support: probability (Diaper and Beer in T) = 2/5

– Confidence: probability (Beer in T | Diaper in T) = 2/2

• Algorithm Apriori [Agarwal, Srikant, VLDB94]

– Support based pruning using monotonicity

• Note: Transaction is a core concept!

Transaction Items Bought

1 {socks, , milk, , beef, egg, …}

2 {pillow, , toothbrush, ice-cream, muffin, …}

3 { , , pacifier, formula, blanket, …}

… …

n {battery, juice, beef, egg, chicken, …}

Page 45: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

47

Spatial Data Mining Example 4: Co-locations/Co-occurrence

• Given: A collection of

different types of spatial

events

• Find: Co-located

subsets of event types

• Challenge:

No Transactions

• New Approaches

– Spatial Join Based

– One join per

candidate is

Computationally

expensive!

Page 46: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

48

SBD Opportunities 1: Estimate Spatial Neighbor Relationship

• SDM Limitation 1: Neighbor relationship is End-users’ burden !

– Colocation mining, hotspot detection, spatial outlier detection, …

– Example: W matrix in spatial auto-regression

– Reason: W quadratic in number of location

– Reliable estimation of W needs very large number data samples

• SBD Opportunity 1: Post-Markov Assumption

– SBD may be large enough to provide reliable estimate of W

– This will relieve user burden and may improve model accuracy

– One may not have assume

• Limited interaction length, e.g. Markov assumption

• Spatially invariant neighbor relationships, e.g., 8-neighbor

• Tele-connections are derived from short-distance relationships

Name Model

Classical Linear Regression

Spatial Auto-Regression

εxβy

εxβWyy ρ

Page 47: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

49

SBD Opportunity 2: Place Based Ensemble of Models

• SDM Limitation 2: Modeling of Spatial Heterogeneity is rare

– Spatial Heterogeneity: No two places on Earth are identical

– Yet, Astro-Physics tradition focused on place-independent models

– Was it due to paucity of data ?

– Exception: Geographically Weighted Regression or GWR [ Fortheringham et al. ]

– GWR provides an ensemble of linear regression models, one per place of interest

• Opportunity 2: SBD may support Place based ensemble of models beyond GWR

– Example: Place based ensemble of Decision Trees for Land-cover Classification

– Example: Place based ensemble of Spatial Auto-Regression Models

– Computational Challenge:

• Naïve approach may run a learning algorithm for each place.

• Is it possible to reduce computation cost by exploiting spatial auto-correlation ?

Page 48: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

50

SBD Opportunity 3: Bigger the SBD, Simpler the Model

• SDM Limitation 3: Complexity of Spatial Models

– Spatial models are usually computationally more expensive than traditional models

– Example: Spatial Auto-regression vs. Linear Regression

– Example: Geographically Weighted Regression vs. Regression

– Example: Colocation Pattern Mining vs. Association Rule Mining

– Confidence Estimation adds more costs, e.g. M.C.M.C. simulations

• SBD Opportunity 3: Bigger the SBD, Simpler the spatial models

– Sometime the complexity is due to paucity of data at individual places

• It forces one to leverage data at nearby places via spatial auto-correlation and spatial join!

– SBD may provide plenty of data at each place!

• This may allow place-based divide and conquer

• Build one model per place using local data and simpler model

– Challenges:

• Compare place-based ensembles of simpler models with current spatial models

• When does bigger data lead to simpler models?

• What is SBD from analytics perspective ?

– e.g., ratio of samples to number of parameters

Page 49: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

51

SBD Opportunity 4:On-line Spatio-temporal Data Analytics

• SDM Limitation 3: Off-line Batch Processing

– Spatial models are usually learned in off-line batch manner

• Example: Spatial Auto-regression, Colocation Pattern Mining, Hotspot Detection

– However, SBD include streaming data such as event reports, sensor measurements

– SBD use cases include monitoring and surveillance requiring On-Line algorithms

• Example: Timely Detection of Outbreak of disease, crime, unrest, adverse events

• Example: Displacement or spread of a hotspot to neighboring geographies

• Example: Abrupt or rapid change detection in land-cover, forest-fire, etc. for quick response

• SBD Opportunity 4: On-line Spatio-temporal Data Analytics

– Purely local models may leverage time-series data analytics models

– Regional and global models are harder

• Spatial interactions (e.g., colocation, tele-connections) with time-lags

• Q? Can these be computed exactly in on-line manner?

• Q? If not, what are on-line approximations ?

Page 50: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

52

Spatial Big Data (SBD)

• SBD Definitions

• SBD Applications

• SBD Infrastructure

• SBD Analytics

• Conclusions

Page 51: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

53

Spatial Big Data (SBD)

• SBD are important to society

– Ex. Eco-routing, Public Safety & Security, Understanding Climate Change

• SBD exceed capacity of current computing systems

• DBMS Challenges

– Eco-Routing: Lagrangian frame, Non-Stationary Ranking

– Privacy vs. Utility Trade-offs

• Data Analytics Opportunities

– Post Markov Assumption – Estimate Neighbor Relationship from SBD

– Place based Ensemble Models to address spatial heterogeneity

– Bigger the spatial data, simpler may be the spatial models

– Online Spatial Data Analytics

• Platform Challenges

– Map-reduce – expensive reduce not suitable for iterative computations

– Load balancing is harder for maps with polygons and line-strings

– Spatial Hadoop ?

Page 52: Spatial Big Data (SBD) Challenges -  · • Does SBD facilitate better spatial models, e.g., place based ensembles beyond GWR? • (When) Does bigger spatial data lead to simpler

54

Acknowledgments

• HPC Resources, Research Grants

– Army High Performance Computing Research Center-AHPCRC

– Minnesota Supercomputing Institute - MSI

• Spatial Database Group Members

– Mete Celik, Sanjay Chawla, Vijay Gandhi, Betsy George, James Kang,

Baris M. Kazar, QingSong Lu, Sangho Kim, Sivakumar Ravada

• USDOD

– Douglas Chubb, Greg Turner, Dale Shires, Jim Shine, Jim Rodgers

– Richard Welsh (NCS, AHPCRC), Greg Smith

• Academic Colleagues

– Vipin Kumar

– Kelley Pace, James LeSage

– Junchang Ju, Eric D. Kolaczyk, Sucharita Gopal