32
Naphtali David Rishe Director, High Performance Database Research Center Director, NSF Industry-University Cooperative Research Center http://HPDRC FIU edu http://TerraFly FIU edu Indexing Geospatial Data with http://HPDRC.FIU.edu http://TerraFly .FIU.edu Indexing Geospatial Data with MapReduce Naphtali Rishe + , Vagelis Hristidis + , Raju Rangaswami + , Ouri Wolfson * , Howard Ho ** , Ariel Cary + , Zhengguo Sun + , Lester Melendes + + Sh l f C i d I f i S i Fl id I i l U i i + School of Computing and Information Sciences, Flor ida International University * University of Illinois at Chicago ** IBM Almaden Research Center Sponsored by: NSF Cluster Exploratory (CluE) Sponsored by: NSF Cluster Exploratory (CluE)

Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Naphtali David RisheDirector, High Performance Database Research CenterDirector, NSF Industry-University Cooperative Research Centerhttp://HPDRC FIU edu http://TerraFly FIU edu

Indexing Geospatial Data with

http://HPDRC.FIU.edu http://TerraFly.FIU.edu

Indexing Geospatial Data with MapReduce

Naphtali Rishe+, Vagelis Hristidis+, Raju Rangaswami+,Ouri Wolfson*, Howard Ho**, Ariel Cary+, Zhengguo Sun+, Lester Melendes+

+S h l f C i d I f i S i Fl id I i l U i i+School of Computing and Information Sciences, Florida International University*University of Illinois at Chicago ** IBM Almaden Research Center

Sponsored by: NSF Cluster Exploratory (CluE)Sponsored by: NSF Cluster Exploratory (CluE)

Page 2: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

ContentContent1. Introduction

2. TerraFly

3. Solving Spatial Problems in MapReduce• R-Tree Index Construction• Aerial Image Processing• Midas Government Domain Application• Midas Government Domain Application

4. Experimental Results

5 Conclusion5. Conclusion

2 Florida International University

Page 3: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

IntroductionIntroductionSpatial databases mainly store:

Raster data (satellite/aerial digital images), andVector data (points, lines, polygons).

T diti l ti l ti d l t k i Traditional sequential computing models may take excessive time to process large and complex spatial repositories.

Emerging parallel computing models such as MapReduce Emerging parallel computing models, such as MapReduce, provide a potential for scaling data processing in spatial applications.

3 Florida International University

Page 4: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Introduction (cont )Introduction (cont.)MapReduce is an emerging massively parallel computing model (Google) composed of two functions:

Map: takes a key/value pair, executes some computation, and emits a set of intermediate key/value pairs as output emits a set of intermediate key/value pairs as output. Reduce: merges its intermediate values, executes some computation on them, and emits the final output.

Here we present our experiences in applying the MapReduce model to:

B lk t t R T ( t ) Bulk-construct R-Trees (vector) Compute aerial image quality (raster)Extract United States Governmental Organizational Hierarchiesg

4 Florida International University

Page 5: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

2. TerraFly

Geospatial Applications Suite

Developed and Maintained byDeveloped and Maintained by

The High Performance Database Research Center (HPDRC)( )

Florida International University (FIU)

Page 6: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

FIU HPDRC ExpertiseFIU-HPDRC Expertise

Database aspectsDatabase aspects:Data visualization

Spatial databases

I di ib d h d bInternet-distributed heterogeneous databases

Database design methodologies

Information analysis

Page 7: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

TerraFlyTerraFly

Geospatial mapping solution

Web-Based

Customized to Industry needsy

Drill down to local information

Page 8: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

TerraFlyTerraFlyGIS Solutions Based on New Generation Technologygy

PlatformGIS-like Internet visualization

h d dOpen architecture, GIS-oriented API provider

40 TB database of aerial imagery and spatial data

NSF and NASA funded technologygy

ServiceProfessionally customized to domain requirements

Comprehensive and expert serviceComprehensive and expert service

Page 9: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

M i g S l tiMapping Solutions

Customized to industry needs Sample applications: The Hydrology application shows the mean water level over time over water bodies

Customized to industry needs. Sample applications:

The Real Estate application facilitates visua;ization of listings and allows complex queries via user-friendly i t f interface.

Page 10: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

H d l g A li tiHydrology ApplicationAverage of Surrounding Stations' Mean Daily Stage for Selected Water Body. Average of Surrounding Stations' Mean Daily Stage for Selected Water Body.

Address searchable

Selectable Lakes

Stage date Time diagram and data

Surrounding water body stations

Page 11: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

R l E t tReal Estate

Searchable for any parameter

Detailed information

y p

Real Estate Listings in the

Vicinity

Surrounding Realtor

information

Page 12: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

2 Solving Spatial Problems in 2. Solving Spatial Problems in MapReduce

• R-Tree Index Construction• Aerial Image ProcessingAerial Image Processing• Midas Government Domain Application

Page 13: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

M R d (MR) R T C t tiMapReduce (MR) R-Tree ConstructionR-Tree Bulk-Construction

Every object o in database D has two attributes:o.id - the object’s unique identifier.

o.P - the object’s location in some spatial domain.j p

The goal is to build an R-Tree index on D.

MapReduce Algorithm1. Database partitioning function computation (MR).2. A small R-Tree is created for each partition (MR).3. The small R-Trees are merged into the final R-Tree.

13 Florida International University

Page 14: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Phase 1 Partitioning FunctionPhase 1 – Partitioning Function– Goal: compute f to assign objects of D into one of Rp f g j

possible partitions s.t.:– R (ideally) equally-sized partitions are generated

l bl(minimal variance is acceptable).– Objects close in the spatial domain are placed within

the same partitionthe same partition.– Proposed solution:

– Use Z-order space-filling curve to map spatial Use Z-order space filling curve to map spatial coordinate samples (3%) into an sorted sequence.

– Collect splitting points that partition the sequence in Rp g p p qranges.

14 Florida International University

Page 15: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Phase 1 Partitioning FunctionPhase 1 – Partitioning Function

Map and Reduce inputs/outputs in computing partitioning function f.

Function Input: (Key, Value) Output: (Key, Value) Map (o id o P) (C U(o P))Map (o.id, o.P) (C, U(o.P))

Reduce (C, list(ui, i=1, .., L)) S´ Where:

• o is an spatial object in the database.o s a spat a object t e atabase.• C which is a constant that helps in sending Mappers' outputs to a single

Reducer.• U is a space-filling curve, e.g. Z-order value.• S' is an array containing R 1 splitting points• S is an array containing R-1 splitting points.

15 Florida International University

Page 16: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Ph 2 R T C t ti i MRPhase 2 - R-Tree Construction in MR

MapReduce functions in constructing R-Trees.

F i I (K V l ) O (K V l )Function Input: (Key, Value) Output: (Key, Value)Map (o.id, o.P) (f(o.P), o)

Reduce (f(o.P), list(oi, i=1, .., A)) tree.root

Where:Where:• o is an spatial object in the database.• f is the partitioning function computed in Phase 1.• Tree root is the R Tree root node• Tree.root is the R-Tree root node.

16 Florida International University

Page 17: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Phase 3 R Tree ConsolidationPhase 3 - R-Tree ConsolidationSequential process

17 Florida International University

Page 18: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Aerial Image Processing

• R-Tree Index Construction• Aerial Image ProcessingAerial Image Processing• Midas Government Domain Application

Page 19: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Image Processing in MapReduceImage Processing in MapReduceAerial Image Quality Computation

d b h f d l h h OQQ f l d Let d be an orthorectified aerial photography (DOQQ) file and t be a tile inside d, d.name is d’s file name and t.q is the quality information of tile t. The goal is to compute a quality bitmap for d.

MapReduce AlgorithmA customized InputFormatter partitions each DOQQ file d into A customized InputFormatter partitions each DOQQ file d into several splits containing multiple tiles.The Mappers compute the quality bitmap for each tile inside a

lit split. The Reducers merge all the bitmaps that belongs to a file d and write them to an output file.

19 Florida International University

Page 20: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Image Processing in MapReduceImage Processing in MapReduce

Input and output of map and reduce functions

Function Input: (Key, Value) Output: (Key, Value) Map (d.name+t.id, t) (d.name, (t.id,t.q))

Reduce (d.name, list(t.id,t.q)) Quality-bitmap of d

Where:• d is a DOQQ file.• t is a tile in d.• t.q is the quality bitmap of t.

20 Florida International University

Page 21: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Midas Government Domain Midas Government Domain Application

• R-Tree Index Construction• Aerial Image ProcessingAerial Image Processing• Midas Government Domain Application

Page 22: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Linking USASpending.gov with OMB E kEarmarks

Organizational/agency hierarchy must be established in order to properly tt ib t di d k i tiattribute spending and earmark appropriation

Midas Earmark records do not contain explicit agency information but, web records do.

Midas Earmark records contain a 6 digit TAS -Treasury Account Symbol

First 2 digits = Agency

Last 4 digits = Account

22

Page 23: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Linking USASpending.gov with OMBE kOMBEarmarks

Common attributes for linking:

Congressional DistrictsCongressional Districts

Treasury Account Symbols (TAS)

We were able to extract semantic agency information from OMB Earmarks we can f ili li kfacilitate linkage.

This allowed us to obtain finer grained results and use FPDS codes and hierarchical information found in NIST SP 800-87.

23

Page 24: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Midas for USASpending.gov

Integrate Spending, Agency, Person, and Earmark (JAQL)

Scrub – Normalize Attribute Values (JAQL+Java)

Scrub – Normalize Attribute Values (JAQL+Java)

E t t JSON (J )

Hierarchy Extraction (systemT)

Export to JSON (JAQL)

Export to JSON (Java)Manual Export to Plain Text

24USASpending Hierarchical Documentation

Page 25: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Example Queries:Example Queries:What percentage of Defense Department spending comes from Earmarks?What type of account does the Department of Agriculture spend the least from?Are there any vendors who do the majority of their work outside of their own congressional district?congressional district?What percentage of the Department of the Interior’s spending goes towards food and food services? Is there a congressman who sponsors bills that lead to a particular kind of spending? 

I.e. Congressman Smith sponsored 100 million in earmarks, of which 90% went to agencies who’s primary function is related to national defense.

25

Page 26: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

3. Experimental Results

Page 27: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Experimental Results: SettingExperimental Results: SettingData Set Table 4. Spatial data sets used in experiments*.p p

Problem Data set Objects Data size

(GB) Description

R-Tree FLD 11.4 M 5 Points of properties in the state of Florida.

YPD 37 M 5 3 Yellow pages directory of points of businesses mostlyYPD 37 M 5.3 Yellow pages directory of points of businesses mostly in the United States but also in other countries.

Image Quality

Miami-Dade 482 files 52 Aerial imagery of Miami-Dade county, FL (3-inch

resolution) * Data sets supplied by the High Performance Database Research Center at Florida International University

EnvironmentThe cluster was provided by the Google and IBM Academic Cluster Computing Initiative.The cluster contains around 480 computers running Hadoop - open source MapReduce.p

27 Florida International University

Page 28: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Experimental Results: R TreeExperimental Results: R-TreeR-Tree Construction Performance Metrics

50.00

60.00

25.00

30.00

20 00

30.00

40.00

Tim

e (m

in)

MR2

MR1

10 00

15.00

20.00

Tim

e (m

in)

MR2

MR1

0.00

10.00

20.00

4 8 16 32 640.00

5.00

10.00

2 4 8 16 32 64

T

(a) FLD data set (b) YPD data set

MapReduce job completion times for various number of reducers in phase-2 (MR2).

ReducersReducers

28 Florida International University

Page 29: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Experimental Results: R TreeExperimental Results: R-TreeMapReduce R-Trees vs. Single Process (SP)

Objects per Reducer Consolidated R-Tree

Data set R Average Stdev Nodes HeightFLD 2 5,690,419 12,183 172,776 4

4 2,845,210 6,347 172,624 4

8 1,422,605 2,235 173,141 416 711,379 2,533 162,518 432 355 651 2 379 173 273 332 355,651 2,379 173,273 364 177,826 1,816 173,445 3

SP 11,382,185 0 172,681 4

YPD 4 9,257,188 22,137 568,854 4

8 4,628,594 9,413 568,716 4

16 2,314,297 7,634 568,232 4

32 1,157,149 6,043 567,550 464 578,574 2,982 566,199 4

SP 37,034,126 0 587,353 5

29 Florida International University

Page 30: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

Experimental Results: ImageryExperimental Results: ImageryTile Quality Computation

20

25

ReduceMap 3

3.5

4ReduceMap

10

15

Tim

e (m

in) Map

1.5

2

2.5

3

Tim

e (m

in)

p

0

5

4 8 16 32 64 128 256 5120

0.5

1

2 4 8 16

(a) Fixed data size, variable Reducers (b) Variable data size, fixed Reducers Fig. 9. MapReduce job completion time for tile quality computation

Reducers Size of data (GB)

30 Florida International University

Page 31: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

5. Conclusion

Page 32: Indexing Geospatial Data with MapReducecake.fiu.edu/...IG.Indexing_Geospatial_Data_with_Map_Reduce_downl… · Aerial imagery of Miami-Dade county, FL (3-inch resolution) * Data sets

ConclusionConclusionWe employed the MapReduce model to solve two spatial problems on the Google&IBM cluster:

(a) Bulk-construction of R-Trees and

(b) Aerial image quality computation( ) g q y p

And a document processing for linkage extraction project:(b) MIDAS Government domain application

MapReduce can dramatically improve task completion times. Our experiments show close to linear scalability.

Our experience in this work shows MapReduce has the Our experience in this work shows MapReduce has the potential to be applicable to more complex spatial problems.

32 Florida International University