39
Searching All The Web’s Spatial Data [email protected] January 21, 2015

Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data [email protected] January 21, 2015

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Searching All The Web’s Spatial Data

[email protected] January 21, 2015

Page 2: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Thanks

• CGA

• Ben Lewis

• Dave Strohschein

Page 3: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Layer Level Search Of Spatial Resources

Page 4: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

A Spatial Resource<kml xmlns=“http://www.opengis.net/kml/2.2"> ! <Document> <Placemark> ! <name>Harvard</name> <description>You Are Here</description> <Point> <coordinates>-71.1169,42.3774,0</coordinates> </Point> ! </Placemark> </Document> !</kml>

Page 5: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Not A Spatial Resource

Page 6: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Web Services

• Individual Layer Level Search

• OGC - Get Capabilities, WMS

• ESRI Rest

Page 7: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Anchor Link Signatures

• Anchor Links To Spatial Resources

• ?request=GetCapabilities

• /ArcGIS/rest/service

• *.kml and *.kmz

• */shape/*.zip

Page 8: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Not JavaScript Code<script>

L.esri.tiledMapLayer(

"http://basemap.nationalmap.gov/ArcGIS/rest/services/USGSTopo/MapServer",

{opacity: 0.50, zIndex:2}).addTo(map);

</script>

Page 9: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Not HTML Tags<body>

Please use my base layer:

<blink>

http://basemap.nationalmap.gov/ArcGIS/rest/services/USGSTopo/MapServer

</blink>!

</body>

Page 10: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Google Advanced Search

• What Will A Crawl Discover?

• allinanchor:, allinurl:, filetype:

• Follow Terms Of Service

Page 11: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Limited Crawl

• Crawl A Couple Sites

• JCrawler: Provide Two Functions

• Follow This Link?

• Process Page

• Run On Localhost, Obey robots.txt

Page 12: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Find ALL Spatial Resources

• Not With A Cluster Running Nutch

• Too Hard!

Page 13: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015
Page 14: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

CommonCrawl.org

• Monthly Crawl, 2-3 Billion Web Pages

• 55,000 WARC Files On Amazon East

• Hadoop Sample Code

• Add jsoup And Several Hundred Lines Of Code

Page 15: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

CommonCrawl Blog

Page 16: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Easier Hadoop

Page 17: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

One Complete “Crawl”

• 25 Slaves, 3 Full Days

• $1400

• R3.XLarge - Lots Of Memory

Page 18: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Sample Crawl Outputhttp://www.ga.gov.au/gis/services/earth_science/GA_Surface_Geology_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS 306 !http://maps.ngdc.noaa.gov/soap/web_mercator/dem_extents/MapServer/WMSServer?request=GetCapabilities%26service=WMS 169 !http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Seismic_Surveys/MapServer/WMSServer?request=GetCapabilities&service=WMS 144 !http://gis.ngdc.noaa.gov/arcgis/services/dem_hillshades/ImageServer/WMSServer?request=GetCapabilities%26service=WMS 132 !http://www.ga.gov.au/gis/services/topography/Australian_Topography/MapServer/WMSServer?request=GetCapabilities&service=WMS 108 !http://www.ga.gov.au/gis/services/earth_science/Crustal_Elements_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS 108

Page 19: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Sample Crawl Outputhttp://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/GA_Surface_Geology_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS -306 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Seismic_Surveys/MapServer/WMSServer?request=GetCapabilities&service=WMS -144 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Crustal_Elements_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS -108 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/topography/Australian_Topography/MapServer/WMSServer?request=GetCapabilities&service=WMS -108 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Airborne_Geophysics/MapServer/WMSServer?request=GetCapabilities&service=WMS -90

Page 20: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Key / Value Pairs• String / Integer Pairs

• Value > 0

• URL To Resource / Frequency Count

• Value < 0

• URL To Resource + “|||” + Page Found On

• Use Unix Commands To Split, Sort File

Page 21: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Harvester

• Input: List Of Spatial Resources

• Processing:

• Obtain Metadata On Each Layer

• Periodically Re-visit

• Output: Solr Records, Report

Page 22: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Layer Level Search Of Spatial Resources

Page 23: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Search

• People Expect Good Results

• Always Too Many Results For Human Review

• Ranking / Scoring Results Is Key

Page 24: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Some Layers Not Relevant

Page 25: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Layer Within Map

Page 26: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Similar Center

Page 27: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Similar Area

Page 28: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Spatial Solr

• Old Style: Floats

• New Style: Rectangle, Polygons

Page 29: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Solr Schema

• Define Fields To Support Search

• Pre-compute Intermediate Result

• Data Type = Search Options

• Or Schema-less

Page 30: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Solr Schema

MinX, MaxX, CenterX

MinY, MaxY, CenterY

HalfWidth

HalfHeight

Area

tdouble Field Types

Page 31: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Old Style Solr Queryhttp://geodata.tufts.edu/solr/select?q=_val_:%22product(10.0,map(sum(map(MinX,-71.143160023987,-71.096038976013,1,0),map(MaxX,-71.143160023987,-71.096038976013,1,0),map(MinY,42.385170824958,42.428266055761,1,0),map(MaxY,42.385170824958,42.428266055761,1,0)),4,4,1,0)))%22_val_:%22product(15.0,recip(sum(abs(sub(Area,0.002030692438118123)),.01),1,1000,1000))%22_val_:%22product(3.0,recip(abs(sub(product(sum(MaxX,MinX),.5),-71.11959949999999)),1,1000,1000))%22_val_:%22product(3.0,recip(abs(sub(product(sum(MaxY,MinY),.5),42.4067184403595)),1,1000,1000))%22+AND+%28LayerDisplayName:water^3+OR+ThemeKeywords:water^2+OR+PlaceKeywords:water^2%29+AND+%28ThemeKeywords:geoscientificinformation^4%29&&fq={!frange+l%3D1+u%3D10}product(2.0,map(sum(map(sub(abs(sub(-71.11959949999999,CenterX)),sum(0.023560523986994042,HalfWidth)),0,400000,1,0),map(sub(abs(sub(42.4067184403595,CenterY)),sum(0.021547615401498632,HalfHeight)),0,400000,1,0)),0,0,1,0))&wt=json&fl=Name,CollectionId,Institution,Access,DataType,Availability,LayerDisplayName,Publisher,GeoReferenced,Originator,Location,MinX,MaxX,MinY,MaxY,ContentDate,LayerId,score,WorkspaceName,SrsProjectionCode&rows=27&start=0&sort=score+desc&fq=ContentDate:[1950-01-01T01:01:01Z+TO+2012-01-01T01:01:01Z]&fq=DataType%3APoint&fq=Institution%3ATufts+OR+Institution%3AHarvard&fq=Institution:Tufts+OR+Access:Public&json.wrf=jQuery16408675794449108286_1331937717696&_=1331941365233

Page 32: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

New Style Spatial

• A Lat-Lon rectangle: minX minY maxX maxY

• <field name="geo">-74.093 41.042 -69.347 44.558</field>

• Units: Degrees

• Distance Calc: Haversine or Euclidean, etc.

Page 33: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Spatial Functions

• fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"

• fq=geo:”IsWithin(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0”

• HeatMaps: Coming Soon

Page 34: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Also Search By

Date

Keywords

DataType

Institution

Solr Filter Clause

Page 35: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Future Is Browser Centric

• Client-Side Rendering

• Canvas, GPU, Actual Data

• Client-Side Analysis

• GPU, BYOD

• Apps With Phone Gap

Page 37: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Web Mapping Terms

• Map Servers

• ESRI Rest, OGC / GetCapibilities

• Convert Spatial Data To Map Tiles

Page 38: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Separating Axis

Page 39: Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data StephenMcDonald@cga.harvard.edu January 21, 2015

Diff CenterXs > Sum Half Widths

Half Width

Center X