Searching All The Web’s Spatial Data · Searching All The Web’s Spatial Data...

Preview:

Citation preview

Searching All The Web’s Spatial Data

StephenMcDonald@cga.harvard.edu January 21, 2015

Thanks

• CGA

• Ben Lewis

• Dave Strohschein

Layer Level Search Of Spatial Resources

A Spatial Resource<kml xmlns=“http://www.opengis.net/kml/2.2"> ! <Document> <Placemark> ! <name>Harvard</name> <description>You Are Here</description> <Point> <coordinates>-71.1169,42.3774,0</coordinates> </Point> ! </Placemark> </Document> !</kml>

Not A Spatial Resource

Web Services

• Individual Layer Level Search

• OGC - Get Capabilities, WMS

• ESRI Rest

Anchor Link Signatures

• Anchor Links To Spatial Resources

• ?request=GetCapabilities

• /ArcGIS/rest/service

• *.kml and *.kmz

• */shape/*.zip

Not JavaScript Code<script>

L.esri.tiledMapLayer(

"http://basemap.nationalmap.gov/ArcGIS/rest/services/USGSTopo/MapServer",

{opacity: 0.50, zIndex:2}).addTo(map);

</script>

Not HTML Tags<body>

Please use my base layer:

<blink>

http://basemap.nationalmap.gov/ArcGIS/rest/services/USGSTopo/MapServer

</blink>!

</body>

Google Advanced Search

• What Will A Crawl Discover?

• allinanchor:, allinurl:, filetype:

• Follow Terms Of Service

Limited Crawl

• Crawl A Couple Sites

• JCrawler: Provide Two Functions

• Follow This Link?

• Process Page

• Run On Localhost, Obey robots.txt

Find ALL Spatial Resources

• Not With A Cluster Running Nutch

• Too Hard!

CommonCrawl.org

• Monthly Crawl, 2-3 Billion Web Pages

• 55,000 WARC Files On Amazon East

• Hadoop Sample Code

• Add jsoup And Several Hundred Lines Of Code

CommonCrawl Blog

Easier Hadoop

One Complete “Crawl”

• 25 Slaves, 3 Full Days

• $1400

• R3.XLarge - Lots Of Memory

Sample Crawl Outputhttp://www.ga.gov.au/gis/services/earth_science/GA_Surface_Geology_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS 306 !http://maps.ngdc.noaa.gov/soap/web_mercator/dem_extents/MapServer/WMSServer?request=GetCapabilities%26service=WMS 169 !http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Seismic_Surveys/MapServer/WMSServer?request=GetCapabilities&service=WMS 144 !http://gis.ngdc.noaa.gov/arcgis/services/dem_hillshades/ImageServer/WMSServer?request=GetCapabilities%26service=WMS 132 !http://www.ga.gov.au/gis/services/topography/Australian_Topography/MapServer/WMSServer?request=GetCapabilities&service=WMS 108 !http://www.ga.gov.au/gis/services/earth_science/Crustal_Elements_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS 108

Sample Crawl Outputhttp://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/GA_Surface_Geology_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS -306 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Seismic_Surveys/MapServer/WMSServer?request=GetCapabilities&service=WMS -144 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Crustal_Elements_of_Australia/MapServer/WMSServer?request=GetCapabilities&service=WMS -108 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/topography/Australian_Topography/MapServer/WMSServer?request=GetCapabilities&service=WMS -108 !http://www.ga.gov.au/data-pubs/web-services/replacement-services-for-the-national-geoscience-datasets-wms|||http://www.ga.gov.au/gis/services/earth_science/Geoscience_Australia_Airborne_Geophysics/MapServer/WMSServer?request=GetCapabilities&service=WMS -90

Key / Value Pairs• String / Integer Pairs

• Value > 0

• URL To Resource / Frequency Count

• Value < 0

• URL To Resource + “|||” + Page Found On

• Use Unix Commands To Split, Sort File

Harvester

• Input: List Of Spatial Resources

• Processing:

• Obtain Metadata On Each Layer

• Periodically Re-visit

• Output: Solr Records, Report

Layer Level Search Of Spatial Resources

Search

• People Expect Good Results

• Always Too Many Results For Human Review

• Ranking / Scoring Results Is Key

Some Layers Not Relevant

Layer Within Map

Similar Center

Similar Area

Spatial Solr

• Old Style: Floats

• New Style: Rectangle, Polygons

Solr Schema

• Define Fields To Support Search

• Pre-compute Intermediate Result

• Data Type = Search Options

• Or Schema-less

Solr Schema

MinX, MaxX, CenterX

MinY, MaxY, CenterY

HalfWidth

HalfHeight

Area

tdouble Field Types

Old Style Solr Queryhttp://geodata.tufts.edu/solr/select?q=_val_:%22product(10.0,map(sum(map(MinX,-71.143160023987,-71.096038976013,1,0),map(MaxX,-71.143160023987,-71.096038976013,1,0),map(MinY,42.385170824958,42.428266055761,1,0),map(MaxY,42.385170824958,42.428266055761,1,0)),4,4,1,0)))%22_val_:%22product(15.0,recip(sum(abs(sub(Area,0.002030692438118123)),.01),1,1000,1000))%22_val_:%22product(3.0,recip(abs(sub(product(sum(MaxX,MinX),.5),-71.11959949999999)),1,1000,1000))%22_val_:%22product(3.0,recip(abs(sub(product(sum(MaxY,MinY),.5),42.4067184403595)),1,1000,1000))%22+AND+%28LayerDisplayName:water^3+OR+ThemeKeywords:water^2+OR+PlaceKeywords:water^2%29+AND+%28ThemeKeywords:geoscientificinformation^4%29&&fq={!frange+l%3D1+u%3D10}product(2.0,map(sum(map(sub(abs(sub(-71.11959949999999,CenterX)),sum(0.023560523986994042,HalfWidth)),0,400000,1,0),map(sub(abs(sub(42.4067184403595,CenterY)),sum(0.021547615401498632,HalfHeight)),0,400000,1,0)),0,0,1,0))&wt=json&fl=Name,CollectionId,Institution,Access,DataType,Availability,LayerDisplayName,Publisher,GeoReferenced,Originator,Location,MinX,MaxX,MinY,MaxY,ContentDate,LayerId,score,WorkspaceName,SrsProjectionCode&rows=27&start=0&sort=score+desc&fq=ContentDate:[1950-01-01T01:01:01Z+TO+2012-01-01T01:01:01Z]&fq=DataType%3APoint&fq=Institution%3ATufts+OR+Institution%3AHarvard&fq=Institution:Tufts+OR+Access:Public&json.wrf=jQuery16408675794449108286_1331937717696&_=1331941365233

New Style Spatial

• A Lat-Lon rectangle: minX minY maxX maxY

• <field name="geo">-74.093 41.042 -69.347 44.558</field>

• Units: Degrees

• Distance Calc: Haversine or Euclidean, etc.

Spatial Functions

• fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"

• fq=geo:”IsWithin(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0”

• HeatMaps: Coming Soon

Also Search By

Date

Keywords

DataType

Institution

Solr Filter Clause

Future Is Browser Centric

• Client-Side Rendering

• Canvas, GPU, Actual Data

• Client-Side Analysis

• GPU, BYOD

• Apps With Phone Gap

Thank YouStephenMcDonald@cga.harvard.edu

Web Mapping Terms

• Map Servers

• ESRI Rest, OGC / GetCapibilities

• Convert Spatial Data To Map Tiles

Separating Axis

Diff CenterXs > Sum Half Widths

Half Width

Center X

Recommended