35
A Tour of Geospatial Data Analysis Tools in SAS Timothy B. Gravelle, Independent Statistical Consultant [email protected]

A Tour of Geospatial Data Analysis Tools in SAS

Embed Size (px)

Citation preview

Page 1: A Tour of Geospatial Data Analysis Tools in SAS

A Tour of Geospatial Data

Analysis Tools in SAS Timothy B. Gravelle, Independent Statistical Consultant

[email protected]

Page 2: A Tour of Geospatial Data Analysis Tools in SAS

Why think about space and place?

• Implicit in most data (e.g., a survey of Ontario

voters, bank branch IDs, etc.).

• Location and proximity/distance also have

explanatory power.

• E.g., proximity to/distance from:

– Geographic features (e.g., borders)

– Various sites (e.g., retail locations, energy

infrastructure).

• Business data, survey data, administrative

data come from somewhere…

2

Page 3: A Tour of Geospatial Data Analysis Tools in SAS

… but we tend to do this…

3

Page 4: A Tour of Geospatial Data Analysis Tools in SAS

… when we should be thinking like this

4

Source: Timothy B. Gravelle (2014) “Partisanship, Border Proximity, and Canadian Attitudes

toward North American Integration.” International Journal of Public Opinion Research, vol. 26

(forthcoming).

Page 5: A Tour of Geospatial Data Analysis Tools in SAS

Challenges

• Many data analysis projects do not start out

with spatial analysis in mind (e.g., secondary

data analyses).

• Existing data may not contain precise location

data.

• But I’m not a GIS analyst! I don’t have/can’t

afford/don’t have time to learn GIS software!

5

Page 6: A Tour of Geospatial Data Analysis Tools in SAS

Meeting these challenges – with SAS

• Obtaining and using spatial data

• Creating maps

• Geocoding business and survey data (that

may not have been intended to be geocoded

in the first place)

• Performing distance calculations

6

Page 7: A Tour of Geospatial Data Analysis Tools in SAS

Spatial data everywhere

• Statistics Canada Census Cartographic

Boundary Files (CBFs) – provinces, MSAs,

federal electoral districts, tracts,

dissemination areas:https://www12.statcan.gc.ca/census-recensement/2011/geo/index-eng.cfm

• US Census Bureau TIGER/Line files – states,

CBSAs, states, counties, tracts, blocks,

ZCTAs:www.census.gov/geo/maps-data/data/tiger-line.html

• SAS also includes some basic maps.

7

Page 8: A Tour of Geospatial Data Analysis Tools in SAS

Reading in spatial data:

PROC MAPIMPORT

• Reads in the main types of map shapefiles

used by GIS packages, both polygon and line

shapefiles.

• Ex.: reading in the Statistics Canada 2011

Census forward sortation area (FSA) polygon

shapefile:

PROC MAPIMPORT DATAFILE=

"C:\Census11\gfsa000b11a_e.shp"

OUT=map_0;

RUN;

8

Page 9: A Tour of Geospatial Data Analysis Tools in SAS

9

Google Earth can be a source

of spatial data – e.g., the

Northern Gateway (Canada)

and Keystone XL (US)

pipelines (publicized by NGOs)

Page 10: A Tour of Geospatial Data Analysis Tools in SAS

Reading in spatial data: Google Earth

(.kml) files

• As popular as these kinds of mapping tools

have become, there is no automatic way to

import their data into SAS.

• Coordinates are stored as in-stream data, so

they can be extracted with some clever DATA

step programming.

10

Page 11: A Tour of Geospatial Data Analysis Tools in SAS

Reading in spatial data: Google Earth

(.kml) files

• Ex.: reading in the coordinates for the

proposed route of the Northern Gateway

pipeline (AB and BC):FILENAME ng "C:\NG\Northern Gateway path.kml";

DATA ng_0;

INFILE ng DSD dlm=', ' LRECL=32767 RECFM=n;

FORMAT LON LAT 20.16;

FORMAT LON LAT 20.16;

INPUT x $char1024. @;

strt=index(x,'<coordinates>')+14;

INPUT @strt @;

DO UNTIL(lon=.);

INPUT LON ?? LAT ?? ELEVATION ??;

IF LON ^= . THEN OUTPUT;

END;

KEEP LON LAT;

STOP;

RUN;11

Page 12: A Tour of Geospatial Data Analysis Tools in SAS

12

Creating maps: PROC GMAP

Page 13: A Tour of Geospatial Data Analysis Tools in SAS

Creating maps: PROC GMAP

• SAS makes it easy to create heat maps

(choropleth maps), plotting different levels of

a variable for different geographic areas.

• Ex.: a map of Boston showing the spatial

distribution of English language ability

(Statistics Without Borders project).

• A map was created by merging the census

tract TIGER/Line shapefile for Boston and

American Community Survey data:

13

Page 14: A Tour of Geospatial Data Analysis Tools in SAS

Creating maps: PROC GMAP

FILENAME mapct "C:\SWB\tl_2010_25025_tract10.shp";

FILENAME langct "C:\SWB\ACS_10_5YR_B16001_with_ann.csv";

PROC MAPIMPORT DATAFILE=mapct OUT=map_0;

RUN;

DATA map_1 (DROP=COUNTY: STATE: NAME: INT: FUNC: MTFCC: );

SET map_0;

CT=INPUT(SUBSTR(GEOID10,6,6), 7.2); FORMAT CT 7.2;

IF (1601.01<=CT<=1606.99) OR (1701.00<=CT<=1708.99)

OR CT=9815.02 OR (1801.00<=CT<=1805.99) THEN DELETE;

/* DELETE CTs COMPRISING CHELSEA, REVERE & WINTHROP */

IF ALAND10>0;

/* DELETE CTs WITH NO LAND AREA */

RUN;

PROC SORT DATA=map_1;

BY CT;

RUN;

14

Page 15: A Tour of Geospatial Data Analysis Tools in SAS

Creating maps: PROC GMAP

DATA lang_0;

INFILE langct FIRSTOBS=7 DSD DLM="," MISSOVER LRECL=32767;

LENGTH GEO_ID $ 20;

INPUT GEO_ID $ [...LOTS OF OTHER VARIABLES];

INFORMAT GEO_ID $20. VD1--mVD119 best12.;

FORMAT GEO_ID $20. VD1--mVD119 best12.;

DROP m: GEO_ID2 GEO_ID_DISPLAY;

CT=INPUT(SUBSTR(GEO_ID,15,6), 7.2); FORMAT CT 7.2;

IF (1601.01<=CT<=1606.99) OR (1701.00<=CT<=1708.99)

OR CT=9815.02 OR (1801.00<=CT<=1805.99) THEN DELETE;

/* DELETE CTs COMPRISING CHELSEA, REVERE & WINTHROP */

RUN;

DATA lang_1;

SET lang_0;

TOTAL_LTVW=SUM(VD5,VD8, [...] ,VD116,VD119);

/* GET THE TOTAL COUNT OF “LESS THAN VERY WELL”

ENGLISH SPEAKERS, ALL LANGUAGES */

RUN;

15

Page 16: A Tour of Geospatial Data Analysis Tools in SAS

Creating maps: PROC GMAP

PROC SORT DATA=lang_1;

BY CT;

RUN;

FILENAME gout "C:\SWB\Boston Map 2012 08 14.png";

GOPTIONS RESET=ALL DEVICE=jpeg GSFNAME=gout YMAX=7.5in;

PROC GMAP MAP=map_1 DATA=lang_1;

ID CT;

CHORO TOTAL_LTVW /LEVELS=6 CDEFAULT=DARKGRAY;

LABEL TOTAL_LTVW=

"Count, Speak English ''Less than Very Well'')";

LEGEND1 ACROSS=3 DOWN=2;

RUN; QUIT;

16

Page 17: A Tour of Geospatial Data Analysis Tools in SAS

17

Page 18: A Tour of Geospatial Data Analysis Tools in SAS

Geocoding: PROC GEOCODE

• Refers to the appending of location (latitude-

longitude) information.

• SAS has well-developed built-in tools for US

data: ZIP, ZIP+4, address-based geocoding.

• Canadian and British geocoding was

introduced in SAS 9.4.

18

Page 19: A Tour of Geospatial Data Analysis Tools in SAS

Geocoding: PROC GEOCODE

• Example – US survey data with ZIP codes in

dataset:

PROC GEOCODE DATA=data_1

METHOD=ZIP

OUT=data_2

LOOKUP=sashelp.zipcode

ADDRESSSTATEVAR=STATE

ADDRESSZIPVAR=ZIP

LOOKUPSTATEVAR=STATECODE

LOOKUPZIPVAR=ZIP

LOOKUPXVAR=X

LOOKUPYVAR=Y;

RUN;19

Page 20: A Tour of Geospatial Data Analysis Tools in SAS

Geocoding: getting creative

• You may need to go beyond the built-in

capabilities of PROC GEOCODE – e.g.,

geocoding Canadian data using FSAs:

PROC MAPIMPORT DATAFILE=

"C:\Census11\gfsa000b11a_e.shp"

OUT=map_0;

RUN;

DATA map_1 (RENAME=(CFSAUID=FSA));

SET map_0;

RUN;

PROC SORT DATA=map_1;

BY FSA;

RUN;20

Page 21: A Tour of Geospatial Data Analysis Tools in SAS

Geocoding: getting creative

%ANNOMAC;

%CENTROID(map_1, fsa_1, FSA, segonly=1);

DATA data_2;

LENGTH _MATCHED_ $ 50;

MERGE data_1 (IN=A) fsa_1;

BY FSA;

IF A=1;

IF X~=. AND Y~=. THEN _MATCHED_=

"Census 2011 FSA shapefile";

ELSE _MATCHED_="None";

RUN;

21

Page 22: A Tour of Geospatial Data Analysis Tools in SAS

Dealing with map projections:

PROC GPROJECT

22

Source: Statistics Canada,

http://www12.statcan.gc.ca/census-recensement/

2011/ref/dict/figures/figure14-dict-eng.cfm

Page 23: A Tour of Geospatial Data Analysis Tools in SAS

Dealing with map projections:

PROC GPROJECT

• One wrinkle is that SAS has no facility for

telling you if a shapefile is projected or not (or

which projection is being used).

• There functions in the sp and rgdal

packages in R to get this information.

• Ex.: “unprojecting” a map of the Kinder

Morgan Trans Mountain oil pipeline

expansion (originally in BC Albers projection):

PROC MAPIMPORT DATAFILE=

"C:\KM TM\KM_Pipeline_Expansion.shp"

OUT=tm_0b;

RUN;23

Page 24: A Tour of Geospatial Data Analysis Tools in SAS

Dealing with map projections:

PROC GPROJECT

PROC GPROJECT DATA=tm_0b OUT=tm_1b

FROM="+proj=aea +lat_1=50 +lat_2=58.5

+lat_0=45 +lon_0=-126 +x_0=1000000 +y_0=0

+datum=WGS84 +units=m +no_defs +ellps=WGS84

+towgs84=0,0,0”

TO="+proj=longlat +datum=WGS84 +no_defs”;

ID SEGMENT;

RUN;

24

Page 25: A Tour of Geospatial Data Analysis Tools in SAS

Dealing with map projections:

stepping out (momentarily) to R

library(sp)

library(rgdal)

setwd("C:/KM TM/")

map.1 <- readOGR(dsn = ".", "KM_Pipeline_Expansion")

map.1@proj4string

map.2 <- spTransform(map.1, CRS("+proj=longlat

+datum=WGS84"))

map.2@proj4string

25

Page 26: A Tour of Geospatial Data Analysis Tools in SAS

Calculating distances

• Knowing the distances between two locations

(latitude-longitude coordinates) is often useful

(proxy for travel times, proximity/distance as a

predictor variable).

• The GEODIST function in SAS makes

distance calculations very easy.

• Ex.: distance to the Queenston-Lewiston

bridge:

DATA data_2;

SET data_1;

QL_DISTANCE=

GEODIST(LAT, LON, 43.15306, -79.04445, ‘DK’);

RUN; 26

Page 27: A Tour of Geospatial Data Analysis Tools in SAS

Calculating distances

• Calculating distances between large datasets

(e.g., a large survey dataset and a pipeline

route) is a little trickier.

• This can be done efficiently in PROC SQL

using a cross join (many-to-many join) and

then aggregating the data using record

(respondent) ID values.

27

Page 28: A Tour of Geospatial Data Analysis Tools in SAS

Calculating distances

PROC SQL;

CREATE TABLE data_3a AS

SELECT D2.CASEID, D2.LAT, D2.LON,

GEODIST(D2.LAT, D2.LON, NG.LAT, NG.LON)

AS DISTANCE_PIPELINE,

LOG(GEODIST(D2.LAT, D2.LON, NG.LAT, NG.LON)+1)

AS LN_DISTANCE_PIPELINE

FROM data_2 AS D2

CROSS JOIN ng_1 AS NG

ORDER BY D2.CASEID, CALCULATED DISTANCE_PIPELINE

;

28

Page 29: A Tour of Geospatial Data Analysis Tools in SAS

Calculating distances

CREATE TABLE data_3b AS

SELECT D3A.CASEID,

MEAN(D3A.LAT) AS LAT,

MEAN(D3A.LON) AS LON,

MIN(D3A.DISTANCE_PIPELINE)

AS DISTANCE_PIPELINE,

MIN(D3A.LN_DISTANCE_PIPELINE)

AS LN_DISTANCE_PIPELINE

FROM data_3a AS D3A

GROUP BY CASEID

ORDER BY CASEID

;

29

Page 30: A Tour of Geospatial Data Analysis Tools in SAS

Calculating distances

CREATE TABLE data_3c AS

SELECT D2.*,

D3B.DISTANCE_PIPELINE,

D3B.LN_DISTANCE_PIPELINE

FROM data_2 AS D2

LEFT JOIN data_3b AS D3B

ON D2.CASEID=D3B.CASEID

ORDER BY D2.CASEID

;

QUIT;

30

Page 31: A Tour of Geospatial Data Analysis Tools in SAS

Straight-line vs. road distances

• One might think that road distances/travel

time would be a better measure than straight-

line/“as the crow flies” distance (as calculated

using the GEODIST function).

• Empirical research comparing the two has

found them to be very strongly correlated

(r2 = 0.94) (Boscoe et al. 2012). They are

thus practically interchangeable.Source: Boscoe, Francis P., Kevin A. Henry and Michael S. Zdeb. 2012. “A Nationwide

Comparison of Driving Distance Versus Straight-Line Distance to Hospitals.” Professional

Geographer 64(2): 188–96.

31

Page 32: A Tour of Geospatial Data Analysis Tools in SAS

Straight-line vs. road distances

• An alternative is to make repeated calls to a

map service (e.g., Google Maps) and extract

the travel distance/time from the result (see

Mike Zdeb’s TASS presentation, June 2012).

• This offers the prospect of greater accuracy

but becomes impractical very quickly with

thousands of records/respondents and

thousands of destination points.

• Seconds/minutes of run time for the PROC

SQL method vs. hours/days.

32

Page 33: A Tour of Geospatial Data Analysis Tools in SAS

Wrap-up

• There are many ways to bring together

different sources of data with spatial data to

answer interesting questions.

• The world is awash in spatial data – much of

it free.

• You don’t need to be a trained GIS analyst to

get started (but it helps to be friends with

one).

• Nor do you need a full-fledged GIS platform –

base SAS and SAS/GRAPH have many

useful facilities.33

Page 34: A Tour of Geospatial Data Analysis Tools in SAS

Wouldn’t it be nice...?

• If SAS could read in Google Earth .kml files,

either directly or via the XML Mapper?

• If PROC GMAP could handle multiple map

layers?

• If PROC MAPIMPORT or PROC GPROJECT

had the ability to output (to the output

window, log, or a SAS dataset) information on

map projection?

34

Page 35: A Tour of Geospatial Data Analysis Tools in SAS

Thank you!

Timothy B. Gravelle, Independent Statistical Consultant

[email protected]