33
SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies. Using XML Mapper and XMLMAP to Read Data Documented by Data Documentation Initiative (DDI) Files Larry Hoyle Policy Research Institute University of Kansas

SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies.

Using XML Mapper and XMLMAP to Read Data Documented by Data Documentation Initiative (DDI) Files Larry Hoyle

Policy Research Institute

University of Kansas

Page 2: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Overview

A SAS program reads an XML metadata file and writes a SAS program to read the raw data file described by the metadata file.

makeReader.sas

2825.xml DDI.map

Read2825.sas

Read2825.sas

Da2825.txt

Work.ICPSR2825Household

Work.ICPSR2825Family

Work.ICPSR2825Person

Page 3: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

DDI

“an international effort to establish a standard for technical documentation describing social science data” - http://www.icpsr.umich.edu/DDI/index.html

Page 4: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

DDI Files

XML• DTD - http://www.icpsr.umich.edu/DDI/users/dtd/index.html

Metadata about:• The DDI file itself

• The study that collected the data

• The data file

• Variables within the data file

• Other Material

Page 5: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

The Minimal DDI File

<?xml version="1.0"?>

<codeBook>

<stdyDscr>

<citation>

<titlStmt>

<titl>Howdy World: Valid but Useless Metadata</titl>

</titlStmt>

</citation>

</stdyDscr>

</codeBook>

Page 6: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Real Example: ICPSR 6084 Raw Data File

100001 161132146211115555299 9991199 99911219

200001 49992

30000102534 000325222641942 3834101202

100002 12213112421111212112222221121 2122 12

200002 12221

30000202574 000756221622052 4261103202

Page 7: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

ICPSR 6084 – About The Study

<citation>

<titlStmt>

<titl>CBS News Monthly Poll #2, August 1992</titl>

Page 8: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

ICPSR 6084 – About The File

<dimensns>

<caseQnty>1,546</caseQnty>

<varQnty>70</varQnty>

<logRecL>80</logRecL>

<recPrCas>3</recPrCas>

Page 9: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

ICPSR 6084 – Reading The File With SAS

infile 'C:\DDRIVE\data\icpsr\data\6084\da6084.txt' LRECL=80 PAD;

<dimensns> <caseQnty>1,546</caseQnty> <varQnty>70</varQnty> <logRecL>80</logRecL> <recPrCas>3</recPrCas> <recNumTot>4,638</recNumTot> </dimensns>

Page 10: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

More From ICPSR 6084 – first variable

Page 11: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

More From ICPSR 6084 – Reading the first variable

input

#1 cardno 1-1

Page 12: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

More From ICPSR 6084 – another variable

#3 respno 2-6

Page 13: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

The Tasks

Pull the necessary information from a hierarchical xml file into SAS as tables• Use XML libname engine with an XMLMAP file

Use that information in SAS to read the raw data file

Page 14: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Making the XMLMAP File – SAS XML Mapper

Page 15: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Defining Tables –What Defines Rows

Drag the element that defines rows to the root of the XMLMAP structure

Page 16: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Defining Tables –Row Defined

Page 17: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Defining Tables – What Defines Columns

Drag an element that defines a column to the root of the table

Page 18: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Defining Tables – Column Defined

Page 19: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Viewing The XMLMap File

Page 20: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Viewing The XMLMap File – Row Path

Page 21: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Viewing The XMLMap File - Column Path

Page 22: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Viewing Sample SAS Code

Page 23: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Previewing the Table

Page 24: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

XMLMapper Limitations

Not every XML file will have all the elements of any possible XML file of that type.• Use XML Schema instead of XML file

An XML Schema file may not work• XML file type defined by DTD

• XML Schema too complex for XML Mapper

Page 25: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

What then

You can use XMLMapper to start and then hand edit the XML MAP file.

Page 26: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Lots of Tables From DDI Mostly for Comments

DATADSCR_VARDATADSCR_VARGRPDATADSCR_VAR_CATGRYDATADSCR_VAR_INVALRNGDATADSCR_VAR_INVALRNG_ITEMDATADSCR_VAR_INVALRNG_RANGEDATADSCR_VAR_VALRNG_ITEMDATADSCR_VAR_VALRNG_RANGEDOCDSCR_CITATION__AUTHENTYDOCDSCR_CITATION__COPYRIGHTDOCDSCR_CITATION__IDNODOCDSCR_CITATION__OTHIDDOCDSCR_CITATION__PRODDATEDOCDSCR_CITATION__PRODUCERDOCDSCR_CITATION__TITLFILEDSCR_FILETXTFILEDSCR_FILETXT_RECGRPSTDYDSCR_CITATION_BIBLCITSTDYDSCR_CITATION_TITLSTMTSTDYDSCR_CITATION_VERSTMTSTDYDSCR_CITATION__AUTHENTYSTDYDSCR_CITATION__COPYRIGHTSTDYDSCR_CITATION__DISTRBTR

STDYDSCR_CITATION__FUNDAGSTDYDSCR_CITATION__GRANTNOSTDYDSCR_CITATION__PRODDATESTDYDSCR_CITATION__PRODUCERSTDYDSCR_CITATION__SOFTWARESTDYDSCR_METHOD__COLLMODESTDYDSCR_METHOD__DATACOLLECTORSTDYDSCR_METHOD__FREQUENCSTDYDSCR_METHOD__RESINSTRUSTDYDSCR_METHOD__SAMPPROCSTDYDSCR_METHOD__TIMEMETHSTDYDSCR_METHOD__WEIGHTSTDYDSCR_STDYINFO_ABSTRACTSTDYDSCR_STDYINFO__ANLYUNITSTDYDSCR_STDYINFO__COLLDATESTDYDSCR_STDYINFO__DATAKINDSTDYDSCR_STDYINFO__GEOGCOVERSTDYDSCR_STDYINFO__KEYWORDSTDYDSCR_STDYINFO__NATIONSTDYDSCR_STDYINFO__TIMEPRDSTDYDSCR_STDYINFO__TOPCCLASSTDYDSCR_STDYINFO__UNIVERSE

Page 27: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

data _null_; file reader lrecl=1024 ; length vEdited $ 2000; set DDIfile.stdyDscr_citation_titlStmt; if _n_=1 then put '/*' / ' SAS program to read ' agency ' ' IDNo ;

stdyDscrTitl= compbl(tranwrd(translate(stdyDscrTitl, ' ', '09'x), '*/', '*_/')); put 'Study Title' _n_ ': ' stdyDscrTitl;

altTitl=compbl(tranwrd(translate(altTitl, ' ','09'x),'*/','*_/')); put ' ' altTitl;

Write a SAS Program – Metadata Comment

/* SAS program to read ICPSR 6084Study Title1 : CBS News Monthly Poll #2, August 1992 August National Poll II, Republican National Convention

Page 28: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Cntlin file for Formatsdata makeTheFormats; input fmtname $ 1-7 type $ 9-9 start $ 11-26 default 28-35 / label :&$512.;datalines;V00006f N 1 1YesV00006f N 3 1Converted Refusal

;run;proc format cntlin=makeTheFormats;run;

Page 29: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Input

Logic to produce different input statements for:• Fixed column data

• Delimited data

Page 30: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Output Multiple datasets if different record types

• Separate keep dataset options

• Logic to output to appropriate dataset

if left(_RecordSetIdentifier) eq left("2 ") then DO;

output ICPSR2825FAMILY ;

END;

if "&fileStructureType" eq "hierarchical" then do; put 'if left(_RecordSetIdentifier) eq left("' catValu '") then DO;' / " output " safeAgency +(-1) safeIDNo +(-1) rectype ';' / 'END;' //; end;

Page 31: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Some of the Limitations

DDI can describe nCubes, geographic coverage, variable groups• Current makeReader.sas can’t handle these

DDI definition includes recursive elements• E.g. recGrps within recGrps

• Current makeReader,sas would not find nested elements

Page 32: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

Questions?

Page 33: SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product

About the Speaker

Larry Hoyle

Associate Scientist

Policy Research Institute,

University of Kansas

1541 Lilac Lane

Lawrence, KS 66044-3177

[email protected]