35
Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005

Reading Microsoft Word XML files with SAS August 25, 2005

Embed Size (px)

DESCRIPTION

Reading Microsoft Word XML files with SAS August 25, 2005. Larry Hoyle -- Policy Research Institute University of Kansas. revised 8/18/2005. 3 scenarios. Extracting text along with associated properties (styles and attributes) Extracting all data from tables - PowerPoint PPT Presentation

Citation preview

Page 1: Reading Microsoft Word XML files with SAS  August 25, 2005

Reading Microsoft Word XML files with SAS

August 25, 2005

Larry Hoyle -- Policy Research Institute

University of Kansas

revised 8/18/2005

Page 2: Reading Microsoft Word XML files with SAS  August 25, 2005

3 scenarios

• Extracting text along with associated properties (styles and attributes)

• Extracting all data from tables

• Extracting coordinates of objects in drawings

Page 3: Reading Microsoft Word XML files with SAS  August 25, 2005

XML - syntax<?xml version="1.0" ?>

<LarryRootTag>

<EmptyTag/><nestedTag>

Some content

</nestedTag >

<nestedTag anAttribute="wha">

Other content

</nestedTag >

</LarryRootTag>

Must begin with this prolog tag

Paired tags, must have 1 root tag

case sensitive

Empty tags end with />

Tags and content called "element"

Tags can be Qualified by

attributes

Elements can be nested,Start and end in same parent

Page 4: Reading Microsoft Word XML files with SAS  August 25, 2005

Word XML

Page 5: Reading Microsoft Word XML files with SAS  August 25, 2005

Word XML

Page 6: Reading Microsoft Word XML files with SAS  August 25, 2005

Extracting text and properties

• SAS XML Engine

• Needs XMLMAP file

• Can use XML Mapper to generate XMLMAP

• Only needs to be generated once for

each type of extract

Page 7: Reading Microsoft Word XML files with SAS  August 25, 2005

Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.

Page 8: Reading Microsoft Word XML files with SAS  August 25, 2005

XML - Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.

Paragraph property:/w:wordDocument/w:body /wx:sect/w:p/w:pPr

Run property:/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.

Page 9: Reading Microsoft Word XML files with SAS  August 25, 2005

Rows

• The XMLMap has to describe a path that delineates rows:

• In this case it’s each text element in a run (in a paragraph…)

<TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>

Page 10: Reading Microsoft Word XML files with SAS  August 25, 2005

Columns – the text

• The XMLMap has to describe a path that delineates each column:

• The text itself is:

<COLUMN name="t">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>

Page 11: Reading Microsoft Word XML files with SAS  August 25, 2005

Columns – the text element number

• A sequential number for the text element is:

<COLUMN name="tNum"

ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN"

syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>

Page 12: Reading Microsoft Word XML files with SAS  August 25, 2005

Columns – the paragraph number

• A sequential number for the paragraph is:

<COLUMN name="pNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>

Page 13: Reading Microsoft Word XML files with SAS  August 25, 2005

Columns –paragraph color

<COLUMN name="PColorVal" retain="YES">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>

Page 14: Reading Microsoft Word XML files with SAS  August 25, 2005

Columns – run color

<COLUMN name="RColorVal" retain="YES">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>

Page 15: Reading Microsoft Word XML files with SAS  August 25, 2005

Our dataset

Page 16: Reading Microsoft Word XML files with SAS  August 25, 2005

Tables

Page 17: Reading Microsoft Word XML files with SAS  August 25, 2005

All Tables Into One Dataset

Page 18: Reading Microsoft Word XML files with SAS  August 25, 2005

Tables – Word XML

Page 19: Reading Microsoft Word XML files with SAS  August 25, 2005

Tables - DataSet Rows

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Page 20: Reading Microsoft Word XML files with SAS  August 25, 2005

Tables – Table Number

<COLUMN name="tblNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl

</INCREMENT-PATH>

Page 21: Reading Microsoft Word XML files with SAS  August 25, 2005

Tables – Row Number

<COLUMN name="trNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr

</INCREMENT-PATH>

Page 22: Reading Microsoft Word XML files with SAS  August 25, 2005

We Could Add Properties if Needed

Page 23: Reading Microsoft Word XML files with SAS  August 25, 2005

Nested tables

Page 24: Reading Microsoft Word XML files with SAS  August 25, 2005

Nested Tables – Absolute Path for Rows

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Page 25: Reading Microsoft Word XML files with SAS  August 25, 2005

Nested Tables – Rootless Path for Rows

<TABLE-PATH syntax="XPath">

w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Page 26: Reading Microsoft Word XML files with SAS  August 25, 2005

Drawing ObjectsVML – Vector Markup Language

• Drawings in Word get stored as XML also

• We’ll just look at lines

Page 27: Reading Microsoft Word XML files with SAS  August 25, 2005

VML – Vector Markup Language

Page 28: Reading Microsoft Word XML files with SAS  August 25, 2005

Dataset – One Row for Each Line

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line</TABLE-PATH>

Page 29: Reading Microsoft Word XML files with SAS  August 25, 2005

Dataset – Column: From

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from</PATH>

Page 30: Reading Microsoft Word XML files with SAS  August 25, 2005

Dataset – Column: To

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to</PATH>

Page 31: Reading Microsoft Word XML files with SAS  August 25, 2005

Dataset – Column: StrokeColor

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor</PATH>

Page 32: Reading Microsoft Word XML files with SAS  August 25, 2005

The Dataset

Page 33: Reading Microsoft Word XML files with SAS  August 25, 2005

Usage Example: Annotate dataset

if prxmatch(xyPattern, from) then do;

function='move';

x= input(PRXPOSN (xyPattern, 1, from),10.);

if prxmatch('/flip:y/',style) then

y= -1* input(PRXPOSN (xyPattern, 2, to),10.);

else

y= -1* input(PRXPOSN (xyPattern, 2, from),10.);

output;

Page 34: Reading Microsoft Word XML files with SAS  August 25, 2005

Plotted in SAS

Page 35: Reading Microsoft Word XML files with SAS  August 25, 2005

Contact Information

Larry HoylePolicy Research Institute, University of Kansas

[email protected]

http://www.ku.edu/pri/ksdata/sashttp/sugi31