48

Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,
Page 2: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Other Data Sources SAS can read data from a variety of sources:

Plain text files, including delimited and fixed-column files

Spreadsheets, such as Excel

Databases

XML

Others

Page 3: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Text Files Text files of various types can be read via the data step.

Two essential statements for this operation include:

infile file-reference <options>;

Directs the data step to a file.

The file reference can be a path (in quotes) or a filename reference.

input variable list;

Lists variables to read, with some instructions for how to read them.

Page 4: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Basic Input With no options assigned, the infile statement will

assume data values are space delimited.

In this case, the most basic form of the input statement is:

input variable_1 <$> … variable_n <$>;

$ is an indicator that the variable is character

variable must follow SAS naming conventions

Page 5: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Simple Example Read the data file “flights.prn” from the raw data

subfolder of the data sets folder:

Page 6: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Simple Example Code:

data test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;

Page 7: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Simple Example Code:

data test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;

Infile statement includes a full-path reference to the file

to be used.

Page 8: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Simple Example Code:

data test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;

List (legal) variable names in order corresponding to

columns in the data.

Each character variable must be followed by the $. It can be appended to the name or a space can be placed between.

Page 9: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Results

Something is not quite right in the date column.

Page 10: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Processdata test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;During compilation, the PDV

is set up based on the specifications in the input

statement.

PDV flight_no date destination first_class economy

Page 11: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Processdata test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;During execution data is loaded into an input buffer and parsed as specified by the input and infile statements.

PDV flight_no date destination first_class economy

Input Buffer

4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7

Page 12: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Processdata test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;Spaces are taken as delimiters between values. Multiple,

consecutive spaces are seen as a single delimiter

PDV flight_no date destination first_class economy

439 12/11/20 LAX 20 137

Input Buffer

4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7

Page 13: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Processdata test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;

Default length of character variables is 8

PDV flight_no date destination first_class economy

439 12/11/20 LAX 20 137

Input Buffer

4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7

A length statement (before input) would be useful.

Page 14: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Processdata test;

infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

input flight_no date $ destination$ first_class economy;

run;

This record is output to the data set, and the process continues until the end of the raw file is reached.

PDV flight_no date destination first_class economy

439 12/11/20 LAX 20 137

Input Buffer

4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7

Page 15: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

The FILENAME Statement The filename statement looks similar to the libname

statement:

filename fileref 'path'; The rules for the fileref are the same as those for a libref.

The path can point directly to an individual file or a folder.

One method:filename myfile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';

data test;

infile myfile;

input flight_no date $ destination$ first_class economy;

run;

Page 16: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

The FILENAME Statement A more flexible method:

filename rawdata '\\seashare\blumj\SAS Programming Data\raw data';

data test;

infile rawdata('flights.prn');

input flight_no date $ destination$ first_class economy;

run;

Page 17: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

The FILENAME Statement A more flexible method:

filename rawdata '\\seashare\blumj\SAS Programming Data\raw data';

data test;

infile rawdata('flights.prn');

input flight_no date $ destination$ first_class economy;

run;Since the filename points to a folder, individual files must be requested when

the fileref is used.

Page 18: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Pitfalls Consider the data file “flights2.prn”, which is only slightly

different than the previous one (the zero is a blank):

Page 19: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Process at 9th Recorddata test2;

infile rawdata ('flights2.prn');

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

The multiple spaces are seen as a single delimiter (not as missing). What happens with economy?

PDV date flight_no destination first_class economy

12/15/2000 114 LAX 187

Input Buffer

1 1 4 1 2 / 1 5 / 2 0 0 0 L A X 1 8 7

Page 20: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Process at 9th Recorddata test2;

infile rawdata ('flights2.prn');

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

SAS wants to fill in a value, so it gets more information from the raw file—the next record

PDV date flight_no destination first_class economy

12/15/2000 114 LAX 187 982

Input Buffer

9 8 2 1 2 / 1 5 / 2 0 0 0 D F W 1 4 3 1

Page 21: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Results

Only 9 records, and the 9th

is incorrect.

Page 22: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Pitfalls As a general rule, spaces are lousy delimiters—consider

“flights3.prn” which adds the pilot’s name as the first column. What will happen with this?

Page 23: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Other Delimiters Instructions for which delimiter is present are set in the

infile statement.

Consider “flights.csv”:

Page 24: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Other Delimiters The dlm= option allows for specification of the delimiter,

which can be a literal keyboard character.

Code:

data test3;

infile rawdata('flights.csv') dlm=',';

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

Page 25: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options 9th Record still problematic

DSD: (delimiter sensitive data) Ignores delimiters inside quoted values.

Treats consecutive delimiters as containing a missing value.

Automatically changes the default delimiter to a comma.

Update:

data test3;

infile rawdata('flights.csv') dsd;

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

Page 26: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Results

9th record now has appropriate missing first class value and 10th record reads correctly.

Page 27: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options Consider “flights2.csv”:

Here the 9th record is incomplete, with the current code the result is:

Page 28: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options

This comes from the flight number on the 10th record.

Page 29: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options

This is missing since it tries to read the date (12/15/2000) here, which is not valid as a number.

Once again, SAS tries to fill in information from the next record.

Page 30: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options MISSOVER: Forces all values that cannot be filled with

the information in the current input buffer to be set to missing.

Update:

data test4;

infile rawdata('flights2.csv') dsd missover;

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

Page 31: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Results

9th record now has appropriate missing first class and economy values and 10th

record reads correctly.

Page 32: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

More Infile Options In “flights3.csv”, column headers are preserved:

The firstobs= option:

Selects the first row to read from the raw data.

Page 33: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Firstobs=

Page 34: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

TAB as a Delimiter TAB is not a standard keyboard character—in the SAS

editor it is just a series of spaces.

For “flights.txt” (which is space delimited), what is the correct form of the dlm= option?

ASCII hexadecimal codes can be given in the form: '##'x The hex code for TAB is 09

data test6;

infile rawdata('flights.txt') dlm='09'x;

length date $ 10;

input flight_no date $ destination$ first_class economy;

run;

Page 35: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

TAB as a Delimiter

See if you can read in “flights2.txt”,

“flights3.txt” and “flights4.txt”.

Page 36: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Informats In the previous examples the date was read as character instead of

what SAS recognizes as a date.

Many formats can be used in a dual role as informats, which provide instructions for how to read raw data.

Code:

data test7;

infile rawdata('flights.txt') dlm='09'x;

input flight_no date:mmddyy10. destination$ first_class economy;

run;

Page 37: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Informats In the previous examples the date was read as character instead of

what SAS recognizes as a date.

Many formats can be used in a dual role as informats, which provide instructions for how to read raw data.

Code:

data test7;

infile rawdata('flights.txt') dlm='09'x;

input flight_no date:mmddyy10. destination$ first_class economy;

run;

The colon (:) is used as an operator to attach an informat to a variable.

Page 38: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Informats

The dates are converted, but no format is

applied. Use a format statement to pick the

format desired.

Page 39: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Fixed Column Files In “flights.dat”, data are in fixed positions:

Flight number, columns 1-3

Date, columns 4-11

Destination, columns 12-14

First Class, columns 15-17 (or just 16 & 17)

Economy, columns 18-20

Page 40: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Fixed Column Files Here the form of the input statement is quite different.

Two potential forms can be used on each variable

variable <$> start-stop start is the starting column position, stop is the ending column

position.

$ is used to denote character variables.

This form cannot use informats.

@n variable informat. n is the starting column position

The informat includes the total width

The informat determines if the variable is character or numeric.

Page 41: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Fixed Column Files First version:

data test8;

infile rawdata('flights.dat');

input flight_no 1-3 date$ 4-11 destination$ 12-14 first_class 16-17 economy 18-20;

run;

In this case, information in the input buffer is parsed into pieces corresponding to the specified columns.

Page 42: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Result

Page 43: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Fixed Column Files Second version:

data test9;

infile rawdata('flights.dat');

input @1 flight_no 3. @4 date mmddyy8. @12 destination $3. @16 first_class 2. @18 economy 3.;

format date date9.;

run;

In this case, information in the input buffer is parsed into pieces corresponding to the start column and format width.

Page 44: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Result

Page 45: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Fixed Column Files One method or the other can be chosen for each variable,

but different variables can use different methods in the same input statement:

data test10;

infile rawdata('flights.dat');

input flight_no 1-3 @4 date mmddyy8. destination$ 12-14 first_class 16-17 economy 18-20;

format date date9.;

run;

Date is the only variable that needs an informat, so the others can be read using the column span.

Page 46: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Exercise 1 Read the “projects.txt” data, which has the columns of: state,

job id, date, region, equipment cost, personnel cost and pollution code.

Compute a variable for total job cost and another for pollution type using the encoding: 1 = TSP

2 = LEAD

3 = CO

4 = SO2

5 = O3

Page 47: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Exercise 2Description of

Field

Data Type

Employee’s

Last Name

Character

Employee’s

First Name

Character

Country Character

City Character

Phone Number Character

Employee ID Character

Job Code Character

Salary Currency

($xxx,xxx)

Read the “employee list.csv” data, which has the columns as noted at the right.

In addition to the variables present, you should also compute a bonus and final salary for each employee. The bonuses are 9% for all pilots and 7.5% for all mechanics and 6.75% for all others.

Page 48: Reading Data from Other Sourcespeople.uncw.edu/blumj/stt305/ppt/Reading data from other sources.pdf · Other Data Sources SAS can read data from a variety of sources: Plain text files,

Exercise 3 Read the “delay.dat” data, which has the

columns as noted at the right.

The airline classifies departure times in one of four ways: midnight to 7 a.m. as early morning, 7 a.m. to noon as morning, noon to 6 p.m. as daytime and 6 p.m. to midnight as evening. Create an additional variable to reflect these classifications.

Additionally, using the scheduled time of departure and the delay, define a variable that contains the actual departure time.

Description of Field Columns Data Type

Flight number 1-3 Character

Departure date 4-10 ddMONyy

Departure time 11-15 Time/Character

hh:mm

Destination 16-18 Character

(3-letter airport

code)

Flight distance (mi) 19-22 Numeric

Ticketed passengers

boarding

23-25 Numeric

Transfers from other

flights

26-28 Numeric

Non-revenue passengers 29-31 Numeric

Aircraft Capacity 32-34 Numeric

Departure Delay (min—

negative indicates an

early departure)

35-37 Numeric