Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Other Data Sources SAS can read data from a variety of sources:
Plain text files, including delimited and fixed-column files
Spreadsheets, such as Excel
Databases
XML
Others
Text Files Text files of various types can be read via the data step.
Two essential statements for this operation include:
infile file-reference <options>;
Directs the data step to a file.
The file reference can be a path (in quotes) or a filename reference.
input variable list;
Lists variables to read, with some instructions for how to read them.
Basic Input With no options assigned, the infile statement will
assume data values are space delimited.
In this case, the most basic form of the input statement is:
input variable_1 <$> … variable_n <$>;
$ is an indicator that the variable is character
variable must follow SAS naming conventions
Simple Example Read the data file “flights.prn” from the raw data
subfolder of the data sets folder:
Simple Example Code:
data test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;
Simple Example Code:
data test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;
Infile statement includes a full-path reference to the file
to be used.
Simple Example Code:
data test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;
List (legal) variable names in order corresponding to
columns in the data.
Each character variable must be followed by the $. It can be appended to the name or a space can be placed between.
Results
Something is not quite right in the date column.
Processdata test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;During compilation, the PDV
is set up based on the specifications in the input
statement.
PDV flight_no date destination first_class economy
Processdata test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;During execution data is loaded into an input buffer and parsed as specified by the input and infile statements.
PDV flight_no date destination first_class economy
Input Buffer
4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7
Processdata test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;Spaces are taken as delimiters between values. Multiple,
consecutive spaces are seen as a single delimiter
PDV flight_no date destination first_class economy
439 12/11/20 LAX 20 137
Input Buffer
4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7
Processdata test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;
Default length of character variables is 8
PDV flight_no date destination first_class economy
439 12/11/20 LAX 20 137
Input Buffer
4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7
A length statement (before input) would be useful.
Processdata test;
infile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
input flight_no date $ destination$ first_class economy;
run;
This record is output to the data set, and the process continues until the end of the raw file is reached.
PDV flight_no date destination first_class economy
439 12/11/20 LAX 20 137
Input Buffer
4 3 9 1 2 / 1 1 / 2 0 0 0 L A X 2 0 1 3 7
The FILENAME Statement The filename statement looks similar to the libname
statement:
filename fileref 'path'; The rules for the fileref are the same as those for a libref.
The path can point directly to an individual file or a folder.
One method:filename myfile '\\seashare\blumj\SAS Programming Data\raw data\flights.prn';
data test;
infile myfile;
input flight_no date $ destination$ first_class economy;
run;
The FILENAME Statement A more flexible method:
filename rawdata '\\seashare\blumj\SAS Programming Data\raw data';
data test;
infile rawdata('flights.prn');
input flight_no date $ destination$ first_class economy;
run;
The FILENAME Statement A more flexible method:
filename rawdata '\\seashare\blumj\SAS Programming Data\raw data';
data test;
infile rawdata('flights.prn');
input flight_no date $ destination$ first_class economy;
run;Since the filename points to a folder, individual files must be requested when
the fileref is used.
Pitfalls Consider the data file “flights2.prn”, which is only slightly
different than the previous one (the zero is a blank):
Process at 9th Recorddata test2;
infile rawdata ('flights2.prn');
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
The multiple spaces are seen as a single delimiter (not as missing). What happens with economy?
PDV date flight_no destination first_class economy
12/15/2000 114 LAX 187
Input Buffer
1 1 4 1 2 / 1 5 / 2 0 0 0 L A X 1 8 7
Process at 9th Recorddata test2;
infile rawdata ('flights2.prn');
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
SAS wants to fill in a value, so it gets more information from the raw file—the next record
PDV date flight_no destination first_class economy
12/15/2000 114 LAX 187 982
Input Buffer
9 8 2 1 2 / 1 5 / 2 0 0 0 D F W 1 4 3 1
Results
Only 9 records, and the 9th
is incorrect.
Pitfalls As a general rule, spaces are lousy delimiters—consider
“flights3.prn” which adds the pilot’s name as the first column. What will happen with this?
Other Delimiters Instructions for which delimiter is present are set in the
infile statement.
Consider “flights.csv”:
Other Delimiters The dlm= option allows for specification of the delimiter,
which can be a literal keyboard character.
Code:
data test3;
infile rawdata('flights.csv') dlm=',';
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
More Infile Options 9th Record still problematic
DSD: (delimiter sensitive data) Ignores delimiters inside quoted values.
Treats consecutive delimiters as containing a missing value.
Automatically changes the default delimiter to a comma.
Update:
data test3;
infile rawdata('flights.csv') dsd;
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
Results
9th record now has appropriate missing first class value and 10th record reads correctly.
More Infile Options Consider “flights2.csv”:
Here the 9th record is incomplete, with the current code the result is:
More Infile Options
This comes from the flight number on the 10th record.
More Infile Options
This is missing since it tries to read the date (12/15/2000) here, which is not valid as a number.
Once again, SAS tries to fill in information from the next record.
More Infile Options MISSOVER: Forces all values that cannot be filled with
the information in the current input buffer to be set to missing.
Update:
data test4;
infile rawdata('flights2.csv') dsd missover;
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
Results
9th record now has appropriate missing first class and economy values and 10th
record reads correctly.
More Infile Options In “flights3.csv”, column headers are preserved:
The firstobs= option:
Selects the first row to read from the raw data.
Firstobs=
TAB as a Delimiter TAB is not a standard keyboard character—in the SAS
editor it is just a series of spaces.
For “flights.txt” (which is space delimited), what is the correct form of the dlm= option?
ASCII hexadecimal codes can be given in the form: '##'x The hex code for TAB is 09
data test6;
infile rawdata('flights.txt') dlm='09'x;
length date $ 10;
input flight_no date $ destination$ first_class economy;
run;
TAB as a Delimiter
See if you can read in “flights2.txt”,
“flights3.txt” and “flights4.txt”.
Informats In the previous examples the date was read as character instead of
what SAS recognizes as a date.
Many formats can be used in a dual role as informats, which provide instructions for how to read raw data.
Code:
data test7;
infile rawdata('flights.txt') dlm='09'x;
input flight_no date:mmddyy10. destination$ first_class economy;
run;
Informats In the previous examples the date was read as character instead of
what SAS recognizes as a date.
Many formats can be used in a dual role as informats, which provide instructions for how to read raw data.
Code:
data test7;
infile rawdata('flights.txt') dlm='09'x;
input flight_no date:mmddyy10. destination$ first_class economy;
run;
The colon (:) is used as an operator to attach an informat to a variable.
Informats
The dates are converted, but no format is
applied. Use a format statement to pick the
format desired.
Fixed Column Files In “flights.dat”, data are in fixed positions:
Flight number, columns 1-3
Date, columns 4-11
Destination, columns 12-14
First Class, columns 15-17 (or just 16 & 17)
Economy, columns 18-20
Fixed Column Files Here the form of the input statement is quite different.
Two potential forms can be used on each variable
variable <$> start-stop start is the starting column position, stop is the ending column
position.
$ is used to denote character variables.
This form cannot use informats.
@n variable informat. n is the starting column position
The informat includes the total width
The informat determines if the variable is character or numeric.
Fixed Column Files First version:
data test8;
infile rawdata('flights.dat');
input flight_no 1-3 date$ 4-11 destination$ 12-14 first_class 16-17 economy 18-20;
run;
In this case, information in the input buffer is parsed into pieces corresponding to the specified columns.
Result
Fixed Column Files Second version:
data test9;
infile rawdata('flights.dat');
input @1 flight_no 3. @4 date mmddyy8. @12 destination $3. @16 first_class 2. @18 economy 3.;
format date date9.;
run;
In this case, information in the input buffer is parsed into pieces corresponding to the start column and format width.
Result
Fixed Column Files One method or the other can be chosen for each variable,
but different variables can use different methods in the same input statement:
data test10;
infile rawdata('flights.dat');
input flight_no 1-3 @4 date mmddyy8. destination$ 12-14 first_class 16-17 economy 18-20;
format date date9.;
run;
Date is the only variable that needs an informat, so the others can be read using the column span.
Exercise 1 Read the “projects.txt” data, which has the columns of: state,
job id, date, region, equipment cost, personnel cost and pollution code.
Compute a variable for total job cost and another for pollution type using the encoding: 1 = TSP
2 = LEAD
3 = CO
4 = SO2
5 = O3
Exercise 2Description of
Field
Data Type
Employee’s
Last Name
Character
Employee’s
First Name
Character
Country Character
City Character
Phone Number Character
Employee ID Character
Job Code Character
Salary Currency
($xxx,xxx)
Read the “employee list.csv” data, which has the columns as noted at the right.
In addition to the variables present, you should also compute a bonus and final salary for each employee. The bonuses are 9% for all pilots and 7.5% for all mechanics and 6.75% for all others.
Exercise 3 Read the “delay.dat” data, which has the
columns as noted at the right.
The airline classifies departure times in one of four ways: midnight to 7 a.m. as early morning, 7 a.m. to noon as morning, noon to 6 p.m. as daytime and 6 p.m. to midnight as evening. Create an additional variable to reflect these classifications.
Additionally, using the scheduled time of departure and the delay, define a variable that contains the actual departure time.
Description of Field Columns Data Type
Flight number 1-3 Character
Departure date 4-10 ddMONyy
Departure time 11-15 Time/Character
hh:mm
Destination 16-18 Character
(3-letter airport
code)
Flight distance (mi) 19-22 Numeric
Ticketed passengers
boarding
23-25 Numeric
Transfers from other
flights
26-28 Numeric
Non-revenue passengers 29-31 Numeric
Aircraft Capacity 32-34 Numeric
Departure Delay (min—
negative indicates an
early departure)
35-37 Numeric