Upload
donna-may
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
ISQS 6347, Data & Text Mining 1
ISQS 6339, Data Management & Business Intelligence
Data Preparation for Analytics Using SAS
Zhangxi Lin
Texas Tech University
ISQS 6347, Data & Text Mining 2
Outline
An overview of data preparation for analytics SAS Programming Essentials
Running SAS programs Mastering fundamental concepts SAS program debugging
Make use of SAS Enterprise Guide for programming
ISQS 6347, Data & Text Mining 3
Structure and Components of Business Intelligence
ISQS 6347, Data & Text Mining 4
Overview: From Data Warehousing to Data Analysis Previous major topics in data warehousing (using SQL Server
2008) Dimensional model design ETL Cubes design and OLAP
Data analysis topics (using SAS) Data preparation
Analytic business questions Data format and data conversion
Data cleansing Data exploratory Data analysis Data visualization
ISQS 6347, Data & Text Mining 5
US Car Theft
The number of U.S. motor vehicle thefts decreased by 1.9 percent from 2003 to 2004, the first decrease since 1999. In 2004, the value of stolen motor vehicles was $7.6 billion, down from $8.6 billion in 2003. The average value of a motor vehicle reported stolen in 2004 was $6,143, compared with $6,797 in 2003.
ISQS 6347, Data & Text Mining 6
2004 Theft Statistics
Every 26 seconds, a motor vehicle is stolen in the United States. The odds of a vehicle being stolen were 1 in 190 in 2003. The odds are highest in urban areas.
U.S. motor vehicle thefts fell 1.9 percent in 2004 from 2003, according to the FBI's Uniform Crime Reports. In 2004, 1,237,114 motor vehicles were reported stolen.
The West was the only region with an increase in motor vehicle thefts from 2003 to 2004, up 3.2 percent. Thefts fell 9.7 percent in the Northeast, 4.4 percent in the Midwest and 2.9 percent in the South.
Nationwide, the 2004 motor vehicle theft rate per 100,000 people was 421.3, down 2.9 percent from 433.7 in 2003.
Only 13.0 percent of thefts were cleared by arrests in 2004. Carjackings occur most frequently in urban areas. They account
for only 3.0 percent of all motor vehicle thefts. The average comprehensive insurance premium in the U.S. rose
11.2 percent from 1999 to 2003
ISQS 6347, Data & Text Mining 7
Business Question
If the number of used Honda Accord thefts is ranked the top in auto theft, should the premium of insurance for Honda Accord be high enough than other brand of cars? Should the insurance for a user Honda higher than a brand new Honda?
Why?
ISQS 6347, Data & Text Mining 8
Analytic Business Questions
How do factors such as time, branch, promotion, and price influence the sale of a soft drink?
Which customers have a high cancellation risk in the next month? How can customers be segmented based on their purchase
behavior? Statistics showed that an online recommendation system may
increase the sale 20%, and the accuracy rate of the system is 40%. A newer algorithm can increase the accuracy rate to 50%. Should the sale be promoted to 20%*125% = 25%?
The airline companies are considering allowing seats over-booked because certain percentage of customers will cancel their flight at the last minute. If the average cancellation rate is 10%, should the over-booking rate be 10% as well? If a cancellation is charged 5% of the fare and how much should the penalty for sold-out situation with over-booking?
ISQS 6347, Data & Text Mining 9
Analysis Process
Selecting an analysis method Identify data source Prepare the data (collecting, cleansing, reorganizing, extracting
transforming, loading) Execute the analysis Interpret the analysis Automate data preparation and execution of analysis, if the
business question has to be answered more than once ETL Stored procedures
The above steps can also be iterated, not necessarily performed in sequential order
We focus on the data preparation step
ISQS 6347, Data & Text Mining 10
Characteristics of Analytic Business Questions Analysis complexity: real analysis or reporting Analysis paradigm: statistics or data mining Data preparation paradigm: as much as data as possible or
business knowledge first Analysis method: supervised or unsupervised analysis Scoring needed – yes/no Periodicity of analysis: one-shot or re-run Historic data needed, yes/no Data structure: one row or multiple rows per subject Complexity of the analysis team
ISQS 6347, Data & Text Mining 11
Components of the SAS System
ReportingAnd
Graphics
Data AccessAnd
Management
UserInterface
Analytical Base SASApplication
Development
VisualizationAnd Discovery
BusinessSolutions
WebEnablement
ISQS 6347, Data & Text Mining 12
SAS Programming Essentials
Find more information from http://support.sas.com
ISQS 6347, Data & Text Mining 13
Data-driven Tasks
The functionality of the SAS System is built around four data-driven tasks common to virtually any applications Data access Data management Data analysis Data presentation
ISQS 6347, Data & Text Mining 14
Turning Data into Information Process of delivery meaningful information
80% data-related Access Scrub Transform Mange Store and retrieve
20% analysis
ISQS 6347, Data & Text Mining 15
DATAStep
SAS Data Sets
Data
PROCSteps
Information
Turning Data into Information
ISQS 6347, Data & Text Mining 16
PCPC WorkstationWorkstationServers//Midrange MainframeMainframe
SuperComputer
90%independent
10%dependent
MultiVendor Architecture
Design of the SAS System
...
ISQS 6347, Data & Text Mining 17
MultiEngine Architecture
Design of the SAS System
DATADATA
Teradata
SYBASE
Microsoft ExcelORACLE
dBase
SAP
DB2
ISQS 6347, Data & Text Mining 18
SAS Programming – Level I Fundamentals (ch1-3) Producing list reports (ch4) Enhancing output (ch5) Creating data sets (ch6) Data step programming (ch7)
Reading data Creating variables Conditional processing Keeping and dropping variables Reading Excel files
Combining SAS data sets (ch8) Producing summary reports (ch9) SAS graphing (ch10)
ISQS 6347, Data & Text Mining 19
In this course, you work with business data from International Airlines (IA). The various kinds of data that IA maintains are listed below: flight data passenger data cargo data employee data revenue data
Course Scenario
ISQS 6347, Data & Text Mining 20
The following are some tasks that you will perform: importing data creating a list of employees producing a frequency table of job codes summarizing data creating a report of salary information
Course Scenario
ISQS 6347, Data & Text Mining 21
DATA steps are typically used to create SAS data sets.
PROC steps are typically used to process SAS data sets (that is, generate reports and graphs, edit data, and sort data).
A SAS program is a sequence of steps that the user submits for execution.
RawData
RawData
DATAStep
DATAStep
ReportReport
SASDataSet
SASDataSet
PROCStep
PROCStep
SAS Programs
ISQS 6347, Data & Text Mining 22
data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;run;
proc means data=work.staff; class JobTitle; var Salary;run;
DATAStep
PROCSteps
SAS Programs
ISQS 6347, Data & Text Mining 23
SAS steps begin with either of the following: DATA statement PROC statement
SAS detects the end of a step when it encounters one of the following: a RUN statement (for most steps) a QUIT statement (for some procedures) the beginning of another step (DATA statement
or PROC statement)
Step Boundaries
ISQS 6347, Data & Text Mining 24
data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;
proc means data=work.staff; class JobTitle; var Salary;run;
Step Boundaries
ISQS 6347, Data & Text Mining 25
You can invoke SAS in the following ways: interactive windowing mode (SAS windowing
environment) interactive menu-driven mode (SAS Enterprise Guide,
SAS/ASSIST, SAS/AF, or SAS/EIS software) batch mode noninteractive mode
Running a SAS Program
ISQS 6347, Data & Text Mining 26
Preparation of SAS Programming Data sets: \SAS-Programming Create a user defined library reference
Statement
LIBNAME libref ‘SAS-data-library’ <options>;
Example
LIBNAME ia ‘c:\workshop\winsas\prog1’;
Two-levels of SAS files namesLibref.fielname
ISQS 6347, Data & Text Mining 27
SAS Programming Essentials
Demon: c02s2d1 Exercise: c02ex1
ISQS 6347, Data & Text Mining 28
General form of the CONTENTS procedure:
Example:
PROC CONTENTS DATA=SAS-data-set;RUN;
PROC CONTENTS DATA=SAS-data-set;RUN;
proc contents data=work.staff;run;
Browsing the Descriptor Portion
c02s3d1
ISQS 6347, Data & Text Mining 29
Numeric values
Variable
names
Variable
values
LastName FirstName JobTitle Salary
TORRES JAN Pilot 50000LANGKAMM SARAH Mechanic 80000SMITH MICHAEL Mechanic 40000WAGSCHAL NADJA Pilot 77500TOERMOEN JOCHEN Pilot 65000
The data portion of a SAS data set is a rectangular table of character and/or numeric data values.
Variable names are part of the descriptor portion, not the data portion.
Character values
SAS Data Sets: Data Portion
ISQS 6347, Data & Text Mining 30
SAS Variable Values
There are two types of variables:
character contain any value: letters, numbers, special characters, and blanks. Character values are stored with a length of 1 to 32,767 bytes. One byte equals one character.
numeric stored as floating point numbers in 8 bytes of storage by default. Eight bytes of floating point storage provide space for 16 or 17 significant digits. You are not restricted to 8 digits.
ISQS 6347, Data & Text Mining 31
SAS names have these characteristics: can be 32 characters long. can be uppercase, lowercase, or mixed-case. are not case sensitive. must start with a letter or underscore.
Subsequent characters can be letters, underscores, or numerals.
SAS Data Set and Variable Names
ISQS 6347, Data & Text Mining 32
data5mon
Select the valid default SAS names.
Valid SAS Names
...
ISQS 6347, Data & Text Mining 33
Select the valid default SAS names.
Valid SAS Names
...
data5mon
ISQS 6347, Data & Text Mining 34
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
ISQS 6347, Data & Text Mining 35
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
ISQS 6347, Data & Text Mining 36
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
data#5
ISQS 6347, Data & Text Mining 37
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
data#5
ISQS 6347, Data & Text Mining 38
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
data#5
five months data
ISQS 6347, Data & Text Mining 39
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
Valid SAS Names
...
data#5
five months data
ISQS 6347, Data & Text Mining 40
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
five months data
data#5
Valid SAS Names
...
fivemonthsdata
ISQS 6347, Data & Text Mining 41
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
five months data
data#5
Valid SAS Names
...
fivemonthsdata
ISQS 6347, Data & Text Mining 42
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
five months data
data#5
Valid SAS Names
...
fivemonthsdata
FiveMonthsData
ISQS 6347, Data & Text Mining 43
data5mon
Select the valid default SAS names.
data5mon
5monthsdata
five months data
data#5
Valid SAS Names
...
fivemonthsdata
FiveMonthsData
ISQS 6347, Data & Text Mining 45
LastName FirstName JobTitle Salary
TORRES JAN Pilot 50000LANGKAMM SARAH Mechanic 80000SMITH MICHAEL Mechanic . WAGSCHAL NADJA Pilot 77500TOERMOEN JOCHEN 65000
A value must exist for every variable for each observation.
Missing values are valid values.
A numeric missing value is displayed as a period.
A character missing value is displayed as a blank.
Missing Data Values
ISQS 6347, Data & Text Mining 46
The PRINT procedure displays the data portion of a SAS data set.
By default, PROC PRINT displays the following: all observations all variables an Obs column on the left side
Browsing the Data Portion
ISQS 6347, Data & Text Mining 47
General form of the PRINT procedure:
Example:
PROC PRINT DATA=SAS-data-set;RUN;
PROC PRINT DATA=SAS-data-set;RUN;
proc print data=work.staff;run;
Browsing the Data Portion
c02s3d1
ISQS 6347, Data & Text Mining 48
SAS documentation and text in the SAS windowing environment use the following terms interchangeably:
SAS Data SetSAS Data Set SAS TableSAS Table
VariableVariable ColumnColumn
ObservationObservation RowRow
SAS Data Set Terminology
ISQS 6347, Data & Text Mining 49
SAS statements have these characteristics: usually begin with an identifying keyword always end with a semicolon
data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;run;
proc means data=work.staff; class JobTitle; var Salary;run;
SAS Syntax Rules
ISQS 6347, Data & Text Mining 50
SAS statements are free-format. One or more blanks or special characters can
be used to separate words. They can begin and end in any column. A single statement can span multiple lines. Several statements can be on the same line.
Unconventional Spacing
data work.staff; infile 'raw-data-file';input LastName $ 1-20 FirstName $ 21-30JobTitle $ 36-43 Salary 54-59;run; proc means data=work.staff; class JobTitle; var Salary;run;
SAS Syntax Rules
...
ISQS 6347, Data & Text Mining 52
data work.staff; infile 'raw-data-file';input LastName $ 1-20 FirstName $ 21-30JobTitle $ 36-43 Salary 54-59;run; proc means data=work.staff; class JobTitle; var Salary;run;
SAS statements are free-format. One or more blanks or special characters can
be used to separate words. They can begin and end in any column. A single statement can span multiple lines. Several statements can be on the same line.
Unconventional Spacing
SAS Syntax Rules
...
ISQS 6347, Data & Text Mining 53
SAS statements are free-format. One or more blanks or special characters can
be used to separate words. They can begin and end in any column. A single statement can span multiple lines. Several statements can be on the same line.
Unconventional Spacing
data work.staff; infile 'raw-data-file';input LastName $ 1-20 FirstName $ 21-30JobTitle $ 36-43 Salary 54-59;run; proc means data=work.staff; class JobTitle; var Salary;run;
SAS Syntax Rules
...
ISQS 6347, Data & Text Mining 54
data work.staff; infile 'raw-data-file';input LastName $ 1-20 FirstName $ 21-30JobTitle $ 36-43 Salary 54-59;run; proc means data=work.staff; class JobTitle; var Salary;run;
...
SAS statements are free-format. One or more blanks or special characters can
be used to separate words. They can begin and end in any column. A single statement can span multiple lines. Several statements can be on the same line.
Unconventional Spacing
SAS Syntax Rules
...
ISQS 6347, Data & Text Mining 55
data work.staff; infile 'raw-data-file';input LastName $ 1-20 FirstName $ 21-30JobTitle $ 36-43 Salary 54-59;run; proc means data=work.staff; class JobTitle; var Salary;run;
...
SAS statements are free-format. One or more blanks or special characters can
be used to separate words. They can begin and end in any column. A single statement can span multiple lines. Several statements can be on the same line.
Unconventional Spacing
SAS Syntax Rules
ISQS 6347, Data & Text Mining 56
Good spacing makes the program easier to read.
Conventional Spacing
data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;run;
proc means data=work.staff; class JobTitle; var Salary;run;
SAS Syntax Rules
ISQS 6347, Data & Text Mining 57
Type /* to begin a comment. Type your comment text. Type */ to end the comment.
/* Create work.staff data set */data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
/* Produce listing report of work.staff */proc print data=work.staff;run;
SAS Comments
c02s3d2
ISQS 6347, Data & Text Mining 58
daat work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff run;
proc means data=work.staff average max; class JobTitle; var Salary;run;
Syntax errors include the following: misspelled keywords missing or invalid punctuation invalid options
Syntax Errors
ISQS 6347, Data & Text Mining 59
This demonstration illustrates how to submit a SAS program that contains errors, diagnose the errors, correct the errors, and save the corrected program.
Debugging a SAS Program c02s4d1.sas userid.prog1.sascode(c02s4d1) c02s4d2.sas userid.prog1.sascode(c02s4d2)
ISQS 6347, Data & Text Mining 60
daat work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;proc print data=work.staff run;proc means data=work.staff average max; class JobTitle; var Salary;run;data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;proc print data=work.staff; run;proc means data=work.staff mean max; class Jobtitle; var Salary;run;
Program statements accumulate in a recall buffer each time you issue a SUBMIT command.
SubmitNumber 1
SubmitNumber 2
Recall a Submitted Program
ISQS 6347, Data & Text Mining 61
SubmitNumber 1
SubmitNumber 2
Issue RECALLonce.
Submit Number 2 statementsare recalled.
Issue the RECALL command once to recall the most recently submitted program.
data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;proc print data=work.staff; run;proc means data=work.staff mean max; class JobTitle; var Salary;run;
Recall a Submitted Program
ISQS 6347, Data & Text Mining 62
daat work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;proc print data=work.staff run;proc means data=work.staff average max; class JobTitle; var Salary;run;data work.staff; infile 'raw-data-file'; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59;run;proc print data=work.staff; run;proc means data=work.staff mean max; class JobTitle; var Salary;run;
Issue the RECALL command again to recall Submit Number 1 statements.
Recall a Submitted Program
SubmitNumber 1
SubmitNumber 2
Issue RECALLagain.
ISQS 6347, Data & Text Mining 63
Exercise 8: Basic SAS Programming Define library IA and Out Go through all SAS programs in Chapter 2-5. Write a SAS program to read a dataset created by
yourself or simply use Person0.txt in \\TechShare\coba\d\ISQS3358\OtherDatasets\ .
The dataset is output to your library Out. Try to apply whatever SAS features in Chapter 5
of Prog-I to general a nice looking report.
Go through all exercises for Ch 2, 3, 4, 5, 6 (answer keys are available, so no need to submit the results)
Hands-on exercise
Write a SAS program to calculate the number of dates passed in 2012 to 3/3/2012. The input is in the format: date9.
01JAN2012 03MAR2012 Answer: 62 days
ISQS 6347, Data & Text Mining 64
ISQS 6347, Data & Text Mining 65
Making Use of SAS Enterprise Guide Code Import a text file
Example: Orders.txt Import an Excel file
Example: SupplyInfo.xls
ISQS 6347, Data & Text Mining 66
Learn from Examples
SAS Help Contents -> Learning to use SAS -> Sample SAS
Programs -> Base SAS “Base Usage Guide Examples”
Chapter 3, 4
ISQS 6347, Data & Text Mining 67
Import an Excel Sheet
proc import out=work.commrexdatafile ="C:\Lin\Shared\ISQS6339\Commrex_3358.xls" dbms=excel replace;
sheet="Company";getnames=yes;mixed=no;scantext=yes;usedate=yes;scantime=yes;run;proc print data=work.commrex;run;
ISQS 6347, Data & Text Mining 68
Excel SAS/ACCESS LIBNAME Enginelibname xlsdata 'C:\Lin\Shared\ISQS6339\Commrex_3358.xls';
proc print data=xlsdata.New1;
run;
ISQS 6347, Data & Text Mining 69
Exercise 9: SAS Data Step Programming http://zlin.ba.ttu.edu/6339/ExerciseInstructions9.htm