29
ISQS 6347, Data & Text Mining 1 ISQS 3358, Business Intelligence Data Preparation for Analytics Using SAS Zhangxi Lin Texas Tech University

ISQS 6347, Data

Embed Size (px)

Citation preview

Page 1: ISQS 6347, Data

ISQS 6347, Data & Text Mining 1

ISQS 3358, Business Intelligence

Data Preparation for AnalyticsUsing SAS

Zhangxi LinTexas Tech University

Page 2: ISQS 6347, Data

ISQS 6347, Data & Text Mining 2

Outline

Exercise 9 An overview of data preparation for

analytics SAS Programming Essentials

Running SAS programs Mastering fundamental concepts SAS program debugging

Exercise 10

Page 3: ISQS 6347, Data

ISQS 6347, Data & Text Mining 3

Exercise 9: Creating Summarized Output

Chapter 5 exercises: p A-15 to A-201. Producing summary statistics2. Creating a custom format3. Producing a summary table4. Optional: Displaying percentage in a table5. Producing a bar chart

Deliverables: Screenshot of the Project Designer Screenshots of the above html outputs (not

necessary to show the complete output)

Page 4: ISQS 6347, Data

ISQS 6347, Data & Text Mining 4

Structure and Components of Business Intelligence

Page 5: ISQS 6347, Data

ISQS 6347, Data & Text Mining 5

Overview: From Data Warehousing to Data Analysis

Previous major topics in data warehousing (using SQL Server 2005) Dimensional model design ETL Cubes design and OLAP

Data analysis topics (using SAS) Data preparation

Analytic business questions Data format and data conversion

Data quality Data exploratory Data analysis

Page 6: ISQS 6347, Data

ISQS 6347, Data & Text Mining 6

US Car Theft

The number of U.S. motor vehicle thefts decreased by 1.9 percent from 2003 to 2004, the first decrease since 1999. In 2004, the value of stolen motor vehicles was $7.6 billion, down from $8.6 billion in 2003. The average value of a motor vehicle reported stolen in 2004 was $6,143, compared with $6,797 in 2003.

Page 7: ISQS 6347, Data

ISQS 6347, Data & Text Mining 7

2004 Theft Statistics Every 26 seconds, a motor vehicle is stolen in the United

States. The odds of a vehicle being stolen were 1 in 190 in 2003. The odds are highest in urban areas.

U.S. motor vehicle thefts fell 1.9 percent in 2004 from 2003, according to the FBI's Uniform Crime Reports. In 2004, 1,237,114 motor vehicles were reported stolen.

The West was the only region with an increase in motor vehicle thefts from 2003 to 2004, up 3.2 percent. Thefts fell 9.7 percent in the Northeast, 4.4 percent in the Midwest and 2.9 percent in the South.

Nationwide, the 2004 motor vehicle theft rate per 100,000 people was 421.3, down 2.9 percent from 433.7 in 2003.

Only 13.0 percent of thefts were cleared by arrests in 2004. Carjackings occur most frequently in urban areas. They

account for only 3.0 percent of all motor vehicle thefts. The average comprehensive insurance premium in the U.S.

rose 11.2 percent from 1999 to 2003

Page 8: ISQS 6347, Data

ISQS 6347, Data & Text Mining 8

Business Question

If the number of used Honda Accord thefts is ranked the top in auto theft, should the premium of insurance for Honda Accord be high enough than other brand of cars? Should the insurance for a user Honda higher than a brand new Honda?

Why?

Page 9: ISQS 6347, Data

ISQS 6347, Data & Text Mining 9

Other Analytic Business Questions How do factors such as time, branch, promotion, and price

influence the sale of a soft drink? Which customers have a high cancellation risk in the next

month? How can customers be segmented based on their purchase

behavior? Statistics showed that an online recommendation system may

increase the sale 20%, and the accuracy rate of the system is 40%. A newer algorithm can increase the accuracy rate to 50%. Should the sale be promoted to 20%*125% = 25%?

The airline companies are considering allowing seats over-booked because certain percentage of customers will cancel their flight at the last minute. If the average cancellation rate is 10%, should the over-booking rate be 10% as well? If a cancellation is charged 5% of the fare and how much should the penalty for sold-out situation with over-booking?

Page 10: ISQS 6347, Data

ISQS 6347, Data & Text Mining 10

Analysis Process Selecting an analysis method Identify data source Prepare the data (collecting, cleansing, reorganizing,

extracting transforming, loading) Execute the analysis Interpret the analysis Automate data preparation and execution of analysis, if

the business question has to be answered more than once

ETL Stored procedures

The above steps can also be iterated, not necessarily performed in sequential order

We focus on the data preparation step

Page 11: ISQS 6347, Data

ISQS 6347, Data & Text Mining 11

Characteristics of Analytic Business Questions

Analysis complexity: real analysis or reporting Analysis paradigm: statistics or data mining Data preparation paradigm: as much as data as

possible or business knowledge first Analysis method: supervised or unsupervised

analysis Scoring needed – yes/no Periodicity of analysis: one-shot or re-run Historic data needed, yes/no Data structure: one row or multiple rows per

subject Complexity of the analysis team

Page 12: ISQS 6347, Data

ISQS 6347, Data & Text Mining 12

Components of the SAS System

ReportingAnd

Graphics

Data AccessAnd

ManagementUser

Interface

Analytical Base SAS ApplicationDevelopment

VisualizationAnd Discovery

BusinessSolutions

WebEnablement

Page 13: ISQS 6347, Data

ISQS 6347, Data & Text Mining 13

SAS Programming Essentials

Find more information from http://support.sas.com

Page 14: ISQS 6347, Data

ISQS 6347, Data & Text Mining 14

Data-driven Tasks

The functionality of the SAS System is built around four data-driven tasks common to virtually any applications Data access Data management Data analysis Data presentation

Page 15: ISQS 6347, Data

ISQS 6347, Data & Text Mining 15

Turning Data into Information

Process of delivery meaningful information 80% data-related

Access Scrub Transform Mange Store and retrieve

20% analysis

Page 16: ISQS 6347, Data

ISQS 6347, Data & Text Mining 16

SAS Programs

Raw Data

SAS DataSet

SAS DataSet

DATAStep

PROCStep

Report

DATA steps are typically used to create SAS data sets

PROC steps are typically used to process SAS data sets

A SAS program is a sequence of steps that the users submits for execution

Page 17: ISQS 6347, Data

ISQS 6347, Data & Text Mining 17

Preparation of SAS Programming

Data sets: \SAS-Programming Create a user defined library reference

StatementLIBNAME libref ‘SAS-data-library’ <options>;

Example LIBNAME ia ‘c:\workshop\winsas\prog1’;

Two-levels of SAS files namesLibref.fielname

Page 18: ISQS 6347, Data

ISQS 6347, Data & Text Mining 18

SAS Programming Essentials

Demon: c02s2d1 Exercise: c02ex1

Page 19: ISQS 6347, Data

ISQS 6347, Data & Text Mining 19

SAS Data Sets

SAS data sets have a description portion and a data portion. The description portion contains the

general information about the SAS data set and variable attributes

The CONTENTS procedure displays the descriptor portion of a SAS data set

PROC CONTENTS DATA=SAS-data-set;RUN;

Page 20: ISQS 6347, Data

ISQS 6347, Data & Text Mining 20

SAS Data Sets: Data Portion

The PRINT procedure displays the data portion of a SAS data set

PROC PRINT DATA=SAS-data-set;RUN;

Page 21: ISQS 6347, Data

ISQS 6347, Data & Text Mining 21

SAS Data Set Terminology

SAS data set = SAS table Variable = Column Observation = Row

Page 22: ISQS 6347, Data

ISQS 6347, Data & Text Mining 22

SAS Comments

/* begins a comment */ ends a comment

Page 23: ISQS 6347, Data

ISQS 6347, Data & Text Mining 23

SAS Program Debugging

Demon: c02s4d1 Exercise: c02ex7

Page 24: ISQS 6347, Data

ISQS 6347, Data & Text Mining 24

Outline of SAS Programming – Level I

Fundamentals (ch1-3) Producing list reports (ch4) Enhancing output (ch5) Creating data sets (ch6) Data step programming (ch7)

Reading data Creating variables Conditional processing Keeping and dropping variables Reading Excel files

Combining SAS data sets (ch8) Producing summary reports (ch9) SAS graphing (ch10)

Page 25: ISQS 6347, Data

ISQS 6347, Data & Text Mining 25

Making Use of SAS Enterprise Guide Code

Import a text file Example: Orders.txt

Import an Excel file Example: SupplyInfo.xls

Page 26: ISQS 6347, Data

ISQS 6347, Data & Text Mining 26

Learn from Examples

SAS Help Contents -> Learning to use SAS ->

Sample SAS Programs -> Base SAS “Base Usage Guide Examples”

Chapter 3, 4

Page 27: ISQS 6347, Data

ISQS 6347, Data & Text Mining 27

Import an Excel Sheet

proc import out=work.commrexdatafile ="C:\Lin\Shared\ISQS6339\Commrex_3358.xls" dbms=excel replace;

sheet="Company";getnames=yes;mixed=no;scantext=yes;usedate=yes;scantime=yes;run;proc print data=work.commrex;run;

Page 28: ISQS 6347, Data

ISQS 6347, Data & Text Mining 28

Excel SAS/ACCESS LIBNAME Engine

libname xlsdata 'C:\Lin\Shared\ISQS6339\Commrex_3358.xls';

proc print data=xlsdata.New1;run;

Page 29: ISQS 6347, Data

ISQS 6347, Data & Text Mining 29

Exercise 10: SAS Data Step Programming

Import and print text file Userlog00.txt Import and print Excel sheet customers in

Afs_customers.xls Send the Word file containing the

Screenshots of two outputs to [email protected]