Upload
juliet-stafford
View
219
Download
0
Embed Size (px)
Citation preview
Chapter 9 Cleansing and Augmenting
the Data
Xiaogang Su
Department of Statistics
University of Central Florida
Section 9.1 Cleansing and Augmenting the Data
Objectives
• Modify the data using the functions
— UPCASE
— COMPBL
— TRANWRD.
Scenario
The secretary at Statistics Department maintains some faculty information. The data set contains information about all faculty in Fall 2001.
...
UPCASE Function
The UPCASE function converts all of the letters in the data value into uppercase.
var=UPCASE(var);
country=upcase(country);COUNTRY
france
BEFORE
COUNTRY
FRANCE
AFTER
Example:
...
COMPBL Function
The COMPBL function compresses multiple consecutive blanks in a data value into one blank. Since the length of a variable is set at compilation, the resulting data value is padded with blanks.
Example:
var=COMPBL(var);
name=compbl(name);NAME
DE PABLOS
BEFORE
NAME
DE PABLOS
AFTER
...
TRANWRD Function
The TRANWRD function replaces all occurrences of a pattern of characters in a data value with another pattern of characters.
Example:
var=TRANWRD(var, target, replacement);
name=tranwrd(name,’Miss’, ’Ms’);NAME
Miss. Joy Ho
BEFORE
Ms. Joy Ho
AFTER
NAME
...
Cleansing Existing Data
This demonstration illustrates
• changing Country to be always uppercase for USA and only the first letter be upper case for other counties
• removing multiple blanks from a last name
Program: pg2-ch9-ex01.sas
Section 9.2 Creating New Character Variables
Objectives
• Modify character data values.
• Specifically, use the
SCAN function
TRIM function
concatenation
operator
SUBSTR function
...
[email protected]@[email protected]@pegasus.cc.ucf.edu
ScenarioUniversity has two email systems “pegasus” and “mail”. The e-mail address structures are in the form of
<Firstinitial>[email protected]
Or <Firstinitial>[email protected]
...
Refresher Quiz!
1. What are the two phases of the DATA step?
Compilation and Execution
2. What are the three required attributes of all variables?
Name, Type, and Length
3. In what phase are these attributes assigned?
Compilation
...
SUBSTR Function
The SUBSTR function extracts a portion of the character data value based on how many characters are designated for retrieval.
var1=SUBSTR(var, start ,<number of chars>);
NAME
Dorothy Ename1 = substr(name,1,3); NAME1
Dor
...
Retrieve First Initial
The first step is to retrieve the initial letter of an employee’s first name.
first_initial = substr(emp_firstname,1,1);
EMP_FIRSTNAME
Dorothy E
FIRST_INITIAL
D
...
Retrieve Middle Initial
The second step is to retrieve the middle initial.
Problems:
• Not all middle initials are in the same location, so you cannot use the SUBSTR function.
• Not all employees have a middle initial.
lastname firstname email MILLS DOROTHY E [email protected] BOWER EILEEN A. [email protected] READING TONY R. [email protected] JUDD CAROL A. [email protected] ANDERSON CHRISTOPHER [email protected]
...
SCAN Function
The SCAN function extracts a portion of the character data value based on what word-number to retrieve.
Example:
var1=SCAN(var, word-number, <delimiter(s)>);
NAME
Dorothy Ename1=scan(name,2,' ');
NAME1
E
...
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Dorothy E
Second word = 'E'
...
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Dorothy E
1 character = E
Second word = 'E'
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Eileen Alma
Second word = 'Alma'
...
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Eileen Alma
1 character = A
Second word = 'Alma'
...
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Hanna
Second word = missing
...
When you put the e-mail address together, separate assignments must be made for those with middle initials and for those without.
Retrieve Middle Initial
middle_initial=substr(scan(emp_firstname,2,' '),1,1);
Example:
emp_firstname = Hanna
1 character = missing
...
Concatenation Operator
The third step is to put the first initial, middle initial, and last name together.
The concatenation operator joins character data values together.
var2 = var !! var1;
Concatenation Operator
Example: newname=name !! name1;
Compilation
NAME1
$
6
NEWNAME
$
15+
NAME
$
9
...
Concatenation Operator
Example: newname=name !! name1;
Dorothy E
2 spaces
NEWNAME
$
15
Dorothy E
Execution
NAME
$
9
NAME1
$
6
+
...
TRIM Function
The TRIM function removes trailing blanks from a character data value during execution.
var2=TRIM(var) !! var1;
TRIM Function
Example: newname=trim(name) !! name1;
NEWNAME
$
15
NAME1
$
6
NAME
$
9
Compilation
+
...
TRIM Function
Example: newname=trim(name) !! name1;
0 spaces
7 spaces
NEWNAME
$
15
DorothyEDorothy E
Execution
NAME
$
9
NAME1
$
6
+
...
TRIM Function
Example: newname=trim(name) !! ‘ ‘ !! name1;
1 space
7 spaces
NEWNAME
$
16
Dorothy EDorothy E
Execution
NAME
$
9
NAME1
$
6
+ +space
...
Pegasus, Mail, or Other email system
This demonstration illustrates putting all the pieces together.
Program: pg2-ch9-ex02.sas
Section 9.3 Creating New Numeric Variables
Objectives
• Modify numeric data values.
• Specifically, use the
SUM Function
ROUND Function
INT Function
MEAN Function
Scenario
International Airlines will provide a retroactive, prorated bonus to employees for their years of service up to and including a maximum of ten years of past company service.
sum_ avg_
EMP_ID bonus bonus
E00002 18956.98 2708
E00003 60211.06 12042
E00004 4218.99 4219
E00005 17142.96 1905
E00007 20361.21 2909
The variable names that containthe bonuses areBONUS1, BONUS2,and so on.
Implied Variable Lists
Because the variable names have the same prefix word followed by a numbered suffix, you can use an implied implied variable listvariable list to refer to a group of variables.
The syntax may vary slightly based on the statement in which the implied variable list is used, but the basic syntax is
varA-varN
where A is a starting number, and N is an ending number. The numbers must be consecutive.
The SUM Function
The SUM function
• adds variable values together
• ignores missing values.
var=SUM(var1, var2, …, varN);
var=sum(var1, var2, var3);
Example:
VAR
18
VAR1
12
VAR2
.
VAR3
6
...
The SUM Function
When you use an implied variable list, use the keyword OF in front of the first variable name to prevent subtraction from occurring.
Example:
var=SUM(OF var1-varN);
var=sum(of var1-var3);VAR1
12
VAR2
.
VAR3
6
VAR
18
The MEAN Function
The MEAN function returns the arithmetic mean (average) and ignores missing values.
Example:
var=MEAN(var1, var2,…,varN);
var=mean(of var1-var3);VAR
9
VAR1
12
VAR2
.
VAR3
6
...
The ROUND Function
The ROUND function returns a value rounded to the nearest round-off unit. If round-off unit is not provided, the variable is rounded to the nearest integer.
Any number or fractional value can be used as a round-off unit.
var=ROUND(var1,<round-off unit>);
The ROUND Function
Examples:
VAR1A
12
VAR2A
43
VAR3A
6.5
VAR4A
-6.5
VAR1
12.12
VAR2
42.54
VAR3
6.472
VAR4
-6.47
var1a = round(var1);var2a = round(var2);var3a = round(var3,.1);var4a = round(var4,.1);
The INT Function
The INT function returns the integer portion of an argument.
var=INT(var1);
The INT Function
var1a = int(var1);var2a = int(var2);var3a = int(var3);var4a = int(var4);
Examples:
VAR1
12.12
VAR2
42.54
VAR3
6.472
VAR4
-6.47
VAR1A
12
VAR2A
42
VAR3A
6
VAR4A
-6
Section 9.4 Creating New Numerical Variables Based on Dates
Objectives
• Modify numeric date values.
• Specifically, use
SAS time and datetime constants
the INTNX function
the TODAY function
the YEAR function
...
Scenario
New employees are eligible for benefits on the first of the month following their start dates.
benefit_ emp_hire_ benefit_Obs date date LASTNAME FIRSTNAME days 1 05/01/1999 01APR1999 ANDERSON SANDRA 3342 06/01/1999 10MAY1999 BARBOUR DAVID 3033 02/01/2000 01JAN2000 BELL MICHAEL 584 08/01/1999 18JUL1999 BENTZ ALICE 2425 02/01/2000 03JAN2000 BOLCH AMY 58
Refresher Quiz!
1. What date is the starting point from which all SAS dates are based?
2. What option takes effect if only a two-digit year is specified?
January 1, 1960
YEARCUTOFF=
3. How do you specify a SAS date constant for September 12, 2008? '12sep2008'd
...
SAS Times
Just as SAS has a starting point for dates, it also has a starting point for times.
Time is measured as the number of seconds since midnight.
'hh:mm<:ss.s>'T
SAS Datetimes
SAS datetimes are a combination of dates and times, and are measured as the number of seconds since January 1, 1960.
'ddmmmyyyy:hh:mm<:ss.s>'DT
The INTNX Function
The INTNX function advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value.
Example:
var1=intnx('year',var,1);VAR
0
SAS date for 01JAN1960
VAR1
366
SAS date for 01JAN1961
var=INTNX('interval',start-from,increment);
...
The TODAY Function
The TODAY function returns the current date as a SAS date.
Example:
Assume today’s date is September 12, 2008.
var1=today(); VAR1
17787
SAS date for 12SEP2008
var=TODAY();
...
Other DATE Functions
There are other functions that return values from SAS dates.
var=YEAR(SAS-date);
var=MONTH(SAS-date);
var=DAY(SAS-date);
Other DATE Functions
Example:
Assume today’s date is September 12, 2008.
THISYEAR=YEAR(TODAY());
THISDAY
12
THISMONTH=MONTH(TODAY());
THISDAY=DAY(TODAY());
THISMONTH
9
THISYEAR
2008
...
Calculating Eligibility Date
This demonstration illustrates putting all the pieces together.
Program: pg2-ch9-ex03.sas