74
Sort And Sort And Accumulating Accumulating Totals Totals Last Updated : 29June, 2004 Center of Excellence Data Warehousing

SAS Sort Accum Total

Embed Size (px)

Citation preview

Sort AndSort AndAccumulating Accumulating

TotalsTotals

Last Updated : 29June, 2004

Center of Excellence

Data Warehousing

ObjectivesObjectives

Understand how the SAS System initializes the value of a variable in the PDV.

Prevent reinitialization of a variable in the PDV.

Create an accumulating variable.

Creating an Accumulating VariableCreating an Accumulating Variable

The SAS data set prog2.daysales contains daily sales data for a retail store. There is one observation for each day in April showing the date (SaleDate) and the total receipts for that day (SaleAmt).

SaleDate SaleAmt

01APR2001 498.4902APR2001 946.5003APR2001 994.9704APR2001 564.5905APR2001 783.0106APR2001 228.8207APR2001 930.5708APR2001 211.4709APR2001 156.2310APR2001 117.6911APR2001 374.7312APR2001 252.73

Creating an Accumulating VariableCreating an Accumulating VariableThe store manager also wants to see a

running total of sales for the month as of each day.

Partial Output Sale SaleDate Amt Mth2Dte

01APR2001 498.49 498.4902APR2001 946.50 1444.9903APR2001 994.97 2439.9604APR2001 564.59 3004.5505APR2001 783.01 3787.56

Creating Mth2DteCreating Mth2Dte

By default, variables created with an assignment statement are initialized to missing at the top of the DATA step.

An accumulating variable must retain its value from one observation to the next.

Mth2Dte=Mth2Dte+SaleAmt;

The RETAIN StatementThe RETAIN Statement

General form of the RETAIN statement:

The RETAIN statement prevents SAS from re-initializing the values of new variables at the top of the DATA step.

Previous values of retained variables are available for processing across iterations of the DATA step.

RETAIN variable-name <initial-value> … ;RETAIN variable-name <initial-value> … ;

The RETAIN StatementThe RETAIN Statement

The RETAIN statementretains the value of the variable in the PDV

across iterations of the DATA step initializes the retained variable to missing

before the first execution of the DATA step if an initial value is not specified

is a compile-time-only statement.

Retain Mth2Dte and Set an Initial ValueRetain Mth2Dte and Set an Initial ValueIf you do not supply an initial value, all

the values of Mth2Dte will be missing.

retain Mth2Dte 0;

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

Creating an Accumulating VariableCreating an Accumulating Variable

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MTH2DTESALEDATE

R

Compile

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

.

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

.

R

Execute

0

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

.

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

.

R

498.49 015066

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

.

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

.

R

15066 498.49 0498.49

0+498.49

......Write out observation to mnthtot.

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15066 498.49 498.49

Implicit Output

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15066 498.49 498.49

Implicit Return

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

498.49

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15066

R

498.49

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

498.49

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15066

R

15067 946.50 498.49

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

498.49

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15066

R

15067 946.50 498.491444.99

498.49+946.50

......Write out observation to mnthtot.

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15067 946.50 1444.99

Implicit Output

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15067 946.50 1444.99

Implicit Return

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

946.50

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15067

R

1444.99

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

946.50

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15067

R

994.97 1444.9915068

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

946.50

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

15067

R

15068 994.97 1444.992439.96

1444.99+994.97

......Write out observation to mnthtot.

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15068 994.97 2439.96

Implicit Output

......

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15068 994.97 2439.96

Implicit Return

SaleDate SaleAmt

15066 498.49 15067 946.50 15068 994.97 15069 564.59 15070 783.01

SALEAMT

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt;run;

MNTH2DTESALEDATE

R

15068 994.97 2439.96

Continue processing until end of SAS data set

proc print data=mnthtot noobs; format SaleDate date9.;run;

Sale SaleDate Amt Mth2Dte

01APR2001 498.49 498.4902APR2001 946.50 1444.9903APR2001 994.97 2439.9604APR2001 564.59 3004.5505APR2001 783.01 3787.56

Partial PROC PRINT Output

Creating an Accumulating VariableCreating an Accumulating Variable

Accumulating Totals: Missing ValuesAccumulating Totals: Missing ValuesWhat happens if there are missing values

for SaleAmt?

data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2dte=Mth2Dte+SaleAmt;run;

......

Undesirable OutputUndesirable Output

Sale SaleDate Amt Mth2Dte

01APR2001 498.49 498.4902APR2001 . .03APR2001 994.97 .04APR2001 564.59 .05APR2001 783.01 .

Missing value

Subsequent values of Mth2Dte

are missing

The Sum StatementThe Sum Statement

When creating an accumulating variable, an alternative to the RETAIN statement is the sum statement.

General form of the sum statement:

variable + expression;variable + expression;

The Sum StatementThe Sum Statement

The sum statementcreates the variable on the left side of the plus

sign if it does not already exist initializes the variable to zero before the first

iteration of the DATA stepautomatically retains the variableadds the value of the expression to the

variable at execution ignores missing values.

Accumulating Totals: Missing ValuesAccumulating Totals: Missing Values

data mnthtot2; set prog2.daysales2; Mth2Dte+SaleAmt;run;

proc print data=mnthtot2 noobs; format SaleDate date9.;run;

SaleDate SaleAmt Mth2Dte

01APR2001 498.49 498.4902APR2001 . 498.4903APR2001 994.97 1493.4604APR2001 564.59 2058.0505APR2001 783.01 2841.06

Partial PROC PRINT Output

Accumulating Totals: Missing ValuesAccumulating Totals: Missing Values

c03s1d1.sas

ObjectivesObjectives

Define First. and Last. processing. Calculate an accumulating total for

groupsof data.

Use a subsetting IF statement to output selected observations.

Accumulating Totals for GroupsAccumulating Totals for Groups

The SAS data set prog2.empsals contains each employee’s identification number (EmpID), salary (Salary), and division (Div). There is one observation for each employee.

EmpID Salary Div

E00004 42000 HUMRESE00009 34000 FINACEE00011 27000 FLTOPSE00036 20000 FINACEE00037 19000 FINACEE00048 19000 FLTOPSE00077 27000 APTOPSE00097 20000 APTOPSE00107 31000 FINACEE00123 20000 APTOPSE00155 27000 APTOPSE00171 44000 SALES

Desired OutputDesired Output

Human resources wants a new data set that shows total salary paid for each division.

Div DivSal

APTOPS 410000FINACE 163000FLTOPS 318000HUMRES 181000SALES 373000

Grouping the Data Grouping the Data

You must group the data in the SAS data set before you can perform processing.

A

B

CD

E

Review of the SORT ProcedureReview of the SORT Procedure

You can rearrange the observations into groups using the SORT procedure.

General form of a PROC SORT step:

PROC SORT DATA=input-SAS-data-set <OUT=output-SAS-data-set>;BY <DESCENDING> BY-variable ...;

RUN;

PROC SORT DATA=input-SAS-data-set <OUT=output-SAS-data-set>;BY <DESCENDING> BY-variable ...;

RUN;

The SORT ProcedureThe SORT Procedure

The SORT procedurerearranges the observations in a DATA setcan sort on multiple variablescreates a SAS data set that is a sorted copy of

the input SAS data setreplaces the input data set by default.

proc sort data=prog2.empsals out=salsort; by Div;run;

Sorting by DivSorting by Div

......

Processing Data in GroupsProcessing Data in Groups

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary DivSal

170000

95000

22000

BY-Group ProcessingBY-Group Processing

General form of a BY statement used with the SET statement:

The BY statement in the DATA step enables you to process your data in groups.

DATA output-SAS-data-set;SET input-SAS-data-set;BY BY-variable … ;<additional SAS statements>

RUN;

DATA output-SAS-data-set;SET input-SAS-data-set;BY BY-variable … ;<additional SAS statements>

RUN;

data divsals(keep=Div DivSal); set salsort; by Div; additional SAS statementsrun;

BY-Group ProcessingBY-Group Processing

BY-Group ProcessingBY-Group Processing

A BY statement in a DATA step creates temporary variables for each variable listed in the BY statement.

General form of the names of BY variables in a DATA step:

First.BY-variableLast.BY-variable

First.BY-variableLast.BY-variable

First. and Last. ValuesFirst. and Last. Values

The First. variable has a value of 1 for the first observation in a BY group; otherwise, it equals 0.

The Last. variable has a value of 1 for the last observation in a BY group; otherwise, it equals 0.

Use these temporary variables to conditionally process sorted, grouped, or indexed data.

......

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary

First. / Last. ExampleFirst. / Last. Example

First.Div

Look Ahead

Last.Div

0

1

......

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary

First. / Last. ExampleFirst. / Last. Example

Look AheadFirst.Div

Last.Div

0

0

......

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary

First. / Last. ExampleFirst. / Last. Example

Look AheadFirst.Div

Last.Div

1

0

......

First. / Last. ExampleFirst. / Last. Example

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary First.Div

Last.Div

0

1Look Ahead

First. / Last. ExampleFirst. / Last. Example

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

Div

20000100000 50000250002000023000270001000012000

Salary First.Div

Last.Div

0

1Look Ahead

What Must Happen When?What Must Happen When?

There is a three-step process for accumulating totals:

1. Set the accumulating variable to 0 at the start of each BY group.

2. Increment the accumulating variable with a sum statement (automatically retains).

3. Output only the last observation of each BY group.

data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; additional SAS statementsrun;

Accumulating Totals for GroupsAccumulating Totals for Groups

1. Set the accumulating variable to 0 at the start of each BY group.

data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; DivSal+Salary; additional SAS statementsrun;

Accumulating Totals for GroupsAccumulating Totals for Groups

2. Increment the accumulating variable with a sum statement (automatically retains).

First. / Last. ExampleFirst. / Last. Example

Div Salary DivSal

APTOPSAPTOPSAPTOPSFINACEFINACEFINACEFINACESALESSALES

20000100000 50000250002000023000270001000012000

20000120000170000

250004500068000910001000022000

Subsetting IF StatementSubsetting IF Statement

The subsetting IF defines a condition that the observation must meet to be further processed by the DATA step.

General form of the subsetting IF statement:

If the expression is true, the DATA step continues processing the current observation.

If the expression is false, SAS returns to the top of the DATA step.

IF expression;IF expression;

data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; DivSal+Salary; if Last.Div;run;

Accumulating Totals for GroupsAccumulating Totals for Groups

3. Output only the last observation of each BY group.

......

Subsetting IF Statement (Review)

...

Initialize PDV.Initialize PDV.

Execute program statements.

Execute program statements.

Is the condition

true?

NO

Execute additional program statements.

Execute additional program statements.

Output observation to SAS data set.

Output observation to SAS data set.

YES

If condition;

NOTE: There were 39 observations read from the data set WORK.SALSORT.NOTE: The data set WORK.DIVSALS has 5 observations and 2 variables.NOTE: DATA statement used: real time 0.74 seconds cpu time 0.33 seconds

Accumulating Totals for GroupsAccumulating Totals for Groups

Partial Log

proc print data=divsals noobs;run;

Div DivSal

APTOPS 410000FINACE 163000FLTOPS 318000HUMRES 181000SALES 373000

Accumulating Totals for GroupsAccumulating Totals for Groups

PROC PRINT Output

c03s2d1.sas

Input DataInput Data

The SAS data set prog2.regsals contains each employee’s ID number (EmpID), salary (Salary), region (Region), and division (Div). There is one observation for each employee.

EmpID Salary Region Div

E00004 42000 E HUMRESE00009 34000 W FINACEE00011 27000 W FLTOPSE00036 20000 W FINACEE00037 19000 E FINACEE00077 27000 C APTOPSE00097 20000 E APTOPSE00107 31000 E FINACEE00123 20000 NC APTOPSE00155 27000 W APTOPSE00171 44000 W SALESE00188 37000 W HUMRESE00196 43000 C APTOPSE00210 31000 E APTOPSE00222 250000 NC SALESE00236 41000 W APTOPS

NumRegion Div DivSal Emps

C APTOPS 70000 2 E APTOPS 83000 3 E FINACE 109000 4 E FLTOPS 122000 3 E HUMRES 178000 5 NC APTOPS 37000 2 NC FLTOPS 28000 1

Desired OutputDesired Output

Human resources wants a new data set that shows the total salary paid and the total number of employees for each division in each region.

Partial Output

proc sort data=prog2.regsals out=regsort; by Region Div;run;

Sorting by Region and DivSorting by Region and Div

The data must be sorted by Region and Div. Region is the primary sort variable. Div is the secondary sort variable.

proc print data=regsort noobs;run;

Sorting by Region and DivSorting by Region and Div

Region Div Salary

C APTOPS 27000 C APTOPS 43000 E APTOPS 20000 E APTOPS 31000 E APTOPS 32000 E FINACE 19000 E FINACE 31000

Partial PROC PRINT Output

data regdivsals; set regsort; by Region Div; additional SAS statementsrun;

Multiple BY VariablesMultiple BY Variables

......

CCCEEENCNCNCNCNC

Region Div First.Region

Look Ahead

Last.Div

0

1APTOPSAPTOPSAPTOPSAPTOPSFINACEFINACEFINACESALESSALESSALESSALES

First.Div

Last.Region

0

1

Multiple BY Variables: ExampleMultiple BY Variables: Example

......

CCCEEENCNCNCNCNC

Region Div First.RegionLook Ahead

Last.Div

0

0APTOPSAPTOPSAPTOPSAPTOPSFINACEFINACEFINACESALESSALESSALESSALES

First.Div

Last.Region

0

0

Multiple BY Variables: ExampleMultiple BY Variables: Example

......

CCCEEENCNCNCNCNC

Region Div First.RegionLook Ahead

Last.Div

1

0APTOPSAPTOPSAPTOPSAPTOPSFINACEFINACEFINACESALESSALESSALESSALES

First.Div

Last.Region

1

0

Multiple BY Variables: ExampleMultiple BY Variables: Example

......

CCCEEENCNCNCNCNC

Region Div First.RegionLook Ahead

Last.Div

1

1APTOPSAPTOPSAPTOPSAPTOPSFINACEFINACEFINACESALESSALESSALESSALES

First.Div

Last.Region

0

1

Multiple BY Variables: ExampleMultiple BY Variables: Example

CCCEEENCNCNCNCNC

Region Div First.RegionLook Ahead

Last.Div

1

1APTOPSAPTOPSAPTOPSAPTOPSFINACEFINACEFINACESALESSALESSALESSALES

First.Div

Last.Region

0

1

Multiple BY Variables: ExampleMultiple BY Variables: Example

Multiple BY VariablesMultiple BY Variables

When you use more than one variable in the BY statement, a change in the primary variable forces Last.BY-variable=1 for the secondary variable. First. Last. First.Region Div Region Region Div Last.Div

C APTOPS 1 0 1 0 C APTOPS 0 1 0 1 E APTOPS 1 0 1 0 E APTOPS 0 0 0 0 E APTOPS 0 0 0 1 E FINACE 0 0 1 0

/*Summarize salaries by division*/data regdivsals(keep=Region Div DivSal NumEmps); set regsort; by Region Div; if First.Div then do; DivSal=0; NumEmps=0; end; DivSal+Salary; NumEmps+1; if Last.Div;run;

Multiple BY VariablesMultiple BY Variables

NOTE: There were 39 observations read from the data set WORK.REGSORT.NOTE: The data set WORK.REGDIVSALS has 14 observations and 4 variables.NOTE: DATA statement used: real time 0.07 seconds cpu time 0.07 seconds

Multiple BY VariablesMultiple BY Variables

Partial Log

proc print data=regdivsals noobs;run;

Region Div DivSal

C APTOPS 70000 E APTOPS 83000 E FINACE 109000 E FLTOPS 122000

Partial PROC PRINT Output

Multiple BY VariablesMultiple BY Variables

c03s2d2.sas

Questions