78
Chapter 1: Creating SAS Data Sets – The Basic Contents Introduction 1 Create Simple SAS Data Set 4 Reading raw data from external data files 7 Reading Data Fields With Special Format 14 Permanent SAS Data Set 19 Introduction SAS programs consist of SAS statements. SAS statements can be broadly categorized into three groups: 1) System statements group 2) Data steps group 3) Procedure steps group (Chapter 2) A SAS statement has two important characteristics: It usually begins with a SAS keyword. It ends with a semicolon (;). Sample Program Code options nodate; data eg1; infile 'D\abc.txt' ; input name$ id age; run ; proc print data =eg1; run ; Statements (begins with keyword) ß OPTIONS statement ß DATA statement ß INFILE statement ß Input statement ß RUN statement ß PROC PRINT statement ß RUN statement Some Basic System Statements All SAS system options have default settings. For example, page numbers are automatically displayed (unless you modify this default setting by using OPTIONS statements). To modify system options, you can submit an OPTIONS statement. t a s s s t a m e n t d a t a e x e r ; i n p u t B r a n d $ W e a r ; l a b e l B r a n d = ' B r a n d N a m e ' W e a r = ' W e a r S i z e ' ; d a t a l i n S A S L i b r a i e s D e f a u l t S A S W o r k l i b r a r y ( S t o r i n g t e m p o r a r y d a t a s e t ) S t o r i n g p e r m a n e n t d a t a s e t Procedure steps group Data steps group System statements group

Sas

Embed Size (px)

DESCRIPTION

SAS notes

Citation preview

Chapter 1: Creating SAS Data Sets – The BasicContentsIntroduction 1Create Simple SAS Data Set 4Reading raw data from external data files 7Reading Data Fields With Special Format 14Permanent SAS Data Set 19 IntroductionSAS programs consist of SAS statements. SAS statements can be broadly categorized into three groups:1) System statements group 2) Data steps group 3) Procedure steps group (Chapter 2)

A SAS statement has two important characteristics:

It usually begins with a SAS keyword.It ends with a semicolon (;).

Sample Program Codeoptions nodate;data eg1;infile 'D\abc.txt';input name$ id age;run;proc print data=eg1;run;

Statements (begins with keyword)ß OPTIONS statementß DATA statementß INFILE statementß Input statementß RUN statementß PROC PRINT statementß RUN statement

Some Basic System Statements

All SAS system options have default settings. For example, page numbers are automatically displayed(unless you modify this default setting by using OPTIONS statements).

To modify system options, you can submit an OPTIONS statement.

••

Datastepsstatements

dataexer;inputBrand$Wear;labelBrand='BrandName'Wear='WearSize';datalin

SASLibraies

DefaultSASWorklibrary (Storingtemporarydataset)

Storingpermanentdataset

Procedure steps group

Data steps group

System statements group

OPTIONS statementOPTIONS option1 <=v1> option2 <=v2> … ;RUN;

Commonly used optionsCENTERNOCENTER

Controls whether output are centered or left-justified. Default:CENTER

DATENODATE

Controls whether or not today’s date will appear at thetop of each page of output.

Default: DATE

LINESIZE = n

Controls the maximum length of output lines.Possible values for n are 64 to 256.

NUMBERNONUMBER

Controls whether or not page numbers appear on eachpage of SAS output.

Default:NUMBER

PAGESIZE = n

Controls the maximum number of lines per page ofoutput.Possible values for n are 15 to 32767.

YEARCUTOFF = yyyy

Specifies the first year in a hundred-year span forinterpreting two-digit dates.

Default: 1920

Warning: The settings of the OPTIONS statement remain in effect until you modify them, or until youend your SAS session.

Question: What does the following OPTIONS statement do?options linesize=110 nodate;

suppresses the date and limits the horizontal page size for text outputsuppresses the date and limits the vertical page size for text outputsuppresses the date and limits the vertical page size for the logsuppresses the date and limits the horizontal page size for the log

Answer: b Create Simple SAS Data SetRaw data file

Raw data is a set of data that has not yet been processed by SAS (not in SAS format)

a.b.c.d.

es;Acme43Ajax34Atlas.;run;

Raw data exist in many form, such as Text file (*.txt) and Excel file (*.xls)A typical raw data set

Each field is fixed columns Each field is separated by at least onespace (blank delimiter)

The title row may /may not be part of a raw data fileEach field is either separated by at least one space (blank delimiter) or in fixed columns

Note: SAS can read files with other delimiters such as commas or tabs.

SAS data set

SAS processes the raw data set and creates a SAS data setA typical SAS data set may look like this:

One observation per rowIn SAS data set, missing value is represented by (default)

a period '.' for numeric variableblank string " " for character variable

By default, value of character variables align on the left and value of numeric variables align onthe right

Question: What type of variable is the variable AcctNum in the data set below?

numericcharactercan be either character or numericcan't tell from the data shown

Correct answer: b

It must be a character variable, because the values contain letters and underscores, which are not validcharacters for numeric values. Besides, it aligns on the left.

Question: What type of variable is the variable Wear in the data set below?

––

••

––

••

––

a.b.c.d.

Missing value (numericvariable)

Missing value (charactervariable)

numericcharactercan be either character or numericcan't tell from the data shown

Correct answer: aIt must be a numeric variable, because the missing value is indicated by a period rather than by a blank.Besides, it aligns on the right Reading raw data from external data filesData steps statements--- List input methodTwo list input methods

Enter raw data directly into SAS systemRead raw data from an external data file

If the values in your raw data file are all separated by at least one space (or other delimiters), then using listinput (also called free formatted input) to read the data may be appropriate. 1) Enter Raw Data Directly into SAS system Syntax DATA data_set_name;INPUT varname1 <$> varname2 <$> . . . ;<Other DATA step statements>DATALINES;. . . data . . .. . . data . . .;RUN;

Example data ex1; ß (1)input Brand $ Wear; ß (2)datalines;Acme 43

Ajax 34 ß (3)Atlas .;run;

Description

DATA statement initiate the data input process and defines the name assigned to the created SAS dataset. DATA statement must be the first statement of a DATA step.

.

.

.

.

1.2.

(1)

Rules for SAS data set names

A SAS name can contain from 1-32 charactersThe first character must be a letter or an underscore (_)Subsequent characters must be letters, numbers, or underscoresBlanks cannot appear in SAS names

Question: Which of the following variable names is valid?4BirthDate$Cost_Items_Tax-Rate

Correct answer: cVariable names follow the same rules as SAS data set names. They can be 1 to 32 characters long, mustbegin with a letter (A–Z, either uppercase or lowercase) or an underscore, and can continue with anycombination of numbers, letters, or underscores.

INPUT statement defines the list of variables contained in the data set. The $ after Brand indicates thatit is a character variable, whereas Wear is numeric variable.Note: Rules for SAS variables names – same as the rules for SAS data set names

Each line following the DATALINES statement are considered a data record until a line contains onlya semicolon(;) is reached. Missing value must be represented by a period (.)

Add labels to SAS variablesTo improve a readability, a label can be attached to a variable when the SAS data set is being created usingLABEL statementLABEL var1 = 'Label1' var2 = 'Label2' ... ;

Label can be any thing up to 256 characters longMay put as many variable names and labels into one LABEL statement

Example:2) Reading external raw data text file DATA data_set_name ;INFILE 'source_file_name' <options>;INPUT varname1 <$> varname2 <$>. . . ;< Other DATA step statements >RUN;

Example data Q7;infile 'C:\sas\ex1\Q7.txt';input Size $ Colour $ Price Cost;run;

'source_file_name' is a single quoted string containing the path and full name with extension of the textfile<options> describes the input file's characteristics and specifies how it is to be read with the INFILEstatement. à infile options will be further discussed

Restrictions for list input method

Each value is separated by a delimiter (one blank space)Specify each variable in INPUT statement in the order that they appear in the records ofraw dataMissing value must be represented by a placeholder such as a period by defaultData must be in standard character or numeric format by defaultCharacter values cannot contain embedded blanks if blank space is used as delimiterDefault maximum length of character variables is 8

Example

ii.iii.iv.v.

a.b.c.d.

(2)

(3)

••

––

––––

The maximum length of var1 is 8, so it only shows “abcdefghi” (first 8 characters) only. Solution: LENGTH StatementLENGTH varname1 $ n1 varname2 $ n2 …varnameX $ nX ;

You should put Length statement before the InputstatementnX sets the length of the variable to X in SAS data set(default length is 8)SAS still reads the raw data record from delimiter todelimiter regardless of the length by default

Using Length statement, the maximum length of var1 is 14.

There would be an error message if you put Length statement after the Input statement Column input method

For reading raw data with each field is being arranged in fixed columns (or aligned)Syntax:DATA data_set_name;INPUT varname1 <$> bc<-ec> varname2<$> bc <-ec>. . . ;DATALINES;data . . .;RUN; DATA data_set_name;INFILE 'source_file_name' ;INPUT varname1 <$> bc<-ec> varname2<$> bc<-ec> . . . ;RUN;

Example:

data quiz1b_q1;infile 'd:\temp\quiz1b_data1.txt';input id 21-27 name $ 1-20 age 37-38 ;run;

Comparison between List input method and Column input method

List input method Column input method

Each value is separated by a delimiter (oneblank space)

No delimiter is required

Specify each variable in INPUT statement inthe order that they appear in the records ofraw data

Can read variables in any order

Missing value must be represented by aplaceholder such as a period by default

A placeholder is not required to indicate amissing value

Character values cannot contain embeddedblanks if blank space is used as delimiter

Allows embedded blanks for character values(since no delimiter is required)

Default maximum length of charactervariables is 8

Longer than 8 character values is allowed

Raw data is not required to be within fixedcolumns

Raw data must be contained within fixedcolumns

Data must be in standard character or numericformat by default

Data must be in standard character or numericformat by default

Note: Both input methods have the same restriction: Data must be in standard format by default INFILE Statement Options 1. DELIMITER = 'list-of-delimiting-characters'

Specifies an alternate delimiter to be used for list input methodCommonly used delimiters: ',' '!' '&' '09'XIf a non-blank delimiter is used:

Character value may contain embedded spaceDefault maximum length of character variables is 8Consecutive(連續的) delimiters is treated as a single delimiterBlank field between two non-consecutive delimiters is read as missing value

Example :data school;infile datalines delimiter=',';length district $ 12;input district teachers_no students_no;datalines;North Point,,,, 1000 , 30000Central, 520 , 16000Wan Chai, , 2500;run;

Output:

ß use comma ( , ) as a delimiterß the maximum length of district has to set to 12ß “ datalines” must be included when enter Raw Datadirectly (Datalines statement)ß Consecutive delimiters is treated as a single delimiterand character value may contain embedded spaceß Blank field between two delimiters is read as missingvalue

2. DSD

Use comma as a delimiter ( You can change the delimiter by using delimiter= Option)

• •

• •

• •

• •

• •

• •

• •

•••

––––

Treats consecutive non-blank delimiters as a missing valueRemoves quotation marks from character valueReads character value that contains a delimiter within a quoted stringDefault maximum length of character variables is 8

Example :

data school;infile 'D:\district.txt' dsddelimiter='!';length district $ 12;input district teachers_no students_no;run;

Removes quotation marks from character value Treats consecutive non-blank delimiters as a missingvalue Reads character value that contains a delimiter within aquoted string Use “!” as a delimiter

Output:

3. MISSOVER

Adding the MISSOVER option if there is any missing data at the end of datalinesIt tells SAS that if it runs out of data, don’t go to the next data line to continue reading.

Example :

data case4a;input var1 var2 var3;datalines;1 24 5 71 8 9;run;

data case4b;infile datalines missover;input var1 var2 var3;datalines;1 24 5 71 8 9;run;

Other INFILE statement options

LRECL = logical-record-lengthSAS assumes external files have a record length of 256 or less. If your data lines are long, and itlooks like SAS is not reading all your data, then use the LRECL= option in the INFILE statement tospecify a record length at least as long as the longest record in your data file.

INFILE 'C:\MyRawData\President.txt' LRECL=2000;

FIRSTOBS = nIt tells SAS at what line to begin reading data. OBS = nTo specify a number (n) that SAS uses to stop reading raw data records after it in the raw data file

Example :

••••

••

DATA icecream;INFILE 'D:\Sales.txt' FIRSTOBS = 3 OBS=5;INPUT Flavor $ 1-9 Location BoxesSold;RUN;

Reading Data Fields With Special FormatFormatted Input MethodRaw data records with special format that cannot be handled by using only basic list input method or basiccolumn input method. 1) Format input for character variablesINPUT varname1 : $w. varname2 : $w. … ;

Informat $w. tells SAS to read exactly w columns of characters immediately after the last encountereddelimiterIt also sets the length of this character variable in SAS data set to wInformat modifier ‘:’ (colon) tells SAS to use the informat supplied but to stop reading the value forthis variable when a delimiter or end of line is encountered

Example:

Data Qaaa;infile datalines delimiter=',' missover;input var1 var2 var3 : $9. var4;datalines;1, ,HELLO,72,4,TEXT,89, , ,621,31,SHORT100,200,LAST LINE,999;run;

Error – missing Informat modifier ‘:’ (colon)Data Qaaa2;infile datalines delimiter=',' missover;input var1 var2 var3 $9. var4;datalines;1, ,HELLO,72,4,TEXT,89, , ,621,31,SHORT100,200,LAST LINE,999;run;

Note: If ‘:’ (colon) is missing, SAS reads exactly 9 columns of characters for var3 even adelimiter is encountered

Format input for numeric variablesINPUT varname1 : DOLLARw. … ;

Informat DOLLARw. tells SAS to read exactly w columns of numeric values with special charactersimmediately after the last encountered delimiterIt removes embedded blanks (with nonblank delimiters), thousand commas (with non-commadelimiters), dollar signs($), right and left parentheses (which are converted to minus signs)Should always use together with Informat modifier ‘:’ (colon)

––

2)

Example:

data case6;infile datalines delimiter=',';input age profit: dollar12. ;datalines;21, $6 750.5519,$1 0000 0022, ($3 000);run;

ß Informat DOLLARw. ß remove embedded blanks and ‘$’ ß remove right and left parentheses

3) Format input for date fieldsIt is convenient to store date in a form of numeric values so that it can be used in calculation

Examples: Calculate the number of days (weeks, months, or years) between the two datesDate can be expressed in many different forms

Examples: 1/31/90, 31/1/90, 31Jan1990, 90-1-31, 90-31-1 Use SAS date informats to read date values:

Informat Date format in raw data WidthDATEw. 17Jan03, 17/Jan/03, 17-Jan-03, 17 Jan 03 7 <= w <= 32MMDDYYw. 011703, 01/17/03, 01-17-03, 01 17 03 6 <= w <= 32DDMMYYw. 170103, 17/01/03, 17-01-03, 17 01 03, 6 <= w <= 32YYMMDDw. 030117, 03/01/17, 03-01-17, 03 01 17 6 <= w <= 32MONYYw. Jan03, Jan/03, Jan-03, Jan 03 6 <= w <= 32YYMMw. 0301 4 <= w <= 6

Example:

data date;infile datalines delimiter=',';input date1: date9. date2 : ddmmyy8. date3 : mmddyy10.;datalines;31Dec59, 01011960, 0117200331Dec1959, 01-01-60, 01-17-0331DEC59, 01/01/60, 01/17/0331DEC1959, 01 01 60, 01 17 03;run;

How does SAS convert calendar dates to SAS date values?

SAS date values is the number of days between 1 Jan 1960 and the specified dateDates before 1 Jan 1960 are negative values, dates after are positive values

How does SAS know which century a two-digit years belong to?If you use two-digit year values in your data lines or external files, you should consider the

––

1-Jan-1962

731

1-Jan-1959

-365

1-Jan-1960

0

1-Jan-1961

366

YEARCUTOFF= option. This option specifies which 100-year span is used to interpret two-digit yearvalues.

The default value of YEARCUTOFF= is 1920

Two-digit years 20-99 are assumed to be 1920-1999Two-digit years 00-19 are assumed to be 2000-2019

Date Expression Interpreted As12/07/41 12/07/194118Dec15 18Dec201504/15/30 04/15/193015Apr95 15Apr1995

To change the cutoff year value:

OPTION YEARCUTOFF = cutoffyear ;For example, if you specify YEARCUTOFF=1950, then the 100-year span will be from 1950to 2049.

options yearcutoff=1950;Using YEARCUTOFF=1950, dates are interpreted as shown below:

Date Expression InterpretedAs

12/07/41 12/07/204118Dec15 18Dec201504/15/30 04/15/203015Apr95 15Apr1995

Question: SAS date values are the number of days since which date?

January 1, 1901 January 1, 1950January 1, 1960 January 1, 2001

Correct answer: a

Question:

In order for the date values 05May1955 and 04Mar2046 to be read correctly, what value mustthe YEARCUTOFF= option have?

a value between 1947 and 1954, inclusive1955 or higher1946 or higherany value

Correct answer: dAs long as you specify an informat (e.g date7.) with the correct field width for reading the entire date value,the YEARCUTOFF= option doesn't affect date values that have four-digit years.

Question:

Which time span is used to interpret two-digit year values if the YEARCUTOFF= option isset to 1950?

1950-20491950-20501949-20501950-2000

Correct answer: aThe YEARCUTOFF= option specifies which 100-year span is used to interpret two-digit year values. Thedefault value of YEARCUTOFF= is 1920. However, you can override the default and change the value ofYEARCUTOFF= to the first year of another 100-year span. If you specify YEARCUTOFF=1950, then

––

a) b)c) d)

a.b.c.d.

a.b.c.d.

1-Jan-1962

731

1-Jan-1959

-365

1-Jan-1960

0

1-Jan-1961

366

the 100-year span will be from 1950 to 2049. Permanent SAS Data SetThe SAS datasets created so far are temporary (i.e. They are deleted when you close the SAS window). Apermanent SAS data set can stay in your computer after the SAS session ends.

A SAS data set is temporary if it is stored in SAS Work libraryAll SAS data set in Work library will be deleted automatically when a SAS session ends

A SAS data set is permanent if it is stored in a SAS library other than Work

Creating a SAS data library -- using LIBNAME statement LIBNAME libref 'SAS-data-library';

where libref is 1 to 8 characters long, begins with a letter or underscore, and contains only letters,numbers, or underscores.SAS-data-library is the name of a SAS data library in which SAS data files are stored

The LIBNAME statement below assigns the libref Mysaslib to the SAS data library D:\.libname Mysaslib 'd:\';

Creating permanent SAS Data SetSuppose a SAS library Mysaslib has been createdA permanent SAS data set is created by the DATA statement with a two-level name (two namesseparated by a period)Syntax

DATA libref.data-set-name ;

A SAS library can be deleted

In Explorer, select the library icon, right click, select Delete

It only removes the connection between the physical storage location and SAS. All SAS data setsremain in the storage folder

•–

––

If a SAS data set is deleted from the a SAS library, it is deleted permanently from the storagefolderA SAS data set can be copied/moved from a SAS library to another library

Question.

Which one of the following statements is false?LIBNAME statements can be stored with a SAS program to reference the SAS libraryautomatically when you submit the program.When you delete a libref, SAS no longer has access to the files in the library. However,the contents of the library still exist on your operating system.Librefs can last from one SAS session to another.You can access files that were created with other vendors' software by submitting aLIBNAME statement.

Correct answer: cThe LIBNAME statement is global, which means that librefs remain in effect until you modify them, cancelthem, or end your SAS session. Therefore, the LIBNAME statement assigns the libref for the current SASsession only. You must assign a libref before accessing SAS files that are stored in a permanent SAS datalibrary.

a.

b.

c.d.

Chapter 2: Simple SAS ReportsContentsPRINT Procedure – PROC PRINT 1Producing Frequency Tables - PROC FREQ 10Computing Statistics -- PROC MEANS 16Defining Custom Formats -- PROC FORMAT 21

PRINT Procedure – PROC PRINTSyntax: PROC PRINT DATA = data_set_name <(data-set-options)> <options> ;< VAR variable-list ; ><SUM variable-list ;><BY variable-list ;><WHERE where-expression ;><TITLEn 'title statement' ;><LABEL variable-name1 = 'label-string1' ... ;><FORMAT variable-name1 format1 variable-name2 format2 …;>RUN;

Basic Report proc print data=Q1;run;

Data Set Options

(FIRSTOBS = n)Starts the printing from nth observation from data-set-name

(OBS = m)Stops the printing at mth observation from data-set-name

NPrints the total number of observations in the data-set-name

NOOBSSuppresses the Obs column in Output

OBS = 'column header' Specifies a header for the Obs column in print out

Example :The following output only shows the 3rd – 8th observations and prints the total number ofobservations in the dataset Q1.proc print data=Q1 (firstobs=3 obs=8) n;run;

l–

l–

l–

l–

l–

Example :The following output suppresses the Obs column.proc print data=Q1 noobs ;run;

Example :The following output specifies ‘Observation Number’ as a header for the Obs column.proc print data=Q1 obs= 'Observation Number';run;

Note: If you include both NOOBS and OBS = 'column header' in your statement, SAS will suppressesthe Obs column. (OBS = 'column header' does not take effect) proc print data=Q1 obs= 'Observation Number' noobs ;run;

Selected VariablesVAR statement

You can choose the observations and variables that appear in your report. proc print data=Q1;var gender cost;run;

Selected ObservationsWHERE statement

To print observations that meet certain conditions

Definition OperatorEqual to EQ =Not equal to NE ^=Greater than GT >Less than LT <Greater than or equal to GE >=Less than or equal to LE <=Equal to one of a list IN Specified substring CONTAINS ? AND & OR | NOT ^

proc print data=Q1;var gender cost;where cost>50;run;

Column TotalsYou can produce column totals for numeric variables within your report. Example :proc print data=Q1;var gender cost;sum cost;

run;

Specifying TITLETITLE statement: To make your report more meaningful, you can specify up to 10 titles by using TITLEstatements in your output. TITLEn 'title' ;

n is a number from 1 to 10 that specifies the line number of the titleSkipping some values of n indicates those lines are blankTitles are centered by defaultSAS uses the same title for all subsequent outputs until you cancelit or define a new titleTo cancel a title, specify a blank TITLEn statement, e.g. TITLE1;

Example :proc print data=Q1;var gender cost;sum cost;title 'ABC Company';title3 'Transaction records';run;

proc print data=Q2;run;

Note1: ‘title’ is the same as ‘title1’Note2: Skipping ‘title2’ indicates thesecond lines is blankNote3: SAS uses the same title for printing Dataset Q2Note4: If you do not want to same titles appear in the second output, you can specify a blank TITLEstatementproc print data=Q2;title;run;

Temporarily Assigning Labels to VariablesYou can enhance your PROC PRINT report by labeling columns with more descriptive textLABEL statement:PROC PRINT DATA = data_set_name LABEL ;LABEL variable-name1 = 'label-string1' variable-name2 = 'label-string2' … ;

Label-string can be up to 256 characters long, including blanks, andmust be enclosed in single quotation marksIf you have assigned labels(permanent label) when you created theSAS data set, you can omit the LABEL statement from PRINT

»»»»

»

»

»

procedure

Exampleproc print data=Q1 label;var gender cost;sum cost;title 'ABC Company';title3 'Transaction records';label cost='Transaction cost';run;

Temporarily Assigning Formats to VariablesIn your SAS reports, formats control how the data values are displayed. Formats affect only how the datavalues appear in output, not the actual data values as they are stored in the SAS data set. FORMAT statementFORMAT variable-name1 format1 variable-name2 format2 …;

Possible forms of system formatx include: COMMAn.d ,DOLLARn.d (d specifies the number of decimal places), DATEw. ,DDMMYYw. , MMDDYYw.Ensure to specify sufficient large value of column (n) to contain thelargest value, including special characters such as commas and dollarsignsIf permanent format (see later section) is used when the SAS data setis created, the format statement in PRINT procedure can be omitted

Some commonly used formats

Format Specifies These Values ExampleCOMMAw.d that contain commas and decimal places comma8.2

DOLLARw.d that contain dollar signs, commas, and decimal places dollar6.2

MMDDYYw.DDMMYYw.

as date values of the form 09/12/97 (MMDDYY8.)or 09/12/1997 (MMDDYY10.) / 12/09/97(DDMMYY8.) or 12/09/1997 (DDMM10.)

mmddyy10.

ddmmyy10.

DATEw. as date values of the form 16OCT99 (DATE7.) or16OCT1999 (DATE9.)

date9.

WORDDATEw.

as date values of the form Apr 12, 1999 worddate32.

w.d rounded to d decimal places in w spaces 8.2

$w. as character values in w spaces $12.Example

proc print data=Mylib.year_sales label noobs;var units amountsold;where salesrep= 'Garcia' and quarter='1';sum amountsold;label unit ='Units sold' amountsold='Amount sold';format units comma7. amountsold dollar12.2;title1 'Sales in first quarter by Garcia';run;

Example

»

»

»

This FORMAT Statement To display Valuesas

format date mmddyy8.; 06/05/03

format net comma5.0 grosscomma8.2;

1,234 5,678.90

format net gross dollar9.2; $1,234.00$5,678.90

Creating a Customized Layout with BY Groups

Produces separate section of the report for each BY group observations BY statement

BY variable-list ;If data is not already sorted by the same variable-list, must add aPROC SORT step before PROC PRINT stepThe same variables cannot appear in both VAR and BYstatements

PROC SORT

PROC SORT DATA = datain <OUT = dataout> ;BY <DESCENDING> variable-list ;

If dataout is not specified, the datain is replaced by the sorted datasetBy default, observations are sorted in ascending order of the specifiedvariable-list

* Missing values always sort low * Sorted order for character variables:

. (missing) < symbol < 0 < 1 < 11 < 2 < A < B < a Exampleproc sort data=ex1 out=sort_ex1;by month year;run;proc print data=sort_ex1 n;var year telephone;by month;run;

Question.

What happens if you submit the following program?proc sort data=clinic.diabetes;run;proc print data=clinic.diabetes; var age height weight pulse; where sex='F';run;

The PROC PRINT step runs successfully, printing observations in their sorted order.The PROC SORT step permanently sorts the input data set.The PROC SORT step generates errors and stops processing, but the PROC PRINTstep runs successfully, printing observations in their original (unsorted) order.

»

»»

»»

a.b.c.

The PROC SORT step runs successfully, but the PROC PRINT step generates errorsand stops processing.

Correct answer: cThe BY statement is required in PROC SORT. Without it, the PROC SORT step fails. However, thePROC PRINT step prints the original data set as requested

Question.

What does PROC PRINT display by default?PROC PRINT does not create a default report; you must specify the rows and columnsto be displayed.PROC PRINT displays all observations and variables in the data set. If you want anadditional column for observation numbers, you can request it.PROC PRINT displays columns in the following order: a column for observationnumbers, all character variables, and all numeric variables.PROC PRINT displays all observations and variables in the data set, a column forobservation numbers on the far left, and variables in the order in which they occur in thedata set.

Correct answer: d

Producing Frequency Tables - PROC FREQThe FREQ procedure is a descriptive procedure and a statistical procedure. It produces one-way and n-wayfrequency tables, and it counts how many observations have each value, provides percentages andcumulative statistics. PROC FREQPROC FREQ DATA = sas-data-set <options>;<TABLE variable-list < / options> ;><BY by-variables ;><WHERE where-expression ;><FORMAT variable-format ;><TITLE 'title-text' ;><LABEL variable-name1 = 'label-string1' ... ;>RUN;

General form: Basic FREQ ProcedurePROC FREQ DATA = sas-data-set <options>;RUN;

By default, PROC FREQ creates a one-way table with the frequency, percent, cumulative frequency,and cumulative percent of every value of all variables in a data set.Example:proc print data=Q1;run;

proc freq data=Q1;run;

d.

a.

b.

c.

d.

PROC FREQ statement options

NLEVELS : displays the number of levels for all variables in TABLE statement

proc freq data=Q1 nlevels;run;

ORDER = : specifies the order for listing the variable values

FREQ: orders values by descending frequency count

proc freq data=Q1 order=freq;table strength_of_fragrance;run;

FORMATTED: orders values by ascending their formatted valuesINTERNAL(default): orders values by ascending their unformatted valuesDATA: orders values according to their order in the data set

data eg1;input var1 $;

datalines;

dß1fß2eß3aß4bß5

b

cß6e;proc freq data=eg1 order=data;run;

Specifying Variables in PROC FREQTABLE statement :

1.

2.•

•••

TABLE variable-list < / options> ;variable-list specifies variables included in the report

For one-way tables, specify the variable nameFor more than one variable, separate the variable names by space

For two-way table, separate the paired variables by *For more than a pair of variables, separate each pair by space

One-way tables

proc freq data=crew;table jobcode location;run;

In this example, two one-way tables are produced.

PROC FREQ produces one-way tables with cellsthat contain

frequencypercentcumulative frequencycumulative percent

Two-way table proc freq data=crew;table location*jobcode;run;

In this example, a two-way table is produced.

PROC FREQ produces two-way tables with cellsthat contain

cell frequencycell percent of total frequencycell percent of row frequencycell percent of column frequency

TABLE statement optionsTABLE statement :TABLE variable-list < / options> ;

Commonly used options in all FREQ tables:

MISSING – includes missing values in frequency statistics, i.e. treat missing valueis a valid valueNOPRINT – suppresses displaying tableOUT = out-data-set – writes the frequencies to SAS data set out-data-set

Exampleproc freq data=crew;table location jobcode / missing out=result;run;

Commonly used options in one-way tables: Commonly used options in two-way tables:

NOCUM – suppressesdisplay of cumulativefrequencies and percentagesNOPERCENT –suppresses display ofpercentagesOUTCUM – includes thecumulative frequency andcumulative percentage in theoutput data set

LIST – prints cross-tabulations in list formatrather than gridCROSSLIST – printscross-tabulations in cross-list formatNOCUM - suppressesdisplay of cumulativefrequencies and cumulativepercentages in list formatNOCOL – suppressesdisplay of columnpercentage for each cellNOROW – suppressesdisplay of row percentage

••••

•••

••

for each cellNOFREQ – suppressesdisplay of the frequencycount for each cellOUTPCT - includes thepercentage of columnfrequency, row frequency,and two-way table frequencyin the output data set

Commonly used options in one-way tables:Exampleproc freq data=Mylib.car ;table size /nopercent;run;

Exampleproc freq data=Mylib.car ;table size /nocum;run;

Exampleproc freqdata=Mylib.car ;table size/noprint out=Q13outcum;run;

Commonly used options in two-way tables:Exampleproc freq data=crew;table location*jobcode / list;;run;

Exampleproc freq data=crew;table location*jobcode / crosslist;run;

Exampleproc freq data=crew;table location*jobcode / out=result2output;run;

BY statementTo obtain separate analyses on observations in groups defined by the BY variablesIf the data set is not sorted in ascending order, sort the data using the SORT procedure witha similar BY statement

Exampleproc sort data=crew;by location;

proc freq data=crew;table jobcode / missing;by location;run;

Computing Statistics -- PROC MEANSPROC MEANS - Produces a report on variables in a SAS data set

Computes summary statistics such as maximum, minimum, mean, standard deviationetc.Only applies to numeric values and missing values are excluded for statisticalcalculations

PROC MEANS DATA = sas_data_set <requested-statistics> <options>;

––

<VAR variable-list ;><BY by-variables ;><CLASS class-variables ;><OUTPUT OUT = sas-data-set <output-statistic = output-label> ;><WHERE where-expression ;><TITLE 'title-text' ;><LABEL variable-name1 = 'label-string1' ... ;>RUN;

VAR statementVAR variable-list ;

Reports on every numeric variable in sas-data-set if VAR statement is not includedFor more than one variable, separate the variable names by spaceDefault reported statistics are N, MEAN, STD, MIN, MAX

proc means data=Mylib.Car;run;

Note: The above report shows all numeric variables (onlymileage and reliability are numeric variables) proc means data=Mylib.Car;var mileage;run;

Requested statisticsOther statistics include: RANGE, MEDIAN, SUM, NMISS, SKEWNESS, VAR, Q1, Q3, P1, P5,P10, P90, P95, P99, etc.If you add any statistics in requested-statistics, PROC MEANS no longer produce the defaultstatistics. They must be requested.

proc means data=Mylib.Car n mean;var MILEAGE;run;

PROC MEANS statement options

MAXDEC= - specifies the number of decimal places for the statisticsproc means data=Mylib.Car n mean std maxdec=3;var MILEAGE;run;

NOPRINT –suppresses all displayed output

Group processing Group Processing Using the CLASS Statement Group Processing Using the BY Statement

–––

PROC MEANS DATA = sas_data_set <requested-statistics> <options>;<VAR variable-list ;>CLASS class-variables ;RUN; CLASS Statement Options:ORDER = - specify the order for listing the valuesof CLASS variableMISSING - treat missing values as a valid value ofCLASS variable Note: You do not need to use the PROC SORTwhen using the CLASS Statement.

PROC MEANS DATA = sas_data_set<requested-statistics>;<VAR variable-list ;>BY by-variables ;RUN; Note: If the data set is not sorted in ascendingorder, sort the data using the PROC SORTwith a similar BY statement

Group Processing Using the CLASSStatement

Group Processing Using the BY Statement

Exampleproc means data=Mylib.Car mean medianorder=freq;var mileage;class size;run;

Exampleproc sort data=Mylib.Carout=sort_Car;by size;run;proc means data=sort_Car mean median;var mileage;by size;run;

Creating a Summarized Data Set -- OUTPUT statement OUTPUT OUT = sas-data-set<output-statistic1 = output-name1a output-name1b …output-statistic2 = output-name2a output-name2b … … > ;

Use the OUTPUT without specifying the output-statistics = option producesdefault statistics (N, MIN, MAX, MEAN, STD) for all of the variables specified inVAR statement.

output-statistics = specify the summary statistic to be written out and it is notnecessary identical to the requested-statistics in PROC MEANS statement

output-names specify the names of the variables that will be created to contain thevalues of the summary statistics. The output-names must be listed in the same orderas in the VAR statement

Exampleproc means data=Mylib.Car noprint;var RELIABILITY MILEAGE;

output out=car_averagemean=MEAN_REL MEAN_MILEnmiss=nm_rel nm_mile;

run;

ExampleNote: Values of _TYPE_ indicates which combinations of Class variables are used to compute the statistics

Question.

The default statistics produced by the MEANS procedure are n-count, mean, minimum,maximum, and

median.range.standard deviation.standard error of the mean.

Correct answer: c

Question.

Which statement will limit a PROC MEANS analysis to the variables Boarded, Transfer, andDeplane?

by boarded transfer deplane;class boarded transfer deplane;output boarded transfer deplane;var boarded transfer deplane;

Correct answer: dTo specify the variables that PROC MEANS analyzes, add a VAR statement and list thevariable names.

Question.

Which of the following statements is true regarding BY-group processing?BY variables must be either indexed or sorted.Summary statistics are computed for BY variables.BY-group processing is preferred when you are categorizing data that contains fewvariables.BY-group processing overwrites your data set with the newly grouped observations.

Correct answer: aUnlike CLASS processing, BY-group processing requires that your data already be indexed orsorted in the order of the BY variables. You might need to run the SORT procedure beforeusing PROC MEANS with a BY group.

Defining Custom Formats -- PROC FORMATYou can use the FORMAT procedure to define your own custom formats for displaying values ofvariables.It does not affect the internal data values that are stored in the SAS data set

a.b.c.d.

a.b.c.d.

a.b.c.

d.

Once defined, custom format is used like SAS system format

PROC FORMAT <LIBRARY = libref>;VALUE <$>format-name1 range1a = 'label1a' range2a = 'label2a' …;VALUE <$>format-name2 range2b = 'label1b' range2b = 'label2b' …;. . . RUN ;

Temporary custom format (default)

A custom format is stored in a format catalog under WORKlibrary (so the format is temporarily stored).You only need to submit the PROC FORMAT procedure onceduring one session, but you need to re-run the procedure againwhen you re-open the SAS software (session).

Permanent custom format Option LIBRARY = librref specifies the name for a permanentSAS data library in which the format catalog will be storedNeed to tell SAS where to find the defined format before using it butdo not need to re-run the procedure

format-name names the format that you are creating Must begin with a $ sign if the format applies to character valuesCannot be longer than eight charactersCannot be the name of an existing SAS formatCannot end with a numberDoes not end in a period

range specifies one or more values to be grouped Values in different ranges should not overlap

label is a text string enclosed in quotation marks (‘ ‘) Note: In a single PROC FORMAT procedure, you can use several VALUE statements to defineseveral formats

Specifying VALUE rangesRange Description1 -10 1 to 10 inclusive ( )1 <- 10 Greater than 1 up through 10 ( )1 -<10 1 up to but not including 10 ( )1 – 10, 15 1 through 10 and value 15 ( or x=15)1, 3, 5 Values 1, 3, and 5Low - 10 Lowest non-missing value through 10 ( )10 - High 10 through the highest non-missing value ( )‘a’ – ‘g’ First character of data value matches any letters from a through g, case sensitive‘a’ – ‘d’ First character of data value matches a or d, case sensitiveLow – ‘g’ Any first character of non-missing value through g, case sensitive‘g’ - High G through any first character of non-missing value, case sensitiveOther Any value not specified elsewhere

––•

––•

––•••••

––•

Associating User-Defined Formats with Variables Example - Creating Temporary custom format

Without using format data eg;input age sex income colour$;datalines;19 1 14000 Y45 1 65000 G72 2 35000 B. 1 44000 Y58 2 83000 W;run;

proc print data=eg;run;

Creating Temporary custom format proc format;value gender 1='Male' 2='Female';value agegroup low-18='Teen' 19-<65='Adult' 65-high='Elder' .='Missing';value $col 'W'='White' 'B'='Blue'

'Y'='Yellow' 'G'='Green';run;

proc print data=eg;format age agegroup. sex gender. colour$col. income dollar8.;run;

proc freq data=eg;table age/ missing;run;

proc freq data=eg;table age/ missing;format age agegroup.;run;

Without using format (continued) proc means data=eg mean maxdec=0 missing;

Creating Temporary custom format(continued)proc means data=eg mean maxdec=0 missing;

var income;class age;run;

var income;class age;format age agegroup.;run;

Example - Creating a SAS data set using custom format

Without using format data eg;input age sex income colour$;datalines;19 1 14000 Y45 1 65000 G72 2 35000 B. 1 44000 Y58 2 83000 W;run;

Creating a SAS data set using customformat proc format;value gender 1='Male' 2='Female';value agegroup low-18='Teen' 19-<65='Adult' 65-high='Elder' .='Missing';value $col 'W'='White' 'B'='Blue'

'Y'='Yellow' 'G'='Green';run;

data eg;input age sex income colour$;format age agegroup. sex gender. colour$col. income dollar8.;datalines;19 1 14000 Y45 1 65000 G72 2 35000 B. 1 44000 Y58 2 83000 W;run;

Note: The user defined format must be created before theDATA step using the format

Permanent custom format Examplelibname mylib 'd\temp';options fmtsearch=(mylib); ß Tell SAS to search for format in this library proc format library=mylib;value gender 1='Male' 2='Female';value agegroup low-18='Teen' 19-<65='Adult' 65-high='Elder' .='Missing';value $col 'W'='White' 'B'='Blue' 'Y'='Yellow' 'G'='Green';run;

data mylib.eg;format age agegroup. sex gender. colour $col. income dollar8.;datalines;19 1 14000 Y45 1 65000 G72 2 35000 B. 1 44000 Y58 2 83000 W;run;

Question.

If you don't specify the LIBRARY= option, your formats are stored in Work.Formats, andthey exist

only for the current procedure.only for the current DATA step.only for the current SAS session.permanently.

Correct answer: cIf you do not specify the LIBRARY= option, formats are stored in a default format catalognamed Work.Formats. As the libref Work implies, any format that is stored inWork.Formats is a temporary format that exists only for the current SAS session.

Question.

Which of the following statements will store your formats in a permanent catalog? libname library 'c:\sas\formats\lib';proc format library=library ...; libname library 'c:\sas\formats\lib';format library =library ...; library='c:\sas\formats\lib';proc format library ...; library='c:\sas\formats\lib';proc library ...;

Correct answer: aTo store formats in a permanent catalog, you first write a LIBNAME statement to associate thelibref with the SAS data library in which the catalog will be stored. Then add the LIBRARY=

a.b.c.d.

a.b.c.d.

option to the PROC FORMAT statement, specifying the name of the catalog.

Question.

When creating a format with the VALUE statement, the new format's namecannot end with a numbercannot end with a periodcannot be the name of a SAS format, and

cannot be the name of a data set variable.must be at least two characters long.must be at least eight characters long.must begin with a dollar sign ($) if used with a character variable.

Correct answer: dThe name of a format that is created with a VALUE statement must begin with a dollar sign ($)if it applies to a character variable.

Question.

Which of the following FORMAT procedures is written correctly? proc format library=library value colorfmt; 1='Red' 2='Green' 3='Blue' run; proc format library=library; value colorfmt 1='Red' 2='Green' 3='Blue'; run; proc format library=library; value colorfmt; 1='Red' 2='Green' 3='Blue' run; proc format library=library; value colorfmt 1='Red'; 2='Green'; 3='Blue'; run;

Correct answer: bA semicolon is needed after the PROC FORMAT statement. The VALUE statement beginswith the keyword VALUE and ends with a semicolon after all the labels have been defined.

Question.

Which of these is false? Ranges in the VALUE statement can specifya single value, such as 24 or 'S'.a range of numeric values, such as 0–1500.a range of character values, such as 'A'–'M'.a list of numeric and character values separated by commas, such as 90,'B',180,'D',270.

Correct answer: dYou can list values separated by commas, but the list must contain either all numeric values orall character values. Data set variables are either numeric or character.

Question.

How many characters can be used in a label?4096200256

Correct answer: dWhen specifying a label, enclose it in quotation marks and limit the label to 256 characters

Question.

Which keyword can be used to label missing values as well as any values that are not specifiedin a range?

LOWMISSMISSINGOTHER

Correct answer: dMISS and MISSING are invalid keywords, and LOW does not include missing values. The

•••

a.b.c.d.

a.b.c.d.

a.b.c.d.

a.b.c.d.

a.b.c.d.

keyword OTHER can be used in the VALUE statement to label missing values as well as anyvalues that are not specifically included in a range.

Question.

You can place the FORMAT statement in either a DATA step or a PROC step. What happenswhen you place the FORMAT statement in a DATA step?

You temporarily associate the formats with variables.You permanently associate the formats with variables.You replace the original data with the format labels.You make the formats available to other data sets.

Correct answer: bBy placing the FORMAT statement in a DATA step, you permanently associate the definedformats with variables.

Question.

The format JOBFMT was created in a FORMAT procedure. Which FORMAT statement willapply it to the variable JobTitle in the program output?

format jobtitle jobfmt;format jobtitle jobfmt.;format jobtitle=jobfmt;format jobtitle='jobfmt';

Correct answer: bTo associate a user-defined format with a variable, place a period at the end of the format namewhen it is used in the FORMAT statement.

a.b.c.d.

1.2.3.4.

Chapter 3: Basic ProgrammingContentsUnderstanding DATA Step Processing 1Debugging In DATA Step 10Single Observation From Multiple Records 13Creating Variables - Assignment statements 15Conditional Logic Statements 17Processing Group of Variables 22Selecting Variables And Observations 26Calculations Across Observations 29Reading Mixed Record Types 32Reading Fixed Number of Repeating Fields 34Reading Varying Number of Repeating Fields 35Reading Hierarchical Raw Data Files 36SAS Functions 43 Understanding DATA Step ProcessingIn Chapter 1, you learned how to write a DATA step to create a temporary or permanent SAS data set fromraw data. When you submit a DATA step, SAS processes the DATA step and then creates a new SAS dataset. In this section, you can learn more about how SAS processes the DATA step. A SAS DATA step is processed in two phases:

During the compilation phase, each statement is scanned for syntax errors. Most syntax errorsprevent further processing of the DATA step. When the compilation phase is complete, thedescriptor portion of the new data set is created.If the DATA step compiles successfully, then the execution phase begins. During the executionphase, the DATA step reads and processes the input data. The DATA step executes once for eachrecord in the input file, unless otherwise directed.

Compilation Phase

1. Input Buffer

At the beginning of the compilation phase, the input buffer (an area of memory) is created to hold a recordfrom the external file.

Input Buffer

2. Program Data Vector

After the input buffer is created, the program data vector (PDV) is created. The PDV is the area ofmemory where SAS builds a data set, one observation at each time.

The program data vector contains two automatic variables that can be used for processing but which arenot written to the data set as part of an observation.

_N_ counts the number of times that the DATA step begins to execute.

Yes

NoRecord1

Record2

Record3

_ERROR_ signals the occurrence of an error that is caused by the data during execution. Thedefault value is 0, which means there is no error. _ERROR_ = 1, when one or more errors occur.

PDV_N_ _ERROR_

Question

Suppose you run a program that causes three DATA step errors. What is the value of the automaticvariable _ERROR_ when the observation that contains the third error is processed?

0123

Correct answer: b

3. Syntax Checking

During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors.Syntax errors include

missing or misspelled keywordsinvalid variable namesmissing or invalid punctuationinvalid options.

4. Data Set Variables

As the INPUT statement is compiled, any variable appears in the DATA step will add to the PDV.Usually, variable attributes such as length and type are determined the first time a variable is encountered. In the example below, the variable ID is defined as a character variable and is assigned the default lengthof 8. Income and Expense are defined as a numeric variable and are assigned the default length of 8Moreover, any variables that are created with an assignment statement in the DATA step are also added tothe program data vector. For example, the assignment statement below creates the variable NetProfit. Theattributes of the variable are determined by the expression (NetProfit=Income-Expense) in the statement.Because the expression produces a numeric value, NetProfit is also defined as a numeric variable and isassigned the default length of 8. Example :data profit;input ID $ Income Expense;NetProfit=Income-Expense;datalines;001 1000 2000002 300 150003 888 777;run;

a.b.c.d.

••••

PDV_N_ _ERROR_ ID Income Expense NetProfit

5. Descriptor Portion of the SAS Data SetAt the bottom of the DATA step (in this example, when the RUN statement is encountered), the compilationphase is complete, and the descriptor portion of the new SAS data set is created. The descriptor portion ofthe data set includes

the name of the data setthe number of observations and variablesthe names and attributes of the variables.

At this point

The example data set contains the four variables that are defined in the input data set and in theassignment statement._N_ and _ERROR_ are not written to the data set.There are no observations because the DATA step has not yet executed.

Execution PhaseAfter the DATA step is compiled, it is ready for execution. During the execution phase, the data portion ofthe data set is created. The data portion contains the data values.

Example :data profit;input ID $ Income Expense;NetProfit=Income-Expense;datalines;001 1000 2000002 300 150003 888 777;run;

1. Set variables in the PDV to missing and Update _N_ & _Error_ in PDVAt the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the valueof _ERROR_ is 0. The remaining variables are initialized to missing. Missing numeric values arerepresented by periods, and missing character values are represented by blanks.

•••

••

PDV_N_ _ERROR_ ID Income Expense NetProfit

- =

PDV_N_ _ERROR_ ID Income Expense NetProfit

1 0

2. Put a new record to input buffer and read data value to the PDV

Input Buffer1---+----10---+----20001 1000 2000PDV_N_ _ERROR_ ID Income Expense NetProfit

1 0 001 1000 2000

3. Executes additional executable statements in DATA stepThe assignment statement (NetProfit=Income-Expense;) executesPDV_N_ _ERROR_ ID Income Expense NetProfit

1 0 001 1000 2000 -1000

4. End of the DATA StepAt the end of the DATA step, several actions occur.First, the values in the PDV are written to the output data set as the first observation. SAS Data Set profit

Next, the value of _N_ is set to 2 and control returns to the top of the DATA step. Finally, the variablevalues in the program data vector are re-set to missing. Notice that the automatic variable _ERROR_ retainsits value. PDV_N_ _ERROR_ ID Income Expense NetProfit

2 0

5. Iterations of the DATA StepYou can see that the DATA step works like a loop, repetitively executing statements to read data values andcreate observations one by one. Each loop (or cycle of execution) is called an iteration. At the beginning ofthe second iteration, the value of _N_ is set to 2, and _ERROR_ is still 0. The values from the second recordare held in the input buffer and then read into the PDV. Input Buffer1---+----10---+----20002 300 150 PDV_N_ _ERROR_ ID Income Expense NetProfit

2 0 002 300 150 PDV_N_ _ERROR_ ID Income Expense NetProfit

• • •

• • •

- =

2 0 002 300 150 150

SAS Data Set profit

6. End-of-File MarkerThe execution phase continues until the end-of-file marker is reached in the raw data file. When there are nomore records in the raw data file to be read, the data portion of the new data set is complete. Final SAS Data Set profit

Question.

Which of the following is not created during the compilation phase?the data set descriptorthe first observationthe program data vectorthe _N_ and _ERROR_ automatic variables

Correct answer: bObservations are not written until the execution phase.

Question.

During the compilation phase, SAS scans each statement in the DATA step, looking for syntaxerrors. Which of the following is not considered a syntax error?

incorrect values and formatsinvalid options or variable namesmissing or invalid punctuationmissing or misspelled keywords

Correct answer: a

Question.

Unless otherwise directed, the DATA step executesonce for each compilation phase.once for each DATA step statement.once for each record in the input file.once for each variable in the input file.

Correct answer: c

Question.

At the beginning of the execution phase, the value of _N_ is 1, the value of _ERROR_ is 0, andthe values of the remaining variables are set to

01undefinedmissing

Correct answer: d

Question.

Suppose you run a program that causes three DATA step errors. What is the value of theautomatic variable _ERROR_ when the observation that contains the third error is processed?

a.b.c.d.

a.b.c.d.

a.b.c.d.

a.b.c.d.

0123

Correct answer: bThe default value of _ERROR_ is 0, which means there is no error. When an error occurs,whether it is one error or multiple errors, the value is set to 1.

Question.

Which of the following actions occurs at the end of the DATA step?The automatic variables _N_ and _ERROR_ are incremented by one.The DATA step stops execution.The descriptor portion of the data set is written.The values of variables created in programming statements are re-set to missing in theprogram data vector.

Correct answer: d

Debugging In DATA StepType of errors in SAS programming

Syntax errorProgram statements do not conform to the rules of the SAS language. Syntax errorsinclude :

missing or misspelled keywordsinvalid variable namesmissing or invalid punctuationinvalid options.

Data errorsSome data values are not consistent with the data type specified in a program

Such as reading a character value for a SAS numeric variableLogic error

Statements are free of syntax error and data error but not producing anticipated resultsSuch as a ‘+’ sign is used instead of a ‘–‘ sign in a formula

Note: SAS can detect and report all syntax errors and data errors but will not recognize logic errors

When an error is detected by SAS:In the SAS log

displays the word ERRORidentifies the possible location of the errorgives an explanation of the error

SAS may or may not continue the execution of the statements depending on the kind of errordetected

Some commonly made errors:

Omitting a semi-colonIncorrectly type of variableNumber of variables specified in the INPUT statement is higher than the number of fields inthe raw dataUnbalanced quotation marks

a.b.c.d.

a.b.c.d.

1.

••••

2.

•3.

•••

–––

Question

What usually happens when a syntax error is detected?SAS continues processing the step.SAS continues to process the step, and the SAS log displays messages about the error.SAS stops processing the step in which the error occurred, and the SAS log displaysmessages about the error.SAS stops processing the step in which the error occurred, and the Output windowdisplays messages about the error.

Correct answer: cSyntax errors generally cause SAS to stop processing the step in which the error occurred. When a programthat contains an error is submitted, messages regarding the problem also appear in the SAS log. When asyntax error is detected, the SAS log displays the word ERROR, identifies the possible location of the error,and gives an explanation of the error.

Question

A syntax error occurs whensome data values are not appropriate for the SAS statements that are specified in aprogram.the form of the elements in a SAS statement is correct, but the elements are not validfor that usage.program statements do not conform to the rules of the SAS language.none of the above.

Correct answer: c

Question

How can you tell whether you have specified an invalid option in a SAS program?A log message indicates an error in a statement that seems to be valid.A log message indicates that an option is not valid or not recognized.The message "PROC running" or "DATA step running" appears at the top of the activewindow.You can't tell until you view the output from the program.

Correct answer: bWhen you submit a SAS statement that contains an invalid option, a log message notifies you that the optionis not valid or not recognized. You should recall the program, remove or replace the invalid option, checkyour statement syntax as needed, and resubmit the corrected program. Multiple Observations From Single RecordSome raw data files may contain more than one observation per record. @@ (double trailing @) line-hold specifier - typically is used to read multiple SAS observations from asingle data line Syntax

INPUT varname1 … @@;

It holds the data line in the input buffer across multiple executions of the DATA stepIt prevents SAS from loading a new record into input buffer at each DATA step iterationunless

the end of record line is detected,

or another INPUT statement without a line-hold specifier is encountered

Example:data profit;input ID $ @@;input Department $5.;

a.b.c.

d.

a.

b.

c.d.

a.b.c.

d.

––

should not be used with the @ pointer control (discuss in later section), with column input,nor with the MISSOVER option

Example: Data Q6;input X Y @@;datalines;1 2 3 4 5 6 7 811 12 13 1421 22 23 24 25 26 27 28;run;

Single Observation From Multiple RecordsSome raw data files may contain more than one record per objectExampleà Each observation consists of 3 records Method 1: Multiple INPUT statements

Number of INPUT statements equals to the number of records for an objectWorks for equal number of records in each observation

Exampledata Case8a;infile 'd:\temp\list7.txt';input id 1-4 name $ 6-16input gender $1input weight_before 1-4 weight_after 6-9;

run;

Method 2: / Line-pointer control

Forces a new record into the input buffer and start reading from the beginning of that recordWorks for equal number of records in each observation

INPUT varname … / varname … / varname … ;

Exampledata Case8b;infile 'd:\temp\list7.txt';input id 1-4 name $ 6-16 / gender $1 / weight_before 1-4 weight_after 6-9;run;

••

••

Method 3: #n line-pointer controlPuts multiple records to the input buffer and assigns the records to PDV in any specified order

INPUT #n1 varname … #n2 varname … #n3 varname … ;

nX represents the record number in the input buffer

Example data Case8c;infile 'd:\temp\list7.txt';input #2 gender $1 #1 id 1-4 name $ 6-16 #3 weight_before 1-4 weight_after 6-9;run;

Creating Variables - Assignment statementsTo produce new information or to change the information from the original information

New information can be added to a SAS data set by creating new variables with anassignment statement in a Data step

Syntax

variable = expression;The left hand side must be a variable nameexpression may contain combinations of numeric or non-numeric constant, avariable, SAS function, and mathematical operatorsWhen the expression contains character data, the data must be enclosed in apair of single (or double) quotation marks

Mathematical Operators

Addition +Subtraction -Multiplication *Division /Exponentiation **

SAS performs exponentiation first, then multiplication and division, followed by

––

addition and subtractionCan use parentheses to override the orderExample:

var1 = 10 * 4 + 3 ** 2 var1 = 49var1 =10 * (4 + 3) ** 2 var1 = 490

Exampledata case1;infile datalines delimiter=',';input name $ tomato cucumber peas grapes;zone=14;type='Home';cucumber=cucumber*10;total= tomato + cucumber + peas + grapes;tomato_percent = tomato / total*100;datalines;David,10,2,40,0Mary,15,5,10,1000Francis,50,10,15,50Tom,20,0, . ,20;run;

Note:

SAS executes each statement once during each round of iteration of DATA stepIf a variable has already been assigned a value in PDV, SAS replaces the original value with thenew oneThe variable PEAS had a missing value for the last observation. Variables calculated from Peaswere also set to missing

Note:The sequence of assignment statements and INPUT statement affect the assigned valuestotal= tomato + cucumber + peas + grapes;cucumber=cucumber*10;

Conditional Logic StatementsIF-THEN statements

the IF-THEN statement executes a SAS statement when the condition in the IF clause is true

IF condition THEN statement;

wherecondition is any valid SAS expression (e.g. VAR1 >= 10)statement is what SAS should do when the condition is true, often an assignment statement

Comparison OperatorsOperator Comparison

Operation= or eq equal to^= or ne not equal to> or gt greater than< or lt less than>= or ge greater than or equal

to

••

••

<= or le less than or equal toin equal to one of a list

Exampleif test<85 and time<=20 then Status='RETEST';if region in ('NE','NW','SW') then Rate=fee-25;if target>300 or sales<50000 then Bonus=salary*.05;

Logical Operators

Operator

Logical Operation

& and| or^ or ~ not

Use the AND operator to execute the THEN statement if both expressions that are linked by AND are true.Exampleif status='OK' and type=3 then Count+1;if (age^=agecheck or time^=3) & error=1 then Test=1;

Use the OR operator to execute the THEN statement if either expression that is linked by OR is true.Exampleif status='S' or cond='E' then Control='Stop';

Use the NOT operator with other operators to reverse the logic of a comparison.

Exampleif not(loghours<7500) then Schedule='Quarterly';if region not in ('NE','SE') then Bonus=200;

Character values must be specified in the same case in which they appear in the data set and must beenclosed in quotation marks.

Exampleif status='OK' and type=3 then Count+1;if status='S' or cond='E' then Control='Stop';if not(loghours<7500) then Schedule='Quarterly';if region not in ('NE','SE') then Bonus=200;

Logical comparisons that are enclosed in parentheses are evaluated as true or false before they are comparedto other expressions. In the example below, the OR comparison in parentheses is evaluated before the firstexpression and the AND operator are evaluated.

SAS sets the length of a character variable first time it is evaluated

Exampledata case2;input var1 @@;if var1>20 then var2='Big'; ß it sets the length of var2 is 3if 11<=var1<=20 then var2='Medium';if var1<11 then var2='Small';datalines;5 15 25;run;

Example (continued)data case2;input var1 @@;if 11<=var1<=20 then var2='Medium'; ß it sets the length of var2 is 5if var1>20 then var2='Big';

if var1<11 then var2='Small';datalines;5 15 25;run;

data case2;input var1 @@;length var2 $8.; ß it sets the length of var2 is 8if var1>20 then var2='Big';if 11<=var1<=20 then var2='Medium';if var1<11 then var2='Small';datalines;5 15 25;run;

Missing value of a numeric variable is smaller than any specified value

Example: data case3;input age @@;if age <=18 then agroup='A';if 18<age<30 then agroup ='B';if 31<= age then agroup='C';datalines;14 . 25 19;run;

IF-THEN blocksTo execute more than one action when the condition is true

IF condition THEN DO;statements;statements;END;

Example: data case4;input course $ @@;if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45;end;if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30;end;datalines;MS3215 MS3216 MS3217;run;

IF-THEN-ELSE statements / IF-THEN-ELSE blocks

To put a number of related IF-THEN statements / IF-THEN blocks togetherIF-THEN-ELSE statement: IF condition THEN statement;ELSE IF condition THEN statement;

IF-THEN-ELSE block: IF condition THEN DO; statements;

ELSE IF condition THEN statement;ELSE statement;

statements;END;ELSE IF condition THEN DO; statements; statements;END;

Example (IF-THEN-ELSE): if var1=. then var2='Unknown';else if var1 <11 then var2='Small';else if 11<= var1<=20 then var2 ='Medium';else if 45 >=var1 >20 then var2='Big';else var2='Very Big';

Example (IF-THEN-ELSE block):data case4;input course $ @@;if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45;end;else if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30;end;else do; lecturer = 'Other'; class_size=.;end;datalines;MS3215 MS3216 MS3217;run;

Processing Group of VariablesIn DATA step programming, you often need to perform the same action on more than one variable.Although you can process variables individually, it is easier to handle them as a group. You can do this byusing array processing. Array statement

Defines a set of variables to be processed as a groupAny variables can be grouped as an array as long as they are either all numeric type or allcharacter type

SyntaxARRAY arrayname[n] <$> variable_list;

arrayname: names the array, must not be the name of a variable in thesame DATA steparrayname is not a variable and it will not appear in PDV orcreated SAS data setn is the number of variables grouped in the array.

––

»

»

»

n must be surrounded by either ( ), { }, or [ ]$ is needed if the variables are character type and the variables havenot been defined before the ARRAY statementarrayname[n] in an assignment statement refers to the nth elements ofthe array as defined in the array statement, n = 1, 2, . . .. In theexample below, newarray[1] is var1, newarray[2] is var2 andnewarray[3] is var3

Example: Use of arraydata case5;array newarray[3] var1 var2 var3;

newarray[1]=1; à var1newarray[2]=2; à var2

newarray[3]=3; à var3run;

Question.

Which statement is false regarding an ARRAY statement?It is an executable statement.It can be used to create variables.It must contain either all numeric or all character elements.It must be used to define an array before the array name can be referenced.

Correct answer: a

Question.

What belongs within the braces of this ARRAY statement?array contrib{?} qtr1-qtr4;

quarterquarter*1-44

Correct answer: d

Question.

For the program below, select an iterative DO statement to process all elements in the contribarray.data work.contrib;

array contrib{4} qtr1-qtr4;

...

contrib{i}=contrib{i}*1.25;

end;

run;do i=4;do i=1 to 4;do until i=4;do while i le 4;

Correct answer: b

Question.

What is the value of the index variable that references Jul in the statements below?array quarter{4} Jan Apr Jul Oct;

do i=1 to 4;

yeargoal=quarter{i}*1.2;

end;123

»»

»

a.b.c.d.

a.b.c.d.

a.b.c.d.

a.b.c.

4 Correct answer: c

DO loop - To process an array of variables iterativelySyntaxDO index_variable = k TO m < BY increment_amount >;

SAS statementsEND;

index_variable is a variable that changes value at each iteration of the loopStarts iteration with value k (m often equals to 1)

increment_amount is a numeric variable or constant that controls how thevalue of index_variable changes

Default value is 1At END, index_variable changes by the amount ofincrement_amountIteration continues until the value of index_variable > m

Example:data case5a;array newarray[3] var1 var2 var3;do i=1 to 3; newarray[i]=i;end;run;

Variable i can be dropped from the data set by including a DROP statement in the DATA stepdata case5a;array newarray[3] var1 var2 var3;do i=1 to 3; newarray[i]=i;end;drop i;run;

Abbreviated list of variable names

To replace regular list of variable namesNumbered range lists

Variables which start with the same characters and end with consecutive numbersThe numbers can start and end anywhere as long as the number sequence between iscomplete

Example:

Example:Array ALL has 20 numeric elements. Write Do statements to refer to the following elements:

All elementsEven-numbered elementsEvery third element, beginning with 1 (i.e. 1, 4, 7, …)

(a)data case_a;array ALL[20] var1-var20;

c)data case_c;array ALL[20] var1-var20;

d.

–»

»»

»

––

••

a.b.c.

Regular variable list Abbreviated listINPUT var6 var7 var8 var9; INPUT var6 - var9;ARRAY narray(4) var6 var7 var8 var9; ARRAY narray(4) var6 - var9;PROC PRINT DATA = data1; PROC PRINT DATA = data1;VAR var6 var7 var8 var9; VAR var6 - var9;

do k=1 to 20;ALL[k]=k;end;run;

(b)data case_b;array ALL[20] var1-var20;DO k = 2 to 20 BY 2;ALL[k]=k;

END;run;

DO k = 1 to 20 BY 3;ALL[k]=k;

END;

Selecting Variables And ObservationsSelecting variables- Can put or exclude selected variables in the PDV to the SAS data setSometimes you might need to read and process fields that you don't want to keep in your data set. In thiscase, you can use the DROP statement or the KEEP statement to specify the variables that you want todrop or keep. DROP statement specifies a list of variables not to write to output data sets.DROP variable_list ;

where variable_list identifies the variables to drop. KEEP statement specifies a list of variables to write to output data sets.KEEP variable_list ;

where variable_list identifies the variables to keep.

Example: data case6;infile datalines delimiter=',';input stud_id $ quiz1-quiz5;array quiz[5] quiz1-quiz5;quiz_sum=0;do i=1 to 5; quiz_sum=quiz_sum + quiz[i];end;quiz_mean=quiz_sum/5;

keep stud_id quiz_sum; ß or drop quiz1-quiz5 quiz_sum i;datalines;S1,45,33,60,75,80S2,67,58,75,69,55;run;

Selecting observations

By default, SAS put an observation to the SAS data set at the end of each DATA step iteration. UseOUTPUT statement in an IF-THEN statement makes SAS outputs an observation based on acondition

Example:

data case7;infile datalines delimiter=',';input age sex $ @@;if sex='f' then output;datalines;25, m, 18, f, 19, m, 20, m, 21, f;run;

Note: If the value of an assignment statement wants to be kept in the SAS data set, it must be placed before

the OUTPUT statement

Example: data case7;infile datalines delimiter=',';input age sex $ @@;if sex='f' then do; newvar=1; output;end;datalines;25, m, 18, f, 19, m, 20, m, 21, f;run;

Example: data case7;infile datalines delimiter=',';input age sex $ @@;if sex='f' then do; output; newvar=1;end;datalines;25, m, 18, f, 19, m, 20, m, 21, f;run;

Writing observations to multiple data sets

To write observations to a selected SAS data set, specify the SAS data set name in theOUTPUT statementThe SAS data set name appears in the OUTPUT statement must be already appeared in theDATA statement

Example:Data Q45am Q45pm;input group :$10. class :$10. enclosure $ fedtime $;if fedtime='am' then output Q45am;else if fedtime='pm' then output Q45pm;else if fedtime='both' then output Q45am Q45pm;datalines;bears Mammalia E2 bothelephants Mammalia W3 amflamingos Aves W1 pmfrogs Amphibia S2 pmkangaroos Mammalia N4 amlions Mammaliz W6 pmsnakes Retilia S1 pmtigers Mammaliz W2 bothzebras Mammaliz W2 am;

run;

Calculations Across Observations

Retaining the Values of VariablesRETAIN statement - Stops resetting some variables to missing in the PDV RETAIN variable1 <init_value1> variable2 <init_value2> … ;

A RETAIN statement can specify both numeric and character variables<init_valueN> Optional to specify starting value of each variable

Example: Calculate the running totaldata case8;input month $ sales @@;acc_sales= acc_sales + sales;

retain acc_sales 0; ß starting value of acc_sales = 0datalines;Jan 3500 Feb 2888 Mar 887Apr 698 May 6789 Jun 906;run;

Example: Put value of an observation to another EX2_DATA1.TXT

Write a SAS DATA step to create a data set which contains the name and sales date in everyobservation. data Q50;infile 'F:\SAS\sas\ex3\Ex2_Data1.txt';input name $1-15 @16 salesdate date11. salesamount 31-35;if name^=' ' then do; oldname=name; oldsalesdate=salesdate;end;

else if name=' ' then do; name=oldname; salesdate=oldsalesdate;end;

retain oldname oldsalesdate;drop oldname oldsalesdate;format salesdate date9.;run;

Effect of missing value on running totalsMissing values will be generated from operations performed on missing values

Example:data case8;input month $ sales @@;acc_sales= acc_sales + sales;

retain acc_sales 0; ß starting value of acc_sales = 0datalines;Jan 3500 Feb 2888 Mar 887Apr . May 6789 Jun 906;run;

––

The Sales of April is missing

Solution:data case8;input month $ sales @@;

if sales ^=. then acc_sales= acc_sales+sales; ß adding an IF-THEN statementretain acc_sales 0; ß starting value of acc_sales = 0datalines;Jan 3500 Feb 2888 Mar 887Apr . May 6789 Jun 906;run;

Sum statementRetains values from the previous iteration of the DATA step in order to cumulatively add the value of avariable across observations

variable + expression;variable specifies the name of the accumulator variable which must be numeric.variable is automatically set to 0 before the first observation is read.variable 's value is retained from one DATA step execution to the next.expression contains the value to be added to the variable.expression can be a variable or a constant

The Sum statement adds the result of the expression that is on the right side of the plus sign (+) to thenumeric variable that is on the left side of the plus sign. At the beginning of the DATA step, the value of thenumeric variable is not set to missing as it usually is when reading raw data. Instead, the variable retains thenew value in the program data vector for use in processing the next observation. Note: The Sum statement is one of the few SAS statements that doesn't begin with a keyword. Note: If the expression produces a missing value, the Sum statement treats it like a zero. (By contrast, in anassignment statement, a missing value is assigned if the expression produces a missing value.) Example:data case8;input month $ sales @@;acc_sales + sales;datalines;Jan 3500 Feb 2888 Mar 887Apr . May 6789 Jun 906;run;

•••••

Reading Mixed Record TypesA raw data file may have more than one type of record layout, e.g. variables with different format indifferent records

Example: Records with different date formatsdata case9;infile datalines delimiter=',';input salesid $ location $ ;if location='USA' then input saledate : mmddyy10. amount;if location='EUR' then input saledate : date9. amount;datalines;101, USA, 1-20-2008,3445433,EUR,30Mar2008,432.3102,USA,4-12-2008,5320444,EUR,26Apr2008,3433.3;run;

Error !

Solution: adding @ (single trailing @) line-hold specifier

@ (single trailing @) line-hold specifier

Holds the record in the input buffer untilthe last statement of the DATA step is executed,or encountered another INPUT statement without a line-hold specifiers

Note: The term trailing indicates that the @ must be the last item that is specified in the INPUT statement.E.g. input salesid $ location $ @ ;

Example: Records with different date formatsdata case9;infile datalines delimiter=',';input salesid $ location $ @;if location='USA' then input saledate : mmddyy10. amount;

–••

if location='EUR' then input saledate : date9. amount;datalines;101, USA, 1-20-2008,3445433,EUR,30Mar2008,432.3102,USA,4-12-2008,5320444,EUR,26Apr2008,3433.3;

run;

Reading Fixed Number of Repeating FieldsExample:

Each record in temp.txt consists of a group's ID and followed by three experimental results

How to pair each group's ID with one result to a single observation so that three observations can bederived from each record?

data temp;infile 'D:\temp.txt';input id $ @;input result @;output;input result @;output;input result @;output;

run;

Alternative:data temp;infile 'D:\temp.txt';input id $ @;

do i=1 to 3; input result @; output;

end; drop i;

run;

Reading Varying Number of Repeating FieldsDO-WHILE loop statement

To execute a DO loop until a condition is reached or while a condition exists, withoutspecifying the number of iterations required

DO WHILE (condition) ;SAS statements

END ;

condition is a valid SAS condition enclosed in parenthesesExample:In EXE2_DATA5.txt, the first field is the ID of thestudent and the second field number of examinationscores for that record. Create a SAS data set whichcontains 2 variables only, namely the student ID andexamination score. The number of observations in theSAS data set equals to the number of examinationscores for every student.data Q64;infile 'F:\SAS\sas\ex3\Exe2_Data5.txt' missover;input id $ no score @;do while (score ^=''); output; input score @;end;drop no;

run;

Alternative:data Q64;infile 'F:\SAS\sas\ex3\Exe2_Data5.txt';input id $ no @;do i= 1 to no; input score@; output;end;keep id score;

run;

Reading Hierarchical Raw Data FilesIntroduction

Raw data files can be hierarchical in structure, consisting of a header record and one or more detail records.Typically, each record contains a field that identifies the record type.

Here, the Employee indicates a header record that contains an employee’s the last name and first name. TheDependent indicates a detail record that contains an employee’s dependant’s name, relationship and age.

-

Raw Data File - LIST2_3.TXTEmployee,Adams,CheungDependent,Machael,C,15 Dependent,Machael,C,13 Employee,Thomas,LeungDependent,Susan,S,26Employee,Lewis,ChanDependent,Richard,C,8Employee,Dansky,WongEmployee,Nicholls,TsangDependent,Robert,C,12Employee,Mary,FongDependent,John,S,40

ß header recordß detail recordß detail record

ß header recordß detail record

ß header recordß detail record

ß header recordß header recordß detail record

ß header recordß detail record

You can build a SAS data set from a hierarchical file by creating one observation per detail record andstoring each header record as part of the observation.

SAS data set –one observation per detail record

You can also build a SAS data set from a hierarchical file by creating one observation per header recordand combining the information from detail records into summary variables.

SAS data set –one observation per header record In this section, you learn how to read from a hierarchical file and create a SAS data set that contains eitherone observation for each detail record or one observation for each header record.Creating One Observation Per Detail Record

Step 1. Retaining the Values of VariablesAs you write the DATA step to read this file, remember that you want to keep the header record as a partof each observation until the next header record is encountered. To do this, you need to use a RETAINstatement to retain the values for empfname and emplname across iterations of the DATA step. Next, you need to read the first field in each record, which identifies the record's type. You also need to usethe single trailing @ line-hold specifier to hold the current record so that the other values in the record canbe read. data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;retain empfname emplname;

Step 2. Conditionally Executing SAS Statements You can use the value of type to identify each record. If type is Employee, execute an INPUT statement toread the values for first name (empfname) and last name (emplname). However, if type is Dependent, then execute an INPUT statement to read the values for first name(depfname), relation, and age. You can tell SAS to perform a given task based on a specific condition by using an IF-THEN statement. data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then input empfname: $15. emplname : $15.;else if type='Dependent' then do; input depfname : $15. relation $ age;end;

retain empfname emplname;

Step 3. Reading a Detail RecordNow think about what needs to happen when a detail record is read. Remember, you want to write anobservation to the data set only when the value of type is Dependent. You can use an OUTPUT statement in an IF-THEN statement makes SAS outputs an observation onlywhen the condition is true (i.e. type is Dependent). data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then input empfname: $15. emplname : $15.;else if type='Dependent' then do; input depfname : $15. relation $ age; output;end;retain empfname emplname;

Step 4. Dropping Variables and Final SAS Data SetBecause type is useful only for identifying a record's type, drop the variable from the data set.data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then input empfname: $15. emplname : $15.;else if type='Dependent' then do; input depfname : $15. relation $ age; output;end;retain empfname emplname;drop type;run;

SAS data set –one observation per detail recordCreating One Observation Per Header RecordRefer to LIST2_3.TXT. Suppose you want to generate a SAS data set contains a list of all employees andtheir monthly payroll deduction for insurance such that

Insurance is free for the employeeEach employee pays $100 per month for a spouse's (S) insurance if applicableEach employee pays 60 per month for a child's (C) insurance if applicable

Step 1. Retaining the Values of Variables As you write the DATA step to read this file, you need to think about performing several tasks. First, thevalue of empfname and emplnames must be retained as detail records are read and summarized. Next, the value of type must be read in order to determine whether the current record is a header record or adetail record. Add a single trailing at sign (@) to hold the record so that another INPUT statement can readthe remaining values. data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;retain empfname emplname;

Step 2. DO Group Actions for Header Records To execute multiple SAS statements based on the value of a variable, you can use a simple DO group withan IF-THEN statement. When the condition type='Employee' is true, you need to execute several statements.

•••

data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then do;

First, you need to determine whether this is the first header record in the external file. You do not want thefirst header record to be written as an observation until the related detail records have been read andsummarized.

_N_ is an automatic variable whose value is the number of times the DATA step has begun to execute.The expression _n_^= 1 defines a condition where the DATA step has executed more than once.

Use this expression in conjunction with the previous IF-THEN statement to check for these two conditions:When the conditions type='Employee' and _n_^= 1 are true, an OUTPUT statement is executed. Thus, eachheader record except for the first one causes an observation to be written to the data set.

data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0;

end;

An INPUT statement reads the values of empfname and emplnames. An assignment statement creates thesummary variable insurance_cost and sets its value to 0.

Step 3. Reading Detail Records

When the value of type is not Employee, you need to define an alternative action. You can do this by addingan ELSE statement to the IF-THEN statement. If its value is 'Dependent' then continue to read for values ofthe first name, relation, and age.

You want to count each person who is represented by a detail record and store the accumulated value in thesummary variable insurance_cost. You have already initialized the value of insurance_cost to 0 each time aheader record is read.

Now, as each detail record is read, you can increment the value of insurance_cost by using a Sumstatement. If relation = 'S' accumulate the cost of insurance by 100. If relation = 'C' accumulate the cost ofinsurance by 60.

data case12;infile 'd:\LIST2_3.TXT' delimiter=',';input type : $9. @;if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0;end;else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60;end;retain empfname emplname;

keep empfname emplname;run;

Step 4. Determining the End of the External File and Final SAS Data SetYour program writes an observation to the data set only when another header record is read and the DATAstep has executed more than once. But after the last detail record is read, there are no more header records tocause the last observation to be written to the data set.You need to determine when the last record in the file is read so that you can then execute another explicitOUTPUT statement. You can determine when the current record is the last record in an external file byspecifying the END= option in the INFILE statement.

INFILE 'file-name' END = variable_name ;

variable_name is any valid SAS variable name that is not included in theINPUT statement or other assignment statements in the same DATA step

equals 1 if it is the last record in the raw data file; 0 otherwiseRemains 0 until SAS processes the last data recordAppears in PDV but not exported to the SAS data set

data case12;infile 'd:\LIST2_3.TXT' delimiter=',' end=eofile;input type : $9. @;if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0;end;else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60;end;if eofile=1 then output;retain empfname emplname;keep empfname emplname;run;

SAS data set –one observation per header record

SAS FunctionsA SAS function performs a computation on one or more variables over the same observation and returns a

»»»

value. SAS functions include mathematical functions, statistical functions, date functions, characterfunctions, and others

SAS function syntaxFunction_name(<argument1> <, …, argumentn>)Function_name(OF abbreviated_variable_list)Function_name(OF array_name[*])

Function_name must be joined by a pair of parenthesesIf used in an assignment statement, the function must be placed on the right hand sideThe parentheses may contain one argument, more than one argument, or no argument(i.e. empty parentheses)

The argument can be a variable name, a constant, another SAS function, orvalid SAS expressionMultiple arguments are separated by a comma

Mathematical functionsFunction name DescriptionABS (argument) Returns a nonnegative number that is equal in magnitude to that of the argument.EXP(argument) Returns the value of the exponential functionLOG(argument) Returns the natural (base e) logarithmLOG10(argument) Returns the logarithm to the base 10SQRT(argument) Returns the square root of a value

Example:data test;input quantity @@;abs_quantity=abs(quantity);log_quantity=log(abs_quantity);sqrt_quantity=sqrt(abs_quantity);datalines;1244 -1898 34232 10 242;run;

Truncation functions

Function name DescriptionINT(argument) Returns the integer portion of the argumentROUND(argument) Returns the nearest integer to the argumentROUND(argument, rounding_unit) Rounds the first argument to a value that is very close to a

multiple of the second argument

Example:data test;x1=int(10.499);x2=int(10.599);x3=round(10.49);x4=round(10.5);x5=round(10.51);x6=round(10.449,0.01);x7=round(10.501,0.01);x8=round(10.504,0.05);

•••

x9=round(13,2);run;

Statistical functionsFunction name Descriptionsum(argument, argument,...) sum of values

mean(argument, argument,...) average of nonmissing values

min(argument, argument,...) minimum value

max(argument, argument,...) maximum value

median(argument, argument,...) Median value

var(argument, argument,...) variance of the values

std(argument, argument,...) standard deviation of the values

N(argument, argument,...) the number of nonmissing values

NMISS(argument, argument,...) the number of missing values

Example:The following figure displays the first few records of a raw data set containing the student quiz scores.The first line is not part of the data set. If a student took all five quizzes, the lowest of the five quiz scoresis dropped. Write a program that will compute the average quiz score based on this decision. If a studenttook fewer than five quizzes, compute the average of the non-missing quizzes.

data Q77;input ID $ Q1-Q5;nmiss=nmiss(of Q1-Q5);if nmiss=0 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4;/*if n=5 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4;*/else average=mean(of Q1-Q5);drop nmiss;datalines;1 85 76 79 80 852 . 56 65 72 813 44 49 . . 54;

run;

Character functionsFunction name DescriptionCAT(string-1 <, ... string-n>) Concatenates character strings without removing

leading or trailing blanksCATS(string-1 <, ... string-n>) Concatenates character strings and removes leading and

trailing blanks

CATT(string-1 <, ... string-n>) Concatenates character strings and removes trailingblanks

CATX(separator, string-1 <, ...string-n>) Concatenates character strings, removes leading andtrailing blanks, and inserts separators

COMPBL(source) Removes multiple blanks into a single blankfrom a character string

Example:Create a SAS data set that joins the two fields into a single variable for the full name in the form of“firstname lastname” such that there is only one blank space between the first name and the last name. data Q78;infile datalines delimiter=',';input first $ last $;name=catx(' ', first, last);datalines;Mary,LeungJohn,WongJonathan,Ng;run;

Character functions(continued)

Function name DescriptionLEFT(argument) Left aligns a SAS character expressionLENGTH(string) Returns the length of a non-blank character string, excluding

trailing blanks, and returns 1 for a blank character stringLENGTHC(string) Returns the length of a character string, including trailing blanksLENGTHN(string)

Returns the length of a non-blank character string, excludingtrailing blanks, and returns 0 for a blank character string

LOWCASE(argument) Converts all letters in an argument to lowercaseRIGHT(argument) Right aligns a character expressionTRIM(argument) Removes trailing blanks from character expressions and returns

one blank if the expression is missingTRIMN(argument) Removes trailing blanks from character expressions and returns a

null string (zero blanks) if the expression is missingUPCASE(argument) Converts all letters in an argument to uppercase

Function name DescriptionFIND(string,substring) Searches for a specific substring of characters within a character

string that you specifystring specifies a character constant, variable, or expression that will be searched for substrings.

Tip: Enclose a literal string of characters in quotation marks.

substring is a character constant, variable, or expression that specifies the substring of characters to search

for in string.Tip: Enclose a literal string of characters in quotation marks.

Function name DescriptionSUBSTR(string, position<,length>) Extracts a substring from an argument

string specifies any SAS character expression. position specifies a numeric expression that is the beginning character position. length specifies a numeric expression that is the length of the substring to extract.

Tip: If you omit length, SAS extracts the remainder of the expression.

Example:data test;infile datalines delimiter=',';input name :$20. sex $;new_name = compbl(name);blank_pos=find(new_name,' ');name_len=length(new_name);last_name=substr(new_name,blank_pos);first_name=substr(name,1,name_len - length(last_name));sex=upcase(sex);datalines;Mary Chan, fTom Ng, MDavid Wong,mBetty Chung,F;run;

Character functions(continued)

Function name DescriptionSCAN(string ,n<, delimiter(s)>) Selects a given word from a character expression

n specifies a numeric expression that produces the number of the word in the character string you wantSCAN to select.

delimiter specifies a character expression that produces characters that you want SCAN to use as a word

separator in the character string.

Note: If you omit delimiter, SAS uses the following characters by default: blank . < ( + & ! $ * );^ – / , % |

Tip: If you represent delimiter, enclose delimiter in quotation marks.

Example:data test;input name $ 20.;surname=scan(name,1,' ');givenname1=scan(name,2,' ');givenname2=scan(name,3,' ');givenname=catx(' ',givenname1,givenname2);datalines;

Chan Wai ChiuYau Sen HeiYu Tang Fei;run;

Example:data Q79;infile datalines delimiter=' ' dsd;input age 1-2 @4 name:$50.;surname=scan(name,1,',');firstname=scan(name,2,',');drop name;datalines;18 "HO, Chun Kit"17 "LO, Yu Yin"20 "SUM, On Man";run;

Character functions (continued)Function name DescriptionCOMPRESS(<source><, chars><, modifiers>) Removes specific characters from a character string

source specifies a source string that contains characters to remove. chars specifies a character string that initializes a list of characters. By default, the characters in this list are

removed from the source. If you specify the “K” modifier in the third argument, then only thecharacters in this list are kept in the result.Tip: You can add more characters to this list by using other modifiers in the third argument.Tip: Enclose a literal string of characters in quotation marks.

modifiers specifies a character string in which each character modifies the action of the COMPRESS

function. Blanks are ignored. These are the characters that can be used as modifiers:

a or A - adds letters of the Latin alphabet (A - Z, a - z) to the list of characters.d or D - adds numerals to the list of characters.i or I - ignores the case of the characters to be kept or removed.k or K - keeps the characters in the list instead of removing them.p or P - adds punctuation marks to the list of characters.

Example:data test;input productcode :$ 10.;product=compress(productcode, ,'ka');code=compress(productcode, ,'a');datalines;Aa235BXT32186798ZYV316X

;run;

Date functionsFunction name Descriptionday(date) Extracts the day value from a SAS date value.month(date) Extracts the month value from a SAS date value.today() Returns the current date as a SAS date value, empty argument

This function requires no arguments, but they must still be followedby parentheses.

week(date) Returns the week number valueweekday(date) Returns the day of the week from a SAS date value, where

1=Sunday, 2=Monday,…, 7=Saturdayyear(date) Extracts the year value from a SAS date value.mdy(month,day,year) Returns a SAS date value from numeric expression of month, day,

and year valuesmonth can be a variable that represents the month, or anumber from 1-12day can be a variable that represents the day, or a numberfrom 1-31year can be a variable that represents the year, or a numberthat has 2 or 4 digits.

YRDIF(sdate,edate,’Actual’) Returns the difference in years between two dates’Actual’ uses the actual number of days between dates in calculatingthe number of years.

DATDIF(sdate,edate,’Actual’) Returns the actual number of days between two dates

Example:data test;input id birthday birthmonth birthyear;birthdate=mdy(birthmonth,birthday,birthyear);birthweek=week(birthdate);birthweekday=weekday(birthdate);cutoffdate='1jan2004'd;day_diff=datdif(cutoffdate,birthdate,'actual');year_diff=yrdif(cutoffdate,birthdate,'actual');format birthdate cutoffdate;datalines;1 31 12 20052 1 1 20063 28 2 20064 31 3 2006;run;

Date functions (continued)

Function name Description

INTCK('interval',from,to) Returns the number of time intervals that occur in a given time spanwhere

'interval' specifies a character constant or variable. The value must be one of the following: DAY,WEEKDAY, WEEK, MONTH, HOUR, QTR, YEARfrom specifies a SAS date value that identifies the beginning of the time span.

to specifies a SAS date value that identifies the end of the time span

The INTCK function counts intervals from fixed interval beginnings, not in multiples of an interval unitfrom the from value. Partial intervals are not counted. For example, WEEK intervals are counted by Sundays rather than seven-day multiples from the fromargument. MONTH intervals are counted by day 1 of each month, and YEAR intervals are counted from01JAN, not in 365-day multiples.

SAS Statement ValueWeeks = intck ('week','31 dec 2000'd,'01jan2001'd); 0Months = intck ('month','31 dec 2000'd,'01jan2001'd); 1Years = intck ('year','31 dec 2000'd,'01jan2001'd); 1

Because December 31, 2000, is a Sunday, no WEEK interval is crossed between that day and January 1,2001. However, both MONTH and YEAR intervals are crossed. Date functions (continued)

Function name DescriptionINTNX('interval',start-from,increment<,'alignment'>)

Increments a date value by a given interval orintervals, and returns a date value

where'interval' specifies a character constant or variable. The value must be one of the following: DAY,WEEKDAY, WEEK, MONTH, HOUR, QTR, YEARstart-from specifies a starting SAS date valueincrement specifies a negative or positive integer that represents time intervals toward the past orfuture'alignment' (optional) forces the alignment of the returned date to the beginning, middle, or end ofthe interval.

For example, the following statement creates the variable TargetYear and assigns it a SAS date value of13515, which corresponds to January 1, 1997.

TargetYear=intnx('year','05feb94'd,3);

The purpose of the optional alignment argument: it lets you specify whether the date value should be at thebeginning, middle, or end of the interval. When specifying date alignment in the INTNX function, use thefollowing arguments or their corresponding aliases:

BEGINNING BMIDDLE MEND ESAMEDAY S

The best way to understand the alignment argument is to see its effect on identical statements. The followingtable shows the results of three INTNX statements that differ only in the value of alignment.

SAS Statement Date Value

MonthX=intnx('month','01jan95'd,5,'b'); 12935 (June 1,1995)

MonthX=intnx('month','01jan95'd,5,'m'); 12949 (June 15,1995)

l

l

l

••

MonthX=intnx('month','01jan95'd,5,'e'); 12964 (June 30,1995)

These statements count five months from January, but the returned value depends on whether alignmentspecifies the beginning, middle, or end day of the resulting month. If alignment is not specified, thebeginning day is returned by default.

Special functions

Function name DescriptionINPUT(source,informat) Explicit Character-to-Numeric Conversion

wheresource indicates the character variable, constant, or expression to be converted to a numeric valuea numeric informat must also be specified, as in this example: input(payrate,2.)

Function name DescriptionPUT(source,format) Explicit Numeric-to-Character Conversion

wheresource indicates the numeric variable, constant, or expression to be converted to a character valuea format matching the data type of the source must also be specified, as in this example: put(site,2.)

Question

A typical value for the character variable Target is 123,456. Which statement correctly converts thevalues of Target to numeric values when creating the variable TargetNo?

TargetNo=input(target,comma6.);TargetNo=input(target,comma7.);TargetNo=put(target,comma6.);TargetNo=put(target,comma7.);

Correct answer: b

••

••

a.b.c.d.

Chapter 4: Modifying and Combining SAS Data SetsContentsReading Single SAS Data Set 1Concatenating SAS data sets 11Merging data sets 14 Reading Single SAS Data Set It is often necessary to update existing SAS data set or creating a new SAS data set from an existing SASdata set for:

selecting observations based on one or more conditionskeeping or dropping variablesrenaming variablescreating new variables

To bring an existing SAS data, we may use SET statement SET statementDATA data_set_name <data_set_options>;<Other DATA step statements>SET sas_data_set <data_set_options> <options>;<Other DATA step statements>RUN;

data_set_name is the name of the SAS data set to be createdsas_data_set is the name of the SAS data set to be readAny DATA step statements can be placed before/after the SET statement

How does it work?1. Compilation phase

No input buffer is created, tracking pointer points to the first observation ofthe SAS data set to be readPDV is created as usual, all variables contained in the SAS data set to be readwill be included by default

Execution phase

As the SET statement is executed, the values from the pointed observation iscopied to the PDVAt the end of each round of DATA step execution, the values in the PDV arewritten to the new data setAt the beginning of each iteration, the values of variables which were readfrom the SAS data set with the SET statement, or those were created by aSUM statement are retained in PDV, all other variable values are set tomissing

Example:Suppose a SAS data set Scores exists in the Mylib library

––––

–––

2.–

data case1;set Mylib.scores;run;

SET statement - Dropping unwanted variablesSuppose the variables score2 and score3 of SCORES are not wanted anymore

DROP data set option :àThese variables (score2 and score3) are not kept in the PDV and cannot be used in the DATA

step Example:data case1;set Mylib.scores (drop=score2 score3);run;

DROP statement à These variables are kept in the PDV but not output to the new data set, they can still be used in the

DATA step Example:data case2;set Mylib.scores;drop score2 score3;run;

SET statement - Keeping selected variables onlySuppose only the variables StudentID and score3 are wanted

KEEP data set optionà Only these variables are kept in PDV and output to new data set

Example:data case3;set Mylib.scores (keep=StudentID score3);run;

KEEP statement

à All variables are kept in the PDV but only these variables are output to the new data setExample:data case3;set Mylib.scores;keep StudentID score3;run;

SET statement - Rename variablesSuppose variable StudentID would be renamed to SID and variable score3 would be renamed to quiz3. Example:data case4;set Mylib.scores (rename=(StudentID=SID score3=quiz3));run;

Note: It only affects the PDV and the new data set. SET statement - Selecting the nth-mth observations Example:Suppose a SAS data set TEMP1 contains 500 observations, write SAS data step to create a SAS dataset for each of the followings:

The new data set contains only the first 100 observations of TEMP1.The new data set contains only the last 100 observations of TEMP1.The new data set contains the 101th – 300th observations of TEMP1.

data Qa;set Temp1 (obs=100);run;

data Qb;set Temp1 (firstobs=401);run;

data Q12c;set Temp1 (firstobs=101 obs=300);run;

a.b.c.

SET statement - Selecting observations conditionally Example:data Q16a Q16b;set Mylib.Fltattnd;IF JOBCODE='FLTAT1' THEN output Q16a;IF JOBCODE='FLTAT2' THEN output Q16b;run;

SET statement - Selecting an observation directly (direct access)Use POINT option in the SET statement point_variable = obs_number ;SET data_set_name POINT = point_variable ;

point_variable specifies a temporary numeric variablepoint_variable appears in PDV but not final data setobs_number contains the observation number of the observation to beread, it must appear assigned to point_variable before the executionof the SET statement

Example:

data case5;obsnum=102;set Mylib.booksales (keep =ID gender firstpurch) point=obsnum;output;stop;run;

The POINT= option reads only the specified observations, SAS cannot read an end-of-file indicator, hence cause an infinite loopMust use a STOP statement to cause SAS to stop processing the current DATAstep immediatelyDATA step writes observations to output at the end of the DATA step, but STOPstatement stops processing before the end of the DATA step, hence no output ofobservationsUse an OUTPUT statement before the STOP statement to override the automaticoutput

»»»

SET statement - Selecting every kth observation Example:Write a SAS DATA step to select first 1000-observation subset from the data set SALE2000 byreading every tenth observation from observation number 10.data case6;do obs=10 to 10000 by 10; set Mylib.sale2000 point=obs; output;end;stop;run;

Write a SAS DATA step to select every tenth observation of the observations in SALE2000. Suppose you do not know total number of observations in SALE2000.SAS7BDAT. You can useNOBS = option creates a temporary variable that contains the total number of observations in theinput data files. Note that NOBS = variable in executable statements that appear before the SETstatement data case6q;do obs=0 to ttlobs by 10; set Mylib.sale2000 point=obs nobs=ttlobs; output;end;stop;run;

SET statement - Creating a random sample with replacement

With replacement: Observations can be selected more than once

The major steps:First generate a random number, say kRead the kth observation directlyRepeat the above two steps until the require numbers of observations are selected

Generate a random number

Function RANUNI(seed) returns a value between 0 and 1seed must be an integerseed = 0 uses the system clock time, resulting in different output each time

To get an integer between 1 and M, use function CEIL( ) as follows:CEIL(RANUNI(seed) * M)

CEIL( ) function returns the smallest integer that is greater than or equal to the argument Example:data case7;samplesize=100;do i=1 to samplesize; sample_point=ceil(ranuni(0)*ttlobs); set Mylib.booksales (keep =ID gender firstpurch) point=sample_pointnobs=ttlobs;

•–––

•–

»»

»

output;stop;drop samplesize i;run;

BY-group processing - To group observations for processing DATA data_set_name ;SET sas_data_set <(data_set_options)> <options>;BY variable1 <variable2 · · · >;

The data set in the SET statement must be sorted by the values of the BY variablesTwo temporary variables for each BY variable are created

First.variable1: equals 1 for the first observation in a BY group; 0 otherwiseLast.variable1: equals 1 for the last observation in a BY group; 0 otherwise

Example:Suppose you want to compute the total amount of money spent (M) on books by each MCODE levelin BOOKSALES.SAS7BDATproc sort data=Mylib.booksales out=sort_booksales;by mcode;run;

data case8;set sort_booksales (keep= mcode m);by mcode;if first.mcode=1 then total_spent=0;total_spent+m;if last.mcode=1 then output;drop m;run;

Behind the scenes - PDV

Using more than one variable in BY statementFIRST.BY-primary-variable = 1 forces FIRST.BY-secondary-variable =1

––

»»

–•

LAST.BY-primary-variable = 1 forces LAST.BY-secondary-variable =1

Example:Suppose you want to compute the total amount of money spent (M) on books by each gender in eachMCODE levelproc sort data=Mylib.booksales out=sort_booksales;by mcode gender;run;

data case8;set sort_booksales;by mcode gender;if first.gender=1 then total_spent=0;total_spent+m;if last.gender=1 then output;keep mcode gender total_spent;run;

Concatenating SAS data setsStacking data sets -To stack or concatenate SAS data sets one on top of the other

DATA data_set_name ;SET sas_data_set1 <(data_set_options)>

sas_data_set2 <(data_set_options)> … <options> ;<Other DATA step statements>RUN;

Can read any number of SAS data sets in one SET statementCommon variables must have the same data type attributeThe new data set contains all of the variables and observations from all of the data sets listed in theSET statement

–––

MCODE GENDER FIRST.MCODE LAST.MCODE FIRST.GENDER LAST.GENDER1 0 1 0 1 01 0 0 0 0 11 1 0 0 1 01 1 0 1 0 12 1 1 0 1 02 1 0 0 0 02 1 0 1 0 1

BY-primary-variable

BY-secondary-variable

How does it work?

Similar to reading single SAS data setObservations from the first data set that is listed in the SET statement are read firstThen the observations from the second data set that is listed, and so on

Example:data Jan;input name $ 1-20 sales;datalines;Daivd Wong 4500Francis Leung 6000Joe Chan 3000;run;data case9;set Jan Feb;run;

data Feb;input name $ 1-20 sales;datalines;Joe Chan 5000Daivd Wong 6000John Tai 4500;run;

Missing values will be generated if stacking data sets with different variable names Example:data Jan;input name $ 1-20 sales1;datalines;Daivd Wong 4500Francis Leung 6000Joe Chan 3000;run;

data Feb;input name $ 1-20 sales2;datalines;Joe Chan 5000Daivd Wong 6000John Tai 4500;run;

data case10;set Jan Feb;run;

Solution: Change to the same variable namedata case10a;set Jan (rename=(sales1=sales)) Feb(rename=(sales2=sales));run;

Use IN= option to determine which data set contributed to the current observation SET sas_data_set (IN = in_variable) … ;

in_variable is a temporary numeric variable that equals 1 when the data set contributed to the currentobservation, 0 otherwise

•••

»

data Jan;input name $ 1-20 sales;datalines;Daivd Wong 4500Francis Leung 6000Joe Chan 3000;run;data Feb;input name $ 1-20 sales;datalines;Joe Chan 5000Daivd Wong 6000John Tai 4500;run;

data case11;set Jan (in=file1) Feb (in=file2);if file1=1 then month='Jan';if file2=1 then month='Feb';run;

Merging data setsTo join corresponding observations from two or more SAS data sets

DATA data_set_name ;MERGE sas_data_set1 <(data_set_options)>

sas_data_set2 <(data_set_options)> · · · <options>; BY variable1 <variable2 · · · >;<Other DATA step statements>RUN;

The data sets in the MERGE statement must be sorted by the values of the BY variablesAvailable options are identical to that of SET statementIf variables that have the same name appear in more than one data set, the value of the variable is thevalue in the last data set that contains it

How does it work?

Compilation Phase- To prepare to merge data sets, SAS

–––

reads the descriptor portions of the data sets that are listed in the MERGE statementreads the rest of the DATA step programcreates the program data vector (PDV) for the merged data setassigns a tracking pointer to each data set that is listed in the MERGE statement.

Execution phase

As the MERGE statement executes, compare the pointed observation of each listed data set to seewhether the BY values match

If yes, the observations are written to the PDV in the order in what the data sets appear in theMERGE statementIf no, SAS determines which of the values comes first and writes the observation that containsthis value to the PDV

At the end of each iteration, writes observation to the data set and

Variables created by the Data step are set to missing in PDVIf neither data set contains any more observations in the BY group, variables come from thelisted data sets are set to missing in the PDV. Otherwise, their values are retained in PDV

One-to-one with equal list matchingExample:Suppose marks of MS1111 and MS1112 for each student for stored in SAS data set MS1111and MS1112 respectively. To calculate the average mark for each student, the two data sets mustbe merged data combinea;merge ms1111 ms1112;by id;run;

MARK in MS1112 overwrite MARK in MS1111 data combineb;merge ms1111(rename=(mark=mark_ms1111)) ms1112(rename=(mark=mark_ms1112));by id;average_mark=(mark_ms1111+mark_ms1112)/2;run;

One-to-one with unequal list matchingSome students took MS1111 but not MS1112, or vice versa data combinec;merge ms1111 ms1112;by id;run;

Use IN= option to select observations that appear in both data setsdata combined;merge ms1111(in=ms1) ms1112(in=ms2);by id;if ms1=1 and ms2=2 then do; average_mark=(mark1+mark2)/2; output;end;run;

1.2.

3.4.

•–

»

»

–»»

One-to-many / Many-to-one matching

The order of the data sets in the MERGE statement does not matter to SASA One-to-many merge is the same as a many-to-one-merge, although the order of the variables in thenew data set are not the same

Example:Suppose CUSTOMERID contains profile of customers and SALES contains products purchasedby each customer

data sale_profile;merge customerid sales;by id;run;

A One-to-many merge is the same as a many-to-one-merge, although the order of the variables inthe new data set are not the same data sale_profile;merge sales customerid;by id;run;

Use IN= option to identify the non-matchesExample:Suppose SALESA contains list of products purchased by some customers. You want to identifythe group of customers who did not purchase any item at all

••

data sale_profile;merge customerid (in=file1) salesa(in=file2);by id;if file1=1 and file2=0 then output;keep id gender age;run;