Upload
duongphuc
View
213
Download
1
Embed Size (px)
Citation preview
SAS Programming Notes 1 Further topics
SAS Programming NotesFor
Data Mining and Exploration
Lecturer: Amos StorkeySchool Of Informatics
University of Edinburgh
SAS Programming Notes 2 Further topics
Acknowledgements: These notes are extensively based on notes developed over a long period by the School of
Accounting, Economics & Statistics, Napier University. People who have worked on or contributed to these notes over that time include Amos Storkey, Ana Costa Da Silva, Phil Darby, Helen Storkey, Jeff Dodgson, Dorothy Currie, Kate Houston
and Kirsty Davidson. I am very grateful for permission to use and develop these
notes for the Data Mining and Exploration course.
First published September 2000
Updated September 2001 (SAS version 8) and February, July
September 2002
October 2004 (SAS version 8.1) and September 2005
December 2006 (SAS version 9.1.3)
January 2008 (SAS 9.2 and linux differences)
File: document.doc
SAS Programming Notes 3 Contents
1. GETTING STARTED 71.1 What is the SAS system? 71.2 The SAS Workspace 71.3 Creating and running a SAS program 91.4 Submitting and correcting your program 101.5 Saving files and clearing text from windows 101.6 Reading a saved program 111.7 A Data Analysis Flow Chart 111.8 Importing data using a wizard 121.9 Viewing a data set 131.10 Creating a SAS Program 141.11 Rules for entering SAS statements 151.12 Adding comments to a program 161.13 Including titles in your SAS output 161.14 Creating new variables 171.15 Printing and saving SAS output 17
2. DATA FILES AND SAS DATA SETS 212.1 Reading data files using the INFILE statement 212.2 LIBNAME and permanent SAS Data Sets 222.3 Referencing a permanent SAS data set 242.4 Contents of a file 242.5 Importing data from other packages 252.6 Missing values 262.7 The INPUT statement 27
3. SAS PROCEDURES 333.1 Structure of a SAS program 333.2 Sample program 33
4. SUMMARISING DATA 374.1 SAS System Options 374.2 HTML output 384.3 Summary Procedures 384.4 PROC SORT 394.5 PROC MEANS 404.6 PROC UNIVARIATE 414.7 PROC FREQ 414.8 General syntax for a procedure 434.9 Help 44
5. GRAPHS AND CHARTS 475.1 Graphics procedures 475.2 PROC PLOT 475.3 PROC CHART 48
6. CORRELATION AND REGRESSION 51
SAS Programming Notes 4 Contents
6.1 PROC CORR 516.2 PROC REG 52
7. EXPLORATORY DATA ANALYSIS 557.1 SAS/INSIGHT 557.2 Accessing SAS/INSIGHT 557.3 Features of SAS/INSIGHT 577.4 Using SAS/INSIGHT 57Tools 59
8. MODIFYING DATA AND OUTPUT 618.1 Introduction 618.2 SET statement 618.3 DROP and KEEP 628.4 Labelling output 638.5 PROC PRINT 648.6 PROC FORMAT 648.7 Recoding data 658.8 Conditional statements 658.9 VALUE statement 668.10 OUTPUT 67
9. PROC TABULATE 69
10. FUNCTIONS AND FORMATS 7310.1 MEAN function 7310.2 NMISS function 7310.3 N function 7410.4 Functions to handle character variables 7410.5 Date and Time Formats 75
11. ITERATIVE PROCESSING 7911.1 Do loops and arrays 7911.2 Reading data in repeated patterns 7911.3 Arrays 8011.4 Generating random numbers 8111.5 Random numbers from a uniform distribution 8211.6 Random numbers from a normal distribution 8311.7 The SAS Program Data Vector 8311.8 The RETAIN and Sum statements 83
12. FURTHER TOPICS 8712.1 Combining Data Sets 8712.2 Hints on Using Word with SAS and SAS/INSIGHT 88
Solutions to exercises 93
SAS Programming Notes 5 Contents
Various files are referred to in these notes. These can be found in a zip file on the Data Mining and Exploration web site
www.inf.ed.ac.uk/teaching/courses/dme/
SAS Programming Notes 7 Getting started
1. Getting started
1.1. Introduction
The SAS system is a widely used resource for statistical analysis and data mining. It is rare to find a job advert for a data mining practitioner that does not ask for SAS skills. The main positive points of SAS are its ability to handle large files fairly transparently, the ease and comprehensive way that standard analyses can be done, the interactive way that analyses can be built alongside a systematic programming environment, and the data handling capabilities. Its main negative points are its graphical capabilities, and that adding your own extensions to the techniques using macros and the interactive matrix language are slightly more cumbersome than other languages (e.g. matlab, R) and than more modern language constructs.
This tutorial will introduce you to the SAS System. This tutorial should be suitable for those working on either a Linux or Windows system. Interface tools in SAS for Windows are much better and so where there are differences these will also be mentioned.
SAS is, at its heart a piece of software for data handling and storage, statistical and data analysis, data mining decision support and report writing. It has been extended to a whole business intelligence package, but the best way of understanding SAS is from the inside out, and so this tutorial will teach the base SAS software to get you started. With base SAS software you can store data values and retrieve them, modify data, compute simple statistics, and create reports all in one SAS session. The difference between SAS and most statistical packages is that SAS incorporates both a database management system and a high-level programming language. There is also SAS software which provides graphics, forecasting, data entry, and statistics. The SAS system also contains other sophisticated applications that are valuable to large enterprises. All are available in one system.
1.2. The SAS Workspace
To start SAS on a linux system type SAS at the command prompt. On windows, select SAS from the start menu.
When you go into SAS, the first thing you see is a set of windows as shown in Figure 1. Your display may appear a little different since this has been adjusted to allow all the windows to be seen at once. There are five different windows shown in this figure. Two further windows are available in SAS
SAS Programming Notes 8 Getting started
version 9, you can switch between them by clicking on the buttons at the bottom of the SAS window.
Figure 1 SAS window on opening in Windows
Figure 2 SAS window on opening in Linux
The five windows are:
the EDITOR window where you enter the SAS statements you wish to execute. The EDITOR has handy features like colour coding and expandable and collapsible sections.
Editor
Output
Log
Explorer
Results
Run
Libraries
Log
Output
Editor
Libraries
Explorer
SAS Programming Notes 9 Getting started
the LOG window which contains information on your SAS run, e.g. date and time of run, a listing of your SAS statements as they are executed and any errors which have occurred during processing.
the OUTPUT window which displays the actual results of the program.
the EXPLORER window, which allows you to view and manage your SAS files and create shortcuts to non-SAS files. For example you can use this window to create new libraries or to open any SAS file.
the RESULTS window helps you navigate and manage output from SAS programs you submit. You can view, save, and print individual items of output. (By default, the Results window is positioned behind the Explorer window but when you submit a SAS program that creates output it moves to the front of your display)
The two windows not shown are:
the GRAPH window, will appear when graphical output is to be displayed.
A seventh window will appear when html output is used. The output delivery system (ODS) can be turned on using programming code or by using the menu options.
You may turn on or turn off a window by using View from the main menu. Just choose the window you need (use this if you ‘loose’ a window).
Task 1Resize the 3 windows on the right hand side so that you see the OUTPUT as well as the EDITOR and the LOG. Make the EDITOR the largest window.
You can activate any of the windows by
clicking on the window (Windows or Linux)
selecting Window from the menu, then the window you want (Windows)
selecting View from the menu (Windows or Linux)
1.3. Creating and running a SAS program
The following lines of code are a simple SAS program.
SAS Programming Notes 10 Getting started
When they are typed into the editor window the words will become colour coded.
Reserved words appear blue (e.g. proc, print, input)
Comments appear green in Windows and in black in Linux (See below for details of entering comments.
Errors appear red.
data class1;input height weight sex $;
datalines;152 45.4 F178 73.0 M178 68.8 M175 59.7 M157 44.5 F165 61.7 M175 74.1 M160 49.5 F
run;
proc print;run;
Task 2Enter the SAS program in the EDITOR window.
1.4. Submitting and correcting your program
There are several methods of submitting your program.
1. Highlight the section of code you wish to run and press the running man icon (in Windows).
2. Ensure that your cursor is in the EDITOR window, then select Run Submit (in Windows or Linux).
3. You can also run just a few lines of code by selecting Run -> Submit top line or Submit N lines (in Windows or Linux).
Right click with the mouse and select Submit All or press the man running iconAn alternative to pressing the man running icon is to press the key F3 in Windows or the key End in Linux.
Examine your LOG window to check that there were no error messages: if all is well examine your output in the OUTPUT window.
SAS Programming Notes 11 Getting started
If you have error messages in your log file you will need to correct the mistakes and resubmit it.
After submitting your code you may find that it has disappeared from the editor window. To overcome this problem select Run Recall Last Submit.
1.5. Saving files and clearing text from windows
When you have succeeded in getting your program to run you can save it as filename.sas ( SAS automatically gives it a .sas ending to remind you that it is a SAS program). Make sure your EDITOR window is active before doing File Save (or pressing the floppy disk icon). Otherwise you might be saving the contents of your log or output window instead of your program.
Save log files as filename.log and output files as filename.lst if you want to save them too. It is usually not necessary to save the log file.
Important: - In order to avoid getting confused about which output and log refers to which program or version of a program, make it a habit to clear your windows before submitting a new program. Do this by selecting Edit Clear All.
Run Recall Last Submit returns the program you have just run to the EDITOR window. This is useful if you have cleared the program by mistake.
Task 3Create a new folder in your personal disk space called MA71064 Statistical computing.
Submit the SAS program from the program editor. When it is working satisfactorily save the file as class1.sas in the folder you have just created.
1.6. Reading a saved program
A SAS program needs to be in an EDITOR window before it can run. To open a saved SAS program activate the EDITOR window and use FileOpen. The program can then be submitted in the usual way.
SAS Programming Notes 12 Getting started
You can have more than one EDITOR window open at the same time. However this can be confusing and it is easiest at first to have only one program open at a time.
1.7. A Data Analysis Flow Chart
Data analysis can be thought of in terms of a process flow. Actions proceed in a sequence. Often the output from one action leads to the input of another. A simple flow chart is given below.
Data step
Proc Print
Proc Means
Figure 3 A simple data analysis flow diagram
SAS programs can contain combinations of DATA steps and PROCEDURES. The SAS program you used above executed the first 2 blocks in the flow diagram. Quite quickly you will be producing more complicated programs that will have many DATA steps and PROCS.
1.8. Importing data using a wizard
The next example reads the excel file Class0 into the temporary SAS library called Work. The format of the data is displayed and the summary statistics (count, average and standard deviation) of the height readings is calculated.
Read or Create a data set
Start
End
Display the data
Calculate the averages
SAS Programming Notes 13 Getting started
The simplest method of entering data into SAS is using the import wizard.
File Import Data will display the dialogue shown in Figure 4.
A.
The source type default is Excel but others are available from the pull down menu.
Next
B. Locate the source file by pressing the Browse buttonOK
C. Select the appropriate worksheet
From the options ensure that ‘Use data in the first row as SAS names’ is ticked.
OK
Next
D. Enter the Member as Class0.
Finish
Check the log window for errors.
Figure 4 Import data wizard
SAS Programming Notes 14 Getting started
In Linux, there is naturally not the option of importing from Excel. However, there is the option of importing csv files. An Excel file can be opened using Open Office and can be saved as a Comma Separated File (csv). It can then be imported straight into SAS.
In Linux steps B and C above are replaced by the dialogue in Figure 4. Similar options are available when pressing the respective button. The remainder dialogue is similar to that in Figure 4.
Figure 5 Import data wizard in Linux
The final step of the wizard, in both Linux and Windows, is optional and offers the possibility of saving the importation command in a specified file, which can be opened with the Program Editor. This can be copy-pasted into any program and be run, without need to follow the steps of the wizard again.
1.9. Viewing a data set
Once the data is into the SAS format you can look at it in a variety of ways.
1 Proc print; run;
2 From the explorer window, double click on the libraries icon to reveal libraries that are present. These libraries are simply pointers to Windows XP folders where the data sets are stored. Double clicking on the work library reveals the data set Class0.
3 Double click on the data set to open the data set.
SAS Programming Notes 15 Getting started
4 Right click on the data set to display a set of options. These include;
Open,
View the Columns
View in Excel (only in Windows).
Task 4Import the excel file class0.xls into SAS using the import wizard then display the imported data set using excel.
1.10. Creating a SAS Program
You have already submitted a simple SAS program which created and then printed out a set of data. The following is an extension of that program. The line numbers have been included to help explain the structure of the program: they are not part of the program itself and should not be typed.
Line number Program001 data class2;002 input height weight sex $ bends pulse1 pulse2;003 datalines;004 152 45.4 F 6 61 84005 178 7.0 M 8 59 102006 178 68.8 M 12 58 95007 175 59.7 M 5 76 83008 157 44.5 F 5 53 102009 165 61.7 M 10 70 110010 175 74.1 M 5 76 102011 160 49.5 F 2 67 118012 161 52.6 M 5 80 103013 180 85.4 M 7 84 102014 160 57.2 F 7 98 115015 170 69.9 M 7 69 102016 178 67.0 M 11 60 79017 163 57.0 F 8 70 98018 160 60.9 F 12 57 84019 185 73.1 M 5 68 .020 188 79.1 M 3 53 69021 159 49.5 F 6 69 112022 run;023 proc print;024 run;025 proc means;026 run;
SAS Programming Notes 16 Getting started
Line number Explanation001 The DATA statement tells SAS to create a data set called class2.
002 The INPUT statement names the variables in the order they appear in the data lines. Variable names must start with a letter, be no more than 32 characters in length (eight characters in version 6) and must not contain blanks, commas and so on. To read data as characters, rather than numbers, a dollar sign is put after the variable name.
003 The DATALINES statement indicates that the next lines are data.004 to 021
The data are entered with a space(s) separating each item. The data must be in the same order as declared in the input statement. A new line is used for each record.
019 A full stop indicates a missing numerical value.022 RUN tells SAS to execute the preceding statements023 PROC PRINT is a procedure to print data in the Output Window025 & 026 PROC MEANS is a procedure to calculate the mean and other statistics
of all the numeric variables, RUN completes the procedure.
This example illustrates the basic structure of a SAS program:
A DATA step consisting of a DATA statement and other statements that form part of this step
SAS PROCECURES begin with a PROC statement. Procedure statements may also be followed by statements that are part of the procedure step, although there are none in these two examples
1.11. Rules for entering SAS statements
SAS statements:
usually begin with an identifying keyword
always end with a semicolon
(check carefully before you submit any program!)
can be in uppercase or lowercase letters
SAS statements are free format.
they can begin and end in any column
one statement can continue over several lines
several statements can be on one line
SAS Programming Notes 17 Getting started
Readability is improved if you add comments and leave spaces between the DATA and PROC steps and perhaps also indent code within a DATA or PROC step. Develop your own style and stick with it.
1.12. Adding comments to a program
There are two ways of writing comments in a SAS program:
begin the comment line with an asterisk and end with a semi-colon
e.g. *This program was developed by J Smith;
begin with a forward slash asterisk and end with an asterisk forward slash
e.g. /* J Smith February 2005 */
Inserting comments is essential if you are doing any serious programming.
The /* style */ is also useful for ‘commenting out’ blocks of a program when testing or debugging.
Task 5Read the file class1.sas into the program editor. Edit the program so that it is the same as the sample program in section 1.10 but with the addition of a comment which gives your name and today’s date. Submit the program and when it is working properly save it as class2.sas.
What information did the Proc Means procedure give you?
1.13. Including titles in your SAS output
The TITLE statement is used to provide titles on your output. The TITLE statement can appear anywhere in a program (an example of a global statement) and subsequently each page of output (and each graph) will have the title until it is reset. For example program class2.sas could be enhanced as follows:
. . .proc print; title ‘Information on Students in Class’;run;proc means; title ‘Summary Statistics of Students in Class’;run;
SAS Programming Notes 18 Getting started
However the final title will adorn all future output until it is reset with another title or ‘cancelled’ withtitle;
run;
Task 6Experiment with the TITLE statement in program class2.sas.
1.14. Creating new variables
If you need to analyse variables, that are derived from the input variables, then you must create these variables in the DATA step. For example, if you want to use two new variables ‘the difference in pulse rates’ and ‘the log of the number of bends’ then these variables must be defined before the lines of data are read in. The rules about naming new variables are the same as for input variables.
data class2; input height weight sex $ bends pulse1 pulse2; diff=pulse2-pulse1; lnbends=log(bends);datalines;...Run;
Some commonly used operators and functions are as follows:
Operator Meaning Function* multiplication log( ) natural log
/ division exp( ) exponential
** exponentiation sqrt( ) square root
1.15. Printing and saving SAS output
The contents of the OUTPUT window may be sent to a printer using the OUTPUT window print command. You can change the way the output looks using, for example, the LINESIZE and PAGESIZE options (see Section 4.1 or
SAS Programming Notes 19 Getting started
SAS Help).The whole of the OUTPUT window listing may be saved as filename.lst using the OUTPUT window save command.
In WINDOWS, it is often convenient to copy all or part of the OUTPUT window into a Word document or another text processing software. This can be achieved with the copy and paste operation. However the results might be disappointing. The appearance may be improved by
using a fixed space font such as SAS Monospace (available if SAS is running)
avoiding ‘wrap around’ by reducing the font size and avoiding unnecessary leading spaces.
See Section 12.2 for further advice on incorporating SAS numeric and graphical output into a Word document.
Task 7 (WINDOWS ONLY)Create a Word document (using copy and paste) which consists of program class2.sas and the output it produces. Experiment with improving the layout of the document.
Exercises1.1 The following table shows the heights and weights of 16 eleven-year-old
girls.
ID no Height(cm) Weight(kg) ID no Height(cm) Weight(kg)
59 135 25 71 133 30
82 146 33 78 149 35
27 153 56 12 141 33
52 154 51 37 164 48
55 139 31 28 146 37
13 131 25 48 149 45
01 149 43 69 147 36
15 137 32 16 152 47
(a) Write the SAS statements to create a SAS data set called ELEVEN. The ID number should be stored as a character variable.
(b) Insert a comment statement to indicate that this data came from Exercise 1.1.
SAS Programming Notes 20 Getting started
(c) Create a new variable which is the ratio of weight to height.
(d) Produce a printout of the data and a table which shows the mean, standard deviation, the maximum and the minimum values for the variables height, weight and the ratio of weight to height. The output should be suitable labelled.
(e) (WINDOWS ONLY) Copy and paste your program and its output into a word document (edit to ensure an attractive appearance).
(f) How would the output have differed if you had input the ID number as a numeric variable?
1.2 Modify your program in 1.1 above in order to determine the body mass index (BMI = weight in kilograms/(height in metres)2 or BMI = W/(H * H) ).
1.3 Several measurements of water quality were taken at eight different sites along the Firth of Forth. The data are shown below.
Site Salinity Phosphate NitrogenChlorophyl
lFaecal
ColiformsCR 30.11 0.068 0.297 1.693 2.917WG 31.48 0.059 0.165 1.464 3.149EG 31.79 0.068 0.144 1.100 3.196SF 31.37 0.185 0.278 1.787 3.418PB 31.50 0.116 0.223 2.099 3.049JO 31.60 0.106 0.207 1.067 2.903SS 30.50 0.047 0.162 1.563 2.895FN 31.96 0.060 0.130 0.753 2.797
(a) Write the statements to create a SAS data set called FORTH.
(b) The units of phosphate are mg/litre. Create a new variable which gives phosphate in units of g/litre where 1mg = 1000g (1 milligram = 1,000 micrograms).
(c) Produce a table which shows summary values for each variable.
SAS Programming Notes 21 Getting started
(d) Save the program in a file called forth.sas.
SAS Programming Notes 23 Data files and SAS data sets
2. Data files and SAS data sets
2.1. Reading data files using the INFILE statement
In the examples in the previous section you created temporary SAS data sets from data, which were included in the program, with a DATALINES statement. In practice, a large set of data is more likely to be available as a raw data file (known as an ASCII or text file) and it will be more convenient to read the external data directly into SAS.
To illustrate this we will create a small ASCII data set using the Notepad editor, read the data file into SAS and then print the contents.
Task 1Open Notepad / text editor and type in the following data set. Save it as blood.txt on your floppy disk. (Note that Notepad automatically gives the extension .txt.)
1 107 1002 110 1143 123 1054 129 1125 112 1156 111 1167 107 1068 112 1029 136 12510 102 104
The variables in the data set are patient number and blood pressure measurements before and after treatment.
The code required to input the data into SAS and get a printout in the Output Window is as follows.data blood; infile 'a:\blood.txt'; input patient $ before after;run;
proc print;run;
The only changes that are required to the previous method of data input are that:
the INPUT statement is preceded by an INFILE statement to tell SAS where to find the external data file.
the DATALINES statement and the lines of data are omitted.
SAS Programming Notes 24 Data files and SAS data sets
Task 2Type the above SAS program into the EDITOR window and save the program under a suitable name. Submit the program and confirm that the values of the variables together with variable names have been printed in the OUTPUT window.
You can verify that the data is stored in the correct location on your hard drive. An example is given in Figure 6.
Figure 6 The data set Blood in the SAS Work library and temporary directory
2.2. LIBNAME and permanent SAS Data Sets
In the programs you have written so far the data set used in any analysis has been created in the data step. Such a data set is described as temporary in the sense that it only exists during your current SAS session and will be deleted when you close SAS.
This kind of temporary file is stored in a SAS library called WORK. You can check what files you have created in the current session by going to the EXPLORER window and clicking on Libraries and then on the library WORK.
(Use View Up One Level or View Show Tree
to navigate back to the original EXPLORER window)
SAS files are given a two part name. The first part of the name is the library name in which the file is stored and the second part is the name of the particular file. You probably noticed that when you created the previous data sets, for example class2, that SAS referred to this file in the log window as WORK.CLASS2.
SAS Programming Notes 25 Data files and SAS data sets
If you wanted to do further analyses on this type of data set in a different session you would need to recreate the data set by running the data step once again. This can be a time-consuming process especially if you have a large amount of data and have created many new variables, changed the format of variables and so on. The alternative approach is to create a permanent SAS data set. This is a special type of file, unique to SAS, which stores the data, variable names and other information such as formats.
You can set up a library to store your data sets and save them so that they can be used in another SAS session. The SAS LIBNAME statement defines the name of the library where the file is to be stored. For example, if you want to store your data on your own disk in drive A, then you need to give this a SAS libname using a statement like the one below. The actual name of the library, in this case mydisk, is chosen by the programmer and is just a convenient name that can be referred to later in the program.
LIBNAME myadisk ‘a:\’;LIBNAME mydisk ‘h:\MA71064 Statistical computing\’;LIBNAME myhomedisk ‘c:\My documents\Napier\MA71064 Stat Comp\’;
You can then save your SAS data sets in this library using a two level SAS name. The first part of the name is the libname and the second part is the name given to the SAS data set. So to create the permanent SAS data set called class2, on the H: drive, would require the following SAS code.
libname mydisk ‘h:\MA71064 Statistical computing\’; * The data library called ‘mydisk’ will; * be located on the H: drive;
data mydisk.class2; *Create the new data set class2; input height weight sex $ bends pulse1 pulse2;
datalines;152 45.4 F 6 61 84178 53.0 M 8 59 102165 61.7 M 10 70 110...................175 74.1 M 5 76 102160 49.5 F 2 67 118run;
Task 3Modify your program, class2.sas, to create a permanent data set. (If you want the data set stored on a hard drive make sure you give the full path name of the required directory.) Check the messages in the LOG Window and check you can see the permanent data set in the EXPLORER window.
SAS Programming Notes 26 Data files and SAS data sets
SAS version 8 puts an automatic SAS7BDAT extension on permanent data sets (version 6 uses SD2).
Task 4Go to Windows explorer and check that you have a file class2.SAS7BDAT in the appropriate directory.
2.3. Referencing a permanent SAS data set
Suppose that you have a permanent SAS data set stored in a particular directory. You may have created this yourself or possibly have downloaded it from the web. You may carry out procedures on the data set directly by using the DATA option in the procedure statement. All the details of variable names and so on will be held in the data set.In the following example permanent SAS data set prac1 is stored in directory
h:\sas\sasdata.
libname xyz ‘h:\sas\sasdata’; proc print data = xyz.prac1;
proc means data = xyz.prac1; run;Note that the first part of the name given by LIBNAME is a pointer to a directory and does not have to be the same name as was used when the data set was created. It is the second part of the name that refers to the particular data set.
2.4. Contents of a file
You can use the SAS procedure CONTENTS to get information about a data set and a list of the variables it contains. This procedure is useful for larger data sets that would be too long or have too many variables to list completely, and it gives you information about when and where the data set was last modified. For example, proc contents data = xyz.prac1; run;
Note: if you have already submitted a LIBNAME statement in the current SAS session, it is not necessary to do so again. You can simply refer to the two-level data set name.
SAS Programming Notes 27 Data files and SAS data sets
Task 5Get information about the SAS data set COMPANY stored in SASHELP. What information is given in each of the columns?
An alternative way of inspecting what variables are in a large file is to print out only the first few observations. This can be done using an option in the PROC PRINT statement. proc print data=sashelp.company (obs=6); run;
Remember you can also view data sets from the EXPLORER window
2.5. Importing data from other packages
Software such as Excel, Minitab and SPSS store data in file types unique to themselves. Some packages have the ability to export into or import out of other formats.
Use the import wizard or PROC IMPORT (see SAS Help).
A safe approach to importing data from such application software into SAS is to export from the other package into ASCII format and input the resulting file into SAS in a DATA step.
Large data sets from outside sources (other companies or organisations) are usually supplied in ASCII format since such data is often held in proprietary databases. Most software allows data to be written to an ASCII or raw data file. This approach is illustrated bellow:
export
INFILE statemen
t
Application data file
ASCII data file
SAS data set
It is a good idea to check the ASCII file with an editor such as WordPad (and possibly ‘tidy up’ if necessary). The ASCII file can then be read into SAS using the INFILE statement assigning variable names with the INPUT statement (as explained in Section 2.1).
Advice on importing data into SAS from popular applications software is summarised below.
Excel (WINDOWS)
SAS Programming Notes 28 Data files and SAS data sets
For a spreadsheet containing data only (values of variables in columns):
Right align columns (necessary for character data)
File Save AsFormatted Text (Space delimited)to give ASCII file filename.prn.
Edit filename.prn with WordPad if column headings need deleting, missing values need replacing with ‘.’ etc.
MinitabFileOther FilesExport Special Text
Specify columns (accept Period Decimel Separator).
Results in ASCII file filename.dat.
Note that Minitab’s missing value symbol is ‘*’. SAS will find this invalid and replace by ‘.’.
SPSS FileOther FilesFixed ASCII
Results in filename.dat.
Note that SPSS’s missing value symbol is ‘.’ However this will be blank in filename.dat and cause SAS to misread the data set when using simple list input.
Alternative: SPSS allows data to be saved directly as a permanent SAS data set:
File Save AsSASv7 Windows long extensionIn recent versions of SAS (e.g. 9.2), SPSS files can be imported directly.
2.6. Missing values
Uncoded missing values present special problems for using list input. To provide some protection for the integrity of your output data set when input data contain uncoded missing input values, use the MISSOVER or STOPOVER options in the INFILE statement. Use the MISSOVER option to set all remaining variables in the INPUT statement to missing. Use the STOPOVER option to prevent an observation from being written to the data set when the input line does not contain a value for each variable in the INPUT statement and to stop the DATA step from further processing.e.g. the program
data test1; input id $ var1 var2 var3 var4 var5; datalines;
SAS Programming Notes 29 Data files and SAS data sets
1001 115 45 65 83 781002 86 27 55 861004 93 52 63 76 881015 73 35 43 112 108;run; would result in the following inaccurate data set
obs id var1 var2 var3 var4 var51 1001 115 45 65 83 782 1002 86 27 55 86 10043 1015 73 35 43 112 108
If we use the MISSOVER option i.e.data test1; infile cards missover; input id $ var1 var2 var3 var4 var5;
cards;1001 115 45 65 83 781002 86 27 55 861004 93 52 63 76 881015 73 35 43 112 108;run;
we will get the following data set
obs id var1 var2 var3 var4 var51 1001 115 45 65 83 782 1002 86 27 55 86 .3 1004 93 52 63 76 884 1015 73 35 43 112 108
Using the MISSOVER option prevents the uncoded missing value in the second data line from causing the third record to be read incorrectly as well. The second observation is still incorrect, but the errors have been restricted to one observation. The STOPOVER option would prevent observation 2 from being written to the data set at all. In order to read the data in properly, either column input or formatted input would have to be used. (See next section)
2.7. The INPUT statement
The INPUT statement names the variables being read in via a DATALINES or INFILE statement and tells SAS where on the DATALINES, or on the lines of INFILE, the values of the variables can be found. There are three main types of INPUT that you can use to describe a record’s values : LIST, COLUMN and FORMATTED. The choice of which type of input you use will depend on
SAS Programming Notes 30 Data files and SAS data sets
the type and arrangement of the incoming data. The $ symbol is placed after a variable name to indicate a character variable.
In the previous examples you have used only the simplest type of INPUT, LIST INPUT. List INPUT is seldom useful for large commercial or scientific work because it is too easy to get missing values or errors in big files. It is commoner for real data to come in fixed column format, where the fields on each line are aligned in columns one under each other.
2.7.1. LIST INPUT- the values are separated by spaces- missing values must be represented by full stops- by default, character values cannot be longer than 8 characters- character values cannot contain embedded blanks- fields must be read in order
e.g. data one; input height weight name $ age;
datalines; 65 150 Chris 50 60 125 Kelly 35 68 180 Leslie 29 ; run;
2.7.2. COLUMN INPUT
- data must be aligned within the column positions specified- character values can contain embedded blanks- input values can be read in any order- character values can be of length 1 to 200 characters- leading and trailing blanks within a field are ignored
e.g.data two; input name $ 1-7 age 9-10 birthdate $ 11-22 sport $ 23-30;
datalines;Ronald 40Dec 3 1954 golfMichael 37Jul 4 1957 fishingLaurel 33Jun 23 1961 softball;
run;
SAS Programming Notes 31 Data files and SAS data sets
2.7.3. FORMATTED INPUT
- character values can be of length 1 to 200 characters- a full stop is not needed for numeric missing values- nonstandard data, such as dates or numbers can be read in- with the use of pointer controls, values can be read in any order
This method of input uses pointer controls and informats for reading in nonstandard data from external data files. An informat is used for reading in data containing dates, numbers with commas, etc.
The informat w.d after a variable specifies the width w and the number of decimal places d to be used in reading in a number.e.g. for the number 2346,
the informat 4.2 would result in the number 23.46 being read in. the informat 4. with no ‘d’ specified would result in the number 2346
being read in
The informat $w. after a variable specifies the length of a character variable
Dates such as 21/10/89 can be read using the informat DDMMYY8. (Note the full stop at the end of the informat)
Pointers indicate the position of a variable e.g.@n go to column n+n move the pointer on n positions
e.g. /* A line of place counters is often useful to put to help alignment0000000001111111111222222222233333333334444444444555555551234567890123456789012345678901234567890123456789012345*/
data three; input @1 name $7. @10 age 2.0 @14 birthdate $11. @28 sport $8. / @9 gradyr 4.0 @16 numchild 1.0 @20 occupation $20.;
datalines;Ronald 40 Dec 3 1954 golf 1973 2 masonry contractorMichael 37 Jul 4 1957 fishing 1975 2 bricklayerLaurel 33 Jun 23 1961 softball 1979 0 attorney;run;
SAS Programming Notes 32 Data files and SAS data sets
/ tells the pointer to go to the next line. Once you go to the next line, you cannot move back to the previous line.
SAS Programming Notes 33 Data files and SAS data sets
Exercises
2.1 Create a permanent data set of the data given in Exercise 1.1. How are you going to retrieve this data without having to retype it? You should be able to modify the program you have saved.
2.2 (a) Download the pulse data file Minitab version from the web or WebCT (pulse.mtw not pulse.prn). Open Minitab load pulse.mtw using File Open Worksheet (not Open Project). Use File Other Files Export Special Text (not File Save Current Worksheet as) to export the PULSE file as an ASCII file. You have to highlight the variables to export then press select. Press OK. Enter a suitable file name. Change the file type to ANSI Text Files (*.TXT). Finally press save.
(b) Create a permanent SAS data set of the data.
2.3 To illustrate the dangers of list format input, take the data file blood.dat and edit it with a text editor (notepad or Word). Make one or two mistakes in it by removing some of the entries in one or more lines. Now save it as a text file and use it to input and print a SAS data set. Examine your log file and output, to see what has gone wrong, and how you are warned.
2.4 (a) (WINDOWS ONLY) Create an Excel file containing the following data where column1 is size, column 2 is colour, column 3 is price and column 4 is transport cost. Save it as a formatted text space delimited file.
Large Red 18.97 0.25Medium Blue 24.68 1.10X-Large Black 29.99 1.75Small Orange 15.89 0.90
(b) Write and submit a SAS program to read in the data using list input and print the variables colour, size and price in that order.
(c) Redo (b) using column input
(d) Redo (b) using formatted input
SAS Programming Notes 34 Data files and SAS data sets
2.5Copy the text file houses.dat from the web. The file contains the following five variables for each of the 120 houses in a survey of house prices. Examine it with an editor (Wordpad or Word).
VARIABLE CONTENTS COLUMN LOCATIONstyle Type of house 1sqfeet Floor area 3-6bedroom Number of bedrooms 8baths Number of bathrooms 10-12Price Price of house 14-19
Use column input to create a permanent SAS data set for the housing data and print the contents.
2.6(a) Download the cars Excel file from the web. To create a file of raw data for reading into a SAS data set:-
Open the file up in Excel. Right align the columns. Delete the coding information about the origin variable (in column L).Save the data as a formatted text space delimited file (.prn extension), or as a csv file in Linux.
(b) Use this file to create a SAS data set (use column input).To identify the column location for each variable, open the .prn file up in Notepad. Move the cursor along the row of data, taking a note of the column locations.
SAS Programming Notes 35 SAS procedures
3. SAS procedures
3.1. Structure of a SAS program
Once you have got the data organised a simple SAS program consists of a series of procedures. You have already used three of these procedures. PROC PRINT, PROC MEANS and PROC CONTENTS. Apart from specifying which data set to use you had no control on the type of output that SAS produced. This may have given the impression that SAS is rather inflexible. However, this is far from the truth. Most procedures have several options which can be invoked and in addition there are statements which can be incorporated into a program (which themselves have options). The procedures and subsequent statements determine the nature of the output produced. Most SAS procedures use the following syntax:PROC PROCNAME options; STATEMENTS / statement options; RUN;A program will typically consist of several such blocks of code.
3.2. Sample program
libname unit3 'c:\sas\sasdata';
proc sort data=unit3.pulse out=sorted; by activity; run;
proc print data=sorted noobs N; *NOOBS removes observation numbers; format height 6.0; title 'Pulse data from Minitab sorted by activity'; var pulse1 pulse2 weight height; by activity; run;
proc freq; tables ran smokes activity; tables sex/nocum nopercent; tables sex*smokes; run;
proc means maxdec=2 mean std; title 'Pulse rates before and after exercise'; var pulse1 pulse2; run;
Task 1
SAS Programming Notes 36 SAS procedures
Print the pulse data that you saved as a permanent SAS data set in Exercise 2.2. Now run the first two parts of the sample program (PROC SORT and PROC PRINT) and compare the output. Remember to specify an appropriate library.
One way of printing separate tables for different subgroups is to use a BY statement. In order to do this the data set must be already sorted by this BY variable. If you do not want to overwrite the original file then the sorted data must be stored in a new file. The statements:
proc sort data=unit3.pulse out=sorted; by activity; run;
sort the pulse data by activity level and store the sorted data in a new file called sorted.
Task2Look in the libraries to see where this file is stored. Is it a permanent or a temporary data set?
The option NOOBS suppresses the observation numbers and the option N allows the sample size to be printed at the end of each table. The format statement gives an instruction to print the values of height with a maximum of six characters and no decimal places.
Task 3 Type in the rest of the sample program and see if you can work out what the remaining statements and options are doing. Look carefully at the titles. What happens if no title statement is made in a procedure?
Individual procedures will be looked at in more detail in the next few sections. Information about the options available for individual procedures is given in the SAS help though it is not always very easy to follow!
It is not strictly necessary to have a run statement between each procedure. SAS recognises that a new procedure statement indicates that the previous
SAS Programming Notes 37 SAS procedures
statements refer to the preceding procedure. However, it is generally advisable to include additional run statements and it is essential to put a run statement at the end of the program.
SAS Programming Notes 39 Summarising data
4. Summarising data
4.1. SAS System Options
You have probably noticed that the date and a page number are included on all the output produced by SAS. The type of output produced by SAS is determined by the system but may be changed by making use of SAS System Options.
There are dozens of options available which deal with hardware and software interfacing, and the input and processing as well as just the output of jobs.
A list of the options may be found in help. . The following are some commonly used options which may be used to change the output.
Option ActionCENTRE/NOCENTRE Output centred / left aligned
DATE/NODATE Date shown / date not shown
NUMBER/NONUMBER Pages numbered / not numbered
PAGESIZE= Determines the number of lines per page
LINESIZE= Determines the printer line width
FIRSTOBS= Specifies the first observation to include from the data set
OBS = Specifies the last observation to include. This is useful for testing code using large data sets.
OBS = max Includes all observations
The following lines of code will produce a print out of observations 20 to 45 inclusively of the pulse data, with no page numbering, no date, left aligned and with 20 rows on the page.options nonumber nodate nocentre pagesize=20 linesize=80 firstobs=20 obs=45;
libname unit4 'c:\sas\sasdata';
proc print data=unit4.pulse; run;
Options firstobs = 1 obs = max; /* Uses all observations in any analysis that follows./*;
SAS system options remain in place for the whole of a SAS session unless subsequently changed. If an OPTIONS statement is entered within a DATA or PROC step then it takes effect immediately. An OPTIONS statement entered outside of a step takes effect with the following step.
SAS Programming Notes 40 Summarising data
4.2. HTML output
HTML output can be turned on from the menu Tools Options Preferences.Select the Results tab and select the Create HTML box. The dialogue windows are shown below, for both Linux and Windows. It includes an option to write the output into a specified folder. If this option is not used the output file is written into the folder specified for the work library. I
Figure 7 Dialogue to turn on HTML output (in Windows / in Linux)
4.3. Summary Procedures
Four procedures PROC SORT, PROC MEANS, PROC UNIVARIATE and PROC FREQ may be used to summarise data. The most commonly used options and statements for these procedures together with sample programmes are given below. The complete set of options can be obtained in SAS help,
HelpSAS Help and Documentation Choose the SAS Products, Base SAS, SAS Procedures then Procedures. From there you should click on the procedure you require.
SAS Programming Notes 41 Summarising data
4.4. PROC SORT
Options Description
DATA= Data set to be used, uses the last data set created by default
OUT= Specifies the name of file to store the sorted data. If no OUT option is used the original file will be overwritten.
Statements
BY <DESCENDING> A list of variables to sort by must be specified. DESCENDING placed before a variable name will sort the data in descending order for that variable.
options centre pagesize=50 firstobs=1 obs=92;
proc sort data=unit4.pulse out=sorted; by sex descending ran; run;
proc print; run;
The lines of code sort the pulse data by sex (males first, followed by females) and within sex by whether the students ran. Those who did not run (coded 2) are placed before those that did run (coded 1) because DESCENDING has been specified. Note that no data statement is used with PROC PRINT. SAS automatically uses the sorted data set because that was the last data set created.
Task 1Using the first 50 observations only of the pulse data, create a data set sorted by smoking (non-smokers first) and by activity. Print out the sorted data set. Check carefully that the output is what you expect.
SAS Programming Notes 42 Summarising data
4.5. PROC MEANS
Options Description
DATA= Data set to be used, by default will use the last data set created
MAXDEC= Gives the maximum number of decimals to be used in the output (must be between 0 and 8)
NOPRINT Suppresses the printing if the procedure is only being used to send summary output to a file (see OUTPUT statement).
ALPHA Gives value for confidence limits (ALPHA=0.05 for 95% C.I.)
statistic keyword list
By default PROC MEANS prints out the variable name, count, mean, std dev, min and max values. Particular statistics may be requested.
Procedure options
N Number of non-missing observations in a subgroup
NMISS Number of missing observations
MEAN Mean
STD Standard deviation
MIN Minimum value
MAX Maximum value
RANGE Range
STDERR Standard error
CLM Confidence limits for the mean
(For additional keywords see SAS Help)
Statements
VAR Specify a list of numeric variables for which statistics are required.
BY Specify a list of alphanumeric variables (data must be sorted by these variables). Descriptive statistics are given for each subgroup.
CLASS Specify a list of alphanumeric variables. Descriptive statistics are given for each subgroup. Uses more memory than the BY command but does not need the data to be sorted.
OUTPUT There are various ways of storing all or some of the summary statistics requested. Need to specify a file name using OUT=filename and which variables/statistics are required. See specimen programme for a simple example of how this can be done.
The following lines of code may be used to get summary statistics (the means, standard deviations and standard errors) for pulse1 and pulse2 in subgroups defined by sex and whether the students ran. These summary values are stored in a file named summary.proc means data=unit4.pulse maxdec=2 mean std stderr; var pulse1 pulse2; class sex ran; output out=summary mean = mean_p1 mean_p2 std = std_p1 std_p2
SAS Programming Notes 43 Summarising data
stderr = se_p1 se_p2; run;
proc print data=summary;run;
Task 2Submit the previous sample program. (Remember you may need to change the library name and file name of the data set.) What data has been stored in the file ‘summary’? What does the TYPE variable indicate in the print output?
4.6. PROC UNIVARIATE
Options Description
DATA= Data set to be used, uses the last data set created by default
PLOT Produces stem-and-leaf plots, boxplot and normal probability plots of the data.
NOPRINT Suppresses all printing.
Statements
VAR Specify a list of numeric variables for which statistics are required.
BY Specify a list of character variables (data must be sorted).
OUTPUT Need to specify a file name using OUT=filename and which statistics/variable names are required.
Task 3Submit the following program and see how the printout differs from that produced by PROC MEANS. How does the output file containing summary values differ?
proc univariate data=sorted plot;var height weight;by sex;output out=summary mean = mean_ht mean_wt std = std_ht std_wt;run;proc print data=summary;run;
4.7. PROC FREQ
Options Description
DATA= Data set to be used, uses the last data set created by default
SAS Programming Notes 44 Summarising data
StatementsTABLES Specify a list of alphanumeric variables for which tallies are
required. Smaller subgroups may be defined by the use of an * e.g. sex*ran*activity
Table statements optionsNOCOL Does not show column percentages
NOCUM Does not show cumulative frequencies or percentages
NOFREQ Does not show cell frequencies
NOPERCENT Does not show cell percentages
NOROW Does not show row percentages
CHISQ Gives results of chi-squared tests of independence
The following code produces frequency tables for sex and smoking habit separately and a two way table of sex and smoking habit. The output also includes the results of a chi-squared test of independence for these two variables.
proc freq data=unit4.pulse;tables sex smokes smokes*sex/nocol norow nocum chisq;run;
4.7.1. Chi-square testProc freq is used to carryout a chi-square test for the association of 2 categorical variables. In this case the null hypothesis is that there is no association between smoking and sex. The same proportion of smokers should be found amongst males and females.
It is convenient to add the row percentage to the cross tabulation as an easy way to look for a possible association. This is achieved by removing the option “norow”. The options “nocol” and “nopercent” have been left in the statement to remove clutter from the output.
proc freq data=unit4.pulse;tables sex*smokes/ chisq nocol nopercent;run;
The output from SAS gives
SAS Programming Notes 45 Summarising data
The FREQ Procedure Table of Smokes by Sex Smokes Sex Frequency‚ Row Pct ‚1 ‚2 ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 20 ‚ 8 ‚ 28 ‚ 71.43 ‚ 28.57 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 ‚ 37 ‚ 27 ‚ 64 ‚ 57.81 ‚ 42.19 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 57 35 92
Statistics for Table of Smokes by Sex Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 1.5321 0.2158 Likelihood Ratio Chi-Square 1 1.5699 0.2102 Continuity Adj. Chi-Square 1 1.0089 0.3152 Mantel-Haenszel Chi-Square 1 1.5154 0.2183 Phi Coefficient 0.1290 Contingency Coefficient 0.1280 Cramer's V 0.1290
The probability of the chi-square statistic being as large as 1.5321 by chance alone is 0.2158. This indicates that there is not an association between sex and smoking in this sample.
Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 20 Left-sided Pr <= F 0.9310 Right-sided Pr >= F 0.1576 Table Probability (P) 0.0886 Two-sided Pr <= P 0.2502 Sample Size = 92
Some times your data may simply be the counts of each table cell. The data set must contain a variable called something like “count” which contains the number of observations in each cell. In this case use the weight statement in the procedure,
e.g. weight count;
4.8. General syntax for a procedure
A general example of a procedure is given below. Each procedure uses only a certain combination of statements but the action of each statement is common across the procedures in which it can be used.
SAS Programming Notes 46 Summarising data
PROC PROCNAME DATA = lib1.data1 OUT = lib2.data2 noprint; Weight WeightVar ; Specifies which variable weight the analysis ; FORMAT NumVar 8.0 CatVar $3. ;* Specifies formats of variables.; BY CatVar ; * Give 1 output per group, SORT needed; CLASS CatVar ; * Similar to BY statement, no SORTing needed. Makes a numeric variable act as a categorical variable.; VAR NumVar ; * Restricts analysis to named variables; OUTPUT OUT = Summary keyword= DescriptiveStatistic; * Named output dataset, specifies names; WHERE NumVar2 > 1000 ; * Only uses certain cases; TABLE CatVar * NumVar ; * Specifies an output table; FREQ NumVar3 ; * Variable giving the observation Frequency; MODEL YVar= XVar + . .; * Fits models; PLOT YVar * XVar ; * Plots a scatter plot;RUN;
SAS Programming Notes 47 Summarising data
4.9. Help
Extensive documentation on each procedure can be found by using Help SAS Help and Documentation. An example of the help screen for the BASE SAS procedures is shown in Figure 8. Other useful help modules are SAS/STAT and SAS/GRAPH
Figure 8 Base SAS Procedures help
Exercises
4.1 (a) For the pulse data, get a printout which shows the number of observations, the mean, standard deviation, maximum and minimum values of pulse2 in each of the four subgroups defined by whether the student smoked/did not smoke and ran/did not run. Make the printout left-aligned with the summary values shown to one decimal place.
(b) Obtain confidence limits for the four means produced in part (a). Does it appear that smoking or running on the spot had any effect on the second pulse rate?
(c) Obtain comparative boxplots of the second pulse rate in each of the four subgroups.
SAS Programming Notes 48 Summarising data
(d) What percentage of smokers were made to run on the spot? Get suitable SAS output to give you this information.
4.2 For the SAS data set RETAIL (From the explorer tab look in the library SASHELP) get a printout which shows the number of observations and the mean and standard deviation of retail sales in each year. Print the mean and standard deviation to two decimal places. Output the mean sales for each year to a new file called ‘summary’ and get a print out of this file
5. Graphs and charts
5.1. Graphics procedures
SAS can produce two types of graphics, high or low resolution. High resolution graphics are sent to a special graphics windows where the graphs can be edited and copied into Word documents. GPLOT and GCHART are two procedures which produce high resolution graphics: the equivalent low resolution procedures are PLOT and CHART.
5.2. PROC PLOT
Options Description
DATA= Data set to be used, uses the last data set created by default
Statements
PLOT Specify yvariable*xvariable. Can produce several plots with a single statement by including a list of variables in parentheses e.g. (list of n yvariables)*(list of m xvariables) will produce nm separate plots.
BY Specify a list of character variables (data must be sorted) to produce separate graphs for subgroups.
PLOT options
=‘symbol’ Specify a symbol to be used for plotting
=variable Identifies each point by the value of another variable
The following code produces a plot of weight against height for all students, separate plots of weight against height for each sex and a single plot with a different symbol for males and females. proc sort data=unit5.pulse out=sorted;
by sex;proc plot data=sorted;
plot weight*height;proc plot;
plot weight*height=’*’;by sex;
run;proc plot;
plot weight*height=sex;run;
Task 1Input the program into SAS and examine the output. (Remember to assign a LIBNAME as the first statement.) Resubmit the program using PROC GPLOT instead of PROC PLOT. What differences in the output do you observe?
When it has run successfully in PROC GPLOT, you will find that you are in a graph window. The graph can be edited in SAS by clicking on the painting icon. To come out of editing the plot, click on file and then down to end. You can save the graph to a file or cut and paste it into Word where it can be further edited if required. The graph window must be closed down (by clicking on ) before another SAS program can be run.
5.3. PROC CHART
Options Description
DATA= Data set to be used, uses the last data set created by default
Statements
HBAR Specify variable to produce a frequency bar chart (horizontal bars)
VBAR Specify variable to produce a frequency bar chart (vertical bars)
PIE Specify variable to produce a pie chart.
BLOCK Specify variable to use on the x-axis. Used in conjunction with GROUP and SUMVAR options to produce three-dimensional bar charts.
BY Specify a list of character variables (data must be sorted) to produce separate charts for subgroups.
HBAR/VBAR/PIE/options
SUMVAR= Specify an analysis variable the sum of which is to be shown on the y-axis
TYPE= May be used on its own or in conjunction with SUMVAR to produce statistics other than the frequency or sum on the y-axis. The options for TYPE are FREQ (frequency counts), PCT (percentages), CFREQ (Cumulative frequencies), CPCT (Cumulative percentages), SUM (Totals), MEAN (Means)
The default is TYPE=SUM if SUMVAR is used otherwise the default is TYPE=FREQ.
LEVELS= Specifies the number of equal width classes for numeric variables.
MIDPOINTS= Specifies the midpoints of classes for numeric variables
MIDPOINTS=lower_limit TO upper_limit BY interval
DISCRETE Prevents SAS from dividing a discrete variable into inappropriate intervals e.g ensures a variable coded from 1 to 5 will produce 5 classes.
GROUP= Produces separate bar charts on the same graph for different discrete values of the GROUP variable.
The following program illustrates some of the features of PROC CHART.
proc chart data=sorted;hbar smokes;by sex;
run;proc chart;
vbar height/levels=6 group=sex sumvar=pulse2 type=mean;run;
Task 2Run this program and look carefully at the output obtained. Adapt the code to show other statistics on the y-axis of the vertical bar chart, a different number of bars and so on to familiarise yourself with the procedure.
Exercises
5.1 Plot weight against height for the data from Exercise 1.1. This should be stored somewhere as a permanent SAS data set. Use a plus sign as your plotting symbol.
5.2 For the pulse data:
(i) Plot the second pulse rate against weight, using a different symbol for males and females.
(ii) Produce a pie chart which shows the percentage of students who usually have particular levels of activity.
(iii) Produce a horizontal bar chart which shows the mean of the second pulse rate for those students that did and did not run.
(iv) Produce vertical bar charts side by side which show the percentage of students who have different levels of physical activity for males and females separately.
(v) Produce a three dimensional bar chart which shows the mean of the first pulse rate for subgroups defined by level of activity and smoking habit.
SAS Programming Notes 53 Correlation and regression
6. Correlation and Regression
Two procedures may be used to obtain information about the relationship between two or more continuous variables. PROC CORR determines the correlation coefficients between selected variables and PROC REG fits a regression model to data and allows output to be saved for further analyses. Remember that the statements and options given in these notes are only a very small subset of those that can be used with particular procedures. The help facility may be used to investigate further possibilities.
6.1. PROC CORR
Options Description
DATA= Data set to be used, uses the last data set created by default
SPEARMAN Calculates the Spearman rank correlation coefficient. By default the Pearson product moment correlation is calculated.
Statements
VAR Specify a variable list (essential)
WITH Specify a variable list to be used with VAR. The VAR variables are given at the top of the table of correlations and the WITH variables at the side. If WITH is not used a matrix of the correlations between all pairs of variables is produced.
BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations.
An example of the use of PROC CORR is shown below using the data concerned with heights and weights of eleven-year-olds (Exercise 1.1).
proc corr data=unit6.eleven; var height weight; run;proc corr data=unit6.eleven; var height ; with weight; run;
Task 1Try running this program and look at the difference in output produced when the WITH statement is included or not included. What do the values under the correlation coefficients indicate?
SAS Programming Notes 54 Correlation and regression
6.2. PROC REG
Options Description
DATA= Data set to be used, uses the last data set created by default
CORR Prints a correlation matrix for all variables listed in the MODEL statement.
Statements
MODEL For simple linear regression, specify a response variable and an independent variable, response=independent variable
OUTPUT Need to specify an output file using OUT=filename and a list of keywords(statistics)/names required. Commonly used keywords are p (predicted values), r (residuals), student (standardised residuals). See the specimen program for syntax.
PLOT Allows scatters plots to be produced using any variables in the model or keywords in the OUTPUT statement. Note the keywords in the OUTPUT must be followed by a full-stop when used as variables in the PLOT statement.
BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations.
PLOT options
OVERLAY Superimposes several scatter plots on the same graph.
The following code produces output which shows: the correlation matrix of weight and height output from a simple linear regression analysis of weight on height a plot of weight against height with the predicted values (line)
superimposed a plot of the standardised residuals against height a printout of the output file a normal probability plot for each of the residuals and standardised
residuals together with the results of normal probability test.
proc reg data=unit6.eleven corr; model weight=height; output out=regout p=pred r=resid student=stdresid; plot weight*height='*' p.*height= 'P'/overlay; plot student.*height; run;
proc print data=regout;
proc univariate plot normal data=regout; var resid stdresid; run;Task2
SAS Programming Notes 55 Correlation and regression
Obtain the output from the previous program. Why is it important to produce residual plots and normal probability plots for the residuals from a linear regression model?
Exercises
6.1 Data set beetle, available as a permanent SAS data set on the web page, gives information on a sample of beetles and the damage they cause.
(a) Download the file and examine the data.
(b) (i) plot area against male beetle length(ii) regress area against male beetle length(iii) obtain a plot of the standardised residuals against length(iv) obtain a normal probability plot of the standardised residuals.
(c) Why isn’t a simple linear regression model satisfactory? What might be a useful thing to try in order to improve the model?
(d) Similarly, investigate the relationship between the amount of frass produced and female beetle length.
SAS Programming Notes 57 Exploratory data analysis
7. Exploratory data analysis
7.1. SAS/INSIGHT
SAS Insight software is a tool for data exploration and analysis. It is interactive which means you can explore data through graphs and analyses linked across multiple windows. This facility can be used to identify outliers, highlight subgroups on a graph and so on. SAS/INSIGHT allows you to analyse univariate distributions, investigate multivariate distributions and fit explanatory models such as a simple linear regression model to your data.
7.2. Accessing SAS/INSIGHT
SAS/INSIGHT may be used with any SAS data sets that have been created previously. If you wish to investigate a permanent SAS data set make sure that you have set up an appropriate library using a LIBNAME statement before invoking SAS/INSIGHT. It is also possible to enter data directly into the SAS/INSIGHT data window.
Within SAS, select SolutionsAnalyseInteractive data analysis. The data of interest can then be accessed using LibraryData SetOpen or a new data set created in the data window using New.
7.3. Creating a Scatter Plot
To investigate SAS/INSIGHT initially, choose a data set you are fairly familiar with and try out some of the features. This example used the Pulse data set. Analyse > Interactive Analysis
Figure 9 Opening the SAS/INSIGHT dialogue in Windows
SAS Programming Notes 58 Exploratory data analysis
Figure 10. Opening the SAS/INSIGHT dialogue in Linux
You can create charts using the analyze menu. An example of creating a scatter plot is shown in Figure 11.
Analyse > Scatter Plot (X/Y)
Figure 11 The Scatter Plot ( X / Y ) dialogue
SAS Programming Notes 59 Exploratory data analysis
Figure 12 Two SAS/INSIGHT scatter plots with 3 points highlighted
A key feature of SAS/INSIGHT graphs is that they are interactive. Click on a point on one chart and all corresponding points become highlighted. An example of this interactivity is shown in Figure 1. The 3 points labelled with ‘1’ use a larger symbol and indicate the level of the variable ran (1 = yes). To click on several points just hold down the ctrl key while clicking. To highlight an area of points left click and hold in one corner of the region then drag the cursor to the opposite corner. All points in the rectangle will be highlighted. You can then turn off the points by clicking on a point ouside the rectangle.
7.4. Features of SAS/INSIGHT
SAS/INSIGHT software provides an extensive range of tools for investigating data and carrying out analyses. Some of the activities that you can carry out using SAS/INSIGHT are shown below.
enter data from the keyboard identify observations in plots examine all values for selected observations brush observations in graphs create overlaid line plots rotate data in three dimensional plots manipulate histograms to explore the distribution of data compare distributions in box plots and mosaic plots compute descriptive statistics fit parametric (normal, lognormal, exponential, Weibull) and kernel
density estimates
SAS Programming Notes 60 Exploratory data analysis
fit parametric cumulative distribution functions create quantile-quantile plots calculate correlations and principal components to find the structure
of your data fit a general linear model create residual and leverage plots transform variables process data by groups for every analysis
7.5. Using SAS/INSIGHT
Once you have a data set in SAS/INSIGHT, manipulations and analyses are carried out by using either Edit or Analyse on the main menu.
Operations are also available from pop-up menus by: clicking the left mouse button in the corners of graphs and tables pressing the right mouse button over an appropriate object.
Variables to be analysed may either be selected before clicking on Analyse or entered as requested within the particular analysis window.
In the data set window: a variable is selected by clicking on the name several variables can be selected by holding down the left mouse
button and dragging across the selection non-contiguous variables or observations may be selected by
holding down the control button and clicking on individual names or row numbers.
In WINDOWS, any plots produced can be printed directly or copied and pasted into Word documents.
Tabular output can be saved into the normal SAS Output Window using FileSaveTables
Commands from your SAS/INSIGHT session can be recorded and later resubmitted. The FILE and INFILE options allow you to produce a file containing commands to document and reproduce a SAS/INSIGHT session. This is very useful for exploratory analysis that you need to interrupt or repeat on different sets of data.
Examplefilename note ‘h:\MA71064 Statistical Computing\insight.txt’;proc insight file = note; run;
SAS Programming Notes 61 Exploratory data analysis
After doing your analysis and then exiting, your file will contain the commands that were used to create and close your Insight session. You can begin your second Insight session from where you left your first session with the following code:
filename note ‘h:\MA71064 Statistical Computing\insight.txt’;proc insight infile = note; run;
Alternatively, just code the FILE keyword without a filename specified and your commands will be recorded in the SAS Log window.
Task 1
To get a feel for what SAS/INSIGHT can do, work through the following exercises which are based on the pulse data.
(a) Obtain a histogram of height.Obtain a histogram of sex in a new window.Click on the bar representing males on the sex histogram.Look at the histogram of height. What has changed on this histogram?
(b) Obtain comparative boxplots of weight for each sex. (Input weight as the Y-variable and sex as the X-variable.) Click on the outlier for male weights. Which observation number is this?Double click on this outlier. What information do you get?
(c) Highlight observation numbers 1, 31 and 67 in the data window..Press the right mouse button and click ‘Label in plots’.Obtain a scatter plot using pulse2 as the Y-variable, pulse1 as theX-variable, sex as the group variable and ran as the label. What information is shown on these plots?Double click on one of the points. What information do you get?
(d) Highlight the variable names for pulse1, pulse2, height and weight in the data window.Obtain a scatter plot.What plots do you get? What are the values shown in each plot?
(e) Using the Fit option in the Analyse menu, input pulse2 as the Y-variable, weight as the X-variable and ran as the group variable.Look carefully at the output obtained.Redo the analysis to show Residual Normal QQ plots and store predicted values and standardised residuals in the data sheet.
SAS Programming Notes 62 Exploratory data analysis
7.6. Tools
Edit > Windows > Tools turns on a menu that allows data points to be coloured or selects different symbols by data value.
Figure 13 The tools dialogue labelling observations by sex
Press the coloured square from the tools window to set a particular set of points to that colour, eg colour red all those points representing people who smoke.Similarly press one of the symbol buttons to select a given symbol for a set of observations.
Other features of SAS/INSIGHT can be investigated using the help facility.
Task 2
(a) Find out what is meant by ‘brushing observations’.
(b) Produce summary statistics and graphs for each of the continuous variables in the pulse data.
(c) Input, into the data sheet, the following data which are chloride content (mg/l) of waters draining from a particular type of rock.
6.0 5.0 0.5 0.5 0.6 10.00.4 6.0 1.2 0.2 0.7 0.30.2 0.8 0.2 1.7 0.5 6.0
(i) Produce a boxplot of the data and comment on the distribution.(ii) Create a new variable which is the log of the chloride content.
SAS Programming Notes 63 Exploratory data analysis
(iii) Check whether the log values could reasonably be assumed to have come from a normal distribution.
SAS Programming Notes 65 Modifying data and output
8. Modifying data and output
8.1. Introduction
In the previous units you have discovered how to create permanent SAS data sets and produce output using a variety of procedures. The OPTIONS statement was introduced in the unit on Describing Data which allows you to make some changes to the output produced. Generally speaking though, there has been very little flexibility in either changing the style of the output or modifying the data set that has been used. In this unit you will learn how to make changes to a permanent SAS data set and to customise some types of output.
8.2. SET statement
The SET statement is a very versatile statement which is used in a DATA step and enables a variety of tasks to be carried out depending on which options are used. One of its most common uses is reading observations and variables from existing SAS data sets so that further processing can take place. Another use is combining two or more data sets so that analyses can be carried out on a larger set of variables or observations.
The same operations can often be done in different ways because some SAS statements can be incorporated as options into either the DATA or SET statements. Have a look at the following examples of code which both achieve the same thing.
data unit8.beetles1 (drop=site); set unit8.beetles; where site=’1’;run;
data unit8.beetles1 (drop=site); set unit8.beetles (where=(site=’1’));run;
Both pieces of code produce a new permanent SAS data set called beetles1 which has data from site one only and does not include site as a variable. The number 1 is shown in single quotes because site is a character variable in this data set. The DROP option allows variables not required to be omitted from the new data set. It can be included in either the DATA or SET statement but you have to be a bit careful. If DROP is used in the SET statement then the variables involved cannot be used for further processing. In this example using the DROP option with the variable site in the SET statement would result in an error message – try it!
The following table shows some of the commonly used options in the DATA and SET statements.
SAS Programming Notes 66 Modifying data and output
Data Set Options Description
DROP= Specify one or more variables to exclude either from further processing or from the new data set
FIRSTOBS= Specify the first observation required for processing
KEEP= Specify one or more variables to include in further processing or in the new data set
OBS= Specify the last observation required for processing
LABEL= Specify names to be given to variables (see section Labelling output)
RENAME= Specify new names for variables
WHERE= Specify a condition to select certain observations from a SAS data set
8.3. DROP and KEEP
The following code shows how the DROP and KEEP options may be used in a program. data lengths (keep=height cond mlength flength) damage (drop=mlength flength); set unit8.beetles (drop=site);run;
proc print data=lengths;
proc print data=damage;run;
Note that more than one data set can be specified in a single DATA statement. In this example two temporary data sets are produced lengths and damage.
Task 1Look at the preceding code and see if you can work out which variables will be contained in each data set. Submit the code and see if you are correct. If you wanted to create permanent data sets what changes to the code would you have to make?
SAS Programming Notes 67 Modifying data and output
8.4. Labelling output
Variable names in SAS are restricted to being eight characters in length, and by default these variable names are used as column headings. A LABEL statement may be used to associate a descriptive label with a variable. If the labels are required in the output then either a LABEL or SPLIT option must be used in the PROC PRINT statement. If a LABEL statement is made in a DATA statement then the labels are permanently associated with the variables in that SAS data set.
An example of the use of labels follows.
proc sort data=unit8.beetles1 out=sorted;by cond height;
proc print data=sorted split=’*’; var height mlength area frass; by cond; label mlength='male*length' area='leaf area*consumed' frass='number of*frass*pellets'; pageby cond; sum frass;run;
Task 2Submit the code in SAS and inspect the output. What effect have the PAGEBY and SUM statements had?
If you had used label names without the asterisks and used the LABEL option in the PROC PRINT statement then SAS would split the label names automatically at a suitable place but you have no control over the process.
SAS Programming Notes 68 Modifying data and output
8.5. PROC PRINT
Options Description
DATA= Data set to be used, uses the last data set created by default
LABEL Ensures that column labels are used in the output
SPLIT= Specify a character in the label names which splits the column headings onto two or more lines
NOOBS Suppresses the printing of observation numbers in the output
Statements
VAR Specify variables to be printed
ID Specify variable to use as identification instead of the observation number
BY Specify a list of character variables (data must be sorted) to produce separate tables.
PAGEBY Used with BY statement to output each table on a separate page.
SUMBY Prints subtotals for the specified BY variable
SUM Specify numeric variables for which the sum of the values is required.
8.6. PROC FORMAT
The LABEL statement allows you to give longer names to variables so that any output is easier to interpret. It is also possible to assign names to individual categories for character variables and to save these names as permanent formats. This is done using PROC FORMAT. The permanent formats are saved in a location which is specified using the LIBNAME statement with the special libref name LIBRARY. For example,
Libname library ‘C:\sasdata’;
will store the formats in a directory sasdata on the hard disk when the LIBRARY option is used in the FORMAT procedure.
Task 3Assign a library called LIBRARY in a suitable location. (The location where you have stored your SAS permanent data sets is probably the most appropriate.)
The individual names are assigned using a value statement. These formats are independent of any particular data set and if appropriate may be used with any variable. The following example shows suitable labels for the plant
SAS Programming Notes 69 Modifying data and output
height and plant condition categories from the beetles data but the format $health, for example, could be used for any variable where 1, 2 and 3 represent poor, satisfactory and good respectively.
proc format library=library;value $height '1'='less than 10cm' '2'='10cm<20cm' '3'='20cm<30cm' '4'='30cm<40cm' '5'='40cm or more';value $health '1'='poor' '2'='satisfactory' '3'='good';run;
When these labels are required in output the format names must be shown with a full-stop in the FORMAT statement.
proc print data=unit8.beetles;format height $height. cond $health.;run;
Task 4Create the above formats and print out the beetles data set to see the effect of using these formats. Check the library called LIBRARY. You should find a catalog called FORMATS. Double-clicking on FORMATS will give a list of all permanent formats you have created.
The permanent formats which have been created may be used at any time. If you want to use them in a future session remember to use a LIBNAME statement initially to assign the libref LIBRARY.
(To use more than one permanent format library use options fmtsearch - see HELP for details)_____________________________________________________________
8.7. Recoding data
You may find that when you are presenting results, or carrying out an analysis of a set data that you may wish to code a continuous variable such as height into discrete categories, for example, short, medium and tall. In some circumstances you may wish combine categories for an analysis. This type of operation can be done either by using conditional statements or making use of the VALUE statement in PROC FORMAT.
_____________________________________________________________
8.8. Conditional statements
Conditional statements can take several forms (see the SAS Help). Two commonly used in recoding are:
SAS Programming Notes 70 Modifying data and output
if expression then statement; if expression then statement; else statement;
The following code shows the use of the IF-THEN/ELSE statement for recoding pulse1 and pulse2 into four numeric categories.
data unit8.pulse2 (keep=pulse1 pulse2 ran); set unit8.pulse;if 40 <= pulse1 < 60 then pulse1=1; else if 60 <= pulse1 < 80 then pulse1=2; else if 80 <= pulse1 < 100 then pulse1=3; else pulse1=4;
if 40 <= pulse2 < 60 then pulse2=1; else if 60 <= pulse2 < 80 then pulse2=2; else if 80 <= pulse2 < 100 then pulse2=3; else pulse2=4;run;
proc freq data=unit8.pulse2;tables pulse1*pulse2/nocol nocum nopercent norow;by ran;run;
_____________________________________________________________
8.9. VALUE statement
An alternative way to code data is to create new formats for the required variables and use these formats when they are needed to produce particular output. For example putting the pulse rates into categories and creating a two way table could be done by creating a new format as follows.
libname library 'c:\sasdata';run;proc format library=library;value pulse 40-59=1 60-79=2 80-99=3 100-200=4;run;
proc freq data=unit8.pulse;tables pulse1*pulse2/nocol norow nocum nopercent chisq;by ran;format pulse1 pulse2 pulse.;run;
Task 5Try out these alternative ways of recoding the data. What do you think are the advantages and disadvantages of each method? What do the resulting tables
SAS Programming Notes 71 Modifying data and output
tell you about the relationship between the first and second pulse rates in each group?
The values that may be assigned to a particular code or description may be specified in the following ways.
Range specification in the VALUE statement
Description
value a single value
value1-valuen a range of values
value1, value2, …. a list of values
HIGH the highest possible value
LOW the lowest possible value
OTHER anything that does not fall into any range
For example a format for age groups could be created as follows.
libname library 'c:\sasdata\';run;proc format library=library;value agegroup low-24=’under 25’ 25-49=’25 or more but less than 50’ 50-high=’50 or over’;run;____________________________________________________________________
8.10. OUTPUT
The OUTPUT statement is used in conjunction with the SET statement to create multiple SAS data sets. The IF statement is used with the OUTPUT statement to control which observations are output to which SAS data sets.
e.g.data american japan british; set mydata.cars; if origin = 1 then output american; else if origin = 2 then output japan; else if origin = 3 then output british;run;
Exercises
8.1 Create a new temporary data set, using the pulse data, named mpulse which contains data for males only. Omit the variables sex, height and weight in this set. Label pulse1 ‘First pulse rate’, pulse2 ‘Second pulse rate’ and print out the data set using these variable labels.
SAS Programming Notes 72 Modifying data and output
8.2 Create formats for the alphanumeric variables in the pulse data to give the following information:
ran 1=ran in place2=did not run in place
smokes 1=smokes regularly2=does not smoke regularly
sex 1=male2=female
activity 1=slight2=moderate3=a lot
Construct two way tables which show frequencies only for the following pairs of variables, using the formats you have created.
(i) sex and smokes(ii) smokes and activity(iii) sex and ran
8.3 In Task 5 there was a warning that the chi-squared test may not be valid because of small expected numbers. Create a new format for the pulse rates which has two categories only. (Choose the categories so that there are roughly equal numbers in each category for this particular set of data. Repeat the chi-squared tests for independence for each of the two groups (ran/did not run) as in Task 5 and comment on the results.
SAS Programming Notes 73 Proc Tabulate
9. PROC TABULATE
Proc tabulate displays descriptive statistics in tabular format. It computes many of the same statistics that are computed by other descriptive statistical procedures such as MEANS, FREQ, and SUMMARY, but incorporates more flexibility.
Statements
Description
CLASS Identifies the categories on which calculations are carried outAre either character or discrete numeric. e.g. a department codeSupplies the values used in the structure of the tableMust be present in PROC TABULATE statements
VAR Contain values appropriate for calculating statisticsAre continuous numericSupplies the values in the table cellsOptional in PROC TABULATE statements
TABLE element1 * element2 where the elements are class variables and optionally, the ALL variable and/or various statisticsTable operators:-comma - produces a multidimensional tableasterisk - produces a hierarchical tableblank - concatenates tables
BY Specify a list of character variables (data must be sorted)FORMAT Specify variables with the formats wantedFREQ Specify the variable
KEYLABEL Used to label the statistics available e.g. keylabel all=’Grand Total’;
LABEL Specify a label for a variable
WEIGHT Specify a variable to be used for weighting the entries in the table
PROC TABULATE is the only procedure that has a SAS manual of its own. It is worth understanding the format of the table statement that controls the position of the variables.
There are 3 operators that determine where the variables are positioned in the output. Notice that the variables must be categorical. If the variables are numeric then the CLASS statement is used to tell the SAS system to treat the numeric variables as if they categorical.
9.1. The comma table operator
Determines page, row and column positions, i.e. cross tabulation. Table <page variable>, <Row variable>, <Column variable> ;e.g. proc tabulate data = students; class faculty sex; table faculty, sex;run;
SAS Programming Notes 74 Proc Tabulate
SEXMale Female
N NFACULTYScience 450 360Arts 350 550
9.2. The asterisk table operator
Nests two variables in the column or row Table <Column variable1> * <Column variable2> ;e.g. proc tabulate data = students; class faculty sex; table faculty*sex;run;
FACULTYScience Arts
SEX SEXMale Female Male Female
N N N N450 360 350 550
9.3. Using a blank table operator
Places variables side by side Table <Row variable>, <Column variable1> <Column variable2> ;e.g. proc tabulate data = students; class faculty sex; table faculty sex;run;
FACULTY SEXScience Arts Male Female
N N N N810 900 800 910
SAS Programming Notes 75 Proc Tabulate
9.4. Using the ALL variable
e.g. proc tabulate data = students; class faculty sex; table faculty all, sex all;run;
SEXMale Female ALL
N N NFACULTYScience 450 360 810Arts 350 550 900
800 910 1710
9.5. Other Statistics
The statistics that can be requested in PROC TABULATE include the following
N number of nonmissing observations SUM sum of the VAR variable for each class of the CLASS variables NMISS number of missing observations MEAN arithmetic means STD standard deviation MIN minimum value MAX maximum value
e.g.proc tabulate data = students; class faculty sex; var exammark; table faculty*sex*exammark*max;run;
FACULTYScience Arts
SEX SEXMale Female Male Female
EXAMMARK EXAMMARK EXAMMARK EXAMMARKMAX MAX MAX MAX83 85 79 76
_____________________________________________________________
SAS Programming Notes 76 Proc Tabulate
The operators can be mixed to produce any output you require. For example it is possible to nest 2 variables on the rows and columns of a table. Table <Row var1>*<Row var2>, <Column var1>*<Column var2> ;
Exercises
9.1 Use proc tabulate on the pulse data to produce (a) the average of pulse 1 in a table of activity by sex(b) the average difference in pulse in a table showing whether they smoke
by whether they ran or not
_____________________________________________________________
SAS Programming Notes 77 Functions and formats
10. Functions and formats
Functions are a useful means of writing SAS code because it simplifies the coding involved and often results in you having to write fewer lines of code. Over 120 functions are available within the SAS system. Some functions operate on numeric values, others on character values. Some are specialised and operate on specific types of values such as dates and times. All functions operate on arguments which may be variable names or specific values
_____________________________________________________________
10.1. MEAN function
Exampledata myinfo; set info; m_val = mean (var1, var2, var3);run;
If the info data set consisted of the following data,
var1 var2 var32.5 5 1.56.0 3 3.0
then myinfo would be as follows:-
The mean function calculates the mean of the three variables listed. Alternatively we could have written the expression as m_val = mean (of var1 - var3);
An important difference between the MEAN function and the expression: m_val = (var1 + var2 + var3)/3;
is that the MEAN function returns the mean of the nonmissing values. So if we had a missing value for var2, the function would return the mean of var1 and var3 whereas the expression above would return a missing value if any of the var values were missing.
var1 var2 var3 m_val2.5 5 1.5 36.0 3 3.0 4
SAS Programming Notes 78 Functions and formats
10.2. NMISS function
The NMISS function returns the number of missing values in a list of variables. This can be useful if for example we want to exclude observations from a calculation where there are too many missing values e.g. suppose we have recorded 50 readings for each instrument and want to compute the mean of these 50 readings, but only for those instruments with at least 30 readings
data myavges; input (x1 - x50); if nmiss (of X1 - X50 ) lt 30 then ave = mean (of X1 - X50); cards; etc._____________________________________________________________
10.3. N function
The N function operates in a similar fashion, but returning the number of non-missing values._____________________________________________________________
10.4. String Handling Functions
The above functions are for numeric variables and do not work with strings of characters. Some of the most commonly used character functions are SUBSTR (char_variable, starting_position, length)
which extracts a substring
INDEX (char_variable, index-string)
which returns the position of a substring
VERIFY (char_variable, verify_string)
which returns the position in the char_variable that is not present in the verify string
Example
data dept; set mydir.jobs; tot = substr (‘ABCDEFG’, 3, 2); dept = substr (account, 4, 3); ind = index (account, ‘tch’); ver = verify (account, ‘sth’);run;
If the data set mydir.jobs consisted of
SAS Programming Notes 79 Functions and formats
account codespsmrk m003spsmrk m005spstch t003
then the data set dept would consist of
account code tot dept ind verspsmrk m003 cd mrk 0 2spsmrk m005 cd mrk 0 2spstch t003 cd tch 4 2
The substr function operates on character literals to extract part of a variable value. The structure of the substr function is substr(argument, position, length). The argument may be a character value or a variable name, the position gives the position from which to start reading, and the length gives the number of characters to read.
Task
Investigate what functions are available in SAS by selectingHelp SAS Help and Documentation
then choose SAS Products, Base SAS, SAS Language Dictionary, Dictionary of Language Elements and Functions and CALL Routines
10.5. Date and Time Formats
The SAS System processes calendar date values by converting dates to integers representing the number of days between January 1 1960, and a specified date.
For example, the following calendar date values represent the date July 26 1989:
072689 26JUL89 89072607/26/89 26JUL1989 26 Jul 1989
The SAS date value representing July 26, 1989 is 10799.
The trick is to convert dates to numerics and back again. SAS has many date, time and datetime informats and formats. We read the data in with date/time informats and get them back out of SAS using date/time formats. Many of the date/time informats are more or less the inverse of formats of the same name.
SAS Programming Notes 80 Functions and formats
The above dates would be read in using the following informats in the input statement e.g.data test;
data test;input var1 MMDDYY6. +1 var2 DATE7.+1 var3 MMDDYY8. +1 var4 DATE9. ;cards;072689 26JUL89 07/26/89 26JUL1989;run;
To print them out all using different formats we reverse the process e.g.
proc print;format var1 DATE9. Var2 MMDDYY6. Var3 DATE7. Var4 YYMMDD6.;run;
The dot at the end of the informat (or format) indicates that it is an informat (or format) statement and not a variable.
Details of the different formats and informats available in SAS can be found in the SAS System help.
Blanks and other special characters can be placed between day, month, and year values. Width values must allow space for blanks and special characters.
Note: SAS defaults to a date in the 1900s if yy is two digits. Use the YEARCUTOFF= system option to override the system default and specify a date range of your choice.
Example
Data Lines SAS Statement Results 1jan1990 input day date9.; 10958 01 jan 90 10958 1 jan 90 10958 1-jan-1990 10958
The TIMEw. informat reads time values in the form hh:mm:ss.ss, where hh and mm are integers representing the hour and minute, and ss.ss is an optional fractional field representing seconds and decimal fractions of seconds. If you do not enter a value for seconds, SAS assumes a value of 0.
Example
Data Line SAS Statement Result 14:22:25 input begin time8.; 51745
SAS Programming Notes 81 Functions and formats
and the DATETIMEw. informat reads date and time values e.g. 8:30 p.m. of May 6 1989 could be represented as 6MAY78:20:30 using DATETIME12.
Another way to specify SAS date/time values is with special constants e.g. 18 February 1951 is represented as ‘18FEB51’D, high noon as ‘12:00’T and a moment in date and time e.g. ‘1OCT82:15:27:05’DT
Exercises
10.1 A character variable alphabet = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’
Work out what you think the following functions will return then check your answers with a SAS programme? (a) index(alphabet, ‘FGHIJ’)(b) index(alphabet, ‘JNP’)(c) substr(alphabet, 3, 3)(d) verify(alphabet, ‘ABDEI’)
10.2 Write a SAS program to find out how old you were on 1 October 1999 (in years).
10.3 Download the SAS data set napier.SAS7BDAT from the web. This contains data on first year Napier students in 1993/4. There are 1872 records and 33 variables. The variables that you need to consider are
crsecd Numeric variable giving course codedob Date of birth
Identify the oldest student. Which course is he/she in ?_________________________
Which course has the youngest students (on average)?_______________
Which courses have the largest numbers of students?__________________________
In order to answer these questions you will need to (a) create a variable giving the age in years. (b) Sort the data set by age, storing your sorted data set in a temporary
data set. Examine the data set from the Explorer window to find the oldest student and their course.
(c) Sort the data set by crsecd, storing your sorted data set in a temporary data set. Use proc means to make a new temporary data set that will contain the mean, minimum and maximum age of students on each course. Examine your data set of means to find the course with the youngest students
SAS Programming Notes 82 Functions and formats
(d) Sort the data set by the number of records for each course (_freq_) to find the course with the largest number of students
SAS Programming Notes 83 Iterative processing
11. Iterative processing
11.1. Do loops and arrays
The DO statement allows us to perform an iterative loop
e.g. do i = 1 to 5; (lines of SAS code) end;
would result in the lines of SAS code being repeated 5 times, with the value of i taking on values 1,2,3,4 and 5.i is used as the counter, 1 is the start value and 5 is the end value. The default increment is 1. We can specify the increment
e.g. do i = 1 to 7 by 2; (lines of SAS code) end;
would result in the lines of code between the do and the end statements being repeated 4 times, with the value of i taking on the values of 1,3,5,7.
_____________________________________________________________
11.2. Reading data in repeated patterns
The quality control department takes 4 sample cans of oil from a production line and weighs them, every hour for 12 hours. Each record in the raw data contains the following fields :hour : the hour in which the samples were takenweight 1-4 : weights of the four sample cans
The quality control department wants to analyse these data. The first step is to create a SAS data set so that it contains a single observation for each measurement taken. The DATA step must create four observations from each record.i.e.first record of raw data :-1 8.024 8.135 8.151 8.065
first four observation in the data set:-HOUR WEIGHT
1 8.0241 8.1351 8.1511 8.065
SAS Programming Notes 84 Iterative processing
The first INPUT statement reads a value from the first field and assigns it to HOUR. The value for HOUR is the same for all four observation to be created from the first record.
data oil1; input hour @
The single trailing @ sign is used to hold the current record, preventing the next INPUT statement from reading a new record.
The next step is to read each value for WEIGHT (four in each record) and write an observation after each is read. An iterative do loop enables us to write a single pair of INPUT and OUTPUT statements to read a value and write an observation multiple times.
data oil1 (drop=i) ; input hour @; do I = 1 to 4; input weight @; end; cards;1 8.024 8.135 8.151 8.0652 7.971 8.165 8.166 8.1573 8.024 8.135 8.151 8.065etc.12 7.971 8.165 8.166 8.157;
proc print; run;
The results would be
OBS HOUR WEIGHT1 1 8.0242 1 8.1353 1 8.1514 1 8.0655 2 7.9716 2 8.1657 2 8.166
etc._____________________________________________________________
11.3. Arrays
Arrays in SAS are used as a shorthand way of processing many variables with a few statements. An array is an ordered list of variable names. It is often used along with a DO statement to carry out an action repeatedly on a sequence of variables. When defining an explicit array, the ARRAY statement must contain
SAS Programming Notes 85 Iterative processing
an array name a subscript that indicates the number of elements in the array a dollar sign if the array is of character variables a description of the elements ( variables a, b, c, d and e in the above
example)
If we do not know the number of elements in an array we can use an asterisk to define the array e.g. array{*} score1 score 2 score 5 score 8;
although this is a lot less efficient in processing time than specifying the actual array dimensions.
Multidimensional arrays can be specified similarlye.g. array x{3,5} test1 - test15;
could specify an array where the first dimension is the class number and the second is the test number ( i.e. 5 tests for each of 3 classes). The elements of this array are referred to by (for example) x{2,3} which gives the second row element of the third column of the array.
_____________________________________________________________
Example
The following program recodes missing scores in a test to 0
data results; infile class6; input id age score1-score5; if score1 = . then score1 = 0; if score2 = . then score2 = 0; if score3 = . then score3 = 0; if score4 = . then score4 = 0; if score5 = . then score5 = 0;run;This can be rewritten using arrays as follows
data results; infile class6; input id age score1 - score5; array ss(5) score1-score5 do I = 1 to 5; if ss(I) = . then ss(I) = 0; end; drop I;run;
The reduction in code is not very much with 5 scores but if we had 150 scores arrays would be much more efficient.
SAS Programming Notes 86 Iterative processing
_____________________________________________________________
11.4. Generating random numbers
In the data step examples we have examined so far, the SAS system has read a single record during the data step and written it to the output data set at the end of the data step. The SAS instruction to write a record to a data set is OUTPUT ( keep = vars) ;
where vars stands for the variables to be saved.
If you write a SAS data step that does not contain an OUTPUT statement, then SAS will assume that you want a record output at the end of the data step. If your program does contain an OUTPUT statement, then SAS will write a record at this point in the program and not at the end of the data step.
The OUTPUT statement can be used to write a program that will produce a series of 100 records each containing a different uniform random number with this code. The seed 2762 can be replaced with any number you like.data random; do I = 1 to 100; x = ranuni (2762); output ; keep = x; end;run;
The keep statement is optional; if it is not included all variables will be saved to the file.
In generating random numbers, the ones at the beginning of a sequence can sometimes not be very random for certain choices of seed (check this from your own sequence), but they are generally OK after 500 numbers or so. To make sure that your sequence will be OK, use the following code at the beginning of any program that generates random numbers.do I = 1 to 500; x = ranuni (1279); * or any number as a seed;end;This will give the random number generator a whirl to ensure it is running smoothly.
_____________________________________________________________
11.5. Random numbers from a uniform distribution
In order to be able to select a random sample or randomly assign subjects to groups we can use SAS functions to generate random numbers. The two functions UNIFORM(0) and RANUNI(0) generate uniform random numbers in the range from 0 to 1. Random number generators require an initial number, called a seed, which they use to calculate the first random number. This
SAS Programming Notes 87 Iterative processing
number is then used to generate the next and so on. For both of these functions, a seed of zero will cause the function to use a seed derived from the time clock, thus generating a different series each time it is used. The RANUNI function can be seeded with any number, but the UNIFORM function must be seeded with a 5, 6 or 9 digit odd number. In either case, if you supply the seed, the function will generate the same series of random numbers each time. In order to generate a series of random numbers from 1 to 100, we could use
X = 1 + 99*RANUNI (0)_____________________________________________________________
11.6. Random numbers from a normal distribution
To generate random numbers from a normal distribution with a mean of 0 and a standard deviation of 1 we can use the RANNOR function which works in a similar way to RANUNI._____________________________________________________________
11.7. The SAS Program Data Vector
While at work on a DATA step, the SAS System maintains a temporary data structure in computer memory called the SAS program vector, or PDV.. The program data vector represents one observation, a data row, and can be thought of as a linear set of boxes in which values of SAS variables can be contained. Unlike a SAS data set, which survives between steps, the PDV is a dynamic entity that is created during a DATA step, and goes away after the step has completed execution.
To construct the PDV is one of the DATA step compiler’s first jobs as it passes through DATA step source code. The compiler looks at all the SAS statements in the step’s source to find out what variables are named - in INPUT, attribute, or other statements - and creates a PDV with space for each variable’s length.
There are a couple of automatic system variables also in the program data vector : the variables _N_ and _ERROR_. These are maintained by the compiler and may be accessed by the program, though they do not get written to the new data set(s). _N_ contains a count of how many times the DATA step has begun execution from the top (i.e. the number of records), and _ERROR_ is set to 1 (true) when there occurs a data error. When a DATA step is executing, each time it begins another iteration the values of variables to be created by INPUT or by assignment in the PDV are initialised to missing (unless a RETAIN or Sum statement has been used. _N_ is incremented, and _ERROR is set to zero. When the DATA step returns to the top for the next iteration, the PDV is reinitialised and the process repeats._____________________________________________________________
SAS Programming Notes 88 Iterative processing
11.8. The RETAIN and Sum statements
Normally, variables in the PDV that are named with assignment or with INPUT statements are initialised to missing each time the DATA step begins a new iteration. The RETAIN statement,
RETAIN [<variables [value]> ....];
causes variables to keep their values from the previous iteration at initialisation. They can still be changed if INPUT reads a new observation, or when an assignment statement (including the Sum statement ) is executed. The RETAIN statement can only be applied to ‘new’ variables, that is, ones which are being created within the DATA step. If a constant value is specified after the variable list, that value is given at the first iteration; otherwise, numeric variables start with a value of zero.
Example Given the SAS data set history :
Year Month No_sold1990 1 101990 2 121990 3 81990 4 61990 5 9
the program : data changes; retain no_last; set course.history; compare = no_sold - no_last; no_last = no_sold; run;
produces the PDV
_N_ _ERROR_ YEAR MONTH NO_SOLD COMPARE NO_LAST1 0 1990 1 10 . 102 0 1990 2 12 2 123 0 1990 3 8 -4 84 0 1990 4 6 -2 65 0 1990 5 9 3 9
The following sum statement is a special type of assignment statement, provided as a convenience for incrementing variables during the DATA step. The statement
a + 7;
is identical in action to the statements
SAS Programming Notes 89 Iterative processing RETAIN A; A = A + 7;
i.e. 7 is added to the previous value of a
Exercises
11.1 Results of a survey are recorded for 1996 and 1997. However an extra question was asked in 1997. Create a SAS data set from the following data showing year and the answers to the questions.
1996 4 8 3 5 6 51996 5 7 4 5 6 81997 3 5 4 7 4 5 61997 5 3 4 3 6 7 8
11.2 Rewrite the following program using arrays:-
data test; input a b c x1-x3 y1-y3; if a = 999 then a = .; if b = 999 then b = .; if c = 999 then c = .;
if x1 = 999 then x1 = .; if x2 = 999 then x2 = .; if x3 = 999 then x3 = .;
if y1 = 999 then y1 = .; if y2 = 999 then y2 = .; if y3 = 999 then y3 = .;
datalines;3 5 2 7 5 999 2 5 9992 9 4 7 2 4 999 4 9993 8 3 0 3 2 999 7 1run;
11.3
The chi-square distribution is the sum of the squares of k independent standard normal random variables. Generate and plot the frequency distributions of chi-squared variables with 5, 10 and 30 degrees of freedom.
SAS Programming Notes 91 Further topics
12. Further topics
12.1. Combining Data Sets
12.1.1. Concatenating Data SetsTo concatenate data sets means to combine similar data sets into a single new data set. In its simplest form the original data sets contain the same variables and the combined data set will contain the original data sets ‘one on top of the other’.
Suppose SAS data set wood08 contains values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric) obtained from a survey of trees in a wood having reference number 08. Suppose further that similar results are available in SAS data sets wood48 and wood69. The following code will combine the three data sets into a new SAS data set woodcom (in the order 08, followed by 48, followed by 69).
data woodcom; set wood08 wood48 wood69;run;
If the original data sets contain different variables then the combined data set will have missing values in an obvious way.
12.1.2. Merging Data SetsWe merge data sets when we combine data sets containing different information. In its simplest form (match-merging) we combine the data sets on the basis of a common variable which typically identifies each case or row.
Consider again SAS data set wood08 containing values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric). Suppose a related SAS data set woodrs08 contains values for the following variables: id (alpha-numeric) plus numerical scores for damage, bark_dep and condit.
We will merge the two data sets into a combined data set having variables id, age, diameter, height, variety, damage, bark_dep and condit. Matching will be by the variable id (termed the ‘BY variable’). However, we must sort both data sets by the BY variable. This is illustrated in the following code.
SAS Programming Notes 92 Further topicsproc sort data=wood08; * Sort the first data set by id ; by id;run;
proc sort data=woodrs08; * Sort the second data set by id ; by id;run;
data woodall08; * Merge the two data sets using id as a key ; merge wood08 woodrs08; by id;run;
The MERGE operation will combine cases having the same values for id in data sets wood08 and woodrs08. If there values exist for id that are not included in both data sets then missing values for the relevant variables will be inserted.
Exercises12.1. The ASCII data set mod273ft.txt contains results from a module
taught to full-time students with values for student name (alpha-numeric), CW1, CW2, CW3, combined coursework and exam. Data set mod273pt.txt contains the same information for part-time students. The data sets are available on the module web page and should be examined with Notepad or WordPad.
(a) Write a program which reads in the two ASCII data sets and combines them into a single permanent SAS data set with suitable variable names. Confirm that the data set has been constructed correctly.
(b) Modify your program so that the combined data set contains a new variable indicating which group (full-time or part-time) each student belongs to.
12.2. The ASCII data set mod273ptsp.txt gives background information on the statistical software that is used at work by the part-time students. Values are given for name, excel, sas, spss (all alpha-numeric, software results are Y or N).
Write a program which reads in the two data sets mod273pt.txt and mod273ptss.txt and merges them into a single permanent SAS data set. Confirm that the data set has been constructed correctly.
SAS Programming Notes 93 Further topics
12.2. Hints on Using Word with SAS and SAS/INSIGHT
12.2.1. Copying a Selection from the Output Window (WINDOWS)
In SAS Output Window Highlight textEdit Copy
In Word document Edit Paste
Note: In the event of difficulty useCtrl/C for CopyCtrl/V for Paste
12.2.2. Saving the Whole Output Window
In SAS Output Window File Save AsChoose directory and file nameThe automatic file extension is .lst
Word and other text editos have no difficulty opening or inserting a list file.
12.2.3. Choice of Font within Word DocumentFor tables of figures you are recommended to use a monospace font, i.e. one that has a constant width for all characters. Arial and Times New Roman are not monospace fonts.
Examples areSAS Monospace (10 point) SAS Monospace (12 point)1 2 3 4 5 1 2 3 4 5 (You may need SAS running to get this font.)
Courier New (10 point) Courier New (12 point)1 2 3 4 5 1 2 3 4 5
To avoid tables ‘wrapping round’ you could remove unnecessary spaces to the left or reduce the size of the font.
It can be effective to use different fonts for different parts of the documents. For example you might use a standard font like Arial or Times New Roman for text and a monospace font for tables. Monospace fonts might also be used for file names etc.
SAS Programming Notes 94 Further topics
12.2.4. Copying from a Graphics Window (WINDOWS)
In SAS Output Window Edit Copy
In Word document Edit Paste SpecialChoose Device Independent Bitmap
To reduce the file size Edit PasteFollowed by Edit Cut;
Edit Paste Special Choose png or jpeg
When you try to reposition the picture you may find that the picture jumps around the document in an uncontrollable manner. This can be eliminated by adding dummy returns that will lie under the picture. Next using ‘In front of text’ of wrapping from the layout tab of the picture format dialogue box. (Right click on the picture to select the format option). The dummy returns must be held together using paragraph formatting and the picture anchor locked onto the dummy returns. This is shown in the diagram below.
Figure 14 Picture control using ‘In front of text’ wrapping style.
Format Picture dialogue
Picture
Enter returns, select all then Format Paragraph Keep lines together + Keep with nextPicture anchor
locked (Advanced setting)
SAS Programming Notes 95 Further topics
12.2.5. Copying and Pasting from SAS/INSIGHTTables Tables can be saved in graphics form or as text (recommended). In SAS/INSIGHT Analysis Window Click on ‘arrow’ at top of table
Choose Save
The table (as text) is put into the base SAS Output Window.Copy from Output Window as described above.
See below for saving table in graphics form.
Graphs (WINDOWS)In SAS/INSIGHT Analysis Window Click on border of graph (or table)
Edit Copy
In Word document Edit Paste SpecialChoose Device Independent Bitmap
Note: If you have highlighted points on your graph hold down Ctrl when you click on the border.
12.2.6. Left Alignment of SAS OutputCopying and pasting from the SAS Output Window is easier if the output is already aligned to the left. The following option will ensure that all future output is left aligned:
options nocentre;run;
SAS Programming Notes 97 Proc Tabulate
13. Solutions to exercises1.1
data ELEVEN;input ID $ height weight;ratio=weight/height;datalines;59 135 2582 146 3327 153 5652 154 5155 139 3113 131 2501 149 4315 137 3271 133 3078 149 3512 141 3337 164 4828 146 3748 149 4569 147 3616 152 47run;/* Data came from Exercise 1.1*/proc means;run;
1.2
...bmi = 100*weight/height;…
1.3
data FORTH;input site $ salinity phos nitrogen chloro faecal_c;phos2=1000*phos;datalines;CR 30.11 0.068 0.297 1.693 2.917WG 31.48 0.059 0.165 1.464 3.149EG 31.79 0.068 0.144 1.100 3.196SF 31.37 0.185 0.278 1.787 3.418PB 31.50 0.116 0.223 2.099 3.049JO 31.60 0.106 0.207 1.067 2.903SS 30.50 0.047 0.162 1.563 2.895FN 31.96 0.060 0.130 0.753 2.797run;proc means;run;
2.1
libname week2 'c:\sas\sasdata';
SAS Programming Notes 98 Proc Tabulatedata week2.eleven;input ID $ height weight;ratio=weight/height;/*Program must be cut and pasted from word document*/datalines;59 135 2582 146 3327 153 5652 154 5155 139 3113 131 2501 149 4315 137 3271 133 3078 149 3512 141 3337 164 4828 146 3748 149 4569 147 3616 152 47run;
2.2 (b)
libname week2 'c:\sas\sasdata';data week2.pulse;infile 'c:\sas\sasdata\pulse.dat';input pulse1 pulse2 ran $ smokes $ sex $ height weight activity $;proc print ;run;
2.4 (b)
libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input size $ colour $ price cost;run;
proc print data = mydir.ex2_4;var colour size price;run;
2.4 (c)
libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input size $ 1-8 colour $ 9-19 price 20-24 cost 25-32;run;
proc print data = mydir.ex2_4;var colour size price;run;
2.4 (d)
libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input @1 size $8. @9 colour $11. @20 price 5.2 @29 cost 4.2;run;
proc print data = mydir.ex2_4;var colour size price;run;
SAS Programming Notes 99 Proc Tabulate
2.5
libname mydir 'c:\temp';data mydir.houses;infile 'c:\temp\houses.dat';input style 1 sqfeet 3-6 bedroom 8 baths 10-12 price 14-19;run;
proc print;run;
2.6
libname mscsas 'c:\kirsty\mscsas';data cars; infile 'd:\kirsty\temp\cars.prn' firstobs = 2 ; input mpg 1-8 cylndrs 9-16 displace 17-24 hrsepwr 25-33 accel 34-41 year 42-49 weight 50-57 origin 58-65 make $ 66-75 model $ 76-89 price 90-93;
run;
4.1(a)
options nocentre;libname unit4 'c:\sas\sasdata';proc sort data=unit4.pulse out=smokes;by smokes ran;proc means maxdec=1;var pulse2;by smokes ran; run;
(b)
options centre;proc means data=smokes alpha=0.05 maxdec=1 clm;var pulse2;by smokes ran;run;
(c) and (d)
proc univariate data=smokes plot;var pulse2;by smokes ran;proc freq data=smokes;tables smokes*ran;run;
4.2
proc sort data = sashelp.retail out = sorted;
SAS Programming Notes 100 Proc Tabulateby year;proc means maxdec=2 N mean std;var sales;by year;output out=summarymean=mean_sls;run;proc print data = summary;run;
5.1
proc plot data=unit5.eleven;plot weight*height='+';run;
5.2
proc plot data=unit5.pulse;plot pulse2*weight=sex;run;
proc chart data=unit5.pulse;pie activity;run;
proc chart data=unit5.pulse;hbar ran/sumvar=pulse2 type=mean;run;
proc chart data=unit5.pulse;vbar activity/type=pct group=sex;run;
proc chart data=unit5.pulse;block activity/group=smokes sumvar=pulse1 type=mean;run;
6.1
libname unit6 'c:\sasdata';run;
proc plot data=unit6.beetles;plot area*mlength;run;
proc reg data=unit6.beetles;model area=mlength;output out=regoutstudent=stdresid;plot student.*mlength;run;
proc univariate data=regout plot normal;var stdresid;run;
8.1
libname unit8 'c:\sasdata';run;data mpulse (drop=sex height weight);set unit8.pulse (where=(sex='1'));label pulse1='First pulse rate' pulse2='Second pulse rate';
SAS Programming Notes 101 Proc Tabulaterun;proc print data=mpulse label;run;
8.2Note that the formats must correspond to that of the variables in the data set pulse, i.e. numeric variables need numeric formats. Numeric variables do not use quotes around the variable values.libname library 'c:\sasdata';run;proc format library=library;value $ran '1'='ran in place' '2'='did not run in place';value $smokes '1'='smokes regularly' '2'='does not smoke regularly';value $sex '1'='male' '2'='female';value $activity '1'='slight' '2'='moderate' '3'='a lot';run;
proc freq data=unit8.pulse;tables sex*smokes smokes*activity sex*ran/nocol norow nocum nopercent;format sex $sex. ran $ran. smokes $smokes. activity $activity.;run;
8.3
proc format library=library;value pulsrate low-76=1 77-high=2;run;
proc freq data=unit8.pulse;tables pulse1*pulse2/nocol norow nocum nopercent chisq;by ran;format pulse1 pulse2 pulsrate.;run;
9.1 (a)proc format; value sex 1 = 'male' 2 = 'female'; value activity 1 = 'slight' 2 = 'moderate' 3 = 'a lot';run;
proc tabulate data = mydir.pulse; class sex activity; var pulse1; table activity,sex * pulse1*mean; format sex sex. activity activity.; title 'Average pulse without exercise';run;
9.1(b)
proc format; value smokes 1 = ‘smokes regularly’ 2 = ‘does not smoke regularly’;
SAS Programming Notes 102 Proc Tabulate value ran 1 = ‘ran in place’ 2 = ‘did not run in place’;run;
data mydir.pulsev2; set mydir.pulse; pulsedif = pulse2 - pulse1;run;
proc tabulate data = mydir.pulsev2; class smokes ran; var pulsedif; table smokes, ran * pulsedif*mean; format smokes smokes. ran ran.; title 'Average difference in pulse by whether they smoke and/or ran';run;
10.1
data test; input alphabet $26.; pt1 = index(alphabet,'FGHIJ'); pt2 = index(alphabet,'JNP') ; pt3 = substr(alphabet,3,3) ; pt4 = verify(alphabet,'ABDE') ;datalines;ABCDEFGHIJKLMNOPQRSTUVWXYZrun;
proc print;run;
10.2
data ageatoct; input dob YYMMDD6.; date = '01OCT99'D; ageatoct = (date-dob)/365;datalines;740625run;
10.3
libname mydisc 'd:\kirsty\temp';
* first set up a variable for age;
data mydisc.napier2; set mydisc.napier; enddate = '31DEC93'D; * sets up a variable as a date constant; age = (enddate - dob)/365.25;run;
* create a temporary SAS data set called agesort, sorted by age and examine data set to identify oldest student;
proc sort data = mydisc.napier2 out = agesort; by age ;run;
* create a temporary SAS data set called napsort, sorted by course;
proc sort data = mydisc.napier2 out = napsort; by crsecd;run;
* create a temporary SAS data set called means containing the mean, max andmin ages by course. Examine the data set to find which course has theyoungest students (on average) Anything strange?! ;
SAS Programming Notes 103 Proc Tabulate
proc means data = napsort; var age; by crsecd;output out = means mean = mnage max = maxage min = minage ;run;
* sort by number of records for each course and examine dat set to find thecourse with the largest number of students;
proc sort data = means; by _freq_;run;
11.1
data survey; input year @ ; * hold the line; if year = 1996 then input q1-q6; else if year = 1997 then input q1-q7; datalines;
1996 4 8 3 5 6 51996 5 7 4 5 6 81997 3 5 4 7 4 5 61997 5 3 4 3 6 7 8;run;
proc print;run;
11.2
data test; input a b c x1-x3 y1-y3; array tt (9) a b c x1-x3 y1-y3; do i = 1 to 9; if tt(i) = 999 then tt(i) = .; end; drop i;datalines;3 5 2 7 5 999 2 5 9992 9 4 7 2 4 999 4 9993 8 3 0 3 2 999 7 1run;
proc print;run;
11.3
data chisq; n = 6; * degrees of freedom required +1;
do i = 1 to 500; * to ensure numbers are random; x = rannor(3059); end;
do j = 1 to 500; *generate 500 chi-squared values; chi = 0; do i = 1 to n; x = rannor(3059); chi = chi + x*x; retain chi; end; output; keep chi; end;
SAS Programming Notes 104 Proc Tabulaterun;
proc gchart; vbar chi;run;
12.1.(b)
/* concatmod.sas */
libname xyz 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';data modft; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ft.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'FT';run;data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'PT';run;data xyz.mod273com; set modft modpt;run;proc print data=xyz.mod273com;run;
12.2.
/* mergemod.sas */
libname abc 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam;run;proc sort data=modpt; by name;run;data modptss; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ptss.txt'; input name $ 1-16 excel $ sas $ spss $;run;proc sort data=modptss; by name;run;data abc.modptcom; merge modpt modptss; by name;run;proc print data=abc.modptcom;run;
SAS Programming Notes 105 Proc Tabulateproc univariate data = pulse;var before; histogram before / midpoints = 0 to 200 by 10;
title ’Histogram Pulse Data’;run;