124
SAS Programming Notes 1 Further topics SAS Programming Notes For Data Mining and Exploration Lecturer: Amos Storkey School Of Informatics University of Edinburgh

Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

Embed Size (px)

Citation preview

Page 1: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 1 Further topics

SAS Programming NotesFor

Data Mining and Exploration

Lecturer: Amos StorkeySchool Of Informatics

University of Edinburgh

Page 2: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 2 Further topics

Acknowledgements: These notes are extensively based on notes developed over a long period by the School of

Accounting, Economics & Statistics, Napier University. People who have worked on or contributed to these notes over that time include Amos Storkey, Ana Costa Da Silva, Phil Darby, Helen Storkey, Jeff Dodgson, Dorothy Currie, Kate Houston

and Kirsty Davidson. I am very grateful for permission to use and develop these

notes for the Data Mining and Exploration course.

First published September 2000

Updated September 2001 (SAS version 8) and February, July

September 2002

October 2004 (SAS version 8.1) and September 2005

December 2006 (SAS version 9.1.3)

January 2008 (SAS 9.2 and linux differences)

File: document.doc

Page 3: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 3 Contents

1. GETTING STARTED 71.1 What is the SAS system? 71.2 The SAS Workspace 71.3 Creating and running a SAS program 91.4 Submitting and correcting your program 101.5 Saving files and clearing text from windows 101.6 Reading a saved program 111.7 A Data Analysis Flow Chart 111.8 Importing data using a wizard 121.9 Viewing a data set 131.10 Creating a SAS Program 141.11 Rules for entering SAS statements 151.12 Adding comments to a program 161.13 Including titles in your SAS output 161.14 Creating new variables 171.15 Printing and saving SAS output 17

2. DATA FILES AND SAS DATA SETS 212.1 Reading data files using the INFILE statement 212.2 LIBNAME and permanent SAS Data Sets 222.3 Referencing a permanent SAS data set 242.4 Contents of a file 242.5 Importing data from other packages 252.6 Missing values 262.7 The INPUT statement 27

3. SAS PROCEDURES 333.1 Structure of a SAS program 333.2 Sample program 33

4. SUMMARISING DATA 374.1 SAS System Options 374.2 HTML output 384.3 Summary Procedures 384.4 PROC SORT 394.5 PROC MEANS 404.6 PROC UNIVARIATE 414.7 PROC FREQ 414.8 General syntax for a procedure 434.9 Help 44

5. GRAPHS AND CHARTS 475.1 Graphics procedures 475.2 PROC PLOT 475.3 PROC CHART 48

6. CORRELATION AND REGRESSION 51

Page 4: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 4 Contents

6.1 PROC CORR 516.2 PROC REG 52

7. EXPLORATORY DATA ANALYSIS 557.1 SAS/INSIGHT 557.2 Accessing SAS/INSIGHT 557.3 Features of SAS/INSIGHT 577.4 Using SAS/INSIGHT 57Tools 59

8. MODIFYING DATA AND OUTPUT 618.1 Introduction 618.2 SET statement 618.3 DROP and KEEP 628.4 Labelling output 638.5 PROC PRINT 648.6 PROC FORMAT 648.7 Recoding data 658.8 Conditional statements 658.9 VALUE statement 668.10 OUTPUT 67

9. PROC TABULATE 69

10. FUNCTIONS AND FORMATS 7310.1 MEAN function 7310.2 NMISS function 7310.3 N function 7410.4 Functions to handle character variables 7410.5 Date and Time Formats 75

11. ITERATIVE PROCESSING 7911.1 Do loops and arrays 7911.2 Reading data in repeated patterns 7911.3 Arrays 8011.4 Generating random numbers 8111.5 Random numbers from a uniform distribution 8211.6 Random numbers from a normal distribution 8311.7 The SAS Program Data Vector 8311.8 The RETAIN and Sum statements 83

12. FURTHER TOPICS 8712.1 Combining Data Sets 8712.2 Hints on Using Word with SAS and SAS/INSIGHT 88

Solutions to exercises 93

Page 5: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 5 Contents

Various files are referred to in these notes. These can be found in a zip file on the Data Mining and Exploration web site

www.inf.ed.ac.uk/teaching/courses/dme/

Page 6: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 7 Getting started

1. Getting started

1.1. Introduction

The SAS system is a widely used resource for statistical analysis and data mining. It is rare to find a job advert for a data mining practitioner that does not ask for SAS skills. The main positive points of SAS are its ability to handle large files fairly transparently, the ease and comprehensive way that standard analyses can be done, the interactive way that analyses can be built alongside a systematic programming environment, and the data handling capabilities. Its main negative points are its graphical capabilities, and that adding your own extensions to the techniques using macros and the interactive matrix language are slightly more cumbersome than other languages (e.g. matlab, R) and than more modern language constructs.

This tutorial will introduce you to the SAS System. This tutorial should be suitable for those working on either a Linux or Windows system. Interface tools in SAS for Windows are much better and so where there are differences these will also be mentioned.

SAS is, at its heart a piece of software for data handling and storage, statistical and data analysis, data mining decision support and report writing. It has been extended to a whole business intelligence package, but the best way of understanding SAS is from the inside out, and so this tutorial will teach the base SAS software to get you started. With base SAS software you can store data values and retrieve them, modify data, compute simple statistics, and create reports all in one SAS session. The difference between SAS and most statistical packages is that SAS incorporates both a database management system and a high-level programming language. There is also SAS software which provides graphics, forecasting, data entry, and statistics. The SAS system also contains other sophisticated applications that are valuable to large enterprises. All are available in one system.

1.2. The SAS Workspace

To start SAS on a linux system type SAS at the command prompt. On windows, select SAS from the start menu.

When you go into SAS, the first thing you see is a set of windows as shown in Figure 1. Your display may appear a little different since this has been adjusted to allow all the windows to be seen at once. There are five different windows shown in this figure. Two further windows are available in SAS

Page 7: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 8 Getting started

version 9, you can switch between them by clicking on the buttons at the bottom of the SAS window.

Figure 1 SAS window on opening in Windows

Figure 2 SAS window on opening in Linux

The five windows are:

the EDITOR window where you enter the SAS statements you wish to execute. The EDITOR has handy features like colour coding and expandable and collapsible sections.

Editor

Output

Log

Explorer

Results

Run

Libraries

Log

Output

Editor

Libraries

Explorer

Page 8: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 9 Getting started

the LOG window which contains information on your SAS run, e.g. date and time of run, a listing of your SAS statements as they are executed and any errors which have occurred during processing.

the OUTPUT window which displays the actual results of the program.

the EXPLORER window, which allows you to view and manage your SAS files and create shortcuts to non-SAS files. For example you can use this window to create new libraries or to open any SAS file.

the RESULTS window helps you navigate and manage output from SAS programs you submit. You can view, save, and print individual items of output. (By default, the Results window is positioned behind the Explorer window but when you submit a SAS program that creates output it moves to the front of your display)

The two windows not shown are:

the GRAPH window, will appear when graphical output is to be displayed.

A seventh window will appear when html output is used. The output delivery system (ODS) can be turned on using programming code or by using the menu options.

You may turn on or turn off a window by using View from the main menu. Just choose the window you need (use this if you ‘loose’ a window).

Task 1Resize the 3 windows on the right hand side so that you see the OUTPUT as well as the EDITOR and the LOG. Make the EDITOR the largest window.

You can activate any of the windows by

clicking on the window (Windows or Linux)

selecting Window from the menu, then the window you want (Windows)

selecting View from the menu (Windows or Linux)

1.3. Creating and running a SAS program

The following lines of code are a simple SAS program.

Page 9: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 10 Getting started

When they are typed into the editor window the words will become colour coded.

Reserved words appear blue (e.g. proc, print, input)

Comments appear green in Windows and in black in Linux (See below for details of entering comments.

Errors appear red.

data class1;input height weight sex $;

datalines;152 45.4 F178 73.0 M178 68.8 M175 59.7 M157 44.5 F165 61.7 M175 74.1 M160 49.5 F

run;

proc print;run;

Task 2Enter the SAS program in the EDITOR window.

1.4. Submitting and correcting your program

There are several methods of submitting your program.

1. Highlight the section of code you wish to run and press the running man icon (in Windows).

2. Ensure that your cursor is in the EDITOR window, then select Run  Submit (in Windows or Linux).

3. You can also run just a few lines of code by selecting Run -> Submit top line or Submit N lines (in Windows or Linux).

Right click with the mouse and select Submit All or press the man running iconAn alternative to pressing the man running icon is to press the key F3 in Windows or the key End in Linux.

Examine your LOG window to check that there were no error messages: if all is well examine your output in the OUTPUT window.

Page 10: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 11 Getting started

If you have error messages in your log file you will need to correct the mistakes and resubmit it.

After submitting your code you may find that it has disappeared from the editor window. To overcome this problem select Run  Recall Last Submit.

1.5. Saving files and clearing text from windows

When you have succeeded in getting your program to run you can save it as filename.sas ( SAS automatically gives it a .sas ending to remind you that it is a SAS program). Make sure your EDITOR window is active before doing File  Save (or pressing the floppy disk icon). Otherwise you might be saving the contents of your log or output window instead of your program.

Save log files as filename.log and output files as filename.lst if you want to save them too. It is usually not necessary to save the log file.

Important: - In order to avoid getting confused about which output and log refers to which program or version of a program, make it a habit to clear your windows before submitting a new program. Do this by selecting Edit  Clear All.

Run  Recall Last Submit returns the program you have just run to the EDITOR window. This is useful if you have cleared the program by mistake.

Task 3Create a new folder in your personal disk space called MA71064 Statistical computing.

Submit the SAS program from the program editor. When it is working satisfactorily save the file as class1.sas in the folder you have just created.

1.6. Reading a saved program

A SAS program needs to be in an EDITOR window before it can run. To open a saved SAS program activate the EDITOR window and use FileOpen. The program can then be submitted in the usual way.

Page 11: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 12 Getting started

You can have more than one EDITOR window open at the same time. However this can be confusing and it is easiest at first to have only one program open at a time.

1.7. A Data Analysis Flow Chart

Data analysis can be thought of in terms of a process flow. Actions proceed in a sequence. Often the output from one action leads to the input of another. A simple flow chart is given below.

Data step

Proc Print

Proc Means

Figure 3 A simple data analysis flow diagram

SAS programs can contain combinations of DATA steps and PROCEDURES. The SAS program you used above executed the first 2 blocks in the flow diagram. Quite quickly you will be producing more complicated programs that will have many DATA steps and PROCS.

1.8. Importing data using a wizard

The next example reads the excel file Class0 into the temporary SAS library called Work. The format of the data is displayed and the summary statistics (count, average and standard deviation) of the height readings is calculated.

Read or Create a data set

Start

End

Display the data

Calculate the averages

Page 12: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 13 Getting started

The simplest method of entering data into SAS is using the import wizard.

File Import Data will display the dialogue shown in Figure 4.

A.

The source type default is Excel but others are available from the pull down menu.

Next

B. Locate the source file by pressing the Browse buttonOK

C. Select the appropriate worksheet

From the options ensure that ‘Use data in the first row as SAS names’ is ticked.

OK

Next

D. Enter the Member as Class0.

Finish

Check the log window for errors.

Figure 4 Import data wizard

Page 13: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 14 Getting started

In Linux, there is naturally not the option of importing from Excel. However, there is the option of importing csv files. An Excel file can be opened using Open Office and can be saved as a Comma Separated File (csv). It can then be imported straight into SAS.

In Linux steps B and C above are replaced by the dialogue in Figure 4. Similar options are available when pressing the respective button. The remainder dialogue is similar to that in Figure 4.

Figure 5 Import data wizard in Linux

The final step of the wizard, in both Linux and Windows, is optional and offers the possibility of saving the importation command in a specified file, which can be opened with the Program Editor. This can be copy-pasted into any program and be run, without need to follow the steps of the wizard again.

1.9. Viewing a data set

Once the data is into the SAS format you can look at it in a variety of ways.

1 Proc print; run;

2 From the explorer window, double click on the libraries icon to reveal libraries that are present. These libraries are simply pointers to Windows XP folders where the data sets are stored. Double clicking on the work library reveals the data set Class0.

3 Double click on the data set to open the data set.

Page 14: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 15 Getting started

4 Right click on the data set to display a set of options. These include;

Open,

View the Columns

View in Excel (only in Windows).

Task 4Import the excel file class0.xls into SAS using the import wizard then display the imported data set using excel.

1.10. Creating a SAS Program

You have already submitted a simple SAS program which created and then printed out a set of data. The following is an extension of that program. The line numbers have been included to help explain the structure of the program: they are not part of the program itself and should not be typed.

Line number Program001 data class2;002 input height weight sex $ bends pulse1 pulse2;003 datalines;004 152 45.4 F 6 61 84005 178 7.0 M 8 59 102006 178 68.8 M 12 58 95007 175 59.7 M 5 76 83008 157 44.5 F 5 53 102009 165 61.7 M 10 70 110010 175 74.1 M 5 76 102011 160 49.5 F 2 67 118012 161 52.6 M 5 80 103013 180 85.4 M 7 84 102014 160 57.2 F 7 98 115015 170 69.9 M 7 69 102016 178 67.0 M 11 60 79017 163 57.0 F 8 70 98018 160 60.9 F 12 57 84019 185 73.1 M 5 68 .020 188 79.1 M 3 53 69021 159 49.5 F 6 69 112022 run;023 proc print;024 run;025 proc means;026 run;

Page 15: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 16 Getting started

Line number Explanation001 The DATA statement tells SAS to create a data set called class2.

002 The INPUT statement names the variables in the order they appear in the data lines. Variable names must start with a letter, be no more than 32 characters in length (eight characters in version 6) and must not contain blanks, commas and so on. To read data as characters, rather than numbers, a dollar sign is put after the variable name.

003 The DATALINES statement indicates that the next lines are data.004 to 021

The data are entered with a space(s) separating each item. The data must be in the same order as declared in the input statement. A new line is used for each record.

019 A full stop indicates a missing numerical value.022 RUN tells SAS to execute the preceding statements023 PROC PRINT is a procedure to print data in the Output Window025 & 026 PROC MEANS is a procedure to calculate the mean and other statistics

of all the numeric variables, RUN completes the procedure.

This example illustrates the basic structure of a SAS program:

A DATA step consisting of a DATA statement and other statements that form part of this step

SAS PROCECURES begin with a PROC statement. Procedure statements may also be followed by statements that are part of the procedure step, although there are none in these two examples

1.11. Rules for entering SAS statements

SAS statements:

usually begin with an identifying keyword

always end with a semicolon

(check carefully before you submit any program!)

can be in uppercase or lowercase letters

SAS statements are free format.

they can begin and end in any column

one statement can continue over several lines

several statements can be on one line

Page 16: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 17 Getting started

Readability is improved if you add comments and leave spaces between the DATA and PROC steps and perhaps also indent code within a DATA or PROC step. Develop your own style and stick with it.

1.12. Adding comments to a program

There are two ways of writing comments in a SAS program:

begin the comment line with an asterisk and end with a semi-colon

e.g. *This program was developed by J Smith;

begin with a forward slash asterisk and end with an asterisk forward slash

e.g. /* J Smith February 2005 */

Inserting comments is essential if you are doing any serious programming.

The /* style */ is also useful for ‘commenting out’ blocks of a program when testing or debugging.

Task 5Read the file class1.sas into the program editor. Edit the program so that it is the same as the sample program in section 1.10 but with the addition of a comment which gives your name and today’s date. Submit the program and when it is working properly save it as class2.sas.

What information did the Proc Means procedure give you?

1.13. Including titles in your SAS output

The TITLE statement is used to provide titles on your output. The TITLE statement can appear anywhere in a program (an example of a global statement) and subsequently each page of output (and each graph) will have the title until it is reset. For example program class2.sas could be enhanced as follows:

. . .proc print; title ‘Information on Students in Class’;run;proc means; title ‘Summary Statistics of Students in Class’;run;

Page 17: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 18 Getting started

However the final title will adorn all future output until it is reset with another title or ‘cancelled’ withtitle;

run;

Task 6Experiment with the TITLE statement in program class2.sas.

1.14. Creating new variables

If you need to analyse variables, that are derived from the input variables, then you must create these variables in the DATA step. For example, if you want to use two new variables ‘the difference in pulse rates’ and ‘the log of the number of bends’ then these variables must be defined before the lines of data are read in. The rules about naming new variables are the same as for input variables.

data class2; input height weight sex $ bends pulse1 pulse2; diff=pulse2-pulse1; lnbends=log(bends);datalines;...Run;

Some commonly used operators and functions are as follows:

Operator Meaning Function* multiplication log( ) natural log

/ division exp( ) exponential

** exponentiation sqrt( ) square root

1.15. Printing and saving SAS output

The contents of the OUTPUT window may be sent to a printer using the OUTPUT window print command. You can change the way the output looks using, for example, the LINESIZE and PAGESIZE options (see Section 4.1 or

Page 18: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 19 Getting started

SAS Help).The whole of the OUTPUT window listing may be saved as filename.lst using the OUTPUT window save command.

In WINDOWS, it is often convenient to copy all or part of the OUTPUT window into a Word document or another text processing software. This can be achieved with the copy and paste operation. However the results might be disappointing. The appearance may be improved by

using a fixed space font such as SAS Monospace (available if SAS is running)

avoiding ‘wrap around’ by reducing the font size and avoiding unnecessary leading spaces.

See Section 12.2 for further advice on incorporating SAS numeric and graphical output into a Word document.

Task 7 (WINDOWS ONLY)Create a Word document (using copy and paste) which consists of program class2.sas and the output it produces. Experiment with improving the layout of the document.

Exercises1.1 The following table shows the heights and weights of 16 eleven-year-old

girls.

ID no Height(cm) Weight(kg) ID no Height(cm) Weight(kg)

59 135 25 71 133 30

82 146 33 78 149 35

27 153 56 12 141 33

52 154 51 37 164 48

55 139 31 28 146 37

13 131 25 48 149 45

01 149 43 69 147 36

15 137 32 16 152 47

(a) Write the SAS statements to create a SAS data set called ELEVEN. The ID number should be stored as a character variable.

(b) Insert a comment statement to indicate that this data came from Exercise 1.1.

Page 19: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 20 Getting started

(c) Create a new variable which is the ratio of weight to height.

(d) Produce a printout of the data and a table which shows the mean, standard deviation, the maximum and the minimum values for the variables height, weight and the ratio of weight to height. The output should be suitable labelled.

(e) (WINDOWS ONLY) Copy and paste your program and its output into a word document (edit to ensure an attractive appearance).

(f) How would the output have differed if you had input the ID number as a numeric variable?

1.2 Modify your program in 1.1 above in order to determine the body mass index (BMI = weight in kilograms/(height in metres)2 or BMI = W/(H * H) ).

1.3 Several measurements of water quality were taken at eight different sites along the Firth of Forth. The data are shown below.

Site Salinity Phosphate NitrogenChlorophyl

lFaecal

ColiformsCR 30.11 0.068 0.297 1.693 2.917WG 31.48 0.059 0.165 1.464 3.149EG 31.79 0.068 0.144 1.100 3.196SF 31.37 0.185 0.278 1.787 3.418PB 31.50 0.116 0.223 2.099 3.049JO 31.60 0.106 0.207 1.067 2.903SS 30.50 0.047 0.162 1.563 2.895FN 31.96 0.060 0.130 0.753 2.797

(a) Write the statements to create a SAS data set called FORTH.

(b) The units of phosphate are mg/litre. Create a new variable which gives phosphate in units of g/litre where 1mg = 1000g (1 milligram = 1,000 micrograms).

(c) Produce a table which shows summary values for each variable.

Page 20: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 21 Getting started

(d) Save the program in a file called forth.sas.

Page 21: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 23 Data files and SAS data sets

2. Data files and SAS data sets

2.1. Reading data files using the INFILE statement

In the examples in the previous section you created temporary SAS data sets from data, which were included in the program, with a DATALINES statement. In practice, a large set of data is more likely to be available as a raw data file (known as an ASCII or text file) and it will be more convenient to read the external data directly into SAS.

To illustrate this we will create a small ASCII data set using the Notepad editor, read the data file into SAS and then print the contents.

Task 1Open Notepad / text editor and type in the following data set. Save it as blood.txt on your floppy disk. (Note that Notepad automatically gives the extension .txt.)

1 107 1002 110 1143 123 1054 129 1125 112 1156 111 1167 107 1068 112 1029 136 12510 102 104

The variables in the data set are patient number and blood pressure measurements before and after treatment.

The code required to input the data into SAS and get a printout in the Output Window is as follows.data blood; infile 'a:\blood.txt'; input patient $ before after;run;

proc print;run;

The only changes that are required to the previous method of data input are that:

the INPUT statement is preceded by an INFILE statement to tell SAS where to find the external data file.

the DATALINES statement and the lines of data are omitted.

Page 22: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 24 Data files and SAS data sets

Task 2Type the above SAS program into the EDITOR window and save the program under a suitable name. Submit the program and confirm that the values of the variables together with variable names have been printed in the OUTPUT window.

You can verify that the data is stored in the correct location on your hard drive. An example is given in Figure 6.

Figure 6 The data set Blood in the SAS Work library and temporary directory

2.2. LIBNAME and permanent SAS Data Sets

In the programs you have written so far the data set used in any analysis has been created in the data step. Such a data set is described as temporary in the sense that it only exists during your current SAS session and will be deleted when you close SAS.

This kind of temporary file is stored in a SAS library called WORK. You can check what files you have created in the current session by going to the EXPLORER window and clicking on Libraries and then on the library WORK.

(Use View Up One Level or View Show Tree

to navigate back to the original EXPLORER window)

SAS files are given a two part name. The first part of the name is the library name in which the file is stored and the second part is the name of the particular file. You probably noticed that when you created the previous data sets, for example class2, that SAS referred to this file in the log window as WORK.CLASS2.

Page 23: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 25 Data files and SAS data sets

If you wanted to do further analyses on this type of data set in a different session you would need to recreate the data set by running the data step once again. This can be a time-consuming process especially if you have a large amount of data and have created many new variables, changed the format of variables and so on. The alternative approach is to create a permanent SAS data set. This is a special type of file, unique to SAS, which stores the data, variable names and other information such as formats.

You can set up a library to store your data sets and save them so that they can be used in another SAS session. The SAS LIBNAME statement defines the name of the library where the file is to be stored. For example, if you want to store your data on your own disk in drive A, then you need to give this a SAS libname using a statement like the one below. The actual name of the library, in this case mydisk, is chosen by the programmer and is just a convenient name that can be referred to later in the program.

LIBNAME myadisk ‘a:\’;LIBNAME mydisk ‘h:\MA71064 Statistical computing\’;LIBNAME myhomedisk ‘c:\My documents\Napier\MA71064 Stat Comp\’;

You can then save your SAS data sets in this library using a two level SAS name. The first part of the name is the libname and the second part is the name given to the SAS data set. So to create the permanent SAS data set called class2, on the H: drive, would require the following SAS code.

libname mydisk ‘h:\MA71064 Statistical computing\’; * The data library called ‘mydisk’ will; * be located on the H: drive;

data mydisk.class2; *Create the new data set class2; input height weight sex $ bends pulse1 pulse2;

datalines;152 45.4 F 6 61 84178 53.0 M 8 59 102165 61.7 M 10 70 110...................175 74.1 M 5 76 102160 49.5 F 2 67 118run;

Task 3Modify your program, class2.sas, to create a permanent data set. (If you want the data set stored on a hard drive make sure you give the full path name of the required directory.) Check the messages in the LOG Window and check you can see the permanent data set in the EXPLORER window.

Page 24: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 26 Data files and SAS data sets

SAS version 8 puts an automatic SAS7BDAT extension on permanent data sets (version 6 uses SD2).

Task 4Go to Windows explorer and check that you have a file class2.SAS7BDAT in the appropriate directory.

2.3. Referencing a permanent SAS data set

Suppose that you have a permanent SAS data set stored in a particular directory. You may have created this yourself or possibly have downloaded it from the web. You may carry out procedures on the data set directly by using the DATA option in the procedure statement. All the details of variable names and so on will be held in the data set.In the following example permanent SAS data set prac1 is stored in directory

h:\sas\sasdata.

libname xyz ‘h:\sas\sasdata’; proc print data = xyz.prac1;

proc means data = xyz.prac1; run;Note that the first part of the name given by LIBNAME is a pointer to a directory and does not have to be the same name as was used when the data set was created. It is the second part of the name that refers to the particular data set.

2.4. Contents of a file

You can use the SAS procedure CONTENTS to get information about a data set and a list of the variables it contains. This procedure is useful for larger data sets that would be too long or have too many variables to list completely, and it gives you information about when and where the data set was last modified. For example, proc contents data = xyz.prac1; run;

Note: if you have already submitted a LIBNAME statement in the current SAS session, it is not necessary to do so again. You can simply refer to the two-level data set name.

Page 25: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 27 Data files and SAS data sets

Task 5Get information about the SAS data set COMPANY stored in SASHELP. What information is given in each of the columns?

An alternative way of inspecting what variables are in a large file is to print out only the first few observations. This can be done using an option in the PROC PRINT statement. proc print data=sashelp.company (obs=6); run;

Remember you can also view data sets from the EXPLORER window

2.5. Importing data from other packages

Software such as Excel, Minitab and SPSS store data in file types unique to themselves. Some packages have the ability to export into or import out of other formats.

Use the import wizard or PROC IMPORT (see SAS Help).

A safe approach to importing data from such application software into SAS is to export from the other package into ASCII format and input the resulting file into SAS in a DATA step.

Large data sets from outside sources (other companies or organisations) are usually supplied in ASCII format since such data is often held in proprietary databases. Most software allows data to be written to an ASCII or raw data file. This approach is illustrated bellow:

export

INFILE statemen

t

Application data file

ASCII data file

SAS data set

It is a good idea to check the ASCII file with an editor such as WordPad (and possibly ‘tidy up’ if necessary). The ASCII file can then be read into SAS using the INFILE statement assigning variable names with the INPUT statement (as explained in Section 2.1).

Advice on importing data into SAS from popular applications software is summarised below.

Excel (WINDOWS)

Page 26: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 28 Data files and SAS data sets

For a spreadsheet containing data only (values of variables in columns):

Right align columns (necessary for character data)

File Save AsFormatted Text (Space delimited)to give ASCII file filename.prn.

Edit filename.prn with WordPad if column headings need deleting, missing values need replacing with ‘.’ etc.

MinitabFileOther FilesExport Special Text

Specify columns (accept Period Decimel Separator).

Results in ASCII file filename.dat.

Note that Minitab’s missing value symbol is ‘*’. SAS will find this invalid and replace by ‘.’.

SPSS FileOther FilesFixed ASCII

Results in filename.dat.

Note that SPSS’s missing value symbol is ‘.’ However this will be blank in filename.dat and cause SAS to misread the data set when using simple list input.

Alternative: SPSS allows data to be saved directly as a permanent SAS data set:

File Save AsSASv7 Windows long extensionIn recent versions of SAS (e.g. 9.2), SPSS files can be imported directly.

2.6. Missing values

Uncoded missing values present special problems for using list input. To provide some protection for the integrity of your output data set when input data contain uncoded missing input values, use the MISSOVER or STOPOVER options in the INFILE statement. Use the MISSOVER option to set all remaining variables in the INPUT statement to missing. Use the STOPOVER option to prevent an observation from being written to the data set when the input line does not contain a value for each variable in the INPUT statement and to stop the DATA step from further processing.e.g. the program

data test1; input id $ var1 var2 var3 var4 var5; datalines;

Page 27: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 29 Data files and SAS data sets

1001 115 45 65 83 781002 86 27 55 861004 93 52 63 76 881015 73 35 43 112 108;run; would result in the following inaccurate data set

obs id var1 var2 var3 var4 var51 1001 115 45 65 83 782 1002 86 27 55 86 10043 1015 73 35 43 112 108

If we use the MISSOVER option i.e.data test1; infile cards missover; input id $ var1 var2 var3 var4 var5;

cards;1001 115 45 65 83 781002 86 27 55 861004 93 52 63 76 881015 73 35 43 112 108;run;

we will get the following data set

obs id var1 var2 var3 var4 var51 1001 115 45 65 83 782 1002 86 27 55 86 .3 1004 93 52 63 76 884 1015 73 35 43 112 108

Using the MISSOVER option prevents the uncoded missing value in the second data line from causing the third record to be read incorrectly as well. The second observation is still incorrect, but the errors have been restricted to one observation. The STOPOVER option would prevent observation 2 from being written to the data set at all. In order to read the data in properly, either column input or formatted input would have to be used. (See next section)

2.7. The INPUT statement

The INPUT statement names the variables being read in via a DATALINES or INFILE statement and tells SAS where on the DATALINES, or on the lines of INFILE, the values of the variables can be found. There are three main types of INPUT that you can use to describe a record’s values : LIST, COLUMN and FORMATTED. The choice of which type of input you use will depend on

Page 28: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 30 Data files and SAS data sets

the type and arrangement of the incoming data. The $ symbol is placed after a variable name to indicate a character variable.

In the previous examples you have used only the simplest type of INPUT, LIST INPUT. List INPUT is seldom useful for large commercial or scientific work because it is too easy to get missing values or errors in big files. It is commoner for real data to come in fixed column format, where the fields on each line are aligned in columns one under each other.

2.7.1. LIST INPUT- the values are separated by spaces- missing values must be represented by full stops- by default, character values cannot be longer than 8 characters- character values cannot contain embedded blanks- fields must be read in order

e.g. data one; input height weight name $ age;

datalines; 65 150 Chris 50 60 125 Kelly 35 68 180 Leslie 29 ; run;

2.7.2. COLUMN INPUT

- data must be aligned within the column positions specified- character values can contain embedded blanks- input values can be read in any order- character values can be of length 1 to 200 characters- leading and trailing blanks within a field are ignored

e.g.data two; input name $ 1-7 age 9-10 birthdate $ 11-22 sport $ 23-30;

datalines;Ronald 40Dec 3 1954 golfMichael 37Jul 4 1957 fishingLaurel 33Jun 23 1961 softball;

run;

Page 29: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 31 Data files and SAS data sets

2.7.3. FORMATTED INPUT

- character values can be of length 1 to 200 characters- a full stop is not needed for numeric missing values- nonstandard data, such as dates or numbers can be read in- with the use of pointer controls, values can be read in any order

This method of input uses pointer controls and informats for reading in nonstandard data from external data files. An informat is used for reading in data containing dates, numbers with commas, etc.

The informat w.d after a variable specifies the width w and the number of decimal places d to be used in reading in a number.e.g. for the number 2346,

the informat 4.2 would result in the number 23.46 being read in. the informat 4. with no ‘d’ specified would result in the number 2346

being read in

The informat $w. after a variable specifies the length of a character variable

Dates such as 21/10/89 can be read using the informat DDMMYY8. (Note the full stop at the end of the informat)

Pointers indicate the position of a variable e.g.@n go to column n+n move the pointer on n positions

e.g. /* A line of place counters is often useful to put to help alignment0000000001111111111222222222233333333334444444444555555551234567890123456789012345678901234567890123456789012345*/

data three; input @1 name $7. @10 age 2.0 @14 birthdate $11. @28 sport $8. / @9 gradyr 4.0 @16 numchild 1.0 @20 occupation $20.;

datalines;Ronald 40 Dec 3 1954 golf 1973 2 masonry contractorMichael 37 Jul 4 1957 fishing 1975 2 bricklayerLaurel 33 Jun 23 1961 softball 1979 0 attorney;run;

Page 30: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 32 Data files and SAS data sets

/ tells the pointer to go to the next line. Once you go to the next line, you cannot move back to the previous line.

Page 31: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 33 Data files and SAS data sets

Exercises

2.1 Create a permanent data set of the data given in Exercise 1.1. How are you going to retrieve this data without having to retype it? You should be able to modify the program you have saved.

2.2 (a) Download the pulse data file Minitab version from the web or WebCT (pulse.mtw not pulse.prn). Open Minitab load pulse.mtw using File Open Worksheet (not Open Project). Use File Other Files Export Special Text (not File Save Current Worksheet as) to export the PULSE file as an ASCII file. You have to highlight the variables to export then press select. Press OK. Enter a suitable file name. Change the file type to ANSI Text Files (*.TXT). Finally press save.

(b) Create a permanent SAS data set of the data.

2.3 To illustrate the dangers of list format input, take the data file blood.dat and edit it with a text editor (notepad or Word). Make one or two mistakes in it by removing some of the entries in one or more lines. Now save it as a text file and use it to input and print a SAS data set. Examine your log file and output, to see what has gone wrong, and how you are warned.

2.4 (a) (WINDOWS ONLY) Create an Excel file containing the following data where column1 is size, column 2 is colour, column 3 is price and column 4 is transport cost. Save it as a formatted text space delimited file.

Large Red 18.97 0.25Medium Blue 24.68 1.10X-Large Black 29.99 1.75Small Orange 15.89 0.90

(b) Write and submit a SAS program to read in the data using list input and print the variables colour, size and price in that order.

(c) Redo (b) using column input

(d) Redo (b) using formatted input

Page 32: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 34 Data files and SAS data sets

2.5Copy the text file houses.dat from the web. The file contains the following five variables for each of the 120 houses in a survey of house prices. Examine it with an editor (Wordpad or Word).

VARIABLE CONTENTS COLUMN LOCATIONstyle Type of house 1sqfeet Floor area 3-6bedroom Number of bedrooms 8baths Number of bathrooms 10-12Price Price of house 14-19

Use column input to create a permanent SAS data set for the housing data and print the contents.

2.6(a) Download the cars Excel file from the web. To create a file of raw data for reading into a SAS data set:-

Open the file up in Excel. Right align the columns. Delete the coding information about the origin variable (in column L).Save the data as a formatted text space delimited file (.prn extension), or as a csv file in Linux.

(b) Use this file to create a SAS data set (use column input).To identify the column location for each variable, open the .prn file up in Notepad. Move the cursor along the row of data, taking a note of the column locations.

Page 33: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 35 SAS procedures

3. SAS procedures

3.1. Structure of a SAS program

Once you have got the data organised a simple SAS program consists of a series of procedures. You have already used three of these procedures. PROC PRINT, PROC MEANS and PROC CONTENTS. Apart from specifying which data set to use you had no control on the type of output that SAS produced. This may have given the impression that SAS is rather inflexible. However, this is far from the truth. Most procedures have several options which can be invoked and in addition there are statements which can be incorporated into a program (which themselves have options). The procedures and subsequent statements determine the nature of the output produced. Most SAS procedures use the following syntax:PROC PROCNAME options; STATEMENTS / statement options; RUN;A program will typically consist of several such blocks of code.

3.2. Sample program

libname unit3 'c:\sas\sasdata';

proc sort data=unit3.pulse out=sorted; by activity; run;

proc print data=sorted noobs N; *NOOBS removes observation numbers; format height 6.0; title 'Pulse data from Minitab sorted by activity'; var pulse1 pulse2 weight height; by activity; run;

proc freq; tables ran smokes activity; tables sex/nocum nopercent; tables sex*smokes; run;

proc means maxdec=2 mean std; title 'Pulse rates before and after exercise'; var pulse1 pulse2; run;

Task 1

Page 34: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 36 SAS procedures

Print the pulse data that you saved as a permanent SAS data set in Exercise 2.2. Now run the first two parts of the sample program (PROC SORT and PROC PRINT) and compare the output. Remember to specify an appropriate library.

One way of printing separate tables for different subgroups is to use a BY statement. In order to do this the data set must be already sorted by this BY variable. If you do not want to overwrite the original file then the sorted data must be stored in a new file. The statements:

proc sort data=unit3.pulse out=sorted; by activity; run;

sort the pulse data by activity level and store the sorted data in a new file called sorted.

Task2Look in the libraries to see where this file is stored. Is it a permanent or a temporary data set?

The option NOOBS suppresses the observation numbers and the option N allows the sample size to be printed at the end of each table. The format statement gives an instruction to print the values of height with a maximum of six characters and no decimal places.

Task 3 Type in the rest of the sample program and see if you can work out what the remaining statements and options are doing. Look carefully at the titles. What happens if no title statement is made in a procedure?

Individual procedures will be looked at in more detail in the next few sections. Information about the options available for individual procedures is given in the SAS help though it is not always very easy to follow!

It is not strictly necessary to have a run statement between each procedure. SAS recognises that a new procedure statement indicates that the previous

Page 35: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 37 SAS procedures

statements refer to the preceding procedure. However, it is generally advisable to include additional run statements and it is essential to put a run statement at the end of the program.

Page 36: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 39 Summarising data

4. Summarising data

4.1. SAS System Options

You have probably noticed that the date and a page number are included on all the output produced by SAS. The type of output produced by SAS is determined by the system but may be changed by making use of SAS System Options.

There are dozens of options available which deal with hardware and software interfacing, and the input and processing as well as just the output of jobs.

A list of the options may be found in help. . The following are some commonly used options which may be used to change the output.

Option ActionCENTRE/NOCENTRE Output centred / left aligned

DATE/NODATE Date shown / date not shown

NUMBER/NONUMBER Pages numbered / not numbered

PAGESIZE= Determines the number of lines per page

LINESIZE= Determines the printer line width

FIRSTOBS= Specifies the first observation to include from the data set

OBS = Specifies the last observation to include. This is useful for testing code using large data sets.

OBS = max Includes all observations

The following lines of code will produce a print out of observations 20 to 45 inclusively of the pulse data, with no page numbering, no date, left aligned and with 20 rows on the page.options nonumber nodate nocentre pagesize=20 linesize=80 firstobs=20 obs=45;

libname unit4 'c:\sas\sasdata';

proc print data=unit4.pulse; run;

Options firstobs = 1 obs = max; /* Uses all observations in any analysis that follows./*;

SAS system options remain in place for the whole of a SAS session unless subsequently changed. If an OPTIONS statement is entered within a DATA or PROC step then it takes effect immediately. An OPTIONS statement entered outside of a step takes effect with the following step.

Page 37: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 40 Summarising data

4.2. HTML output

HTML output can be turned on from the menu Tools  Options  Preferences.Select the Results tab and select the Create HTML box. The dialogue windows are shown below, for both Linux and Windows. It includes an option to write the output into a specified folder. If this option is not used the output file is written into the folder specified for the work library. I

Figure 7 Dialogue to turn on HTML output (in Windows / in Linux)

4.3. Summary Procedures

Four procedures PROC SORT, PROC MEANS, PROC UNIVARIATE and PROC FREQ may be used to summarise data. The most commonly used options and statements for these procedures together with sample programmes are given below. The complete set of options can be obtained in SAS help,

HelpSAS Help and Documentation Choose the SAS Products, Base SAS, SAS Procedures then Procedures. From there you should click on the procedure you require.

Page 38: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 41 Summarising data

4.4. PROC SORT

Options Description

DATA= Data set to be used, uses the last data set created by default

OUT= Specifies the name of file to store the sorted data. If no OUT option is used the original file will be overwritten.

Statements

BY <DESCENDING> A list of variables to sort by must be specified. DESCENDING placed before a variable name will sort the data in descending order for that variable.

options centre pagesize=50 firstobs=1 obs=92;

proc sort data=unit4.pulse out=sorted; by sex descending ran; run;

proc print; run;

The lines of code sort the pulse data by sex (males first, followed by females) and within sex by whether the students ran. Those who did not run (coded 2) are placed before those that did run (coded 1) because DESCENDING has been specified. Note that no data statement is used with PROC PRINT. SAS automatically uses the sorted data set because that was the last data set created.

Task 1Using the first 50 observations only of the pulse data, create a data set sorted by smoking (non-smokers first) and by activity. Print out the sorted data set. Check carefully that the output is what you expect.

Page 39: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 42 Summarising data

4.5. PROC MEANS

Options Description

DATA= Data set to be used, by default will use the last data set created

MAXDEC= Gives the maximum number of decimals to be used in the output (must be between 0 and 8)

NOPRINT Suppresses the printing if the procedure is only being used to send summary output to a file (see OUTPUT statement).

ALPHA Gives value for confidence limits (ALPHA=0.05 for 95% C.I.)

statistic keyword list

By default PROC MEANS prints out the variable name, count, mean, std dev, min and max values. Particular statistics may be requested.

Procedure options

N Number of non-missing observations in a subgroup

NMISS Number of missing observations

MEAN Mean

STD Standard deviation

MIN Minimum value

MAX Maximum value

RANGE Range

STDERR Standard error

CLM Confidence limits for the mean

(For additional keywords see SAS Help)

Statements

VAR Specify a list of numeric variables for which statistics are required.

BY Specify a list of alphanumeric variables (data must be sorted by these variables). Descriptive statistics are given for each subgroup.

CLASS Specify a list of alphanumeric variables. Descriptive statistics are given for each subgroup. Uses more memory than the BY command but does not need the data to be sorted.

OUTPUT There are various ways of storing all or some of the summary statistics requested. Need to specify a file name using OUT=filename and which variables/statistics are required. See specimen programme for a simple example of how this can be done.

The following lines of code may be used to get summary statistics (the means, standard deviations and standard errors) for pulse1 and pulse2 in subgroups defined by sex and whether the students ran. These summary values are stored in a file named summary.proc means data=unit4.pulse maxdec=2 mean std stderr; var pulse1 pulse2; class sex ran; output out=summary mean = mean_p1 mean_p2 std = std_p1 std_p2

Page 40: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 43 Summarising data

stderr = se_p1 se_p2; run;

proc print data=summary;run;

Task 2Submit the previous sample program. (Remember you may need to change the library name and file name of the data set.) What data has been stored in the file ‘summary’? What does the TYPE variable indicate in the print output?

4.6. PROC UNIVARIATE

Options Description

DATA= Data set to be used, uses the last data set created by default

PLOT Produces stem-and-leaf plots, boxplot and normal probability plots of the data.

NOPRINT Suppresses all printing.

Statements

VAR Specify a list of numeric variables for which statistics are required.

BY Specify a list of character variables (data must be sorted).

OUTPUT Need to specify a file name using OUT=filename and which statistics/variable names are required.

Task 3Submit the following program and see how the printout differs from that produced by PROC MEANS. How does the output file containing summary values differ?

proc univariate data=sorted plot;var height weight;by sex;output out=summary mean = mean_ht mean_wt std = std_ht std_wt;run;proc print data=summary;run;

4.7. PROC FREQ

Options Description

DATA= Data set to be used, uses the last data set created by default

Page 41: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 44 Summarising data

StatementsTABLES Specify a list of alphanumeric variables for which tallies are

required. Smaller subgroups may be defined by the use of an * e.g. sex*ran*activity

Table statements optionsNOCOL Does not show column percentages

NOCUM Does not show cumulative frequencies or percentages

NOFREQ Does not show cell frequencies

NOPERCENT Does not show cell percentages

NOROW Does not show row percentages

CHISQ Gives results of chi-squared tests of independence

The following code produces frequency tables for sex and smoking habit separately and a two way table of sex and smoking habit. The output also includes the results of a chi-squared test of independence for these two variables.

proc freq data=unit4.pulse;tables sex smokes smokes*sex/nocol norow nocum chisq;run;

4.7.1. Chi-square testProc freq is used to carryout a chi-square test for the association of 2 categorical variables. In this case the null hypothesis is that there is no association between smoking and sex. The same proportion of smokers should be found amongst males and females.

It is convenient to add the row percentage to the cross tabulation as an easy way to look for a possible association. This is achieved by removing the option “norow”. The options “nocol” and “nopercent” have been left in the statement to remove clutter from the output.

proc freq data=unit4.pulse;tables sex*smokes/ chisq nocol nopercent;run;

The output from SAS gives

Page 42: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 45 Summarising data

The FREQ Procedure Table of Smokes by Sex Smokes Sex Frequency‚ Row Pct ‚1 ‚2 ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 20 ‚ 8 ‚ 28 ‚ 71.43 ‚ 28.57 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 ‚ 37 ‚ 27 ‚ 64 ‚ 57.81 ‚ 42.19 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 57 35 92

Statistics for Table of Smokes by Sex Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 1.5321 0.2158 Likelihood Ratio Chi-Square 1 1.5699 0.2102 Continuity Adj. Chi-Square 1 1.0089 0.3152 Mantel-Haenszel Chi-Square 1 1.5154 0.2183 Phi Coefficient 0.1290 Contingency Coefficient 0.1280 Cramer's V 0.1290

The probability of the chi-square statistic being as large as 1.5321 by chance alone is 0.2158. This indicates that there is not an association between sex and smoking in this sample.

Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 20 Left-sided Pr <= F 0.9310 Right-sided Pr >= F 0.1576 Table Probability (P) 0.0886 Two-sided Pr <= P 0.2502 Sample Size = 92

Some times your data may simply be the counts of each table cell. The data set must contain a variable called something like “count” which contains the number of observations in each cell. In this case use the weight statement in the procedure,

e.g. weight count;

4.8. General syntax for a procedure

A general example of a procedure is given below. Each procedure uses only a certain combination of statements but the action of each statement is common across the procedures in which it can be used.

Page 43: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 46 Summarising data

PROC PROCNAME DATA = lib1.data1 OUT = lib2.data2 noprint; Weight WeightVar ; Specifies which variable weight the analysis ; FORMAT NumVar 8.0 CatVar $3. ;* Specifies formats of variables.; BY CatVar ; * Give 1 output per group, SORT needed; CLASS CatVar ; * Similar to BY statement, no SORTing needed. Makes a numeric variable act as a categorical variable.; VAR NumVar ; * Restricts analysis to named variables; OUTPUT OUT = Summary keyword= DescriptiveStatistic; * Named output dataset, specifies names; WHERE NumVar2 > 1000 ; * Only uses certain cases; TABLE CatVar * NumVar ; * Specifies an output table; FREQ NumVar3 ; * Variable giving the observation Frequency; MODEL YVar= XVar + . .; * Fits models; PLOT YVar * XVar ; * Plots a scatter plot;RUN;

Page 44: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 47 Summarising data

4.9. Help

Extensive documentation on each procedure can be found by using Help  SAS Help and Documentation. An example of the help screen for the BASE SAS procedures is shown in Figure 8. Other useful help modules are SAS/STAT and SAS/GRAPH

Figure 8 Base SAS Procedures help

Exercises

4.1 (a) For the pulse data, get a printout which shows the number of observations, the mean, standard deviation, maximum and minimum values of pulse2 in each of the four subgroups defined by whether the student smoked/did not smoke and ran/did not run. Make the printout left-aligned with the summary values shown to one decimal place.

(b) Obtain confidence limits for the four means produced in part (a). Does it appear that smoking or running on the spot had any effect on the second pulse rate?

(c) Obtain comparative boxplots of the second pulse rate in each of the four subgroups.

Page 45: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 48 Summarising data

(d) What percentage of smokers were made to run on the spot? Get suitable SAS output to give you this information.

4.2 For the SAS data set RETAIL (From the explorer tab look in the library SASHELP) get a printout which shows the number of observations and the mean and standard deviation of retail sales in each year. Print the mean and standard deviation to two decimal places. Output the mean sales for each year to a new file called ‘summary’ and get a print out of this file

Page 46: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

5. Graphs and charts

5.1. Graphics procedures

SAS can produce two types of graphics, high or low resolution. High resolution graphics are sent to a special graphics windows where the graphs can be edited and copied into Word documents. GPLOT and GCHART are two procedures which produce high resolution graphics: the equivalent low resolution procedures are PLOT and CHART.

5.2. PROC PLOT

Options Description

DATA= Data set to be used, uses the last data set created by default

Statements

PLOT Specify yvariable*xvariable. Can produce several plots with a single statement by including a list of variables in parentheses e.g. (list of n yvariables)*(list of m xvariables) will produce nm separate plots.

BY Specify a list of character variables (data must be sorted) to produce separate graphs for subgroups.

PLOT options

=‘symbol’ Specify a symbol to be used for plotting

=variable Identifies each point by the value of another variable

The following code produces a plot of weight against height for all students, separate plots of weight against height for each sex and a single plot with a different symbol for males and females. proc sort data=unit5.pulse out=sorted;

by sex;proc plot data=sorted;

plot weight*height;proc plot;

plot weight*height=’*’;by sex;

run;proc plot;

plot weight*height=sex;run;

Page 47: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

Task 1Input the program into SAS and examine the output. (Remember to assign a LIBNAME as the first statement.) Resubmit the program using PROC GPLOT instead of PROC PLOT. What differences in the output do you observe?

When it has run successfully in PROC GPLOT, you will find that you are in a graph window. The graph can be edited in SAS by clicking on the painting icon. To come out of editing the plot, click on file and then down to end. You can save the graph to a file or cut and paste it into Word where it can be further edited if required. The graph window must be closed down (by clicking on ) before another SAS program can be run.

5.3. PROC CHART

Options Description

DATA= Data set to be used, uses the last data set created by default

Statements

HBAR Specify variable to produce a frequency bar chart (horizontal bars)

VBAR Specify variable to produce a frequency bar chart (vertical bars)

PIE Specify variable to produce a pie chart.

BLOCK Specify variable to use on the x-axis. Used in conjunction with GROUP and SUMVAR options to produce three-dimensional bar charts.

BY Specify a list of character variables (data must be sorted) to produce separate charts for subgroups.

HBAR/VBAR/PIE/options

SUMVAR= Specify an analysis variable the sum of which is to be shown on the y-axis

TYPE= May be used on its own or in conjunction with SUMVAR to produce statistics other than the frequency or sum on the y-axis. The options for TYPE are FREQ (frequency counts), PCT (percentages), CFREQ (Cumulative frequencies), CPCT (Cumulative percentages), SUM (Totals), MEAN (Means)

The default is TYPE=SUM if SUMVAR is used otherwise the default is TYPE=FREQ.

LEVELS= Specifies the number of equal width classes for numeric variables.

MIDPOINTS= Specifies the midpoints of classes for numeric variables

MIDPOINTS=lower_limit TO upper_limit BY interval

DISCRETE Prevents SAS from dividing a discrete variable into inappropriate intervals e.g ensures a variable coded from 1 to 5 will produce 5 classes.

GROUP= Produces separate bar charts on the same graph for different discrete values of the GROUP variable.

Page 48: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating
Page 49: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

The following program illustrates some of the features of PROC CHART.

proc chart data=sorted;hbar smokes;by sex;

run;proc chart;

vbar height/levels=6 group=sex sumvar=pulse2 type=mean;run;

Task 2Run this program and look carefully at the output obtained. Adapt the code to show other statistics on the y-axis of the vertical bar chart, a different number of bars and so on to familiarise yourself with the procedure.

Exercises

5.1 Plot weight against height for the data from Exercise 1.1. This should be stored somewhere as a permanent SAS data set. Use a plus sign as your plotting symbol.

5.2 For the pulse data:

(i) Plot the second pulse rate against weight, using a different symbol for males and females.

(ii) Produce a pie chart which shows the percentage of students who usually have particular levels of activity.

(iii) Produce a horizontal bar chart which shows the mean of the second pulse rate for those students that did and did not run.

(iv) Produce vertical bar charts side by side which show the percentage of students who have different levels of physical activity for males and females separately.

(v) Produce a three dimensional bar chart which shows the mean of the first pulse rate for subgroups defined by level of activity and smoking habit.

Page 50: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 53 Correlation and regression

6. Correlation and Regression

Two procedures may be used to obtain information about the relationship between two or more continuous variables. PROC CORR determines the correlation coefficients between selected variables and PROC REG fits a regression model to data and allows output to be saved for further analyses. Remember that the statements and options given in these notes are only a very small subset of those that can be used with particular procedures. The help facility may be used to investigate further possibilities.

6.1. PROC CORR

Options Description

DATA= Data set to be used, uses the last data set created by default

SPEARMAN Calculates the Spearman rank correlation coefficient. By default the Pearson product moment correlation is calculated.

Statements

VAR Specify a variable list (essential)

WITH Specify a variable list to be used with VAR. The VAR variables are given at the top of the table of correlations and the WITH variables at the side. If WITH is not used a matrix of the correlations between all pairs of variables is produced.

BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations.

An example of the use of PROC CORR is shown below using the data concerned with heights and weights of eleven-year-olds (Exercise 1.1).

proc corr data=unit6.eleven; var height weight; run;proc corr data=unit6.eleven; var height ; with weight; run;

Task 1Try running this program and look at the difference in output produced when the WITH statement is included or not included. What do the values under the correlation coefficients indicate?

Page 51: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 54 Correlation and regression

6.2. PROC REG

Options Description

DATA= Data set to be used, uses the last data set created by default

CORR Prints a correlation matrix for all variables listed in the MODEL statement.

Statements

MODEL For simple linear regression, specify a response variable and an independent variable, response=independent variable

OUTPUT Need to specify an output file using OUT=filename and a list of keywords(statistics)/names required. Commonly used keywords are p (predicted values), r (residuals), student (standardised residuals). See the specimen program for syntax.

PLOT Allows scatters plots to be produced using any variables in the model or keywords in the OUTPUT statement. Note the keywords in the OUTPUT must be followed by a full-stop when used as variables in the PLOT statement.

BY Specify a list of character variables (data must be sorted) to produce separate tables of correlations.

PLOT options

OVERLAY Superimposes several scatter plots on the same graph.

The following code produces output which shows: the correlation matrix of weight and height output from a simple linear regression analysis of weight on height a plot of weight against height with the predicted values (line)

superimposed a plot of the standardised residuals against height a printout of the output file a normal probability plot for each of the residuals and standardised

residuals together with the results of normal probability test.

proc reg data=unit6.eleven corr; model weight=height; output out=regout p=pred r=resid student=stdresid; plot weight*height='*' p.*height= 'P'/overlay; plot student.*height; run;

proc print data=regout;

proc univariate plot normal data=regout; var resid stdresid; run;Task2

Page 52: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 55 Correlation and regression

Obtain the output from the previous program. Why is it important to produce residual plots and normal probability plots for the residuals from a linear regression model?

Exercises

6.1 Data set beetle, available as a permanent SAS data set on the web page, gives information on a sample of beetles and the damage they cause.

(a) Download the file and examine the data.

(b) (i) plot area against male beetle length(ii) regress area against male beetle length(iii) obtain a plot of the standardised residuals against length(iv) obtain a normal probability plot of the standardised residuals.

(c) Why isn’t a simple linear regression model satisfactory? What might be a useful thing to try in order to improve the model?

(d) Similarly, investigate the relationship between the amount of frass produced and female beetle length.

Page 53: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 57 Exploratory data analysis

7. Exploratory data analysis

7.1. SAS/INSIGHT

SAS Insight software is a tool for data exploration and analysis. It is interactive which means you can explore data through graphs and analyses linked across multiple windows. This facility can be used to identify outliers, highlight subgroups on a graph and so on. SAS/INSIGHT allows you to analyse univariate distributions, investigate multivariate distributions and fit explanatory models such as a simple linear regression model to your data.

7.2. Accessing SAS/INSIGHT

SAS/INSIGHT may be used with any SAS data sets that have been created previously. If you wish to investigate a permanent SAS data set make sure that you have set up an appropriate library using a LIBNAME statement before invoking SAS/INSIGHT. It is also possible to enter data directly into the SAS/INSIGHT data window.

Within SAS, select SolutionsAnalyseInteractive data analysis. The data of interest can then be accessed using LibraryData SetOpen or a new data set created in the data window using New.

7.3. Creating a Scatter Plot

To investigate SAS/INSIGHT initially, choose a data set you are fairly familiar with and try out some of the features. This example used the Pulse data set. Analyse > Interactive Analysis

Figure 9 Opening the SAS/INSIGHT dialogue in Windows

Page 54: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 58 Exploratory data analysis

Figure 10. Opening the SAS/INSIGHT dialogue in Linux

You can create charts using the analyze menu. An example of creating a scatter plot is shown in Figure 11.

Analyse > Scatter Plot (X/Y)

Figure 11 The Scatter Plot ( X / Y ) dialogue

Page 55: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 59 Exploratory data analysis

Figure 12 Two SAS/INSIGHT scatter plots with 3 points highlighted

A key feature of SAS/INSIGHT graphs is that they are interactive. Click on a point on one chart and all corresponding points become highlighted. An example of this interactivity is shown in Figure 1. The 3 points labelled with ‘1’ use a larger symbol and indicate the level of the variable ran (1 = yes). To click on several points just hold down the ctrl key while clicking. To highlight an area of points left click and hold in one corner of the region then drag the cursor to the opposite corner. All points in the rectangle will be highlighted. You can then turn off the points by clicking on a point ouside the rectangle.

7.4. Features of SAS/INSIGHT

SAS/INSIGHT software provides an extensive range of tools for investigating data and carrying out analyses. Some of the activities that you can carry out using SAS/INSIGHT are shown below.

enter data from the keyboard identify observations in plots examine all values for selected observations brush observations in graphs create overlaid line plots rotate data in three dimensional plots manipulate histograms to explore the distribution of data compare distributions in box plots and mosaic plots compute descriptive statistics fit parametric (normal, lognormal, exponential, Weibull) and kernel

density estimates

Page 56: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 60 Exploratory data analysis

fit parametric cumulative distribution functions create quantile-quantile plots calculate correlations and principal components to find the structure

of your data fit a general linear model create residual and leverage plots transform variables process data by groups for every analysis

7.5. Using SAS/INSIGHT

Once you have a data set in SAS/INSIGHT, manipulations and analyses are carried out by using either Edit or Analyse on the main menu.

Operations are also available from pop-up menus by: clicking the left mouse button in the corners of graphs and tables pressing the right mouse button over an appropriate object.

Variables to be analysed may either be selected before clicking on Analyse or entered as requested within the particular analysis window.

In the data set window: a variable is selected by clicking on the name several variables can be selected by holding down the left mouse

button and dragging across the selection non-contiguous variables or observations may be selected by

holding down the control button and clicking on individual names or row numbers.

In WINDOWS, any plots produced can be printed directly or copied and pasted into Word documents.

Tabular output can be saved into the normal SAS Output Window using FileSaveTables

Commands from your SAS/INSIGHT session can be recorded and later resubmitted. The FILE and INFILE options allow you to produce a file containing commands to document and reproduce a SAS/INSIGHT session. This is very useful for exploratory analysis that you need to interrupt or repeat on different sets of data.

Examplefilename note ‘h:\MA71064 Statistical Computing\insight.txt’;proc insight file = note; run;

Page 57: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 61 Exploratory data analysis

After doing your analysis and then exiting, your file will contain the commands that were used to create and close your Insight session. You can begin your second Insight session from where you left your first session with the following code:

filename note ‘h:\MA71064 Statistical Computing\insight.txt’;proc insight infile = note; run;

Alternatively, just code the FILE keyword without a filename specified and your commands will be recorded in the SAS Log window.

Task 1

To get a feel for what SAS/INSIGHT can do, work through the following exercises which are based on the pulse data.

(a) Obtain a histogram of height.Obtain a histogram of sex in a new window.Click on the bar representing males on the sex histogram.Look at the histogram of height. What has changed on this histogram?

(b) Obtain comparative boxplots of weight for each sex. (Input weight as the Y-variable and sex as the X-variable.) Click on the outlier for male weights. Which observation number is this?Double click on this outlier. What information do you get?

(c) Highlight observation numbers 1, 31 and 67 in the data window..Press the right mouse button and click ‘Label in plots’.Obtain a scatter plot using pulse2 as the Y-variable, pulse1 as theX-variable, sex as the group variable and ran as the label. What information is shown on these plots?Double click on one of the points. What information do you get?

(d) Highlight the variable names for pulse1, pulse2, height and weight in the data window.Obtain a scatter plot.What plots do you get? What are the values shown in each plot?

(e) Using the Fit option in the Analyse menu, input pulse2 as the Y-variable, weight as the X-variable and ran as the group variable.Look carefully at the output obtained.Redo the analysis to show Residual Normal QQ plots and store predicted values and standardised residuals in the data sheet.

Page 58: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 62 Exploratory data analysis

7.6. Tools

Edit > Windows > Tools turns on a menu that allows data points to be coloured or selects different symbols by data value.

Figure 13 The tools dialogue labelling observations by sex

Press the coloured square from the tools window to set a particular set of points to that colour, eg colour red all those points representing people who smoke.Similarly press one of the symbol buttons to select a given symbol for a set of observations.

Other features of SAS/INSIGHT can be investigated using the help facility.

Task 2

(a) Find out what is meant by ‘brushing observations’.

(b) Produce summary statistics and graphs for each of the continuous variables in the pulse data.

(c) Input, into the data sheet, the following data which are chloride content  (mg/l) of waters draining from a particular type of rock.

6.0 5.0 0.5 0.5 0.6 10.00.4 6.0 1.2 0.2 0.7 0.30.2 0.8 0.2 1.7 0.5 6.0

(i) Produce a boxplot of the data and comment on the distribution.(ii) Create a new variable which is the log of the chloride content.

Page 59: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 63 Exploratory data analysis

(iii) Check whether the log values could reasonably be assumed to have come from a normal distribution.

Page 60: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 65 Modifying data and output

8. Modifying data and output

8.1. Introduction

In the previous units you have discovered how to create permanent SAS data sets and produce output using a variety of procedures. The OPTIONS statement was introduced in the unit on Describing Data which allows you to make some changes to the output produced. Generally speaking though, there has been very little flexibility in either changing the style of the output or modifying the data set that has been used. In this unit you will learn how to make changes to a permanent SAS data set and to customise some types of output.

8.2. SET statement

The SET statement is a very versatile statement which is used in a DATA step and enables a variety of tasks to be carried out depending on which options are used. One of its most common uses is reading observations and variables from existing SAS data sets so that further processing can take place. Another use is combining two or more data sets so that analyses can be carried out on a larger set of variables or observations.

The same operations can often be done in different ways because some SAS statements can be incorporated as options into either the DATA or SET statements. Have a look at the following examples of code which both achieve the same thing.

data unit8.beetles1 (drop=site); set unit8.beetles; where site=’1’;run;

data unit8.beetles1 (drop=site); set unit8.beetles (where=(site=’1’));run;

Both pieces of code produce a new permanent SAS data set called beetles1 which has data from site one only and does not include site as a variable. The number 1 is shown in single quotes because site is a character variable in this data set. The DROP option allows variables not required to be omitted from the new data set. It can be included in either the DATA or SET statement but you have to be a bit careful. If DROP is used in the SET statement then the variables involved cannot be used for further processing. In this example using the DROP option with the variable site in the SET statement would result in an error message – try it!

The following table shows some of the commonly used options in the DATA and SET statements.

Page 61: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 66 Modifying data and output

Data Set Options Description

DROP= Specify one or more variables to exclude either from further processing or from the new data set

FIRSTOBS= Specify the first observation required for processing

KEEP= Specify one or more variables to include in further processing or in the new data set

OBS= Specify the last observation required for processing

LABEL= Specify names to be given to variables (see section Labelling output)

RENAME= Specify new names for variables

WHERE= Specify a condition to select certain observations from a SAS data set

8.3. DROP and KEEP

The following code shows how the DROP and KEEP options may be used in a program. data lengths (keep=height cond mlength flength) damage (drop=mlength flength); set unit8.beetles (drop=site);run;

proc print data=lengths;

proc print data=damage;run;

Note that more than one data set can be specified in a single DATA statement. In this example two temporary data sets are produced lengths and damage.

Task 1Look at the preceding code and see if you can work out which variables will be contained in each data set. Submit the code and see if you are correct. If you wanted to create permanent data sets what changes to the code would you have to make?

Page 62: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 67 Modifying data and output

8.4. Labelling output

Variable names in SAS are restricted to being eight characters in length, and by default these variable names are used as column headings. A LABEL statement may be used to associate a descriptive label with a variable. If the labels are required in the output then either a LABEL or SPLIT option must be used in the PROC PRINT statement. If a LABEL statement is made in a DATA statement then the labels are permanently associated with the variables in that SAS data set.

An example of the use of labels follows.

proc sort data=unit8.beetles1 out=sorted;by cond height;

proc print data=sorted split=’*’; var height mlength area frass; by cond; label mlength='male*length' area='leaf area*consumed' frass='number of*frass*pellets'; pageby cond; sum frass;run;

Task 2Submit the code in SAS and inspect the output. What effect have the PAGEBY and SUM statements had?

If you had used label names without the asterisks and used the LABEL option in the PROC PRINT statement then SAS would split the label names automatically at a suitable place but you have no control over the process.

Page 63: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 68 Modifying data and output

8.5. PROC PRINT

Options Description

DATA= Data set to be used, uses the last data set created by default

LABEL Ensures that column labels are used in the output

SPLIT= Specify a character in the label names which splits the column headings onto two or more lines

NOOBS Suppresses the printing of observation numbers in the output

Statements

VAR Specify variables to be printed

ID Specify variable to use as identification instead of the observation number

BY Specify a list of character variables (data must be sorted) to produce separate tables.

PAGEBY Used with BY statement to output each table on a separate page.

SUMBY Prints subtotals for the specified BY variable

SUM Specify numeric variables for which the sum of the values is required.

8.6. PROC FORMAT

The LABEL statement allows you to give longer names to variables so that any output is easier to interpret. It is also possible to assign names to individual categories for character variables and to save these names as permanent formats. This is done using PROC FORMAT. The permanent formats are saved in a location which is specified using the LIBNAME statement with the special libref name LIBRARY. For example,

Libname library ‘C:\sasdata’;

will store the formats in a directory sasdata on the hard disk when the LIBRARY option is used in the FORMAT procedure.

Task 3Assign a library called LIBRARY in a suitable location. (The location where you have stored your SAS permanent data sets is probably the most appropriate.)

The individual names are assigned using a value statement. These formats are independent of any particular data set and if appropriate may be used with any variable. The following example shows suitable labels for the plant

Page 64: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 69 Modifying data and output

height and plant condition categories from the beetles data but the format $health, for example, could be used for any variable where 1, 2 and 3 represent poor, satisfactory and good respectively.

proc format library=library;value $height '1'='less than 10cm' '2'='10cm<20cm' '3'='20cm<30cm' '4'='30cm<40cm' '5'='40cm or more';value $health '1'='poor' '2'='satisfactory' '3'='good';run;

When these labels are required in output the format names must be shown with a full-stop in the FORMAT statement.

proc print data=unit8.beetles;format height $height. cond $health.;run;

Task 4Create the above formats and print out the beetles data set to see the effect of using these formats. Check the library called LIBRARY. You should find a catalog called FORMATS. Double-clicking on FORMATS will give a list of all permanent formats you have created.

The permanent formats which have been created may be used at any time. If you want to use them in a future session remember to use a LIBNAME statement initially to assign the libref LIBRARY.

(To use more than one permanent format library use options fmtsearch - see HELP for details)_____________________________________________________________

8.7. Recoding data

You may find that when you are presenting results, or carrying out an analysis of a set data that you may wish to code a continuous variable such as height into discrete categories, for example, short, medium and tall. In some circumstances you may wish combine categories for an analysis. This type of operation can be done either by using conditional statements or making use of the VALUE statement in PROC FORMAT.

_____________________________________________________________

8.8. Conditional statements

Conditional statements can take several forms (see the SAS Help). Two commonly used in recoding are:

Page 65: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 70 Modifying data and output

if expression then statement; if expression then statement; else statement;

The following code shows the use of the IF-THEN/ELSE statement for recoding pulse1 and pulse2 into four numeric categories.

data unit8.pulse2 (keep=pulse1 pulse2 ran); set unit8.pulse;if 40 <= pulse1 < 60 then pulse1=1; else if 60 <= pulse1 < 80 then pulse1=2; else if 80 <= pulse1 < 100 then pulse1=3; else pulse1=4;

if 40 <= pulse2 < 60 then pulse2=1; else if 60 <= pulse2 < 80 then pulse2=2; else if 80 <= pulse2 < 100 then pulse2=3; else pulse2=4;run;

proc freq data=unit8.pulse2;tables pulse1*pulse2/nocol nocum nopercent norow;by ran;run;

_____________________________________________________________

8.9. VALUE statement

An alternative way to code data is to create new formats for the required variables and use these formats when they are needed to produce particular output. For example putting the pulse rates into categories and creating a two way table could be done by creating a new format as follows.

libname library 'c:\sasdata';run;proc format library=library;value pulse 40-59=1 60-79=2 80-99=3 100-200=4;run;

proc freq data=unit8.pulse;tables pulse1*pulse2/nocol norow nocum nopercent chisq;by ran;format pulse1 pulse2 pulse.;run;

Task 5Try out these alternative ways of recoding the data. What do you think are the advantages and disadvantages of each method? What do the resulting tables

Page 66: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 71 Modifying data and output

tell you about the relationship between the first and second pulse rates in each group?

The values that may be assigned to a particular code or description may be specified in the following ways.

Range specification in the VALUE statement

Description

value a single value

value1-valuen a range of values

value1, value2, …. a list of values

HIGH the highest possible value

LOW the lowest possible value

OTHER anything that does not fall into any range

For example a format for age groups could be created as follows.

libname library 'c:\sasdata\';run;proc format library=library;value agegroup low-24=’under 25’ 25-49=’25 or more but less than 50’ 50-high=’50 or over’;run;____________________________________________________________________

8.10. OUTPUT

The OUTPUT statement is used in conjunction with the SET statement to create multiple SAS data sets. The IF statement is used with the OUTPUT statement to control which observations are output to which SAS data sets.

e.g.data american japan british; set mydata.cars; if origin = 1 then output american; else if origin = 2 then output japan; else if origin = 3 then output british;run;

Exercises

8.1 Create a new temporary data set, using the pulse data, named mpulse which contains data for males only. Omit the variables sex, height and weight in this set. Label pulse1 ‘First pulse rate’, pulse2 ‘Second pulse rate’ and print out the data set using these variable labels.

Page 67: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 72 Modifying data and output

8.2 Create formats for the alphanumeric variables in the pulse data to give the following information:

ran 1=ran in place2=did not run in place

smokes 1=smokes regularly2=does not smoke regularly

sex 1=male2=female

activity 1=slight2=moderate3=a lot

Construct two way tables which show frequencies only for the following pairs of variables, using the formats you have created.

(i) sex and smokes(ii) smokes and activity(iii) sex and ran

8.3 In Task 5 there was a warning that the chi-squared test may not be valid because of small expected numbers. Create a new format for the pulse rates which has two categories only. (Choose the categories so that there are roughly equal numbers in each category for this particular set of data. Repeat the chi-squared tests for independence for each of the two groups (ran/did not run) as in Task 5 and comment on the results.

Page 68: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 73 Proc Tabulate

9. PROC TABULATE

Proc tabulate displays descriptive statistics in tabular format. It computes many of the same statistics that are computed by other descriptive statistical procedures such as MEANS, FREQ, and SUMMARY, but incorporates more flexibility.

Statements

Description

CLASS Identifies the categories on which calculations are carried outAre either character or discrete numeric. e.g. a department codeSupplies the values used in the structure of the tableMust be present in PROC TABULATE statements

VAR Contain values appropriate for calculating statisticsAre continuous numericSupplies the values in the table cellsOptional in PROC TABULATE statements

TABLE element1 * element2 where the elements are class variables and optionally, the ALL variable and/or various statisticsTable operators:-comma - produces a multidimensional tableasterisk - produces a hierarchical tableblank - concatenates tables

BY Specify a list of character variables (data must be sorted)FORMAT Specify variables with the formats wantedFREQ Specify the variable

KEYLABEL Used to label the statistics available e.g. keylabel all=’Grand Total’;

LABEL Specify a label for a variable

WEIGHT Specify a variable to be used for weighting the entries in the table

PROC TABULATE is the only procedure that has a SAS manual of its own. It is worth understanding the format of the table statement that controls the position of the variables.

There are 3 operators that determine where the variables are positioned in the output. Notice that the variables must be categorical. If the variables are numeric then the CLASS statement is used to tell the SAS system to treat the numeric variables as if they categorical.

9.1. The comma table operator

Determines page, row and column positions, i.e. cross tabulation. Table <page variable>, <Row variable>, <Column variable> ;e.g. proc tabulate data = students; class faculty sex; table faculty, sex;run;

Page 69: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 74 Proc Tabulate

SEXMale Female

N NFACULTYScience 450 360Arts 350 550

9.2. The asterisk table operator

Nests two variables in the column or row Table <Column variable1> * <Column variable2> ;e.g. proc tabulate data = students; class faculty sex; table faculty*sex;run;

FACULTYScience Arts

SEX SEXMale Female Male Female

N N N N450 360 350 550

9.3. Using a blank table operator

Places variables side by side Table <Row variable>, <Column variable1> <Column variable2> ;e.g. proc tabulate data = students; class faculty sex; table faculty sex;run;

FACULTY SEXScience Arts Male Female

N N N N810 900 800 910

Page 70: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 75 Proc Tabulate

9.4. Using the ALL variable

e.g. proc tabulate data = students; class faculty sex; table faculty all, sex all;run;

SEXMale Female ALL

N N NFACULTYScience 450 360 810Arts 350 550 900

800 910 1710

9.5. Other Statistics

The statistics that can be requested in PROC TABULATE include the following

N number of nonmissing observations SUM sum of the VAR variable for each class of the CLASS variables NMISS number of missing observations MEAN arithmetic means STD standard deviation MIN minimum value MAX maximum value

e.g.proc tabulate data = students; class faculty sex; var exammark; table faculty*sex*exammark*max;run;

FACULTYScience Arts

SEX SEXMale Female Male Female

EXAMMARK EXAMMARK EXAMMARK EXAMMARKMAX MAX MAX MAX83 85 79 76

_____________________________________________________________

Page 71: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 76 Proc Tabulate

The operators can be mixed to produce any output you require. For example it is possible to nest 2 variables on the rows and columns of a table. Table <Row var1>*<Row var2>, <Column var1>*<Column var2> ;

Exercises

9.1 Use proc tabulate on the pulse data to produce (a) the average of pulse 1 in a table of activity by sex(b) the average difference in pulse in a table showing whether they smoke

by whether they ran or not

_____________________________________________________________

Page 72: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 77 Functions and formats

10. Functions and formats

Functions are a useful means of writing SAS code because it simplifies the coding involved and often results in you having to write fewer lines of code. Over 120 functions are available within the SAS system. Some functions operate on numeric values, others on character values. Some are specialised and operate on specific types of values such as dates and times. All functions operate on arguments which may be variable names or specific values

_____________________________________________________________

10.1. MEAN function

Exampledata myinfo; set info; m_val = mean (var1, var2, var3);run;

If the info data set consisted of the following data,

var1 var2 var32.5 5 1.56.0 3 3.0

then myinfo would be as follows:-

The mean function calculates the mean of the three variables listed. Alternatively we could have written the expression as m_val = mean (of var1 - var3);

An important difference between the MEAN function and the expression: m_val = (var1 + var2 + var3)/3;

is that the MEAN function returns the mean of the nonmissing values. So if we had a missing value for var2, the function would return the mean of var1 and var3 whereas the expression above would return a missing value if any of the var values were missing.

var1 var2 var3 m_val2.5 5 1.5 36.0 3 3.0 4

Page 73: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 78 Functions and formats

10.2. NMISS function

The NMISS function returns the number of missing values in a list of variables. This can be useful if for example we want to exclude observations from a calculation where there are too many missing values e.g. suppose we have recorded 50 readings for each instrument and want to compute the mean of these 50 readings, but only for those instruments with at least 30 readings

data myavges; input (x1 - x50); if nmiss (of X1 - X50 ) lt 30 then ave = mean (of X1 - X50); cards; etc._____________________________________________________________

10.3. N function

The N function operates in a similar fashion, but returning the number of non-missing values._____________________________________________________________

10.4. String Handling Functions

The above functions are for numeric variables and do not work with strings of characters. Some of the most commonly used character functions are SUBSTR (char_variable, starting_position, length)

which extracts a substring

INDEX (char_variable, index-string)

which returns the position of a substring

VERIFY (char_variable, verify_string)

which returns the position in the char_variable that is not present in the verify string

Example

data dept; set mydir.jobs; tot = substr (‘ABCDEFG’, 3, 2); dept = substr (account, 4, 3); ind = index (account, ‘tch’); ver = verify (account, ‘sth’);run;

If the data set mydir.jobs consisted of

Page 74: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 79 Functions and formats

account codespsmrk m003spsmrk m005spstch t003

then the data set dept would consist of

account code tot dept ind verspsmrk m003 cd mrk 0 2spsmrk m005 cd mrk 0 2spstch t003 cd tch 4 2

The substr function operates on character literals to extract part of a variable value. The structure of the substr function is substr(argument, position, length). The argument may be a character value or a variable name, the position gives the position from which to start reading, and the length gives the number of characters to read.

Task

Investigate what functions are available in SAS by selectingHelp SAS Help and Documentation

then choose SAS Products, Base SAS, SAS Language Dictionary, Dictionary of Language Elements and Functions and CALL Routines

10.5. Date and Time Formats

The SAS System processes calendar date values by converting dates to integers representing the number of days between January 1 1960, and a specified date.

For example, the following calendar date values represent the date July 26 1989:

072689 26JUL89 89072607/26/89 26JUL1989 26 Jul 1989

The SAS date value representing July 26, 1989 is 10799.

The trick is to convert dates to numerics and back again. SAS has many date, time and datetime informats and formats. We read the data in with date/time informats and get them back out of SAS using date/time formats. Many of the date/time informats are more or less the inverse of formats of the same name.

Page 75: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 80 Functions and formats

The above dates would be read in using the following informats in the input statement e.g.data test;

data test;input var1 MMDDYY6. +1 var2 DATE7.+1 var3 MMDDYY8. +1 var4 DATE9. ;cards;072689 26JUL89 07/26/89 26JUL1989;run;

To print them out all using different formats we reverse the process e.g.

proc print;format var1 DATE9. Var2 MMDDYY6. Var3 DATE7. Var4 YYMMDD6.;run;

The dot at the end of the informat (or format) indicates that it is an informat (or format) statement and not a variable.

Details of the different formats and informats available in SAS can be found in the SAS System help.

Blanks and other special characters can be placed between day, month, and year values. Width values must allow space for blanks and special characters.

Note: SAS defaults to a date in the 1900s if yy is two digits. Use the YEARCUTOFF= system option to override the system default and specify a date range of your choice.

Example

Data Lines SAS Statement Results 1jan1990 input day date9.; 10958 01 jan 90 10958 1 jan 90 10958 1-jan-1990 10958

The TIMEw. informat reads time values in the form hh:mm:ss.ss, where hh and mm are integers representing the hour and minute, and ss.ss is an optional fractional field representing seconds and decimal fractions of seconds. If you do not enter a value for seconds, SAS assumes a value of 0.

Example

Data Line SAS Statement Result 14:22:25 input begin time8.; 51745

Page 76: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 81 Functions and formats

and the DATETIMEw. informat reads date and time values e.g. 8:30 p.m. of May 6 1989 could be represented as 6MAY78:20:30 using DATETIME12.

Another way to specify SAS date/time values is with special constants e.g. 18 February 1951 is represented as ‘18FEB51’D, high noon as ‘12:00’T and a moment in date and time e.g. ‘1OCT82:15:27:05’DT

Exercises

10.1 A character variable alphabet = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’

Work out what you think the following functions will return then check your answers with a SAS programme? (a) index(alphabet, ‘FGHIJ’)(b) index(alphabet, ‘JNP’)(c) substr(alphabet, 3, 3)(d) verify(alphabet, ‘ABDEI’)

10.2 Write a SAS program to find out how old you were on 1 October 1999 (in years).

10.3 Download the SAS data set napier.SAS7BDAT from the web. This contains data on first year Napier students in 1993/4. There are 1872 records and 33 variables. The variables that you need to consider are

crsecd Numeric variable giving course codedob Date of birth

Identify the oldest student. Which course is he/she in ?_________________________

Which course has the youngest students (on average)?_______________

Which courses have the largest numbers of students?__________________________

In order to answer these questions you will need to (a) create a variable giving the age in years. (b) Sort the data set by age, storing your sorted data set in a temporary

data set. Examine the data set from the Explorer window to find the oldest student and their course.

(c) Sort the data set by crsecd, storing your sorted data set in a temporary data set. Use proc means to make a new temporary data set that will contain the mean, minimum and maximum age of students on each course. Examine your data set of means to find the course with the youngest students

Page 77: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 82 Functions and formats

(d) Sort the data set by the number of records for each course (_freq_) to find the course with the largest number of students

Page 78: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 83 Iterative processing

11. Iterative processing

11.1. Do loops and arrays

The DO statement allows us to perform an iterative loop

e.g. do i = 1 to 5; (lines of SAS code) end;

would result in the lines of SAS code being repeated 5 times, with the value of i taking on values 1,2,3,4 and 5.i is used as the counter, 1 is the start value and 5 is the end value. The default increment is 1. We can specify the increment

e.g. do i = 1 to 7 by 2; (lines of SAS code) end;

would result in the lines of code between the do and the end statements being repeated 4 times, with the value of i taking on the values of 1,3,5,7.

_____________________________________________________________

11.2. Reading data in repeated patterns

The quality control department takes 4 sample cans of oil from a production line and weighs them, every hour for 12 hours. Each record in the raw data contains the following fields :hour : the hour in which the samples were takenweight 1-4 : weights of the four sample cans

The quality control department wants to analyse these data. The first step is to create a SAS data set so that it contains a single observation for each measurement taken. The DATA step must create four observations from each record.i.e.first record of raw data :-1 8.024 8.135 8.151 8.065

first four observation in the data set:-HOUR WEIGHT

1 8.0241 8.1351 8.1511 8.065

Page 79: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 84 Iterative processing

The first INPUT statement reads a value from the first field and assigns it to HOUR. The value for HOUR is the same for all four observation to be created from the first record.

data oil1; input hour @

The single trailing @ sign is used to hold the current record, preventing the next INPUT statement from reading a new record.

The next step is to read each value for WEIGHT (four in each record) and write an observation after each is read. An iterative do loop enables us to write a single pair of INPUT and OUTPUT statements to read a value and write an observation multiple times.

data oil1 (drop=i) ; input hour @; do I = 1 to 4; input weight @; end; cards;1 8.024 8.135 8.151 8.0652 7.971 8.165 8.166 8.1573 8.024 8.135 8.151 8.065etc.12 7.971 8.165 8.166 8.157;

proc print; run;

The results would be

OBS HOUR WEIGHT1 1 8.0242 1 8.1353 1 8.1514 1 8.0655 2 7.9716 2 8.1657 2 8.166

etc._____________________________________________________________

11.3. Arrays

Arrays in SAS are used as a shorthand way of processing many variables with a few statements. An array is an ordered list of variable names. It is often used along with a DO statement to carry out an action repeatedly on a sequence of variables. When defining an explicit array, the ARRAY statement must contain

Page 80: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 85 Iterative processing

an array name a subscript that indicates the number of elements in the array a dollar sign if the array is of character variables a description of the elements ( variables a, b, c, d and e in the above

example)

If we do not know the number of elements in an array we can use an asterisk to define the array e.g. array{*} score1 score 2 score 5 score 8;

although this is a lot less efficient in processing time than specifying the actual array dimensions.

Multidimensional arrays can be specified similarlye.g. array x{3,5} test1 - test15;

could specify an array where the first dimension is the class number and the second is the test number ( i.e. 5 tests for each of 3 classes). The elements of this array are referred to by (for example) x{2,3} which gives the second row element of the third column of the array.

_____________________________________________________________

Example

The following program recodes missing scores in a test to 0

data results; infile class6; input id age score1-score5; if score1 = . then score1 = 0; if score2 = . then score2 = 0; if score3 = . then score3 = 0; if score4 = . then score4 = 0; if score5 = . then score5 = 0;run;This can be rewritten using arrays as follows

data results; infile class6; input id age score1 - score5; array ss(5) score1-score5 do I = 1 to 5; if ss(I) = . then ss(I) = 0; end; drop I;run;

The reduction in code is not very much with 5 scores but if we had 150 scores arrays would be much more efficient.

Page 81: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 86 Iterative processing

_____________________________________________________________

11.4. Generating random numbers

In the data step examples we have examined so far, the SAS system has read a single record during the data step and written it to the output data set at the end of the data step. The SAS instruction to write a record to a data set is OUTPUT ( keep = vars) ;

where vars stands for the variables to be saved.

If you write a SAS data step that does not contain an OUTPUT statement, then SAS will assume that you want a record output at the end of the data step. If your program does contain an OUTPUT statement, then SAS will write a record at this point in the program and not at the end of the data step.

The OUTPUT statement can be used to write a program that will produce a series of 100 records each containing a different uniform random number with this code. The seed 2762 can be replaced with any number you like.data random; do I = 1 to 100; x = ranuni (2762); output ; keep = x; end;run;

The keep statement is optional; if it is not included all variables will be saved to the file.

In generating random numbers, the ones at the beginning of a sequence can sometimes not be very random for certain choices of seed (check this from your own sequence), but they are generally OK after 500 numbers or so. To make sure that your sequence will be OK, use the following code at the beginning of any program that generates random numbers.do I = 1 to 500; x = ranuni (1279); * or any number as a seed;end;This will give the random number generator a whirl to ensure it is running smoothly.

_____________________________________________________________

11.5. Random numbers from a uniform distribution

In order to be able to select a random sample or randomly assign subjects to groups we can use SAS functions to generate random numbers. The two functions UNIFORM(0) and RANUNI(0) generate uniform random numbers in the range from 0 to 1. Random number generators require an initial number, called a seed, which they use to calculate the first random number. This

Page 82: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 87 Iterative processing

number is then used to generate the next and so on. For both of these functions, a seed of zero will cause the function to use a seed derived from the time clock, thus generating a different series each time it is used. The RANUNI function can be seeded with any number, but the UNIFORM function must be seeded with a 5, 6 or 9 digit odd number. In either case, if you supply the seed, the function will generate the same series of random numbers each time. In order to generate a series of random numbers from 1 to 100, we could use

X = 1 + 99*RANUNI (0)_____________________________________________________________

11.6. Random numbers from a normal distribution

To generate random numbers from a normal distribution with a mean of 0 and a standard deviation of 1 we can use the RANNOR function which works in a similar way to RANUNI._____________________________________________________________

11.7. The SAS Program Data Vector

While at work on a DATA step, the SAS System maintains a temporary data structure in computer memory called the SAS program vector, or PDV.. The program data vector represents one observation, a data row, and can be thought of as a linear set of boxes in which values of SAS variables can be contained. Unlike a SAS data set, which survives between steps, the PDV is a dynamic entity that is created during a DATA step, and goes away after the step has completed execution.

To construct the PDV is one of the DATA step compiler’s first jobs as it passes through DATA step source code. The compiler looks at all the SAS statements in the step’s source to find out what variables are named - in INPUT, attribute, or other statements - and creates a PDV with space for each variable’s length.

There are a couple of automatic system variables also in the program data vector : the variables _N_ and _ERROR_. These are maintained by the compiler and may be accessed by the program, though they do not get written to the new data set(s). _N_ contains a count of how many times the DATA step has begun execution from the top (i.e. the number of records), and _ERROR_ is set to 1 (true) when there occurs a data error. When a DATA step is executing, each time it begins another iteration the values of variables to be created by INPUT or by assignment in the PDV are initialised to missing (unless a RETAIN or Sum statement has been used. _N_ is incremented, and _ERROR is set to zero. When the DATA step returns to the top for the next iteration, the PDV is reinitialised and the process repeats._____________________________________________________________

Page 83: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 88 Iterative processing

11.8. The RETAIN and Sum statements

Normally, variables in the PDV that are named with assignment or with INPUT statements are initialised to missing each time the DATA step begins a new iteration. The RETAIN statement,

RETAIN [<variables [value]> ....];

causes variables to keep their values from the previous iteration at initialisation. They can still be changed if INPUT reads a new observation, or when an assignment statement (including the Sum statement ) is executed. The RETAIN statement can only be applied to ‘new’ variables, that is, ones which are being created within the DATA step. If a constant value is specified after the variable list, that value is given at the first iteration; otherwise, numeric variables start with a value of zero.

Example Given the SAS data set history :

Year Month No_sold1990 1 101990 2 121990 3 81990 4 61990 5 9

the program : data changes; retain no_last; set course.history; compare = no_sold - no_last; no_last = no_sold; run;

produces the PDV

_N_ _ERROR_ YEAR MONTH NO_SOLD COMPARE NO_LAST1 0 1990 1 10 . 102 0 1990 2 12 2 123 0 1990 3 8 -4 84 0 1990 4 6 -2 65 0 1990 5 9 3 9

The following sum statement is a special type of assignment statement, provided as a convenience for incrementing variables during the DATA step. The statement

a + 7;

is identical in action to the statements

Page 84: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 89 Iterative processing RETAIN A; A = A + 7;

i.e. 7 is added to the previous value of a

Exercises

11.1 Results of a survey are recorded for 1996 and 1997. However an extra question was asked in 1997. Create a SAS data set from the following data showing year and the answers to the questions.

1996 4 8 3 5 6 51996 5 7 4 5 6 81997 3 5 4 7 4 5 61997 5 3 4 3 6 7 8

11.2 Rewrite the following program using arrays:-

data test; input a b c x1-x3 y1-y3; if a = 999 then a = .; if b = 999 then b = .; if c = 999 then c = .;

if x1 = 999 then x1 = .; if x2 = 999 then x2 = .; if x3 = 999 then x3 = .;

if y1 = 999 then y1 = .; if y2 = 999 then y2 = .; if y3 = 999 then y3 = .;

datalines;3 5 2 7 5 999 2 5 9992 9 4 7 2 4 999 4 9993 8 3 0 3 2 999 7 1run;

11.3

The chi-square distribution is the sum of the squares of k independent standard normal random variables. Generate and plot the frequency distributions of chi-squared variables with 5, 10 and 30 degrees of freedom.

Page 85: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 91 Further topics

12. Further topics

12.1. Combining Data Sets

12.1.1. Concatenating Data SetsTo concatenate data sets means to combine similar data sets into a single new data set. In its simplest form the original data sets contain the same variables and the combined data set will contain the original data sets ‘one on top of the other’.

Suppose SAS data set wood08 contains values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric) obtained from a survey of trees in a wood having reference number 08. Suppose further that similar results are available in SAS data sets wood48 and wood69. The following code will combine the three data sets into a new SAS data set woodcom (in the order 08, followed by 48, followed by 69).

data woodcom; set wood08 wood48 wood69;run;

If the original data sets contain different variables then the combined data set will have missing values in an obvious way.

12.1.2. Merging Data SetsWe merge data sets when we combine data sets containing different information. In its simplest form (match-merging) we combine the data sets on the basis of a common variable which typically identifies each case or row.

Consider again SAS data set wood08 containing values for the five variables id (alpha-numeric), age, diameter, height and variety (alpha-numeric). Suppose a related SAS data set woodrs08 contains values for the following variables: id (alpha-numeric) plus numerical scores for damage, bark_dep and condit.

We will merge the two data sets into a combined data set having variables id, age, diameter, height, variety, damage, bark_dep and condit. Matching will be by the variable id (termed the ‘BY variable’). However, we must sort both data sets by the BY variable. This is illustrated in the following code.

Page 86: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 92 Further topicsproc sort data=wood08; * Sort the first data set by id ; by id;run;

proc sort data=woodrs08; * Sort the second data set by id ; by id;run;

data woodall08; * Merge the two data sets using id as a key ; merge wood08 woodrs08; by id;run;

The MERGE operation will combine cases having the same values for id in data sets wood08 and woodrs08. If there values exist for id that are not included in both data sets then missing values for the relevant variables will be inserted.

Exercises12.1. The ASCII data set mod273ft.txt contains results from a module

taught to full-time students with values for student name (alpha-numeric), CW1, CW2, CW3, combined coursework and exam. Data set mod273pt.txt contains the same information for part-time students. The data sets are available on the module web page and should be examined with Notepad or WordPad.

(a) Write a program which reads in the two ASCII data sets and combines them into a single permanent SAS data set with suitable variable names. Confirm that the data set has been constructed correctly.

(b) Modify your program so that the combined data set contains a new variable indicating which group (full-time or part-time) each student belongs to.

12.2. The ASCII data set mod273ptsp.txt gives background information on the statistical software that is used at work by the part-time students. Values are given for name, excel, sas, spss (all alpha-numeric, software results are Y or N).

Write a program which reads in the two data sets mod273pt.txt and mod273ptss.txt and merges them into a single permanent SAS data set. Confirm that the data set has been constructed correctly.

Page 87: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 93 Further topics

12.2. Hints on Using Word with SAS and SAS/INSIGHT

12.2.1. Copying a Selection from the Output Window (WINDOWS)

In SAS Output Window Highlight textEdit Copy

In Word document Edit Paste

Note: In the event of difficulty useCtrl/C for CopyCtrl/V for Paste

12.2.2. Saving the Whole Output Window

In SAS Output Window File Save AsChoose directory and file nameThe automatic file extension is .lst

Word and other text editos have no difficulty opening or inserting a list file.

12.2.3. Choice of Font within Word DocumentFor tables of figures you are recommended to use a monospace font, i.e. one that has a constant width for all characters. Arial and Times New Roman are not monospace fonts.

Examples areSAS Monospace (10 point) SAS Monospace (12 point)1 2 3 4 5 1 2 3 4 5 (You may need SAS running to get this font.)

Courier New (10 point) Courier New (12 point)1 2 3 4 5 1 2 3 4 5

To avoid tables ‘wrapping round’ you could remove unnecessary spaces to the left or reduce the size of the font.

It can be effective to use different fonts for different parts of the documents. For example you might use a standard font like Arial or Times New Roman for text and a monospace font for tables. Monospace fonts might also be used for file names etc.

Page 88: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 94 Further topics

12.2.4. Copying from a Graphics Window (WINDOWS)

In SAS Output Window Edit Copy

In Word document Edit Paste SpecialChoose Device Independent Bitmap

To reduce the file size Edit PasteFollowed by Edit Cut;

Edit Paste Special Choose png or jpeg

When you try to reposition the picture you may find that the picture jumps around the document in an uncontrollable manner. This can be eliminated by adding dummy returns that will lie under the picture. Next using ‘In front of text’ of wrapping from the layout tab of the picture format dialogue box. (Right click on the picture to select the format option). The dummy returns must be held together using paragraph formatting and the picture anchor locked onto the dummy returns. This is shown in the diagram below.

Figure 14 Picture control using ‘In front of text’ wrapping style.

Format Picture dialogue

Picture

Enter returns, select all then Format Paragraph Keep lines together + Keep with nextPicture anchor

locked (Advanced setting)

Page 89: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 95 Further topics

12.2.5. Copying and Pasting from SAS/INSIGHTTables Tables can be saved in graphics form or as text (recommended). In SAS/INSIGHT Analysis Window Click on ‘arrow’ at top of table

Choose Save

The table (as text) is put into the base SAS Output Window.Copy from Output Window as described above.

See below for saving table in graphics form.

Graphs (WINDOWS)In SAS/INSIGHT Analysis Window Click on border of graph (or table)

Edit Copy

In Word document Edit Paste SpecialChoose Device Independent Bitmap

Note: If you have highlighted points on your graph hold down Ctrl when you click on the border.

12.2.6. Left Alignment of SAS OutputCopying and pasting from the SAS Output Window is easier if the output is already aligned to the left. The following option will ensure that all future output is left aligned:

options nocentre;run;

Page 90: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 97 Proc Tabulate

13. Solutions to exercises1.1

data ELEVEN;input ID $ height weight;ratio=weight/height;datalines;59 135 2582 146 3327 153 5652 154 5155 139 3113 131 2501 149 4315 137 3271 133 3078 149 3512 141 3337 164 4828 146 3748 149 4569 147 3616 152 47run;/* Data came from Exercise 1.1*/proc means;run;

1.2

...bmi = 100*weight/height;…

1.3

data FORTH;input site $ salinity phos nitrogen chloro faecal_c;phos2=1000*phos;datalines;CR 30.11 0.068 0.297 1.693 2.917WG 31.48 0.059 0.165 1.464 3.149EG 31.79 0.068 0.144 1.100 3.196SF 31.37 0.185 0.278 1.787 3.418PB 31.50 0.116 0.223 2.099 3.049JO 31.60 0.106 0.207 1.067 2.903SS 30.50 0.047 0.162 1.563 2.895FN 31.96 0.060 0.130 0.753 2.797run;proc means;run;

2.1

libname week2 'c:\sas\sasdata';

Page 91: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 98 Proc Tabulatedata week2.eleven;input ID $ height weight;ratio=weight/height;/*Program must be cut and pasted from word document*/datalines;59 135 2582 146 3327 153 5652 154 5155 139 3113 131 2501 149 4315 137 3271 133 3078 149 3512 141 3337 164 4828 146 3748 149 4569 147 3616 152 47run;

2.2 (b)

libname week2 'c:\sas\sasdata';data week2.pulse;infile 'c:\sas\sasdata\pulse.dat';input pulse1 pulse2 ran $ smokes $ sex $ height weight activity $;proc print ;run;

2.4 (b)

libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input size $ colour $ price cost;run;

proc print data = mydir.ex2_4;var colour size price;run;

2.4 (c)

libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input size $ 1-8 colour $ 9-19 price 20-24 cost 25-32;run;

proc print data = mydir.ex2_4;var colour size price;run;

2.4 (d)

libname mydir 'c:\temp';data mydir.ex2_4;infile 'c:\temp\ex2_4.prn';input @1 size $8. @9 colour $11. @20 price 5.2 @29 cost 4.2;run;

proc print data = mydir.ex2_4;var colour size price;run;

Page 92: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 99 Proc Tabulate

2.5

libname mydir 'c:\temp';data mydir.houses;infile 'c:\temp\houses.dat';input style 1 sqfeet 3-6 bedroom 8 baths 10-12 price 14-19;run;

proc print;run;

2.6

libname mscsas 'c:\kirsty\mscsas';data cars; infile 'd:\kirsty\temp\cars.prn' firstobs = 2 ; input mpg 1-8 cylndrs 9-16 displace 17-24 hrsepwr 25-33 accel 34-41 year 42-49 weight 50-57 origin 58-65 make $ 66-75 model $ 76-89 price 90-93;

run;

4.1(a)

options nocentre;libname unit4 'c:\sas\sasdata';proc sort data=unit4.pulse out=smokes;by smokes ran;proc means maxdec=1;var pulse2;by smokes ran; run;

(b)

options centre;proc means data=smokes alpha=0.05 maxdec=1 clm;var pulse2;by smokes ran;run;

(c) and (d)

proc univariate data=smokes plot;var pulse2;by smokes ran;proc freq data=smokes;tables smokes*ran;run;

4.2

proc sort data = sashelp.retail out = sorted;

Page 93: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 100 Proc Tabulateby year;proc means maxdec=2 N mean std;var sales;by year;output out=summarymean=mean_sls;run;proc print data = summary;run;

5.1

proc plot data=unit5.eleven;plot weight*height='+';run;

5.2

proc plot data=unit5.pulse;plot pulse2*weight=sex;run;

proc chart data=unit5.pulse;pie activity;run;

proc chart data=unit5.pulse;hbar ran/sumvar=pulse2 type=mean;run;

proc chart data=unit5.pulse;vbar activity/type=pct group=sex;run;

proc chart data=unit5.pulse;block activity/group=smokes sumvar=pulse1 type=mean;run;

6.1

libname unit6 'c:\sasdata';run;

proc plot data=unit6.beetles;plot area*mlength;run;

proc reg data=unit6.beetles;model area=mlength;output out=regoutstudent=stdresid;plot student.*mlength;run;

proc univariate data=regout plot normal;var stdresid;run;

8.1

libname unit8 'c:\sasdata';run;data mpulse (drop=sex height weight);set unit8.pulse (where=(sex='1'));label pulse1='First pulse rate' pulse2='Second pulse rate';

Page 94: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 101 Proc Tabulaterun;proc print data=mpulse label;run;

8.2Note that the formats must correspond to that of the variables in the data set pulse, i.e. numeric variables need numeric formats. Numeric variables do not use quotes around the variable values.libname library 'c:\sasdata';run;proc format library=library;value $ran '1'='ran in place' '2'='did not run in place';value $smokes '1'='smokes regularly' '2'='does not smoke regularly';value $sex '1'='male' '2'='female';value $activity '1'='slight' '2'='moderate' '3'='a lot';run;

proc freq data=unit8.pulse;tables sex*smokes smokes*activity sex*ran/nocol norow nocum nopercent;format sex $sex. ran $ran. smokes $smokes. activity $activity.;run;

8.3

proc format library=library;value pulsrate low-76=1 77-high=2;run;

proc freq data=unit8.pulse;tables pulse1*pulse2/nocol norow nocum nopercent chisq;by ran;format pulse1 pulse2 pulsrate.;run;

9.1 (a)proc format; value sex 1 = 'male' 2 = 'female'; value activity 1 = 'slight' 2 = 'moderate' 3 = 'a lot';run;

proc tabulate data = mydir.pulse; class sex activity; var pulse1; table activity,sex * pulse1*mean; format sex sex. activity activity.; title 'Average pulse without exercise';run;

9.1(b)

proc format; value smokes 1 = ‘smokes regularly’ 2 = ‘does not smoke regularly’;

Page 95: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 102 Proc Tabulate value ran 1 = ‘ran in place’ 2 = ‘did not run in place’;run;

data mydir.pulsev2; set mydir.pulse; pulsedif = pulse2 - pulse1;run;

proc tabulate data = mydir.pulsev2; class smokes ran; var pulsedif; table smokes, ran * pulsedif*mean; format smokes smokes. ran ran.; title 'Average difference in pulse by whether they smoke and/or ran';run;

10.1

data test; input alphabet $26.; pt1 = index(alphabet,'FGHIJ'); pt2 = index(alphabet,'JNP') ; pt3 = substr(alphabet,3,3) ; pt4 = verify(alphabet,'ABDE') ;datalines;ABCDEFGHIJKLMNOPQRSTUVWXYZrun;

proc print;run;

10.2

data ageatoct; input dob YYMMDD6.; date = '01OCT99'D; ageatoct = (date-dob)/365;datalines;740625run;

10.3

libname mydisc 'd:\kirsty\temp';

* first set up a variable for age;

data mydisc.napier2; set mydisc.napier; enddate = '31DEC93'D; * sets up a variable as a date constant; age = (enddate - dob)/365.25;run;

* create a temporary SAS data set called agesort, sorted by age and examine data set to identify oldest student;

proc sort data = mydisc.napier2 out = agesort; by age ;run;

* create a temporary SAS data set called napsort, sorted by course;

proc sort data = mydisc.napier2 out = napsort; by crsecd;run;

* create a temporary SAS data set called means containing the mean, max andmin ages by course. Examine the data set to find which course has theyoungest students (on average) Anything strange?! ;

Page 96: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 103 Proc Tabulate

proc means data = napsort; var age; by crsecd;output out = means mean = mnage max = maxage min = minage ;run;

* sort by number of records for each course and examine dat set to find thecourse with the largest number of students;

proc sort data = means; by _freq_;run;

11.1

data survey; input year @ ; * hold the line; if year = 1996 then input q1-q6; else if year = 1997 then input q1-q7; datalines;

1996 4 8 3 5 6 51996 5 7 4 5 6 81997 3 5 4 7 4 5 61997 5 3 4 3 6 7 8;run;

proc print;run;

11.2

data test; input a b c x1-x3 y1-y3; array tt (9) a b c x1-x3 y1-y3; do i = 1 to 9; if tt(i) = 999 then tt(i) = .; end; drop i;datalines;3 5 2 7 5 999 2 5 9992 9 4 7 2 4 999 4 9993 8 3 0 3 2 999 7 1run;

proc print;run;

11.3

data chisq; n = 6; * degrees of freedom required +1;

do i = 1 to 500; * to ensure numbers are random; x = rannor(3059); end;

do j = 1 to 500; *generate 500 chi-squared values; chi = 0; do i = 1 to n; x = rannor(3059); chi = chi + x*x; retain chi; end; output; keep chi; end;

Page 97: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 104 Proc Tabulaterun;

proc gchart; vbar chi;run;

12.1.(b)

/* concatmod.sas */

libname xyz 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';data modft; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ft.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'FT';run;data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam; group = 'PT';run;data xyz.mod273com; set modft modpt;run;proc print data=xyz.mod273com;run;

12.2.

/* mergemod.sas */

libname abc 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';data modpt; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt'; input name $ 1-16 cw1 cw2 cw3 com_cw exam;run;proc sort data=modpt; by name;run;data modptss; infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ptss.txt'; input name $ 1-16 excel $ sas $ spss $;run;proc sort data=modptss; by name;run;data abc.modptcom; merge modpt modptss; by name;run;proc print data=abc.modptcom;run;

Page 98: Getting started - School of Informatics | The University of ... · Web view1.7 A Data Analysis Flow Chart 11 1.8 Importing data using a wizard 12 1.9 Viewing a data set 13 1.10 Creating

SAS Programming Notes 105 Proc Tabulateproc univariate data = pulse;var before; histogram before / midpoints = 0 to 200 by 10;

title ’Histogram Pulse Data’;run;