Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)

Boot Camp R Fall 2013 PT651 Lab Component

Slides by Sherri Verdugo

•  Brief definition of statistics: The subject of statistics deals with techniques for collecting, analyzing, and drawing conclusions from data. In science, or any subject, even psychology, you must know how to read the literature or how to interpret data. Statistics is your key, to knowing when you are given false data--or, for example, if that new medication really works. Definition is from Statistical Methods, Eighth Edition by George W. Snedecor and William G. Cochran.

Statistics Brief description.

R Boot Camp and Descriptive Information

Agenda

Introduction to the Lab Component

Wrap Up

R: The basics and Rstudio; The basics

Descriptive Statistics & R

Revisiting the Blog and Installation

Parameters, Statistics oh my….the magic behind the curtain

Measures of Variability & R

✓

1

2

3

4

5

6

7

Installing on your desktop means that the environment is local on your machine!

Installing on the Cloud means you are portable and accessible!

• The Blog Installation tips - http://g2research.blogspot.com/2013_08_01_archive.html

ü PC, Mac, Unix, etc.

• Desktop or Cloud R & R Studio -R on the Desktop

ü R on the Cloud

The Blog and Installation of R

Revisiting the Past

Desktop versus the cloud

R

R is available locally on your computer and depends on the size of your computer! R is available on the internet! Fast!

Rstudio and R are located on the cloud.

Cloud

R is located on local machine. Drawbacks: only in one environment.

Desktop

2 1

A sample represents a population. Dependent Variables are cheaper and easier to change J

The magic behind the curtain.

Parameters, Statistics, oh my!

1 Population characteristics.

Parameters

Measures of sample data.

Statistics

3 Rank order of data values for a variable.

f = Frequency Distribution

Caveat: Class: grouping of scores w/ unique range Independent Variables: costly or nearly impossible to change. I.E. Dependent Variables: outcome variables. Mutually Exclusive: no overlap in classes Mutually Exhaustive: each score fits into a class

2

Descriptive Statistics: ways to represent data using characterization of central tendency, shape and variability in the data.

Three Measures of Central Tendency Measures of Central Tendency:

central value or a typical value for a probability distribution

M3

Which score changes the most when you change the scores in the data? Frequencies tell a story as well!

3 The most frequent score of the sequence.

Mode

The average of all scores. Is the sample mean and µ is the population mean.

Mean

1

The middle number in a sequence. Place the numbers in order. For N=odd number then this is the exact mid point. For N=even number add the two middle numbers together and divide by two.

Median

2

X

X = x1 + x2 + ...+ xnn

µx = ∑ xP(x)

Variability = dispersion of scores.

Adding the deviations results in zero? That’s not what I need. NEW CONCEPT: Sum of Squares (SS).

Percentiles = the grouping of the scores totaling 100%. What percentile is a score in the data set? Quartiles: Q1=25%, Q2=50%, Q3=75%, Q4=100%

Standard Deviation = how much variation from the mean or what we expect the value to be.

Range = difference between highest and lowest scores.

Variance = how far a set of numbers are “dispersed” or spread out.

Coefficient of Variation = unitized risk expressed as a ratio.

Do you see any patterns?

Name five measures of variability

Variability

1

2

3

4

5

6

7

sx =(xi − x)

2i=1

n∑n −1

sx2 =

(xi − x)2

i=1

n∑n −1

cv =σµ

SS = (x − x)2∑

Range = Xhighscore − Xlowestscore

cv =σµ*100

Tip: Remember help() and google/google scholar < This is the command prompt

Examples in R using the Cloud or Desktop

•  How can we find the descriptive statistics in R?

•  Look at R scripts for: 1) mean, 2) median, 3) mode, 4) range, 5) percentiles, 6) variance, 7) standard deviation, and 8) coefficient of variation

•  Class Exercise: 1_descriptives_general.R

Descriptives & R Continued

•  #Descriptive Statistics: General Entry Level Exercise •  2+2;f=2+2;f #R is a calculator and you can store an answer: •  a=4;a; c = a*2; c#Solve an equation and store variables •  fun1=((a+c+2)/2);fun1#Result is 7 (4+8+2)=14 and this is 14/2 = 7 •  #Step 1: generate a few random numbers to look at •  rnorm(10, mean=1.2, sd=3.4);s=rnorm(10, mean=1.2, sd=3.4);s •  me=mean(s); me #Mean •  med=median(s);med #Median •  table(s);names(sort(-table(s)))[1]#Mode. Not interesting because we don't have any duplicates •  ra=range(s);ra#range low and high scores •  mran=max(s);mran;miran=min(s);miran •  ran=mran-miran;ran#we are running the range from the extremes of the data (hi and lo) •  qua=quantile(s);qua#quantiles •  per=quantile(s, c(.32, .57, .98));per#percentiles •  varia=var(s);varia#variance •  stds=sd(s);stds#standard deviation •  stdsc=sqrt(varia);stdsc#double check the std. deviation :) •  coefs=((stds/me)*100);coefs# sanity check •  library(raster);cv(s, na.rm=TRUE); #double check with a package.

Descriptive Statistics & R An Example Script.

Can we normalize scores? Of course we can.

Descriptive Statistics and R

Normal…we want normal for now J Work through the tutorial J

tYes!

1) Samples from a normal distribution = distribution of sample means are normal. 2)The mean of the distribution of sample means = the mean of the "parent population," the population from which the samples are drawn. 3)The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of sample means.

Can we do this in R?

1

Central Limit Theorem

This is the point where you should see the pattern between the previous slides and the new concept of the CLT. If you feel like it doesn’t make sense, that is normal (Yes, I like that word a lot!)…. However, you can use this to work through statistical problems….and solve complex problems with relative ease. Parts of the puzzle have been given to you.

Sanity Check!

Distribution of many many trials is normal….even if the distribution of each trial is not normal! h t tp : / /www. learner.o rg /courses/mathilluminated/units/7/textbook/06.php Why is this important? You can work with the law of large numbers… Approximations.

32

It might not be normal from the population. It might however be normal using the Central Limit Theorem and that means that 68.26% will fall ± 1 s.d. from the mean.

Standard scores (i.e. z-scores) represent scores distances from the mean. When the mean and standard deviation are known….you can have a z-score attached to it. Think IQ tests, SAT, GRE, MCAT…etc.

To solve any mystery…you are looking for patterns. Statistics is a tool that gives you the ability to look for a pattern in a sample that represents a portion of the population.

Dispersions of the scores


2

Patterns.

Can we Standardize?

Is the Population Normal???

Looking at Real Data that is “Clean”


Nothing is impossible with R. If you can’t figure it out…think logically. What is it that you want to do and then readdress the steps that you have taken so far.

• Clean Data Has no missing values -Homework Examples

ü Extremely rare in the real world….

• Real Data Has missing values -Can be handled by R

ü Big steps involved in cleaning the data set….

Statistics…your guide to the universe

Every Number tells a story

Clean Data

Collect Data

Design a Study.

Analyze Data

Present Data

Conclusions J

Your own sub headline

Wrap Up

3 These have to be loaded before you use them. If you have to use them….make sure you comment on them in the code so that you can use them again later.

Library or Packages

R Packages have templates ready for you to learn with…some packages are better than others.

Templates

1

Present the descriptive information for the data set. This is key when you are presenting inferential statistics in a journal, report, homework, or even understanding the data you are reading about.

Always…Always…Always

2

Have productivity anywhere you have internet.

Taking it to the Cloud

When you are in a clinical setting…time is of the essence. You are dealing with patients and the information you present must be accurate.

Population

Sample Information

Patient Information

THANK YOU!

That wasn’t so hard J Remember you are now programming! Who needs Excel or SPSS now?

Free is sometimes better in more ways than one!

Education

Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)