Upload
vantran
View
6
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Biostatistics powerpoint
Citation preview
Stats 141/Bio 141: Lecture 1
Rajarshi Mukherjee
Stanford University
September 23, 2014
Stats 141/Bio 141: Lecture 1 September 23, 2014 1
Class Logistics
Stats 141/Bio 141: Lecture 1 September 23, 2014 2
Overview
Instructor: : Rajarshi MukherjeeOffice: Sequoia Hall 202e-mail: [email protected]
Class Meetings: 2:15 pm – 3:30 pm, Tuesday and Thursday, 200-02.
Office Hours: Thursday 4:00 pm-6:00 pm, and by appointment. (RoomTBD)
Who am I? Stein Fellow/ Lecturer, Department of Statistics, StanfordUniversity.
Stats 141/Bio 141: Lecture 1 September 23, 2014 3
Definitely Not a Professor!
Stats 141/Bio 141: Lecture 1 September 23, 2014 4
Teaching Assistants
Sam GrossOffice Hours: Thursday 10 am - 12 pmRoom: Sequoia 227e-mail: [email protected]
Hera HeOffice Hours: Friday 1 pm – 3 pmRoom: Sequoia 232e-mail: [email protected]
Qingyuan ZhaoOffice Hours: Monday from 12 pm - 2 pmRoom: Sequoia 206e-mail: [email protected]
Xiaoying TianOffice Hours: Tuesday 10 am - 12 pmRoom: Sequoia 232e-mail: [email protected]
Stats 141/Bio 141: Lecture 1 September 23, 2014 5
Sections
Wed 3:40 PM - 4:30 PM at 380-380X (Listed as 3:15 pm - 4:30 pm)
Friday 3:15 PM - 4:05 PM at 380-380W
You can come to any of these sections.
Sections are not mandatory but recommended.
Sections will be used to review course materials/ work through practiceproblems/discuss computing.
Stats 141/Bio 141: Lecture 1 September 23, 2014 6
Prerequisites
Some background of algebra will be helpful.
There will be some theoretical developments more than Stats 60/Stats 160.
However, no prior knowledge of statistics will be assumed.
Stats 141/Bio 141: Lecture 1 September 23, 2014 7
Homework and Exams
Homework: Assigned each week on Thursday. Due at the beginning ofclass on Thursday of the following week. The lowest of your homeworkscores will be dropped, so if you fail to turn one in, you will still be OK.
Quizzes: Quiz 1 Thursday October 9th; Quiz 2 December 2nd. (In class)
Midterm Exam: Wednesday, November 12th, in the evening (time andplace to be announced). It will roughly cover material from Lectures 1through 12.
Final Exam: Date, time and place are determined by the Registrar.Currently, we are listed as having our final on Thursday, December 11thfrom 7pm to 10pm. The location is TBA. The final will test knowledge fromall sections of the course.
All exams are open book. But no computers will be allowed. You can usehand scientific calculators.
Stats 141/Bio 141: Lecture 1 September 23, 2014 8
Grading
Grades: Your class grade will be calculated as follows:
Homework: 20%
Quizzes: 10% each (total 20%)
Midterm exam: 25%
Final exam: 35%
Stats 141/Bio 141: Lecture 1 September 23, 2014 9
Textbook and Class Notes
Textbook:Required: Statistics for the Life Sciences: Samuels, Witmer and Schaffner,4th Edition (Prentice Hall)Suggested: Introductory Statistics with R: Dalgaard, 2nd Edition (Springer)Using R for Introductory Statistics: Verzani, 1st Edition (Chapman &Hall/CRC)
Course Website URL:https://coursework.stanford.edu/portal/site/F14-BIO-141-01
Lecture Notes will be uploaded on the website before class and should bedownloaded/printed out in advance and brought to class.
I will try to post/mention the textbook section(s) being covered from thetext.
Today: General Introduction, Section 1.3, 2.1., Types of Variables.
Stats 141/Bio 141: Lecture 1 September 23, 2014 10
Computing
For statistical computation and programming we will use environment R.
You can download it for free from http://cran.r-project.org/.
You can find extensive documentation for free on the Web, starting with thesame site/ suggested textbooks.
It is also suggested that you download an editor for R. We suggest RStudio,which can be downloaded from http://www.rstudio.com/.
Weekly sections will talk about computing and data examples.
Nice Introduction: https://www.codeschool.com/courses/try-r.
Stats 141/Bio 141: Lecture 1 September 23, 2014 11
Teaching Schedule
A tentative schedule of things to be covered is given on the website in thesyllabus document.
I will not be teaching on October 16. Instead there will be a class long TAsection devoted towards review and questions.
October 10 Fri, 5:00 pm: Last day to add or drop a class.
Stats 141/Bio 141: Lecture 1 September 23, 2014 12
Compare classes
Stats 60 : Stats for social sciences (mostly surveys).
Stats 110: A much more “mathy” class.
Stats 141: Understanding Tools used in Statistics.
Stats 202: “Cutting edge” data mining and visuals. (akin to bioinformatics)
Stats 300A: Very theoretical. For PhD students with lots of mathbackground.
Stats 141/Bio 141: Lecture 1 September 23, 2014 13
Introduction
Stats 141/Bio 141: Lecture 1 September 23, 2014 14
Let’s begin with the Following
1 Describe the range of applications of statistics.
2 Identify situations in which statistics can be misleading.
3 Define “Statistics”.
Stats 141/Bio 141: Lecture 1 September 23, 2014 15
Range of applications of statistics
Statistics include numerical facts and figures.
Earth Sciences: The largest earthquake measured 9.2 on the Richter scale.
Social Sciences: Men are at least 10 times more likely than women tocommit murder.
Public Health: One in every 8 South Africans is HIV positive.
Biology: ANGPTL 3,4 and 5 genes might be associated withHypertriglyceridemia in Europe.
Statistical Physics, History, Anthropology,...
http://en.wikipedia.org/wiki/List of fields of application of statistics.
Stats 141/Bio 141: Lecture 1 September 23, 2014 16
Be Cautious against Misleading Statistics
The study of statistics involves math and relies upon calculations ofnumbers.
But it also relies heavily on how the numbers are chosen and how thestatistics are interpreted.
For example, consider the following three scenarios and the interpretationsbased upon the presented statistics.
You will find that the numbers may be right, but the interpretation may bewrong.
Stats 141/Bio 141: Lecture 1 September 23, 2014 17
Example 1
A new advertisement for Ben & Jerry’s ice cream introduced in late May of lastyear resulted in a 30% increase in ice cream sales for the following three months.Thus, the advertisement was effective.
A major flaw is that ice cream consumption generally increases in themonths of June, July, and August regardless of advertisements.
Example where one interpretes outcomes as the result of one variable (Ben& Jerry advertisment) when another variable (Time of the Year) is actuallyresponsible.
Stats 141/Bio 141: Lecture 1 September 23, 2014 18
Example 2
The more churches in a city, the more crime there is. Thus, churches lead tocrime.
A major flaw is that both increased churches and increased crime rates canbe explained by larger populations.
In bigger cities, there are both more churches and more crime.
A third variable (number of people) can cause both situations.
However, people erroneously believe that there is a causal relationshipbetween the two primary variables rather than recognize that a third variablecan cause both.
Stats 141/Bio 141: Lecture 1 September 23, 2014 19
Example 3
75% more interracial marriages are occurring this year than 25 years ago. Thus,our society accepts interracial marriages..
A major flaw is that we don’t have the information that we need. What isthe rate at which marriages are occurring?
Suppose only 1% of marriages 25 years ago were interracial and so now1.75% of marriages are interracial (1.75 is 75%higher than1).
But this latter number is hardly evidence suggesting the acceptability ofinterracial marriages.
In addition, the statistic provided does not rule out the possibility that thenumber of interracial marriages has seen dramatic fluctuations over the yearsand this year is not the highest.
Again, there is simply not enough information to understand fully the impactof the statistics.
Stats 141/Bio 141: Lecture 1 September 23, 2014 20
”Definition” of Statistics
As a whole, these examples show that statistics are not only facts and figure.
They are something more than that.
In the broadest sense, “statistics” refers to the ”art” (techniques andprocedures) for analyzing, interpreting, displaying, and making decisionsbased on data.
Stats 141/Bio 141: Lecture 1 September 23, 2014 21
Why Study Statistics
Statistics Encountered in Daily Life
4 out of 5 dentists recommend Dentine.
Almost 85% of lung cancers in men and 45% in women are tobacco-related.
A surprising new study shows that eating egg whites can increase one’s lifespan.
People predict that it is very unlikely there will ever be another baseballplayer with a batting average over 400.
Stats 141/Bio 141: Lecture 1 September 23, 2014 22
Why Study Statistics
Statistics are often presented in an effort to add credibility to an argumentor advice.
Hopefully studying statistics will make us into an intelligent consumer ofstatistical claims.
It can be a claim made by an Biologist about collected data or claim madeby a business company about their product.
Charles Frederick Mosteller:
“While it is easy to lie with statistics, it is even easier to lie without them..”
Stats 141/Bio 141: Lecture 1 September 23, 2014 23
Population and Sample
Statistics is the art of dealing with ”Data”.
However, as we said earlier, one of the most important things to understandis the context.
In statistics, we always want to study certain characteristics of individuals(people/animals/organisms/objects....) in a population of interest.
It is important to identify the population of interest and the characteristicsof interest/ objectives of the study before embarking on statistical analysis.
Example
1 Height of Basketball Players: Population- All Basketball Players,Characteristics of interest- Height.
2 Rate of Cell Division of a Skin Cancer Tissue: Population- All SkinCancer Tissues, Characteristics of interest- Rate of cell division.
3 Relationship between Work Hours and Blood Pressure in Texas:Population- All people in Texas, Characteristics of interest- Work Hours,Blood Pressure.
Stats 141/Bio 141: Lecture 1 September 23, 2014 24
Population and Sample
However, most of the time the population is too big (it can even beinfinite!), or impossible to census.
In that case we need to work with a sample.
A sample is a collection of selected individuals chosen from the populationwe want to study.
Data is collection of measurements made on the these selected individualson the sample.
Hopefully, the sample is a representative of the population.
Aim of Statistics: Summarize data characteristics (Statistic) to give usidea about population summaries (Parameter): a process called inferenceInference.
Stats 141/Bio 141: Lecture 1 September 23, 2014 25
Importance of Representative Sample: Random Sample
It is very important that the sampled data is a representative of thepopulation we are trying to study.
Random Sample: Subjects are selected from a population so that eachindividual has an equal chance of being selected.
Random samples are representative of the source population.
Non-random samples are not representative. If not, the inference drawn ispotentially false and misleading- Problem due to Sampling Bias.
Section 1.3 is a good read.
From now onwards, we will always assume that the data is a valid randomsample from the population.
Stats 141/Bio 141: Lecture 1 September 23, 2014 26
Central Dogma of Statistics
Stats 141/Bio 141: Lecture 1 September 23, 2014 27
In This Course...
1 Descriptive Statistics: Given data, how to perform initial description andvisualization of it.
Identifying different types of data.Tabular and Graphical Representation.Numerical summaries of the data.
2 Probability and Inferential Statistics:
Data as a “sample” from a larger population of interest.Given population variability how to quantify sample variability?(Probability)Given sample variability how to quantify population variability?(Inferential Statistics)
Estimation (Sampling Distribution, Means, Proportions, ConfidenceIntervals etc.)Hypothesis Testing (Testing for means, proportions, group homogeneityand equality etc.)
Stats 141/Bio 141: Lecture 1 September 23, 2014 28
Take Home Messages
Statistics is about drawing inference from data about a population ofinterest.
Importance of Context: Ask the right questions to understand what is thepopulation of interest and what is the characteristics of the population weare interested in.
Make sure data is representative of the population of interest and measuresthe characteristics of the individuals of the population- Random Sample.
Stats 141/Bio 141: Lecture 1 September 23, 2014 29
Statistics Does Not Prove Anything !
You only observe data once.
What if someone else does another study on the same population withanother random sample drawn from the population?
Different random samples will include different subjects, with differentobservations.
Each new random sample/data will lead to(slightly) different conclusions,implying that, sometimes, not so precise conclusions will be drawn.
Absolute certainty cannot be expected as conclusions are based on only asmall part (the sample) from the total, infinitely large, population.
Statistics is about quantifying this uncertainty from data.
Stats 141/Bio 141: Lecture 1 September 23, 2014 30
Data and Variables
Data and Variables
Data is collection of characteristics/measurements of sampled units from anunderlying population.
A variable is a characteristic/measurement of a sampled unit.
Usually data is stored and presented in a dataset, comprised of variablesmeasured on sampled units.
Setting Up Data: Make a table with
One Sample unit per Row.
One Variable per Column.
Data with n sampled units and p variables measured for each sampled unitis therefore represented in a n × p table.
Stats 141/Bio 141: Lecture 1 September 23, 2014 33
Patient Data from a City Hospital:n = 25, p = 6
Sampled Individual Var 1 Var 2 Var 3 Var 4 Var 5 Var 6Subject Index Gender Blood Gr. Health # Siblings Ht.(m) Wt.(Kg)
1 M A Fair 5 1.53 63.832 M B Excellent 4 1.16 92.983 F B Very Good 3 1.85 65.194 M A Poor 1 1.51 63.585 M B Fair 1 1.94 66.376 F B Very Good 1 1.62 61.697 F O Poor 4 1.22 87.518 F AB Very Good 3 1.09 64.839 F O Excellent 5 1.52 93.1310 M A Excellent 0 1.09 87.9111 F O Excellent 0 1.66 64.7212 F A Good 2 1.4 64.2213 M O Very Good 6 1.18 93.7114 M A Good 3 1.48 89.2215 M O Fair 0 1.72 65.5716 F A Very Good 3 1.72 66.4317 M O Good 2 1.87 65.1118 F AB Poor 4 1.42 65.6719 M B Very Good 4 1.53 66.7320 M A Excellent 3 1.28 89.7921 M O Excellent 4 1.86 92.0822 M B Good 3 1.53 65.2723 M O Excellent 8 1.42 65.5424 M O Good 5 0.58 63.825 M O Excellent 1 2.53 90.91
Stats 141/Bio 141: Lecture 1 September 23, 2014 34
Types of Variables
In order to describe and summarize different variables of observed data, it isimportant to first identify the different types of commonly appearingvariables.
These are:
1 Categorical/Qualitative Variables
Nominal variablesOrdinal variables
2 Numeric Variables
Discrete variablesContinuous variables
We will now go through each of them and try to understand how to suitablysummarize them.
Stats 141/Bio 141: Lecture 1 September 23, 2014 35
Nominal Variables
The simplest type of variable is nominal variable, in which the observed values fallinto specific unordered categories/classes
Examples
Gender — Male(M), Female(F).
Survival Status — alive, deceased
Blood Group — O, A, B, AB.
Race/ Ethnicity- Asian, Caucasian, African.
Cause of Death- natural, accident.
Political Affiliation- Republican, Democrat.
Stats 141/Bio 141: Lecture 1 September 23, 2014 36
Patient Data from a City Hospital:n = 25, p = 6
Nominal NominalSubject Index Gender Blood Gr. Health # Siblings Ht.(m) Wt.(Kg)
1 M A Fair 5 1.53 63.832 M B Excellent 4 1.16 92.983 F B Very Good 3 1.85 65.194 M A Poor 1 1.51 63.585 M B Fair 1 1.94 66.376 F B Very Good 1 1.62 61.697 F O Poor 4 1.22 87.518 F AB Very Good 3 1.09 64.839 F O Excellent 5 1.52 93.1310 M A Excellent 0 1.09 87.9111 F O Excellent 0 1.66 64.7212 F A Good 2 1.4 64.2213 M O Very Good 6 1.18 93.7114 M A Good 3 1.48 89.2215 M O Fair 0 1.72 65.5716 F A Very Good 3 1.72 66.4317 M O Good 2 1.87 65.1118 F AB Poor 4 1.42 65.6719 M B Very Good 4 1.53 66.7320 M A Excellent 3 1.28 89.7921 M O Excellent 4 1.86 92.0822 M B Good 3 1.53 65.2723 M O Excellent 8 1.42 65.5424 M O Good 5 0.58 63.825 M O Excellent 1 2.53 90.91
Stats 141/Bio 141: Lecture 1 September 23, 2014 37
Properties of Nominal Variables
Numbers are often used to represent the categories for convenience, butthese numbers are merely labels.
Examples:- Gender: Male = 1 and Female = 0.- Race/ Ethnicity: Asian = 1, Caucasian=2, African= 3.
Nominal variables that take on one of two distinct values are said to bedichotomous/binary. (Example: Survival Status: alive(=1), deceased(=0).)
Both the order and the magnitude of the numbers are unimportant.
Most arithmetic operations do not make sense for nominal data.
Examples: It does not make sense to take average.
Stats 141/Bio 141: Lecture 1 September 23, 2014 38
Ordinal Variables
Variables are ordinal if the observed values fall into specific categories/classes butthe order among the categories is important.
Examples
patient satisfaction with care received in the hospital — poor, fair, good,very good, excellent
injury status — none, minor, moderate, severe, fatal
level of education- Bachelor, Masters, PhD.
Stats 141/Bio 141: Lecture 1 September 23, 2014 39
Patient Data from a City Hospital:n = 25, p = 6
OrdinalSubject Index Gender Blood Gr. Health # Siblings Ht.(m) Wt.(Kg)
1 M A Fair 5 1.53 63.832 M B Excellent 4 1.16 92.983 F B Very Good 3 1.85 65.194 M A Poor 1 1.51 63.585 M B Fair 1 1.94 66.376 F B Very Good 1 1.62 61.697 F O Poor 4 1.22 87.518 F AB Very Good 3 1.09 64.839 F O Excellent 5 1.52 93.1310 M A Excellent 0 1.09 87.9111 F O Excellent 0 1.66 64.7212 F A Good 2 1.4 64.2213 M O Very Good 6 1.18 93.7114 M A Good 3 1.48 89.2215 M O Fair 0 1.72 65.5716 F A Very Good 3 1.72 66.4317 M O Good 2 1.87 65.1118 F AB Poor 4 1.42 65.6719 M B Very Good 4 1.53 66.7320 M A Excellent 3 1.28 89.7921 M O Excellent 4 1.86 92.0822 M B Good 3 1.53 65.2723 M O Excellent 8 1.42 65.5424 M O Good 5 0.58 63.825 M O Excellent 1 2.53 90.91
Stats 141/Bio 141: Lecture 1 September 23, 2014 40
Properties of Ordinal Variables
As with nominal variables, categories might again be represented bynumbers.
Example: injury status — none (0), minor (1), moderate (2), severe(3),fatal(4).
The order of the numbers is important, while interpretation.
We are still not concerned with the magnitudes.
Example: injury status — none (0), minor (1), moderate (2), severe(3),fatal(4) means the same thing while interpreting as none (5), minor (10),moderate (15), severe(20), fatal(50).
In particular, does not make sense to do most arithmetic operations such asaddition, subtraction, division, average etc.
Stats 141/Bio 141: Lecture 1 September 23, 2014 41
Categorical Variables
Together, nominal and ordinal variables are called categorical variables.
Discrete Variables
Variables are discrete if it can take specific set of values but the order andmagnitude matter.
Numbers are not merely labels; they are actual measurable quantities.
However, these quantities are restricted to taking on specified values only —usually integers or counts.
Examples
number of siblings a person has - (0, 1, 2, 3, . . .).
number of motor vehicle accidents in the city of Boston in a given week -(0, 1, 2, 3, . . .).
number of new cases of diabetes diagnosed in the United States over aone-year period - (0, 1, 2, 3, . . .).
Number of days it rained in Palo Alto in the first week of November -(0, 1, 2, 3, 4, 5, 6, 7).
Stats 141/Bio 141: Lecture 1 September 23, 2014 43
Patient Data from a City Hospital:n = 25, p = 6
DiscreteSubject Index Gender Blood Gr. Health # Siblings Ht.(m) Wt.(Kg)
1 M A Fair 5 1.53 63.832 M B Excellent 4 1.16 92.983 F B Very Good 3 1.85 65.194 M A Poor 1 1.51 63.585 M B Fair 1 1.94 66.376 F B Very Good 1 1.62 61.697 F O Poor 4 1.22 87.518 F AB Very Good 3 1.09 64.839 F O Excellent 5 1.52 93.1310 M A Excellent 0 1.09 87.9111 F O Excellent 0 1.66 64.7212 F A Good 2 1.4 64.2213 M O Very Good 6 1.18 93.7114 M A Good 3 1.48 89.2215 M O Fair 0 1.72 65.5716 F A Very Good 3 1.72 66.4317 M O Good 2 1.87 65.1118 F AB Poor 4 1.42 65.6719 M B Very Good 4 1.53 66.7320 M A Excellent 3 1.28 89.7921 M O Excellent 4 1.86 92.0822 M B Good 3 1.53 65.2723 M O Excellent 8 1.42 65.5424 M O Good 5 0.58 63.825 M O Excellent 1 2.53 90.91
Stats 141/Bio 141: Lecture 1 September 23, 2014 44
Properties of Discrete Variables
Both order and magnitude of the values of the variable matter.
Arithmetic rules can be applied.
However, the result of the arithmetic operation might not be a discretevariable itself.
Example: I person 1 has 2 siblings and person 2 has 3 siblings, then onaverage they have (2 + 3)/2 = 2.5 siblings, but 2.5 is not a valid value forthe number of siblings.
Stats 141/Bio 141: Lecture 1 September 23, 2014 45
Continuous Variables
Variables are continuous When both order and magnitude are important, butquantities are not restricted to taking on specified values/counts.
Examples
Height.
Birth Weight.
Length of time a lung cancer patient survives after diagnosis.
Concentration of mercury in a particular fish.
Stats 141/Bio 141: Lecture 1 September 23, 2014 46
Patient Data from a City Hospital:n = 25, p = 6
Continuous ContinuousSubject Index Gender Blood Gr. Health # Siblings Ht.(m) Wt.(Kg)
1 M A Fair 5 1.53 63.832 M B Excellent 4 1.16 92.983 F B Very Good 3 1.85 65.194 M A Poor 1 1.51 63.585 M B Fair 1 1.94 66.376 F B Very Good 1 1.62 61.697 F O Poor 4 1.22 87.518 F AB Very Good 3 1.09 64.839 F O Excellent 5 1.52 93.1310 M A Excellent 0 1.09 87.9111 F O Excellent 0 1.66 64.7212 F A Good 2 1.4 64.2213 M O Very Good 6 1.18 93.7114 M A Good 3 1.48 89.2215 M O Fair 0 1.72 65.5716 F A Very Good 3 1.72 66.4317 M O Good 2 1.87 65.1118 F AB Poor 4 1.42 65.6719 M B Very Good 4 1.53 66.7320 M A Excellent 3 1.28 89.7921 M O Excellent 4 1.86 92.0822 M B Good 3 1.53 65.2723 M O Excellent 8 1.42 65.5424 M O Good 5 0.58 63.825 M O Excellent 1 2.53 90.91
Stats 141/Bio 141: Lecture 1 September 23, 2014 47
Properties of Continuous Variables
Most arithmetic operations make sense.
Result of adding, subtracting, dividing etc. is again valid continuous variable.
The difference between any two values can be arbitrarily small.
Therefore fractional values are possible.
The accuracy of the measuring instrument is the only limiting factor
Stats 141/Bio 141: Lecture 1 September 23, 2014 48
Numeric Variables
Together, discrete and continuous variables are called numeric variables.