39
Sampling Sampling and and Inference Inference

Sampling and Inference_Political_Science

Embed Size (px)

DESCRIPTION

Sampling and Inference: Learn about the importance of random sampling in political research; learn why samples that seem small can yield accurate information about larger groups; learn how to figure out the margin of error of a sample; learn how to make inferences about the information in a sample.

Citation preview

Page 1: Sampling and Inference_Political_Science

Sampling Sampling andand InferenceInference

Page 2: Sampling and Inference_Political_Science

Learning Objectives

1) Why Random Sampling is of Fundamental Importance in Political Research.

2) Why Samples that Seem Small can Yield Accurate info About Much Larger Groups.

3) How to Find the Margin of Error in a Sample.

4) How to Use the Normal Curve to Make Inferences About the Info in a Sample.

Page 3: Sampling and Inference_Political_Science

We Continue to Hunger for Numbers and for the Certainty they Seem to Convey. “Which Candidate OR Ballot Measure is the Likely Choice among voting age adults?”

“By how large a Margin?”

Anyone interested in Politics, Society, or the Economy wants to understand the Attitudes, Beliefs, or Behavior of very large groups -- Large Aggregations of Units are populations.

Page 4: Sampling and Inference_Political_Science

PopulationsPopulationsPopulationPopulation: Generically defined as the Universe of Subjects the : Generically defined as the Universe of Subjects the

Researcher wants to Describe. Researcher wants to Describe.

Emp Emp = Studying the Financial Activity of Political Action = Studying the Financial Activity of Political Action Committees in 2008. Committees in 2008.

My population would include all PAC contributions in 2008.My population would include all PAC contributions in 2008.

Population ParameterPopulation Parameter: : Identifying a Characteristic of the Identifying a Characteristic of the Population: Population:

– The Dollar Amount of the Average PAC contribution, The Dollar Amount of the Average PAC contribution, – The Percentage of Voting Age Adults who are Republicans. The Percentage of Voting Age Adults who are Republicans.

CensusCensus: Complete Access to a Populations’ Interest. A Census : Complete Access to a Populations’ Interest. A Census Allows the Researcher to Obtain Measurements for All Allows the Researcher to Obtain Measurements for All

Members of a Population. Members of a Population.

Thus, the Researcher Thus, the Researcher Not Need to Infer or Estimate any of the Populations’ Not Need to Infer or Estimate any of the Populations’ ParametersParameters when Describing the Units of Analysis. when Describing the Units of Analysis.

Page 5: Sampling and Inference_Political_Science

Sample Sample More often, Researchers are Unable to Examine a Population Directly, thus More often, Researchers are Unable to Examine a Population Directly, thus We Rely on a Sample. We Rely on a Sample.

SampleSample: : A Number of Cases or Observations Drawn from a Population. (A A Number of Cases or Observations Drawn from a Population. (A fixture of life for Political Research). fixture of life for Political Research).

– Population Characteristics are Frequently Hidden from View, thus, we Population Characteristics are Frequently Hidden from View, thus, we turn to turn to SamplesSamples, , whichwhich Yield Observable Yield Observable SampleSample StatisticsStatistics. .

Sample StatisticSample Statistic: : Is an Estimate of a Population Parameter, based on a Is an Estimate of a Population Parameter, based on a Sample Drawn from the Population. Sample Drawn from the Population.

– PollsterPollster 1) 1) TakesTakes aa Sample. Sample. 2) 2) ElicitsElicits anan OpinionOpinion. and . and 3) 3) Then Infers or Then Infers or Estimates a Population Characteristic from this Estimates a Population Characteristic from this Sample StatisticSample Statistic..

Such Such SamplesSamples SizeSize ofof 15001500 isis TypicalTypical, but seems small. , but seems small.

– Just How Accurately does a Sample Statistic Estimate a Population’s Just How Accurately does a Sample Statistic Estimate a Population’s Parameter? Parameter?

Page 6: Sampling and Inference_Political_Science

This CH Discusses 3 Factors that Determine How Closely a Sample Statistic Reflects a Population’s Parameter.

The 1st Two Factors Deal with a Sample: 1) The Procedure that We Use to Choose the

Sample2) The Sample’s Size, the Number of Cases in

the Sample. 3) The 3rd Factor Deals with a Population’s

Parameter that We Want to Estimate: The Amount of Variation in the Population.

Also, CH Shows How the Normal Distribution comes into play in Helping Researchers

1) Determine the Margin of Error of a Sample Estimate and

2) How this info is used for Making Inferences.

Page 7: Sampling and Inference_Political_Science

Factor 1- The Procedure that We Use to Choose the Sample: Random Sampling

For a Sample to Yield an Accurate Estimate of a Population Parameter, You Must Use a Random SampleRandom Sample = A Sample that has been Randomly Drawn from the Population.

A Random Sample Insures that Every Member of the Population has An Equal Chance of Being Chosen for the Sample.

Sampling FrameSampling Frame: The Method for Defining the Population We Want to Study.

Poor Sampling Frames Give Life to “Evil Twins” of Sampling – Selection Bias and Response Bias.

Selection BiasSelection Bias: (Or Sampling Bias) Occurs When Some Members of the Population Are More Likely to be Included in the Sample Than are Other Members of The Community.

Response BiasResponse Bias; Occurs When Some Cases in the Sample Are More Likely Than Others to be Measured.

Page 8: Sampling and Inference_Political_Science

Random Sampling Continued

“Sound Off” Polls – The Staple for Newspapers and Radio Talk Suffer From Both Selection & Response Bias.

Since Only those Whom Read Newspapers are Sampled (Selection Bias) Only those Newspaper Readers Self-Motivated to Call are Sampled (Response Bias).

We have Devised Sampling Procedures that Minimize Response Bias.

Page 9: Sampling and Inference_Political_Science

Random Sampling Continued

A Valid Sample is Based on Random Selection: It Occurs When Every Member of the Population has An Equal Chance of Being Included in the Sample.

Emp = If 100 members are in the Population, then the Probability that One Member is Chosen is 1 out of 100.

Basic procedure used regularly by big-time polling firms – Gallup, CBS/New York Times Poll, Univ of Michigan inst for soc research.

If a Sample is Not Randomly Selected, Then the Size of the Sample Does Not Matter.

In this CH, we Explore these points using a Hypothetical Example of a Student Organization that Wants to Gauge Student GPA’s. Since, We Cannot Survey All Students; thus, We Decide to Take a Sample of 100 Students.

1st Define Sampling Frame by assigning a unique sequential number to each student in the population – from 1st student listed in administration list to the last.

Page 10: Sampling and Inference_Political_Science

Random Sampling Continued

Random Sampling is The Only Way to Guard Error – Selection bias and Sampling Bias.

Page 11: Sampling and Inference_Political_Science

Random Sampling Error

Important = In Eliminating Bias We Do Not Eliminate Error, In Fact, in Drawing a Random Sample, We are Consciously Introducing Random Sampling ErrorRandom Sampling Error.

Random Sampling ErrorRandom Sampling Error: Is defined as The Extent to which a Sample Statistic differs by chance from a Population Parameter.

Random Sampling Error is Vastly Better because We Know 1) How It Affects a Sample Statistic, and 2) We Fully Understand How to Estimate its Magnitude.

Assuming Sample Bias to be 0, The Population ParameterPopulation Parameter will be Equal to the Sample StatisticSample Statistic, Plus Any Random Error that was Introduced by Taking the Sample

Population ParameterPopulation Parameter = Sample StatisticSample Statistic + Random Sampling Random Sampling ErrorError.

Example = An Instructor’s Measurements of Students Exam Scores will be Equal to Their True Scores PlusPlus The Error Introduced by Haphazard or Random Occurrences.

Page 12: Sampling and Inference_Political_Science

Random Sampling Error Cont..

What makes Random Sampling Error a better kind of Error = We have the Statistical Tools for Figuring out How Much a Statistic is Affected by Random Sampling Error.

The Magnitude of Random Sampling Error Depends on Two Components:

1) The Size of the Sample. 2) The Amount of Variation in the Population

Characteristic being Measured.

Sample SizeSample Size has an Inverse RelationshipInverse Relationship with Random Random SamplingSampling ErrorError = As the Sample Size Goes UpAs the Sample Size Goes Up Random Random Sampling Error GoesSampling Error Goes Down.Down.

The Variation ComponentVariation Component and the Sample Size ComponentSample Size Component Are Not Separate and Independent – They Work Together in They Work Together in Determining the Determining the Size of Random Sampling ErrorSize of Random Sampling Error..

Page 13: Sampling and Inference_Political_Science

Random Sampling Error Cont..

Random Sampling ErrorRandom Sampling Error = (Variation / Sample Size). Variation is the Numerator (Reflects its Direct Relationship with Random Sampling Error) & Sample Size is the Denominator (Depicting its Inverse Relationship with Random Sampling Error).

In the Hypothetical ExampleHypothetical Example of a Student Organization that Wants to Gauge Student GPA’s, Suppose there was a Great Deal Great Deal ofof VariationVariation in GPA’s among students (The Student Population is Widely Dispersed Across the Values of GPA) andand you were working with a Small-Sized Random SampleSmall-Sized Random Sample. Thus, the Variation is a “BIG” Number and the Sample Size is a “small” number. Dividing the Large Variation by the Small Sample Size would yield a Large Amount of Random Sampling Error.Random Sampling Error.

Under Theses Circumstances We Can Not Be Very Confident that the Sample Mean StatisticSample Mean Statistic Provides an Accurate Picture of the True Population Mean -- Because their Estimate Estimate contains so much Random Sampling Errorcontains so much Random Sampling Error. But, if a Larger Sample were taken or if Students’ GPA were Not So Spread Out, Random Sampling Error Would DiminishRandom Sampling Error Would Diminish = Gain Confidence in Sample Statistic. Gain Confidence in Sample Statistic.

Page 14: Sampling and Inference_Political_Science

Factor 2: How Sample Size Affects Random Sampling Error The Basic Effect of Sample Size on Random Sampling Error -- As As the Sample Size Increases, Error Decreasesthe Sample Size Increases, Error Decreases..

The Sample Size is Denoted by a Lowercase nn. Sample of n = 400 is Better Than n = 100 since a Larger Sample

Provides a More Accurate Picture of what we are after. A CatchA Catch = A Larger Sample Size Does Not Deliver a Fourfold

Reduction in Random Sample Error -- Because the Inverse Relationship between Sample Size and Sampling Error is CurvilinearCurvilinear.

For Smaller Sample Sizes (Smaller Values of n) An Increase in Sample Size Decreases a Lot of Error.

For Larger Sample Sizes – Larger Values of n – An Increase in the Sample Size has a Modest Effect on Error Reduction.

Page 15: Sampling and Inference_Political_Science

How Sample Size Affects Random Sampling Error Cont…

For Sampling, the Shape of the Curve Fits this Pattern = AsAs Sampling Size IncreasesSampling Size Increases, , Random Sampling Error is Random Sampling Error is Reduced by the Square Root of the Sample Size.Reduced by the Square Root of the Sample Size.

That is, The Sample Size Component of Random Sampling Error is Equal to the Square Root of the Sample Size, n:

Sample Size Components of Random Sampling Error =

Plugging this into our Conceptual Formula for Random Sampling Error:

*Random Sampling ErrorRandom Sampling Error = (Variation ComponentVariation Component) / n Squaren Square

x n

n

Page 16: Sampling and Inference_Political_Science

How Sample Size Affects Random Sampling Error cont.. Curvilinear RelationshipCurvilinear Relationship between Sample Size and Random Sampling Error.

Consider 3 Samples: n = 400, n = 1,600, n = 2,800.

The Sample Size Components of the Smallest Sample Size (from the four samples above) is The Square Root of 400 = 20. Thus, for a Sample of this Size, we would Calculate Random Sampling Error by

Dividing the Variation Component by 20.

By going from a Sample Size of 400 to a Sample Size of 16,000 We can Increase the Sample Size Component of Random Sampling Error from 20 to 40 – Thus We Double the Denominator, “n Square”. This has a Beneficial Effect on Random Sampling Error – Beneficial Effect on Random Sampling Error – Cutting It In Half.Cutting It In Half.

1st Jump in Sample Size: From 400 to 1,600 Delivered a Big Boost in the Sample Size Component From 20 to 40. But 1,600 to 2,800 Gave us a More Modest Increase From 40 to 53. Thus, The Same 1,200 Case Increase in Sample Size Produces a Bigger Reduction Thus, The Same 1,200 Case Increase in Sample Size Produces a Bigger Reduction

in The Sampling Error for Smaller Values of n than for Larger Values of n.in The Sampling Error for Smaller Values of n than for Larger Values of n.

Sophisticated Sampling is Expensive – Pollsters Must Balance The Cost of Drawing Larger Samples Against the Payoff in Precision. For this reason, A A Sample Size in the Sample Size in the 1,500 to 2,000 Range is an Acceptable Comfort Range1,500 to 2,000 Range is an Acceptable Comfort Range for for Estimating a Population Parameter. Estimating a Population Parameter.

Page 17: Sampling and Inference_Political_Science

How Random Sampling Error is Affected By n.

Campus Organization Collects a Sample (n = 100) and Computes a Sample Statistic a Sample Mean GPA = 2.80.

• The Group wants to know How Much Random Sampling ErrorRandom Sampling Error is Contained in This Estimate.

I. Part of the Sampling Error Depends on The The Sample SizeSample Size. Sample Size Component is Equal to 100 Squared = 10. What does the Sample Size Error Component of 10 have to do with

Accuracy of the Sample Mean of 2.80?

II. Answer Depends on the Second Component of Random Sampling Error, The The Amount of VariationAmount of Variation in The Population Characteristic being Measured.

This Connection is Direct: As Variation in the Population As Variation in the Population Characteristic Goes Up, Random Sampling Error Goes Up.Characteristic Goes Up, Random Sampling Error Goes Up.

Page 18: Sampling and Inference_Political_Science

How Random Sampling Error is Affected By n.

Since Variation in the Population Characteristic is Low, the Variation Component of Random Sampling Error is Low. A Random Sample taken from the Population Would Produce a Sample Mean

that is Close to The Population Mean. Moreover, Repeated Sampling from the Same Population would Produce Moreover, Repeated Sampling from the Same Population would Produce

Sample Mean After Sample Mean that are Close to The Population Mean – and Sample Mean After Sample Mean that are Close to The Population Mean – and Close to Each Other.Close to Each Other.

If Students’ GPA are More Widely Dispersed Around The Population Mean, If there are Large Numbers of Students In Each Value of GPA, From Lower to Higher, with only a Slight Amount of Clustering around the Population Mean of 2.80., then Variation is high and Random Sampling Error is High. Since Variation in the Population Characteristic is High, The Variation

Component of Random Sampling Error is High. A Random Sample would Produce a Sample Mean that May or Not be

Close to the Population Mean – It Depends on which Cases were Randomly Selected. One Sample Might Pick Up a Few More Students who Reside Above The Population Mean and Produce a Sample Mean of 2.90.

Page 19: Sampling and Inference_Political_Science

How Random Sampling Error is Affected By n. How to Determine the Amount of Variation

in a Population Characteristic?

1st, Look at a Key Measurement of Variation – Standard Deviation – How Does it Affect Random Sampling Error? (Next Slide).

Page 20: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation

The Amount of Variation in a Variable is Determined by the Dispersion of Cases Across the Values of the Variable.

If the Cases Tend to Fall in or close to the Modal or Median Value, the Variable has a Low Amount of Variation.

If the Cases are More Dispersed, the Variable has a High Amount of Variation.

For Interval-Level Variables, the Standard Deviation – a More Precise Measure is used.

If, the Individual Cases in the Distribution Do Not Deviate Very Much from the Distribution’s Mean, then the Standard Deviation Is a Small Number.

Contrast – If, Individual Cases Tend to Deviate a Great Deal From the Mean – then a Large Difference Exist Between the Values of Individual Cases and the Mean of the Distribution – then the Standard Deviation is a Large Number.

Page 21: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation Continued….

Calculation of the Standard Deviation – Necessary Symbols and Notation.

Emp: Raw Data of Wages of 11 Members of a Fictional Population: $2.00; $5.00; $7.00; $8.00; $9.50;

$10.00; $10.50; $12.00; $13.00; $15.00; $18.00

Sample Size: Denoted by Small n. Population Size is denoted by big N.

What is the Population Mean? By Dividing the Summation of All Wages which equals = $110.00, by

the Population Size, 11, We Arrive at the Mean Wage Rate, or the Central Tendency, for the Population, $10.00.

Population Parameters are Always Symbolized by Greek letters -- “mew.” µ

Page 22: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation continued….

How to Summarize Variation in Wages Among Members of this Hypothetical Population? A Measure is provided by The Range defined as The

Maximum Value Minus the Minimum Value. In this Exmp, the Range is the Highest Wage, $18.00,

Minus the Lowest Wage, $2.00 = A Range of $16.00. In Gauging Variation in Interval-level Variables, the

Measure of Choice is the Standard Deviation – Greek “Sigma.”

The Standard Deviation Measures Variation as a Function of Deviations from the Mean of a Distribution.

1st step in Finding the Standard Deviation is to Express each Value as a Deviation from the Mean – to Subtract the Mean from Each Value.

(Individual Value – “Mew” or ) = Deviation from the Mean.

Page 23: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation Continued….

An Individual Wage Below Pop Mean will have a “Negative” Deviation. An Individual Wage Above The Mean will have A Deviation of 0 -- “Deviations

From The Mean.”

Examp = Wage Earner 1, making a paltry $2.00 per hour, has a Deviation of - $8.00, (neg 8) Below the Population Mean of $10.00.

All Measures of Variation in Interval Level- Variables are based on the Square of the Deviations From the Mean of the Distribution.

“Squared Deviations From the Mean” Removes the Minus Sign on “Negative” Deviations for Populations Wages Below the Population Mean.

Eamp = The Logic of The Standard Deviation…So Both Deviations Figure Equally in Determining the Dispersion of Wages Around the Mean.

Rest assure that, when all is said and done, The Standard Deviation Will Provide the Info We Need to Distinguish Between Wage Earners Who Fall Below the Pop Mean and Those Who Fall Above the Pop Mean.

Smaller Deviations Make Smaller Contributions to the Variation in Wages.

Page 24: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation continued….

The Average of The Squared Deviations is Known by a Statistical Name, The Variance: It “Looks At” The Overall Summary of Variation in the Distribution and Then Computes the Mean of This Amount by Dividing by N.

Exap = The Variance is the Summation of The Squared Deviations ($204.50) Divided By The Population Size (N= 11), Which Yields an Average Wage that is, on Average , $18.59.

So the Contribution Each Individual Wage Earner Makes to the Overall Variation in Wages is, on Average, $18.59.

Notice, As with Any Mean, The Size of the Average of Squared Deviations is Sensitive to Values that Lie Far Away from The Mean.

Wage Earners Toward the Tails of the Distribution – On the Low End and the High End – Make Greater Contributions to the Variance than do Wage Earners Who Lie Closer to The Population Mean.

“As Deviations From The Mean Increase, Then The Variance Increases Too.”

The Standard Deviation is Based on The Variance. Fact, Standard Deviation Is The Square Root of The Variance.

The Standard Deviation (o’) for The Population of Wage Earners, then, is The Square Root of $18.59, = $4.31

Page 25: Sampling and Inference_Political_Science

Variation Revisited: The Standard Deviation continued…

The Standard Deviation has Two Important Applications:

1) When Combined With Some Workable Assumptions about a Variable’s Distribution, Knowledge of o’ Permits Useful Inferences About a Single Case Drawn at Random From a Population.

2) When Combined with Knowledge of The Sample-Size Component of Random Sampling Error, The SD Allows the Researcher to Estimate The Accuracy of a Statistic From a Random Sample Drawn From a Population.

In Both Applications this Inferential Leverage is Rooted in the Known Properties of The Normal Distribution.

Page 26: Sampling and Inference_Political_Science

The Normal Distribution Normal Distribution: Is a Bell Shaped Distribution Used to Describe Interval-Level

Variables.

The Mean of the Normal Distribution is marked by 0. The Hash Marks along the Horizontal Axis represents The Number of Standard

Deviations Relative to the Mean: 1 for The Point One Standard Dev Above The Mean, -1 for The Point One Standard Dev Below and Mean, and so on.

Emp = If a Member of a Population that has a Value on Some Characteristic that puts that Member One Standard Deviation Above The Population Mean, then the Member would fall under the somewhat Shorter Part of The Curve, Above The +1 Point on the Normal Distribution.

The Numbers Along the Horizontal Axis are known as Standardized Scores – Z Scores.

A Z Score: Is Obtained for Any Value in a Population by Finding the Value’s Deviation From The Mean and Dividing by The Standard Devotion of The Distribution.

A Z Score Converts Raw Deviation From The Mean into a “Standardized” Deviation from the Mean. It also Tells Us How Many Standard Deviations a Case Lies Above The Mean (A positive sign on Z) or Below The Mean (A negative sign on Z).

Page 27: Sampling and Inference_Political_Science

NORMAL CURVE

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

196. 196.

Area = .6823

2/3 of the population lies

between +- 1 st. deviation

Area = .95

2.5% of population2.5% of population

95% of the population lies between +- 1.96 st. deviations

Page 28: Sampling and Inference_Political_Science

The Normal Distribution Continued.. If a case has a value equal to the mean, its Z score is 0. so, for any value in a population,

Z Score = (Deviation from the Mean) /

Thus Wage earner 1, with a raw deviation from the mean of -$8.00, gets a standardized Z score of -$8.00 / 4.31 = -1.86.

Wage earner 1’s $2.00 wage situates him 1.86 standard deviations below the mean. And the Z score for wage earner 11, at +1.86, affirms his highly paid status relative to everyone else in the population mean.

His hourly wage of $18.00 places him 1.86 standard units above the population mean.

The arrow stretching between Z = -1 and Z = +1 bears the label “68%.” It means this: if a distribution is normally distributed, then 68 % of the cases in the distribution will have Z scores between -1 (one standard deviation below the mean) and +1 (one standard deviation above the mean). Since the curve is perfectly symmetrical.

So the range between Z = -1 and Z = +1 is the fattest and tallest part of the curve, containing over two-thirds of the cases.

95 percent of the cases will have Z scores in the long interval between 1.96 standard deviations below the mean and 1.96 standard deviations above the mean. This long interval will contain just bout all the cases.

Since the curve is symmetrical, half of this 5 %, or 2.5 %, will fall in the region below Z = -1.96, and the other 2.5 % will fall in the region above Z = 1.96.

Page 29: Sampling and Inference_Political_Science

The Normal Distribution Continued… Consider the normal distribution’s essential role in making probabilistic inferences about the individuals in a population.

A Probability: is defined as the likelihood of the occurrence of a single event.

If we know the mean and the standard deviation of a distribution, then we can calculate the Z score for any individual in the pop. Furthermore, assume the distbution is normally distributed, we can make reasonable inferences about the Z score of any single case drawn at random from the distribution.

Randomly picking a single case, what is the probability it will have a Z score between -1 and +1? -- thanks to the inferential leverage of the normal distribution, you know that 68 % of all the cases fall in this interval – thus 68 % probability that a single case drawn at random will have a Z score in the interval between Z = -1 and Z = +1. “68 % confident” that the case will have a Z score in this interval.

There is a chance that case will have a Z score outside the -1 to +1 region. How much of a chance? 32 %, must lie in the area above Z = +1 or below Z = -1. thus there is a 32% probability that a randomly chosen case will have a Z score outside the fat and tall part of the curve.

Knowledge of the standard deviation and the normal distribution can be directly applied when making inferences about an unseen population based on samples randomly drawn from that population.

Page 30: Sampling and Inference_Political_Science

How the Standard Deviation Affects Random Sampling Error

How a sample statistic is affected by variation in the pop from which the sample is drawn.

Emp = student pollsters took a random sample of n = 100 from a student pop of N = 20,000. they calculated a sample statistic, mean GPA in sample, 2.80.

Now the group knows that their sample estimate (x bar) will be equal to the true pop mean (u) plus the random sampling error they introduced by taking the sample.

They also know the magnitude of the sample size component of random sampling error. As sample size increases, the sample size error component decreases as a function of the square root of n. Sample size of 100, this component is sq root of 100 = 10.

The missing ingredient, the variation component of random sampling error – the connection is direct: As variation in a pop characteristic increases, so does random sampling error.

The variation component of random sampling error is determined by standard deviation (0’).

For any given sample size of n, as the population standard deviation increases, random sampling error increases (and vice versa).

When we combine this principal with the sample size error component, we again can represent the partnership between the two components of random sampling error:

Random sampling error = Standard deviation / Square root of the sample size. OR using symbols, Random sampling error = / n

Page 31: Sampling and Inference_Political_Science

How the Standard Deviation Affects Random Sampling Error Contin…

The Formula captures the effects of the variation component and the sample size component.

If we use a very large sample to estimate a pop parameter that had a very small standard dev, our sample statistic would be quite accurate.

If the population standard dev was large and the sample was small our estimate would have a bigger dose of random error, thus, less accurately mirror the pop parameter we are measuring.

Assuming that the pop standard dev is known, the procedure for figuring out the size of random sampling error is straightforward. Sample of 100, sample mean of 2.80. stand dev of GPAs in the student pop is

40. the size of random sampling error would be = Random Sampling Error = o’ / sq root of n = .40 / squ root 100 = .40 /10 = .04.

Student group can say their sample mean (x bar = 2.80) is equal to the true pop GPA (mew) with random sampling error equal to .04.

Look at how this info can be used to draw inferences about the pop (next page).

Page 32: Sampling and Inference_Political_Science

Inference Using The Normal Distribution

Researchers describing the random sampling error associated with a sample mean (x bar) refer to the standard error of the mean. Standard error has the same adjective as another term, standard dev. The diff between them =

Standard deviation (o’) is a measure of dispersion around a single mean.

Standard error is a measure of how closely samples mean estimates a pop mean.

Emap = we take a random sample – not a single case – from the pop. Then calculate a sample mean. By knowing the size of the sample (n) and the amount of dispersion in the pop (0’), we can est how closely the sample mean reflects the pop mean.

In both applications whether a single case or drawing a sample from a distribution, the normal curve comes into play.

Page 33: Sampling and Inference_Political_Science

Inference Using The Normal Distribution Cont…

If sample mean GPA of 2.80 from a sample of 100 and pop standard dev is .40, then the stand error of the sample mean is .04. true pop GPA is 2.80, plus or minus .04.

That is, the true mean probably lies in the interval of about 2.76, a standard error below the sample est, to 2.84, a standard error above the sample est. here is where the normal distribution applies – there is a 68 % chance that the pop mean lies within + or – 1 standard error of the sample mean, and there is a 95% chance that it lies within + or – 1.96 standard errors of the sample mean.

How do we know this? Central limit theorem an est statistical rule that tells us that, if we were to take an infinite number of samples of size n from a pop of N subjects, the means of theses samples would be normally distributed, furthermore, would be centered on the true pop mean and have a stand dev equal to o’ divided by the square root of n.

Eamp = sample of size 100 from the student pop, wrote down the sample’s mean GPA, then took a third sample, then a 4th into infinity, we would find that the mean of all those samples is equal to the pop mean, that most (68%) of the sample means in this infinite group cluster within one standard error of the pop mean, and that just about all (95%) fall within 1.96 standard errors.

Page 34: Sampling and Inference_Political_Science

Inference Using The Normal Distribution Cont…

o researchers never talk about certainty instead confidence and probability.

95 percent confidence interval; (most common standard) defined as the interval within which 95% of all possible sample est will fall by chance. Boundaries of the 95% confidence interval are defined by the sample mean minus 1.96 standard errors at the lower end, and sample mean plus 1.96 standard errors at the upper end.

Lower confidence boundary = sample mean - 1.96 stand errors = 2.80 – 1.96 (.04) = 2.7216

Upper confidence boundary = sample mean + 1.96 stand errors = 2.80 + 1.96 (.04)

= 2.8784 Conclusion: 95% of all possible random samples of n = 100 will yield sample means

between 2.7216 and 2.8784. (customary to round off 1.96 to 2.0)

To find the 95% confidence interval for a sample mean, multiply the standard error by 2. Subtract this number from the sample mean to find the lower confidence boundary. Add this number to the sample mean to find the upper confidence boundary.

Applying this rule, students can be 95% confident, that the unobserved pop mean has a value between 2.80 + or – 2.80 + or – 2(.04), this is , between 2.72 and 2.88.

Page 35: Sampling and Inference_Political_Science

Inference Using The Normal Distribution Cont…

Suppose the dean decrees the sample mean to be “a bit off.” Hypothesize mean is at least 2.90. Seems reasonable, since, the fluke effect is always a possibility with random samples. Lets assume the dean is correct that the true population mean really is 2.92.

How often by chance would a random sample yield a sample mean of 2.80 with a standard error of .04? The dean’s hypoth mean would occur less than 5 percent of the time.

Normal distribution, however, allows for even greater precision.

1st assemble all the numbers: Deans hypo pop mean = 2.92 Observed sample mean (x bar) = 2.80

Stand error of the sample mean = .04

How far apart are the dean’s hypo pop mean, 2.92, and observed sample mean, 2.80? find this diff: subtract sample maen (x bar) from hypo pop mean: hypo mean minus sample mean = (mew) – (x bar)

= 2.92 – 2.80 = .12

Now question becomes how many standard errors lie between the hypo mean and the observed sample mean? Thus we convert the disputed diff (.12) to a familiar standard unit, Z:

Z = (hypo mean minus sample mean) / standard error

= (2.92 – 2.80) / .04 = .12 / .04 = 3.0

Page 36: Sampling and Inference_Political_Science

Inference Using The Normal Distribution Cont…

Number of standard errors separating the hypo pop mean and the observed sample mean – the value of Z – is 3 units. This is a large Z score. If the dean is right, what are the chances that the student groups random sample produced a mean that is so far off the mark?

Hash mark for Z = +3 on the horizontal axis. Represents the dean is correct – the true pop mean really is way out there at Z = +3 – how often will the student pollsters get a sample mean of 2.80? what percent of possible samples would produce such results? The answer is contained in a normal distribution probability table, such as the presented in table 5-2.

The student group took a random sample (realistic enough), calculated a sample mean (realistic) however, the variation component of random sampling error , the standard deviation, was assumed that the pop standard deviation (o’) was a known quantity. This is not realistic. Its practical a researcher rarely knows any of the pop’s parameters – if known then no need to take a sample.

A different dist – very similar to normal distribution – can be applied to problems of inference when the pop standard deviation is not known quantity.

Page 37: Sampling and Inference_Political_Science

Inference Using The Student T-Inference Using The Student T-DistributionDistribution

In most realistic sampling situations the researcher has a random sample – that’s it. In most realistic sampling situations the researcher has a random sample – that’s it. Researcher uses this sample to calculate a sample mean to determine the standard error Researcher uses this sample to calculate a sample mean to determine the standard error of the mean – the degree to which the sample mean varies, by chance, from the pop of the mean – the degree to which the sample mean varies, by chance, from the pop mean. mean.

Need to know pop standard dev. If parameter is unviable – usually is – then need an Need to know pop standard dev. If parameter is unviable – usually is – then need an estimate of pop SD. estimate of pop SD.

Simply calculate the SD of the sample. Then use the sample standard dev as a stand-Simply calculate the SD of the sample. Then use the sample standard dev as a stand-in for (o’) in calculating the standard error. The standard error of the sample mean in for (o’) in calculating the standard error. The standard error of the sample mean would become = Sample standard deviation / Square root of the sample size. Or using would become = Sample standard deviation / Square root of the sample size. Or using s to denote the sample standard deviation, the standard error is = s / n sq root.s to denote the sample standard deviation, the standard error is = s / n sq root.

When using smaller samples exact properties of the normal dist may no longer be When using smaller samples exact properties of the normal dist may no longer be applied in making inferences. The applied in making inferences. The Student’s t-distributionStudent’s t-distribution can be applied.can be applied.

The shape of the Student’s t-dist depends on the sample size. Boundaries of the 95% The shape of the Student’s t-dist depends on the sample size. Boundaries of the 95% confidence interval are not fixed. Vary depending on how large a sample is being used for confidence interval are not fixed. Vary depending on how large a sample is being used for inference. inference.

Logic here is that when pop stand dev is not known and the sample size is small, the t-dist Logic here is that when pop stand dev is not known and the sample size is small, the t-dist sets wider boundaries on random sampling error and permits less confidence in the sets wider boundaries on random sampling error and permits less confidence in the accuracy of a sample statistic. accuracy of a sample statistic.

When the sampling size is large, the t-dis adjusts these boundaries accordingly, narrowing When the sampling size is large, the t-dis adjusts these boundaries accordingly, narrowing the limits of random sampling error and allowing more confidence in the measurements the limits of random sampling error and allowing more confidence in the measurements mademade from the samplefrom the sample. .

Page 38: Sampling and Inference_Political_Science

Inference Using The Student T-Inference Using The Student T-Distribution cont..Distribution cont..

Terminology used to describe the t-dis is different from that used to describe the normal dist. The procedures for drawing inferences about a pop parameter are essentially the same.

The similarities and diff between the inferential properties of the student’s t-dis and the normal curve.

Sample size is n =100, sample mean GPA, (mean x bar) = 2.80. however, student group does not know the pop standard deviation. Must, rely, instead on stand dev of their sample, calculates to be .50. so s = .50. the standard error of the sample mean now becomes: s / n sq root = .50/100 sq root

= .50/10 = .05 the campus group substitutes the sample standard dev for the pop stand dev,

does the math and arrives at the standard error of the sample mean, .05.

Answer is contained in a student’s t-dis table table 5-3. the specific shape of the student’s t-dist depends on the sample size. In normal estimation, we do not worry about the size of the sample. So we calculate a value of Z. using student’s t, the sample size determines the shape of the distribution.

Page 39: Sampling and Inference_Political_Science

Inference Using The Student T-Inference Using The Student T-Distribution cont..Distribution cont..

Left-hand column of table 5-3 “degrees of freedom” a statistical property of a large family of distributions, including the student’s t-dist.

Number of degrees of freedom is tied to sample size. Number of degrees of freedom is equal to the sample size minus 1,or n – 1.

The signature of the student’s t-dist is that it adjusts the confidence interval, depending on the size of the sample.

More degrees of freedom mean less random sampling error and, thus, more confidence in the sample statistic.

Find the 95 % confidence interval of the student pollsters’ sample mean. 1st, determine the number of degrees of freedom, tied to the sample size (degrees’ of freedom = n – 1 ). n = 100, degrees of freedom = 99. no row , corresponds exactly to 99 degrees of freedom, so use closest lower number, 90 degreees of freedom. Reads across the column labeled “025.” Number 1.987, tells us that .025 or 2.5 % of the curve falls above t = 1.987. the student’s t-dist like the normal dist, is perfectly symmetrical. 2.5% of the curve must lie below t = -1.987.

Theses t-values give us the info we need to define the 95% confidence interval of the sample mean:

Lower confidence boundary = sample mean - 1.987 stand errors = 2.80 – 1.987 (.05) = 2.70 Upper confidence boundary = sample mean + 1.987 standard errors = 2.80 + 1.987 (.05) = 2.90