Bias! Anecdotes! Randomization! and more!

Bias! Anecdotes! Randomization!

and more!

Learning Objectives

By the end of this lecture, you should be able to:

List and describe two common methods by which we obtain data.

Define anecdote with example(s). List two key goals in obtaining a reasonably good sample. Recognize and identify sources of bias in sample. Define ‘population’ in terms of its relationship to ‘sample’ Define: selection bias, response bias, non-response bias,

wording bias Define statistical inference. Identify and explain two key potential pitfalls in inference.

Clearly this is not a “numbers” oriented lecture. In terms of quizzes and exams, I would suggest you go through the lecture a few times until you really can answer these objectives in your own words.

Where does data come from?Two main sources of data:

Available Data: As you would expect, this is data that is already available from some other source. For example, if you were trying to do an analysis of SAT scores, you could contact the testing service and ask for publicly available data. Similarly, the US government puts census data on the web. Other government agencies and many organizations make data available as well. In this day and age, there is more data available at the click of a mouse than one could possibly hope to make use of in a lifetime.

Samples: Sometimes we need to ask a question for which there is no available data.In For example, suppose we wanted to try to find the average height of DePaul undergraduate women. Clearly it would be impossible (or at least time-consuming and costly) to measure every such student at DePaul. For this reason we take a random sample of DePaul undergraduate women, and hope that from there, we can infer information about all undergrate female DePaul students.

The terms ‘infer’ , ‘random sample’ and ‘population’ are also very important terms. We will discuss each of them in more detail.

What is one TERRIBLE way to obtain data? Answer: Anecdotes. That is, things that “we have heard”, or “are

widely known”, or “are common knowledge”, as opposed to data that comes from evidence.

A new 4-letter (o.k., 8-letter) Word:

AnecdoteAnecdotal evidence refers to people accepting as evidence individual

stories/incidents that they have heard about. That is, the kinds of things we hear

from friends, case-studies in the press, “crazy coincidences”, etc.

Anecdotes are based on selected, individual cases. Yet we tend to remember them,

because they are often unusual in some way. This is because we don’t bother

remembering the non-coincidences that happen tens of thousands of times every

single day.

Humans are really, really, REALLY good at “spotting” patterns when, in fact, no

pattern actually exists.

Key Point: Anecdotal “evidence”, is NOT evidence!

Anecdote: Smoking doesn't cause lung cancer, "My grandmother lived to 95 and

smoked like a chimney, and didn't die of lung cancer."

It is certainly true that not all smokers die of lung cancer. However, the vast majority of people who get lung cancer are smokers.

Anecdote: Homeopathy works! “My aunt had arthritis for 15 years. She went to 8 different

specialists, none of whom could cure her. But then she tried byronia 5ch and within 3 days it had improved.”

People’s medical conditions do change with time. For every person whose condition improves right around the time they start taking an alternative therapy, there are thousands who do not.

Like plane crashes, we as humans, naturally, only talk about the findings that are interesting to us. We aren’t lying, we’re just… human.

Remember that as humans, we are great at finding patterns – even when they are not real.

Anecdote: People wearing top hats live longer. Back in the day, this fact, was supported by

a great deal of anecdotal evidence. “My grandpa wore a top hat and lived until 97 years old!”

People that wear top hats are usually richer, therefore can afford better food, shelter, sanity, and medical resources. A wider study that “controlled for” (important term!) people's income was easily able to show that the claim was false.

This is a great example of how people who do not possess the knowledge to recognize about lack of causation are easily fooled! In fact, it is one of the most common ways in which statistics are abused.

For more information on this story, be sure to join us for the 6:00 news…

The plural of anecdote is anecdotes, it is not evidence!

Think about the internet “echo chamber”. People of similar interests are constantly reinforcing their beliefs by restating those beliefs to people with typically similar views. When is the last time any of us have spent real time on websites reading political journals/blogs/etc from people we don’t agree with?!

Where does data come from? Two main sources are: Available Data, and Samples.

SAMPLES:Frequently, we are interested analysing a topic for which there is no available data. In this case, we take a “sample” of observations and hope that from that sample, we can infer information about the rest of the population. Making sure you get a proper sample is a HUGELY important issue when it comes to setting up a study. There are many issues to think about. For now we will concern ourselves with two in particular:

Sample Size: More is typically better. However it is not always feasible, and can often be very costly. If a sample size is very small, however, it can severely limit our ability to draw any meaningful conclusions.

Randomization: This is one of the most important and widely abused aspects of study design and for this reason, we will discuss separately. For now, it is important to recognize that when choosing a sample, the people (or whichever observations) must be chosen at random and must be representative of the population you are interested in.

Sample Size Researchers and statisticians love large sample sizes, as the larger the

sample size (‘n’), the more confident we are in the results. However, larger samples are not always practical or even possible.

Suppose you want to test a new cancer drug and you wanted to enroll 500 patients. Now suppose that the drug costs $200,000 a year (which could happen). This study would almost certainly not be feasible.

Suppose you wish to investigate a very rare form of cancer. It is so rare that there are only 173 cases in the entire country. And only 14 of them are even remotely in your geographic area. Unless you can somehow make the study work remotely (many can not), you are stuck with an n of 14.

Suppose you were okay with the previous study of 14 people – only to find out that 2 are unwilling to join your study because they can’t commit to the time requirements, and another 4 have other illnesses that prevent them from participating.

Making sure your sample is random Imagine that we are doing a relatively simple study in order to

determine the average height of DePaul undergraduate women. In this case, it would not be very difficult to obtain a large sample. However, where would you obtain this sample?

At a basketball practice? At a gymnastics meet? Both of those places? Neither? Answer: You would try to spread it out and avoid places where there is

clearly a ‘bias’ in favor of a particular height pattern. So ideally, you would sample women from all campuses, at all times of day, and in all majors. At that point you might be reasonably confident that you have randomly chosen a random and representative sample of undergraduate women at DePaul.

Random sampling:

Key point: Individuals are selected at random and no one group is

over-represented.

Random sampling avoids several potential sources of bias.

A sample that is not random is essentially useless.

‘Nuff said.

Population versus Sample Sample: The part of the

population we actually examine.

A statistic is a number

describing a characteristic of a

sample.

Population: The entire group

of individuals in which we are

interested but can’t usually

assess directly.

A parameter is a number

describing a characteristic of

the population.

Sample

Population

Population versus Sample Sample: The part of the

population we actually examine.

Examples:

We sample 200 working-age people in California

We sample 30 DePaul undergraduate women

We sample 150 male crickets

Population: The entire group

of individuals in which we are

interested but can’t usually

assess directly.

Examples:

-Income of all working-age people in California

-Height of all DePaul undergraduate women

-Length of all male crickets

Population vs Sample1. A political scientist wants to know what percentage of college students consider

themselves conservatives.

2. An automaker hires a market research firm to learn what % of adults 18-35 recall seeing TV ads for a new SUV.

3. Government economists want to know about average household income in Chicago.

It would be impossible to ask these questions of every single college student / adult / household. Instead, we ask a sample of college students / adults / households. The population refers to the entire group that we want information about The sample is the small section of the population that we actually examine The GOAL of a study is to take the information we derive from the sample, and to

generalize it, i.e. to “infer” information about the entire population.

Identify the population for the three examples mentioned above:1. All college students.

2. All adults aged 18-35 years old

3. All households in Chicago. However, this is sloppy. Do we mean greater Chicago? Are we including both inner-city

Chicago and the Gold Coast? If we do, we are basically looking at two very different groups! (Recall from our discusison on categorical variables in scatterplots: When you have different groups which are likely to have their own unique dataset, you should plot them separately).

BIAS It may not always be intentional, but it’s always there!

If you’re biased and you know it… Biases are everywhere It is very important to be aware of the different types of bias and

where they tend to show up.

1. Convenience sampling: Just ask whoever is around.

Example: “Man on the street” survey (cheap, convenient, often quite

opinionated, or emotional => now very popular with TV “journalism”)

Which men, and on which street?

Ask about gun control or legalizing marijuana “on the street” in

Berkeley v.s. rural Texas and you would get wildly different results.

Even within an area, answers would probably differ if you did the

survey outside a high school or a country western bar.

Bias: Opinions limited to individuals who are present.

Two examples of bias seen in sampling methods

2. Voluntary Response Sampling:

Individuals choose to be involved. These samples are very

susceptible to being biased because different people are motivated

to respond or not. Often called “public opinion polls,” these are not

considered valid or scientific.

Bias: Sample design systematically favors a particular outcome.

Bias present? Ann Landers summarizing responses of her

readers:

“70% of (10,000) parents wrote in to say that having kids was

not worth it—if they had to do it over again, they wouldn’t. “

Bias: Most letters to newspapers are written by disgruntled people. A

later sample found the exact opposite result! Incidentally, it turned out

that this sample was also very flawed.

Online surveys – Is there a bias?

Answer: Voluntary response bias. People have to care enough about an issue to

bother replying. This sample is probably a combination of people who hate

“wasting the taxpayers money” and “animal lovers.”

Common biases you should be able to identify: Nonresponse Bias: People who feel they have something to hide or

who don’t like their privacy being invaded probably won’t answer. Yet

they are absolutely part of the population under study!

Remember that the most important objective of a good sample is for that

sample to accurately represent the population.

Response Bias: Fancy term for lying. This is particularly important

when the questions are very personal (e.g., “How much do you

drink?”)

Wording effects Bias: Questions worded like “Do you agree that it is

awful that…” are prompting you to give a particular response.

Selection Bias: an important one – upcoming slide…

Etc, Etc Bias can show up in all kinds of unexpected ways (and

not all of them have names).

Selection BiasThis is a common form of bias. It occurs when the group that is sampled has something in common that relates to the issue under consideration.

Example: You are conducting a poll to determine whether taxpayer dollars should be used to improve Wrigley Field. The pollsters randomly sample people from ‘The Cubby Bear’ (a popular Cubs bar), and outside the White Sox convention at the Palmer House. In both cases, there will be a selection bias – albeit likely with different results.Example: The majority of people who are asked about their experiences with psychics report positive results. This is a selection bias since the people asked are motivated to have their beliefs validated. Example: There is a tendency of people who review products they have purchased online to give positive reviews. The reason for this is the same as the example above.Example:

Another sampling biggie: Undercoverage

Occurs when parts of the population are left out in the

process of choosing the sample.

Because the U.S. Census goes “house to house,” homeless people

are not represented. Illegal immigrants also avoid being counted.

Geographical districts with a lack of coverage tend to be poor.

Representatives from wealthy areas typically oppose statistical

adjustment of the census.

Historically, many clinical trials had avoided including

women in their studies because of their periods and the

chance of pregnancy. As a result, many medical

treatments were not appropriately tested for women. This

problem is slowly being recognized and addressed.

To assess the opinion of students at the Ohio State University about campus safety, a reporter interviews 15 students he meets walking on the campus late at night who are willing to give their opinion.

What is the sample here? What is the population? Is there significant bias present? All those students walking on campus late at night All students at universities with safety issues The 15 students interviewed All students approached by the reporter

Sample: the 15 students. Target population: All Ohio State Students.Selection Bias: People who feel safe are more likely to walk out at night. People who don’t feel safe probably won’t do so as often. They would be under-represented in the sample. Possible Non-Response Bias: Entirely possible that some people would hurry away or refuse to answer if someone approaches them with a question at night. Others?

Example: An SRS (simple random sample) of 1200 adult Americans is selected and asked: “In light of the huge national deficit, should the government at this time spend additional money to establish a national system of health insurance?“ Thirty-nine percent of those responding answered yes. What can you say about this survey?

If it is truly a random sample, then we are being told that the sampling process is relatively free from bias. However, in this case, the wording is biased. The results probably understate the percentage of people who do favor a system of national health insurance.

If you don’t like the results you find however….Selection Bias. This is an egregious example, in that the selection bias was intentionally created by the pollsters.

Assuming that this program truly did randomly sample ‘likely voters’ (as opposed to ‘likely voters who watch this particular program’), then this is a very reasonable poll.

Let’s play: Find the BiasWhat toothpaste do people

prefer? Experiment: In order to determine which brand of toothpaste

Americans prefer, researchers wait outside of Whole Foods Market and ask everyone who bought toothpaste, which brand they preferred. What are some biases present in this experiment?

Potential Biases?- Whole Foods is an upscale market. Many shoppers are from a higher

income market and some will buy ‘boutique’ products (incuding toothpaste!) - Colgate is on sale- Crest just had an advertising blitz during the Superbowl- Oprah mentioned in an interview that she likes Aquafresh

Documents

Bias! Anecdotes! Randomization! and more!