22
1 Section 1.1: The Structure of Data Objectives 1) Statistics, what is it? 2) Cases and variables 3) Organization of tables 4) Categorical and quantitative variables 5) Explanatory and response variables 6) Exploring data sets from our book Example 1: How has the enrolment at MC changed over the years? year enrolment 2010 26015 2011 26996 2012 27453 2013 26155 2014 25517 2015 25320 2016 23916 2017 22875 Example 2: We want to study characteristics of students in this M017/117 class Survey: what questions would you ask?

Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

1

Section 1.1: The Structure of Data Objectives

1) Statistics, what is it? 2) Cases and variables 3) Organization of tables 4) Categorical and quantitative variables 5) Explanatory and response variables 6) Exploring data sets from our book

Example 1: How has the enrolment at MC changed over the years? year enrolment 2010 26015 2011 26996 2012 27453 2013 26155 2014 25517 2015 25320 2016 23916 2017 22875 Example 2: We want to study characteristics of students in this M017/117 class Survey: what questions would you ask?

Page 2: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

2

Section 1.1: The Structure of Data

Example 3: We would like to study the relationship between a) The weight of a car and its mileage (miles per gallon). b) The number of hours of study and the grade on the test.

Example 4: we would like to study

Page 3: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

3

Section 1.1: The Structure of Data Read section 1.1, starting with page 4; then, answer the questions.

A STUDENT SURVEY DATASET – First day survey to students in an introductory statistics class, administered through many years.

4) Consider the table 1.1 (page 4) – from the STUDENT SURVEY dataset. a. Each column on the dataset correspond to a different _______________

b. Each row on the dataset correspond to a different _______________

5) Use a complete sentence to explain what the first five columns of the second row of table 1.1 tell you about

student 2. 2 F Yes Academy 4

6) The following table lists final exam grades for some of my Spring 2015 students for the different classes that I taught.

Grades Male: 98 (Math 117A), 95 (Math 098), 75 (Math 117A) Female: 99 (Math 093), 89 (Math 117A)

Display the information from the above table as a dataset with cases as rows and variables as columns. (use table 1.1 as a guide)

1) What is statistics?

Statistics is the science of collecting, describing, analyzing and interpreting data in order to make decisions.

Read about CASES AND VARIABLES:

2) The subjects/objects that we obtain information from are called _________________________________

3) What is a variable?

Page 4: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

4

Section 1.1: The Structure of Data Read about CATEGORICAL AND QUANTITATIVE VARIABLES – summarize definitions

7) Give your own example of a categorical and a quantitative variable.

a. Categorical b. Quantitative APPLYING THE CONCEPTS - try some odd numbered problems from the end of the section

8) For the situations described in Exercises 1.1 to 1.6: (1) What are the cases? (2) What is the variable and is it quantitative or categorical?

a) 1.2 - Record the percentage change in the price of a stock for 100 stocks publicly traded on Wall Street.

b) 1.4 - Measure the shelf life of bunches of bananas (the number of days until the bananas go bad) for a large sample.

c) 1.6 - Record whether or not the literacy rate is over 75% for each country in the world.

d) 1.6 – again. What if we want to record the literacy rate for each country in the world?

9) Read the “Tagging Penguins” example on page 10 a. The cases are: b. List the variables and classify each as quantitative or categorical

c. Show what a table for the data looks like

A categorical variable divides cases into __________________________ A quantitative variable records a ______________________________ for each _______________

Page 5: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

5

Section 1.1: The Structure of Data Read about EXPLANATORY AND RESPONSE VARIABLES – complete the following:

10) Back to the “Tagging Penguins” example:

a. The explanatory variable is

b. The response variables are

11) Read EXAMPLE 1.8 on page 11 - In Example 1.4, we considered the following three questions about relationships between variables in the AllCountries dataset. Identify the explanatory variable and the response variable if it makes sense to do so. (Check answers in the book) (a) Do countries larger in area tend to have a more rural population? (b) Is the birth rate higher in developed or undeveloped countries? (c) What is the relationship, if any, between a country's government spending on the military and on health care?

APPLYING THE CONCEPTS - try some odd numbered problems at the end of the section in WileyPlus.

12) In Exercises 1.7 to 1.10, a relationship between two variables is described. In each case, we can think of one variable as helping to explain the other. Identify the explanatory variable and the response variable.

a) 1.8 - Amount of fertilizer used and the yield of a crop

b) 1.10 - Year and the world record time in a marathon

We use one variable, the ________________________________ to help us understand or predict values of another variable, the __________________________________________. The ____________________________________________ helps explain the _____________________________ The _____________________________________ responds to the ___________________________________

Page 6: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

6

Section 1.1: The Structure of Data

13) Using DATA SETS from Wiley HOW TO ACCESS DATA SETS: Go to WileyPlus, click on Modules, Select section 1.1, click on EXPLORE (top right), Select DATA SET NOTE: To find out details of the data you need to access the “Data Set Index” which is in my website, http://faculty.montgomerycollege.edu/maronne/ ; go to Statistics page, fourth row.

1) Access the ACS data. Look into the Data Set Index for a description of the data. a. What does ACS stand for? b. Give a summary DETAIL of the data. c. How many cases are listed? d. List each variable, the units, the codes used, and classify each variable as categorical or quantitative.

2) From the BIKE COMMUTE data show a table with 4 cases and two variables, one categorical and one quantitative; include units. What questions may we ask about a relation between two variables?

Page 7: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

7

Page 8: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

8

Section 1.2: Sampling from a Population

Objectives: 1) Population and sample 2) Census 3) Descriptive and Inferential statistics 4) Sampling bias 5) Simple random samples 6) Selecting a random sample with the calculator 7) Random versus biased sample

Page 9: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

9

Section 1.2: Sampling from a Population

To do at home: Read section 1.2 starting on page 16; then answer the questions.

1) In the book, read SAMPLES FROM POPULATIONS a. What is a census?

b. Every how many years is conducted?

c. What are the cases?

d. Does it attempt to gather data from a population or from a sample?

e. According to the census, ________% of 25-year olds or older have at least a college’s bachelor’s degree. APPLYING THE CONCEPTS - also, try some odd numbered problems from the end of the section

2) In Exercises 1.28 to 1.31, state whether the data are best described as a population or a sample.

a) 1.28 - To estimate size of trout in a lake, an angler records the weight of 12 trout he catches over a weekend. b) 1.30 - The US Department of Transportation announces that of the 250 million registered passenger vehicles in the

US, 2.1% are electro-gas hybrids.

3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population.

a) 1.32 - A sociologist conducting a survey at a mall interviews 120 people about their cell phone use

b) 1.34 - A cell phone carrier sends a satisfaction survey to 100 randomly selected customers.

POPULATIONS AND SAMPLES A population includes all individuals or objects of interest. Data are collected from a sample, which is a subset of the population.

Page 10: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

10

Section 1.2: Sampling from a Population Read about Inferential Statistics

APPLYING THE CONCEPTS – also, try some odd numbered problems from the end of the section

4) In Exercises 1.36 to 1.39, a biased sampling situation is described. In each case, give: (1) The sample (2) The population of interest (3) A population we can generalize to given the sample.

a) 1.36 - To estimate the proportion of Americans who support changing the drinking age from 21 to 18, a random

sample of 100 college students are asked the question “Would you support a measure to lower the drinking age from 21 to 18?”

b) To investigate the growth of the canine population in New York City, 100 dogs are randomly selected from a registry of licensed pets in the city, and it is found that 78 of them have been neutered.

5) We want to estimate the proportion of male students in Montgomery College. You observe that 35% of the students in our class Math 117A are male. We conclude that 35% of MC students are male.

a. Identify the population

b. Identify the sample

c. Write the descriptive statement (The one that describes the sample)

d. Write the inferential statement (The generalization to the population)

e. Do you trust this inferential statement? Yes/No, why?

INFERENTIAL STATISTICS The process of using ____________________________ data to gain information about the ______________________________is called ________________________

Page 11: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

11

Section 1.2: Sampling from a Population – Biased Samples

6) Read about SAMPLING BIAS

7) Read example 1.11 “Dewey defeats Truman” on page 17. Answer question (c) from this example.

8) Wording Bias - A survey is to be conducted using a random sample of citizens in a town, asking them if they support raising taxes to increase funding for the public school.

Which of the following questions (a), (b), or (c) will bias the results toward more YES answers? Which of the following questions (a), (b), or (c will bias the results toward more NOS answers? Which is the best way to phrase the question? (a), (b), or (c)?

(a) “Our taxes are far higher than they should be already and the schools are just wasting the money. Do you support a big tax hike to throw more money at them?” (b) “Do you support raising taxes to increase funding for the public schools?” (c) “Our schools do such a fantastic job and really need our help. Do you support raising taxes a small amount to increase funding for them?”

Sampling bias occurs when the method of selecting a ____________________________ causes the sample to ______________________________ from the ________________________________________ in some relevant ways. In this case, we can’t trust generalizations from the _____________________________ to the _____________________________________

Page 12: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

12

Section 1.2: Sampling from a Population – Random Samples

9) Read about SIMPLE RANDOM SAMPLES

10) How would you select a simple random sample of 50 students from all the students taking an 8 a.m. class in the Science Center East building of Montgomery College?

Using the calculator to select simple random samples

11) To do in class: Use the calculator to select a simple random sample of 5 students from our class of 28 students. Press MATH Arrow to PRB Select 5:RandInt(1,28,5) If some numbers repeat, do again until we select 5 different random integers.

When choosing a simple random sample of n individuals, all groups of size n in the population have ____________________________ of becoming the sample.

Page 13: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

13

Section 1.2: Sampling from a Population Biased versus random Sampling Activity

12) OBJECTIVE - Estimate the average length of words in Lincoln’s Gettysburg Address

a) Biased Samples: The entire text of Lincoln’s Gettysburg Address is given below. Your task is to select a sample of

10 words to estimate the average length of words in this speech. Pick words that appear to be representative of the population of all the words. Circle the ten words you choose. For each of the selected words, count the number of letters and list them here:

b) Calculate the average number of letters per words for the 10 words in your sample; rounded to one decimal place.

Average number of letters per word rounded to one decimal place: SUM / 10 = _______________

c) Construct a dot plot with class results: ______|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|___

2 3 4 5 6 7 8 9 10 11

Lincoln’s Gettysburg Address “Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they here gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.”

Page 14: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

14

Section 1.2: Sampling from a Population Exploring RANDOM SAMPLING activity

13) OBJECTIVE - Estimate the average length of words in Lincoln’s Gettysburg Address Random Sampling: The entire text of Lincoln’s Gettysburg Address has 268 words. We will select 10 words at random.

a) Use your calculator to select 10 numbers at random from 1 to 268. MATH, PRB, 5: RandInt(1, 268, 10). If there is repetition, run again until you have 10 different integers. List the numbers here: These numbers are the identifiers. Use the table given on the next page to identify and circle the selected words. Write the number of letters of each of the ten selected words here: Now, find the average number of letters per word rounded to one decimal place: Sum / 10 = ______________________

b) Construct a dot plot with class results: ______|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|___

2 3 4 5 6 7 8 9 10 11

14) The average number of letters in the entire text of Lincoln’s Gettysburg Address is 4.29. Which of the two sampling techniques (on this page and previous page) produces a better estimate? Why do you think this is the case?

Page 15: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

15

Page 16: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

16

Section 1.2: Sampling from a Population – Random or not? APPLYING THE CONCEPTS - try some odd numbered problems from the end of the section

15) In Exercises 1.40 to 1.45, state whether or not the sampling method described produces a random sample from the given population.

a) 1.40 - The population is incoming students at a particular university. The name of each incoming student is thrown

into a hat, the names are mixed, and 20 names (each corresponding to a different student) are drawn from the hat.

b) 1.42 - The population is all employees at a company. All employees are emailed a link to a survey.

c) 1.44 - The population is all trees in a forest. We walk through the forest and pick out trees that appear to be representative of all the trees in the forest.

Section 1.3 - Observational Studies versus Designed experiments Objectives:

1) Observational studies and designed experiments – what is the difference 2) What is a treatment? In which one does the scientist impose a treatment? 3) What is confounding? In which one do we observe confounding? 4) What do we mean by randomization? In which one is it used? 5) How can you accomplish randomization? 6) In which one can we talk about cause and effect? 7) Treatment and control groups 8) What do we mean by blinding? Double blinding? Placebo effect?

Example 1: Which diet is more effective to reduce weight; diet A or B?

• Scenario 1. Some patients follow diet A and others diet B. After a few months, data on weight loss is collected and the information is used to decide which diet is more effective.

• Scenario 2. The doctor divides the patients randomly into two groups, and assigns each group to one of the two diets A or B. After a few months, data on weight loss is collected and the information is used to decide which diet is more effective.

What do you think? Which is a better procedure to help us decide which diet is more effective? Why?

Page 17: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

17

Section 1.3 - Observational Studies versus Designed experiments Example 2 Testing a medicine – Scenario 1 – Suppose a pharmaceutical company has developed a new medicine to cure a disease. To see whether or not this medicine is effective in curing this disease, it will have to be tested on a group of humans. Suppose there are 100 persons who have this disease; 50 of them voluntarily decide to take this medicine, and the remaining 50 decide not to take it. The researcher then compares the cure rates for the two groups of patients.

A) Is this an example of a designed experiment or an observational study? Explain B) Can we conclude that the medicine is effective in curing the disease? Explain C) Name possible confounding variables

Testing a medicine – Scenario 2 – Reconsider the above example. Now, suppose that out of the 100 people who have this disease, 50 are selected at random. These 50 people make up one group, and the remaining 50 belong to the second group. One of these

Page 18: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

18

groups is the treatment group, and the second is the control group. The researcher then compares the cure rates for the two groups of patients.

A) Is this an example of a designed experiment or an observational study? B) Can we conclude that the medicine is effective in curing the disease? Explain

Page 19: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

19

Section 1.3: Experiments and Observational Studies

Example 1: a study found that people who own two cars live longer than people who own only one car. Does having more cars make you live longer? Example 2: Does smoking cause lung cancer? APPLYING THE CONCEPTS – try some odd numbered problems from the end of the section

1) In Exercises 1.66 to 1.71, we give a headline that recently appeared online or in print. State whether the claim is one of association and causation, association only, or neither association nor causation.

a) 1.66 - Daily exercise improves mental performance.

b) 1.68 - Cell phone radiation leads to deaths in honey bees.

c) 1.70 - Cat owners tend to be more educated than dog owners. APPLYING THE CONCEPTS – try some odd numbered problems from the end of the section

2) Exercises 1.72 – 1.77 describe an association between two variables. Give a confounding variable that may help to account for this association.

a) 1.72 - More ice cream sales have been linked to more deaths by drowning.

b) 1.74 - People who own a yacht are more likely to buy a sports car.

c) 1.76 - Air pollution is higher in places with a higher proportion of paved ground relative to grassy ground.

ASSOCIATION Two variables are associated if values of one variable tend to be related to the values of the other

variable. This does not mean that changes in one variable cause changes on the other variable. CAUSATION Two variables are causally associated if changing the value of one variable influences the value of the other variable.

CONFOUNDING VARIABLE A confounding variable, also known as a confounding factor or lurking variable,32 is a third variable that is associated with both the explanatory variable and the response variable. A confounding variable can offer a plausible explanation for an association between two variables of interest.

Page 20: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

20

Section 1.3: Experiments and Observational Studies

APPLYING THE CONCEPTS – try some odd numbered problems from the end of the section

3) In Exercises 1.78 to 1.81, we describe data collection methods to answer a question of interest. Are we describing an experiment or an observational study?

a) 1.78 - To examine whether eating brown rice affects metabolism, we ask a random sample of people whether they

eat brown rice and we also measure their metabolism rate.

b) 1.80 - To examine whether planting trees reduces air pollution, we find a sample of city blocks with similar levels of air pollution and we then plant trees in half of the blocks in the sample. After waiting an appropriate amount of time, we measure air pollution levels.

4) Read your book to explore the following concepts. Then summarize below. a. Treatment and Control (or placebo) group

b. Placebo effect

c. Blinding

d. Double blinding

e. What is the objective of using “blinding or double blinding” in experiments?

5) For the results of a study to be unbiased and valid, an experiment must be ___________________________

OBSERVATIONAL STUDIES AND EXPERIMENTS An experiment is a study in which the researcher actively controls one or more of the explanatory variables. An observational study is a study in which the researcher does not actively control the value of any variable but simply observes the values as they naturally exist. CAUSATION CAUTION It is difficult to avoid confounding variables in observational studies. For this reason, observational studies can almost never be used to establish causality.

Page 21: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

21

Section 1.3: Experiments and Observational Studies

6) Causal relationships can be established from?

Observational studies or randomized experiments?

7) Are the following examples of “randomized comparative” or “matched pairs” experiments?

a) Testing if tutoring has an effect on the math scores.

Five students took a math test before and after tutoring. Their scores were as follows.

b) The effectiveness of a new headache medicine is tested by measuring the amount of time before the headache is cured for patients who use the medicine and another group of patients who use a placebo drug.

c) The effect of caffeine as an ingredient is tested with a sample of regular soda and another sample with decaffeinated soda.

d) The effectiveness of a headache medicine is tested by measuring the intensity of a headache in patients before and after drug treatment. The data consist of before and after intensities for each patient.

RANDOMIZED EXPERIMENT In a randomized experiment, the value of the explanatory variable for each unit is determined randomly, before the response variable is measured. If a randomized experiment yields an association between the two variables, we can establish a causal relationship from the explanatory to the response variable.

TWO TYPES OF RANDOMIZED EXPERIMENTS In a randomized comparative experiment, we randomly assign cases to different treatment groups and then compare results on the response variable(s). In a matched pairs experiment, each case gets both treatments in random order (or cases get paired up in some other obvious way), and we examine individual differences in the response variable between the two treatments.

Page 22: Section 1.1: The Structure of Data Objectivesfaculty.montgomerycollege.edu/maronne/Math117A-LOCK...3) In Exercises 1.32 to 1.35, describe the sample and describe a reasonable population

22

Section 1.3: Experiments and Observational Studies

A WORD ABOUT RANDOMIZATION

• To generalize from a sample to a population we select at random the units (individuals, cases) in the sample.

• To establish causality, we assign at random the values of the explanatory variable to each unit in the sample.

Figure 1.3

Two fundamental questions about data collection