Tutorial Workshop Questions s1 2016

ECON 1203 Tutorial Workshop Questions Semester 1 2016

***This document will be periodically updated with questions to be discussed in

succeeding tutorials, and re-posted to Moodle every fortnight.*** Weeks 1 and 2 1. (a) What is meant by a variable in a statistical sense? Distinguish between qualitative and

quantitative statistical variables, and between continuous and discrete variables. Give examples.

(b) Distinguish between (i) a statistical population and a sample; (ii) a parameter and a

statistic. Give examples.

2. In order to know the market better, the second-hand car dealership, Anzac Garage, wants to analyze the age of second-hand cars being sold. A sample of 20 advertisements for passenger cars is selected from the second-hand car advertising/listing website www.drive.com.au The ages in years of the vehicles at time of advertisement are listed below: 5, 5, 6, 14, 6, 2, 6, 4, 5, 9, 4, 10, 11, 2, 3, 7, 6, 6, 24, 11

(a) Calculate the frequency, cumulative frequency and relative frequency distributions for the age data using the following bin classes: More than 0 to less than or equal to 8 years More than 8 to less than or equal to 16 years More than 16 to less than or equal to 24 years.

(b) Sketch a frequency histogram using the calculations in part (a). What can you say about the distribution of the age of these second-hand cars? Is there anything that concerns you about the frequency table and histogram? Specifically, is the choice of bin classes appropriate? What needs to be done differently?

(c) Halve the width of the bins (0 to 4, 4 to 8, etc) and recalculate the frequency, cumulative frequency and relative frequency distributions. Using the new distributions and histogram, what can you now say about the distribution of the age of second-hand cars?

3. Health expenditure

A recent report by Access Economics provides a comparison of Australian expenditures on health with that of comparable OECD countries. Data from that report relating to the year 2005 have been used to reproduce their Figure 2.2 (below denoted as Figure 2.1).

(a) What are the key features of these data?

(b) While this is a bivariate scatter plot, there are three variables involved: health expenditure, GDP and population. Why account for population by expressing health expenditure and GDP in per capita terms?

http://www.drive.com.au/

4. Australian housing prices Recent research by Dr Nigel Stapledon at the UNSW School of Economics provides an extensive analysis of Australian housing prices since 1880. In Figure 2.2 his data are used to provide a comparison of Sydney and Melbourne housing prices over time. (a) What are the key features of these data?

(b) Why have prices been expressed in constant dollars?

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60 70

Heal

th e

xpen

ditu

re p

er ca

pita

(U

S$00

0)

GDP per capita (US$000)

Figure 2.1 OECD Health Expenditure and GDP

0

100

200

300

400

500

600

1860 1880 1900 1920 1940 1960 1980 2000 2020

Thou

sand

s of d

olla

rs

Year

Figure 2.2 Comparison of Sydney and Melbourne median house prices in constant 2007-08 Dollars

Sydney Melbourne

5. Using the car data from Question 2:

(a) Calculate the mean, median and mode for this sample of data and use these statistics to further describe the distribution of car ages.

(b) If the largest observation were removed from this data set, how would the three measures of central tendency you have calculated change?

6. For the following statistical population, compute the mean, range, variance and

standard deviation: 3, 3, 5, 12, 13, 14, 17, 20, 21, 21.

What would happen to each of the measures you have calculated if: (a) …4 were added to each data point (observation)?

(b) …each data point were multiplied by 2?

7. Migrant wealth.

Suppose the Minister for Immigration is interested in research on the assimilation of migrant households (a household where the chief income-earner is foreign born). The Household, Income and Labour Dynamics in Australia (HILDA) survey is a representative survey of Australian households. Using 4,669 household observations for 2002 from HILDA, we find there are 3,567 households classified as Australian-born and 1,102 classified as migrants. One key consideration is how migrant households are doing in terms of wealth compared with Australian-born households. Using these data, we find the following:

Summary statistics for net household wealth ($A)

Mean 10th percentile Median 90th percentile Australian-born 236,064 1,545 123,020 560,006

Migrant 248,970 1,720 131,152 524,372

(a) What can you say about the distribution of net household wealth, for both Australian-born and migrant households, by looking at just the mean and the median figures?

(b) More generally, what can you say about the distribution of wealth for migrant households compared to that for Australian-born households? In particular, which type of household has greater variation in wealth?

(c) Suppose the minister has net household wealth of $600,000. What can you say about his or her financial circumstances relative to other Australian-born households?

8. Sydney housing prices.

Figure 3.2 depicts a scatter plot of Sydney-area housing prices versus distance from the CBD. The unit of observation is a suburb, price is the mean of the median price of houses sold in each suburb for two quarters (those ending in September and

December 2002), and distance is measured in kilometers from downtown. (a) What would you expect the correlation to be between price and distance?

(b) Does it appear that there is a linear relationship between the two variables?

(c) What other key features of these data can be determined from the plot?

9. Anzac Garage wants to develop guidelines for setting prices of cars according to the car’s age. They hire a business consultant who chooses a sample of 117 second-hand passenger car advertisements collected from www.drive.com.au and retrieves data on the age and price of the cars. (a) The business consultant first calculates the correlation coefficient between age and price and finds it to be -0.278. Interpret this result.

(b) Sketch what you think the scatter diagram from which this correlation coefficient was calculated might look like. Suppose the business consultant constructs a simple linear regression model using price as the dependent variable, and age as the independent variable. What do you think the estimated regression line might look like here? (We will return to this particular example later in the course and address this question more formally.)

10. Big Data. Suppose you are sitting at the NSW Department of Health and have access

to information on hospital admissions, diagnosis, private insurance coverage, sex, age, smoking status, and length of hospital stay for all patients at all NSW hospitals for 2000 through 2015. A team of statisticians in your department are available to analyse these data following your direction.

0

1000000

2000000

3000000

4000000

5000000

6000000

0 10 20 30 40 50 60 70 80

Pric

e $

Distance to CBD (kms)

Figure 3.2: House prices in Sydney suburbs versus distance to CBD

http://www.drive.com.au/

(a) You get a phone call from the State treasurer wanting to know how much of your budget you spend on smokers and smoking-related health problems. You promise to get back to her, and put down the phone. What do you tell your team?

(b) You get a phone call from the Australian Council on Smoking and Health, asking about any evidence that the State has on the association between smoking and health outcomes. You promise to get back to them and put down the phone. What do you tell your team?

11. Work through problem 34 on page 165 of Sharpe (Chapter 4).

Weeks 3 and 4

1. (a) Explain what it means to say that two probabilistic events in a sample space are mutually exclusive of one another. (b) Explain what it means to say that two probabilistic events in a sample space are independent of one another. (c) Why can two events not at the same time be both mutually exclusive and

independent of one another?

2. A department store wants to study the relationship between the way customers pay for an item and the price of the item. 250 transactions are recorded and the following table is formed.

Price category Payment Cash Credit card Debit card

Under $20 $20-$100 Over $100

15 11

6

9 53 38

18 52 48

Convert the table to a joint distribution. Express each of the following questions in terms of probability statements, and then solve:

(a) What is the probability that an item is under $20?

(b) What is the probability that an item with a price tag of $43 is paid for in cash?

(c) What is the probability that people pay for an item that is at least $20 by credit?

(d) If somebody used a debit card to pay for an item, what is the probability that the item was less than $100?

(e) Are price and means of payment independent?

3. In a small batch of 20 manufactured widgets, there are, in fact, 3 defective ones. You, as quality control officer for the company making the widgets, decide to examine a sample of 3 widgets, selected without replacement, to see how many defective ones are selected.

(a) Use a probability tree to evaluate the probability distribution of the number of defectives sampled.

(b) How would your answer change if the sampling were done with replacement?




7. The manager of a factory has determined from past experience that X, the number of

repairs required to machines in her factory on any one day, has the following probability distribution:

x 0 1 2 3 4

P(X = x) 0.41 0.25 0.18 0.10 0.06

Calculate the following: (a) P(1 <X< 4)

(b) P(0 ≤ X ≤ 3)

(c) E(X)

(d) Var(X)

(e) What is the conditional probability distribution of X, conditional on some

positive number of repairs taking place?

(f) Describe at least one business decision the manager might face that would be impacted by the information in the original table of unconditional probabilities.

(g) Describe at least one business decision the manage might face that would be impacted by the information in the table of conditional probabilities.

8. Suppose that the daily number of errors a randomly-selected bank teller makes is denoted by X and follows the distribution given in the table below. A human resource manager records the daily numbers of errors of two randomly selected tellers. Denote the associated random variables by X1 and X2. As the selection is random, X1 and X2 are independent and follow the same distribution as X. The manager then computes the sample mean 𝑋� = 𝑋1+𝑋2

2 where the sample size is n = 2.

X 0 1 2

P(X = x) 0.6 0.2 0.2

a. Find the mean and variance of X1. Explain why we do not need to find the mean

and variance of X2 once we know those of X1.

b. Since X1 and X2 are random, so is𝑋�. Find the mean and variance of the random variable𝑋�. Compare these with the result from (a) and comment. Hint: you will find it useful to note that 𝐶𝐶𝐶(𝑋1,𝑋2) = 0 because X1and X2are independent. This simplifies the evaluation of the variance of the random variable𝑋.�

c. Find the possible values that 𝑋� may take. Hence list the probability distribution

of 𝑋� for samples of size 2. (This is known as the sampling distributionof 𝑋�). d. Examine briefly what would happen if n =3, 4, …? For this last sub-question, you

will need to use the idea of a factorial of an integer n, labelled 𝑛!, which means n multiplied by every positive integer smaller than itself. So, for example, 3! = 3 ×2 × 1 = 6. Also recall the combinatorial formula for the number of ways of selecting x from n distinct objects(Sharpe page 193): Cxn = 𝑛!/(𝑛 − 𝑥)! 𝑥!.

9. A student has enrolled in three courses in this semester. Let’s call them courses A, B

and C. Her chances of passing each course are 0.8, 0.65, and 0.5, respectively. Passing each course is assumed to be independent of passing other courses. Answer the following:

a. Define a random variable for each course outcome.

b. What is the probability that this student passes exactly two courses? Express this question in terms of probability statements, and then solve.

c. What is the probability that this student fails at least one course? Express this

question in terms of probability statements, and then solve.

d. How reasonable is the assumption of independence?

10. Let X be the number of heads in 4 tosses of a fair coin.

a. What is the probability distribution of X?

b. What are the mean and variance of X?

c. Consider a game where you win $5 for every head but lose $3 for every tail that appears in 4 tosses of a fair coin. Let the variable Y denote the winnings from this game. Formulate the probability distribution of Y based on the probability distribution of X.

d. What is the expected value of Y? Would you like to play this game? If so, why? If

not, why not?


Weeks 5 and 6

1. A random number generator is designed to draw numbers at random from within a

specified range. We can consider any number in the range as a possible outcome. (a) What type of distribution is the random number generator drawing from? (b) Suppose we program a random number generator to generate a random number

with a value falling in the interval [0, 2]. What is the height of the density of the distribution from which the random number generator is drawing? Draw a graph of the probability density function.

(c) What is the cumulative probability distribution of the random variable from which draws are being taken? Draw a graph of the cumulative probability distribution function.

(d) Find the following for this case: P(Y<0.6); P(Y≤0.6); P(0.5<Y<1.5), using both the density function and the cumulative probability function. Show that your answers match whichever you use.

2. From several years’ records, a fish market manager has determined that the weight of

deep sea bream sold in the market (X) is approximately normally distributed with a mean of 450 grams and a standard deviation of 100 grams. Assuming this distribution will remain unchanged in the future, calculate the expected proportions of deep sea bream sold over the next year weighing…

a) between 300 and 400 grams. b) between 400 and 600 grams. c) more than 625 grams.

3. In a certain large city, household annual incomes are considered approximately normally distributed with a mean of $40,000 and a standard deviation of $6,000. What proportion of households in the city have an annual income over $35,000? If a random sample of 120 households were selected, how many of these households would we expect to have annual incomes between $35,000 and $45,000?

4. What is the 75th percentile of the normal distribution N(10, 9)?

5. In a certain city, it is estimated that 60% of households have access to the internet. A

company wishing to sell services to internet users randomly chooses 150 households in the city and sends them advertising material.

(a) Calculate the probability that fewer than 90 contacted households have internet access.

(b) Calculate the probability that between 60 and 100 (inclusive) contacted households have internet access.

(c) There is an 80% chance (probability of .8) that the number of contacted households with internet access equals or exceeds what value?

6. Using your personalized Course Project data: (a) Calculate the sample averages of all variables. Which of these averages are

meaningful? Express the meaning of each average in words that are understandable and effective for a layperson such as your client.

(b) Do you need to manipulate the raw data provided, before proceeding to statistical analyses, in order to address the client’s question? If so, how?

7. Work through problem 28 on page 264 of Sharpe (Chapter 7), referring to the 68-95-99.7 Rule explained on page 239-240 of Sharpe.

8. UNSW wants to measure the attractiveness of its “brand” to potential students. The university performs an experiment by inviting 100 high school students from different public schools across New South Wales to browse a few websites related to different universities, and then to choose the one that they would prefer most.

(a) Is this a random sample? Can you think of any potential source of selection bias? (b) Suppose that a perfectly random sample of students is drawn from the target

population, and these students take part in the exercise described above. With reference to the brief discussion on page 732 of Sharpe (“Confounding and Lurking Variables”), can you think of any confounding factors – that is, factors that might lead to lack of confidence in using students’ expressed preferences, as measured in this exercise, as an indicator of their degree of overall attraction to the UNSW “brand”?

(c) Suppose that the exercise described in part (b) is conducted. The resulting data include each student’s high school, the selection of universities whose websites

they browsed, and the one amongst those that they chose as their most-preferred university. Sketch on a piece of paper or in an Excel sheet what these data would look like once they are made ready for quantitative analysis.

(d) Add to the display in part (c) any additional variables that you r answer to part (b) indicated you might like to have access to. Show these variables in a form that is analysis-ready.

(e) Suppose you had access to the expanded data set constructed in part (d). Describe what sort of analyses you could conduct that might help to shed light on UNSW’s core question about the attractiveness of its brand.

(f) Based on your analysis, what would you be able to tell UNSW leadership about the core drivers of its brand appeal – that is, what it is about UNSW that students are drawn to?





Weeks 7 and 8

1. Suppose a normally distributed random variable X has a mean of 50 and a variance of 100. Also suppose a sample of size 16 is drawn from this population. Calculate the following probabilities: (a) P(40< X <55) (b) P(40< 𝑋� <55)

2. Recall the Anzac Garage data used previously. These data are available from the

course website (in the “Tutorial Questions and Information” folder) in an Excel file called AnzacG.xls. Use these 117 observations on used passenger cars to find the 95% confidence interval for the population mean distance travelled by used passenger cars

(this variable is labelled ‘odometer’ in the data set and is measured in kilometres). Assume the population standard deviation is 60,000kms.

3. What would be the effects on the width of the confidence interval calculated in the

previous question of: (a) a decrease in the level of confidence used? (b) an increase in sample size? (c) an increase in the population standard deviation? (d) an increase in the sample standard deviation? (e) an increase in the value of 𝑋� found?

4. Again referring to the data in ‘odometer’ from AnzacG.xls and the population from

which it is drawn, determine the sample size required to estimate the population mean to within 5,000 kms with 90% confidence. Again assume the population standard deviation is 60,000 kms.

5. Perform the following hypothesis tests of the population mean. In each case, draw a picture to illustrate the rejection regions on both the Z and 𝑋� distributions, and calculate the p-value of the test. (a) H0: μ = 50, H1: μ > 50, n = 100, �̅� = 55, σ = 10, α = 0.05 (b) H0: μ = 25, H1: μ < 25, n = 100, �̅� = 24, σ = 5, α = 0.1 (c) H0: μ = 80, H1: μ ≠ 80, n = 100, �̅� = 80.5, σ = 4, α = 0.05

6. A real estate expert claims the current mean value of houses in a particular area is

more than $250,000. A random sample of 150 recent sales prices in the area yields a sample mean of $265,000. It is known that house values in the area are approximately normally distributed with a standard deviation of $50,000. (a) Perform an upper tail test of the null hypothesis that the population mean house

value in the area is $250,000. Use a 5% level of significance and state the rejection (critical) region in terms of both �̅� and z.

(b) Why is an upper tail test most appropriate in this case? (c) What is the p-value associated with the test statistic used in the part (a) test?

Interpret this value. (d) Define in words the type I and II errors that could afflict the part (a) test.

7. What effect does increasing the sample size have on the outcome of a hypothesis test?

Explain your answer using the example of a one-tail test concerning the mean of a normally distributed population with known variance.


Recalling Exercise 39:

Then, re-do the analysis with all settings the same except supposing that: c) The professor’s students scored 108 points on the final exam, having used the

software (and nothing else changed). d) The number of students enrolled in the course decreased from 481 to 210

(and nothing else changed). e) The standard deviation of the students’ scores increased from 6.3 to 25.2

points (and nothing else changed).

9. Project Review: For the course project, you are only expected to use statistical methods covered in lectures and tutorials up to and including those in Week 9. Thus you should now have sufficient material to complete the project in a timely fashion. What might be useful at this stage is to think about presentation. See the Examples of Statistical Reports section of the Project folder on Moodle for some ideas in general. As a directed exercise for this tutorial, compare and contrast the presentation of material in the NSW BOCSAR report on driving under the influence of cannabis (driving-cannabis.pdf) and Queensland Office of Economic and Statistical Research bulletin on computer and internet usage in Queensland (computer-internet-useage-qld-c01.pdf). You should be able to read these reports comfortably, although there are a few methods that may be unfamiliar in the cannabis report (although these methods will be covered later in the course).

Weeks 9 and 10

1. State whether the normal distribution, the t distribution, or neither would be the right type of sampling distribution to assume for the sample mean in order to test hypotheses regarding the population mean in the following situations: (a) Population variable normally distributed, σ2 unknown, sample size less than

30.

(b) Population variable normally distributed, σ2 unknown, sample size greater than 30.

(c) Population variable normally distributed, σ2 known, sample size less than 30.

(d) Population variable not normally distributed, σ2 unknown, sample size greater than 30.

(e) Population variable not normally distributed, σ2 unknown, sample size less than 30.

2. Reconsider the example used earlier in the course in which a real estate expert claimed the current mean value of houses in a particular area was more than $250,000. A random sample of 150 recent sales prices in the area yielded a sample mean of $265,000, and it is known that house values in the area are approximately normally distributed with a standard deviation of $50,000. (a) If in fact the population mean house value in the area is $260,000, what is the

probability of committing a type II error in performing an upper-tail test of the null hypothesis that the mean house value price in the area is $250,000, as was done in Part (a) of the prior week’s exercise? What is the power of the test in these circumstances? State in words what the power of the test means.

(b) Illustrate your answer to part (a) above by showing on a diagram the areas representing the probability of a type II error and the power of the test.

3. A company running an urban rail service wishes to estimate its daily average number

of late-running trains on weekdays. For 10 randomly selected weekdays, it finds the following numbers of late running trains:

32, 10, 9, 18, 25, 15, 14, 18, 22, 16

(a) Assuming the number of late running trains on a weekday is approximately

normally distributed, calculate a 90% confidence interval for the mean number of late-running trains on a weekday.

(b) If we did not have the assumption of normality, could we still calculate a confidence interval in this example? If not, suggest a way of overcoming this problem.

4. Reconsider the question from a previous week that used the Anzac Garage data,

available from the course website (in the “Tutorial Questions and Information” folder) in an Excel file called Anzacg.xls. Would normality be a good approximation for the population distribution of distance travelled by used passenger cars? (Hint: look at the summary statistics and a histogram.) Do you need to assume normality? Redo the 95% confidence interval for the population mean distance travelled by used passenger cars without assuming a known population standard deviation.

5. It is known that 80% of people suffering from a particular disease are cured by a certain standard medication. Test the claim of the developers of a new medication that their product is more effective than the standard medication in curing the disease, using a 5% significance level, given a random sample of 400 people with the disease of whom 330 are cured by using the new medication. (Hint: Use the normal approximation, and ignore the continuity correction.)

6. Download the data “Credit_Card_Bank” from the MyStatLab website (available

under the heading of Chapter 1: Data and Decisions). Using the variables “Offer Status” and “Spendlift Positive”, conduct the appropriate Chi-squared test to determine whether these there is a relationship between the type of offer a customer

was exposed to and whether a lift in spending was observed, assuming a significance level of 0.05. Interpret your results.

7. Use a calculator to compute the sample least squares regression line for the model 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀, given the following six observations:

y 2 8 6 12 9 11 x 1 4 3 10 10 8

8. Suppose the relationship between the dependent variable weekly household consumption expenditure in dollars (y) and the independent variable weekly household income in dollars (x) is represented by the simple regression model (i refers to the ith observation or household):

𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜀𝑖 Suppose a sample of observations yields least squares estimates of b0 = -32 and b1 =

0.82 for this model.

(a) What does 𝜀𝑖 represent in the model? (b) State the basic (classical) assumptions made about the ε‘s in this model. Explain

in words what the assumptions mean. (c) Does the estimate of b0 = -32 make sense? If not, does this necessarily invalidate

the model? Explain your answer. (d) Interpret both β1 and b1. What does the model predict would be the change in y

following a $10 increase in x from some initial level? (e) Suppose we measured y and x in cents rather than dollars. What effect would

this have on the estimated coefficient of x? What effect would it have on the estimated intercept?

(f) Suppose y were measured in dollars but x were measured in cents. What effects would this have on the estimated coefficient of x?

(g) Distinguish between 𝜀𝑖 and 𝜀�̂� (the residual associated with observation i). Illustrate your answer with a diagram.

9. Work through problem 16 on page 529-530 of Sharpe (Chapter 15).

Documents

Tutorial Workshop Questions s1 2016