Objectives (IPS chapter 1.1) Displaying distributions with graphs Labels/Variables Two types of variables Ways to chart categorical data Bar graphs

Objectives (IPS chapter 1.1)Displaying distributions with graphs

Labels/Variables

Two types of variables

Ways to chart categorical data Bar graphs

Pie charts

Ways to chart quantitative data Line graphs: time plots

Scales matter

Histograms

Stemplots

Stemplots versus histograms

Interpreting histograms

Variables

In a study, we collect information—data—from individuals. Individuals

can be people, animals, plants, or any object of interest.

A variable is a characteristic that varies among individuals in a

population or in a sample (a subset of a population).

Example: age, height, blood pressure, ethnicity, leaf length, first language

The distribution of a variable tells us what values the variable takes

and how often it takes these values.

Two types of variables Variables can be either quantitative (or numerical)…

Something that can be counted or measured for each individual and then

added, subtracted, averaged, etc. across individuals in the population.

Example: How tall you are, your age, your blood cholesterol level, the

number of credit cards you own

… or categorical.

Something that falls into one of several categories. What can be counted

is the count or proportion of individuals in each category.

Example: Your blood type (A, B, AB, O), your hair color, your ethnicity,

whether you paid income tax last tax year or not

Cases: are the objects described by a set of data. Cases may be customers,

companies, subjects in a study, or other subjects.

A label: a special variable used in some data sets to distinguish different cases.

A variable is a characteristic of a case.

Different cases can have different values for the variables.

Example:(1) The cases are the individual students;

(2) The first three (Student identification number, last name, first name) are labels.

(3) Gender is a categorical variable;

(4) Test 1 to Final are numerical variables.

Cases, Labels, Variables, and Values

Eg: How do you know if a variable is categorical or quantitative?Ask: What are the n individuals/units in the sample (of size “n”)? What is being recorded about those n individuals/units? Is that a number ( quantitative) or a statement ( categorical)?

Individualsin sample

DIAGNOSIS AGE AT DEATH

Patient A Heart disease 56

Patient B Stroke 70

Patient C Stroke 75

Patient D Lung cancer 60

Patient E Heart disease 80

Patient F Accident 73

Patient G Diabetes 69

QuantitativeEach individual is

attributed a numerical value.

CategoricalEach individual is assigned to one of several categories.

HW 1.14(b), 1.16

LabelEach individual is assigned to one

label.

Definition: quantitative data are discrete if the possible values are isolated

points on the number line. quantitative data are continuous if the set of possible values

forms an entire interval on the number line.Question: Are peoples’ heights continuous? What about ages? What about family size?

(1) For categorical data: i) Frequency / Relative Frequency distribution; ii) Bar graph

iii) Pie charts.(2) For numerical data: i) Frequency distribution; ii) Stemplot (stem-and-leaf plot); iii) Histogram.(3) For bivariate numerical data: Scatter plot

Summary of variables

Ways to chart categorical dataBecause the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.)

Bar graphsEach category isrepresented by a bar.

Pie chartsPeculiarity: The slices must represent the parts of one whole.

For frequency distribution, the most important definition is relative frequency, which is defined by

Example 1: For seven students with gender: M, F, M, M, F, M, F. (M=male; F=Female).

Q1: What is frequency distribution?

Q2: Find the relative frequency for Example 1.

Q3: Make a bar graph and pie chart for Example 1.

set data in the obs of #

frequencyfrequency relative

HW 1.21(a)

Example 1:

For the following frequency distribution:

Q1: Find the relative frequency. (Eg: in the format of 12.56%)

Q2: Make a bar graph for Example 2.

Q3: Make a pie chart for Example 2.

set data in the obs of #

frequencyfrequency relative

Grade FreqFreq

Freshman 12

Sophomore 25

Junior 16

Senior 10

Others 1

TotalTotal 64

Relative FreqRelative Freq

12/64=18.75%

25/64=39.06%

16/64=25%

10/64=15.63%

1/64= 1.56%

1=100%

HW 1.21(a)

Example 2:

Example 2: Students percentage (continue)

Example: Top 10 causes of death in the United States 2006

Rank Causes of death Counts% of top

10s% of total deaths

1 Heart disease 631,636 34% 26%

2 Cancer 559,888 30% 23%

3 Cerebrovascular 137,119 7% 6%

4 Chronic respiratory 124,583 7% 5%

5 Accidents 121,599 7% 5%

6 Diabetes mellitus 72,449 4% 3%

7 Alzheimer’s disease 72,432 4% 3%

8 Flu and pneumonia 56,326 3% 2%

9 Kidney disorders 45,344 2% 2%

10 Septicemia 34,234 2% 1%

All other causes 570,654 24%

For each individual who died in the United States in 2006, we record what was

the cause of death. The table above is a summary of that information.

Top 10 causes of deaths in the United States 2006

Bar graphsEach category is represented by one bar. The bar’s height shows the count (or

sometimes the percentage) for that particular category.

The number of individuals who died of an accident in

2006 is approximately 121,000.

HW 1.21(b, c)

Percent of people dying fromtop 10 causes of death in the United States in 2006

Pie chartsEach slice represents a piece of one whole. The size of a slice depends on what

percent of the whole this category represents.

HW 1.22(b)

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data.

Make sure all percents

add up to 100.

Child poverty before and after government intervention—UNICEF, 2005

What does this chart tell you?

•The United States and Mexico have the highest

rate of child poverty among OECD (Organization

for Economic Cooperation and Development)

nations (22% and 28% of under 18).

•Their governments do the least—through taxes

and subsidies—to remedy the problem (size of

orange bars and percent difference between

orange/blue bars).

Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why?

The poverty line is defined as 50% of national median income.

R program for Students percentage example## The route of your data files

route<-"//bearsrv/classrooms/Math/cchen/STT215/LAB_Fall2011/"

## Read in the dataset

dat=read.csv(paste(route,'In_Class_exercises1.csv',sep=''), header=TRUE)

## To plot pie graph

pie(dat$Percent, labels=dat$Students)

## To plot pie graph with a title and make it colorful

pie(dat$Percent, labels=dat$Students, main='Student percentage', col = rainbow(5))

## To create a barplot

barplot(dat$Percent, names.arg= dat$Students, main='Student percentage')

Overview: Ways to chart quantitative data Stemplots

Also called a stem-and-leaf plot. Each observation is represented by a

stem, consisting of all digits except the final one, which is the leaf.

Histograms

A histogram breaks the range of values of a variable into classes and

displays only the count or percent of the observations that fall into each

class.

Line graphs: time plots

A time plot of a variable plots each observation against the time at

which it was measured.

Stem plots: Example 1How to make a stemplot:

Separate each observation into a stem, consisting of all but the final (rightmost) digit,

and a leaf, which is that remaining final digit. Stems may have as many digits as

needed, but each leaf contains only a single digit.

Eg: how to find the stem and leaf for: 25, 135, 6.

Write the stems in a vertical column with the smallest value at the top, and draw a

vertical line at the right of this column.

Write each leaf in the row to the right of its stem, in increasing order out from the

stem.

Dataset: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70.

Q: Make a stem-leaf-plot for this dataset.

Stem plots: Example 2How to make a stemplot:

Separate each observation into a stem, consisting of all but the final

(rightmost) digit, and a leaf, which is that remaining final digit. Stems may

have as many digits as needed, but each leaf contains only a single digit.

Write the stems in a vertical column with the smallest value at the top, and

draw a vertical line at the right of this column.

Write each leaf in the row to the right of its stem, in increasing order out

from the stem.

Dataset: 1, 3, 3, 12, 15, 17, 21, 25, 45, 49, 62, 67, 69. Q: Make a stem-leaf-plot for this dataset.

Stem and leaf Notes:To compare two related distributions, a back-to-back stem plot

with common stems is useful.

Stem-and-leaf plot works best for small numbers of observations that are all greater than 0. But it does not work well for large datasets.

Stem-and-leaf plot display the actual values of the observations.

Stem and leaf Notes: Eg: Make a stemplot for the data: 115, 143, 162, 198, 267, 279, 302. Trim and also split stems. That means: trimming numbers means dropping

the last digit.

Eg: Original data 141, by dropping the last digit, it gives 14.

Original data 255, by dropping the last digit, it gives 25.

Splitting stems. Eg: your dataset are:

1, 7, 10, 11, 12, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 28, 29.

Then by “splitting stem”, it gives:

0 | 1

0 | 7

1 | 0 1 2 3

1 | 5 6 7 8 9

2 | 1 2 3

2 | 5 6 8 9

Here “splitting stem” says, if you dataset is of median size, then even when some numbers share same stem, we separate them into two parts with same stem, one part with 0-4, and another part with 5-9.

Histogram A histogram breaks the range of values of a variable into classes

and displays only the count or percentage of the observations that fall into each class.

You can choose any convenient number of classes, but you should always choose classes of equal width.

Table 1.3Introduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Steps to draw a histogram Step 1: Divide the range of the data into classes of equal width. (Be sure to

specify the classes precisely so that each individual falls into exactly one class.)

EX: IQ Scores 75<= IQ Scores < 85; 85<= IQ Scores < 95; 95<= IQ Scores < 105; 105<= IQ Scores < 115; 115<= IQ Scores < 125; 125<= IQ Scores < 135;135<= IQ Scores < 145; 145<= IQ Scores < 155;

Step 2: Count the number of individual in each class. The counts are called frequencies, and a table of frequencies for all class is a frequency table.

Step 3: Draw the histogram.

First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. The vertical axis contains the scale of counts. Each bar represents a class. The base covers the class. The bar height is the class count.

classes [75, 85) [85, 95) [95, 105) [105,115) [115,125) [125,135) [135,145) [145,155)

counts 2 3 10 16 13 10 5 1

Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less?

Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less?

In Summary…

Important property of a density curve is that areas under the curve correspond to relative frequencies

Stemplots are quick and dirty histograms that can easily be done by

hand, and therefore are very convenient for back of the envelope

calculations. However, they are rarely found in scientific or laymen

publications.

Stemplots versus histograms

Once a graph of the variable is made, we can begin to understand its distribution by looking at the following: look at the overall pattern in the graph and for striking

deviations from that overall pattern. Peaks? Gaps? Symmetric? Skewed?

describe the overall pattern of the distribution by talking about its shape, center, and spread (or variation).

look for possible outliers in the distribution; i.e., those values of the variable that seem to fall outside the overall pattern you see.

These features will be important for all types of graphs…

Stemplots versus histograms Examining distributions of Quantitative Variables

Contiunue 1. one or several peaks (called modes)? If unique major peak, then

call unimodal.

2. Center/Midpoint: the value with roughly half the observations taking smaller values and half taking larger values.

3. symmetric or skewed to the right / left? Skewed to the right if the right tail (larger value) is much longer than the left tail (smaller value).

4. Outlier: a point that is clearly apart from the body of the data, not just the most extreme observation in a distribution.

Complex, multimodal distribution

Symmetric distribution

Skewed distribution

Alaska Florida

Outliers

An important kind of deviation is an outlier. Outliers are observations

that lie outside the overall pattern of a distribution. Always look for

outliers and try to explain them.

The overall pattern is fairly

symmetrical except for 2

states that clearly do not

belong to the main trend.

Alaska and Florida have

unusual representation of

the elderly in their

population.

A large gap in the

distribution is typically a

sign of an outlier.

Symmetric

Skewed to the right

Skewed to the left

Outlier

Double peaks

Q: 1.37(p 28) HWQ: 1.25, 1.26, 1.27, 1.42

How to create a histogram

It is an iterative process – try and try again.

What bin size should you use?

Not too many bins with either 0 or 1 counts

Not overly summarized that you lose all the information

Not so detailed that it is no longer summary

rule of thumb: start with 5 to 10 bins

Look at the distribution and refine your bins

(There isn’t a unique or “perfect” solution)

Not summarized enough

Too summarized

Same data set

Death rates from cancer (US, 1945-95)

0

50

100

150

200

250

1940 1950 1960 1970 1980 1990 2000

Years

Death

rate

(per

thousand)


0

50

100

150

200

250

1940 1960 1980 2000

Years

Dea

th r

ate

(per

thou

sand

)


0

50

100

150

200

250

1940 1960 1980 2000

Years

Death

rate

(per

thousand)

A picture is worth a thousand words,

BUT

There is nothing like hard numbers.

Look at the scales.

Scales matterHow you stretch the axes and choose your scales can give a different impression.


120

140

160

180

200

220

1940 1960 1980 2000

Years

Death

rate

(pe

r th

ousan

d)

IMPORTANT NOTE:

Your data are the way they are.

Do not try to force them into a

particular shape.

It is a common misconception

that if you have a large enough

data set, the data will eventually

turn out nice and symmetrical.

Histogram of dry days in 1995

Line graphs: time plots

A trend is a rise or fall that persist over time, despite small irregularities.

In a time plot, time always goes on the horizontal, x axis.

We describe time series by looking for an overall pattern and for striking

deviations from that pattern. In a time series:

A pattern that repeats itself at regular intervals of time is

called seasonal variation.

Retail price of fresh oranges over time

This time plot shows a regular pattern of yearly variations. These are seasonal

variations in fresh orange pricing most likely due to similar seasonal variations in

the production of fresh oranges.

There is also an overall upward trend in pricing over time. It could simply be

reflecting inflation trends or a more fundamental change in this industry.

Time is on the horizontal, x axis.

The variable of interest—here

“retail price of fresh oranges”—

goes on the vertical, y axis.

1918 influenza epidemicDate # Cases # Deaths

week 1 36 0week 2 531 0week 3 4233 130week 4 8682 552week 5 7164 738week 6 2229 414week 7 600 198week 8 164 90week 9 57 56week 10 722 50week 11 1517 71week 12 1828 137week 13 1539 178week 14 2416 194week 15 3148 290week 16 3465 310week 17 1440 149

0100020003000400050006000700080009000

10000

0100200300400500600700800

0100020003000400050006000700080009000

10000

# c

ase

s d

iag

no

sed

0

100

200

300

400

500

600

700

800

# d

ea

ths

rep

ort

ed

# Cases # Deaths

A time plot can be used to compare two or more

data sets covering the same time period.

The pattern over time for the number of flu diagnoses closely resembles that for the

number of deaths from the flu, indicating that about 8% to 10% of the people

diagnosed that year died shortly afterward from complications of the flu.

Go to http://www.statcrunch.com/ and log in with your user name and password.

From tool bar, click on MyStatCrunch:

Then click on My Groups:

Then click on “join a group” In the search box on the left panel, search for: STT215 Then you are able to find Now select and join our UNCW-STT215 group.

How to run StatCrunch

Searchengines.xls (Categorical variable with summary of %).

Ruthmcgwire.xls (Numerical variable)

Beer.xls (Complex dataset)


After you make a graph, now click on

the “Options” icon.

If you select “COPY”, then a new window

Will pop up. Right-click the image to

copy it.

Now use “Ctr+Alt+v” to past the image.

You can choose the bitmap format.

Otherwise you can save the image and cut and paste later on.


Documents

Objectives (IPS chapter 1.1) Displaying distributions with graphs Labels/Variables Two types of variables Ways to chart categorical data Bar graphs