PA 818 Professor Wallace Fall 2017 Descriptive Statistics with …gwallace/Papers/STATA... · 2017-09-22 · 1 . PA 818 . Professor Wallace . Fall 2017. Lecture: Descriptive Statistics

1

PA 818 Professor Wallace Fall 2017 Lecture: Descriptive Statistics with STATA

• Descriptive methods for nominal and ordinal data using STATA and Excel

o Frequency distributions (Example 1) o Discrete histograms (Example 2) o Pie charts (Example 3) o Describing relationships between two or more categorical

variables Cross tabular frequency distribution (Example 4) Graphical techniques (Example 5)

• Graphical descriptive techniques for interval and time series data. o Histograms and frequency distributions (Examples 6, 7, and 8) o Stem and leaf display (Examples 9 and 10) o Ogives (Examples 11, 12, and 13)

• Numerical descriptive techniques for interval data o Measures of central tendency

Sample mean (Example 14) Sample median (Example 15) Sample mode (Examples 16, 17, and 18) Mean vs. Mode (Example 19) Measures of central location for categorical data

(Example 20) o Measures of dispersion.

Range (Example 21) Sample standard deviation and variance (Example 22) Coefficient of variation (Example 23)

o Measures of relative standing and box plots Percentiles (Examples 24 and 25) Interquartile range (Example 26) Ratios of percentiles (Examples 27 and 28) Box plots (Examples 29 and 30)

• Wrap-up on descriptive statistics (Examples 32, 32, and 33)

2

Descriptive Techniques for Categorical Data • Frequency distribution – a tabular description for nominal data that

list the number of units associated with each category

• Relative frequency distribution – tabular description for nominal data that list the fraction or percentage of units associated with each category

• Cumulative frequency distribution (ordinal only) – a tabular description for nominal data that list the cumulative (category and below) count, fraction, or percentage of units associated with each category

Example 1: Using the data in the updated CPS ORG file provide a frequency, relative frequency, and cumulative frequency distribution for grouped educational attainment for men in 2007. The education level categories should be less than high school, high school, some college (no degree), and 4 or more years of college. gen ed_level=cond(ed<12,1,0) replace ed_level=cond(ed==12,2,ed_level) replace ed_level=cond(ed>12 & ed<16,3,ed_level) replace ed_level=cond(ed>15,4,ed_level) tab ed_level

ed_level | Freq. Percent Cum.

------------+----------------------------------- 1 | 8,000 11.01 11.01 2 | 24,602 33.86 44.87 3 | 16,237 22.34 67.21 4 | 23,827 32.79 100.00 ------------+----------------------------------- Total | 72,666 100.00

3

Perhaps you are preparing this for your boss and want to provide more information about what the education levels actually mean. In this case we can make use of data labels of the type that already exist the sex and race variables. Let’s create some for ed_level. label define ed_levell 1 "High school dropout" /* */ 2 "High school" /* */ 3 "Some college" /* */ 4 "4 or more years of college" label values ed_level ed_levell tab ed_level ed_level | Freq. Percent Cum. ---------------------------+----------------------------------- High school dropout | 8,000 11.01 11.01 High school | 24,602 33.86 44.87 Some college | 16,237 22.34 67.21 4 or more years of college | 23,827 32.79 100.00 ---------------------------+----------------------------------- Total | 72,666 100.00

4

• Discrete histogram (a type of bar graph) – a graphical representation of a frequency distribution or relative frequency distribution whereby bars are associated with categories and the height of each bar on the graph represents the frequencies or relative frequencies associated with its corresponding category

Example 2: Distribution of education level for male full-time, full-year workers in 2007 hist ed_level, discrete percent

* The following edits were made in the STATA’s graph editor to get to the graph shown above:

• Bar Properties o bar width set to 0.5 o color set to black

• xaxis1 title – hidden in advanced tab • xaxis1 properties label properties

o use value labels checked o angle set at 45 degrees

010

2030

40Pe

rcen

t

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

(male full-time, full-year workers, 2007)Distribution of Educational Attainment

5

o Under range/delta – minimum value set to 1, max value set to 4, and delta set to 1

• Title and Subtitle added

• Pie chart – a graphical representation of the relative frequency distribution whereby a circle (or pie) is divided into slices with each slice representing a category and where the size of the slice is proportional to the relative frequency of its associated category.

Example 3: The same relative frequency distribution of educational attainment displayed in pie chart format.

graph pie, over(ed_level) plabel(_all name) /* */ plabel(_all percent)

* The following edits were made in STATA’s graph editor to get to the graph shown above:

• Legend – advanced tab – hide legend checked. • The percent and name labels were moved so that they don’t

overlap • Title and subtitle added • Pielabel text changed to white

High school dropout

High school

Some college

4 or more years of college

11.01%

33.86%

22.34%

32.79%

(male full-time, full-year workers, 2007)Distribution of Educational Attainment

6

Frequency Distributions vs. Discrete Histograms vs. Pie Charts o Graphs and charts verses tables

Graphs and charts take up more room that the same information displayed in tabular format, but they may be easier for some audiences to interpret.

Too many graphs or charts is generally a bad idea. Graphs may be better when providing descriptive

statistics related to one or two data items, rather than many items in a data set.

Tabular displays for nominal data can be integrated into tables which also provide basic descriptive for interval data.

o Discrete histogram versus pie chart In general, discrete histograms are more efficient and

flexible than pie charts, but pie charts are probably better in communicating relative frequencies to lay audiences. I hardly every opt for the pie chart.

Simple discrete histograms can be printed in black and white, whereas pie charts usually need to incorporate some color or texture.

7

• Describing relationships between two categorical variables o Cross tabular frequency distribution – a tabular display that

shows the absolute or relative frequencies associated with all combinations of two nominal variables.

o Stacked or Side-by-Side Discrete Histograms – a graphical display whereby multiple discrete histograms for different groups, but for the same categorical data, are shown stacked or side by side.

Example 4: How does educational attainment vary by race among men in 2007? Using a cross tabulated frequency distribution A cross tabular frequency distribution of race and educational attainment would help us answer this question tab ed_level race, col +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | Equals 1 if non-Hisp white, 2 if non-Hisp | black, 3 if Hisp, and 4 if other race ed_level | White, No Black, No Hispanic Other | Total ----------------------+--------------------------------------------+---------- High school dropout | 3,369 511 3,790 330 | 8,000 | 6.43 8.91 38.12 7.19 | 11.01 ----------------------+--------------------------------------------+---------- High school | 17,490 2,373 3,488 1,251 | 24,602 | 33.38 41.38 35.09 27.27 | 33.86 ----------------------+--------------------------------------------+---------- Some college | 12,467 1,483 1,409 878 | 16,237 | 23.79 25.86 14.17 19.14 | 22.34 ----------------------+--------------------------------------------+---------- 4 or more years of co | 19,078 1,367 1,254 2,128 | 23,827 | 36.41 23.84 12.61 46.39 | 32.79 ----------------------+--------------------------------------------+---------- Total | 52,404 5,734 9,941 4,587 | 72,666 | 100.00 100.00 100.00 100.00 | 100.00

8

Another way tab race ed_level, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ Equals 1 if | non-Hisp white, 2 | if non-Hisp black, | 3 if Hisp, and 4 if | ed_level other race | High scho High scho Some coll 4 or more | Total --------------------+--------------------------------------------+---------- White, Non-Hispanic | 3,369 17,490 12,467 19,078 | 52,404 | 6.43 33.38 23.79 36.41 | 100.00 --------------------+--------------------------------------------+---------- Black, Non-Hispanic | 511 2,373 1,483 1,367 | 5,734 | 8.91 41.38 25.86 23.84 | 100.00 --------------------+--------------------------------------------+---------- Hispanic | 3,790 3,488 1,409 1,254 | 9,941 | 38.12 35.09 14.17 12.61 | 100.00 --------------------+--------------------------------------------+---------- Other | 330 1,251 878 2,128 | 4,587 | 7.19 27.27 19.14 46.39 | 100.00 --------------------+--------------------------------------------+---------- Total | 8,000 24,602 16,237 23,827 | 72,666 | 11.01 33.86 22.34 32.79 | 100.00

9

Example 5: How does educational attainment vary by race? Using discrete histograms

Using stacked histograms hist new_ed, discrete percent by(race, col(1))

*** We always want to make sure the categories and Y-axis scales are the same.

01020304050

01020304050

01020304050

01020304050

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

White, Non-Hispanic

Black, Non-Hispanic

Hispanic

Other

Perc

ent

(male full-time, full-year workers, 2007)Distribution of Educational Attainment By Race

10


• Plotregion1 – plot1 – bar width set to 0.5 and color set to black

• Pletregion2 – plot1 – bar width set to 0.5 and color set to black


• xaxis1-xaxi4 – axis rule – range/delta checked and set to minimum value=1, maximum value=4, and range=1

• xaxis3 – label properties – label properties – show labels checked, use value labels checked, angle set to 45 degrees

• yaxis1-yaxis4 – axis rule – range/delta checked and set to minimum value=0, maximum value=50, and range=10 – labels set horizontal

• Title and subtitle added • Bottom position title and note hidden

11

Using side-by-side histograms hist new_ed, discrete percent by(race,col(3))



• Pletregion2 – plot1 – bar width set to 0.75 and color set to black


• xaxis1-xaxi3 o axis rule – range/delta checked and set to minimum

value=1, maximum value=4, and range=1 • xaxis3

o label properties – label properties – show labels checked, use value labels checked, angle set to 45 degrees

• Title and subtitle added • Bottom position title (“new_ed”) and note hidden

010

2030

4050

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

High sc

hool

dropo

ut

High sc

hool

Some c

olleg

e

4 or m

ore ye

ars of

colle

ge

White, Non-Hispanic Black, Non-Hispanic Hispanic Other

Perc

ent

(male full-time, full-yuear workers, 2007)Distribution of Educational Attainment By Race

12

Graphical Techniques for Interval Data • Histogram – graphical display which shows the absolute or relative

frequency associated with particular class (intervals) of equal width. o With a histogram we have to determine the class width and

start value or number of bins (classes) and start value, o Start value – the value that the first (left most) class

starts

o Number of bins (classes) – the number of classes Number of observations Number of classes <50 5-7 50-200 7-9 200-500 9-10 500-1,000 10-11 1,000-5,000 11-13 5,000-50,0000 13-17 >50,0000 17-20

o Class width – the width of each class

Class width =(largest value − smalles value)

Number of classes

13

Example 6: Use the CPS ORG data to create a histogram for male hourly wages in 2007 sum wage Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- wage | 72,666 25.18414 14.83449 4.077735 94.0815

The number of observations is larger than 50,000 so by the guidelines we should have 17-20 classes (bins). The formula for the class width is approximately 90/18 which is about 5. We will select a class width of 5 and start at 0. hist wage, start(0) width(5) percent

14

This is an example of a positively (or right) skewed histogram. The histogram could also be described as unimodal as it has one peak. In contrast a bimodal histogram has two peaks.

• Histogram with some grouped data – in the prior example the relative frequencies for wage categories above $70 are really low. One solution that would allow for more detail for at lower wages, but less at higher wages, is to group the upper classes.

05

1015

20Pe

rcen

t

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Hourly Wage in $2014

(male, full-time, full-year workers, 2007)Distribution of Hourly Wages

15

Example 7: Create a histogram with a class width of 5 where wages above $70 are grouped into one catch all category. This allows for more detail at the bottom of the distribution without near zero frequency categories at the top.

egen cwage=cut(wage), at(0(5)95) gen ncwage=cond(ncwage>=70,70,cwage)

label define ncwagel 0 "$0-$5" 5 "$5-$10" /* */ 10 "$10-$15" 15 "$15-$20" /*

*/ 20 "$20-$25" 25 "$25-$30" /* */ 30 "$30-$35" 35 "$35-$40" /* */ 40 "$40-$45" 45 "$45-$50" /* */ 50 "$50-$60" 60 "$60-$70" /*

70 "$70+"

label values ncwage ncwagel tab ncwage hist ncwage, discrete percent

05

1015

20Pe

rcen

t

$0-$5

$5-$1

0

$10-$

15

$15-$

20

$20-$

25

$25-$

30

$30-$

35

$35-$

40

$40-$

45

$45-$

50

$50-$

65

$55-$

60

$60-$

65

$65-$

70$7

0+


(male, full-time, full-year workers, 2007)Distribution of Hourly Wages

16

• Using histograms to compare distributions of interval variables across groups

Example 8: Make a graph which allows us to compare the distribution of wages across education levels.

hist ncwage, discret percent /* */ by(ed_level, col(1))

*** Need to make sure the vertical scales and horizontal scales are the same

010203040

010203040

010203040

010203040

$0-$5

$5-$1

0

$10-$

15

$15-$

20

$20-$

25

$25-$

30

$30-$

35

$35-$

40

$40-$

45

$45-$

50

$50-$

65

$55-$

60

$60-$

65

$65-$

70$7

0+

High school dropout

High school

Some college

4 or more years of college

Perc

ent


17

Graphical Descriptive Techniques for Interval Data, cont. • Stem and Leaf Displays

o Similar to a histogram o Observations are split into 2 parts

o Stem – first part the observations o Leaf – second part of the observation Example: A wage of $15.53 might be split into 15 and 5 (rounded to nearest 10th).

o Each stem is then listed followed by all the leafs associated with the stem

$ Because all leafs are listed, stem and leaf displays work better for data with a smaller number of elements.

18

Example 9: Stem and leaf display for the 2007 male hourly wage data from CPS ORG. The stem is the wage rounded down the nearest dollar and the leaf is the next digit in the wage rounded to the nearest 10th. Note the first observations are 5.3, 5.4, 5.7, 5.8, and 5.9. sort wage list wage in 1/10 +----------+ | wage | |----------| 1. | 4.077735 | 2. | 4.077735 | 3. | 4.077735 | 4. | 4.077735 | 5. | 4.081813 | |----------| 6. | 4.102976 | 7. | 4.110357 | 8. | 4.116922 | 9. | 4.138901 | 10. | 4.151875 | +----------+ stem wage

Stem-and-leaf plot for wage (Hourly wage in $2012 using NBER recommenation - earnwke/uhourse) wage rounded to nearest multiple of .1 plot in units of .1 4* | 111111111 4t | 2222233333333333333333333333 4f | 44444444444455 4s | 6666666666666666666666777777777 4. | 8888888888999999999999999999999999 5* | 000000001111111111 5t | 222233333333333333333333333333333 5f | 4444444444444444444444455555555555555555555555555 5s | 66666666666666677777777777777777777777777777777777777777777777 ... (96) 5. | 888999999999999999999999999999999999999 6* | 000000000000000000000001111111111111111111111 6t | 22222222222222223333333333333333333333333333333333333333333333 ... (72) 6f | 444444444444444444444455555555555555555 6s | 6666666666666666666666666666666666666666666666666666666666666 ... (127) 6. | 8888888888999999999999999999999999999999999999999999999999999 ... (155) . . . 93. | 8 94* | 1 82* | 83* | 84* | 85* | 86* | 87* | 88* | 4

Example 10: Stem and leaf display for a final exam I gave recently

19

stem final

Stem-and-leaf plot for final_ 3* | 7 4* | 7 5* | 6* | 123458 7* | 025668 8* | 056688 9* | 24568 10* | 0

We can easily see the lowest scores were 27 and 47 and the highest scores were 98 and 100.

20

• Ogive – a plot of the cumulative frequency distribution. o Using the book methods the data is grouped and the data is

plotted at the group intervals o Using STATA every point is plotted

Example 11: Generate an ogive for wage using the book method. tab cwage

Hourly Wage | in $2014 | Freq. Percent Cum. ------------+----------------------------------- 0 | 122 0.17 0.17 5 | 5,480 7.54 7.71 10 | 14,566 20.05 27.75 15 | 13,704 18.86 46.61 20 | 10,439 14.37 60.98 25 | 8,105 11.15 72.13 30 | 5,560 7.65 79.78 35 | 4,269 5.87 85.66 40 | 2,974 4.09 89.75 45 | 1,946 2.68 92.43 50 | 1,961 2.70 95.13 55 | 815 1.12 96.25 60 | 497 0.68 96.93 65 | 1,085 1.49 98.43 70 | 368 0.51 98.93 75 | 130 0.18 99.11 80 | 627 0.86 99.98 85 | 9 0.01 99.99 90 | 9 0.01 100.00 ------------+----------------------------------- Total | 72,666 100.00

21

Once we have the above table we can load it in excel and make a nice looking line graph

Ogive for Wages (male full-time, full-year workers, 2007)

0

10

20

30

40

50

60

70

80

90

100

Cum

mul

ativ

e Pe

rcen

t


22

Example 12: Generate an ogive using the 2007 male wage data using STATA

cumul wage, gen(cum) label variable cum "Cummultive Relative /* */ Frequency" browse wage cum /* Not needed */ line cum wage, sort

0.2

.4.6

.81

Cum

mul

tive

Rel

ativ

e Fr

eque

ncy

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100Hourly Wage in $2014

(male, full-time, full-year, workers, 2007)Cummulative Distribution of Hourly Wages

23

Example 13: Use an ogive to compare the distribution of wages across educational attainment levels

cumul wage if ed_level==1, gen(cum1) cumul wage if ed_level==2, gen(cum2) cumul wage if ed_level==3, gen(cum3) cumul wage if ed_level==4, gen(cum4) label variable cum1 "High school dropout" label variable cum2 "High school" label variable cum3 "Some college" label variable cum4 "4 or more years of /* * / college" line cum1 cum2 cum3 cum4 wage, sort

0.2

.4.6

.81

Cum

mul

ativ

e R

elat

ive

Freq

uenc

y

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100Hourly Wage in $2014

High school dropout High schoolSome college 4 or more years of college

(male, full-time, full-year workers, 2007)Cummulative Distribution of Hourly Wages by Education Level

24

Numeric Descriptive Techniques for Interval Data • Measures of Central Tendency

o Sample mean (or average)

�̅�𝑥 =1𝑛𝑛�𝑥𝑥𝑖𝑖

𝑛𝑛

𝑖𝑖=1

As we have already seen the sum command produces a table that contains the sample mean

Example 14: Using data from the CPS ORG what are (sample) mean weekly earnings for men in 2007?

sum earnwke

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- earnwke | 72,666 1100.081 686.8093 142.7207 3293.549

o Sample median (50th percentile ) – The median is nothing more than the middle observation when the data is sorted In the case of an odd number of observations there is a true

middle value In the case of an even number of observations we take the

mid-point of the two middle observations Example 15: Using data from the CPS ORG what is the (sample) median of weekly earnings for men in 2007? sort earnwke browse earnwke

Note that in this case there are an even number of observations so there will not be a true middle value. With the data sorted we would want to take the midpoint of the 36,333th and 36,334th value. As it turns out for this particular data the 36,333th and 36,334th value are identical at 913.413.

25

We can get the median (and other percentiles) from STATA using the sum command with the detail option. sum earnwke, detail Actual or computed weekly earnings $2014, edited ------------------------------------------------------------- Percentiles Smallest 1% 283.1579 142.7207 5% 365.365 142.7207 10% 439.1345 146.9453 Obs 72,666 25% 593.7182 151.8548 Sum of Wgt. 72,666 50% 913.4126 Mean 1100.081 Largest Std. Dev. 686.8093 75% 1370.119 3293.549 90% 2084.864 3293.549 Variance 471707.1 95% 2634.579 3293.549 Skewness 1.424771 99% 3293.549 3293.549 Kurtosis 4.800422

We could also get the median using egen egen med_earnwke=median(earnwke) sum med_earnwke Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- med_earnwke | 72,666 913.4126 0 913.4126 913.4126

26

o Sample mode – the mode is the most frequently occurring value in the data.

Example 16: Using data from the CPS ORG what is the (sample) mode of weekly earnings for men in 2007?

In STATA that you have to use egen to find the mode.1 We have seen egen a few times already.

egen mode_earnwke=mode(earnwke) sum mode_earnwke Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mode_earnwke | 72,666 456.7063 0 456.7063 456.7063

Example 17: How many observations of weekly earnings take on this modal value?

sum earnwke if earnwke==mode_earnwke

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- earnwke | 2,157 456.7063 0 456.7063 456.7063

1 egen can also be used to find medians, any percentile, mins, maxes, standard deviations, and a variety of other measures.

27

Example 18: What are modal weekly earnings black and white men (separately) in 2007? bysort race: egen mode_earnwke_byrace /* */ =mode(earnwke) bysort race: sum mode_earnwke_byrace -------------------------------------------------------------------------- -> race = White, Non-Hispanic Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mode_earnw~e | 52,404 3293.549 0 3293.549 3293.549 ---------------------------------------------------------------------------> race = Black, Non-Hispanic Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mode_earnw~e | 5,734 456.7063 0 456.7063 456.7063 -------------------------------------------------------------------------- -> race = Hispanic Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mode_earnw~e | 9,941 456.7063 5.68e-14 456.7063 456.7063 ---------------------------------------------------------------------------> race = Other Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mode_earnw~e | 4,587 456.7063 0 456.7063 456.7063

28

o Mean vs Median: Which one to show? Basically the mean is going to be sensitive to extreme

values and outliers whereas the median is not. For a positively (right) skewed distribution the mean will

always be above the median For a negatively (left) skewed distribution the mean will

always be below the median. In practice if I felt it was important to report the median in

a descriptive table, I would also report the mean.

Example 19: The distribution of male weekly earnings in 2007

The distribution is positively skewed and there are some large values that are likely to pull that are likely to pull the mean up. This might be an instance where reporting the mean and the median is appropriate. Note that mean and median are both reported in the median example above.

02

46

810

Perc

ent

020

040

060

080

010

0012

0014

0016

0018

0020

0022

0024

0026

0028

0030

0032

00

Weekly Earnings in $2014

(male, full-time, full-year workers, 2007)Distribution of Weekly Earnings

29

o Measures of central location for categorical data – really the only thing that you can do is make dummy (or indicator) variables and report the means of these.

$ Means of dummy or indicator variables are relative

frequencies.

Example 20: Find the mean of educational level indicator variables for men in 2007 gen ed1=cond(ed_level==1,1,0) gen ed2=cond(ed_level==2,1,0) gen ed3=cond(ed_level==3,1,0) gen ed4=cond(ed_level==4,1,0) sum ed1-ed4 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ed1 | 3,096 .2328811 .4227354 0 1 ed2 | 3,096 .3914729 .4881586 0 1 ed3 | 3,096 .1821705 .3860474 0 1 ed4 | 3,096 .1934755 .3950862 0 1

Later on it will be useful to use what STATA refers to as factor variables. The simplest form a factor variable is a categorical variable with an i. prefix. The i. prefix lets STATA know to treat the variables as categorical. sum i.ed_level Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ed_level | High school | 72,666 .3385627 .4732241 0 1 Some coll.. | 72,666 .223447 .4165583 0 1 4 or more.. | 72,666 .3278975 .4694505 0 1

30

By default the lowest category of a categorical factor variables is omitted from any operations. This can be changed by using the fvset command (fvset stands for factor variable set). We want to use fvset to set the base category to none.

fvset base none ed_level sum i.ed_level Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ed_level | High scho.. | 72,666 .1100928 .3130075 0 1 High school | 72,666 .3385627 .4732241 0 1 Some coll.. | 72,666 .223447 .4165583 0 1 4 or more.. | 72,666 .3278975 .4694505 0 1

Yet another option is to use the ibn. prefix. The ibn. prefix tells STATA to treat the variable as categorical with no base level. sum ibn.ed_level Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ed_level | High scho.. | 72,666 .1100928 .3130075 0 1 High school | 72,666 .3385627 .4732241 0 1 Some coll.. | 72,666 .223447 .4165583 0 1 4 or more.. | 72,666 .3278975 .4694505 0 1

31

Measures of Dispersion • Sample range – the range is simply the largest observed value less

the smallest observed value. o The range can be calculated manually for information provided

in the sum command o After you run the sum command certain temporary variables

are created. Among these are r(min) and r(max) which can be used to get the range.

Example 21: What is the range of weekly earnings for women in the 1989 CPS ORG data?

use "C:\cpsorg(updated).dta", clear keep if sex==2 & year==1989 &

sum earnwke

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- earnwke | 60,072 726.9502 399.2943 143.1871 3671.317

Note that we could easily calculate the range from the above summarize output. If we didn’t want to bother or wanted the range stored for some reason we could make use of the temporary variables that are created when the summarize command is run to do the computation and store the results as an additional variable. display r(min)-r(max)

3528.1301

32

We could also use egen to the same effect egen r_max=max(earnwke) egen r_min=min(earnwke) gen range_1989_2=r_max-r_min sum range_1989_2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- range_1989_2 | 60,072 3528.13 0 3528.13 3528.13

• Sample standard deviation and variance

Sample variance: 𝑆𝑆2 = 1

𝑛𝑛−1∑ (𝑥𝑥𝑖𝑖 − �̅�𝑥)2𝑛𝑛𝑖𝑖=1

Sample standard deviation: 𝑆𝑆 = � 1𝑛𝑛−1

∑ (𝑥𝑥𝑖𝑖 − �̅�𝑥)2𝑛𝑛𝑖𝑖=1

Example 22: What is the sample standard deviation and variance of female weekly earnings in 1989? The standard deviation is 399.29 – it is directly available from the information in Example 1. The variance is stored as a temporary variable quietly sum earnwke display r(Var) 159435.96

33

More on the Standard Deviation o You typically want to report the standard deviation.

Standard deviation measured is the same units as the data, whereas variance is measured in the squared units of the data

Variances tend to be rather large The standard deviation coveys more information

o What information does a standard deviation tell us Normal distribution (most distributions aren’t normal)

• 68% of observations are within one standard deviation of the mean

• Roughly 95% of observations are within two standard deviations of the mean

Non-normal distribution of unknown type • Chebysheff’s Theorem: at least a proportion of

1 − 1𝑘𝑘2� observations lie within 𝑘𝑘 standard

deviations of the mean where 𝑘𝑘 is any number larger than 1.

• Using this theorem we can say that at least 75% observations lie within 2 standard deviation of the mean.

o How large is the standard deviation? If we would like to know how large the standard deviation is relative to the mean, we can calculate the sample coefficient of variation a 100 times the ratio of the sample standard deviation to the sample mean.

34

Example 23: Sample coefficient of variation of weekly earnings for women in the CPS ORG in 1989

display=100*r(sd)/r(mean) 54.927325

The standard deviation is 55% of the mean. Measures of relative standing and box plots

• Percentiles – the P’th percentile is the value for which P percent of the observations are less than the value and (1-P) percent of the observations are greater than the value.

Example 24: Percentiles of weekly earnings in 1989 for women in the CPS ORG data. sum with the detail options provides a bunch of percentiles. More are available with egen.

sum earnwke, detail Weekly Earnings in $2014 ------------------------------------------------------------- Percentiles Smallest 1% 244.3726 143.1871 5% 301.6475 143.1871 10% 343.649 143.1871 Obs 60,072 25% 458.1987 143.1871 Sum of Wgt. 60,072 50% 620.4774 Mean 726.9502 Largest Std. Dev. 399.2943 75% 906.8516 3671.317 90% 1212.317 3671.317 Variance 159436 95% 1470.054 3671.317 Skewness 2.047665 99% 2100.077 3671.317 Kurtosis 10.3863

35

Example 25: egen can be used to find any percentile. In this example we find the 33rd percentile of weekly earnings for women in 1989 egen earnwkep33=pctile(earnwke), p(33) sum earnwkep33 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- earnwkep33 | 60,072 502.1094 0 502.1094 502.1094

• Interquartile range o Defined as a measure of dispersion defined as the difference

between the 75th and 25th percentiles o We could also think about calculating other percentile ranges

like p90-p10.

Example 26: Calculate the interquartile range of weekly earnings in 1989 for women in the CPS ORG data. The easiest way to do this is to access the temporary percentile variables that are created after the sum command is run. quietly sum earnwke display r(p75)-r(p25)

448.6529

36

• Percentile ratios – the ratios of percentiles are commonly used as measures of dispersion (e.g., p90/p10, p75/p25, p90/p50).

Example 27: Economist that have studies wage inequality have noted a rise in wage inequality within education level groups and experience level groups. Is this rise in residual wage inequality apparent for men in the updated CPS ORG data? use "C:\cpsorg(updated).dta", clear drop if sex==2

egen ex_level=cut(ex), at(0(5)60) tab ex_level drop if ex_level>=50 // not very many men bysort year ex_level ed_level: egen /* */ p90=pctile(wage), p(90) bysort year ex_level ed_level: egen /* */ p10=pctile(wage), p(10) gen ratio90_10=p90/p10 graph bar ratio90_10, over(year) The figure below shows a frequency (by ex_level) weighted mean of the 90-10 residual wage dispersion by ed_level

37

01

23

4M

ean

90/1

0 W

eekl

y Ea

rnin

gs R

atio

1979 1989 2007

(male, full-time, full-year workers)Mean 90/10 Residual Wage Dispersion By Year

38

Example 28: Suppose I wanted to see whether this trend differs by education level? I could simply add the option by(ed_level) to the graph bar command (in this case we will use graph hbar)

graph hbar ratio90_10, over(year) by(ed_level)

0 1 2 3 4 0 1 2 3 4

2007

1989

1979

2007

1989

1979

2007

1989

1979

2007

1989

1979

< High School High School

Some College College+

Mean 90/10 Residual Earnings RatioGraphs by ed_level

(male full-time, full-year workers)Mean 90/10 Residual Wage Dispersion By Education Level and Year

39

• Box plots – an innovate waye to graphically display 5 statistics o Minimum value o Maximum value o 25th percentile (p25) o 50th percentile (median or p50) o 75th percentile (p75)

Example 29: Generate a box plot of weekly wages in 2007 for men in the updated CPS ORG data. graph box wage if year==2007, horizontal

0 20 40 60 80 100Hourly Wage in $2014

P50

The lowest value

p25

P75 P75+1.5(p75-p25)

40

Example 30: Box plot for final exam scores in 2012

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180Final Exam Score

41

Wrapping it all up • Example 31: The most commonly used survey-based health variable

is self-reported health on a 5 point scale (1=poor, 2=fair, 3=good, 4=very good, 5=excellent). Suppose that you had data on self-reported health. What options would you have to summarize it?

This sort of variable is ordinal. With any sort of categorical data simple means should not be used as descriptive statistics because the intervals between data do reflect fixed units of measurement. Rather it is appropriate to show relative frequencies in a table, a bar graph, or a pie chart. The below table, discrete histogram, and pie chart shows relative frequencies associated with the 5 self-reported health categories in the first wave of the Health and Retirement Study (HRS). sum i.rshlth Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- rshlth | Poor | 99,867 .0823896 .2749587 0 1 Fair | 99,867 .1800194 .3842056 0 1 Good | 99,867 .3050758 .4604418 0 1 Very Good | 99,867 .2916179 .4545096 0 1 Excellent | 99,867 .1408974 .3479174 0 1

Showing the frequency distribution above is fine, but I could also make use of a bar graph or a pie chart.

http://hrsonline.isr.umich.edu/

http://hrsonline.isr.umich.edu/

42

hist rshlth, discrete percent /* */ xlabel(1 2 3 4 5, valuelabel) /* */ fcolor(black) graphregion(color(white)) /* */ gap(40) title("Distribution of Self-Reported Health") /* */ subtitle("(HRS Respondents, 1992)")

010

2030

Perc

ent

Poor Fair Good Very Good ExcellentSelf Reported Health

(HRS Respondents, 1992)Distribution of Self-Reported Health

43

graph pie, over(rshlth) legend(off) /* */ plabel(_all name, color(white)) plabel(_all percent) /* */ title("Distribution of Self-Reported Health", color(black)) /* */ subtitle("HRS Respondents, 1992")

Poor

Fair

Good

Very Good

Excellent8.239%

18%

30.51%

29.16%

14.09%

HRS Respondents, 1992Distribution of Self-Reported Health

44

• Example 32: Questionable or improper use of descriptive statistics. What’s wrong with this picture (Rector, R. and R. Sheffield (2011). Air Conditioning, Cable TV, and Xbox: What is Poverty in the Unites States Today, Heritage Foundation, Washington, DC.)

45

• Example 33: Construct a simple set of descriptive statistics for the updated CPS ORG data. If you just wanted to provide a basic set of descriptive statistics for the CPS ORG data, without telling a particular story, a simple table like the one below might be sufficient.

Sample Means/Medians and Standard Deviations (N=410,911)

Mean/Median Standard deviation

Hourly wage (mean) 21.858 12.406 hourly wage (median) 18.912 Age 38.327 12.004 Sex

Male (=1) 0.577 0.494 Female (=1) 0.423 0.494 Race

White, non-Hispanic (=1) 0.787 0.410 Black, non-Hispanic (=1) 0.085 0.293 Hispanic (=1) 0.077 0.266 Other, non-Hispanic (=1) 0.041 0.199 Education Level Less than HS (=1) 0.127 0.333 High school (=1) 0.377 0.484 Some college (=1) 0.226 0.418 College+ (=1) 0.270 0.443 Year 1979 (=1) 0.335 0.472 1989 (=1) 0.341 0.474 2007 (=1) 0.324 0.469

Documents

PA 818 Professor Wallace Fall 2017 Descriptive Statistics with …gwallace/Papers/STATA... · 2017-09-22 · 1 . PA 818 . Professor Wallace . Fall 2017. Lecture: Descriptive Statistics