measures of centrality

Preview:

DESCRIPTION

measures of centrality. Last lecture summary. Which graphs did we meet? scatter plot ( bodový graf ) bar chart (sloupcový graf) histogram pie chart (koláčový graf) How do they work, what are their advantages and/or disadvantages?. SDA women – histogram of heights 2014. n = 48 or N = 48 - PowerPoint PPT Presentation

Citation preview

MEASURES OF CENTRALITY

Last lecture summary• Which graphs did we meet?

• scatter plot (bodový graf)• bar chart (sloupcový graf)• histogram• pie chart (koláčový graf)

• How do they work, what are their advantages and/or disadvantages?

SDA women – histogram of heights 2014

n = 48 or N = 48

bin size = 3.8

Distributions

negatively skewedskewed to the left

positively skewedskewed to the left

http://turnthewheel.org/free-textbooks/street-smart-stats/

e.g., life expectancy e.g., body height e.g., income

STATISTICS IS BEATIFULnew stuff

Life expectancy data• Watch TED talk by Hans Rosling, Gapminder Foundation:

http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html

STATISTICS IS DEEP

UC BerkeleyThough data are fake, the paradox is the same

Simpson’s paradox

www.udacity.com – Introduction to statistics

Male

Applied Admitted Rate [%]MAJOR A 900 450MAJOR B 100 10

www.udacity.com – Introduction to statistics

Male

Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10

www.udacity.com – Introduction to statistics

Female

Applied Admitted Rate [%]MAJOR A 100 80MAJOR B 900 180

www.udacity.com – Introduction to statistics

Female

Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20

www.udacity.com – Introduction to statistics

Gender bias

What do you think, is there a gender bias?

Who do you think is favored? Male or female?

Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10

Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20

www.udacity.com – Introduction to statistics

Gender bias

Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10

Both 1000 460 46

Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20

Both 1000 260 26

male

female

www.udacity.com – Introduction to statistics

Gender bias

Rate [%]MAJOR A 50MAJOR B 10

Both 46

Rate [%]MAJOR A 80MAJOR B 20

Both 26

male

female

www.udacity.com – Introduction to statistics

Statistics is ambiguous• This example ilustrates how ambiguous the statistics is.

• In choosing how to graph your data you may majorily impact what people believe to be the case.

“I never believe in statistics I didn’t doctor myself.”“Nikdy nevěřím statistice, kterou si sám nezfalšuji.”

Who said that?

Winston Churchill

www.udacity.com – Introduction to statistics

What is statistics?• Statistics – the science of collecting, organizing,

summarizing, analyzing and interpreting data• Goal – use imperfect information (our data) to infer facts,

make predictions, and make decisions

• Descriptive statistic – describing and summarising data with numbers or pictures

• Inferential statistics – making conclusions or decisions based on data

Variables• variable – a value or characteristics that can vary from

individual to individual• example: favorite color, age

• How variables are classified?

• quantitative variable – numerical values, often with units of measurement, arise from the how much/how many question, example: age, annual income, number children• continuous (spojitá proměnná), example: height, weight• discrete (diskrétní proměnná), example: number of children

• continuous variables can be discretized

Variables• categorical (qualitative) variables

• categories that have no particular order• example: favorite color, gender, nationality

• ordinal• they are not numerical but their values have a natural order• example: tempterature low/medium/high

variable(proměnná)

quantitative(kvantitativní)

categorical(kategorická)

continuous(spojitá)

discrete(diskrétní)

ordinal(ordinální)

Variables

Choosing a profession

Chemistry Geography

50 000 – 60 000 40 000 – 55 000

www.udacity.com – Statistics

Choosing a profession• We made an interval estimate.• But ideally we want one number that describes the entire

dataset. This allows us to quickly summarize all our data.

www.udacity.com – Statistics

Choosing a profession

1. The value at which frequency is highest.

2. The value where frequency is lowest.

3. Value in the middle.

4. Biggest value of x-axis.

5. Mean

Chemistry Geography

www.udacity.com – Statistics

Three big M’s

• The value at which frequency is highest is called the mode. i.e. the most common value is the mode.

• The value in the middle of the distribution is called the median.

• The mean is the mean (average is the synonymum).

Chemistry Geography

www.udacity.com – Statistics

Quick quiz• What is the mode in our data?

2 5 6 5 2 6 9 8 5 2 3 5

www.udacity.com – Statistics

Mode in negatively skewed distribution

www.udacity.com – Statistics

Mode in uniform distribution

www.udacity.com – Statistics

Multimodal distribution

www.udacity.com – Statistics

Mode in categorical data

www.udacity.com – Statistics

More of modeTrue or False?

1. The mode can be used to describe any type of data we have, whether it’s numerical or categorical.

2. All scores in the dataset affect the mode.

3. If we take a lot of samples from the same population, the mode will be the same in each sample.

4. There is an equation for the mode.

• Ad 3.• http://onlinestatbook.com/stat_sim/sampling_dist/ • http://www.shodor.org/interactivate/activities/Histogram/ - mode changes as you

change a bin size.

• Because 3. is not true, we can’t use mode to learn something about our population. Mode depends on how you present the data.

www.udacity.com – Statistics

Life expectancy data

www.coursera.org – Statistics: Making Sense of Data

Minimum

Sierra Leone

minimum = 47.8

www.coursera.org – Statistics: Making Sense of Data

Maximum

Japan

maximum = 84.3

www.coursera.org – Statistics: Making Sense of Data

Life expectancy data

all countries

www.coursera.org – Statistics: Making Sense of Data

Life expectancy data

1 197

Egypt

99

73.2half larger

half smaller

www.coursera.org – Statistics: Making Sense of Data

Life expectancy data

Minimum = 47.8

Maximum = 83.4

Median = 73.2

www.coursera.org – Statistics: Making Sense of Data

Q1

1 197

Sao Tomé & Príncipe

50 (¼ way)

1st quartile = 64.7

www.coursera.org – Statistics: Making Sense of Data

Q1

¾ larger¼ smaller

1st quartile = 64.7

www.coursera.org – Statistics: Making Sense of Data

Q3

1 197

NetherlandAntilles

148 (¾ way)

3rd quartile = 76.7

www.coursera.org – Statistics: Making Sense of Data

Q3

3rd quartile = 76.7

¾ smaller ¼ larger

www.coursera.org – Statistics: Making Sense of Data

Life expectancy data

Minimum = 47.8

Maximum = 83.4

Median = 73.2

1st quartile = 64.7

3rd quartile = 76.7

www.coursera.org – Statistics: Making Sense of Data

Box Plot

www.coursera.org – Statistics: Making Sense of Data

Box plot

1st quartile

3rd quartilemedian

minimum

maximum

Modified box plot

IQRinterquartile range

1.5 x IQR

outliers

outliers

Quartiles, median – how to do it?

79, 68, 88, 69, 90, 74, 87, 93, 76

Find min, max, median, Q1, Q3 in these data. Then, draw the box plot.

www.coursera.org – Statistics: Making Sense of Data

Another example

Min. 1st Qu. Median 3rd Qu. Max.

68.00 75.00 81.00 88.50 93.00

78, 93, 68, 84, 90, 74

Percentiles

věk [roky]http://www.rustovyhormon.cz/on-line-rustove-grafy

3rd M – Mean• Mathematical notation:

• … Greek letter capital sigma• means SUM in mathematics

• Another measure of the center of the data: mean (average)

• Data values:

Salary of 25 players of the American football (NY red Bulls) in 2012.

33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000

median = 112 495

mean = 518 311

Mean is not a robust statistic.

Median is a robust statistic.

Robust statistic

10% trimmed mean … eliminate upper and lower 10% of data

Trimmed mean is more robust.

Trimmed mean33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000

median = 112 495

mean = 518 311

10% trimmed mean = 128 109

Recommended