Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Funded by: H2020 EU researchand innovation programme
by Anna Alberts and Michael Peters
Basic Statistics
Funded by: H2020 EU researchand innovation programme
Statistics is the science of organising, analysing, collecting, and representing data
Funded by: H2020 EU researchand innovation programme
Overview of Today
● Statistics gone wrong
● Basic Statistical Measures
● Correlation, Regression, Causation and some trends
Break
● Hands on: draw basic measures on example data and your dataset
● Wrap up
Funded by: H2020 EU researchand innovation programme
Bad Numbers...
Funded by: H2020 EU researchand innovation programme
Common Sense?!
Funded by: H2020 EU researchand innovation programme
Descriptive Statistics
What is in my dataset
● Measures of Central Tendency ○ on average
● Measures of Spread ○ how is it divided
Funded by: H2020 EU researchand innovation programme
Measures of Central Tendency
● Mean: the average. ○ All data points divided by the number of datapoints
● Median: the value in the middle○ All the values - for which we use N - and then half of N - that is your middle value
● Mode: most frequent value○ the datapoint that appears the most
○ Very useful with qualitative data, or rankings
Funded by: H2020 EU researchand innovation programme
Measures of Spread
● Range: minumum to maximum
● Variance: the average of the squared differences from the Mean
● Standard deviation: ○ Measures what is “normal” or expected
○ Root Square of the variance
Funded by: H2020 EU researchand innovation programme
Exercise - Real Life Statistics
- Write down your length
- High to low: range
- Now find the others with your length - behind each other
- “Normal distribution”
- Who is an outlier.
Sheet for illustration!
Funded by: H2020 EU researchand innovation programme
Normal Distribution
Central limit theorem: averages of random variables converge to the normal
distribution if the number of observations is high enough
Y: Frequency
X: Value
Examples: blood pressure, human height, IQ scores, test scores
Funded by: H2020 EU researchand innovation programme
Mutual relation of two or more things (dictionary.com)
● Two variables x and y
● Y = dependent variable (to be explained)
● X = independent variable (explaining Y) ○ Y = Shoe size
○ X = Length of foot (in cm)
● How are they related?
Correlation
Funded by: H2020 EU researchand innovation programme
● Y = average monthly net income in Euro● This is what “classic” statistical software produces (STATA)
Average Income per Age
Funded by: H2020 EU researchand innovation programme
Regression
Statistical process for estimating relationships among variables.
● “Trying to make sense of data”
● fitting a line into the “cloud” of data
● Finding causal relationships
Correlation Coefficients: quantify a relationship between two or more
random variables
● Range between - 1 and 1
● http://guessthecorrelation.com/
Funded by: H2020 EU researchand innovation programme
Funded by: H2020 EU researchand innovation programme
Inferences from Data
“Correlation does not imply causation.”
● Stork population and human birth rates correlate
● Expected relationship?
● Underlying: stork population and birth rates higher in countryside
● Control variables → use your common sense!
Funded by: H2020 EU researchand innovation programme
Inferring a “causal” dependence of different variables
Think medicine:
● Falling from a tree and breaking your arm ○ Clear causal effect
● Claim that variable x determines outcome y ○ Foot length determines shoe size
○ Storks deliver the babies
● Can you control for other factors?
● Does the relationship hold up against reasonable doubt?
Causation
Funded by: H2020 EU researchand innovation programme
Causation?
Funded by: H2020 EU researchand innovation programme
Exercise
Excel or Google Spreadsheets
● Find the mean = Average (Value 1:Value 10)
● Find the median = Median (Value 1:Value 10)
● Find the mode = Mode (Value 1:Value 10)
● Variance = Var (Value 1:Value 10)
● Standard deviation = STDEVP
● Try out different plots “Insert, Chart”
http://bit.ly/2fVrepJ - World Bank Unemployment Data
http://bit.ly/2fUbbKM - World Bank GDP Growth
Funded by: H2020 EU researchand innovation programme
Wrap Up
Basic Statistical Concepts:
● Measures of Central Tendency and Measures of Spread
● Aka what is in my dataset
Correlation, Causation, Regression
http://tylervigen.com/spurious-correlations
Funded by: H2020 EU researchand innovation programme
● High correlation does not imply causation
○ but there is often reason for the correlation (storks, babies, rural areas)
● Do not trust small samples!
○ Population vs. Sample size
○ Statistical studies with the number of observations below 100 dubious, better n > 1000
■ For Social Sciences survey sample sizes n > 90
○ Rule of large numbers → convergence to normal (outliers have little effect)
● Control factors
○ Was something obvious left out? Critical thinking
● Wording
○ Genuine scientists do not use words such as “definitely, absolutely…”
● Visualization
○ X and Y Axes - do they start at 0?
○ Misconstrued data, question yourself do I understand the graph?
○ If something does not add up, it might actually be wrong!
Statistics Cheat Sheet
Funded by: H2020 EU researchand innovation programme
Symbolism
x usually stands for the variable
n = the number of observations
Or X₄= Variable with the value X, observation number 4
Squared: X²
Square root: √
∑ : sigma = the sum of all the observations
Funded by: H2020 EU researchand innovation programme
Thanks!
Find us on [email protected]
@anna_alberts@miguelitoj89
@okfde