Upload
claudia-wagner
View
110
Download
5
Tags:
Embed Size (px)
DESCRIPTION
http://www.summerschool.websci.net/ WebScience Summer School Southampton Data Science 2014
Citation preview
Dr. Claudia Wagner http://claudiawagner.info/
Web Science Summer School WS3 , Southampton, UK , 21th July 2014
source: Twitter 2
Statistical computing is very central , but data science is more than statistics
Activities of data scientists: collection and generation,
preparation,
analysis,
visualization,
management and preservation of large collections of data
Jeffrey Stanton, Introduction to Data Science, free e-book 3
Ask interesting question Why is it important? Which number answers your question?
Get or generate the data Which data will help answering you question? How is the data
generated? Are their any sampling biases? Ethical issues? Analyze the data
Are there any anomalies or regularities? Which hidden process has generated the data? Fit a model to the data and validate it
Visualize and communicate results What does 75% probability mean?
Preserve and share the data to make results reproducible
4
Data is a collection of facts Facts can be numbers, words,
measurements, observations or even just descriptions of things
Qualitative data (e.g., “it was great”) Quantitative data
Discrete (e.g., 5)
Continuous (e.g., 3.723)
5
6
Stevens, S. S. (1946). "On the Theory of Scales of Measurement". Science 103 (2684): 677–680.
Nominal (e.g., ethnic group, sex, nationality)
Ordinal (e.g., status)
Interval (e.g., temperature in Celsius)
Ratio (e.g., weight)
Observations are only named
Observations can be ordered
Distance is meaningful
Absolute zero
7
Random sample of Twitter users Random sample of tweets from the public timeline More active users are more likely to be included
Friendship Paradox Select a random sample of people and ask them to list
the people they know. Contact a sample of the listed friends and repeat the survey.
Sampling bias: people with more friends are more likely to show up in the friend lists which we generate at the first stage
8
A study found that the profession with the lowest average age of death was student. Being a student does not cause you to die at an early
age. Being a student means you are young. This is what makes the average of those that die so low.
Amount of ice cream consumed per day is highly
correlated with number of drownings per day Both variables are correlated with the daily
temperature
9
"Teaching Statistics: A Bag of Tricks," by Gelman and Nolan (2002)
A study found that only 1.5% of drivers in accidents reported that they were using a cell phone, whereas 10.9% reported that they were distracted by another occupant in the car.
Can we conclude that using a cell phone safer than speaking with another occupant? P(cellphone | accident) != P(accident | cellphone) Compare P(accident|cellphone) and P(accident|occupant) We need to know the prevalence of cell phone use It is likely that much more people talk to another occupant
in the car while driving than talking on the cell phone
10 Jessica Utts, What Educated Citizens Should Know about Statistics and Probability, The American Statistician, Vol. 57, No. 2 (May, 2003), pp. 74-79
Ecological Fallacy
Illiteracy rate in each US state and the proportion of immigrants per state
Negative correlation of −0.53
▪ The greater the proportion of immigrants in a state, the lower its average illiteracy.
When individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens.
11 Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review (American Sociological Review, Vol. 15, No. 3) 15 (3): 351–357.
Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation
Found data or observational data
Are observational data enough?
Are such data available?
Generate Data
Designs the data generation process
▪ E.g., via surveys, experiments, crowdsourcing
13
14 http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
Two general types of traces:
15
Accretion - a build-up of physical traces
Erosion - the wearing away of material
Webb, Eugene J. et al. Unobtrusive Measures: nonreactive research in the social sciences. Chicago: Rand McNally, 1966
Bulk downloads
Wikipedia, IMDB, Million Song Database, etc.
API access
NY Times, Twitter, Facebook, Foursquare, etc.
Web scraping
Tools e.g., http://scrapy.org/
What data is ok to scrap?
▪ Public, non-sensitive, anonymized, fully referenced information, Check terms of conditions!
16
Takes time to accumulate
Conservative estimate
Only what happened counts! Intentions, motivations or internal states don’t count.
Inferentially weak
Cannot answer “what-if” questions
17
Surveys
Simulations Model behavior of users/agents on a micro-level
Simulate what happens under different conditions
Empirical validation Experiments Keep all variables constant and only manipulate one
variable (e.g., emotions)
18
Simulations Study of macro-phenomena
Difficult to validate empirically
Surveys and/or Experiments We only get data from those who are accessible and
willing to respond or participate
Responders provide answers that are in line with self-image and researcher’s expectations
Hawthorne effect, etc.
19
Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation
21
Data cleaning
Fill in missing values
Smooth noisy data
Identify or remove outliers
Resolve inconsistencies
Data integration
Integration of multiple databases, or files
22
Data transformation Normalization: scaled to fall within a small, specified range
Standardization: how many standard deviations from the mean
lies each data point
Discretization: divide the range of a continuous attribute into intervals some algorithms require discrete attributes.
Data reduction Dimensionality reduction (remove unimportant attributes via
feature selection, group features into factors e.g. PCA, SVD)
Aggregation and clustering
Sampling
Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation
Problem:
Given high dimensional space (e.g., fb-user which are described via various attributes such as locations they visited)
Find pairs of data points (𝒙, y) that are within some distance threshold 𝒅(𝒙, y) ≤ 𝒔
We first need to decide what „distance“
means
24
Distance Measures
Jaccard similarity between 2 sets of items I1, I2
sim(I1, I2) = |𝐼1 ∩ 𝐼2|
|𝐼1 ∪ 𝐼2|
dist(I1, I2) = 1- sim(I1, I2)
Euclidian distance, Hamming distance,
Cosine Similarity, etc.
25
Goal: Given a set of items group the items into some number of clusters, so that
Members of a cluster are similar to each other
Members of different clusters are dissimilar
26 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
Not-Hierarchical / Point assignment:
Maintain a set of clusters
Point belong to “nearest” cluster
Hierarchical:
Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two “nearest” clusters into one
Divisive (top down):
▪ Start with one cluster and recursively split it
27 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
28
29
30
31
32
Try different k, looking at the change in the average distance to centroid as k increases
Average falls rapidly until right k, then changes little
33
Average Diameter
k
best k
Aim: Find hidden concepts/groups in a matrix Method: Singular Value Decomposition (SVD)
34 Lescovec et al., Mining of Massive Datasets, p. 418
Rank = 2 Rank denotes the
information content of the matrix.
For instance, a rank-1 matrix can be written as a product of one column and one vector
35
36
37 Lescovec et al., Mining of Massive Datasets, p. 418
Relates users and concepts
Relates movies to concepts
Strength of
concepts
Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation
Estimate population parameter from sample statistics
Sampling Distribution of statistic: Draw a finite set of samples of size n from the population
Computing the statistic on the sample
Repeat this process
The mean of the sampling distribution is the expected value of the statistic in the true population
SD of the sampling distribution is the standard error
39
40
Some descriptive statistics such as mean or median are unbiased estimators of central tendency
Expected value of the statistic is the true population parameter
Expected value of dispersion in a sample is an underestimate of the true population value
41
True population size is N Sample size n < N (e.g., n=100)
Correction factor : 𝑛
𝑛−1
For n=100 the correction factor is ~ 1.01 For n=100.000 our correction factor is
~1.00001
Estimate Population Var: (
𝑛
𝑛−1) ∗ (𝑥𝑖−𝜇 𝑛
𝑖=1 )
𝑛
42
Specify the range of values that have a high probability of containing the true population parameter
Confidence level α: the probability that confidence interval contains true population parameter
43
CI = sample statistic + MOE MOE = SE * Critical value
MOE = 𝜎
𝑛∗ 𝑧𝛼/2
Critical Value: how far away from the mean
must a point lie in order to be considered as “extreme” or “unexpected”?
44
n … sample size σ … standard deviation z α/2 … confidence coefficient
45
Area under the curve is 0.475 What’s the z-score?
46
Select 1000 fb-user randomly Average number of bar visits per year X = 78
Standard Deviation: (𝑥𝑖−𝜇 𝑛
𝑖=1 )2
𝑛 = 30
Confidence level is 95% divide 0.95 by 2 to get
0.475 Check out the z table z = 1.98
MOE =
𝜎
𝑛∗ 𝑧𝛼/2 =
30
1000 ∗ 1.98= 1.88
78 +/- 1.88 CI: [76.12 ; 79.88]
47
Exact CI can only be computed when the sampling distribution and SD of sampling distribution (i.e., SE) are known
Otherwise we have to estimate the Standard Error (SE) Bootstrap
48
Sampling with replacement Population is unknown But we observe one sample from the population of
size n=4: {2, 3, 8, 8} We use this sample to generate a large number of
bootstrap samples of size n: ▪ 8, 8, 8, 3 ▪ 3, 3, 8, 2 ▪ …
Compute statistic (e.g. ,mean) for each bootstrap sample
Estimate SE from the bootstrap distribution
49
50
Population
Sample
Bootstrap Sample
Bootstrap Sample
Bootstrap Sample
Bootstrap Sample
Calculate statistic for each bootstrap sample
Statistic +/- MOE
MOE for 95% CI = 2 * SE
Bootstrap Distribution
Standard Error (SE): SD of bootstrap distribution
Randomly selected sample of fb-user
Have they ever checked in at a nightclub?
Democrats: 100/1000 yes
Republican: 90/1000 yes
Do the nightlife preferences differ
significantly across political parties? Give 95% CI for difference in proportions
51
dems = rep( c(0,1), c(1000-100, 100) ) repubs = rep( c(0,1), c(1000-90, 90) ) mean(dems) #0.1 mean(repubs) #0.09 del.p = mean(dems) - mean(repubs) #0.01 (point estimate)
reps = replicate( 1000, { ds = sample( dems, 1000, replace=TRUE ) rs = sample( repubs, 1000, replace=TRUE ) mean( ds ) - mean( rs ) } ) SE = sd( reps ) # 0.0131 c( del.p - 2*SE, del.p + 2*SE ) #-0.0162 0.0362 (interval estimate)
52
H1: political party affects the nightlife-preferences H0: political party does not affects the nightlife-
preferences Proportion of users who visited nightclubs not matter
which party they belong to: 190/2000 = 0.095
If political affinities have no effect, we would expect the following frequencies:
53
Democrats Republicans
yes 100 90 190
no 900 910 1810
Democrats Republicans
yes 95 95 190
no 905 905 1810
χ2= 𝑜−𝑒 2
𝑒 = 0.5815
DF = (number of rows – 1) x (number of columns – 1) = 1
Critical value of χ2 at 5% significance and 1 DF is 3.84
Our χ2 does not exceed the critical value
We cannot reject H0 54
Democrats Republicans
yes 100 90 190
no 900 910 1810
If α=0.05 then 95% of all values fall in this interval
Two-tail test: 2.5% of values in the
upper tail and 2.5% of the lower tail are considered as so extreme that we reject H0 if we observe them
55
Test if democrats on fb, on average, have more than 60 bar visits per year H1: µ > 60 H0: µ <= 60
Random sample of 20 democratic fb-user: {65 73 51 67 48 80 69 53 59 62 71 67 64 78 65 490
80 60 51 70} Sample mean 𝜇 =64.1 Assume we know SD in population = 10
𝑧 = 𝜇 − 𝜇
𝑆𝐸 𝑆𝐸 =
𝑆𝐷
𝑛 𝑧 =
64.1−60
10/ 20 = 1.8336
56
Would we expect that? How extreme is this observation? If H0 is true (mean<=60) in which area
around the mean do 95% of all points lie
Pick alpha level α=0.05 that’s the maximum probability where you reject the null hypothesis if the null hypothesis is true
Right-tail test: find our critical value for 0.45 using the z-distribution
If the z-score of our observed data exceed
this value we have to reject H0
57
1.8336 > 1.645 reject the null hypothesis
Large Effects, Small Samples: In small samples it is easy to overestimate an effect which
might have happened by chance Small Effects, Large Samples:
The smaller the effect you want to measure the larger the sample size you need to prove it significant!
Example: Assume a coin is biased: 10% head and 90% tail
Tossing the coin 10 times should be enough to convince people that the coin is biased.
Example: Assume a coin is biased: 51% head and 49% tail
Minimum sample size increases with decreasing effect size which one wants to demonstrate
58
The more we analyze, the more we find by chance!
If you calculate correlation between 10 variables (i.e., 44 different correlation coefficients) you should expect that at least 2 correlations are significant with p < 0.05 by chance (one in every 20)
Corrections or adjustments for the total number of comparison are needed!
59
Many tests such as z-test, t-test, ANOVA make the normality assumption.
If true population is very skewed (e.g. power law) the sampling distribution of the statistic will not be normal
Nonparametric methods like sign-test use e.g. median rather than the mean Hypothesis about the median of the true population (e.g. H1:
median < 100, H0: median = 100) Count number of measurements that favor the null hypothesis If H0 is true half of the measurement should fall on each side.
60
Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation
Aim Find a function that describes the relation between X
(e.g. bar visits) and Y (e.g. new friends)
Given X predict Y Problem Infinite number of ways X and Y could be related
Idea Reduce space of possible function and start with the
simplest one (linear relation)
Y= 𝑏0 + 𝑏1 𝑋
62
Y = 2 + 0.5 X
63
6 4 2 0
Y
X
0 2 4 6 8
Use Gradient Descent to minimize Cost function C 𝑏0, 𝑏1
C 𝑏0, 𝑏1 = 1
2𝑁 (𝑌𝑖−𝑌 𝑖)
2𝑁𝑖=1
C 𝑏0, 𝑏1 = 1
2𝑁 (𝑌𝑖 − 𝑏0 − 𝑏1𝑋)2𝑁
𝑖=1
Start with some guess for 𝑏0, 𝑏1 Keep changing 𝑏0, 𝑏1 to reduce C 𝑏0, 𝑏1 until
we hopefully end up at a minimum
64
𝑏0 ≔ 𝑏0 − 𝛼𝜕
𝜕𝑏𝑗C 𝑏0, 𝑏1
𝑏1 ≔ 𝑏1 − 𝛼𝜕
𝜕𝑏𝑗C 𝑏0, 𝑏1
Simultaneous updates of b0 and b1
65
Derivative of cost function informs us about the slope of
the cost function
Learning rate
66
C(b)
b
Residuals: deviation between the observed and the predicted values
Residual sum of squares:
67
Is this a good measure?
No it depends on the number of observations N
What if we multiply it with
1/N?
𝑦𝑖… observed value 𝑦 … value predicted by the model 𝑦 … mean of observed data
68
Total variability in the outcome
that needs to be explained
Unexplained variability! Residuals: difference
between the observed value and the estimated value
Proportion of the total variability unexplained by the model
Independent variable is binary (e.g., went to nightclub or not)
We can group users by number of new friends year (20-25, 25-30, 30-35, etc.) and compute the proportion of people with high “nightclub-probability”
69
Logistic Regression:
Maximum Likelihood Estimator
Estimate unknown coefficients by
maximizing the log likelihood function
Coefficient is interpreted as the rate of change in the "log odds" as X changes
70
ln𝑃(𝑌 = 1)
1 − 𝑃(𝑌 = 1)= 𝑏0 + 𝑏1X + ϵ
Simple Example: You have a coin that you know is biased towards
heads and you want to know what the probability of heads (p) is.
We want to estimate the unknown parameter p!
71
You flip the coin 10 times and the coin comes
up head 7 times. What’s your best guess for p?
72
3737 )1(!3!7
!10)1(
7
10)heads 7( ppppP
Find the value for p that makes our data most likely!
The probability of observing 7 times head when tossing a coin 10 times is given by this binomial distribution:
73
)1log(3log7!3!7
!10loglog ppLikelihood
Set the derivative equal to 0 and solve for p.
Derivative with respect to p.
ppLikelihood
dp
d
1
370log
10
7
107377
3)1(70)1(
3)1(70
1
37
p
ppp
pppp
pp
pp
*derivative of a constant is 0
*derivative 7f(x)=7f '(x)
*derivative of log x is 1/x
3737 )1(!3!7
!10)1(
7
10ppppLikelihood
74
web.stanford.edu/~kcobb/hrp261/lecture4.ppt
267.)3(.)7(.120)3(.)7(.7
10Likelihood theof Value 3737
Likelihood of observing 7 times head when tossing a
biased coin with p(head) = 0.7 and p(tail)=0.3 10 times
is:
75
Linear Regression (R-squared)
Logistic Regression (pseudo R-squared)
76
you can “prove” anything with graphics
Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation
78
79 http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
80 http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
Be careful when drawing conclusions from graphs
Size of effect shown in graphic != Size of effect in sample data != Size of the effect in the true population Scale Disorting (e.g., bar charts not starting with
zero)
Snapshot
…
81
Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation
GESIS Data Archives & Data Centers
Preserve research data and make them accessible for reuse.
Competencies and infrastructure
▪ e.g. https://datorium.gesis.org/xmlui/
CESSDA:
umbrella organisation for the European national data archives (http://www.cessda.net/)
Re3data
browse data archives by topic: http://www.re3data.org/
83
DPC Digital Preservation Handbook: http://www.dpconline.org/advice/preservationhandbook
Legal and regulatory framework including open access and licenses
Incentives to share data Credentials? Citation principles under development (see
e.g. http://www.datacite.org/). Long term preservation strategies software and hardware changes, documentation,
metadata and retrieval/access Data preservation starts at an individual level Reasons for data loss often on an individual level,
e.g. broken hardware, researchers leaving a group. 84
Vasant Dhar. Data Science and Prediction. In: Communications of
the ACM, December 2013, Vol. 56, No. 12, pp. 64-73
Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)
Jeffrey Stanton, Introduction to Data Science (free download) Steffen Staab, Data Science Course University Koblenz-Landau,
https://www.uni-koblenz-landau.de/campus-koblenz/fb4/west/teaching/ss14/data-science/data-science1
Serious Stats, Thom Baguley
86