Upload
m
View
224
Download
0
Embed Size (px)
Citation preview
8/8/2019 Useful Statistical Concepts for Engineers
1/51
- 1 -
Useful Statistical concepts for Engineers
Deepak Agarwal
ML Class
12/10/2009
Yahoo!
8/8/2019 Useful Statistical Concepts for Engineers
2/51
- 2 -
Scope of the lecture
Basic probability distributions to model randomness in data
Fitting distributions to data
Common parametric distributions Discrete distributions, continuous distributions
Generalized Linear models
Multi-level hierarchical models
Generalized Linear mixed effects models
8/8/2019 Useful Statistical Concepts for Engineers
3/51
- 3 -
Role of Probability distributions
Probability distributions
Mathematical models to describe intrinsic variation in data
Helps in quantifying uncertainty and eventually decision making
How do we construct such distributions to computeprobabilities for any subset in data domain ?
Domain
Finite set of points (total clicks in 100 displays of ad X on Pub Y)
Countable but infinite set of points (total visits to webpage Y) Real numbers (time spent on webpage Y)
Is it necessary to specify probabilities for all subsets?
NO, what to specify ?
8/8/2019 Useful Statistical Concepts for Engineers
4/51
- 4 -
Cumulative distribution function (CDF)
X : random variable
CDF F : [0,1] such that
F(x) = Pr( X x)
F is non-decreasing and right continuous
CDF uniquely characterize a probability distribution
Given CDF, we can compute probability of any subset
E.g P( a < X b) = F(b) F(a) ; P( X > b) = 1 F(b)
What about more complicated sets?
In high dimension?
8/8/2019 Useful Statistical Concepts for Engineers
5/51
- 5 -
Probability density function (PDF)
A unique functionp : [0,) such that
P( A ) = Ap(x) dF(x) [aggregate density with weights from F]
Meaning of notation Ap(x) dF(x)
real numbers P( A ) = Ap(x) dx (Continuous distributions) Discrete numbers P( A ) = Ap(x) (Discrete distributions)
PDF often easier to work with when (modeling) fitting
distributions to data
8/8/2019 Useful Statistical Concepts for Engineers
6/51
- 6 -
Empirical CDF
Empirical CDF Fm for data X= (x1,x2,,xm) (I I D)
Probability distribution with mass 1/m on eachxi
1. m=10; -1.21 0.28 1.08 -2.35 0.43 0.51 -0.57 -0.55 -0.56 -0.89
2. m=10; 0 0 0 0 0 1 0 0 1 1
2 1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
example 1
Fm(x)
8/8/2019 Useful Statistical Concepts for Engineers
7/51
- 7 -
Plug-in principle
(F) : Some characteristic of the theoretical distribution
E.g. mean
(Fm) : Corresponding quantity for empirical CDF
Exercise: Convince yourself this is true
Plug-in principle: (Fm) good estimator of(F) for all
characteristics (F)
8/8/2019 Useful Statistical Concepts for Engineers
8/51
- 8 -
Why plug-in works? Glivenko-Cantelli Lemma
Intuitively, Fn good estimator of F
Estimator should get better with increasing sample size m
Glivenko-Cantelli : For the iidscenario
P(A) Pm(A) uniformlyfor all subsets A, as m
We can infer about distribution with a large sample
Error does not grow as we increase the number of quantities
estimated from the same sample
Justifies why plug-in principle works
8/8/2019 Useful Statistical Concepts for Engineers
9/51
- 9 -
Quantifying uncertainty in estimates
We wont have infinite m in practice (costly) Quantify uncertainty in estimates for a given m
Sample mean
Estimate of uncertainty
8/8/2019 Useful Statistical Concepts for Engineers
10/51
- 10 -
Standard error calculations continued
Consider median
Difficult to compute
Asymptotic approx:
Is there a better way?
8/8/2019 Useful Statistical Concepts for Engineers
11/51
- 11 -
Re-sampling from empirical CDF: Bootstrap
Bootstrap: Random sample (with replacement) from Fm
The samples help compute s.e.
For the median example,
Take a random sample of size m with replacement from
empirical CDF Fm and compute the median
For B such samples, compute the standard deviation (se) of
median estimates, this quantifies uncertainty This works well ifFm is a good approximation to F
Bootstrap is only finding an approx of s.eF((Fm)) underFm
8/8/2019 Useful Statistical Concepts for Engineers
12/51
- 12 -
Why bootstrap works?
Except for mean, difficult to compute standard deviation ofother sample statistics
Bootstrap sampling provides an approximation to Fm, easyblack box to compute variance estimates
How many bootstrap samples B ?
For estimating s.e., 20-100 are good enough
Depends on m, tails of the underlying distribution
Exercise: How many distinct bootstrap samples are there for a given m ?
8/8/2019 Useful Statistical Concepts for Engineers
13/51
- 13 -
Bootstrap: Variations
Does it always work? No, especially in cases where Fm is not a good approx ofF
E.g. sample m data from Uniform(0, )
max xi ML estimator of
Bootstrap as defined so far wont work well here
Parametric bootstrap
What if we know about the parametric form ofF(e.g. guassian)
Sample from Fm,par instead ofFm
8/8/2019 Useful Statistical Concepts for Engineers
14/51
- 14 -
Example
m=100; data drawn iid from N(0,1); distribution of median
0 1000 2000 3000 4000 5000
0.2
8
0.3
0
0.3
2
0.3
4
B
c.v
8/8/2019 Useful Statistical Concepts for Engineers
15/51
- 15 -
How can we use it at Y! ?
Variance estimates can help with online learning(explore/exploit): Johns lecture
Bootstrapping can help better understand variance
properties of models Running too many experiments on test data not a good idea
(Kilians lecture)
Easy to Map-reduce
8/8/2019 Useful Statistical Concepts for Engineers
16/51
- 16 -
Before we move on, another look at bias-variance tradeoff
8/8/2019 Useful Statistical Concepts for Engineers
17/51
- 17 -
Bias-Variance Tradeoff
Important in all scenarios (regression, density estimation,.)
Recall Robs example from lecture 1
High Bias High varianceOptimal trade-off
8/8/2019 Useful Statistical Concepts for Engineers
18/51
- 18 -
Bias-Variance continued
F : True distribution generating the data (not known)
={ F } : Model class chosen by analyst to approximate F
Influenced by things like domain knowledge, previous studies,
software availability, my favorite algo, ad-server latency, . E.g. Linear models, Neural networks, logistic regression,
X : Available data
Loss L(F, F[X]) : Metric that measures model performance
E.g. MSE, Misclassification error, total click lift, total revenue,..
8/8/2019 Useful Statistical Concepts for Engineers
19/51
- 19 -
Bias-Variance continued
Loss influenced by two aspects How flexible is to approximate reality ? (Bias)
More flexible it is, more complex it gets (reduces bias)
How stable is the best fit from to data ? (Variance)
Does the fit change a lot with perturbations to data ?
More flexible the class to choose from, more data we need
to control the variance
With too much flexibility and little data, we tend to learn
patterns that are not real
(chasing the data , too many parameters, generalization error,
fitting noise, too many degrees-of-freedom)
8/8/2019 Useful Statistical Concepts for Engineers
20/51
- 20 -
Example: Recall Regression from Robs lecture
Exercise:
1. Identify
2. If dim(x) = 20, m
= 10M; what is a
more seriousproblem here (bias
or variance) ?
3. Based on 2., what
other tools would
you try on thisproblem?
8/8/2019 Useful Statistical Concepts for Engineers
21/51
- 21 -
A useful exercise
Might have heard things like All models are wrong, some are more useful than others
Google uses simple models but trains them on lots of data
SVM works well on my data
Nave Bayes is hard to beat on text data
Boosting is the best off-the-shelf classifier
Its all about feature engineering; Maxent, GBDT doesnt matter
Y! data is too noisy, better to work with simple models
We have terabytes of data but it is too little
Exercise: Interpret these in terms of bias-variance tradeoff
8/8/2019 Useful Statistical Concepts for Engineers
22/51
- 22 -
Other remarks on bias-variance
There is no universal solution to the bias-variance tradeoff
Several classes available, each with pros and cons
Understanding the properties of s and experimenting withdata important
Inventing new s motivated by failures of existing ones on real
applications important for advancement of the field
8/8/2019 Useful Statistical Concepts for Engineers
23/51
- 23 -
How to measure performance ?
Depends on the loss function
For classification and regression, test errors used in ML
Several other measures in Statistics (does not use test data)
MODEL FIT - MODEL COMPLEXITY
AIC, BIC, DIC, Mallows Cp, Bayes factors,
E.g. AIC = - log-loss(training) + # parameters
Based on assumptions that may not hold in all scenarios
8/8/2019 Useful Statistical Concepts for Engineers
24/51
- 24 -
Parametric Distributions: A useful class to
work with data
8/8/2019 Useful Statistical Concepts for Engineers
25/51
- 25 -
Parametric models
Non-parametric approach attractive, no assumptions needed
Bootstrap and asymptotics often provides answers, BUT
Hard to incorporate additional knowledge about the system
Computationally intensive
Higher uncertainty in estimates price we pay for generality
Theory gets harder for dependent random variables
Social network data, Spatial data, time series
Parametric models that assume functional form is an
alternate way to model the world Faster computation, better estimates if model good
approximation to reality
Easier to model dependent random variables
8/8/2019 Useful Statistical Concepts for Engineers
26/51
- 26 -
Common discrete parametric distributions
Bernoulli
Poisson
Geometric
Negative Binomial
Multinomial
8/8/2019 Useful Statistical Concepts for Engineers
27/51
- 27 -
Common continuous parametric distributions
Normal (Gaussian)
Log-normal: Normal on log scale
Gamma : Tails thinner than log-normal
Beta: flexible class on [0,1]
Multivariate Normal : Multivariate Gaussian data
8/8/2019 Useful Statistical Concepts for Engineers
28/51
- 28 -
Exponential family: A general class of parametric
distributions
Distribution with PDF given by
g(): convex (log-partition function)
Example: Bernoulli distribution
8/8/2019 Useful Statistical Concepts for Engineers
29/51
- 29 -
Estimation
Maximum likelihood estimation (MLE)
For i i dcase,
MLE is asymptotically unbiased, consistent and achieves
lowest variance asymptotically under mild conditions
8/8/2019 Useful Statistical Concepts for Engineers
30/51
- 30 -
Desirable properties of estimators
Unbiased
Consistent
Low variance, efficient [Attains Cramer-Rao lower bound]
8/8/2019 Useful Statistical Concepts for Engineers
31/51
- 31 -
MLE efficient
Under mild regularity conditions, MLE is asymptoticallyunbiased, consistent and efficient
Other estimators
MVUE: Minimum variance unbiased estimators
Search for lowest variance estimator among unbiased ones Requires only moment assumptions on the distributions
Method-of-moments (MOM) estimators
Equate empirical moments with theoretical ones
May lose efficiency but easier to estimate in some cases
8/8/2019 Useful Statistical Concepts for Engineers
32/51
- 32 -
Non i i d data [adding more flexibility]
Statistically independent but different means
Too flexible, sharing parameters a good compromise
E.g. 100 displays of an ad on a website (Bernoulli)
Click probabilities not same, how do we model it ?
ifunction of features ? Males, females have different probs
Regression problem (Logistic regression, )
i =: (zi, ) = ziT ; dim() = n
8/8/2019 Useful Statistical Concepts for Engineers
33/51
- 33 -
Generalized Linear Models: Flexible class for regressions
Data: (x1,z1), (x2,z2),,(xm,zm)
Assumption: zis measured without error (important)
1-parameter exponential family
= (xi). i
Do linear regression on transformed scale
i := (zi, ) = ziT
8/8/2019 Useful Statistical Concepts for Engineers
34/51
- 34 -
GLM continued
Example: logistic regression (covered in Robs lecture)
Gaussian regression, Poisson regression are special cases
Referred to as Generalized linear model (GLM)
MacCullagh and Nelder (book)
8/8/2019 Useful Statistical Concepts for Engineers
35/51
- 35 -
Other option: Shrinkage estimators (Stein)
Stein
Result
Stein estimator has smaller MSE than MLE
Remarkable : Incurring some bias by pooling data reduces
variance significantly
Shrinkage: Estimates pulled towards the mean
8/8/2019 Useful Statistical Concepts for Engineers
36/51
- 36 -
Bayesian statistics
Data, parameters are all random variables that we model All inferences about parameters are conditional on data
Bayes Theorem
[|X] = [X| ] []/ [X]
Posterior Lik x Prior
10 5 0 5 10
0
1
2
3
4
Likelihood
Prior
8/8/2019 Useful Statistical Concepts for Engineers
37/51
- 37 -
Bayesian continued
Does not depend on asymptotics, works for finite m
Rich class of models (generally over-parametrized) but
avoids over-fitting through constraints on parameters
Model specification often requires care
Computationally intensive
(but approximations work well for large data)
8/8/2019 Useful Statistical Concepts for Engineers
38/51
- 38 -
Bayesian interpretation of Stein
Exercise
8/8/2019 Useful Statistical Concepts for Engineers
39/51
- 39 -
Analysis of variance (ANOVA)
Replications within each group E.g. log NGD prices in different dmas
How to estimate unknown hyper-parameters , ?
8/8/2019 Useful Statistical Concepts for Engineers
40/51
- 40 -
Estimating hyper-parameters: Empirical Bayes
Empirical Bayes (EB): Maximize marginal likelihood
ANOVA example (integral available in closed form)
EB works well for large data, in small samples in may overfit
Double dipping
8/8/2019 Useful Statistical Concepts for Engineers
41/51
- 41 -
Example
Time spent on landing page after a story click on TodayModule on Y! Front Page
8/8/2019 Useful Statistical Concepts for Engineers
42/51
- 42 -
Distribution across different properties
8/8/2019 Useful Statistical Concepts for Engineers
43/51
- 43 -
ANOVA
Observations for a property replications: log(time spent) data
0.04651195, 0.11435909 , 2.52275583
8/8/2019 Useful Statistical Concepts for Engineers
44/51
- 44 -
Shrinkage
2 1 0 1 2 3
0.5
0.0
0.5
MLE
Shrinakgeest
8/8/2019 Useful Statistical Concepts for Engineers
45/51
- 45 -
Estimating hyper-parameters: Full Bayes
Assume a mild prior on hyper-parameters
In ANOVA example
Computation gets difficult, often require simulation
Main idea
Simulate samples from posterior distribution and make all
conclusions from these (recall parametric bootstrap)
Several techniques : Markov Chain Monte Carlo (MCMC)
8/8/2019 Useful Statistical Concepts for Engineers
46/51
- 46 -
Modeling correlations through priors
Time series: Autoregressive prior
Conditional independence, marginal dependence
Attractive way to model correlations
Spatial correlation
8/8/2019 Useful Statistical Concepts for Engineers
47/51
- 47 -
Generalized linear mixed model (GLMM)
Fit different regressions to different groups but shareparameters
Example: Random intercept models
Parallel regressions lines to groups
Front Page example: log(ts) = a + b*Gender + prop_id
(Intercept) gender)0 gender)f gender)m sigma^2 tau^20.025 0.114 0.051 0.049 1.32 .121
8/8/2019 Useful Statistical Concepts for Engineers
48/51
- 48 -
GLMM continued
Crossed-random effects Group specific slopes and intercepts
FP example
log(ts) = a + b*gend + Propid*Gender
Exercise: fit this model using lme4 in R
Hint: formula (log(ts) ~ gender + (Propid|gender) )
8/8/2019 Useful Statistical Concepts for Engineers
49/51
- 49 -
GLMMs
From an ML perspective Linear models with different cross-product features
Fancy regularization (different priors for different features)
No cross-validation, all parameters estimated automatically
Priors motivated by problem, highly flexible class Model specification has to be done carefully by analyst
Extends to exponential family
Conceptually easy, more computation required Software (lme4 in R; PROC GLMMIX in SAS)
8/8/2019 Useful Statistical Concepts for Engineers
50/51
- 50 -
Generalized linear mixed model (GLMM)
Back to ANOVA: Regression + ANOVA
Define , then we can write
Extends to exponential family, computation gets harder
Generalized linear mixed models (GLMM)
Software (lme4 package in R)
8/8/2019 Useful Statistical Concepts for Engineers
51/51
Summary
We covered Bootstrap for I I D case
Parametric distributions
Shrinkage Estimators
Generalized linear models Grouped regressions (mixed effects models)
For non i i d data, working with flexible parametric models
provide powerful expressive language to model data
Needs some practice to master these models
Next lecture: Olivier Chapelle (Optimization techniques)