Upload
butest
View
654
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
General remarks about learningProbability Theory and Statistics
Linear spaces
Machine LearningPreliminaries and Math Refresher
M. Luthi, T. Vetter
February 18, 2008
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
The problem of learning is arguably at the very core of the problemof intelligence, both biological and artificial.
T. Poggio and C.R. Shelton
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Model building in natural sciences
Model building
Given a phenomenon, construct a model for it.
Example (Heat Conduction)
Phenomenon: The spontaneous transfer of thermal energythrough matter, from a region of higher temperature to a region oflower temperatureModel:
∂Q
∂t= −k
∮S∇T · dS
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Learning as Model Building
Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)Model:
f ∗ = arg maxf ∈H
P(f )P(f |D)
P(D)
Neural networks, Decision Trees, Naive Bayes, Support Vectormachines, etc.
Models for learning
The models for learning are the learning algorithms
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Learning as Model Building
Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)Model:
f ∗ = arg maxf ∈H
P(f )P(D|f )
P(D)
Neural networks, Decision Trees, Naive Bayes, Support Vectormachines, etc.
Models for learning
The models for learning are the learning algorithms
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Goals of the first block
Life is short . . .
We want to cover the essentials of learning.
General Setting
Mathematicallyprecise settingof the learningproblem
Valid for anykind of learningalgorithm
StatisticalLearning Theory
When doeslearning work
Conditions anyalgorithm hasto satisfy
Performancebounds
Kernel Methods
Theory ofKernels
Make linearalgorithmsnon-linear.
Learning fromnon-vectorialdata.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Mathematics needed in the first block
The need for mathematics
As we treat the learning problem in a formal setting, the resultsand methods are necessarily formulated in mathematical terms.
General Setting
Probabilitytheory
Statistics
Basicoptimizationtheory
StatisticalLearning Theory
Moreprobabilitytheory
More statistics
Kernel Methods
Linear spaces
Linear algebra
Basicoptimizationtheory
A bit of mathematical maturity and an open mind is required. Therest will be explained.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Mathematics needed in the first block
The need for mathematics
As we treat the learning problem in a formal setting, the resultsand methods are necessarily formulated in mathematical terms.
General Setting
Probabilitytheory
Statistics
Basicoptimizationtheory
StatisticalLearning Theory
Moreprobabilitytheory
More statistics
Kernel Methods
Linear spaces
Linear algebra
Basicoptimizationtheory
A bit of mathematical maturity and an open mind is required. Therest will be explained.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Nothing is more practical than a good theory.
Vladimir N. Vapnik
Nothing (in computer science) is more beautiful than learningtheory?
M. Luthi
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Nothing is more practical than a good theory.
Vladimir N. Vapnik
Nothing (in computer science) is more beautiful than learningtheory?
M. Luthi
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Probability theory vs Statistics
Definition (Probability Theory)
A branch of mathematicsconcerned with the analysis ofrandom phenomena.
General ⇒ Specific
Definition (Statistics)
The science of collecting,analyzing, presenting, andinterpreting data.
Specific ⇒ General
Statistical Machine learning is closely related to (inferential)statistics.
Many state-of-the-art learning algorithms are based onconcepts from probability theory.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Probabilities
Definition (Probability Space)
A probability space is the triple
(Ω,F ,P)
where
Ω is a set of events ω
F is a collection of events (e.g. the power-set P(Ω))
P is a measure that satisfies the probability axioms.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Axioms of Probability
1 For any A ∈ F , there exists a number P(A), the probability ofA, satisfitying P(A) ≥ 0.
2 P(Ω) = 1.
3 Let An, n ≥ 1 be a collection of pairwise disjoint events,and let A be their union. Then
P(A) =∞∑
n=1
P(An).
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Independence
Definition (Independence)
Two events, A and B, are independent iff the probability of theirintersection equals the product of the individual probabilities, i.e.
P(A ∩ B) = P(A) · P(B).
Definition (Conditional probability)
Given two events A and B, with P(B) > 0, we define theconditional probability for A given B, P(A|B), by the relation
P(A|B) =P(A ∩ B)
P(B).
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Random Variables
A single event is not that interesting.
Definition (Random Variable)
A random variable X is a function from the probability space to avector of real numbers
X : Ω → Rn.
Random variables are characterized by their distribution function F :
Definition (Probability Distribution Function)
Let X : Ω → R be a random variable. We define
FX (x) = P(X ≤ x) −∞ < x < ∞.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Probability density function
Definition (Probability density function)
The density function, is the function fX , with the property
FX (x) =
∫ x
−∞fX (y) dy , −∞ < x < ∞.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Convergence
Definition (Convergence in Probability)
Let X1,X2, . . . be random variables. We say that Xn converges inprobability to the random variable X as n →∞, iff, for all ε > 0,
P(|Xn − X | > ε) → 0, as n →∞.
We write Xnp−→ X as n →∞.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Weak law of large numbers
Theorem (Bernoulli’s Theorems (Weak law of large numbers))
Let X1, . . . ,Xn be a sequence of independent and identicallydistributed (i.i.d.) random variables, each having mean µ andstandard deviation σ. Then
P[|(X1 + . . . + Xn)/n − µ| > ε] → 0
as n →∞.
Thus given enough observations xi ∼ FX , the sample meanx = 1
n
∑ni=1 xi will approach the true mean µ.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Expectation
Definition (Expectation)
Let X be a random variable with probability density function fX ,and g : R → R a function. We define the expectation
E [g(X )] :=
∫ ∞
−∞g(x)fX (x) dx .
Definition (Sample mean)
Let a sample x = x1, x2, . . . , xn be given. We define the(sample) mean to be
x =1
n
n∑i=1
xi .
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Variance
Definition (Variance)
Let X be a random variable with density funciton fX . The varianceis given by
Var[X ] = E [(X − E [X ])2] = E [X 2]− (E [X ])2.
The square root√
Var[X ] of the variance is referred to as thestandard deviation.
Definition (Sample Variance)
Let the sample x = x1, x2, . . . , xn with sample mean x be given.We define the sample variance to be
s2 =1
n − 1(xi − x)2.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Notation
Assume F has a probability density function:
f (x) =dF (x)
dx
Formally, we write:f (x) dx = dF (x)
Example: Expectation
E [g(X )] :=
∫ ∞
−∞g(x)f (x) dx . =
∫ ∞
−∞g(x)dF (x)
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Vector Space
A set V together with two binary operations
1 vector addition + : V × V → V and
2 scalar multiplication · : R× V → V
is called a vector space over R, if it satisfies the following axioms:
1 ∀x , y ∈ V : x + y = y + x (commutativity)
2 ∀x , y ∈ V : x + (y + z) = (x + y) + z (associativity)
3 ∃0 ∈ V ,∀x ∈ V : 0 + x = x (identity of vector addition)
4 ∃1 ∈ V ,∀x ∈ V : 1 · x = x (identity of vector multiplication)
5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element)
6 ∀α ∈ R,∀x , y ∈ V : α · (x + y) = α · x + α · y (distributivity)
7 ∀α, β ∈ R,∀x ∈ V : (α + β) · x = α · x + β · x (distributivity)
8 ∀α, β ∈ R,∀x ∈ V : α(β · x) = (αβ) · xM. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Vector Space
More importantly for us, the definition implies:
x + y ∈ V , ∀x , y ∈ V
αx ∈ V , ∀α ∈ R,∀x ∈ V
Subspace criterion
Let V be a vector space over R, and let W be a subset of V .Then W is a subspace if and only if it satisfies the following 3conditions:
1 0 ∈ W
2 If x , y ∈ W then x + y ∈ W
3 If x ∈ W and α ∈ R then αx ∈ W
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Normed spaces
Definition (Normed vector space)
A normed vector space is a pair (V , ‖·‖) where V is a vector spaceand ‖·‖ is the associated norm, satisfying the following propertiesfor all u, v ∈ V :
1 ‖v‖ ≥ 0 (positivity)
2 ‖u + v‖ ≤ ‖u‖+ ‖v‖ (triangle inequality)
3 ‖αv‖ = |α|‖v‖ (positive scalability)
4 ‖v‖ = 0 ⇔ v = 0 (positive definiteness)
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Definition (Inner product space)
An real inner product space is a pair (V , 〈·, ·〉), where V is a realvector space and 〈·, ·〉 the associated inner product, satisfying thefollowing properties for all u, v ,∈ V
1 〈u, v〉 = 〈v , u〉 (symmetry)
2 〈αu, v〉 = α〈u, v〉, 〈u, αv〉 = α〈u, v〉and〈u + v ,w〉 = 〈u,w〉+ 〈v ,w〉, 〈u, v + w〉 = 〈u, v〉+ 〈u,w〉,(bilinearity)
3 〈u, u〉 ≥ 0 (positive definiteness)
Definition (Strict inner product space)
A inner product space is called strict if
〈u, u〉 = 0 ⇔ u = 0
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
Inner product space
The strict inner product
induces a norm: ‖f ‖2 = 〈f , f 〉.is used to define distances and angles between elements.
Theorem (Cauchy Schwarz inequality)
For all vectors u and v of a real inner product space (V , 〈·, ·〉), thefollowing inequality holds:
|〈u, v〉| ≤ ‖u‖‖v‖.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher
General remarks about learningProbability Theory and Statistics
Linear spaces
If you’re not comfortable with any of the presented material, youshould take your favourite textbook and read it up within the nexttwo weeks.
M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher