Upload
jamie-seol
View
1.233
Download
0
Embed Size (px)
Citation preview
IDS Lab
Universal Approximation Theoremwhy does deep neural network work?
basic maths for deep learningJamie Seol
IDS Lab
Jamie Seol
Motivation• Basic logistic regression, or a perceptron can’t classify data like:
• because the decision boundary will be just a line!
IDS Lab
Jamie Seol
Motivation• Unless we put additional features like x3 = x12, x4 = x22
• but this is a feature engineering!• we need to do it by our hands!!
• we don’t want to do this!• this is the one of main reasons why we’re studying deep
learning!• we can’t do this always!
• what if the input data was like 1000 dimensional vector?• we know that deep learning automatically finds these additional
features• but why and how?
IDS Lab
Jamie Seol
Toy experiment• We know that SVM will probably work well on the previous
problem• How about MLP(multilayer perceptron)?
• It works, with 100% accuracy• https://github.com/theeluwin/kata/blob/master/tasks/
Universal%20Approximation%20Theorem/uat.py
IDS Lab
Jamie Seol
Toy experiment• MLP actually learned a non-linear decision boundary without any
feature engineering!
data decision boundary
IDS Lab
Jamie Seol
Universal Approximation Theorem• The theorem states that, long story short, MLP can represent ANY given (nice) function• Formally proved by G. Cybenko in 1989 with sigmoidal activation
function• K. Hornik proved any activation function works too• therefore, theoretically, MLP can learn ANYTHING
• “it’s not working in practice” ← it will, if you have zillion neurons with zillion layers, and infinite data
• now we can say a deep learning is an universal learning algorithm
• then why do we need things like CNN and RNN? because we don’t have infinite data
IDS Lab
Jamie Seol
Universal Approximation Theorem• We’ll prove this theorem• We can’t skip maths forever!
• let’s use Times New Roman font
IDS Lab
Jamie Seol
Analysis 101• In Euclidean space ℝn, we call the following set open ball
• B(x, ε) = {y ∈ ℝn | d(x, y) < ε} • for given x ∈ ℝn and ε ∈ ℝ+
• function d can be any metric, but usually we use • d(x, y) = l2(x - y) = ||x - y||2 • also known as Euclidean distance
• We call a set A ⊂ ℝn is open if for ∀x ∈ A, ∃ε ∈ ℝ+ such that B(x, ε) ⊂ A • We generalize this concept of open set by open set itself, which we call a
topology
IDS Lab
Jamie Seol
Topology 101• For some set X, we say 𝒯 is a topology on X if:
• 𝒯 is a family of subsets of X • that is, 𝒯 ⊂ 𝒫(X)
• ∅, X ∈ 𝒯 • any union of elements of 𝒯 is an element of 𝒯 • any intersection of finite elements of 𝒯 is an element of 𝒯
• Members of 𝒯 are often called open sets • for example, with X = ℝn, we can give topology 𝒯 by collecting all
open sets defined in previous slide • note that this is a typical non-contructive definition!
• we can’t specify all members of 𝒯, though we just know that they do exist well
IDS Lab
Jamie Seol
Topology 101• Note that topology is motivated by generalization of open set!
• for example, in analysis, we say a sequence {an} converges to a if for ∀ε > 0, ∃N s.t. for ∀n > N, |an - a| < ε holds
• in topology, we say a sequence {an} converges to a if for any open set O ∈ 𝒯 containing a, ∃N s.t. for n > N, an ∈ O holds
• x ∈ X is called a limit point of A (for some given A ⊂ X) if for any open set O ∈ 𝒯 containing x, O∩A ≠ ∅ holds
• we denote A’ to be a set of limit points of A • Ā is called a closure of A, defined by Ā = A∪A’ • we say A is dense in X if Ā = X
• THIS is the true meaning of dense! • dense dance ~
IDS Lab
Jamie Seol
Fourier Analysis 101• Easy example: any irrational number is a limit point of ℚ (why?), so
closure of ℚ is equal to ℝ, or, ℚ is dense in ℝ • One of the most useful application of the concept dense is Fourier series/
transform: • trigonometric polynomials (a.k.a. Fourier series) are dense in L2
• this is non-trivial when support is not compact • how about Lp?
• Fourier transform can be defined in Lp • this is also non-trivial • can be proved by following step:
• show that Swarts class is dense in Lp • extend domain by completion to Lp
IDS Lab
Jamie Seol
Topology 101• We say a topological space X is compact if every open covers has a finite
subcover • open covers are just
• Actually, talking about compactness requires a lot of time • For Euclidean space, we can think compact as bounded and closed
• this is not trivial: see Heine-Borel theorem • bounded means that the set can be contained in some big open ball • closed is just opposite of open; A is closed set when X\A is open
IDS Lab
Jamie Seol
Linear Algebra 101• For a vector space V over field F, we call || • || → ℝ a norm of a vector if
• for all a ∈ F and all u, v ∈ V, • ||av|| = |a| ||v|| • ||u + v|| ≤ ||u|| + ||v|| • ||u|| = 0 then u = 0
• Examples • lp-norm: lp(x) =
• Lp-norm: Lp(f) = • Note that set of real valued-functions are vector space! functions are
vector! • (3f + g) is a function(vector) defined by (3f + g)(x) = 3f(x) + g(x)
IDS Lab
Jamie Seol
Linear Algebra 101• For a vector space V over field F, we call <•, •> → F a inner product of
two vectors if • for all a ∈ F and all x, y, z ∈ V,
• Note that we can induce canonical(trivial) norm from inner product by
• Example: inner product of two function
IDS Lab
Jamie Seol
Linear Algebra 101• Function space - just a vector space
• which means, its elements, or vectors, are functions • mostly we’ll deal with continuous and real-valued function
• Typical example of function spaces are: • linear functional (dual space) • Swartz class • Banach space, Hilbert space
• Reproducing Kernel Hilbert Space (RKHS) • this is really really really important concept! mandatory for
understanding regularization and transfer learning • we’ll cover it someday…
• Domain of functions(vectors) are often called support
IDS Lab
Jamie Seol
Analysis 101• A set M is called a metric space if metric d is given, holding
• d: M × M → ℝ, and for all x, y, z in M, • d(x, y) ≥ 0 • d(x, y) = 0 then x = y • d(x, y) = d(y, x) • d(x, z) ≤ d(x, y) + d(y, z)
• Obviously, if some vector space is norm space, then it has canonical(trivial) metric d induced by d(x, y) = ||x - y||
• Another obvious: metric space has canonical(trivial) topology induced from open balls
• For example, Kullback-Leibler divergence is not a metric
IDS Lab
Jamie Seol
Analysis 101• A sequence {an} in metric space M is said to be a cauchy if for ∀ε > 0, ∃N s.t. for ∀n, m > N, d(an, am) < ε holds
• A metric space M is said to be complete if every cauchy sequence converges • very, very typical complete space: ℝ
• but this is not trivial • for example, rational sequence {an} satisfying an ∈ [π - 1/n, π + 1/n] is
cauchy, but it never converges!
IDS Lab
Jamie Seol
Analysis 101• In function space with functions having a metric codomain, we say a
sequence {fn} converges uniformly to f if for ∀ε > 0, ∃N s.t. for ∀n > N, d(fn(x), f(x)) < ε holds all x in support • why uniform? because, N was determined independently by x
• which means, N is a function of ε • if N was determined by both ε and x, then we call it point-wise
convergence
IDS Lab
Jamie Seol
Real Analysis 101• We say a function is measurable if f−1(A) is measurable for any open set A
• maybe we should stop here… • if someone meets non-measurable function by chance, it’ll be one of
the most unlucky day ever • one old mathematician said, “If someone built an airplane using
non-measurable function, then I’ll not ride the plane” • Anyway, a measure is something that measures area of a set
• for example, µ([a, b]) = b - a will hold as intuitively in ℝ with Lebesgue measure µ
• we say some property p holds almost everywhere in A if the measure of {x ∈ A | ¬p(x)} is 0
• note that the probability, is measure of an event!
IDS Lab
Jamie Seol
All together• Want to show: “set of MLP is uniformly dense in measurable normed
function space on every compact support” • at last! now we can read this statement
• Let’s prove this theorem
IDS Lab
Jamie Seol
Notations• For some natural number r, let Ar be set of all affine functions from ℝr to ℝ, where affine function means in form of A(x) = wTx + b
• For any measurable function G from ℝ to ℝ, let ∑r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽jG(Aj(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Aj ∈Ar, finite sum} • note that ∑r(G) is family of single hidden layer feedforward neural
network with one output, having G as activation function • actually, we’re talking about Borel measurable, not just any measure
• For any measurable function G from ℝ to ℝ, let ∑∏r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽j∏G(Ajk(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Ajk ∈Ar, finite sum and finite product} • somewhat complex but more general form of MLP
IDS Lab
Jamie Seol
Notations• A function 𝛹: ℝ → [0, 1] is a squashing function if
• non-decreasing • limit to +∞ gives 1, -∞ gives 0 • examples: positive indicator function, standard sigmoid function
• Cr = set of continuous function from ℝr to ℝ • if G is continuous, both ∑r(G) and ∑∏r(G) belongs to Cr
• Mr = set of measurable function from ℝr to ℝ • if G is measurable, both ∑r(G) and ∑∏r(G) belongs to Mr • note that Cr is subset of Mr
IDS Lab
Jamie Seol
Notations• Subset S in Cr is said to be uniformly dense on compacta in Cr if for
every compact subset K ⊂ ℝr, S is dK-dense in Cr • dK(f, g) = supx∈K |f(x) - g(x)| • uniform convergence should hold
• Given a measure µ on (ℝr, Br), we define metric • dµ(f, g) = inf{ε > 0: µ{x: |f(x) - g(x)| > ε} < ε}
• so dµ(f, g) = 0 if and only if f and g agrees almost everywhere • Br is Borel 𝜎-field of ℝr (we omit detail explanation) • note that this measure µ is some kind of input space environment (think
as a probability of an event)
IDS Lab
Jamie Seol
Stone-Weierstrass Theorem• A family A of real functions defined on E is an algebra if A is closed
under addition, multiplication, and scalar multiplication • A family A separates points on E if for every x, y in E that x ≠ y, ∃f ∈ A s.
t. f(x) ≠ f(y) • A family A vanishes at no point of E if for each x in E, ∃f ∈ A s. t. f(x) ≠ 0 • statement if A is an continuous algebra on compact metric space K and
separates points on K and vanished at no point of K, then A is dK-dense on C(K)
• proof too long to prove in here • see the reference for full proof
IDS Lab
Jamie Seol
Sketch of the proof1. Apply Stone-Weierstrass theorem to ∑∏r(G) 2. Extend uniformly dense on compacta to dense in sense of measure 3. For squashing functions, approximate it to continuous squashing function 4. For continuous squashing function, approximate it to Fourier series 5. Remove ∏ by cosine rule
IDS Lab
Jamie Seol
Theorem 1• statement let G be any continuous non-constant function from ℝ to ℝ,
then ∑∏r(G) is uniformly dense on compacta in Cr • proof let K ⊂ ℝr be any compact set, then for any G, ∑∏r(G) is obviously
an algebra on K which separates points and vanished at no point, since we collected affine functions in form of A(x) = wTx + b, so we can use Stone-Weierstrass theorem
IDS Lab
Jamie Seol
Lemma 1• statement if {fn} is a sequence of functions in Mr that converges
uniformly on compacta to the function f, then dµ(fn, f) → 0 • by this lemma, extending Mr from Cr becomes really easy! • note that our activation functions are not always continous
• but surely measurable, because we don’t want to build any non-measurable MLP - it would be a terror!
• proof pick a big compact set (so that measure of the set becomes almost large as the whole), find large N from uniform convergence, then apply Chebyshev’s inequality • fn will be similar to f in the compact set, and the rest part of support is
small, so the integral on Chebyshev’s inequality becomes tiny • we’re working on ℝr, which is a locally compact metric space with
regular measure, so many problems are relatively simple
IDS Lab
Jamie Seol
Theorem 2• statement let G be any continuous non-constant function from ℝ to ℝ,
then ∑∏r(G) is dµ-dense in Mr for any finite measure µ • proof immediately follows from Theorem 1 and Lemma 1 if we know
that for any finite measure µ, Cr is dµ-dense in Mr
• this can be proved by similar way that we did in Lemma 1 • So the activation function don’t really have to be continuous! we can use
positive indicator function as activation function
IDS Lab
Jamie Seol
Theorem 3• statement let 𝛹 be any squashing function, then ∑∏r(𝛹) is uniformly
dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof show that ∑∏r(𝛹) is uniformly dense on compacta in ∑∏r(F),
where F is continuous squashing function • any continuous squashing function F can be uniformly approximated
by some element H of ∑l(𝛹), in sense of supx∈ℝ |F(x) - H(x)| < ε for ∀ε > 0
• if ∏l(F) can be uniformly approximated by members of ∑∏r(𝛹), then we’re done
• these proof can be done by analytic (rather dirty) works • the rest part will be completed by Lemma 1 and Theorem 2
IDS Lab
Jamie Seol
Theorem 4• statement let 𝛹 be any squashing function, then ∑r(𝛹) is uniformly
dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof using the fact that continous squashing function can be uniformly
approximated by Fourier series in each compact support and note that cosAcosB = cos(A+B) - cos(A-B) • wow! that’s really all! • we removed the ∏ part by uniformly approximating 𝛹 to some
continuous squashing function (using Theorem 3), which also can uniformly approximated to its Fourier series in each compact set (for more details, see the reference book), and some cosine rules
IDS Lab
Jamie Seol
Corollaries• The rest part is just some natural corollaries like
• it can be extended to general Lp space • it can be extended to multi-output MLP • it can be extended to multi-layer MLP
• hmm? actually, Theorem 4 claims that 1-layer NN is enough universal approximator
IDS Lab
Jamie Seol
Conclusion• We now know that MLP actually works theoretically! • MLPs are universal approximators
• Taylor series and Fourier series are also good universal approximators but those two requires detailed information of the original function (or computable oracle), while MLP can be learned by only using the train data
• me? • Jamie Seol • email - [email protected] • twitter - @theeluwin • website - theeluwin.kr • blog - blog.theeluwin.kr
IDS Lab
Jamie Seol
References• Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer
feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366.
• Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems 2.4 (1989): 303-314.
• Rudin, Walter. Principles of mathematical analysis. Vol. 3. New York: McGraw-Hill, 1964.
• Rudin, Walter. Real and complex analysis. Tata McGraw-Hill Education, 1987.
• Munkres, James R. Topology. Prentice Hall, 2000. • Gockenbach, Mark S. Finite-dimensional linear algebra. CRC Press, 2011. • Young, Matt. "The Stone-Weierstrass Theorem." (2006). • Stein, Elias M., and Rami Shakarchi. "Fourier Analysis, Princeton Lectures
in Analysis I." (2003).