Upload
zach-zhang
View
88
Download
0
Embed Size (px)
Citation preview
Iterated geometric harmonics for missing data recovery
Iterated geometric harmonicsfor missing data recovery
Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu
California Polytechnic State University
Nov. 14, 2015California Polytechnic State University
San Luis Obispo, CA
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
The missing data problemMissing data is often a problem. Data can be lost
while recording measurements,during storage or transmission,due to equipment failure,...
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
The missing data problemMissing data is often a problem. Data can be lost
while recording measurements,during storage or transmission,due to equipment failure,...
Existing techniques:require some records (rows) to be complete, orrequire some characteristics (columns) to be complete, orare based on linear regression.(But data often has highly nonlinear internal structure!)
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.
{[ ]n records(p characteristics)
one record
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.
EXAMPLES
36 photos, each of size 112 pixels × 92 pixels.{vk}36
k=1 ⊆ R10,304. (Each photo stored as a vector)
Results from a psychology experiment: a 50-question examgiven to 200 people.{vk}200
k=1 ⊆ R50.
3000 student records (SAT, ACT, GPA, class rank, etc.){vk}3000
k=1 ⊆ R20.
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
Special case of the missing data problemSuppose all missing data are in one column
v1 �v2 f2v3 �...
vn fn
Consider last column as a function f : {1, 2, . . . , n} → R.
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.
f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.
f
XΓ
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.
f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.
fF
XΓ
Iterated geometric harmonics for missing data recovery
Motivation: the missing data problem
Introduction and background
Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.
f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.
Application: The data is a sample {(x, f (x))}x∈Γ.
Example: X may be a collection of images or documents.Y = R
Want to generalize to as-yet-unseen instances in X.
“function extension”←→ “automated sorting”
=⇒ machine learning/manifold learning
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.
k(v,u) =
{≈ 0, v and u very different≈ 1, v and u very similar
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.
k(v,u) =
{≈ 0, v and u very different≈ 1, v and u very similar
Two possible choices of such a kernel function:
k(v,u) =
{e−‖v−u‖2
2/ε
|Corr(v,u)|m
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.
Vector vi −→ vertex vi in the network.v1v2v3v4
k−−−−−→
v1 • 4
2
• v23
wwwwwwwww
v3 •1
• v4
K =
v1 v2 v3 v4
v1
v2
v3
v4
0 4 2 0
4 0 3 0
2 3 0 1
0 0 1 0
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.
Vector vi −→ vertex vi in the network.v1v2v3v4
k−−−−−→
v1 • 4
2
• v23
wwwwwwwww
v3 •1
• v4
K =
v1 v2 v3 v4
v1
v2
v3
v4
0 4 2 0
4 0 3 0
2 3 0 1
0 0 1 0
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.
Vector vi −→ vertex vi in the network.v1v2v3v4
k−−−−−→
v1 • 4
2
• v23
wwwwwwwww
v3 •1
• v4
K =
v1 v2 v3 v4
v1
v2
v3
v4
0 4 2 0
4 0 3 0
2 3 0 1
0 0 1 0
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.
Efficiency gain: n× p data matrix 7→ n× n adjacency matrixv1v2v3v4
k−−−−−→ K =
0 4 2 04 0 3 02 3 0 10 0 1 0
Ki,j := k(vi, vi)
Advantageous for high-dimensional datasets: p >> n.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonicsCoifman and Lafon introduced the machine learning tool“geometric harmonics” in 2005.
Idea: the eigenfunctions of a diffusion operator can be used toperform global analysis of the dataset and of functions on adataset.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator
f 7→ Kf by (Kf )(u) :=∑v∈Γ
Ku,vf (v), u ∈ X.
“Restricted matrix multiplication”
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator
f 7→ Kf by (Kf )(u) :=∑v∈Γ
Ku,vf (v), u ∈ X.
Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ
Ku,vψj(v) = λjψj(u), u ∈ Γ.
NOTE:k symmetric =⇒ K symmetric =⇒ {ψj} form ONB
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator
f 7→ Kf by (Kf )(u) :=∑v∈Γ
Ku,vf (v), u ∈ X.
Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ
Ku,vψj(v) = λjψj(u), u ∈ Γ.
[Nystrom] Reverse this equation to define values off Γ:
Ψj(u) :=1λj
∑v∈Γ
Ku,vψj(v), u ∈ X.
{Ψj}nj=1 are the geometric harmonics, where n = |Γ|.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define
F(x) =
n∑j=1
〈f , ψj〉ΓΨj(x), x ∈ X.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define
F(x) =
n∑j=1
〈f , ψj〉ΓΨj(x), x ∈ X.
For x ∈ Γ, Ψj(x) = ψj(x), so
F(x) =n∑
j=1
〈f , ψj〉ΓΨj(x) =
n∑j=1
〈f , ψj〉Γψj(x) = f (x),
since this is just the decomposition of f in the ONB {ψj}nj=1.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
The network model associated to a dataset
Geometric harmonics: limitationsGeometric harmonics does not apply to missing data.
Consider f : Γ→ R as extra column with holes:v1v2v3 f...
vn
Geometric harmonics requires first p columns to be complete.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:
Data are samples from a submanifold.
Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:
Data are samples from a submanifold.
Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.
Idea: Consider jth column to be a function of the othersv1v2...
vn
−→
a11a21...
an1
a12a22...
an2
. . .
. . .
. . .
a1ja2j...
anj
. . .
. . .
. . .
a1pa2p...
anp
Geometric harmonics can be applied to jth column.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: the iteration scheme
1 Record locations of missing values in the dataset.2 Stochastically impute missing values.
Drawn from N(µ, σ2), computed columnwise.
3 Iteration through columns.(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.
Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.
(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.
Process typically stabilizes after about 4 cycles.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: the iteration scheme
1 Record locations of missing values in the dataset.2 Stochastically impute missing values.
Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.
(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.
Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.
(d) Continue until all columns are updated.
4 Repeat iteration until updates cause negligible change.Process typically stabilizes after about 4 cycles.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: the iteration scheme
1 Record locations of missing values in the dataset.2 Stochastically impute missing values.
Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.
(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.
Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.
(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.
Process typically stabilizes after about 4 cycles.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
damaged restored original(70% data loss)
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption
Probably not well-suited to social network analysis, etc.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption
Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records
Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption
Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records
Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.
Iterated geometric harmonics excels when p >> n
However, has demonstrated good performance onlow-dimensional time series.Example: San Diego weather data (next slide)
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
San Diego Airport weather datan = 2000, p = 25
0 1 2 3 4 50
500
1000
1500
2000
2500
GH Iterations
L−2
Erro
r
0.050.10.150.20.250.30.350.4
0 1 2 3 4 5 68
10
12
14
16
18
20
22
GH Iterations
Stan
dard
Dev
iatio
n
0.050.10.150.20.250.30.350.4
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
SummaryIterated Geometric Harmonics (IGH):
Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
SummaryIterated Geometric Harmonics (IGH):
Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)
Future work: noisy data.
Iterated geometric harmonics for missing data recovery
A solution: Geometric harmonics
Iterated geometric harmonics
Iterated geometric harmonicsfor missing data recovery
Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu
California Polytechnic State University
Nov. 14, 2015California Polytechnic State University
San Luis Obispo, CA
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: noisy dataThe problem of “noisy data” is more difficult:
Before improving the data, bad values need to be located.
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: noisy dataThe problem of “noisy data” is more difficult:
Before improving the data, bad values need to be located.Current work: using Markov random fields to detect noise.
Markov random fields: another graph-based tool for dataanalysis.
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: Markov random fields
original (noisy) data
improved data
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: Markov random fields
original (noisy) data
improved data
a1 a2 a3
a4
w13
u4
u1
u5
u2
u6
u3
w12
w45
w23
w56
w24 w35
a5 a6
b1 b2 b3
b4 b5 b6
Minimize the energy functional:E =
∑wij(ai − aj)
2 +∑
ui(ai − bi)2
where {bi} are given,wij are tuned by user (and usually all equal), andui are tuned by user (and usually all equal).
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: Markov random fields
original (noisy) data
improved data
a1 a2 a3
a4
w13
u4
u1
u5
u2
u6
u3
w12
w45
w23
w56
w24 w35
a5 a6
b1 b2 b3
b4 b5 b6
Minimize the energy functional:E =
∑(ai − aj)
2 + λ∑
(ai − bi)2
where {bi} are given,wij = ui = 1, and λ is tuned by user.
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Future work: Markov random fieldsMarkov random fields (MRF) use simulated annealing solve
minimize E given {bi}Output: improved data {ai}.
Our approach:1 Apply MRF to find improved data {ai}.2 Compare {ai} to original data {bi}.3 Label nodes with large values of |ai − bi| as missing data.4 Apply IGH and obtain better improved data.
Iterated geometric harmonics for missing data recovery
Future work
From missing data to noisy data
Iterated geometric harmonicsfor missing data recovery
Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu
California Polytechnic State University
Nov. 14, 2015California Polytechnic State University
San Luis Obispo, CA
Iterated geometric harmonics for missing data recovery
Theoretical underpinnings
Reproducing kernel Hilbert spaces
Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is
nonnegative: k(x, y) ≥ 0
symmetric: k(x, y) = k(y, x)
positive semidefinite: for any choice of {xi}mi=1,
Ki,j = k(xi, xj) defines a positive semidefinite matrix.
Iterated geometric harmonics for missing data recovery
Theoretical underpinnings
Reproducing kernel Hilbert spaces
Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is
nonnegative: k(x, y) ≥ 0
symmetric: k(x, y) = k(y, x)
positive semidefinite: for any choice of {xi}mi=1,
Ki,j = k(xi, xj) defines a positive semidefinite matrix.
[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X
〈kx, f 〉 = f (x) (reproducing property)
Iterated geometric harmonics for missing data recovery
Theoretical underpinnings
Reproducing kernel Hilbert spaces
Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is
nonnegative: k(x, y) ≥ 0
symmetric: k(x, y) = k(y, x)
positive semidefinite: for any choice of {xi}mi=1,
Ki,j = k(xi, xj) defines a positive semidefinite matrix.
[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X
〈kx, f 〉 = f (x) (reproducing property)
In the discrete case, H is the closure off =
∑x axkx, ax ∈ scalars.
Iterated geometric harmonics for missing data recovery
Theoretical underpinnings
Reproducing kernel Hilbert spaces
Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by
(Kf )(x) =
∫Γk(x, y)f (y)dµ(y), x ∈ X,
turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:
K?g(y) = g(y), y ∈ Γ, g ∈ H.
Iterated geometric harmonics for missing data recovery
Theoretical underpinnings
Reproducing kernel Hilbert spaces
Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by
(Kf )(x) =
∫Γk(x, y)f (y)dµ(y), x ∈ X,
turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:
K?g(y) = g(y), y ∈ Γ, g ∈ H.
K?K is self-adjoint, positive, and compact.
Its eigenvalues are discrete and non-negative.Since K? is restriction, eigs can be found by diagonalizing kon Γ.