49
Iterated geometric harmonics for missing data recovery Iterated geometric harmonics for missing data recovery Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhang jlindgre, epearse, zazhang, @calpoly.edu California Polytechnic State University Nov. 14, 2015 California Polytechnic State University San Luis Obispo, CA

Igh maa-2015 nov

Embed Size (px)

Citation preview

Page 1: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Page 2: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

The missing data problemMissing data is often a problem. Data can be lost

while recording measurements,during storage or transmission,due to equipment failure,...

Page 3: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

The missing data problemMissing data is often a problem. Data can be lost

while recording measurements,during storage or transmission,due to equipment failure,...

Existing techniques:require some records (rows) to be complete, orrequire some characteristics (columns) to be complete, orare based on linear regression.(But data often has highly nonlinear internal structure!)

Page 4: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.

{[ ]n records(p characteristics)

one record

Page 5: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.

EXAMPLES

36 photos, each of size 112 pixels × 92 pixels.{vk}36

k=1 ⊆ R10,304. (Each photo stored as a vector)

Results from a psychology experiment: a 50-question examgiven to 200 people.{vk}200

k=1 ⊆ R50.

3000 student records (SAT, ACT, GPA, class rank, etc.){vk}3000

k=1 ⊆ R20.

Page 6: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Special case of the missing data problemSuppose all missing data are in one column

v1 �v2 f2v3 �...

vn fn

Consider last column as a function f : {1, 2, . . . , n} → R.

Page 7: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

f

Page 8: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

fF

Page 9: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

Application: The data is a sample {(x, f (x))}x∈Γ.

Example: X may be a collection of images or documents.Y = R

Want to generalize to as-yet-unseen instances in X.

“function extension”←→ “automated sorting”

=⇒ machine learning/manifold learning

Page 10: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.

k(v,u) =

{≈ 0, v and u very different≈ 1, v and u very similar

Page 11: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.

k(v,u) =

{≈ 0, v and u very different≈ 1, v and u very similar

Two possible choices of such a kernel function:

k(v,u) =

{e−‖v−u‖2

2/ε

|Corr(v,u)|m

Page 12: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Page 13: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Page 14: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Page 15: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Efficiency gain: n× p data matrix 7→ n× n adjacency matrixv1v2v3v4

k−−−−−→ K =

0 4 2 04 0 3 02 3 0 10 0 1 0

Ki,j := k(vi, vi)

Advantageous for high-dimensional datasets: p >> n.

Page 16: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonicsCoifman and Lafon introduced the machine learning tool“geometric harmonics” in 2005.

Idea: the eigenfunctions of a diffusion operator can be used toperform global analysis of the dataset and of functions on adataset.

Page 17: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

“Restricted matrix multiplication”

Page 18: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ

Ku,vψj(v) = λjψj(u), u ∈ Γ.

NOTE:k symmetric =⇒ K symmetric =⇒ {ψj} form ONB

Page 19: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ

Ku,vψj(v) = λjψj(u), u ∈ Γ.

[Nystrom] Reverse this equation to define values off Γ:

Ψj(u) :=1λj

∑v∈Γ

Ku,vψj(v), u ∈ X.

{Ψj}nj=1 are the geometric harmonics, where n = |Γ|.

Page 20: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define

F(x) =

n∑j=1

〈f , ψj〉ΓΨj(x), x ∈ X.

Page 21: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define

F(x) =

n∑j=1

〈f , ψj〉ΓΨj(x), x ∈ X.

For x ∈ Γ, Ψj(x) = ψj(x), so

F(x) =n∑

j=1

〈f , ψj〉ΓΨj(x) =

n∑j=1

〈f , ψj〉Γψj(x) = f (x),

since this is just the decomposition of f in the ONB {ψj}nj=1.

Page 22: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: limitationsGeometric harmonics does not apply to missing data.

Consider f : Γ→ R as extra column with holes:v1v2v3 f...

vn

Geometric harmonics requires first p columns to be complete.

Page 23: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:

Data are samples from a submanifold.

Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.

Page 24: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:

Data are samples from a submanifold.

Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.

Idea: Consider jth column to be a function of the othersv1v2...

vn

−→

a11a21...

an1

a12a22...

an2

. . .

. . .

. . .

a1ja2j...

anj

. . .

. . .

. . .

a1pa2p...

anp

Geometric harmonics can be applied to jth column.

Page 25: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.

3 Iteration through columns.(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.

Process typically stabilizes after about 4 cycles.

Page 26: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.

(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.

4 Repeat iteration until updates cause negligible change.Process typically stabilizes after about 4 cycles.

Page 27: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.

(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.

Process typically stabilizes after about 4 cycles.

Page 28: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Page 29: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Page 30: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

damaged restored original(70% data loss)

Page 31: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.

Page 32: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records

Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.

Page 33: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records

Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.

Iterated geometric harmonics excels when p >> n

However, has demonstrated good performance onlow-dimensional time series.Example: San Diego weather data (next slide)

Page 34: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

San Diego Airport weather datan = 2000, p = 25

0 1 2 3 4 50

500

1000

1500

2000

2500

GH Iterations

L−2

Erro

r

0.050.10.150.20.250.30.350.4

0 1 2 3 4 5 68

10

12

14

16

18

20

22

GH Iterations

Stan

dard

Dev

iatio

n

0.050.10.150.20.250.30.350.4

Page 35: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

SummaryIterated Geometric Harmonics (IGH):

Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)

Page 36: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

SummaryIterated Geometric Harmonics (IGH):

Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)

Future work: noisy data.

Page 37: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Page 38: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: noisy dataThe problem of “noisy data” is more difficult:

Before improving the data, bad values need to be located.

Page 39: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: noisy dataThe problem of “noisy data” is more difficult:

Before improving the data, bad values need to be located.Current work: using Markov random fields to detect noise.

Markov random fields: another graph-based tool for dataanalysis.

Page 40: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

Page 41: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

a1 a2 a3

a4

w13

u4

u1

u5

u2

u6

u3

w12

w45

w23

w56

w24 w35

a5 a6

b1 b2 b3

b4 b5 b6

Minimize the energy functional:E =

∑wij(ai − aj)

2 +∑

ui(ai − bi)2

where {bi} are given,wij are tuned by user (and usually all equal), andui are tuned by user (and usually all equal).

Page 42: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

a1 a2 a3

a4

w13

u4

u1

u5

u2

u6

u3

w12

w45

w23

w56

w24 w35

a5 a6

b1 b2 b3

b4 b5 b6

Minimize the energy functional:E =

∑(ai − aj)

2 + λ∑

(ai − bi)2

where {bi} are given,wij = ui = 1, and λ is tuned by user.

Page 43: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fieldsMarkov random fields (MRF) use simulated annealing solve

minimize E given {bi}Output: improved data {ai}.

Our approach:1 Apply MRF to find improved data {ai}.2 Compare {ai} to original data {bi}.3 Label nodes with large values of |ai − bi| as missing data.4 Apply IGH and obtain better improved data.

Page 44: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Page 45: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

Page 46: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X

〈kx, f 〉 = f (x) (reproducing property)

Page 47: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X

〈kx, f 〉 = f (x) (reproducing property)

In the discrete case, H is the closure off =

∑x axkx, ax ∈ scalars.

Page 48: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by

(Kf )(x) =

∫Γk(x, y)f (y)dµ(y), x ∈ X,

turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:

K?g(y) = g(y), y ∈ Γ, g ∈ H.

Page 49: Igh maa-2015 nov

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by

(Kf )(x) =

∫Γk(x, y)f (y)dµ(y), x ∈ X,

turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:

K?g(y) = g(y), y ∈ Γ, g ∈ H.

K?K is self-adjoint, positive, and compact.

Its eigenvalues are discrete and non-negative.Since K? is restriction, eigs can be found by diagonalizing kon Γ.