33
Feature Selection & the Shapley-Folkman Theorem. Alexandre d’Aspremont, CNRS & D.I., ´ Ecole Normale Sup´ erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d’Aspremont OWOS, June 2020. 1/32

Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Feature Selection &

the Shapley-Folkman Theorem.

Alexandre d’Aspremont,

CNRS & D.I., Ecole Normale Superieure.

With Armin Askari, Laurent El Ghaoui (UC Berkeley)and Quentin Rebjock (EPFL)

Alex d’Aspremont OWOS, June 2020. 1/32

Page 2: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Introduction

Feature Selection.

� Reduce number of variables while preserving classification performance.

� Often improves test performance, especially when samples are scarce.

� Helps interpretation.

Classical examples: LASSO, `1-logistic regression, RFE-SVM, . . .

Alex d’Aspremont OWOS, June 2020. 2/32

Page 3: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Introduction: feature selection

RNA classification. Find genes which best discriminate cell type (lung cancer vscontrol). 35238 genes, 2695 examples. [Lachmann et al., 2018]

0 5000 10000 15000 20000 25000 30000 35000Number of features (k)

3.455

3.450

3.445

3.440

3.435

3.430Ob

ject

ive

×1011

Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ,EEF1A1, MT-CO1, HBA2, HBB, HBA1.

Alex d’Aspremont OWOS, June 2020. 3/32

Page 4: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Introduction: feature selection

Applications. Mapping brain activity by fMRI.

From PARIETAL team at INRIA.

Alex d’Aspremont OWOS, June 2020. 4/32

Page 5: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Introduction: feature selection

fMRI. Many voxels, very few samples leads to false discoveries.

Wired article on Bennett et al. “Neural Correlates of Interspecies PerspectiveTaking in the Post-Mortem Atlantic Salmon: An Argument For Proper MultipleComparisons Correction” Journal of Serendipitous and Unexpected Results, 2010.

Alex d’Aspremont OWOS, June 2020. 5/32

Page 6: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Introduction: linear models

Linear models. Select features from large weights w.

� LASSO solves minw ‖Xw − y‖22 + λ‖w‖1 with linear prediction given by wTx.

� Linear SVM, solves minw∑imax{0, 1− yiwTxi}+ λ‖w‖22 with linear

classification rule sign(wTx).

� `1-logistic regression, etc.

In practice.

� Relatively high complexity on very large-scale data sets.

� Recovery results require uncorrelated features (incoherence, RIP, etc.).

� Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poorperformance.

Alex d’Aspremont OWOS, June 2020. 6/32

Page 7: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Outline

� Sparse Naive Bayes

� The Shapley-Folkman theorem

� Duality gap bounds

� Numerical performance

Alex d’Aspremont OWOS, June 2020. 7/32

Page 8: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Multinomial Naive Bayes

Multinomial Naive Bayes. In the multinomial model

logProb(x | C±) = x> log θ± + log

((∑mj=1 xj)!∏mj=1 xj!

).

Training by maximum likelihood

(θ+∗ , θ−∗ ) = argmax

1>θ+=1>θ−=1θ+,θ−∈[0,1]m

f+> log θ+ + f−> log θ−

Linear classification rule: for a given test point x ∈ Rm, set

y(x) = sign(v + w>x),

where

w , log θ+∗ − log θ−∗ and v , logProb(C+)− logProb(C−),

Alex d’Aspremont OWOS, June 2020. 8/32

Page 9: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Sparse Naive Bayes

Naive Feature Selection. Make w , log θ+∗ − log θ−∗ sparse.

Solve(θ+∗ , θ

−∗ ) = argmax f+> log θ+ + f−> log θ−

subject to ‖θ+ − θ−‖0 ≤ k1>θ+ = 1>θ− = 1θ+, θ+ ≥ 0

(SMNB)

where k ≥ 0 is a target number of features. Features for which θ+i = θ−i can bediscarded.

Nonconvex problem.

� Convex relaxation?

� Approximation bounds?

Alex d’Aspremont OWOS, June 2020. 9/32

Page 10: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Sparse Naive Bayes

Convex Relaxation. The dual is very simple.

Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019]

Let φ(k) be the optimal value of (SMNB). Then φ(k) ≤ ψ(k), where ψ(k) is theoptimal value of the following one-dimensional convex optimization problem

ψ(k) := C + minα∈[0,1]

sk(h(α)), (USMNB)

where C is a constant, sk(·) is the sum of the top k entries of its vector argument,and for α ∈ (0, 1),

h(α) := f+◦log f++f−◦log f−−(f++f−)◦log(f++f−)−f+ logα−f− log(1−α).

Solved by bisection, linear complexity O(n+ k log k). Approximation bounds?

Alex d’Aspremont OWOS, June 2020. 10/32

Page 11: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Outline

� Sparse Naive Bayes

� The Shapley-Folkman theorem

� Duality gap bounds

� Numerical performance

Alex d’Aspremont OWOS, June 2020. 11/32

Page 12: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Shapley-Folkman Theorem

Minkowski sum. Given sets X,Y ⊂ Rd, we have

X + Y = {x+ y : x ∈ X, y ∈ Y }

(CGAL User and Reference Manual)

Convex hull. Given subsets Vi ⊂ Rd, we have

Co

(∑i

Vi

)=∑i

Co(Vi)

Alex d’Aspremont OWOS, June 2020. 12/32

Page 13: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Shapley-Folkman Theorem

The `1/2 ball, Minkowski average of two and ten balls, convex hull.

+ + + + =

Minkowski sum of five first digits (obtained by sampling).

Alex d’Aspremont OWOS, June 2020. 13/32

Page 14: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Shapley-Folkman Theorem

Shapley-Folkman Theorem [Starr, 1969]

Suppose Vi ⊂ Rd, i = 1, . . . , n, and

x ∈ Co

(n∑i=1

Vi

)=

n∑i=1

Co(Vi)

thenx ∈

∑[1,n]\Sx

Vi +∑Sx

Co(Vi)

where |Sx| ≤ d.

Alex d’Aspremont OWOS, June 2020. 14/32

Page 15: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Shapley-Folkman Theorem

Proof sketch. Write x ∈∑ni=1Co(Vi), or

(x1n

)=

n∑i=1

d+1∑j=1

λij

(vijei

), for λ ≥ 0,

Conic Caratheodory then yields representation with at most n+ d nonzerocoefficients. Use a pigeonhole argument

λij

} d

}n xi ∈ Vixi ∈ Co(Vi)

Number of nonzero λij controls gap with convex hull.

Alex d’Aspremont OWOS, June 2020. 15/32

Page 16: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Shapley-Folkman: geometric consequences

Consequences.

� If the sets Vi ⊂ Rd are uniformly bounded with rad(Vi) ≤ R, then

dH

(∑ni=1 Vin

,Co

(∑ni=1 Vin

))≤ R

√min{n, d}

n

where rad(V ) = infx∈V supy∈V ‖x− y‖.

� In particular, when d is fixed and n→∞(∑ni=1 Vin

)→ Co

(∑ni=1 Vin

)in the Hausdorff metric with rate O(1/n).

� Holds for many other nonconvexity measures [Fradelizi et al., 2017].

Alex d’Aspremont OWOS, June 2020. 16/32

Page 17: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Outline

� Sparse Naive Bayes

� The Shapley-Folkman theorem

� Duality gap bounds

� Numerical performance

Alex d’Aspremont OWOS, June 2020. 17/32

Page 18: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Nonconvex Optimization

Separable nonconvex problem. Solve

minimize∑ni=1 fi(xi)

subject to Ax ≤ b, (P)

in the variables xi ∈ Rdi with d =∑ni=1 di, where fi are lower semicontinuous

and A ∈ Rm×d.

Take the dual twice to form a convex relaxation,

minimize∑ni=1 f

∗∗i (xi)

subject to Ax ≤ b (CoP)

in the variables xi ∈ Rdi.

Alex d’Aspremont OWOS, June 2020. 18/32

Page 19: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Nonconvex Optimization

Convex envelope. Biconjugate f∗∗ satisfies epi(f∗∗) = Co(epi(f)), whichmeans that

f∗∗(x) and f(x) match at extreme points of epi(f∗∗).

Define lack of convexity as ρ(f) , supx∈dom(f){f(x)− f∗∗(x)}.

Example.

0

1

−1 1

Card(x)

|x|

x

The l1 norm is the convex envelope of Card(x) in [−1, 1].

Alex d’Aspremont OWOS, June 2020. 19/32

Page 20: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Nonconvex Optimization

Writing the epigraph of problem (P) as in [Lemarechal and Renaud, 2001],

Gr ,{

(r0, r) ∈ R1+m :

n∑i=1

fi(xi) ≤ r0, Ax− b ≤ r, x ∈ Rd},

we can write the dual function of (P) as

Ψ(λ) , inf{r0 + λ>r : (r0, r) ∈ G∗∗r

},

in the variable λ ∈ Rm, where G∗∗ = Co(G) is the closed convex hull of theepigraph G.

Affine constraints means (P) and (CoP) have the same dual [Lemarechal andRenaud, 2001, Th. 2.11], given by

supλ≥0

Ψ(λ) (D)

in the variable λ ∈ Rm. Roughly speaking, if G∗∗ = G, no duality gap in (P).

Alex d’Aspremont OWOS, June 2020. 20/32

Page 21: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Nonconvex Optimization

Epigraph & duality gap. Define

Fi ={

(f∗∗i (xi), Aixi) : xi ∈ Rdi}

+ Rm+1+

where Ai ∈ Rm×di is the ith block of A.

� The epigraph G∗∗r can be written as a Minkowski sum of Fi

G∗∗r =

n∑i=1

Fi + (0,−b) + Rm+1+

� Shapley-Folkman at x ∈ G∗∗r shows f∗∗(xi) = f(xi) for all but at most m+ 1terms in the objective.

� As n→∞, with m/n→ 0, Gr gets closer to its convex hull G∗∗r and theduality gap becomes negligible.

Alex d’Aspremont OWOS, June 2020. 21/32

Page 22: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Bound on duality gap

A priori bound on duality gap of

minimize∑ni=1 fi(xi)

subject to Ax ≤ b,

where A ∈ Rm×d.

Proposition [Aubin and Ekeland, 1976, Ekeland and Temam, 1999]

A priori bounds on the duality gap Suppose the functions fi in (P) satisfyAssumption (. . . ). There is a point x? ∈ Rd at which the primal optimal valueof (CoP) is attained, such that

n∑i=1

f∗∗i (x?i )︸ ︷︷ ︸CoP

≤n∑i=1

fi(x?i )︸ ︷︷ ︸

P

≤n∑i=1

f∗∗i (x?i )︸ ︷︷ ︸CoP

+

m+1∑i=1

ρ(f[i])︸ ︷︷ ︸gap

where x? is an optimal point of (P) and ρ(f[1]) ≥ ρ(f[2]) ≥ . . . ≥ ρ(f[n]).

Alex d’Aspremont OWOS, June 2020. 22/32

Page 23: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Bound on duality gap

General result. Consider the separable nonconvex problem

hP (u) := min.∑ni=1 fi(xi)

s.t.∑ni=1 gi(xi) ≤ b+ u

(P)

in the variables xi ∈ Rdi, with perturbation parameter u ∈ Rm.

Proposition [Ekeland and Temam, 1999]

A priori bounds on the duality gap Suppose the functions fi, gji in problem (P)satisfy assumption (...) for i = 1, . . . , n, j = 1, . . . ,m. Let

pj = (m+ 1) maxiρ(gji), for j = 1, . . . ,m

thenhP (p)∗∗ ≤ hP (p) ≤ hP (0)∗∗ + (m+ 1) max

iρ(fi).

where hP (u)∗∗ is the optimal value of the dual to (P).

Alex d’Aspremont OWOS, June 2020. 23/32

Page 24: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection

Duality gap bound. Sparse naive Bayes reads

hP (u) = minq,r −f+> log q − f−> log r

subject to 1>q = 1 + u1,

1>r = 1 + u2,∑mi=1 1qi 6=ri ≤ k + u3

in the variables q, r ∈ [0, 1]m, where u ∈ R3. There are three constraints, two ofthem convex, which means p = (0, 0, 4).

Theorem [Askari, A., El Ghaoui, 2019]

NFS duality gap bounds. Let φ(k) be the optimal value of (SMNB) and ψ(k)that of the convex relaxation (USMNB). We have

ψ(k − 4) ≤ φ(k) ≤ ψ(k),

for k ≥ 4.

Alex d’Aspremont OWOS, June 2020. 24/32

Page 25: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection

Primalization. Given α∗ solving (USMNB), reconstruct point (θ+, θ−).

For α∗ optimal for (USMNB), let I be complement of the set of indicescorresponding to the top k entries of h(α∗), set B± :=

∑i6∈I f

±i , and

θ+∗ i = θ−∗ i =f+i + f−i

1>(f+ + f−), ∀i ∈ I, θ±∗i =

B+ +B−B±

f±i1>(f+ + f−)

, ∀i 6∈ I.

In all but pathological scenarios, k largest coefficients in (USMNB) give supportof the solution.

Alex d’Aspremont OWOS, June 2020. 25/32

Page 26: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Outline

� Sparse Naive Bayes

� Approximation bounds & the Shapley-Folkman theorem

� Numerical performance

Alex d’Aspremont OWOS, June 2020. 26/32

Page 27: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection

Data.

Feature Vectors Amazon IMDB Twitter MPQA SST2

Count Vector 31,666 103,124 273,779 6,208 16,599

tf-idf 31,666 103,124 273,779 6,208 16,599

tf-idf wrd bigram 870,536 8,950,169 12,082,555 27,603 227,012

tf-idf char bigram 25,019 48,420 17,812 4838 7762

Number of features in text data sets used below.

Amazon IMDB Twitter MPQA SST2

Count Vector 0.043 0.22 1.15 0.0082 0.037

tf-idf 0.033 0.16 0.89 0.0080 0.027

tf-idf wrd bigram 0.68 9.38 13.25 0.024 0.21

tf-idf char bigram 0.076 0.47 4.07 0.0084 0.082

Average run time (seconds, plain Python on CPU).

Alex d’Aspremont OWOS, June 2020. 27/32

Page 28: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection.

10−1 100 101 102 103

Training time (s)

50

55

60

65

70

75

80

85

90A

ccu

racy

(%)

MethodLogistic-`1

Logistic-RFESVM-`1

SVM-RFE

LASSO

Odds Ratio

TMNB

SMNB - this work

Sparsity Level0.1%

1%

5%

10%

Accuracy versus run time on IMDB/Count Vector, MNB in stage two.

Alex d’Aspremont OWOS, June 2020. 28/32

Page 29: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection.

0 5 10 15 20 25 30k

−6.66

−6.64

−6.62

−6.60

−6.58

−6.56

−6.54

Fu

nct

ion

Val

ue

ψ(k)

Primalization

ψ(k − 4)

0 500 1000 1500 2000 2500 3000k

−15.80

−15.75

−15.70

−15.65

ψ(k)

Primalization

ψ(k − 4)

Duality gap bound versus sparsity level for m = 30 (left panel) and m = 3000(right panel), showing that the duality gap quickly closes as m or k increase.

Alex d’Aspremont OWOS, June 2020. 29/32

Page 30: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection.

10000 20000 30000 40000 50000 60000 70000 80000

m

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Tim

e(s

)

Run time with IMDB dataset/tf-idf vector data set, with increasing m, k withfixed ratio k/m, empirically showing (sub-) linear complexity.

Alex d’Aspremont OWOS, June 2020. 30/32

Page 31: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Naive Feature Selection.

Criteo data set. Conversion logs. 45 GB, 45 million rows, 15000 columns.

� Preprocessing (NaN, encoding categorical features) takes 50 minutes.

� Computing f+ and f− takes 20 minutes.

� Computing the full curve below (i.e. solving 15000 problems) takes 2 minutes.

0 2000 4000 6000 8000 10000 12000 14000Number of features (k)

87

86

85

84

83

82

81

80

79

Obj

ectiv

e

×103

1.7051×109

Standard workstation, plain Python on CPU.

Alex d’Aspremont OWOS, June 2020. 31/32

Page 32: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

Conclusion

Naive Feature Selection.

For naive Bayes, we get sparsity almost for free.

� Linear complexity.

� Nearly tight convex relaxation.

� Feature selection performance comparable to LASSO or `1 logistic regression,but NFS is 100× faster. . .

� Requires no RIP assumption (only the naive one behind NB).

https://github.com/aspremon/NaiveFeatureSelection

Alex d’Aspremont OWOS, June 2020. 32/32

Page 33: Feature Selection & the Shapley-Folkman Theorem. · 2020-06-15 · Introduction: feature selection RNA classi cation. Find genes which best discriminate cell type (lung cancer vs

*

References

Jean-Pierre Aubin and Ivar Ekeland. Estimates of the duality gap in nonconvex optimization. Mathematics of Operations Research, 1(3):225–245, 1976.

Ivar Ekeland and Roger Temam. Convex analysis and variational problems. SIAM, 1999.

Matthieu Fradelizi, Mokshay Madiman, Arnaud Marsiglietti, and Artem Zvavitch. The convexification effect of minkowski summation.Preprint, 2017.

Alexander Lachmann, Denis Torre, Alexandra B Keenan, Kathleen M Jagodnik, Hoyjin J Lee, Lily Wang, Moshe C Silverstein, and AviMa’ayan. Massive mining of publicly available rna-seq data from human and mouse. Nature communications, 9(1):1366, 2018.

Claude Lemarechal and Arnaud Renaud. A geometric study of duality gaps, with applications. Mathematical Programming, 90(3):399–427,2001.

Ross M Starr. Quasi-equilibria in markets with non-convex preferences. Econometrica: journal of the Econometric Society, pages 25–38, 1969.

Alex d’Aspremont OWOS, June 2020. 33/32