Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Some Recent Insights on Transfer Learning

P → Q?

Samory KpotufeColumbia University

Based on work with Guillaume Martinet, and (ongoing) Steve Hanneke

, , , , , , , , , ,

Transfer Learning:

Given data {Xi, Yi} ∼i.i.d. P , produce a classifier for (X,Y ) ∼ Q.

Case study: Apple Siri’s voice assistant

- Initially trained on data from American English speakers ...

- Could not understand 30M+ nonnative speakers in the US!

Costly Solution ≡ 5+ years acquiring more data and retraining!

A Main Practical Goal:

Cheaply transfer ML software between related populations.

, , , , , , , , , ,

Transfer Learning:








, , , , , , , , , ,

Transfer Learning:








, , , , , , , , , ,

Transfer is of general relevance:

AI for Judicial Systems

- Source Population: prison inmates

- Target Population: everyone arrested

Over 60% inaccurate risk assessments on minorities(2016 Pro-Publica study)

Main Issue: Good Target data is hard or expensive to acquire

AI in medicine, Genomics, Insurance Industry, Smart cities,...

, , , , , , , , , ,








, , , , , , , , , ,








, , , , , , , , , ,








, , , , , , , , , ,

Many heuristics ... but no mature theory or principles

, , , , , , , , , ,

Basic questions remain largely unanswered:

Suppose: h is trained on source data ∼ P , to be transferred to target Q.

• Is there enough information in source P about target Q?

• If not, how much new data should we collect, and how?

• Would unlabeled target data suffice? Or help at least?

, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

, , , , , , , , , ,

Formal Setup:


For PY |X 6= QY |X : [Scott 19], [Blanchard et al. 19], [Cai & Wei 19].





, , , , , , , , , ,

Formal Setup:






, , , , , , , , , ,

Formal Setup:






, , , , , , , , , ,

Formal Setup:






, , , , , , , , , ,

Formal Setup:






, , , , , , , , , ,

Formal Setup:






, , , , , , , , , ,

However, the right notion of distance(PX → QX) remains unclear...

, , , , , , , , , ,


Basic intuition: Transfer iseasiest if P has sufficient massin regions of large Q-mass.

6= proportions of � and M(e.g. 6= proportions of English accents)

, , , , , , , , , ,


Basic intuition: Transfer iseasiest if P has sufficient massin regions of large Q-mass.

6= proportions of � and M(e.g. 6= proportions of English accents)

Many foundational results quantify this intuition ...

, , , , , , , , , ,

Common notions of distance(PX → QX):

• Extensions of TV: differences |PX(A)−QX(A)|, suitable A(e.g. dA divergence/Y-discrepancy of S. Ben David, M. Mohri, ...)

• Density Ratios: ratio dQX/dPX over data space(e.g., Sugiyama, Belkin, Jordan, Wainwright, ...)

How well do these notions measure transferability?

, , , , , , , , , ,





, , , , , , , , , ,





, , , , , , , , , ,





, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ��=⇒ Transfer is Hard

These notions can be pessimistic in measuring transferability ...

, , , , , , , , , ,





, , , , , , , , , ,





, , , , , , , , , ,




Large TV, dA, Y-disc ≈ 1/2


, , , , , , , , , ,




Large dQ/dP , KL-div ≈ ∞


, , , , , , , , , ,




Large dQ/dP , KL-div ≈ ∞


, , , , , , , , , ,

We propose a new distance(P → Q) shown to control transfer ...

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,

First let’s look at extremes γ =∞ or 0

Notice that γ.= γ(P → Q) is asymmetric

(unlike TV, dA, Y-discrepancy, Wasserstein, ...)

, , , , , , , , , ,




, , , , , , , , , ,




, , , , , , , , , ,

The continuum 0 < γ <∞:

, , , , , , , , , ,


γ ≡ How fast PX shifts mass away from QX -dense regions.

, , , , , , , , , ,


γ ≡ How fast fQ/fP goes to ∞.

, , , , , , , , , ,


γ ≡ Difference in support dimension.

, , , , , , , , , ,

Optimistic:γ is often small when other measures are not

, , , , , , , , , ,


Large TV, dA, Y-disc ≈ 1/2

but here typically γ ≈ 0

, , , , , , , , , ,


Large dQX/dPX , KL-div ≈ ∞but γ = 1 ≡ dim(PX)− dim(QX)

, , , , , , , , , ,

γ captures performance limits (minimax rates) under transfer ...

, , , , , , , , , ,

Performance depends on γ + hardness of Q:

Easy to hard Target QX,Y classification

Easy QX,Y Hard QX,Y

Essential: Noise in QY |X , and QX -mass near decision boundary

, , , , , , , , , ,



Easy QX,Y Hard QX,Y


, , , , , , , , , ,



Easy QX,Y Hard QX,Y


, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,











, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

, , , , , , , , , ,




infh

supdist(P,Q)=γ


)−(β+1)/d0,



, , , , , , , , , ,




infh

supdist(P,Q)=γ


)−(β+1)/d0,



, , , , , , , , , ,




infh

supdist(P,Q)=γ


)−(β+1)/d0,



, , , , , , , , , ,




infh

supdist(P,Q)=γ


)−(β+1)/d0,



, , , , , , , , , ,




infh

supdist(P,Q)=γ


)−(β+1)/d0,



, , , , , , , , , ,

Simulations with increasing γ:

nS ≡ Source sample size (no target sample used)

Transfer requires more source data for larger γ values.

, , , , , , , , , ,




, , , , , , , , , ,




, , , , , , , , , ,




, , , , , , , , , ,

Lower-bound analysis:

Main Ingredients:

- Established techniques based on Fano’s inequality.

- h has access to (P,Q) samples, but has to do well on just Q ...

, , , , , , , , , ,


Main Ingredients:



, , , , , , , , , ,


Main Ingredients:



, , , , , , , , , ,

Upper-bound analysis:

Rate achieved by k-NN on combined source and target data.

Main ingredient: show that NN distances depend on ρ.

One open problem we had to solve:

Rates for Vanilla k-NN without uniform density assumption

, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,






, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

, , , , , , , , , ,

Many new messages:


)−(β+1)/d0




, , , , , , , , , ,

Many new messages:


)−(β+1)/d0




, , , , , , , , , ,

Many new messages:


)−(β+1)/d0




, , , , , , , , , ,

Many new messages:


)−(β+1)/d0




, , , , , , , , , ,

All good but ...

Can we automatically decide how much target data to sample?(Ongoing work...)

, , , , , , , , , ,

All good but ...


, , , , , , , , , ,

All good but ...


, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

, , , , , , , , , ,



Adaptive Sampling:





, , , , , , , , , ,



Adaptive Sampling:





, , , , , , , , , ,



Adaptive Sampling:





, , , , , , , , , ,



Adaptive Sampling:





, , , , , , , , , ,

Initial experiments with CIFAR-10:

Transfer Setup: Target Q has 50% images of cats and dogs.

nP = 20K, nQ = 6K unlabeled Q data

h : Deep NN with 10 layers, and k-NN on top.

, , , , , , , , , ,





, , , , , , , , , ,





, , , , , , , , , ,





, , , , , , , , , ,

Results

Best performance achieved after relatively few label requests ...

, , , , , , , , , ,

Results

Best performance achieved after relatively few label requests ...

, , , , , , , , , ,

Quick Summary and some New Directions ...

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

, , , , , , , , , ,

Quick Summary:




, , , , , , , , , ,

Quick Summary:




, , , , , , , , , ,

Quick Summary:




, , , , , , , , , ,

Quick Summary:




, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

This yields H specific performance limits ...

, , , , , , , , , ,



Intuition:



, , , , , , , , , ,



Intuition:



, , , , , , , , , ,



Intuition:



, , , , , , , , , ,



Intuition:


In particular:

Consider disagreements between classifiers

γ : PX(h 6= h′) & QX(h 6= h′)1+γ

Set A where 2 half-spaces disagree


, , , , , , , , , ,



Intuition:


In particular:

Consider disagreements between classifiers

γ : PX(h 6= h′) & QX(h 6= h′)1+γ

Set A where 2 half-spaces disagree


, , , , , , , , , ,

New Messages: [Kpo. & Hanneke, 19]

Near optimal heuristics for bounded VC classes:(no need to estimate γ)

No Classification noise:

ERM on combined source and target data is minimax-optimal.

Any Level of Noise:

Minimize RQ(h) subject to RP (h) ≤ minh′ RP (h′) + ∆nP (h)

§ Hard to implement in general...

, , , , , , , , , ,





Any Level of Noise:



, , , , , , , , , ,





Any Level of Noise:



, , , , , , , , , ,





Any Level of Noise:



, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!

, , , , , , , , , ,



Mostly Open:



Thanks!

, , , , , , , , , ,



Mostly Open:



Thanks!

, , , , , , , , , ,



Mostly Open:



Thanks!

, , , , , , , , , ,



Mostly Open:



Thanks!

Documents

Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could