Optimal Quantum Sample Complexity of Learning Algorithmscomputationalcomplexity.org/Archive/2017/slides/25_AdW.pdf · Recent success: deep learning is extremely good at image recognition,

1/ 23

OptimalQuantum Sample Complexity

of Learning Algorithms

Srinivasan Arunachalam

(Joint work with Ronald de Wolf)

2/ 23

Machine learning

Classical machine learning

Grand goal: enable AI systems to improve themselves

Practical goal: learn“something” from given data

Recent success: deep learning is extremely good at imagerecognition, natural language processing, even the game of Go

Why the recent interest? Flood of available data, increasingcomputational power, growing progress in algorithms

Quantum machine learning

What can quantum computing do for machine learning?

The learner will be quantum, the data may be quantum

Some examples are known of reduction in time complexity:

clustering (Aımeur et al. ’06)principal component analysis (Lloyd et al. ’13)perceptron learning (Wiebe et al. ’16)recommendation systems (Kerenidis & Prakash ’16)

2/ 23

Machine learning











2/ 23

Machine learning











2/ 23

Machine learning











2/ 23

Machine learning











2/ 23

Machine learning











2/ 23

Machine learning











2/ 23

Machine learning











3/ 23

Probably Approximately Correct (PAC) learning

Basic definitions

Concept class C: collection of Boolean functions on n bits (Known)

Target concept c : some function c ∈ C (Unknown)

Distribution D : 0, 1n → [0, 1] (Unknown)

Labeled example for c ∈ C: (x , c(x)) where x ∼ D

3/ 23


Basic definitions





3/ 23


Basic definitions





3/ 23


Basic definitions





3/ 23


Basic definitions





4/ 23


Basic definitions


Target concept c : some function c ∈ C. (Unknown)

Distribution D : 0, 1n → [0, 1]. (Unknown)


4/ 23


Basic definitions





4/ 23


Basic definitions





4/ 23


Basic definitions





5/ 23


Basic definitions




Labeled example for c ∈ C: (x , c(x)) where x ∼ D.

Formally: A theory of the learnable (L.G. Valiant’84)

Using i.i.d. labeled examples, learner for C should outputhypothesis h that is Probably Approximately Correct

Error of h w.r.t. target c : errD(c , h) = Prx∼D [c(x) 6= h(x)]

An algorithm (ε, δ)-PAC-learns C if:

∀c ∈ C ∀D : Pr[ errD(c , h) ≤ ε︸︷︷︸Approximately Correct

] ≥ 1− δ︸︷︷︸Probably

5/ 23


Basic definitions











5/ 23


Basic definitions











5/ 23


Basic definitions











6/ 23

Complexity of learning

Recap

Concept: some function c : 0, 1n → 0, 1Concept class C: set of concepts




How to measure the efficiency of the learning algorithm?

Sample complexity: number of labeled examples used by learner

Time complexity: number of time-steps used by learner

This talk: focus on sample complexity

No need for complexity-theoretic assumptions

No need to worry about the format of hypothesis h

6/ 23


Recap











6/ 23


Recap











6/ 23


Recap











6/ 23


Recap











6/ 23


Recap











7/ 23

Vapnik and Chervonenkis (VC) dimension

VC dimension of C ⊆ c : 0, 1n → 0, 1

Let M be the |C| × 2n Boolean matrix whose c-th row is the truth tableof concept c : 0, 1n → 0, 1VC-dim(C): largest d s.t. the |C| × d rectangle in M contains 0, 1dThese d column indices are shattered by C

7/ 23


VC dimension of C ⊆ c : 0, 1n → 0, 1Let M be the |C| × 2n Boolean matrix whose c-th row is the truth tableof concept c : 0, 1n → 0, 1

VC-dim(C): largest d s.t. the |C| × d rectangle in M contains 0, 1dThese d column indices are shattered by C

7/ 23


VC dimension of C ⊆ c : 0, 1n → 0, 1Let M be the |C| × 2n Boolean matrix whose c-th row is the truth tableof concept c : 0, 1n → 0, 1VC-dim(C): largest d s.t. the |C| × d rectangle in M contains 0, 1d

These d column indices are shattered by C

7/ 23


VC dimension of C ⊆ c : 0, 1n → 0, 1Let M be the |C| × 2n Boolean matrix whose c-th row is the truth tableof concept c : 0, 1n → 0, 1VC-dim(C): largest d s.t. the |C| × d rectangle in M contains 0, 1dThese d column indices are shattered by C

8/ 23


VC dimension of C ⊆ c : 0, 1n → 0, 1M is the |C| × 2n Boolean matrix whose c-th row is the truth table of c


Table : VC-dim(C) = 2

Concepts Truth tablec1 0 1 0 1c2 0 1 1 0c3 1 0 0 1c4 1 0 1 0c5 1 1 0 1c6 0 1 1 1c7 0 0 1 1c8 0 1 0 0c9 1 1 1 1

9/ 23


VC dimension of C ⊆ c : 0, 1n → 0, 1M is the |C| × 2n Boolean matrix whose c-th row is the truth table of c






10/ 23

VC dimension characterizes PAC sample complexity

VC dimension of CM is the |C| × 2n Boolean matrix whose c-th row is the truth table of c


Fundamental theorem of PAC learning

Suppose VC-dim(C) = d

Blumer-Ehrenfeucht-Haussler-Warmuth’86:

every (ε, δ)-PAC learner for C needs Ω(

dε + log(1/δ)

ε

)examples

Hanneke’16: there exists an (ε, δ)-PAC learner for C using

O(

dε + log(1/δ)

ε

)examples

10/ 23








dε + log(1/δ)

ε

)examples


O(

dε + log(1/δ)

ε

)examples

10/ 23








dε + log(1/δ)

ε

)examples


O(

dε + log(1/δ)

ε

)examples

10/ 23








dε + log(1/δ)

ε

)examples


O(

dε + log(1/δ)

ε

)examples

11/ 23

Quantum PAC learning

(Bshouty-Jackson’95): Quantum generalization of classical PAC

Learner is quantum:

Data is quantum: Quantum example is a superposition∑x∈0,1n

√D(x) |x , c(x)〉

Measuring this state gives (x , c(x)) with probability D(x),so quantum examples are at least as powerful as classical

11/ 23



Learner is quantum:


√D(x) |x , c(x)〉


11/ 23



Learner is quantum:


√D(x) |x , c(x)〉


11/ 23



Learner is quantum:


√D(x) |x , c(x)〉

Measuring this state gives (x , c(x)) with probability D(x),

so quantum examples are at least as powerful as classical

11/ 23



Learner is quantum:


√D(x) |x , c(x)〉


12/ 23

Classical vs. Quantum PAC learning algorithm!

Question

Can quantum sample complexity be significantly smaller than classical?

13/ 23


Quantum Data

Quantum example: |Ec,D 〉 =∑

x∈0,1n√

D(x) |x , c(x)〉Quantum examples are at least as powerful as classical examples

Quantum is indeed more powerful for learning! (for uniform distribution)

Sample complexity: Learning class of linear functionsClassical: Ω(n) classical examples neededQuantum: O(1) quantum examples suffice (Bernstein-Vazirani’93)

Time complexity: Learning DNFsClassical: Best known upper bound is quasi-poly. time (Verbeugt’90)Quantum: Polynomial-time (Bshouty-Jackson’95)

13/ 23


Quantum Data


x∈0,1n√



Sample complexity: Learning class of linear functions

Classical: Ω(n) classical examples neededQuantum: O(1) quantum examples suffice (Bernstein-Vazirani’93)


13/ 23


Quantum Data


x∈0,1n√





13/ 23


Quantum Data


x∈0,1n√




Time complexity: Learning DNFs

Classical: Best known upper bound is quasi-poly. time (Verbeugt’90)Quantum: Polynomial-time (Bshouty-Jackson’95)

13/ 23


Quantum Data


x∈0,1n√





14/ 23


Quantum Data


x∈0,1n√


Quantum is indeed more powerful for learning! (for a fixed distribution)

Learning class of linear functions under uniform D:Classical: Ω(n) classical examples neededQuantum: O(1) quantum examples suffice (Bernstein-Vazirani’93)

Learning DNF under uniform D:Classical: Best known upper bound is quasi-poly. time (Verbeugt’90)Quantum Polynomial-time (Bshouty-Jackson’95)

But in the PAC model,learner has to succeed for all D!

15/ 23

Quantum sample complexity

Quantum upper bound

Classical upper bound O(

dε + log(1/δ)

ε

)carries over to quantum

Best known quantum lower bounds

Atici & Servedio’04: lower bound Ω(√

dε + d + log(1/δ)

ε

)Zhang’10: improved first term to d1−η

ε for all η > 0

15/ 23


Quantum upper bound


dε + log(1/δ)

ε




dε + d + log(1/δ)

ε

)

Zhang’10: improved first term to d1−η

ε for all η > 0

15/ 23


Quantum upper bound


dε + log(1/δ)

ε




dε + d + log(1/δ)

ε

)Zhang’10: improved first term to d1−η

ε for all η > 0

16/ 23

Quantum sample complexity = Classical sample complexity

Quantum upper bound


dε + log(1/δ)

ε




dε + d + log(1/δ)

ε

)Zhang’10 improved first term to d1−η

ε for all η > 0

Our result: Tight lower bound

We show: Ω(

dε + log(1/δ)

ε

)quantum examples are necessary

Two proof approaches

Information theory: conceptually simple, nearly-tight bounds

Optimal measurement: tight bounds, some messy calculations

16/ 23


Quantum upper bound


dε + log(1/δ)

ε




dε + d + log(1/δ)

ε


ε for all η > 0


We show: Ω(

dε + log(1/δ)

ε





16/ 23


Quantum upper bound


dε + log(1/δ)

ε




dε + d + log(1/δ)

ε


ε for all η > 0


We show: Ω(

dε + log(1/δ)

ε





17/ 23

Proof approach

1 First, we consider the problem of probably exactly learning:quantum learner should identify the concept

2 Here, quantum learner is given one out of |C| quantum states.Identify the target concept using copies of the quantum state

3 Quantum state identification has been well-studied

4 We’ll get to probably approximately learning soon!

17/ 23

Proof approach





17/ 23

Proof approach





17/ 23

Proof approach





18/ 23

Proof sketch: Quantum sample complexity T ≥ VC-dim(C)/ε

State identification: Ensemble E = (pz , |ψz 〉)z∈[m]

Given state |ψz 〉 ∈ E with prob pz . Goal: identify z

Optimal measurement could be quite complicated,but we can always use the Pretty Good Measurement

Crucial property: if Popt is the optimal success probability, thenPopt ≥ Ppgm ≥ P2

opt (Barnum-Knill’02)

How does learning relate to identification?

Quantum PAC: Given |ψc 〉 = |Ec,D 〉⊗T , learn c approximately

Let VC-dim(C) = d . Suppose s0, . . . , sd is shattered by C. Fix

D(s0) = 1− ε, D(si ) = ε/d on s1, . . . , sdLet k = Ω(d) and E : 0, 1k → 0, 1d be an error-correcting code

Pick 2k codeword concepts czz∈0,1k ⊆ C:

cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23




Optimal measurement could be quite complicated,

but we can always use the Pretty Good Measurement








cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23





Crucial property: if Popt is the optimal success probability, then

Popt ≥ Ppgm ≥ P2opt (Barnum-Knill’02)






cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23





Crucial property: if Popt is the optimal success probability, thenPopt ≥ Ppgm

≥ P2opt (Barnum-Knill’02)






cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23









Let VC-dim(C) = d . Suppose s0, . . . , sd is shattered by C.

Fix



cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23










D(s0) = 1− ε, D(si ) = ε/d on s1, . . . , sd

Let k = Ω(d) and E : 0, 1k → 0, 1d be an error-correcting code


cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

18/ 23












cz(s0) = 0, cz(si ) = E (z)i ∀ i ∈ b

19/ 23

Pick concepts cz ⊆ C: cz(s0) = 0, cz(si ) = E (z)i ∀ i

Suppose VC (C) = d + 1 and s0, . . . , sd is shattered by C, i.e.,|C| × (d + 1) rectangle of s0, . . . , sd contains 0, 1d+1

Concepts Truth tablec ∈ C s0 s1 · · · sd−1 sd · · · · · ·

c1 0 0 · · · 0 0 · · · · · ·c2 0 0 · · · 1 0 · · · · · ·c3 0 0 · · · 1 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d 0 1 · · · 1 1 · · · · · ·

c2d+1 1 0 · · · 0 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d+1 1 1 · · · 1 1 · · · · · ·

......

.... . .

...... · · · · · ·

c(s0) = 0

Among c1, . . . , c2d, pick 2k concepts that correspond to codewords ofE : 0, 1k → 0, 1d on s1, . . . , sd

19/ 23




c1 0 0 · · · 0 0 · · · · · ·c2 0 0 · · · 1 0 · · · · · ·c3 0 0 · · · 1 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d 0 1 · · · 1 1 · · · · · ·

c2d+1 1 0 · · · 0 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d+1 1 1 · · · 1 1 · · · · · ·

......

.... . .

...... · · · · · ·

c(s0) = 0


19/ 23




c1 0 0 · · · 0 0 · · · · · ·c2 0 0 · · · 1 0 · · · · · ·c3 0 0 · · · 1 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d 0 1 · · · 1 1 · · · · · ·

c2d+1 1 0 · · · 0 1 · · · · · ·...

......

. . ....

... · · · · · ·c2d+1 1 1 · · · 1 1 · · · · · ·

......

.... . .

...... · · · · · ·

c(s0) = 0


20/ 23





If Popt is the optimal success probability, then Popt ≥ Ppgm ≥ P2opt



Let VC-dim(C) = d . Suppose s0, . . . , sd is shattered by C. FixD : D(s0) = 1− ε, D(si ) = ε/d on s1, . . . , sdLet k = Ω(d) and E : 0, 1k → 0, 1d be an error-correcting code

Pick 2k concepts czz∈0,1k ⊆ C: cz(s0) = 0, cz(si ) = E (z)i ∀ i

Learning cz approximately (wrt D) is equivalent to identifying z!

20/ 23





If Popt is the optimal success probability, then Popt ≥ Ppgm ≥ P2opt



Let VC-dim(C) = d . Suppose s0, . . . , sd is shattered by C. FixD : D(s0) = 1− ε, D(si ) = ε/d on s1, . . . , sdLet k = Ω(d) and E : 0, 1k → 0, 1d be an error-correcting code

Pick 2k concepts czz∈0,1k ⊆ C: cz(s0) = 0, cz(si ) = E (z)i ∀ i


21/ 23

Sample complexity lower bound via PGM

Recap


If sample complexity is T , then there is a good learner thatidentifies z from |ψcz 〉 = |Ecz ,D 〉

⊗T w.p. ≥ 1− δGoal: Show T ≥ d/ε

Analysis of PGM

For the ensemble |ψcz 〉 : z ∈ 0, 1k with uniform probabilitiespz = 1/2k , we have Ppgm ≥ P2

opt ≥ (1− δ)2

Recall k = Ω(d) because we used a good ECC

Ppgm ≤ · · · 4-page calculation · · · ≤ exp(T 2ε2/d +√

Tdε− d − Tε)

This implies T = Ω(d/ε)

21/ 23


Recap



⊗T w.p. ≥ 1− δ

Goal: Show T ≥ d/ε

Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM

For the ensemble |ψcz 〉 : z ∈ 0, 1k with uniform probabilitiespz = 1/2k , we have Ppgm

≥ P2opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2


Ppgm ≤

· · · 4-page calculation · · · ≤ exp(T 2ε2/d +√

Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


21/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2



Tdε− d − Tε)


22/ 23


Recap


If sample complexity is T , then there is a good learner that identifiesz from |ψcz 〉 = |Ecz ,D 〉

⊗T with probability ≥ 1− δ

Analysis of PGM


opt ≥ (1− δ)2


Tdε− d − Tε)


22/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2


Tdε− d − Tε)


22/ 23


Recap




Analysis of PGM


opt ≥ (1− δ)2


Tdε− d − Tε)


23/ 23

Conclusion and future work

Further results

Agnostic learning: No quantum bounds known before (unlike PACmodel). Showed quantum examples do not reduce sample complexity

Also studied the model with random classification noise and showthat quantum examples are no better than classical examples

Future work

Quantum machine learning is still young!

Theoretically, one could consider more optimistic PAC-like modelswhere learner need not succeed ∀c ∈ C and ∀D

Efficient quantum PAC learnability of AC0 under uniform D?

23/ 23


Further results

Agnostic learning: No quantum bounds known before (unlike PACmodel).

Showed quantum examples do not reduce sample complexity


Future work




23/ 23


Further results



Future work




23/ 23


Further results



Future work




23/ 23


Further results



Future work




23/ 23


Further results



Future work




23/ 23


Further results



Future work




23/ 23


Further results



Future work




24/ 23

Buffer 1: Proof approach via Information theory

Suppose s0, . . . , sd is shattered by C. By definition:∀a ∈ 0, 1d ∃c ∈ C s.t. c(s0) = 0, and c(si ) = ai ∀ i ∈ [d ]

Fix a nasty distribution D:

D(s0) = 1− 4ε, D(si ) = 4ε/d on s1, . . . , sd.Good learner produces hypothesis h s.t.h(si ) = c(si ) = ai for ≥ 3

4 of is

Think of c as uniform d-bit string A, approximated by h ∈ 0, 1dthat depends on examples B = (B1, . . . ,BT )

1 I (A : B) ≥ I (A : h(B)) ≥ Ω(d) [because h ≈ A]2 I (A : B) ≤

∑Ti=1 I (A : Bi ) = T · I (A : B1) [subadditivity]

3 I (A : B1) ≤ 4ε [because prob of useful example is 4ε]

This implies Ω(d) ≤ I (A : B) ≤ 4Tε, hence T = Ω( dε )

For analyzing quantum examples, only step 3 changes:

I (A : B1) ≤ O(ε log(d/ε)) ⇒ T = Ω( dε

1log(d/ε) )

25/ 23

Buffer 2: Proof approach in detail

Suppose we’re given state |ψi 〉 with prob pi , i = 1, . . . ,m. Goal:learn i

Optimal measurement could be quite complicated,but we can always use the Pretty Good Measurement.

This has POVM operators

Mi = piρ−1/2|ψi 〉〈ψi |ρ−1/2, where ρ =

∑i pi |ψi 〉〈ψi |

Success probability of PGM: PPGM =∑

i piTr(Mi |ψi 〉〈ψi |)

Crucial property (BK’02): if POPT is the success probablity of theoptimal POVM, then POPT ≥ PPGM ≥ P2

OPT

Let G be the m ×m Gram matrix of the vectors√

pi |ψi 〉,then PPGM =

∑i

√G (i , i)2

26/ 23

Buffer 3: Analysis of PGM

For the ensemble |ψcz 〉 : z ∈ 0, 1k with uniform probabilitiespz = 1/2k , we have PPGM ≥ (1− δ)2

Let G be the 2k × 2k Gram matrix of the vectors√

pz |ψcz 〉, then

PPGM =∑

z

√G (z , z)2

Gxy = g(x ⊕ y). Can diagonalize G using Hadamard transform, and

its eigenvalues will be 2k g(s). This gives√

G∑z

√G (z , z)2 ≤ · · · 4-page calculation · · · ≤

≤ exp(T 2ε2/d +√

Tdε− d − Tε)


Documents

Optimal Quantum Sample Complexity of Learning Algorithmscomputationalcomplexity.org/Archive/2017/slides/25_AdW.pdf · Recent success: deep learning is extremely good at image recognition,