127
Sub-Linear Algorithms in Learning and Testing Course taught at Columbia University (COMS 6998-3) 1 Rocco Servedio Spring 2014 1 Webpage: http://www.cs.columbia.edu/ ˜ rocco/Teaching/S14/

Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Sub-Linear Algorithms in Learningand Testing

Course taught at Columbia University (COMS 6998-3)1

Rocco Servedio

Spring 2014

1Webpage: http://www.cs.columbia.edu/˜rocco/Teaching/S14/

Page 2: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

ii

Foreword

Recently there has been a lot of glorious hullabaloo about Big Data and how it is going to revolutionizethe way we work, play, eat and sleep. As part of the general excitement, it has become clear that fortruly massive datasets and digital objects of various sorts, even algorithms which run in linear time(linear in the size of the relevant dataset or object) may be much too slow. In this course we willstudy sub-linear time algorithms which are aimed at helping us understand massive objects. Thealgorithms we study inspect only a tiny portion of an unknown object and have the goal of comingup with some useful information about the object. Algorithms of this sort provide a foundation forprincipled analysis of truly massive data sets and data objects.

In particular we will consider• learning algorithms: Here the goal is to come up with an representation (or hypothesis) that

is a high-accuracy approximation of the unknown object.• algorithms for property testing. Here the (more modest) goal is to determine whether the

unknown object has some particular property of interest, or is “far” (in a suitable sense) fromevery object having the property.

We will study these algorithms for three different types of objects: Boolean functions, graphs, andprobability distributions1.

1Due to time constraints, this last topic was only evoked in the first introductory lecture.

Page 3: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Acknowledgements

These lecture notes were compiled from scribe notes taken by the students during the course of theclass. In particular, we would like to thank Clement Canonne, Anirban Gangopadhyay, HyungtaeKim, Lucas Kowalczyk, Enze Li, Ting-Chu Lin, Yang Liu, Yiren Lu, Jelena Marasevic, Bach Nguyen,Keith Nichols, Dimitris Paidarakis, Fotis Psallidas, Richard J. Stark, Timothy Sun, Li-Yang Tan,Jinyu Xie and Yichi Zhang.

Special thanks go to Clement Canonne, both for editing all the scribe notes and for compilingand formatting them into this document.

Page 4: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

ii

Page 5: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Contents

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1 January 22, 2014: Introduction 11.1 High-level overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 What is this course about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Kinds of algorithmic problems . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Some content – Testing Sortedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 First naive attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Second naive attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Third (and right) attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 January 29, 2014: Learning Boolean functions 112.1 Learning Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Learning Parities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Learning Juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Fourier Analysis of Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Special Properties of Parity Functions in V . . . . . . . . . . . . . . . . . . . . . . . 152.6 Parity Functions as an Orthonormal Basis of V . . . . . . . . . . . . . . . . . . . . . 16

3 February 05, 2014 173.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Last time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Basics of Fourier analysis, concluded . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Fourier and learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 The LMN Algorithm (Linial–Mansour–Nisan) . . . . . . . . . . . . . . . . . . . . . . 193.3.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 LMN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii

Page 6: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

iv CONTENTS

3.3.3 Summing up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 The KM Algorithm (Kushilevitz–Mansour) . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 Building up to the KM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 February 12, 2014 254.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 KM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Applications of the KM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Learning function with small L1 norm . . . . . . . . . . . . . . . . . . . . . . 274.3.2 Sparse Fourier Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Monotone Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4.1 Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Preview for next time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 February 19, 2014 315.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Last time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Finish learning monotone Boolean functions . . . . . . . . . . . . . . . . . . . . . . . 315.3 Lower bounds for learning monotone Boolean functions . . . . . . . . . . . . . . . . 335.4 Main contribution of LMN: learning AC0 circuits . . . . . . . . . . . . . . . . . . . . 355.5 Learning halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.6 Next time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 February 26, 2014 396.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1 Last time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Learning functions with small prefix covers . . . . . . . . . . . . . . . . . . . . . . . 396.3 (Property) testing Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 March 5, 2014: Property testing for Boolean functions 477.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.1.1 Last time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.1.2 Today: Property testing for Boolean functions . . . . . . . . . . . . . . . . . 47

7.2 Proper learning implies Property Testing . . . . . . . . . . . . . . . . . . . . . . . . . 487.3 Linearity testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.3.2 BLR Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Page 7: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

CONTENTS v

7.3.3 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.4 Monotonicity testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8 March 12, 2014 558.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.1.1 Last time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.2 Testing Monotonicity: O(nε

)upper bound . . . . . . . . . . . . . . . . . . . . . . . . 56

8.3 Ω(√n) lower bound for non-adaptive 1-sided testers . . . . . . . . . . . . . . . . . . 58

8.4 Ω(n1/5

)lower bound for non-adaptive, 2-sided testers . . . . . . . . . . . . . . . . . 59

8.4.1 Yao’s Principle (easy direction) . . . . . . . . . . . . . . . . . . . . . . . . . . 60

9 March 26, 2014 639.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.1.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639.1.2 Today: lower bound for two-sided non-adaptive monotonicity testers. . . . . . 63

9.2 Proving the Ω(n1/5

)lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639.2.2 The lower bound construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

10 March 04, 2014 7310.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

10.1.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7310.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

10.2 Monotonicity testing lower bound: wrapping up . . . . . . . . . . . . . . . . . . . . . 7310.3 Testing Juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7610.3.2 Characterization of far-from-juntas functions . . . . . . . . . . . . . . . . . . 7710.3.3 (Naive) Junta testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

11 April 9, 2014 8111.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

11.1.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8111.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

11.2 The actual junta test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8111.3 Proof of the main lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

11.3.1 Big sets S ⊆ J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8311.3.2 Small sets S ⊆ J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

11.4 Testing juntas: a lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Page 8: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

vi CONTENTS

12 April 16, 2014: Property testing for graphs 9112.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

12.1.1 Last Time: end of Boolean function testing . . . . . . . . . . . . . . . . . . . 9112.1.2 Today: Graph Property testing . . . . . . . . . . . . . . . . . . . . . . . . . . 9112.1.3 Next Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

12.2 Basics of Graph Property Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9212.3 Adaptiveness is not (that) helpful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9312.4 Testing bipartiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

12.4.1 Naive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9512.4.2 Harder, stronger (better) analysis . . . . . . . . . . . . . . . . . . . . . . . . . 95

13 April 24, 2014 9913.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

13.1.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9913.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

13.2 poly(1ε )-query testable graph properties . . . . . . . . . . . . . . . . . . . . . . . . . 99

13.3 General Graph Partition Testing (GGPT) . . . . . . . . . . . . . . . . . . . . . . . . 10113.4 Triangle-freeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

14 April 30, 2014 10714.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

14.1.1 Last Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10714.1.2 Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

14.2 Proof Sketch for Szemeredi Regularity Lemma . . . . . . . . . . . . . . . . . . . . . 10814.2.1 High-level Idea of SRL Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

14.3 Lower Bound for Testing 4-freeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 11114.4 Sparse Graph Testing in Bounded-Degree Model . . . . . . . . . . . . . . . . . . . . 114

Bibliography 117

Page 9: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 1

January 22, 2014: Introduction

Plan:• High-level overview• Some content: first taste.

Relevant Readings:• Ergun, Kannan, Kumar, Rubinfeld and Viswanathan, 1998: Spot-checkers. [EKK+98]• Ron, 2008: Property Testing: A Learning Theory Perspective. [Ron08]

1.1 High-level overview

1.1.1 What is this course about?

Goal Get information from some massive data object – so humongous we cannot possibly lookat the whole object. Instead, we must use sublinear-time algorithms to have any hope of gettingsomething done.

This immediately triggers the first natural question – is it even possible to do anything? As weshall see, the answer is – perhaps surprisingly – yes.

What kind of “objects”? We will be mainly interested in 3 different sorts of ginormous objects:Boolean functions, graphs and probability distributions.

Example 1. The object is a Boolean function f : 0, 1n → 0, 1, of size 2n, to which we havequery access:

1

Page 10: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

2 LECTURE 1. JANUARY 22, 2014: INTRODUCTION

f

x ∈ 0, 1n

f(x)

Example 2. The object is a graph G = (V,E) where V = [N ] = 1, . . . , N, for N = 2n; we havequery access to its adjacency matrix:

G

(i, j) ∈ V × V

Yes iff (i, j) ∈ E

Example 3. The object is a probability distribution D over [N ], and we have access to independentsamples:

D

“Sample-deliveringButton”

i ∼ D

What can we hope for? For most questions, it is impossible to get an exact answer in sublineartime. Think for instance of deciding whether a function is identically zero or not; until all 2n pointshave been queried, there is no way to be certain of the answer. However, by accepting approximateanswers, we can do a lot.

Similarly, it is not hard to see that it is paramount that our algorithms are randomized.Deterministic ones are easy to “fool”. So, we will seek algorithms that give good approximationswith high probability.

randomization + approximation

1.1.2 Kinds of algorithmic problems

We will consider two main families of problems over the three types of objects introduced above:learning and property testing.

Page 11: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

1.1. HIGH-LEVEL OVERVIEW 3

Learning problems

The goal is to output some high-quality approximation of the object O. Note that we can onlydo this if we know O is highly structured. For instance, learning a completely random Booleanfunction f defined by tossing a coin to generate a truth table (for each x ∈ 0, 1n, f(x) is chosenuniformly, independently at random in 0, 1) clearly cannot be done with an efficient number ofqueries.Example 4 (Learning decision trees). Assume f : 0, 1n → 0, 1 is computed by a poly(n)-sizedecision tree (DT). E.g., for x = (x1, . . . , x8), f(x) is given by the value at the leaf reached by goingdown the following tree (0: left, 1: right):

x7

x5

0 1

x4

x3

1 x1

0 1

x5

0 x6

1 0

(in this example, f(11001100) = 1). The distance measure used here will be the Hamming distance:for f, g : 0, 1n → 0, 1,

d(f, g) def= | x ∈ 0, 1n : f(x) 6= g(x) |

2n = Prx∼U0,1n

[ f(x) 6= g(x) ] (1.1)

Question: Can we, given black-box access to f (promised to be computed by a poly(n)-size DT),run in poly(n, 1/ε) time and output some hypothesis h : 0, 1n → 0, 1 s.t. d(f, g) ≤ ε?. Spoiler: yes – we will cover this in future lectures.Example 5 (Learning distributions). Distribution D over [N ]. Assume D has structure;1 moreparticularly, for this example, assume D is monotone (non-increasing):

D(1) ≥ D(2) ≥ · · · ≥ D(N)1As we will show later in the class, if D is arbitrary, the number of samples are needed for learning is linear in N,

specifically, Θ(N/ε2

).

Page 12: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

4 LECTURE 1. JANUARY 22, 2014: INTRODUCTION

D(i)

i1 2 3 4 . . . N

Figure 1.1: Example of a monotone distribution D.

The distance measure considered will be the total variation distance (TV)2: for D1,D2 distribu-tions over [N ],

dTV(D1,D2) def= maxS⊆[N ]

(D1(S)−D2(S)) = 12∑i∈[N ]

|D1(i)−D2(i)| (1.2)

where the second equality (known as Scheffe’s Identity) is left as an exercise.Hint: canbe shown byconsideringthe set S = i ∈ [N ] : D1(i) > D2(i) .

It is not hard to seethat dTV(D1,D2) ∈ [0, 1] (where the upper bound is for instance achieved for D1, D2 with disjointsupports).

Question: Given access to independent samples from some D, unknown monotone distributionover [N ], what are the sample and time complexity required to output (a succinct representation of)a hypothesis distribution D′ s.t. dTV(D,D′) ≤ ε?. Good news: can do this with O

(logN/ε3

)samples and runtime – and this is optimal.

Property testing problems

The object O is now arbitrary, and we are interested in some property P on the set of all objects.The goal is to distinguish whether: (a) O has the property P, of (b) O is “far” from every O′ with

2A (very stringent) metric also referred to as statistical distance, or (half) the L1 distance.

Page 13: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

1.1. HIGH-LEVEL OVERVIEW 5

property P. We do not care about the in-between cases where O is “close” to having property P.In such cases we can give whatever answer we want. Equivalently, this setting can be seen as a“promise” problem, where we are “promised” that all objects will either have property P or be “far”from having property P.

Far from P P

For instance, for the property P (on Boolean functions) of “being the identically zero function”,f is ε-far from P if it takes value 1 on at least an ε fraction of the inputs.

Example 6 (Testing f : 0, 1n → 0, 1 for monotonicity).

Definition 7. A Boolean function f is monotone (non-decreasing) if x y implies f(x) ≤ f(y);where x y means xi ≤ yi for all i ∈ [n]. For instance, f defined by f(x) = x1 ∧ x17 is monotone;f(x) = x1 is not.

Taking our property P to be monotonicity (equivalently, P = f : 0, 1n → 0, 1 : f is monotone ;the set of all functions with the property), we define the distance of f from P as follows:

dist(f,P) def= ming∈P

d(f, g) (1.3)

(where d(f, g) is the Hamming distance defined in Equation (1.1)). We define a testing algorithm Tfor monotonicity as follows: given parameter ε ∈ (0, 1], and query access to any f ,• if f is monotone, T should accept (with probability ≥ 9/10);• if dist(f,P) > ε, T should reject (with probability ≥ 9/10).

. As it turns out, testing monotonicity of Boolean functions can be done with O(nε

)queries and

runtime.

Example 8 (Bipartiteness testing for graphs). Given an arbitrary graph G = ([N ], E), we wouldlike to design a testing algorithm T which, given oracle access to the adjacency matrix of G,• if G is bipartite, accepts (with probability ≥ 9/10);

Page 14: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

6 LECTURE 1. JANUARY 22, 2014: INTRODUCTION

• if dist(G,Bip) > ε, rejects (with probability ≥ 9/10)where Bip is the set of all bipartite graphs over vertex set [N ], and dist(G,Bip) = minG′∈Bip d(G,G′)with d(G,G′) def= |E(G)∆E(G′)|

(N2 ) the edit distance between G and G′.

. Somewhat surprisingly, testing bipartiteness can be done with poly(1ε ) queries and runtime,

independent of N .

1.1.3 Summary

Most of the topics covered in the class will fit into this 2× 3 table:

Boolean functions Graphs Probability distributions

Learning eg, DTs eg, monotone

Testing eg, Monotonicity eg, Bipartiteness

Flavor of the course algorithms and lower bounds; Fourier analysis over −1, 1n; probability;graph theory. . .

1.2 Some content – Testing Sortedness

The first topic covered will be the leftmost top cell – learning Boolean functions. However, beforedoing so, we will get a taste of property testing with the example of testing sortedness of a list (anexample slightly out of the above summary table, yet which captures the “spirit” of many propertytesting results).

Problem: Given access to a list a = (a1, . . . , aN ) ∈ ZN , figure out whether it is sorted – that is, ifa1 ≤ a2 ≤ . . . aN . More precisely, we want to distinguish sorted lists from lists which are “far fromsorted”.

Definition 9. For ε ∈ [0, 1], we say a list of N integers a = (a1, . . . , aN ) is ε-approximately sortedif there exists b = (b1, . . . , bN ) ∈ ZN such thatSuggestion:

write thisin terms ofsome distancedist(a, b)

.

(i) b is sorted; and(ii) | i ∈ [N ] : ai 6= bi | ≤ εN

Remark 1. This definition is equivalent to saying that a is ε-approximately sorted if it has a(1− ε)N -length sorted subsequence.

Page 15: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

1.2. SOME CONTENT – TESTING SORTEDNESS 7

Goal: Design an algorithm which queries “few” elements ai (here a “query” means providing thevalue i to an oracle, and being given ai as response) and• if a is sorted, accepts with high probability;• if a is not ε-approximately sorted, rejects with high probability.

Remark 2. Deterministic algorithms will not work here; indeed, consider any deterministic algo-rithm which reads at most N/2 of the ai’s; it is possible to change any sorted input on the uncheckedspots to make it Θ(1)-far from sorted, and the algorithm cannot distinguish between the two cases.

Theorem 10. There exists a O(

logNε

)-query algorithm for ε-testing sortedness. Further, this tester

is one-sided:• if a is sorted, it accepts with probability 1;• if a is not ε-approximately sorted, rejects with probability ≥ 2/3.

1.2.1 First naive attempt

Natural idea: read logNε random spots, and accept iff the induced sublist is sorted.

Why does it fail? Consider the (1/2-far from sorted) list

a = (11, 10, 21, 20, 31, 30, . . . , 10100 + 1, 10100).

A violation will only be detected if certain consecutive elements – e.g., 21, 20 – are queried, whichhappens with probability o(1) if we make O

(1ε logN

)queries. (Think of “ε” as being a small

constant like 1/100.)

ai

i• • • •

• • • •• • • •

• • • •• • • •

• • • •• • • •

• • • •• • • •

• • • •

1 . . . N

1.2.2 Second naive attempt

Since the previous approach failed at finding local violations, let us focus on such violations: drawpairs of consecutive elements, checking if each pair is sorted, and accept iff all pairs drawn pass thetest.

Page 16: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

8 LECTURE 1. JANUARY 22, 2014: INTRODUCTION

Why does it fail? Consider the list

a = (1, 2, 3, . . . , N/2, 1, 2, 3, . . . , N/2).

Once again, a is 1/2-far from sorted, yet there is only one pair of consecutive elements that mayshow a violation.

ai

i• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •• •

1 . . .N2N2 + 1 N

1.2.3 Third (and right) attempt

Some preliminary setup: first, observe that we can assume without loss of generality that allai’s are distinct. Indeed, we can always ensure it is the case by replacing on-the-fly ai by bi

def= Nai+i.It is not hard to see that the bi’s are now distinct, and moreover that ai ≤ aj iff bi < bj .

Furthermore, suppose x is a sorted list of N distinct integers. In such a list, one can use binarysearch to check in logN queries whether a given value x′ is in x.

Definition 11. Given a (not necessarily sorted) list a, we say that ai is well-positioned if a binarysearch on ai ends up at the ith location (where it successfully finds ai). In particular, if a is sorted,all its elements are well-positioned.

For instance, ina = (1, 2, 100, 4, 5, 6, . . . , 98, 99)

a3 = 100 is not well-positioned; while a75 = 75 is.Note that with 1 + logN queries, one can query location i to get ai and then use binary search

on ai to determine whether ai is well-positioned.

The algorithm1: for 10

ε iterations do2: Pick i ∈ [N ] uniformly at random. Query and get ai.3: Do binary search on ai ; if ai is not well-positioned, halt and return REJECT4: end for

Page 17: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

1.2. SOME CONTENT – TESTING SORTEDNESS 9

5: return ACCEPT. makes 10

ε (1 + logN) = O(

logNε

)queries, as claimed.

. if a is sorted, all elements are well-positioned and the tester accepts with probability 1.

. It remains to prove that if a is ε-far from sorted, the algorithm will reject with probability atleast 2/3. Equivalently, we will show the contrapositive – if a is such that Pr[ ACCEPT ] ≥ 1/3,then a is ε-approximately sorted (i.e. has a sorted subsequence of size at least (1− ε)N).Suppose that Pr[ ACCEPT on a ] ≥ 1/3, and define W ⊆ [N ] to be the set of all well-positionedindices. If we had |W | ≤ (1− ε)N , then the probability that the algorithm accepts would beat most

Pr[ ACCEPT on a ] ≤ (1− ε)10/ε < 0.01

so it must be the case that |W | ≥ (1− ε)N . Therefore, it is sufficient to prove that the setW is sorted, that is, for all i, j ∈ W such that i < j, ai < aj . Fix any two such i, j; clearly,if there were an index k s.t. (1) i ≤ k ≤ j, (2) the binary search for ai visits k, and (3) thebinary search for aj visits k, then this would entail that ai ≤ ak ≤ aj .So it is enough to argue such a k exists. To see why, note that both binary searches for ai andaj start at the same index N/2; and there must be a last common index to both searches, asthey end on different indices i and j. Take k to be this last common index; as the two binarysearches diverge at k, it has to be between i and j.

Page 18: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

10 LECTURE 1. JANUARY 22, 2014: INTRODUCTION

Page 19: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 2

January 29, 2014: Learning Booleanfunctions

2.1 Learning Boolean Functions

2.1.1 Setup

We describe the framework for uniform-distribution learning of a class C of Boolean functionsf : 0, 1n → 0, 1 using membership queries (MQ). The class C should be thought of as “known” tothe learning algorithm; there is also an unknown target function f ∈ C. The learning algorithm hasMQ access to f . Recall that given f, g : 0, 1n → 0, 1, the distance dist(f, g) is the probabilityover random uniform x that f(x) 6= g(x), that is Pr[f(x) 6= g(x)] where x ∈ 0, 1n. A successfullearning algorithm for C has the following property: given input parameters ε, δ, for any choice ofthe target function f ∈ C, with probability at least 1− δ the learning algorithm outputs a hypothesish such that dist(f, g) ≤ ε. (For simplicity we will sometimes consider learning algorithms whichachieve dist(f, g) ≤ ε with probability at least 9/10, i.e. we will sometimes work with the fixedconstant 1/10 for δ.)

Given a value of ε and a target function f , a useful piece of terminology is that we will sometimesrefer to a hypothesis h such that dist(f, h) ≤ ε as a good hypothesis; thus the goal of a learningalgorithm is to output a good h with high probability.

Suppose h is such that dist(f, h) > ε: we can easily detect such a bad hypothesis by drawing randompoints, querying them using the MQ oracle and evaluating h on them, and seeing if h makes amistake. This is because if h is bad, then given m random examples from 0, 1n, it will be thecase that Pr[h = f on all m points] < (1− ε)m; which is less than δ/ |C| for m def= 1

ε

(ln |C|+ ln 1

δ

).

This simple observation suggests the following generic learning algorithm which can be used for anyfinite class C:

11

Page 20: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

12 LECTURE 2. JANUARY 29, 2014: LEARNING BOOLEAN FUNCTIONS

Algorithm 1 Naive Algorithm

1: Draw m = 1ε

(ln |C|+ ln 1

δ)

)samples.

2: for all h ∈ C do3: Check if h gets the label of all m points right; if so, output this h.4: end for

With the discussion above and a union bound over all the (at most) |C| bad hypotheses, we canclaim that with probability at least 1− δ no bad hypothesis ∈ C will be right on all m examples.Therefore with probability 1− δ the algorithm outputs some h ∈ C such that dist(f, h) ≤ ε (whichis guaranteed to exist, since the target function f itself is in C).

While this has good query complexity and in general works for any class, it has a terrible runtime ofO(|C|) (as we enumerate in the worst case all functions in the class). For each class of functions thatwe study our goal will be to do much better than this running time (and happily we will achievethis for all the function classes we consider).

We begin our study of learning classes of Boolean functions by considering two simple classes,namely, parities and juntas.

2.2 Learning Parities

Given a subset of indices S ⊆ [n], the parity function corresponding to S, denoted PARS , is thefunction from 0, 1n → 0, 1 defined as

PARS(x) =(∑i∈S

xi

)mod 2, x = (x1, . . . , xn) ∈ 0, 1n.

In other words, the parity is 1 iff an odd number of the bits in positions S are set to 1.

The class C = of all parities consists of 2n functions corresponding to the 2n subsets of [n]. A verysimple algorithm suffices to learn this class, based on the observation that PARS(ei) = 1 iff i ∈ S,where ei is the element of 0, 1n that has a 1 precisely in position i and a 0 in every other position.By querying these n strings it is possible to exactly learn the unknown parity function (achieveerror ε = 0) with probability 1.

2.3 Learning Juntas

We say that a Boolean function f : 0, 1n → 0, 1 depends on coodinate i if there exists a stringz ∈ 0, 1n such that f(z) 6= f(z⊕i). Here and subsequently z⊕i denotes the string z with the ithbit flipped.

Page 21: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

2.3. LEARNING JUNTAS 13

Definition 12 (k-junta). For k ≤ n, f : 0, 1n → 0, 1 is said to be a k-junta if it depends on atmost k variables. Equivalently, f is a k-junta if there are k′ ≤ k distinct indices i1, dots, ik′ suchthat for all x ∈ 0, 1n, we have

f(x1, . . . , xn) = g(xi1 , . . . , xik′ ).

Let Jk denote the set of all k-juntas over 0, 1n. We have that |Jk| ≤(nk

)22k ≤ nk22k , and this

upper bound is essentially tight (at least for k not too large). Therefore, we can use the naivealgorithm above to learn Jk in nk · 22k time using roughly 2k + k logn queries. However this is arather poor running time, and in fact it is possible to do exponentially better:

Theorem 13. There is a learning algorithm for Jk that with probability ≥ 1−δ, will output an h suchthat dist(f, g) = 0 (i.e. h is logically equivalent to f). This algorithm makes poly(2k, logn, log 1/δ)queries and has running time poly(2k, n, log(1/δ)) (note that even specifying a single n-bit querystring takes Θ(n) time).

Proof.

High level idea: There are two main steps to the algorithm.1. Find all relevant variables (there will be k′ ≤ k relevant variables);2. Perform 2k′ queries to exactly determine the truth table of g.Given the set i1, . . . , ik′ of relevant variables and the truth table of g, we have exactly identified

the junta f .

A useful tool for the algorithm is binary search. Note that given two strings x, y with f(x) = 0 andf(y) = 1, by performing a binary search (to successively halve the size of the set of bits on whichthe two strings disagree) it is possible to identify a relevant variable using at most logn MQ queries.Our routine for step 1, Find-Relevant-Variables (abbreviated FRV), makes use of binary searchas detailed below.

A note on notation before giving the algorithm: When we write “f∣∣xi←b

” below where b is anelement of 0, 1, this refers to the (n− 1)-variable function that is obtained from f by fixing itsi-th input bit to the value b. Note that given MQ access to f it is straightforward to simulate MQaccess to f

∣∣xi←b

.

Find-Relevant-Variables(n, k, δ, f) :1: Draw m = 2k · log(6/δ) random examples from 0, 1n.2: if f takes the same value on all m examples then return the empty set3: else4: Given x, y such that f(x) = 1, f(y) = 0, use logn queries (binary search) to find i ∈ [n] such

that i is relevant.

Page 22: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

14 LECTURE 2. JANUARY 29, 2014: LEARNING BOOLEAN FUNCTIONS

5: Run FRV(n − 1, k − 1, δ/3, f∣∣xi←0) to get S0 ⊆ [n] and FRV(n − 1, k − 1, σ/3, f

∣∣xi←1) to

get S1 ⊆ [n]6: return S0 ∪ S1 ∪ i7: end if

It is easy to verify that the above recursive algorithm has recursion depth at most k and makesa total of at most 2k recursive calls. An easy inductive argument shows that if FRV(n, k, δ, f) iscalled on an n-variable k-junta f , then with probability at least 1− δ it finds all relevant variables.The key observation underlying the analysis is that if f is a k-junta which is not a constantfunction (i.e. has at least one relevant variable), then we have both Prx∈0,1n [f(x) = 0] ≥ 1/2k andPrx∈0,1n [f(x) = 1] ≥ 1/2k. Hence a collection of 2k ln(6/δ) random examples will with probabilityat least 1− δ/3 have at least one positive example and at least one negative example. (The othertwo contributions of δ/3 failure probability come from the two recursive calls.)

These learning results so far for parities and juntas are rather “ad hoc”. Next we will develop themachinery of Fourier analysis over the Boolean hypercube; over the course of the next few lectureswe will apply this machinery to get powerful learning algorithms for a range of function classes.

2.4 Fourier Analysis of Boolean Functions

Now, we look at Boolean functions as f : −1, 1n → −1, 1. Somewhat counter-intuitively, wewill be viewing −1 as “true” and 1 as “false”. With this convention we can redefine (and rename)the Parity Function corresponding to a subset S ⊆ [n] as:

Definition 14.χS(x) =

∏i∈S

xi, x ∈ −1, 1n

To check that this makes sense, observe that just as the XOR of two “true” bits yields “false”(i.e. 1 + 1 = 0 mod 2), the product of two −1 bits is 1.Note that we can view any Boolean function f : −1, 1n → −1, 1 as a function with the newrange R (f : −1, 1n → R).

The set of all functions f : −1, 1n → R is a vector space of dimension 2n (each function can beviewed as a list of its values on its 2n possible inputs). Call this vector space V . We now define aninner product on V :

Definition 15. Given f, g ∈ V , the inner product 〈f, g〉 is defined as:

〈f, g〉 def= Ex∼−1,1n [f(x)g(x)] = 12n

∑x∈−1,1n

f(x)g(x)

Page 23: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

2.5. SPECIAL PROPERTIES OF PARITY FUNCTIONS IN V 15

Remark 3. Notice that this induces the norm of an element of V as:‖f‖2 =

√〈f, f〉 = 1

2n/2

√ ∑x∈−1,1n

f(x)g(x)

2.5 Special Properties of Parity Functions in V

Proposition 16 (Parities are unit vectors). The norm of χS for any S is 1.

Proof.‖χS‖ =

√〈χS , χS〉 =

√Ex∼−1,1n [χS(x)2] =

√1 = 1.

(since for all x, χS(x) ∈ −1, 1 =⇒ χ2S(x) = 1)

We note that this proposition is not unique to parity functions: any Boolean function f : −1, 1n →−1, 1 has ‖f‖ = E[f2] = 1.

Proposition 17. For any S, T ⊆ [n], χS · χT = χS∆T .

Proof. Observe that

χS(x) · χT (x) =(∏i∈S

xi

)∏j∈T

xj

.Clearly, the only xi in the joint product are ones where i is either in S or T . However, i which arein both S and T will cancel each other out (since xi ∈ −1, 1 would then show up twice in theproduct). So, the only i that affect the results are those in the set symmetric difference S∆T andwe have: χS(x) · χT (x) = χS∆T (x).

Proposition 18. If S = ∅, Ex∼−1,1[χS(x)] = E[1] = 1.

Proof. Obvious, since χ∅ is identically 1.

Proposition 19. If S 6= ∅, Ex∼−1,1[χS(x)] = 0.

Proof. Can think of determining value of χS(x) by first taking the product of all but one i ∈ S. Atthis point, the value of χS(x) is completely determined by the last index’s xi (will make the finalproduct either 1 or -1). Each of these will occur with probability 1

2 since x is drawn uniformly atrandom, so the expected value of the final product is 0.

Proposition 20 (Orthogonality in V ). If S 6= T , then 〈χS , χT 〉 = 0.

Proof. By definition and using the previous properties, 〈χS , χT 〉 = Ex∼−1,1n [χS · χT ] = E[χS∆T ].But S∆T 6= ∅ since S 6= T , so by Proposition 19 we have 〈χS , χT 〉 = 0.

Page 24: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

16 LECTURE 2. JANUARY 29, 2014: LEARNING BOOLEAN FUNCTIONS

Proposition 21 (Linear Independence of χS). The set χS : S ⊆ [n] is linearly independent.

Proof. Suppose 0 = c1χS1 + c2χS2 + · · ·+ cnχSn . Then

0 = 〈χS1 , 0〉 = 〈χS1 , c1χS1 + c2χS2 + · · ·+ cnχSn〉= c1 〈χS1 , χS1〉+ c2 〈χS1 , χS2〉+ · · ·+ cn 〈χS1 , χSn〉= c1

by linearity and since all⟨χSi , chiSj

⟩terms equal 1 only when i = j = 1 and equal 0 for all other

pairs i 6= j.

The same argument can be used for any i to show that ci = 0. Since 0 = c1χS1 + c2χS2 + · · ·+ cnχSnimplies ci = 0 for all i, we have then shown linear independence.

2.6 Parity Functions as an Orthonormal Basis of V

By the previous properties, we then have that χS : S ⊆ [n] is an orthonormal basis of V (it is aset of 2n linearly independent orthogonal elements of length 1 in V ).So, any f ∈ V has a unique representation as

f(x) =∑S⊆[n]

f(S)χS(x)

The f(S) are referred to as the Fourier coefficients of f .

One way of viewing this representation is as the expansion of f as a multilinear polynomial.(The parity functions are all possible monomials over x1, ..., xn).

What is f(S)?

〈f, χS〉 =⟨ ∑T⊆[n]

f(T )χT , χS

⟩=∑T⊆[n]

f(T ) 〈χT , χS〉 = f(S)

by linearity and since 〈χT , χS〉 equals 1 when S = T and 0 otherwise.So, f(S) = 〈f, χS〉 .

Page 25: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 3

February 05, 2014

3.1 Overview

3.1.1 Last time

Last time we covered learning parity functions and k-juntas. We also went through the basics ofFourier analysis for functions f : −1, 1n → R.

3.1.2 Today

Today we finish the basics of Fourier analysis over the Boolean cube and cover two learningalgorithms for Boolean functions, the LMN (Linial/Mansour/Nisan) algorithm [LMN93] and theKM (Kushilevitz/Mansour) algorithm [KM91]. As we will see, the LMN algorithm lets us learnto high accuracy given uniform random labeled examples if we are given a set S of subsets of [n]whose Fourier coefficients have almost all of the “Fourier weight” of f . The KM algorithm usesmembership queries and can be used to find all the “heavy” Fourier coefficients of f .

3.2 Basics of Fourier analysis, concluded

We recall that f(x) =∑S⊆[n] f(S)χs(x) where f : −1, 1n → R. We define χs(x) =

∏i∈S xi (note

that χS(x) is the multilinear monomial corresponding to the product of all variables in S). We alsodefined f(S) = E[f(x)χS(x)] = 〈f, χS〉. For Boolean f , this is equal to Pr[f(x) = χs(x)]−Pr[f(x) 6=χS(x)].

As an example of the Fourier representation of a Boolean function, let us consider the AND function.

AND(x1, . . . , xk)def=−1 if x1 = · · · = xk = −1+1 otherwise.

17

Page 26: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

18 LECTURE 3. FEBRUARY 05, 2014

To determine the Fourier representation of AND, first let’s consider the real-valued function f suchthat f(x1, . . . , xk) = 1 if x1 = · · · = xk = −1, and 0 otherwise. It’s not hard to see that

f(x) = 1− x12 · 1− x2

2 . . .1− xk

2 =∑S⊆[k]

(−1)|S|χS(x)2k

(note that 1−xj2 is either 0 or 1 when xj ∈ −1, 1). Then we see that

AND(x) = 1− 2f(x) = 1− 22k −

∑S:|S|>0

(−1)|S|χS(x)2k .

Turning back to a general Boolean function f : −1, 1n → −1, 1, it is easy to see from the definitionthat

∣∣∣f(S)∣∣∣ ≤ 1. However, a much stronger statement is in fact true: for any f : −1, 1n → −1, 1,

we have that∑S⊆[n] f(S)2 = 1. This is a special case of the following more general result:

Theorem 22 (Plancherel’s Identity). For any two functions f, g : −1, 1n → R, we have∑S⊆[n]

f(S)g(S) = 〈f, g〉 = E[fg] . (3.1)

Proof. By plugging in the Fourier expansion of f and g in the RHS, using linearity of expectation:

E[fg] = Ex∼U−1,1n

[(∑S

f(S)χS(x))(∑

T

ˆg(T )χT (x))]

=∑S,T

f(S) ˆg(T ) · Ex [χS(x)χT (x)] =∑S

f(S)g(S)

where the last equality derives from the orthonormality of the χS ’s.

Corollary 23 (Parseval’s Theorem). Let f : −1, 1n → R be a Boolean function. Then∑S⊆[n]

f(S)2 = ‖f‖22

and in particular, if f is Boolean then∑S⊆[n] f(S)2 = 1.

3.2.1 Fourier and learning

The intuition is that the f(S)2’s are the “weights” of the χS ’s in the Fourier representation of f ;so that learning a big enough fraction of the total weight means learning a “good fraction” of thefunction (in a sense that we shall make precise). Hence, our goal in Fourier-based learning is toapproximate most of the Fourier representation of f .

Page 27: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

3.3. THE LMN ALGORITHM (LINIAL–MANSOUR–NISAN) 19

A basic first objective is, given f , S ⊆ [n], to find f(S). We note that f(S) = E[f(x)χS ]. Clearly,computing this quantity exactly requires 2n queries; however, we do not need the exact value! Hence,we will be satisfied with an approximation of f(S), which we can obtain efficiently via sampling.

Lemma 24. Fix any Boolean function f : −1, 1n → −1, 1. Given S ⊆ [n], γ > 0, δ > 0and access to independent, random examples (x, f(x)) where x ∼ U−1, 1n, one can, usingO(1/γ2 log(1/δ))

)such examples, output a value cS such that with probability greater than or equal

to 1− δ, cS satisfies∣∣∣cS − f(S)

∣∣∣ ≤ γ.

Proof. Follows from the Hoeffding bound (additive Chernoff) on the random variable f(x)χS(x)(where x is uniform random over −1, 1n), whose expected value is f(S). Note that this randomvariable always takes values in [−1, 1] and hence we may straightforwardly apply the Hoeffdingbound.

Remark 4. This is useful, but does not solve the learning question: there are 2n Fourier coefficients,and for a Boolean function almost all of them are very small in magnitude. For instance, atmost n4 subsets S ⊆ [n] are such that

∣∣∣f(S)∣∣∣ ≥ 1/n2. This is because otherwise we would get∑

S f(S)2 > n4 · 1/n4 = 1, contradicting Parseval’s theorem.

But suppose we were given a list of subsets S1, . . . , Sm with the promise that 99% of the “Fourierweight” is on these:

∑mi=1 f(Si)2 ≥ 0.99. Then we would be in good shape, as we could use the

lemma above to estimate each f(Si) to sufficiently high accuracy. The next section will make thisprecise.

3.3 The LMN Algorithm (Linial–Mansour–Nisan)

3.3.1 Preliminary Definitions

Definition 25. Fix f : −1, 1n → −1, 1, S = S1, . . . , Sm where Si ⊆ [n]. We say that f isε-concentrated on S if ∑

Si∈Sf(Si)2 ≥ 1− ε

(that is,∑Si /∈S f(S)2 ≤ ε).

3.3.2 LMN Algorithm

Theorem 26. Suppose f is ε-concentrated on S. Then with probability at least 1− δ, the hypothesish returned by the LMN Algorithm satisfies Ex[(f(x)− h(x))2] ≤ ε+ τ .

Page 28: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

20 LECTURE 3. FEBRUARY 05, 2014

Algorithm 2 LMN AlgorithmRequire:

- access to random samples (x, f(x))- collection of subsets S = S1, . . . , Sm- parameters τ > 0, ε > 0

1: Pick M = O(mτ log m

δ

)random examples (x, f(x)).

2: Use this set to get an estimate cSi of f(Si), for all i ∈ [m].3: return h : −1, 1n → R defined by h(x) =

∑mi=1 cSiχSi(x).

Proof. Fix some Si ∈ S. Define γ =√τ/m. From the earlier lemma, we are guaranteed that the

probability of∣∣∣cSi − f(Si)

∣∣∣ > γ is at most δ/m. With probability ≥ 1− δ, by the union bound, we

get∣∣∣cSi − f(Si)

∣∣∣ ≤ γ for all i ∈ 1, . . . ,m.

Let g def= f − h. We have

E[(f − h)2

]= E

[g2]

=(Plancherel)

∑S

g(S)2 =∑Si∈S

g(Si)2 +∑Si /∈S

g(S)2

≤ ε+ τ

where the last inequality holds because |g(Si|) =∣∣∣f(Si − h(Si)

∣∣∣ ≤ γ, which implies that∑Si∈S g(Si)2 ≤

mγ2 ≤ τ ; and∑Si /∈S g(S)2 =

∑Si /∈S f(S)2 ≤ ε since h has no nonzero Fourier coefficients outside

S.

The following observation allows us to convert easily the output of the LMN algorithm (which isa real-valued function) to a Boolean function with the same approximation guarantees:

Observation 27. Let us fix any Boolean function f : −1, 1n → −1, 1, and a real h : −1, 1n →R. Define

h′(x) = sign(h(x)) =

1 h(x) ≥ 0−1 h(x) < 0

.

Then we have that Pr[h′(x) 6= f(x) ] ≤ E[(f(x)− h(x))2].

Proof. Note that the left hand side evaluates to 12n∑x 1h′(x)6=f(x) (the indicator function) while

the right hand side evaluates to 12n∑x(f(x)− h(x))2. We will show the desired inequality holds

term by term, which will prove the result: for any x,• if h′(x) = f(x), the corresponding summand on the LHS evaluates to 0 and while the RHS

summand is ≥ 0;

Page 29: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

3.4. THE KM ALGORITHM (KUSHILEVITZ–MANSOUR) 21

• if h′(x) 6= f(x), the LHS summand contributes 1 to the sum while the RHS summandcontributes ≥ 1.

3.3.3 Summing up

Theorem 28 (LMN algorithm). Given a collection of subsets S such that f is ε-concentrated on S, aswell as parameters ε, δ > 0 and access to independent, uniform random examples (x, f(x)), the LMNalgorithm runs in poly(n, |S| , 1/ε, log(1/δ)) time and outputs a hypothesis h′ : −1, 1n → −1, 1such that dist(h′, f) ≤ 2ε with probability at least 1− δ.

This is tremendously helpful, but requires that S be given. We discuss below some ideas on howto get S:

1. For some C (classes of functions), we can show that every f ∈ C is ε-Fourier concentrated onS = S ⊆ [n] : |S| ≤ d . We see that |S| ' nd. Hence, we get poly(nd, 1/ε) time algorithms.We explore this approach in future lectures.

2. Using membership queries, we can find S such that∣∣∣f(S)

∣∣∣ is large! need MQ forthis algorithm.

Thus, we can learnconcept classes C which have 1 − ε of their Fourier weight on such “big” coefficients. TheKushilevitz-Mansour (KM) algorithm, the subject of the next section, is an algorithm thatlets us find S such that

∣∣∣f(S)∣∣∣ is large using MQ.

3.4 The KM Algorithm (Kushilevitz–Mansour)

3.4.1 Introduction

In the KM algorithm, we find S such that∣∣∣f(S)

∣∣∣ is large, using membership queries.

Want: given θ, efficiently find S def=S ⊆ [n] :

∣∣∣f(S)∣∣∣ ≥ θ . Sadly, this is unrealistic: since we

can only ever hope to obtain approximate values of the Fourier coefficients, there is no way tohandle the sets at or very close to the threshold.

Will get: given θ and MQ(f), find S s.t.• if

∣∣∣f(S)∣∣∣ ≥ 2θ, Si ∈ S;

• if∣∣∣f(S)

∣∣∣ ≤ θ/2, then Si /∈ S.with probability at least 1− δ, in poly(n, 1/θ, log 1/δ) time.

Page 30: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

22 LECTURE 3. FEBRUARY 05, 2014

High-level idea: We want to “successively isolate” large f(S)’s. So we think of a way to breakup the set of all S ⊆ [n] as follows:

Definition 29. Fix k ∈ 0, . . . , n, and S1 ⊆ [k]. Given f : −1, 1n → R, we define fk,S1 : −1, 1n−k →[−1, 1] by 1

fk,S1(x) def=∑

T2⊆k+1,...,nf(S1 ∪ T2)χT2(x).

(note that we will use the subscripts to indicate where the sets “live”: S1, T1 ⊆ 1, . . . , k, whileT2 ⊆ k + 1, . . . , n).

Observe that fk,S1 “includes” (in a sense we will define shortly) exactly the Fourier coefficientsf(S) such that S ∩ 1, . . . , k = S1. So all fk,S1 as S1 ranges over all subsets of 1, .., k includes all2n Fourier coefficients of f .

Example 30.• For k = 0, we have f0,∅(x) = f(x) =

∑T2⊆1,...,n f(T2)χS2(x)

• For k = n, S1 ⊆ n, we have that fn,S1(x) = f(S1) (constant function).

The question that begs asking: supposing we are given MQ(f), k, S1 ⊆ 1, .., k, x, can wecompute or approximate fk,S1(x)?; Using the definition, yes (term by term) – but this would be insanely inefficient; there is a better way, by sampling – with the following lemma:

3.4.2 Building up to the KM algorithm

We propose a better way than the one given above.

Lemma 31. Fix k ∈ 0, . . . , n, S1 ⊆ [k], x ∈ −1, 1n−k, f . Then

fk,S1(x) = Ey∼−1,1k [f(yx)χS1(y)] .

(where we wrote yx for the concatenation (y1, . . . , yk, x1, . . . , xn−k))

Proof. We note that f(yx) =∑T

ˆf(T )χT (yx). Writing T as T1∪T2 (where T1 = T ∩1, . . . , k, T2 =T ∩ k + 1, . . . , n, we get χT (yx) = χT1(y)χT2(x). Hence, we see that

f(yx) =∑T1

∑T2

f(T1 ∪ T2)χT1(y)χT2(x).

1Note that from the definition it is not immediately obvious that the range of fk,S1 is [−1, 1]; however we willprove this below.

Page 31: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

3.4. THE KM ALGORITHM (KUSHILEVITZ–MANSOUR) 23

Therefore, by linearity

Ey∼−1,1k [f(yx)χS1(y)] =∑T1

∑T2

f(T1 ∪ T2)χT2(x)E[χT1(y)χS1(y)]

=∑T1

∑T2

f(T1 ∪ T2)χT2(x)1T1=S1 (by orthonormality)

=∑T2

f(S1 ∪ T2)χT2(x)

= fk,S1(x)

Using this lemma, f(yx)χS1(y) (with y ∼ −1, 1k) is a [−1, 1]-valued random variable we cansample (as long as f takes values in [−1, 1]), and invoking Hoeffding–Chernoff bounds we get:

Lemma 32. There is an algorithm which, given MQ(f) (where f : −1, 1n → −1, 1), k ∈0, . . . , n, S1 ⊆ [k] and x ∈ −1, 1n−k, as well as parameters γ, δ > 0, outputs with probability atleast 1− δ a value v ∈ [−1, 1] satisfying

|v − fk,S1(x)| ≤ γ

in time poly(n, 1γ , log 1

δ ), making O(

1γ2 log 1

δ

)queries.

Simplification: for the sake of clarity, we hereafter pretend we can get the exact value fk,S1(x),instead of just approximate it. This will make the exposition cleaner, and can be addressed modulosome technical details.

Recall that our ultimate goal is to find large f(S) if there is one. We first make a couple usefulobservations:

Observation 33. Given k, S1 ⊆ [k] as above, we have

E[fk,S1(x)2

]=

(Plancherel)

∑T2⊆k+1,...,n

f(S ∪ T2)2 =∑

S:S∩[k]f(S)2.

Observation 34. Fix any f : −1, 1n → −1, 1 and θ > 0. Then:(i) At most 1/θ2 sets S ⊂ [n] can have

∣∣∣f(S)∣∣∣ ≥ θ;

(ii) For any fixed k ∈ [0, n], at most 1/θ2 Si ⊆ 1, .., k have E[fk,S1(x)2] ≥ θ2 (since

∑S1⊆[k] E

[fk,S1(x)2] =∑

S⊂[n]ˆf(S)2 = 1);

(iii) If k, S1 are such that E[fk,S1(x)2] < θ2 then

∣∣∣f(S)∣∣∣ < θ ∀S such that S ∩ [k] = S1. (indeed,

E[fk,S1(x)2] =

∑S′:S′∩[k]=S1 f(S′)2 ≥ f(S)2 for any such S.)

Page 32: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

24 LECTURE 3. FEBRUARY 05, 2014

Lemma 35. Fix f : −1, 1n → −1, 1, and pretend that E[fk,S1(x)2] can be computed ex-

actly in unit time given MQ(f), k, S1 ⊆ [k] and x. Then we can, given θ > 0, exactly identifyS ⊆ [n] :

∣∣∣f(S)∣∣∣ ≥ θ in time poly(n, 1

θ ).

The idea is to build a (partial) binary tree, where each level-` node has “address” (`, S′1), andvalue

∑T2⊆`+1,...,n f(S′1 ∪ T2)2 = E

[f`,S′1(x)2

].

This tree (built on-the-fly by the algorithm exploring it) will have as key invariant that, for each `,∑S′1⊆[`] value(`, S′1) ≤ 1. The algorithm will stop exploring (`, S′1) if value(`, S′1) ≤ θ2; therefore, for

all `, the number of live nodes at level ` will be at most 1/θ2. This means the tree (which has depthat most n) can be entirely explored in time poly(n, 1/θ). Each leaf node that is reached at depth ncorresponds to a Fourier coefficient and this collection of leaves will contain all Fourier coefficientswith

∣∣∣f(S)∣∣∣ ≥ θ.

(We give a formal proof of this in the next lecture.)

Page 33: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 4

February 12, 2014

4.1 Last Time

Last time we covered Fourier basics, the LMN (Linial/Mansour/Nisan) algorithm, and the KM(Kushilevitz/Mansour) algorithm. Today we finish the KM algorithm and explore its applications,and start a unit on learning monotone Boolean functions using the LMN algorithm.

Recall fk,S1(x) =∑T2⊆k+1...n f(S1 ∪ T2)χT2(x) and E

[f2k,S1

]=∑T2⊆k+1...n f(S1 ∪ T2)

Lemma 36. Fix any f : −1, 1n → −1, 1. Pretend that we can exactly compute E[f2k,S1

]in

unit time, for any given k, S1 ⊆ [k]. Then given θ > 0, there is an algorithm that outputs exactlyS : |f(S)| ≥ θ

in poly(n, 1

θ ) time.

Proof. The algorithm builds a partial binary tree on-the-fly where each level-k node has address(k, S1) and value E

[f2k,S1

]. Note that the node’s value is the sum of its two children’s value: node v,

at location (k, S1), has children• (k + 1, S1), with value

∑T2⊆k+2...n f(S1 ∪ T2)2; and

• (k + 1, S1 ∪ k + 1), with value∑T2⊆k+2,...n f(S1 ∪ k + 1 ∪ T2)2.

In particular, for each k the sum of all level-k nodes of the complete binary tree is E[f2

0,∅

]= E

[f2] = 1;

hence the number of nodes at level k whose value is greater than θ is less than 1θ2 .

The algorithm then builds the partial tree from the root (0, ∅). When it reaches a node withvalue < θ2, it calls it “dead” and stops exploring this branch. The algorithm only continues in “live”nodes. Therefore each level contains at most ≤ 1

θ2 live nodes, and the total number of nodes everexplored is n

θ2 (since there are n levels). Once the whole tree is built, the value at each leaf (n, S1)at the bottom level is E

[f2n,S1

]= f(S1)2, and these leaves correspond exactly to all S1 ⊆ [n] such

that E[f2n,S1

]≥ θ2.

25

Page 34: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

26 LECTURE 4. FEBRUARY 12, 2014

4.2 KM Algorithm

Now we would like to look at “almost-exact” computations, since we (as already observed) cannotget the exact values E

[f2k,S1

], but only approximate them. Note that we have fk,S1(x) ≤ 1 for

all x, k ∈ 0, . . . , n and S1 ⊆ [k] because fk,S1(x) = Ey∼−1,1k [f(yx)χS1(y)]. So we estimateEx[fk,S(x)2] by picking a uniform random x and estimating fk,S(x) and then taking the square. Weare basically using sample points. Towards this end, we can prove the following:

Lemma 37.Homework Exer-cise

Given MQ(f), f : −1, 1n → −1, 1, 0 ≤ k ≤ n, S1 ⊆ [k], τ, δ, there is a poly(n, 1τ , log 1

δ )-time algorithm that with probability ≥ 1− δ, outputs v such that∣∣∣v − E

[f2k,S1

]∣∣∣ ≤ τ.Proof. Left as an exercise.

Using this, the actual analogue of “almost-exact” is the KM algorithm.

Theorem 38.Homework Exer-cise

Let f : −1, 1n → −1, 1. Then given θ > 0 and error probability δ > 0, there isan algorithm (KM) that, with probability 1− δ, outputs a collection S of size O

(1/θ2) such that

• if S ∈ S, then∣∣∣f(S)

∣∣∣ ≥ θ2 ;

• if S has∣∣∣f(S)

∣∣∣ ≥ 2θ, then S ∈ S.

Proof. Left as an exercise.

4.3 Applications of the KM algorithm

What can we use the KM algorithm for? It does not work well for everything; some functions haveonly small Fourier coefficients. KM works best if most of the Fourier weight is on a “small” set oflarge coefficients.

Parities As a trivial case, we can use the KM algorithm to learn an unknown parity function:if f = χS for some unknown S ⊆ [n], f has all its Fourier weight on only one non-zero coefficient;setting θ def= 0.9 will then ensure we find S.

Inner Product An example of function which “defeats” the KM algorithm (by having only verysmall Fourier coefficients) is the inner product function, defined as

IP2 : x ∈ −1, 12n 7→ (x1 ∧ x2)⊗ (x3 ∧ x4)⊗ · · · ⊗ (x2n−1 ∧ x2n)

Page 35: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

4.3. APPLICATIONS OF THE KM ALGORITHM 27

Since x1 ∧ x2 = 12 (1 + x1 + x2 − x1x2) (recall that True corresponds to −1), the Fourier representa-

tion of IP2 isIP2 =

∑S⊆[n]

±12n2χS

and all coefficients have same tiny weight 12n .

As we shall see momentarily, there exists a class of functions for which the Kushilevitz–Mansouralgorithm is well-suited: the functions with bounded L1 norm.

4.3.1 Learning function with small L1 norm

Definition 39. Given f : −1, 1n → −1, 1, define L1(f) def=∑S⊆[n]

∣∣∣f(S)∣∣∣

Claim 40. Given Boolean function f and ε > 0, let Sεdef=S :

∣∣∣f(S)∣∣∣ ≥ ε

L1(f)

. Then

1. |Sε| ≤ L1(f)2

ε ; and2. f is ε-concentrated on Sε.

Proof. Fix S ∈ Sε. Then∣∣∣f(S)

∣∣∣ ≥ εL1(f) . We know by definition that

L1(f) =∑S⊆[n]

∣∣∣f(S)∣∣∣ ≥ ∑

S∈Sε

∣∣∣f(S)∣∣∣ ≥ |Sε| · ε

which gives us the first item. For the second, the goal is to show that∑S 6∈Sε f(S)2 ≤ ε. This holds

because ∑S 6∈Sε

f(S)2 ≤ maxS 6∈Sε

∣∣∣f(S)∣∣∣ · ∑

S 6∈Sε

∣∣∣f(S)∣∣∣ < ε

as∑S 6∈Sε

∣∣∣f(S)∣∣∣ ≤ L1(f) and maxS 6∈Sε

∣∣∣f(S)∣∣∣ ≤ ε

L1(f) .

Putting it together, we obtain an algorithm that can learn Boolean functions with small L1norm:

Theorem 41. There exists an algorithm Learn-L1 which, given MQ(f), L = L1(f), ε, δ > 0,outputs with probability ≥ 1 − δ a Boolean function h such that dist(f, h) ≤ ε, and runs in timepoly(n, 1

ε , L, log(1δ )). We can run LMN algorithm to get h.

Proof. Learn-L1 runs the KM algorithm to find a collection S of poly(L/ε) many Fourier coefficientscontaining all coefficients in Sε/2; f is then ε/2-concentrated on S, and it suffices to use the LMNalgorithm to learn f to accuracy ε.

Page 36: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

28 LECTURE 4. FEBRUARY 12, 2014

Remark 5. If the algorithm is not given L, one can still use it by “guessing”Homework Exer-cise

an upperbound on it(repeated guesses), and testing the hypothesis obtained until a good one is found.Corollary 42. We can learn size-s decision trees in time poly(n, s, 1

ε , log(1δ )), where the size s

denotes the number of leaves in the decision tree.Proof. This directly follows from the following claim:

Lemma 43. Let f : −1, 1n → −1, 1 be computed by a size-s decision tree. Then L1(f) ≤ s.

To show this, fix leaf v in the decision tree; call the bit label bv ∈ −1, 1. Consider the functiongv that outputs bv if x reaches the leaf v and 0 otherwise. The path to v is an AND of literals:`i1 , `i2 , . . . , `ik where `ij = ±xj . What is the function gv then?

gv(x) = bv(1 + `i1

2 )(1 + `i22 ) . . . (1 + `ik

2 ) =∑

S⊆i1,i2,...,ik

±12k χS(x).

Each path function gv thus has a simple Fourier representation, with each coefficient being ±2k. Thedecision tree is f = gv1 +· · ·+gvs where v1, . . . vs are the s leaves, so L1(f) ≤ L1(gv1)+· · ·+L1(gvs) ≤s.

4.3.2 Sparse Fourier Representations

Another application is learning functions with sparse Fourier representations. Given a Booleanfunction f , let sparse(f) be the number of its non-zero Fourier coefficients, that is

sparse(f) def=∣∣∣ S ⊆ [n] : f(S) 6= 0

∣∣∣As a first observation, we have the following relation between sparsity and L1 norm:Fact 44. Fix any f : −1, 1n → −1, 1. If sparse(f) = s, then L1(f) ≤

√s.

Proof. Let S1 . . . Ss be the sets corresponding to the non-zero coefficients of f , so that L1(f) =∑s

∣∣∣f(S)∣∣∣ =

∑si=1

∣∣∣f(Si)∣∣∣. By Cauchy–Schwartz,

s∑i=1

∣∣∣f(Si)∣∣∣ ≤

√√√√ s∑i=1

12

√√√√ s∑i=1

f(Si)2 =√s · 1

Corollary 45. We can learn the class of s-sparse functions in time poly(s, n, 1ε , log(1

δ )).Remark 6. There is no general relationship between sparsity and decision tree size. Consider χ[n],which is (super) sparse: sparse(χ[n]) = 1, but its decision tree size is 2n. On the other hand, thefunction ANDn has sparsity 2n, but decision tree size n.Remark 7. It is known that the KM algorithm can learn poly(n)-term DNF to accuracy ε in timenO(log log(n

ε)).

Page 37: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

4.4. MONOTONE BOOLEAN FUNCTIONS 29

4.4 Monotone Boolean Functions

We will see that the LMN algorithm will learn any monotone function f in nO(√n/ε) time (in

particular, without membership queries). Let us first define a monotone Boolean function.

Definition 46. A Boolean function f : −1, 1n → −1, 1 is said to be monotone if x y impliesf(x) ≤ f(y). Here, x y refers to the partial order on the Boolean hypercube, i.e. x y if andonly if xi ≤ yi for all i ∈ [n].

Remark 8. Note that there are a huge number of monotone functions (more than 2( nn/2)); to see

this, consider (for n even) any function f that has f(x) = 1 when∑ni=1 xi > 0 and f(x) = −1 when∑n

i=1 xi < 0. Any such function is monotone, and there are 2( nn/2) such functions since each of the( n

n/2)

points x with∑i xi = 0 can take either value +1 or −1.

So any naive method would result in an exponential time algorithm. We define a key notion tohelp us deal with monotone functions, which is the concept of influence.

4.4.1 Influence

Definition 47 (Influence of a Boolean function). Fix x ∈ −1, 1n. Given f : −1, 1n → −1, 1,the influence of xi on f is

Inf f (i) def= Pr[f(xi←1) 6= f(xi←−1)

]where xi←b denotes x with the i-th bit set to b. The total influence of f is Inf [f ] =

∑ni=1 Inf f (i).

Let’s see a couple examples.1. For the constant function f = 1, Inf f (i) = 0∀i.2. For the parity function χ[n] = f , Infχ[n](i) = 1∀i3. Consider majority function f = MAJn(x) = sign(x1 + · · ·+ xn), where n is odd. What is the

influence of the hidden variable? (Applications in voting). We can see that InfMAJn(i) =(n−1n−1

2

)/2n = Θ(1/

√n), so Inf [MAJn] = Θ(

√n). We’ll see next time that this is in fact the

largest possible influence for any monotone Boolean function.

4.5 Preview for next time

Next time we will talk about the connection between influence and the KM algorithm.

Lemma 48. Consider f : −1, 1n → −1, 1. Then Inf f (i) =∑S:i∈S f(S)2.

Proof. Define the discrete derivative in the i-th direction of f as

Di(f) = f(xi←1)− f(xi←−1)2 .

Page 38: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

30 LECTURE 4. FEBRUARY 12, 2014

Note that by definition Di(f)(x) = 0 if x is “insensitive” and ±1 otherwise. By linearity of theFourier transform, one gets

Di(f(x)) = 12

∑S⊆[n]

f(S)χS(xi←1)−∑S⊆[n]

f(S)χS(xi←−1)

=∑S3i

f(S)χS\i(x).

and then

Inf f (i) = Pr[xi←1) 6= f(xi←−1)

]= Pr[Di(f(x)) = ±1 ] = E

[Di(f(x))2

]=

(Plancherel)

∑S3i

f(S)2.

Corollary 49. For any Boolean function f , the on the i-th variable is the Fourier weight of all thecoefficients that contain i, and in particular

Inf [f ] =n∑i=1

∑S3i

f(S)2 =∑S⊆[n]

|S| f(S)2.

Page 39: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 5

February 19, 2014

5.1 Overview

5.1.1 Last time

• Finished KM algorithm;• Applications of KM algorithm: learning decision trees;learning functions with sparse Fourier

representation (in particular k-juntas of parities); started learning monotone Boolean functions(via their influence).

5.1.2 Today

• Finish learning monotone Boolean functions (using Inf f (i))), and AC0 circuits (via Fourierconcentration on low-degree coefficients – no membership queries);• Lower bounds for learning monotone Boolean functions;• learning k-juntas of half spaces in poly((nkε )k) time. No Fourier here! (but membership

queries).

Relevant Readings:• Mansour [Man94]: Learning Boolean Functions via the Fourier Transform.• Gopalan, Klivans and Meka [GKM12]: Learning Functions of Halfspaces Using Prefix Covers.

5.2 Finish learning monotone Boolean functions

Recall:• f : −1, 1n → −1, 1 is monotone if x y implies f(x) ≤ f(y)• Influence: Inf f (i) = Prx

[f(xi←1) 6= f(xi←−1)

]for i ∈ [n]

31

Page 40: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

32 LECTURE 5. FEBRUARY 19, 2014

• Total influence: Inf [f ] =∑ni=1 Inf f (i); an important example is Inf [MAJ] ≤

√n (for the

majority function MAJ(x) = sign(x1 + · · ·+ xn))• for all Boolean functions f , Inf f (i) =

∑S3if(S)2; and thus Inf [f ] =

∑S|S| · f(S)2.

Claim 50. If f is a monotone Boolean function, we have Inf f (i) = f(i) (where f(i) is short forf(i)).

Proof. Without loss of generality, we consider the case i = 1:

Inf f (1) = Prx∼−1,1n−1

[f(1x′) 6= f(−1x′)

]=∣∣ x′ ∈ −1, 1n−1 : f(1x′) = 1, f(−1x′) = −1

∣∣2n−1

with the last equality holding because of the monotonicity of f . But on the other hand,

f(1) = E[f(x)x1] = 12n

∑x∈−1,1n−1

(f(1x′)− f(−1x′)

)=∣∣ x′ ∈ −1, 1n−1 : f(1x′) = 1, f(−1x′) = −1

∣∣2n−1

where the second equality again uses the monotonicity of f .

Lemma 51. For any monotone Boolean function f , Inf [f ] ≤ Inf [MAJ] ≤√n.

Proof.• A first approach: we can prove that Inf [f ] ≤

√n using the Cauchy-Schwarz inequality:

Inf [f ] =n∑i=1

f(i) · 1 ≤(Cauchy−Schwarz)

√√√√ n∑i=1

12 ·

√√√√ n∑i=1

f(i)2 ≤√n

since∑ni=1 f(i)2 ≤

∑S⊆[n] f(S)2 = 1.

• Or we can also show the stronger statement that Inf [f ] ≤ Inf [MAJ]:

Inf [f ] =n∑i=1

f(i) =n∑i=1

E[f(x)xi] = E[f(x) ·

n∑i=1

xi

]= E[f(x) · (x1 + · · ·+ xn)]≤ E[MAJ(x) · (x1 + · · ·+ xn)] = Inf [MAJ]

(where the inequality comes from observing that since f(x) ∈ −1, 1, the quantity f(x) · (x1 + · · ·+ xn)is at most |x1 + · · ·+ xn|, which happens for f = MAJ, i.e. when f(x) = sign(x1 + · · ·+ xn)).

Page 41: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

5.3. LOWER BOUNDS FOR LEARNING MONOTONE BOOLEAN FUNCTIONS 33

We can combine the facts above to obtain Fourier concentration for monotone Boolean functions;more generally:

Theorem 52. Let f be any monotone Boolean function. Then∑|S|≥ Inf [f ]

ε

f(S)2 ≤ ε.

Proof. By contradiction, assume∑|S|≥ Inf [f ]

ε

f(S)2 > ε; then

Inf [f ] =∑S⊆[n]

|S| f(S)2 ≥∑

|S|≥ Inf [f ]ε

|S| f(S)2 ≥ Inf [f ]ε·

∑|S|≥ Inf [f ]

ε

f(S)2 >Inf [f ]ε· ε = Inf [f ]

leading to a contradiction.

Remark 9. One can also see this proof as an application of Markov’s Inequality, by seeing theFourier weights f(S)2 as the probability distribution f2 they induce over subsets of [n]. The theoremnow can be rephrased as

PrS∼f(S)2

[|S| ≥ Inf [f ]

ε

]≤

ES∼f(S)2 |S|Inf [f ]ε

= ε

as Inf [f ] = ES∼f(S)2 |S|.

Corollary 53. Suppose f : −1, 1n → −1, 1 is monotone. Then f is ε-concentrated on S def=S ⊆ [n] : |S| ≤

√nε

.

5.3 Lower bounds for learning monotone Boolean functions

As a direct consequence, the LMN algorithm will learn any monotone Boolean function in timepoly(n

√nε ) = 2O(√n(logn)/ε). While this constitutes a huge saving comparison to the 2Ω(n) general

bound, this is still it is a lot! Hence, an immediate question is “Can we do better?”.

One may first ask whether there is a better analysis of the LMN algorithm for monotone Booleanfunctions which would yield significantly better performance. However, the answer to this is negative;it is known that there exist monotone Boolean functions such that

∑|S|≤√n/100 f(S)2 ≤ 1

100 , whichimply that no low-degree learning algorithm such as LMN can do better than to deal with thenΩ(√n) Fourier coefficients up to degree Ω(

√n).

Page 42: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

34 LECTURE 5. FEBRUARY 19, 2014

Learning to high accuracy Clearly, the 2O

(√n lognε

)becomes trivial for ε ≈ 1/

√n; hence, this

range of accuracy seems like a “good regime” to look for a lower bound. And indeed, one can showthat we cannot efficiently learn monotone Boolean functions to high accuracy, which we do below.

Claim 54. There is a class C of monotone Boolean functions such that, if the target function fis drawn uniformly from C, then any learning algorithm A making less than 1

10 ·2n√n

membershipqueries will output a hypothesis h such that E[dist(f, h)] ≥ 1

5√n

(where the expectation is taken overthe draw of f).

Proof. For simplicity consider n even (the case n odd is similar, up to technicalities). Define C tobe the class of monotone Boolean functions f such that

f(x) =

+1 if

∑ni=1 xi > 0

−1 if∑xi < 0

±1 if∑xi = 0 (arbitrarily)

Equivalently, drawing a function from this class amounts to tossing(nn2

)independent fair coins that

specify the value of f on the middle layer of the hypercube (where∑i xi = 0).

Yet, the learning algorithm makes at most 110 ·

2n√n

MQ in this middle layer, which containsbetween 1

2 ·2n√n

and 2n√n

different points. So A “sees” less than a 15 fraction of the values of the

inputs that are define f , and misses at least a 45 fraction of them. Each “unseen” point contributes

on expectation 12 ·

12n to the error of its hypothesis h. Therefore,

E[error(h)] ≥ 45 ·

12 ·

2n√n· 1

2n = 15√n.

In fact, a stronger lower bound can be proven:

Theorem 55 ([BBL98]). There is a (different) class C′ of monotone Boolean functions such thatany algorithm that makes at most 2

√n

100 membership queries outputs, when the target function f isdrawn uniformly from C′, a hypothesis h such that E[dist(f, h)] ≥ 0.49.

High-level sketch of proof. Each f ∈ C′ is a 2√n

50 -term monotone DNF, f = T1 ∨ · · · ∨ T2√n/50 , where

each term T1, . . . , T2√n/50 is drawn independently from the set of all c

√n

50 -length conjunctions (foran appropriately chosen constant c so that the function is balanced with high probability). Theargument then goes roughly as follows: every time a query x satisfies one of the terms, the algorithmis given “for free” all the variables of the term. But even with this overly generous assumption,

Page 43: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

5.4. MAIN CONTRIBUTION OF LMN: LEARNING AC0 CIRCUITS 35

• there are at most 2√n/100 positive examples, hence at most 2

√n/100 terms out of the 2

√n/50

total terms are “shown” to the algorithm. Intuitively, this means that the algorithm does not“see” anything about almost all terms (with high probability);• further, each negative example eliminates (again, with high probability) very few possible

terms, so that negative examples do not help either.

5.4 Main contribution of LMN: learning AC0 circuits

x1 x3

x4

x7

Above see HW page forrelated problem:there exists adepth-D, size-Mcircuit with noFourier weighton any subsetS of coeffi-cients such that|S| ≤ logD−1 M .

we see a size-6 and depth-3 constant depth circuit. Linial, Mansour, Nisan showed that if f is

computed by a size-M, depth-D circuit then f is ε-concentrated on S =S : |S| ≤

(O(log M

ε

))D .

That is we can learn the class AC0 of constant-depth, polynomial-size Boolean circuits in npoly(log nε

)-time.

5.5 Learning halfspaces

Definition 56. A Boolean function f : 0, 1n → −1, 1 is said to be a halfspace (or LinearThreshold Function (LTF)) if there exist weights w1, . . . , wn ∈ R and threshold θ ∈ R such thatf(x) = sign(w · x− θ) for all x ∈ 0, 1n.

Fact 57 (PAC-learning Halfspaces). There is an algorithm that can learn any unknown halfspaceover 0, 1n in poly(n, 1

ε , log 1δ )-time, using only random independent and identically distributed

examples under arbitrary distribution D over 0, 1n. The algorithm outputs with probability at least1− δ an hypothesis h such that Prx∼D [ f(x) 6= h(x) ] ≤ ε.

This algorithm is based on polynomial-time linear programming. It works when f is a halfspace,but breaks down completely if f is a function of halfspaces, such as f = h1 ∧ h2. Indeed, in the

Page 44: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

36 LECTURE 5. FEBRUARY 19, 2014

arbitrary distribution D model, even if we allow membership queries no algorithm faster than 2ntime is known for f = h1∧h2. So we will restrict our attention (as usual) to the uniform-distributionsetting.

One question to ask is whether one can use Fourier analysis to learn (under the uniformdistribution) a single halfspace or – more ambitiously – a function g(h1, . . . hk) of halfspaces, whereg : 0, 1k → −1, 1. The following results are known here:• Let h = MAJ. If S is such that

∑S∈S h(S)2 ≥ 1− ε, then it must be the case that |S| = nΩ( 1

ε2).

In particular, the KM algorithm will not work well.• There exists f = g(h1, . . . , hk) such that if S is such that

∑S∈S f(S)2 ≥ 1− ε, then it must

be the case that |S| = (n/k)Ω(k2/ε2). For k small compared to n this is nΩ(k2/ε2).These are bad-news results for Fourier concentration. The good news is this is as bad as the badnews gets; it is known that any f = g(h1, . . . , hk) satisfies∑

|S|≤O(k2)ε2

f(S)2 ≥ 1− ε

and thus can be learnt with LMN in nO

(k2ε2

)time (without membership queries).

However, with membership queries, it is possible to achieve much better running time – namely,polynomial in n and 1/ε for any fixed constant k:Theorem 58 ([GKM12]). The class of k-juntas of halfspaces (functions of the form f = g(h1, . . . , hk)with the hi’s being halfspaces) can be learnt under the uniform distribution in poly((nkε )k)-time,using membership queries.

Idea of the proof

The algorithm will use a hypothesis that is a Read Once Branching Program (ROBP).Definition 59. A width-W ROBP M is a layered digraph with layers 0, 1, . . . , n and at most Wnodes in each layer.• L(M, i) is the set of nodes in layer i, with L(M, 0) = v0 (v0 being the start node). Moreover,

each node in L(M,n) is labeled 0 or 1 (respectively, ACCEPT or REJECT);• for i ∈ 0, 1, . . . , n − 1, each v ∈ L(M, i) has two out-edges, one labeled 0 and the other

labeled 1, both going to nodes in L(M, i+ 1);• for z ∈ 0, 1i and node v, M(v, z) denotes the node reached starting from v and followingi edges according to z. We can view a ROBP as a Boolean function M : 0, 1n → 0, 1 bysetting, for z ∈ 0, 1n,

M(z) def=

0 if M(v0, z) is labeled 01 if M(v0, z) is labeled 1.

Page 45: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

5.6. NEXT TIME 37

Notation We will write Ui for the uniform distribution over 0, 1i. For a prefix x ∈ 0, 1i (withi ≤ n), we define fx : 0, 1n−i → 0, 1 by fx(z) = f(x z), where stands for concatenation. Notethat dist(fx, fy) = Prz∈0,1n−1 [ f(x z) 6= f(y z) ].

Definition 60. A function f : 0, 1n → 0, 1 is said to be (ε,W )-prefix coverable if for all i ∈ [n]there exists Si ⊆ 0, 1i with

∣∣Si∣∣ ≤W such that

∀y ∈ 0, 1i, ∃x ∈ Si such that dist(fx, fy) ≤ ε.

The collection (S1, . . . , Sn) is then called an (ε,W )-prefix cover of f .

The two building blocks of the proofs will be the following lemmas:

Lemma 61. Every k-junta of LTFs g(h1, . . . , hk) is (ε, (4kε )k)-prefix coverable.

Lemma 62. There is a membership-query algorithm which, given ε, W , and MQ(f) for some(ε,W )-prefix coverable function f , outputs (a W -ROBP) h such that dist(h, f) ≤ 4nε. Furthermore,the algorithm runes in time poly(n,W, 1

ε , log 1δ ).

Remark 10. The two lemmas above combined yield the theorem: to learn a k-junta of LTFs toaccuracy ε′, set ε def= ε′

4n , W def= (4kε )k = (16kn

ε′ )k and run the algorithm of Lemma 62.

5.6 Next time

Lemma 2 is a direct consequence of the following two claims, which we will prove next time.

Claim 63. If h is any LTF, h is (ε, 2ε )-prefix coverable.

Claim 64. Let f1, . . . , fk be any (ε,W )-prefix coverable functions, and fix any g : 0, 1k → 0, 1.Then g(f1, . . . , fk) is (2kε,W k)-prefix coverable.

Page 46: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

38 LECTURE 5. FEBRUARY 19, 2014

Page 47: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 6

February 26, 2014

Relevant Readings:• Gopalan et al., Learning functions of halfspaces using prefix covers [GKM12].

6.1 Overview

6.1.1 Last time

• Learning monotone Boolean functions and proving lower bounds.• Learning AC0 circuits.• Started learning functions with small prefix covers using read-once branching programs

(ROBPs).

6.1.2 Today

• Finish the algorithm for learning functions with small prefix covers.– Apply it to learning k-juntas of halfspaces.

• Introduction to property testing.– Adaptive vs. non-adaptive testers.– Connections between testing and learning.– Testing linearity (over GF(2)), i.e. parities.

6.2 Learning functions with small prefix covers

Recall that a Boolean function f : 0, 1n → 0, 1 is said to be (ε,W )-prefix coverable if for alli ∈ [n], there exists a subset Si ⊆ 0, 1i with

∣∣Si∣∣ ≤W such that for every prefix y ∈ 0, 1i, there

39

Page 48: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

40 LECTURE 6. FEBRUARY 26, 2014

exists another prefix x ∈ Si such that the distance between fx and fy1 is at most ε. If we define anε-ball to be

Bf (x, ε) def=y ∈ 0, 1i : dist(fy, fx) ≤ ε

,

then the union of the ε-balls of the elements of Si covers all prefixes of length i. The collection(S1, . . . , Sn) is said to be an (ε,W )-prefix cover of the function f .

The proof that k-juntas of halspaces are prefix coverable relies on the following technical lemmas:

Lemma 65. If h is a linear threshold function (LTF), then h is (ε, 2ε )-prefix coverable.

Lemma 66. If f1, . . . , fk are arbitrary (ε,W )-prefix coverable functions and g : 0, 1k → 0, 1 isany Boolean function, then the k-junta F = g(f1, . . . , fk) is (2kε,W k)-prefix coverable.

Combining these two results shows that every k-junta of halfspaces is (ε, (4kε )k)-prefix coverable.

0, 1n−i 0, 1n−i

Figure 6.1: For a general function (a), we cannot expect the prefix functions fx to be nested. Linearthreshold functions do have the nesting property (b). The inner ellipses represent the set of satisfyingassignments f−1

x (1).

Proof of Lemma 65. We need to exhibit an (ε, 2ε )-prefix cover (S1, . . . , Sn) for a linear threshold

function h. Recall that an LTF is defined as

h(x) = sign(

n∑i=1

wixi − θ).

for some real-valued weights wi and threshold θ. If we have two prefixes x, x′ ∈ 0, 1j for anyj ∈ [n], what do the functions fx and fx′ look like? They are also linear threshold functions, but

1Informally, “fx” is the Boolean function on the last n− i variables that agrees with f when the first i variablesare set to x.

Page 49: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

6.2. LEARNING FUNCTIONS WITH SMALL PREFIX COVERS 41

more importantly, the only thing different between the two is the threshold θ. More explicitly, if wedefine the following function on the prefixes

v(x) =j∑i=1

wixi,

then we can express the prefix functions fx as

fx(z) = sign

n∑i=j+1

wizi − (θ − v(x))

.Suppose that fx has a larger threshold than fx′ . Then any satisfying assignment z for fx′ is also asatisfying assignment of fx. That is, the set of satisfying assignments of fx′ is nested inside that offx (see Figure 6.1). More precisely, if v(x′) ≤ v(x), then f−1

x (1) ⊆ f−1x′ (1).

0 11ε 2ε 3ε · · ·

Figure 6.2: Arranging prefixes along the number line. Our greedy algorithm picks the red pointsand discards the black points, which fall within ε (the light blue intervals) of a red point.

Let p(x) denote the probability Pr[ fx(z) = 1 ]. Arrange all 2i prefixes x in increasing order ofp(x) on the interval [0, 1] (see Figure 6.2). We use a greedy algorithm to pick the prefix cover.Sweeping from left to right, we take the first prefix x and put it into our prefix cover Si and discardevery other prefix x′ within ε of x. That is, we remove all x′ such that p(x′) < p(x) + ε. We repeatthis process until we go through all the prefixes. We need to argue that

(1) The size of Si is at most 2ε , and

(2) Si covers all prefixes with ε-balls.For (1), we know that the prefixes in Si are at least ε apart from each other on the number line.Therefore, the worst case is that we selected prefixes with p values 0, ε, 2ε, . . . , 1 = 1

ε . Thus, thereare at most 1

ε + 1 ≤ 2ε prefixes in Si.

To show (2), we exploit the nesting property for prefixes of LTFs. In our algorithm, we picked xto be in our cover and discarded all x′ with p(x′) within ε of p(x). The nesting property tells usthat f−1

x (1) ⊆ f−1x′ (1), so the only difference between fx and fx′ is that on f−1

x′ (1) \ f−1x (1),

fx′ returns 1 and fx returns 0. Thus, the distance between these functions is

dist(fx, f

′x

)=

∣∣∣f−1x′ (1) \ f−1

x (1)∣∣∣

2n−i =

∣∣∣f−1x′ (1)

∣∣∣2n−i −

∣∣f−1x (1)

∣∣2n−i = p(x′)− p(x) < ε,

and so (S1, . . . , Sn) is an (ε, 2ε )-prefix cover.

Page 50: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

42 LECTURE 6. FEBRUARY 26, 2014

Proof of Lemma 66. We know that each fj is (ε,W )-prefix coverable, so let (S1j , . . . , S

nj ) denote

such a prefix cover. We want to use these sets to construct a prefix cover (T 1, . . . , Tn) for F . Let usconsider some i ∈ [n]. Given some k-tuple (x1, . . . , xk) ∈ Si1 × · · · × Sik, define

U(x1, . . . , xk)def=z ∈ 0, 1i : dist

((fj)xj , (fj)z

)≤ ε, for j = 1, . . . , k

.

Basically, U(x1, . . . , xk) is just the set of all prefixes z that are simultaneously within ε of eachxj . The way we construct our prefix cover is that for each nonempty U(x1, . . . , xk), take anyz ∈ U(x1, . . . , xk) and add it to T i. Once again, we need to prove that

1. The size∣∣T i∣∣ is at most W k, and

2. T i covers all prefixes with 2kε-balls.For (1), we can simply count how many sets U(x1, . . . , xk) there are. We pick at most one

element from each such set, so∣∣∣T i∣∣∣ ≤ ∣∣∣Si1 × · · · × Sik∣∣∣ =∣∣∣Si1∣∣∣× · · · × ∣∣∣Sik∣∣∣ ≤W k,

where the last inequality follows from the fact that we had (ε,W )-prefix covers for each fj .To prove (2), let y ∈ 0, 1i be some prefix. From each Sij , there exists xj that covers y. That is,

dist((fj)xj , (fj)y

)≤ ε. By the definition of U , y must be in U(x1, . . . , xk). We picked some element

z from U(x1, . . . , xk) to be in T i. The distance function defines a metric, so we can use the triangleinequality. The distance between y and z is at most

dist((fj)y, (fj)z) ≤ dist((fj)y, (fj)xj

)+ dist

((fj)xj , (fj)z

)≤ ε+ ε = 2ε.

Summing the distance over all k functions fj yields a distance of at most 2kε.

The last piece we need is a learning algorithm for functions with small prefix covers. The ideais to use a read-once branching program (ROBP) of width corresponding to the size of the prefixcovers, but how can we construct it without knowing the prefix covers beforehand? It turns outthat with some extra error, we can achieve this with high probability.

The main idea of the algorithm is that we are building the ith level L(M, i) of a branchingprogram by adding another bit after each prefix in the (i− 1)th level. If the corresponding prefixfunction is close enough to that of a node in layer i, we make a connection between those two nodes.If not, we create a new node.

First, we need to explain how we are getting our estimates d of the distance between two prefixfunctions fxb and fy. We sample uniformly on suffixes z and make membership queries for f(xbz)and f(y z). The algorithm will have to make at most 2nW 2 estimates, as there are two choices forthe bit b, at most W nodes in layers i − 1 and i, and n layers in total. Then, if we sample withconfidence parameter δ

2nW 2 , the overall probability of failure is at most δ by the union bound. Inorder to get the accuracy to within ε, the Chernoff bounds tell us that we need O

(log(nW/δ)/ε2

)

Page 51: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

6.2. LEARNING FUNCTIONS WITH SMALL PREFIX COVERS 43

Algorithm 3 Main Algorithm [GKM12]On input n, ε, δ, W , and MQ access to f :

1: Set L(M, 0) = λ and L(M, i) = ∅ for each i ∈ [n].2: for i = 1, 2, . . . , n do3: for each x ∈ L(M, i− 1) and b ∈ 0, 1 do4: for each y ∈ L(M, i) do5: Obtain estimate dxb,y of dist(fxb, fy) accurate within ±ε.6: end for7: If smallest value dxb,y ≤ 3ε, add edge labeled b from x to y.8: Otherwise, add edge labeled b from x to a new node x b ∈ L(M, i).9: end for

10: If |L(M, i)| > W , output FAIL and halt.11: end for12: Label each node x ∈ L(M,n) with the bit f(x).

samples per estimate. Thus, the runtime, which is essentially the number of membership queries, ispoly(n,W, 1

ε , log 1δ ).

Now we need to argue that the algorithm does not output FAIL that often and that the resultingROBP is close enough to the original function.

Lemma 67. When the estimates d are all within ±ε of the true values, the algorithm does notoutput FAIL.

Proof. Let (S1, . . . , Sn) be an (ε,W )-prefix cover for f . Fix x ∈ Si and consider the ε-ball Bf (x, ε)around x. Our algorithm will add at most one prefix in Bf (x, ε) to the ith layer L(M, i). If therewere two such prefixes y and y′, by the triangle inequality, the distance between the functions fyand fy′ is at most 2ε. Our estimate dy,y′ is accurate to within ±ε, so dy,y′ is at most 3ε. However,our algorithm only creates new nodes when the estimated distance is more than 3ε, which is acontradiction.

Lemma 68. The difference in the ROBP hypothesis M and the original function f is Pr[M(x) 6= f(x) ] ≤4nε.

Proof. If x is a node in L(M, i), let Mx be the ROBP on the variables xi+1, . . . , xn starting at nodex. We will show that for each node x ∈ L(M, i), the distance dist(Mx, fx) ≤ 4(n− i)ε, so pluggingin i = 0 gives the desired result.

We induct backwards starting from the nth layer. More formally, we induct on n − i. Byconstruction, the error is 0 on the nodes in level n. In Line 12 of the Main Algorithm, we placedthe actual value of the function f on those nodes, so the base case is established.

Suppose the error bound is true for level i. Let x ∈ L(M, i− 1) be some node, and examine itstwo outgoing edges 0 and 1 to nodes y0 and y1, respectively. We can express the error dist(Mx, fx)

Page 52: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

44 LECTURE 6. FEBRUARY 26, 2014

as the average of dist(Mx0, fyb) and dist(Mx1, fyb), as half of the inputs start with 0 and the otherhalf starts with 1. By the induction hypothesis, dist(Myb , fyb) ≤ 4(n− i)ε. Finally, by the way weconstructed the edges,

dist(Mxb,Myb) ≤ dxb,yb + ε ≤ 3ε+ ε ≤ 4ε.

Putting this all together using the triangle inequality, we find that

dist(Mx, fx) = 12∑

b∈0,1dist(Mxb, fyb)

≤ 12∑

b∈0,1dist(Mxb,Myb) + dist(Myb , fyb)

≤ 12∑

b∈0,14ε+ 4(n− i)ε

= 4(n− (i− 1))ε,

as desired.

This concludes our unit on learning Boolean functions; the next one will be devoted to a somehowrelated task, property testing of such functions.

6.3 (Property) testing Boolean functions

Definition 69. A property P is a subset of all Boolean functions (as a concept class was). Wewill almost exclusively consider “reasonable” properties, that is ones which are invariant underrenaming variables:. if f(x1, . . . , xn) ∈ P, then f(xπ(1), . . . , xπ(n)) ∈ P for all permutations π on[n]. As before, we define

dist(f,P) def= ming∈P

dist(f, g).

Definition 70 (Property tester). A property testing algorithm A for P is a (randomized) algorithmthat is given ε > 0 and membership query access to f for an arbitrary f , and satisfies the following:• if f ∈ P, then A outputs ACCEPT with probability at least 2/3;• if dist(f,P) > ε, then A outputs REJECT with probability at least 2/3.

Note that unlike in the learning setting, the function f is not guaranteed to be in P: the only promiseis that it is either in P or ε-far from it. The query complexity of A is the number of queries q(n, ε)A makes to MQ(f).

Remark 11. Certainly, any successful property tester must be randomized: otherwise, unless itqueries most of the points of the domain (say, a 1− ε fraction), any deterministic tester be made toalways fail on some ad hoc instances.

Page 53: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

6.3. (PROPERTY) TESTING BOOLEAN FUNCTIONS 45

Example 71. Let P be the property of perfectly balanced functions, that is, f ∈ P if and onlyif∣∣f−1(0))

∣∣ =∣∣f−1(1)

∣∣ = 2n−1. We can test this trivially with O(

1ε2

)samples via random

sampling and counting the proportion of 0’s and 1’s. Using Chernoff bounds we can prove that withprobability 9

10 , our estimate is within ε10 from the true value.

Crucially, the main concern in property testing is the query complexity, and not the runningtime. This is the key difference between learning and testing. In learning, learning practicallyany interesting class of functions requires a number of samples that depends on n (something likeΩ(logn) samples). However, there are many cases in property testing where the complexity isindependent of n, i.e. q(n, ε) = q′(ε). For fixed ε > 0, these properties are constant-query testable.

Definition 72 (One- and Two-sided testers). A property tester that never reject functions in P aresaid to have one-sided error. Otherwise (as defined above), they are said to have two-sided error.

One way of viewing one-sided error is to observe that such an algorithm rejects only if thequeries exhibit some kind of “witness” that f is not in the property P . This perspective is useful inproving lower bounds.

Definition 73 (Adaptive and non-adaptive testers). A property testing algorithm A is said tobe non-adaptive if it chooses which inputs to query all at once at the beginning of runtime beforeevaluating any of them, that if if its kth query xk does not depend on f(x1, . . . , f(xk−1). Otherwise,the algorithm is adaptive and can change its later queries based on the results of previous queries.

Clearly, adaptive testers are at least as powerful as non-adaptive ones.

Fact 74. If A is an adaptive tester that makes q(n, ε) queries, there is a 2q(n,ε)-query non-adaptivetester A′.

Proof Sketch. A′ simulates all the possible ways A can proceed depending on the outcomes of thequeries. This branching process is essentially a binary tree of queries of depth q, so there are atmost 2q queries we would have to make to be able to trace any path down the tree.

One may wonder if this exponential gap is tight, i.e. if adaptiveness can indeed be exponentiallymore query-efficient that non-adaptive testers. The following theorem essentially shows this is thecase:

Fact 75 (Ron and Servedio [RS13]). There are properties where you cannot do better than thisexponential gap, for example, the property P of signed majority functions

f(x) = sign(

n∑i=1

σixi

), σi ∈ ±1.

There is an adaptive algorithm for P that makes poly(logn, 1ε ) queries, but every non-adaptive

tester requires at least nc queries for some absolute constant c > 0.

Page 54: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

46 LECTURE 6. FEBRUARY 26, 2014

Page 55: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 7

March 5, 2014: Property testing forBoolean functions

7.1 Overview

7.1.1 Last time

• k-juntas of prefix-coverable functions are prefix-coverable;• gave an algorithm to learn prefix-coverable functions.

Combined, these yield a(nkε

)k-time algorithm to learn k-juntas of halfspaces g(h1, . . . , hk): expo-

nential in k, but since g is arbitrary an exponential dependence on k is unavoidable.Note that this algorithm requires membership queries.

Question: For g = ANDk, i.e. h1 ∧ · · · ∧ hk, is it possible to do better?

7.1.2 Today: Property testing for Boolean functions

• Proper learning for P implies property testing of P (generic, but quite inefficient)• Testing linearity (over GF[2]), i.e. P = all parities: (optimal) O

(1ε

)-query 1-sided non-

adaptive tester.• Testing monotonicity (P = all monotone functions: an efficient O

(nε

)-query 1-sided non-

adaptive algorithm1.

1This result (from 2000) was improved to O(n7/8

ε3/2

)by Chakrabarty and Seshadhri in 2013.

47

Page 56: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

48 LECTURE 7. MARCH 5, 2014: PROPERTY TESTING FOR BOOLEAN FUNCTIONS

Relevant Readings:• Ron, 2008: Property Testing: A Learning Theory Perspective. [Ron08]• Ron, 2009: Algorithmic and Analysis Techniques in Property Testing. [Ron09]• Bellare, Coppersmith, Hastad, Kiwi and Sudan: Linearity Testing in Characteristic Two.

[BCH+95].

7.2 Proper learning implies Property Testing

Recall that a proper learning algorithm for a class C is a learning algorithm which only outputshypotheses from C. Hereafter, we will assume that when the algorithm fails to return such ahypothesis (e.g., after erring because the target function was not in the right class), it signals it byoutputting FAIL.

Theorem 76 (Testing reduces to Proper learning). Suppose the class P of Boolean functions has aproper learning algorithm L that makes mL(n, ε, δ) queries and achieves 1−ε accuracy with probability1− δ. Then there exists a property testing algorithm T for P making mT (n, ε) = mL(n, ε2 ,

16) +O

(1ε

)queries.

Proof. The idea is to run L on f , getting some hypothesis h. Then, check whether h is indeed closeto f on O(1/ε) fresh random examples, and accepts or rejects accordingly.

Algorithm 4 Generic testing algorithm1: Run L on f with accuracy parameter ε/2 and δ = 1

6 . If it does not output h ∈ P, returnREJECT.

2: Otherwise, h ∈ P : draw O(1/ε) uniform random examples from 0, 1n, query f and evaluate hon them. Let ε be the fraction of those examples on which f and h differ.

3: if ε > 3ε/4 then return REJECT, otherwise return ACCEPT4: end if

Analysis:Case 1: f ∈ P. With probability at least 5/6, the proper learning algorithm L will return h ∈ P

such that dist(f, h) ≤ ε2 – and survive Step 1. In Step 3, a (multiplicative) Chernoff bound

ensures we reject with probability at most 1/6.Overall, we get that Pr[T outputs ACCEPT ] ≥ 2/3

Case 2: dist(f,P) > ε. Since L is a proper learning algorithm, every valid hypothesis h it mayreturn will have dist(f, h) > ε; therefore, either (a) L fails to output such a hypothesis (causingus to reject in Step 1); or (b) outputs h ∈ P with dist(f, h) > ε, and T will reject in Step 3with probability at least 5/6.

Page 57: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

7.3. LINEARITY TESTING 49

Conclusion Recall that every class P has a O(

1ε (log |P|+ log 1

δ

)-query proper learning algorithm

(via an Occam’s Razor and consistent hypothesis finder type argument). Therefore,Good: Generic – every class P is testable with this many queries, with L;Bad: Not optimal. It yields a O

(nε

)-query tester for parities, whereas we’ll see we can test this class

with a customized algorithm that uses only O(

)queries. For monotone Boolean functions,

a bound based only on |P| is 2n/(√n · ε), but our (non-proper) learning algorithm can be

converted into a proper learning algorithm with the same query complexity, so it is possibleto use the learning-based approach to get an nO(

√n)/ε-query tester. However, we will see that

a customized approach for this class yields a O(nε

)-query tester.

Remark 12. Later, [DLM+07] proved that for “many” classes P (those in which every function fcan be approximated by a not-too-large junta), one can do a more sophisticated learning ; testingconversion, using junta testing ideas. For several classes such as size-s DNF formulas, size-s decisiontrees, and size-s Boolean circuits, this approach leads to testing algorithms with query complexitypoly(s, 1/ε) independent of n (while the straightforward “testing via proper learning” result abovewould give query complexities that depend on n).

7.3 Linearity testing

7.3.1 Definitions

We consider linearity mod 2 of functions f : 0, 1n → 0, 1:

Definition 77. A function f is said to be linear if there exists a family (ai)1≤i≤n ∈ 0, 1n suchthat

f(x) = f(x1, ..., xn) = a1x1 + ...+ anxn mod 2=∑i∈S

xi mod 2

= PARS(x)

for S = i : ai = 1 ; that is, the class of all linear functions is P = PARSS⊆[n].

Athough this definition is totally fine, it is not clear how to design a testing algorithm fromit (short of learning the set S). Hence, one may wonder if there is an alternate, “testing friendly”characterization of linearity – inspired by the usual characterization of linarity2 for general vectorspaces, something local along the lines of:

Definition 78. A function f is said to be linear’ if it satisfies

∀x, y ∈ 0, 1n, f(x) + f(y) = f(x+ y) (7.1)

where x+ y is the bitwise addition mod 2 (that is, (x+ y)i = xi + yi mod 2).

Page 58: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

50 LECTURE 7. MARCH 5, 2014: PROPERTY TESTING FOR BOOLEAN FUNCTIONS

Fortunately, it turns out that that these two definitions are indeed equivalent:

Lemma 79. f is linear if and only if it is linear’.

Proof.⇒ Let S ⊆ [n] be as in Definition 77 for f , and fix any x, y ∈ 0, 1n:

f(x) + f(y) =∑i∈S

xi +∑i∈S

yi =∑i∈S

(xi + yi) = f(x+ y)

⇐ Suppose f is linear’.• For any x ∈ 0, 1n, f(x+ 0n) = f(x) + f(0n), which implies f(0n) = 0;• Define ei

def= (0, . . . , 1, . . . , 0), the string with only the ith coordinate set to 1. For anyx ∈ 0, 1n and i ∈ [n], a simple distinction of cases show that f(xiei) = xif(ei) (as xi iseither 0 or 1).

Setting S def= i ∈ [n] : f(ei) = 1 , we have

f(x) = f( ∑i∈[n]

xiei)

=∑i∈[n]

xif(ei) =∑i∈S

xif(ei) =∑i∈S

xi

proving f is linear.

7.3.2 BLR Test

Observe that the second definition looks intuitively easy to test: a natural way to check linearity’ isto pick x, y at random, and check whether Eq. (7.1) holds. This 3-query procedure, first analyzedby Blum, Luby and Rubinfeld [BLR90], is known as the BLR linearity test.

Algorithm 5 The BLR linearity test

1: Pick x, y independently and uniformly from 0, 1n, set z def= x+ y (bitwise).2: Query f(x), f(y) and f(z).3: if f(x) + f(y) = f(z) then return ACCEPT4: else return REJECT5: end if

The full linearity tester merely repeats the BLR tester O(1/ε) times, and accepts if and only ifall tests passed:

2Note that the only scalars being 0 and 1, the axiom f(λx) = λf(x) is vacuous here.

Page 59: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

7.3. LINEARITY TESTING 51

Algorithm 6 (Full) linearity tester1: Repeat the BLR linearity test 3

ε times.2: return ACCEPT if all tests passed, REJECT otherwise.

Clearly, this is a 9/ε-query non-adaptive tester; it remains to prove that it behaves as required.In order to show correctness, we must prove a robust version of Eq. (7.1): if f is far from linear, ithas to violate f(x) + f(y) = f(x+ y) for many pairs (x, y).

As a first and easy observation, note that if f is indeed linear, then the test will never find aviolation, hence accepting with probability 1 (i.e, this is even a one-sided tester). As for the (hard)case where f is ε-far from linear, it is enough to prove the following theorem on the behavior of theBLR test:

Theorem 80. If f is such that dist(f,P) > ε, then Pr[ BLR outputs Yes on f ] < 1− ε.

(indeed, as (1− ε)3/ε < 1e3 <

110 , the full tester will then accept (and err) with probability at most

1/10).

Proof. For convenience of use with Fourier analysis, we switch back to viewing f as a −1, 1n →−1, 1 function, and a parity PARS as χS(x) =

∏i∈S xi. In this setting, the BLR test can be

rephrased as checking whether f(z) = f(x)f(y), for random x, y ∈ −1, 1n and z = x y (that is,zi = xiyi for all i).

The proof will hinge upon the following lemma, which relates the probability of seccess of theBLR test to the Fourier expansion of f :

Lemma 81. Fix any f : −1, 1n → −1, 1. Then

Pr[ BLR outputs Yes on f ] = 12 + 1

2∑S⊆[n]

f(S)3.

Proof. Define the indicator variable 1BLR for the event “the BLR test outputs Yes on f”. Rewriting“cleverly” what it means (f(z)f(x)f(y) = 1), we get that

1BLR = 12 + 1

2f(x)f(y)f(z).

Therefore,

Pr[ BLR outputs Yes ] = E[1BLR] = E[1

2 + 12f(x)f(y)f(z)

]= 1

2 + 12E[f(x)f(y)f(z)]

Page 60: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

52 LECTURE 7. MARCH 5, 2014: PROPERTY TESTING FOR BOOLEAN FUNCTIONS

and, to argue the lemma, it is sufficient to show that E[(f(x)f(y)f(z))] =∑S⊆[n] f(S)3. Expanding

the Fourier expansions,

E[f(x)f(y)f(z)] = E

∑S⊆[n]

f(S)χS(x)∑T⊆[n]

f(T )χT (y)∑U⊆[n]

f(U)χU (z)

=

∑S,T,U⊆[n]

f(S)f(T )f(U)E[χS(x)χT (y)χU (x y)]

where the expctation is over the independent choice of x and y. This last term can be itself rewrittenas

E[χS(x)χT (y)χU (x y)] = E[(∏

i∈Sxi

)(∏i∈T

yi

)(∏i∈U

xiyi

)]

= E[( ∏

i∈S∪Uxi

)( ∏i∈T∪U

yi

)]= E[χS∪U (x)]E[χT∪U (y)] (by independence of x, y)

=

1, if S = T = U

0, otherwise.(by orthonormality of the χS ’s)

Substituting this back into the triple-sum expression of E[f(x)f(y)f(z)], we get E[f(x)f(y)f(z)] =∑S f(S)3 as desired.

With this in hand, we are ready to prove Theorem 80 (by contrapositive). Suppose thatPr[ BLR outputs Yes on f ] ≥ 1− ε: we will show this implies dist(f,P) ≤ ε. Lemma 81 gives us

1− ε ≤ Pr[ BLR outputs Yes on f ] = 12 + 1

2∑

f(S)3

so that1− 2ε ≤

∑f(S)3 ≤ max

S⊆[n]f(S) ·

∑S⊆[n]

f(S)2

︸ ︷︷ ︸=1 (Boolean function)

Thus, there exists S∗ ⊆ [n] with f(S∗) ≥ 1− 2ε. Since f(S∗) = E[f(x)χS∗ ] = 2 Pr[ f(x) = χS∗ ]− 1,we get that Pr[ f(x) = χS∗(x) ] ≥ 1− ε, which implies dist(f,P) ≤ ε.

7.3.3 Generalizations

The original paper [BCH+95] and subsequent work broadly generalize this O(1/ε)-linearity test tomany other settings:

Page 61: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

7.4. MONOTONICITY TESTING 53

• f : G→ H, for Abelian groups G,H;• for the class of degree-d polynomials Pd = all degree-d polynomials over GF2 (e.g. f(x) =x1x2x3 + x1x4 + x1x3x5 mod 2): it is known that Pd is

(2Θ(d)/ε

)-testable.

7.4 Monotonicity testing

We will cover the following results:Upper Bound: We will describe and analyze the [GGL+00] O

(nε

)-query (non-adaptive, 1-sided)

tester. Recently, [CS13] gave a O(n7/8

ε3/2

)tester, breaking the longstanding “barrier” of n (we

will not prove this result in class).Lower Bound For non-adaptive 1-sided testers, we will present an Ω(

√n) lower bound from

[FLN+02]. For non-adaptive, 2-sided testers, the best known lower bound was for a long timeΩ(logn) from [FLN+02]. A few weeks ago, Servedio–Tan brought it up to Ω

(n1/5

), giving

the first polynomial lower bound for this problem.Let M denote the class of all n variable monotone Boolean functions. Once again, for the sake

of testing we will resort to a handier, equivalent definition of monotonicity:

Lemma 82. A function f : 0, 1n → 0, 1 is monotone if and only if for all x ∈ 0, 1n,f(xi←0) ≤ f(xi←1).

Here and subsequently the notation “xi←b” indicates the string in 0, 1n obtained from x ∈0, 1n by setting the i-th coordinate to b.

The algorithm will use as a subroutine the procedure EdgeTester, which picks an edge ofthe hypercube uniformly at random and checks whether the value of f at its two endpoints violatemonotonicity:

Algorithm 7 The procedure EdgeTester1: Draw independently x ∼ U0,1n , i ∼ [n].2: Query f(xi←0),f(xi←1).3: return REJECT if f(xi←0) > f(xi←1), ACCEPT otherwise.

Given this “edge tester”, the overall testing algorithm is very simple:

Algorithm 8 The monotonicity tester MonTester.1: Invoke EdgeTester O

(nε

)times.

2: return REJECT if any of the calls returned REJECT, ACCEPT otherwise.

Albeit very simple and “natural”, the edge tester is not that straightforward to analyze.

Page 62: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

54 LECTURE 7. MARCH 5, 2014: PROPERTY TESTING FOR BOOLEAN FUNCTIONS

Theorem 83. MonTester is a O(nε

)-query, 1-sided non-adaptive tester for M.

Proof. The 1-sided and non-adaptive properties are easily seen to hold; the heart of the proof is, asbefore, in the correctness for f ε-far from monotone. This point will follow from the correctness ofthe edge tester:

Theorem 84. Suppose dist(f,M) ≥ ε. Then,

Pr[ EdgeTester outputs REJECT ] ≥ ε

n.

Proof of Theorem 84.Notation Let E def= (x, y) : (x, y) is an edge in 0, 1n with x ≺ y : so that we have |E| =n2n−1.

The set of violating edges V (f) ⊆ E is defined as V (f) def= (x, y) ∈ E : f(x) = 1, f(y) = 0 .Viewing V (f) as disjoint union, we write

V (f) = V1(f) ∪ V2(f) ∪ · · · ∪ Vn(f)

where Vi(f) ⊆ V (f) is the set of coordinate-i violating edges.

Let η(f) def= |V (f)||E| = |V (f)|

n2n−1 = Pr[ EdgeTester outputs REJECT ]. Our goal is to prove thatη(f) ≥ dist(f,M)

n .

High-level idea: fixing some f , the proof will go by building a g ∈M such that dist(f, g) ≤ nη(f).To do so, we will “monotonize” f , coordinate by coordinate, by sorting the violating edges for agiven coordinate with a shift operator.

Definition 85 (Shift Operator). Fix i ∈ [n]. The shift operator Si acts on functions h : 0, 1n →0, 1, by sorting h(xi←0), h(xi←1): Sih is a function from 0, 1n to 0, 1 defined by

Sih(xi←0) = min(h(xi←0), h(xi←1)

)Sih(xi←1) = max

(h(xi←0), h(xi←1)

)(Rest of the proof during next lecture.)

Page 63: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 8

March 12, 2014

8.1 Overview

8.1.1 Last time

• Proper learning for P implies property testing of P (generic, but quite inefficient)• Testing linearity (over GF[2]), i.e. P = all parities: (optimal) O

(1ε

)-query 1-sided non-

adaptive tester.• Testing monotonicity (P = all monotone functions: an efficient O

(nε

)-query 1-sided non-

adaptive tester.

8.1.2 Today

• Finish testing monotonicity (P = all monotone functions): an efficient O(nε

)-query 1-sided

non-adaptive algorithm• Lower bounds:

– For non-adaptive 1-sided testers, we will show a Ω(√n) lower bound from [FLN+02].

– Start the proof of Ω(n1/5

)by Chen–Servedio–Tan for non-adaptive, 2-sided testers, using

Yao’s minmax principle which converts the problem to the problem of lower boundfor deterministic algorithms (under suitable distribution on inputs).

Relevant Readings:• E. Fischer and E. Lehman and I. Newman and S. Raskhodnikova and R. Rubinfeld and A.

Samorodnitsky: Monotonicity Testing Over General Poset Domains. [FLN+02]• O. Goldreich and S. Goldwasser and E. Lehman and D. Ron and A. Samordinsky: Testing

Monotonicity. [GGL+00]

55

Page 64: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

56 LECTURE 8. MARCH 12, 2014

8.2 Testing Monotonicity (contd. from last time)

Recall the set of violating edges V (f) ⊆ E can be decomposed as

V (f) = V1(f) ∪ V2(f) ∪ · · · ∪ Vn(f)

where Vi(f) ⊆ V (f) is the set of coordinate-i violating edges., and we defined the quantityη(f) def= |V (f)|

n2n−1 = |V (f)|n2n−1 = Pr[ EdgeTester outputs REJECT ].

Goal: prove η(f) ≥ dist(f,M)n i.e.

nη(f) ≥ dist(f,M). (8.1)

To do so, for any fixed f we will show how to construct a monotone function g such that dist(f, g) ≤nη(f).Finally, recall the definition of the shift operator Si:

Definition 86 (Shift Operator). Fix i ∈ [n]. The shift operator Si acts on functions h : 0, 1n →0, 1, by sorting h(xi←0), h(xi←1): Sih is a function from 0, 1n to 0, 1 defined by

Sih(xi←0) = min(h(xi←0), h(xi←1)

)Sih(xi←1) = max

(h(xi←0), h(xi←1)

)

In the following, we let Di(f) def= 2 |Vi(f)| be the number of vertices x such that Si(f)(x) 6= f(x).

Definition 87. We say h : 0, 1n → 0, 1 is i-monotone if no x has h(xi←0) = 1 but h(xi←1) = 0,that is if h has no violation in the ith coordinate. For A ⊆ [n], we say h is A-monotone if h isi-monotone for all i ∈ A.

Claim 88 (2-part claim). 1. If h is A-monotone and j /∈ A, then Sj(h) is (A ∪ j)-monotone.2. For every i,j ∈ [n], we have Di(Sj(h)) ≤ Di(h) (shifting does not increase violations).

Before proving this claim, we show how it directly yields our goal:

Proof of Eq. (8.1) using Claim 88. Let g def= Sn(Sn−1(· · ·S1(f)) · · · ) = Sn Sn−1 · · · S1(f). Bythe Part 1 of the claim, g is monotone (as it is [n]-monotone); hence, it is sufficient to prove it isnot too far from f – namely, that nη(f) ≥ dist(f, g).

Page 65: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

8.2. TESTING MONOTONICITY: O(nε

)UPPER BOUND 57

Let fi denote Si Si−1 · · · S1(f) (so in particular f = f0 and g = fn). By the triangleinequality,

dist(f, g) ≤ dist(f0, f1) + · · ·+ dist(fn−1, fn)

Focusing on a fixed term of the sum, for i ∈ [n]

dist(fi−1, fi) = dist(fi−1, Si(fi−1)) = Di(fi−1)2n

= Di(Si−1 · · · S1(f))2n

≤ Di(Si−2 · · · S1(f))2n (Claim 88, Part 2)

≤ Di(f0)2n = |Vi(f)|

2n−1 (Repeating the inequality)

which, coming back to the sum, gives

dist(f, g) ≤ |V1(f)|+ · · ·+ |Vn(f)|2n−1 = |V (f)|

2n−1 = nη(f)

as |V (f)| = |∪ni=1Vi(f)| =∑ni=1 |Vi(f)| by disjointness; and finally by definition of η(f).

It remains to prove the claim:

Proof of Claim88. First, observe that (2) ⇒ (1): indeed, assume Part 2 holds, and suppose h isA-monotone. Fix any j 6∈ A. Since Sj(h) is j-monotone by application of the shift operator; weonly have to show that Sj(h) is i-monotone as well, for any i ∈ A.Fix such an i ∈ A: the number of i-edges where Sj(h) violates monotonicity is

|Vi(Sj(h))| = Di(Sj(h))2 ≤

(Part 2)

Di(h)2 = |Vi(h)| = 0

as stated.

Turning to Part 2: rather disappointly, this is a “proof by inspection”, as there are actually only 16cases to consider: only 2 variables are really involved, i and j.More precisely, without loss of generality, one can take i = 1 and j = 2; fixing coordinatesx3, · · · ,xn ∈ 0, 1n−2, h becomes a bivariate function h : 0, 12 → 0, 1. Hence, it is sufficient toargue that for all h : 0, 12 → 0, 1, D1(S2(h)) ≤ D1(h) – which can be done by enumerating all16 cases.

Page 66: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

58 LECTURE 8. MARCH 12, 2014

Remark 13. This algorithm was analyzed in 2000; it is known that the analysis is tight, that isthat this “edge tester” needs Ω(n) queries: a hard instance would be any dictator function x 7→ xi,anti-monotone.In 2013, Chakrabarty and Seshadhri ([CS13]) broke the “linearity barrier” for testing monotonicityby giving a O

(n7/8/ε3/2

)-query tester1 which combines the edge tester with a “path tester” (which

picks a random path in the hypercube, then queries two points randomly on this path). This has(very) recently been improved to an n5/6 dependency, by Chen–Servedio–Tan (2014).

8.3 Ω(√

n) lower bound for non-adaptive 1-sided testers

Theorem 89. There is an absolute constant ε0 > 0 such that any one-sided non-adaptive ε0-testerfor M must make at least

√n

3 queries.

Observation 90. Suppose A is such a tester, and say A reveals a violation of f if it queries x,y with x ≺ y such that f(x) = 1, f(y) = 0. As it is one-sided, A can only reject when it is “sure”beyond any doubt; that is, if A does not reveals a violation in an execution, it must output ACCEPT.Therefore, if A is 1-sided non-adaptive tester for monotonicity, it must be the case that for every fwith dist(f,M) > ε0, A must reveal a violation of f with probability at least 2

3 .

Definition 91. For i ∈ [n], define the truncated anti-dictator fi as

fi : 0, 1n → 0, 1

x 7→

1 if

∑nj=1 xj ≥ n

2 +√n

0 if∑nj=1 xj <

n2 −√n

xi o.w.

Fact 92. There exists an absolute constant ε0 > 0 such that, for every i ∈ [n], dist(fi,M) > ε0.

Proof. Indeed, there are at least c2n) (for some suitable constant c > 0) many x ∈ 0, 1n having:

n

2 −√n <

n∑i=1

xi <n

2 +√n

(in the “middle slice”). Without loss of generality, we consider the case i = 1: we can pair up inputsof the form z = (1, z2, · · · , zn) for which f1(z) = 0 with z′ = (0, z2, · · · , zn), for which f1(z′) = 1.Any monotone function g disagrees with f1 on at least 1 of these two inputs; so any monotonefunction must disagree with f on at least c

2 · 2n points.

1Note that the quantity of interest in the query complexity is n, so this result is an improvement even though theexponent of ε is now 3/2 > 1. More generally, compared to n the parameter ε is seen as a constant, and in propertytesting 2221/ε

will always be considered better than log∗ nε

.

Page 67: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

8.4. Ω(n1/5

)LOWER BOUND FOR NON-ADAPTIVE, 2-SIDED TESTERS 59

Lemma 93. Let A be any non-adaptive q-query algorithm. Then there exists i ∈ [n] such that Areveals a violation on fi with probability at most 2q√

n.

This implies the theorem: any one-sided non-adaptive tester A with query complexity q <√n

3will reveal a violation on some fi∗ with probability < 2/3; but it only rejects on such occasions, yetany successful tester should reject fi∗ with probability at least 2/3.

Proof of Lemma 93. Fix A to be any q-query non-adaptive algorithm, and let Q be the set of qqueries it makes. We will show Q reveals violations of fi for at most 2(q − 1)

√n many i ∈ [n]: this

in turn implies thatn∑i=1

Pr[A reveals a violation of fi ] =n∑i=1

E[1 A reveals a

violation of fi

]= E

[n∑i=1

1 A reveals aviolation of fi

]= E[| i ∈ [n] : A reveals a violation of fi |]≤ 2(q − 1)

√n

so there exists i ∈ [n] such that Pr[A reveals a violation of fi ] ≤ 2(q−1)√n

.

Q is an arbitrary set of q strings in 0, 1n; without loss of generality, one can further assume everystring z ∈ Q has Hamming weight |z| ∈ [n2 −

√n, n2 +

√n], as querying any other cannot reveal any

violation of fi. Q reveals violations for fi only if Q contains 2 comparable strings u v such thatui 6= vi.

Accordingly, let GQ be a q-node unirected graph with vertex set V = Q and edge set E containingonly comparable pairs: (u, v) ∈ E iff u ≺ v or v ≺ u.

(1) |E| ≤(n

2)≤ q2 (pairs of comparable strings); and each pair reveals a violation of at most

2√n fi’s (by the Hamming weight assumption: u, v ∈ Q can differ in at most that many

coordinates). Therefore, the total number of i’s such that Q can reveal a violation of fi is atmost 2

√n(n

2)≤ 2q2√n. Almost what we need, but with q2 instead of q.

(2) A better bound can be achieved by considering a spanning forest FQ of GQ: FQ has at mostq − 1 edges. Furthermore, if Q has two comparable strings u, v with ui 6= vi, u and v will bein the same tree and some edge in the path u; v has endpoints with different value on theirith coordinate, and hence presents a violation of fi. As before, every 2 adjacent vertices ina tree differ by at most 2

√n coordinates, so the maximum number of i’s such that fi has a

violation reveals in FQ (and thus in GQ) is 2(q − 1)√n.

8.4 Ω(n1/5

)lower bound by Chen–Servedio–Tan for non-adaptive,

2-sided testers

We will now (start to) prove the following lower bound:

Page 68: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

60 LECTURE 8. MARCH 12, 2014

Theorem 94. There exists ε0 > 0 such that any 2-sided non-adaptive tester for M must makeΩ(n1/5

)queries.

To do so, we start by describing a general approach and one of the key tools for property testinglower bounds: “Yao’s Minmax Theorem”.

8.4.1 Yao’s Principle (easy direction)

Consider a decision problem (here, Property Testing) over a (finite) set X of possible inputs (in ourcase, X = P ∪ f : dist(f,P) > ε , and the inputs are functions), and a randomized non-adaptivedecision algorithm A that makes q queries to its input f . Such an algorithm is equivalent to aprobability distribution µ = µA over deterministic q-query decision algorithms. Letting Y be theset of all such determistic algorithms, we consider the X × Y matrix M with Boolean entries, and• rows indexed by functions f ∈ X;• columns indexed by algorithms y ∈ Y (or, equivalently, by sets Q of queries, possibly with

repetitions)

such that M(f, y) =

1 if y is right on input f0 o.w.

.

Our randomized algorithm A is thus equivalent to a distribution µ over columns (i.e., over Y ),non-negative function with

∑y∈Y µ(y) = 1. Similarly, a distribution λ over inputs (f ∈ X) satisfies∑

f∈X λ(f) = 1.For A to be a successful q-query property testing algorithm, it must be such that for every row

f ∈ X:Pr[A outputs right answer on f ] ≥ 2/3

that is Pry∼µ [M(f, y) = 1 ] ≥ 2/3.

Suppose there is a distribution λ over X such that every y ∈ Y has:

Prf∼λ

[M(f, y) = 1 ] < 2/3.

Then, for any distribution µ over Y :

Prf∼λy∼µ

[M(f, y) = 1 ] < 2/3

so it cannot be the case that for every f ∈ X, Pry∼µ [M(f, y) = 1 ] ≥ 2/3.and in particular A (which is fully characterized by µ) is not a legit tester – since there exists somef with Pr[A right on f ] < 2/3.

This is what Yao’s Principle states (at least, what its “easy direction” does): one can reduce theproblem of dealing with randomized (non-adaptive) algorithms over arbitrary inputs to the one ofdeterministic algorithms over a (“suitably difficult”) distribution over inputs:

Page 69: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

8.4. Ω(n1/5

)LOWER BOUND FOR NON-ADAPTIVE, 2-SIDED TESTERS 61

Theorem 95 (Yao’s Minmax Principle, easy direction). Suppose there is a distribution λ overfunctions (legitimate inputs: f ∈ P ∪ h : dist(h,P) > ε ) such that any q-query deterministicalgorithm is correct with probability < 2/3 when f ∼ λ.Then, given any (non-adaptive) q-query randomized algorithm A, there exists fA ∈ X, such that

Pr[A is correct on fA ] < 2/3

Hence, any non-adaptive property testing algorithm for P must make at least q + 1 queries.

Goal: find hard distribution over functions, for deterministic algorithms.More precisely, to get a grip on what being a hard distribution is, recall the notion of distance

between probability distributions we introduced at the beginning of the course:

Definition 96. Suppose D1, D2 are both probability distributions over a finite set Ω; their totalvariation distance is defined2 as

dTV(D1,D2) def= maxS⊆Ω

(D1(S)−D2(S)) = 12∑ω∈Ω|D1(ω)−D2(ω)| ∈ [0, 1]

This will come in handy to prove our lower bounds, as (very hazily) two sequences of queries/answerswhose distribution are very close are impossible to distinguish with high probability:

Exercise 97 (Homework problem). HW ProblemLet D1, D2 be two distributions over some set Ω, and A beany algorithm (possibly randomized) that takes x ∈ Ω as input and outputs Yes or No. Then∣∣∣∣ Pr

x∼D1[A(x) = Yes ]− Pr

x∼D2[A(x) = Yes ]

∣∣∣∣ ≤ dTV(D1,D2)

where the probabilities are also taken over the possible randomness of A.

2The second equality is known as Scheffe’s lemma.

Page 70: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

62 LECTURE 8. MARCH 12, 2014

Page 71: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 9

March 26, 2014

9.1 Overview

9.1.1 Last Time

• Finished analysis of O(nε

)-query algorithm for monotonicity.

• Showed an Ω(√n) lower bound for one-sided non-adaptive monotonicity testers.

• Stated and proved (one direction of) Yao’s Principle: Suppose there exists a distribution Dover functions f : −1, 1n → −1, 1 (the inputs to the property testing problem) such thatany q-query deterministic algorithm gives the right answer with probability at most c. Then,given any q-query non-adaptive randomized testing algorithm A, there exists some functionfA such that:

Pr[A outputs correct answer onfA ] ≤ c.

9.1.2 Today: lower bound for two-sided non-adaptive monotonicity testers.

We will use Yao’s Principle to show the following lower bound:

Theorem 98 (Chen–Servedio–Tan ’14). Any 2-sided non-adaptive property tester for monotonicity,to ε0-test, needs Ω

(n1/5

)queries (where ε0 > 0 is an absolute constant).

9.2 Ω(n1/5

)lower bound: proving Theorem 98

9.2.1 Preliminaries

Recall the definition of total variation distance between two distributions over the same set Ω:

dTV(D1,D2) = 12∑x

|D1(x)−D2(x)| .

63

Page 72: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

64 LECTURE 9. MARCH 26, 2014

As homework problem from last lecture (Exercise 97), we have the lemma1 below, which relatesthe probability of distinguishing between samples from two distributions to their total variationdistance:

Lemma 99 (DPI for TV). Let D1, D2 be two distributions over some set Ω, and A be any algorithm(possibly randomized) that takes x ∈ Ω as input and outputs Yes or No. Then∣∣∣∣ Pr

x∼D1[A(x) = Yes ]− Pr

x∼D2[A(x) = Yes ]

∣∣∣∣ ≤ dTV(D1,D2)

where the probabilities are also taken over the possible randomness of A.

To apply this lemma, recall that given a deterministic algorithm’s set of queriesQ = z(1), . . . , z(q) ⊆−1, 1n, a distribution D over Boolean functions induces a distribution D

∣∣Q

over −1, 1q: x isdrawn from D

∣∣Q

by• drawing f ∼ D;• outputting (f(z(1), . . . , f(z(q))) ∈ −1, 1q.

With this observation and Yao’s principle in hand, we can state and prove a key tool in provinglower bounds in property testing:

Lemma 100 (Key Tool). Fix any property P (a set of Boolean functions). Let DYes be a distributionover the Boolean functions that belong to P, and DNo be a distribution over Boolean functions thatall have dist(f,P) > ε.Suppose that for all q-query sets Q, one has dTV

(DYes

∣∣Q,DNo

∣∣Q

)≤ 1

4 . Then any (2-sided) non-adaptive ε-tester for P must use at least q + 1 queries.

Proof. Let D be the mixture D def= 12DYes + 1

2DNo (that is, a draw from D is obtained by tossinga fair coin, and returning accordingly a sample drawn either from DYes or DNo). Fix a q-querydeterministic algorithm A. Let

pYdef= Pr

f∼DYes[A accepts on f ] , pN

def= Prf∼DNo

[A accepts on f ]

That is, pY is the probability that a random “Yes” function is accepted, while pN is the probabilitythat a random “No” function is accepted. Via the assumption and the previous lemma, |pY − pN | ≤ 1

4 .However, this means that A cannot be a succesful tester; as

Prf∼D

[A gives wrong answer ] = 12(1− pY ) + 1

2pN = 12 + 1

2(pN − pY ) ≥ 38 >

13

So Yao’s Principle tells us that any randomized non-adaptive q-query algorithm is wrong on some fin support of D with probability at least 3

8 ; but a legit tester can only be wrong on any such f withprobability less than 1

3 .1This is sometimes referred to as a “data processing inequality” for the total variation distance.

Page 73: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

9.2. PROVING THE Ω(n1/5

)LOWER BOUND 65

Exercise 101 (Generalization of Lemma 100). Relax the previous lemma slightly. Prove that theconclusion still holds even under the weaker assumptions HW Problem

Prf∼DYes

[ f ∈ P ] ≥ 99100 , Pr

f∼DNo[ dTV(f,P) > ε ] ≥ 99

100 .

For our lower bound, we need to come up with DYes (resp. DNo) to be over monotone functions(resp. ε0-far from monotone) such that ∀Q ⊆ −1, 1n with |Q| = q, dTV

(DYes

∣∣Q,DNo

∣∣Q

)≤ 1

4 .At a high-level, we need to argue that both distributions “look the same”. One may thus think ofthe Central Limit Theorem – the sum of many independent, “nice” real-valued random variablesconverges to a Gaussian in distribution (in cumulative distribution function). For instance, abinomial distribution Bin

(106, 1

2

)has the same shape (“bell curve”) as the corresponding Gaussian

distribution N(

12 ,

14106

). For our purpose, however, the convergence guarantees stated by the

Central Limit Theorem will not be enough, as they do not give explicit bounds on the rate ofconvergence; we will use a “quantitative version” of the CLT, the Berry–Esseen Theorem.

First, recall the definition a (real-valued) Gaussian random variable:

Definition 102 (One-dimensional Gaussian distribution). A real-valued random variable is said tobe Gaussian with mean µ and variance σ if it follows the distribution N (µ, σ), which has probabilitydensity function

fµ,σ(x) def= 1√2πσ

e−(x−µ)2

2σ2 , x ∈ R

Such a random variable has indeed expectation µ and variance σ2; futhermore, the distributionis fully specified by these two parameters. Extending to higher dimensions, one can define similarlya d-dimensional Gaussian random variable:

Definition 103 (d-dimensional Gaussian distribution). Fix a vector µ ∈ Rd and a symmetricnon-negative definite matrix Σ ∈ Rd×d. A random variable taking values in Rd is said to be Gaussianwith mean µ and covariance Σ if it follows the distribution N (µ,Σ), which has probability densityfunction

fµ,Σ(x) def= 1√(2π)k det Σ

e−12 (x−µ)TΣ−1(x−µ), x ∈ Rd

As in the univariate case, µ and Σ uniquely define the distribution; further, one has that forX ∼ N (µ,Σ),

Σi,j = Cov(Xi, Xj) = E[(Xi − EXi)(Xj − EXj)] , i, j ∈ [d].

Page 74: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

66 LECTURE 9. MARCH 26, 2014

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

x

F0,

1(x

)

(a) Cumulative distribution function (CDF)

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

x

f 0,1

(x)

(b) Probability density function (PDF)

Figure 9.1: Standard Gaussian N (0, 1).

Theorem 104 (Berry–Esseen2). Let S def= X1 + . . .+Xn be the sum of n independent (real-valued)random variables X1, . . . , Xn satisfying

Pr[ |Xi − E[Xi]| ≤ τ ] = 1.

that is every Xi is almost surely bounded. For i ∈ [n], define µidef= E[Xi] and σi

def=√

VarXi, sothat ES =

∑ni=1 µi and VarS =

∑ni=1 σ

2i (the last equality by independence). Finally, let G be a

N(∑n

i=1 µi,√∑n

i=1 σ2i

)Gaussian variable, matching the first two moments of S. Then, for all

θ ∈ R,

|Pr[S ≤ θ ]− Pr[G ≤ θ ]| ≤ O(τ)√∑ni=1 σ

2i

.

In other terms3, letting FS (resp. FG) denote the CDF of S (resp. G), one has ‖FS − FG‖∞ ≤O(τ)√∑n

i=1 σ2i

.

Remark 14. The constant hidden in the O(·) notation is actually very reasonable – one can takeit to be equal to 1.

3This quantity ‖FS − FG‖∞ is also referred to as the Kolmogorov distance between S and G.3There exist other versions of this theorem, with weaker assumptions or phrased in terms of the third moments of

the Xi’s; we only state here one tailored to our needs.

Page 75: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

9.2. PROVING THE Ω(n1/5

)LOWER BOUND 67

Application: baby step towards the lower bound. Fix any string z ∈ −1, 1n, and fori ∈ [n] let the (independent) random variables γi be defined as

γidef=

+1 w.p. 12

−1 w.p. 12

Letting Xidef= γizi, we have µi = EXi = 0, σi = VarXi = 1; and can take τ = 1 to apply the

Berry–Esseen theorem to X def= X1 + . . .+Xn. This allows us to conclude that

∀θ ∈ R, |Pr[X ≤ θ ]− Pr[G ≤ θ ]| ≤ O(1)√n

for G ∼ N (0,√n).

Now, consider a slightly different distribution than the λi’s: for the same z ∈ −1, 1n, definethe independent random variables νi by

νidef=1

3 w.p. 910

−3 w.p. 110

and let Yidef= νizi for i ∈ [n], Y def= Y1 + · · ·+ Yn. By our choice of parameters,

EYi =( 1

10 · (−3) + 910 ·

13

)zi = 0 = EXi

VarYi = E[Y 2i

]= 1

10 · 9 + 910 ·

19 = 1 = VarXi

So E[Y ] = E[Y ] = 0 and VarY = VarX = n; by the Berry–Esseen theorem (with τ set to 3, and Gas before)

∀θ ∈ R, |Pr[Y ≤ θ ]− Pr[G ≤ θ ]| ≤ O(1)√n

and by the triangle inequality

∀θ ∈ R, |Pr[X ≤ θ ]− Pr[Y ≤ θ ]| ≤ O(1)√n

(9.1)

We can now define DYes and DNo based on this (that is, based on respectively a random draw ofλ, ν ∈ Rn distributed as above): a function fλ ∼ DYes is given by

∀z ∈ −1, 1n, fλ(z) def= sign(λ1z1 + . . . λnzn).

and similarly for fν ∼ DNo:

∀z ∈ −1, 1n, fν(z) def= sign(ν1z1 + . . . νnzn)

Page 76: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

68 LECTURE 9. MARCH 26, 2014

With the notations above, X ≤ 0 if and only if fγ(z) = −1 and Y ≤ 0 if and only if fν(z) = −1.This implies that for any fixed single query z,

dTV(DYes

∣∣z,DNo

∣∣z

)= 1

2 (|Pr[X ≤ 0 ]− Pr[Y ≤ 0 ]|+ |Pr[X > 0 ]− Pr[Y > 0 ]|) ≤ O(1)√n.

This almost looks like what we were aiming at – so why aren’t we done? There are two problemswith what we did above:

1. This only deals the case q = 1; that is, would provide a lower bound against one-queryalgorithms.Fix: we will use a multidimensional version of the Berry–Esseen Theorem for the sums ofq-dimensional independent random variables (converging to a multidimensional Gaussian).

2. fγ , fν are not monotone (indeed, both the γi’s and νi’s can be negative).Fix: shift everything by 2:

- γi ∈ 1, 3: fγ is monotone;- νi ∈ −1, 7

3: fν will be far from monotone with high probability (will show this).

9.2.2 The lower bound construction

Up until this point, everything has been a warmup; we are now ready to go into more detail.

DYes and DNo. As we mentioned in the previous section, we need to (re)define the distributionsDYes and DNo (that is, of γ and ν) to solve the second issue:DYes Draw f ∼ DYes by independently drawing, for i ∈ [n],

γidef=

+3 w.p. 12

+1 w.p. 12

and setting f : x ∈ −1, 1n 7→ sign(∑ni=1 γixi). Any such f is monotone, as the weights are

all positive.DNo Similarly, draw f ∼ DNo by independently drawing, for i ∈ [n],

νidef=

+73 w.p. 9

10−1 w.p. 1

10

and setting f : x ∈ −1, 1n 7→ sign(∑ni=1 νixi). f is not always far from monotone – actually,

one of the functions in the support of DNo (the one with all weights set to 7/3) is evenmonotone. However, we shall argue that f ∼ DNo is far from monotone with overwhelmingprobability, and then apply the relaxation of the key tool (HW 101) to conclude.

Page 77: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

9.2. PROVING THE Ω(n1/5

)LOWER BOUND 69

The theorem will stem from the following two lemmas, that states respectively that (†) No-functions are almost all far from monotone, and (‡) that the two distributions are hard to distinguish:

Lemma 105 (Lemma †). There exists a universal constant ε0 > 0 such that

Prf∼DNo

[ dist(f,M) > ε0 ] ≥ 1− 12Θ(n) .

(note that this 1− o(1) probability is actually stronger than what the relaxation from Problem 101requires.)

Lemma 106 (Lemma ‡). Let A be any deterministic q-query algorithm. Then

∣∣∣∣ PrfYes∼DYes

[A accepts ]− PrfNo∼DNo

[A accepts ]∣∣∣∣ ≤ O

(q5/4(logn)1/2

n1/4

)

so that if q = O(n1/5

)the RHS is at most 0.01, which implies with the earlier lemmas and discussion

that at least q + 1 queries are needed for any 2-sided, non-adaptive randomized tester.

Proof of Lemma 105. By an additive Chernoff bound, with probability at least 1− 12Θ(n) the random

variables νi satisfym

def= | i ∈ [n] : νi = −1 | ∈ [0.09n, 0.11n]. (?)

Say that any linear threshold function for which (?) holds is nice. Fix any nice f in the support ofDNo, and rename the variables so that the negative weights correspond to the first variables:

f(x) = sign(−(x1 + · · ·+ xm) + 7

3(xm+1 + · · ·+ xn)), x ∈ −1, 1n

It is not difficult to show that for this f (remembering that m = Θ(n)), these first variables havehigh influence – roughly of the same order as for the MAJ function:

Claim 107 (HW Problem). For i ∈ [m], Inf f (i) = Ω(

1√n

). HW Problem

Observe further that f is unate (i.e., monotone increasing in some coordinates, and monotonedecreasing in the others). Indeed, any LTF g : x 7→ sign(w · x) is unate:

- non-decreasing in coordinate xi if and only if wi ≥ 0;- non-increasing in coordinate xi if and only if wi ≤ 0.

We saw in previous lectures that, for g monotone, g(i) = Inf g(i); it turns out the same proofgeneralizes to unate g, yielding

g(i) = ±Inf g(i)

Page 78: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

70 LECTURE 9. MARCH 26, 2014

where the sign depends on whether g is non-decreasing or non-increasing in xi. Back to our functionf , this means that

Inf f (i) =

+f(i) if νi = 73

−f(i) if νi = −1

and thus for all i ∈ [m] f(i) = −Ω(

1√n

).

Fix any monotone Boolean function g: we will show that dist(f, g) ≥ ε0, for some choice of ε0 > 0independent of f and g.

4 · dist(f, g) = Ex∼U−1,1n

[(f(x)− g(x))2

]=

(Parseval)

∑S⊆[n]

(f(S)− g(S))2

≥n∑i=1

(f(i)− g(i))2 ≥m∑i=1

(f(i)− g(i))2 =(g mon.)

m∑i=1

(−Inf f (i)− Inf g(i))2

=m∑i=1

(Inf f (i) + Inf g(i))2 ≥m∑i=1

(Inf f (i))2

=m∑i=1

(Ω( 1√

n

))2= Ω

(m

n

)= Ω(1).

Proof (sketch) of Lemma 106. Fix any deterministic, non-adaptive q-query algorithm A; and viewits q queries z(1), . . . , z(q) ∈ −1, 1n as a q × n matrix Q ∈ −1, 1q×n, where z(i) corresponds tothe ith row of Q.

q

n︷ ︸︸ ︷z

(1)1 z

(1)2 z

(1)3 · · · · · · · · · z

(1)n

z(2)1 z

(2)2 z

(2)3 · · · · · · · · · z

(2)n

......

... . . . ...z

(q)1 z

(q)2 z

(q)3 · · · · · · · · · z

(q)n

Define the “Yes-response vector” RY , random variable over −1, 1q, by the process of

(i) drawing fYes ∼ DYes, where fYes(x) = sign(γ1x1 + · · ·+ γnxn);(ii) setting the ith coordinate of RY to fYes(Qi,·) (fYes on the ith row of Q, i.e. z(i)).

Similarly, define the “No-response vector” RN over −1, 1q. Via Lemma 99 (the homework problemon total variation distance),

(LHS of Lemma 106) ≤ dTV(RY , RN ).

Page 79: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

9.2. PROVING THE Ω(n1/5

)LOWER BOUND 71

(abusing the notation of total variation distance, by identifying the random variables with theirdistribution.) Hence, our new goal is to show that:

dTV(RY , RN )≤?

(RHS of Lemma 106).

Multidimensional Berry–Esseen setup. For fixed Q as above, define two random variablesS, T ∈ Rq as• S = Qγ, with γ ∼ U1,3n ;• T = Qν, with

νi =

+73 w.p. 9

10−1 w.p. 1

10

for each i ∈ [n] (independently).We will also need the following geometric notion:

Definition 108. An orthant in Rq is the analogue in q-dimensional Euclidean space of a quadrantin the plane R2; that is, it is a set of the form

O = O1 ×O2 × · · · × Oq

where each Oi is either R+ or R−. There are 2q different orthants in Rq.

The random variable RY is fully determined by the orthant S lies in: the ith coordinate ofRY is the sign of the ith coordinate of S, as Si = (Qγ)i = Qi,· · γ. Likewise, RN is determinedby the orthant T lies in. Abusing slightly the notation, we will write RY = sign(S) for ∀i ∈ [q],RY,i = sign(Si) (and similarly, RT = sign(T )).

Now, it is enough to show that for any union O of orthants,

|Pr[S ∈ O ]− Pr[T ∈ O ]| ≤ O(q5/4(logn)1/2

n1/4

). ()

as this is equivalent to proving that, for any subset U ⊆ −1, 1q, |Pr[RS ∈ U ]− Pr[RT ∈ U ]| ≤O(q5/4(logn)1/2

n1/4

)(and the LHS is by definition equal to dTV(RY , RN )).

Note that for q = 1 we get back to the “regular” Berry–Esseen Theorem; for q > 1, we will needa “multidimensional Berry–Esseen”. The key will be to have random variables with matching meansand covariances (instead of means and variances for the one-dimensional case).

(Rest of the proof during next lecture.)

Page 80: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

72 LECTURE 9. MARCH 26, 2014

Page 81: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 10

March 04, 2014

10.1 Overview

10.1.1 Last Time

Started Ω(n1/5

)lower bound non-adaptive monotonicity testers (introducing Yao’s principle; Berry–

Esseen Theorem, DYes, DNo; multidimensional Berry–Esseen Theorem).

10.1.2 Today

Finish this lower bound (using a multidimensional analogue of the Berry–Esseen Theorem); starttesting juntas.

10.2 Monotonicity testing lower bound: wrapping up

Recall the definitions of DYes and DNo: f ∼ DYes is a linear threshold function (LTF) drawn bychoosing independently

γidef=

+3 w.p. 12

+1 w.p. 12

and setting f : x ∈ −1, 1n 7→ sign(γTx); similarly, for f ∼ DNo,

νidef=

+73 w.p. 9

10−1 w.p. 1

10

73

Page 82: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

74 LECTURE 10. MARCH 04, 2014

and f : x ∈ −1, 1n 7→ sign(νTx). The set of queries Q of any q-query non-adaptive tester will beseen as a q × n Boolean matrix

Q = q

n︷ ︸︸ ︷±1

where the ith row q(i) is the ith query string. We also defined the random variables RY , RN ∈ −1, 1qby

(RY )i = f(q(i)) for f ∼ DYes

(RN )i = f(q(i)) for f ∼ DNo

so that by setting S def= Qσ ∈ Rq and Tdef= Qν ∈ Rq, we get RY = sign(S) and RN = sign(T ).

Need to show: For S, T as above, for O any union of orthants in Rq, one has

|Pr[S ∈ O ]− Pr[T ∈ O ]| ≤ O(q5/4(logn)1/2

n1/4

)(†)

as this would imply an Ω(

n1/5

log4/5 n

)lower bound on q for the RHS to be less than 0.01.

Theorem 109 (Original (unidimensional) Berry–Esseen Theorem). Let X1, X2, . . . , Xn be n inde-pendent real-valued random variables such that |Xi − E[Xi]| ≤ τ almost surely (with probability 1);and let G be Gaussian with mean and variance matching S def=

∑ni=1Xi. Then

∀θ ∈ R, |Pr[S ≤ θ ]− Pr[G ≤ θ ]| ≤ O(τ)√VarS

.

Theorem 110 (Multidimensional Berry–Esseen Theorem1). Let S def= X(1) + · · ·+X(n), where theX(i)’s are independent random variables in Rq with ‖X(i)

j − E[X

(i)j

]‖∞≤ τ a.s.; and let G be a

q-dimensional Gaussian with mean and covariance matrix matching those of S. Then, for any Ounion of orthants and any r > 0,

|Pr[S ∈ O ]− Pr[G ∈ O ]| ≤ O

τq3/2 lognr

+q∑i=1

r + τ√∑nj=1 VarX(j)

i

(‡)

Page 83: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

10.2. MONOTONICITY TESTING LOWER BOUND: WRAPPING UP 75

Proof of (†) using (‡). Note that in (†), S = Qσ is the sum of the σi · (ith column of Q)’s, which areindependent q-dimensional vector-valued random variables (likewise for T = Qν). σi·(ith column of Q)is a q-dim independent vector-valued random variables. So

S ∼= GS , T ∼= GT

by our multidimensional Berry–Esseen theorem. But as the means and covariance matrices of thesetwo Gaussians match (because – as we will prove momentarily – ES = ET and CovS = CovT ), weget GS ≡ GT and by the triangle inequality

∀O, ∀r > 0, |Pr[S ∈ O ]− Pr[G ∈ O ]| ≤ 2 · (RHS of (‡)) .

Hence, it only remains to check the expectations and covariance matrices of S and T do match: wehave

S = Qσ = X(1) + · · ·+X(n), where X(j) def= σj ·Q∗,j

T = Qν = Y (1) + · · ·+ Y (n), where Y (j) def= νj ·Q∗,j

(Q∗,j ∈ Rq denoting the jth column of Q); and it is not hard to see that the expectations are equaltermwise, i.e. EX(j) = EY (j) for all j ∈ [n]:

EX(j) = 12 · 1 ·Q∗,j + 1

2 · 3 ·Q∗,j = 2Q∗,j

EY (j) = 110 · (−1) ·Q∗,j + 9

10 ·73 ·Q∗,j = 2Q∗,j

so ES = ET . As for the covariance matrices, as for any random variable Z ∈ Rq by definition

(CovZ)k,` = E[(Zk − EZk) (Z` − EZ`)] = E[ZkZ`]− E[Zk] · E[Z`]

for all k, ` ∈ [q], one can check that, using the independence of the X(j)’s (resp. Y (j)’s),

∀j ∈ [n], (CovX(j))k,` = (Cov Y (j))k,` = Qk,jQ`,j

and hence CovX(j) = Cov Y (j); so that (again by independence) CovS =∑nj=1 CovX(j) =∑n

j=1 Cov Y (j) = CovT .This finally results in

∀r > 0, |Pr[S ∈ O ]− Pr[G ∈ O ]| ≤ O(τq3/2 logn

r+ q · r + τ√

n

)

(as VarX(j)i = VarY (j)

i = 1) which holds for any r. Taking r = (qn)1/4√logn, the RHS becomesO(q5/4(logn)1/2

n1/4

).

1From [Chen–Servedio–Tan’14], building upon a Central Limit Theorem for Earthmover distance of [VV11]

Page 84: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

76 LECTURE 10. MARCH 04, 2014

10.3 Testing Juntas

We will now describe and analyze an algorithm for testing juntas (recall that a k-junta is a Booleanfunction with at most k relevant variables).Let us write Jk for the class of all k-juntas over −1, 1n (where k can depend on n); from earlierlectures, we know that one can learn Jk with 2k logn (membership) queries. As we shall see,however, testing is significantly more query-efficient:

Theorem 111. There is an O(k log k + k

ε

)-query (one-sided) algorithm for testing Jk.

Remark 15. Next time, we will prove an Ω(k) lower bound for this problem, which shows thistheorem is roughly optimal.

10.3.1 Setup

Let S ⊆ [n] be a set of variables, and S = [n] \ S. For x, y ∈ −1, 1n, we write ySxS for the stringin −1, 1n which has for ith coordinate

(ySxS)idef=xi if i /∈ Syi if i ∈ S

.

Definition 112. Given f : −1, 1n → −1, 1 and S ⊆ [n], then Inf f (S) is defined as

Inf f (S) = 2 Prx,y∼U−1,1n

[ f(ySxS) 6= f(x) ] .

Remark 16. Intutively, this captures the (overall) influence of variables in S, as it amounts to“rerandomizing the variables in S, and seeing if f ’s value flips”. The factor 2 is added for consistencywith the usual definition of influence of a single variable, when S is taken to be a singleton i:indeed, in that definition, the variable xi is chosen to be flipped instead of rerandomized, whichchanges the quantity by a factor 2. (One could define this generalization of influence by flippingvalues as well, instead of rerandomizing them; but this turns out not to be equivalent, and generallymessier. Rerandomizing appears somehow to be the “right” thing to do)As a final remark, the notation Inf f (S) instead of InfS(f) is not insignificant, as it is meant tohighlight the fact that the quantity of interest will indeed be the set S, not the function f .

Lemma 113. For any2 S ⊆ [n], Inf f (S) =∑T⊆[n] : T∩S 6=∅ f(T )2.

2Observe that this is consistent with Inf i(f) =∑

T3i f(T )2, when S = i.

Page 85: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

10.3. TESTING JUNTAS 77

Proof. We have

Inf f (S) = 2 Prx,y

[ f(ySxS) 6= f(x) ]

= 2 · Ex,y[1f(ySxS) 6=f(x)]

= 2 · Ex,y[1− f(x)f(ySxS)

2

](rewriting)

= 1− Ex,y [f(x)f(ySxS)]

= 1− Ex,y

( ∑T⊆[n]

f(T )χT (x))( ∑

U⊆[n]f(U)χU (ySxS)

)= 1−

∑U,T⊆[n]

f(T )f(U)Ex,y [χT (x)χU (ySxS)] . (linearity)

If U ∩ S 6= ∅, for any fixed x we have Ey[χU (yS)xS ] = 0 and the corresponding terms in the sumvanish. Otherwise, fix U ⊆ S, (equivalently, U ∩ S = ∅):

Ex,y [χT (x)χU (ySxS)] = Ex,y [χT (x)χU (x)] = Ex[χT∆U (x)] =

1 if T = U

0 o.w.

so Ex,y [f(x)f(ySxS)] =∑U⊆S f(U)2; and plugging this in the expression of Inf f (S) yields

Inf f (S) = 1−∑U⊆S

f(U)2 =∑

U : U∩S 6=∅f(U)2.

Corollary 114 (Monotonicity and Subadditivity of Influence). For all S, T ⊆ [n] and any Booleanfunction f ,

Inf f (S) ≤ Inf f (S ∪ T ) ≤ Inf f (S) + Inf f (T ) (10.1)

10.3.2 Characterization of far-from-juntas functions

Proposition 115. Fix f : −1, 1n → −1, 1. If dist(f,Jk) > ε, then every J ⊆ [n] such that|J | ≤ k has Inf f (J) > 2ε.

Proof. By contrapositive, suppose there exists a subset J ⊆ [n] with |J | ≤ k and Inf f (J) ≤ 2ε: wewill show f is ε-close to some (explicit) junta over J . Unrolling the definition of Inf f (J), we have

Prx,y

[ f(x) 6= f(yJxJ) ] ≤ ε

Page 86: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

78 LECTURE 10. MARCH 04, 2014

or equivalentlyPr

x,y,z∼U−1,1n[ f(zJxJ) 6= f(yJxJ) ] ≤ ε. (10.2)

For all choices of xJ ∈ −1, 1J , let h(xJ) be the bit b ∈ −1, 1 such that pxJdef= Pry [ f(yJxJ) = b ] ≥

12 . Note that h : −1, 1n 7→ h(xJ) is then a J-junta (which outputs the “most likely value of fgiven xJ”).For a fixed setting of xJ , we have

Pry,z

[ f(yJxJ)) 6= f(zJxJ) ] = 2pxJ (1− pxJ ) ≥ 1− pxJ (by choice of pxJ )

but Pry [ f(yJxJ)) 6= h(xJ) ] = 1− pxJ by definition of h, so we get

∀x ∈ −1, 1n, Pry

[ f(yJxJ)) 6= h(xJ) ] ≤ Pry,z

[ f(yJxJ)) 6= f(zJxJ) ] . (10.3)

Rewriting with this in mind,

Prx

[ f(x) 6= h(x) ] = Prx

[ f(xJxJ)) 6= h(xJ) ] (10.4)

= Prx,y

[ f(yJxJ)) 6= h(xJ) ] (10.5)

≤(10.3)

Prx,y,z

[ f(yJxJ)) 6= f(zJxJ) ] (10.6)

≤(10.2)

ε (10.7)

where we used the fact that Eq. (10.3) holds for all values of x, and thus a fortiori for a randomone.

10.3.3 (Naive) Junta testing

Observation 116. If f(x) 6= f(y), some variable i ∈ [n] such that xi 6= yi must be relevant for f .Further, S = i ∈ [n] : xi 6= yi has “typically” |S| = Θ(n).

In a previous lecture (dealing with learning juntas), we used this observation to perform a binarysearch and find a relevant variable in log |S| queries. If willing to use logn queries, one can testjuntas by the same approach: repeatedly find relevant variables, reject if at least k + 1 are found.; see Naive–Junta–Test algorithm (Alg. 9)

Theorem 117. Naive–Junta–Test is an O(kε + k logn

)-query 1-sided ε-tester for Jk.

Proof.1-sided: Clear (cannot find more than k relevant variables if f ∈ Jk).

Page 87: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

10.3. TESTING JUNTAS 79

Algorithm 9 Naive–Junta–TestRequire: MQ(f), k, ε > 0

1: Initialize S ← [n], `← 02: for r def= 6(k + 1)/ε rounds do3: Draw x, y ∼ U−1,1n4: if f(x) 6= f(xSyS then5: Use binary search to find a relevant coordinate j6: Update S ← S \ j, `← `+ 17: If ` > k then halt and return REJECT.8: end if9: return ACCEPT. . Did not reject in any of the r rounds

10: end for

Query complexity: There are at most 2 · 6(k+1)ε queries made from Line 2 (two per iteration, to

get f(x) and f(ySxS)); and Step 2(a) uses log |S| ≤ logn queries for the binary search, and isexecuted at most k + 1 times (as every time it is a new relevant variable is found, and afterk + 1 variables the algorithm rejects in any case).

Correctness: we must prove that if dist(f,Jk) > ε, then Naive–Junta–Test rejects withprobability at least ≥ 2/3. Assume dist(f,Jk) > ε. By Proposition 115, forallJ ⊆ [n] with|J | ≤ k we have Inf f ([)J ] > 2ε, that is

Prx,y

[ f(x) 6= f(yJxJ) ] > ε.

Therefore, in every iteration of Step 2, we have a probability at least ε of entering the steps(a)–(b)–(c) and finding a new relevant variable. The expected number of rounds R beforek + 1 relevant variables are found satisfies

ER ≤ k + 1ε

and by Markov’s inequality R ≤ 6(k+1)ε with probability at least 5/6. Hence, with at least this

probability the algorithm will unveil k + 1 relevant variables and reject.

This is great – except for the logn part. Too expensive: we are not willing to spend anydependence on n, and as a matter of fact do not need to find the relevant variables: it is enough tojust infer their existence (and “set them aside” not to doublecount any in future searches).

This is what the Junta–Test (see Alg. 10) does: the key is to make a random partition of thevariables, in a number of buckets independent of n; and treat each bucket as a “(meta)variable”.

Page 88: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

80 LECTURE 10. MARCH 04, 2014

We will deal with a random partition of [n]:

I = I1, I2, . . . , Is

where the Ij ’s are disjoint and⋃sj=1 IJ = [n], built as follows: independently, each of the n variables

xi is put in a random element of I.

Definition 118. Fix a partition I of [n], and f : −1, 1n → −1, 1. f is said to be a k-partjunta with relation to I if there exist Ii1 , . . . , Iik ∈ I such that all relevant variables of f are inIi1 ∪ · · · ∪ Iik .

We say f ε-violates being a k-part junta (with relation to I) if every J that is a union of kelements of I has Inf f (J) > 2ε (i.e., Prx,y [ f(x) 6= f(yJxJ) ] > ε).

This concept has a natural connection with the notion of junta-hood, but with “chunks of variables”playing the role of variables. The following lemma generalizes the characterization of Proposition 115(which was key to the analysis of Naive–Junta–Test) to k-part juntas:

Lemma 119 (Main Lemma (Analogue of Proposition 115)). Let I be a random partition of [n]into s

def= 1020k9

ε5 sets, as in Junta–Test, and suppose dist(f,Jk) > ε. Then, with probability atleast 5/6 over the choice of I, f ε/2-violates being a k-part junta with relation to I.

Using this lemma, one can getHW Problem the correctness of Junta–Test via the same proof that gaveus correctness of Naive–Junta–Test from Proposition 115 (left as a homework problem (Exer-cise 124)):

Theorem 120 (Main Theorem). Junta–Test is an O(kε + k log k

)-query 1-sided ε-tester for Jk.

(Proof of Lemma 119 during next lecture.)

Page 89: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 11

April 9, 2014

11.1 Overview

11.1.1 Last Time

• Completed Ω(n1/5

)lower bound against non-adaptive monotonicity testers.

• The naıve junta test and its analysis.

11.1.2 Today

• The actual O(k log k + k/ε)-query junta test and its analysis.• An Ω(k) lower bound for testing juntas (even to 0.49-test).

Relevant Readings

• Eric Blais. Testing juntas nearly optimally. [Bla09]• Hana Chockler and Dan Gutfreund. A lower bound for testing juntas. [CG04]

11.2 The actual junta test

Definition 121. Fix any partition I of [n] and any f : −1, 1n → −1, 1. We say f is a k-partjunta w.r.t. I if the union of at most k parts in I together contain all the relevant coordinates of f .We say f ε-violates being a k-part junta w.r.t. I if Inf f (J) > 2ε (i.e. Px,y[f(x) 6= f(yJxJ)] > ε)for every J that is the union of at most k parts of I.

Next we recall the actual junta test, which is very similar to naıve one given in the previouslecture, only except using subsets of variables I1, . . . , Is ⊆ [n] in place of singleton variables (forclarity, the changes between the two testers are highlighted in blue in Alg. 10).

81

Page 90: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

82 LECTURE 11. APRIL 9, 2014

Algorithm 10 Junta–TestRequire: MQ(f), k, ε > 0

1: Initialize S ← [n], `← 0, and set s def= 1020k9/ε5

2: Randomly partition [n] into I = I1, . . . , Is (i.e. assign each i ∈ [n] uniformly to one of the ssets I)

3: for r def= 12(k + 1)/ε rounds do4: Draw x, y ∼ U−1,1n5: if f(x) 6= f(xSyS then6: Use binary search on S to find a block Ij ∈ I that contains a relevant variable (by flipping

all the variables in an entire block at a time)7: Update S ← S \ Ij , `← `+ 18: If ` > k then halt and return REJECT.9: end if

10: return ACCEPT. . Did not reject in any of the r rounds11: end for

We also restate a useful property of set influence, from last time (Corollary 114):

Corollary 122 (Monotonicity and Subadditivity of Influence). For all S, T ⊆ [n] and any Booleanfunction f ,

Inf f (S) ≤ Inf f (S ∪ T ) ≤ Inf f (S) + Inf f (T ) (11.1)

Recall also that the key ingredient in the correctness proof of the naıve algorithm from theprevious lecture was a lemma showing that if f is ε-far from Jk then Inf f (J) > 2ε for every |J | ≤ k.We will need an analogous lemma to analyze the actual Junta–Test algorithm, for sets J that arethe union of k blocks I1, . . . , Ik ∈ I.

Lemma 123 (Main Lemma). Let I = I1, . . . , Is be a random partition of [n] into s = 1020k9/ε5

parts. Suppose that f : −1, 1n → −1, 1 has dist((, f),Jk) > ε. Then f ε/2-violates being ak-part junta w.r.t. I with probability at least 5/6 over the random choice of I.

Exercise 124.HW Problem Prove that Lemma 123 implies the main testing result: Junta–Test usesO(k log k + k/ε) queries and is a 1-sided ε-tester for whether f ∈ Jk.

11.3 Proof of the main lemma

Let f : −1, 1n → −1, 1 be ε-far from Jk. Recall the definition of influence: for every J ⊆ [n],

Inf f (J) =∑

T :J∩T 6=∅

f(T )2 = 1−∑

S:J∩S=∅

f(S)2 = 1−∑S⊆J

f(S)2.

Page 91: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

11.3. PROOF OF THE MAIN LEMMA 83

Our goal is to show that with probability at least 5/6 over the choice of I,∑S⊆J

f(S)2 < 1− ε (11.2)

for every J that is the union of at most k parts of I. (As a sanity check to see that the lemma isat least plausible, consider f = PAR(x1, . . . , xk+1), which is 1/2-far from every k-junta. With veryhigh probability all k + 1 variables end up in different parts in a random partition I, and so indeedwe have that

∑S⊆J f(S)2 = 0 for any J that is the union of at most k parts.)

Recall that our goal is to upper bound by 1− ε the total of f(S)2 summed across all subsetsS ⊆ J . The proof analyzes different sets S ⊆ J in different ways:

(1) the set of “big S’s” ; Fourier weight ≤ ε/2, with probability at least 17/18;(2) the set of “small S’s entirely of special variables (players)” ; Fourier weight ≤ 1− 2ε, with

probability at least 17/18;(3) the set of “small S’s not entirely of players” ; Fourier weight ≤ ε/2, with probability at least

17/18so that in total we get, with probability at least 5/6, a Fourier weight at most 1− ε in total (acrossthese three categories). We begin with sets S that have large magnitude:

11.3.1 Big sets S ⊆ J

For any J ⊆ [n], define BJ = S ⊆ J : |S| > 2k . We will show these large sets S are very likelyto get “broken up” under I into more than k parts, and so their contributions to (11.2) is small.

Lemma 125. With probability at least 17/18 over the choice of I, for every J that is the union ofat most k parts of I we have ∑

S∈BJ

f(S)2 ≤ ε

2 .

Proof. We say that a set T ⊆ [n] is k-covered by a partition I, written T k I, if there are k partsIi1 , . . . , Iik in I such that T ⊆ Ii1 ∪ · · · ∪ Iik . Fix S ⊆ [n] with |S| > 2k, and note that S k I iff allelements of S are “sent” to k or fewer parts. Therefore

PI [S k I] ≤(s

k

)(k

s

)2k+1≤(es

k

)k (ks

)2k+1= ek

(k

s

)k+1 ε

36 ,

where we have used our choice of s = 1020k9/ε5 for the final bound. It follows that the expectedtotal Fourier weight on all large cardinality sets S that are k-covered by I is upper bounded by:

EI

∑SkI|S|>2k

f(S)2

=∑|S|>2k

f(S)2 · PrI

[S k I ] ≤ 1 · ε36 .

Page 92: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

84 LECTURE 11. APRIL 9, 2014

We conclude that EI[∑

S∈BJ f(S)2]≤ ε/36 for every J that is the union of at most k parts, and

hence by Markov’s inequality∑S∈BJ f(S)2 > ε/2 with probability at most 1/181.

11.3.2 Small sets S ⊆ J

Next we handle sets S ⊆ J such that |S| ≤ 2k. For this we will need a definition that lets usconsider the influence of a set T on f in more detailed way. Recall that Inf f (T ) def=

∑U∩T 6=∅ f(U)2.

Definition 126 (Low-order influence of a set). Given f , T ⊆ [n], and a value k, the influence oforder ≤ k of T on f is

Inf≤kf (T ) def=∑

U∩T 6=∅|U |≤k

f(U)2,

and the influence of order > k of T on f is

Inf>kf (T ) def=∑

U∩T 6=∅|U |>k

f(U)2.

Note that Inf≤kf (T ) + Inf>kf (T ) = Inf f (T ).

For singleton sets T = i we write Inf f (i) as shorthand for Inf f (i), and note that then

Inf≤kf (i) =∑S3i|S|≤k

f(S)2.

We consider two cases, depending on whether S ⊆ J consists entirely of variables with large low-orderinfluence. First, observe that Parseval’s identity implies that the total low-order influence of allvariables is at most k:

Fact 127 (Low–Order Influence). For all f : −1, 1n → −1, 1,∑i∈[n] Inf≤kf (i) ≤ k.

Proof. ∑i∈[n]

Inf≤kf (i) =∑i∈[n]

∑i∈S|S|≤k

f(S)2 ≤ k ·∑|S|≤k

f(S)2 ≤ k.

This in turns shows, by an averaging/Markov–like argument, that we cannot have many variableswith large low-order-influence:

1As a side remark, what we actually proved is somewhat stronger than the lemma: we bounded (with highprobability) the sum over sets S that each belong some some BJ , but not necessarily the same for each S.

Page 93: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

11.3. PROOF OF THE MAIN LEMMA 85

Corollary 128. For all θ > 0,∣∣∣ i ∈ [n] : Inf≤kf (i) > θ

∣∣∣ ≤ k/θ.Fix now

θdef= 10−9 ε

2

k4 log kε,

our threshold for “large” low-order influence, and define

Hdef=i ∈ [n] : Inf≤2k

f (i) > θ

so that |H| ≤ 2k/θ. We think of elements of H as players, since none of the variables in a juntahave any high-order influence, and the ones that “matter” have large low-order influence. Given J ,we define

HJ = S ⊆ J : |S| ≤ 2k, S ⊆ J ∩H LJ = S ⊆ J : |S| ≤ 2k, S 6⊆ J ∩H ,

and note that BJ ∪HJ ∪ LJ = S : S ⊆ J . We showed in Lemma 125 that the contribution ofBJ to the sum in (11.2) is at most ε/2; in the next two sections we will argue that the contributionof HJ is at most 1− 2ε, and that of LJ at most ε/4.

Small sets S that consist entirely of players

The intuition behind the proof of this case is that the players (at most 2k/θ of them) are likely toall get split up in a random partition I. Hence any S that is contained within a union of at mostk parts and consists entirely of players has size at most k. Since f is far from a junta, the totalFourier weight on such sets (indeed on all sets of size at most k) cannot be too close to 1. Formally,we will show:

Lemma 129. With probability at least 17/18 over the choice of I, for every J that is the union ofat most k parts of I we have ∑

S∈HJ

f(S)2 ≤ 1− 2ε.

Proof. Recall that |H| ≤ 2k/θ, and so

PI [some Ii ∈ I has ≥ 2 elements of H] ≤(|H|2

)·(s

1

)· 1s2 ≤

(2kθ

)2· 1s≤ 1

18 .

That is, with probability at least 17/18 every part Ii has at most 1 element of H. Conditioning onthis event occurring, every J that is the union of at most k parts has |J ∩H| ≤ k. Since f is ε-farfrom being a junta, we have that Inf f (J ∩H) > 2ε, and so indeed∑

S∈HJ

f(S)2 =∑

S : S⊆J∩Hf(S)2 < 1− 2ε.

Page 94: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

86 LECTURE 11. APRIL 9, 2014

Small sets S that contain at least one non-player

Finally it remains to handle LJ , small sets (of size at most 2k) that contain at least one non-player.Intuitively, the presence of a non-player makes their contribution small. (Recall that if a function gis a junta and S contains an irrelevant variable, then g(S) = 0.) Formally, we prove:

Lemma 130. With probability at least 17/18 over the choice of I, for every J that is the union ofat most k parts of I we have ∑

S∈LJ

f(S)2 ≤ ε

4 .

Proof. We will show that with probability 17/18, every Ii ∈ I has

Inf≤2kf (Ii \H) ≤ ε

4k . (11.3)

Given this, every J that is the union of at most k parts satisfies∑S∈LJ

f(S)2 =∑S⊆J|S|≤2kS 6⊆J∩H

f(S)2 = Inf≤2kf (J \H))

≤ k ·maxi∈J

Inf≤2kf (Ii \H) (Subadditivity)

≤ k · ε4k = ε

4 ,

giving the lemma.Fix i ∈ [s]. For every j ∈ [n] let Xj be random variable (randomness is over the choice of I)

Xj =

Inf≤2kf (j) if j ∈ Ii \H

0 otherwise.

Note that if j is a player (i.e. j ∈ H) then Xj is always 0 regardless of I. If j a non-player(i.e. Inf≤2k

f (j) is small), the value of Xj is either 0 or Inf≤2kf (j) depending on whether j lands in Ii.

Exercise 131.HW Problem Show that

EI

∑j∈[n]

Xj

≤ 2ks<

ε

8k

and that

PrI

∑j∈[n]

Xj ≥ E

∑j∈[n]

Xj

+ ε

8k

≤ 118s. (Hint: Chernoff bound)

Page 95: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

11.4. TESTING JUNTAS: A LOWER BOUND 87

These two items guarantee that, with probability at least 1− 118s , one gets

∑j∈[n] Xj ≤ ε

4k . Takinga union bound over all i ∈ [s] completes the proof of Lemma 130.

Combining Lemmas 125,129, and 130, we get that with probability at least 5/6 over the choiceof I, every J that is the union of at most k parts of I satisfies

∑S⊆J f(S)2 ≤ 1− ε, and the proof

of Lemma 123 (and therefore the proof of correctness of Juna-Test) is complete.

11.4 Testing juntas: a lower bound

Our next theorem (and the last Boolean function result of the course!) is a strong lower boundshowing that the query complexity of Junta–Test is essentially optimal, even for adaptive testers:

Theorem 132. Any algorithm, adaptive or non-adaptive, that tests the class Jk to accuracy ε = 0.49must make at least k/4 queries.

The idea of the proof is to show this boils down to “finding a needle in a haystack”. Althoigh wewill not directly apply Yao’s principle, we consider two distributions over functions:• DNo: A draw fNo ∼ DNo is a uniform random (k + 1)-junta over the first (k + 1) variablesx1, . . . , xk+1.• DYes: A draw fYes ∼ DYes is a uniform random k-junta over the first (k + 1) variablesx1, . . . , xk+1. Equivalently, fYes is drawn from DYes by first selecting a uniform randomindex i ∈ [k + 1], and then setting fYes to be a uniform random k-junta over the k variablesx1, . . . , xi−1, xi+1, . . . , xk+1.

Clearly DYes is supported entirely on k-juntas; it is also straightforward to verify that DNo issupported almost entirely on functions that are constant-far from k-juntas:

Exercise 133. HW ProblemFor all sufficiently large k, with probability at least 0.999 a random functionfNo ∼ DNo satisfies dist((,f)No,Jk) > 0.49.

Proof of Theorem 132. The intuition is that to distinguish between the two cases, one needs tocheck whether the first k + 1 first variables are relevant, ruling them out one by one. To make thisprecise, let A be any q-query algorithm (which may be adaptive), where q < k/4. The transcript ofA’s execution on a function f is the complete list

(x(1), f(x(1))), . . . , (x(q), f(x(q)))

of query-responses that A makes and receives. Let TNo be the transcript of A’s execution on arandom fNo ∼ DNo, and TYes be the transcript of A’s execution on a random fYes ∼ DYes. Note thatboth TNo and TYes are length-q transcript-valued random variables, where the randomness is overboth the algorithm A’s coin tosses and the draw of the function from the corresponding distribution.

Page 96: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

88 LECTURE 11. APRIL 9, 2014

We will show that with probability at least 3/4, TYes is distributed exactly according to TNo. Thissuffices since it implies that∣∣∣∣ Pr

fNo∼DNo[A accepts fNo ]− Pr

fYes∼DYes[A accepts fYes ]

∣∣∣∣ ≤ 1/4,

which rules out the possibility that A tests Jk to accuracy ε = 0.49. (If A were a such tester thedifference in these probabilities should be at least 1/3− 0.001 > 1/4, since with probability at least0.999 a draw fNo ∼ DNo is ε-far from Jk.)

The following rule specifies how the label fNo(x(`)) of the `-th query string x(`) in a draw from TNois distributed:• If the first k+ 1 bits (x(`)

1 , . . . , x(`)k+1) of x(`) perfectly match those of an earlier query x(r), then

fNo(x(`)) is the same response fNo(x(r)) that was given earlier.• If the first k + 1 bits (x(`)

1 , . . . , x(`)k+1) of x(`) do not perfectly match some earlier query, then

fNo(x(`)) is a fair coin toss independent of everything previous.

Likewise, the following rule specifies how the label fYes(x(`)) of the `-th query string x(`) in TYes isdistributed. First, a uniform i ∈ [k + 1] is sampled and fixed once and for all before the first query.Then, for each query x(`),• If first k + 1 bits (x(`)

1 , . . . , x(`)k+1) of x(`) perfectly match those of an earlier query x(r), then

fYes(x(`)) is the same response fYes(x(r)) that was given earlier.• If first k + 1 bits (x(`)

1 , . . . , x(`)k+1) of x(`) do not perfectly match some earlier query x(r), we

consider three cases:1. if x(`) and x(r) differ on at least two of the first k + 1 coordinates, then fYes(x(`)) is a

fair coin toss independent of everything previous;2. if x(`) and x(r) differ only on the i-th coordinate among the first k + 1, then fYes(x(`)) is

again a fair coin toss independent of everything previous (we call these two queries x(`)

and x(r) an i-twin);3. if x(`) and x(r) differ only on the j-th coordinate among the first k + 1 (i.e. x(`) and x(r)

form a j-twin), where j 6= i, then fYes(x(`)) is the same response fYes(x(r)) that wasgiven earlier.

Note that case (3) is the only case in which TNo is distributed differently from TYes: in TNo thebits fNo(x(`)) and fNo(x(r)) agree with probability 1/2 (since they are each independent fair cointosses), whereas in TYes the responses agree with probability 1. In other words, TNo and TYes aredistributed identically if case (3) does not occur in any of the q queries of A.

It remains to bound the probability that TYes contains two queries x(`) and x(r) that comprisean i-twin. We will use the fact that q queries can contain a j-twin for at most q − 1 values of j.(To see this, consider the q-vertex graph G with one vertex corresponding to each of the q queries

Page 97: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

11.4. TESTING JUNTAS: A LOWER BOUND 89

of A, and an edge between two vertices x(`) and x(r) if they form a j-twin for some j. Let T be aspanning forest of G. Note that T contains at most q − 1 edges, and there is an edge in T betweenevery pair of vertices that form a twin.) Since i ∈ [k + 1] is distributed uniformly, the probabilitythat case (3) occurs within q < k/4 queries is at most (q − 1)/(k + 1) < 1/4 by a union bound, andthe proof is complete.

Page 98: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

90 LECTURE 11. APRIL 9, 2014

Page 99: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 12

April 16, 2014: Property testing forgraphs

12.1 Overview

12.1.1 Last Time: end of Boolean function testing

• Finished Junta testing with O(k log k + k/ε)-query algorithm;• Completed Ω(k) lower bound for junta testing.

12.1.2 Today: Graph Property testing

• Basics ; adaptive vs. nonadaptive testers;• O

(1ε3

)-query algorithm for testing bipartiteness, and some generalizations.

12.1.3 Next Time

• ∆(triangle)-freeness;• Regularity (Szemeredi regularity lemma).

Relevant readings• Goldreich, Goldwasser, and Ron. Property testing and its connection to learning and approxi-

mation. [GR98]

91

Page 100: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

92 LECTURE 12. APRIL 16, 2014: PROPERTY TESTING FOR GRAPHS

12.2 Basics of Graph Property Testing

Some models of representing graphs are better-suited to dense graphs, and other models for sparsegraphs. Today, we will be focusing on the former:

Adjacency Matrix Model All our graphs will be, unless specified otherwise, denoted by G =(V,E), where the vertex set is V def= 1, . . . , N (and N is to be thought of as a huge, insanely biginteger). They will be undirected simple graphs (no self loops nor multiple edges), and will berepresented by their adjacency matrix:

Definition 134 (Adjacency matrix). The adjacency matrix of a graph G = (V,E) is a symmetricmatrix A ∈ 0, 1N×N such that

Aij =

1 if (i, j) ∈ E0 otherwise.

Equivalently, it can also be seen as a function fG : [N ]×[N ]→ 0, 1 where fG(u, v) = 1 if (u, v) ∈ E,and 0 otherwise.

For two graphs G1 = (V1, E1), G2 = (V2, E2) such that |V1| = |V2| = N , we consider the followingnotion of distance:

Definition 135 (Distance between graphs). Given two N -vertex graphs G1, G2,

dist(G1, G2) def= |E1∆E2|N2 = 1

N2 | (u, v) ∈ [N ]× [N ] : (u, v) ∈ (E1 \ E2) ∪ (E2 \ E1) |

where ∆ is the set symmetric difference.

(the distance can also be normalized by(N

2)

instead, which is essentially the same as(N

2)∼

N→∞N2/2).

Remark 17. Note that this is indeed a good representation and distance measure for dense graphsonly, as for sparse graphs the adjacency matrix is still of size N2, but only has very few (that is,o(N2)) non-zero entries; so in particular dist(G1, G2) = o(1) as soon as both graphs are sparse, and

our definition of distance is meaningless.For instance, let G1 be a tree (hence containing N − 1 edges) and G2 be the empty graph on N

vertices: although the two graphs are intuitively very different, dist(G1, G2) = O(

1N

).

Definition 136 (Reasonable properties). A property P of N-vertex graphs is a set of N-vertexgraphs. Such P is said to be reasonable if it is closed under renaming (permuting) of vertices, that is

G = ([N ], E) ∈ P =⇒ ∀π ∈ SN , Gπ = ([N ], Eπ) ∈ P

where Eπdef= (π(u), π(v)) : (u, v) ∈ E , and SN is the set of all permutations of [N ].

Page 101: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

12.3. ADAPTIVENESS IS NOT (THAT) HELPFUL 93

Example 137 (Reasonable properties). The following properties are examples of reasonable prop-erties:• P, set of all connected N -vertex graphs;• P, set of all N -vertex graphs with maximum degree at most 20.

Example 138 (Unreasonable property). Here is an example of an unreasonable property: P def= G : ∀p prime in [N ], deg(p) = 0 in G .

In these lectures, all properties we study will be reasonable. Finally, the distance from a graph G toa property P is then simply defined as:

dist(G,P) def= minG′∈P

dist(G,G′

).

Definition 139 (Property Testing in the Adjacency Model). Fix a reasonable property P of N-vertex graphs. A property testing algorithm for P is an algorithm A which is given ε > 0 andquery access to fG; A makes q = q(ε,N) queries to fG, and outputs either ACCEPT or REJECT.Furthermore, A must satisfy the following:• if G ∈ P, Pr[A outputs ACCEPT ] ≥ 2/3; (completeness)• if dist(G,P) > ε, Pr[A outputs REJECT ] ≥ 2/3. (soundness)

If A accepts every G ∈ P with probability 1, then it is said to have one-sided error.

Definition 140 (Adaptive and nonadaptive testers). If A (randomly) chooses all its future queries(u(1), v(1)), . . . , (u(q), v(q)) before making any query to fG, then A is nonadaptive. Otherwise, it isadaptive.

12.3 Adaptiveness is not (that) helpful

For Boolean functions, we saw that a q-query adaptive algorithm could be transformed to a 2q-querynonadaptive algorithm, and that this exponential blowup was sometimes necessary. The followingtheorem will provide an analogous statement for graphs, but involving only a quadratic blowup:

Theorem 141. Fix a reasonable graph property P which has a q(ε,N)-query (adaptive) tester.Then there exists an O

(q2)-query nonadaptive tester for P. Moreover, if the original adaptive

algorithm had one-sided error, so has the nonadaptive.

Sketch of the proof. The key observation is that any query made to the adjacency matrix eitherinvolve two vertices seen previously, or at least a “new” vertex. But morally, every edge containingan unseen vertex is in some sort generic, because the property is reasonable: any unseen vertexis, to the tester, indistinguishable from any other unseen one, as the property should be invariantunder renaming of the vertices.

Page 102: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

94 LECTURE 12. APRIL 16, 2014: PROPERTY TESTING FOR GRAPHS

More precisely, we say that the first i queries (u(1), v(1)), . . . , (u(i), v(i)) touch vertex v if v ∈u(1), v(1), . . . , u(i), v(i). Consider the (i+ 1)st query that the original tester T makes; it is eitherbetween:

(i) two touched vertices;(ii) one touched and one untouched; or

(iii) two untouched vertices.But, as handwaved above, “all untouched vertices are equivalent” due to the permutation property:running T on Gπ should yield the same answer as running the tester on G, for any π ∈ SN . We usethis to describe a nonadaptive version T ′ of the original tester T : it simulates the (i+ 1)st query ofT as follows.• uniformly selects (u1, u2) amonst all pairs of untouched vertices, queries all 1 + 2(2i) pairs of

edges (u1, u2), (u1, v), (u2, v) for v touched and record the answers.Note that the above is carried out in an entirely non-adaptive way. Having done this, the tester T ′has all the information it needs to simulate the answer that T would make on its i-th query, foreach i, if T were run on a graph Gπ for a uniform random π ∈ SN . (Note that after T ’s (i+ 1)st

query, T ′ has explored all possible edges between 2(i+ 1) vertices.) The random π is effectively built“on-the-fly”. An answer for Gπ is also a correct answer for G, and T ′ makes in total

(2q+22)

= O(q2)

queries.

Remark 18. This unit on graph property testing has less of an algorithmic flavor than the Booleanfunction testing had; it is more of a (hard) structural results unit one, these results then yieldingsimple algorithms.

12.4 Testing bipartiteness

Definition 142 (Bipartiteness). A graph G = (V,E) is said to be bipartite if there exists a partitionV1, V2 of V such that every e ∈ E has exactly one endpoint in V1 and the other in V2 (that is,E ⊆ (V1 × V2) ∪ (V2 × V1)). Equivalently, G is bipartite if and only if is had no odd length cycle.

Fact 143 (Deciding bipartiteness is “easy”.). Given an N -vertex and M -edge graph G, there is anO(M) algorithm (based on breath-first-search) to check whether G is bipartite.

This is good, but clearly not sublinear: what about testing instead of deciding?

Definition 144 (Violating edge and bad partition). Given a partition V1, V2 of V , an edge e = (u, v)is called a violating edge with respect to V1 ,V2 if u, v ∈ V1 or u, v ∈ V2. The partition V1, V2 is saidto be ε-bad if there are at least εN2 such violating edges; otherwise, V1, V2 is ε-good.

One then has that a graph G is ε-far from bipartite if and only if every partition V1, V2 of V isε-bad. Thus, a natural way to try and test whether G is bipartite is as follows:A couple immediate remarks:

Page 103: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

12.4. TESTING BIPARTITENESS 95

Algorithm 11 Test–BipartitenessRequire: Access to fG for G = (V,E), integer m > 0

1: Uniformly, independently pick set R ⊆ [N ] of m vertices . Pick each w.p. mN .

2: Query all(m

2)

edges from R to get the induced subgraph GR.3: Run breath-first-search to check if GR is bipartite.4: if GR is bipartite then return ACCEPT5: else return REJECT6: end if

• if G is bipartite, the above algorithm accepts with probability 1, so completeness is not anissue: for the rest of the analysis, we will assume that dist(G,Bip) > ε;• if m is large enough (as a function of ε and N), then the above algorithm rejects every graph

that is ε-far from bipartite with probability at least 2/3 (e.g. for m = Θ(N), clearly).. Question: how small can m be for the algorithm to work?

12.4.1 Naive analysis

Fix any partition (V1, V2) of V : there exists at least εN2 violating edges with respect to V1, V2.Take m def= 2 ln(1/δ

ε (to be viewed as m2 = 1

ε ln 1δ many pairs of vertices) for some δ to be determined

shortly. The probability that none of these pairs contains a violating edge is at most

(1− ε)1ε

ln 1δ ≤ δ.

This does look promising – yet, there is a catch: indeed, by querying a small number of edges, wecan rule out one fixed partition with failure probability δ; but there are 2N−1 partitions. A unionbound would thus require δ = O

(1/2N

), yielding m = Ω(N/ε); which we do not want to pay.

12.4.2 Harder, stronger (better) analysis

Theorem 145. For m def= O(

log(1/ε)ε2

), the algorithm Test–Bip rejects any G with dist(G,Bip) > ε

with probability at least 2/3, i.e. is a legit tester for bipartiteness with query complexity O(1/ε4

).

This O(1/ε4

)-query complexity, immediate from the definition of the algorithm (Alg. 11), can

actually be brought down to O(1/ε3

)– as the analysis will show (see Remark 19).

Proof. The algorithm can be viewed as working in two stages:• View the sample set R of m vertices as U ∪ S, where U contains the first m1

def= 1εΘ(log 1

ε

)vertices and S the next m2

def= 1ε2 Θ

(log 1

ε

)ones;

Page 104: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

96 LECTURE 12. APRIL 16, 2014: PROPERTY TESTING FOR GRAPHS

• There are only 2m1−1 possible partitions U1, U2 of U . Each such partition defines a partialpartition of V : to be consistent with U1, U2, all neighbors of U1 must go to one side (the onecontaining U2), and all neighbors of U2 in the other.

We will show that with high probability, most high-degree vertices of V do neighbor U , and so are“forced” by U1, U2. This will let us argue that S “reveals” that R does not satisfy bipartiteness withrespect to U1, U2 (with high probability). The key to this argument is that there are only 2m1−1

possible partitions U1, U2 to rule out, independent of N .

Notation: Given vertex v ∈ V , we let Γ(v) denote the set of neighbors of v. Given a set X ⊆ V ,Γ(X) def=

⋃v∈X Γ(v) is then the set of all neighbors of X.

A partition U1, U2 of U defines a partial partition V1, V2 of V , where

V1def= U1 ∪ Γ(U2) V2

def= U2 ∪ Γ(U1) (12.1)

The vertices outside the neighbor sets are not constrained.

Definition 146 (High-degree vertex). A vertex v ∈ V is said to have high-degree if deg(v) > ε3N .

A subset U ⊆ V is then said to good if at most ε3N of the high-degree-vertices of V do not have a

neighbor in U .

The following lemma claims that most high-degree-vertices of V neighbor (a random choice of)U :

Lemma 147. With probability at least 5/6 over the choice of a set U (of m1 vertices pickeduniformly at random), U is good.

Proof. Fix a high-degree vertex v. Pr[ v /∈ Γ(U) ] ≤(1− ε

3)Θ( 1

εlog 1

ε ) ≤ ε18 . Let X be the number of

high-degree vertices with no neighbor in U (i.e., not in Γ(U)). By linearity of expectation,

EX ≤ εN

18

and Markov’s inequality ensures that

Pr[X ≥ εN

3

]≤ εN/18

εN/3 = 16

so that Pr[X ≤ εN

3

]≥ 5/6.

From now on, we condition on U being good.

Definition 148. An edge e ∈ E disturbs a partition U1, U2 of U if both its endpoints lie in thesame Γ(Ui).

Page 105: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

12.4. TESTING BIPARTITENESS 97

Lemma 149. Suppose U is good. Then for any partition U1, U2 of U , at least ε3N

2 edges of Gdisturb U1, U2.

Proof. Fix any partition U1, U2 of U , and any V ′1 , V ′2 (real) partition of V completing the corre-sponding partial partition V1, V2 (defined as in (12.1)). By assumption, G is ε-far from Bip andthus there must be at least εN2 violating edges with respect to V ′1 , V ′2 . We will upper bound thenumber of edges that do not disturb U1, U2.

(i) Since U is good, at most ε3N high-degree vertices do not have a neighbor in U . These vertices

touch in total at most ε3N ·N = ε

3N2 edges (each of them has at most N neighbors).

(ii) The number of edges incident to all low-degree vertices is no more than N · ε3N = ε3N

2 (eachof the at most N vertices has at most ε

3N neighbors).This implies that at least ε

3N2 of the original (at least εN2) violating edges are not in groups (i) nor

(ii), i.e. are between two high-degree vertices both having a neighbor in U . These ε3N

2 violatingedges connecting high-degree vertices in Γ(U), they either connect Γ(U1) to Γ(U1) or Γ(U2) to Γ(U2),hence disturbing U1, U2.

With this in hand, we are finally set to prove Theorem 145:

Proof of Theorem 145. We will show that GR is bipartite with probability at most 1/3; two “badevents” can prevent this to be the case:Event 1: U is not good. Then assume the “worst” (GR bipartite) – this happens with probability

at most 1/6 with Lemma 147;Event 2: although U is good, there exists partition U1, U2 of U such that GR contains no edge

disturbing U1, U2.Let us bound the probability that, conditioned on U being good, Event 2 (?) occurs. Our lastlemma ensures that there are at least ε

3N2 disturbing edges for any partition of U , so that

Pr[ ? ] ≤ Pr[∃U1,U2 partition of U s.t. none of these ε

3N2

disturbing edges are in the graph induced by S

]≤ 2m1−1︸ ︷︷ ︸

union boundover all

partitions of U

·(

1− ε

3

)m22

(pairing up the vertices in S into m22 )

≤ 2m1 · 2−m1

6 = 16 (by choice of m1, m2)

Overall, GR fails to be bipartite with probability at most 16 + 1

6 = 13 .

Remark 19 (Improving the query complexity). As we just saw, the algorithm needs to query only:• |U |2 = O

(1ε2

)edges (for all possible edges in U);

Page 106: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

98 LECTURE 12. APRIL 16, 2014: PROPERTY TESTING FOR GRAPHS

• |U | · |S| = O(

1ε3

)edges (for all possible edges between U and S);

• |S|2 = O(

1ε2

)edges (all of an arbitrary pairing of vertices in S).

In total, Alg. 11 can therefore be implemented with query complexity O(

1ε3

).

Summary of the idea: a “small sample” U forces the structure of the unseen part ; allows usto escape the exponential blowup (due to the number of possible partitions of U).

This analysis is from [GR98], a (100+)-page paper by Goldreich, Goldwasser and Ron whichalso describes several generalizations of this algorithm.

(That will be discussed during next lecture.)

Page 107: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 13

April 24, 2014

13.1 Overview

13.1.1 Last Time

• Graph property testing for (dense) graphs with N nodes and access to adjacency matrix (Nsufficiently large);• q-query testing with a O

(q2)-query non-adaptive tester;

• O(

1ε3

)-query tester for bipartiteness (?).

13.1.2 Today

• Broad generalization of (?) for “general graph partition properties” (GGPT).• Testing 4-freeness using Oε(1)-query algorithm:

– 4-removal lemma– Szemeredi Regularization lemma

Relevant readings• Goldreich, Goldwasser, and Ron. Property testing and its connection to learning and approxi-

mation. [GR98]• Szemeredi. Regular partitions of graphs. [Sze78]

13.2 poly(1ε )-query testable graph properties

Besides bipartiteness, some of the graph properties that were shown in [GR98] to be testable withpoly(1

ε ) queries (and running time exponential in the number of queries) include the following:

99

Page 108: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

100 LECTURE 13. APRIL 24, 2014

k-colorability

Definition 150. Fix any integer k. A graph G(V,E) is said to be k-colorable if there existsa (proper) k-coloring of G, that is a mapping ϕ : V → [k] such that ∀(i, j) ∈ E ϕ(i) 6= ϕ(j).In other terms, after assigning a color to each node, any two adjacent nodes i, j have differentcolors.Note that for k = 2, this is exactly bipartiteness.

Theorem 151. For any fixed k ≥ 2, there is a poly(kε )-query tester for k-colorability.

B

R G

ρ-clique

Definition 152. Fix any ρ ∈ (0, 1). A ρ-clique of an N-vertex graph is a collection of ρNvertices containing all the edges within them.

Theorem 153. Let P def= G N -vertex graph : G has a ρ-clique . Then there is a poly(1ε )-

query tester for P.

rho-clique(ρN vertices)

N-2N-1

Nv1

v2

v3 ρN

ρ-bisection

Definition 154. Fix any ρ ∈ (0, 1/4). A ρ-bisection of an N-vertex graph G = (V,E) is apartition of V in two N

2 -size subsets such that the number of edges crossing from V1 to V2 isat most ρN2 (that is, V1, V2 define a balanced, ρ-sparse cut of G).

V1 V2≤ pN2

Page 109: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

13.3. GENERAL GRAPH PARTITION TESTING (GGPT) 101

Theorem 155. Let P def= G N -vertex graph : G has a ρ-bisection . Then there is apoly(1

ε )-query tester for P.

The similarity between these results is not fortuitous: it turns out they all fall into the same generalsetting, General Graph Partition Testing.

13.3 General Graph Partition Testing (GGPT)

Bipartiteness, k-colorability, having a ρ-clique or having a ρ-bisection are special cases of a moregeneral family of properties, the class of “General Graph Partition Testing” (GGPT) properties,which all admit constant-query testers – a single “meta-algorithm” actually allows one to test anyof these properties.

More specifically, a GGPT property is specified by an integer k (number of “pieces”, i.e. size ofthe desired partition of the graph), as well as (a) size bounds for each of the k pieces and (b) edgedensity bounds for a pair between 2 pieces.

Definition 156 (GGPT property). A General Graph Partition Testing property is specified by aninteger k and a collection Φ of k + k2 intervals in [0, 1]:

Φ = ([`i, ui])i∈[k], [`i,j , ui,j ]i,j∈[k].

Given k,Φ the property Pk,Φ is the set of all N -vertex graphs G = (V,E) for which there existsa k-way partition of V into V1 ∪ V2 ∪ . . . Vk satisfying

(i) ∀i ∈ [k], N`i ≤ |Vi| ≤ Nui (right density of vertices)(ii) ∀i, j ∈ [k]2, N2`i,j ≤ |E(Vi, Vj)|︸ ︷︷ ︸

#edges betweenVi and Vj

≤ N2ui,j (right density of edges)

As an example, consider the k-colorability property, which can be rephrased as a GGPT as follows:• [`1, u1] = [`2, u2] = . . . = [`k, uk] = [0, 1];• [`1,1, u1,1] = [`1,1, u1,2] = . . . = [`k,k, uk,k] = 0;• [`i,j , ui,j ] = [0, 1] for all i, j ∈ [k] with i 6= j.

Theorem 157 (Testing GGPT properties). Given any k and Φ, the property Pk,Φ is testable with(kε )O(k) queries (and running time exponential in the query complexity).

Proof sketch. Using the same high-level idea as in the bipartiteness tester:(1) First, draw a small set of vertices and query all pairs of these; call the resulting graph G′

(note that the number of possible partitions of G′ is exponential in |G′|).(2) If G ∈ Pk,Φ, then some partition of G′ is good; this good partition induces a partial partition

of G that will (approximately) satisfy the constraints.

Page 110: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

102 LECTURE 13. APRIL 24, 2014

(3) Like for bipartiteness, draw a 2nd sample and see if it complies with any partition of G′. (“only”need to worry about exp(|G′|) many partitions).

13.4 Triangle-freeness

Definition 158 (4-freeness). Fix any graph G = (V,E). A triangle 4 is a triple (i, j, k) ∈ V 3

such that (i, j), (j, k), (j, i) ∈ E. G is said to be triangle-free (or 4-free) if it does not contain anytriangle.

V3

V1

V2

Let P def= G N -vertex graph : G is 4-free .Then, the distance of a graph G from P is

dist(G,P) =#edges one needs to

erase to kill all trianglesN2

Algorithm 12 Test–4-Freeness1: for s iterations do2: Randomly pick v1, v2, v3 nodes ∈ [N ]3: Query (v1, v2), (v1, v3) and (v2, v3)4: return REJECT if the three edges exist (i.e., there is a 4)5: end for6: return ACCEPT (none of the s iterations found a 4)

Analysis First, it is easy to see that if a graph G is 4-free, then Test–4-Freeness acceptswith probability 1; and that it always makes at most 3s queries. Furthemrore, if s is chosen bigenough (e.g., N2, N3 . . . ), Test–4-Freeness will work.

Question How small can s be? More specifically, is s = Oε(1) enough?Note that for Oε(1) many queries to suffice, it must be the case that

G ε-far from 4-free ⇒ G has Ωε(1) ·N3 triangles

Fortunately, this is true; and is a corollary of the following lemma, that we will prove in theremaining part of this lecture:

Theorem 159 (4-Removal Lemma). For all ε > 0, there exists δε > 0 such that any N-vertexgraph G which is ε-far from 4-free contains at least δεN3 triangles.

Page 111: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

13.4. TRIANGLE-FREENESS 103

(given this, Test–4-Freeness works with s set to 10/δε, as then (1− δε)s 1/3)

Proof.• Consider first the very special case where G is an N -vertex α-dense random graph (i.e. any

possible edge is in E independently with probability α). In such a graph, the probability of atriangle existing between 3 nodes v1, v2 and v3 is

Pr[v1

v34v2

]= Pr[ (v1, v2) ∈ E ] · Pr[ (v1, v3) ∈ E ] · Pr[ (v2, v3) ∈ E ] = α3

and by linearity the expected number of triangles is E[#4] = α3(N3)

(and one can show thatthe graph is Θ(α)-far from 4-free). Thus, in this case, “δε = α3” works for the statement ofthe 4-Removal Lemma.• However, we do not deal with random graphs here, but arbitrary graphs. The key will be to

argue that these graphs still present some structure, namely have enough “regularity” – andthat this regularity is roughly equivalent, for our purposes, to “behaving like random graphs”.

Definition 160 (Density). Given disjoints sets X,Y ⊆ [N ] of a graph G, the density d(X,Y )is defined as

d(X,Y ) def= e(X,Y )|X| · |Y |

where e(X,Y ) def= |E(X,Y )|.

Recall that a partition of [N ] is a collection of disjoint subsets V1, V2, . . . , Vk such that⋃ki=1 Vi = [N ].

Definition 161 (Regularity). Let A,B ⊆ [N ] be disjoint. The pair (A,B) is said to beε-regular if for all X ⊆ A, Y ⊆ B with |X| ≥ ε |A| and |Y | ≥ ε |B| one has

|d(A,B)− d(X,Y )| ≤ ε.

The idea is to show that regularity is sufficient to ensure lots of triangles, just like in the caseof random graphs.

Lemma 162. Fix 0 < α < 12 and 0 < ε < α

2 . Suppose A,B,C ⊆ [N ] are disjoint subsetssuch that each pair (A,B), (A,C), and (B,C) is both (i) ε-regular and (ii) α-dense. Then the

number of AB

C

triangles is at least α3

16 |A| |B| |C|.

Page 112: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

104 LECTURE 13. APRIL 24, 2014

A A* B

C

Buu

Cu

Proof. First we show that A has many “well-connected” vertices (adjacent to many elementsof both B and C). Define

A∗def= a ∈ A : a has both at least (α− ε) |B| neighbors in B and (α− ε) |C| in C .

Claim 163. |A∗| ≥ (1− 2ε) |A|.

Proof. Let Abad(B) ⊆ A be a ∈ A : a has < (α− ε) |B| neighbors in B . We have

d(Abad(B), B) =e(Abad(B), B)|Abad(B)| |B|

<|Abad(B)|(α− ε) |B||Abad(B)| |B|

= (α− ε).

Since d(A,B) ≥ α by assumption, we get∣∣∣d(A,B)− d(Abad(B), B)

∣∣∣ > ε; and so, (A,B) beingε-regular, we must have by contrapositive |Abad(B)| < ε |A|. Analogously, define Abad(C): weobtain |Abad(C)| < ε |A|, and therefore

A∗ ≥ |A| − |Abad(B)| − |Abad(C)| ≥ (1− 2ε)A.

Now we will use A∗ to get a lot of triangles as follows: for a vertex u ∈ A∗ (that is, a“well-connected” vertex – see Figure 13.4), let

Budef= b ∈ B : (u, b) ∈ E

Cudef= c ∈ C : (u, c) ∈ E

Every edge between Bu with Cu gives a triangle with u; to get many of them, we want tolower bound

e(Bu, Cu) = d(Bu, Cu) |Bu| |Cu| . (13.1)

But since u ∈ A∗, |Bu| and |Cu| both are large:

|Bu| ≥ (α− ε) |B| ≥ α

2 |B| ≥ ε |B|

|Cu| ≥ (α− ε) |C| ≥ α

2 |C| ≥ ε |C|

Page 113: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

13.4. TRIANGLE-FREENESS 105

and this also implies, as (B,C is ε-regular, that

d(Bu, Cu) ≥ d(B,C)− ε ≥ α− ε > α

2

and thus, plugging these back in (13.1), e(Bu, Cu) ≥ α3

8 |B| |C|. So finally, the total number

of AB

C

triangles is at least

∑u∈A∗

e(Bu, Cu) ≥ (1− 2ε) |A| α3

8 |B| |C| ≥α3

16 |A| |B| |C| .

To conclude the proof of the 4-Removal Lemma, we conjure an amazing fact (or miracle):Szemeredi Regularity Lemma. This structural result, from [Sze78], is a cornerstone in graphproperty testing which states that every sufficiently large graph can be divided into subsets ofabout the same size so that the edges between different subsets behave almost as in a randomgraph. More formally:

Theorem 164 (Szemeredi Regularity Lemma). Given ε > 0 and m0 ≥ 1, there exist M =M(ε,m0) (upperbound on the number of pieces of the partition) and K = K(ε,m0) such thatfor any graph G = (V,E) with at least K vertices there exists an integer m and a partition ofV into V0, V1, . . . , Vm satisfying:

(i) |V1| = |V2| = . . . = |Vm|; (all same size)(ii) V0 ≤ ε |V |; (slop bin)

(iii) m0 ≤ m ≤M , and(iv) at most εm2 pairs (Vi, Vj) are not ε-regular.

As a “small” catch, however, one feels compelled to point out that M can be as large as

2222···

2

where the tower of 2’s has height 1/ε5 (and, sadly, one cannot hope to get much improvement,as lowerbounds on this height have been proven). However, the amazing fact is that this isstill completely independent of N : no matter how big N ≥ K is, M will not change by a iota.

Now for the kill: recall the statement of the 4-Removal Lemma we want to prove:

∀ε > 0∃δε such that any N -vertex G ε-far from 4-free has at least δεN34′s.

Page 114: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

106 LECTURE 13. APRIL 24, 2014

We will use the Szemeredi Regularity Lemma (SRL) above to finish the proof. We aregiven ε > 0; set m0 = 10

ε , and apply SRL on G with parameter “ε” chosen to be ε10 . This

guarantees the existence of M = M(ε),K = K(ε) such that if N ≥ K there exists a partitionV0 ∪ V1 ∪ . . . ∪ Vm of V with

– 10ε ≤ m ≤M ;

– at most ε10m

2 pairs Vi, Vj are not ε10 -regular;

– |V0| ≤ ε10N ; and

– |V1| = |V2| = . . . = |Vm| ∈[ (1− ε

10 )NM , εN10

].

Now, if N < K then set δεdef= 1

2 ·1K3 , so that 0 < δε ≤ 1

2 . It suffices to have G containing atleast one triangle for the statement of the theorem to hold, which is true as G is ε-far from4-free (so must contain at least one triangle).

Goal: assuming now N ≥ K, we want to modify G so that we can use Lemma 162. Let G′be obtained from G by:(1) removing all edges incident to V0, (at most ε

10N · |V | =ε10N

2 edges)

(2) removing all edges within Vi, for i ∈ [m],(at most m ·

(N/m

2

)≤ m · N

2

2m2 = N2

2m ≤ε20N

2 edges)

(3) removing all edges between Vi and Vj if (Vi, Vj) is not ε10 -regular,

(at most ε10m

2 ·(Nm

)2 ≤ ε10N

2 edges)

(4) removing all edges between Vi and Vj if d(Vi, Vj) ≤ ε5

(at most m2 · ε5(Nm

)2 ≤ ε5N

2 edges).In total, this removes at most 9ε

20N2 edges from G: since G is ε-far from 4-free, it follows that

G′ has at least one remaining triangle. This triangle, by construction of G′, has to be betweenof the form

Vi

Vj

Vk

i.e. between some Vi, Vj , Vk for distinct i, j, k, with Vi, Vj , Vk simultanously ε10 -regular and

ε5 -dense. Therefore, from Lemma 162 there are at least

( ε5)3

16 |Vi| |Vj | |Vk| ≥( ε5)3

16

((1− ε

10)NM

)3

many triangles in G. Choosing δεdef= 1

2000

(εNM (1− ε

10))3

then concludes the proof.

Page 115: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Lecture 14

April 30, 2014

14.1 Overview

14.1.1 Last Time

• Generalization of bipartiteness testing: “general graph partition properties” (GGPT)• Testing 4-freeness via regularity:

– Szemeredi Regularity Lemma (SRL)– 4-removal lemma

(tester with query complexity Oε(1), with the dependence on ε is a tower of 1ε5 2’s)

14.1.2 Today

• Quick sketch of the proof of SRL

•(

)Ω(log 1ε ) lower bound for 1-sided error testers for 4-freeness (ruling out poly(1

ε )-querytesters)• Bounded-degree graph property testing (model well-suited for sparse graphs):

– adjacency list model– for constant maximum degree d, poly

(1ε

)-query algorithm for testing connectivity.

Relevant readings• Szemeredi. Regular partitions of graphs. [Sze78]• Komlos, Simonovits. Szemeredi Regularity Lemma and and Its Applications in Graph

Theory. [KS96]• Ron. Algorithmic and Analysis Techniques in Property Testing, chapter 9. [Ron09]

107

Page 116: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

108 LECTURE 14. APRIL 30, 2014

• Goldreich, Ron. Property Testing in Bounded Degree Graphs. [GR02]

14.2 Proof Sketch for Szemeredi Regularity Lemma

Recall the Szemereedi Regularity Lemma from the last class:

Theorem 165 (Szemeredi Regularity Lemma). Given ε > 0 and m0 ≥ 1, there exist M = M(ε,m0)and K = K(ε,m0) such that for any graph G = (V,E) with at least K vertices there exists aninteger m and a partition of V into V0, V1, . . . , Vm satisfying:

1. |V1| = |V2| = . . . = |Vm|,2. V0 ≤ ε |V |,3. m0 ≤ m ≤M , and4. at most εm2 pairs (Vi, Vj) are not ε-regular.

14.2.1 High-level Idea of SRL Proof

We will first need a definition of the refinement of a partition.

Definition 166. A refinement of a partition P = (V0, V1, . . . , Vm) is a partition P ′ = (V ′0 , V ′1 , . . . , V ′m′)such that for all j ∈ 0, 1, . . . ,m′ there exists i ∈ 0, 1, . . . ,m such that V ′j ⊆ Vi.

In words, a refinement P ′ of a partition P further partitions some (or all) sets from P intosmaller (disjoint) subsets.

The elements of the proof are:• Define a potential function f(P ) ∈ [0, 1], where P is a partition (we will give more details

later).• Start with any partition P1 of V into m0 pieces V1, V2, . . . , Vm0 of equal size, and put any

remaining vertices into the slop bin V0. (this partition will satisfy 1–3 of SRL, up to increasingK (i.e., N) for 2 to hold).• If 4 is satisfied, done. If not, show that Pt can be refined into Pt+1 such that:

(a) f(Pt+1) > f(Pt) + ε5;(b) If partition Pt has nt pieces, then its refinement Pt+1 has at most nt+1 pieces, where

nt+1 ≤ nt2nt ; (this is where the tower of twos come from)(c) Each piece in Pt+1 (except the slop bin) has the same size;(d) “not too much” is added to the slop bin (i.e., we maintain |V0| ≤ ε |V |).

We will only briefly comment on some of the proof details. In particular, we will define thepotential function, and comment on parts (a) and (b).

Page 117: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

14.2. PROOF SKETCH FOR SZEMEREDI REGULARITY LEMMA 109

Potential function. Given a partition P = (V0, V1, . . . , Vk) of V , where |V | = N , define thepotential function f(P ) as:

f(P ) def=∑

0≤i,j≤k

|Vi| · |Vj |N2 · d(Vi, Vj)2.

Recall from previous class that density d(Vi, Vj) = e(Vi,Vj)|Vi|·|Vj | . Therefore:

f(P ) =∑

0≤i,j≤k

e(Vi, Vj)2

N2 |Vi| · |Vj |.

It is clear that f(P ) ≥ 0. To see that f(P ) ≤ 1, notice that the number of edges e(Vi, Vj) betweentwo sets of vertices is at most |Vi| · |Vj |. Therefore:

f(P ) ≤∑

0≤i,j≤k

|Vi| · |Vj |N2 = 1

N2

k∑i=0|Vi|

k∑j=0|Vj | =

1N2 ·N

2 = 1.

where the inequality is tight for the complete graph KN .

Some details of part (a) . The part (a) is shown using the following two claims:

Claim 167. If P ′ is a refinement of P , then f(P ′) ≥ f(P ).

Proof. This claim can proved using the standard form of Cauchy-Schwartz inequality:

∑k

xkyk ≤√∑

k

x2k

∑k

y2k,

and we omit the details of the proof.

Claim 168. Suppose that the pair Vi, Vj is not ε-regular. Then there exist subsets V ′i ⊆ Vi, V ′j ⊆ Vjsuch that

∣∣∣d(V ′i , V ′j )− d(Vi, Vj)∣∣∣ > ε.

Using Claim 168, the proof proceeds by showing that for Vi, Vj that are not ε-regular, the newpartition P ′ with Vi, Vj partitioned into V ′i , Vi \ V ′i , V ′j , Vj \ V ′j (see Fig. 14.1) has:

f(P ′) ≥ f(P ) + ε4|Vi| · |Vj |N2 .

Page 118: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

110 LECTURE 14. APRIL 30, 2014

Vi Vj

V ′i

Vi \ V ′i

V ′j

Vj \ V ′j

Figure 14.1: A refinement of a partition.

V1V1,2

V2

V1,3

V3

Figure 14.2: Multiple splits.

Now, since at each step there are at least εm2 pairs of sets thatare not ε-regular – as condition 4 of SLR was not satisfied – theincrease in the potential function is at least

εm2 · ε4 |Vi| · |Vj |N2 .

Since |Vi|N = |Vj |

N ≥ 1−εm (recall that the partition P breaks

V into equal pieces plus the slop bin), we get that thetotal increase is, after working out the right constants, atleast ε5.

Some details of part (b). We need to make splits for all the (atleast εm2) set pairs that are not ε-regular. For each set Vi there are up to nt splits, so it can getsplit into up to nt2nt pieces.

Page 119: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

14.3. LOWER BOUND FOR TESTING 4-FREENESS 111

14.3 Lower Bound for Testing 4-freeness

We saw earlier that we can construct a 1-sided error tester for 4-freeness, with a query complexity

Oε(1) = 22···2

: a tower of 1ε5 2’s. Although this is a constant query complexity, one could argue it is

a very ugly (huge) constant. So can we get a better dependence on 1ε , say, a polynomial one? It

turns out that the answer is no–the dependence must be super-polynomial, although it is still notknown whether it must be a tower. We will prove the following lower bound (omitting some details):

Theorem 169. Any 1-sided (adaptive or non-adaptive) ε-tester for 4-freeness must use at least1

εΩ(log 1ε ) queries.

Proof. We saw in previous lectures that existence of a q-query adaptive tester implies the existenceof a O

(q2)-non-adaptive tester. Therefore, it is sufficient to show that the theorem holds for a

non-adaptive tester.We can assume without loss of generality that the non-adaptive tester works as follows:• draw r vertices uniformly at random;• query the

(r2)

induced edges;• reject if and only if a triangle is found.Therefore, to prove the statement of the theorem it suffices to show that there exists a graph G

such that:1. G is ε-far from being 4-free;

(implies that a (valid) tester must reject w.p. ≥ 23 )

2. the number of 4’s in G is at least(εC

)C log Cε ·N3, for some constant c.

(implies that the expected number of triangles among q random vertices is at most (q3)·( εC )C log Cε ·N3

(N3 ) , so, by

choosing constants appropriately for q = (1/ε)O((log 1ε )), one can make this value arbitrarily small (say 0.001)

and have Pr[ REJECT ] ≤ 0.001.)

Let’s take a closer look at these two requirements:• 2 demands that G has few triangles. Clearly, we can achieve this by selecting a graph G that

has few edges; but then G will not be far from being 4-free (if G has less than εN2 edges, itviolates 1).• To achieve 1, we must have at least εN2 edges. But if we try to use, for instance, an ε-dense

random graph, we will get a graph with an expected ε3(N

3)∼ ε3

6 N3 number of edges, which

violates 2.The two requirements appear to contradict each other, and finding a graph that satisfies both of

them might be a bit too challenging. But graphs are complicated objects; what looks easier thanfinding a graph? Maybe we should try with a set of numbers, and a proper analogy. The analogywill more or less look like this (Table 14.1):

Page 120: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

112 LECTURE 14. APRIL 30, 2014

Set [m] Complete graph on 1, . . . , NSet X ⊆ [m] Graph on 1, . . . , N|X| large Dense graphx1, x2, x3 s.t. x1 + x3 = 2x2 4 in the graph

Table 14.1: Analogy between a set of numbers and a graph.

In particular, our appropriate analogue of a dense graph with few triangles will be a dense subsetX ⊆ [m] with few or no triples x1, x2, x3 s.t. x1 + x3 = 2x2. Following our above correspondance,we will also refer to such triples of numbers as triangles.We shall apply the following lemma, which we state without the proof.

Lemma 170. For sufficiently large m, there exists a set X ⊆ [m], |X| ≥ m

e10√

logm> m0.99, such that

X contains no non-trivial triangles: that is, the only solutions to x1 + x3 = 2x2 for x1, x2, x3 ∈ Xare of the form x1 = x2 = x3.

A

B

j

j + x

j + 2x

C

1st construction. Fix m, and choose a set X with no trianglesusing Lemma 170. Construct the graph G as follows:• Vertices are grouped into three disjoint sets: A =a1, . . . , am, B = b1, . . . , b2m, and C = c1, . . . , c3m. No-tice that there are 6m vertices in total.• Add the following edges to G: for each j ∈ [m] and x ∈ X,

add (aj , bj+x), (bj+x, cj+2x), (cj+2x, aj) to G.Clearly, we have in total |E| = 3m |X| = Θ

(m2

e10√

logm

)edges in

G, and constructed m |X| “intended” triangles. We would like tobe certain these are the only ones – i.e., no “non-intended” trianglewas created:

Claim 171. G has no “non-intended” triangles.

Proof. By construction, there are no edges inside the sets A,B,C;therefore, since the edges appear only between the vertices fromdifferent sets, the only possible triangles must be of the form ai, bj , ck, for ai ∈ A, bj ∈ B, ck ∈ C.

If there is a triangle ai, bj , ck, then, by construction, it must be j = i + x1 (to have an edge(ai, bj)), k = i+ 2x1 (to have an edge (ai, ck)). But then to have an edge (bj , ck), we must have:

j + x2 = k

⇔ i+ x1 + x2 = i+ 2 · x1

⇔ x2 = x1.

Page 121: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

14.3. LOWER BOUND FOR TESTING 4-FREENESS 113

Therefore, all the triangles must be of the form ai, bi+x, ci+2x, for i ∈ [m], x ∈ X.

Are we done? Not quite. G has Θ(m) vertices and Θ(

m2

e10√

logm

)edges, which looks good; but it

has too few edges – for ε = O(1), G is oε(1)-close to being 4-free.

Fixing it. We want to keep the same structure of G, but make it more dense. Our “fix” will be a“blowup” operation on G, which we define below.

Definition 172. Fix a graph G = (V,E), s > 0. The s-blowup of G is a graph G(s) = (V (s), E(s))with s |V | vertices and s2 |E| edges that:• replaces each vertex v ∈ V by a super-vertex: a set of s independent vertices;• adds a complete bipartite graph between every pair of super-vertices.

u v

su sv

s2 edges

Figure 14.3: An example of a blow-up operation over two vertices.

For the next (and final) construction of G, the idea will be to blow up the graph from the 1st

construction, for an appropriately chosen s. We will get a graph G(s) with 6ms vertices, 3s2m |X|edges, and (exactly) s3m |X| triangles.

Exercise 173. HW ProblemShow that G(s) has Θ(s2m |X|

)edge-disjoint triangles.

2nd construction. Fix ε > 0, and pick the largest value of m such that ε ≤ 136e10

√logm

: this gives

m ≥(C

ε

)C log Cε

for some C > 0. From the 1st construction, get the graph G with 6m nodes; let s def= N6m . Our goal

graph is the graph G(s) obtained after an s-blowup on G. Observe that:• G(s) has N nodes.

Page 122: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

114 LECTURE 14. APRIL 30, 2014

• The number of triangles in G(s) is

s3m |X| = Θ(N3m |X|m3

)= Θ

(N3 |X|m2

)= O

(N3

m

)=(ε

C

)Θ(log 1ε )·N3.

The only thing that is left to show for the constructed graph G(s) is that it is ε-far from being4-free. In fact, if we show that the number of edge-disjoint triangles is εN2, we are done (becausewe need to remove at least one edge from each such triangle to get a 4-free graph). The aboveexercise (Exercise 173) was to show that G(s) has at least s2m |X| edge-disjoint triangles; recallingthe values we had for m and |X|, this yields

s2m |X| = N2m |X|36m2 ≥ N2m

me10√

logm ≥ εN2.

concluding the proof of the lower bound.

14.4 Sparse Graph Testing in Bounded-Degree Model

Fix a degree bound d ≥ 2. We will say that G has degree d if every v ∈ V has deg(v) ≤ d. As usual,we will work on graphs of the form graph G = (V,E), where |V | = N is large.Given two graphs G1 and G2, we will define the distance between them as

dist(G1, G2) def= |E1∆E2|d ·N

∈ [0, 1],

where “∆” denotes the symmetric difference between two sets.The adjacency matrix representation is not good anymore, since the graphs we are observing

now are sparse: many queries (even Ω(N)) would be required before even finding an edge. Therefore,we will from now on assume that the graphs are represented by adjacency lists:

Definition 174 (Adjacency list). Given a vertex v ∈ V with degree dv ≤ d, the adjacency list of vis a d-tuple of the form

`(v) = (u1, . . . , udv , 0, 0, . . . 0)that is a list of v’s neighbors followed by a sequence of 0’s, so that the total list length equals d. Theneighbors of v may be ordered arbitrarily in the list, but the order must be fixed.

A query to the graph is a couple (v, i) ∈ V × [d], and returns the ith neighbor of v if i ≤ dv, and0 otherwise.As before, a property P will be defined as a set of degree-d graphs; as in the case of dense graphs, wewill observe only reasonable properties. Examples of properties are “connectivity” or “bipartiteness”.The distance of a graph G to a property P is

dist(G,P) def= minG′∈P

dist(G,G′

).

Page 123: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

14.4. SPARSE GRAPH TESTING IN BOUNDED-DEGREE MODEL 115

The property we will deal with in the rest of this lecture is P = Connectivity. The question wewant to answer is: how many queries are needed to test P?

We start with the following observations:• if G is connected, then, by definition of connectivity, for all vertices u, v ∈ V there exists a path

in G connecting u and v. However, this does not seem to be a good way to test connectivity,as if we get unlucky when picking a pair of vertices u, v, walking the u ; v path may takeΩ(N) queries (e.g., consider the case where G is a line);• if G has K connected components, then to make G connected we must add at least K − 1

edges to G (an edge between each pair of connected components);• if G is “far” from being connected (dist(G,P) > ε), then G must have “many” (at least(

1ε + 1

)) connected components. Moreover, if there are “many” connected components, then

some of them must be “small”.In fact, this last point can be strenghtened: if G has many connected components, then many ofthem must be small. This is true because every connected component must contain at least onevertex. Therefore, if we select a vertex v uniformly at random, we have a decent chance of hitting a“small” component (say, of size s). But a component with s vertices, can completely explored viaBreadth-First-Search in O(sd) time (that is, with O(sd) queries). A natural algorithm for testingConnectivity becomes clear now, and the pseudocode for it is provided below.

Algorithm 13 Test–Connectivity1: Select m vertices in G uniformly at random.2: For each of the selected m vertices, perform a BFS starting from the vertex, until the search has

explored ` vertices or there are no more vertices to be explored (found a connected component)3: If a connected component was detected in any of the BFS’s, return REJECT. Otherwise, return

ACCEPT.

The values of m and ` in the stated algorithm are m def= 16εd , ` def= 8

εd . The algorithm makes at mostm · ` · d = 16

εd ·8εd · d = 128

ε2d queries. Further, it is clear that if G is connected, Test-Connectivityaccepts with probability 1. The only part that remains to show is that if dist(G,P) > ε, thealgorithm rejects with probability at least 2/3. The key for proving this part are the following twolemmas.

Lemma 175. If a graph G is ε-far from being connected, then it has at least ε4dN connected

components.

The proof of Lemma 175 can be found in [GR02]. The following is a simple corollary of Lemma175:

Lemma 176. If G is ε-far from being connected, then it has at least ε8dN connected components

each containing at most 8εd vertices.

Page 124: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

116 LECTURE 14. APRIL 30, 2014

Proof. By Lemma 175, there are at least ε4dN connected components in G. The number of

connected components with at least 8εd vertices is at most N

8/εd = εdN8 . Therefore, there are at least

εdN4 −

εdN8 = εdN

8 connected components each with at most 8εd vertices.

From Lemma 175, for ε > 4d every graph is ε-close to being connected, so assume ε ≤ 4

d . Sincewe are selecting m vertices uniformly at random, from Lemmas 175 and 176, the probability thatno selected vertex is in a component of size at most 8

εd (the probability that the algorithm does notreject an ε-far G) is at most: (

1− εd

8

)m≤ e−

εd8 ·m = e−2 <

13

establishing soundness of Test–Connectivity.

Page 125: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

Bibliography

[BBL98] A. Blum, C. Burch, and J. Langford. On learning monotone Boolean functions. InProceedings of the Thirty-Ninth Annual Symposium on Foundations of Computer Science,pages 408–415, 1998. 55

[BCH+95] Mihir Bellare, Don Coppersmith, Johan Hastad, Marcos A. Kiwi, and Madhu Sudan.Linearity testing in characteristic two. In FOCS, pages 432–441, 1995. 7.1.2, 7.3.3

[Bla09] Eric Blais. Testing juntas nearly optimally. In Proceedings of the Forty-first AnnualACM Symposium on Theory of Computing, STOC ’09, pages 151–158, New York, NY,USA, 2009. ACM. 11.1.2

[BLR90] M. Blum, M. Luby, and R. Rubinfeld. Self-testing/correcting with applications tonumerical problems. In Proceedings of the Twenty-second Annual ACM Symposium onTheory of Computing, STOC ’90, pages 73–83, New York, NY, USA, 1990. ACM. 7.3.2

[CG04] Hana Chockler and Dan Gutfreund. A lower bound for testing juntas. Inf. Process.Lett., 90(6):301–305, 2004. 11.1.2

[CS13] Deeparnab Chakrabarty and C. Seshadhri. A o(n) monotonicity tester for Booleanfunctions over the hypercube. In Proceedings of the Forty-fifth Annual ACM Symposiumon Theory of Computing, STOC ’13, pages 411–418, New York, NY, USA, 2013. ACM.7.4, 13

[DLM+07] Ilias Diakonikolas, Homin K. Lee, Kevin Matulef, Krzysztof Onak, Ronitt Rubinfeld,Rocco A. Servedio, and Andrew Wan. Testing for concise representations. In FOCS,pages 549–558, 2007. 12

[EKK+98] Funda Ergun, Sampath Kannan, S. Ravi Kumar, Ronitt Rubinfeld, and MaheshViswanathan. Spot-checkers. In Proceedings of the Thirtieth Annual ACM Sympo-sium on Theory of Computing, STOC ’98, pages 259–268, New York, NY, USA, 1998.ACM. 1

117

Page 126: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

118 BIBLIOGRAPHY

[FLN+02] E. Fischer, E. Lehman, I. Newman, S. Raskhodnikova, R. Rubinfeld, and A. Samorod-nitsky. Monotonicity testing over general poset domains. In STOC, pages 474–483, 2002.7.4, 8.1.2, 8.1.2

[GGL+00] O. Goldreich, S. Goldwasser, E. Lehman, D. Ron, and A. Samordinsky. Testing mono-tonicity. Combinatorica, 20(3):301–337, 2000. 7.4, 8.1.2

[GKM12] Parikshit Gopalan, Adam R. Klivans, and Raghu Meka. Learning functions of halfspacesusing prefix covers. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors,COLT, volume 23 of JMLR Proceedings, pages 15.1–15.10. JMLR.org, 2012. 5.1.2, 58, 6,3

[GR98] Goldreich Goldwasser and Ron. Property testing and its connection to learning andapproximation. Journal of the ACM, pages 653–750, 1998. 12.1.3, 12.4.2, 13.1.2, 13.2

[GR02] O Goldreich and D Ron. Property testing in bounded degree graphs. Algorithmica,32:302–343, 2002. 14.1.2, 14.4

[KM91] Eyal Kushilevitz and Yishay Mansour. Learning Decision Trees using the FourierSpectrum. In Proceedings of the Twenty-third Annual ACM Symposium on Theory ofComputing, STOC ’91, pages 455–464, New York, NY, USA, 1991. ACM. 3.1.2

[KS96] Janos Komlos and Miklos Simonovits. Szemeredi’s regularity lemma and its applicationsin graph theory, 1996. 14.1.2

[LMN93] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, Fourier transform andlearnability. Journal of the ACM, 40(3):607–620, 1993. 3.1.2

[Man94] Yishay Mansour. Learning Boolean functions via the Fourier transform. In VwaniRoychowdhury, Kai-Yeung Siu, and Alon Orlitsky, editors, Theoretical Advances inNeural Computation and Learning, pages 391–424. Springer US, 1994. 5.1.2

[Ron08] Dana Ron. Property testing: A learning theory perspective. Foundations and Trends inMachine Learning, 1(3):307–402, 2008. 1, 7.1.2

[Ron09] Dana Ron. Algorithmic and analysis techniques in property testing. volume 5, pages73–205, 2009. 7.1.2, 14.1.2

[RS13] Dana Ron and Rocco A. Servedio. Exponentially improved algorithms and lower boundsfor testing signed majorities. In Symposium on Discrete Algorithms, pages 1319–1336.SIAM, 2013. 75

[Sze78] E. Szemeredi. Regular partitions of graphs. Problemes combinatoires et theorie desgraphes, pages 399–401, 1978. 13.1.2, 13.4, 14.1.2

Page 127: Sub-Linear Algorithms in Learning and Testing › ~rocco › Teaching › S14 › Scribe › lectures...ii Foreword Recently there has been a lot of glorious hullabaloo about Big Data

BIBLIOGRAPHY 119

[VV11] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log(n)-sample estimatorfor entropy and support size, shown optimal via new CLTs. In STOC, pages 685–694,2011. 1