62
Boolean Function Analysis Yuval Filmus August 2, 2021 Contents 1 Introduction: linearity testing 2 2 Polymorphisms of majority 4 2.1 Approximate polymorphisms .................................... 7 3 Friedgut–Kalai–Naor theorem 8 4 Voting and influences 13 4.1 Bonus: L1 influences ......................................... 15 5 Friedgut and KKL 16 5.1 Friedgut’s junta theorem ....................................... 16 5.2 Kahn–Kalai–Linial theorem ..................................... 19 6 Hypercontractivity 20 6.1 Another proof of FKN ........................................ 21 6.2 General norms ............................................ 23 7 Constant degree functions: Kindler–Safra theorem 24 8 Biased Fourier analysis: Erd˝ os–Ko–Rado 26 8.1 Intersecting families ......................................... 27 8.2 Hypercontractivity .......................................... 29 9 Russo–Margulis 31 9.1 Russo–Margulis + Friedgut ..................................... 32 10 Very biased Fourier analysis: Biased FKN theorem 34 11 Invariance principle 37 11.1 Application: Majority is Stablest .................................. 41 11.2 Application: Bourgain’s tail bound ................................. 44 12 Global hypercontractivity 47 12.1 Application: Bourgain’s booster theorem .............................. 50 13 Analysis on the slice: Erd˝ os–Ko–Rado 52 13.1 Influence ................................................ 55 13.2 Noise .................................................. 57 13.3 Application: Erd˝ os–Ko–Rado .................................... 58 13.4 Coupling the slice and the cube ................................... 59 1

Boolean Function Analysis - Technion

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Boolean Function Analysis

Yuval Filmus

August 2, 2021

Contents

1 Introduction: linearity testing 2

2 Polymorphisms of majority 42.1 Approximate polymorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Friedgut–Kalai–Naor theorem 8

4 Voting and influences 134.1 Bonus: L1 influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Friedgut and KKL 165.1 Friedgut’s junta theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Kahn–Kalai–Linial theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Hypercontractivity 206.1 Another proof of FKN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 General norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Constant degree functions: Kindler–Safra theorem 24

8 Biased Fourier analysis: Erdos–Ko–Rado 268.1 Intersecting families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278.2 Hypercontractivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

9 Russo–Margulis 319.1 Russo–Margulis + Friedgut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

10 Very biased Fourier analysis: Biased FKN theorem 34

11 Invariance principle 3711.1 Application: Majority is Stablest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4111.2 Application: Bourgain’s tail bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

12 Global hypercontractivity 4712.1 Application: Bourgain’s booster theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

13 Analysis on the slice: Erdos–Ko–Rado 5213.1 Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5513.2 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5713.3 Application: Erdos–Ko–Rado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5813.4 Coupling the slice and the cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

1

1 Introduction: linearity testing

Boolean function analysis [O’D14] studies functions f : 0, 1n → 0, 1n (known as Boolean functions) froma spectral perspective. (Often we will replace 0, 1 by ±1.) These functions could come from a Boolean circuit,from a probabilistically checkable proof, from an error-correcting code, from an intersecting family, and so on.Much of the area is dedicated to understanding the structure of functions which satisfy given properties. Byway of introduction, we will consider one of the simplest applications of Boolean function analysis: linearitytesting.

A function f : ±1n → ±1 is a homomorphism or linear if for all x, y ∈ ±1n we have

f(xy) = f(x)f(y).

Which functions are linear? First, note that f(1) = f(1)2, and so f(1) = 1. Next, let ei be the vector givenby eii = −1 and eij = 1 for all j 6= i. Then for all x ∈ ±1n,

f(x) =∏

i : xi=−1

f(ei).

If S is the set of ei such that f(ei) = −1, this shows that

f(x) =∏i∈Sxi=−1

(−1) =∏i∈S

xi.

In other words, f(x) is linear if it is a monomial. Such functions are also known as Fourier characters, sincethey are characters of the group Zn2 .

What can we say about a function f which is close to linear, in the sense that f(xy) = f(x)f(y) for mostx, y? Concretely, suppose that

Pr[f(xy) = f(x)f(y)] = 1− ε.

What can we say about f? Does it have to be close to a linear function? This is what we will show, usingFourier analysis.

The basic idea is to express the function f as a mixture of Fourier characters:

f(x) =∑S⊆[n]

cSxS , where xS =∏i∈S

xi.

How do we know that such a representation exists? Is it unique?First of all, notice that if every function can be represented in this way, then the representation must

be unique: if we think of the space of functions on the Boolean cube ±1n as a vector space, then it hasdimension 2n, which exactly coincides with the number of Fourier characters.

There are many ways to show that every function can be represented in the form above, in other words,as a multilinear polynomial. For example, here is such a representation:

f(x) =∑

y∈±1nf(y)

n∏i=1

xiyi + 1

2.

The idea is that if xi = yi then xiyi = 1, and otherwise xiyi = −1.

The unique representation f =∑S⊆[n] f(S)xS is known as the Fourier expansion of f ,

and the coefficients f(S) are known as the Fourier coefficients of f .

With this representation in hand, let us try to express the assumption in a different way. First, noticethat f(xy) = f(x)f(y) is the same as f(x)f(y)f(xy) = 1. Second, since f(x)f(y)f(xy) ∈ ±1, if we know

2

the probability that f(x)f(y)f(xy) = 1, then we also know the probability that f(x)f(y)f(xy) = −1, and sowe can compute the expectation of f(x)f(y)f(xy):

E[f(x)f(y)f(xy)] = Pr[f(x)f(y)f(xy) = 1]− Pr[f(x)f(y)f(xy) = −1] = 1− 2ε.

At this point, we substitute the Fourier expansion of f and apply linearity of expectation to obtain

1− 2ε =∑S,T,U

f(S)f(T )f(U)E[xSyT (xy)U ].

Notice that (xy)U = xUyU . Moreover, xSxU = xS∆U , where ∆ is symmetric difference. This is becausex2i = 1. Altogether,

1− 2ε =∑S,T,U

f(S)f(T )f(U)E[xS∆U ]E[yT∆U ],

since x, y are independent.What is the expectation of xR? If R = ∅ then xR = 1, and so E[xR] = 1. Otherwise,

E[xR] =∏i∈R

E[xi] = 0,

since each xi has zero mean. We conclude that

1− 2ε =∑S

f(S)3.

Before continuing, let us pause to notice something that came up during the calculation: E[xSxT ] = 0 ifS 6= T , and E[x2

S ] = E[1] = 1. In other words, the Fourier characters xS form an orthonormal basis for thespace of real-valued functions on the Boolean cube. This gives us a way to compute the Fourier coefficients ofa function h:

E[hxS ] =∑T

h(T )E[xTxS ] = h(S).

More generally,

E[gh] =∑S,T

g(S)h(T )E[xSxT ] =∑S

g(S)h(S).

In particular,

E[h2] =∑S

h(S)2.

This is known as Parseval’s identity. The left-hand side is often written as ‖h‖2, since it is the square of theL2 norm of h.

Back to our function f , which satisfies f2 = 1, and so∑S

f(S)2 = 1.

Combining this with the preceding equation, we get

1− 2ε =∑S

f(S)3 ≤ maxS

f(S) ·∑S

f(S)2 = maxS

f(S).

In other words, some Fourier coefficient f(S) must be close to 1 in value!Our earlier formula for the Fourier coefficient shows that

1− 2ε ≤ f(S) = E[fxS ].

3

Since both f and xS are ±1-valued,

E[fxS ] = Pr[fxS = 1]− Pr[fxS = −1] = Pr[f = xS ]− Pr[f 6= xS ] = 2 Pr[f = xS ]− 1.

In other words,Pr[f = xS ] ≥ 1− ε.

To conclude, we have shown that if f(x)f(y) = f(xy) with probability 1− ε, then there exists a set Ssuch that f = xS with probability at least 1− ε.

Property testing view Linearity testing is often viewed from the perspective of property testing. In thisview, we are given a function f : ±1n → ±1 as a black box, and our goal is to find out whether f is aFourier character or not, by sampling only few values of f .

What properties can we require of such a test? It is natural to require a Fourier character to always passthe test, and this is known as perfect completeness.

What if f is not a Fourier character? If f is very close to a Fourier character, say results from a Fouriercharacter by changing only a few entries, no test that samples only a few values of f will be able to tellthe difference. Hence the most we can say is that if f passes the test then it is probably close to a Fouriercharacter. More accurately, the soundness guarantee is that if f passes the test with probability close to 1,then f is close to a Fourier character.

We can view the above as analyzing the following natural test: choose x, y at random, query f at locationsx, y, xy, and check whether f(x)f(y) = f(xy). If f is a Fourier character then the test always passes (perfectcompleteness). Conversely, if the test passes with probability 1− ε, then f is ε-close to a Fourier character(soundness), meaning that Pr[f 6= xS ] ≤ ε for some Fourier character xS .

List decoding regime A random function f satisfies f(x)f(y) = f(xy) with probability very close to 1/2.Intuitively, this is because for any given x, y, the probability that f(x)f(y) = f(xy) is exactly 1/2. (Toformalize this argument, we need to show that Pr[f(x)f(y) = f(xy)] is concentrated around its mean 1/2,which we can do using Chebyshev’s inequality.) Therefore if a function satisfies f(x)f(y) = f(xy) withprobability noticeably larger than 1/2, the function is not random. What can we say about such functions?

Our argument above actually shows that

maxS

Pr[f = xS ] ≥ Pr[f(x)f(y) = f(xy)].

Therefore if f(x)f(y) = f(xy) happens more often than in a random function, this implies that f hasnon-trivial correlation with some Fourier character.

Exercise Repeat the analysis above, replacing the test f(x)f(y) = f(xy) with the test f(x)f(y)f(z) =f(xyz).

2 Polymorphisms of majority

Here is another way of viewing Fourier characters: they are polymorphisms of the predicate P (x, y, z) on±13 which holds when xyz = 1. A function f : Dn → D is a polymorphism of a predicate P ⊆ Dm ifwhenever vectors x1, . . . , xn satisfy the predicate P , then so does the vector f(x1

1, . . . , xn1 ), . . . , f(x1

m, . . . , xnm).

Pictorially, we can think of an n×m table:

x11 · · · x1

m

x21 · · · x2

m...

. . ....

xn1 · · · xnmf(x1

1, . . . , xn1 ) · · · f(x1

m, . . . , xnm)

4

Here the last row is generated from the rest of the table by applying f columnwise. The guarantee is that ifall original rows satisfy P , then so does the final one.

A particular case of interest is when the predicate P is truth-functional, that is, arises from a functionφ : Dm−1 → D. In this case, P (x1, . . . , xm) holds if xm = φ(x1, . . . , xm−1). The predicate P consideredabove arises from the function φ = xy, which becomes the XOR function if we switch from ±1 to 0, 1.

Today we would like to consider the predicate arising from the majority function MAJ: ±13 → ±1.A function f : ±1n → ±1 is a polymorphism of MAJ if for all x, y, z ∈ ±1n,

MAJ(f(x), f(y), f(z)) = f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn)). (1)

In contrast to the analysis in Section 1, in this case we won’t get far if we just try to substitute the Fourierexpansion in all places. Instead, we will fix x and average over y, z.

To see what we get on the left-hand side, we need to compute the Fourier expansion of MAJ. The easiestway to do this is using the formula from Section 1:

MAJ(∅) = E[MAJ] = 0,

MAJ(1) = E[MAJx1] =1

2E[MAJ |x1 = 1]− 1

2E[MAJ |x1 = −1] =

1

2

(3

4− 1

4

)− 1

2

(−3

4+

1

4

)=

1

2,

MAJ(1, 2, 3) = E[MAJx1x2x3] =1− 3− 3 + 1

8= −1

2.

What about MAJ(1, 2) = E[MAJx1x2]? Since MAJ is an odd function, that is MAJ(−x1,−x2,−x3) =−MAJ(x1, x2, x3), we have

E[MAJ(x1, x2, x3)x1x2] = E[−MAJ(−x1,−x2,−x3)x1x2)] = −E[MAJ(y1, y2, y3)(−y1)(−y2)],

where yi = −xi. Since (−y1)(−y2) = y1y2, we conclude that E[MAJ(x1, x2, x3)x1x2] = 0. Altogether,

MAJ(x1, x2, x3) =x1 + x2 + x3 − x1x2x3

2.

Plugging this Fourier expansion and taking expectation over y, z, the left-hand side of (1) becomes

f(x) + 2E[f ]− f(x)E[f ]2

2=

1− E[f ]2

2f(x) + E[f ].

Now let us turn to the right-hand side of (1):

Ey,z

[f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn)] =∑S

f(S)∏i∈S

Eyi,zi

[MAJ(xi, yi, zi)].

If xi = 1, then MAJ(xi, yi, zi) = 1 with probability 3/4, and so E[MAJ(xi, yi, zi)] = 1/2. Similarly, if xi = −1then the expectation equals −1/2, and so we can say that it equals xi/2. Therefore the right-hand side of (1)equals ∑

S

f(S)∏i∈S

xi2

=∑S

(1

2

)|S|f(S)xS .

This function is usually denoted T1/2f , where Tρ is the noise operator, which multiplies the S-th Fourier

coefficient by ρ|S|.This sort of dependence on |S| is very common in Fourier analysis, and it suggests decomposing the

Fourier expansion into levels according to the size of S:

f =

n∑d=0

∑|S|=d

f(S)xS .

5

The d’th level of the Fourier expansion consists of coefficients f(S) with |S| = d. We often use the notationf=d for the sum above:

f=d =∑|S|=d

f(S)xS .

Using this notation, we can express the noise operator more succinctly:

Tρf =

n∑d=0

ρdf=d.

For future reference, let us give another interpretation of Tρf . On input x, Tρf(x) = E[f(w)], wherewi = xi with probability 1+ρ

2 and wi = −xi with probability 1−ρ2 (this interpretation makes sense as long as

|ρ| ≤ 1). Indeed,

E[wi] =1 + ρ

2xi −

1− ρ2

xi = ρxi,

and so this generalizes the case of MAJ(xi, yi, zi).Back to (1), which we have shown to be equivalent to

1− E[f ]2

2f(x) + E[f ] = T1/2f(x).

This equation holds for any x, and so we can think of it as an identity of functions. Replacing each side withits Fourier expansion, we obtain

1− E[f ]2

2f(∅) + E[f ] +

∑S 6=∅

1− E[f ]2

2f(S)xS =

∑S

1

2|S|f(S)xS .

Since the Fourier expansion is unique, we can compare coefficients on both sides. In particular, the freecoefficients must be equal:

1− E[f ]2

2f(∅) + E[f ] = f(∅).

This is a good point to mention that E[f ] = E[fx∅] = f(∅), which allows us to simplify the equation:

1− E[f ]2

2E[f ] + E[f ] = E[f ].

Therefore either E[f ] = 0, or else E[f ]2 = 1. In the latter case, E[f ] = ±1, and so f = ±1 is constant.The more interesting case is when E[f ] = 0. The Fourier expansions simplify to

1

2

∑S

f(S)xS =∑S

1

2|S|f(S)xS .

This shows that f(S) 6= 0 only if S belongs to the first level, and so f has the form

f =

n∑i=1

cixi.

We say that f has degree 1, since the largest non-zero Fourier coefficient is on level 1, and moreover f ishomogeneous, since all non-zero Fourier coefficients belong to the same level.

What does f look like? We claim that at most one coefficient ci is non-zero. Indeed, suppose thatc1, c2 6= 0, and assume without loss of generality that c1, c2 > 0. Then

f(1, 1, 1, . . . , 1) > f(1,−1, 1, . . . , 1) > f(−1,−1, 1, . . . , 1),

which is impossible since f is ±1-valued. Thus f = cixi. Since f is Boolean, in fact f = ±xi.Concluding, we have shown that if f is a polymorphism of MAJ, then f is either a constant or of the

form ±xi. We call functions of the form ±xi dictators, since they are dictated by the i’th coordinate of theinput. (Sometimes only xi is called a dictator, and −xi is called an anti-dictator.)

6

Exercise Determine all Boolean functions (that is, functions f : ±1n → ±1) which have degree atmost 1.

Exercise Find the polymorphisms of the predicate NAE on ±13, which holds when the inputs are notall equal.

2.1 Approximate polymorphisms

Exact polymorphisms of XOR are Fourier characters. Section 1 shows something stronger: if f : ±1n → ±1is an approximate polymorphism of XOR, meaning that the polymorphism property holds for most tables,that f is close to some Fourier character.

What happens in the case of MAJ? Suppose that f : ±1n → ±1 satisfies

Pr[MAJ(f(x), f(y), f(z)) = f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn))] = 1− ε.

Is it true that f must be close to a constant or to a dictatorship?We would like to repeat the analysis we had before. Let us try to see which parts of it we can salvage.

If we denote the left-hand side by g(x, y, z) and the right-hand side by h(x, y, z), then we are given thatPr[g = h] = 1− ε. Above we have calculated

G(x) := Ey,z

[g(x, y, z)] =1− E[f ]2

2f + E[f ],

H(x) := Ey,z

[h(x, y, z)] = T1/2f,

and then compared the Fourier expansions of G and H.If G = H then the Fourier expansions of G and H must be equal. In our case, G ≈ H, a notion that

we will have to formalize. What do we require of this notion? We need it to follow from the assumptionPr[g = h] = 1− ε, and we want it to imply something about the Fourier expansions of G and H.

It turns out that the correct way to formalize G ≈ H is by considering ‖G−H‖2 = E[(G−H)2]. Indeed,Parseval’s identity shows that

‖G−H‖2 =∑S

(G(S)− H(S))2.

It remains to bound ‖G−H‖2 using the promise on g, h. The first step is to “undo” the expectation over y, z:

E[(G−H)2] = Ex

[(Ey,z

[g(x, y, z)− h(x, y, z)]

)2]≤ Ex,y,z

[(g(x, y, z)− h(x, y, z))2].

Indeed, for every x, we can think of g(x, y, z) − h(x, y, z) as a random variable Rx (corresponding to theexperiment of choosing y, z uniformly at random). The inequality then reads

Ex

[E[Rx]2] ≤ Ex

[E[R2x]],

which follows from E[Rx]2 ≤ E[R2x] (this basic inequality states that V[Rx] ≥ 0, or follows from convexity of

t2 by Jensen’s inequality).Now (g(x, y, z)− h(x, y, z))2 equals 4 if g(x, y, z) 6= h(x, y, z) and 0 if g(x, y, z) = h(x, y, z), and so

E[(G−H)2] ≤ 4 Pr[g 6= h] = 4ε.

Substituting the Fourier expansions of G and H, we conclude that(1− E[f ]2

2E[f ]

)2

+∑S 6=∅

(1− E[f ]2

2− 1

2|S|

)2

f(S)2 ≤ 4ε.

7

We will now try to follow our steps in the analysis of the case ε = 0. The first step was to consider theexpectation. Using the triangle inequality,

| Ex,y,z

[MAJ(f(x), f(y), f(z))]− Ex,y,z

[f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn))]| ≤

Ex,y,z

[|MAJ(f(x), f(y), f(z))− f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn))|] =

2 Pr[MAJ(f(x), f(y), f(z)) 6= f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn))] = 2ε.

Substituting the expressions for the expectations on the left-hand side,∣∣∣∣1− E[f ]2

2E[f ] + E[f ]− E[f ]

∣∣∣∣ ≤ 2ε =⇒ |E[f ]− 1| · |E[f ] + 1| · |E[f ]| ≤ 4ε.

Now suppose that among 0,±1, E[f ] is closest to a. Then for any other b ∈ 0,±1, the distance fromE[f ] to b is at least 1/2. Therefore E[f ] is 16ε-close to some a ∈ 0,±1.

If E[f ] is 16ε-close to a ∈ ±1 then Pr[f = a] = 1− 8ε. Otherwise, |E[f ]| = 16ε, and we concentrate onthe rest of the sum: ∑

S 6=∅

(1− E[f ]2

2− 1

2|S|

)2

f(S)2 ≤ 4ε.

If |S| = 1 then the coefficient (1− E[f ]2)/2− 1/2 could be close to zero. But for larger |S|, on the one hand(1 − E[f ]2)/2 ≥ 1/2 − 128ε2, which is at least 1/3 when ε ≤ 1/100, and on the other hand, 1/2|S| ≤ 1/4.Therefore the coefficient is at least 1/144, assuming that ε ≤ 1/100 (if ε ≥ 1/100 then f is trivially 100ε-closeto any Boolean function, so this case is not interesting). Concentrating only on that part of the sum yields∑

|S|>1

f(S)2 ≤ 576ε.

The sum on the left is the squared norm of the function f>1 which consists of all levels of f beyond level 1.We are therefore led to the following question:

What can we say about Boolean functions f satisfying ‖f>1‖2 ≤ ε?

We will answer this question next week.

3 Friedgut–Kalai–Naor theorem

Suppose that f : ±1n → ±1 satisfies ‖f>1‖2 ≤ ε. What can we say about f? Does it have to be close toa constant or a dictator?

Let us rephrase this question in terms of g = f≤1, that is,

g = f(∅) +

n∑i=1

f(i)xi.

This is the orthogonal projection of f to the space of functions of degree 1 (by which we really mean thespace of functions of degree at most 1), which means that it is the degree 1 function minimizing ‖f − g‖2.This is because the Fourier characters form an orthonormal basis.

What can we say about the function g? It is close to f , in the sense that ‖g − f‖2 ≤ ε. In particular,using the notation

dist(y, S) = minz∈S|y − z|,

8

since f is ±1-valued, we deduce that

E[dist(g, ±1)2] ≤ E[(g − f)2] ≤ ε.

Our working hypothesis is that f should be close to some Boolean function r, which is a constant or adictator, in the sense that Pr[f 6= r] ≤ δ, where δ depends on ε. If this is the case, then we expect g to beclose to r as well. Indeed,

‖g − r‖2 = E[(g − r)2] ≤ E[2(g − f)2 + 2(f − r)2] ≤ 2‖g − f‖2 + 8 Pr[f 6= r] = O(ε+ δ).

Here we used two useful facts: the inequality (a+ b)2 ≤ 2a2 + 2b2 (a special case of Cauchy–Schwarz), and‖f − r‖2 = 4 Pr[f 6= r] (since (f − r)2 ∈ 0, 4).

Conversely, suppose that we show that ‖g− r‖2 ≤ η for some Boolean function r which is either a constantor a dictator. Then an identical argument shows that ‖f − r‖2 = O(ε+ η), and so Pr[f 6= r] = O(ε+ η).

This shows that the following questions are essentially equivalent:

1. Understanding the structure of Boolean functions f such that ‖f>1‖2 is small.

2. Understanding the structure of degree 1 functions g such that E[dist(g, ±1)2] is small.

We will focus on the second.

Getting rid of the constant term Our goal is to show that if deg g ≤ 1 and E[dist(g, ±1)2] = ε, theng is close to a constant or a dictator. Since deg g ≤ 1, we can represent it as

g = c0 +

n∑i=1

cixi.

It will be convenient to get rid of the constant coefficient, by considering the function

h =

n∑i=0

cixi.

This function satisfies

h(+1, x1, . . . , xn) = g(x1, . . . , xn),

h(−1, x1, . . . , xn) = −g(−x1, . . . ,−xn).

This shows that E[dist(h, ±1)2] = E[dist(g, ±1)2], hence it suffices to understand functions like h. Wewill show that h must be close to a dictator, and it will follow that g must be close to a constant or a dictator.

Getting rid of large coefficients If h is close to a dictator r, say r = ±x0, then this would mean that‖h− r‖2 is small, and so the following is small:

(±1− c0)2 +

n∑i=1

c2i .

In other words, one of the coefficients is close to ±1, and the other are close to zero. It turns out that it iseasy to show this for individual coefficients; most of the effort would be to show that this holds in aggregate.

Consider the coefficient c0. We know that

Ex1,...,xn

Ex0

[dist(h, ±1)2] ≤ ε,

9

hence there is a choice of x1, . . . , xn such that

dist(H + c0, ±1)2 + dist(H − c0, ±1)2 ≤ 2ε, where H =

n∑i=1

cixi.

Suppose without loss of generality that H + c0 is closer to +1, say H + c0 = 1 + γ, where γ2 ≤ 2ε. There arenow two cases to consider. First, suppose that H − c0 is also closer to +1. Since H − c0 = 1 + γ − 2c0, wehave (2c0 − γ)2 ≤ 2ε, and so

(2c0)2 ≤ 2γ2 + 2(2c0 − γ)2 = O(ε),

and so c20 = O(ε). If H − c0 is close to −1, then (2c0 − γ − 2)2 ≤ ε, and so

(2c0 − 2)2 ≤ 2γ2 + 2(2c0 − γ − 2)2 = O(ε),

hence (c0 − 1)2 = O(ε). If H + c0 were close to −1, then we would also have the option (c0 + 1)2 = O(ε).So far we have shown that each individual coefficient is somewhat close to 0,±1. Next, we will show

that there cannot be two coefficients which are “large”, using an argument similar to how we showed thatBoolean degree 1 functions are constants or dictators.

For this part of the argument, we will need ε to be “small enough”, that is, smaller than some absoluteconstant. This is an assumption which we commonly make, since usually structure results are trivial whenε is large. For example, suppose that we aim to conclude, eventually, that h is O(ε)-close to a dictator. Ifε ≥ 1/100 then this is automatically satisfied for any dictator, by choosing the big O constant appropriately.Indeed, if r is any dictator and round(h, ±1) results from rounding h to the nearest value in ±1, then

‖h− r‖2 = E[(h− r)2] ≤ 2E[(h− round(h, ±1))2] + 2E[(round(h, ±1)− r)2] ≤ 2ε+ 2 ≤ 202ε.

Suppose that c0, c1 are both large, say (c0 − 1)2, (c1 − 1)2 = O(ε). We know that

Ex2,...,xn

Ex0,x1

[dist(h, ±1)2] ≤ ε,

hence there is a choice of x2, . . . , xn such that

dist(H + c0 + c1, ±1)2 + dist(H − c0 − c1, ±1)2 ≤ 4ε.

(We removed two terms.) The idea now is that since c0 + c1 ≈ 2, it is impossible for H + (c0 + c1) andH − (c0 + c1) to both be close to ±1, assuming ε is small enough.

Formally, suppose that H + c0 + c1 is closer to a ∈ ±1 and that H − c0− c1 is closer to b ∈ ±1. Then

(c0 + c1 − a− (−c0 − c1 − b))2 ≤ 2(H + c0 + c1 − a)2 + 2(−(H − c0 − c1 − b))2 ≤ 8ε.

This implies that(c0 + c1 − (a+ b)/2)2 ≤ 2ε.

On the other hand,(c0 + c1 − 2)2 ≤ 2(c0 − 1)2 + 2(c1 − 1)2 ≤ 4ε.

Altogether, this shows that ((a+ b)/2− 2)2 ≤ 12ε. Since (a+ b)/2 ≤ 1, this is impossible as long as ε < 1/12.Summarizing, assuming that ε < 1/12, at most one of the ci can be “large”. If such a coefficient exists,

let us assume that it is c0. We are now at the following situation: we know that c0 is close to C ∈ 0,±1,and that c1, . . . , cn are each individually close to 0. This suggests that h is close to Cx0. However,

‖H − Cx0‖2 = (c0 − C)2 +

n∑i=1

c2i .

We would like to show that the right-hand side is small. So far, all we know is that each particular summandis O(ε), but it doesn’t follow that the sum itself is small, since there are n+ 1 many summands! The mainpart of the proof is to show that the coefficients are close to 0,±1 in aggregate.

10

Main argument The intuition here is that since the coefficients c1, . . . , cn are small, the distribution of∑i cixi is close to a normal distribution with zero mean and variance

∑i c

2i . On the other hand, we know

that∑i cixi must be close to ±1 − c0. Since a normal distribution is “smooth”, this can only happen if∑

i cixi is concentrated around one of the values ±1 − c0, and in particular,∑i c

2i must be small.

Arguing this formally requires some work. We will do so by induction. Under the assumption thatc21, . . . , c

2n ≤ Kε (which is what we get from the preceding step), we will show, by induction on m, that in fact

m∑i=1

c2i ≤ Kε.

The base case m = 1 is trivial, so let us assume that∑mi=1 c

2i ≤ Kε, and show that

∑m+1i=1 c2i ≤ Kε, assuming

that c2i+1 ≤ Kε and that ε is small enough.First of all, we note that there is a setting of x0, xm+2, . . . , xn for which

E

dist

(H +

m+1∑i=1

cixi, ±1

)2 ≤ ε, H = c0x0 +

n∑i=m+2

cixi.

The remaining coefficients c1, . . . , cm+1 satisfy

m+1∑i=1

c2i ≤ 2Kε.

Let us now eliminate H. We can write

2

m+1∑i=1

cixi =

(H +

m+1∑i=1

cixi

)−

(H +

m+1∑i=1

ci(−xi)

).

Denoting by a, b ∈ ±1 the values that the two expressions on the right are closest to, this gives(2

m+1∑i=1

cixi − (a− b)

)2

≤ 2

(H +

m+1∑i=1

cixi − a

)2

+ 2

(H +

m+1∑i=1

ci(−xi)− b

)2

.

Taking expectation and dividing by 4, this shows that

E

dist

(m+1∑i=1

cixi, 0,±1

)2 ≤ ε,

since (a− b)/2 ∈ 0,±1.Since the variance of

∑m+1i=1 cixi is small, this sum is concentrated around its mean, which is zero. Hence

we expect that most of the time,∑m+1i=1 cixi would be closest to 0, and this would imply that

n∑i=1

c2i = E

( n∑i=1

cixi

)2 ≈ E

dist

(m+1∑i=1

cixi, 0,±1

)2 ≤ ε.

In order to argue this formally, let us notice that if s :=∑m+1i=1 cixi is not closest to 0, then |s| ≥ 1/2, and

so s2 ≤ 4s4 (we will see below why this is useful). This shows that(m+1∑i=1

cixi

)2

≤ dist

(m+1∑i=1

cixi, 0,±1

)2

+ 4

(m+1∑i=1

cixi

)4

.

11

Indeed, if the sum is closest to 0 then the term on the left equals the first term on the right, and otherwise itis bounded by the second term on the right. We conclude that

m+1∑i=1

c2i = E

(m+1∑i=1

cixi

)2 ≤ E

dist

(m+1∑i=1

cixi, 0,±1

)2+4E

(m+1∑i=1

cixi

)4 ≤ ε+4E

(m+1∑i=1

cixi

)4 .

What do we do about the second term on the right? It is four times the expectation of∑i,j,k,`

cicjckc`xixjxkx`.

Most of the terms here vanish: indeed, if i 6= j, k, ` then E[xixjxkx`] = E[xi]E[xjxkx`] = 0. There aretwo types of terms that survive: i = j 6= k = ` (and their two permutations), and i = j = k = `. SinceE[x2

ix2k] = E[x4

i ] = 1, we can bound

E

(m+1∑i=1

cixi

)4 ≤ 3

m+1∑i=1

m+1∑j=1

c2i c2j +

m+1∑i=1

c4i ≤ 3

(m+1∑i=1

c2i

)2

+Kε

m+1∑i=1

c2i ,

since c21, . . . , c2m+1 ≤ Kε. By assumption,

∑i c

2i ≤ 2Kε, and so putting everything together, we conclude that

m+1∑i=1

c2i ≤ ε+ 4 · [3 · (2Kε)2 + (Kε) · (2Kε)] = ε+ 56Kε2.

We would like this to be at most Kε, assuming that ε is small enough. Possibly increasing K so that it is atleast 2, it suffices to assume that ε ≤ 1/(56K) to make the inductive step go through.

Concluding the argument Our inductive proof shows that

n∑i=1

c2i = O(ε).

We also know that (c0 − C)2 = O(ε), where C ∈ 0,±1. This shows that

‖h− Cx0‖2 = (c0 − C)2 +

n∑i=1

c2i = O(ε).

It remains to show that C 6= 0. Indeed,

dist(C, ±1)2 = E[dist(Cx0, ±1)2] ≤ 2E[(h− Cx0)2] + 2E[dist(h, ±1)2] = O(ε),

which for small enough ε implies that C 6= 0.Altogether, we have proved the following results, due to Friedgut, Kalai and Naor [FKN02]:

If g : ±1n → R is a degree 1 function satisfying E[dist(g, ±1)2] ≤ ε, then there isa Boolean function r : ±1n → ±1, which depends on at most one input, such that‖g − r‖2 = O(ε).

If f : ±1n → ±1 satisfies ‖f>1‖2 ≤ ε then there is a Boolean function r : ±1n →±1, which depends on at most one input, such that Pr[f 6= r] = O(ε).

Together with the results of Section 2.1, we have completed the proof of the following result:

12

If f : ±1n → ±1 satisfies

Prx,y,z∈±1n

[MAJ(f(x), f(y), f(z)) = f(MAJ(x1, y1, z1), . . . ,MAJ(xn, yn, zn))] ≥ 1− ε

then there exists a function r : ±1n → ±1, depending on at most one input, suchthat Pr[f 6= r] = O(ε).

Exercise Find all functions f : ±1n → 0,±1 such that deg f ≤ 1. Then characterize all functionsf : ±1n → 0,±1 such that ‖f>1‖2 ≤ ε.

4 Voting and influences

Consider an election between two candidate, −1 and 1. There are n voters, and each one votes either −1 or1. The outcome of the election is given by a function f : ±1n → ±1. In many places f is the majorityfunction, but sometimes more sophisticated functions are used, for example in the United States. It is naturalto assume that f is monotone, that is, if a voter changes their vote from −1 to 1, then the outcome cannotchange from 1 to −1.

Suppose furthermore that each voter independently tosses a fair coin, a simplifying assumption which isnot too far from reality. How many votes need to be “bought” in order to force the outcome, with probability2/3, say?

In the case of majority, it suffices to bribe Θ(√n) voters. To see this, denote the votes by x1, . . . , xn, and

note that by the central limit theorem, x1 + · · ·+ xn has roughly Gaussian distribution with zero mean andstandard deviation

√n. If we bribe the last C

√n voters to vote 1, then x1 + · · ·+ xn has roughly Gaussian

distribution with mean C√n and standard deviation

√n− C

√n ≤√n. In particular, the probability that

the sum is positive is roughly the probability that a standard Gaussian is at least −C, which tends to apositive constant; the constant is 1−Θ(e−C

2/2/C), and in particular, it tends to 1 as C →∞.Ajtai and Linial [AL93] constructed a function in which Ω(n/ log2 n) voters need to be bribed. This is

almost optimal, due to a fundamental result of Kahn, Kalai and Linial [KKL88], which shows that for anyfunction f , there is a set of O(n/ log n) voters whose bribing makes the outcome biased, in the sense that oneof the candidates wins with probability 2/3.

Kahn, Kalai and Linial choose which voters to bribe sequentially. The first voter to bribe is the one withthe largest influence on the outcome of the election. For a Boolean function f , it is natural to define the i’thinfluence of f by

Infi[f ] = Pr[f(x) 6= f(x(i))],

where x(i) results from negating the i’th coordinate.The i’th influence is closely related to the Laplacian in direction i, which is given by

Lif(x) =f(x)− f(x(i))

2.

Indeed, if f is Boolean, then |Lif(x)| is the indicator of f(x) 6= f(x(i)).The effect of negating the i’th coordinate on a Fourier character xS is easy to describe: if i /∈ S then the

character stays the same, and otherwise it is negated. Therefore

Lif =1

2

∑S

f(S)xS −1

2

∑i/∈S

f(S)xS +1

2

∑i∈S

f(S)xS =∑i∈S

f(S)xS .

In particular, Parseval’s identity shows that

‖Lif‖2 =∑i∈S

f(S)2.

13

All of this makes sense even for non-Boolean f . When f is Boolean, Lif(x)2 is exactly the indicator off(x) 6= f(x(i)), since Lif ∈ 0,±1. This shows that

Infi[f ] = ‖Lif‖2 =∑i∈S

f(S)2.

We adopt this definition of influence even for non-Boolean f .What happens if we bribe voter i to always vote 1? By how much does that increase the probability that

the outcome is 1? We can choose a random x by first choosing all coordinates other than xi, collectivelyknown as x−i, and then choosing i. If f(x−i,−1) = f(x−i, 1), where the second argument is xi, then bribingvoter i makes no difference. Otherwise, since f is monotone, bribing voter i increases the probability of theoutcome 1 from 1/2 to 1. Overall, this shows that

Prxi=1

[f(x) = 1] = Pr[f(x) = 1] +1

2Infi[f ].

In the extreme case when f = xi, we have Infi[f ] = 1, and indeed the probability of the outcome 1 increasesfrom 1/2 to 1.

Before continuing with the Kahn–Kalai–Linial strategy, let us see what happens if we choose who to bribeat random. The effect depends on an important quantity known as the total influence:

Inf[f ] =

n∑i=1

Infi[f ].

Indeed, the calculation above shows that the output is skewed by Inf[f ]/(2n). The total influence has a niceformula in terms of the Fourier coefficients:

Inf[f ] =

n∑i=1

Infi[f ] =

n∑i=1

∑i∈S

f(S)2 =∑S

|S|f(S)2 =

n∑d=0

‖f=d‖2.

Total influence also has nice combinatorial interpretation as the edge boundary. Suppose that f is aBoolean function, which is the indicator function of some subset A ⊆ ±1n of the Boolean cube. The i’thinfluence Infi[f ] measures the number of edges of the cube crossing from A to A in direction i, and so Inf[f ]measures the total number of edges crossing from A to its complement.

A variant of this interpretation is average sensitivity. The sensitivity of f at a point x is the number ofcoordinates i such that f(x) 6= f(x(i)). The average sensitivity of f is simply Inf[f ].

Finally, total influence can also be defined in terms of the Laplacian of f , which is Lf =∑i Lif :

Inf[f ] = E[f · Lf ] (we prove this in Section 13.1).An important inequality involving total influence is the Poincare inequality: Inf[f ] ≥ V[f ]. As a special

case, if a Boolean function is balanced, then its average sensitivity is at least 1, which is tight for dictators.The Fourier formula for total influence immediately implies the Poincare inequality, once we notice that

V[f ] = E[f2]− E[f ]2 =∑S

f(S)2 − f(∅)2 =∑S 6=∅

f(S)2.

Indeed,

Inf[f ] =∑S

|S|f(S)2 ≥∑S 6=∅

f(S)2 = V[f ].

This bound is almost tight for low-degree functions: if f has degree d then

Inf[f ] =∑S

|S|f(S)2 ≤ df(S)2 = dV[f ].

Poincare’s inequality implies that if we bribe a random voter to 1, then we bias the outcome by at leastV[f ]/(2n). This is tight for dictatorships, but far from tight in the case of majority. Indeed, a voter is

14

influential if the other votes split exactly evenly, which happens with probability Θ(1/√n), which is much

larger than 1/(2n).Bribing a random voter is not a good strategy in general, as the case of a dictatorship demonstrates. In

order to show that every election can be biased by bribing only O(n/ log n) voters, we will show that everyfunction f has a coordinate whose influence is Ω( logn

n V[f ]), a fundamental result known as the KKL theorem.This theorem is tight, as shown by the examples of the Tribes function:

Tribes(x) =

n/m∨i=1

m∧j=1

xi,j , m = log n− log log n,

where ∨ is the max operator and ∧ is the min operator.Let us check that Tribes is more or less balanced. Each of the n/m “tribes” evaluates to 1 with probability

2−m, and so the function itself evaluates to −1 with probability

(1− 2−m)n/m ≈ e−n/(2mm).

Since 2mm = (n/ log n)(log n − log log n) ≈ n, this probability is roughly e−1. The same calculation alsoallows us to estimate the influences of Tribes. For xi,j to be influential, we need all other tribes to evaluateto −1, which happens with probability roughly e−1, and the other coordinates in the tribe to evaluate to 1,which happens with probability 21−m = 2 log n/n. Overall, all influences are O( logn

n ).Tribes is extremal from the point of view of the maximal influence, and dictators are extremal from the

point of view of the average influence. Can we characterize functions which are extremal on either front?The answer in the case of maximal influence is not completely clear, but the answer in the case of averageinfluence was worked out by Friedgut [Fri98]. Let us first make the question precise: What can we say aboutfunctions with average influence O(1/n)? Equivalently, what can we say about functions with total influenceO(1)?

First, let us try to see which functions satisfy this. One example is constants and dictators. More generally,if a function depends on O(1) coordinates, then its total influence is O(1). Such a function is called a junta.Friedgut’s junta theorem shows that every function with total influence O(1) is close to a junta.

The proofs of the KKL theorem and of Friedgut’s theorem are quite similar, but since the second one ismore intuitive, we will start by proving Friedgut’s theorem. Afterwards we will prove the KKL theorem, andfinally, we will show how to use it to bias elections.

4.1 Bonus: L1 influences

We defined the influences of f : ±1n → ±1 as

Infi[f ] = Pr[f(x) 6= f(x(i))],

and observed thatInfi[f ] = ‖Lif‖2.

In fact, more is true:Infi[f ] = ‖Lif‖pp = E[|Lif(x)|p]

for all p > 0. We usually concentrate on the case p = 2 since it leads to a formula for Infi[f ] in terms ofFourier coefficients.

When we consider non-Boolean functions, the choice of p does matter. Aaronson and Ambainis, in theconference version of their work [AA14], implicitly considered the case p = 1. They considered boundedfunctions f : ±1n → [−1, 1], and implicitly assumed that

n∑i=1

E[|Lif |] ≤ deg(f),

15

an inequality which we saw holds for the usual influences, but isn’t known to hold for these “L1 influences”.It turns out that the argument of Aaronson and Ambainis can be fixed to use the usual influences.

Backurs and Bavarian [BB14] showed that the “total L1 influence” is O(deg(f)3), and this was improvedto deg(f)2 in [FHKL16], using approximation theory. Below we present the slightly weaker upper bound2 deg(f)2. The best known lower bound is deg(f), achieved by Fourier characters, by other Boolean functions,and by some non-Boolean functions (see [FHKL16, Section 4]). Backurs and Bavarian conjecture that thetrue answer is O(deg(f)); it might well be deg(f).

Suppose that f : ±1n → [−1, 1]. We will bound∑ni=1 E[|Lif |] by showing that for every x ∈ ±1n,

n∑i=1

|Lif(x)| ≤ 2 deg(f)2.

(This can be improved to deg(f)2, which is tight for Chebyshev polynomials.) Let’s first get rid of theabsolute values: it suffices to show that for all x, y ∈ ±1n,

n∑i=1

yiLif(x) ≤ 2 deg(f)2.

If S = i ∈ [n] : yi = 1, then the left-hand side is∑i∈S

Lif(x)−∑i/∈S

Lif(x) =∑i∈S

Lif(x) +∑i/∈S

Li(−f)(x),

and so it suffices to show that for all x ∈ ±1n and S ⊆ [n],∑i∈S

Lif(x) ≤ deg(f)2.

We convert f into a univariate polynomial φ so that the left-hand side equals some derivative of φ. Wechoose

φ(t) = f(tx|S , x|S),

where the first part of the input corresponds to the coordinates in S, and the second part to the coordinatesoutside S. In other words,

φ(t) =∑T⊆[n]

t|S∩T |f(T )xT .

Thereforeφ′(1) =

∑T⊆[n]

|S ∩ T |f(T )xT =∑i∈S

Lif(x).

So φ′(1) is what we want to bound.What do we know about φ? If S = [n] then φ(t) = Ttf(x), and so for t ∈ [−1, 1], φ(t) is an average of

values of f on ±1n. The same property holds for arbitrary S, and we conclude that |φ(t)| ≤ 1 when |t| ≤ 1.Moreover, clearly deg(φ) ≤ deg(f).

We are now left with the following problem: Given a degree d polynomial φ such that |φ(t)| ≤ 1 for all|t| ≤ 1, how large can φ′(1) be? The answer, due to Bernstein and Markov, is d2, which is achieved uniquelyby the Chebyshev polynomial Td(x) = cos(d cos−1(x)).

5 Friedgut and KKL

5.1 Friedgut’s junta theorem

Let f be a Boolean function with total influence I. Friedgut’s junta theorem states that if I is small, thenf is close to a junta, which is a function depending on a small number of coordinates. Which coordinates

16

belong to the junta? It is natural to conjecture that the junta J is composed of all influential coordinates,say J consists of all coordinates whose influence is at least τ (we will determine τ later on).

Once we have decided on the junta coordinates J , it is easy to construct the function itself: for everysetting of the junta coordinates, we simply take the majority value, obtaining a Boolean function g. In orderto understand how close f and g are, it will be useful to consider a third function h, which results fromaveraging f over all coordinates outside the junta, that is,

h(xJ , x−J) = Ey−J

[f(xJ , y−J)], g(xJ , x−J) = round(h(xJ , x−J), ±1).

It is easy to compute the Fourier expansion of h given that of f . Indeed, it is enough to see what happens toa Fourier character xS . If all variables in S belong to the junta, then the character survives. Otherwise, thecharacter is averaged out (since E[xi] = 0). Therefore

h =∑S⊆J

f(S)xS ,

which implies that

‖f − h‖2 =∑S 6⊆J

f(S)2.

We will come back to this expression later, but first, let us see how to relate ‖f − h‖2 and Pr[f 6= g]. Theidea is very simple. We consider an arbitrary point x, and the three values f(x), g(x), h(x). We knowthat on average (f(x)− h(x))2 is small, and furthermore f(x), g(x) are Boolean, and g(x) is obtained fromh(x) by rounding. If |h(x)| > 1, then rounding actually brings g(x) closer to f(x). Otherwise, supposethat 0 ≤ h(x) ≤ 1. If f(x) = 1 then g(x) is again close to f(x), and otherwise |g(x) − f(x)| = 2 whereas|h(x)− f(x)| ≥ 1. This shows that (g(x)− f(x))2 ≤ 4(h(x)− f(x))2, and so

Pr[f 6= g] =1

4‖f − g‖2 ≤ ‖f − h‖2.

Therefore it suffices to bound ‖f − h‖2.

The formula for ‖f −h‖2 sums over all squared Fourier coefficients which intersect J . Each such coefficient

involves some coordinate i /∈ J . By construction,∑i∈S f(S)2 < τ . This shows that

‖f − h‖2 ≤∑i/∈J

∑i∈S

f(S)2 < nτ.

This bound is clearly not good enough. The problem is that we are counting each f(S)2 multiple times — infact, S ∩ J times. This suggests trying to get rid of Fourier coefficients corresponding to large sets S. Indeed,∑

|S|≥M

f(S)2 ≤ 1

M

∑S

|S|f(S)2 =Inf[f ]

M,

so we can disregard large coefficients. It remains to bound∑S 6⊆J|S|<M

f(S)2.

At this point we invoke the magic wand of Boolean function analysis, hypercontractivity. An operator onfunctions is contractive if it reduces norm. For example, recall the noise operator Tρ from Section 2. When|ρ| ≤ 1, this operator is contractive:

‖Tρf‖2 =∑S

(ρ|S|f(S))2 ≤∑S

f(S)2 = ‖f‖2.

17

It turns out that Tρ is actually hypercontractive, which means that it satisfies an inequality of the form‖Tρf‖p ≤ ‖f‖q, for p > q.

Let us first recall what the Lp norms are:

‖f‖p = p

√Ex

[|f(x)|p].

This turns out to be a norm for p ≥ 1 (including the limit p =∞), that is, ‖cf‖p = |c|‖f‖p (which is easy tosee), and the triangle inequality ‖f + g‖p ≤ ‖f‖p + ‖g‖p (which requires some argument). Another standardresult states that ‖f‖p is nondecreasing in p (when p =∞, ‖f‖∞ is just the maximum value of |f |).

The noise operator Tρ is contractive for any norm (when |ρ| ≤ 1). To see this, let us describe the noiseoperator in a slightly different way. One of the definitions we gave was: Tρf(x) = E[f(y)], where yi = xiwith probability 1+ρ

2 , and yi = −xi otherwise. Instead, we can let zi = 1 with probability 1+ρ2 and zi = −1

otherwise, and then Tρf(x) = E[f(xz)]. This shows that Tρf is the average of functions fz defined byfz(x) = f(xz). Since ‖fz‖p = ‖f‖p for all z, the triangle inequality immediately implies that ‖Tρf‖p ≤ ‖f‖p.

Hypercontractivity is the stronger property that ‖Tρf‖p ≤ ‖f‖q for p > q (depending on ρ). Using aninductive argument similar to an argument which we encountered in Section 3, we will show that

‖T1/√

3f‖4 ≤ ‖f‖2.

This actually holds for every function f , not just Boolean functions. From this, we will deduce that

‖T1/√

3f‖2 ≤ ‖f‖4/3.

We will apply this not to f itself, but rather to Lif :∑i∈S

3−|S|f(S)2 = ‖T1/√

3Lif‖22 ≤ ‖Lif‖24/3 = E[|Lif(x)|4/3]3/2.

Since Lif(x) ∈ 0,±1, the right-hand side is in fact Infi[f ]3/2. This shows that∑S 6⊆J

3−|S|f(S)2 ≤∑i/∈J

∑i∈S

3−|S|f(S)2 ≤∑i/∈J

Infi[f ]3/2 ≤√τ Inf[f ].

At this point it becomes apparent why it is useful to separate the coefficients into small S and large S: theabove inequality is only useful if 3−|S| is not too small, that is, when |S| is not too large. Altogether, weobtain

‖f − h‖2 ≤∑|S|≥M

f(S)2 +∑S 6⊆J|S|<M

f(S)2

≤ Inf[f ]

M+ 3M

∑S 6⊆J

3−|S|f(S)2

≤ Inf[f ]

M+ 3M Inf[f ]

√τ .

Suppose we are aiming at ‖f − h‖2 ≤ ε. The easiest way to satisfy this is to ask for both summands to beat most ε/2. Looking at the first summand, we should choose M = 2 Inf[f ]/ε, and so the second summand is2O(Inf[f ]/ε) Inf[f ]

√τ , which means that we need to choose τ = 2−Θ(Inf[f ]/ε) (this requires some calculation).

How many coordinates does J contain? Each coordinate contributes Infi[f ] ≥ τ to the total influence,and so the number of coordinates is at most Inf[f ]/τ = Inf[f ]2O(Inf[f ]/ε) = 2O(Inf[f ]/ε). This concludes theproof of Friedgut’s junta theorem:

Let f : ±1n → ±1. For any ε, there is a Boolean junta h, depending on 2O(Inf[f ]/ε)

coordinates, such that Pr[f 6= h] ≤ ε.

18

5.2 Kahn–Kalai–Linial theorem

The Kahn–Kalai–Linial theorem states that every Boolean function has a somewhat influential coordinate.Our starting point is the inequality∑

S 6⊆J

f(S)2 ≤ Inf[f ]

M+ 3M/2 Inf[f ]

√τ ,

where M is arbitrary and J is the collection of all coordinates whose influence is at least τ .Suppose we are aiming at an influence of at least κ

n V[f ], where κ is a function of n. If Inf[f ] ≥ κV[f ],then the maximal influence is obviously at least κ

n V[f ], so we can assume that Inf[f ] ≤ κV[f ].A natural place to find an influential variable is in the set J . Indeed,∑

i∈JInfi[f ] =

∑S

|S ∩ J |f(S)2 ≥∑S 6=∅

f(S)2 −∑S 6⊆J

f(S)2,

since if S 6= ∅ is a subset of J then |S ∩ J | ≥ 1. Now, the first term is V[f ], and we bounded the other oneabove, so averaging over all variables in J , there must be one whose influence is at least

V[f ]− Inf[f ]/M − 3M/2 Inf[f ]√τ

|J |≥ V[f ]− Inf[f ]/M − 3M/2 Inf[f ]

√τ

Inf[f ]/τ,

since |J | ≤ Inf[f ]/τ . This suggests choosing M, τ so that the two subtrahends are at most V[f ]/10, say.Accordingly, we choose M = Inf[f ]/(10V[f ]), and so τ = 2−Θ(Inf[f ]/V[f ]). This gives us a variable whoseinfluence is at least

Θ

(V[f ]

Inf[f ]2−Θ(Inf[f ]/V[f ])

).

Since Inf[f ] ≤ κV[f ], this is at leastΘ(2−Θ(κ)/κ) = 2−Θ(κ).

The best choice of κ is the one which balances the two terms 2−Θ(κ) and κn V[f ]. We want 2Θ(κ)κ = n/V[f ],

and so κ = Θ(log(n/V[f ])). Altogether, we obtain the KKL theorem:

Let f : ±1n → ±1. There exists a variable i whose influence is at least

Ω

(log(n/V[f ])

nV[f ]

).

How do we use the KKL theorem to influence elections? As we mentioned in Section 4, the idea is toiteratively bribe the most influential voter. Suppose that we want to bribe voters until the probability thatone of the candidates wins is at least 2/3. The variance of f is

V[f ] = E[f2]− E[f ]2 = 1− (Pr[f = 1]− Pr[f 6= −1])2,

and so one of the candidates wins with probability at least 2/3 when the variance drops below 1−(2/3−1/3)2 =8/9.

If the original variance is below 8/9, then there is nothing to do. Otherwise, we repeatedly bribe the mostinfluential voter to vote for candidate 1. As long as the variance is above 8/9 and there are m voters left, wecan find a voter whose influence is at least Ω( logm

m ) = Ω( lognn ). Bribing this voter increases the probability

that candidate 1 wins by Ω( lognn ). Hence this process necessarily stops after O( n

logn ) steps.

The formulation of the KKL theorem above is not dimension-free, that is, it involves n. The generalphilosophy in Boolean function analysis is to prove statements where n does not appear. We can obtainsuch a statement directly from our proof. What the proof shows is that for a parameter κ of our choice,either Inf[f ] ≥ κV[f ], or maxi Infi[f ] ≥ 2−Θ(κ). Stated in terms of δ = 2−Θ(κ), this gives the followingdimension-free version of the KKL theorem:

19

Let f : ±1n → ±1. For every δ > 0, one of the following must hold:

maxi

Infi[f ] ≥ δ or Inf[f ] = Ω(log(1/δ)V[f ]).

That is, for balanced functions, if all influences are small, then the total influence is large. We can deducethe previous formulation of the KKL theorem as before, by balancing both terms.

6 Hypercontractivity

Hypercontractivity is the secret spice behind much of Boolean function analysis. One way to think about itis that it encapsulates a conceptually useful proof by induction. Another is via convergence of Markov chains,a point of view which we will not discuss here.

Let us try to prove an inequality of the form ‖Tρf‖4 ≤ ‖f‖2 by induction. It would be simpler to raiseeverything to the fourth power, proving instead E[(Tρf)4] ≤ E[f2]2. The induction is on the number of inputsn. When n = 0, the inequality trivially holds for any ρ. Now suppose that we can prove this inequality forfunctions on n inputs, and try to prove it for functions on n+ 1 inputs. In order to reduce the number ofvariables, we will separate the variable xn+1, writing

f = xn+1

∑S⊆[n]

f(S ∪ n+ 1)xS +∑S⊆[n]

f(S)xS .

For the sake of succinctness, we will write this in the following way:

f = xn+1g + h,

where g, h are functions on n variables. We then have

Tρf = ρxn+1Tρg + Tρh.

Now we can attempt the proof by induction:

E[(Tρf)4] =

4∑i=0

(4

i

)ρi E[xin+1]E[(Tρg)i(Tρh)4−i].

It is easy to compute E[xin+1] = 1 for i = 0, 2, 4 and E[xin+1] = 0 for i = 1, 3, and so

E[(Tρf)4] = ρ4 E[(Tρg)4] + 6ρ2 E[(Tρg)2(Tρh)2] + E[(Tρh)4].

We can bound E[(Tρg)4] ≤ E[g2]2 and E[(Tρh)4] ≤ E[h2]2 by induction. As for the mixed term, the

Cauchy–Schwarz inequality shows that

E[(Tρg)2(Tρh)2] ≤√E[(Tρg)4]E[(Tρh)4] ≤ E[g2]E[h2].

Altogether, this givesE[(Tρf)4] ≤ ρ4 E[g2]2 + 6ρ2 E[g2]E[h2] + E[h2]2.

Our target is E[f2]2. By Parseval’s identity,

E[f2] = ‖f‖2 =∑

S⊆[n+1]

f(S)2 =∑S⊆[n]

g(S)2 +∑S⊆[n]

h(S)2 = E[g2] + E[h2],

and so we are looking for a value of ρ for which the following always holds:

ρ4 E[g2]2 + 6ρ2 E[g2]E[h2] + E[h2]2 ≤ E[g2]2 + 2E[g2]E[h2] + E[h2]2.

Comparing coefficients, we need ρ4 ≤ 1 and 6ρ2 ≤ 2, and so |ρ| ≤ 1/√

3. We have proved hypercontractivityin the following form:

20

For all functions f : ±1n → R and all |ρ| ≤ 1/√

3:

‖Tρf‖4 ≤ ‖f‖2.

Crucially, this inequality doesn’t involve n. We say that it is dimension-independent. Boolean functionanalysis concerns itself mostly with such dimension-independent results. We have seen several examplesabove: the Friedgut–Kalai–Naor theorem, Friedgut’s junta theorem and one version of the Kahn–Kalai–Linialtheorem.

When f has constant degree d, we can get rid of the noise operator, by writing f = TρT−1ρ f . The spectral

formula for the noise operator makes it clear that Tρ is indeed invertible, and T−1ρ = Tρ−1 , and so

‖f‖4 = ‖T1/√

3T√

3f‖4 ≤ ‖T√3f‖2 =

√∑S

3|S|f(S)2 ≤√

3d‖f‖2.

The proofs in Section 5 used a different form of hypercontractivity, in which the L2 norm was on theleft-hand side rather than on the right-hand side. This version is easily deducible from the current version,via Holder’s inequality, which states that 〈f, g〉 ≤ ‖f‖p‖g‖q, where 1/p+ 1/q = 1. If p = q = 1/2 then wejust get the Cauchy–Schwarz inequality. Here we will be interested in p = 4 and q = 4/3. Using this, if|ρ| ≤ 1/

√3 then

‖Tρf‖22 = 〈f, T 2ρ f〉 ≤ ‖f‖4/3‖T 2

ρ f‖4 ≤ ‖f‖4/3‖Tρf‖2,and so ‖Tρf‖2 ≤ ‖f‖4/3. The first step uses the symmetry of the operator Tρ:

〈Tρg, h〉 =∑S

ρ|S|g(S)h(S) =∑S

g(S)ρ|S|h(S) = 〈g, Tρh〉.

Altogether, we get

For all functions f : ±1n → R and all |ρ| ≤ 1/√

3:

‖Tρf‖2 ≤ ‖f‖4/3.

6.1 Another proof of FKN

As another illustration of hypercontractivity, let us give an alternative proof of the Friedgut–Kalai–Naortheorem which we proved in Section 3.

The Friedgut–Kalai–Naor theorem states, in one formulation, that if F : ±1n → ±1 is close todegree 1, in the sense that ‖F>1‖2 = ε, then F is close to a Boolean function G which is either constant or adictator, in the sense that Pr[F 6= G] = O(ε).

As we have shown in Section 3, we can assume, without loss of generality, that E[F ] = 0. Therefore,f = F≤1 has the form

f =

n∑i=1

cixi,

where ci = F (i) = E[Fxi]. In this case our goal is to show that f is close to a Boolean dictator ±xi.In order to show that f is close to a dictator, it suffices to show that some ci is close to ±1. Indeed,

E[Fxi] = Pr[F = xi]− Pr[F = −xi] = 2 Pr[F = xi]− 1 = 1− 2 Pr[F = −xi],

and so if ci = 1− δ then Pr[F = xi] = 1− δ/2, and if ci = −1 + δ then Pr[F = −xi] = 1− δ/2.Since ‖F>1‖2 = ε while ‖F‖2 = 1, we can conclude that ‖f‖2 = 1− ε, that is,

n∑i=1

c2i = 1− ε.

21

Also, we know that each c2i is O(ε)-close to 0,±1, as we have shown in Section 3. It could thereforeconceivably be the case that all ci are small. In order to rule this case out, we will consider

n∑i=1

c4i .

If all ci were small, then this sum would be at most O(ε) (since by assumption c4i ≤ O(ε)c2i ), so to rule thiscase out, all we need to do is give a lower bound on

∑i c

4i , which we expect to be close to 1.

In order to get a handle on∑i c

4i , we consider E[f4]:

E[f4] =

n∑i,j,k,`=1

cicjckc` E[xixjxkx`] =

n∑i=1

c4i + 3

n∑i=1

∑j 6=i

c2i c2j = 3

(n∑i=1

c2i

)2

− 2

n∑i=1

c4i .

We expect the left-hand side to be close to 1. Since the right-hand side is 3E[f2]2 − 2∑i c

4i ≈ 3− 2

∑i c

4i ,

this will show that∑i c

4i ≈ 1.

Instead of estimating E[f4] directly, we will consider the related quantity E[(f2 − 1)2], prompted by theknown property E[f2] = 1− ε.

We know that E[dist(f, ±1)2] ≤ E[(f − F )2] ≤ ε, and so with probability 1 − 1/C, it holds thatdist(f, ±1)2 ≤ Cε (we will choose C later on). This implies that f = ±1 + τ , where |τ | ≤

√Cε, and so

f2 = 1 + Θ(τ) (since ε ≤ 1), implying that (f2 − 1)2 = O(τ2) = O(Cε). This shows that

E[(f2 − 1)2] ≤ O(Cε) + E[(f2 − 1)21dist(f,±12>Cε].

The hard part is to bound the behavior of f on the bad inputs, which cause it to be abnormally large. Thisis where hypercontractivity comes in. But first, we need a trick, the most standard one — Cauchy–Schwarz:

E[(f2 − 1)21dist(f,±12>Cε] ≤√E[(f2 − 1)4]

√E[1dist(f,±12>Cε] ≤

1√C‖f2 − 1‖24.

Since f2 − 1 has degree 2, we know that ‖f2 − 1‖4 ≤ 3‖f2 − 1‖2, and so altogether,

E[(f2 − 1)2] ≤ O(Cε) +9√C

E[(f2 − 1)2].

Choosing any C > 81, we conclude that

E[(f2 − 1)2] = O(ε).

Since (f2 − 1)2 = f4 − 2f2 + 1 and E[f2] = 1− ε, this shows that

E[f4] = O(ε) + 2(1− ε)− 1 = 1 +O(ε),

as predicted above. Therefore

2

n∑i=1

c4i = 3E[f2]2 − E[f4] = 3(1− ε)2 − (1 +O(ε)) ≥ 2−O(ε),

implying that∑i c

4i ≥ 1−O(ε). It is now easy to show that some ci must be large. This follows from

n∑i=1

c4i ≤ maxic2i ·

n∑i=1

c2i = (1− ε) maxic2i .

Altogether, we conclude that some ci satisfies c2i ≥ 1−O(ε), and so |ci| ≥ 1−O(ε). As we have seen above,this implies that F is O(ε)-close to either xi or −xi.

22

6.2 General norms

Most applications of hypercontractivity in Boolean function analysis use one of the two forms stated above.Sometimes, however, we need to consider larger norms. Here is the most general form of hypercontractivity:

For all functions f : ±1n → ±1, all 1 ≤ p ≤ q ≤ ∞, and all |ρ| ≤√

p−1q−1 :

‖Tρf‖q ≤ ‖f‖p.

This theorem is proved by induction. In the base case n = 1, we have a function f = a+ bx. We need toprove that

q

√|a+ ρb|q + |a− ρb|q

2≤ p

√|a+ b|p + |a− b|p

2.

If a = 0, then the inequality reads |ρb| ≤ |b|, which holds as long as |ρ| ≤ 1. Hence we can assume that a 6= 0.Since the inequality is homogeneous, we can consider f/a = 1 + (b/a)x instead of f . Writing t = a/b, weneed to prove that

q

√|1 + ρt|q + |1− ρt|q

2≤ p

√|1 + t|p + |1− t|p

2.

When t is small, a Taylor expansion shows that

|1 + ρt|q = (1 + ρt)q ≈ 1 + qρt+

(q

2

)ρ2t2.

We get a similar expression for |1− ρt|q, and so the left-hand side is roughly

q

√1 +

(q

2

)ρ2t2 ≈ 1 +

q − 1

2ρ2t2.

Similarly, the left-hand side is roughly 1+ p−12 t2. Comparing coefficients, we see that we need (q−1)ρ2 ≤ p−1,

and so |ρ| ≤√

p−1q−1 . It turns out that the base case does hold for such ρ, but we will not prove so here.

The more interesting part of the proof is tensorization, which is the way in which we deduce the generalcase from the base case. While a direct inductive proof is possible, it is a bit tricky. Instead, we willconsider an equivalent form of hypercontractivity, due to Ryan O’Donnell, for which the inductive proof isstraightforward.

Suppose that ‖Tρf‖q ≤ ‖f‖p. Using Holder’s inequality, we can obtain a two-function version:

〈Tρf, g〉 ≤ ‖Tρf‖q‖g‖q′ ≤ ‖f‖p‖g‖q′ , where1

q+

1

q′= 1.

Coincidentally, the left-hand side is a very natural bilinear form:

〈Tρf, g〉 = Ex

[Tρf(x)g(x)] = Ex∼±1ny∼Nρ(x)

[f(y)g(x)],

where Nρ(x) is obtained by flipping each coordinate with probability 1−ρ2 . The connection between x and y

is symmetric: we could also have sampled y ∼ ±1n and then x ∼ Nρ(y) to obtain the same distribution.We write this symmetric distribution as (x, y) ∼ Nρ.

The two-function version will be easier to tensorize. Before seeing that, let us show how to deducehypercontractivity in its original form. The idea is very simple: assuming for simplicity that f ≥ 0, we takeg = (Tρf)q−1 to obtain

‖Tρf‖qq = E[(Tρf)q] = 〈Tρf, (Tρf)q−1〉 ≤ ‖f‖p‖(Tρf)q−1‖q/(q−1) = ‖f‖p‖Tρf‖q−1q ,

23

and so ‖Tρf‖q ≤ ‖f‖p. The proof for arbitrary f is similar, using g = |Tρf |q−1 sgn(Tρf).Now let us prove that 〈Tρf, g〉 holds by induction. We have already seen the base case n = 1, so suppose

that f, g are functions on n+ 1 variables. The basic idea is to write

〈f, Tρg〉 = E(xn+1,yn+1)∼Nρ

E(x1,y1),...,(xn,yn)∼Nρ

[f(x1, . . . , xn, xn+1)g(y1, . . . , yn, yn+1)].

If we fix the value of xn+1, yn+1, then the functions f, g now depend only on n variables, and so we can applythe induction hypothesis:

〈f, Tρg〉 ≤ E(xn+1,yn+1)∼Nρ

[E

x1,...,xn[|f(x1, . . . , xn, xn+1)|p]1/p E

x1,...,xn[|g(x1, . . . , xn, yn+1)|q

′]1/q

′].

The outer expectation is also of the form 〈F, TρG〉, on a single coordinate. Applying the base case gives

〈f, Tρg〉 ≤

Exn+1

[∣∣∣∣ Ex1,...,xn

[|f(x1, . . . , xn, xn+1)|p]1/p∣∣∣∣p]1/p

· Exn+1

[∣∣∣∣ Ex1,...,xn

[|g(x1, . . . , xn, xn+1)|q′]1/q

′∣∣∣∣q′]1/q′

= ‖f‖p‖g‖q′ .

7 Constant degree functions: Kindler–Safra theorem

In Section 2 we showed that every Boolean degree 1 function is a dictator. Similarly, every Boolean degree dfunction is a junta, although the argument is less trivial.

Let f : ±1n → ±1 have degree d. If f is a junta, then f must depend on all variables with positiveinfluence. We call such variables influential. Since there are only finitely many juntas, the influence of anyinfluential variable in a junta is Ω(1). This suggests showing that every influential variable in f has influenceΩ(1). To show this, we consider the function Lif , define in Section 4. The proof is a simple application ofhypercontractivity:

Infi[f ] = ‖Lif‖22 ≤ 3d‖T1/√

3Lif‖22 ≤ 3d‖Lif‖24/3 = 3d‖Lif‖32 = 3d Infi[f ]3/2.

If Infi[f ] 6= 0 then Infi[f ] ≥ 9−d. Conversely, as we have shown in Section 4, Inf[f ] ≤ dV[f ] ≤ d. Thereforeat most 9dd variables are influential in f . In other words, f depends on at most 9dd variables. In fact, thiscan be improved to O(2d) [CHS20, Wel20].

What about Boolean functions which are merely close to degree d? In Section 3, we showed that suchfunctions are close to a constant or a dictator, that is, to a Boolean degree 1 function. Using a similarstrategy, we will show that the same holds for degree d functions, a result first proved by Kindler andSafra [KS04, Kin02].

Let f : ±1n → ±1 satisfy ‖f>d‖2 = ε. Following the lead of Section 3, we will consider the functiong = f≤d, which is a degree d function satisfying E[dist(g, ±1)2] ≤ ε.

Identifying the junta variables The first step is identifying the junta variables, which we accomplish bymodifying the argument for Boolean degree d functions. For any variable i,√

Infi[g] = ‖Lig‖2 ≤√

3d‖Lig‖4/3.

This time we cannot directly relate ‖Lig‖3/2 and Infi[g]. However, rounding g to G = round(g, ±1),

‖Lig‖4/3 ≤ ‖LiG‖4/3 +‖Lig−LiG‖4/3 ≤ ‖LiG‖4/3 +‖Lig−LiG‖2 ≤ ‖LiG‖4/3 +‖g−G‖2 ≤ ‖LiG‖4/3 +√ε.

24

Since G is Boolean,

‖LiG‖4/3 = ‖LiG‖3/22 ≤ (‖Lig‖2 + ‖LiG− Lig‖2)3/2 = (√

Infi[g] +√ε)3/2.

Therefore √Infi[g] ≤

√3d(

(√

Infi[g] +√ε)3/2 +

√ε).

If√

Infi[g] ≥ 2√

3d√ε then

1

4Infi[g] ≤ 3d(

√Infi[g] +

√ε)3 = O(Infi[g]3/2),

and so Infi[g] = Ω(1).We have shown that the influence of every variable is either O(ε) or Ω(1). Since Inf[g] ≤ d, there can

be at most O(1) of the latter variables, forming a set J . We reorder the variables so that J = 1, . . . , k.Variables outside of J have influence at most Kε, for some K > 0. We can assume that K > 1.

Inductive argument The heart of the proof is an inductive argument, mimicking the proof in Section 3.We will prove by induction that for all m > k,∑

max(S)≥m

g(S)2 ≤ Kε.

The base case, m = n + 1, is trivial. Now suppose that this inequality holds for some m + 1. SinceInfm[g] ≤ Kε, ∑

max(S)≥m

g(S)2 ≤∑

max(S)>m

g(S)2 +∑m∈S

g(S)2 ≤ 2Kε.

For any subset T ⊆ m, . . . , n, let g⊕T be the function formed by negating the coordinates in T . Thus

∆T :=g − g⊕T

2=

∑|S∩T | odd

g(S)xS .

Since E[dist(g/2, ±1/2)2] = E[dist(−g⊕T /2, ±1/2)2] ≤ ε/4,

E[dist(∆T , 0,±1)2] ≤ 2E[dist(g/2, ±1)2] + 2E[dist(−g⊕T /2, ±12)] ≤ ε.

(This is because the sum of two values in ±1 lies in 0,±1.) On the other hand, we know that

‖∆T ‖2 =∑

|S∩T | odd

g(S)2 ≤∑

max(S)≥m

g(S)2 ≤ 2Kε.

Following the steps of Section 3, we want to say that the only way that ∆T can be simultaneously closeto 0,±1 and have small norm is if it is in fact close to 0. As in Section 3, we observe that

∆2T ≤ dist(∆T , 0,±1)2 + 4∆4

T ,

since either |∆T | ≤ 1/2, in which case ∆2T equals the first term, or |∆T | > 1/2, in which case ∆2

T ≤ 4∆4T .

Taking expectations and using hypercontractivity, we can bound E[∆4T ] in terms of E[∆2

T ]2 (in Section 3 weresorted to a direct calculation instead):

‖∆T ‖22 ≤ ε+ 4‖∆T ‖44 ≤ ε+ 4 · 9d‖∆T ‖42 ≤ ε+O(K2ε2).

So far so good, but how does ∆T relate to the quantity we are actually interested in? The idea is to pickT at random. If max(S) < m, then |S ∩ T | is never odd. Otherwise, the probability that |S ∩ T | is odd is at

25

least 2−|S∩m,...,n| ≥ 2−d (this is the probability that one element in S ∩ m, . . . , n is inside T , and the restare outside T ). Therefore ∑

max(S)≥m

g(S)2 ≤ 2d ET‖∆T ‖22 ≤ 2dε+O(K2ε2).

If K > 2d and ε is small enough (a valid assumption, as in Section 3), the right-hand side is at most Kε,completing the proof by induction.

Finishing the proof Where do we stand? Assuming that J = 1, . . . , k, we have shown that∑S : max(S)>k

g(S)2 = O(ε).

For general J , this shows that ∑S 6⊆J

g(S)2 = O(ε).

As in Section 5, this shows that the function h obtained by averaging h over the coordinates outside of Jsatisfies ‖g − h‖2 = O(ε). Furthermore, as in Section 5, the function H = round(h, ±1) is a Boolean juntawhich also satisfies ‖g −H‖2 = O(ε).

We would like to argue that H is not just a junta, but actually has degree d. Indeed, up to the choice of thejunta coordinates, there are only finitely many options for H. Therefore either degH ≤ d, or ‖H>d‖2 = Ω(1).In the latter case, ‖g −H‖2 ≥ ‖H>d‖2 = Ω(1), since deg g ≤ d. This possibility can be ruled out if ε is smallenough. Summarizing, we have proved the following results, known as the Kindler–Safra theorem:

If g : ±1n → R is a degree d function satisfying E[dist(g, ±1)2] < ε, then there is a Boolean degree d functionr : ±1n → ±1, depending on a constant number of inputs, such that ‖g − r‖2 = O(ε).

If f : ±1n → ±1 satisfies ‖f>d‖2 ≤ ε then there is a Boolean degree d function r : ±1n → ±1, dependingon a constant number of inputs, such that Pr[g 6= r] = O(ε).

How does the big O constant scale with d? Carefully keeping track of all big O constants, we see that‖g −H‖2 ≤ 2O(d)ε. Similarly, H depends on 2O(d) coordinates.

In order to guarantee that H has degree d, we appealed to there being only finitely many possible H.We can make this argument constructive. Suppose that degH > d, say H(S) 6= ∅, where |S| > d. Since Hdepends on M = 2O(d) coordinates, this means that |H(S)| = |E[HxS ]| ≥ 2−M , since E[HxS ] is the average

of 2M many integers. Therefore, in order to guarantee that H has degree d, we need ε ≤ 2−2O(d)

.Keller and Klein [KK20] used a different proof (carried out in the more difficult setting of the slice) that

directly constructs a Boolean function H of degree d such that ‖g −H‖2 ≤ 2O(d)ε. They also showed thatPr[f 6= H] ≤ 4ε+ 2O(d)ε2, using a simple application of hypercontractivity.

8 Biased Fourier analysis: Erdos–Ko–Rado

So far we have been considering functions with respect to the uniform distribution. This distribution ishidden in our expectations, which are always taken with respect to the input distribution. However, in manycases we are interested in other distributions. Perhaps the best example is G(n, p) random graphs, in whicheach edge appears in the graph with probability p. When p = 1/2, this is just the uniform distribution, butwe are often interested in other values of p.

When p = 1/2, it is convenient to think of the Boolean cube as ±1n, and of Boolean functions as±1-valued functions. For general p, this convention makes less sense, and instead, typically we replace ±1

26

with 0, 1. The distribution on 0, 1n in which each coordinate equals 1 with probability p independentlyis known as µp.

How do we generalize the Fourier expansion to the p-biased case? There are two basic properties of theFourier basis: it is an orthonormal basis, and it is “coordinate-based”, in the sense that there is a naturalcorrespondence between subsets of [n] and Fourier characters. We can express “coordinate-based” in a moreformal way: if a function depends only on the coordinates in J , then its Fourier expansion is supported onsubsets of J ; and conversely, χS depends only on the coordinates in S.

These “axioms” suffice to derive the p-biased Fourier basis, up to negation. Fixing p, we will denote thisbasis by (ωS)S⊆[n]. First of all, ω∅ is constant, and by orthonormality, the constant is ±1. We choose ω∅ = 1.Continuing, ωi depends only on xi, say ωi = αx+ β. By orthogonality,

0 = Eµp

[ωiω∅] = Eµp

[αxi + β] = αp+ β.

By orthonormality,1 = E

µp[ω2i] = E

µp[α2x2

i + 2αβxi + β2] = (α2 + 2αβ)p+ β2.

The first equation shows that β = −αp, and so the second equation gives

α2p− 2α2p2 + α2p2 = 1 =⇒ α2 =1

p(1− p).

We choose

ωi =xi − p√p(1− p)

.

In this formula, p = E[xi] and p(1− p) = V[xi].At this point one can already guess the general formula:

ωS =∏i∈S

ωi =∏i∈S

xi − p√p(1− p)

.

Indeed, E[ω2S ] =

∏i∈S E[ω2

i] = 1, and if S 6= T , then without loss of generality there is some i ∈ S \ T ,

and then E[ωSωT ] ∝ E[ωi] = 0. In both cases, we crucially used the independence of the coordinates —equivalently, the fact that µp is a product measure.

A basis of this form is known as tensorial — by defining the tensor product accordingly, we can think ofthe p-biased Fourier basis as the tensor product

ω∅, ω1 ⊗ · · · ⊗ ω∅, ωn.

This basis is unique up to (individual) negation. To see this, suppose that we have constructed ωT forT ( S. The space of all functions on S has dimension 2|S|. The function ωS is orthogonal to the 2|S| − 1functions ωT for T ( S. These functions are orthogonal, and so linearly independent, hence ωT belongs tosome one-dimensional subspace, which contains exactly two points of unit norm.

8.1 Intersecting families

We now give a simple application of the p-biased Fourier basis, to the study of intersecting families. A subsetF ⊆ 0, 1n is called an intersecting family if every A,B ∈ F intersect (that is, they are not disjoint). Whatis the largest µp-measure of an intersecting family? The answer depends qualitatively on p.

When p > 1/2, one can take the family of all sets of size dn+12 e. This is an intersecting family whose

µp-measure tends to 1 as n→∞, and so this regime is not so interesting.When p = 1/2, every intersecting family has measure at most 1/2, since it can contain at most one set

out of each pair A,A. There are many different intersecting families of measure 1/2, for example all familiesof the form

S : |S ∩ [2r + 1]| ≥ r + 1,

27

which can also be viewed as the majority function on the first 2r + 1 coordinates.The most interesting regime is when p < 1/2. One obvious construction is a star, which consists of all

sets containing the point i, for some arbitrary i ∈ [n]. This family has µp-measure p. Is there any betterconstruction? If not, are there any other constructions achieving the same measure? If not, are there anyother constructions approaching the same measure?

We will approach the subject using a technique originating in the work of Hoffman [Hof70] andLovasz [Lov79], although for the purpose of exposition, the relation to their work will not be immedi-ately apparent.

The idea is to define a “noise operator” T with appropriate properties. Just as the noise operator Tρ, wewill define Tf(x) = E[f(y)], where y is some noisy version of x, say y ∼ N(x). It is natural to require that ifx ∼ µp then also y ∼ µp. Moreover, noise should be applied to different coordinates independently.

We will design our noise operator in such a way that if f is the characteristic function of an intersectingfamily, then 〈f, Tf〉 = 0 (this mimics a construction of Hoffman, in which T is the normalized adjacencymatrix of some graph). This property will be satisfied if N(1A) is guaranteed to be a set which is disjointfrom A. Considering an individual coordinate, this means that N must change 1 to 0, and can act arbitrarilyon 0, under the constraint that if the input to N is distributed µp, then so is the output. If we denote by tthe probability that N changes 0 to 1, then the probability that N outputs 1 is (1− p)t, and so t = p

1−p .

If 1A ∼ µp and 1B ∼ N(1A) then A ∩B = ∅ and both 1A and 1B have distribution µp. As seen above,this determines the distribution of (A,B) uniquely. Since the constraints are symmetric, we conclude that wecan also sample (A,B) by first sampling 1B ∼ µp and then sampling 1A ∼ N(1B).

This symmetry implies that ωS is an eigenfunction of T . The proof is by induction. Suppose that wehave shown that TωR = λRωR for all R ( S. Then

〈ωR, TωS〉 = 〈TωR, ωS〉 = λR〈ωR, ωS〉 = 0.

Hence TωS is a function, depending only on the coordinates in S, which is orthogonal to ωR for all R ( S.This means that TωS = λSωS for some λS .

If 1R ∼ N(1S) then R ∩ S = ∅, and so TωS(1S) = E[ωS(R)] = ωS(∅). This allows us to calculate λS bysubstituting 1S :

λS

(1− p√p(1− p)

)|S|= λSωS(1S) = TωS(1S) = ωS(∅) =

(−p√p(1− p)

)|S|,

and so

λS =

(− p

1− p

)|S|.

This means that

Tf =∑S

(−p

1− p

)|S|f(S)ωS .

If f is indeed the characteristic function of an intersecting family, then 〈f, Tf〉 = 0, which by Parseval’sidentity implies that

0 = 〈f, Tf〉 =∑S

(−p

1− p

)|S|f(S)2 ≥ f(∅)2 − p

1− p∑S 6=∅

f(S)2.

We know that f(∅) = E[f ] = µp(F). Parseval’s identity shows that∑S 6=∅

f(S)2 =∑S

f(S)2 − f(∅)2 = E[f2]− E[f ]2 = E[f ]− E[f ]2 = µp(F)(1− µp(F)).

28

Substituting this in the inequality, we obtain

0 ≥ µp(F)2 − p

1− pµp(F)(1− µp(F)) = µp(F)(1− µp(F))

(µp(F)

1− µp(F)− p

1− p

),

and so µp(F) ≤ p, since the function x1−x =

∑n≥1 x

n is increasing for x ≥ 0.This calculation allows us to say much more. First of all, we can understand which intersecting families

have µp-measure exactly p. If f is the characteristic function of such an intersecting family, all inequalitiesabove must be tight. Since the only inequality we used was ( −p1−p )|S| ≥ −p

1−p and p1−p < 1 (since p < 1

2 ), thismeans that the Fourier expansion of f is supported on the first two levels, that is, deg f ≤ 1. Just as in theunbiased case, this implies that f is a dictator, and so a star (since stars are the only intersecting dictators).In other words, stars are the unique extremal families.

We can say even more. If f is the characteristic function of an intersecting family whose measure is closeto p, then the inequalities above must be nearly tight. This translates to ‖f>1‖ being small (see exercisebelow). In the unbiased case (p = 1/2), this implies that f is close to a dictator via the Friedgut–Kalai–Naortheorem, and so (since f is the characteristic function of an intersecting family) to a star.

Our proof of the Friedgut–Kalai–Naor theorem in Section 3 goes through with little changes for anyconstant p, once we replace xi with ωi = xi−p√

p(1−p); perhaps the biggest difference is that E[ω4

i ] 6= 1 in general.

Therefore, for any constant p, an intersecting family of almost extremal µp-measure is close to a star (this isknown as stability).

Most results in Boolean function analysis generalize from p = 1/2 to any constant p; the most glaringexception is that ωSωT 6= ωS∆T in general. Moreover, these results usually hold uniformly for all p ∈ [η, 1−η],where η > 0 is arbitrary. The case of small p is more interesting, and we will consider it carefully later on.

Exercise Fix p < 1/2, and suppose that F is an intersecting family.

(a) Show that if µp(F) = p then F is a star.

(b) Assuming the Friedgut–Kalai–Naor theorem for µp, show that if µp(F) ≥ p− ε then there exists a star Gsuch that µp(F∆G) = O(ε).

(c) Prove the Friedgut–Kalai–Naor theorem for µp, and show that it holds uniformly for all p ∈ [η, 1− η],where η > 0 is arbitrary.

Exercise Two families F ,G are cross-intersecting if any A ∈ F and B ∈ G intersect. Show that if p < 1/2and F ,G are cross-intersecting then µp(F)µp(G) ≤ p2, with equality achieved only if F = G is a star.

8.2 Hypercontractivity

Does hypercontractivity extend to the p-biased setting? Before answering this question, we need to define ananalog of the noise operator Tρ. A natural choice is to define it in terms of the Fourier expansion:

Tρf =∑S⊆[n]

ρ|S|f(S)ωS .

In the unbiased case, there was an equivalent definition of Tρ, namely Tρf(x) = E[f(y)], where y isobtained from x by flipping each entry with probability 1−ρ

2 . A similar interpretation exists in the p-biasedcase, which also serves to clarify the unbiased case, as we now explain.

We are looking for a noise distribution Nρ(x) satisfying two constraints. First, if x ∼ µp then Nρ(x) ∼ µp.Second, the corresponding noise operator coincides with the spectral definition above. Let x ∼ µp andy ∼ Nρ(x). If the two constraints hold, then

Ex,y

[x− p√p(1− p)

· y − p√p(1− p)

]= Ex,y

ρ( x− p√p(1− p)

)2 = ρ.

29

Let us consider two extreme cases. In order to get ρ = 1, we can simply choose y = x. In order to get ρ = 0,we can choose y ∼ µp independent of x. By linearity, we can get any value of ρ by a mixture of these twoextremes: Nρ(x) is obtained by retaining x with probability ρ, and resampling it (according to µp) withprobability 1− ρ.

Armed with a definition of Tρ, we can try to mimic the proof of hypercontractivity in Section 6 (only thesimple case). We are looking for a value of ρ that satisfies ‖Tρf‖4 ≤ ‖f‖2, and aiming at an inductive proof.The base case n = 0 holds for any value of ρ, so it suffices to consider the inductive step.

An arbitrary function f on n+ 1 variables can be written as

f = ωn+1g + h, where g =∑S⊆[n]

f(S ∪ n+ 1)ωS , h =∑S⊆[n]

f(S)ωS .

Retracing our steps in Section 6,

E[(Tρf)4] =

4∑i=0

(4

i

)ρi E[ωin+1]E[(Tρg)i(Tρh)4−i].

As in the unbiased case, we have E[ωn+1] = 0 and E[ω2n+1] = 1. In contrast to the unbiased case, the following

two moments are not 0, 1:

κ3 := E[ω3n+1] =

(1− p)(−p)3 + p(1− p)3

(p(1− p))3/2=

1− 2p√p(1− p)

,

κ4 := E[ω4n+1] =

(1− p)(−p)4 + p(1− p)4

(p(1− p))2=

1− 3p(1− p)p(1− p)

.

The exact values do not matter as long as p is constant. The only significant difference from the unbiasedcase is that κ3 is non-zero. Nevertheless, let us try to bound E[(Tρf)4] in terms of E[f2]2 = (E[g2] + E[h2])2:

E[(Tρf)4] = ρ4κ4 E[(Tρg)4] + 4ρ3κ3 E[(Tρg)3(Tρh)] + 6ρ2 E[(Tρg)2(Tρh)2] + E[(Tρh)4].

We can bound E[(Tρg)4] ≤ E[g2]2 and E[(Tρh)4] ≤ E[h2]2 by induction. Applying Cauchy–Schwarz, we can

similarly bound E[(Tρg)2(Tρh)2] ≤√E[(Tρg)4]E[(Tρh)4] ≤ E[g2]E[h2]. As for the remaining term,

E[(Tρg)3(Tρh)] ≤√

E[(Tρg)4]E[(Tρg)2(Tρh)2] ≤ E[g2]√

E[g2]E[h2] ≤ E[g2]2 + E[g2]E[h2]

2,

using the AM-GM inequality. Altogether, we obtain

E[(Tρf)4] ≤ (ρ4κ4 + 2ρ3κ3)E[g2]2 + (2ρ3κ3 + 6ρ2)E[g2]E[h2] + E[h2]2.

In order for this to be bounded by (E[g2] + E[h2])2, we need ρ to satisfy

ρ4κ4 + 2ρ3κ3 ≤ 1, ρ3κ3 + 3ρ2 ≤ 1.

For any fixed p, there will be a fixed ρ satisfying this. Moreover, there is a single ρ which works for allp ∈ [ε, 1− ε], for any fixed ε > 0.

How does ρ vary with p? Let us consider the setting p ≤ 1/2− ε, in which κ3 = Θ( 1√p ) and κ4 = Θ( 1

p ).

The optimal choice of ρ satisfies

ρ = Θ(min((1/κ4)1/4, (1/κ3)1/3, 1)) = Θ(min(p1/4,√p

1/3, 1)) = Θ ( 4

√p) .

30

9 Russo–Margulis

One of the most famous applications of p-biased Boolean function analysis is to understand thresholdphenomena in random graphs. The starting point is a simple connection between the µp-measure of amonotone property and total influence, known as the Russo–Margulis formula [Rus82].

A monotone property of subsets of [n] (in the case of random graphs, subsets of [(n2

)]) is a collection F

of subsets of [n] which is closed upwards: if S ∈ F and T ⊇ S then T ∈ F . For example, the followingproperties are monotone: containing a certain point; containing at least n/2 points; the Tribes function; (fora graph) containing a triangle; being connected. If f is the characteristic function of F , then f is monotone.

Let φ(p) = µp(F). What does the derivative of φ look like? To answer this, we will compare φ(p) andφ(p+ε) using a coupling of µp and µp+ε. Let t1, . . . , tn ∼ U([0, 1]) and define xi = [ti < p] and yi = [ti < p+ε].Clearly x ∼ µp and y ∼ µp+ε. Moreover, x ≤ y, and so φ(p + ε) − φ(p) is the probability that x /∈ F buty ∈ F (we identify sets with Boolean vectors).

The probability that y differs from x in more than one coordinate is O(n2ε2). The probability that ydiffers from x in a specific coordinate i is ε(1− ε)n−1 = ε+O(nε2). It follows that

φ(p+ ε)− φ(p) = ε

n∑i=1

Prµp

[x|i=0 /∈ F and x|i=1 ∈ F ] +On(ε2).

(The notation On(ε2) denotes a quantity bounded by Cnε2, where Cn depends on n.) Taking the limit ε→ 0,

we see that

φ′(p) =

n∑i=1

Prµp

[f(x−i, 0) 6= f(x−i, 1)].

This looks very similar to the formula for the total influence in the unbiased case: while we defined thei’th influence by comparing f(x) and f(x⊕i), we would get the same result if we compared instead f(x−i, 0)and f(x−i, 1). How does this look in terms of the Fourier expansion? It suffices to consider the Fourier basisvectors. If i /∈ S then ωS(x−i, 0) = ωS(x−i, 1), and otherwise

ωS(x−i, 1)− ωS(x−i, 0) = ωS−i(x)

(1− p√p(1− p)

− −p√p(1− p)

)=

1√p(1− p)

ωS−i(x).

(This is also the derivative of ωS(x) with respect to xi.) Therefore

Eµp

[(f(x−i, 1)− f(x−i, 0))2] =1

p(1− p)∑i∈S

f(S)2.

This suggests two different ways of defining influence, differing by a factor of p(1− p). We could either defineit as the expectation above, or as the sum of squared Fourier coefficients. We choose the latter, althoughsome sources prefer the former. That is, we define

Infi[f ] = p(1− p) Eµp

[(f(x−i, 1)− f(x−i, 0))2] =∑i∈S

f(S)2.

Equivalently, if y is obtained from x by resampling the i’th coordinate, then xi 6= yi with probability 2p(1−p),and so

Infi[f ] =1

2Eµp

[(f(x)− f(y))2].

The total influence is defined by summing over the individual influences.Having defined total influence, we see that

φ′(p) =1

p(1− p)Inf(p)[f ],

31

where the superscript denotes that total influence is taken with respect to µp.This simple statement already has interesting implications. If F is non-trivial (isn’t empty and doesn’t

contain everything), then φ is strictly increasing, and so we can define τ(q) = φ−1(q) for any q ∈ [0, 1]. Howlarge can τ(3/4)− τ(1/4) be? Nothing interesting can be said in general, but in many cases, including thatof random graphs, the property in question has some symmetries, and in particular, is invariant under sometransitive permutation group (a permutation group is transitive is for any i, j there is a permutation mappingi to j). This implies that all influences are the same. Using a p-biased version of the KKL theorem, thisallows us to show that τ(3/4)− τ(1/4) is small.

We will consider the case in which 0 τ(1/2) 1 (that is, ε ≤ τ(1/2) ≤ 1 − ε for some fixed ε > 0),although a similar argument works in general, once we take care of the dependence on p, as shown by Friedgutand Kalai [FK96].

Let p0 = τ(1/2). If we choose x, y ∼ µp0 independently then their coordinate-wise minimum z = min(x, y)is distributed µp20 . Also,

Pr[z ∈ F ] ≤ Pr[x, y ∈ F ] = Pr[x ∈ F ]2 = 1/4.

Therefore τ(1/4) ≥ µp20 . Similarly, τ(3/4) ≤ µ1−(1−p0)2 . By assumption, 0 τ(1/2) 1, and so the sameholds for τ(1/4) and τ(3/4).

As we stated above, a theorem such as the KKL theorem holds uniformly for all p ∈ [η, 1−η]. In particular,for all p ∈ [τ(1/4), τ(3/4)] the maximal influence is Ω( logn

n V[f ]) = Ω( lognn ) (since the variance is at least 3

16 ).Since all influences are the same, the total influence is Ω(log n). In other words, φ′(p) = Ω(log n), and so

τ(3/4)− τ(1/4) = O

(1

log n

).

In contrast, if f is a junta such as xi, then τ(3/4)− τ(1/4) is constant. This shows that coarse thresholdbehavior is associated with f depending on a small number of coordinates. The sharp threshold theorems ofFriedgut [Fri99], Bourgain and Hatami [Hat12] prove this in a formal sense, crucially also for p = o(1), whichis the interesting regime for random graphs and many other settings.

Exercise

(a) Extend the KKL theorem to the p-biased setting, with an explicit dependence on p.

(b) Obtain a bound on τ(3/4) − τ(1/4) for monotone functions invariant under a transitive permutationgroup, as a function of p0 = τ(1/2) and n.

9.1 Russo–Margulis + Friedgut

One winning combination is a that of Russo–Margulis and Friedgut’s junta theorem, which is often usedwhen analyzing PCPs as part of hardness of approximation proofs. As a toy example, we will prove thefollowing result. Fix p < 1/2 and ε > 0. We will show how to associate with each monotone family F ofµp-measure at least ε a set L(F) of Op,ε(1) labels such that if F and G are cross-intersecting (that is, anyset in F intersects any set in G) then L(F) and L(G) intersect. This is the technical heart of the proof thatthe trivial 2-approximation algorithms for vertex cover cannot be improved (assuming the unique gamesconjecture), due to Khot and Regev [KR08]. In this proof, the approximation ratio is roughly 1/(1− p), andso we want p to be close to 1/2.

The idea is very simple. Since µp(F) ≥ ε ≥ 0, and clearly µ1/2(F) ≤ 1, there must be a point pF ∈ [p, 1/2]where the derivative of φ(q) = µq(F) is at most 1/(1/2− p). According to the Russo–Margulis formula, thetotal influence of the characteristic function f at µpF is Op(1). It is natural to choose as L(F) all coordinateswhose influence is at least some η = η(p) > 0. Since the total influence is Op(1), the number of labels onlydepends on p.

Now suppose we are given two monotone families F ,G whose label sets L(F), L(G) are disjoint. We willshow that F ,G are not cross-intersecting.

32

The proof of Friedgut’s junta theorem (which works for µpF as well) shows that F is δ-close with respectto µpF to a junta depending on some subset C(F) ⊆ L(F) of at most M = M(p, δ) coordinates (that is, theprobability that F differs from the junta is at most δ). Similarly, g is δ-close with respect to µpG to a juntadepending on some subset C(G) ⊆ L(G) of coordinates.

Our first step is to remove a few sets from F so that it doesn’t depend on the coordinates in C(G) (we will

see later why this is necessary). To accomplish this, we think of F as a 2C(G)× 2C(G) Boolean table. We forma new family F ′ by removing every row which is non-constant; the resulting family is monotone since F ismonotone. The restriction fS of f to any such row S ⊆ C(G) has variance at least pMF (1− pMF ) ≥ pM (1− pM )with respect to µpF . Therefore

µpF (F \ F ′) ≤ PrS⊆C(G)

[fS is not constant]

≤ [pM (1− pM )]−1 ES⊆C(G)

[V[fS ]]

(1)

≤ [pM (1− pM )]−1 ES⊆C(G)

[Inf[fS ]]

= [pM (1− pM )]−1∑

i∈C(G)

ES⊆C(G)

[Infi(fS)]

= [pM (1− pM )]−1∑

i∈C(G)

Infi(f)

(2)

≤ [pM (1− pM )]−1Mη,

where (1) is Poincare’s inequality, and (2) follows since C(G) ⊆ L(G) is disjoint from L(F).We will choose η so that the right-hand side is at most min(δ, ε/2), implying that

µpF (F ′) ≥ µpF (F)− ε

2

(∗)≥ µp(F)− ε

2≥ ε

2,

where (∗) holds since F is monotone. Note that η depends on M, δ, and therefore to prevent circular choice,δ will need to depend only on p, ε.

Recall that F is δ-close with respect to µpF to a junta depending on C(F). Since F and F ′ are δ-close,this implies that F ′ is 2δ-close to such a junta. For each A ⊆ C(F), consider the restriction F ′A of F ′ tothose sets whose intersection with C(F) is A. Thus

EA

[min(µpF (F ′A), 1− µpF (F ′A))] ≤ δ,

where the expectation on A is taken with respect to µpF restricted to C(F). In particular, min(µpF (F ′A), 1−µpF (F ′A)) ≤

√δ for all but a

√δ-fraction of A’s (with respect to µpF !). Such A’s contribute at most

√δ to

the total measure of F ′. Sets A such that µpF (F ′A) ≤√δ contribute another

√δ, and so if ε/2 > 2

√δ, there

must be a set A such that µpF (F ′|A) ≥ 1−√δ. Accordingly, we choose δ = (ε/5)2.

Since F ′ doesn’t depend on the coordinates in C(G), the same holds for F ′A, and we conclude that thefollowing set has µpF -measure at least 1− ε/5, with respect to the coordinates outside of C(F) ∪ C(G):

F∗ = S ∈ F ′ : S ∩ (C(F) ∪ C(G)) = A.

Since F ′ is monotone, µ1/2(F∗) ≥ 1− ε/5 as well.In a completely analogous way, we can find B ⊆ C(G) such that the following family has µ1/2-measure at

least 1− ε/5, again with respect to the coordinates outside of C(F) ∪ C(G):

G∗ = S ∈ F ′ : S ∩ (C(F) ∪ C(G)) = B.

33

We are finally ready to show that F ,G are not cross-intersecting. Choose a set T ⊆ C(F) ∪ C(G) uniformlyat random. The probability that A ∪ T ∈ F∗ and B ∪ (C(F) ∪ C(G) \ T ) ∈ G∗ is at least 1− 2ε/5 > 0, andso such a set T exists. Since A ⊆ C(F) and B ⊆ C(G), the two sets A ∪ T and B ∪ (C(F) ∪ C(G) \ T ) aredisjoint, completing the proof.

10 Very biased Fourier analysis: Biased FKN theorem

How does p-biased Fourier analysis differ from standard Fourier analysis (the case p = 1/2)? When 0 p 1,the behavior is very similar, although one crucial property of the Fourier characters is lost, namely, they areno longer characters of the group Zn2 (no longer satisfy xSxT = xS∆T ), which makes it harder to analyzelinearity testing, for example.

The situation greatly differs when p is very small (or very close to 1), which is of interest in random graphtheory, among else: G(n, p) random graphs often exhibit their most interesting behavior for sub-constant p.For example, the threshold for connectivity is p = logn

n .The Russo–Margulis formula allows understanding the speed of threshold phenomena in terms of total

influence: using the terminology of Section 9, if τ(2/3)− τ(1/3) is large compared to τ(1/2), then this meansthat the total influence around τ(1/2) is relatively small. If the threshold is bounded away from 0, 1, thenFriedgut’s junta theorem implies that the graph property is essentially a junta, which is impossible for graphproperties. For sub-constant p, the characterization (due to Friedgut, Bourgain and Hatami) is much morecomplex, and only states that the graph property is “global”, that is, affected only slightly by local events.

As an example of very biased Fourier analysis, we generalize the FKN theorem 3 to this setting. Recallthat the Friedgut–Kalai–Naor (FKN) theorem states that if f is a degree 1 function which is close to Boolean,in the sense that E[dist(f, ±1)2] = ε, then f is O(ε)-close to a Boolean dictatorship. What happens forother values of p?

When dealing with several different values of p, it is often advantageous to switch to 0, 1-valued functionsand variables, which to avoid confusion we will name y1, . . . , yn. Thus the function f can be written in theform

f = c0 +

n∑i=1

ciyi,

and the FKN theorem states that if f is close to Boolean then it is close to one of the following functions:0, 1, yi, 1−yi. The same result holds for any constant value of p. Note that if we replaced yi with ωi = yi−p√

p(1−p),

the coefficients in front of ωi would have to depend on p.What happens when p is small? Other examples arise. For example, the function y1 + y2 is quite close

to Boolean, since the probability that both y1 and y2 equal 1 is only p2. More generally, we can considerthe function fm =

∑mi=1 yi or its negation. If m = λ/p then the distribution of fm is roughly Poisson with

expectation λ, and soPr[fm /∈ 0, 1] ≈ 1− e−λ(1 + λ) = Θ(λ2).

Moreover,E[f2

m] = mE[y2i ] +m(m− 1)E[yiyj ] ≈ λ+ λ2.

This shows that as long as λ = O(√ε), the function fm will be O(ε)-close to Boolean.

The FKN theorem for small p [Fil16] states that this is the only example: if p ≤ 1/2 and f is a degree 1function such that E[dist(f, 0, 1)2] ≤ ε, then either f or 1 − f is close to a sum of O(

√ε/p) coordinates.

(In fact, this is slightly wrong, can you see why?)Recall that the FKN theorem can also be stated analogously for Boolean functions F which are close to

degree 1, in the sense that ‖F>1‖2 ≤ ε. A simple corollary of the formulation above is that either F or 1− Fis close to the maximum of O(

√ε/p) coordinates.

34

In the rest of this subsection, we prove the FKN theorem for small p in full, starting with its firstformulation. That is, we are given a linear function

f = c0 +

n∑i=1

ciyi,

where yi ∈ 0, 1, and we are promised that for some p ≤ 1/2,

Eµp

[dist(f, 0, 1)2] ≤ ε.

As in the proof of the unbiased FKN theorem, we can assume that ε is “small enough”, that is, smallerthan some constant. Indeed, if ε ≥ ε0 for some ε0 > 0 then since x2 ≤ 2(x− 1)2 + 2,

E[f2] ≤ 2 + 2E[dist(f, 0, 1)2] ≤ 2 + 2ε = O(ε),

since 2 ≤ (2/ε0)ε.The main idea behind the proof is that we can sample from µp in two steps: first, sample a set S according

to µ2p, and substitute 0 for every variable outside of S. Then, sample a point according to µ1/2 restricted toS. In other words, if we sample S ∼ µ2p and a subset T ∼ µ1/2(S) (that is, each element of S is included inT with probability 1/2), then T ∼ µp.

Accordingly, for every S ⊆ [n] we define a function fS : 0, 1S → R by fS(yS) = f(yS , 0), where the secondargument on the right-hand side corresponds to the coordinates outside S. Let εS = Eµ1/2

[dist(fS , 0, 1)2].Then

ES∼µ2p

[εS ] = ε.

Let fS = 2fS − 1, a function which is O(εS)-close to ±1. How does the function fS look like in theunbiased Fourier basis? Using the conversion formula xi = 2yi − 1,

fS = 2c0 − 1 + 2

n∑i=1

cixi + 1

2= 2c0 +

n∑i=1

ci − 1 +

n∑i=1

cixi.

The unbiased FKN theorem states that there exists a Boolean dictatorship gS such that ‖fS − gS‖2 = O(εS).Since gS is a Boolean dictatorship, gS(i) ∈ 0,±1, and so

O(εS) ≥ ‖fS − gS‖2 ≥∑i∈S

(f(i)− gS(i))2 ≥∑i∈S

dist(ci, 0,±1)2.

The idea now is to take expectation with respect to S ∼ µ2p. On the left-hand side we get O(ε). SincePr[i ∈ S] = 2p for every i ∈ [n], on the right-hand side we get 2p

∑i dist(ci, 0,±1)2. Over all, this shows

thatn∑i=1

dist(ci, 0,±1)2 = O

p

).

Let di = round(ci, 0,±1). It is natural to “round” the coefficients ci to di. In order to do this properly,we switch to the orthogonal variables ωi = yi−p√

p(1−p). Writing f in these terms, we get

f = c0 − pn∑i=1

ci +

n∑i=1

ci(yi − p) = c0 − pn∑i=1

ci +

n∑i=1

√p(1− p)ciωi.

We form a new function g by replacing ci with di:

g = c0 − pn∑i=1

ci +

n∑i=1

ci(yi − p) = c0 − pn∑i=1

ci +

n∑i=1

√p(1− p)diωi.

35

We only changed the level one Fourier coefficients, since this ensures that we can estimate ‖f − g‖2:

‖f − g‖2 =

n∑i=1

(√p(1− p)ci −

√p(1− p)di)2 = p(1− p)

n∑i=1

(ci − di)2 = O(ε).

In particular, using (a+ b)2 ≤ 2a2 + 2b2, we see that g is also close to Boolean:

E[dist(g, 0, 1)2] ≤ 2E[(g − f)2] + 2E[dist(f, 0, 1)2] = O(ε).

A similar calculation shows that not too many di’s can be non-zero. On the one hand, ‖g‖2 ≤ 2 +2E[dist(g, 0, 1)2] = O(1). On the other hand,

‖g‖2 ≥n∑i=1

g(i)2 = p(1− p)n∑i=1

d2i .

This shows that∑ni=1 d

2i = O(1/p), and so at most O(1/p) of the di can be non-zero.

How does the function g look like in terms of the original variables yi? Reversing the calculations above,for some b ∈ R we have

g = b+

n∑i=1

diyi.

We can write this even more simply: let I be the set of indices such that di = 1, and let J be the set ofindices such that dj = −1. Then

g = b+∑i∈I

yi −∑j∈J

yj .

Since at most O(1/p) are non-zero, |I|, |J | = O(1/p).The probability that all yi in I ∪ J are zero is (1− p)|I|∪|J| = (1− p)O(1/p) = Ω(1). This shows that

E[dist(b, 0, 1)2] = E[dist(g, 0, 1)2 | yi = 0 for all i ∈ I ∪ J ] ≤ E[dist(g, 0, 1)2]

Pr[yi = 0 for all i ∈ I ∪ J ]= O(ε).

Taking d = round(b, 0, 1), this shows that the function

h = d+∑i∈I

yi −∑j∈J

yj

satisfies ‖g − h‖2 = O(ε) and so, as before, ‖f − h‖2 = O(ε) and E[dist(h, 0, 1)2] = O(ε).The cases d = 0 and d = 1 are quite similar, so from now on we assume that d = 0, and approximate the

structure of f ; if d = 1, the same structure will hold for 1− f . Thus from now on, we assume that

h =∑i∈I

yi −∑j∈J

:= hI − hJ .

Let us start by getting rid of J . We know that Pr[hI = 0] = Ω(1), and so E[dist(−hJ , 0, 1)2] = O(ε).Since hJ ≥ 0, this implies that E[h2

J ] = O(ε). Therefore ‖hI − f‖2 = O(ε) and E[dist(hI , 0, 1)2] = O(ε).Now we bound the size of I, by looking at the probability that hI = 2:

Pr[hI = 2] =

(|I|2

)p2(1− p)|I|−2 (∗)

= Ω(|I|2p2).

The starred lower bound holds as long as |I| 6= 1. This shows that either |I| = 1, or |I| = O(√ε/p).

Notice that‖hI‖2 ≤ |I|p+ |I|2p2 = O(p+

√ε),

and so the same holds for ‖f‖2. In other words, f is actually somewhat close to a constant function.Summarizing:

36

Let f : 0, 1n → R be a degree 1 polynomial satisfying Eµp [dist(f, 0, 1)2] ≤ ε, where p ≤ 1/2.Then there exists a set I, of size at most max(1, O(

√ε/p)), such that

min

∥∥∥∥∥∑i∈I

yi − f

∥∥∥∥∥2

,

∥∥∥∥∥1−∑i∈I

yi − f

∥∥∥∥∥2 = O(ε).

Moreover, there exists d ∈ 0, 1 such that

‖f − d‖2 = O(p+√ε).

Finally, let us see what all this implies for Boolean functions F : 0, 1n → 0, 1 which are close to degreeone, in the sense that ‖F>1‖2 = ε. Taking f = F≤1, we get ‖f − F‖2 = ε. Since F is Boolean, in particularE[dist(f, 0, 1)2] ≤ ε. Therefore we can approximate either f or 1− f by a function of the form hI , where|I| ≤ max(1, O(

√ε/p)).

Without loss of generality, let us assume that it is f which is close to hI , that is, ‖f − hI‖2 = O(ε). Sincewe are going to replace f with the Boolean function F , it is natural to replace hI with the Boolean functionHI given by

HI =∨i∈I

yi,

that is, the maximum of the coordinates I. If |I| = 1 then HI = hI . Otherwise,

‖hI −HI‖2 = E[(hI − 1)21hi≥2] = E[(hI − 1)2]− Pr[hI = 0].

Since E[h2I ] ≤ |I|p+ |I|2p2, E[hI ] = |I|p, and Pr[hI = 0] ≥ 1− |I|p, this shows that

‖hI −HI‖2 = E[h2I ]− 2E[hI ] + 1− Pr[hI = 0] ≤ (|I|p+ |I|2p2)− 2|I|p+ 1− (1− |I|p) = |I|2p2 = O(ε).

Since ‖f − hI‖2 = O(ε), ‖f − F‖2 = ε, and ‖hI −HI‖2 = O(ε), we deduce that ‖F −HI‖2 = O(ε). SinceF and HI are both Boolean, in fact Pr[F 6= HI ] = O(ε). In summary:

Let F : 0, 1n → 0, 1 be a Boolean function satisfying ‖F>1‖2 ≤ ε with respect to µp, where p ≤ 1/2.Then there exists a set I, of size at most max(1, O(

√ε/p)), such that

min

(Pr[F 6=

∨i∈I

yi

],Pr[1− F 6=

∨i∈I

yi

])= O(ε).

Moreover, there exists d ∈ 0, 1 such that

Pr[F 6= d] = O(p+√ε).

In this theorem, it is important to note that the definition of F>1 itself also depends on p.

Exercise Generalize the results of this subsection to the case of an arbitrary product measure on 0, 1n.

11 Invariance principle

One of the most celebrated results in probability theory is the central limit theorem, which states (in oneversion) that the sum of many i.i.d. “reasonable” random variables behaves like a Gaussian random variable.

The assumption that the variables are not only independent but also identically distributed is crucial.Indeed, a sum of the form

x1 +x2 + · · ·+ xn

n,

37

where (x1, . . . , xn) is a random point in ±1n, does not look like a Gaussian. The problem is that x1 hastoo much influence on the sum, compared to the other variables.

The Berry–Esseen1 theorem is one way to quantify this phenomenon. Suppose that X1, . . . , Xn areindependent zero-mean variables with second moments E[X2

i ] = σ2i and third moments E[|Xi|3] = ρi. The

sum X1 + · · ·+Xn has zero mean and variance σ2 =∑i σ

2i . The Berry–Esseen theorem states that for every

t ∈ R,

|Pr[X1 + · · ·+Xn < t]− Pr[N(0, σ2) < t]| = O

(∑ni=1 ρiσ3

),

where N(µ, σ2) is a Gaussian random variable with mean µ and variance σ2.Let us see what this implies in the special case Xi = cixi, where (x1, . . . , xn) is a random point in ±1n.

The second and third moments are easily calculated to be

σ2i = c2i , ρi = |ci|3.

Therefore the sum is close to a Gaussian if

n∑i=1

|ci|3

(n∑i=1

c2i

)3/2

.

This condition is a bit hard to work with. To better understand the picture, let us notice that

n∑i=1

|ci|3 ≤ maxi|ci| ·

n∑i=1

c2i ,

and so the condition holds whenever

maxic2i

n∑i=1

c2i ,

that is, whenever none of the squared weights is “prominent” compared to the sum of squared weights.

The invariance principle is a similar statement for polynomials. In contrast to the linear case, whendealing with polynomials, we cannot say that a “smooth” polynomial behaves like a Gaussian. For example,suppose that (x1, . . . , xn) is a random point in ±1n, and consider the polynomial(

n∑i=1

xi

)2

.

This behaves not like a Gaussian, but rather like the square of a Gaussian. Similarly,n/2∑i=1

xi

2

+

n∑i=n/2+1

xi

behaves like the sum of a squared Gaussian and an independent Gaussian. Therefore we need to change thestatement of the result.

The basic idea is that in the context of the Berry–Esseen theorem, we can “convert” X1 + · · ·+Xn toN(0, σ2) (recall that σ2 =

∑i σ

2i , where σ2

i = E[X2i ]) by replacing each Xi by a Gaussian random variable

Gi = N(0, σ2i ) with the same mean and variance. This kind of replacement is something that makes sense for

arbitrary polynomials.We will be interested in functions on the Boolean cube ±1n. For such functions, we cannot consider

arbitrary polynomials. Indeed, consider the quadratic∑ni=1 x

2i√

n.

1Sometimes spelled Esseen. The name is apparently pronounced as in English, with the stress on the first syllable.

38

On the Boolean cube, this is just the constant√n, but if we replace each xi by a standard Gaussian, then

we obtain N(0, 3) (since E[N(0, 1)4] = 3)! This kind of problem goes away if we concentrate on multilinearpolynomials.

We have arrived at the following problem. Let P be a multilinear polynomial, let X1, . . . , Xn be randomi.i.d. variables uniformly distributed on ±1, and let G1, . . . , Gn be random i.i.d. standard Gaussians. Whenare the distributions of P (X1, . . . , Xn) and P (G1, . . . , Gn) close?

There are many notions of closeness that one can consider. For example, the Berry–Esseen theoremconsiders closeness in CDF, also known as Kolmogorov–Smirnov distance. It turns out that it is much easierto consider closeness with respect to test functions. That is, given a “nice” function Φ: R→ R, when can webound

|E[Φ(P (X1, . . . , Xn))]− E[Φ(P (G1, . . . , Gn))]|?

It is natural to try a hybrid argument, in this context known as the replacement technique. In thisargument, we don’t compare (X1, . . . , Xn) and (G1, . . . , Gn) directly. Instead, we consider a sequence ofdistributions, each differing only in a single random variable:

(X1, X2, X3, . . . , Xn)

(G1, X2, X3, . . . , Xn)

(G1, G2, X3, . . . , Xn)

. . .

(G1, G2, G3, . . . , Gn)

We bound the difference in expectation of two neighboring distributions, and then take the sum of differences.This is a standard technique in cryptography. Let’s try to apply it here and see what happens.

We wish to bound

|E[Φ(P (G1, . . . , Gi−1, Xi, Xi+1, . . . , Xn))]− E[Φ(P (G1, . . . , Gi−1, Gi, Xi+1, . . . , Xn))]| ≤E

G1,...,Gi−1

Xi+1,...,Xn

[| EXi

[Φ(P (G1, . . . , Gi−1, Xi, Xi+1, . . . , Xn))]− EGi

[Φ(P (G1, . . . , Gi−1, Gi, Xi+1, . . . , Xn)])|].

Once we fix G1, . . . , Gi−1, Xi+1, . . . , Xn, the polynomial becomes a function on a single variable, say

P (G1, . . . , Gi−1, z,Xi+1, . . . , Xn) = Az +B.

We wish to argue that E[Φ(AXi +B)] ≈ E[Φ(AGi +B)]. We can expect this to be the case if A is small onaverage. We chose Gi in such a way that Xi and Gi have the same first and second moment, and they alsohappen to have an identical third moment. This suggests aiming at an expression involving powers of Xi.We can obtain such an expression using a Taylor series for Φ:

Φ(Az +B) = Φ(B) + Φ′(B)Az + Φ′′(B)A2z2

2+ Φ′′′(B)

A3z3

6+ Φ′′′′(Aθz +B)

A4z4

24,

for some θ ∈ [0, 1]. Suppose that |Φ′′′′| is always bounded by some K. Since Xi and Gi have identical first,second, and third moments,

|E[Φ(AXi +B)]− E[Φ(AGi +B)]| ≤ KA4

24|E[X4

i ] + E[G4i ]| = O(KA4).

In other words, the difference in expectation under the two hybrid distributions is small if E[A4] is small.To see whether this is the case, we need a formula for A. Recalling the operator Li from Section 4,

A = LiP (G1, . . . , Gi−1, ·, Xi+1, . . . , Xn),

39

where the value of the placeholder makes no difference. We have thus shown that

|E[Φ(P (G1, . . . , Gi−1, Xi, Xi+1, . . . , Xn))]− E[Φ(P (G1, . . . , Gi−1, Gi, Xi+1, . . . , Xn))]| ≤O(K) · E

G1,...,Gi−1

Xi+1,...,Xn

[(LiP )4].

What can we say about E[(LiP )4]? Before answering that, let us note that

EG1,...,GiXi+1,...,Xn

[(LiP )2] =∑S⊆[n]i∈S

P (S)2 = Infi[P ].

This is clear if we replaced G1, . . . , Gi−1 by X1, . . . , Xi−1, and it still holds with the Gaussian variables, sincethe calculation only uses the first two moments, which are identical for Xj and Gj .

Again replacing G1, . . . , Gi−1 with X1, . . . , Xi−1, we can bound E[(LiP )4] by 9degP E[(LiP )2]2, usinghypercontractivity (see Section 6). The proof of hypercontractivity in Section 6 easily extends to the Gaussiancase, and so

|E[Φ(P (G1, . . . , Gi−1, Xi, Xi+1, . . . , Xn))]−E[Φ(P (G1, . . . , Gi−1, Gi, Xi+1, . . . , Xn))]| ≤ O(K9degP Infi[P ]2).

Summing over all i, this shows that

|E[Φ(P (X1, . . . , Xn))]− E[Φ(P (G1, . . . , Gn))]| = O

(K9degP

n∑i=1

Infi[P ]2

).

We can simplify this expression using the bound Inf[p] ≤ degP · V[P ], proved in Section 4:

|E[Φ(P (X1, . . . , Xn))]− E[Φ(P (G1, . . . , Gn))]| = O(K2O(degP ) V[P ] max

iInfi[P ]

).

Apart from the normalizing factor V[P ], this states that if P has low degree and low influences, then for anytest function Φ with bounded fourth derivative, Φ(P ) behaves the same under both the uniform distributionover ±1n and the standard n-dimensional Gaussian distribution. This is one form of the celebratedinvariance principle [MOO10]:

Let Φ: R → R be a function satisfying |Φ′′′′(z)| ≤ K for all z ∈ R. If P is a degree dmultilinear polynomial with maximal influence τ then

|E[Φ(P (X1, . . . , Xn))]− E[Φ(P (G1, . . . , Gn))]| = O(K2O(d)τ V[P ]),

where (X1, . . . , Xn) is the uniform distribution on ±1n, and (G1, . . . , Gn) are i.i.d.standard Gaussians.

It is possible to deduce closeness in CDF (as in the Berry–Esseen theorem) from this result by usingsuitable test functions. The idea is to approximate the function Φ(z) = [z < t] by two other functions Φ`,Φu,with bounded fourth derivatives, that approximate Φ in the following senses:

1. Φ` ≤ Φ ≤ Φu.

2. Φ` and Φu agree with Φ outside a small neighborhood of t.

The invariance principle shows that

E[Φ`(P (X1, . . . , Xn))] ≈ E[Φ`(P (G1, . . . , Gn))].

If we show thatE[Φ`(P (G1, . . . , Gn))] ≈ E[Φ(P (G1, . . . , Gn))],

40

then we can conclude that

E[Φ(P (X1, . . . , Xn))] ≥ E[Φ`(P (X1, . . . , Xn))] ≈ E[Φ(P (G1, . . . , Gn))],

and repeating the same argument with Φu would show a similar upper bound.Why does E[Φ`(P (G))] ≈ E[Φ(P (G))]? The two functions Φ`,Φ differ only in a neighborhood of t,

so this amounts to showing that P (G1, . . . , Gn) cannot be too concentrated in any small neighborhood, aphenomenon known as Gaussian anti-concentration and proved by Carbery and Wright [CW01].

Exercise Show that ‖T1/√

3f‖4 ≤ ‖f‖2 when the input consists of m standard Gaussians and n−m Boolean

variables uniformly distributed on ±1, and conclude that E[f4] ≤ 9deg f E[f2]2 for such functions.

11.1 Application: Majority is Stablest

Noise sensitivity Suppose that f : ±1n → ±1 represents the outcome of an election, just like inSection 4. How sensitive is the election to random errors in reading individual votes? Suppose that eachvote is flipped with probability p < 1/2. Denoting the original votes by x = x1, . . . , xn, the new votes byy = y1, . . . , yn, and assuming as usual that the original votes are random, the probability that the outcomechanges is

Pr[f(x) 6= f(y)] = Pr[f(x)f(y) = −1] = E[

1− f(x)f(y)

2

].

Notice that y ∼ Nρ(x) for ρ = 1− 2p, where Nρ is the distribution used to define the noise operator Tρ. Thus

Pr[f(x) 6= f(y)] =1

2− 1

2E[f(x)Tρf(x)] =

1

2− 1

2〈f, Tρf〉.

Thus the stability of f depends only 〈f, Tρf〉: the larger 〈f, Tρf〉 is, the more stable the elections are. Thequantity 〈f, Tρf〉 is known as the noise sensitivity of f .

Which balanced function f is the most stable? Since f is balanced,

〈f, Tρf〉 =∑S

ρ|S|f(S)2 ≤ ρ∑S

f(S)2 = ρ,

and this bound is achieved by functions of degree 1. As we have seen in Section 2, such functions are dictators.Dictators are not a good choice for a voting system. What happens if we decree that no voter have too

much of an influence on the outcome of the election? That is, what if we ask that the maximal influence besmall? According to the invariance principle, in this case the function f behaves as if it lived on Gaussianspace. What is the analog of noise sensitivity in Gaussian space?

Rotation sensitivity The noise operator Tρ has a counterpart in Gaussian space. Recall that in Section 6.2,we defined a distribution Nρ on pairs of points in ±1n. This distribution had the following features: if(x, y) ∼ Nρ then:

1. The marginals x, y are distributed uniformly over ±1n.

2. For every i, the correlation between xi and yi is E[xiyi] = ρ.

3. For any two functions f, g, 〈f, Tρg〉 = E(x,y)∼Nρ [f(x)g(y)].

We can construct a similar distribution on Gaussian space using bivariate Gaussians:

(xi, yi) ∼ N((

0 0),

(1 ρρ 1

)).

41

In words, (xi, yi) is a bivariate Gaussian with zero mean and covariance matrix

(1 ρρ 1

). Concretely, given

xi, we can sample yi using the formula

yi = ρxi +√

1− ρ2 · zi, where zi ∼ N(0, 1).

Indeed, it is not hard to check that E[xiyi] = ρ and E[y2i ] = ρ2 + (

√1− ρ2)2 = 1.

Now suppose that wi ∼ N(0, 1), and define

yi = ρ1xi +√

1− ρ21 · wi,

zi = ρ2xi −√

1− ρ22 · wi.

Clearly (xi, yi) ∼ Nρ1 and (xi, zi) ∼ Nρ2 . What about (yi, zi)? Since

E[yizi] = ρ1ρ2 −√

1− ρ21

√1− ρ2

2,

also (yi, zi) ∼ Nρ for a suitable ρ. We get a much simpler formula if we take θ1 = cos−1 ρ1 and θ2 = cos−1 ρ2:the addition formula for cosine shows that cos−1 ρ = θ1 + θ2, if we choose θ1, θ2 ∈ [0, π].2

If f is a Boolean function on Gaussian space then

Pr[f(y) 6= f(z)] ≤ Pr[f(y) 6= f(x)] + Pr[f(z) 6= f(x)],

and so, using the notation Rθ for Ncos θ, if θ1, θ2 ∈ [0, π] then

Pr(x,y)∼Rθ1+θ2

[f(x) 6= f(y)] ≤ Pr(x,y)∼Rθ1

[f(x) 6= f(y)] + Pr(x,y)∼Rθ2

[f(x) 6= f(y)].

The same formula holds for any number of summands, and in particular, for any integer ` ≥ 1,

Pr(x,y)∼Rπ/2

[f(x) 6= f(y)] ≤ ` Pr(x,y)∼Rπ/2`

[f(x) 6= f(y)].

Since cos(π/2) = 0, the distribution Rπ/2 = N0 consists of two independent Gaussians, and so the left-handside is just

Prx,y

[f(x) 6= f(y)] = Ex,y

[1− f(x)f(y)

2

]=

E[f2]− E[f ]2

2=

1

2V[f ].

Altogether, this shows that

Pr(x,y)∼Rπ/2`

[f(x) 6= f(y)] ≥ V[f ]

2`. (2)

This inequality is tight for the sign function f = sgn(x1), as we now show. The sign function satisfiesV[f ] = 1. How do we compute its rotation sensitivity, that is the probability that sgn(x1) 6= sgn(y1) when(x1, y1) ∼ Rπ/2`?

We can sample x1, y1 by sampling two independent Gaussians x1, z1 and letting y1 = cos(π/2`)x1 +sin(π/2`)z1. Thus sgn(x1) is the sign of the inner product of (x1, z1) with (1, 0), and sgn(y1) is the sign ofthe inner product of (x1, z1) with (cos(π/2`), sin(π/2`)).

Write (x1, z1) = (r cos θ, r sin θ), which is what we get if we think of this pair as a point on the complexplane. Crucially, θ is uniformly distributed. The inner product of (x1, z1) with a vector (cos(α), sin(α)) equalsr cos(θ − α), and in particular, the sign of the inner product only depends on α. In particular:

• x1 is positive if θ lies in a sector of width π around 0.

2This is not a coincidence, and can be explained by viewing the noise operator in terms of rotations on the plane, similar toFigure 1.

42

(1, 0)(cos θ, sin θ)

θ

θ

θ

Figure 1: Red region is where x1 is positive, blue region is where y1 is positive, purple region is where theyhave different signs

• y1 is positive if θ lies in a sector of width π around π/2`.

This shows that x1 and y1 have different signs with probability 2(π/2`)/(2π) = 1/2`, see Figure 1. We haveshown that the sign function satisfies

Pr(x,y)∼Rπ/2`

[sgn(x1) 6= sgn(y1)] =1

2`.

Other functions satisfying this include the sign of any other linear form, for example f(x) = sgn(x1 + · · ·+xn);this is because the Gaussian distribution is rotation-invariant, and sign functions are scale-invariant. Suchfunctions f are known as (balanced) halfspaces. More generally, we can consider functions of the formsgn(a1x1 + · · ·+ anxn + θ), which are not balanced. It turns out that they are also tight for Equation (2).

Borell [Bor75], using completely different techniques (symmetrization), showed that Equation (2) holdsfor all angles θ, and that halfspaces are the unique optimizers (up to measure zero). Stability versions ofBorell’s theorem are also known [MN15, Eld15]. Such results show that if Equation (2) is almost tight then fis close to a halfspace.

Majority is Stablest The analog of halfspaces on the Boolean cube is threshold functions, that is, functionsof the form sgn(x1 + · · ·+ xn + θ). Let us focus on the case of majority, MAJ(x) = sgn(x1 + · · ·+ xn). If x isa random point on the Boolean cube, then X := (x1 + · · ·+ xn)/

√n has distribution quite close to Gaussian.

If (x, y) ∼ Nρ and Y := (y1 + · · ·+ yn)/√n then E[XY ] = ρ, and so the bivariate central limit theorem shows

that (X,Y ) roughly has the Gaussian distribution Nρ. Therefore MAJ behaves much like the sign functionconsidered above, and in particular,

Pr(x,y)∼Nρ

[MAJ(x) 6= MAJ(y)] ≈ cos−1 ρ

2`.

In contrast, if f is a dictator then

Pr(x,y)∼Nρ

[f(x) 6= f(y)] = Pr(x,y)∼Nρ

[x1 6= y1] =1− ρ

2.

The difference is illustrated in Figure 2.The invariance principle implies (after several technical manipulations that we skip) that Borell’s theorem

approximately holds for functions on the Boolean cube ±1n as long as all influences of f are small. Moreaccurately, for every δ > 0 there exists τ > 0 such that if maxi Infi[f ] ≤ τ then

Pr(x,y)∼Nρ

[f(x) 6= f(y)] ≥ cos−1 ρ

π− δ.

43

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

ρ

1−ρ2

cos−1 ρπ

Figure 2: Pr[f(x) 6= f(y)] when (x, y) ∼ Nρ for dictator (red) and majority (blue)

Stated differently, if a balanced Boolean function f is significantly more stable than cos−1 ρπ , then the only

explanation is that it has an influential variable, which is preventing it from behaving as if it lived in Gaussianspace.

What does all this have to do with theoretical computer science? It turns out that this result, known asMajority is Stablest, implies that the Goemans–Williamson algorithm [GW95] gives the optimal worst-caseapproximation ratio for MAX-CUT, as shown in [KKMO07]. This was the original impetus for proving theinvariance principle.

While the original proof of Majority is Stablest used the invariance principle, since then direct inductiveproofs were found [DMN16].

11.2 Application: Bourgain’s tail bound

The Kindler–Safra theorem, proved in Section 7, states that a Boolean function which is concentrated upto constant degree is close to a junta. If we take the degree into account, the Kindler–Safra theorem statesthat if f : ±1n → ±1 satisfies ‖f>k‖2 ≤ ε, then f is 2O(k)ε-close to a Boolean junta depending on 2O(k)

variables. This result is meaningful only as long as k log n. What happens for larger k?Let us see what happens for the majority function. Since all influences are very small, we instead consider

the sign function, which is the Gaussian analog of majority. We have seen that

Pr(x,y)∼Nρ

[sgn(x) 6= sgn(y)] =cos−1 ρ

π.

Since sgn is Boolean,

〈sgn, Tρ sgn〉 = E(x,y)∼Nρ

[sgn(x) sgn(y)] = 1− 2 Pr(x,y)∼Nρ

[sgn(x) 6= sgn(y)] = 1− 2 cos−1 ρ

π.

If sgn was a function on the Boolean cube, then the left-hand side would be

∞∑d=0

ρd‖ sgn=d ‖2.

44

(We can make this meaningful on Gaussian space using the Hermite expansion.) The right-hand side is

2

π

∞∑d=0

(2dd

)4d

ρ2d+1

2d+ 1.

This is just the Taylor series of arccosine. Using the well-known asymptotics of central binomial coefficients,we see that for odd d,

‖ sgn=d ‖2 = Θ

(1

d3/2

).

Summing this over all d ≥ k,

‖ sgn>k ‖2 = Θ

(1√k

).

Assuming that majority behaves like the sign function, this shows that there exist functions which are farfrom being a junta and have mass Θ( 1√

k) beyond level k.

Bourgain’s theorem [Bou02] states that this bound is optimal: any Boolean function which has less massbeyond level k is essentially a junta. Bourgain’s original proof was complicated. Here we will follow thetechnique of Kindler–Kirshner–O’Donnell [KKO18], who also introduced rotation sensitivity, analyzed in thepreceding section.

Gaussian tail bound The starting point is showing that every Boolean function f in Gaussian space has

mass Ω(V[f ]√

k

)beyond the k’th level; the connection to functions on the Boolean cube is via the invariance

principle. This will follow as an application of the inequality

Pr(x,y)∼R2θ

[f(x) 6= f(y)] ≤ 2 Pr(x,y)∼Rθ

[f(x) 6= f(y)],

applied to a suitable angle θ.Let us first express both sides in terms of the Hermite expansion. Above we have shown that

〈f, Tρf〉 = 1− 2 Pr(x,y)∼Nρ

[f(x) 6= f(y)],

and so, taking ρ = cos θ,

Pr(x,y)∼Rθ

[f(x) 6= f(y)] =1− 〈f, Tρf〉

2=〈f, f〉 − 〈f, Tρf〉

2=

∞∑d=0

1− ρd

2‖f=d‖2.

(The Hermite expansion continues to infinity.) Therefore we can express the inequality above as follows:

∞∑d=0

1− cosd(2θ)

2‖f=d‖2 ≤

∞∑d=0

(1− cosd θ)‖f=d‖2.

How do the two sides compare? Taylor expansion shows that

1− cosd θ =dθ2

2−O(d2θ4).

Similarly,1− cosd(2θ)

2=d(2θ)2

4−O(d2θ4) = dθ2 −O(d2θ4).

Thus when dθ2 1, the coefficient on the left is in fact larger than the coefficient on the right! The onlyexplanation is that the function has significant component in higher levels.

45

In order to quantify this, let us pick a threshold D = Θ(1/θ2). When d ≤ D, the “eigenvalue” on the leftis roughly double that on the right, and so

∞∑d=D+1

(1− cosd θ)‖f=d‖2 ≥ Ω(1) ·∞∑d=0

1− cosd(2θ)

2‖f=d‖2.

The expression on the left is bounded by ‖f>D‖2. The expression on the right is just

Ω(1) · Pr(x,y)∼Rθ

[f(x) 6= f(y)](2)

≥ V[f ]

π/θ,

assuming that π/θ is an even integer. Given k, choose θ to be an even integer such that k = Θ(1/θ2).Combining everything, we have shown that

‖f>k‖2 = Ω

(V[f ]√k

).

Junta conclusion Applying the invariance principle, we obtain a similar bound for functions on theBoolean cube when all influences are small. But we can show more. Let f : ±1n → ±1 be an arbitraryBoolean function such that ‖f>k‖2 ≤ ε√

k. We will show that f is approximately a junta.

The junta variables are easy to identify, namely the variables having high influence. However, there is asubtle issue: f could have large total influence, and so there could be many variables with high influence.Therefore we actually choose as the junta variables those variables of high influence in f≤k, whose numberwe can bound. We ignore this subtlety (which is also required in order to apply the invariance principle inthe first place!) in the sequel; roughly speaking, this is not an issue since f is close to f≤k, by assumption.

Denote by J the variables with high influence. We would like to move f to Gaussian space using theinvariance principle. However, the invariance principle only applies to functions in which all variables havelow influence. Therefore we only replace the variables outside J with Gaussians, obtaining a function on

±1J × RJ . Consequently, we can only apply the tail bound on the Gaussian part.For each assignment α to the variables in J , the function fα obtained by substituting α to J lives in

Gaussian space, and so satisfies

‖f>kα ‖2 ≥ Ω

(1√k

)· V[fα]. (3)

How do these quantities relate to the Fourier spectrum of f? The Fourier expansion of fα is

∑S⊆J

∑T⊆J

αT f(S ∪ T )

xS .

Therefore

‖f>kα ‖2 =∑S⊆J|S|>k

∑T⊆J

αT f(S ∪ T )

2

=∑S⊆J|S|>k

∑T1,T2⊆J

αT1αT2

f(S ∪ T1)f(S ∪ T2).

Taking expectation over α gives

[‖f>kα ‖2] =∑S⊆J|S|>k

∑T⊆J

f(S ∪ T )2 =∑

|R\J|>k

f(R)2,

due to orthogonality of characters.

46

Since variance is just the case k = 0, taking expectation over the tail bound (3) gives∑|R\J|>k

f(R)2 ≥ Ω

(1√k

)∑R 6⊆J

f(R)2.

The left-hand side is at most ‖f>k‖2 ≤ ε√k

. The right-hand side, as we have seen in Section 5, is the distance

between f and the function g obtained by averaging over the coordinates outside J . Thus

ε√k≥ ‖f>k‖2 ≥ Ω

(1√k

)‖f − g‖2,

and so ‖f − g‖2 = O(ε). As we have seen in Section 5, if we round g to a Boolean function G, thenPr[f 6= G] = O(ε) as well.

How large is the junta J? This depends on the threshold required for the invariance principle to gothrough, an issue we have been vague about. It turns out that |J | ≤ 2O(k)poly(1/ε).

12 Global hypercontractivity

Hypercontractivity is the crucial ingredient in many proofs in Boolean function analysis. Hypercontractivityholds in the p-biased cube, as we have shown in Section 8.2. However, the parameters deteriorate as p getscloser to 0. This is inevitable, as the following simple calculation shows.

Suppose that ‖Tρf‖4 ≤ ‖f‖2 for some value of ρ, with respect to µp. As we have seen in Section 6, thisimplies that

‖f‖4 = ‖TρTρ−1f‖4 ≤ ‖Tρ−1f‖2 ≤ ρ− deg f‖f‖2.

Let us apply this to the function f(y1, . . . , yn) = y1, where y1, . . . , yn ∼ µp. For any q ≥ 1, ‖f‖q = E[|f |q]1/q =p1/q. Therefore the inequality above reads

4√p ≤ ρ−1√p =⇒ ρ ≤ 4

√p.

Conversely, in Section 8.2 we have shown that hypercontractivity does hold for ρ = 4√p, with the usual

inductive proof that we first explained in Section 6. For p close to zero, this result is next to useless, sinceT 4√p reduces a function to essentially its average.This examples shows that we cannot hope for a hypercontractivity estimate with ρ independent of p.

However, it might be that such an estimate holds for some functions. This should remind us of the situationin Section 11, in which we showed that low degree functions with low influences behave in similar wayson the unbiased Boolean cube and on Gaussian space. Could it be that functions with low influences doobey hypercontractivity with ρ independent of p? Such a result was proved in [KLLM19]. Here we give anexposition based on unpublished notes of Noam Lifshitz.

The basic idea is to compare a function f on the p-biased cube with its analog F on the unbiased cube.By that we mean that f and F have the same Fourier coefficients:

f =∑S⊆[n]

f(S)ωS , F =∑S⊆[n]

f(S)xS ,

where ωS =∏i∈S

yi−p√p(1−p)

, a p-biased Fourier character, depends on (y1, . . . , yn) ∼ µp, and xS =∏i∈S xi,

an unbiased Fourier character, depends on (x1, . . . , xn) ∼ µ1/2.Just as hypercontractivity states that by applying noise we can convert an L4 norm into an L2 norm, we

will show that by applying noise, we can convert the L4 norm of f to the L4 norm of F :

E[(Tρf)4] ≤ E[F 4] + penalty terms,

47

where the penalty terms are the analogs of the error terms in the proof of the invariance principle. Applyingeven more noise, we would be able to convert the L4 norm on the right into an L2 norm E[F 2]2, and soobtain an expression involving only the original function, since f and F have the same L2 norm.

The proof will proceed by induction, as in Section 6. Therefore we start with the base case. Supposethat f = aω + b, where ω = y−p√

p(1−p)for y ∼ µp, and let F = ax+ b for x ∼ µ1/2. We would like to compare

E[(Tρf)4] and E[F 4], which equal

E[(Tρf)4] = E[ω4]ρ4a4 + 4E[ω3]ρ3a3b+ 6ρ2a2b2 + b4,

E[F 4] = a4 + 6a2b2 + b4.

We calculated E[ω3] and E[ω4] in Section 8.2. Looking at the expressions, we see that if p is bounded awayfrom 1/2 then

κ3 := E[ω3] = Θ

(1√p

), κ4 := E[ω4] = Θ

(1

p

).

Comparing the expressions for E[Tρf4] and E[F 4], there are two main issues. First, κ4 is not constant.

We will take care of this using a penalty term. Second, we have an additional term κ3a3b. Using the AM-GM

inequality, we can convert it to terms which we are able to handle:

κ3a3b =

√κ2

3a6b2 =

√κ2

3a4 · a2b2 ≤ κ2

3a4 + a2b2

2.

ThereforeE[(Tρf)4] ≤ (κ4ρ

4 + 2κ23ρ

3)a4 + (2ρ3 + 6ρ2)a2b2 + b4.

For small enough constant ρ, this is at most

E[F 4] +O

(1

p

)a4.

We can extract a by taking a derivative of F . Renaming x to x1, We have L1F = ax1 (see Section 4),and so

E[(Tρf)4] ≤ E[F 4] + αE[(L1F )4], where α = O

(1

p

).

This concludes the one-dimensional case.It is not too difficult to guess what we get in the general case:

E[(Tρf)4] ≤∑S⊆[n]

α|S| E[(LSF )4],

where LSF is obtained by applying the operators Li for all i ∈ S, in an arbitrary order (they commute).We prove this by induction. We have already seen the base case n = 1. Now suppose f = ωn+1g + h, and

let F = xn+1G+H, where G,H are obtained from g, h by replacing ωT by xT .For every value of y1, . . . , yn, the function Tρf becomes the one-dimensional function Tρ[ωn+1Tρg(y1, . . . , yn)+

Tρh(y1, . . . , yn)] (where the first Tρ is with respect to the last coordinate and the other Tρ’s are with respectto the first n coordinates), and so the base case shows that

Eyn+1

[(Tρf)4(y1, . . . , yn, yn+1)] ≤ Exn+1

[(xn+1Tρg(y1, . . . , yn) + Tρh(y1, . . . , yn))4] + α(Tρg(y1, . . . , yn))4.

Taking expectation over y1, . . . , yn, the second term is at most

α∑S⊆[n]

α|S| E[(LSG)4] =∑

S⊆[n+1]n+1∈S

α|S+1| E[(LSF )4].

48

As for the first term, since xn+1G+H = F , we can similarly bound it by

Exn+1

∑S⊆[n]

α|S| Ex1,...,xn

[(LSF )4] =∑S⊆[n]

α|S| E[(LSF )4].

Altogether, this shows that

E[(Tρf)4] ≤∑

S⊆[n+1]

α|S| E[(LSF )4],

as needed. This completes the inductive proof.Finally, in order to obtain an expression in terms of f on the right-hand side, we apply more noise. Let

σ = ρ/√

3. Then

E[(Tσf)4] ≤∑S⊆[n]

α|S| E[(LST1/√

3F )4] ≤∑S⊆[n]

α|S| E[(LSF )2]2 =∑S⊆[n]

α|S| E[(LSf)2]2,

since LSF and LSf have the same Fourier coefficients. We record this form of hypercontractivity:

For any p ≤ 1/2, the following holds with respect to µp and some constant σ > 0independent of p:

‖Tσf‖44 ≤∑S⊆[n]

α|S|‖LSf‖42, α = O

(1

p

).

When is the right-hand side small? For starters, let us notice that

f(x−n, 1)− f(x−n, 0) =∑S⊆[n]n∈S

f(S)ωS\n(1− p)− (−p)√

p(1− p)=∑S⊆[n]n∈S

f(S)ωS\n1√

p(1− p).

This shows that

Ex−n

[(f(x−n, 1)− f(x−n, 0))2] =1

p(1− p)‖Lnf‖22.

(We have actually already seen this calculation in Section 9.) The expression f(x−n, 1)− f(x−n, 0) can becast as an operator Dnf = (Lnf)/(

√p(1− p)ωn). Defining Di1,...,isf = Di1 . . . Disf , this shows that

‖Tσf‖44 ≤∑S⊆[n]

(pα)|S|‖DSf‖22‖LSf‖22,

where pα = O(1). Now suppose that ‖DSf‖2 ≤ β for all S. Then

‖Tσf‖44 ≤∑S⊆[n]

(pα)|S|β2‖LSf‖22 = β2∑S⊆[n]

∑T⊇S

(pα)|S|f(T )2 = β2∑T⊆[n]

f(T )2∑S⊆T

(pα)|S| = β2∑T⊆[n]

(1+pα)|T |f(T )2.

Suppose now that we replace σ by τ = σ/(1 + pα). Then

‖Tτf‖44 ≤ β2∑T⊆[n]

f(T )2 = β2‖f‖22.

This is another form of hypercontractivity worth recording:

For any p ≤ 1/2, the following holds with respect to µp and some constant τ > 0independent of p:

‖Tτf‖4 ≤√β‖f‖2,where β = max

S⊆[n]‖DSf‖2.

49

12.1 Application: Bourgain’s booster theorem

One of the classical topics in random graph theory is the threshold behavior of monotone graph properties.Here are two examples, connectivity and containing a triangle:

Pr[G(n, logn+c

n

)is connected]→ e−e

−c,

Pr[G(n, cn

)contains a triangle]→ 1− e−c

3/6.

In both cases, the limit is as n→∞.There is a crucial difference between these two properties. In the case of connectivity, the threshold is

around lognn , and the probability jumps from 0 to 1 in an interval (“window”) of width 1

n = o( lognn ). In

contrast, in the case of containing a triangle, both the threshold and the window width are around 1n .

Let us recall a notation from Section 9: for a monotone property f : 0, 1n → 0, 1 and q ∈ [0, 1], wedefine τ(q) as the value of p such that µp(f) = q, where µp(f) = Eµp [f ] is the probability that G(n, p) satisfiesthe property.

A simple argument (taking the union or intersection of several G(n, τ(1/2)) graphs) shows that the windowwidth is always at most the threshold itself. When the window width has the same order of magnitude as thethreshold, we say that the threshold is coarse, and otherwise we say that it is sharp. Sharp thresholds areeasier to locate (since we only need to show that we are inside the window in order to locate the thresholdwith high accuracy), so we would like a criterion that guarantees that a property manifests a sharp threshold.

The Russo–Margulis formula, proved in Section 9, shows that if f is a monotone property then φ(p) = µp(f)

satisfies φ′(p) = Inf(p)[f ]/p(1− p), where Inf(p)[f ] is the total influence of f with respect to µp. If the criticalprobability is pc = τ(1/2) and the window width is w = τ(3/4)− τ(1/4), then some point p ∈ [τ(1/4), τ(3/4)]

has derivative at most 1/(2w). If the threshold is coarse then p, w = Θ(pc) and so Inf(p)[f ] = O(p/w) = O(1).What can we say about such functions? If p is constant, then Friedgut’s junta theorem states that f is

close to a junta, as we showed in Section 5. However, this is no longer the case for small p. For example, letus consider the case of containing a triangle. Since φ(p) ≈ 1 − e−(pn)3/6, we have φ′(p) ≈ 1

2n3p2e−(pn)3/6,

and so Inf(p)[f ] ≈ pφ′(p) ≈ 12 (pn)3e−(pn)3/6. When p = c/n, this is constant, yet f is far from being a junta.

Sharp threshold theorems describe the structure of functions f satisfying Inf(p)[f ] = O(1), under theadditional assumption that 0 µp(f) 1; we do not know the answer when µp(f) is small. The most usefulsuch theorem is due to Bourgain (appearing in an appendix to [Fri98]), which we will prove here using globalhypercontractivity, following [KLLM19].

Let f be a monotone function such that

Inf(p)[f ] ≤ K V[f ].

When 0 µp(f) 1, this is the same as requiring that Inf(p)[f ] = O(1).An argument in the style of the proofs in Section 5 shows that

V[f ] =∑

1≤|S|≤2K

f(S)2 +∑|S|>2K

f(S)2.

The first term is µp(f)2. The second term is at most Inf[f≤2K ]. The third term is at most

1

2K

∑S

|S|f(S)2 =1

2KInf[f ] ≤ 1

2V[f ].

Altogether,

V[f ] ≤ Inf[f≤2K ] +1

2V[f ],

which implies that1

2V[f ] ≤ Inf[f≤2K ]. (4)

50

We now bound Inf[f≤2K ] using global hypercontractivity. For each individual i,

Infi[f≤2K ] = ‖Li(f≤2K)‖22 = ‖(Lif)≤2K‖22 = p(1− p)‖(Dif)<2K‖22,

since Dif = Lif/(ωn√p(1− p)). Crucially, Dif ∈ 0,±1. Applying Holder’s inequality 〈g, h〉 ≤ ‖g‖4‖h‖4/3,

this shows that

Infi[f≤2K ] = p(1− p)〈(Dif)<2K , Dif〉 ≤ p(1− p)‖(Dif)<2K‖4‖Dif‖4/3 = p(1− p)‖(Dif)<2K‖4‖Dif‖3/22 .

Global hypercontractivity shows that

‖(Dif)<2K‖4 = ‖TτTτ−1(Dif)<2K‖4 ≤√βi‖Tτ−1(Dif)<2K‖1/22 ≤

√βie

O(K)‖Dif‖1/22 ,

whereβi = max

S⊆[n]‖DSTτ−1(Dif)<2K‖2.

Altogether,Infi[f

≤2K ] ≤ p(1− p)√βi‖Dif‖22 =

√βi Infi[f ].

Substituting this in (4),

1

2V[f ] ≤

√maxiβie

O(K) Inf[f ] =⇒ maxiβi ≥ e−O(K).

Now let us explore βi further. First, we can restrict S in the definition of βi to sets of size less than2K. Second, Tτ−1 increases the Fourier coefficients of (Dif)<2K by a factor of at most eO(K), and so we canremove it at the cost of increasing the entire expression by that factor. Altogether, we can bound

maxiβi ≤ eO(k) max

|S|≤2K‖DSf‖22.

We conclude thatmax|S|≤2K

‖DSf‖2 ≥ e−O(K).

Now take a set S such that ‖DSf‖2 ≥ e−O(K). We can give an explicit formula for DSf :

DSf(y) =∑T⊆S

(−1)|S\T |f(y−S , T ),

where y−S consists of the coordinates of y outside S, and we think of T as as assignment to the coordinatesof y in S; this generalizes the explicit formula for DnF appearing above. Since f is Boolean, this expressionis bounded by 2|S|, and so ‖DSf‖22 ≤ 4|S| Pr[DSf 6= 0], implying that Pr[DSf 6= 0] ≥ e−O(K).

Consider a particular assignment y−S to the coordinates outside of S. If DSf 6= 0 then since f ismonotone, necessarily f(y−S , ∅) = 0 and f(y−S , S) = 1. Assuming that p ≤ 1/2, this shows that if we fix thecoordinates in S to 1, then this increases the expected value (for this partial assignment y−S) by at least(1− p)|S| ≥ e−O(K). Therefore,

E[f | yi = 1 for all i ∈ S] ≥ E[f ] + e−O(K).

We have deduced Bourgain’s booster theorem:

Let f be a monotone function and p ≤ 1/2. If Inf(p)[f ] ≤ Kµp(f)(1− µp(f)) then thereexists a set S of size at most 2K such that

E[f | yi = 1 for all i ∈ S] ≥ E[f ] + e−O(K).

51

13 Analysis on the slice: Erdos–Ko–Rado

Up to now we have considered Fourier analysis on the Boolean cube ±1n with respect to the uniformmeasure, on the Boolean cube 0, 1n with respect to the biased measure µp, and briefly on Gaussian space.All of these settings have the useful feature that the coordinates are independent. What happens when thisproperty is absent?

The simplest such situation is the slice(

[n]k

), which is the uniform distribution νk over all vectors in

0, 1n of Hamming weight k. Intuitively, the slice behaves very similarly to the Boolean cube with respectto the measure µk/n, and we will show a formal version of this later. But first, let us explore the followingbasic question: what is the right notion of Fourier expansion for functions on the slice?

Every function on ±1n can be written in a unique way as a multilinear polynomial in x1, . . . , xn. Thisis no longer the case for the slice, where we consider the inputs to be the coordinates x1, . . . , xn ∈ 0, 1,which are promised to sum to k. Indeed,

n∑i=1

xi − k = 0,

and more generally, if we multiply the left-hand side by any polynomial P and multilinearize the result, thenwe also get a polynomial that vanishes on the slice. Therefore, in order to obtain a canonical representation,we need to add more constraints. Such a canonical representation was found by Dunkl [Dun76], who askedthat the representing polynomial P be “orthogonal to the defining system”, that is,

n∑i=1

∂P

∂xi= 0.

This constraint, known as harmonicity, is not enough. We need the polynomial to be multilinear, to take intoaccount the identity x2

i = xi. Similarly, since xS = 0 for any |S| > k, we need degP ≤ k; dually, we alsoneed degP ≤ n− k. Putting all these constraints together, we do obtain a canonical representation.

Existence Let us first show that if f is any function on the slice(

[n]k

), where for simplicity we assume that

k ≤ n/2, then it can be represented by a harmonic multilinear polynomial of degree at most k.Our starting point is the following trivial representation of f as a homogeneous multilinear polynomial:

Pk =∑|S|=k

cSxS .

The idea is to find a homogeneous multilinear polynomial Qk of the same degree such that ∆Qk = 0 andPk −Qk can be represented by a homogeneous multilinear polynomial of degree k − 1, and then repeat theprocess k − 1 more times.

How do we find such a polynomial Qk? If the coefficients of Qk are dS , then

∆Qk =∑

|T |=k−1

∑i∈[n]\T

dT∪ixT .

Therefore ∆Qk = 0 if∑i/∈T dT∪i = 0 for all T of size k − 1.

It is natural to think of Qk as a formal sum of the “indeterminates” xS . Similarly, ∆Qk is a formalsum of the indeterminates xT . The operator ∆ is a linear operator mapping Vk := span(xS : |S| = k) toVk−1 := span(xT : |T | = k − 1).

Linear algebra tells us that Vk can be written as the direct sum of ker ∆ and im ∆T . This is very promising,since applying this to Pk, this means that we can find a representation Pk = Qk + ∆TPk−1 where ∆Qk = 0and Pk−1 ∈ Vk−1. We have thus found a candidate for Qk.

What can we say about ∆TPk−1? The operator ∆T maps xT to∑i∈[n]\T xT∪i. On the slice,∑

i∈[n]\T

xT∪i = xT∑

i∈[n]\T

xi = xT ,

52

since if xT = 1 then exactly one of the xi’s will evaluate to 1. Thus on the slice, we have Pk = Qk + Pk−1.We now have to repeat the same process for Pk−1. We can view ∆ as an operator from Vk−1 to

Vk−2 := span(xU : |U | = k − 2), and so find a representation Pk−1 = Qk−1 + ∆TPk−2, where ∆Qk−1 = 0.This time we have ∑

i∈[n]\U

xU∪i = xU∑

i∈[n]\U

xi = 2xU ,

on the slice. Thus on the slice, Pk−1 = Qk−1 + 2Pk−2. Continuing in this vein, we obtain

Pk =

k∑d=0

(k − d)!Qd,

where Qd is a degree d polynomial satisfying ∆Qd = 0. This is the representation we were looking for.This argument also shows that if f can be represented by some polynomial of degree d, then it can be

represented as a harmonic multilinear polynomial of degree at most d.

Uniqueness Now, let us show that the representation is unique. Equivalently, we need to show that if f isidentically zero on the slice, then the only harmonic multilinear polynomial of degree at most k representingit is the zero polynomial. Let us therefore consider a harmonic multilinear polynomial

P =∑|S|≤k

cSxS

which evaluates to zero on the entire slice.Since P evaluates to zero on the slice, in particular E[P ] = 0. Therefore

0 = c∅ +

k∑d=1

∑|S|=d

E[xS ]cS = c∅ +

k∑d=1

E[x1 · · ·xd]∑|S|=d

cS .

Since P is harmonic, we know that for every T ,∑i∈[n]\T

cT∪i = 0.

Summing this over all T of size d− 1, we see that (d+ 1)∑|S|=d cS = 0, and so c∅ = 0. We have actually

shown something stronger: c∅ = 0 iff E[P ] = 0.Next, we also know that E[Px1] = 0. Therefore

0 = E[x1]c1 +

k∑d=1

∑|S|=d1/∈S

(E[x1 · · ·xd]cS + E[x1 · · ·xd+1]cS∪1).

If we sum the constraint∑i∈[n]\T cT∪i = 0 over all T containing 1 of size d, then we see that

d∑

|S|=d+11∈S

cS = 0 =⇒∑|S|=d1/∈S

cS∪1 = 0.

Similarly, if we sum the constraint over all T not containing 1 of size d− 1, then we see that

0 =∑

|T |=d−11/∈T

∑i/∈T

cT∪i =∑

|T |=d−11/∈T

∑i/∈T∪1

cT∪i + cT∪1 = d∑|S|=d1/∈S

cS +∑

|S|=d−11/∈S

cS∪1.

53

We already know that the second summand vanishes, hence the first one also does. In total, we deduce thatc1 = 0. Again, we have actually shown something stronger: assuming that E[P ] = 0, then c1 = 0 iffE[Px1] = 0. In particular, if E[P ] = E[Px1] = 0 then c∅ = c1 = 0; and the converse also holds.

In the same way, we prove that all other coefficients vanish (we leave this as an exercise to the reader).Importantly, the proof actually shows something quite a bit stronger: f is orthogonal to all functions of degreeat most d iff its harmonic representation (which is how we shall call the unique representation describedabove) has no terms of degree at most d. This means that if we write

f =

k∑d=0

f=d,

where f=d is the d’th homogeneous part of the harmonic representation of f , then the functions f=d areorthogonal.

Generating set How do harmonic representations look like? Here is an example of a harmonic functionwhich is homogeneous of degree d:

(x1 − x2) · · · (x2d−1 − x2d).

To check that this function is indeed harmonic, all we need to do is observe is that the product of harmonicpolynomials is harmonic, by the product rule of the derivative; the claim then follows by considering the easycase of x1 − x2.

It turns out that every homogeneous harmonic multilinear polynomial of degree d can be written as alinear combination of functions of this form. To see this, let us first determine the dimension Dd of the spaceof harmonic multilinear polynomials which are homogeneous of degree d. Without the harmonicity constraint,the dimension is

(nd

). There are

(nd−1

)different harmonicity constraints, so Dd ≥

([n]d

)−(

[n]d−1

). On the other

hand, by unique representation we know that∑kd=0Dd =

(nk

). Thus

(n

k

)=

k∑d=0

Dd ≥k∑d=0

(n

d

)−(

n

d− 1

)=

(n

k

).

This shows that all inequalities are tight, and so Dd =(nd

)−(nd−1

).

Let us now say that a sequence 1 ≤ j1 < · · · < jd ≤ n is admissible if there exist indices i1, . . . , id such thatall of i1, . . . , id, j1, . . . , jd are different, and additionally it < jt for all t ∈ [d]. It is a classical combinatorialresult that the number of admissible sequences is

(nd

)−(nd−1

)(this is the famous Bertrand ballot problem,

when we think of indices jt as being votes to one candidate, and indices not in j1, . . . , jd as being votes tothe other candidate).

For each admissible sequence j1 < · · · < jd, choose some witness i1, . . . , id, and consider the polynomial

χj1,...,jd =

d∏t=1

(xit − xjt),

which is harmonic multilinear. We claim that these polynomials are linearly independent (as vectors ofcoefficients), and so must form a basis for all harmonic multilinear polynomials which are homogeneous ofdegree d. To see this, we construct a partial order on monomials of degree d. Say that (i1, . . . , id) ≺ (j1, . . . , jd)if it < jt for all t ∈ [d], and extend this partial order arbitrarily to a linear order. The matrix of coefficientsof the polynomials χj1,...,jd is triangular with 1s on the diagonal with respect to this order, and so thesepolynomials are linearly independent.

Exercise Complete the proof of uniqueness, and see where the condition k ≤ n/2 comes into play.

54

13.1 Influence

Two concepts that were important in Boolean function analysis on the Boolean cube were influence and noise.How do we extend them to the slice?

We defined the influence in direction i via the operation of flipping the i’th coordinate. However, thisoperation doesn’t preserve the Hamming weight. Instead, the minimal change is swapping two coordinates.Accordingly, one can define an influence in direction (i j), but it is less useful than the analog of total influence.In the case of the Boolean cube, this is the sum of all influences, and also has an alternative characterizationin terms of the Hamming graph, which is the graph corresponding to the Boolean cube (two vertices areconnected if they are at Hamming distance 1): if f is a Boolean function, then Inf[f ] is the average, over x,of the number of neighbors y of x such that f(x) 6= f(y). For arbitrary functions,

Inf[f ] =1

4Ex

[∑y∼x

(f(x)− f(y))2

].

We also had another formula for total influence: using y ∼ x to go over all neighbors y of x in the Hamminggraph,

〈f, Lf〉, where Lf(x) =∑y∼x

f(x)− f(y)

2.

Indeed,

Inf[f ] =1

4Ex

[∑y∼x

(f(x)− f(y))2

]=n

4E

x,y∼x[f(x)2 − 2f(x)f(y) + f(y)2] =

n

4E

x,y∼x[2f(x)2 − 2f(x)f(y)] =

1

2Ex

[∑y∼x

f(x)(f(x)− f(y))

]= 〈f, Lf〉,

since if we choose a random x and a random neighbor y, then y is also random.The analog of the Hamming graph in the case of the slice is the Johnson graph, in which two points x, y

are connected if they differ in two coordinates, which is the minimal number. Accordingly, for functions onthe slice we will define

Inf[f ] ∝ Ex

[∑y∼x

(f(x)− f(y))2

].

The constant of proportionality is somewhat arbitrary — we will determine it so that the Fourier formula forInf[f ] will resemble the one for the Boolean cube.

Just as in the case of the Boolean cube, it will be easier to use the formula Inf[f ] ∝ 〈f, Lf〉, where

Lf(x) =∑y∼x

(f(x)− f(y)).

(It is more natural to forego the factor of 2 in this case, since the xi are 0, 1-valued rather than ±1-valued.)This is easier since we can compute Lf directly on the functions χd = (x1 − x2) · · · (x2d−1 − x2d), and deduceits value for arbitrary f .

Suppose that x is a point such that χd(x) = 0. This means that there is a pair of equal coordinatesx2i−1 = x2i. Suppose for definiteness that x2i−1 = x2i = 0. The only neighbors of x on which χd possiblydoes not vanish are in directions (2i− 1 j) and (2i j), where j 6= 2i− 1, 2i. Suppose that y is a neighbor indirection (2i− 1 j) such that χd(y) 6= 0. Necessarily xj = 1, and so y2i−1 − y2i = 1. If instead we look at theneighbor z in direction (2i j), then the only difference between y and z is in coordinates 2i− 1, 2i, wherez2i−1 − z2i = −1. Therefore χd(z) = −χd(y). Matching neighbors in this way, we see that Lχd(x) = 0.

The more interesting case is when χd(x) 6= 0. Without loss of generality, let us say that x1 = x3 =· · · = x2d−1 = 1 and x2 = x4 = · · · = x2d = 0. For a swap to change the value of the function, one of thecoordinates being swapped must belong to [2d]. This can happen in the following ways:

55

1. Coordinates 2i− 1, 2i ∈ [2d] are swapped. This changes the value of the function to −1. There are dsuch swaps.

2. A coordinate 2i− 1 ∈ [2d] is swapped with a coordinate 2j ∈ [2d], where i 6= j. This changes the valueof the function to 0. There are d(d− 1) such swaps.

3. A coordinate 2i− 1 ∈ [2d] is swapped with a 0-coordinate outside [2d]. This also zeroes the function.There are d(n− k − d) such swaps.

4. A coordinate 2i ∈ [2d] is swapped with a 1-coordinate outside [2d]. This also zeroes the function. Thereare d(k − d) such swaps.

In total,Lχd(x) = 2d+ d(d− 1) + d(n− k − d) + d(k − d) = d(n− d+ 1).

When χd(x) = −11, we similarly get the negative of this value. Therefore

Lχd = d(n− d+ 1)χd.

Since functions of the form χd span the d’th level (consisting of harmonic multilinear polynomials whichare homogeneous of degree d), this shows that

Lf =

k∑d=0

d(n− d+ 1)f=d,

and so, since the different levels are orthogonal to each other,

〈f, Lf〉 =

k∑d=0

d(n− d+ 1)‖f=d‖2.

In comparison, the coefficients in the case of the Boolean cube were d. We therefore define the total influenceby dividing the formula above by a factor of n:

Inf[f ] =1

n〈f, Lf〉 =

k∑d=0

(1− d− 1

n

)d‖f=d‖2.

One useful property of Lf is that the eigenvalues d(n− d+ 1) are all distinct (assuming k ≤ n/2). To seethis, it suffices to notice that

(d+ 1)(n− d)− d(n− d+ 1) = n− 2d,

which is strictly positive when d+ 1 ≤ n/2.

The operator L is closely related to the adjacency operator A of the Johnson graph. The relation isLf = k(n− k)f −Af , since every vertex has degree k(n− k). The eigenspaces of A are thus also the k + 1levels of the Fourier expansion. The same holds for all powers of A, and so for all polynomials in A, and sofor all operators B such that B(x, y) depends only on the distance between x and y in the Johnson graph.

Such operators are encountered very frequently, since if B is any linear operator which “doesn’t careabout names of elements” — formally, is invariant under the operation of permuting the indices [n] — thenB(x, y) only depends on the Hamming distance between x and y, which is twice their distance in the Johnsongraph. Any such operator will have the same eigenspaces as A, and so to compute its eigenvalues, it sufficesto compute them on the functions χd.

56

13.2 Noise

In the case of the Boolean cube, we defined the noise operator Tρ in two equivalent ways. First, usingan operation which flips every coordinate with probability 1−ρ

2 . Second, using the Fourier expansion: Tρmultiplies the d’th level by ρd. It is clear how to extend the second definition to the case of the slice. Whatabout the first definition?

Here is another way at looking at the Fourier definition. Recall that

Lf =

n∑d=0

df=d.

In other words, the eigenspaces of L are the Fourier levels, with eigenvalues d. Therefore

Tρf =

n∑d=0

ρdf = e−L ln(1/ρ)f.

As in the case of the slice, we can write L in terms of the adjacency operator A of the Hamming graph. It willbe slightly nicer to consider instead the corresponding random walk operator M = A/n, which correspondsto taking a random neighbor. Then L = (nI −A)/2 = (n/2)(I −M), and so

Tρf = e−(n/2) ln(1/ρ)(I−M).

Defining t = (n/2) ln(1/ρ) and using the Taylor series for ex, this gives

Tρf = e−t∞∑k=0

tk

k!Mkf = E

k∼P(t)[Mkf ],

where P(t) is a Poisson random variable with expectation t. That is, Tρf(x) = E[f(y)], where y is obtainedby taking P(t) random steps. (Equivalently, we can view this in terms of continuous-time Markov chains.)

We can generalize all of this to the slice by considering the adjacency and random walk operators ofthe Johnson graph. The upshot is that the d in ρd will be replaced by (1− d−1

n )d, leading to the followingdefinition:

Tρf =

k∑d=0

ρ(1− d−1n )df=d.

How does this compare with the noise operator T ∗ρ whose eigenvalues are exactly ρd? We have

‖Tρf − T ∗ρ f‖2 =

k∑d=0

(ρ(1− d−1n )d − ρd)2‖f=d‖2 ≤ max

d(ρ(1− d−1

n )d − ρd)2‖f‖2.

When d ≤√n, using ex = 1 +O(x) (for x = O(1)) we can upper bound

ρ(1− d−1n )d − ρd = ρd

((1/ρ)d(d−1)/n − 1

)= Oρ

(d2ρd

n

),

and otherwiseρ(1− d−1

n )d − ρd ≤ ρd/2 ≤ ρ√n/2.

This shows that

‖Tρf − T ∗ρ f‖2 = Oρ

(1

n

)‖f‖2.

Hypercontractivity still holds for constant p. We show this below for low-degree functions by relating theslice to the corresponding p-biased cube.

57

13.3 Application: Erdos–Ko–Rado

In Section 8.1, we considered the p-biased version of the Erdos–Ko–Rado theorem. The original version tookplace in the slice, and this is one motivation to study it. (Another motivation is G(n,m) random graphs,which were the model of random graphs originally studied by Erdos and Renyi [ER60].)

A family F ⊆(

[n]k

)is intersecting if any two sets A,B ∈ F intersect. If k > n/2 then any two sets

intersect, and so the concept is not interesting. If k = n/2 then an intersecting family contains at most oneout of each pair S, S, and so the measure of any such family (which is its size divided by

(nk

)) is at most 1/2,

and this is achieved by stars, that is, all families containing some fixed element i.The interesting case is when k < n/2. In this case, stars are intersecting families of measure k/n, but

is this optimal? Let us try to mimic the proof in Section 8.1, constructing a noise operator T such that〈f, Tf〉 = 0 if f is the characteristic function of an intersecting family. We will have Tf(x) = Ey∼N(x)[f(y)],where N(x) is supported on sets disjoint from x. The only reasonable choice for N(x) is a random set disjointfrom x. This makes T the random walk operator of the Kronecker graph, in which two sets are connected ifthey are disjoint.

The operator T “doesn’t care about names of elements”, and so we know that its eigenspaces are theFourier levels. To compute the eigenvalues, it suffices to compute Tχd on some point x such that χd(x) = 1,say one in which x1 = x3 = · · · = x2d−1 = 1 and x2 = x4 = · · · = x2d = 0. Any y in the support of N(x) hasy1 = y3 = · · · = y2d−1 = 0, so Tχd(x) is (−1)d times the probability that y2 = y4 = · · · = y2d = 1. Out of the(n−kk

)neighbors of x in the Kronecker graph,

(n−k−dk−d

)are of this form, and so

Tχd = (−1)d(n−k−dk−d

)(n−kk

) χd.

Therefore if f is the characteristic function of an intersecting family F , then

0 = 〈f, Tf〉 =

k∑d=0

(−1)d(n−k−dk−d

)(n−kk

) ‖f=d‖2.

Following our footsteps in Section 8.1, we will single out d = 0, and lower bound all other eigenvalue by theminimal one, that is, the one maximizing

(n−k−dk−d

)over odd d. The interpretation of this quantity as the

number of neighbors such that y2 = y4 = · · · = y2d = 1 makes it clear that the maximizer is d = 1 (the leastnumber of constraints), and so

0 ≥ ‖f=0‖2 −(n−k−1k−1

)(n−kk

) ‖f>0‖2.

We can simplify the ratio of the binomials to kn−k .

What is f=0? It is the constant coefficient of the harmonic expansion of f . Since E[χd] = 0 for d > 0, wesee that f=0 = E[f ]. Therefore, as in the p-biased case,

k

n− k(E[f ]− E[f ]2) ≥ E[f ]2 =⇒ E[f ] ≤ k

n− k(1− E[f ]).

This shows that nn−k E[f ] ≤ k

n−k , and so E[f ] ≤ k/n. Furthermore, as in the p-biased case, equality can onlyhappen if deg f ≤ 1, which implies that f is a dictator and so a star (exercise). We can even deduce thatnear-maximizers are close to stars, at least when k/n is bounded away from 0, using a slice version of theFriedgut–Kalai–Naor theorem.

Exercise Show that if deg f ≤ 1 then f is a dictator (unless k ∈ 1, n−1), and conclude that if 2 ≤ k < n/2and F is an intersecting family of measure k/n, then F is a star.

58

13.4 Coupling the slice and the cube

When k/n is constant (an assumption we make from now on), the slice behaves similarly to the p-biased cube.

For example if we put p = k/n, then E[x1] = p, and E[x1x2] = k(k−1)n(n−1) ≈ p

2; the same holds for all monomials

of degree o(√n).

One particularly transparent way to connect the slice(

[n]k

)and the corresponding p-biased cube (with

p = k/n) was found by Noam Lifshitz (as yet unpublished). The idea is to consider a coupling of the sliceand the p-biased cube. The coupling is very simple: we choose x ∼ µp, and then take a uniformly randomsubset (if |x| ≥ k) or superset (if |x| ≤ k) of x of size exactly k.

Let Tµ→ν be an operator mapping functions on the p-biased cube to functions on the slice as follows:Tµ→νf(y) = E[f(x)], where x is distributed as the first element of the coupling above, subject to the secondelement being y. Define Tν→µ similarly in the other direction. We define a noise operator on the slice by

T νρ f = Tµ→νTµρ Tν→µ,

where Tµρ is the usual noise operator on the p-biased cube. Since Tµ→ν , Tν→µ are averaging operatorsthey are contractive (by an application of the triangle inequality), and so (aiming, arbitrarily, at (2, 4/3)-hypercontractivity)

‖T νρ f‖2 = ‖Tµ→νTµρ Tν→µf‖2 ≤ ‖Tµρ Tν→µf‖2 ≤ ‖Tν→µf‖4/3 ≤ ‖f‖4/3.

It remains to find out how T νρ operates on the harmonic expansion of f . Since T νρ “doesn’t care aboutnames of coordinates”, it suffices to consider f = χd.

What can we say about Tν→µχd? Since the operators Tν→µ and Tµ→ν are adjoint, we have 〈Tν→µχd, g〉 =〈χd, Tµ→νg〉. If deg g < d then g can be written as a linear combination of monomials of degree e < d. Eachsuch monomial xS only depends on e < d variables, and so Tµ→νxS can be written as a polynomial in thesevariables, hence its harmonic expansion has degree at most e. This shows that deg Tµ→νg < d, and so theinner product vanishes. In other words, Tν→µχd has no Fourier coefficients below level d. Conversely, eachmonomial in χd depends on only d variables, and so Tν→µχd has degree at most d. We conclude that Tν→µχdlies in the d’th level of the Fourier expansion. This is useful since it implies that

T νρ χd = ρdTµ→νTν→µχd.

It remains to understand the effect of Tµ→νTν→µ on χd. We know that χd is an eigenvector of Tµ→νTν→µ,with some eigenvalue λd which can be calculated as

λd =〈Tµ→νTν→µχd, χd〉

‖χd‖2=‖Tν→µχd‖2

‖χd‖2=

Pr[Tν→µχd 6= 0]

Pr[χd 6= 0].

The denominator is exactly

2dkd(n− k)d

n2d≈ (2p(1− p))d.

In order to estimate the numerator, let us consider the distribution of the Hamming weight of x. If ` ≥ k,then there are

(n−k`−k)

vectors x ≥ y of Hamming weight `. Since y is obtained from such an x with probability

1/(k`

), the probability that |x| = ` is proportional to

p`(1− p)n−`(n−k`−k)(

`k

) .

When ` increases by 1, this gets multiplied by a factor of

p

1− pn− ``+ 1

=k

`+ 1

n− `n− k

.

59

Similarly, if ` ≤ k then the probability that |x| = ` is proportional to

p`(1− p)n−`(k`

)(n−`k−`) .

When ` decreases by 1, this gets multiplied by a factor of

1− pp

`

n− `+ 1=

n− kn− `+ 1

`

k.

In both cases, once ` deviates by√n from k, the ratio becomes 1−O(1/

√n), and so it drops exponentially

every√n steps. This means that |x| is concentrated in a distance of roughly O(

√n) from k.

If we denote by δ the deviation of |x| from k, then fixing |x|, the probability that x \ y (if |x| ≥ k) or y \ x(if |x| ≤ k) contains one of the first 2d coordinates is O(dδ/n). Since δ is of order roughly O(

√n), we see

that Pr[Tν→µχd 6= 0] differs from Pr[χd 6= 0] by at most roughly O(d/√n). The error term is much smaller

than the main term as long as d log n, in which case

λd ≈ 1± eO(d)

√n.

Just as in the case of Tρ and T ∗ρ , this implies that ‖T νρ f‖2 is close to ‖Tρf‖2 and ‖T ∗ρ f‖2, at least for lowdegree functions. This suffices to show that ‖f‖4 = O(‖f‖2) for functions of bounded degree, for example,and so can be used to derive many of the theorems proved in these lecture notes.

Exercise Obtain a more precise estimate for the eigenvalues of T νρ , and deduce a concrete statement ofhypercontractivity for low-degree functions on the slice.

References

[AA14] Scott Aaronson and Andris Ambainis. The need for structure in quantum speedups. TheoryComput., 10:133–166, 2014. doi:10.4086/toc.2014.v010a006.

[AL93] Miklos Ajtai and Nathal Linial. The influence of large coalitions. Combinatorica, 13(2):129–145,1993. doi:10.1007/BF01303199.

[BB14] Arturs Backurs and Mohammad Bavarian. On the sum of L1 influences. In IEEE 29th Conferenceon Computational Complexity—CCC 2014, pages 132–143. IEEE Computer Soc., Los Alamitos,CA, 2014. doi:10.1109/CCC.2014.21.

[Bor75] Christer Borell. The Brunn-Minkowski inequality in Gauss space. Invent. Math., 30(2):207–216,1975. doi:10.1007/BF01425510.

[Bou02] Jean Bourgain. On the distributions of the Fourier spectrum of Boolean functions. Israel J.Math., 131:269–276, 2002. doi:10.1007/BF02785861.

[CHS20] John Chiarelli, Pooya Hatami, and Michael Saks. An asymptotically tight bound on the numberof relevant variables in a bounded degree Boolean function. Combinatorica, 40(2):237–244, 2020.doi:10.1007/s00493-019-4136-7.

[CW01] Anthony Carbery and James Wright. Distributional and Lq norm inequalities for polynomials overconvex bodies in Rn. Math. Res. Lett., 8(3):233–248, 2001. doi:10.4310/MRL.2001.v8.n3.a1.

[DMN16] Anindya De, Elchanan Mossel, and Joe Neeman. Majority is stablest: discrete and SoS. TheoryComput., 12:Paper No. 4, 50, 2016. doi:10.4086/toc.2016.v012a004.

60

[Dun76] Charles F. Dunkl. A Krawtchouk polynomial addition theorem and wreath products of symmetricgroups. Indiana Univ. Math. J., 25(4):335–358, 1976. doi:10.1512/iumj.1976.25.25030.

[Eld15] Ronen Eldan. A two-sided estimate for the Gaussian noise stability deficit. Invent. Math.,201(2):561–624, 2015. doi:10.1007/s00222-014-0556-6.

[ER60] P. Erdos and A. Renyi. On the evolution of random graphs. Magyar Tud. Akad. Mat. KutatoInt. Kozl., 5:17–61, 1960.

[FHKL16] Yuval Filmus, Hamed Hatami, Nathan Keller, and Noam Lifshitz. On the sum of the L1 influencesof bounded functions. Israel J. Math., 214(1):167–192, 2016. doi:10.1007/s11856-016-1355-0.

[Fil16] Yuval Filmus. Friedgut-Kalai-Naor theorem for slices of the Boolean cube. Chic. J. Theoret.Comput. Sci., pages Art. 14, 17, 2016. doi:10.4086/cjtcs.2016.014.

[FK96] Ehud Friedgut and Gil Kalai. Every monotone graph property has a sharp threshold. Proc.Amer. Math. Soc., 124(10):2993–3002, 1996. doi:10.1090/S0002-9939-96-03732-X.

[FKN02] Ehud Friedgut, Gil Kalai, and Assaf Naor. Boolean functions whose Fourier transform isconcentrated on the first two levels. Adv. in Appl. Math., 29(3):427–437, 2002. doi:10.1016/

S0196-8858(02)00024-6.

[Fri98] Ehud Friedgut. Boolean functions with low average sensitivity depend on few coordinates.Combinatorica, 18(1):27–35, 1998. doi:10.1007/PL00009809.

[Fri99] Ehud Friedgut. Sharp thresholds of graph properties, and the k-sat problem. J. Amer.Math. Soc., 12(4):1017–1054, 1999. With an appendix by Jean Bourgain. doi:10.1090/

S0894-0347-99-00305-7.

[GW95] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximumcut and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach.,42(6):1115–1145, 1995. doi:10.1145/227683.227684.

[Hat12] Hamed Hatami. A structure theorem for Boolean functions with small total influences. Ann. ofMath. (2), 176(1):509–533, 2012. doi:10.4007/annals.2012.176.1.9.

[Hof70] Alan J. Hoffman. On eigenvalues and colorings of graphs. In Graph Theory and its Applications(Proc. Advanced Sem., Math. Research Center, Univ. of Wisconsin, Madison, Wis., 1969), pages79–91. Academic Press, New York, 1970.

[Kin02] Guy Kindler. Property testing, PCP and Juntas. PhD thesis, Tel-Aviv University, 2002.

[KK20] Nathan Keller and Ohad Klein. A structure theorem for almost low-degree functions on the slice.Israel J. Math., 240(1):179–221, 2020. doi:10.1007/s11856-020-2062-4.

[KKL88] Jeff Kahn, Gil Kalai, and Nathan Linial. The influence of variables on Boolean functions. InProc. 29th Ann. Symp. on Foundations of Comp. Sci., pages 68–80. Computer Society Press,1988.

[KKMO07] Subhash Khot, Guy Kindler, Elchanan Mossel, and Ryan O’Donnell. Optimal inapproximabilityresults for MAX-CUT and other 2-variable CSPs? SIAM J. Comput., 37(1):319–357, 2007.doi:10.1137/S0097539705447372.

[KKO18] Guy Kindler, Naomi Kirshner, and Ryan O’Donnell. Gaussian noise sensitivity and Fourier tails.Israel J. Math., 225(1):71–109, 2018. doi:10.1007/s11856-018-1646-8.

[KLLM19] Peter Keevash, Noam Lifshitz, Eoin Long, and Dor Minzer. Hypercontractivity for global functionsand sharp thresholds, 2019. URL: https://arxiv.org/abs/1906.05568, arXiv:1906.05568.

61

[KR08] Subhash Khot and Oded Regev. Vertex cover might be hard to approximate to within 2− ε. J.Comput. System Sci., 74(3):335–349, 2008. doi:10.1016/j.jcss.2007.06.019.

[KS04] Guy Kindler and Shmuel Safra. Noise-resistant Boolean functions are juntas, 2004. Unpublishedmanuscript.

[Lov79] Laszlo Lovasz. On the Shannon capacity of a graph. IEEE Trans. Inform. Theory, 25(1):1–7,1979. doi:10.1109/TIT.1979.1055985.

[MN15] Elchanan Mossel and Joe Neeman. Robust optimality of Gaussian noise stability. J. Eur. Math.Soc. (JEMS), 17(2):433–482, 2015. doi:10.4171/JEMS/507.

[MOO10] Elchanan Mossel, Ryan O’Donnell, and Krzysztof Oleszkiewicz. Noise stability of functionswith low influences: invariance and optimality. Ann. of Math. (2), 171(1):295–341, 2010.doi:10.4007/annals.2010.171.295.

[O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014. doi:

10.1017/CBO9781139814782.

[Rus82] Lucio Russo. An approximate zero-one law. Z. Wahrsch. Verw. Gebiete, 61(1):129–139, 1982.doi:10.1007/BF00537230.

[Wel20] Jake Wellens. Relationships between the number of inputs and other complexity measures ofboolean functions, 2020. arXiv:2005.00566.

62