Gaussian width bounds with applications to arithmetic progressions in random settings

Motivated by problems on random differences in Szemer\'{e}di's theorem and on large deviations for arithmetic progressions in random sets, we prove upper bounds on the Gaussian width of point sets that are formed by the image of the $n$-dimensional Boolean hypercube under a mapping $\psi:\mathbb{R}^n\to\mathbb{R}^k$, where each coordinate is a constant-degree multilinear polynomial with 0-1 coefficients. We show the following applications of our bounds. Let $[\mathbb{Z}/N\mathbb{Z}]_p$ be the random subset of $\mathbb{Z}/N\mathbb{Z}$ containing each element independently with probability $p$. $\bullet$ A set $D\subseteq \mathbb{Z}/N\mathbb{Z}$ is $\ell$-intersective if any dense subset of $\mathbb{Z}/N\mathbb{Z}$ contains a proper $(\ell+1)$-term arithmetic progression with common difference in $D$. Our main result implies that $[\mathbb{Z}/N\mathbb{Z}]_p$ is $\ell$-intersective with probability $1 - o(1)$ provided $p \geq \omega(N^{-\beta_\ell}\log N)$ for $\beta_\ell = (\lceil(\ell+1)/2\rceil)^{-1}$. This gives a polynomial improvement for all $\ell \ge 3$ of a previous bound due to Frantzikinakis, Lesigne and Wierdl, and reproves more directly the same improvement shown recently by the authors and Dvir. $\bullet$ Let $X_k$ be the number of $k$-term arithmetic progressions in $[\mathbb{Z}/N\mathbb{Z}]_p$ and consider the large deviation rate $\rho_k(\delta) = \log\Pr[X_k \geq (1+\delta)\mathbb{E}X_k]$. We give quadratic improvements of the best-known range of $p$ for which a highly precise estimate of $\rho_k(\delta)$ due to Bhattacharya, Ganguly, Shao and Zhao is valid for all odd $k \geq 5$. We also discuss connections with error correcting codes (locally decodable codes) and the Banach-space notion of type for injective tensor products of $\ell_p$-spaces.


Introduction
The Gaussian width of a point set T ⊆ R k measures the expected maximum correlation between T and a standard Gaussian vector g = N(0, I k ), and is given by The terminology reflects the fact that the Gaussian width of a set is proportional to √ k times its average width in a random direction. While this quantity plays a central role in highdimensional probability, it is notoriously hard to estimate in general; see for instance [Tal14] for an extensive discussion of this problem.
Our main result gives upper bounds on the Gaussian width of sets that appear naturally in the context of probabilistic combinatorics. The relevant sets are given by the image of the n-dimensional Boolean hypercube under a certain polynomial mapping ψ : R n → R k . In particular, we focus on the case where each coordinate ψ i : R n → R is a multilinear polynomial with 0-1 coefficients. Say that a polynomial has multiplicity t if each of its variables has a nonzero exponent in at most t monomials in its support. 1 Theorem 1.1. Let ψ : R n → R k be a polynomial mapping such that each coordinate is multilinear, has 0-1 coefficients, and has degree at most d and multiplicity t. Then, GW ψ({0, 1} n ) d nt kn 1− 1 ⌈d/2⌉ log n.
The factor nt can be seen as a natural scaling due to the fact that each coordinate ψ i maps the Boolean hypercube into [0, nt] (which follows from a handshaking lemma). In the special case where ψ is linear, ψ(x) = ( c 1 , x , . . . , c k , x ), for some c 1 , . . . , c k ∈ {0, 1} n , the set ψ({0, 1} n ) is easily seen to be contained in the set T = {( c i , y ) k i=1 : y ℓ∞ ≤ 1}. The Gaussian width of the former set is thus at most that of the latter, which in turn is at most as the sum is an n-dimensional Gaussian vector whose coordinates have variance at most k.
Perhaps surprisingly, Theorem 1.1 shows that if ψ is quadratic and has constant multiplicity, then the Gaussian width is at most a factor √ log n larger than the above upper bound. This turns out to be an easy consequence of a 1974 random matrix inequality due to Tomczak-Jeagermann [TJ74], which also forms the basis for our proof of the higher-degree cases. The proof of Theorem 1.1 (given in Section 2) proceeds in two steps: first we reduce to the case of homogeneous mappings of even degree, and then we reduce to the quadratic case. The first step is the reason for the ceiling in ⌈d/2⌉ appearing in the exponent of n and it would be interesting to know if one can remove this ceiling; i.e., does the result hold with the exponent 1 − 2/d? More generally, an exponent of the form o(1) for constant d would imply the truth of some unresolved conjectures on variants of Szemerédi's theorem, large deviations and coding theory (topics which we discuss below). The link to coding theory also implies that the bound is optimal for d = 2 and that the smallest possible exponent is at least (log log n) 2−o(1) / log n for d = 3 and (log log n) r−o(1) / log n for d = 2 r , r ≥ 3. Finally, a close inspection of the proof of Theorem 1.1 shows that it also holds for polynomials with non-negative integer coefficients, for a suitable change of the definition of multiplicity. In the following four subsections we discuss two applications of this result and links with error correcting codes and the Banach space notion of type.
1.1. Random differences in Szemerédi's Theorem. In 1975 Szemerédi [Sze75] proved that any subset of the integers of positive upper density contains arbitrarily long arithmetic progressions, answering a famous open question of Erdős and Turán. It is well known that this is equivalent to the assertion that for every positive integer k and any α ∈ (0, 1), there exists an N 0 (k, α) ∈ N such that if N ≥ N 0 (k, α) and A ⊆ Z/NZ is a set of size |A| ≥ αN, then A must contain a proper k-term arithmetic progression. Certain refinements of Szemerédi's theorem concern sets D ⊆ N for which the theorem still holds true when the arithmetic progressions are required to have common difference from D. Such sets are usually referred to as intersective sets in number theory, or recurrent sets in ergodic theory. More precisely, a set D ⊆ N is ℓ-intersective (or ℓ-recurrent) if any set A ⊆ N of positive upper density has an (ℓ+1)-term arithmetic progression with common difference in D. Szemerédi's theorem then states that N is ℓ-intersective for every ℓ ∈ N, but much smaller intersective sets exist. For example, for any t ∈ N, the set {1 t , 2 t , 3 t , . . . } is ℓ-intersective for every ℓ, which is a special case of more general results of Sárközy [Sár78a] when ℓ = 1 and of Bergelson and Leibman [BL96] for all ℓ ≥ 1. The shifted primes {p−1 : p is prime} and {p+1 : p is prime} are also ℓ-intersective for every ℓ ∈ N, shown by Sárközy [Sár78b] when ℓ = 1 and in a more general setting by Wooley and Ziegler [WZ12] for all ℓ ≥ 1.
It is natural to ask at what density, random sets become ℓ-intersective. To simplify the discussion, we will look at the analogous question in Z/NZ. Definition 1.2. Let ℓ be a positive integer and α ∈ (0, 1]. A subset D ⊆ Z/NZ is (ℓ, α)intersective if any subset A ⊆ Z/NZ of size |A| ≥ αN contains a proper (ℓ + 1)-term arithmetic progression with common difference in D.
1.2. Large deviations for arithmetic progressions. Let H = (V, E) be a hypergraph over a finite vertex set V of cardinality N and for p ∈ (0, 1) denote by V p the random binomial subset where each element of V appears independently of all others with probability p. Let X be the number of edges in H that are induced by V p . Important instances of the random variable X include the count of triangles in an Erdős-Rényi random graph and the count of arithmetic progressions of a given length in the random set [Z/NZ] p . The study of the asymptotic behavior of X when p = p(N) is allowed to depend on N and N grows to infinity motivates a large body of research in probabilistic combinatorics. Of particular interest is the problem of determining the probability that X significantly exceeds its expectation Pr[X ≥ (1 + δ)EX] for δ > 0, referred to as the upper tail. Despite the fact that standard probabilistic methods fail to give satisfactory bounds on the upper tail in general, advances were made recently for special instances, in particular for triangle counts [LZ17] and general subgraph counts [BGLZ17]. For more general hypergraphs, progress was made by Chatterjee and Dembo [CD16] using a novel nonlinear large deviation principle (LDP), which was improved by Eldan [Eld18] shortly after. The LDPs give precise estimates on the upper tail that are given in terms of a parameter φ p whose value is determined by the solution to a certain variational problem. The range of values of p for which these estimates are actually valid depends on the underlying hypergraph H. This splits the problem of estimating the upper tail into two sub-problems: (1) determining for what range of p the estimate in terms of φ p holds true and (2) solving the variational problem to determine the value of φ p . The answer to problem (1) turns out to depend on the Gaussian width of a point set related to H.
This approach was pursued in [CD16] to estimate the upper tail of the number of 3-term arithmetic progressions in [Z/NZ] p , for which the authors solved problem (1). The case of longer APs, asking for the upper tail probability of the count X k of k-term arithmetic progressions in [Z/NZ] p , was recently treated by Bhattacharya et al. [BGSZ18]. They solved the variational problem (2) for N prime and gave bounds for the relevant Gaussian width towards solving problem (1). Based on this, they showed that if k ≥ 3 and δ > 0 are fixed and p tends to zero sufficiently slowly as N → ∞ along the primes, then Similar results were shown for the analogous problem over {1, . . . , N} (in which case N no longer needs to be prime), but we shall focus on the problem in Z/NZ for ease of exposition. The rate at which p is allowed to decay for (1) to hold turns out to depend on Gaussian widths of the form featuring in Theorem 1.1. The bounds proved in [BGSZ18] imply that (1) for k ≥ 5, and absolute constants ε k ∈ (0, ∞) depending only on k. However, the authors conjecture that a probability p slightly larger than N −1/(k−1) suffices for all k. Some support for this conjecture is given by a result of Warnke [War17] showing that for all p ≥ (log N/N) 1/(k−1) , the logarithm of the upper tail (also referred to as the large deviation rate) of the k-AP count in {1, . . . , N} p is given by Θ k ( √ δp k/2 N log p), where the asymptotic notation hides constants depending only on k. Notice that (1) is more accurate than this result in that it (almost) determines those constants, though currently for a more narrow range of p. 2 Using Theorem 1.1, we widen the range of p for which (1) can be shown to hold for all k ≥ 5.
Theorem 1.4. For every integer k ≥ 3 and c k = 1 6k k−1 2 , the estimate (1) holds true, provided p ≥ N −c k (log N) and N is prime.
1.3. Locally decodable codes. There is a close connection between the Gaussian widths considered in Theorem 1.1 and special error-correcting codes called locally decodable codes (LDCs). A map C : {0, 1} k → {0, 1} n is a q-query LDC if for every i ∈ [k] and x ∈ {0, 1} k , the value x i can be retrieved by reading at most q coordinates of the codeword C(x), even if the codeword is corrupted in a not too large (but possibly constant) fraction of coordinates. A main open problem is to determine the smallest possible codeword length n as a function of the message length k, when q is a fixed constant. Currently this problem is settled only in the cases q = 1, 2 [KT00, KW04,GKST06] and remains wide open for the case q = 3. We refer to the extensive survey [Yek12] for more information on this problem. A connection with Gaussian width was established by the authors and Dvir in [BDG17], where we show that q-query LDCs from {0, 1} Ω(k) to {0, 1} O(n) are equivalent to mappings ψ : R n → R k whose coordinates are degree-q, multiplicity-1 polynomials with 0-1 coefficients that are supported by Ω(n) monomials, and such that the set ψ({0, 1} n ) has Gaussian width Ω(k). It was observed there that the best-known lower bounds on the length n = n(k) of q-query LDCs-proved using techniques from quantum information theory [KW04]-imply a slightly different but equivalent version of Theorem 1.3 (see Section 5). The proof of Theorem 1.1 is based on ideas from [KW04], but does not use quantum information theory. 3 1.4. Gaussian width bounds from type constants. We observe that the Gaussian width in Theorem 1.1 can be bounded in terms of type constants of certain Banach spaces. Unfortunately, we do not have good enough bounds on the type constants of the required spaces to improve Theorem 1.1. But we hope that this connection will motivate progress on understanding these spaces.
A Banach space X is said to have (Rademacher) type p > 0 if there exists a constant T < ∞ such that for every k and x 1 , . . . , x k ∈ X, where the expectation is over a uniformly random ε = (ε 1 , . . . , ε k ) ∈ {−1, 1} k . The smallest T for which (2) holds is referred to as the type-p constant of X, denoted T p (X). Type, and its dual notion cotype, play an important role in Banach space theory as they are tightly linked to local geometric properties (we refer to [LT79] and [Mau03] for extensive surveys). Some fundamental facts are as follows. It follows from the triangle inequality that every Banach space has type 1 and from the Kahane-Khintchine inequality that no Banach space has type p > 2. The parallelogram law implies that Hilbert spaces have type 2. An easy but important fact is that ℓ 1 fails to have type p > 1. Indeed, a famous result of Maurey and Pisier [MP73] asserts that a Banach space fails to have type p > 1 if and only if it contains ℓ 1 uniformly. Finite-dimensional Banach spaces have type-p for all p ∈ [1, 2]. Of importance to Theorem 1.1 are the actual type constants T p (X) of a certain family of finite-dimensional Banach spaces. Let r 1 , . . . , r d ≥ 1 be such that d i=1 1 r i = 1 and let L n r 1 ,...,r d be the space of d-linear forms on R n × · · · × R n (d times) endowed with the norm This space is also known as the injective tensor product of ℓ n s 1 , . . . , ℓ n s d for r −1 i + s −1 i = 1 and as such plays an important role in the theory of tensor products of Banach spaces [Rya02]. The relevance of the type constants of this space to Theorem 1.1 is captured by the following lemma, proved in Section 7.
Lemma 1.5. Let ψ : R n → R k be a polynomial mapping such that each coordinate is multilinear and has 0-1 coefficients, degree at most d and multiplicity t. Then for any ..,r d ) k 1/p . Observe that the space L n 2,2 may be identified with the space of n × n matrices endowed with the spectral norm (or operator norm). A key ingredient in the proof of Theorem 1.1, Theorem 2.1 below, easily implies that the type-2 constant of this space is of order O( √ log n). A well-known lower bound of the same order follows for instance from the connection between Gaussian width and LDCs and a basic construction of a 2-query LDC known as the Hadamard code. More generally, lower bounds on the type constants of L n r 1 ,...,r d are implied by d-query LDCs [BNR12,Bri16].
2. Proof of Theorem 1.1 In this section we prove Theorem 1.1. We begin by giving a high-level overview of the ideas. The main tool we use is the following random matrix inequality, which is a special case of a non-commutative version of the Khintchine inequality due to Tomczak-Jaegermann [TJ74, Theorem 3.1]. Let ·, · be the standard inner product on R N and denote by B N 2 the Euclidean unit ball in R N . Given a matrix A ∈ R N ×N , its operator norm (or spectral norm) is given by A = sup{| Ax, y | : x, y ∈ B N 2 }. Theorem 2.1 (Tomczak-Jaegermann). There exists an absolute constant C ∈ (0, ∞) such that the following holds. Let A 1 , . . . , A k ∈ R N ×N be a collection of matrices and let g 1 , . . . , g k be independent Gaussian random variables with mean zero and variance 1. Then, This result already suffices to prove Theorem 1.1 when the coordinate mappings ψ i are quadratic forms, in which case there exist matrices The assumption that each ψ i has multiplicity t implies that each row and column of A i has at most t ones. This in turn implies that A i ≤ t by a Birkhoff-von Neumann-type theorem.
Since each x ∈ {0, 1} n has Euclidean norm at most √ n, we get By Theorem 2.1, the above is at most Ctn √ k log n.
The general case is proved via a reduction to the above quadratic case and consists of two steps. In the first step, we reduce to the case where each coordinate ψ i is a homogeneous polynomial of degree 2⌈d/2⌉. This is done in a straightforward way by adding at most dn variables in such a way so as to preserve the multiplicity. The second step consists of a reduction to the quadratic case. For this, it will be convenient to consider the hypergraphs associated with the monomial support of the coordinate mappings ψ i .
Recall that an d-hypergraph H = (V, E) consists of a vertex set V and a multiset E, also denoted E(H), of subsets of V of size at most d, called the edges. A hypergraph is d-uniform if each edge has size exactly d. The degree of a vertex is the number of edges containing it and the degree of H, denoted ∆(H), is the maximum degree among its vertices. A matching is a hypergraph where no two edges intersect. Associate with a hypergraph H = ([n], E), the multilinear polynomial p H ∈ R[x 1 , . . . , x n ] given by The multiplicity of p H is then exactly the degree ∆(H). Clearly the coordinate mappings ψ i of the form featuring in Theorem 1.1 can be written as p H for some d-hypergraph H of degree at most t. The reduction to the quadratic case is based on the following key lemma, in which for x ∈ R n and m ∈ N, the the mth tensor power x is defined as . For every r ∈ N there exist a C r , c r ∈ (0, ∞) and n 0 (r) ∈ N such that the following holds. Let n ≥ n 0 (r), m = C r n 1−1/r and N = n m . Let H = ([n], E) be a 2r-uniform hypergraph and let p H be the polynomial as in (3). Then, there exists a matrix A ∈ R N ×N such that A r ∆(H) and for every x ∈ {−1, 1} n , Moreover, A is the adjacency matrix of a graph (with possible parallel edges). With this lemma in hand, the proof of Theorem 1.1 is straightforward (see below). The idea behind Lemma 2.2 is to use decompositions into matchings and a generalization of the Birthday Paradox that says that for any n-vertex 2r-matching, a random subset of C r n 1−1/r vertices contains r vertices of any fixed edge with probability c r /n. To illustrate how this is used in the r = 2 case, let H be a 4-matching, let m = C 2 √ n and N = n m . It follows from the generalized Birthday Paradox that there are c 2 N/n strings in [n] m containing at least two elements of a given edge. Now let G be the graph with vertex set [n] m whose edges are the pairs {u, v} that cover some edge in H and complement each other, meaning: there are The main observation is that for every edge {u, v} ∈ E(G) that covers an edge e ∈ E(H) and every It follows that, modulo the relations x 2 1 = 1, . . . , x 2 n = 1, we have p G (x ⊗m ) = (c 2 N/n)p H (x). The (appropriately scaled) adjacency matrix of G then satisfies the second criterion of the lemma, but it will have large norm if G has high degree. To obtain a matrix with the desired norm, we consider a pruned version of G in which we keep only edges that do not cover too many edges of H (at the cost of only a constant-factor decrease of the constant c 2 ).
We now give the formal proof of Theorem 1.1. The following simple proposition is used for the first step, in which we homogenize the polynomials. Given where 1 ∈ R (2r−1)n is the all-ones vector. Hence, if we let ψ ′ : R 2rn → R k be the polynomial map whose coefficients are given by Since the dependence of our claimed bound on the Gaussian width is polynomial in n, the extra vertices will result in an extra factor depending only on d. It thus suffices to prove the theorem for the case where H 1 , . . . , H k are 2r-uniform.
Observe that since the polynomials ψ i are multilinear, the Gaussian width is bounded from above by replacing binary vectors with sign vectors. In particular, Let m = C r n 1−1/r and N = n m and for each i ∈ [k], let A i ∈ R N ×N be a matrix for p H i as in Lemma 2.2. Then, for every x ∈ {−1, 1} n , where in the inequality we used that x ⊗m has Euclidean norm √ N. Taking expectations, it then follows from Theorem 2.1 that the Gaussian width of ψ({0, 1} n ) is at most where in the second inequality we used that

Proof of the matrix lemma
In this section we prove Lemma 2.2. The starting point is a decomposition of a boundeddegree hypergraph into a small number of matchings. For this, we use the following basic result on edge colorings. The edge chromatic number of a hypergraph H, denoted by χ E (H), is the minimum number of colors needed to color the edges of H such that no two edges which intersect have the same color. Note that χ E (H) equals the smallest number of matchings into which E(H) can be partitioned. .
Given a mapping f : [m] → [n] and set S ∈ M, let Note that this is a count of the r-subsets I ⊆ [m] such that |S ∩ f (I)| = r. Denote (2) For all i ∈ [m] I, we have g(i) = f (i).
If g complements f then clearly the converse also holds. Say that the complementary pair (f, g) covers S ∈ M if f (I) ∪ g(I) = S. Observe that if (f, g) covers S, then for every Define the set of ordered pairs (5) P = (f, g) : f is s-good and g complements f .
Proposition 3.2. Let P be as in (5). Then, for every S ∈ M, the number of pairs (f, g) ∈ P that cover S equals |P|/|M|.
Proof: Fix distinct sets S, T ∈ M and let π ∈ S n be a permutation such that π(S) = T, π(T ) = S and π(i) = i for all i / ∈ S ∪ T . Let P S be the set of pairs (f, g) ∈ P which cover S and define P T similarly. We claim that the map ψ : (f, g) → (π • f, π • g) is an injective map from P S to P T . It follows that T is covered by at least as many pairs from P as S is. Similarly, interchanging S and T , the converse also holds. To prove the claim, note that if (f, g) covers S, then (π • f, π • g) covers T . Moreover, φ(π • f ) = φ(f ) because π maps edges of the matching M to edges of M. Thus ψ( For every (f, g) ∈ P, we have that g is s 2 -good.
Proof: Let S ∈ M and (f, g) ∈ P be such that (f, g) covers S. Consider the histograms F, G : [n] → {0, 1, . . . , m} given by F (i) = |f −1 (i)| and G(i) = |g −1 (i)| for each i ∈ [n]. Then F and G differ only in S. In particular, there is an r-set T ⊆ S such that G(i) = F (i)+1 for each i ∈ T and G(i) = F (i) − 1 for each i ∈ S T . Hence, For all other S ′ ∈ M, we have µ S ′ (g) = µ S ′ (f ). Moreover, f must be s-good for (f, g) to belong to P. It follows that where in the last line we used the choice of s = 200 · 4 r . ✷ Lemma 3.4 (Generalized birthday paradox). For every r ∈ N there exists a C r ∈ (0, ∞) and an n 0 (r) ∈ N such that the following holds. Let h be a uniformly distributed random variable over the set of maps from [m] to [n]. Then, provided n ≥ n 0 (r) and m = C r n 1−1/r , We postpone the proof of Lemma 3.4 to Section 4.
Corollary 3.5. Let P be as in (5) and let A : [n] m × [n] m → {0, 1} be its incidence matrix, that is A(f, g) = 1 ⇐⇒ (f, g) ∈ P. Then, |P| ≥ Ω(N) and every row and every column of A has at most s 2 (r!) ones.
Proof: The first claim follows from Lemma 3.4 and the fact that |P| is at least the number of s-good mappings. If h is l-good, then there are at most l(r!) mappings from [m] → [n] that complement h. Hence, every row of A has at most s(r!) ones and by Proposition 3.3, every column of A has at most s 2 (r!) ones. ✷ With this, we can now prove Lemma 2.2. Proof of Lemma 2.2: Let t = ∆(H). By Lemma 3.1, H can be decomposed into χ E (H) ≤ 2rt matchings, which we denote by F 1 , . . . , F χ E (H) . Complete each F i to a maximal family M i of disjoint 2r-subsets of [n] in some arbitrary way. For each M i , let P i be as in (5) and let A i : [n] m × [n] m → {0, 1} n be its incidence matrix. Set to zero all the entries of A i that correspond to a pair (f, g) covering a set in M i F i . Let B = A 1 + · · · + A χ E (H) and A = (B + B T ). It follows from (4) and Proposition 3.2 that for each x ∈ {−1, 1} n , we have Since all M i are maximal, they have the same size, as do the P i . Hence, by Corollary 3.5, there exists a constant c r ∈ (0, 1] such that the right-hand side of (6) equals (2c r N/n)p H (x). Let G be the graph with adjacency matrix A, allowing for parallel edges. Then G has degree at most 2ts 2 (r!). It follows from Lemma 3.1 that G can be partitioned into O r (t) matchings. Since the adjacency matrix of a matching has unit norm, we get that A ≤ O r (t). ✷

Proof of the generalized birthday paradox.
For the proof of Lemma 3.4, we use a standard Poisson approximation result for "balls and bins" problems [MU05, Theorem 5.10]. A discrete Poisson random variable Y with expectation µ is nonnegative, integer valued, and has probability density function Proposition 4.1. If X, Y are independent Poisson random variables with expectations µ X , µ Y , respectively, then X + Y is a Poisson random variable with expectation µ X + µ Y .
Proof of Lemma 3.4: Let C r > 0 be a parameter depending only on r to be set later. Let µ = C r m/n = C r n −1/r and assume that n ≥ n 0 (r) := 4(C r r) r . For h a random map as in Lemma 4.2, we begin by lower bounding the probability of the event that φ(h) ≥ 1. Recall that this occurs if there exists an S ∈ M and an r-subset T ∈ S r such that T ⊆ im(h). Let X be as in Lemma 4.2. Let ψ : (N ∪ {0}) n → {0, 1} be the function Then ψ(X) = 1 if φ(h) = 0 and ψ(X) decreases monotonically with m. Hence, for Y a Poisson random vector as in Lemma 4.2, we have where in the last line we used the fact that since the sets S ∈ M are disjoint, the random variables are independent. The random variables 1 ≥1 (Y i ), i ∈ S, are independent Bernoullis that are zero with probability e −µ . The expectation in (8) equals the probability that these random variables form a string of Hamming weight strictly less than r. Using that n ≥ 4(C r r) r and the fact that 1 − x ≤ exp(−x) ≤ 1 − x + x 2 /2 when x > 0, this probability is at most where T ⊂ S is some fixed subset of size r. Hence, since M is maximal, the above and (8) give Set C r = (6er) 1/r , then the above right-hand side is at most 1/4. Next, we upper bound the probability that φ(h) ≥ s = 200 · 4 r . Define χ : (N ∪ {0}) n → R + by Then, φ(h) = χ(X). Moreover, E[χ(X)] increases monotonically with m. It thus follows from Lemma 4.2 that where in the second line we used the fact that the Y i are independent. By Markov's inequality, Pr[φ(h) > 200 · 4 r ] ≤ 1 4 . With (9), we get that h is s-good with probability at least 1/2. ✷

Random differences in Szemerédi's Theorem
In this section we prove Theorem 1.3. We first consider a slightly different random model where we form a random multiset D k of size k by repeatedly sampling a uniformly random element from Z/NZ. We will need the following equivalent formulation of Szemerédi's Theorem due to Varnavides [Var59] (see [Tao07,Theorem 4.8] for this exact formulation).
Proof: We will arrive at a contradiction assuming that the statement is false. Let Γ = Z/NZ. For f : Γ → R and y ∈ Γ \ {0}, define which is a degree ℓ + 1 polynomial over the variables (f (x)) x∈Γ . For a multiset S ⊆ Γ \ {0}, define If f = 1 A , then this counts the fraction of proper (ℓ + 1)-term APs with common difference in S that lie completely in A.
Let N 1 (ℓ, α) and ǫ(ℓ, α) be as in Proposition 5.1. Suppose that with a constant probability, there is a subset A ⊆ Γ of size at least αN with no proper (ℓ + 1)-term APs whose common difference lies in D. Then, By Proposition 5.1, for every A ⊆ Γ of size at least αN, we have that Λ Γ\{0} (1 A ) ≥ ǫ. We are going to apply a standard symmetrization trick to establish a connection with Gaussian 13 width. Let D ′ be an independent copy of D. Then, Observe that for i.i.d. random y, y ′ ∈ Γ \ {0}, the random variable φ y (1 A ) − φ y ′ (1 A ) is symmetric in the sense that it has the same distribution as its negation. Let σ 1 , . . . , σ k be independent uniformly distributed {−1, 1}-valued random variables. Then it follows from the above that Let us fix y 1 , . . . , y k ∈ Γ \ {0}. Each φ y i can be written as φ y i = N −1 p H i (as in (3)) where H i is the hypergraph on Γ whose edges are given by (ℓ + 1) term arithmetic progressions with common difference y i . The maximum degree of H i is O(ℓ). This is because each such AP (x+ty i ) 0≤t≤ℓ intersects another AP (x ′ +t ′ y i ) 0≤t ′ ≤ℓ iff x−x ′ = (t ′ −t)y i ; so there are only O(ℓ) such x ′ for a given x. Let g 1 , . . . , g k be independent N(0, 1) random variables. Then we can bound where the last line follows directly from Theorem 1.1. Thus we get k ℓ N 1−1/⌈(ℓ+1)/2⌉ log N which is a contradiction. ✷ We will the need following simple fact that conditioning on a high probability event will not change the probability of any event by much. ✷ Proof of Theorem 1.3: Let D k be a random subset of Z/NZ {0} of size at most k, formed by sampling a uniformly random element from Z/NZ for k times. Let D p = [Z/NZ \ {0}] p be a random subset of Z/NZ \ {0} formed by including each element with probability p independently. We claim that if D k is ℓ-intersective with probability 1 − o(1), then D p will also be ℓ-intersective with probability 1 − o(1) when p = 2k/N and k = ω N (1).
Let p = 2k/N and k = ω N (1). Let E be the event that D p has size at least k. By the Chernoff bound, where D KL is the Kullback-Leibler divergence. By Lemma 5.3, conditioning on E changes the probability of D p being ℓ-intersective by o(1). Conditioned on E, the probability that D p is ℓ-intersective is at least the probability that D k is ℓ-intersective. Indeed, both D p and D k , after conditioning on a given size reduce to the uniform distribution over all subsets of that size. Proposition 5.2 thus implies D p is ℓ-intersective when p = ω(N −1/⌈(ℓ+1)/2⌉ log N). ✷

Upper tails for arithmetic progressions in random sets
Here we prove Theorem 1.4. Let Γ = Z/NZ. In the following we identify maps from a set S to R with vectors in R S . For f : Γ → R, define f (a)f (a + b)f (a + 2b) · · · f (a + (k − 1)b).
Observe that for a subset A ⊆ Γ, we have that Λ k (1 A ) counts the number of proper k-term arithmetic progressions in A. Moreover, Λ k is an N-variate polynomial of degree k. Recall that the gradient of a polynomial p ∈ R[x 1 , . . . , x n ] is the mapping ∇p : R n → R n whose ith coordinate is given by (∇p) i = (∂p/∂x i )(x). The proof of Theorem 1.4 follows from a simple corollary of Theorem 1.1 and one of the main results of [BGSZ18]. For the corollary, we consider polynomial mappings given by gradients of polynomials of the form (3). The claim now follows from Theorem 1.1 as p H i = (∇p H ) i each H i has degree at most t. ✷ Theorem 6.2 (Bhattacharya-Ganguly-Shao-Zhao). Let k ≥ 3 be a fixed integer and let σ, τ be positive real numbers such that 1 N GW ∇Λ k ({0, 1} Γ ) N 1−σ (log N) τ .
Proof of Theorem 1.4: Let H = (Γ, E) be the hypergraph whose edges are the (unordered) proper k-term arithmetic progressions in Γ. Then, accounting for the fact that Λ k distinguishes between the same progression with step b run forward from a point a or backward from a+(k−1)b and since N is prime, we have 2p H = Λ k . We claim that every pair of distinct vertices appears in O(k 2 ) edges. First note that H is 2-transitive, since for any two pairs of distinct vertices (a, b), (c, d), the affine linear map x → c( sends a to c, b to d and preserves progressions. It follows that every pair of distinct vertices is contained in the same number of edges. Since each edge contains k 2 pairs, the claim follows by double-counting. By Corollary 6.1, we may thus set σ = 1/(2⌈(k − 1)/2⌉) and τ = 1/2 in Theorem 6.2 and it follows that for constant δ, the estimate (11) holds if p k min{δp k , δ 2 p} N − 1 6⌈(k−1)/2⌉ (log N) 1+1/6 .
Taking kth roots now gives the claim. ✷ 7. Proof of Lemma 1.5 In this section we give a proof Lemma 1.5. As explained in the proof of Theorem 1.1, it suffices to prove the statement when the coordinates of ψ are given by p H i (as in (3)) for d-uniform hypergraphs H 1 , . . . , H k . Let Λ H i be a d-multilinear form such that p H i (x) = Λ H i (x, x, . . . , x). Let g = (g 1 , . . . , g k ) be vector of independent standard Gaussians and ε = (ε 1 , . . . , ε k ) be uniformly random in {−1, 1} k . Then, where in the last line we used that each g i is symmetrically distributed, that is, g i and −g i have the same distribution. By Jensen's inequality, the above expectation over ε is at most where we used the fact that E g g ℓp ≤ ( k i=1 E g i |g i | p ) 1/p ≤ k 1/p (E g 1 |g 1 | 2 ) 1/2 = k 1/p . If H i is a matching hypergraph, using Hölder's inequality, it is easy to see that Λ H i ≤ 1. If not, by Lemma 3.1, we can decompose H i into d∆(H i ) matchings and use triangle inequality to conclude that Λ H i ≤ d∆(H i ) which gives the desired bound.