Tight general bounds for the extremal numbers of 0–1 matrices

A zero-one matrix M is said to contain another zero-one matrix A if we can delete some rows and columns of M and replace some 1-entries with 0-entries such that the resulting matrix is A . The extremal number of A , denoted ex( n, A ), is the maximum number of 1-entries that an n × n zero-one matrix can have without containing A . The systematic study of this function for various patterns A goes back to the work of F¨uredi and Hajnal from 1992, and the ﬁeld has many connections to other areas of mathematics and theoretical computer science. The problem has been particularly extensively studied for so-called acyclic matrices, but very little is known about the general case (that is, the case where A is not necessarily acyclic). We prove the ﬁrst asymptotically tight general result by showing that if A has at most t 1-entries in every row, then ex( n, A ) ≤ n 2 − 1 /t + o (1) . This veriﬁes a conjecture of Methuku and Tomon. Our result also provides the ﬁrst tight general bound for the extremal number of vertex-ordered graphs with interval chromatic number 2, generalizing a celebrated result of F¨uredi, and Alon, Krivelevich and Sudakov about the (unordered) extremal number of bipartite graphs with maximum degree t in one of the vertex classes.


Introduction
One of the most central problems in extremal graph theory is concerned with estimating the maximum number of edges that a graph on a given number of vertices can have without containing some given graph as a subgraph.Formally, the extremal number (also known as the Turán number) of a graph H, denoted as ex(n, H), is the maximum number of edges that a graph on n vertices can have if it does not contain H as a subgraph.The first result on this topic was obtained by Mantel [21] in 1907, when he determined the value of ex(n, H) when H is the complete graph on three vertices.In 1941, Turán [30] extended this to the case where H is an arbitrary complete graph.The celebrated Erdős-Stone-Simonovits theorem [9,10] asserts that ex(n, H) = 1 − 1 χ(H)−1 + o(1) n 2 , which determines the asymptotics of ex(n, H) whenever the chromatic number χ(H) of H is at least 3.However, this result only gives the weak estimate o(n 2 ) for the extremal number of bipartite graphs.A slightly better upper bound can be obtained using the Kővári-Sós-Turán theorem [19] which states that ex(n, K s,t ) = O(n 2−1/s ).Getting good estimates for the extremal number of bipartite graphs is an important and notoriously challenging problem.The difficulty is well reflected by the fact that the order of magnitude of ex(n, H) is unknown even in very simple cases such as when H is K 4,4 , the complete bipartite graph with 4 vertices on each side, or C 8 , the cycle of length 8.One of the few tight general results is the following theorem of Füredi [12].A different proof of this result was later found by Alon, Krivelevich and Sudakov [2] as one of the first applications of the celebrated dependent random choice method.
Theorem 1.1 (Füredi [12], Alon-Krivelevich-Sudakov [2]).Let H be a bipartite graph with maximum degree t in one side of the bipartition.Then ex(n, H) = O(n 2−1/t ).This result is tight when H = K t,r for some r much larger than t (see [3,5,17]), but there have been recent improvements in the case where H is K t,t -free: see [6,15,27].

Zero-one matrices
In the 90's, motivated by various applications to problems in other areas of mathematics (some of which we will discuss shortly), researchers started developing an analogous extremal theory for zero-one matrices.For zero-one matrices A and M , we say that M contains A if, by deleting some rows and columns from M , and possibly turning some of its 1-entries to 0-entries, we can obtain A. The weight of a zero-one matrix M , denoted w(M ), is the number of 1-entries in M .The extremal number of a zero-one matrix A, denoted ex(n, A), is the maximum possible weight of an n × n zero-one matrix that does not contain A.
One of the first results on the topic was obtained by Füredi [11] in 1990 who determined the order of magnitude of the extremal number of a certain 2 × 3 matrix, and used it to give a O(n log n) bound for the number of unit distances in a convex n-gon, thereby making significant progress on an old problem of Erdős and Moser [8].This bound is still the best known, though the implicit constant has been improved [1] (also using forbidden submatrix theory).A more systematic study of extremal numbers of zero-one matrices was initiated by Füredi and Hajnal [13] in 1992.Since then, the extremal theory of zero-one matrices has been very successful at resolving problems in combinatorics, discrete and computational geometry, structural graph theory and the analysis of data structures.We refer the reader to the paper of Pettie and Tardos [26], and the surveys of Tardos [28,29] for an excellent overview of the extremal theory of zero-one matrices and its many applications.
There is a natural relationship between zero-one matrices and bipartite graphs.For a matrix A, we define H A to be the bipartite graph obtained by taking a vertex for each row and column of A, and taking an edge between u and v if and only if A(u, v) = 1.It is easy to prove that for any zero-one matrix A, we have ex(n, A) = Ω(ex(n, H A )). ( Interestingly, disproving a conjecture of Füredi and Hajnal [13] in a very strong sense, Pach and Tardos [24] showed that ex(n, A) can be much larger than ex(n, H A ).More precisely, they proved that there are matrices A for which H A = C 2k (for arbitrary k) and yet ex(n, A) = Ω(n 4/3 ).Since ex(n, C 2k ) = O(n 1+1/k ) (see [4]), this implies a huge gap between ex(n, A) and ex(n, H A ) for these matrices, and demonstrates that proving upper bounds for the extremal number of zero-one matrices is even more difficult than the corresponding problem for bipartite graphs.
When H A is a forest, we say that A is an acyclic matrix.The extremal numbers of acyclic matrices have been extensively studied.Füredi and Hajnal [13] conjectured that for any permutation matrix P , we have ex(n, P ) = O(n).Klazar [16] showed that this would imply the well-known Stanley-Wilf conjecture on the number of permutations without forbidden patterns, and Marcus and Tardos [22] proved these conjectures.Another conjecture posed by Füredi and Hajnal asserted that for any acyclic zero-one matrix A, we have ex(n, A) = O(n log n).This was disproved by Pettie [25], who showed that there are acyclic zero-one matrices A such that ex(n, A) = Ω(n log n log log n).Recently, Pettie and Tardos [26] found, for each positive integer t, an acyclic zero-one matrix A t such that ex(n, A t ) = Ω(n(log n) t ).Despite these advances, it is still very much unknown how large the extremal number of an acyclic zero-one matrix can be: there are no known acyclic zero-one matrices A with ex(n, A) = Ω(n 1+ε ) for some ε > 0 but, strikingly, it is not even known whether there is an absolute constant ε > 0 such that ex(n, A) = O(n 2−ε ) for all acyclic zero-one matrices A.
Given our rather incomplete understanding of the theory even for acyclic zero-one matrices, it is unsurprising that very little is known about the general case, where A is not necessarily acyclic.To the best of our knowledge, the only known tight result for the extremal number of a zero-one matrix that is not acyclic concerns the r × t all-one matrix A r,t (naturally corresponding to the complete bipartite graph K r,t ) -it is easy to show that ex(n, A r,t ) = O(n 2−1/t ), and this is tight when r is much larger than t by equation (1) and the known lower bounds for ex(n, K r,t ).
An important general result about extremal numbers of matrices that are not acyclic was proved by Methuku and Tomon [23].Say that a matrix A is column-t-partite (respectively, row-t-partite) if it can be cut along its columns (respectively, rows) into t submatrices such that every row (respectively, column) of each of these submatrices contains at most one 1-entry.Theorem 1.2 (Methuku-Tomon [23]).Let t ≥ 2 and let A be a column-t-partite zero-one matrix.Then (1) .
Since the r × t all-one matrix is column-t-partite, this is a fairly good estimate for large t.Methuku and Tomon conjectured that their result can be strengthened in two ways: firstly, by improving the exponent to 2 − 1/t + o(1), and secondly, by significantly relaxing the column-t-partite assumption.
This conjecture is motivated by the aforementioned result of Füredi, and Alon, Krivelevich and Sudakov (Theorem 1.1), and indeed (in view of equation ( 1)) it would be a direct generalization of that result, since zero-one matrices with at most t 1-entries in each row correspond to bipartite graphs with maximum degree at most t in one side of the bipartition.As a partial result towards Conjecture 1.3, Methuku and Tomon proved that ex(n, A) ≤ n 2− 1 t +o(1) when A is both column-t-partite and row-t-partite.
In this paper we completely resolve Conjecture 1.3 as follows.
Theorem 1.4.Let t ≥ 1 and let A be a zero-one matrix which contains at most t 1-entries in each row.Then ex(n, A) ≤ n 2− 1 t +o(1) .
As we have discussed before, this is tight up to the o(1) term.It is known [13] that the o(1) term is necessary in the case t = 1; indeed, certain zero-one matrices A with one 1-entry in each row correspond to Davenport-Schinzel sequences [7] satisfying ex(n, A) = ω(n).This suggests that perhaps the o(1) term is necessary in general.On the other hand, we can show that for column-t-partite matrices, this error term is not needed when t ≥ 2.
Theorem 1.5.Let t ≥ 2 and let A be a column-t-partite zero-one matrix.Then ex(n, Every matrix with at most one 1-entry in each row is column-1-partite, so by our discussion above, there are column-1-partite matrices with superlinear extremal numbers.Hence, Theorem 1.5 reveals an intriguing difference between the cases t = 1 and t > 1. An interesting feature of our proof method is that, unlike the proof of Methuku and Tomon and many other important recent advances in the field [18,20], it does not use a density increment argument.Instead, our proof employs a novel use of 'blocks' of different sizes to construct an embedding of our forbidden matrix A (see Section 1.3 for an overview of our proof).We believe that this approach may have further applications in the extremal theory of zero-one matrices and ordered graphs.

Ordered graphs
In 2006, Pach and Tardos [24] initiated the systematic study of the extremal numbers of ordered graphs.An ordered graph is a pair (G, <) for which G is a graph and < is a total ordering of the vertex set of G -in what follows, we will sometimes abuse notation slightly and simply write G for (G, <).We say that (G, <) contains (H, < ′ ) as an ordered subgraph if there exists an order-preserving embedding of H into G.Analogously to the unordered setting, the extremal number of an ordered graph H, denoted by ex < (n, H), is the maximum possible number of edges in an ordered graph on n vertices that does not contain H as an ordered subgraph.The interval chromatic number of an ordered graph H, denoted by χ < (H), is the minimum number of colours needed to colour the vertices of H such that there are no edges within the colour classes, and each colour class is an interval with respect to the ordering on V (H).Pach and Tardos [24] established an analogue of the Erdős-Stone-Simonovits theorem by proving that ex 2 holds for any ordered graph H.This means that, much like in the unordered case, the asymptotic value of ex There is a natural connection between ordered bipartite graphs and zero-one matrices: for an ordered bipartite graph H with vertex classes X, Y such that max(X) < min(Y ), we can define the matrix A H whose rows correspond to elements of X (ordered according to the ordering of V (H)), whose columns correspond to elements of Y , and in which A H (x, y) = 1 if and only if xy ∈ E(H).It is easy to see then that an ordered bipartite graph G contains another ordered bipartite graph H as an ordered subgraph if and only if A G contains A H . Pach and Tardos established the following connection (which is much closer than the one established in (1) between zero-one matrices and unordered graphs) between the extremal numbers of H and A H . Theorem 1.6 (Pach-Tardos [24]).For any ordered bipartite graph H, we have Using this result, our Theorem 1.4 immediately implies the following general bound on the extremal numbers of ordered bipartite graphs.
Theorem 1.7.Let H be an ordered bipartite graph with maximum degree at most t in one side of the bipartition.Then ex < (n, H) ≤ n 2−1/t+o (1) .
As a corollary, we obtain that if H is an ordered even cycle (with interval chromatic number 2), then ex < (n, H) ≤ n 3/2+o (1) .For some known results about the extremal number of certain families of ordered even cycles, see [14].
Organization of the paper.In the next subsection, we give a detailed overview of the proof of Theorem 1.4.The proof of Theorem 1.5 follows a similar strategy.In Section 2, we prove both Theorem 1.4 and Theorem 1.5.

Proof outline
We now discuss some of the ideas used in our proof.For comparison, let us first recall the proof strategy of Methuku and Tomon [23] for their (weaker) bound of n 2−1/t+1/t 2 +o (1) in the special case of column-t-partite matrices.If A is a fixed column-t-partite matrix, and M is an m × n matrix which does not contain A, they divide M into k (horizontal) 'blocks', i.e., submatrices of size (m/k) × n, for some k.They show that if we cannot find a copy of A where each row is coming from a different block, then one of the blocks must have large 'density' in some sense -more precisely, the number of copies of K t,t must be large in one of the blocks.They pass to that block (i.e., delete the rows corresponding to the other blocks) and repeat this process, obtaining a density increment at each step, eventually leading to a contradiction.
One can try to follow this strategy for general matrices A which have at most t 1-entries in each row.However, without the column-t-partite condition, the density increment is too weak to obtain useful bounds.Hence, in this paper we will use a different argument.We will also repeatedly divide our large matrix M into k blocks.However, a key difference is that our argument is more 'global' in the sense that we will not ignore the 'deleted' rows in the blocks that we do not pass to -in fact, they will be crucial to building our copy of A.
Another key difference is coming from the way we choose the block we pass to: instead of passing to the block which is the 'densest', we will select the block in a certain randomised way which works well together with classical dependent random choice-type arguments.Upon completion of this procedure we obtain a sequence of blocks, from which we build A.
To give more details about our proof, let us focus on the case t = 2 (i.e., each row of A has at most two 1-entries), and assume that we are trying to prove a bound of O(n 1.51 ) for a matrix A which has 20 rows and 20 columns.Let n be large, and let M be an n × n zero-one matrix of weight n 1.51 .We perform the following procedure.First, we pick a row r of M uniformly at random, and look at the set of columns C which have a 1-entry in this row.(The row r is fixed for the remainder of this procedure.)Let s be a large but bounded number; for instance, s = 10 5 works for the parameters above.This is the number of steps we will have in the process.Starting with our matrix M 0 = M , we obtain M 1 , M 2 , . . ., M s as follows.Having defined M i , we divide it into k = n 1/s (horizontal) blocks of equal size, and M i+1 is defined to be the block which contains the row r.Note that, after the last step, our matrix M s is the single row r.
For each pair of columns c 1 , c 2 in C, we consider how the size of their common neighbourhood (i.e., the number of rows having a 1-entry in both c 1 and c 2 ) changes during this process, i.e., we consider the numbers N j (c 1 c 2 ) = |{i : M j (i, c 1 ) = M j (i, c 2 ) = 1}| for j = 0, 1, . . ., s.Note that, for almost all pairs {c 1 , c 2 } ∈ C 2 , N 0 (c 1 c 2 ) is expected to be large (at least n 0.01 ), by our definition of C using dependent random choice.However, N s (c 1 c 2 ) is 1 for all {c 1 , c 2 } ∈ C 2 .We consider two types of steps for the columns c 1 , c 2 : 'shrinking' steps, i.e., steps j which have N j (c 1 c 2 ) < N j−1 (c 1 c 2 ), and 'non-shrinking' steps, i.e., steps j which have N j (c 1 c 2 ) = N j−1 (c 1 c 2 ).Note that we have N j (c 1 c 2 ) < N j−1 (c 1 c 2 ) if and only if more than one of the k blocks of M j−1 contain a row which has a 1-entry in both of the columns c 1 and c 2 .We consider two subcases for shrinking steps: a shrinking step for c 1 c 2 is 'above-shrinking' if there is a row in M j−1 which has a 1-entry in both of the columns c 1 and c 2 , and this row is 'above' the block containing r (i.e., 'above' the submatrix M j ), and 'below-shrinking' otherwise (in which case there is a row in M j−1 which has a 1-entry in both of the columns c 1 and c 2 , and this row is 'below' the block containing r).
Step j is 'below-shrinking' for c 1 c 2 because we have a row i 'below' M j that contains a 1-entry in both of the columns c 1 and c 2 .
To see why shrinking steps are useful, let us assume that we can find a set C ′ = {y 1 , . . ., y 20 } of 20 columns (y 1 < • • • < y 20 ), together with 20 steps j 1 , . . ., j 20 ∈ [s] (where j 1 < . . .< j 20 ), such that each of the steps j i is below-shrinking for all pairs of columns from C ′ .For each ℓ ∈ [20], let B ℓ be the submatrix of M j ℓ −1 obtained by taking all rows below M j ℓ .Note that the submatrices B ℓ of M are pairwise disjoint, and B ℓ is located completely below B ℓ ′ if ℓ < ℓ ′ .Moreover, for each ℓ ∈ [20], and each pair c 1 c 2 from C ′ , since the step j ℓ is below-shrinking for c 1 c 2 , we know that there is a row of B ℓ which has a 1-entry in both c 1 and c 2 .However, these conditions easily imply that we can find a copy of A in M by using the columns C ′ = {y 1 , . . ., y 20 }, and embedding the ℓ-th row of A into B 21−ℓ appropriately (more precisely, if A(ℓ, a) = A(ℓ, b) = 1, then the ℓ-th row of the embedded copy of A is an arbitrary row of B 21−ℓ which contains a 1-entry in columns y a and y b ).(See Figure 2 for an example of how shrinking steps can be used to construct an embedding of A in M .)Thus, it is enough to find a set C ′ ⊆ C of 20 columns, and a set of 20 steps from [s], such that each step is below-shrinking for each column-pair from C ′ .Similarly, it is enough to find a set C ′ ⊆ C of 20 columns, and a set of 20 steps from [s], such that each step is above-shrinking for each column-pair from C ′ .However, as noted above, if we look at a pair of columns, we 'almost always' expect to have N 0 (c 1 c 2 ) > n 0.01 , and we also know that N s (c 1 c 2 ) = 1.Since we always divide into k blocks and pass to one of the blocks in each shrinking step, N j (c 1 c 2 ) is expected to shrink by at most a factor of 1/k = 1/n 1/s .This can be made more precise using that, crucially, we always pass to the block containing r, and r is a uniformly random row in the common neighbourhood of c 1 and c 2 .Thus, typically, we expect to have more than 0.01s shrinking steps -in particular, for almost all pairs of columns, we expect more than 40 shrinking steps.Hence, for almost all pairs of columns, we can find 20 shrinking steps of the same subtype (above-shrinking or below-shrinking).Because this holds for almost all pairs of columns, we can find a large subset C 1 ⊆ C of columns such that each pair from C 1 has either 20 above-shrinking steps or 20 below-shrinking steps (by Turán's theorem).But then, since the number of 'shrinking patterns' is bounded (as there are only s = 10 5 steps, and we need to select 20 of them to be above-shrinking or to be below-shrinking), if |C 1 | is large enough, then by the multicolour Ramsey's theorem, we can find a large subset C 2 (of size more than 20) of columns in C 1 such that every pair of columns in C 2 has the same shrinking pattern (i.e., there exist 20 steps that are either all above-shrinking or all below-shrinking for every pair of columns in C 2 ), which finishes the proof by the observations above.for the column pair y 1 y 2 to embed the third row of A, step j 2 for the column pair y 2 y 3 to embed the second row of A, and step j 3 for the column pair y 1 y 3 to embed the first row of A.

Proofs of our results
We now turn to the formal proofs of our results.We will use the following lemma to show that typically we have many shrinking steps.Proof.The result is established easily by induction on m, as follows.The case m = 0 is clear, as if a leaf has zero branching ancestors then T must be a path from the leaf to the root.Now assume that m ≥ 1, and that the result holds for smaller values of m.For a given m ≥ 1, we proceed by induction on |T |.The case |T | = 1 is clear.For |T | > 1, if the root r is not branching, then let r ′ be its unique child, and we are done by applying induction to T − r (rooted at r ′ ).Otherwise let r 1 , . . ., r ℓ be the children of the root r, with ℓ ≤ k.Then T − r splits up into ℓ trees T 1 , . . ., T ℓ , rooted at r 1 , . . ., r ℓ , respectively.Moreover, a vertex of T i has at most m branching ancestors in T if and only if it has at most m − 1 branching ancestors in T i .By induction, the number of such non-root leaves in T i is at most k m−1 .It follows that the number of non-root leaves in T with at most m branching ancestors is at most ℓk m−1 ≤ k m .
We are now ready to prove Theorem 1.4 in the following equivalent form.Theorem 2.2.Let t ≥ 1 be a positive integer.Let ε > 0, and let A be a zero-one matrix such that each row of A contains at most t 1-entries.Then ex(n, A) = O(n 2−1/t+ε ).
Proof.Let A have a rows and b columns.We may assume that ε < 1/10, b ≥ t, and each row of A contains exactly t 1-entries.Let n be sufficiently large, let M be an n × n zero-one matrix, and assume that M has weight at least n 2−1/t+ε .Let s = ⌈4a/ε⌉, and let k = ⌈n 1/s ⌉. (Here, as described in the proof outline, s denotes the number of steps we will perform, and k is the number of (horizontal) blocks we divide our matrix M into in each step.)Note that s depends on ε and A but not on n.Our goal is to show that M must contain A. For convenience, we label the rows of M by [0, n − 1] instead of [n] (but we label the columns of M by Note that the k-ary representation σ(i) of row i encodes for each step in the proof outline the block that the row is contained in.
We pick a row label r ∈ [0, n − 1] uniformly at random.Let C be the set of column labels c such that Note that R j is the set of rows remaining after j steps, i.e., the rows of the matrix M j in the proof outline, and we have Furthermore, for each (unordered) t-set e = {c 1 , . . ., c t } of distinct column labels from C, and for each j ∈ [s], we define type j (e) to be 'shrinking' if there is some row index i ∈ R j−1 \ R j such that M (i, c 1 ) = . . .= M (i, c t ) = 1.Otherwise, we define type j (e) to be 'non-shrinking'.(Note that these definitions agree with the ones mentioned in the proof outline.)Let a t-set e of columns from C be good if there are at least 2a values of j ∈ [s] such that type j (e) is 'shrinking', and let e be bad otherwise.Let H denote the set of bad t-sets from C. Clearly, Proof.Let a t-set of columns e = {c 1 , . . ., c t } from , and heavy otherwise.Note that the expected number of light column t-sets in C is at most ).Thus, it suffices to show that the expected number of column t-sets in C which are both heavy and bad is o(E[ |C| t ]).To show this, it is enough to prove that for any heavy t-set of columns e from [n]  t , the conditional probability P(e is bad | e ⊆ C) is o(1).Let us fix a heavy t-set e = {c 1 , . . ., c t }, and let us write I = {i ∈ [0, n − 1] : A(i, c 1 ) = . . .= A(i, c t ) = 1}.Note that, conditioned on e ⊆ C, r is a uniformly random element of I. Let V be the set of sequences x = (x 1 , . . ., x p ) of length at most s such that there is some i ∈ I with (σ 1 (i), . . ., σ p (i)) = x.We define a rooted tree T on V by letting x = (x 1 , . . ., x p ) be the parent of y = (y 1 , . . ., y q ) precisely when q = p + 1 and (y 1 , . . ., y p ) = (x 1 , . . ., x p ) (i.e., y is obtained from x by extending the sequence by one step).In other words, the rooted tree corresponds to a poset on the initial segments of the k-ary representations of the row indices in I, where the elements of the poset are ordered by inclusion.Note that the root of the tree is the empty sequence, the leaves are σ(i) for i ∈ I, and each vertex has at most k children.Recall that a vertex of T is branching if it has at least 2 children.Notice that for all rows i ∈ I and for all j ∈ [s], if r = i, then type j (e) is 'shrinking' if and only if (σ 1 (i), . . ., σ j−1 (i)) is branching.Thus, if r = i, then e is bad if and only if the corresponding leaf σ(i) has at most 2a − 1 branching ancestors.Therefore, by Lemma 2.1, there are at most k 2a−1 choices of r which make e bad.Since e is heavy, a uniformly random element of I makes e bad with probability at most finishing the proof of the claim.
Let N be a sufficiently large integer (namely, the multicolour hypergraph Ramsey number N = r t (b; 2 s a )) such that every 2 s a -colouring of the complete t-uniform hypergraph K (t) N on N vertices contains a monochromatic K (t) b .Furthermore, let L be a sufficiently large positive real number so that every t-uniform hypergraph with sufficiently many vertices and edge density more than 1 − 1/L contains a complete t-uniform subgraph K (t) N on N vertices (for example, we can choose L = N t ).Note that N and L do not depend on n.Using the claim above, we see that E[ |C| t − L|H|] = Ω(n t−1+tε ).Therefore, we can fix a choice of r such that we have |C| = Ω(n 1−1/t+ε ) and |H| < 1 L |C| t .Then (if n is sufficiently large), by the definition of L, we can find a subset C 1 ⊆ C such that |C 1 | = N and each t-set from C 1 is good (i.e., C 1 contains no t-set from H).
Whenever e = {c 1 , . . ., c t } is a t-set from C 1 and j ∈ [s] is such that type j (e) ='shrinking', we let subtype j (e) = ↑ if there is some row index i ∈ R j−1 \R j such that i < min R j and M (i, c 1 ) = . . .= M (i, c t ) = 1.Otherwise, we let subtype j (e) = ↓.Note that in this case there is some row index i ∈ R j−1 \ R j such that i > max R j and M (i, c 1 ) = . . .= M (i, c t ) = 1.For each t-set e from C 1 , we know that there are at least 2a choices of j with type 'shrinking', thus, we can find at least a choices of j with the same subtype (↑ or ↓).For each e, choose some z e ∈ {↑, ↓} and a subset J e ⊆ [s] of size a such that subtype j (e) = z e for all j ∈ J e (and type j (e) ='shrinking' for all j ∈ J e ).Then e → (z e , J e ) gives a 2 s a -colouring of the t-sets from C 1 .Hence, by the definition of N , we can find C 2 ⊆ C 1 of size b such that each t-set from C 2 has the same colour -i.e., there is some z ∈ {↑, ↓} and a subset J ⊆ [s] of size a such that for all t-sets e from C 2 we have z e = z and J e = J.In particular, subtype j (e) = z for all e ∈ C 2 t and all j ∈ J.We now show that this allows us to find a copy of A on column set Note that these conditions imply that i 1 < • • • < i a , and that M (i u , c y ) = 1 whenever A(u, y) = 1.Thus, the rows i 1 , . . ., i a and columns c 1 , . . ., c b give a copy of A, finishing the proof.
We now turn to the proof of Theorem 1.5.The proof is quite similar to the proof of Theorem 2.2, so we will just highlight the differences.
Sketch of the proof of Theorem 1.5.Let M be an n×n zero-one matrix with weight ω(n 2−1/t ).It suffices to prove that if n is sufficiently large, then M contains A. Let a be the number of rows in A. Consider an arbitrary partition of A using vertical cuts into t submatrices such that each of these submatrices has at most one 1-entry in each row.We may assume without loss of generality that they all have precisely one 1-entry in each row.Let b 1 , . . ., b t be the number of columns of these submatrices in the natural order.One of the main differences compared to the proof of Theorem 2.2 is that we choose k = 2 and s = ⌈log 2 n⌉.We define the random set C of columns and the set H of bad t-sets in C as in the proof of Theorem 2.2.Now E[|C|] = ω(n 1−1/t ) and hence E[ |C| t ] = ω(n t−1 ).We claim that, similarly to Claim 2.3, we have E[|H|] ≤ 1  4 • E[ |C| t ] (for every sufficiently large n).Let a t-set of columns e = {c 1 , . . ., c t } from [n]  t be light if |{i ∈ [0, n−1] : M (i, c 1 ) = . . .= M (i, c t ) = 1}| ≤ 5•2 2a−1 , and heavy otherwise.Note that the expected number of light column t-sets in C is at most n t • (5 ).Thus, it suffices to show that the expected number of column t-sets which are both heavy and bad is at most • |C| t sets e ∈ C t \ H, we have z e = z and J e = J.Now if n is sufficiently large, then by an ordered analogue of the Erdős box theorem (see, e.g., Lemma 4 in [23]), there exist sets C 1 , . . ., C t ⊆ C of columns such that |C i | = b i for every i ∈ [t], every column in C i comes before every column in C i+1 (for all i ∈ [t − 1]), and for each e = {c 1 , . . ., c t } which satisfies c i ∈ C i for all i ∈ [t], we have z e = z and J e = J.Then, since A is column-t-partite, we can embed A into M using the set C 1 ∪ • • • ∪ C t of columns, similarly as in the proof of Theorem 2.2.

Lemma 2 . 1 .
Let k and m be non-negative integers with k ≥ 1, and let T be a rooted tree such that each vertex has at most k children.Let us say that a vertex of T is branching if it has at least 2 children.Then the number of (non-root) leaves of T with at most m branching ancestors is at most k m .

C 2 .
In what follows we will assume that z = ↓, as one can deal with the case z = ↑ similarly.Let us label the rows of A by [a] and the columns by [b].Furthermore, let C 2 = {c 1 , . . ., c b } with c 1 < • • • < c b , and let J = {j 1 , . . ., j a } with j 1 from[n]t , the conditional probability P(e is bad | e ⊆ C) is at most 1/5.But an identical argument to the one in the proof of Claim 2.3 shows that this conditional probability is at mostk 2a−1 5•2 2a−1 = 1/5.The rest of the argument is fairly different from the proof of Theorem 2.2.Since E[|H|] ≤ 1 4 • E[ |C| t ], we have E[ |C| t − 2|H|] ≥ 1 2 • E[ |C| t ] = ω(n t−1).Hence, there exists an outcome such that |C| = ω(n 1−1/t ) and |H| ≤ 1 2 • |C| t .For each e = {c 1 , . . ., c t } ∈ C t \ H, let us define z e ∈ {↑, ↓} and J e ⊆ [s] as in the proof of Theorem 2.2.Then e → (z e , J e ) is a 2 s a -colouring of C t \ H.It follows that there exist some z ∈ {↑, ↓} and J ⊆ [s] of size a such that for all at least 1 1 5 • E[ |C| t ].To show this, it is enough to prove that for any heavy t-set of columns e