Representing the special linear group with block unitriangular matrices

Abstract We prove that every element of the special linear group can be represented as the product of at most six block unitriangular matrices, and that there exist matrices for which six products are necessary, independent of indexing. We present an analogous result for the general linear group. These results serve as general statements regarding the representational power of alternating linear updates. The factorizations and lower bounds of this work immediately imply tight estimates on the expressive power of linear affine coupling blocks in machine learning.


Introduction
Let F be an arbitrary field and M m,n (F) denote the set of m × n matrices with coefficients in F. When m = n, we simply write M n (F).Let GL n (F) denote the group of n × n nonsingular matrices with coefficients in F, and SL n (F) denote the subgroup of GL n (F) consisting of matrices with determinant one.Let I denote the identity matrix and 0 denote the zero matrix; the dimension of each is always clear from context.Let S n denote the symmetric group and P π denote the permutation π ∈ S n of the columns of I.For X ⊂ M n (F), let X k ⊂ M n (F) denote the set of k-fold products of elements of X.Let BL m,n (F) and BU m,n (F) denote the subgroups of SL m+n (F) consisting of block lower and upper unitriangular matrices, respectively, with block partition {1, . . ., m} and {m + 1, . . ., m + n}: In this work, we prove the existence of the following block unitriangular factorization of the special linear group.
THEOREM 1.For every M ∈ SL 2n (F), there exists A 1 , . . ., A 6 ∈ M n (F) such that For every m + n > 3, there exists some M ∈ SL m+n (F) such that M ∉ [BL m,n (F) ∪ BU m,n (F)] 5 .Furthermore, if F has at least four elements, then the lower bound holds independent of indexing: there exists M ∈ 5 for all permutations π ∈ S m+n .
We have an analogous theorem for the general linear group.Let T m,n (F) denote the set of matrices of the form with B ∈ GL m (F) and C ∈ GL n (F) both diagonal.We prove the following result.
THEOREM 2. For every M ∈ GL 2n (F) and diagonal D ∈ GL n (F) with det(D) = det(M), there exists A 1 , . . ., A 6 ∈ M n (F) such that Furthermore, for every m The factorizations of Theorems 1 and 2 are efficiently computable.Our construction is fairly unique among block matrix factorizations.a It can be viewed as a generalized version of a block LU factorization.One benefit of our construction is that such a factorization always exists, whereas the existence of a block LU factorization relies on the invertibility of the upper left block.Moreover, the above theorems serve as general results regarding the representational power of alternating linear updates.a One somewhat similar construction is the LU L factorization proposed by Serre and Püschel (1), where L, L are block lower unitriangular and U is block upper triangular; the authors studied how close (in a rank sense) such a factorization can be to block diagonal.

Brief Report
In fact, Theorem 2 immediately solves an open problem regarding affine coupling networks in machine learning.An affine coupling block is a function f :R m+n → R m+n of the form and ⊙ is the entry-wise product.A version of these functions (with s = 0) was originally introduced by Dinh et al. (2) in their NICE deep learning model.Dinh et al. (3) expanded that work to real nonvolume preserving transformations (e.g. the general formulation above).These papers led, in part, to the popularization of normalizing flows in machine learning, a general class of diffeomorphisms that map some standard distribution (say, a standard Gaussian vector) to a more complex one.We refer the reader to Refs.(4,5) for two excellent surveys of this emerging field.As discussed in Ref. (5), the "question of foremost importance" is the expressive power of such models, even when restricted to simple inputs.Recently, Koehler et al. studied the expressive power of linear affine couplings, i.e. matrices of the form with strictly positive entries (6).They posed the following question: how many linear affine coupling layers are needed to represent an arbitrary orientation-preserving matrix?They produced a 47-layer construction and showed that at least 5 layers are necessary.Theorem 2 answers their question exactly.
COROLLARY 3. Every matrix M ∈ GL 2n (R) with det(M) > 0 can be represented by a depth-six linear affine coupling network.In addition, for every n > 1, there exists M ∈ SL 2n (R) for which six layers are necessary, i.e.M cannot be exactly represented by a depth-five linear affine coupling network.
The lower bound of Corollary 3 immediately implies one for the nonlinear setting, by considering the function Mx, and applying a Jacobian argument; see Ref. (6,Corollary 6) for details.Theorem 1 gives an analogous result for the aforementioned NICE model (where D = I), with an additional lower bound independent of indexing, e.g. a learned partition cannot do uniformly better than an arbitrary one.The improvement in construction from depth 47 to depth 6 leads to a significant practical difference in terms of architecture.For example, since permutation matrices can be represented with six layers, the choice of partition may be of limited importance.Furthermore, the improved construction has consequences for maximum likelihood estimation, as Corollary 3 implies that the distributions representable as the application of a sixlayer linear affine coupling network to N(0, I) are exactly the set of N(0, Σ) with Σ invertible; see Ref. (6, Appendix A.2) for details.

Proof of Theorems 1 and 2
We construct the factorizations of Theorems 1 and 2 first by showing that matrices with a nonsingular upper right block can be represented with five layers (Lemmas 5 and 6), and then by proving that it costs only one layer to convert any matrix into one with a nonsingular upper right block (Lemma 7).We prove matching lower bounds by analyzing the class of block diagonal (Lemma 8) and diagonal (Lemma 9) matrices that can be represented with five layers.
In what follows, we make use of the theory of commutators, i.e. elements of a group G of the form [g, h]: = g −1 h −1 gh for some g, h ∈ G.We recall the following consequence of a combination of results of R.C. Thompson.LEMMA 4. [Thompson (7,8)] are efficiently computable; see Refs.(7,8) for details.Using Lemma 4 and well-chosen block unitriangular matrices, we produce a five-layer decomposition for matrices with a nonsingular upper right block. where . The result follows from a short computation: , their product equals M. □ Although the proof of Lemma 5 is quite short, it gives little motivation for the choice of matrices A 1 , . . ., A 5 .This exact choice can only be justified by a detailed analysis of both the representational power of three layers and the image of any matrix under a two-layer transformation.Unfortunately, SL 4 (GF(2)) cannot be treated using Lemma 5, as the matrices 1 1 0 1 are not commutators of SL 2 (GF(2)).Despite this, elements of SL 4 (GF(2)) with nonsingular upper right block can still be represented as the product of five block unitriangular matrices, which, given the small group size, is easily verified by exhaustive search.b LEMMA 6. [Urschel (10)] ) and M 2 ∈ SL 2 (GF(2)).Then there exists A 1 , . . ., A 5 ∈ M 2 (GF( 2)) such that b See repository (10) for a short computer-assisted proof [using the Julia programming language (11)]; the program terminates in under a second on a personal computer.It is also possible to prove Lemma 6 via an involved case analysis.The details are left to the interested reader.
The desired factorizations of Theorems 1 and 2 follow from the application of Lemmas 5 and 6 to the product M B A 0 . That such a matrix A exists is a consequence of the following simple lemma, as M ∈ GL 2n (F) implies coker(M 1 ) ∩ coker(M 2 ) = 0. □ We now consider the lower bounds of Theorems 1 and 2. We have the following lemma regarding the representation of block diagonal matrices.
By setting these matrices equal and inspecting the upper left and right blocks, we deduce that Using the former equality applied to the lower left block, which, all together, implies (using the lower right block) Therefore, Taking the trace of each gives our desired result, as the product of two matrices has a fixed trace, independent of the order of operands.□ Consider the matrices X ∈ GL m (F) and Y ∈ GL n (F), m, n > 1, defined as follows: Proof.Let M be diagonal, with diagonal elements g, h, (gh) −1 (not necessarily distinct), and 2n − 3 copies of 1, for some g, h ≠ 1 satisfying gh ≠ 1.Such g, h ∈ F always exists when F has at least four elements (take any g 1 , g 2 ≠ 0, 1 distinct; either g The product of two matrices has a fixed set of nonzero characteristic roots, independent of the order of operands (12, Theorem 1).However, in total, exactly three elements of I − M −1 1 and I − M 4 are nonzero.Therefore, there is no ordering and bipartition of the diagonal elements such that the nonzero characteristic roots, taken with multiplicity, of I − M −1 1 and I − M 4 are the same, a contradiction.□

LEMMA 9 .
where δ(•) is the Kronecker delta function.We have det(X) = det(Y) = ( − 1) m+1 , trace(XD) = 0 for all diagonal D ∈ GL m (F), and, for every diagonal  D ∈ GL n (F), trace(Y  D) ≠ 0 if and only if m • 1 = n • 1.Therefore, by Lemma 8, m,n (F)] 5 .To complete our desired lower bound, we must briefly analyze the case when either m or n is equal to one.If, say, n = 1 and m > 2, let us keep X as above and set Y = ( − 1) m+1 , so that X −1 0 0 Y   ∈ SL m+1 (F).By the analysis in the proof of Lemma 8, if X −1 0 0 Y   ∈ [T m,n (F)] 5 , then I −  DXD is a rank one matrix for some diagonal  D, D ∈ GL m (F).However, this is not possible, as [I −  DXD](1, 1) = [I −  DXD](2, 2) = 1 and [I −  DXD](2, 1)= 0. This completes the proof of Theorem 2. When F has at least four elements, the lower bound for SL n (F) holds independent of indexing.The following lemma completes the proof of Theorem 1.If F has at least four elements, then, for every m + n > 3, there exists M ∈ SL m+n (F) such that P π MP π −1 ∉ [BL m,n (F) ∪ BU m,n (F)] 5 for all permutations π ∈ S m+n .