Determining the Number of Factors in High-dimensional Generalized Latent Factor Models

As a generalization of the classical linear factor model, generalized latent factor models are useful for analyzing multivariate data of different types, including binary choices and counts. This paper proposes an information criterion to determine the number of factors in generalized latent factor models. The consistency of the proposed information criterion is established under a high-dimensional setting where both the sample size and the number of manifest variables grow to infinity, and data may have many missing values. An error bound is established for the parameter estimates, which plays an important role in establishing the consistency of the proposed information criterion. This error bound improves several existing results and may be of independent theoretical interest. We evaluate the proposed method by a simulation study and an application to Eysenck's personality questionnaire.


Introduction
Factor analysis is a popular method in social and behavioral sciences, including psychology, economics, and marketing (Bartholomew et al., 2011). It uses a relatively small number of factors to model the variation in a large number of observable variables, often known as manifest variables. For example, in psychological science, manifest variables may correspond to personality questionnaire items for which factors are often interpreted as personality traits. Multivariate data in social and behavioral sciences often involve categorical or count variables, for which the classical linear factor model may not be suitable. Generalized latent factor models (Skrondal and Rabe-Hesketh, 2004;Chen et al., 2020) provide a flexible framework for more types of data by combining generalized linear models and factor analysis.
Specifically, item response theory models (Embretson and Reise, 2000;Reckase, 2009), which are widely used in psychological measurement and educational testing, can be viewed as special cases of generalized latent factor models. The generalized latent factor models are also closely related to several low-rank models for count data (Liu et al., 2018;Robin et al., 2019; McRae and Davenport, 2020) and mixed data (Collins et al., 2002;Robin et al., 2020) that make similar probabilistic assumptions, though these works do not pursue interpretations from the factor analysis perspective.
Factor analysis is often used in an exploratory manner for generating scientific hypotheses.
In this case, which is known as exploratory factor analysis, the number of factors and the corresponding loading structure are unknown and need to be learned from data. Quite a few methods have been proposed for determining the number of factors in linear factor models, including eigenvalue-based criteria (Kaiser, 1960;Cattell, 1966;Onatski, 2010;Ahn and Horenstein, 2013), information criteria (Bai and Ng, 2002;Bai et al., 2018;Choi and Jeong, 2019), cross-validation (Owen and Wang, 2016), and parallel analysis (Horn, 1965;Buja and Eyuboglu, 1992;Dobriban and Owen, 2019). However, fewer methods are available for determining the number of factors in generalized latent factor models, and statistical theory remains to be developed, especially under a high-dimensional setting when the sample size and the number of manifest variables are large.
Traditionally, statistical inference of generalized latent factor models is typically carried out based on a marginal likelihood function (Bock and Aitkin, 1981;Skrondal and Rabe-Hesketh, 2004), in which latent factors are treated as random variables and are integrated out from the likelihood function. However, for high-dimensional data involving large numbers of observations, manifest variables and factors, marginal-likelihood-based inference tends to suffer from a high computational burden and thus may not always be feasible. In that case, a joint likelihood function that treats factors as fixed model parameters may be a good alternative (Chen et al., 2019(Chen et al., , 2020Zhu et al., 2016). Specifically, a joint maximum likelihood estimator is proposed in Chen et al. (2019Chen et al. ( , 2020 that is easy to compute and also statistically optimal in the minimax sense when both the sample size and the number of manifest variables grow to infinity. With a diverging number of parameters in the joint likelihood function, the classical information criteria, such as the Akaike information criterion (AIC; Akaike, 1974) and the Bayesian information criterion (BIC; Schwarz, 1978), may no longer be suitable. This paper proposes a joint-likelihood-based information criterion (JIC) for determining the number of factors in generalized latent factor models. The proposed criterion is suitable for high-dimensional data with large numbers of observations and manifest variables and can be used even when data contain many missing values. Under a very general setting, we prove the consistency of the proposed JIC when both the numbers of samples and manifest variables grow to infinity. Specifically, the missing entries are allowed to be non-uniformly distributed in the data matrix, and their proportion is allowed to grow to one, i.e., the proportion of observable entries is allowed to decay to zero. An error bound for the joint maximum likelihood estimator is established under a general setting where the data entries can be non-uniformly missing, and the number of factors can grow to infinity. This error bound substantially extends the existing results on the estimation of generalized latent factor models and related models, including Cai and Zhou (2013), Davenport et al. (2014), Bhaskar and Javanmard (2015), Ni and Gu (2016), and Chen et al. (2020). Simulation shows that the proposed JIC has good finite sample performance under different settings, and an application to the revised Eysenck's personality questionnaire (Eysenck et al., 1985) finds three factors, which confirms the design of this personality survey.
2 Joint-likelihood-based Information Criterion

Generalized Latent Factor Models
We consider multivariate data involving N individuals and J manifest variables. Let y ij be a random variable that denotes the ith individual's value on the jth manifest variable. Factor models assume that each individual is associated with K latent factors, denoted by a vector F i = (f i1 , ..., f iK ) T . We assume that the distribution of y ij given F i follows an exponential family distribution with natural parameter d j + A T j F i , and possibly a scale parameter φ that is also known as a dispersion parameter, where d j and A j = (a j1 , ..., a jK ) T are manifestvariable-specific parameters. Specifically, d j can be viewed as an intercept parameter, and a jk is known as a loading parameter. More precisely, the probability density/mass function for y ij takes the form where b and c are pre-specified functions that depend on the exponential family distribution.
Given all the person-and manifest-variable-specific parameters, data y ij , i = 1, ..., N , j = 1, ..., J, are assumed to be independent. In particular, linear factor models for continuous data, logistic factor model for binary data, and Poisson factor model for counts, are special cases of model (1). We present the logistic and Poisson models as two examples, while pointing out that (1) also includes linear factor models as a special case when the exponential family distribution is chosen to be a Gaussian distribution.
Example 1. When data are binary, (1) leads to a logistic model. That is, by letting b(d j + A T j F i ) = log{1 + exp(d j + A T j F i )}, φ = 1, and c(y, φ) = 0, (1) implies that y ij follows a Bernoulli distribution with success probability exp(d j + A T j F i )/{1 + exp(d j + A T j F i )}. This model is known as the multi-dimensional two-parameter logistic model (Reckase, 2009) that is widely used in educational testing and psychological measurement.
Example 2. For count data, (1) leads to a Poisson model by letting b(d j + A T j F i ) = exp(d j + A T j F i ), φ = 1, and c(y, φ) = − log(y!). Then y ij follows a Poisson distribution with intensity . This model is known as the Poisson factor model for count data (Wedel et al., 2003).
We further take missing data into account under an ignorable missingness assumption.
Let ω ij be a binary random variable, indicating the missingness of y ij . Specifically, ω ij = 1 means that y ij is observed, and ω ij = 0 if y ij is missing. It is assumed that, given all the person-and manifest-variable-specific parameters, the missing indicators ω ij , i = 1, ..., N, j = 1, ..., J, are independent of each other, and are also independent of data y ij . The same missing data setting is adopted in Cai and Zhou (2013)  we assume the dispersion parameter φ > 0 is known and does not change with N and J.
Our theoretical development below can be extended to the case when φ is unknown; see Remark 6 below for a discussion.

Proposed Information Criterion
Under the above setting for generalized latent factor models, the log-likelihood function for observed data takes the form Note that a subscript K is added to the likelihood function to emphasize the number of factors in the current model.
For exploratory factor analysis, we consider the following constrained joint maximum likelihood estimator as proposed in Chen et al. (2019Chen et al. ( , 2020 where · denotes the standard Euclidian norm. Here C is a reasonably large constant to ensure that a finite solution to (3) exists and satisfies certain regularity conditions.
As there is no further constraint imposed under the exploratory factor analysis setting, the solution to (3) is not unique. This indeterminacy of the solution will not be an issue when determining the number of factors, since the proposed JIC only depends on the loglikelihood function value rather than the value of the specific parameters. The computation of (3) can be done by an alternating maximization algorithm which has good convergence properties according to numerical experiments (Chen et al., 2019(Chen et al., , 2020, even though (3) is a non-convex optimization problem. See Appendix D of the online supplement for further discussions on the computation of (3) and the choice of the constraint constant C.
Let n be the number of observed data entries, i.e., The proposed JIC takes the form (3), and v(n, N, J, K) is a penalty term depending on n, N , J, and K. We chooseK that minimizes JIC(K).
As will be shown in Section 3, the consistency ofK can be guaranteed under a wide range of choices of v(n, N, J, K). In practice, we suggest to use v(n, N, where N ∨J denotes the maximum of N and J. When there is no missing data, i.e., n = N J, then (4) becomes v(n, N, J, K) = K(N ∨ J) log(N ∧ J), where N ∧ J denotes the minimum of N and J. The advantage of this choice will be explained in Section 3.

Theoretical Results
We start with the definition of several useful quantities. Let p ij = Pr(ω ij = 1) be the sampling weight for y ij and p min = min 1≤i≤N,1≤j≤J p ij be their minimum. Also let n * = N i=1 J j=1 p ij , n * i· = J j=1 p ij , and n * ·j = N i=1 p ij be the expected number of observations in the entire data matrix, each row and each column, respectively. Let p max = (J −1 max 1≤i≤N n * i· ) ∨ (N −1 max 1≤j≤J n * ·j ) be the maximum average sampling weights for different columns and rows. Let m * ij = d * j + (A * j ) T F * i be the true natural parameter for y ij , and let M * = (m * ij ) 1≤i≤N,1≤j≤J . We also denoteM = (d j +Â T jF i ) N ×J to be the corresponding estimator of M obtained from (3). To emphasize the dependence on the number of factors, we useM (K) to denote the estimator when assuming K factors in the model. Let K max denote the maximum number of factors considered in the model selection process and let K * be the true number of factors.
The following two assumptions are made throughout the paper.
Assumption 2. The true model parameters F * i , A * j , and d * j satisfy the constraint in (3).
In the rest of the section, we will first present error bounds for the joint maximum likelihood estimator, and then present conditions on v(n, N, J, K) that guarantee consistent model selection.
Theorem 1. Assume n * /(log n * ) 2 ≥ (N ∧ J) log(N + J) and the true number of factors satisfies 1 ≤ K * ≤ K max . Then, there is a finite constant κ depending on p max /p min , C, φ, the function b and independent of K max , N , J and n * , such that with probability at least In particular, if K * is known, then we have ( The upper bound established in Theorem 1 is sharp, in the sense that the following lower bound holds under mild conditions. Proposition 1 (Lower bound). Assume (K * ) 2 (J + N ) ≤ n * . Then, there are constants κ, N 0 , J 0 > 0, such that for any N ≥ N 0 , J ≥ J 0 , and any estimatorM , denotes the parameter space. Here, κ is a constant that depends on p max /p min , C, φ, the function b, and is independent of K * . It is possibly different from the κ in Theorem 1.
We make a few remarks on Theorem 1.
Remark 1. It is well-known that in exploratory factor analysis, the factors F 1 , · · · , F N are not identifiable due to rotational indeterminacy, while m ij s are identifiable. Thus, we establish error bounds for estimating the matrix M rather than those of F i s and A j s. If additional design information is available and a confirmatory generalized latent factor model is used, then the methods described in Section 2.2 and theoretical results in Theorem 1 can be extended to establish error bounds for F i s following a similar strategy as in Chen et al. (2020).
The key assumption for Theorem 1 to hold is that both M * andM are low-rank matrices.
It can be easily generalized to other low rank models beyond the current generalized latent factor model, including the low-rank interaction model proposed in Robin et al. (2019). For example, one may parameterize Remark 2. The error bound (5) improves several recent results on low-rank matrix estimation and completion. For example, when n * = o (N ∧ J) 2 , it improves the er- Chen et al. (2020), where a fixed K * and uniform sampling, i.e., p max = p min , are assumed. Other examples include Ni and Gu (2016) and Bhaskar and Javanmard (2015), where the error rates are shown to be respectively, assuming binary data. The error estimate (5) is also smaller than the optimal rate {K * (N ∨ J)(n * ) −1 } 1/4 for approximate low rank matrix completion (Cai and Zhou, 2013;Davenport et al., 2014), which is expected as the parameter space in these works, which consists of nuclear-norm constrained matrices, is larger than that of our setting. Several technical tools are used to obtain the improved error bound including a sharp bound on the spectral norm of random matrices that extends a recent result in Bandeira and Van Handel (2016) and an upper bound of singular values of Hadamard products of low rank matrices based on a result established in Horn (1995).
Note that the constant κ in Theorem 1 depends on p max /p min . Thus, it is most useful when p max /p min is bounded by a finite constant that is independent of N and J. In this case, the asymptotic error rate is similar between a uniform sampling and a weighted sampling.
In the case where the sampling scheme is far from a uniform sampling, the next theorem provides a finite sample error bound.
Remark 3. Theorem 2 provides a finite sample error bound for the joint maximum likelihood estimator when the number of factors is known to be no greater than K max . It extends Theorem 1 in several aspects. First, the constants κ 2C 2 , δ C 2 , κ 1,b,C,φ and κ 2,bC,φ are made explicit in Theorem 2. In addition, it allows the missingness pattern to be far from uniform sampling. To see this, consider the case where J = N α , p min = N −β , and p max /p min ≤ N γ with α ∈ (0, 1], β ∈ [0, α), γ ∈ [0, β] and C is fixed. Roughly, a larger γ suggests a Let u(n, N, J, K) = v(n, N, J, K) − v(n, N, J, K − 1), and let σ 1 (M * ) ≥ σ 2 (M * ) ≥ · · · ≥ σ K * +1 (M * ) be the non-zero singular values of M * . Note that due to the inclusion of the intercept term d j , a non-degenerate M * is of rank K * + 1. The next theorem provides sufficient conditions on u(n, N, J, K) for consistent model selection.
Theorem 3. Consider the following asymptotic regime as N, J → ∞, If the function u satisfies then, lim N,J→∞ Pr(K = K * ) = 1. to be random as implicitly required by condition (9). A general result allowing a random of M * that measures the strength of the factors. Under the conditions of Theorem 3, it is guaranteed that, when K ≤ K * , JIC(K)−JIC(K−1) = −2(l K −l K−1 )+u(n, N, J, K) < 0 with probability tending to 1. It thus avoids under-selection. Second, under the conditions of Theorem 3, 2(l K −l K−1 ) = O p (N ∨ J) for each fixed K ≥ K * + 1, i.e., when both models are correctly specified. When N ∨ J = o u(n, N, J, K) and K ≥ K * + 1, JIC(K) − JIC(K − 1) = −2(l K −l K−1 ) + u(n, N, J, K) > 0 with probability tending to 1.
This avoids over-selection. Finally, the two requirements also imply that selection consistency can only be guaranteed when N ∨ J = o{σ 2 K * +1 (M * )}. That is, the factor strength has to be stronger than the noise level.
In practice, the factor strength σ 2 K * +1 (M * ) is unknown while N ∨ J is observable. Therefore, we recommend to choose u(n, N, J, K) = (N ∨ J)h(n, N, J) for some slowly diverging factor h(n, N, J), so that over-selection is avoided. Note that we require h(n, N, J) to diverge slowly, so that under-selection is also avoided for a wide range of factor strength levels. More specifically, we suggest to use h(n, N, J) = log{n/(N ∨ J)} which becomes log(N ∧ J) when there is no missing data. Its consistency is established in Corollaries 1 and 2 below. The next theorem extends Theorem 3 to a more general asymptotic setting.
random sequence {ξ N,J } such that ξ N,J → ∞ in probability as N, J → ∞, and with probability converging to one as N, J → ∞, the following inequalities hold u(n, N, J, K) where K max ≥ K * denotes the largest number of factors considered in model selection and we allow K max = ∞. Then, lim N,J→∞ Pr(K = K * ) = 1.
Theorem 4 relaxes the assumptions of Theorem 3 in several aspects. First, it is established under a more general asymptotic regime by allowing K * to diverge and p min to decay to zero, as N and J grow. It also allows the missingness pattern to be very different from uniform sampling by allowing p max /p min to grow. Second, u(n, N, J, K) is allowed to be random as long as (11) holds with high probability. In particular, the model selection consistency of the suggested penalty (4) is established in Corollary 2 below as an implication of Theorem 4. Third, (11) provides a more specific requirement on u(n, N, J, K). The second and third lines of (11) depend on the true number of factors K * . In practice, we need to Remark 6. In Theorems 3 and 4, the dispersion parameter φ is assumed known. When φ is unknown, we may first fit the largest model with K max factors to obtain an estimatê φ, and then select the number of factors using JIC with φ replaced byφ. Similar model selection consistency results would still hold. We note that the use of the plug-in estimator for dispersion parameter is common in constructing information criteria for linear models and linear factor models (Bai and Ng, 2002).
Remark 7. Several information criteria have been proposed for linear factor models under high-dimensional regimes. In particular, Bai and Ng (2002) consider a setting that the observed data matrix can be decomposed as the sum of a low rank matrix and a mean-zero error matrix, and propose information criteria to select the rank of the low-rank matrix.
Their setting is very similar to the case when the exponential family distribution in (1) is chosen to be a Gaussian distribution and there is no missing data, except that Bai and Ng (2002) do not require the Gaussian assumption. In fact, under the Gaussian linear factor model and when the dispersion parameter φ = 1, the suggested JIC with penalty term v(n, N, J, K) = K(N ∨ J) log(N ∧ J) is asymptotically equivalent to the P C p1 through P C p3 criteria proposed in Bai and Ng (2002) and in particular takes the same form as the P C p3 criterion. Bai et al. (2018) consider the spike covariance structure model (Johnstone, 2001) and develop information criteria for choosing the number of dominant eigenvalues which corresponds to the number of factors when regarding the spike covariance structure model as a linear factor model. By random matrix theory, they establish consistency results when the sample size and the number of manifest variables grow to infinity at the same speed and there is no missing data.
As mentioned in Section 1, nonlinear factor models are more suitable for multivariate data that involve categorical or count variables. Specifically, under model (1), the expected no longer a low-rank matrix when b is a nonlinear transformation. Consequently, methods developed for the linear factor model do not work well when data follow a nonlinear factor model. The presence of massive missing data further complicates the problem.
Finally, we point out that the theoretical results established above may also be useful for developing information criteria based on the marginal likelihood. The marginal likelihood, which is widely used for estimating latent variable models, treats the latent factors as random variables and integrates them out. When both N and J are large, by applying the Laplace approximation (Huber et al., 2004), the marginal likelihood can be approximated by a joint likelihood plus some remainder terms. The development above can be used to analyze this joint likelihood term. Further discussions are given in Appendix E of the online supplement. In particular, we consider eight combinations of N and J, given by J = 100, 200, 300, 400, N = J and N = 5J. We consider three settings for missing data, including (M1) no missing data, (M2) uniformly missing, with missingness probability p ij = 0.5 for all i and j, and (M3) non-uniformly missing, with missingness probability that depends on the value of the first factor. The true number of factors is set to K * = 3.
The model parameters are generated as follows. First, the true parameters d * j , a * j1 , ..., a * j3 are generated by sampling independently from the uniform distribution over the interval [−2, 2]. Second, the true factor values are generated under two settings. Under the first setting (S1), all three factors f * i1 , ..., f * i3 are generated by sampling independently from the uniform distribution over the interval [−2, 2], so that all the factors have essentially the same strength. Under the second setting (S2), the first two factors f * i1 and f * i2 are generated the same as those under the first setting, while the last factor f * i3 is sampled from the uniform distribution over the interval [−0.8, 0.8]. Under the second setting, the last factor tends to be weaker than the rest and thus is more difficult to detect. We use the proposed JIC to select K from the candidate set {1, 2, 3, 4, 5} and the constraint constant C in (3) is set to be 5. The true model parameters satisfy this constraint. All the combinations of the above settings lead to 48 different simulation settings. For each setting, we run 100 independent replications. The computation is done using the R package mirtjml (Zhang et al., 2020b).
We first examine the results on parameter estimation. The loss max Figure 1. As we can see, under each setting for factor strength and missing data, the loss decays towards zero, as both N and J grow. Given the same N and J, the estimation tends to be more accurate when there is no missing data.
In addition, the estimation tends to be more accurate under setting M2 where the data entries are uniformly missing than that under M3 where the missingness depends on the latent factors. We further examine the selection of factors. Table 1 presents the frequency that the number of factors is under-and over-selected among the 100 independent replications for all the 48 settings. As we can see, the proposed JIC becomes more accurate as N and J grow. Under the settings when N = J, no under-selection is observed, but the proposed JIC is likely to over-select when J is relatively small. Under the settings when N = 5J, no over-selection is observed, but under-selection is observed when one factor is relatively weaker than the others and J is relatively small. We point out that determining the number of factors is a challenging task under our settings when J is relatively small. To illustrate, Figure 2 shows the box plots of 2(l 3 −l 2 ) and 2(l 4 −l 3 ) under settings when J = 100. For most of these settings, 2(l 3 −l 2 ) is not substantially larger than 2(l 4 −l 3 ), while our asymptotic theory requires the former to be of a higher order.
From the results in Table 1, we see that for relatively small values of N and J, the proposed information criterion tends to over-penalize when N = 5J and under-penalize when N = J. We explain this phenomenon. Our choice of v(n, N, J, K) is derived from the error bound (5) in Theorem 1. Although this error bound is rate optimal as implied by Proposition 1, it does not take into account the relationship between N and J. For example, consider two settings that both have no missing data and the same J, but one with N = J and the other with N = 5J. By Theorem 1, the two settings have exactly the same upper bound κ(K max /J) 1/2 . However, as we can see from Fig. 1, the error tends to be larger under the setting when N = J than that when N = 5J. Consequently, with the JIC derived from the same upper bound, it is more likely to over-select when N = J and to under-select when N = 5J. To improve the current information criterion, a refined error bound is needed, according to which we can choose a v(n, N, J, K) that better adapts to the relationship between N and J. This is a challenging problem and we leave it for future investigation.  100 47 100 100 53 100 100

Application to Eysenck's Personality Questionnaire
We apply the proposed JIC to a dataset based on the revised Eysenck's personality questionnaire (Eysenck et al., 1985), a personality inventory that has been widely used in clinics and research. This questionnaire is designed to measure three personality traits, extraversion, neuroticism, and psychoticism. We refer the readers to Eysenck et al. (1985) for the characteristics of these personality traits. The factor structure of this personality inventory remains of interest in psychology, due to its importance in the literature of human personality and wide use in several studies worldwide (Barrett et al., 1998;Chapman et al., 2013;Heym et al., 2013). In particular, it has been found that the dependence between items measuring the psychoticism trait tends to be lower than that between items measuring the other two traits. Based on this observation, some researchers suggested that psychoticism may consist of multiple dimensions (Caruso et al., 2001). We use the proposed JIC to investigate the factor structure of the inventory.
Specifically, we analyze all the items from the questionnaire, except for the lie scale items that are used to to guard against various concerns about response style. There are 79 items in total, each with "Yes" and "No" response options. An example item is "Do you often need understanding friends to cheer you up?". Among the 79 items, 32, 23, and 24 items are designed to measure psychoticism, extraversion, and neuroticism, respectively. For each participant, a total score can be computed based on each of the three item sets. This total score is often used to measure the corresponding personality trait. Here, we analyze a female UK normative sample dataset (Eysenck et al., 1985),   Table 3: Kendall's tau rank correlation between participants' estimated factor scores under the oblimin rotation and the total scores for the three personality traits.

F1
F2 F3 P score 0.08 0.78 -0.05 E score 0.86 0.00 -0.12 N score -0.08 0.08 0.88 The results are given in Tables 2 and 3. Specifically, the three-factor model achieves the minimum JIC value among the five candidate choices of K, suggesting a three-factor structure for the inventory. We investigate the three-factor model using the oblimin method, one of the most popular oblique rotation methods (Browne, 2001), to obtain an relatively simple loading structure. Table 3 shows Kendall's tau rank correlation between participants' estimated factor scores under the oblimin rotation and the total scores for the three personality traits given by the design. According to the correlations, the extracted factors tend to correspond to the extraversion, psychoticism, and neuroticism traits, respectively.
Additional results can be found in Appendix H of the online supplement, including the estimated parameters for the fitted models and a comparison with marginal-likelihood-based inference. When our model (1) takes the form of a Gaussian density and there is no missing data, then the proposed JIC and its theory are consistent with the results of Bai and Ng (2002) for high-dimensional linear factor models. In this sense, the current work substantially extends the work of Bai and Ng (2002) by considering non-linear factor models and allowing a general setting for missing values. Although we focus on generalized latent factor models with an exponential-family link function, the proposed JIC is applicable to other models, for example, a probit factor model for binary data that replaces the logistic link by a probit link in Example 1. The consistency results are likely to hold under similar conditions, for a wider range of models. This extension is left for future investigation.

A Proof of Theoretical Results
A.1 Proof of Theorems 1 and 2 We will present the proof of Theorem 2 first and then that of Theorem 1, because the former is more general than the latter. The proof of Theorem 2 is based on the following two lemmas, whose proof will be provided later in the supplementary material.
Lemma 2. There is a universal constant c > 0 such that Lemma 3. Let V = (v ij ) 1≤i≤N,1≤j≤J be a random matrix with independent and centered entries. In addition, assume v ij s are sub-exponential random variables with parameters ν, α > 0. That is, E(e λv ij ) ≤ e λ 2 ν 2 /2 for all |λ| < 1/α. Then, there exists a universal constant c > 0 such that with probability at least for all N ≥ 1, J ≥ 1, and n * ≥ 6. In particular, under Assumptions 1 and 2, and there is a universal constant c > 0 such that with probability at least 1−(N +J) −1 −(n * ) −1 , for all N ≥ 1, J ≥ 1, and n * ≥ 6.
of Theorem 1. Note that max i n * i· ≤ p max J and max j n * ·j ≤ p max N . Thus, (7) is simplified to for some κ depending on C, b, φ and p max /p min . Because p min = (p min /p max )p max ≥ (p min /p max )n * /(N J), the above inequality implies with a possibly different κ that also depends on C, b, and φ. Multiplying both sides by (N J) −1/2 and simplifying it, we arrive at Note that for n * /(log n * ) 2 ≥ (N ∧ J) log(N + J), (N ∨ J)/n * 1/2 ≥ {(N J) 1/2 log 1/2 (N + J)}(n * ) −1 log n * , and the above inequality is simplified as This completes the proof.

A.2 Proof of Theorems 3 and 4 and Corollary 2
The proofs of Theorems 3 and 4 are based on the following three supporting lemmas, whose proofs are given in the supplementary material. We start by recalling u(n, N, J, K) = v(n, N, J, K) − v(n, N, J, K − 1) and defining R = 4(p min δ C 2 ) −1 Z • Ω 2 + 2δ C 2 C 2 Q 2 2 .
In the rest of the section, we provide the proof of Theorem 4 first and then the proof of Theorem 3 because the former is more general than the latter.
of Theorem 4. We will verify that conditions of Theorem 4 ensure conditions in Lemma 4 and Lemma 5. We start with verifying conditions in Lemma 4. According to the second line of (11), where the last line is obtained according to Lemma 6 and that ξ N,J → ∞ in probability.

Pr inf
Thus, conditions of Lemma 4 are verified and we obtain lim N,J→∞ Next, we verify conditions of Lemma 5. According to Lemma 6 and the assumption p −2 min p max K * (N ∨ Thus, In addition, according to the first line of (11), From (32) and (33) We complete the proof by combining (30) and (34).
of Theorem 3. First note that the existence of u satisfying (9) (10) is verified.

B Proof of Supporting Lemmas
of Lemma 1. By definition, In the rest of the proof, we derive upper bounds for each term on the right-hand-side of the above display. For the first term where A, B = tr(A T B) denotes the matrix inner product. Recall the following inequality in linear algebra: | A, B | ≤ A 2 B * ≤ rank(B) A 2 B F for any two matrices A and B. Applying this fact to the above display, we obtain Notice that rank(M − M * ) ≤ r + r * for M ∈ M r . Thus, the above inequality implies We proceed to the analysis of the second term Note that for M ∈ M r , |m ij | ≤ B i G j ≤ C 2 . Similarly, |m * ij | ≤ C 2 . Thus, for anỹ m ij = tm * ij +(1−t)m ij and t ∈ (0, 1), |m ij | ≤ C 2 . Recall the definition of δ C 2 = inf |x|≤C 2 b (x).
Note that where we define Q = Ω−P and P = (p ij ) 1≤i≤N,1≤j≤J . The next lemma is helpful for bounding matrix norms involving Hadamard products, whose proof is given later this section.
Remark 9. The proof of Lemma 7 utilizes the property that m ij = B T i G j with B i , G j ≤ C and combine it with a result in Horn (1995). This improves the estimate in Chen et al.
Comparing with this bound, the above lemma provide a sharper bound in the order of r + r * .
Applying Lemma 7 to (44) and combine it with (43), we obtain We complete the proof by combining the above display with (39) and (42).
On the other hand, Theorem 2 in Horn (1995) states that, for any m × n matrices where σ i (·) denotes the ith largest singular value of a matrix, Noting the left-hand side of the above display equals (M − M * ) • (M − M * ) * . Thus, The proofs of Lemmas 2 and 3 are based on the next lemma that provides an upper tail bound for the spectral norm of a large class of random matrices. Its proof mainly combines standard symmetrization and truncation arguments with a recent result by Bandeira and Van Handel (2016) on the spectral norm of symmetric random matrices with independent, centered and symmetric entries.
Then, there is a universal constant c > 0 such that for all t, λ ≥ 0 where we define is an independent copy of x ij .
of Lemma 8. Let X = (x ij ) which is an independent copy of X and letX = (x ij ) = X − X .
Then,x ij s have symmetric distribution and are independent.
Z is a symmetric (N + J) × (N + J) random matrix whose entries are independent and symmetric random variables. Define a random matrix Z(λ) as the truncated Z, Then, entries of Z(λ) are independent, symmetric random variables and are bounded by λ. Apply Corollary 3.12 in Bandeira and Van Handel (2016) to Z(λ), then there exists a universal constant c > 0 such that Note that Thus, On the other hand, The above two inequalities together imply Note that Z 2 = X 2 and max 1≤i,j≤N +J |z ij | = max 1≤i≤N,1≤j≤J |x ij |. From the above inequality, we obtain With a union bound, we further get Pr( X 2 ≥ 4(σ 1 ∨ σ 2 ) + t) ≤ (N + J)e −t 2 /(cλ 2 ) + 1≤i≤N,1≤j≤J Pr(|x ij | > λ).
We complete the proof by noting that (2c) 1/2 is still a universal constant.
of Lemma 4. For each K * +1 ≤ K ≤ K max , we first derive an upper bound for φ{l( ). According to Lemma 1, Combining this with (16) gives Thus, the penalized log-likelihood satisfies It is easy to see that, if the events u(n, N, J, K * + 1) > 2φ −1 (K * + 1)R and u(n, N, J, l) > 2φ −1 R happen at the same time for all K * + 2 ≤ l ≤ K max , then the right-hand side of the above inequality is strictly greater than zero. Thus, ≥ Pr u(n, N, J, K * + 1) > 2φ −1 (K * + 1)R, and inf We complete the proof by noting the right-hand side of the above inequality tend to one under the assumptions of the lemma.
The proof of Lemma 5 requires the next lemma.
This completes the proof.
of Lemma 6. According to Lemma 3, there is a universal constant c such that with proba- Under the asymptotic regime (10), we have 4(φκ 2C as N, J → ∞. Similarly, according to Lemma 2, with probability at least 1 − (N + J) −1 . Under the the asymptotic regime (10), the right-hand-side of the above inequality is of the order O({p max (N ∨ J)} 1/2 ), and thus We complete the proof by combining (81), (83), and the definition of R.

C Proof of Proposition 1
Without loss of generality, assume J ≥ N , N/K * is an integer, and φ = 1. The proof can be easily extended to the other cases.
To prove the lower bound for the minimax risk, we use a local Fano's method, which is a standard tool for proving lower error bounds (Tsybakov, 2008). Throughout the proof we We start with constructing a local packing of as follows. First, let F (0) = C(I K * , · · · , I K * ) T .
Next, according to the (Gilbert-Varshanmov bound) (Gilbert, 1952) The set M * defined above has the following properties. Property (a) holds obviously. Property (b) holds because of the following inequalities where we used the construction M = F (0) A T in the last equation. Note that |a jk | = |γb jk | = γ for A ∈ A. Thus, (a jk − a jk ) 2 ≤ 4γ 2 , which leads to property (b) of the set M * for a possibly different κ. Property (c) holds for the following reasons. By construction, for M, M ∈ M * and M = M where the last inequality is due to (84). Thus, property (c) holds.

D On Optimization for Joint Likelihood
We provide some discussions on the optimization problem (3) for the constrained joint maximum likelihood estimator. The two reasons below explain why the solution given by an alternating maximization algorithm typically performs well, even though (3) is a non-convex optimization problem. First, according to the proofs of Theorems 1 through 4, Theorems 1 and 2 hold as long as the estimates satisfy when K ≥ K * . In addition, for Theorems 3 and 4 to hold, we only need It means that the number of factors can be consistently selected even if our estimate is not a global solution to (3) as long as (94) holds.
Second, we use good starting points when solving the optimization (3). Specifically, under the logistic factor model for binary data, a singular-value-decomposition-based algorithm is proposed by Zhang et al. (2020a) that is guaranteed to give a consistent estimator of the model parameters. Although this estimator is statistically less efficient than the jointlikelihood-based estimator (thus cannot be directly plugged into the likelihood to construct an information criterion), it can serve as a good starting point when solving the optimization (3). For other models, similar singular-value-decomposition-based algorithms can also be developed.
We also discuss the choice of constraint constant C which needs to be specified when computing the constrained joint maximum likelihood estimator. First of all, we point out that it is standard to impose such a constraint for low-rank matrix estimation under nonlinear models. For example, in the work of Cai and Zhou (2013) on 1-bit matrix completion, it is required that the max norm (i.e., the maximum value of the absolute values of entries) of underlying low-rank matrix is smaller than a constant, which plays essentially the same role as the constant C in the current work. Second, according to our simulation study, the estimation of the model parameters and the performance of the proposed information criteria are not sensitive to the choice of C, as long as it is set to be sufficiently large. Given a specific dataset, we suggest to run the estimator under different values of C to check its sensitivity.
In practice, we suggest to start with a sufficiently large C, followed by a sensitivity analysis to check whether the estimator is sensitive to the current choice of C.

E Information Criteria based on Marginal Likelihood
We provide some discussion on the behavior of the maximum marginal likelihood when both N and J grow to infinity. To simplify the discussion, we assume there is no missing value and the dispersion parameter is 1, but this discussion can be generalized to the case when there are missing data and the dispersion parameter needs to be estimated. Then by the Laplace approximation (Huber et al., 2004) and under suitable regularity conditions, we should be able to establish where H(F i ,Â,D) is the Hessian matrix of L i (x) = l i (x,Â,D) evaluated atF i and the R N,J term comes from the remainder term of Laplace approximation. Note that the first term in (95) is the dominant term that takes the same form as the joint likelihood, thougĥ A j ,d j , andF i are obtained from the marginal likelihood. The remainder R N,J is a term with a smaller asymptotic order. Moreover, we believe that the error bound established in Theorem 1 can be extended toM K = (d j +F T iÂ j ) N ×J when K * ≤ K ≤ K max . Therefore, the development in this article will also be useful when developing marginal-likelihood-based information criteria for generalized latent factor models under a high-dimensional setting.

F Comparison with Some Related Works
As discussed in Remark 2, the error bound (5) improves several recent results on low-rank matrix estimation and completion. We now summarize the comparison in Remark 2 using Table 4 below. This comparison focuses on the error bound (5) when K max = K * and data entries are binary and are uniformly missing.
We further compare the current development with Chen et al. (2019) and Chen et al. (2020) that also concern likelihood-based analysis of generalized latent factor models. We discuss the similarities and differences below.
1. Model: Chen et al. (2020) and the current work consider the same generalized latent factor model (1) and Chen et al. (2019) consider the special case for binary data as given in Example 1. Bhaskar and Javanmard (2015) Rank Ni and Gu (2016) Cai and Zhou (2013)   information criteria for selecting the number of factors in high-dimensional generalized latent factor models and establish conditions under which selection consistency is guaranteed.
Second, we substantially extend the results on the estimation of generalized latent factor models under a general setting where the data entries can be non-uniformly missing and the number of factors can also grow to infinity.

G.1 Additional Results for Simulation in Section 4.1
The average running time for one independent replication for each of the 48 simulation settings is given in Table 5, where the computation is run on a computer with an Intel(R) Xeon(R) CPU 2.30GHz. The computation code for our simulations can be found on the author's Github page: https://github.com/yunxiaochen/JML_IC.

G.2 Simulation under Poisson Factor Model
We further provide a simulation study under the Poisson factor model as given in Example 2.
Similar to the simulation study in Section 4.1, we consider the same factor strength settings S1 and S2, and the same missing data settings M1-M3. Again, we consider two relationships between N and J, including N = J and N = 5J. We consider J = 100, 200, 300, 400. Again, we let K * = 3 and the true model parameters be generated the similarly as the simulation study in Section 4.1. More precisely, under the setting S1, the true parameters d * j , a * j1 , ..., a * j3 are generated by sampling independently from the uniform distribution over the interval [−1, 1] and the true factor values are generated f * i1 , ..., f * i3 are generated by sampling independently from the uniform distribution over the interval [−1, 1]. Under the setting S2, f * i3 is generated from the uniform distribution over the interval [−0.4, 0.4] and the rest of the parameters are generated the same as those in S1. We use the proposed JIC to select K from the candidate set {1, 2, 3, 4, 5} and the constraint constant C in (3)

G.3 A Scree Plot Example
Scree plots are a widely used tool for selecting the number of factors in factor analysis (Cattell, 1966). A scree plot displays the eigenvalues of the covariance matrix of data in a  downward curve, ordering the eigenvalues from largest to smallest. The number of factors is then determined by finding the "elbow" of the graph. The "elbow" is the eigenvalue where the eigenvalues seem to level off and the number of factors is determined by the number of eigenvalues that are greater than the elbow. This approach typically works well for data following a linear factor model. This is because, under a linear factor model, the covariance matrix of data is approximately a low-rank matrix plus a diagonal matrix, where the lowrank part drives the "elbow" phenomenon. When data are generated from a nonlinear factor model, such as the logistic or Poisson factor models considered in the current work, the factor structure of data cannot be fully characterized by the covariance matrix. In particular, the covariance matrix cannot be approximated by a low-rank matrix plus a diagonal matrix. As a result, the elbow of the scree plot may no longer correspond to the number of factors. We provide a simulated example to illustrate this point. Figure 4 shows the scree plot for data generated from a Poisson factor model under a setting when N = J = 200, K * = 3, and there are no missing entries. Based on the scree plot, one may tend to choose seven or eight factors, which is larger than the true number of factors. We now compare the proposed JIC with the method proposed in Bai et al. (2018) via a simulation study. In this study, data are generated from a linear factor model, with N = 2J.
More specifically, y ij follows a normal distribution with mean d j + A T j F i and variance 1, so that the data follow a spike covariance structure model assumed in Bai et al. (2018), with the eigenvalues of the covariance matrix of (y i1 , ..., y iJ ) T satisfying λ K * > λ K * +1 = · · · = λ J = 1.
Again, we let K * = 3 and the true model parameters be generated under the setting S1 the simulation study in Section 4.1, where the three factors are of the same strength. More precisely, the true parameters d * j , a * j1 , ..., a * j3 are generated by sampling independently from the uniform distribution over the interval [−2, 2] and the true factor values are generated f * i1 , ..., f * i3 are generated by sampling independently from the uniform distribution over the interval [−2, 2]. We consider J = 10, 20, ..., 50 and N = 2J. Note that under this linear factor model with no missing data and assuming that the variance of y ij is known to be 1, then the proposed JIC is the same as the P C p3 criterion proposed in Bai and Ng (2002).
We use the proposed JIC to select K from the candidate set {1, 2, 3, 4, 5} and the con- (3) is set to be 5. In addition, we also use the AIC and BIC proposed in Bai et al. (2018) to select K from the candidate set {1, 2, 3, 4, 5}. The results are given in Table 8. As we can see, all three information criteria become more accurate when N and J

H Additional Results for Real Data Analysis
In what follows, we provide additional results for the real data analysis. In Tables 9 and 10, we show the loading matrix and the sample covariance matrix for the estimated factor scores, after applying the oblimin rotation. Note that the items have been reordered, with items 1-32, 33-55, and 56-79 designed to measure the psychoticism, extraversion, and neuroticism traits, respectively. The content of the items can be found in Eysenck et al. (1985). Note that our data have been pre-processed so that the negatively worded items are reversely scored. As we can see, items 1-32, 33-55, and 56-79 tend to have high loadings on F2, F1, and F3, respectively. According to Table 10, the correlations between the three estimated factors are relatively small, suggesting that the three factors tend to be uncorrelated.
We further provide results for the two-and four-factor models, whose JIC values are also relatively small. These results may provide us further insights about the latent structure of this personality inventory. Tables 11 and 13 provide the the loading matrices for the two models, respectively, after applying the oblimin rotation. Moreover , Tables 12 and 14 show the sample covariance matrices for the estimated factor scores, from the two models, respectively. According to Table 11, the items that are designed to measure the extraversion trait tend to have high loadings for the first factor and items designed to measure neuroticism tend to have high loadings for the second factor, while most items designed to measure psy-choticism have small loadings for both factors. These results suggest that the psychoticism factor may not be captured by the two-factor model.
From the loading structure given in Table 13, the extracted factors F4, F2, and F3 tend to correspond to the psychoticism, extraversion, and neuroticism traits, respectively.
In addition, most items have small loadings on F1, except for items 14. "Do you stop to think things over before doing anything?", 28. "Do you generally 'look before you leap' ?", 45. "Have people said that you sometimes act too rashly?", and 48. "Do you often make decisions on the spur of the moment?", where items 14 and 28 are negatively worded and thus reversely scored. It seems a minor factor about impulsive decision.
We compare the proposed method with the classical Akaike information criterion (AIC) and Bayesian information criterion (BIC) calculated based on the marginal likelihood function, where the latent factors are treated as random variables. More specifically, the latent factors are assumed to follow a multivariate normal distribution, in the calculation of the marginal likelihood. The marginal maximum likelihood estimator is computed using the R package "mirt" (Chalmers, 2012), where the computation for the marginal maximum likelihood estimator is carried out using an Expectation-Maximization (EM) algorithm. The EM algorithm is very time-consuming when only involving a moderate number of factors (Reckase, 2009). The AIC and BIC values for the one-through five-factor models are given in Table 15 below. In calculating the AIC and BIC values, the number of parameters for a K-factor model is J(K + 1) − K(K − 1)/2, recalling that J is the number of items. The three-factor model fits best according to the BIC value, which is consistent with the selection based on the proposed JIC. On the other hand, AIC selects the four-factor model. Note that under the classical asymptotic regime and the true model is one of the candidate models, the BIC guarantees consistency for model selection, while the AIC tends to over-select (Shao, 1997).
Finally, we provide the estimation results for the three-factor model from the marginallikelihood approach, in comparison with those from the joint-likelihood approach. The results are given in Tables 16 and 17. Similar to the analysis above, the results are under the oblimin rotation. As we can see, although the estimates are slightly different from those given by the joint likelihood, the loading structure is similar and suggests that the three factors correspond to the extraversion, psychoticism, and neuroticism traits, respectively.  Table 9: Estimated loading matrix for the three-factor model after applying the oblimin rotation.

F1
F2 F3 F1 1.00 -0.03 -0.19 F2 -0.03 1.00 -0.02 F3 -0.19 -0.02 1.00 Table 10: The sample covariance matrix for the estimated factor scores, under the threefactor model after applying the oblimin rotation. Note that the model parameters have been rescaled, so that the sample variance for each factor is one.