## Abstract

Prospective pregnancy studies are a valuable source of longitudinal data on menstrual cycle length. However, care is needed when making inferences of such renewal processes. For example, accounting for the sampling plan is necessary for unbiased estimation of the menstrual cycle length distribution for the study population. If couples can enroll when they learn of the study as opposed to waiting for the start of a new menstrual cycle, then due to length-bias, the enrollment cycle will be stochastically larger than the general run of cycles, a typical property of prevalent cohort studies. Furthermore, the probability of enrollment can depend on the length of time since a woman’s last menstrual period (a backward recurrence time), resulting in selection effects. We focus on accounting for length-bias and selection effects in the likelihood for enrollment menstrual cycle length, using a recursive two-stage approach wherein we first estimate the probability of enrollment as a function of the backward recurrence time and then use it in a likelihood with sampling weights that account for length-bias and selection effects. To broaden the applicability of our methods, we augment our model to incorporate a couple-specific random effect and time-independent covariate. A simulation study quantifies performance for two scenarios of enrollment probability when proper account is taken of sampling plan features. In addition, we estimate the probability of enrollment and the distribution of menstrual cycle length for the study population of the Longitudinal Investigation of Fertility and the Environment Study.

## Introduction

We consider the setting of longitudinal prospective studies of cyclic events forming a renewal process, where the primary objective is to estimate the distribution of interarrival times. A necessary consideration when making inferences about the study population based on the observed times is the sampling plan; in particular, one in which enrollment can occur at any time in a gap interval including its start or end, and the enrollment date may differ among participants. After enrollment, generally participants are followed with cycle renewals recorded until censoring at the end of the study, after a fixed number of cycles, loss to follow-up, or occurrence of a competing event. Also, the gap time can be truncated.

For example, the process underlying the periodic shedding of the lining of the uterus known as menstruation is a stationary renewal process in that the occurrence of bleeding events is cyclic over time. The interarrival time of menstrual bleeding events, or menstrual cycle length, is an important determinant of risk for several chronic diseases afflicting females including heart disease and breast cancer; and it is also an important component in assessing female fecundity (see e.g. Jensen *and others*, 1999). It is typically defined as the number of days from the first day of menstrual bleeding (the day marking the last menstrual period, LMP) to the day preceding the next menstrual cycle, with care taken to distinguish menstrual bleeding from episodic bleeding. Cycle length is commonly measured in time-to-pregnancy studies either via retrospective recall or more reliably in prospective designs using daily diaries of intensity of bleeding along with hormonal monitoring. In this paper, we will analyze cycle lengths from a comprehensive prospective pregnancy study, the Longitudinal Investigation of Fertility and the Environment (LIFE) Study (Buck Louis, Schisterman *and others*, 2011) in which repeated measurements of menstrual cycle length and couple’s preconception exposure to persistent environmental chemicals were collected. Research goals include estimating the distribution of menstrual cycle length accounting for covariates and residual variation, and using the estimated distribution to predict time-to-pregnancy. An initial step is to estimate the distribution of menstrual cycle length, both in a population of women and for an individual woman.

A complicating feature of the LIFE and other such studies of renewal processes is that in order to efficiently sample couples attempting to become pregnant, enrollments occur between renewals as opposed to waiting for the next bleeding event. Furthermore, the interarrival time is potentially right censored by pregnancy or exit from the study resulting in four different combinations of truncation and censoring relevant to the observed enrollment menstrual cycle lengths (more generally, the gap times) as described by Gill and Keiding (2010). In modeling the distribution of (potentially censored) enrollment lengths, we account for two features of this sampling process; length-bias and selection effects. Length-bias relates to the process by which an announcement of a study has a probability of “intersecting” a menstrual cycle length that is proportional to its length. Due to length-biased sampling, the distribution of enrollment lengths is biased in the positive direction compared with the population distribution. Selection relates to the possible dependence of the probability that the couple will decide in favor of enrollment on the time from LMP until they learn of the study (the backward recurrence time), resulting in effects that could be positive or negative; thus, adding to or canceling out the length-bias.

A common and possibly unbiased approach for estimating the menstrual time distribution is to circumvent these challenges by excluding the enrollment cycle lengths and analyzing only post-enrollment cycle lengths (see, e.g. Murphy *and others*, 1995), conceding some loss of information. However, in some applications the additional information provided by the enrollment data can be important in detecting subtle covariate effects and the selection effects may in themselves be of interest. Moreover, in prospective pregnancy studies, the most fecund women may have very few cycles under observation, if any, post-enrollment.

Several authors have developed methods for estimating the unbiased distribution of survival times from length-biased or left-truncated data (see, e.g. Asgharian *and others*, 2002; Cox, 1969; Huang and Qin, 2011; Vardi, 1989; Wang, 1989, 1991; Wang *and others*, 1986; Winter and Foldes, 1988), which would allow for inclusion of the enrollment menstrual cycle lengths. Building upon this body of work, we propose an approach for estimating both the probability of enrollment as a function of time from LMP and the distribution of menstrual cycle length for the study population, adjusting for length-bias and selection effects. In Sections 2–5, we formalize the issue and develop our estimation approach with a focus on enrollment data. In Sections 6 and 7, we present results from a simulation study and analysis of the distribution of menstrual cycle length for the LIFE study population, respectively. We conclude with a discussion of remaining challenges and impact on analysis and design of longitudinal studies of recurrent events.

## Estimating the distribution of menstrual cycle length

Our aim is to estimate the distribution of menstrual cycle length for a population focusing on the role of the enrollment cycle data, as the distribution estimated from observations of this cycle can be influenced by the way in which these cycles are ascertained. We assume that the enrollment cycle lengths for different individuals are independent and identically distributed. The sampling (enrollment process) induces a distribution for the initial observed cycle length that could potentially be different from that for the population. Specifically, we account for length-bias and selection effects on the enrollment cycle length distribution so that inferences relate to a reference population. We first consider the uncensored case and then generalize.

### Notation$$,$$ sampling plan and consequences

Let $$\tilde {Y} \sim f(\cdot), 0 < \tilde {Y} < \infty $$ with associated survivor function $$S(y)=\int ^{\infty }_{y} f(u)\,{\mathrm {d}} u$$ and cumulative distribution function (CDF) $$F(y)=\int ^y_0 f(u)\,{\mathrm {d}} u$$. Without loss of generality, assume $$\tilde {Y} < V < \infty $$, where $$V$$ is a finite upper bound. Assume that the sequence $$(\tilde {Y}_1, \tilde {Y}_2, \ldots)$$, indexed by $$m$$, starts at some origin ($$T_0$$) and continues over time with $$T_m = T_0 +\tilde {Y}_1 + \tilde {Y}_2 + \cdots + \tilde {Y}_m$$. Consider the following sampling plan for a longitudinal study:

An announcement of study availability is issued at calendar time $$t^*$$ and it “intersects” cycle length $$\tilde {Y}_{m(t^*)}$$ with $$m(t^*) = \{m: T_{m-1} < t^* \le T_m\}$$ at $$\tilde {B}$$, the backwards recurrence time. That is, $$\tilde {B}$$ is the time from $$T_{m-1}$$ to $$t^*$$ and can take on values in the range $$0\le \tilde {B} < V$$. In addition, let,

– $$\lambda (\tilde {Y}) = $$ pr(the announcement “intersects” a cycle of length $$\tilde {Y})$$.

– $$\tilde A$$ be the residual time from enrollment to the next event, so $$\tilde {Y}_{m(t^*)} = \tilde {B}+\tilde A$$. We consider the case where both $$\tilde {B}$$ and $$\tilde A$$ are known. As examples, Keiding

*and others*(2002) and Keiding*and others*(2012) studied estimation of the distribution of time-to-pregnancy in the current-duration design when only $$\tilde {B}$$ was known; and Brookmeyer and Gail (1987) studied bias in estimation of the distribution of time to disease when $$\tilde {B}$$ was unknown, but $$\tilde A$$ was known.

The potential participant (the couple) decides on enrolling in the study. Let $$R=1$$ if the participant enrolls and $$R=0$$ otherwise. If the participant enrolls, the enrollment cycle length is $$\tilde {Y}_{m(t^*)}$$.

– Assume that $$t^*$$ is stochastically independent of the $$T_m$$ sequence. As the underlying process $$\tilde {Y}$$ is a renewal process, using the theorem proved by Winter (1989) and details included in Section 1.1 of supplementary material available at Biostatistics online, $$[\tilde {B} \mid \tilde {Y}_{m(t^*)}] \sim \mathcal {U}(0,\tilde {Y}_{m(t^*)}),\ 0\le \tilde {B} < \tilde {Y} < V$$.

– A potential participant knows only the $$\tilde {B}$$ value and not $$\tilde {Y}_{m(t^*)}$$, and so define the probability of enrollment, $$\pi (\tilde {B})= {\mathrm {pr}}(R=1 \mid \tilde {Y}_{m(t^*)},\tilde {B})={\rm pr}(R=1 \mid \tilde {B})$$.

A $$\lambda (\tilde {Y})\propto \tilde {Y}$$ produces standard length-bias analogous to the waiting time paradox (see, Feller, 1971) or, for example, in Aalen and Husebye (1991) pertaining to the cyclic movements in the small bowel during the fasting state. Just as longer bus waiting times are more likely to be intersected by an individual arriving at a bus stop, longer cycle lengths are more likely to be intersected by the announcement of the study than are shorter. This length-biased sampling results in a distribution of enrollment cycle lengths that is stochastically larger than the population distribution. Further, a non-constant $$\pi (\cdot)$$ induces additional selection effects and, depending on the shape of $$\pi (\cdot)$$, can induce a stochastically larger or smaller distribution for an enrollment cycle.

In modeling the enrollment cycle length, one must account for the aforementioned length-bias and selection effects on the first observation of the underlying renewal process. Our focus is on estimating the distribution of $$\tilde {Y}$$ using a likelihood that adjusts for the enrollment sampling process and henceforth we consider only enrollment lengths.

### Modeling

If $$\lambda (\tilde {Y}) \equiv \bar {\lambda }$$ and $$\pi (\tilde {B}) \equiv \bar {\pi }$$, then neither length-bias nor selection effects operate and standard modeling is valid. If $$\lambda (\tilde {Y})\propto \tilde {Y}$$ and $$\pi (\tilde {B}) \equiv \bar {\pi }$$, then $$\tilde {Y}_{m(t^*)}$$ is distributed according to the length-biased distribution: $$f(y)y\mu ^{-1}$$, where $$\mu = E(\tilde {Y})$$ (see, Cox (1962)). If, in addition, $$\pi (\tilde {B})$$ depends on $$\tilde {B}$$, then $$\tilde {Y}_{m(t^*)}$$ is distributed as a more generally weighted version of $$f(\cdot)$$. Consequently, statistical models must account for the enrollment sampling plan by conditioning on $$R=1$$. To simplify notation, henceforth we drop the “ $$\tilde {}\ $$” notation and subscript $$m(t^*)$$, let $$[Y, B ] \stackrel {\cal {D}}{=} [\tilde {Y}_{m(t^*)}, \tilde {B} \mid R = 1]$$ and denote this enrollment distribution by $$g$$. We have the joint distribution

## Parameter estimation

Assume that $$f(\cdot) = f(\cdot \mid \theta)$$, and that recruitment for the longitudinal study continues until $$n$$ participants are enrolled. At enrollment, participants report the LMP of the current cycle, so observed data for participant $$i$$ are $$(b_i, y_i)$$. The full likelihood is the product of the conditional likelihood ($$[y \mid b]$$) and the marginal likelihood for $$B$$. Specifically, with $$f(\cdot) = f(\cdot \mid \theta)$$, assuming that the functions $$\lambda (\cdot)$$ and $$\pi (\cdot)$$ are known,

We focus on estimating the survivor (tail) function $$S(y) = \int ^\infty _y f(u)\,{\mathrm {d}} u$$ or in the parametric case $${ S(y \mid \theta)=\int ^\infty _y f(u \mid \theta)\,{\mathrm {d}} u}$$. Under the renewal process sampling plan in Section 2.1, we henceforth assume that $$\lambda (y)\propto y$$. If $$\pi (\cdot)$$ can be estimated, it is straightforward to use the full likelihood. The conditional likelihood ($$\mathcal {L}_c$$) does not depend on $$\pi (\cdot)$$ and so if this selection function is not of interest, inferences can be based on (3.2). Including the marginal likelihood ($$\mathcal {L}_m$$) may provide a notable increment in information on $$\theta $$, but it does depend on $$\pi (\cdot)$$. As (2.2) shows, $$\pi (\cdot)$$ and $$f$$ interact, and a recursive approach is effective.

### Recursive estimation

The following recursive algorithm requires $$g_B$$ (equivalently the CDF $$G_B$$). Using the observed backward recurrence times, $$g_B$$ can be estimated by the empirical probability mass function (equivalently $$G_B$$ by the empirical distribution function). Alternatively, a smoothed estimate of $$g_B$$, can be obtained using adaptive bandwidth kernel density estimation (see, e.g. Wand and Jones, 1994). A two-stage algorithm for estimating $$\pi (\cdot)$$ and $$S(\cdot)$$ proceeds as follows:

Conditioning on $$B$$, let $$\hat {S}^{(0)}(\cdot)$$ be a non-parametric estimate of $$S(\cdot)$$ based on the conditional likelihood (see, e.g. Winter and Foldes, 1988). For parametric estimation, equivalently let $$\hat {\theta }^{(0)}$$ be the estimate of $$\theta $$ based on maximizing $$\mathcal {L}_c(\theta ; \textbf {y} \mid \textbf {b}, \lambda)$$ in (3.2).

Using the relation between $$\pi (\cdot)$$ and $$g_B$$ in (2.2), obtain $$\hat {\pi }^{(1)}(\cdot)$$ (only the shape is needed), with $$S(\cdot) = \hat {S}^{(0)}(\cdot)$$.

Let $$\hat {S}^{(\nu)}(\cdot) $$ be the non-parametric maximum likelihood estimate using the full likelihood, with $$\pi (\cdot) = \hat {\pi }^{(\nu)}(\cdot)$$. Equivalently, for parametric estimation let $$\hat {\theta }^{(\nu)}$$ be the maximum likelihood estimate (MLE) using $$\mathcal {L}(\theta ; \textbf {y} \mid \textbf {b}, \lambda)$$ in (3.1).

Using relation (2.2), update $$\hat {\pi }^{(\nu +1)}(\cdot)$$ (up to a constant of proportionality) with $$S(\cdot) = \hat {S}^{(\nu)}(\cdot)$$.

Iterate between steps 3 and 4 until convergence criterion is met.

In Sections 1.2 and 1.3 of supplementary material available at Biostatistics online, we provide a proof of the existence and uniqueness of the solution for $$\pi $$.

### Non-parametric estimation of $$S(y)$$

Assuming $$\lambda (y)\propto y$$ and $$\pi (b)\equiv \bar {\pi }$$, the inverse probability weighted empirical estimate of $$S(y)$$ derived by Cox (1969) for uncensored lengths $$(y_i; i=1,\ldots ,n)$$ sampled from $$G_Y$$ is

For right-censored data, assuming $$\pi (\cdot)\equiv \bar {\pi }$$, several non-parametric approaches for estimating $$S(\cdot)$$ accounting for length-bias or left-truncation have been developed (Asgharian *and others*, 2002; Huang and Qin, 2011; Vardi, 1989). In order to relax the assumption on $$\pi (\cdot)$$, we note that conditioning on $$B$$ the likelihood does not depend on $$\pi (\cdot)$$. Methods for estimating $$S(\cdot)$$ based on $$\mathcal {L}_c$$ have been developed by many including Keiding and Gill (1990), Wang (1991), Wang *and others* (1986), and Winter and Foldes (1988). When estimation of $$\pi (\cdot)$$ is not of interest, these methods allow for non-parametric estimation of $$S(\cdot)$$, but may lead to a loss of efficiency when compared with using the full likelihood. In the simulation study (see Section 6), we do not find evidence of an efficiency loss.

### Semi-parametric estimation of $$S(y\mid \theta)$$

In the semi-parametric approach, we assume $$\tilde {Y}\sim f(\cdot ; \theta)$$; however, we do not make any parametric assumptions about $$\pi (\cdot)$$. The recursive algorithm given in Section 3.1 may be used to estimate $$\pi (\cdot)$$ and $$S(y \mid \theta) = \int _y^\infty f(u \mid \theta)\,{\mathrm {d}} u$$ with the adaptation in step 3 of producing $$\hat {\theta }^{(\nu)}$$ from the full likelihood written in terms of the current estimate of $$\pi (\cdot)=\hat {\pi }^{(\nu)}(\cdot)$$. As we will show in Section 4.1, the semi-parametric approach can accommodate censoring. Bootstrap bias-corrected and accelerated ($${\rm BC}_a$$) confidence intervals (Efron, 1987) can be constructed for $$\theta $$ that account for the uncertainty in estimating $$\pi (\cdot)$$.

## Censoring

To accommodate censoring, let $$\tilde {C}$$ denote the censoring time defined from the most recent event (LMP) to the censoring date. Under the sampling plan detailed in Section 2.1, length-bias operates on the enrollment cycle length and also on the enrollment cycle censoring time, when defined from LMP. Consequently, length-bias operates on the available length: $$\tilde {Y}\wedge \tilde {C}$$. For the enrollment cycle, we eliminate the “$$\ \tilde {}\ $$” notation and write $$C = B + V$$, where V is the residual time from enrollment date to censoring date, and write $$X=Y\wedge C$$ as the length from LMP to the next menstrual period or censoring, with event indicator $$\delta =I(Y\wedge C)$$. As Asgharian *and others* (2002), Huang and Qin (2011), and Wang (1991) among others noted, $$B$$ is the initial segment of both $$Y$$ and $$C$$, so in general $$Y$$ and $$C$$ are dependent. We assume the standard relations for left truncated, right-censored random variables, that $$[A \perp V \mid B]$$ (conditional independence) and the conditional distribution of $$V$$ given $$B$$ is non-informative regarding $$f$$, and so $$f_{V\mid B}$$ does not depend on $$\theta $$.

As before, recruitment continues until $$n$$ couples are enrolled, with enrollment data $$(b_i,x_i,\delta _i)$$; $$i=1,\ldots ,n$$. Under the conditional independence and non-informative assumptions, the conditional likelihood of $$X$$ given $$B$$ is (leaving out $$\theta $$ on the right-hand side)

### Semi-parametric estimation of $$S(y\mid \theta)$$ in the censored case

We find the MLE $$\hat \theta $$ and use $$S(y \mid \hat {\theta }) = \int ^\infty _y f(u;\hat {\theta })\,{\mathrm {d}} u,\ y>0$$. For known $$\pi (\cdot)$$, maximum likelihood methods can be used to estimate $$\theta $$, $$S(\cdot \mid \theta)$$ and other functions of it, and the likelihood ratio can be used to calculate confidence regions. However, $$\pi (\cdot)$$ is usually not known and a recursive algorithm similar to that in Section 3.1 is necessary. Starting with an estimate of $$g_B(b)$$, proceed as follows: (1) let $$\hat {\theta }^{(0)}$$ be the MLE using the conditional likelihood in (4.1) so, $$\hat S^{(0)} = {S}(y \mid \hat {\theta }^{(0)})=\int ^\infty _y f(u\mid \hat {\theta }^{(0)})\,{\mathrm {d}} u$$; (2) estimate the shape of $$\pi (b)$$ by $$\hat {\pi }^{(1)}(b) \propto {\hat {g}_{B}(b)}/{{S}(b \mid \hat {\theta }^{(0)})}$$; (3) let $$\hat {\theta }^{(\nu)}$$ be the MLE using the full likelihood in (4.2) with $$\pi (\cdot)=\hat {\pi }^{(\nu)}(\cdot)$$; (4) update $$\hat {\pi }^{(\nu +1)}(b) \propto {\hat {g}_{B}(b)}/{{S}(b \mid \hat {\theta }^{\nu)})}$$; and (5) iterate between steps (3) and (4) until convergence criterion is met. Use bootstrap $${\mathrm {BC}}_a$$ confidence intervals (Efron, 1987) to account for uncertainty in estimating $$\pi (\cdot)$$.

## Incorporating a random effect and a covariate

In many applications, it is desirable to account for observed and unobserved residual heterogeneity; therefore, we propose the following model for the observed menstrual cycle length for couple $$i:$$

Our goal is to use the enrollment cycle data to estimate the marginal distribution of menstrual cycle length for the reference population with covariate $$Z=z$$. For the enrollment cycle, we eliminate the “$$\ \tilde {}\ $$” notation and assume as in Section 4 that length-bias operates on $$X_i=Y_i\wedge C_i$$ with $$Y_i$$ modeled as in (5.1). Since the random effects are not observed, we assume $$\pi (\cdot)$$ does not depend on $$W_i$$, and further we restrict to the case where $$\pi (\cdot)$$ does not explicitly depend on $$z_i$$. Incorporating the covariate, the observed enrollment data consist of $$(b_i,x_i,\delta _i,z_i);i=1,\ldots ,n$$. We assume non-informative and independent censoring conditional on $$Z_i=z_i$$ and $${W_i=w_i}$$. Under these assumptions, the observed data conditional likelihood of $$X$$ given $$(B,Z)$$ is

As before, note that $$\mathcal {L}_c$$ in (5.2) does not depend on $$\pi (\cdot)$$. Therefore, one could estimate $$(\theta ,\alpha , \beta)$$ using a maximum likelihood approach based on it, and we compare the estimates and standard errors obtained using $$\mathcal {L}_c$$ to those based on $$\mathcal {L}$$ in Sections 6 and 7.

### Semi-parametric estimation of $$S(y\mid z;\theta ,\alpha , \beta)$$

To extend the semi-parametric approach to incorporate a covariate and random effect, we assume a parametric distribution for $$\tilde {Y}_0$$; but as before, we do not make any parametric assumptions about the form of $$\pi (\cdot)$$. Starting with an estimate of $$g_B(b\mid z)$$ (as shown in Section 6, the empirical probability mass function works well), proceed as follows:

Let $$(\hat {\theta }^{(0)},\hat {\alpha }^{(0)},\hat {\beta }^{(0)})$$ be the MLE using the conditional likelihood in (5.2) so,

\[ {S}^{(0)}(y \mid z;\hat{\theta}^{(0)},\hat{\alpha}^{(0)},\hat{\beta}^{(0)})=\int^\infty_y f(u\mid z;\hat{\theta}^{(0)},\hat{\alpha}^{(0)},\hat{\beta}^{(0)})\,{\rm d} u. \]- \[ \hat{\pi}^{(1)}(b\mid z) \propto \frac{\hat{g}_{B\mid Z}(b\mid z)}{{S}^{(0)}(y \mid z;\hat{\theta}^{(0)},\hat{\alpha}^{(0)},\hat{\beta}^{(0)})}. \]
Let $$(\hat {\theta }^{(\nu)},\hat {\alpha }^{(\nu)},\hat {\beta }^{(\nu)})$$ be the MLE using the full likelihood in (5.3) with $$\pi (\cdot \mid z)=\hat {\pi }^{(\nu)}(\cdot \mid z)$$.

- \[ \hat{\pi}^{(\nu+1)}(b\mid z) \propto \frac{\hat{g}_{B\mid Z}(b\mid z)}{{S}^{(\nu)}(y \mid z;\hat{\theta}^{(\nu)},\hat{\alpha}^{(\nu)},\hat{\beta}^{(\nu)})}. \]
Iterate between steps (3) and (4) until convergence criterion is met.

In Sections 6 and 7, we consider the case where $$Z$$ is a binary covariate, which is useful when comparing the distribution of menstrual cycle length for two populations.

## Simulation studies

We describe and report a simulation study examining the semi-parametric estimation approach described in Section 5 in the setting in which we model the observed menstrual cycle length dependent on an unobserved random effect $$W$$ and an observed time-independent binary covariate $$z$$ (see model (5.1)). Our primary interest is in the estimation of the marginal means and standard deviations: $$E(\tilde {Y} \mid z=0)$$, $$E(\tilde {Y} \mid z=1)$$, $${\rm SD}(\tilde {Y} \mid z=0)$$, and $${\mathrm {SD}}(\tilde {Y} \mid z=1)$$, which are functions of the parameters $$(\theta ,\alpha ,\beta)$$. We evaluate performance of maximum likelihood estimators when proper account is taken of length-bias and selection effects. We focus on censored data using the full likelihood in (5.3).

Using the data generation scheme described in Section 2 of supplementary material available at Biostatistics online, we generate the simulated data consisting of 2000 samples, each with 500 participants enrolled using quota sampling, a sampling scheme in which participants are enrolled until a quota is reached. The first 250 participants enrolled of each sample compose a small sample size case. In the data generation scheme, the choice of parametric family and the values of the parameters mimic the LIFE data analysis in Section 7.

We study the performance of the proposed recursive semi-parametric algorithm for empirical estimation of $$\pi (b\mid z)$$ and parametric estimation of the distribution of $$\tilde {Y}$$, under model (5.1) in which $$\tilde {Y_0}$$ is assumed to be a normal-Gumbel mixture and $$W\sim {\mathrm {Gamma}}(\alpha ,1/\alpha)$$. In particular, we are interested in two data generating models for $$\pi $$, constant as found in the data analysis, and decreasing with $$b$$. The bias, standard errors, mean squared error, and coverage probabilities for the estimators of $$E(\tilde {Y} \mid z=0)$$, $$E(\tilde {Y} \mid z=1)$$, $${\mathrm {SD}}(\tilde {Y} \mid z=0)$$, $${\mathrm {SD}}(\tilde {Y} \mid z=1)$$, $$\beta $$, and $${\mathrm {SD}}(W)$$ are summarized in Table 1. For both cases of $$\pi $$, the bias of each of the marginal parameters is approximately 0. To account for the uncertainty in estimating both $$\pi $$ and the model parameters, we calculated bootstrap $${\mathrm {BC}}_a$$ intervals with 1000 bootstrap replicates. The coverage probability of the nuisance parameter, $${\mathrm {SD}}(W)$$, is less than the nominal 95%; however, the coverage probability for each of the marginal parameters of interest is very good even for small samples. Using the same simulated data, we also studied the performance of maximum likelihood estimators based on the conditional likelihood, $$\mathcal {L}_c$$ in (5.2) which does not require estimation of $$\pi $$ (see Table 2). Under both generating models for $$\pi $$ and for each sample size, we did not observe a notable difference between the two approaches in any of the performance measures.

Setting | Parameter | True | Bias | Sim SE | MSE | $${\mathrm {BC}}_a$$ CP |
---|---|---|---|---|---|---|

Const $$\pi (b)$$$$(n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.03 | 0.81 | 0.66 | 92.80 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.04$$ | 0.83 | 0.70 | 93.05 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.08 | 1.17 | 94.45 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.04 | 1.08 | 94.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.03 | 0.00 | 93.05 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 78.70 | |

Const $$\pi (b)$$$$(n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.55 | 0.31 | 94.90 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.04$$ | 0.56 | 0.31 | 91.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 0.72 | 0.52 | 94.75 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.69 | 0.48 | 94.65 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 90.70 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 75.35 | |

Decr $$\pi (b)$$$$(n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.76 | 0.58 | 94.75 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.03$$ | 0.76 | 0.58 | 94.90 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.10 | 1.21 | 95.70 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.05 | 1.11 | 95.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 93.35 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 81.15 | |

Decr $$\pi (b)$$$$(n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.53 | 0.29 | 95.45 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.03$$ | 0.52 | 0.27 | 92.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.01 | 0.77 | 0.59 | 95.40 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.73 | 0.54 | 95.25 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 92.70 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 73.45 |

Setting | Parameter | True | Bias | Sim SE | MSE | $${\mathrm {BC}}_a$$ CP |
---|---|---|---|---|---|---|

Const $$\pi (b)$$$$(n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.03 | 0.81 | 0.66 | 92.80 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.04$$ | 0.83 | 0.70 | 93.05 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.08 | 1.17 | 94.45 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.04 | 1.08 | 94.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.03 | 0.00 | 93.05 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 78.70 | |

Const $$\pi (b)$$$$(n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.55 | 0.31 | 94.90 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.04$$ | 0.56 | 0.31 | 91.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 0.72 | 0.52 | 94.75 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.69 | 0.48 | 94.65 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 90.70 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 75.35 | |

Decr $$\pi (b)$$$$(n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.76 | 0.58 | 94.75 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.03$$ | 0.76 | 0.58 | 94.90 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.10 | 1.21 | 95.70 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.05 | 1.11 | 95.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 93.35 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 81.15 | |

Decr $$\pi (b)$$$$(n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.53 | 0.29 | 95.45 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.03$$ | 0.52 | 0.27 | 92.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.01 | 0.77 | 0.59 | 95.40 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.73 | 0.54 | 95.25 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 92.70 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 73.45 |

Data generated using algorithm enumerated in Section $$2$$ of supplementary material available at Biostatistics online with two cases for $$\pi (b):$$ constant $$(\zeta =0)$$ and decreasing $$(\zeta = 5.5)$$ with $$b$$. Number of $$\mbox {simulations} = 2000$$. Standard errors based on simulation estimates (Sim SE). Bootstrap $${\mathrm { BC}}_a$$ CI constructed using $$1000$$ bootstrap replicates. Two sample sizes:$$250$$ and $$500$$.

Setting | Parameter | True | Bias | Sim SE | MSE | $${\mathrm {BC}}_a$$ CP |
---|---|---|---|---|---|---|

Const $$\pi (b)\ (n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.81 | 0.66 | 92.85 |

$$E(Y\mid z=1)$$ | 30.39 | 0.01 | 0.81 | 0.66 | 93.25 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.08 | 1.17 | 94.50 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.02 | 1.04 | 1.08 | 94.30 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.03 | 0.00 | 95.25 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 78.95 | |

Const $$\pi (b)\ (n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.55 | 0.31 | 95.00 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.54 | 0.29 | 95.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 0.72 | 0.52 | 94.70 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.01 | 0.69 | 0.47 | 95.05 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 96.20 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 72.25 | |

Decr $$\pi (b)\ (n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.76 | 0.58 | 94.75 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.76 | 0.58 | 95.10 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.10 | 1.21 | 95.80 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.06 | 1.12 | 95.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 95.00 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 81.05 | |

Decr $$\pi (b)\ (n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.53 | 0.29 | 95.65 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.52 | 0.27 | 95.85 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.01 | 0.77 | 0.59 | 95.25 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.73 | 0.54 | 95.15 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.00 | 0.00 | 96.50 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.02 | 0.00 | 72.25 |

Setting | Parameter | True | Bias | Sim SE | MSE | $${\mathrm {BC}}_a$$ CP |
---|---|---|---|---|---|---|

Const $$\pi (b)\ (n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.81 | 0.66 | 92.85 |

$$E(Y\mid z=1)$$ | 30.39 | 0.01 | 0.81 | 0.66 | 93.25 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.08 | 1.17 | 94.50 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.02 | 1.04 | 1.08 | 94.30 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.03 | 0.00 | 95.25 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 78.95 | |

Const $$\pi (b)\ (n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.55 | 0.31 | 95.00 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.54 | 0.29 | 95.50 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 0.72 | 0.52 | 94.70 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.01 | 0.69 | 0.47 | 95.05 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 96.20 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 72.25 | |

Decr $$\pi (b)\ (n=250)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.04 | 0.76 | 0.58 | 94.75 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.76 | 0.58 | 95.10 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.02 | 1.10 | 1.21 | 95.80 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 1.06 | 1.12 | 95.35 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.02 | 0.00 | 95.00 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.00 | 0.00 | 81.05 | |

Decr $$\pi (b)\ (n=500)$$ | $$E(Y\mid z=0)$$ | 31.95 | 0.01 | 0.53 | 0.29 | 95.65 |

$$E(Y\mid z=1)$$ | 30.39 | $$-0.01$$ | 0.52 | 0.27 | 95.85 | |

$${\mathrm {SD}}(Y\mid z=0)$$ | 9.70 | 0.01 | 0.77 | 0.59 | 95.25 | |

$${\mathrm {SD}}(Y\mid z=1)$$ | 9.23 | 0.00 | 0.73 | 0.54 | 95.15 | |

$$\beta $$ | $$-0.05$$ | 0.00 | 0.00 | 0.00 | 96.50 | |

$${\mathrm {SD}}(W)$$ | 0.10 | 0.00 | 0.02 | 0.00 | 72.25 |

Data generated using algorithm enumerated in Section $$2$$ of supplementary material available at Biostatistics online with two cases for $$\pi (b):$$ constant $$(\zeta =0)$$ and decreasing $$(\zeta = 5.5)$$ with $$b$$. Number of $${\rm simulations} = 2000$$. Standard errors based on simulation estimates (Sim SE). Bootstrap $${\mathrm { BC}}_a$$ CI constructed using $$1000$$ bootstrap replicates. Two sample sizes:$$250$$ and $$500$$.

## Application to the LIFE study

We apply the proposed semi-parametric approach to estimate the enrollment probability function and the distribution of menstrual cycle length for the population studied in the LIFE Study, focusing on enrollment cycle data with adjustment for length-bias and selection features of the sampling plan. For this prospective pregnancy study, recruitment took place in the U.S. in Michigan and Texas in 2005–2009 with minimal criteria for eligibility listed as follows: female ages 18–44 years and male ages $$18+$$ years; in a committed relationship; ability to communicate in English or Spanish; menstrual cycle length between 21 and 42 days; no hormonal contraception injections during past year; and no sterilization procedures or physician diagnosed infertility. Prior to conception, 501 couples satisfying the eligibility criteria and interested in becoming pregnant were enrolled on a calendar date irrespective of the day of the woman’s menstrual cycle and followed until pregnancy, exit from the study, or censored at the completion of 12 menstrual cycles at-risk for pregnancy. Under these inclusion criteria and protocol, the sampling plan is independent of the participants’ cycle lengths. For further details, see the complete study protocol (Buck Louis, Schisterman *and others*, 2011).

The enrollment menstrual cycle length was observed as the sum of two segments: the backward recurrence time, which was recalled on the enrollment date as the time since LMP; and the residual time, which was prospectively observed from enrollment until the next menstrual bleeding event determined from the woman’s daily recorded level of bleeding or spotting ($$0 = {\mathrm {none}}$$ to $$4 = {\rm heavy}$$), using an algorithm to distinguish menstrual bleeding from non-menstrual bleeding. The residual time was censored if the couple exited the study for one or more of the following reasons: no longer wishing to participate, relocation, diagnosis of sterility, and non-compliance with study protocol; or if pregnancy was detected using Clearblue^{r} Easy pregnancy tests which detect levels of human chorionic gonadotropin over 25 mIU$$/$$mL. For couples becoming pregnant in the enrollment cycle, we would have ideally censored the residual time on the day on which the next menstrual bleeding is precluded by the absence of a decline in estrogen and progesterone. In the absence of knowledge of this date, we censored on the date of the first positive pregnancy date.

We consider the enrollment cycles of 467 couples for which both the backward recurrence time and residual time (or censoring time) could be determined, with total observed length in the range of (8, 90) days and backward recurrence time less than 50 days. Since a menstrual cycle length of 90 days is more than twice the study’s inclusion criteria of 42 days, we have excluded lengths outside this range as they are likely an artifact of a missing cycle stop/start. Furthermore, we have excluded 4 women because the onset date of the enrollment menstrual cycle could not be determined due to breastfeeding, miscarriage, or discontinuation of hormonal birth control. This restriction excludes women who recently were pregnant which may be important if cycle length is associated with pregnancy, but this is a very small number of exclusions. We excluded 6 women who exited the study before recording any bleeding information due to non-compliance or no longer wishing to participate. This exclusion is based on the assumption that these exits were not associated with menstrual cycle length.

We estimate the marginal distribution of menstrual cycle length using model (5.1) for the observed menstrual cycle length, with $$z=0$$ for the women aged $$< 30$$ years and $$z=1$$ for the women aged $$\ge 30$$ years, where 30 is the median age; and we assume $$W\sim {\rm Gamma}(\alpha ,1/\alpha)$$ so $$E(W)=1$$ and $${\rm Var}(W)=1/\alpha $$. Motivated by the statistical model for menstrual cycle length implemented by McLain *and others* (2012), we assume that $$\tilde {Y}_0$$ has a normal-Gumbel mixture distribution with unknown parameter vector: $$\theta =(q, \mu _N,\sigma _N,\mu _G,\sigma _G)$$, where $$q$$ is the proportion of normally distributed lengths and $$(\mu _N,\sigma _N)$$ and $$(\mu _G,\sigma _G)$$ are the mean and standard deviation of the normal and Gumbel distributions, respectively. As described in Section 3 of supplementary material available at Biostatistics online, we find this parametric choice for $$\tilde {Y}_0$$ fits the menstrual cycle length data better than the log-normal model. We assume $$\lambda (y)\propto y$$ and provide a check of this assumption as suggested by a reviewer in Section 4 of supplementary material available at Biostatistics online. Rather than assuming a model for $$\pi (\cdot \mid z)$$ in the full likelihood (see (5.3)), we use the semi-parametric approach proposed in Section 5.1 to estimate the shape of $$\pi (\cdot \mid z)$$ and $$(\theta ,\beta ,\alpha)$$.

In order to illustrate the effects of adjusting for the sampling plan, we show the marginal survival curves (Figure 1(a)) for women aged $$< 30$$ years, point estimates, and 95% confidence intervals (Table 3) with and without adjustment for length-bias and selection effects. The estimates of the unadjusted marginal survivor function and marginal means and standard deviations are located above those which have been adjusted, due to the over-representation of long cycles. The marginal survivor function that adjusts for both length-bias and selection effects lies in between the other curves, suggesting some cancelation of the length-bias by the selection effects. The same pattern was observed for the women aged $$\ge 30$$ years (not shown). The point estimate and confidence interval of the standard deviation of the random effect are small indicating that the residual variation is small. Of biological interest is that the estimated marginal survivor curve (see Figure 1(b)) and marginal mean and standard deviation for the women aged $$\ge 30$$ years is below that of women aged $$< 30$$ years which agrees with the findings of Harlow *and others* (2000). If estimation of $$\pi (\cdot \mid z)$$ is not of interest, we can alternatively estimate the parameters by maximizing the censored form of the conditional likelihood, $$\mathcal {L}_c$$, in (5.2). For these data, the point estimates and standard errors for the full and conditional approaches are approximately identical (not shown); however, this is not always the case as seen in the simulation results in Table 2. Additionally, in Section 5 of supplementary material available at Biostatistics online, we compare the unadjusted and adjusted estimated distribution of $$\tilde {Y}$$ based on enrollment cycle data to the estimated distribution based on first post-enrollment cycle data.

Parameter | Unadjusted | Adjusted for LB | Adjusted for LB and S |
---|---|---|---|

$$E(\tilde {Y} \mid {\rm age} <30)$$ | $$_{{32.55}\;\!}{33.76}_{\;{34.77}}$$ | $$_{{30.42}\;\!}{31.40}_{\;{32.31}}$$ | $$_{{31.01 }\;\!}{32.18}_{\;{33.39}}$$ |

$$E(\tilde {Y}\mid {\mathrm {age}} \ge 30)$$ | $$_{{31.30}\;\!}{32.51}_{\;{33.71}}$$ | $$_{{29.12}\;\!}{30.13}_{\;{31.00}}$$ | $$_{{29.91}\;\!}{31.07}_{\;{32.22}}$$ |

$${\mathrm {SD}}(\tilde {Y}\mid {\rm age} <30)$$ | $$_{{8.70}\;\!}{10.92}_{\;{12.30}}$$ | $$_{{7.59}\;\!}{8.56}_{\;{9.84}}$$ | $$_{{8.54}\;\!}{10.04}_{\;{11.98}}$$ |

$${\mathrm {SD}}(\tilde {Y}\mid {\rm age} \ge 30)$$ | $$_{{8.50}\;\!}{10.51}_{\;{11.99}}$$ | $$_{{7.31}\;\!}{8.21}_{\;{9.42}}$$ | $$_{{8.21}\;\!}{9.69}_{\;{11.66}}$$ |

$$\beta :{\mathrm {age}}$$ | $$_{{-0.08}\;\!}{-0.04}_{\;{-0.01}}$$ | $$_{{-0.08}\;\!}{-0.04}_{\;{-0.01}}$$ | $$_{{-0.07}\;\!}{-0.04}_{\;{-0.01}}$$ |

$$\alpha $$ | $$_{{82.69}\;\!}{85.24}_{\;{88.00}}$$ | $$_{{73.28}\;\!}{80.94}_{\;{84.91}}$$ | $$_{{71.41}\;\!}{82.88}_{\;{85.85}}$$ |

$${\mathrm {SD}}(W)$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.11}}$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.12}}$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.12}}$$ |

$$q$$ | $$_{{0.54}\;\!}{0.64}_{\;{0.70}}$$ | $$_{{0.62}\;\!}{0.71}_{\;{0.78}}$$ | $$_{{0.60}\;\!}{0.70}_{\;{0.77}}$$ |

$$\mu _N$$ | $$_{{28.93}\;\!}{29.83}_{\;{31.11}}$$ | $$_{{29.00}\;\!}{29.72}_{\;{30.88}}$$ | $$_{{28.98}\;\!}{29.73}_{\;{30.70}}$$ |

$$\sigma _N$$ | $$_{{0.65}\;\!}{0.76}_{\;{1.35}}$$ | $$_{{0.70}\;\!}{0.83}_{\;{1.69}}$$ | $$_{{0.69}\;\!}{0.81}_{\;{1.72}}$$ |

$$\mu _G$$ | $$_{{37.54}\;\!}{40.72}_{\;{42.87}}$$ | $$_{{32.72}\;\!}{35.47}_{\;{38.66}}$$ | $$_{{34.13}\;\!}{37.78}_{\;{41.56}}$$ |

$$\sigma _G$$ | $$_{{10.31}\;\!}{14.59}_{\;{16.74}}$$ | $$_{{11.15}\;\!}{13.45}_{\;{16.79}}$$ | $$_{{12.39}\;\!}{15.49}_{\;{19.99}}$$ |

Parameter | Unadjusted | Adjusted for LB | Adjusted for LB and S |
---|---|---|---|

$$E(\tilde {Y} \mid {\rm age} <30)$$ | $$_{{32.55}\;\!}{33.76}_{\;{34.77}}$$ | $$_{{30.42}\;\!}{31.40}_{\;{32.31}}$$ | $$_{{31.01 }\;\!}{32.18}_{\;{33.39}}$$ |

$$E(\tilde {Y}\mid {\mathrm {age}} \ge 30)$$ | $$_{{31.30}\;\!}{32.51}_{\;{33.71}}$$ | $$_{{29.12}\;\!}{30.13}_{\;{31.00}}$$ | $$_{{29.91}\;\!}{31.07}_{\;{32.22}}$$ |

$${\mathrm {SD}}(\tilde {Y}\mid {\rm age} <30)$$ | $$_{{8.70}\;\!}{10.92}_{\;{12.30}}$$ | $$_{{7.59}\;\!}{8.56}_{\;{9.84}}$$ | $$_{{8.54}\;\!}{10.04}_{\;{11.98}}$$ |

$${\mathrm {SD}}(\tilde {Y}\mid {\rm age} \ge 30)$$ | $$_{{8.50}\;\!}{10.51}_{\;{11.99}}$$ | $$_{{7.31}\;\!}{8.21}_{\;{9.42}}$$ | $$_{{8.21}\;\!}{9.69}_{\;{11.66}}$$ |

$$\beta :{\mathrm {age}}$$ | $$_{{-0.08}\;\!}{-0.04}_{\;{-0.01}}$$ | $$_{{-0.08}\;\!}{-0.04}_{\;{-0.01}}$$ | $$_{{-0.07}\;\!}{-0.04}_{\;{-0.01}}$$ |

$$\alpha $$ | $$_{{82.69}\;\!}{85.24}_{\;{88.00}}$$ | $$_{{73.28}\;\!}{80.94}_{\;{84.91}}$$ | $$_{{71.41}\;\!}{82.88}_{\;{85.85}}$$ |

$${\mathrm {SD}}(W)$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.11}}$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.12}}$$ | $$_{{0.11}\;\!}{0.11}_{\;{0.12}}$$ |

$$q$$ | $$_{{0.54}\;\!}{0.64}_{\;{0.70}}$$ | $$_{{0.62}\;\!}{0.71}_{\;{0.78}}$$ | $$_{{0.60}\;\!}{0.70}_{\;{0.77}}$$ |

$$\mu _N$$ | $$_{{28.93}\;\!}{29.83}_{\;{31.11}}$$ | $$_{{29.00}\;\!}{29.72}_{\;{30.88}}$$ | $$_{{28.98}\;\!}{29.73}_{\;{30.70}}$$ |

$$\sigma _N$$ | $$_{{0.65}\;\!}{0.76}_{\;{1.35}}$$ | $$_{{0.70}\;\!}{0.83}_{\;{1.69}}$$ | $$_{{0.69}\;\!}{0.81}_{\;{1.72}}$$ |

$$\mu _G$$ | $$_{{37.54}\;\!}{40.72}_{\;{42.87}}$$ | $$_{{32.72}\;\!}{35.47}_{\;{38.66}}$$ | $$_{{34.13}\;\!}{37.78}_{\;{41.56}}$$ |

$$\sigma _G$$ | $$_{{10.31}\;\!}{14.59}_{\;{16.74}}$$ | $$_{{11.15}\;\!}{13.45}_{\;{16.79}}$$ | $$_{{12.39}\;\!}{15.49}_{\;{19.99}}$$ |

Model assumptions:$$\tilde {Y}=\tilde {Y}_0W\,{\rm e}^{\beta z},$$$$z=0$$(age $$<30$$),$$z=1$$ (age $$\ge 30$$),$$W\sim {\mathrm { Gamma}}(\alpha ,1/\alpha),$$$$\tilde {Y}_0 \sim qf_{\rm normal}(\mu _N,\sigma ^2_N)+(1-q)f_{\rm Gumbel}(\mu _G,\sigma ^2_G)$$. Adjustments for LB and S done using semi-parametric algorithm to estimate model parameters and $$\pi (\cdot \mid z)$$ assuming $$\lambda (y)\propto y$$. Point estimate with $${\mathrm {BC}}_a$$$$95\%$$ confidence limits displayed as: $$_{{Lo}\;\!}{\rm Est}_{\;{Up}}$$(Louis and Zeger, 2009). $${\mathrm { BC}}_a$$ confidence intervals were calculated using non-parametric bootstrapping with $$1000$$ replicates. Restricted to $$n=467$$ enrollment cycles of known observed length within $$(8, 90)$$ days and with backward recurrence time less than $$50$$ days.

Although the estimates shown in Figure 1 and Table 3 are based on an empirical estimate of $$\pi (b\mid z)$$, for the purpose of visualizing the enrollment probability function we display a smoothed version of the estimate for each age group in Figure 2 which was obtained using a kernel density smoothing approach with oversmoothed bandwidth selection (Wand and Jones, 1994), implemented in

## Discussion

We have considered the problem of estimating the distribution of menstrual cycle length from enrollment cycle data for which the distribution is potentially weighted by features of the sampling process, in particular length-bias and selection effects. Addressing these features in the likelihood, we developed a two-stage semi-parametric approach for estimating the population distribution along with the shape of the enrollment probability. Using the empirical probability mass function to estimate the distribution of backward recurrence times, we obtained approximately unbiased estimates of the parameters of the distribution and traded very little variance for estimating the shape of the enrollment probability.

In the analysis of the LIFE Study data, we showed that adjustment for both length-bias and selection effects leads to visible differences in parameter and survivor curves estimates. These differences may be especially important when making risk predictions for TTP or disease; or when studying covariate effects in studies where an effect size on the order of 1–3 days is not uncommon such as those investigating the effects of environmental pollutants on menstrual cycle length (see, e.g. Buck Louis, Rios *and others*, 2011).

This approach is particularly useful in settings where there is a desire to model the first gap time or an interest in the characteristics of the selection process. For example, one could compare the estimated enrollment probability across multiple studies or recruitment strategies and, if a significant difference is found, then account for selection effects to make inference on a reference population.

As is done in many studies involving a renewal process, the backward recurrence time was ascertained using self-report at enrollment. Since the inclusion criteria restricted to women with regular cycles, we expect the recall bias and uncertainty to be small. As is common with retrospective recalled information, we do observe some digit preference, in particular for an LMP of 7, 14, and 21 days before enrollment. In a more comprehensive analysis of these data, we could conduct a sensitivity analysis on digit preference generating distributions around those digits.

Focusing solely on enrollment cycle data, we were able to obtain relatively small confidence intervals for all parameters of the menstrual cycle length distribution; suggesting potential gain for inclusion of these data even when post-enrollment data are available. Our future work will focus on modeling the full history of menstrual cycle data from enrollment to pregnancy along with the time-to-pregnancy process. In modeling the pregnancy cycles for this work, we used time-to-positive test date as a proxy for the date on which the menstrual cycle is paused; however, the true date is at a time before the positive test date. Improvements can be made using data on intercourse patterns, ovulation, and pregnancy tests and by conditioning on the date for which the menstrual cycle is paused being earlier than the positive pregnancy test.

This work impacts the analysis of existing longitudinal data by providing an approach to estimate $$\pi (\cdot)$$ for a particular study and to use this estimate to adjust inferences to a reference population. The estimate of $$\pi (\cdot)$$ can also be useful in designing recruitment for a new study. Additionally, it might be of interest to collect data on the number of participants who decline enrollment in order to estimate the magnitude of $$\pi (\cdot)$$.

## Supplementary material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

## Funding

This work was supported by the Intramural Research Program of the U.S., National Institutes of Health, *Eunice Kennedy Shriver* National Institute of Child Health and Human Development [Contracts N01-HD-3-3355, N01-HD-3-3356, N01-HD-3-3358] and by the U.S., National Institutes of Health, National Institute of Environmental Health Sciences [Training Grant 2T32ES012871].

## Acknowledgements

The authors thank Dr Germaine Buck Louis for providing us access to the LIFE Study. The authors acknowledge that this study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov). *Conflict of Interest*: None declared.