Semiparametric Estimation of Treatment Effects in Randomized Experiments

We develop new semiparametric methods for estimating treatment effects. We focus on settings where the outcome distributions may be thick tailed, where treatment effects may be small, where sample sizes are large and where assignment is completely random. This setting is of particular interest in recent online experimentation. We propose using parametric models for the treatment effects, leading to semiparametric models for the outcome distributions. We derive the semiparametric efficiency bound for the treatment effects for this setting, and propose efficient estimators. In the leading case with constant quantile treatment effects one of the proposed efficient estimators has an interesting interpretation as a weighted average of quantile treatment effects, with the weights proportional to minus the second derivative of the log of the density of the potential outcomes. Our analysis also suggests an extension of Huber's model and trimmed mean to include asymmetry.


Introduction
Historically, randomized experiments were often carried out in medical and agricultural settings. In these settings, sample sizes were often modest, typically on the order of hundreds or (more rarely) thousands of units. Outcomes commonly studied included mortality or crop yield, and were characterized by relatively well-behaved distributions with thin tails. Standard analyses in those settings typically involved estimating the average effect of the treatment using the difference in average outcomes by treatment group, followed by constructing confidence intervals using Normal distribution based approximations. These methods originated in the 1920s, e.g., Neyman (1990); Fisher (1937), but they continue to be the standard in modern applications. See Wu and Hamada (2011) for a recent discussion.
More recently many experiments are conducted online (see Kohavi et al. (2020) for an overview), leading to substantially different settings. Gupta et al. (2019, p.20) claim that "Together these organizations [Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yandex] tested more than one hundred thousand experimental treatments last year." The settings for these online experiments are substantially different from those in biomedical and agricultural settings. First, the experiments are often on a vastly different scale, with the number of units on the order of millions to tens of millions. Second, the outcomes of interest, variables such as time spent by a consumer, sales per consumer or payments per service provider, are characterized by distributions with extremely thick tails. Third, the treatment effects are often extremely small relative to the standard deviation of the outcomes, even if their magnitude remains substantively important. For example, Lewis and Rao (2015) analyze challenges with statistical power in experiments designed to measure the effect of digital advertising on consumer expenditures. They discuss a hypothetical experiment where the average expenditure per potential customer is $7 with a standard deviation of $75, and where an average treatment effect of $0.35 (0.005 of a standard deviation) would be substantial in the sense of being highly profitable for the company given the cost of advertising. In the Lewis and Rao example, an experiment with power 0.8 for a treatment effect of $0.35, and a significance level for the two-sided test of means of 0.05, would require a sample size of 1.4 million customers. As a result confidence intervals for the average treatment effect are likely to include zero even if the true effects were substantively important and samples are large. Even if a confidence interval for the average treatment effect includes zero there may be evidence about the presence of causal effects of the treatment. Using Fisher exact p-value calculations (Fisher, 1937) with well-chosen statistics (e.g., the Hodges-Lehman difference in average ranks, Rosenbaum (1993)), one may well be able to establish conclusively that treatment effects are present. However, the magnitude of the treatment effect, rather than its presence, is typically important for decision makers.
This sets the stage for the problem we address in this paper. In the absence of additional information, there exists no estimator for the average treatment effect that is more efficient than the difference in means. To obtain more precise estimates, we either need to change the focus away from the average treatment effect, or we need to make additional assumptions. One approach to changing the question, at least slightly, is to transform the outcome (e.g. taking logarithms or winsorizing) followed by a standard analysis estimating the average effect of the treatment on the transformed outcome. In this paper, like Taddy et al. (2016); Tripuraneni et al. (2021), we choose a different approach, namely making additional assumptions on the joint distribution of the outcomes and treatment indicator.
The key conceptual contribution is that we postulate a semi-parametric model for the outcome distributions by treatment group. The leading example of this semi-parametric model corresponds to restricting the quantile treatment effects to be identical across quantiles, thus assuming that the two conditional outcome distributions differ only by a shift. We do not directly use parametric models for the outcome distributions by treatment group, because specifying such a model that well approximates the full outcome distribution is more challenging than postulating a model for the treatment effects. Unlike outcomes, treatment effects tend to be small and often have little variation. For this semiparametric set-up (e.g., Bickel et al. (1993), Bickel and Doksum (2015)), we derive the influence function, the semiparametric efficiency bound, and we propose semiparametrically efficient estimators.
It turns out that the parametrization of the treatment effect can be very informative, potentially making the asymptotic variance for the corresponding semiparametric estimators substantially smaller than the asymptotic variance for the difference-in-means estimator. For example, if the potential outcomes have Cauchy distributions, the variance bound for the average treatment effect is infinite because the moments of the Cauchy distribution do not exist. However, under the constant additive treatment effect assumption (implying that the quantile treatment effects are identical), the semiparametric variance bound for the treatment effect is finite.
In addition, even if this model for the treatment effect is misspecified, the estimand corresponding to proposed estimators continue to have a causal interpretation, as a weighted average of quantile treatment effects, making it an easy-to-implement and attractive choice in practice.
The remainder of the paper is organized as follows. First, in Section 2 we consider the leading case where we assume the two potential outcome distributions differ only by a shift, so that the quantile treatment effects are all identical. This is implied by, but does not require, the assumption that the treatment effect is additive and constant. In Section 3 we consider the case where we have more flexible parametric models linking the two conditional outcome distributions. In Section 4 we provide some simulation evidence regarding the finite sample properties of the proposed methods in controlled settings and provide real data illustrations. Section 5 concludes. A software implementation for R is available at https://github.com/michaelpollmann/parTreat.

Constant Quantile Treatment Effects
In this section we focus on a special case with constant quantile treatment effects. After setting up the problem formally we discuss robust estimation in the one-sample case to motivate a class of weighted quantile treatment effect estimators. We then discuss the formal semiparametric problem and show adaptivity of the proposed estimators. Finally we consider partial adaptivity and robustness.
This case is closely related to the classical two-sample problem, as discussed in Hodges Jr and Lehmann (1963), and to problems considered in the literature on robust descriptive statistics as in Bickel and Lehmann (1975a,b), Doksum (1974), Doksum and Sievers (1976), and in particular Jaeckel (1971a), Jaeckel (1971b). Section 2 can be interpreted as an extension of Jaeckel's work, in a causal inference framework, to the two sample context in the setting of semiparametric theory. In the process of doing so, we generalize Huber's model (Huber, 1964) and the estimator based on trimmed means to include asymmetry, and present a simplified version of the results of Chernoff et al. (1967) (see also Bickel (1967), Govindarajulu et al. (1967) and Stigler (1974) on linear combinations of order statistics). In particular, we exhibit efficient M (maximumlikelihood type) and L (linear combination of order statistics) estimates for outcome distributions that are known up to a shift. We then analyze fully adaptive estimates of both types, as discussed in Bickel et al. (1993), and partially adaptive estimates, in particular flexible trimmed means (Jaeckel (1971a)). In this setting the problem is closely related to the literature on robust estimation of locations (e.g., Bickel and Lehmann (1975a,b); Hampel et al. (2011);Huber (2011);Lehmann (1976, 2012)).

Set Up
We consider a set up with a randomized experiment with n observations drawn randomly from a large population. With probability p ∈ (0, 1) a unit is assigned to the treatment group. Let n 1 and n 0 = n − n 1 denote the number of units assigned to the treatment and control group. Following (Neyman, 1990;Rubin, 1974;Imbens and Rubin, 2015), let Y i (0) and Y i (1) denote the two potential outcomes for unit i, and let the treatment be denoted by Z i ∈ {0, 1}. We assume that the treatment assignment for one unit does not affect the outcomes for any other unit. For all units in the sample we observe the pair The cumulative distribution functions for the two potential outcomes are F 0 (y) and F 1 (y) with inverses F −1 0 (u) and F −1 1 (u), and means and variances µ 0 , µ 1 , σ 2 0 and σ 2 1 . Note that by the random assignment assumption the distribution of the potential outcome Y i (z) is identical to the conditional distribution of the realized We are interested in the average treatment effect in the population, The natural estimator for this average treatment effect is the difference in sample averagesτ are the averages of the observed outcomes by treatment group. Under standard conditions n1 The concern is that this conventional estimatorτ may be imprecise. In particular in settings where the outcome distribution is thick tailed, sometimes extremely so, confidence intervals may be wide. We address this issue in this paper by imposing some restrictions on the two potential outcome distributions. Following the semiparametric literature (Bickel and Lehmann, 1975a,b, 1976 we exploit these restrictions to develop new estimators.

Weighted Average Quantile Treatment Effects
It is useful to start with quantile treatment effects (Lehmann and D'Abrera, 1975), which play an important role in our setup. For quantile u ∈ (0, 1), define These quantile treatment effects are closely related to what Doksum (1974) and Doksum and Sievers (1976) label the response function: R(y) ≡ F −1 1 (F 0 (y)) − y = ∆(F 0 (y)). The natural estimate for the quantile treatment effect is the empirical plug-in,∆(u) ≡ , defined as the [n 1 u] th order statistic of Y i |Z i = 1, i = 1, . . . , n 1 , where n 1 = n i=1 Z i and similarly forF −1 0 (u). A natural class of parameters summarizing the difference between the Y i (1) and Y i (0) distributions consists of weighted averages of the quantile treatment effects: where the weights integrate to one, W (0) = 0, W (1) = 1. Different choices for the weight function correspond to different estimands. The constant weight case, W ′ (u) ≡ 1, corresponds to the population average treatment effect τ = E[Y i (1) − Y i (0)]. The median corresponds to the case where W (·) puts all its mass at 1/2. We thus allow W (·) to permit point masses.
For a given weight function W (·) we can estimate the parameter τ (F 0 , F 1 ; W ) using a weighted average quantile estimator: and Y (z) (i) again are the order statistics in treatment group z.

Efficient estimation of Weighted Average Quantile Treatment Effects Using Influence Functions
To understand the properties of the weighted average quantile estimatorτ W , we begin by considering the nonparametric model for a single sample, with cumulative distribution F (·) where the interest is in the weighted quantile ∞ −∞ F −1 (u)dW (u) for a given weight funtion W (·). For this one-sample case, Jaeckel (1971a,b), building on Chernoff et al. (1967), shows that under simple conditions on W and F , for a sample size of n, where the influence function ψ is related to the weight function W by The last term ensures that ∞ −∞ ψ(x, F, W )dF (x) = 0. Note that if the derivatives ψ ′ (·) and W ′ (·) = w(·) exist, by (2. 7), where the variance equals the expectation of the square of the influence function: The results in Jaeckel (1971a,b) for the one-sample case extend in the following way to the two-sample setting that is our primary focus. If τ (F 0 , F 1 , W ) is estimated byτ W in (2. 5), then, under regularity conditions given in the appendix (Theorem A.1), where ψ(x, F, W ) is given by (2. 7).

Constant Quantile Treatment Effects
Now let us return to the primary focus of this section, the estimation of the average treatment effect under the constant quantile treatment effect assumption. Our key assumption in this section is that the quantile treatment effects are all equal: and thus, for any weight functions W (·), (2. 10) Later, in Section 3, we generalize this to allow for a more general parametric function linking the quantile treatment effects. One way to motivate the constant quantile treatment effect assumption is to assume that the unit-level treatment effects are all constant, Y i (1) − Y i (0) = τ for all units i = 1, . . . , n. This implies, but is not implied by, the assumption that all the quantile treatment effects are identical. The assumption of constant unit-level treatment effects is very strong, implying rank-invariance, which is in fact stronger than what we need. In this section, for expository reasons we further assume that we know the control outcome distribution F 0 (·) up to a shift. That is, F 0 (x) = F (x − η) where F (·) (with derivative f ) is known and η unknown. Because of the constant quantile treatment effect assumption the treated potential outcome distribution is also known up to a shift, F 1 (x) = F (x − η − τ ). Assuming that F 0 (·) is known up to a shift is unrealistic in It is interesting to inspect the form of the weights w f (u). These weights are proportional to minus the second derivative of the logarithm of the density function. In other words, we can approximate the efficient estimator by first estimating a large number of quantile treatment effects. Under the model these quantile treatment effects are all identical. To efficiently estimate that common treatment effect we can simply use a weighted average of the estimated quantile treatment effects. It turns out the optimal weights simplify to minus the second derivative of the logarithm of the density. For the Normal distribution, that means the weights are constant. For the double exponential distribution the weights put point mass at the median. For the Cauchy distribution the weights are proportional to − cos(2πu) sin(πu) 2 . Interestingly these weights are negative for some quantiles. One can of course see this by inspecting the estimated weights. If one is concerned by the negative weights one can also modify them by restricting them to be nonnegative. Finally, note that implicitly the influence function estimator also has the negative weights in such cases because the two estimators are first order equivalent. A final comment connects this to common methods for dealing with thick tailed distributions. In practice many researchers use winsorizing to deal with these problems. This can be interpreted as using a weighted average quantile estimator with a particular set of weights. Specifically, with winsorizing at the q and 1 − q quantiles, the implicit weights are constant on the interval (q, 1−q), and then put additional point mass q on the qth and (1 − q)th quantiles. As discussed in Bickel (1965), the asymptotic properties of the winsorizing estimator depend delicately on the density at the winsorizing quantiles. In our simulations this estimator does not perform particularly well. Like other settings, there is tension here between having an interpretable target that may not be precisely estimable (e.g., the average effect of the treatment), versus a precisely estimable estimand whose interpretation is more complex (e.g. the weighted average quantile effect). This tension arises also in other settings. An example is the estimation of average treatment effects under unconfoundedness where weighting by the confounders may affect both the interpretation of the estimand and the precision with which we can estimate it (Crump et al., 2009;Li et al., 2018). Another setting is that discussed in Vansteelandt and Dukes (2022). The use of quantile methods for estimating treatment effects in thicktailed settings has been studied in Firpo (2007); Firpo et al. (2009), but unlike in those papers, our focus is on the overall treatment effect, rather than the effect at specific quantiles.

Fully adaptive estimation
As stated earlier, in practice we do not know the density f (·) up to location. However, in this case with constant quantile treatment effects this knowledge does not matter up to first order. Because of the orthogonality of the tangent space with respect to f , it follows from semiparametric theory (Bickel et al., 1993) that even if the density f is unknown, substituting a suitable estimate of f (and η) in (2. 11) or (2. 13), will yield estimators with influence functions given by (2. 11), or equivalently by where f 0 (·) ≡ F ′ 0 (·) ≡ f (· − η). For our proposed estimator we split the data randomly into two parts, with the two subsamples denoted by A, corresponding to {(Z i , Y i ) : 1 ≤ i ≤ n 2 }, and B, corresponding to {(Z i , Y i ) : n 2 < i ≤ n}. The M estimate using the estimatedf (·) is of the form, n 2 < i ≤ n}, a one step estimate using the sample splitting technique (Klaassen, 1987). τ (A) andτ (B) are initial √ n consistent estimates based on the two subsamples, for example based on the difference in medians or other quantiles. Algorithm 1 shows the key steps; additional details are given in the supplementary materials.
We can also construct an L estimate based on an average of the quantile differences. This estimator is obtained by first estimating F 0 (·), f 0 (·), and f ′ 0 (·), substituting that for F (·), f (·), and f ′ (·) into w f (u) in equation (2. 12), followed by using this estimated 10: £ Calculate a preliminary consistent estimator: 13: £ Estimate density and its derivatives:
Algorithm 2 Weighted Average Quantile Estimatorτ waq 1: £ Input: 2: 10: £ Estimate density and its derivatives: 12: 13: £ Order and pair observations: 14: duplicate treated or control observations as needed such that there are n (A) of both, evenly across the distribution, and order them (analogously for the B split): 16:

22:
23: £ Repeat lines 10 through 21 reversing A and B, then average: Formally we have for the unknown f (·) case: Theorem 1. For all f such that f ′ exists and 0 < I(f ) < ∞: (i) There exist a √ n-consistent estimatorτ . .
The main insight is that the asymptotic variance for the proposed estimator is the same as the variance for the maximum likelihood estimator in the case where F 0 is known up to a shift. Our conditions are not optimal (see Stone (1975) for minimal ones in the one sample case). The √ n-consistent estimator for τ can be based on any quantile treatment effect estimator.
Thus, both the estimatorsτ waq andτ if are adaptive to models in which the distribution of control and treated potential outcomes is known up to location. Theorem 1 implies that the influence function based estimator is as efficient as the maximum likelihood estimator based on the true distribution function. For instance, if potential outcomes are normally distributed, the maximum likelihood estimator is the difference in means, and the influence function estimator has the same limiting distribution. If, however, the potential outcomes follow a double exponential distribution, the difference in medians is the efficient estimator. Under this distribution, the influence function based estimator adapts and has the same limiting distribution as the difference in medians. For the Cauchy distribution the optimal weights are more complicated, w f (u) ∝ − cos(2πu) sin(πu) 2 , but the influence function based estimator has the same limiting distribution, without requiring a priori knowledge about the distribution. This influence function based estimatorτ if is a special case of the estimator developed by Cuzick (1992a,b) for the partial linear regression model setting.
Although these estimators are efficient under the constant additive treatment model, as we shall see in Section 3, if the the constant quantile treatment effect assumption is violated, estimates of the types derived from f known continue to estimate at rate n −1/2 , meaningful measures of the treatment effect as discussed in Section 2.1. This is unfortunately not the case forτ if andτ waq because estimation of f ′ and f ′′ introduces components of variance of order larger than n −1/2 . However, there is a partial remedy, that we discuss next.

Partial Adaptation
An interesting alternative to fully adaptative estimation in the closely related one-sample symmetric case was studied by Jaeckel (1971b). See also Yu and Yao (2017). The estimator proposed by Jaeckel (1971b) iŝ for a sample from F (x − µ) with f symmetric. We can interpret this as restricting the class of weight functions to one indexed by a scalar parameter α: This weight function yields the α-trimmed mean. We can then choose the value of α that minimizes the asymptotic variance ofμ α . This asymptotic variance is equal to which can be estimated by replacing F with the empirical distribution, denoted asσ 2 α . Letα = arg min ασ 2 α , and letμα be the corresponding estimator for µ. Jaeckel (1971b) shows thatμα is adaptive for estimating µ over a Huber family of densities. In the Huber family (X − µ)/σ has density f for varying µ, σ > 0: where c(k) makes f (x)dx = 1, and k = −F −1 (α). Adaptivity here means that using the trimming proportion optimizing the variance estimate, in fact yields an estimate which is efficient for the member of the Huber family generating the data. He optimizes 0 < α 0 ≤ α ≤ α 1 < 1 2 . Because this family includes among others the Gaussian (k → ∞) and double exponential (k = 0), this family is very flexible. For more properties, see Huber (2011).
In the two-sample case, it is reasonable to consider asymmetric weight functions leading to the natural generalization, This estimator is partially adaptive, in a similar way to the symmetric trimmed mean in the one sample problem. In the online appendix, we extend Jaeckel's result on partial adaptation for our two sample problem to a generalization of the Huber family whose members are symmetric iff

Semiparametric Estimation of Treatment Effects in Randomized Experiments
[ 13] where ) and Φ is the CDF of N (0, 1). See Figure 1 for illustration. This family can be equivalently parametrized by In the supplementary materials we also discuss how inference can proceed in this setting.

The General Parametric Treatment Effect Case
In some settings, the assumption of an additive model may be too restrictive. In this section, we develop estimators given a general parametric model for this difference.
The starting point is a model governing the relation between the two potential outcomes: Assumption 1 (Parametric Model Quantile Treatment Effects). The potential outcome distributions satisfy The constant quantile treatment effect case is a special case of this with h(y, θ) = y+θ. Another important special case is the proportional treatment effect case, h(y, θ) = θy. For the general case the weighted average quantile estimator does not directly generalize, so we focus on the influence-function-based estimator. For the general case the influence function is more complex.
This approach of modelling treatment effects has connections to the literature on structural nested models, which also imposes modeling restrictions on treatment effects, although for different reasons, largely based on the challenges in dynamic settings. See Robins (1986) for an early paper, and Vansteelandt and Joffe (2014) for a review.
As before we initially assume F 0 known. In terms of the quantile treatment effects τ (u) Assumption 1 implies the restriction Given Assumption 1, the population average treatment effect can be characterized as In practice, however, estimating τ pop may still be subject to substantial sampling variance, even if h(·) is known. For example, suppose that h(y, θ) = θy, so that the treatment effect is proportional. The average treatment effect is then θE[Y i |Z i = 0]. Even if θ is known, estimating the population mean E[Y i |Z i = 0] could lead to a large standard error. As an alternative, we therefore focus on a different estimand. Specifically, we suggest to estimate the in-sample, as opposed to population, average treatment effect. This is still a well-defined average causal effect that is useful for decision makers. It is in the spirit of the typical analysis of randomized experiments based on convenience samples where the focus is on the average effect for the particular sample. A key insight is that estimators of this object can have a much lower variance. We define the in-sample average treatment effect as (3. 20) When h is not just an additive function, τ is is sample-dependent, and thus stochastic.
In particular when the variance of Y i is large because of thick tails for the potential outcome distributions, the variance of τ is over repeated samples can be large, too. To give some intuition for this, suppose that θ is known. Then the variance ofτ − τ is is zero, but the variance of τ is − τ pop over repeated samples can be large. We therefore focus on the variance of estimatorsτ relative to τ is for the particular sample at hand, rather than on the variance ofτ relative to the population average τ pop . If F 0 was known, we could estimate θ efficiently by some version of maximum likelihood to get an estimateθ and

Semiparametric Estimation of Treatment Effects in Randomized Experiments
[ 15] By Assumption 1 (3. 21) and the score function iṡ If F 0 is assumed unknown, to obtain an efficient influence function we must 1) Compute the tangent plane as f 0 varies with θ fixed. The tangent plane iṡ (Note that both factors of the likelihood must be varied treating θ as fixed.) 2) Projectl on the orthocomplement of the tangent plane to geṫ 3) The efficient influence function is given by Lemma 1. The efficient influence function for θ is To use the influence function approach we need a √ n-consistent initial estimatorθ. We can do so by a fixed number of quantiles, u 1 , . . . , u d , where d is the dimension of θ, and find the θ that solveŝ for u = u 1 , . . . , u d . For simplicity, we suggest using evenly-spaced quantiles, u = 1 1+d , 2 1+d , . . . , d 1+d . Letθ denote the solution to this system of equations. Then: Theorem 2. Under mild conditions on the estimation of density and its derivative, the again a one step estimate using the sample splitting technique (Klaassen, 1987), similar to (2. 15).
Note that, unlike the constant treatment effect setting, the efficient influence function is not necessarily the one corresponding to F 0 known.
Given inference forθ, inference forτ as an estimator of the in-sample average treatment effect, is straightforward based on the Delta method and the representation in (3. 20), taken as given the potential outcomes. The asymptotic variance forτ is equal to the variance forθ, pre and post multiplied by the derivative of the expression in (3. 20) with respect to θ.

Simulations
We evaluate the performance of the proposed estimators and conventional estimators in a Monte Carlo study. Throughout most of these simulations, the true unit-level treatment effects are all zero. We estimate the treatment effect using the proposed efficient estimators based on an additive model. We consider seven estimators: The (standard) difference in means, the difference in medians, the Hodges-Lehman (Hodges Jr and Lehmann (1963)) estimator, † the adaptively trimmed mean, the adaptively winsorized mean, ‡ the estimator based on the efficient influence function (eif), and the weighted †The Hodges-Lehmann estimator is equal to the median of all pairwise differences between treated and control observations. ‡We apply the ideas of Jaeckel (1971b) for the optimal trimmed mean to choose the parameters for the estimators in Section 2.6, see Theorem A.3 in the Appendix. We allow anywhere between no trimming (difference in means) and the extreme of trimming all but the medians (difference in medians). While including the extremes is not covered by the theory, this approach appeared to work best in our simulations. average quantiles (waq) estimator. For the latter two we report only results without sample splitting. Results for the case with sample splitting are very similar and are available in the supplementary materials. Although in our illustrations we use relatively simple estimators for the densities and their derivatives based on variable bandwidth kernels, an alternative would be to use methods directy aimed at estimating derivatives of the logarithm of the density as in Pinkse and Schurter (2021). Detailed descriptions of how the new estimators are implemented, as well as R code implementing all estimators with performance optimizations for these simulations, are available in the supplemental materials. § We present results for three sets of simulations, one with a range of known distributions for the potential outcomes, so we can directly assess the ability of the proposed methods to adapt to different distributions, and two with simulations based on real data: one based on housing prices and one based on medical expenditures, both with thick tailed distribution.

Simulations with Known Distributions
We simulate samples of n =20,000 observations, half of which are treated, and report summary statistics based on 10,001 simulated samples (using an odd number so that the median is unique). We repeat the simulation study for standardized Normal, Double Exponential (Laplace), and Cauchy distributions for the potential outcomes. The difference in means is the maximum likelihood estimator for Normally distributed data, and so will do well there, but may perform poorly for thicker tailed distributions such as the Double Exponential distribution and in particular the Cauchy distribution. The difference in medians is the maximum likelihood estimator for the Double Exponential distribution, and relatively robust to thick tails and outliers, and so is expected to perform reasonably well across all specifications, but not as well as the efficient estimators for the Normal.
For the simulations with known distributions we can derive the functional form for the optimal weights for the quantile-based estimator. The optimal weights for the waq estimator are proportional to the (estimated) second derivative of the log density. For the Normal distribution, ∂ 2 ln f ∂y 2 (y) = − 1 σ 2 , implying the optimal weights are constant. The density of the double exponential distribution is such that the optimal weights asymptotically place all weight close to the median. For the standard Cauchy distribution the efficient weights w on the difference in u ∈ (0, 1) quantiles of treated and control distributions are w f (u) ∝ − cos(2πu) sin(πu) 2 , shown in Figure 2. Most of the weight is §The fully documented R package is available at https://github.com/michaelpollmann/ parTreat. Details on the empirical implementation of our estimators are in the supplementary materials. For a sample of 1,000 treated and 1,000 control observations, the R package computes estimates and standard errors practically instantaneously. With very large samples, the derivatives of the log density can be precomputed on a random subsample of the data for similarly fast computation. The efficient estimators perform well across distributions, and confidence intervals based either on estimates of the analytic variance formulas or on the bootstrap achieve their nominal coverage levels, as shown in Table 1. Their standard deviations are close to the theoretical efficiency bound, as shown in column 4, labelled relative efficiency, where values larger than one imply standard deviations of the estimator in excess of the efficiency bound. The efficient influence function and weighted average quantiles estimators are close to the most efficient estimator for the normal, double exponential, and Cauchy distribution. The last columns show that the confidence intervals are close to their nominal coverage for each distribution. For computational convenience in the simulations, the confidence intervals based on the bootstrap variance use the m-outof-n bootstrap ), with m =2,000 (half treated, half control), to estimate the variance of the estimators. Even with these smaller sample sizes, the density estimates calculated within each bootstrap sample appear to be sufficiently good to yield reasonable confidence intervals for the estimators.

Simulations with House Price Data
In the second set of simulations we use house price data from the replication files of Linden and Rockoff (2008) available at Linden and Rockoff (2019). They obtained property sales data for Mecklenburg County, North Carolina, between January 1994 and December 2004. They dropped sales below $5,000 and above $1,000,000, such that 170,239 observations remain, which we take as our population of interest. Despite the  trimming the distribution is noticeably skewed (skewness 2.2) and thick tailed (kurtosis 9.5). Even after taking logs, the distribution is heavy-tailed with kurtosis equal to 5.1. Figure 3 plots a histogram for house prices, both in levels and in logs, along with the estimated optimal weights (minus the second derivative of the log density) based on all 170,239 observations. We base simulations on this data by drawing samples of size n =20,000, and randomly assigning exactly half of each sample to the treatment group and the remaining half to the control group, with a zero treatment effect. Within each sample, observations are drawn from the population without replacement, but sampling is independent across samples, such that observations may appear in multiple samples. We estimate the efficiency bound using density estimates based on all observations. We compute the same estimators as in the simulations of the previous section, with a small adjustment to the adaptively trimmed and winsorized means where we fix the trimming and winsorizing percentiles on the left to 0% (no trimming/winsorizing), and only adaptively choose the threshold on the right. Table 2 summarizes the simulation results based on 10,001 simulated samples. When the house prices are in levels, the standard deviation of the difference in means estimator is twice as large as that of efficient estimators. For the difference in medians and the Hodges-Lehman estimators, which are less affected by outliers in the data, the standard deviation is larger by approximately 30% and 20%, respectively. Confidence intervals, based on estimated variances and asymptotic normal approximations, have close to nominal coverage throughout, and are meaningfully shorter for the efficient estimators we propose.
In Figure 4 we show the root mean squared error and coverage of 95% confidence intervals both relative to the average treatment effect under deviations from the constant treatment effect model. ¶ In the top panel, the unit-level treatment effects are independent draws from a normal distribution with mean equal to 0.1 standard deviations of the (population) standard deviation of the potential outcomes in the absence of treatment.
On the horizontal axis, we vary the standard deviation of the normal distribution as a fraction q of the (population) standard deviation of the control potential outcomes, simulating 10,001 samples for each value. When q > 0, the constant treatment effect model is misspecified. In the bottom panel, the unit-level treatment effects are 0 with probability q and t with probability 1 − q, where t is chosen as a function of q such that the average treatment effect is constant across all simulations and the same as in the top panel. The difference in means estimator is unbiased for the average treatment effect regardless of the value of q, so its large root mean squared error is due to its variance. In both panels, the influence function-based and weighted average quantile estimators are only (asymptotically) unbiased for the average treatment effect when q = 0. We also estimate a proportional treatment effect (multiplicative) model and translate the estimated coefficients into level effects. Under the multiplicative model, Y i (1) = θY i (0). When outcomes are strictly positive, this is identical to an additive model for log outcomes; log(Y i (1)) = log(θ) + log(Y i (0)). Given an estimateτ log of the additive model with the outcome in logs, the estimate of the level treatment effect is then For estimates of the in-sample treatment effect, we therefore apply estimatesτ log to a population of interest with known means µ Y0 and µ Y1 and fixed treatment probability p asτ Using the Delta method, if V is the asymptotic variance ofτ log , then the asymptotic variance ofτ , holding µ Y0 , µ Y1 , and p fixed, is which we estimate by replacing τ log withτ log and V by the estimate of the variance,V . For the purpose of these simulations, we set µ Y0 = µ Y1 equal to the population mean of house prices, and p = 1/2. When treatment effects are assumed to be proportional to potential outcomes, the proposed estimators for this multiplicative model are still more efficient than alternative estimators, but the gains are smaller. The middle panel of Table 2 shows the quality of ¶The design of these simulations was kindly suggested by one of the referees.  estimates of the multiplicative parameter obtained by transforming outcomes into logs, log(θ). As can be seen in panel (b) of Figure 3, the distribution of log house prices appears closer to the normal distribution with fewer "outliers." Consequently, the difference in means estimator, which is efficient for the normal distribution, comes noticeably closer to the efficiency bound than when outcomes are in levels. Nevertheless, the efficient estimators further reduce the variance. The bottom panel of Table 2 shows the same summary statistics when the multiplicative parameter is translated back into a level effect. The efficient estimators of the multiplicative parameter lead to treatment effects with smaller variance and shorter confidence intervals than the difference in means, irrespective of whether the latter is estimated in levels (top panel) or in logs and then translated into levels (bottom panel).
Level of Health Expenditure

Medical Expenditures Data
Next, we present simulation results based on confidential medical expenditure data from the IBM MarketScan Research Database, following the sample construction of Koenecke et al. (2021, Figure 2). We restrict the sample to males, age 45-64, with pneumonia inpatient diagnosis and at least 1 year of continuous medical enrollment. For each patient, we consider the first inpatient admission only to abstract away from any dynamics. We focus on medical expenditure as the outcome variable. For each patient, we sum the payments recorded by MarketScan for this admission. In total, we use data on 103,662 admissions.∥ Figure 5 plots a histogram for medical expenditure in levels and in logs along with the estimated optimal weights (minus the second derivative of the log density). We also observe a treatment variable in this data set, the (prior) use of alpha blockers, which Koenecke et al. (2021) find may improve health outcomes during respiratory distress by preventing hyperinflammation. We design a simulation study similar to those of the previous sections, treating the receipt of alpha blockers as randomly assigned in our population. This allows us to study (coverage) properties of the estimators and inference procedures in settings where the parametric treatment effect model is not (necessarily) correctly specified. For these simulations, each sample is a draw, without replacement, of 200 of the 5,507 observations in the treatment group and 3,565 of the 98,155 observations in the control group. While the control group is smaller than in our other simulations, it remains sufficiently large to ∥The sample size differs slightly from that reported by Koenecke et al. (2021) due to missing expenditure data for a small number of admissions. estimate the density and its derivatives required for our estimators. In this simulation design, F 0 is given by the empirical distribution of the control group, and F 1 is given by the empirical distribution of the treatment group, of the full sample. Although it is not necessarily correct, the additive treatment effect model may offer a reasonable approximation, and we are interested in the performance of inference methods when the conditions for our theoretical results are not (quite) met in this particular application. The simulation results reported in Table 3, for the same estimators as in Section 4.2, suggest that the proposed estimators perform reasonably well in this setting despite mis-specification.

Conclusion
In many modern settings where randomized experiments are used to estimate treatment effects the presence of heavy-tailed distributions can lead to larger standard errors. Often researchers use winsorizing with ad hoc thresholds to address this. Here we develop systematic methods for obtaining more precise inferences using parametric models for the treatment effects, while avoiding the specification of models for the potential outcomes. We present results for semiparametric effiency bounds, suggest efficient estimators, and show in simulations that these methods can be effective in realistic settings.
In particular we recommend the semiparametrically efficient estimator under the constant additive treatment effect model. Although one may not think the constant additive treatment effect assumption holds exactly, the fact that the estimator can be interpreted as estimating a weighted average of the quantile treatment effects make this an attractive choice.
In this discussion we do not incorporate covariates or pretreatment variables. One could combine the ideas exposited in the current paper with models for the control potential outcome. One may also wish to incorporate covariates in the model for the quantile treatment effects and thus allow for heterogenous treatment effects e.g., Chen and Au (2022); Wager and Athey (2018). Another interesting avenue is to consider alternative estimators for the current model by first estimating the unrestricted quantile functions, followed by minimum distance methods, as in ; Alvarez and Biderman (2022).

Conflict of interest:
We have no conflicts of interest to disclose.

Data availability
The source for the house price data is Linden and Rockoff (2019). The relevant columns of the data are also included in the supplementary materials of this paper. The medical data are proprietary and confidential. Researchers at some institutions, in particular medical schools, may have access to the IBM MarketScan Research Database from which the data is drawn following Koenecke et al. (2021, Figure 2).

Funding
Generous support from the Office of Naval Research through ONR grants N00014-17-1-2131 and N00014-19-1-2468 is gratefully acknowledged. Data for the health application were accessed using the Stanford Center for Population Health Sciences Data Core.     Theorem A.1. Let X 1 , · · · , X n be i.i.d. from F , and letF andF −1 be the empirical distribution function and empirical quantile function respectively. Assume that the C-R condition and the above W condition hold and let

List of Figure Legends
When ψ ′ exists, we may also represent W as below if it is well defined: Remark 1. Justification of the conditions for a few common scenarios.

B. A lemma on the C-R condition
Lemma A.1. If the C-R condition holds, then for any t ∈ (0, ϵ) where K 0 , K 1 and K 2 are positive constants, only dependent on C 0 and ϵ. Similar results hold for t ∈ (1 − ϵ, 1), omitted for conciseness.

Semiparametric Estimation of Treatment Effects in Randomized Experiments
[ 3]

C. Proof of Theorem A.1
Proof. First of all, Under the C-R condition, by Theorem 4 of Csorgo and Revesz (1978), we have ∆ n = o P (n −3/4 log n).
Under the condition (A.2), where the last inequality is due to Lemma A.1 and the last equality is due to (A.1), then Furthermore, by Lemma A.1, where the last equality is due to (A.1). Similarly, we have [ 4] Athey et al.

D. Proof of Theorem 1
Our proof is based on the sample splitting argument, see (Klaassen, 1987).
where the last equality is due to Theorem 4 of (Csorgo and Revesz, 1978). Thus under the further condition (A.11), we have Hence it is sufficient to prove that Recall that wf is independent ofF , so .12). Therefore (A.13) holds, which concludes the proof.

Semiparametric Estimation of Treatment Effects in Randomized Experiments
[ 7] D.3. Proof of Theorem 1(ii) For simplicity, assume that n/2 i=1 Z i = 1 2 np and n i= n 2 +1 Z i = 1 2 np. Letf 0(j) be the estimate of f 0 ≡ F ′ 0 based on D j , j ∈ {1, 2}, which satisfy the conditions (A.9) and (A.10). By the first order Taylor expansion at τ , there exists τ betweenτ and τ such that which is o P (1) according to (A.9). Thus Similarly, we have [ 9] E. Proof of partial adaptation with (α, β)-trimmed mean

E.1. Main results
Theorem A.2. Suppose that f 0 falls into the extended Huber family as defined in Section 2.6. If α and β are chosen properly, then for the constant additive treatment effect model,τ α,β as defined in (2. 18) is an efficient estimate of τ .
. First of all, with integral change of variable, Furthermore, it is not hard to verify that F 0 (−k 1 ) = f 0 (−k 1 )/k 1 and 1 − F 0 (k 2 ) = f 0 (k 2 )/k 2 . Using the fact that Athey et al.
which completes the proof.

Semiparametric Estimation of Treatment Effects in Randomized Experiments
[ 11] E.1.1. Supporting results Lemma A.3. The influence function (A.3), specialized for the (α, β)-trimmed mean, can be simplified to The asymptotic variance σ 2 f , denoted as σ 2 (α, β) for the (α, β)-trimmed mean, can be written as Proof. The proof follows from direct calculation and is omitted.

F. Proof of Theorem 2
F.1. Proof of Lemma 1 Let g(y, θ) = ∂ ∂θ log f 1 (y, θ), where f 1 (y, θ) is the same as in (3. 21). In order to find the projection ofl on the orthocomplement of theṖ f , we need to find Q which minimizes

F.2. Proof of Theorem 2
The proof is similar to the sample-splitting argument for Claim 3 of Theorem 1 and thus is omitted for conciseness.

G.1. Density Estimation
Our estimators rely on estimates of the density and its first and second derivative.
Using data X (for instance half the sample of control observations for the estimator with sample splitting), we estimate f (k) (x) at x ∈ X ≡ {X (⌈n d k 1000 ⌉) | k = 1, . . . , 999} where n d is the number of observations in the data X used for estimating densities. Estimating the density only at points in the density-estimation sample avoids estimating densities of 0 in the tails where there are few observations especially when using sample-splitting. We use adaptive kernel estimates (Silverman 1986, ch. 5.3) and a triweight kernel, which allows estimation of the first and second derivative of the density. An alternative based on direct estimation of the logarithm of the density and its derivatives is Pinkse and Schurter (2021).
Density estimation proceeds in two steps. First, we obtain an initial estimate of the density at each point x ∈ X as where K(u) is the triweight kernel and h is the bandwidth, chosen as h = 3.15 ·σ X · n −1/5 d where 3.15 is the Silverman rule-of-thumb constant for the triweight kernel and σ X = quantile(X, 0.95) − quantile(X, 0.05) 2 · 1.6449 is an estimate of the standard deviation of X if X is normally distributed but less affected by outliers than the conventional unbiased estimator of the variance. Second, we obtain our final estimate of the density by calculating an adaptive bandwidth. Intuitively, in regions of high density, we can choose a small bandwidth that lowers bias without incurring much of a cost in terms of variance. In regions of low density, however, a larger bandwidth is needed to have sufficiently many data points near x to reign in the variance. Following Silverman (1986, ch. 5.3.1), define the local bandwidth factors λ i for X i as where g is the geometric mean of thef (X i ): Then the estimate of the density iŝ To estimate the first and second derivative of the density, we take λ i as above, and estimatê with h 1 and h 2 the Silverman rule-of-thumb bandwidth for estimating the derivatives of the density using a triweight kernel h 1 = 2.83 ·σ X · n −1/7 d h 2 = 2.70 ·σ X · n −1/9 d
Here c is for normalization such that n0 i=1 w( i n0+1 ) = 1, and quantile(Y (0) , u) is the u quantile of all the control observations for the full-sample estimator.f (F −1 (u)) is an ad-hoc estimate of the density at the u quantile calculated asf (F −1 (i/(n 0 + 1))) = (1/n 0 )/ max(∆Y (1) (⌈n1 i n 0 +1 ⌉) ) with ∆ the first difference operator. The factor MAD(f ), i.e. the median absolute deviation of the data (multiplied by 1.4826 such that it estimates the standard deviation if the data are normally distributed), is used so that the truncation condition is both shift-invariant and scaleinvariant. The truncation of the weightsw addresses violations of condition (A.11) in the tails by truncating observations with excessive (normalized) weight relative to the density at the quantile. In the simulations, it improves the performance for the Cauchy distribution for some samples with extreme outliers, and otherwise has no effect. For a sample-splitting estimator, we randomly split the samples of treated and control in half, use the first half to estimate w(u) (the densities and quantile(Y (0) , u)), and then evaluate equation A.17 for the second half of the sample given w. Reverse the roles of the sample splits and average the two estimators, as described in the main text. Under correct specification,θ waq andθ if are asymptotically equivalent, so the standard errors based onÎ are appropriate forθ waq as well. Table 4 reports results for the influence function-based and weighted average quantile estimators with and without sample splitting for the simulations shown in Tables 1, 2, and 3.