## Abstract

Hundreds of papers and factors attempt to explain the cross-section of expected returns. Given this extensive data mining, it does not make sense to use the usual criteria for establishing significance. Which hurdle should be used for current research? Our paper introduces a new multiple testing framework and provides historical cutoffs from the first empirical tests in 1967 to today. A new factor needs to clear a much higher hurdle, with a *t*-statistic greater than 3.0. We argue that most claimed research findings in financial economics are likely false.

Received October 22, 2014; accepted June 15, 2015 by Editor Andrew Karolyi.

Over forty years ago, one of the first tests of the capital asset pricing model (CAPM) found that the market beta was a significant explanator of the cross-section of expected returns. The reported *t*-statistic of 2.57 in Fama and MacBeth (1973, Table III) comfortably exceeded the usual cutoff of 2.0. However, since that time, hundreds of papers have tried to explain the cross-section of expected returns. Given the known number of factors that have been tried and the reasonable assumption that many more factors have been tried but did not make it to publication, the usual cutoff levels for statistical significance may not be appropriate. We present a new framework that allows for multiple tests and derive recommended statistical significance levels for current research in asset pricing.

We begin with 313 papers published in a selection of journals that study cross-sectional return patterns. We provide recommended test thresholds from the first empirical tests in 1967 to present day. We also project minimum *t*-statistics through 2032, assuming the rate of “factor production” remains the same as the last ten years. We present a taxonomy of historical factors, as well as definitions.^{1}

Our research is related to a recent paper by McLean and Pontiff (2015), who argue that certain stock market anomalies are less anomalous after being published.^{2} Their paper tests the statistical biases emphasized in Leamer (1978), Ross (1989), Lo and Mackinlay (1990), Fama (1991), and Schwert (2003).

Our paper also adds to the recent literature on biases and inefficiencies in cross-sectional regression studies. Lewellen, Nagel, and Shanken (2010) critique the usual practice of using cross-sectional $R2$s and pricing errors to judge success and show that the explanatory power of many previously documented factors are spurious. Our work focuses on evaluating the statistical significance of a factor given the previous tests on other factors. Our goal is to use a multiple testing framework to both re-evaluate past research and to provide a new benchmark for current and future research.

We tackle multiple hypothesis testing from the frequentist perspective. Bayesian approaches to multiple testing and variable selection also exist.^{3} However, the high dimensionality of the problem combined with the fact that we do not observe all the factors that have been tried poses a big challenge for Bayesian methods. While we propose a frequentist approach to overcome this missing data issue, it is unclear how to do this in the Bayesian framework. Nonetheless, we provide a detailed discussion of Bayesian methods in paper.

Multiple testing has only recently gained traction in the finance literature. For the literature on multiple testing corrections for data snooping biases, see Sullivan, Timmermann, and White (1999, 2001) and White (2000). For research on data snooping and variable selection in predictive regressions, see Foster, Smith, and Whaley (1997), Cooper and Gulen (2006), and Lynch and Vital-Ahuja (2012). For applications of multiple testing approach in the finance literature, see, for example, Shanken (1990), Ferson and Harvey (1999), Boudoukh et al. (2007), and Patton and Timmermann (2010). More recently, a multiple testing connection has been used to study technical trading and mutual fund performance, see, for example, Barras, Scaillet, and Wermers (2010), Bajgrowicz and Scaillet (2012), and Kosowski et al. (2006). Conrad, Cooper, and Kaul (2003) point out that data snooping accounts for a large proportion of the return differential between equity portfolios that are sorted by firm characteristics. Bajgrowicz, Scaillet, and Treccani (2013) show that multiple testing methods help eliminate a large proportion of spurious jumps detected using conventional test statistics for high-frequency data. Holland, Basu, and Sun (2010) emphasize the importance of multiple testing in accounting research. Our paper is consistent with the theme of this literature.

There are limitations to our framework. First, should all factor discoveries be treated equally? We think no. A factor derived from a theory should have a lower hurdle than a factor discovered from a purely empirical exercise. Economic theories are based on a few economic principles and, as a result, there is less room for data mining. Nevertheless, whether suggested by theory or empirical work, a *t*-statistic of 2.0 is too low. Second, our tests focus on unconditional tests. While the unconditional test might consider the factor marginal, it is possible that this factor is very important in certain economic environments and not important in other environments. These two caveats need to be taken into account when using our recommended significance levels for current asset pricing research.

While our focus is on the cross-section of equity returns, our message applies to many different areas of finance. For instance, Frank and Goyal (2009) investigate around thirty variables that have been documented to explain the capital structure decisions of public firms. Welch and Goyal (2008) examine the performance of a dozen variables that have been shown to predict market excess returns. Novy-Marx (2014) proposes unconventional variables to predict anomaly returns. These three applications are ideal settings to employ multiple testing methods.

## 1. The Search Process

Our goal is not to catalog every asset pricing paper ever published. We narrow the focus to papers that propose and test new factors. For example, Sharpe (1964), Lintner (1965), and Mossin (1966) all theoretically proposed (at roughly the same time), a single-factor model—the capital asset pricing model (CAPM). Following Fama and MacBeth (1973), there are hundreds of papers that test the CAPM. We include the theoretical papers, as well as the first paper to provide test statistics. We do not include the hundreds of papers that test the CAPM in different contexts, for example, various international markets and different time periods. We do, however, include papers, such as Kraus and Litzenberger (1976), who test the market factor, as well as one additional risk factor linked to the market factor.

Sometimes different papers propose different empirical proxies for the same type of economic risk. Although they may look similar from a theoretical standpoint, we still include them. An example is the empirical proxies for idiosyncratic financial constraints risk. While Lamont, Polk, and Saa-Requejo (2001) use the Kaplan and Zingales (1997) index to proxy for firm-level financial constraints, Whited and Wu (2006) estimate their own constraint index based on the first-order conditions of firms' optimization problem. We include both even though they are likely highly correlated.

Since our focus is on factors that can broadly explain return patterns, we omit papers that focus on a small group of stocks or a short period of time. This will, for example, exclude a substantial amount of empirical corporate finance research that studies event-driven return movements.^{4}

Certain theoretical models lack immediate empirical content. Although they could be empirically relevant once suitable proxies are constructed, we choose to exclude them.

With these rules in mind, we narrow our search to generally the top journals in finance, economics, and accounting. To include the most recent research, we search for working papers on the Social Science Research Network (SSRN). Working papers pose a challenge because there are thousands of them, and they have not been subjected to peer review. We choose a subset of papers that we suspect are in review at top journals, have been presented at top conferences, or are due to be presented at top conferences. We end with 63 working papers. In total, we focus on 313 articles, among which are 250 published articles. We catalogue 316 different factors.^{5}

Our collection of 316 factors likely underrepresents the factor population. First, we generally only consider top journals. Second, we are selective in choosing only a handful of working papers. Third, sometimes there are many variants of the same characteristic, and we usually only include the most representative ones. Fourth, and perhaps most importantly, we should be measuring the number of factors tested (which is unobservable)—that is, we do not observe the factors that were tested but that failed to pass the usual significance levels and were never published (see Fama 1991). Our multiple testing framework tries to account for this possibility.

## 2. Factor Taxonomy

To facilitate our analysis, we group the factors into different categories. We start with two broad categories: “common” and individual firm “characteristics.” “Common” means the factor can be viewed as a proxy for a common source of risk. Risk exposure to this factor or its innovations is supposed to help explain cross-sectional return patterns. “Characteristics” means the factor is specific to the security or portfolio. A good example is Fama and MacBeth (1973). While the beta against the market return is systematic (exposure to a common risk factor), the standard deviation of the market model residual is not based on a common factor—it is a property of the individual firm, that is, it is an idiosyncratic characteristic.

Strictly speaking, a risk factor should be a variable that has unpredictable variation through time. Moreover, assets' risk exposures to this factor need to be able to explain the cross-sectional return patterns. Based on these criteria, individual firm characteristics should not qualify as risk factors because characteristics are preknown and have limited time-series variation. However, we interpret firm characteristics in a broader sense. If a certain firm characteristic is found to be correlated with the cross-section of expected returns, a long-short portfolio can usually be constructed to proxy for the underlying unknown risk factor. It is this unknown risk factor that we have in mind when we classify particular firm characteristics as risk factors.

Based on the unique properties of the proposed factors, we further divide the “common” and “characteristics” groups into finer categories. In particular, we divide “common” into “financial,” “macro,” “microstructure,” “behavioral,” “accounting,” and “other.” We divide “characteristics” into the same categories, except we omit the “macro” classification, which is common, by definition. The following table provides further details on the definitions of these subcategories and gives examples for each.

## 3. Adjusted *t*-statistics in Multiple Testing

### 3.1 Why multiple testing?

Given that so many papers have attempted to explain the same cross-section of expected returns, statistical inference should not be based on a “single” test perspective. Our goal is to provide guidance as to the appropriate significance level using a multiple testing framework. When just one hypothesis is tested, we use the term “individual test,” “single test,” and “independent test” interchangeably.^{6}

Strictly speaking, different papers study different sample periods and hence focus on different cross-sections of expected returns. However, the bulk of the papers we consider have substantial overlapping sample periods. Also, if one believes that cross-sectional return patterns are stationary, then these papers are studying roughly the same cross-section of expected returns.

Risk type | Description | Examples | |
---|---|---|---|

$Common(113)$ | $Financial(46)$ | Proxy for aggregate financial market movement, including market portfolio returns, volatility, squared market returns, among others | Sharpe (1964): market returns; Kraus and Litzenberger (1976): squared market returns |

$Macro(40)$ | Proxy for movement in macroeconomic fundamentals, including consumption, investment, inflation, among others | Breeden (1979): consumption growth; Cochrane (1991): investment returns | |

$Microstructure(11)$ | Proxy for aggregate movements in market microstructure or financial market frictions, including liquidity, transaction costs, among others | Pastor and Stambaugh (2003): market liquidity; Lo and Wang (2006): market trading volume | |

$Behavioral(3)$ | Proxy for aggregate movements in investor behavior, sentiment or behavior-driven systematic mispricing | Baker and Wurgler (2006): investor sentiment; Hirshleifer and Jiang (2010): market mispricing | |

$Accounting(8)$ | Proxy for aggregate movement in firm-level accounting variables, including payout yield, cash flow, among others | Fama and French (1992): size and book-to-market; Da and Warachka (2009): cash flow | |

$Other(5)$ | Proxy for aggregate movements that do not fall into the above categories, including momentum, investors' beliefs, among others | Carhart (1997): return momentum; Ozoguz (2009): investors' beliefs | |

$Characteristics(202)$ | $Financial(61)$ | Proxy for firm-level idiosyncratic financial risks, including volatility, extreme returns, among others | Ang et al. (2006): idiosyncratic volatility; Bali, Cakici, and Whitelaw (2011): extreme stock returns |

$Microstructure(28)$ | Proxy for firm-level financial market frictions, including short sale restrictions, transaction costs, among others | Jarrow (1980): short sale restrictions; Mayshar (1981): transaction costs | |

$Behavioral(3)$ | Proxy for firm-level behavioral biases, including analyst dispersion, media coverage, among others | Diether, Malloy, and Scherbina (2002): analyst dispersion; Fang and Peress (2009): media coverage | |

$Accounting(87)$ | Proxy for firm-level accounting variables, including PE ratio, debt-to-equity ratio, among others | Basu (1977): PE ratio; Bhandari (1988): debt-to-equity ratio | |

$Other(24)$ | Proxy for firm-level variables that do not fall into the above categories, including political campaign contributions, ranking-related firm intangibles, among others | Cooper, Gulen, and Ovtchinnikov (2010): political campaign contributions; Edmans (2011): intangibles |

Risk type | Description | Examples | |
---|---|---|---|

$Common(113)$ | $Financial(46)$ | Proxy for aggregate financial market movement, including market portfolio returns, volatility, squared market returns, among others | Sharpe (1964): market returns; Kraus and Litzenberger (1976): squared market returns |

$Macro(40)$ | Proxy for movement in macroeconomic fundamentals, including consumption, investment, inflation, among others | Breeden (1979): consumption growth; Cochrane (1991): investment returns | |

$Microstructure(11)$ | Proxy for aggregate movements in market microstructure or financial market frictions, including liquidity, transaction costs, among others | Pastor and Stambaugh (2003): market liquidity; Lo and Wang (2006): market trading volume | |

$Behavioral(3)$ | Proxy for aggregate movements in investor behavior, sentiment or behavior-driven systematic mispricing | Baker and Wurgler (2006): investor sentiment; Hirshleifer and Jiang (2010): market mispricing | |

$Accounting(8)$ | Proxy for aggregate movement in firm-level accounting variables, including payout yield, cash flow, among others | Fama and French (1992): size and book-to-market; Da and Warachka (2009): cash flow | |

$Other(5)$ | Proxy for aggregate movements that do not fall into the above categories, including momentum, investors' beliefs, among others | Carhart (1997): return momentum; Ozoguz (2009): investors' beliefs | |

$Characteristics(202)$ | $Financial(61)$ | Proxy for firm-level idiosyncratic financial risks, including volatility, extreme returns, among others | Ang et al. (2006): idiosyncratic volatility; Bali, Cakici, and Whitelaw (2011): extreme stock returns |

$Microstructure(28)$ | Proxy for firm-level financial market frictions, including short sale restrictions, transaction costs, among others | Jarrow (1980): short sale restrictions; Mayshar (1981): transaction costs | |

$Behavioral(3)$ | Proxy for firm-level behavioral biases, including analyst dispersion, media coverage, among others | Diether, Malloy, and Scherbina (2002): analyst dispersion; Fang and Peress (2009): media coverage | |

$Accounting(87)$ | Proxy for firm-level accounting variables, including PE ratio, debt-to-equity ratio, among others | Basu (1977): PE ratio; Bhandari (1988): debt-to-equity ratio | |

$Other(24)$ | Proxy for firm-level variables that do not fall into the above categories, including political campaign contributions, ranking-related firm intangibles, among others | Cooper, Gulen, and Ovtchinnikov (2010): political campaign contributions; Edmans (2011): intangibles |

The numbers in parentheses represent the number of factors identified. See Table 6 and http://faculty.fuqua.duke.edu/~charvey/Factor-List.xlsx.

We want to emphasize that there are many forces that make our guidance lenient; that is, a credible case can be made for an even higher threshold for discovery. We have already mentioned that we only sample a subset of research papers and the “publication bias/hidden tests” issue (i.e., it is difficult to publish a nonresult).^{7} However, there is another publication bias that is more subtle. In many scientific fields, replication studies routinely appear in top journals. That is, a factor is discovered, and others try to replicate it. In finance and economics, it is very difficult to publish replication studies. Hence, there is a bias towards publishing “new” factors rather than rigorously verifying the existence of discovered factors.

There are two ways to deal with the bias introduced by multiple testing: out-of-sample validation and using a statistical framework that allows for multiple testing.^{8} When feasible, out-of-sample testing is the cleanest way to rule out spurious factors. In their study of anomalies, McLean and Pontiff (2015) take the out-of-sample approach. Their results show a degradation of performance of identified anomalies after publication, which is consistent with the statistical bias. It is possible that this degradation is larger than they document. In particular, they drop 12 of their 97 anomalies because they could not replicate the in-sample performance of published studies. Given that these nonreplicable anomalies were not even able to survive routine data revisions, they are likely to be insignificant strategies, either in-sample or out-of-sample. The degradation from the original published “alpha” is 100% for these strategies, which would lead to a higher average rate of degradation for their strategies.

While the out-of-sample approach has many strengths, it has one important drawback: it cannot be used in real time. To make real time assessments in the out-of-sample approach, it is common to hold out some data. However, this is not genuine out-of-sample testing as all the data are observable to researchers. A real out-of-sample test requires data in the future. In contrast to many tests in the physical sciences (where new data can be created for an experiment), we often need years of data to do an out-of-sample test. We pursue the multiple testing framework because it yields immediate guidance on whether a discovered factor is real.

### 3.2 A multiple testing framework

In statistics, multiple testing refers to simultaneous testing of more than one hypothesis. The statistics literature was aware of this multiplicity problem at least 60 years ago.^{9} Early generations of multiple testing procedures focus on the control of the family-wise error rate (see Section 4.3.1). More recently, increasing interest in multiple testing from the medical literature has spurred the development of methods that control the false discovery rate (see Section 4.3.2). Multiple testing is an active research area in both the statistics and the medical literature.^{10}

Despite the rapid development of multiple testing methods, they have not attracted much attention in the finance literature. Moreover, most of the research that does involve multiple testing focuses on the Bonferroni adjustment,^{11} which is known to be too stringent. Our paper aims to fill this gap.

First, we introduce a hypothetical example to motivate a more general framework. In Table 2, we categorize the possible outcomes of a multiple testing exercise. Panel A displays an example of what the literature could have discovered, and panel B notationalizes panel A to ease our subsequent definition of the general type I error rate—the chance of making at least one false discovery or the expected fraction of false discoveries.

Panel A: An example | |||
---|---|---|---|

Unpublished | Published | Total | |

Truly insignificant | 500 | 50 | 550 |

Truly significant | 100 | 50 | 150 |

Total | 600 | 100(R) | 700(M) |

Panel A: An example | |||
---|---|---|---|

Unpublished | Published | Total | |

Truly insignificant | 500 | 50 | 550 |

Truly significant | 100 | 50 | 150 |

Total | 600 | 100(R) | 700(M) |

Panel B: The testing framework | |||
---|---|---|---|

$H0$ not rejected | $H0$ rejected | Total | |

$H0$ true | $N0|a$ | $N0|r$ | $M0$ |

$H0$ false | $N1|a$ | $N1|r$ | $M1$ |

Total | $M\u2212R$ | $R$ | $M$ |

Panel B: The testing framework | |||
---|---|---|---|

$H0$ not rejected | $H0$ rejected | Total | |

$H0$ true | $N0|a$ | $N0|r$ | $M0$ |

$H0$ false | $N1|a$ | $N1|r$ | $M1$ |

Total | $M\u2212R$ | $R$ | $M$ |

Panel A shows a hypothetical example for factor testing. Panel B presents the corresponding notation in a standard multiple testing framework.

Our example in panel A assumes 100 published factors (denoted as $R$). Among these factors, suppose 50 are false discoveries and the rest are real ones. In addition, researchers have tried 600 other factors, but none were found to be significant. Among them, 500 are truly insignificant, but the other 100 are true factors. The total number of tests ($M$) is 700. Two types of mistakes are made in this process: 50 factors are falsely discovered to be true (type I error or false positive), while 100 true factors are buried in unpublished work (type II error or false negative). The usual statistical control in a multiple testing context aims at reducing “50” or “50/100,” the absolute or proportionate occurrence of false discoveries, respectively. Of course, we only observe published factors because factors that are tried and found to be insignificant rarely make it to publication.^{12} This poses a challenge since the usual statistical techniques only handle the case in which all test results are observable.

Panel B defines the corresponding terms in a formal statistical testing framework. In a factor testing exercise, the typical null hypothesis is that a factor is not significant. Therefore, a factor being insignificant means the null hypothesis is “true.” Using “0” (“1”) to indicate the null is true (false) and “a” (“r”) to indicate “not reject” (“reject”), we can easily summarize panel A. For instance, $N0|r$ measures the number of rejections when the null is true (i.e., the number of false discoveries) and $N1|a$ measures the number of failed rejections when the null is false (i.e. the number of missed discoveries). To avoid confusion, we try not to use standard statistical language in describing our notation but rather use words unique to our factor testing context. The generic notation in panel B is convenient in formally defining different types of errors and describing adjustment procedures in subsequent sections.

### 3.3 Type I and type II errors

For a single hypothesis test, a value $\alpha $ is used to control type I error rate: the probability of finding a factor to be significant when it is not. The $\alpha $ is sometimes called the “level of significance.” In a multiple testing framework, restricting each individual test's type I error rate at $\alpha $ is not enough to control the overall probability of false discoveries. The intuition is that, under the null that all factors are insignificant, it is very likely for an event with $\alpha $ probability to occur when many factors are tested. In multiple hypothesis testing, we need measures of the type I error that help us simultaneously evaluate the outcomes of many individual tests.

To gain some intuition about plausible measures of type I error, we return to panel B of Table 2. $N0|r$ and $N1|a$ count the total number of the two types of errors: $N0|r$ counts false discoveries, while $N1|a$ counts missed discoveries. As generalized from single hypothesis testing, the type I error in multiple hypothesis testing is also related to false discoveries, by which we conclude a factor is “significant” when it is not. But, by definition, we must draw several conclusions in multiple hypothesis testing, and there is a possible false discovery for each. Therefore, plausible definitions of the type I error should take into account the joint occurrence of false discoveries.

The literature has adopted at least two ways of summarizing the “joint occurrence.” One approach is to count the total number of false discoveries $N0|r$. $N0|r$ greater than zero suggests incorrect statistical inference for the overall multiple testing problem—the occurrence of which we should limit. Therefore, the probability of event $N0|r>0$ should be a meaningful quantity for us to control. Indeed, this is the intuition behind the family-wise error rate introduced later. On the other hand, when the total number of discoveries $R$ is large, one or even a few false discoveries may be tolerable. In this case, $N0|r$ is no longer a suitable measure; a certain false discovery proportion may be more desirable. Unsurprisingly, the expected value of $N0|r/R$ is the focus of false discovery rate, the second type of control.

#### 3.3.1 Family-wise error rate

The two aforementioned measures are the most widely used in the statistics literature. Moreover, many other techniques can be viewed as extensions of these measures. Holm (1979) is the first to formally define the family-wise error rate. Benjamini and Hochberg (1995) define and study the false discovery rate. Alternative definitions of error rate include per comparison error rate (Saville 1990), positive false discovery rate (Storey 2003), and generalized false discovery rate (Sarkar and Guo 2009). We now describe the two leading approaches in detail.

The family-wise error rate (FWER) is the probability of at least one type I error:

#### 3.3.2 False discovery rate

The *false discovery proportion* (FDP) is the proportion of type I errors:

^{13}Intuitively, this is because FDR allows $N0|r$ to grow in proportion to $R$, whereas FWER measures the probability of making even a single type I error.

Returning to example A, panel A shows that a false discovery event has occurred under FWER since $N0|r=50\u22651$ and the realized *FDP* is high, $50/100=50%$. This suggests that the probability of false discoveries (FWER) and the expected proportion of false discoveries (FDR) may both be high.^{14} The remedy, as suggested by many FWER and FDR adjustment procedures, is to lower *p*-value thresholds for these hypotheses (*p*-value, as defined in our context, is the single test probability of having a *t*-statistic that is at least as large as the observed one under the null hypothesis). In terms of panel A, this would turn some of the fifty false discoveries insignificant and push them into the “Unpublished” category. Hopefully, the fifty true discoveries would survive this change in *p*-value threshold and remain significant, which is only possible if their *p*-values are relatively small.

On the other hand, type II errors—the mistake of missing true factors—are also important in multiple hypothesis testing. Similar to type I errors, both the total number of missed discoveries $N1|a$ and the fraction of missed discoveries among all abandoned tests $N1|a/(M\u2212R)$ are frequently used to measure the severity of type II errors.^{15} Ideally, one would like to simultaneously minimize the chance of committing a type I error and that of committing a type II error. In our context, we would like to include as few insignificant factors (i.e., as low a type I error rate) as possible and simultaneously as many significant ones (i.e., as low a type II error rate) as possible. Unfortunately, this is not feasible: as in single hypothesis testing, a decrease in the type I error rate often leads to an increase in the type II error rate and vice versa. We therefore seek a balance between the two types of errors. A standard approach is to specify a significance level $\alpha $ for the type I error rate and derive testing procedures that aim to minimize the type II error rate, that is, maximize power, among the class of tests with type I error rate at most $\alpha $.

When comparing two testing procedures that can both achieve a significance level $\alpha $, it seems reasonable to use their type II error rates. However, when we have multiple tests, the exact type II error rate typically depends on a set of unknown parameters and is therefore difficult to assess.^{16} To overcome this difficulty, researchers frequently use the distance of the actual type I error rate to some prespecified significance level as the measure for a procedure's efficiency. Intuitively, if a procedure's actual type I error rate is strictly below $\alpha $, we can probably push this error rate closer to $\alpha $ by making the testing procedure less stringent, that is, a higher *p*-value threshold so there will be more discoveries. In doing so, the type II error rate is presumably lowered given the inverse relation between the two types of error rates. Therefore, once a procedure's actual type I error rate falls below a prespecified significance level, we want it to be as close as possible to that significance level in order to achieve the smallest type II error rate. Ideally, we would like a procedure's actual type I error rate to be exactly the same as the given significance level.^{17}

Both FWER and FDR are important concepts that are widely applied in many scientific fields. However, based on specific applications, one may be preferred over the other. When the number of tests is very large (e.g., a million), FWER controlling procedures tend to become very tough as they control for the occurrence of even a single false discovery among one million tests. As a result, they often lead to a very limited number of discoveries, if any. Conversely, FWER control is more desirable when the number of tests is relatively small, in which case more discoveries can be achieved and at the same time trusted. In the context of our paper, we are sure that many tests have been tried in the finance literature. Although there are around 300 published ones, hundreds or even thousands of factors could have been constructed and tested. However, it is not clear whether this number should be considered “large” compared to the number of tests conducted in, say, medical research.^{18} This creates difficulty in choosing between FWER and FDR. Given this difficulty, we do not take a stand on the relative appropriateness of these two measures but instead provide adjusted *p*-values for both. Researchers can compare their *p*-values with these benchmarks to see whether FDR or even FWER is satisfied.

### 3.4 *p*-value adjustment: Three approaches

The statistics literature has developed many methods to control both FWER and FDR.^{19} We choose to present the three most well-known adjustments: Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY). Both Bonferroni and Holm control FWER, and BHY controls FDR. Depending on how the adjustment is implemented, they can be categorized into two general types of corrections: a “single-step” correction equally adjusts each *p*-value, and a “sequential” correction is an adaptive procedure that depends on the entire distribution of *p*-values. Bonferroni is a single-step procedure, whereas Holm and BHY are sequential procedures. Table 3 summarizes the two properties of the three methods.

Adjustment type | Single/Sequential | Multiple test |
---|---|---|

Bonferroni | single | family-wise error rate |

Holm | sequential | family-wise error rate |

Benjamini, Hochberg, and Yekutieli (BHY) | sequential | false discovery rate |

Adjustment type | Single/Sequential | Multiple test |
---|---|---|

Bonferroni | single | family-wise error rate |

Holm | sequential | family-wise error rate |

Benjamini, Hochberg, and Yekutieli (BHY) | sequential | false discovery rate |

Suppose there are in total $M$ tests and we choose to set FWER at $\alpha w$ and FDR at $\alpha d$. In particular, we consider an example with the total number of tests $M=10$ to illustrate how different adjustment procedures work. For our main results, we set both $\alpha w$ and $\alpha d$ at 5%. Table 4, panel A, lists the *t*-statistics and the corresponding *p*-values for ten hypothetical tests. The numbers in the table are broadly consistent with the magnitude of *t*-statistics that researchers report for factor significance. Note that all ten factors will be “discovered” if we test one hypothesis at a time. Multiple testing adjustments will usually generate different results.^{20}

Panel A: Single tests and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Test $\u2192$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | # of discoveries |

t-statistic | 1.99 | 2.63 | 2.21 | 3.43 | 2.17 | 2.64 | 4.56 | 5.34 | 2.75 | 2.49 | 10 |

p-value (%) | 4.66 | 0.85 | 2.71 | 0.05 | 3.00 | 0.84 | 0.00 | 0.00 | 0.60 | 1.28 |

Panel A: Single tests and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Test $\u2192$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | # of discoveries |

t-statistic | 1.99 | 2.63 | 2.21 | 3.43 | 2.17 | 2.64 | 4.56 | 5.34 | 2.75 | 2.49 | 10 |

p-value (%) | 4.66 | 0.85 | 2.71 | 0.05 | 3.00 | 0.84 | 0.00 | 0.00 | 0.60 | 1.28 |

Panel B: Bonferroni “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Test $\u2192$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

t-statistic | 1.99 | 2.63 | 2.21 | 3.43 | 2.17 | 2.64 | 4.56 | 5.34 | 2.75 | 2.49 | 3 |

p-value (%) | 4.66 | 0.85 | 2.71 | 0.05 | 3.00 | 0.84 | 0.00 | 0.00 | 0.60 | 1.28 |

Panel B: Bonferroni “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Test $\u2192$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

t-statistic | 1.99 | 2.63 | 2.21 | 3.43 | 2.17 | 2.64 | 4.56 | 5.34 | 2.75 | 2.49 | 3 |

p-value (%) | 4.66 | 0.85 | 2.71 | 0.05 | 3.00 | 0.84 | 0.00 | 0.00 | 0.60 | 1.28 |

Panel C: Holm adjusted p-values and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Reordered tests b | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | |

Old order | 8 | 7 | 4 | 9 | 6 | 2 | 10 | 3 | 5 | 1 | 4 |

p-value (%) | 0.00 | 0.00 | 0.05 | 0.60 | 0.84 | 0.85 | 1.28 | 2.71 | 3.00 | 4.66 | |

$\alpha w/(M+1\u2212b)$ | 0.50 | 0.56 | 0.63 | 0.71 | 0.83 | 1.00 | 1.25 | 1.67 | 2.50 | 5.00 | |

$\alpha w=5%$ |

Panel C: Holm adjusted p-values and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Reordered tests b | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | |

Old order | 8 | 7 | 4 | 9 | 6 | 2 | 10 | 3 | 5 | 1 | 4 |

p-value (%) | 0.00 | 0.00 | 0.05 | 0.60 | 0.84 | 0.85 | 1.28 | 2.71 | 3.00 | 4.66 | |

$\alpha w/(M+1\u2212b)$ | 0.50 | 0.56 | 0.63 | 0.71 | 0.83 | 1.00 | 1.25 | 1.67 | 2.50 | 5.00 | |

$\alpha w=5%$ |

Panel D: BHY adjusted p-values and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Reordered tests b | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | |

Old order | 8 | 7 | 4 | 9 | 6 | 2 | 10 | 3 | 5 | 1 | 6 |

p-value (%) | 0.00 | 0.00 | 0.05 | 0.60 | 0.84 | 0.85 | 1.28 | 2.71 | 3.00 | 4.66 | |

$(b\xb7\alpha d)/(M\xd7c(M))$ | 0.17 | 0.34 | 0.51 | 0.68 | 0.85 | 1.02 | 1.19 | 1.37 | 1.54 | 1.71 | |

$\alpha d=5%$ |

Panel D: BHY adjusted p-values and “significant” factors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Reordered tests b | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | |

Old order | 8 | 7 | 4 | 9 | 6 | 2 | 10 | 3 | 5 | 1 | 6 |

p-value (%) | 0.00 | 0.00 | 0.05 | 0.60 | 0.84 | 0.85 | 1.28 | 2.71 | 3.00 | 4.66 | |

$(b\xb7\alpha d)/(M\xd7c(M))$ | 0.17 | 0.34 | 0.51 | 0.68 | 0.85 | 1.02 | 1.19 | 1.37 | 1.54 | 1.71 | |

$\alpha d=5%$ |

The table displays ten *t*-statistics and their associated *p*-values for a hypothetical example. Panels A and B highlight the significant factors under single tests and Bonferroni's procedure, respectively. Panels C and D explain Holm's and BHY's adjustment procedure, respectively. The bold numbers in each panel are associated with significant factors under the specific adjustment procedure of that panel. $M$ represents the total number of tests (M = 10) and $c(M)=\u2211j=1M1/j$. $b$ is the order of *p*-values from lowest to highest. $\alpha w$ is the significance level for Bonferroni's and Holm's procedure, and $\alpha d$ is the significance level for BHY's procedure. Both numbers are set at 5%. The cutoff *p*-value for Bonferroni is 0.5%, for Holm is 0.60%, and for BHY is 0.85%.

#### 3.4.1 Bonferroni's adjustment

Bonferroni's adjustment is as follows:
Bonferroni applies the same adjustment to each test. It inflates the original *p*-value by the number of tests $M$; the adjusted *p*-value is compared with the threshold value $\alpha w$.

**Example 4.4.1** To apply Bonferroni's adjustment to the example in Table 4, we simply multiply all the *p*-values by ten and compare the new *p*-values with $\alpha w=5%$. Equivalently, we can look at the original *p*-values and consider the cutoff of 0.5%($=\alpha w/10$). This leaves the *t*-statistic of tests 4, 7, and 8 as significant, which are highlighted in panel B.

Using the notation in panel B of Table 2 and assuming $M0$ of the $M$ null hypotheses are true, Bonferroni operates as a single-step procedure that can be shown to restrict FWER at levels less than or equal to $(M0\xd7\alpha w)/M$, without any assumption on the dependence structure of the *p*-values. Since $M0\u2264M$, Bonferroni also controls FWER at level $\alpha w$.^{21}

#### 3.4.2 Holm's adjustment

Sequential methods have been proposed to adjust *p*-values in multiple hypothesis testing.^{22} They are motivated by a seminal paper by Schweder and Spjotvoll (1982), who suggest a graphical presentation of the multiple testing *p*-values. In particular, using $Np$ to denote the number of tests that have a *p*-value exceeding $p$, Schweder and Spjotvoll (1982) suggest plotting $Np$ against $(1\u2212p)$. When $p$ is not very small (e.g., $p>0.2$), it is very likely that the associated test is from the null hypothesis. In this case, the *p*-value for a null test can be shown to be uniformly distributed between 0 and 1. It then follows that for a large $p$ and under independence among tests, the expected number of tests with a *p*-value exceeding $p$ equals $T0(1\u2212p)$, where $T0$ is the number of null hypotheses, i.e., $E(Np)=T0(1\u2212p)$. By plotting $Np$ against $(1\u2212p)$, the graph should be approximately linear with slope $T0$ for large *p*-values. Points on the graph that substantially deviate from this linear pattern should correspond to non-null hypotheses, i.e., discoveries. The gist of this argument—large and small *p*-values should be treated differently—has been distilled into many variations of sequential adjustment methods, among which we will introduce Holm's method that controls FWER and BHY's method that controls FDR.

Holm's adjustment is as follows:
The equivalent adjusted *p*-value is therefore

*p*-values, we start from the smallest

*p*-value and go down to the largest one.

^{23}If $k$ is the smallest index that satisfies $p(b)>\alpha wM+1\u2212b$, we will reject all tests whose ordered index is below $k$.

Order the original

*p*-values such that $p(1)\u2264p(2)\u2264\cdots p(b)\u2264\cdots \u2264p(M)$, and let the associated null hypotheses be $H(1),H(2),\cdots H(b)\cdots ,H(M)$.Let $k$ be the minimum index such that $p(b)>\alpha wM+1\u2212b$.

Reject the null hypotheses $H(1)\cdots H(k\u22121)$ (i.e., declare these factors significant), but not $H(k)\cdots H(M)$.

To explore how Holm's adjustment procedure works, suppose $k$ is the smallest index such that $p(b)>\alpha wM+1\u2212b$. This means that for $b<k$, $p(b)\u2264\alpha wM+1\u2212b$. In particular, for $b=1$, Bonferroni equals Holm, that is, $\alpha wM=\alpha wM+1\u2212(b=1)$; for $b=2$, $\alpha wM<\alpha wM+1\u2212(b=2)$, so Holm is less stringent than Bonferroni. Since less stringent hurdles are applied to the second to the $(k\u22121)$th *p*-values, more discoveries are generated under Holm's than Bonferroni's adjustment.

**Example 4.4.2** To apply Holm's adjustment to the example in Table 4, we first order the *p*-values in ascending order and try to locate the smallest index $k$ that makes $p(b)>\alpha wM+1\u2212b$. Table 4, panel C, shows the ordered *p*-values and the associated $\alpha wM+1\u2212b$'s. Starting from the smallest *p*-value and going up, we see that $p(b)$ is below $\alpha wM+1\u2212b$ until $b=5$, at which point $p(5)$ is above $\alpha w10+1\u22125$. Therefore, the smallest $b$ that satisfies $p(b)>\alpha wM+1\u2212b$ is 5 and we reject the null hypothesis for the first four ordered tests (we discover four factors) and fail to reject the null for the remaining six tests. The original labels for the rejected tests are in the second row in panel C. Compared to Bonferroni, one more factor (test 9) is discovered; that is, four factors rather than three are significant. In general, Holm's approach leads to more discoveries and all discoveries under Bonferroni are also discoveries under Holm's criteria.

Like Bonferroni, Holm also restricts FWER at $\alpha w$ without any requirement on the dependence structure of *p*-values. It can also be shown that Holm is uniformly more powerful than Bonferroni in that tests rejected (factors discovered) under Bonferroni are always rejected under Holm, but not vice versa. In other words, Holm leads to at least as many discoveries as Bonferroni. Given the dominance of Holm over Bonferroni, one might opt to only use Holm. We include Bonferroni because it is the most widely used adjustment and a simple single-step procedure.

#### 3.4.3 Benjamini, Hochberg, and Yekutieli's adjustment

Benjamini, Hochberg, and Yekutieli's (BHY) adjustment is as follows:
The equivalent adjusted *p*-value is defined sequentially as

As with Holm's procedure, we order the original

*p*-values such that $p(1)\u2264p(2)\u2264\cdots p(b)\u2264\cdots \u2264p(M)$ and let associated null hypotheses be $H(1),H(2),\cdots H(b)\cdots ,H(M)$.Let $k$ be the maximum index such that $p(b)\u2264bM\xd7c(M)\alpha d$.

Reject null hypotheses $H(1)\cdots H(k)$, but not $H(k+1)\cdots H(M)$.

In contrast to Holm's, BHY's method starts with the largest *p*-value and goes to the smallest one. If $k$ is the largest index that satisfies $p(b)\u2264bM\xd7c(M)\alpha d$, we will reject tests (discover factors) whose ordered index is below or equal to $k$. Also, note that $\alpha d$ (significance level for FDR) is chosen to be the same as $\alpha w$ (significance level for FWER). The significance level is subjective in nature. Here, we choose the same significance level to make an apples-to-apples comparison between FDR and FWER adjustment procedures. We discuss this choice in more detail in Section 4.6.

To explore how BHY works, let $k$ be the largest index, such that $p(b)\u2264bM\xd7c(M)\alpha d$. This means that for $b>k$, $p(b)>bM\xd7c(M)\alpha d$. In particular, we have $p(k+1)>(k+1)M\xd7c(M)\alpha d$, $p(k+2)>(k+2)M\xd7c(M)\alpha d$, …, $p(M)>MM\xd7c(M)\alpha d$. We see that the $(k+1)$th to the last null hypotheses, not rejected, are compared to numbers smaller than $\alpha d$, the usual significance level in single hypothesis testing. By being stricter than single hypothesis tests, BHY guarantees that the false discovery rate, which depends on the joint distribution of all the test statistics, is below the prespecified significance level. See Benjamini and Yekutieli (2001) for details on the proof.

**Example 4.4.3** To apply BHY's adjustment to the example in Table 4, we first order the *p*-values in ascending order and try to locate the largest index $k$ that satisfies $p(b)\u2264bM\xd7c(M)\alpha d$. Table 4, panel D, shows the ordered *p*-values and the associated $bM\xd7c(M)\alpha d$'s. Starting from the largest *p*-value and going down, we see that $p(b)$ is above $bM\xd7c(M)\alpha d$ until $b=6$, at which point $p(6)$ is below $610\xd72.93\alpha d$. Therefore, the smallest $b$ that satisfies $p(b)\u2264bM\xd7c(M)\alpha d$ is 6, and we reject the null hypothesis for the first six ordered tests and fail to reject for the remaining four tests. In the end, BHY leads to six significant factors (tests 8, 7, 4, 9, 6, and 2), three more than Bonferroni and two more than Holm.

In summary, for single tests, using the usual 5% cutoff, 10 out of 10 are discovered. Allowing for multiple tests, the cutoffs are far smaller, with BHY at 0.85%, Holm at 0.60%, and Bonferroni at 0.5%.

The choice of $c(M)$ determines the generality of BHY's procedure. Intuitively, when $c(M)$ is larger, then the more difficult it is to satisfy the inequality $p(b)\u2264bM\xd7c(M)\alpha d$, and hence there will be fewer discoveries. This makes it easier to restrict the false discovery rate below a given significance level since fewer discoveries are made. In the original work that develops the concept of false discovery rate and related testing procedures, $c(M)$ is set equal to one. Under this choice, BHY is only valid when the test statistics are independent or positively dependent. With our choice of $c(M)$ (i.e., $c(M)=\u2211j=1M1j$), BHY is valid under any form of dependence among the *p*-values.^{24} Note with $c(M)>1$, this reduces the size of $bM\xd7c(M)\alpha d$ and it is tougher to satisfy the inequality $p(b)\u2264bM\xd7c(M)\alpha d$. That is, there will be fewer factors found to be significant.

Figure 1 summarizes our example. It plots the original *p*-values (single tests), as well as adjusted *p*-value lines, for various multiple testing adjustment procedures. We see the stark difference in outcomes between multiple and single hypothesis testing. While all ten factors would be discovered under single hypothesis testing, only three to six factors would be discovered under a multiple hypothesis test. Although single hypothesis testing guarantees the type I error of each test meets a given significance level, meeting the more stringent FWER or FDR bound will lead us to discard a number of factors.

### 3.5 Summary statistics

Figure 2 shows the history of discovered factors and publications.^{25} We observe a dramatic increase in factor discoveries during the last decade. In the early period from 1980 to 1991, only about one factor is discovered per year. This number has grown to around five for the 1991–2003 period, during which time a number of papers, such as Fama and French (1992), Carhart (1997), and Pastor and Stambaugh (2003), spurred interest in studying cross-sectional return patterns. In the last nine years, the annual factor discovery rate has increased sharply to around 18. In total, 164 factors were discovered in the past nine years, roughly doubling the 84 factors discovered in all previous years. We do not include working papers in Figure 2. In our sample, there are 63 working papers covering 68 factors.

We obtain *t*-statistics for each of the 316 factors discovered, including the ones in the working papers.^{26} The overwhelming majority of *t*-statistics exceed the 1.96 benchmark for 5% significance.^{27} The nonsignificant ones typically belong to papers that propose a number of factors. These likely represent only a small subsample of nonsignificant *t*-statistics for all tried factors. Importantly, we take published *t*-statistics as given. That is, we assume they are econometrically sound with respect to the usual suspects (data errors, coding errors, misalignment, heteroscedasticity, autocorrelation, clustering, outliers, etc.).

### 3.6 *p*-value adjustment when all tests are published ($M=R$)

We now apply the three adjustment methods previously introduced to the observed factor tests, under the assumption that the test results of all tried factors are available. We know that this assumption is false since our sample underrepresents all insignificant factors by conventional significance standards: we only observe those insignificant factors that are the results of purposeful falsification experiments. We design methods to handle this missing data issue later.

Despite some limitations, our results in this section are useful for at least two reasons. First, the benchmark *t*-statistic based on our incomplete sample provides a lower bound of the true *t*-statistic benchmark. In other words, if $M$ (total number of tests) > $R$ (total number of discoveries), then we would expect fewer factors than when $M=R$,^{28} so future *t*-statistics need to at least surpass our benchmark to claim significance. Second, results in this section can be rationalized within a Bayesian or hierarchical testing framework.^{29} Factors in our list constitute an “elite” group: they have survived academia's scrutiny for publication. Placing a high prior on this group in a Bayesian testing framework or viewing this group as a cluster in a hierarchical testing framework, one can interpret results in this section as the first-step factor selection within an a priori group.

Based on our sample of observed *t*-statistics of published factors,^{30} we obtain three benchmark *t*-statistics. In particular, at each point in time, we transform the set of available *t*-statistics into *p*-values. We then apply the three adjustment methods to obtain benchmark *p*-values. Finally, these *p*-value benchmarks are transformed back into *t*-statistics, assuming that standard normal distribution approximates the *t*-distribution well. To guide future research, we extrapolate our benchmark *t*-statistics into the future, assuming that the rate of “factor production” remains the same as the recent history, that is, 2003–2012.

We choose to set $\alpha w$ at 5% (Holm, FWER) and $\alpha d$ at 1% (BHY, FDR) for our main results. The significance level is subjective, as in individual hypothesis testing, where conventional significance levels are usually adopted. Since FWER is a special case of the type I error in individual testing and 5% seems the default significance level in cross-sectional studies, we set $\alpha w$ at 5%. On the other hand, FDR is a more lenient control relative to FWER. If we choose the same $\alpha d$ as $\alpha w$, then by definition the BHY method will be more lenient than both Holm and Bonferroni. We set FDR at 1% but will explain what happens when $\alpha d$ is increased to 5%.

Figure 3 presents the three benchmark *t*-statistics. Both Bonferroni and Holm adjusted benchmark *t*-statistics are monotonically increasing in the number of discoveries. For Bonferroni, the benchmark *t*-statistic starts at 1.96 and increases to 3.78 by 2012. It reaches 4.00 in 2032. The corresponding *p*-values (under single tests) for 3.78 and 4.00 are 0.02% and 0.01%, respectively, much lower than the starting level of 5%. Holm implied *t*-statistics always fall below Bonferroni *t*-statistics, consistent with the fact that Bonferroni always results in fewer discoveries than Holm. However, Holm tracks Bonferroni closely and their differences are small. BHY implied benchmarks, on the other hand, are not monotonic. They fluctuate before year 2000 and stabilize at 3.39 (*p*-value = 0.07%) after 2010. This stationarity feature of BHY implied *t*-statistics, inherent in the definition of FDR, contrasts with that of Bonferroni and Holm. Intuitively, at any fixed significance level $\alpha $, the law of large numbers forces the false discovery rate (FDR) to converge to a constant. If we change $\alpha d$ to 5%, the corresponding BHY implied benchmark *t*-statistic is 2.78 (*p*-value = 0.54%) in 2012 and 2.81 (p-value = 0.50%) in 2032, still much higher than the starting value of 1.96. In sum, after taking testing multiplicity into account, we believe the minimum threshold *t*-statistic for 5% significance is about 2.8, which corresponds to a *p*-value (if a single test) of 0.5%.

To see how the new *t*-statistic benchmarks better reveal the statistical significance of factors, in Figure 3 we mark the *t*-statistics of a few prominent factors. Among these factors, HML, MOM, DCG, SRV, and MRT are significant across all types of *t*-statistic adjustments, EP, LIQ, and CVOL are sometimes significant, and the rest are never significant.

One concern with our results is that factors are discovered at different times and tests are conducted using different methods. This heterogeneity in the time of discovery and testing methods may blur the interpretation of our results. Ideally, we want updated factor tests that are based on the most recent sample and the same testing method.^{31} To alleviate this concern, we focus on the group of factors that are published no earlier than 2000 and rely on Fama-MacBeth tests. Additionally, we require that factor tests cover at least the 1970–1995 period and have as controls at least the Fama-French three factors (Fama and French 1993). This leaves us with 124 factors. Based on this factor group, the Bonferroni and Holm implied threshold *t*-statistics are 3.54 and 3.20 (5% significance), respectively, and the BHY implied thresholds are 3.23 (1% significance) and 2.67 (5% significance) by 2012. Not surprisingly, these statistics are smaller than the corresponding thresholds based on the full sample. However, the general message that we need a much higher *t*-statistic threshold when multiple testing is taken into account is unchanged.

### 3.7 Robustness

#### 3.7.1 Test statistics dependence

There is a caveat for all three methods considered so far. In the context of multiple testing, any type of adjustment procedure can become too stringent when there is a certain dependence structure in the data. This is because these procedures are primarily designed to guard against type I errors. Under a certain correlation structure, they may penalize type I errors too harshly and lead to a high type II error rate.

In theory, under independence, Bonferroni and Holm approximately achieve the prespecified significance level $\alpha $ when the number of tests is large. On the other hand, both procedures tend to generate fewer discoveries than desired when there is a certain degree of dependence among the tests. Intuitively, in the extreme case in which all tests are the same (i.e., correlation = 1.0), we do not need to adjust at all: FWER is the same as the type I error rate for single tests. Hence, the usual single hypothesis test is sufficient. Similarly, BHY may generate too few discoveries when tests are independent or positively correlated.

Having discussed assumptions for the testing methods to work efficiently, we now try to think of scenarios that can potentially violate these assumptions. First, factors that proxy for the same type of risk may be dependent. Moreover, returns of long-short portfolios designed to achieve exposure to a particular type of factor may be correlated. For example, there are a number of factors with price in the denominator that are naturally correlated. We also count four different idiosyncratic volatility factors. If this type of positive dependence exists among test statistics, all three methods would likely generate fewer significant factors than desired. On the other hand, most often factors need to “stand their ground” to be published. In the end, if you think we are overcounting at 316, consider taking a haircut to 113 factors (the number of “common” factors in Table 1). Figure 3 shows that our main conclusions do not materially change. For example, the Holm at 113 factors is 3.29 (*p*-value = 0.10%), while Holm at 316 factors is 3.64 (*p*-value = 0.03%).

Second, research studying the same factor but based on different samples will generate highly dependent test statistics. Examples include the sequence of papers studying the size effect. We try to minimize this concern by including, with a few exceptions, only the original paper that proposes the factor. To the extent that our list includes few such duplicate factors, our method greatly reduces the dependence that would be introduced by including all papers studying the same factor but for different sample periods.

Finally, when dependence among test statistics can be captured by Pearson correlations among contemporaneous strategy returns, we present a new model in Section 5 to systematically incorporate the information in test correlations.

#### 3.7.2 The case in which $M>R$

To deal with the hidden tests issue when $M>R$, we propose in Appendix A a simulation framework to estimate benchmark *t*-statistics. The idea is to first back out the underlying distribution for the *t*-statistics of all tried factors, then to generate benchmark *t*-statistic estimates, and apply the three adjustment procedures to simulated *t*-statistics samples.^{32}

Based on our estimates, 71% of all tried factors are missing. Using this information, the new benchmark *t*-statistics for Bonferroni and Holm are estimated to be 4.01 and 3.96, respectively, both slightly higher than when $M=R$. This is as expected because more factors are tried under this framework. The BHY implied *t*-statistic increases from 3.39 to 3.68 at 1% significance and from 2.78 to 3.18 at 5% significance. In sum, across various scenarios, we think the minimum threshold *t*-statistic is 3.18, corresponding to BHY's adjustment for $M>R$ at 5% significance. Alternative cases all result in even higher benchmark *t*-statistics.

One concern with BHY is that our specification of $c(M)$ results in an overly stringent threshold for FDR. We therefore try the more lenient choice (i.e., $c(M)\u22611$) as in Benjamini and Hochberg (1995). Based on our estimate that 71% of tried factors are missing and by simulating the missing tests as in Appendix A, we find that the BHY implied threshold equals 3.05 at 5% significance and 3.17 at 1% significance. Indeed, these numbers are smaller than the numbers under our default specification of $c(M)$ (i.e., $c(M)=\u2211j=1M1j$). However, they are above 3.0 and therefore are consistent with our overall message.

#### 3.7.3 A Bayesian hypothesis testing framework

We can also study multiple hypothesis testing within a Bayesian framework. One major obstacle of applying Bayesian methods in our context is the unobservability of all tried factors. While we propose new frequentist methods to handle this missing data problem, it is not clear how to structure the Bayesian framework in this context. In addition, the high dimensionality of the problem raises concerns on both the accuracy and the computational burden of Bayesian methods.

Nevertheless, ignoring the missing data issue, we outline a standard Bayesian multiple hypothesis testing framework in Appendix B and explain how it relates to our multiple testing framework. We discuss in detail the pros and cons of the Bayesian approach. In contrast to the frequentist approach, which uses generalized type I error rates to guide multiple testing, the Bayesian approach relies on the posterior likelihood function and thus contains a natural penalty term for multiplicity. However, this simplicity comes at the expense of having a restrictive hierarchical model structure and independence assumptions that may not be realistic for our factor testing problem. Although extensions incorporating certain forms of dependence are possible, it is unclear what precisely we should do for the 316 factors in our list. In addition, even for the Bayesian approach, the final reject/accept decision still involves the threshold choice. Due to these concerns, we choose not to implement the Bayesian approach. We leave extensions of the basic Bayesian framework that could possibly alleviate the above concerns to future research.

#### 3.7.4 Methods controlling the FDP

Instead of FDR, recent research by Lehmann and Romano (2005) develops methods to directly control the realized FDP. In particular, they propose a step-down method to control for the probability of FDP exceeding a threshold value. Since their definition of type I error (i.e., $P(FDP>\gamma )$, where $\gamma $ is the threshold FDP value) is different from either FWER or FDR, results based on their methods are not comparable to ours. However, the main conclusion is the same. For instance, when $\gamma =0.10$ and $\alpha =0.05$, the benchmark *t*-statistic is 2.70 (*p*-value = 0.69%), which is much higher than the conventional cutoff of 1.96. Details are presented in Online Appendix C.

## 4. Correlation among Test Statistics

Although the BHY method is robust to arbitrary dependence among test statistics, it does not use any information about the dependence structure. Such information, when appropriately incorporated, can be helpful in making the method more accurate (i.e., less stringent). We focus on the type of dependence that can be captured by the Pearson correlation. To generate correlation among test statistics, we focus on the correlation among factor returns. This correlation is likely driven by macroeconomic and market-wide variables. Therefore, in our context, the dependence among test statistics is equivalent to the correlation among factor returns.

Multiple testing corrections in the presence of correlation have been only considered in the recent statistics literature. Existing methods include bootstrap-based permutation tests and direct statistical modeling. Permutation tests resample the entire dataset and construct an empirical distribution for the pool of test statistics.^{33} Through resampling, the correlation structure in the data is taken into account and no model is needed. In contrast, direct statistical modeling makes additional distributional assumptions on the data-generating process. These assumptions are usually case dependent as different kinds of correlations are more plausible under different circumstances.^{34}

In addition, recent research in finance explores bootstrap procedures to assess the statistical significance of individual tests.^{35} Many of these studies focus on performance evaluation and test whether fund managers exhibit skill. Our approach focuses on the joint distribution of the test statistics (both FWER and FDR depend on the cross-section of *t*-statistics) and evaluates the significance of each individual factor.

Unfortunately, we do not always observe the time series of factor returns (when a *t*-statistic is based on long-short strategy returns) or the time series of slopes in cross-sectional regressions (when a *t*-statistic is based on the slope coefficients in cross-sectional regressions). Because few researchers post their original data, often all we have is the single *t*-statistic that summarizes the significance of a factor. We propose a novel approach to overcome this missing data problem. It is in essence a “direct modeling approach” but does not require the full information of the return series based on which the *t*-statistic is constructed. In addition, our approach is flexible enough to incorporate various kinds of distributional assumptions. We expect it to be a valuable addition to the multiple testing literature, especially when only test statistics are observable.

### 4.1 A model with correlations

For each factor, suppose researchers construct a corresponding long-short trading strategy and normalize the return standard deviation to be $\sigma =15%$ per year, which is close to the annual volatility of the market index.^{36} In particular, let the normalized strategy return in period $t$ for the $i$-th discovered strategy be $Xi,t$. Then the *t*-statistic for testing the significance of this strategy is:

*t*-statistic has a normal distribution

^{37}and, perhaps more importantly, because it is consistent with the intuition that more profitable strategies are less likely to exist. An exponential distribution captures this feature by having a monotonically decreasing probability density function.

Next, we incorporate correlations into the above framework. Among the various sources of correlations, the cross-sectional correlations among contemporaneous returns are the most important for us to take into account. These correlations are likely induced by a response to common macroeconomic or market shocks. Other kinds of correlations can be easily embedded into our framework as well.^{38}

As a starting point, we assume that the contemporaneous correlation between two strategies' returns is $\rho $. The noncontemporaneous correlations are assumed to be zero. That is,

Finally, to incorporate the impact of hidden tests, we assume that $M$ factors are tried, but only factors that exceed a certain *t*-statistic threshold are published. We set the threshold *t*-statistic at 1.96 and focus on the subsample of factors that have a *t*-statistic larger than 1.96. However, as shown in Appendix A, factors with marginal *t*-statistics (i.e., *t*-statistics just above 1.96) are less likely to be published than those with larger *t*-statistics. Therefore, our subsample of published *t*-statistics only covers a fraction of *t*-statistics above 1.96 for tried factors. To overcome this missing data problem, we assume that our sample covers a fraction *r* of *t*-statistics in between 1.96 and 2.57 and that all *t*-statistics above 2.57 are covered. We augment the existing *t*-statistic sample to construct the full sample. For instance, when $r=1/2$, we simply duplicate the sample of *t*-statistics in between 1.96 and 2.57 and maintain the sample of *t*-statistics above 2.57 to construct the full sample. For the baseline case, we set $r=1/2$, consistent with the analysis in Appendix A. We try alternative values of *r* to determine how the results change.^{39}

Given the correlation structure and the sampling distribution for the means of returns, we can fully characterize the distributional properties of the cross-section of returns. We can also determine the distribution for the cross-section of *t*-statistics as they are functions of returns. Based on our sample of *t*-statistics for published research, we match key sample statistics with their population counterparts in the model.

The sample statistics we choose to match are the quantiles of the sample of *t*-statistics and the sample size (i.e., the total number of discoveries). Two concerns motivate us to use quantiles. First, sample quantiles are less susceptible to outliers compared to means and other moment-related sample statistics. Our *t*-statistic sample does have a few influential observations, and we expect quantiles to be more useful descriptive statistics than the mean and the standard deviation. Second, simulation studies show that quantiles in our model are more sensitive to changes in parameters than other statistics. To offer a more efficient estimation of the model, we choose to focus on quantiles.

In particular, the quantities we choose to match and their values for the baseline sample (i.e., $r=1/2$) are given by:

^{40}The estimation works by seeking to find the set of parameters that minimizes the following objective function:

^{41}

We estimate the three parameters ($\lambda ,p0$, and $M$) in the model and choose to calibrate the correlation coefficient $\rho $. In particular, for a given level of correlation $\rho $, we numerically search for the model parameters $(\lambda ,p0,M)$ that minimize the objective function $D(\lambda ,p0,M,\rho )$.

We choose to calibrate the amount of correlation because the correlation coefficient is likely to be weakly identified in this framework. Ideally, to have a better identification of $\rho $, we would like to have *t*-statistics that are generated from samples that have varying degrees of overlap.^{42} We do not allow heterogeneity in sample periods in either our estimation framework (i.e., all *t*-statistics are generated from samples that cover the same period) or our data (we do not record the specific period for which the *t*-statistic is generated). As a result, our results are best interpreted as the estimated *t*-statistic thresholds for a hypothetical level of correlation.

To investigate how correlation affects multiple testing, we follow an intuitive simulation procedure. In particular, fixing $\lambda $, $p0$, and $M$ at their estimates, we know the data-generating process for the cross-section of returns. Through simulations, we are able to calculate the previously defined type I error rates (i.e., FWER and FDR) for any given threshold *t*-statistic. We search for the optimal threshold *t*-statistic that exactly achieves a prespecified error rate.

### 4.2 Results

Our estimation framework assumes a balanced panel with $M$ factors and $N$ periods of returns. We need to assign a value to $N$. Published papers usually cover a period ranging from twenty to fifty years. In our framework, the choice of $N$ does not affect the distribution of $Ti$ under the null hypothesis (i.e., $\mu i=0$) but will affect $Ti$ under the alternative hypothesis (i.e., $\mu i>0$). When $\mu i$ is different from zero, $Ti$ has a mean of $\mu i/(\sigma /N)$. A larger $N$ reduces the noise in returns and makes it more likely for $Ti$ to be significant. To be conservative (i.e., less likely to generate significant *t*-statistics under the alternative hypotheses), we set $N$ at 240 (i.e., twenty years). Other specifications of $N$ change the estimate of $\lambda $ but leave the other parameters almost intact. In particular, the threshold *t*-statistics are little changed for alternative values of $N$.

The results are presented in Table 5. Across different correlation levels, $\lambda $ (the mean parameter for the exponential distribution that represents the mean returns for true factors) is consistently estimated at 0.55% per month. This corresponds to an annual factor return of 6.6%. Therefore, we estimate the average mean returns for truly significant factors to be 6.6% per annum. Given that we standardize factor returns by an annual volatility of 15%, the average annual Sharpe ratio for these factors is 0.44 (or monthly Sharpe ratio of 0.13).^{43}

Panel A: r = 1/2 (baseline) | |||||||
---|---|---|---|---|---|---|---|

t-statistic | |||||||

$\rho $ | $p0$ | $\lambda $(%) | $M$ | FWER(5%) | FWER(1%) | FDR(5%) | FDR(1%) |

0 | 0.396 | 0.550 | 1,297 | 3.89 | 4.28 | 2.16 | 2.88 |

0.2 | 0.444 | 0.555 | 1,378 | 3.91 | 4.30 | 2.27 | 2.95 |

0.4 | 0.485 | 0.554 | 1,477 | 3.81 | 4.23 | 2.34 | 3.05 |

0.6 | 0.601 | 0.555 | 1,775 | 3.67 | 4.15 | 2.43 | 3.09 |

0.8 | 0.840 | 0.560 | 3,110 | 3.35 | 3.89 | 2.59 | 3.25 |

Panel A: r = 1/2 (baseline) | |||||||
---|---|---|---|---|---|---|---|

t-statistic | |||||||

$\rho $ | $p0$ | $\lambda $(%) | $M$ | FWER(5%) | FWER(1%) | FDR(5%) | FDR(1%) |

0 | 0.396 | 0.550 | 1,297 | 3.89 | 4.28 | 2.16 | 2.88 |

0.2 | 0.444 | 0.555 | 1,378 | 3.91 | 4.30 | 2.27 | 2.95 |

0.4 | 0.485 | 0.554 | 1,477 | 3.81 | 4.23 | 2.34 | 3.05 |

0.6 | 0.601 | 0.555 | 1,775 | 3.67 | 4.15 | 2.43 | 3.09 |

0.8 | 0.840 | 0.560 | 3,110 | 3.35 | 3.89 | 2.59 | 3.25 |

Panel B: r = 2/3 (more unobserved tests) | |||||||
---|---|---|---|---|---|---|---|

0 | 0.683 | 0.550 | 2,458 | 4.17 | 4.55 | 2.69 | 3.30 |

0.2 | 0.722 | 0.551 | 2,696 | 4.15 | 4.54 | 2.76 | 3.38 |

0.4 | 0.773 | 0.552 | 3,031 | 4.06 | 4.45 | 2.80 | 3.40 |

0.6 | 0.885 | 0.562 | 4,339 | 3.86 | 4.29 | 2.91 | 3.55 |

0.8 | 0.922 | 0.532 | 5,392 | 3.44 | 4.00 | 2.75 | 3.39 |

Panel B: r = 2/3 (more unobserved tests) | |||||||
---|---|---|---|---|---|---|---|

0 | 0.683 | 0.550 | 2,458 | 4.17 | 4.55 | 2.69 | 3.30 |

0.2 | 0.722 | 0.551 | 2,696 | 4.15 | 4.54 | 2.76 | 3.38 |

0.4 | 0.773 | 0.552 | 3,031 | 4.06 | 4.45 | 2.80 | 3.40 |

0.6 | 0.885 | 0.562 | 4,339 | 3.86 | 4.29 | 2.91 | 3.55 |

0.8 | 0.922 | 0.532 | 5,392 | 3.44 | 4.00 | 2.75 | 3.39 |

We estimate the model with correlations. $r$ is the assumed proportion of missing factors with a *t*-statistic between 1.96 and 2.57. Panel A shows the results for the baseline case in which $r=1/2$, and panel B shows the results for the case in which $r=2/3$. $\rho $ is the correlation coefficient between two strategy returns in the same period. $p0$ is the probability of having a strategy that has a mean of zero. $\lambda $ is the mean parameter of the exponential distribution for the monthly means of the true factors. $M$ is the total number of trials.

For the other parameter estimates, both $p0$ and $M$ are increasing in $\rho $. Focusing on the baseline case in panel A and at $\rho =0$, we estimate that researchers have tried $M=1,297$ factors and 60.4% ($=1\u22120.396$) are true discoveries. When $\rho $ is increased to 0.60, we estimate that a total of $M=1,775$ factors have been tried and around 39.9% ($=1\u22120.601$) are true factors.

Turning to the estimates of threshold *t*-statistics and focusing on FWER, we see that they are not monotonic in the level of correlation. Intuitively, two forces are at work in driving these threshold *t*-statistics. On the one hand, both $p0$ and $M$ are increasing in the level of correlation. Therefore, more factors—both in absolute value and in proportion—are drawn from the null hypothesis. To control the occurrences of false discoveries based on these factors, we need a higher threshold *t*-statistic. On the other hand, a higher correlation among test statistics reduces the required threshold *t*-statistic. In the extreme case when all test statistics are perfectly correlated, we do not need multiple testing adjustment at all. These two forces work against each other and result in the nonmonotonic pattern for the threshold *t*-statistics under FWER. For FDR, it appears that the impact of larger $p0$ and $M$ dominates so that the threshold *t*-statistics are increasing in the level of correlation.

Across various correlation specifications, our estimates show that in general a *t*-statistic of 3.9 and 3.0 is needed to control FWER at 5% and FDR at 1%, respectively.^{44} Notice that these numbers are not far away from our previous estimates of 3.78 (Holm adjustment that controls FWER at 5%) and 3.38 (BHY adjustment that controls FDR at 1%). However, these similar numbers are generated through different mechanisms. Our current estimate assumes a certain level of correlation among returns and relies on an estimate of more than 1,300 for the total number of factor tests. On the other hand, our previous calculation assumes that the 316 published factors are all the factors that have been tried but does not specify a correlation structure.

### 4.3 How large is $\rho $?

Our sample has limitations in making a direct inference on the level of correlation. To give some guidance, we provide indirect evidence on the plausible levels of $\rho $.

First, the value of the optimized objective function sheds light on the level of $\rho $. Intuitively, a value of $\rho $ that is more consistent with the data-generating process should result in a lower optimized objective function. Across the various specifications of $\rho $ in Table 5, we find that the optimized objective function reaches its lowest point when $\rho =0.2$. Therefore, our *t*-statistic sample suggests a low level of correlation. However, this evidence is only suggestive given the weak identification of $\rho $ in our model.

Second, we draw on external data source to provide inference. In particular, we analyze the S&P CAPITAL IQ database, which includes detailed information on the time-series of returns of over 400 factors for the U.S. equity market. We estimate the average pairwise correlation among these factors to be 0.15 for the 1985–2014 period.

Finally, existing studies in the literature provide guidance on the level of correlation. McLean and Pontiff (2015) estimate the correlation among anomaly returns to be around 0.05. Green, Hand, and Zhang (2013a) focus on accounting-based factors and find the average correlation to be between 0.06 and 0.20. Focusing on mutual fund returns, Barras, Scaillet, and Wermers (2010) argue for a correlation of zero among fund returns (i.e., excess returns against benchmark factors), while Ferson and Chen (2013) calibrate this number to be between 0.04 and 0.09.

Overall, we believe that the average correlation among factor returns is in the neighborhood of 0.20.

### 4.4 How many true factors are there?

The number of true discoveries using our method seems high given that most of us believe as a priori that there are only a handful of true systematic risk factors. However, many of these factors that our method deems statistically true have tiny Sharpe ratios. For example, around 70% of them have a Sharpe ratio that is less than 0.5 per annum. From a modeling perspective, we impose a monotonic exponential density for the mean returns of true factors. Hence, by assumption, the number of discoveries will be decreasing in the mean return.

Overall, statistical evidence can only get us so far in reducing the number of false discoveries. This is a limitation not only to our framework but also probably in any statistical framework that relies on individual *p*-values. To see this, suppose the smallest *t*-statistic among true risk factors is 3.0 and assume our sample covers fifty risk factors that all have a *t*-statistic above 3.0. Then based on statistical evidence only, it is impossible to rule out any of these fifty factors from the list of true risk factors.

We agree that a further scrutiny of the factor universe is a valuable exercise. There are at least two routes we can take. One route is to introduce additional testable assumptions that a systematic risk factor has to satisfy to claim significance. Pukthuanthong and Roll (2014) use the principle components of the cross-section of realized returns to impose such assumptions. The other route is to incrementally increase the factor list by allowing different factors to crowd each other out. Harvey and Liu (2014c) provide such a framework. We expect both lines of research to help in culling the number of factors.

## 5. Conclusion

At least 316 factors have been tested to explain the cross-section of expected returns. Most of these factors have been proposed over the last ten years. Indeed, Cochrane (2011) refers to this as “a zoo of new factors.” Our paper argues that it is a serious mistake to use the usual statistical significance cutoffs (e.g., a *t*-statistic exceeding 2.0) in asset pricing tests. Given the plethora of factors, and the inevitable data mining, many of the historically discovered factors would be deemed “significant” by chance.

There is an important philosophical issue embedded in our approach. Our threshold cutoffs increase through time as more factors are data mined. However, data mining is not new. Why should we have a higher threshold for today's data mining than for data mining in the 1980s? We believe there are three reasons for tougher criteria today. First, the low-hanging fruit has already been picked. That is, the rate of discovering a true factor has likely decreased. Second, there is a limited amount of data. Indeed, there is only so much you can do with the CRSP database. In contrast, in particle physics, it is routine to create trillions of new observations in an experiment. We do not have that luxury in finance. Third, the cost of data mining has dramatically decreased. In the past, data collection and estimation were time intensive, so it was more likely that only factors with the highest priors—potentially based on economic first principles—were tried.

Our paper presents three conventional multiple testing frameworks and proposes a new one that particularly suits research in financial economics. While these frameworks differ in their assumptions, they are consistent in their conclusions. We argue that a newly discovered factor today should have a *t*-statistic that exceeds 3.0. We provide a time-series of recommended “cutoffs” from the first empirical test in 1967 through to present day. Many published factors fail to exceed our recommended cutoffs.

While a *t*-statistic of 3.0 (which corresponds to a *p*-value of 0.27%) seems like a very high hurdle, we also argue that there are good reasons to expect that 3.0 is too low. First, we only count factors that are published in prominent journals and we sample only a small fraction of the working papers. Second, there are surely many factors that were tried by empiricists, failed, and never made it to publication or even a working paper. Indeed, the culture in financial economics is to focus on the discovery of new factors. In contrast with other fields, such as medical science, it is rare to publish replication studies focusing on only existing factors. Given that our count of 316 tested factors is surely too low, this means the *t*-statistic cutoff is likely even higher.^{45}

Should a *t*-statistic of 3.0 be used for every factor proposed in the future? Probably not. A case can be made that a factor developed from first principles should have a lower threshold *t*-statistic than a factor that is discovered as a purely empirical exercise. Nevertheless, a *t*-statistic of 2.0 is no longer appropriate—even for factors that are derived from theory.

In medical research, the recognition of the multiple testing problem has led to the disturbing conclusion that “most claimed research findings are false” (Ioannidis (2005)). Our analysis of factor discoveries leads to the same conclusion – many of the factors discovered in the field of finance are likely false discoveries: of the 296 published significant factors, 158 would be considered false discoveries under Bonferonni, 142 under Holm, 132 under BHY (1%), and 80 under BHY (5%). In addition, the idea that there are so many factors is inconsistent with the principal component analysis, where, perhaps there are five “statistical” common factors driving time-series variation in equity returns (Ahn, Horenstein, and Wang 2012).

The assumption that researchers follow the rules of classical statistics (e.g., randomization, unbiased reporting) is at odds with the notion of individual incentives, ironically, one of the fundamental premises in economics. Importantly, the optimal amount of data mining is not zero since some data mining produces knowledge. The key, as argued by Glaeser (2008), is to design appropriate statistical methods to adjust for biases, not to eliminate research initiatives. The multiple testing framework detailed in our paper is true to this advice.

Our research quantifies the warnings of both Fama (1991) and Schwert (2003). We attempt to navigate the zoo and establish new benchmarks to guide empirical asset pricing tests.

Reference | Factor | # | Reference | Factor | # |
---|---|---|---|---|---|

Sharpe (1964) | market return | T | Constantinides (1982) | individual consumer's wealth | T |

Lintner (1965) | market return | T | Basu (1983) | EP ratio | C8 |

Mossin (1966) | market return | T | Adler and Dumas (1983) | FX rate change | T |

Douglas (1967) | total volatility | C1 | Arbel, Carvell, and Strebel (1983) | institutional holding^{‡} | |

Heckerman (1972) | market return | T | Hawkins, Chamberlin, and Daniel (1984) | earnings expectations^{‡} | |

relative prices of cons. goods | T | McConnell and Sanger (1984) | new listings announcement^{‡} | ||

^{1}Black, Jensen, and Scholes (1972) | market return | Chan, Chen, and Hsieh (1985) | market return^{†} | ||

Black (1972) | market return | T | industrial production growth | F5 | |

Merton (1973) | state variables investment opps. | T | change in expected inflation* | F6 | |

Fama and MacBeth (1973) | market return | F1 | unanticipated inflation | F7 | |

beta squared* | F2 | credit premium | F8 | ||

idiosyncratic volatility* | C2 | term structure* | F9 | ||

Rubinstein (1973) | high-order market return | T | De Bondt and Thaler (1985) | long-term return reversal | C9 |

Solnik (1974) | world market return | T | Cox, Ingersoll, and Ross (1985) | $\Delta $ investment opportunities | T |

Rubinstein (1974) | individual investor resources | T | Amihud and Mendelson (1986) | transaction costs | T |

Gupta and Ofer (1975) | earnings growth expectations | C3 | Constantinides (1986) | transaction costs | T |

Kraus and Litzenberger (1976) | market return^{†} | Stulz (1986) | expected inflation | T | |

squared market return* | F3 | Sweeney and Warga (1986) | long-term interest rate | F10 | |

Basu (1977) | PE ratio | C4 | Chen, Roll, and Ross (1986) | industrial production growth^{†} | |

Lucas (1978) | marginal rate of substitution | T | credit premium^{†} | ||

Litzenberger and Ramaswamy (1979) | dividend yield | C5 | term structure^{†} | ||

market return^{†} | unanticipated inflation^{†} | ||||

Breeden (1979) | real consumption growth | T | change in oil prices* | F11 | |

Jarrow (1980) | short-sale restrictions | T | Bhandari (1988) | debt-to-equity ratio | C10 |

^{2}Fogler, John, and Tipton (1981) | market return^{†‡} | Bauman and Dowen (1988) | long-term growth forecasts^{‡} | ||

Treasury bond return^{‡} | Breeden, Gibbons, and Litzenberger (1989) | consumption growth | F12 | ||

corporate bond return^{‡} | Amihud and Mendelson (1989) | illiquidity | C11 | ||

Oldfield and Rogalski (1981) | Treasury-bill return | F4 | Ou and Penman (1989) | predicted earnings change | C12 |

Stulz (1981) | world consumption | T | Jegadeesh (1990) | return predictability | C13 |

Mayshar (1981) | transaction costs | T | |||

Banz (1981) | firm size | C6 | |||

Figlewski (1981) | short interest | C7 | |||

Ferson and Harvey (1991) | market return^{†} | Elton, Gruber, and Blake (1995) | change in expected inflation | F22 | |

consumption growth^{†} | change in expected GNP | F23 | |||

credit spread^{†} | Spiess and Affleck-Graves (1999) | seasoned equity offerings^{‡} | |||

$\Delta $ slope of the yield curve | F13 | Chan, Foresi, and Lang (1996) | money growth | F24 | |

unexpected inflation^{†} | Cochrane (1996) | returns on physical inv. | F25 | ||

real short rate | F14 | Campbell (1996) | market return^{†} | ||

^{3}Fama and French (1992) | size | F15 | labor income | F26 | |

value | F16 | dividend yield^{†} | |||

Chopra, Lakonishok, and Ritter (1992) | return momentum^{‡} | interest rate^{†} | |||

Holthausen and Larcker (1992) | predicted return signs^{‡} | term structure^{†} | |||

Jegadeesh and Titman (1993) | return momentum | C14 | Jagannathan and Wang (1996) | market return^{†} | |

Elton et al. (1993) | returns on S&P stocks^{‡} | slope of yield curve^{†} | |||

returns on non-S&P stocks^{‡} | labor income^{†} | ||||

^{4}Bansal and Viswanathan (1993) | high-order equity & bond returns^{‡} | La Porta (1996) | earnings forecasts | C18 | |

Fama and French (1993) | market return^{†} | Lev and Sougiannis (1996) | R&D capital | C19 | |

size^{†} | Sloan (1996) | accruals | C20 | ||

value^{†} | Womack (1996) | buy recommendations | C21 | ||

term structure^{†} | sell recommendations | C22 | |||

credit risk^{†} | Erb, Harvey, and Viskanta (1996) | credit rating | C23 | ||

^{5}Ferson and Harvey (1993) | world equity return^{‡} | Brennan and Subrahmanyam (1996) | illiquidity | C24 | |

change in weighted exchange rate^{‡} | ^{6}Chapman (1997) | nonlinear fn. of cons. growth^{‡} | |||

$\Delta $ LT inflation expectations^{‡} | ^{7}Fung and Hsieh (1997) | opportunistic style return^{‡} | |||

weighted real short-term rate^{‡} | global/macro style return^{‡} | ||||

change in oil price^{†‡} | value style return^{‡} | ||||

change in TED spread^{‡} | trend following style return^{‡} | ||||

$\Delta $ in G-7 industrial production^{‡} | distressed inv. style return^{‡} | ||||

unexpected G-7 inflation^{‡} | Carhart (1997) | size^{†} | |||

Ferson and Harvey (1994) | world equity return | F17 | value^{†} | ||

change in weighted FX rate* | F18 | market return^{†} | |||

$\Delta $ LT inflation expectations* | F19 | momentum | F27 | ||

change in oil price^{*†} | Botosan (1997) | disclosure level | C25 | ||

Bossaerts and Dammon (1994) | tax rate for capital gains | F20 | Ackert and Athanassakos (1997) | earnings forecast uncertainty | C26 |

tax rate for dividend | F21 | Daniel and Titman (1997) | size^{†} | ||

Loughran and Ritter (1995) | new public stock issuance | C15 | value^{†} | ||

Michaely, Thaler, and Womack (1995) | dividend initiations | C16 | |||

dividend omissions | C17 | ||||

Beneish (1997) | earnings management likelihood^{‡} | Griffin and Lemmon (2002) | distress risk | C45 | |

Loughran and Vijh (1997) | corporate acquisitions | C27 | Diether, Malloy, and Scherbina (2002) | analyst dispersion | C46 |

Brennan, Chordia, and Subrahmanyam (1998) | size^{†} | Chen, Hong, and Stein (2002) | breadth of ownership | C47 | |

book-to-market ratio^{†} | Easley, Hvidkjaer, and O'Hara (2002) | information risk | C48 | ||

momentum^{†} | Jones and Lamont (2002) | short-sale constraints | C49 | ||

trading volume | C28 | Penman and Zhang (2002) | earnings sustainability | C50 | |

Abarbanell and Bushee (1998) | fundamental analysis^{‡} | Amihud (2002) | market illiquidity | F33 | |

Frankel and Lee (1998) | firm fundamental value^{‡} | Vassalou (2003) | GDP growth news | F34 | |

Dichev (1998) | bankruptcy risk | C29 | Pastor and Stambaugh (2003) | market liquidity | F35 |

Datar, Naik, and Radcliffe (1998) | illiquidity | C30 | Ali, Hwang, and Trombley (2003) | idiosyncratic return volatility^{†} | |

Ferson and Harvey (1999) | expected portfolio return | F28 | transaction costs^{†} | ||

Moskowitz and Grinblatt (1999) | industry momentum | C31 | investor sophistication^{†} | ||

Spiess and Affleck-Graves (1999) | debt offerings^{‡} | Gompers, Ishii, and Metrick (2003) | shareholder rights | C51 | |

Heaton and Lucas (2000) | entrepreneur income | F29 | Doyle, Lundholm, and Soliman (2003) | excluded expenses | C52 |

Harvey and Siddique (2000) | coskewness | F30 | Fairfield, Whisenant, and Yohn (2003) | growth in LT net operating assets | C53 |

Lee and Swaminathan (2000) | trading volume | C32 | Rajgopal, Shevlin, and Venkatachalam (2003) | order backlog | C54 |

Asness, Porter, and Stevens (2000) | intra-industry size | C33 | Watkins (2003) | return consistency | C55 |

intra-industry value | C34 | Jacobs and Wang (2004) | idiosyncratic consumption | F36 | |

intra-industry CF/P | C35 | Campbell and Vuolteenaho (2004) | cash-flow news | F37 | |

intra-industry $\Delta $% # employees | C36 | discount rate news | F38 | ||

intra-industry momentum | C37 | ^{8}Vanden (2004) | market return^{†} | ||

Piotroski (2000) | financial statement infor. | C38 | index option returns | F39 | |

Lettau and Ludvigson (2001) | consumption growth^{†} | Vassalou and Xing (2004) | default risk | F40 | |

consumption-wealth ratio | F31 | Brennan, Wang, and Xia (2004) | real interest rate | F41 | |

Chordia, Subrahmanyam, and Anshuman (2001) | level of liquidity | C39 | maximum Sharpe ratio portfolio | F42 | |

variability of liquidity | C40 | Teo and Woo (2004) | return reversals at the style level | F43 | |

Lamont, Polk, and Saa-Requejo (2001) | financial constraints | C41 | Eberhart, Maxwell, and Siddique (2004) | unexpected change in R&D | C56 |

Fung and Hsieh (2001) | straddle return^{‡} | George and Hwang (2004) | 52-week high | C57 | |

Barber et al. (2001) | consensus recommendations* | Jegadeesh et al. (2004) | analysts' recommendations | C58 | |

Dichev and Piotroski (2001) | bond rating changes | C42 | Ofek, Richardson, and Whitelaw (2004) | put-call parity | C59 |

Elgers, Lo, and Pfeiffer (2001) | analysts' forecasts | C43 | Titman, Wei, and Xie (2004) | abnormal capital investment | C60 |

Gompers and Metrick (2001) | institutional ownership | C44 | Hirshleifer et al. (2004) | balance sheet optimism | C61 |

Dittmar (2002) | market return^{†} | Parker and Julliard (2005) | LT consumption growth | F44 | |

squared market return^{†} | Bansal, Dittmar, and Lundblad (2005) | long-run consumption | F45 | ||

labor income growth^{†} | |||||

squared labor income growth | F32 | ||||

Lustig and Van Nieuwerburgh (2005) | housing price ratio | F46 | Brammer, Brooks, and Pavelin (2006) | environment indicator* | C78 |

Cremers and Nair (2005) | external corporate governance | C62 | employment indicator* | C79 | |

internal corporate governance | C63 | community indicator* | C80 | ||

^{9}Acharya and Pedersen (2005) | market return^{†} | Daniel and Titman (2006) | intangible information | C81 | |

market liquidity* | F47 | Fama and French (2006) | profitability | C82 | |

individual stock liquidity | C64 | investment* | C83 | ||

Hou and Moskowitz (2005) | price delay | C65 | book-to-market^{†} | ||

Anderson, Ghysels, and Juergens (2005) | heterogeneous beliefs | C66 | Bradshaw, Richardson, and Sloan (2006) | net financing | C84 |

Nagel (2005) | short-sale constraints | C67 | Cen, Wei, and Zhang (2006) | forecasted earnings per share | C85 |

Asquith, Pathak, and Ritter (2005) | short-sale constraints | C68 | Franzoni and Marin (2006) | pension plan funding | C86 |

Gu (2005) | patent citation | C69 | Gettleman and Marks (2006) | acceleration | C87 |

Jiang, Lee and Zhang (2005) | information uncertainty | C70 | Narayanamoorthy (2006) | unexpected earnings' autocorr. | C88 |

Lev, Nissim, and Thomas (2005) | adjusted R&D | C71 | Boudoukh et al. (2007) | payout yield | F62 |

Lev, Sarath, and Sougiannis (2005) | R&D reporting biases | C72 | Balvers and Huang (2007) | productivity | F63 |

Mohanram (2005) | growth index | C73 | capital stock | F64 | |

^{10}Vanden (2006) | market return^{†} | Jagannathan and Wang (2007) | 4th Q to 4th Q cons. growth | F65 | |

index option return^{†} | Avramov et al. (2007) | credit rating | C89 | ||

market × option return^{‡} | Shu (2007) | trader composition | C90 | ||

Gomes, Yaron, and Zhang (2006) | financing frictions | F48 | Baik and Ahn (2007) | change in order backlog | C91 |

Li, Vassalou, and Xing (2006) | inv. growth (IG) households* | F49 | Brown and Rowe (2007) | firm productivity | C92 |

IG nonfinancial corporates | F50 | Doran, Fodor, and Peterson (2007) | insider forecasts of firm vol. | C93 | |

IG noncorporate business | F51 | Head, Smith, and Wilson (2007) | ticker symbol | C94 | |

IG financial firms | F52 | Gourio (2007) | earnings cyclicality | F66 | |

^{11}Chung, Johnson, and Schill (2006) | 3rd-10th power market return^{‡} | Kumar et al. (2008) | market volatility innovation | F67 | |

Whited and Wu (2006) | financial constraints | C74 | firm age | C95 | |

Ang et al. (2006) | downside risk | F53 | market return^{†} | ||

Ang et al. (2006) | systematic volatility | F54 | market vol. × firm age | C96 | |

idiosyncratic volatility | C75 | Adrian and Rosenberg (2008) | short-run market volatility | F68 | |

Baker and Wurgler (2006) | investor sentiment | F55 | long-run market volatility | F69 | |

Kumar and Lee (2006) | retail investor sentiment | F56 | Xing (2008) | investment growth | F70 |

Yogo (2006) | durable & nondur. cons. growth | F57 | Korniotis (2008) | mean consumption growth | F71 |

Lo and Wang (2006) | market return^{†} | variance of consumption growth* | F72 | ||

trading volume | F58 | mean habit growth | F73 | ||

Sadka (2006) | liquidity | F59 | variance of habit growth | F74 | |

Chordia and Shivakumar (2006) | earnings | F60 | Korajczyk and Sadka (2008) | liquidity | F75 |

Liu (2006) | liquidity | F61 | Guo and Savickas (2008) | country-level idiosyncratic vol. | C97 |

Anderson and Garcia-Feijóo (2006) | capital investment | C76 | Campbell, Hilscher, and Szilagyi (2008) | distress | C98 |

Hou and Robinson (2006) | industry concentration | C77 | Garlappi, Shu, and Yan (2008) | shareholder advantage | C99 |

implied market value from KMV | C100 | ||||

Cooper, Gulen, and Schill (2008) | asset growth | C101 | Barber, Odean, and Zhu (2009) | order imbalance | C125 |

Pontiff and Woodgate (2008) | share issuance | C102 | Cremers, Halling, and Weinbaum (2010) | market volatility and jumps | F85 |

Brandt et al. (2008) | earnings announcement return^{‡} | Hirshleifer and Jiang (2010) | market mispricing | F86 | |

Cohen and Frazzini (2008) | firm economic links | C103 | Boyer, Mitton, and Vorkink (2010) | idiosyncratic skewness | C126 |

Fabozzi, Ma, and Oliphant (2008) | sin stock | C104 | Cooper, Gulen, and Ovtchinnikov (2010) | political contributions | C127 |

Gu and Lev (2011) | goodwill impairment | C105 | Tuzel (2010) | real estate holdings | C128 |

Gu, Wang, and Ye (2008) | information in order backlog | C106 | Amaya et al. (2011) | realized skewness | C129 |

Lehavy and Sloan (2008) | investor recognition | C107 | realized kurtosis | C130 | |

Soliman (2008) | DuPont analysis | C108 | An, Bhojraj, and Ng (2010) | excess multiple | C131 |

Hvidkjaer (2008) | small trades | C109 | Armstrong, Banerjee, and Corona (2010) | firm information quality | C132 |

Brennan and Li (2008) | idiosyncratic S&P 500 return | F76 | Cao and Xu (2010) | long-run idiosyncratic vol. | C133 |

Da (2009) | cash flow cova. with cons. | F77 | Easley, Hvidkjaer, and O'Hara (2010) | private information | F87 |

cash flow duration | F78 | Hameed, Huang, and Mian (2010) | intra-industry return reversals | C134 | |

Livdan, Sapriza, and Zhang (2009) | financial constraints | Menzly and Ozbas (2010) | related industry returns | C135 | |

Malloy, Moskowitz, and Vissing-Jorgensen (2009) | LT stockholder cons. growth | F79 | Papanastasopoulos, Thomakos, and Wang (2010) | earnings to equity holders | C136 |

Cremers, Nair, and John (2009) | takeover likelihood | F80 | net cash to equity holders | C137 | |

Chordia, Huh, and Subrahmanyam (2009) | illiquidity | F81 | Simutin (2010) | excess cash | C138 |

Da and Warachka (2009) | cash flow | F82 | Huang et al. (2012) | extreme downside risk | C139 |

Ozoguz (2009) | investors' beliefs* | F83 | Xing, Zhang, and Zhao (2010) | volatility smirk | C140 |

investors' uncertainty | F84 | George and Hwang (2010) | exposure financial distress costs | ||

Fang and Peress (2009) | media coverage | C110 | Berkman, Jacobsen, and Lee (2011) | rare disasters | F88 |

Avramov et al. (2009) | financial distress | C111 | ^{12}Kapadia (2011) | distress risk^{‡} | |

Fu (2009) | idiosyncratic volatility | C112 | Hou, Karolyi, and Kho (2011) | momentum^{†} | |

Hahn and Lee (2009) | debt capacity | C113 | cash flow-to-price | F89 | |

Bali and Hovakimian (2009) | realized-implied vol. spread | C114 | Li (2011) | R&D investment | C141 |

call-put implied vol. spread | C115 | financial constraints^{†} | |||

Chandrashekar and Rao (2009) | productivity of cash | C116 | Bali, Cakici, and Whitelaw (2011) | extreme stock returns | C142 |

Chemmanur and Yan (2009) | advertising | C117 | Yan (2011) | jumps individual stock returns | C143 |

Da and Warachka (2009) | analyst forecasts optimism | C118 | Edmans (2011) | intangibles | C144 |

Gokcen (2009) | information revelation | C119 | ^{13}Chen, Novy-Marx, and Zhang (2011) | market return^{†} | |

Gow and Taylor (2009) | earnings volatility | C120 | investment portfolio return | F90 | |

Huang (2009) | cash-flow volatility | C121 | ROE portfolio return | F91 | |

Korniotis and Kumar (2009) | local unemployment | C122 | Akbas, Armstrong, and Petkova (2011) | volatility of liquidity | C145 |

local housing collateral | C123 | Jiang and Sun (2011) | dispersion in beliefs | C146 | |

Nguyen and Swanson (2009) | efficiency score | C124 | Han and Zhou (2011) | credit default swap spreads | C147 |

Eisfeldt and Papanikolaou (2011) | organizational capital | C148 | Lioui and Maio (2012) | future growth opp. cost of money | F106 |

Balachandran and Mohanram (2011) | residual income | C149 | Gârleanu, Kogan, and Panageas (2012) | inter-cohort cons. differences | T |

Bandyopadhyay, Huang, and Wirjanto (2010) | accrual volatility | C150 | Hu, Pan, and Wang (2012) | market-wide liquidity | F107 |

Callen and Lyle (2011) | implied cost of capital | C151 | Conrad, Dittmar, and Ghysels (2013) | stock skewness | C166 |

Callen, Khan, and Lu (2013) | nonaccounting infor. quality | C152 | Baltussen, Van Bekkum, and Van der Grient (2012) | expected return uncertainty | C167 |

accounting infor. quality | C153 | Zhao (2012) | information intensity | C168 | |

Chen, Kacperczyk, and Ortiz-Molina (2011) | labor unions | C154 | Friewald, Wagner, and Zechner (2012) | credit risk premia | C169 |

Da, Liu, and Schaumburg (2011) | overreaction to nonfundamentals | C155 | Garcia and Norli (2012) | geographic dispersion | C170 |

Drake, Rees, and Swanson (2011) | short interest | C156 | Kim, Pantzalis, and Park (2012) | political geography | C171 |

Hafzalla, Lundholm, and Van Winkle (2011) | percent total accrual | C157 | Johnson and So (2012) | option to stock volume ratio | C172 |

Hess, Kreutzmann, and Pucker (2011) | projected earnings accuracy^{‡} | Palazzo (2012) | cash holdings | C173 | |

Imrohoroglu and Tuzel (2011) | firm productivity | C158 | Donangelo (2012) | labor mobility | C174 |

Landsman et al. (2011) | really dirty surplus | C159 | Wang (2012) | debt covenant protection | C175 |

Li (2011) | earnings forecast | C160 | Chen and Strebulaev (2012) | stock cash-flow sensitivity | C176 |

Nyberg and Pöyry (2011) | asset growth | C161 | Li (2012) | jump beta | F108 |

Ortiz-Molina and Phillips (2011) | real asset liquidity | C162 | ^{15}Ferson, Nallareddy, and Xie (2012) | long-run cons. growth^{‡} | |

Patatoukas (2011) | customer-base concentration | C163 | short-run cons. growth^{‡} | ||

Thomas and Zhang (2011) | tax expense surprises | C164 | cons. growth volatility^{‡} | ||

Wahlen and Wieland (2011) | predicted earnings increase score^{‡} | Ang, Bali, and Cakici (2012) | change in call implied vol. | C177 | |

Garlappi and Yan (2011) | shareholder recovery | change in put implied vol. | C178 | ||

Savov (2011) | garbage growth | F92 | Bazdresch, Belo and Lin (2012) | firm hiring rate | C179 |

Adrian, Etula, and Muir (2012) | financial intermediary's wealth | F93 | Cohen and Lou (2012) | infor. processing complexity | C180 |

Campbell et al. (2012) | stochastic volatility* | F94 | Cohen, Malloy, and Pomorski (2012) | opportunistic buy | C181 |

Chen and Petkova (2012) | average variance of equity returns | F95 | opportunistic sell | C182 | |

Eiling (2013) | income growth goods industries | F96 | Hirshleifer, Hsu, and Li (2012) | innovative efficiency | C183 |

income growth for manufacturing | F97 | Li (2012) | abnormal operating cash flows | C184 | |

income growth for distributive | F98 | abnormal production costs | C185 | ||

income growth for service* | F99 | Prakash and Sinha (2012) | deferred revenues | C186 | |

income growth for government* | F100 | Price et al. (2012) | earnings conference calls | C187 | |

Boguth and Kuehn (2012) | consumption volatility | F101 | So (2012) | earnings forecast optimism | C188 |

Chang, Christoffersen, and Jacobs (2012) | market skewness | F102 | Boons, De Roon, and Szymanowska (2012) | commodity index | F109 |

Viale, Garcia-Feijoo, and Giannetti (2012) | learning* | F103 | Moskowitz, Ooi, and Pedersen (2012) | time-series momentum | C189 |

Knightian uncertainty | F104 | Koijen et al. (2012) | carry | C190 | |

Bali and Zhou (2012) | market uncertainty | F105 | Burlacu et al. (2012) | expected return proxy | C191 |

^{14}Gómez, Priestley, and Zapatero (2012) | labor income^{‡} | Beneish, Lee, and Nichols (2012) | fraud probability | C192 | |

Van Binsbergen (2012) | product price change | C165 | |||

Brennan et al. (2012) | buy orders | C193 | Frazzini and Pedersen (2013) | betting-against-beta | C198 |

sell orders | C194 | Valta (2013) | secured debt | C199 | |

Doskov, Pekkala, and Ribeiro (2013) | expected dividend level | F110 | convertible debt | C200 | |

expected dividend growth | F111 | convertible debt indicator | C201 | ||

Cohen, Diether, and Malloy (2013) | firm's ability to innovate | C195 | Akbas et al. (2013) | cross-sectional pricing inefficiency | F112 |

Larcker, So, and Wang (2013) | board centrality | C196 | Chordia, Subrahmanyam, and Tong (2013) | attenuated returns | C202 |

Novy-Marx (2013) | gross profitability | C197 | Brennan, Huh, and Subrahmanyam (2013) | bad private information | C203 |

Han and Zhou (2013) | trend signal | F113 |

Reference | Factor | # | Reference | Factor | # |
---|---|---|---|---|---|

Sharpe (1964) | market return | T | Constantinides (1982) | individual consumer's wealth | T |

Lintner (1965) | market return | T | Basu (1983) | EP ratio | C8 |

Mossin (1966) | market return | T | Adler and Dumas (1983) | FX rate change | T |

Douglas (1967) | total volatility | C1 | Arbel, Carvell, and Strebel (1983) | institutional holding^{‡} | |

Heckerman (1972) | market return | T | Hawkins, Chamberlin, and Daniel (1984) | earnings expectations^{‡} | |

relative prices of cons. goods | T | McConnell and Sanger (1984) | new listings announcement^{‡} | ||

^{1}Black, Jensen, and Scholes (1972) | market return | Chan, Chen, and Hsieh (1985) | market return^{†} | ||

Black (1972) | market return | T | industrial production growth | F5 | |

Merton (1973) | state variables investment opps. | T | change in expected inflation* | F6 | |

Fama and MacBeth (1973) | market return | F1 | unanticipated inflation | F7 | |

beta squared* | F2 | credit premium | F8 | ||

idiosyncratic volatility* | C2 | term structure* | F9 | ||

Rubinstein (1973) | high-order market return | T | De Bondt and Thaler (1985) | long-term return reversal | C9 |

Solnik (1974) | world market return | T | Cox, Ingersoll, and Ross (1985) | $\Delta $ investment opportunities | T |

Rubinstein (1974) | individual investor resources | T | Amihud and Mendelson (1986) | transaction costs | T |

Gupta and Ofer (1975) | earnings growth expectations | C3 | Constantinides (1986) | transaction costs | T |

Kraus and Litzenberger (1976) | market return^{†} | Stulz (1986) | expected inflation | T | |

squared market return* | F3 | Sweeney and Warga (1986) | long-term interest rate | F10 | |

Basu (1977) | PE ratio | C4 | Chen, Roll, and Ross (1986) | industrial production growth^{†} | |

Lucas (1978) | marginal rate of substitution | T | credit premium^{†} | ||

Litzenberger and Ramaswamy (1979) | dividend yield | C5 | term structure^{†} | ||

market return^{†} | unanticipated inflation^{†} | ||||

Breeden (1979) | real consumption growth | T | change in oil prices* | F11 | |

Jarrow (1980) | short-sale restrictions | T | Bhandari (1988) | debt-to-equity ratio | C10 |

^{2}Fogler, John, and Tipton (1981) | market return^{†‡} | Bauman and Dowen (1988) | long-term growth forecasts^{‡} | ||

Treasury bond return^{‡} | Breeden, Gibbons, and Litzenberger (1989) | consumption growth | F12 | ||

corporate bond return^{‡} | Amihud and Mendelson (1989) | illiquidity | C11 | ||

Oldfield and Rogalski (1981) | Treasury-bill return | F4 | Ou and Penman (1989) | predicted earnings change | C12 |

Stulz (1981) | world consumption | T | Jegadeesh (1990) | return predictability | C13 |

Mayshar (1981) | transaction costs | T | |||

Banz (1981) | firm size | C6 | |||

Figlewski (1981) | short interest | C7 | |||

Ferson and Harvey (1991) | market return^{†} | Elton, Gruber, and Blake (1995) | change in expected inflation | F22 | |

consumption growth^{†} | change in expected GNP | F23 | |||

credit spread^{†} | Spiess and Affleck-Graves (1999) | seasoned equity offerings^{‡} | |||

$\Delta $ slope of the yield curve | F13 | Chan, Foresi, and Lang (1996) | money growth | F24 | |

unexpected inflation^{†} | Cochrane (1996) | returns on physical inv. | F25 | ||

real short rate | F14 | Campbell (1996) | market return^{†} | ||

^{3}Fama and French (1992) | size | F15 | labor income | F26 | |

value | F16 | dividend yield^{†} | |||

Chopra, Lakonishok, and Ritter (1992) | return momentum^{‡} | interest rate^{†} | |||

Holthausen and Larcker (1992) | predicted return signs^{‡} | term structure^{†} | |||

Jegadeesh and Titman (1993) | return momentum | C14 | Jagannathan and Wang (1996) | market return^{†} | |

Elton et al. (1993) | returns on S&P stocks^{‡} | slope of yield curve^{†} | |||

returns on non-S&P stocks^{‡} | labor income^{†} | ||||

^{4}Bansal and Viswanathan (1993) | high-order equity & bond returns^{‡} | La Porta (1996) | earnings forecasts | C18 | |

Fama and French (1993) | market return^{†} | Lev and Sougiannis (1996) | R&D capital | C19 | |

size^{†} | Sloan (1996) | accruals | C20 | ||

value^{†} | Womack (1996) | buy recommendations | C21 | ||

term structure^{†} | sell recommendations | C22 | |||

credit risk^{†} | Erb, Harvey, and Viskanta (1996) | credit rating | C23 | ||

^{5}Ferson and Harvey (1993) | world equity return^{‡} | Brennan and Subrahmanyam (1996) | illiquidity | C24 | |

change in weighted exchange rate^{‡} | ^{6}Chapman (1997) | nonlinear fn. of cons. growth^{‡} | |||

$\Delta $ LT inflation expectations^{‡} | ^{7}Fung and Hsieh (1997) | opportunistic style return^{‡} | |||

weighted real short-term rate^{‡} | global/macro style return^{‡} | ||||

change in oil price^{†‡} | value style return^{‡} | ||||

change in TED spread^{‡} | trend following style return^{‡} | ||||

$\Delta $ in G-7 industrial production^{‡} | distressed inv. style return^{‡} | ||||

unexpected G-7 inflation^{‡} | Carhart (1997) | size^{†} | |||

Ferson and Harvey (1994) | world equity return | F17 | value^{†} | ||

change in weighted FX rate* | F18 | market return^{†} | |||

$\Delta $ LT inflation expectations* | F19 | momentum | F27 | ||

change in oil price^{*†} | Botosan (1997) | disclosure level | C25 | ||

Bossaerts and Dammon (1994) | tax rate for capital gains | F20 | Ackert and Athanassakos (1997) | earnings forecast uncertainty | C26 |

tax rate for dividend | F21 | Daniel and Titman (1997) | size^{†} | ||

Loughran and Ritter (1995) | new public stock issuance | C15 | value^{†} | ||

Michaely, Thaler, and Womack (1995) | dividend initiations | C16 | |||

dividend omissions | C17 | ||||

Beneish (1997) | earnings management likelihood^{‡} | Griffin and Lemmon (2002) | distress risk | C45 | |

Loughran and Vijh (1997) | corporate acquisitions | C27 | Diether, Malloy, and Scherbina (2002) | analyst dispersion | C46 |

Brennan, Chordia, and Subrahmanyam (1998) | size^{†} | Chen, Hong, and Stein (2002) | breadth of ownership | C47 | |

book-to-market ratio^{†} | Easley, Hvidkjaer, and O'Hara (2002) | information risk | C48 | ||

momentum^{†} | Jones and Lamont (2002) | short-sale constraints | C49 | ||

trading volume | C28 | Penman and Zhang (2002) | earnings sustainability | C50 | |

Abarbanell and Bushee (1998) | fundamental analysis^{‡} | Amihud (2002) | market illiquidity | F33 | |

Frankel and Lee (1998) | firm fundamental value^{‡} | Vassalou (2003) | GDP growth news | F34 | |

Dichev (1998) | bankruptcy risk | C29 | Pastor and Stambaugh (2003) | market liquidity | F35 |

Datar, Naik, and Radcliffe (1998) | illiquidity | C30 | Ali, Hwang, and Trombley (2003) | idiosyncratic return volatility^{†} | |

Ferson and Harvey (1999) | expected portfolio return | F28 | transaction costs^{†} | ||

Moskowitz and Grinblatt (1999) | industry momentum | C31 | investor sophistication^{†} | ||

Spiess and Affleck-Graves (1999) | debt offerings^{‡} | Gompers, Ishii, and Metrick (2003) | shareholder rights | C51 | |

Heaton and Lucas (2000) | entrepreneur income | F29 | Doyle, Lundholm, and Soliman (2003) | excluded expenses | C52 |

Harvey and Siddique (2000) | coskewness | F30 | Fairfield, Whisenant, and Yohn (2003) | growth in LT net operating assets | C53 |

Lee and Swaminathan (2000) | trading volume | C32 | Rajgopal, Shevlin, and Venkatachalam (2003) | order backlog | C54 |

Asness, Porter, and Stevens (2000) | intra-industry size | C33 | Watkins (2003) | return consistency | C55 |

intra-industry value | C34 | Jacobs and Wang (2004) | idiosyncratic consumption | F36 | |

intra-industry CF/P | C35 | Campbell and Vuolteenaho (2004) | cash-flow news | F37 | |

intra-industry $\Delta $% # employees | C36 | discount rate news | F38 | ||

intra-industry momentum | C37 | ^{8}Vanden (2004) | market return^{†} | ||

Piotroski (2000) | financial statement infor. | C38 | index option returns | F39 | |

Lettau and Ludvigson (2001) | consumption growth^{†} | Vassalou and Xing (2004) | default risk | F40 | |

consumption-wealth ratio | F31 | Brennan, Wang, and Xia (2004) | real interest rate | F41 | |

Chordia, Subrahmanyam, and Anshuman (2001) | level of liquidity | C39 | maximum Sharpe ratio portfolio | F42 | |

variability of liquidity | C40 | Teo and Woo (2004) | return reversals at the style level | F43 | |

Lamont, Polk, and Saa-Requejo (2001) | financial constraints | C41 | Eberhart, Maxwell, and Siddique (2004) | unexpected change in R&D | C56 |

Fung and Hsieh (2001) | straddle return^{‡} | George and Hwang (2004) | 52-week high | C57 | |

Barber et al. (2001) | consensus recommendations* | Jegadeesh et al. (2004) | analysts' recommendations | C58 | |

Dichev and Piotroski (2001) | bond rating changes | C42 | Ofek, Richardson, and Whitelaw (2004) | put-call parity | C59 |

Elgers, Lo, and Pfeiffer (2001) | analysts' forecasts | C43 | Titman, Wei, and Xie (2004) | abnormal capital investment | C60 |

Gompers and Metrick (2001) | institutional ownership | C44 | Hirshleifer et al. (2004) | balance sheet optimism | C61 |

Dittmar (2002) | market return^{†} | Parker and Julliard (2005) | LT consumption growth | F44 | |

squared market return^{†} | Bansal, Dittmar, and Lundblad (2005) | long-run consumption | F45 | ||

labor income growth^{†} | |||||

squared labor income growth | F32 | ||||

Lustig and Van Nieuwerburgh (2005) | housing price ratio | F46 | Brammer, Brooks, and Pavelin (2006) | environment indicator* | C78 |

Cremers and Nair (2005) | external corporate governance | C62 | employment indicator* | C79 | |

internal corporate governance | C63 | community indicator* | C80 | ||

^{9}Acharya and Pedersen (2005) | market return^{†} | Daniel and Titman (2006) | intangible information | C81 | |

market liquidity* | F47 | Fama and French (2006) | profitability | C82 | |

individual stock liquidity | C64 | investment* | C83 | ||

Hou and Moskowitz (2005) | price delay | C65 | book-to-market^{†} | ||

Anderson, Ghysels, and Juergens (2005) | heterogeneous beliefs | C66 | Bradshaw, Richardson, and Sloan (2006) | net financing | C84 |

Nagel (2005) | short-sale constraints | C67 | Cen, Wei, and Zhang (2006) | forecasted earnings per share | C85 |

Asquith, Pathak, and Ritter (2005) | short-sale constraints | C68 | Franzoni and Marin (2006) | pension plan funding | C86 |

Gu (2005) | patent citation | C69 | Gettleman and Marks (2006) | acceleration | C87 |

Jiang, Lee and Zhang (2005) | information uncertainty | C70 | Narayanamoorthy (2006) | unexpected earnings' autocorr. | C88 |

Lev, Nissim, and Thomas (2005) | adjusted R&D | C71 | Boudoukh et al. (2007) | payout yield | F62 |

Lev, Sarath, and Sougiannis (2005) | R&D reporting biases | C72 | Balvers and Huang (2007) | productivity | F63 |

Mohanram (2005) | growth index | C73 | capital stock | F64 | |

^{10}Vanden (2006) | market return^{†} | Jagannathan and Wang (2007) | 4th Q to 4th Q cons. growth | F65 | |

index option return^{†} | Avramov et al. (2007) | credit rating | C89 | ||

market × option return^{‡} | Shu (2007) | trader composition | C90 | ||

Gomes, Yaron, and Zhang (2006) | financing frictions | F48 | Baik and Ahn (2007) | change in order backlog | C91 |

Li, Vassalou, and Xing (2006) | inv. growth (IG) households* | F49 | Brown and Rowe (2007) | firm productivity | C92 |

IG nonfinancial corporates | F50 | Doran, Fodor, and Peterson (2007) | insider forecasts of firm vol. | C93 | |

IG noncorporate business | F51 | Head, Smith, and Wilson (2007) | ticker symbol | C94 | |

IG financial firms | F52 | Gourio (2007) | earnings cyclicality | F66 | |

^{11}Chung, Johnson, and Schill (2006) | 3rd-10th power market return^{‡} | Kumar et al. (2008) | market volatility innovation | F67 | |

Whited and Wu (2006) | financial constraints | C74 | firm age | C95 | |

Ang et al. (2006) | downside risk | F53 | market return^{†} | ||

Ang et al. (2006) | systematic volatility | F54 | market vol. × firm age | C96 | |

idiosyncratic volatility | C75 | Adrian and Rosenberg (2008) | short-run market volatility | F68 | |

Baker and Wurgler (2006) | investor sentiment | F55 | long-run market volatility | F69 | |

Kumar and Lee (2006) | retail investor sentiment | F56 | Xing (2008) | investment growth | F70 |

Yogo (2006) | durable & nondur. cons. growth | F57 | Korniotis (2008) | mean consumption growth | F71 |

Lo and Wang (2006) | market return^{†} | variance of consumption growth* | F72 | ||

trading volume | F58 | mean habit growth | F73 | ||

Sadka (2006) | liquidity | F59 | variance of habit growth | F74 | |

Chordia and Shivakumar (2006) | earnings | F60 | Korajczyk and Sadka (2008) | liquidity | F75 |

Liu (2006) | liquidity | F61 | Guo and Savickas (2008) | country-level idiosyncratic vol. | C97 |

Anderson and Garcia-Feijóo (2006) | capital investment | C76 | Campbell, Hilscher, and Szilagyi (2008) | distress | C98 |

Hou and Robinson (2006) | industry concentration | C77 | Garlappi, Shu, and Yan (2008) | shareholder advantage | C99 |

implied market value from KMV | C100 | ||||

Cooper, Gulen, and Schill (2008) | asset growth | C101 | Barber, Odean, and Zhu (2009) | order imbalance | C125 |

Pontiff and Woodgate (2008) | share issuance | C102 | Cremers, Halling, and Weinbaum (2010) | market volatility and jumps | F85 |

Brandt et al. (2008) | earnings announcement return^{‡} | Hirshleifer and Jiang (2010) | market mispricing | F86 | |

Cohen and Frazzini (2008) | firm economic links | C103 | Boyer, Mitton, and Vorkink (2010) | idiosyncratic skewness | C126 |

Fabozzi, Ma, and Oliphant (2008) | sin stock | C104 | Cooper, Gulen, and Ovtchinnikov (2010) | political contributions | C127 |

Gu and Lev (2011) | goodwill impairment | C105 | Tuzel (2010) | real estate holdings | C128 |

Gu, Wang, and Ye (2008) | information in order backlog | C106 | Amaya et al. (2011) | realized skewness | C129 |

Lehavy and Sloan (2008) | investor recognition | C107 | realized kurtosis | C130 | |

Soliman (2008) | DuPont analysis | C108 | An, Bhojraj, and Ng (2010) | excess multiple | C131 |

Hvidkjaer (2008) | small trades | C109 | Armstrong, Banerjee, and Corona (2010) | firm information quality | C132 |

Brennan and Li (2008) | idiosyncratic S&P 500 return | F76 | Cao and Xu (2010) | long-run idiosyncratic vol. | C133 |

Da (2009) | cash flow cova. with cons. | F77 | Easley, Hvidkjaer, and O'Hara (2010) | private information | F87 |

cash flow duration | F78 | Hameed, Huang, and Mian (2010) | intra-industry return reversals | C134 | |

Livdan, Sapriza, and Zhang (2009) | financial constraints | Menzly and Ozbas (2010) | related industry returns | C135 | |

Malloy, Moskowitz, and Vissing-Jorgensen (2009) | LT stockholder cons. growth | F79 | Papanastasopoulos, Thomakos, and Wang (2010) | earnings to equity holders | C136 |

Cremers, Nair, and John (2009) | takeover likelihood | F80 | net cash to equity holders | C137 | |

Chordia, Huh, and Subrahmanyam (2009) | illiquidity | F81 | Simutin (2010) | excess cash | C138 |

Da and Warachka (2009) | cash flow | F82 | Huang et al. (2012) | extreme downside risk | C139 |

Ozoguz (2009) | investors' beliefs* | F83 | Xing, Zhang, and Zhao (2010) | volatility smirk | C140 |

investors' uncertainty | F84 | George and Hwang (2010) | exposure financial distress costs | ||

Fang and Peress (2009) | media coverage | C110 | Berkman, Jacobsen, and Lee (2011) | rare disasters | F88 |

Avramov et al. (2009) | financial distress | C111 | ^{12}Kapadia (2011) | distress risk^{‡} | |

Fu (2009) | idiosyncratic volatility | C112 | Hou, Karolyi, and Kho (2011) | momentum^{†} | |

Hahn and Lee (2009) | debt capacity | C113 | cash flow-to-price | F89 | |

Bali and Hovakimian (2009) | realized-implied vol. spread | C114 | Li (2011) | R&D investment | C141 |

call-put implied vol. spread | C115 | financial constraints^{†} | |||

Chandrashekar and Rao (2009) | productivity of cash | C116 | Bali, Cakici, and Whitelaw (2011) | extreme stock returns | C142 |

Chemmanur and Yan (2009) | advertising | C117 | Yan (2011) | jumps individual stock returns | C143 |

Da and Warachka (2009) | analyst forecasts optimism | C118 | Edmans (2011) | intangibles | C144 |

Gokcen (2009) | information revelation | C119 | ^{13}Chen, Novy-Marx, and Zhang (2011) | market return^{†} | |

Gow and Taylor (2009) | earnings volatility | C120 | investment portfolio return | F90 | |

Huang (2009) | cash-flow volatility | C121 | ROE portfolio return | F91 | |

Korniotis and Kumar (2009) | local unemployment | C122 | Akbas, Armstrong, and Petkova (2011) | volatility of liquidity | C145 |

local housing collateral | C123 | Jiang and Sun (2011) | dispersion in beliefs | C146 | |

Nguyen and Swanson (2009) | efficiency score | C124 | Han and Zhou (2011) | credit default swap spreads | C147 |

Eisfeldt and Papanikolaou (2011) | organizational capital | C148 | Lioui and Maio (2012) | future growth opp. cost of money | F106 |

Balachandran and Mohanram (2011) | residual income | C149 | Gârleanu, Kogan, and Panageas (2012) | inter-cohort cons. differences | T |

Bandyopadhyay, Huang, and Wirjanto (2010) | accrual volatility | C150 | Hu, Pan, and Wang (2012) | market-wide liquidity | F107 |

Callen and Lyle (2011) | implied cost of capital | C151 | Conrad, Dittmar, and Ghysels (2013) | stock skewness | C166 |

Callen, Khan, and Lu (2013) | nonaccounting infor. quality | C152 | Baltussen, Van Bekkum, and Van der Grient (2012) | expected return uncertainty | C167 |

accounting infor. quality | C153 | Zhao (2012) | information intensity | C168 | |

Chen, Kacperczyk, and Ortiz-Molina (2011) | labor unions | C154 | Friewald, Wagner, and Zechner (2012) | credit risk premia | C169 |

Da, Liu, and Schaumburg (2011) | overreaction to nonfundamentals | C155 | Garcia and Norli (2012) | geographic dispersion | C170 |

Drake, Rees, and Swanson (2011) | short interest | C156 | Kim, Pantzalis, and Park (2012) | political geography | C171 |

Hafzalla, Lundholm, and Van Winkle (2011) | percent total accrual | C157 | Johnson and So (2012) | option to stock volume ratio | C172 |

Hess, Kreutzmann, and Pucker (2011) | projected earnings accuracy^{‡} | Palazzo (2012) | cash holdings | C173 | |

Imrohoroglu and Tuzel (2011) | firm productivity | C158 | Donangelo (2012) | labor mobility | C174 |

Landsman et al. (2011) | really dirty surplus | C159 | Wang (2012) | debt covenant protection | C175 |

Li (2011) | earnings forecast | C160 | Chen and Strebulaev (2012) | stock cash-flow sensitivity | C176 |

Nyberg and Pöyry (2011) | asset growth | C161 | Li (2012) | jump beta | F108 |

Ortiz-Molina and Phillips (2011) | real asset liquidity | C162 | ^{15}Ferson, Nallareddy, and Xie (2012) | long-run cons. growth^{‡} | |

Patatoukas (2011) | customer-base concentration | C163 | short-run cons. growth^{‡} | ||

Thomas and Zhang (2011) | tax expense surprises | C164 | cons. growth volatility^{‡} | ||

Wahlen and Wieland (2011) | predicted earnings increase score^{‡} | Ang, Bali, and Cakici (2012) | change in call implied vol. | C177 | |

Garlappi and Yan (2011) | shareholder recovery | change in put implied vol. | C178 | ||

Savov (2011) | garbage growth | F92 | Bazdresch, Belo and Lin (2012) | firm hiring rate | C179 |

Adrian, Etula, and Muir (2012) | financial intermediary's wealth | F93 | Cohen and Lou (2012) | infor. processing complexity | C180 |

Campbell et al. (2012) | stochastic volatility* | F94 | Cohen, Malloy, and Pomorski (2012) | opportunistic buy | C181 |

Chen and Petkova (2012) | average variance of equity returns | F95 | opportunistic sell | C182 | |

Eiling (2013) | income growth goods industries | F96 | Hirshleifer, Hsu, and Li (2012) | innovative efficiency | C183 |

income growth for manufacturing | F97 | Li (2012) | abnormal operating cash flows | C184 | |

income growth for distributive | F98 | abnormal production costs | C185 | ||

income growth for service* | F99 | Prakash and Sinha (2012) | deferred revenues | C186 | |

income growth for government* | F100 | Price et al. (2012) | earnings conference calls | C187 | |

Boguth and Kuehn (2012) | consumption volatility | F101 | So (2012) | earnings forecast optimism | C188 |

Chang, Christoffersen, and Jacobs (2012) | market skewness | F102 | Boons, De Roon, and Szymanowska (2012) | commodity index | F109 |

Viale, Garcia-Feijoo, and Giannetti (2012) | learning* | F103 | Moskowitz, Ooi, and Pedersen (2012) | time-series momentum | C189 |

Knightian uncertainty | F104 | Koijen et al. (2012) | carry | C190 | |

Bali and Zhou (2012) | market uncertainty | F105 | Burlacu et al. (2012) | expected return proxy | C191 |

^{14}Gómez, Priestley, and Zapatero (2012) | labor income^{‡} | Beneish, Lee, and Nichols (2012) | fraud probability | C192 | |

Van Binsbergen (2012) | product price change | C165 | |||

Brennan et al. (2012) | buy orders | C193 | Frazzini and Pedersen (2013) | betting-against-beta | C198 |

sell orders | C194 | Valta (2013) | secured debt | C199 | |

Doskov, Pekkala, and Ribeiro (2013) | expected dividend level | F110 | convertible debt | C200 | |

expected dividend growth | F111 | convertible debt indicator | C201 | ||

Cohen, Diether, and Malloy (2013) | firm's ability to innovate | C195 | Akbas et al. (2013) | cross-sectional pricing inefficiency | F112 |

Larcker, So, and Wang (2013) | board centrality | C196 | Chordia, Subrahmanyam, and Tong (2013) | attenuated returns | C202 |

Novy-Marx (2013) | gross profitability | C197 | Brennan, Huh, and Subrahmanyam (2013) | bad private information | C203 |

Han and Zhou (2013) | trend signal | F113 |

*Notes to Table*: T, theoretical; F, common factors; C, characteristics. An augmented version (which includes full citations, as well as hyperlinks to each of the cited articles) of this table is available for download and resorting. See http://faculty.fuqua.duke.edu/charvey/Factor-List.xlsx. Many of the working papers we cite have been published, but because our method depends on the point in time, we cite only the working paper version.

This table contains a summary of risk factors that explain the cross-section of expected returns.

*, insignificant; †, duplicated; ‡, missing *p*-value.

1: Black, Jensen, and Scholes (1972) first tested the market factor. However, they focus on industry portfolios and thus present a less powerful test compared to Fama and MacBeth (1973). We therefore use the test statistics in Fama and MacBeth (1973) for the market factor.

2: No *p*-values reported for their factors constructed from principal component analysis.

3: Fama and French (1992) create zero-investment portfolios to test size and book-to-market effects. This is different from the testing approach in Banz (1981). We therefore count Fama and French's (1992) test on size effect as a separate one.

4: No *p*-values reported for their high order equity index return factors.

5: No *p*-values reported for their eight risk factors that explain international equity returns.

6: No *p*-values reported for his return factors.

7: No *p*-values reported for their five hedge fund style return factors.

8: Vanden (2004) reports a *t*-statistic for each Fama-French 25 size and book-to-market sorted stock portfolios. We average these 25 *t*-statistics.

9: Acharya and Pedersen (2005) consider the illiquidity measure in Amihud (2002). This is different from the liquidity measure in Pastor and Stambaugh (2003). We therefore count their factor as a separate one.

10: No *p*-values reported for the interactions between market return and option returns.

11: No *p*-values reported for their comoment betas.

12: No *p*-values reported for his distress tracking factor.

13: Gómez, Priestley, and Zapatero (2012) study census division level labor income. However, most of the division level labor income have a nonsignificant *t*-statistic. We do not count their factors.

14: No *p*-values reported for their factors estimated from the long-run risk model.

15: The paper is replaced by Hou, Xue, and Zhang (2014).

### Appendix A

#### Multiple Testing When the Number of Tests ($M$) Is Unknown

The empirical difficulty in applying standard *p*-value adjustments is that we do not observe factors that have been tried, found to be insignificant and then discarded. We attempt to overcome this difficulty using a simulation framework. The idea is first to simulate the empirical distribution of *p*-values for all experiments (published and unpublished) and then to adjust the *p*-values based on these simulated samples.

First, we assume the test statistic (*t*-statistic, for instance) for any experiment follows a certain distribution $D$ (e.g., exponential distribution) and the set of published works is a truncated $D$ distribution. Based on the estimation framework for truncated distributions,^{46} we estimate the parameters of distribution $D$ and the total number of trials $M$. Next, we simulate many sequences of *p*-values, each corresponding to a plausible set of *p*-value realizations of all trials. To account for the uncertainty in parameter estimates of $D$ and $M$, we simulate the *p*-value sequences based on the distribution of estimated $D$ and $M$. Finally, for each *p*-value, we calculate the adjusted *p*-value based on a sequence of simulated *p*-values. The median is taken as the final adjusted *p*-value.

##### A.1 Using Truncated Exponential Distribution to Model the *t*-statistic Sample

Truncated distributions have been used to study hidden tests (i.e., publication bias) in medical research.^{47} The idea is that studies reporting significant results are more likely to get published. Assuming a threshold significance level or *t*-statistic, researchers can, to some extent, infer the results of unpublished works and gain an understanding of the overall effect of a drug or treatment. However, in medical research, insignificant results are still viewed as an indispensable part of the overall statistical evidence and are given much more prominence than in the financial economics research. As a result, medical publications are more likely to report insignificant results. This makes applying the truncated distribution framework to medical studies difficult as there is no clear-cut threshold value.^{48} In this sense, the truncated distributional framework suits our study better—1.96 is the obvious hurdle that research needs to overcome to be published.

On the other hand, not all tried factors with a $t$-statistic above 1.96 are reported. In the quantitative asset management industry, significant results are not published—they are considered “trade secrets.” For the academic literature, factors with “borderline” *t*-statistics are difficult to get published. Thus, our sample is likely missing a number of factors that have *t*-statistics just over 1.96. To make our inference robust, for our baseline result, we assume all tried factors with *t*-statistics above 2.57 are observed and ignore those with *t*-statistics in the range of (1.96, 2.57). We experiment with alternative ways to handle *t*-statistics in this range.

Many distributions can be used to model the *t*-statistic sample. One restriction that we think any of these distributions should satisfy is the monotonicity of the density curve. Intuitively, it should be easier to find factors with small *t*-statistics than with large ones.^{49} We choose to use the simplest distribution that incorporates this monotonicity condition: the exponential distribution.

Panel A of Figure A.1 presents the histogram of the baseline *t*-statistic sample and the fitted truncated exponential curve.^{50} The fitted density closely tracks the histogram and has a population mean of 2.07.^{51} Panel B is a histogram of the original *t*-statistic sample, which, as we discussed before, is likely to underrepresent the sample with a *t*-statistic in the range of (1.96, 2.57). Panel C is the augmented *t*-statistic sample with the ad hoc assumption that our sample covers only half of all factors with *t*-statistics between 1.96 and 2.57. The population mean estimate is 2.22 in panel B and 1.93 in panel C. As expected, the underrepresentation of relatively small *t*-statistics results in a higher mean estimate for the *t*-statistic population. We think the baseline model is the best among all three models as it not only overcomes the missing data problem for the original sample, but also avoids guessing the fraction of missing observations in the 1.96–2.57 range. We use those model estimates for the follow-up analysis.

Using the baseline model, we calculate other interesting population characteristics that are key to multiple hypothesis testing. Assuming independence, we model observed *t*-statistics as draws from an exponential distribution with mean parameter $\lambda \u02c6$ and a known cutoff point of 2.57. The proportion of unobserved factors is then estimated as

*t*-statistic for the underlying factor population is 2.07 and about 71.1% of tried factors are discarded. Given that 238 out of the original 316 factors have a

*t*-statistic exceeding 2.57, the total number of factor tests is estimated to be 824 ($=238/(1\u221271.1%)$) and the number of factors with a

*t*-statistic between 1.96 and 2.57 is estimated to be 82.

^{52}Since our

*t*-statistic sample covers only 57 such factors, roughly 30% (=(82-57)/82) of

*t*-statistics between 1.96 and 2.57 are hidden.

##### A.2 Simulated Benchmark *t*-statistics under Independence

The truncated exponential distribution framework helps us approximate the distribution of *t*-statistics for all factors, both published and unpublished. We can then apply the aforementioned adjustment techniques to this distribution to generate new *t*-statistic benchmarks. However, there are two sources of sampling and estimation uncertainty that affect our results. First, our *t*-statistic sample may underrepresent all factors with *t*-statistics exceeding 2.57.^{53} Hence, our estimates of the total trials are biased (too low), which affects our calculation of the benchmarks. Second, estimation errors in the truncated exponential distribution can affect our benchmark *t*-statistics. Although we can approximate the estimation error through the usual asymptotic distribution theory for MLE, it is unclear how this error affects our benchmark *t*-statistics. This is because *t*-statistic adjustment procedures usually depend on the entire *t*-statistic distribution and so standard transformational techniques (e.g., the delta method) do not apply. Moreover, we are not sure whether our sample is large enough to trust the accuracy of asymptotic approximations.

Given these concerns, we propose a four-step simulation framework that incorporates these uncertainties.
To see how our procedure works, notice that Steps II and III calculate the theoretical benchmark *t*-statistic for a *t*-statistic distribution characterized by $(\lambda \u02c6,M\u02c6)$. As a result, the outcome is simply one number and there is no uncertainty around it. Uncertainties are incorporated in Steps I and IV. In particular, by repeatedly sampling from the original *t*-statistic sample and re-estimating $\lambda $ and $M$ each time, we take into account the estimation error of the truncated exponential distribution. Also, under the assumption that neglected significant *t*-statistics follow the empirical distribution of our *t*-statistic sample, by varying $r$, we can assess how this underrepresentation of our *t*-statistic sample affects results.

**Step I Estimate**$\lambda $**and**$M$**based on a new***t*-statistic sample with size $r\xd7R$.Suppose our current

*t*-statistic sample size is $R$ and it only covers a fraction of $1/r$ of all factors. We sample $r\xd7R$*t*-statistics (with replacement) from the original*t*-statistic sample. Based on this new*t*-statistic sample, we apply the above truncated exponential distribution framework to the*t*-statistics and obtain the parameter estimates $\lambda $ for the exponential distribution. The truncation probability is calculated as $P\u02c6=\Phi (2.57;\lambda \u02c6)$. We can then estimate the total number of trials by$M\u02c6=rR1\u2212P\u02c6.$

**Step II****Calculate the benchmark***t*-statistic based on a random sample generated from $\lambda \u02c6$ and $M\u02c6$.Based on the previous step estimate of $\lambda \u02c6$ and $M\u02c6$, we generate a random sample of

*t*-statistics for all tried factors. We then calculate the appropriate benchmark*t*-statistic based on this generated sample.

**Step III****Repeat Step II 10,000 times to obtain the median benchmark***t*-statistic.We take the median as the final benchmark

*t*-statistic corresponding to the parameter estimate $(\lambda \u02c6,M\u02c6)$.

**Step IV****Repeat Steps I-III 10,000 times to generate a distribution of benchmark***t*-statistics.Repeat Steps I-III 10,000 times, each time with a newly generated

*t*-statistic sample as in Step I. For each repetition, we obtain a benchmark*t*-statistic $ti$ corresponding to the parameter estimates $(\lambda \u02c6i,M\u02c6i)$. In the end, we have a collection of benchmark*t*-statistics ${ti}i=110000$.

Table A.1 shows estimates of $M$ and benchmark *t*-statistics. When $r=1$, the median estimate for the total number of trials is 817,^{54} almost the same as our previous estimate of 820 based on the original sample. Unsurprisingly, the Bonferroni implied benchmark *t*-statistic (4.01) is larger than 3.78, which is what we obtain when ignoring unpublished works. The Holm implied *t*-statistic (3.96), while not necessarily increasing in the number of trials, is also higher than before (3.64). The BHY implied *t*-statistic increases from 3.39 to 3.68 at 1% significance and from 2.78 to 3.18 at 5% significance. As $r$ increases, the sample size $M$ and the benchmark *t*-statistics for all four types of adjustments increase. When $r$ doubles, the estimate of $M$ also approximately doubles and the Bonferroni and Holm implied *t*-statistics increase by about 0.2, whereas the BHY implied *t*-statistics increase by around 0.03 (under both significance levels).

Sampling ratio | M | Bonferroni | Holm | BHY(1%) | BHY(5%) | |||||
---|---|---|---|---|---|---|---|---|---|---|

($r$) | [10% | 90%] | [10% | 90%] | [10% | 90%] | [10% | 90%] | [10% | 90%] |

1 | 817 | 4.01 | 3.96 | 3.68 | 3.17 | |||||

[731 | 947 ] | [3.98 | 4.04 ] | [3.92 | 4.00] | [3.63 | 3.74 ] | [3.12 | 3.24] | |

1.5 | 1,234 | 4.11 | 4.06 | 3.70 | 3.20 | |||||

[1,128 | 1,358 ] | [4.08 | 4.13 ] | [4.03 | 4.09] | [3.66 | 3.74 ] | [3.16 | 3.24] | |

2 | 1,646 | 4.17 | 4.13 | 3.71 | 3.21 | |||||

[1,531 | 1,786 ] | [4.15 | 4.19 ] | [4.11 | 4.15] | [3.67 | 3.75 ] | [3.18 | 3.25] |

Sampling ratio | M | Bonferroni | Holm | BHY(1%) | BHY(5%) | |||||
---|---|---|---|---|---|---|---|---|---|---|

($r$) | [10% | 90%] | [10% | 90%] | [10% | 90%] | [10% | 90%] | [10% | 90%] |

1 | 817 | 4.01 | 3.96 | 3.68 | 3.17 | |||||

[731 | 947 ] | [3.98 | 4.04 ] | [3.92 | 4.00] | [3.63 | 3.74 ] | [3.12 | 3.24] | |

1.5 | 1,234 | 4.11 | 4.06 | 3.70 | 3.20 | |||||

[1,128 | 1,358 ] | [4.08 | 4.13 ] | [4.03 | 4.09] | [3.66 | 3.74 ] | [3.16 | 3.24] | |

2 | 1,646 | 4.17 | 4.13 | 3.71 | 3.21 | |||||

[1,531 | 1,786 ] | [4.15 | 4.19 ] | [4.11 | 4.15] | [3.67 | 3.75 ] | [3.18 | 3.25] |

The estimated total number of factors tried ($M$) and the benchmark t-statistic percentiles based on a truncated exponential distribution framework. Our estimation is based on the original *t*-statistic sample truncated at 2.57. The sampling ratio is the assumed ratio of the true population size of *t*-statistics exceeding 2.57 over our current sample size. Both Bonferroni and Holm have a significance level of 5%.

### Appendix B

#### A Bayesian Approach to Multiple Tests

The following framework is adopted from Scott and Berger (2006) and highlights the key issues in Bayesian multiple hypothesis testing.^{55} More sophisticated generalizations modify the basic model but are unlikely to change the fundamental hierarchical testing structure.^{56} We use this framework to explain the pros and cons of performing multiple testing in a Bayesian framework.

The hierarchical model is as follows:

**H1.**$(Xi|\mu i,\sigma 2,\gamma i)~iidN(\gamma i\mu i,\sigma 2)$,**H2.**$\mu i|\tau 2~iidN(0,\tau 2),\gamma i|p0~iidBer(1\u2212p0)$,**H3.**$(\tau 2,\sigma 2)~\pi 1(\tau 2,\sigma 2),p0~\pi 2(p0)$.

We explain each step and the notation in detail Under this framework, the joint conditional likelihood function for $Xi$'s is simply a product of individual normal likelihood functions and the posterior probability that $\gamma i=1$ (discovery) can be calculated by applying Bayes' law. When the number of trials is large, to calculate the posterior probability, we need efficient methods, such as importance sampling, which involves high-dimensional integrals.

**H1.**$Xi$ denotes the average return generated from a long-short trading strategy based on a certain factor; $\mu i$ is the unknown mean return; $\sigma 2$ is the common variance for returns; and $\gamma i$ is an indicator function, with $\gamma i=0$ indicating a zero factor mean. $\gamma i$ is the counterpart of the reject/accept decision in the usual (frequentists') hypothesis testing framework.H1 therefore says that factor returns are independent conditional on mean $\gamma i\mu i$ and common variance $\sigma 2$, with $\gamma i=0$ indicating that the factor is spurious. The common variance assumption may look restrictive, but we can always scale factor returns by changing the dollar investment in the long-short strategy. The crucial assumption is conditional independence of average strategy returns. A certain form of conditional independence is unavoidable for Bayesian hierarchical modeling

^{57}—probably unrealistic for our application. We can easily think of scenarios in which average returns of different strategies are correlated, even when population means are known. For example, it is well known that two of the most popular factors, the Fama and French (1992) HML and SMB, are correlated.

**H2.**The first-step population parameters $\mu i$'s and $\gamma i$'s are assumed to be generated from two other parametric distributions: $\mu i$'s are independently generated from a normal distribution, and $\gamma i$'s are simply generated from a Bernoulli distribution, that is, $\gamma i=0$ with probability $p0$.The normality assumption for the $\mu i$'s requires the reported $Xi$'s to randomly represent either long/short or short/long strategy returns. If researchers have a tendency to report positive abnormal returns, we need to randomly assign to these returns plus/minus signs. The normality assumptions in both H1 and H2 are important as they are necessary to guarantee the properness of the posterior distributions.

**H3.**Finally, the two variance variables $\tau 2$ and $\sigma 2$ follow a joint prior distribution $\pi 1$ and the probability $p0$ follows a prior distribution $\pi 2$.

One benefit of a Bayesian framework for multiple testing is that the multiplicity penalty term is already embedded. In the frequentists' framework, this is done by introducing FWER or FDR. In a Bayesian framework, the so-called “Ockham's razor effect”^{58} automatically adjusts the posterior probabilities when more factors are simultaneously tested.^{59} Simulation studies in Scott and Berger (2006) show how the discovery probabilities for a few initial signals increase when more noise is added to the original sample.

However, there are several shortcomings for the Bayesian approach. Some of them are specific to the context of our application and the others are generic to the Bayesian multiple testing framework.

At least two issues arise when applying the Bayesian approach to our factor selection problem. First, we do not observe all tried factors. While we back out the distribution of hidden factors parametrically under the frequentist framework, it is not clear how the missing data and the multiple testing problems can be simultaneously solved under the Bayesian framework. Second, the hierarchical testing framework may be overly restrictive. Both independence and normality assumptions can have a large impact on the posterior distributions. Although normality can be somewhat relaxed by using alternative distributions, the scope of alternative distributions is limited as there are only a few distributions that can guarantee the properness of the posterior distributions. Independence, as we previously discussed, is likely to be violated in our context. In contrast, the three adjustment procedures under the frequentists' framework are able to handle complex data structures since they rely on only fundamental probability inequalities to restrict their objective function—the type I error rate.

There are a few general concerns about the Bayesian multiple testing framework. First, it is not clear what to do after obtaining the posterior probabilities for individual hypotheses. Presumably, we should find a cutoff probability $P$ and reject all hypotheses that have a posterior discovery probability larger than $P$. But then we return to the initial problem of finding an appropriate cutoff *p*-value, which is not a clear task. Scott and Berger (2006) suggest a decision-theoretic approach that chooses the cutoff $P$ by minimizing a loss function. The parameters of the loss function, however, are again subjective. Second, the Bayesian posterior distributions are computationally challenging. We document 300 factors, but there are potentially many more if missing factors are taken into account. When $M$ gets large, importance sampling is a necessity. However, results of importance sampling rely on simulations and subjective choices of the centers of the probability distributions for random variables. Consequently, two researchers trying to calculate the same quantity might obtain very different results. Moreover, in multiple testing, the curse of dimensionality generates additional risks for Bayesian statistical inference.^{60} These technical issues create additional hurdles for the application of the Bayesian approach.

We would like to thank the Editor (Andrew Karolyi) and three anonymous referees for their detailed and thoughtful comments. We would also like to thank Viral Acharya, Jawad Addoum, Tobias Adrian, Andrew Ang, Ravi Bansal, Mehmet Beceren, Itzhak Ben-David, Bernard Black, Jules van Binsbergen, Oliver Boguth, Tim Bollerslev, Alon Brav, Ian Dew-Becker, Robert Dittmar, Jennifer Conrad, Michael Cooper, Andres Donangelo, Gene Fama, Wayne Ferson, Ken French, Simon Gervais, Bing Han, John Hand, Abby Yeon Kyeong Kim, Lars-Alexander Kuehn, Sophia Li, Harry Markowitz, Kyle Matoba, David McLean, Marcelo Ochoa, Peter Park, Lubos Pastor, Andrew Patton, Lasse Pedersen, Tapio Pekkala, Jeff Pontiff, Ryan Pratt, Tarun Ramadorai, Alexandru Rosoiu, Tim Simin, Avanidhar Subrahmanyam, Ivo Welch, Basil Williams, Yuhang Xing, Josef Zechner, and Xiaofei Zhao, as well as seminar participants at the 2014 New Frontiers in Finance Conference at Vanderbilt University, the 2014 Inquire Europe-UK meeting in Vienna, the 2014 WFA meetings, and seminars at Duke University, Texas A&M University, Baylor University, University of Utah, and Penn State University. Our data are available for download and resorting. The main table includes full citations and Web addresses to each of the cited articles. See http://faculty.fuqua.duke.edu/~charvey/Factor-List.xlsx. Supplementary data can be found on *The Review of Financial Studies* web site.

## References

*R*

^{2}