Field experiments provide a useful way to address a number of important issues in environmental and resource economics. This article provides a review of studies that have used field experiments to inform (1) benefit–cost analysis and (2) efforts to promote resource conservation. In these areas, scholars have used field experiments to test existing theories, inform the development of new theories, and guide policymakers. After summarizing these contributions, we discuss new directions for the use of field experiments in environmental and resource economics.
The field of environmental and resource economics has never been short on empirical questions. Thus it should come as no surprise that controlled experimentation has been used by many environmental and resources economists as a way to uncover causal relationships and develop policy-relevant benefit–cost estimates. Until recently, studies using such methods have relied primarily on an analysis of data from either laboratory subjects or a natural experiment. 1 In the last decade, however, the use of field experimentation has become more prominent in environmental economics. This article presents an overview of this growing literature, with an emphasis on studies that aim to inform (1) benefit–cost analysis and (2) efforts to promote resource conservation.
We want to emphasize at the outset that this article does not provide a comprehensive review of the voluminous literature on controlled experimentation. Rather, we focus on studies that have utilized field experiments to inform the design and evaluation of environmental policy, particularly as viewed through the lens of the individual agent or consumer. Furthermore, we discuss a limited number of studies on these topics, those that illustrate what we view to be a distinguishing feature of field experiments: the ability to draw causal inferences about behavior in naturally occurring settings with self-selected agents that vary in both experience and familiarity with the underlying choice setting.
The next section provides an overview of what field experiments are, along with a brief discussion of the potential challenges of running field experiments. We then provide an overview of field experiments that inform benefit–cost analysis. Specifically, we first review field experiments that address the valuation of nonmarket goods and various methods that can be used to align hypothetical statements of values with “real” values for the good in question. We next turn to field experiments that focus on the disparity between measures of willingness to accept (WTA) and willingness to pay (WTP). These topics are important for environmental economists since they provide evidence on the stability and consistency of preferences, a critical assumption that underlies all existing methods for estimating the value of non-market goods and services.
Next we review the rapidly growing literature that explores the effectiveness of dynamic pricing plans and nonpecuniary strategies (e.g., normative appeals, targeted information) in managing the consumption of energy and water. For researchers, this literature is of interest because it fosters a deeper understanding of the individual behaviors that generate public goods (bads). For policymakers, these studies demonstrate how to use insights from behavioral economics to promote policy goals. We conclude with a summary of the lessons learned from this review of the literature and a discussion of future directions for the use of field experiments in environmental and resource economics.
What Is a Field Experiment?
A fundamental challenge for researchers when they are attempting to estimate the causal effect of some action or policy is the construction of the correct counterfactual. Because the action or policy of interest is either taken or not, the researcher is unable to observe what would have happened in the absence of the policy or what would have happened if another action had been taken instead. However, it is possible to observe outcomes for agents with similar characteristics—“like” others or a control group—whose choices were not affected by the policy of interest. Field experiments build upon the experimental model of the physical sciences in order to create valid control groups. Specifically, field experiments rely on randomization within a naturally occurring setting as a way to create an instrumental variable to facilitate causal identification. 2
The Various Types of Field Experiments
Harrison and List (2004) classify field experiments into three categories: artefactual, framed, and natural. An artefactual field experiment mimics a laboratory experiment except that it uses “nonstandard” subjects such as participants from the market of interest. Early contributions to this category in the field of environmental economics include the seminal study by Bohm (1972) , which examines whether stated WTP for a sneak preview of a Swedish television show differs when the choice is purely hypothetical versus when the choice is real and requires actual payment.
The second category—a framed field experiment—incorporates important elements of the naturally occurring environment with respect to the commodity, task, stakes, and information set of the subjects (see Harrison and List 2004 ). In such experiments, subjects often know about the randomization and/or are aware of the study via a survey that is used to generate information for policy purposes. Prominent examples of framed field experiments include social experiments and randomized control trials in the area of development economics.
The third category—natural field experiments—refers to experiments conducted in environments where subjects naturally undertake the desired task and are not aware that they are participants in an experiment. This means that they know neither that they are being randomized into treatment nor that their subsequent behavior is being scrutinized. 3
Additional Considerations and Caveats
Given the differences between field experiments and other empirical methods, it is important to note some challenges that may arise when conducting field experiments. First, compared with laboratory experiments, field experiments are difficult to replicate. Thus a fundamental advantage of laboratory experiments over field experiments is the ability of others to reproduce the study and independently verify its results.
There are three distinct approaches to replication: (1) taking the actual data generated by an experiment and reanalyzing the data, (2) running an experiment that follows a similar protocol but employs a new subject pool, and (3) testing the hypotheses of the study using a new research design (List and Rasul 2011). Laboratory experiments can be replicated using all three approaches. While the same is true for many artefactual and framed field experiments, the second type of replication is much more difficult to do for natural field experiments because such experiments often require the cooperation of outside entities, which may make it difficult to rerun the original experiment.
A second, related, challenge for field experiments is the “external validity” of experiments that are designed around the evaluation of a particular policy (e.g., a targeted messaging campaign designed to promote households to sign up for an in-home energy audit). Although studies based on such field experiments are relatively easy for policymakers to understand, the simplicity of the approach, whereby individuals are randomly assigned to either a treatment group (i.e., a group that receives the targeted message) or a control group (i.e., a group that does not receive a message), often comes at a cost. This is because empirical studies that are designed to identify reduced-form causal effects (i.e., the effect of the targeted message campaign on signup rates) often provide little information on why the program works or whether it influences behavior over a longer time horizon. This limits the extent to which one can use the results from any given study to predict the impact of similar programs on other individuals or groups or changes in behavior over time. For example, Allcott (2015) finds evidence of a “partner selection bias,” which indicates that sites/firms willing to partner with researchers to implement large-scale field experiments may not represent a random sample of the relevant population. More specifically, Allcott (2015) shows that predictions that used data from a series of early field experiments designed to promote energy conservation substantially overstated the actual effects of later replications of this program.
Third, it is important to note that in order to run a field experiment, the researcher often needs to identify a partner who is willing to implement the proposed design and provide the data needed to estimate the causal effects of interest. Unfortunately, many researchers do not have the personal connections necessary to form such partnerships. Similarly, in many cases, the nature of the research question is not amenable to field experimentation (e.g., the effect of a new regulatory policy on the entry and exit decisions of firms in affected industries). In such cases, the controlled environment of the laboratory is a better starting point for inquiry.
With this background on the different types of field experiments and practical challenges that arise when implementing field experiments and/or interpreting the results from such studies, we next provide a review of field experiments that are designed to inform benefit–cost analysis. In doing so, we first discuss a body of literature that examines the valuation of nonmarket goods and various methods to align hypothetical statements of values with “real” values for the good in question. We then turn to field experiments that focus on the disparity between measures of WTA and WTP.
Using Field Experiments to Inform Benefit–Cost Analysis
Public policy decision-making is often based on a comparison of the benefits and costs associated with proposed regulations. Such an analysis requires that researchers are able to accurately estimate the total value of the goods and services affected by the proposed regulation. For commodities traded in the marketplace, prices provide a direct signal of value, thus making the task of valuation straightforward. The task becomes more challenging when it is the total value of nonmarket goods and services (such as improved air or water quality) that is being estimated. In such cases, policymakers often rely on stated preference methods to provide signals of value. 4
While stated preference methods are literally the only option for researchers to estimate nonuse or existence values and hence recover the total value of nonmarket goods, critics argue that such estimates are unreliable because the hypothetical nature of the approach allows respondents to distort their statements of value without any penalty. Understanding whether and why people distort their preferences when responding to hypothetical questions remains a fundamental challenge for environmental economists. Fortunately, there is extensive literature that examines the nature and extent of such hypothetical bias and methodologies to address this tendency and align hypothetical statements of value with the respondents “true” preferences. 5 In the remainder of this section we review this literature and summarize the major lessons for policymakers and those relying on hypothetical choices to estimate nonmarket values.
The Impact of Cheap Talk
Much of the early work on hypothetical bias focused on ways to design contingent valuation (CV) questions that induce subjects to respond as if their choices involved actual payments. 6Cummings, Harrison, and Osborne (1995) and Cummings and Taylor (1999) present evidence for one such promising approach—an ex ante design they refer to as a “cheap talk” scheme. The underlying premise of the cheap talk design is to include the issue of hypothetical bias as an integral part of survey questionnaires. The “scripts” for such questionnaires describe hypothetical bias, note its prevalence in surveys, and discuss underlying reasons for its occurrence. Moreover, the script asks subjects to consider this problem and adjust their responses to the valuation questions accordingly.
To illustrate the robustness of these findings, we discuss several applications of the cheap talk design approach in a field setting. List (2001) conducted the earliest such study in a framed field experiment that compared bids (WTP) for a 1982 Topps Traded Cal Ripken, Jr. baseball card across three different treatments: (1) a hypothetical second-price auction, (2) a hypothetical second-price auction that included a cheap talk script, and (3) an actual second-price auction. 7 The field experiment was conducted on the floor of a sports card show and examined the behavior of actual market participants—professional sports card dealers or ordinary consumers. For the sample of nondealers, List (2001) found that average bids in the hypothetical treatment exceeded those submitted by individuals in both the actual and cheap talk treatments. However, there was no difference in average bids for individuals in the actual and cheap talk treatments, suggesting that the cheap talk script worked and provided a way to overcome hypothetical bias.
Carlsson, Frykblom, and Lagerkvist (2005) extend this line of inquiry using a framed field experiment to examine the impact of cheap talk on responses in a choice experiment concerning the purchase of two goods (chicken and ground beef) for a sample of Swedish adults. Under this approach, hypothetical bias can occur at two levels: (1) the decision to purchase and (2) the extent to which the purchase decision is affected by changes in various attributes of the good. Carlsson, Frykblom, and Lagerkvist (2005) find evidence suggesting that cheap talk affects how respondents react to changes in the attributes of the goods and the estimated marginal values. List, Sinha, and Taylor (2006) report similar results, finding little difference in estimated marginal values between respondents randomly assigned a cheap talk script and respondents facing an actual purchase decision.
Despite this evidence, several studies find differences in the impact of cheap talk scripts across various segments of the population (see, e.g., List 2001 ; Lusk 2003 ; Aadland and Caplan 2003 , 2006 ; Blumenschein et al. 2008 ). Specifically, these studies find that cheap talk is only effective when examining the choices of inexperienced subjects or those who are unfamiliar with the good they are being asked to value. Thus, overall, the literature provides mixed evidence concerning the effectiveness of cheap talk as a means to overcome hypothetical bias.
Nevertheless, the results of these studies suggest the critical importance of two factors for the success of cheap talk scripts: (1) the respondents’ familiarity with the good being valued and (2) the information content and length of the cheap talk script. In particular, cheap talk scripts appear to be more effective when respondents are unfamiliar with the good being valued and the researcher can provide information on both the expected direction and magnitude of hypothetical bias. Hence, we believe this evidence can be used to guide researchers about when they should and should not use cheap talk scripts.
The Role of Consequentialism
Although cheap talk has garnered much of the attention in the literature, researchers have also examined other methods aimed at overcoming hypothetical bias. Carson, Groves, and Machina (2000) provide a theoretical model of “consequential” survey designs—surveys that have some probability of influencing public policy—and show that such designs should induce respondents to truthfully reveal their preferences. The intuition for this approach is that if respondents believe that their responses have the potential to influence policy, then they have no incentive to distort their behavior and provide statements of value that do not reflect their “true” preferences.
Cummings and Taylor (1998) use a framed field experiment that provides the first test of a “consequential” survey design by varying the probability that a referendum, if passed, would be binding and require the subjects to pay for the production of a brochure (i.e., make an actual financial commitment). 8 They find that at lower probability levels ( P ≤ 0.50), respondents are significantly more likely to vote “yes” than what is observed by counterparts voting in a binding referendum (i.e., a referendum that, if passed, would require subjects to pay for the production of the brochure). However, at a higher probability level ( P = 0.75), the respondents’ voting behavior is indistinguishable from what is observed in the binding referendum.
Landry and List (2007) extend this literature by comparing statements of value elicited using a consequential design with those elicited using a cheap talk script. The results from this framed field experiment suggest that the two methods have similar effects on voting behavior, that is, there is no discernable difference between the two approaches in the proportion of “yes” votes. Although the results from these studies provide mixed evidence for the theory of consequentialism, they do suggest that “consequential” designs may provide a viable alternative to the use of a cheap talk script, particularly when the perceived degree of consequentiality is high.
An important prediction from Carson, Groves, and Machina (2000) is the invariance result, which suggests that statements of value should be unaffected by the degree of perceived consequentiality, that is, subjects should truthfully reveal their preferences whenever there is any probability that their response will have real economic consequence. A number of more recent studies (e.g., Bulte et al. 2005 ; Carson, Groves, and List 2006 ; Herriges et al. 2010 ; Vossler, Doyon, and Rondeau 2012 ) find empirical support for the invariance result. For example, using data from the 2005 Iowa Lakes Survey, Herriges et al. (2010) examine the casual impact of the perceived degree of consequentiality on WTP for improved water quality at a lake. In this natural field experiment, a random subset of individuals was informed that results from previous Iowa Lake surveys had influenced policy decisions at the state level. As this information was positively correlated with the perceived consequentiality of the survey, the authors were able to estimate the “causal” impact of consequentiality on WTP and found results consistent with the invariance results, that is, no difference in the distribution of WTP between those who report the survey to be minimally consequential and those who report higher levels of consequentiality, but a significant difference in the distribution of WTP between those who report the survey to be minimally consequential and those who report that the survey is irrelevant for policy purposes.
Overall, this literature on consequentialism suggests that individuals respond to incentives when formulating statements of value. When surveys are perceived to be consequential, respondents appear to truthfully reveal preferences. However, when the response to a survey question is perceived to be inconsequential or has the possibility of affecting an outside option, respondents may strategically distort statements of value or fail to apply the necessary cognitive resources to make a careful calculation of such values.
The Impact of the Mode of Elicitation and Interviewer Characteristics
We next discuss a body of literature that explores how the mode of elicitation—in person interviews where decisions are directly observed by others versus less interpersonal methods where choices are made in anonymity—affects statements of values. The National Oceanic and Atmospheric Administration panel on CV recommends in-person interviews as opposed to less interpersonal modes of elicitation, such as phone or mail surveys, when implementing CV studies ( Arrow et al. 1993 ). While such an approach undoubtedly has benefits, there is ample evidence that individuals are more cooperative when interacting with those from a similar social group (e.g., Devine 1989 ; Fershtman and Gneezy 2001 ; Andreoni and Petrie 2008 ). It is also well documented that respondents may distort answers to survey questions in order to please the interviewer or maintain consistency with societal norms (e.g., Atkin and Chaffee 1972–1973 ; Campbell 1981 ; Cotter, Cohen, and Coulter 1982 ; Finkel, Guterbock, and Borg 1991 ; Fisher 1993 ; Davis 1997 ; Krosnick 1999 ). Thus it is important to recognize that respondents in CV studies may be influenced by both the presence and characteristics of the surveyor.
List et al. (2004) explore this issue by examining whether WTP estimates are affected by the degree of anonymity. In their framed field experiment, subjects were randomly assigned to six different treatment groups and asked to vote on whether to contribute $20 to provide startup capital for the Center for Environmental Policy Analysis at the University of Central Florida. The treatments varied choice along two dimensions: (1) whether or not the subjects’ responses were observed by others and (2) whether decisions were hypothetical or had real economic consequence. The results suggest that estimated WTP depends on the degree of anonymity and that such effects are similar in magnitude to differences in WTP between the hypothetical and real treatments.
Alpizar, Carlsson, and Johansson-Stenman (2008) continue this line of inquiry in a natural field experiment examining the effect of anonymity on charitable donations in support of Poas National Park (PNP) in Costa Rica. International visitors to the park were asked to complete an exit interview about their experience and to make a donation to support PNP. Experimental treatments varied whether contributions were made anonymously and placed in a ballot box or registered and observed by an interviewer. The results reveal that average donations were approximately 25 percent higher when made in front of an interviewer, highlighting the influence of an observant “other” on revealed preferences.
A related line of inquiry examines interviewer effects on estimated WTP using CV or related survey-based methods. For example, Leggett et al. (2003) assess interviewer bias in face-to-face versus self-administered surveys of visitors to Fort Sumter National Monument. They find that the estimated WTP for a fort visit is approximately 23 to 29 percent lower when the survey is self-administered than when it is conducted through an in-person interview. A more recent set of framed field experiments examines interviewer effects by systematically varying interviewer characteristics and exploring how these characteristics affect estimated values of WTP (e.g., Bateman and Mawby 2004 ; Loureiro and Lotade 2005 ; Gong and Aadland 2009 ). Results from these studies suggest that characteristics such as the race, gender and attire of an interviewer can have significant impacts on estimated WTP.
Overall, this literature suggests that statements of value are sensitive to both the mode of elicitation and the characteristics of those eliciting the values. Fortunately, there are a number of ways to control for and mitigate these effects. For example, one can use variation in both the mode of elicitation and the observable characteristics of the interviewer to ex post estimate the effects of such characteristics on WTP and derive an estimate of WTP that nets out the influence of such factors. Alternatively, one can attempt ex ante to minimize these effects by using a cheap talk script or consequential survey design.
Differences Between WTA and WTP
It has been more than four decades since researchers discovered that the WTP and WTA measures of value differed starkly (see, e.g., Hammack and Brown 1974 ). This finding stimulated a series of laboratory experiments designed to explore the role of endowment on both rates of trade and estimates of WTP and WTA elicited using CV-style methods (e.g., Knetsch 1989 ; Kahneman, Knetsch, and Thaler 1990 ; Batemen et al. 1997 ). In the trade experiments, subjects would be randomly provided one of two goods (e.g., a coffee mug or candy bar) and asked if they would like to exchange the good they were provided for the other good. Results from these studies indicate that rates of trade and the proportion of individuals who wind up with any given good depend on endowment, that is, the proportion of individuals who wind up with a coffee mug is significantly greater when individuals are initially provided the coffee mug than what is observed when individuals are initially provided the candy bar, a finding at odds with predictions of the standard economic model.
In the WTP/WTA experiments, subjects were randomly assigned the role of a buyer who was provided a good (coffee mug) or a seller who was provided cash. Buyers were then asked how much they would be willing to accept to sell the good they were provided and sellers asked how much they would be willing to pay to acquire the good. Results from these studies suggest that reservation values for sellers (WTA) exceed reservation values for buyers (WTP). Taken together, the results from these early experiments call into question one of the fundamental assumptions of neoclassical models—that preferences (and hence value) for any good are independent of endowment. That is, a person should not value a good more once they “own” it then when they are looking to acquire it.
Environmental economics may be the branch of economics most affected by this research as the WTA/WTP “disparity” calls into question the legitimacy of all existing methods used to estimate values for nonmarket goods and services. More fundamentally, when the losses associated with reductions in the quality of a nonmarket good such as air quality are greater than the gains associated with an equivalent improvement in air quality, the decision concerning how to assign property rights (e.g., do individuals have the right to clean air or do firms have the right to pollute) when evaluating proposed regulations and assessing environmental damages can have a dramatic impact ( Knetsch 1990 ).
In the discussion that follows, we review findings from a series of framed field experiments ( List 2003 , 2004a , 2004b ) designed to explore whether the disparity between WTA and WTP reflects an inherent behavioral trait or the actions of inexperienced agents that are unfamiliar with the good being traded. 9 These studies build upon the early laboratory experiments and can be divided into four categories: (1) experiments that examine trading patterns of “familiar” goods, (2) experiments that examine trading patterns of “unfamiliar” goods, (3) experiments that examine bidding patterns for “familiar” goods, and (4) experiments that examine bidding patterns for “unfamiliar” goods.
In the first category (the “familiar” goods trading experiments), subjects were randomly provided a piece of sports memorabilia and offered the opportunity to exchange it for a different piece of sports memorabilia. In these experiments, subjects were recruited from the floor of a sports memorabilia show and thus were familiar with both the trading environment and the goods available for trade. Observed rates of trade were found to depend upon the initial endowment, but this evidence of an endowment effect was driven entirely by the actions of inexperienced agents. For the sample of professional dealers and experienced nondealers, trading rates and final holdings were found to be independent of initial endowment. Results from these experiments thus suggest an important caveat on the earlier literature exploring the WTP/WTA disparity: institutions and experience matter.
Do these results hold when the good is unfamiliar? To address this question, subjects were endowed with an “unfamiliar” good—either a mug or a candy bar. 10 As with the “familiar” good experiments, both rates of trade and final holdings were found to be independent of initial endowment for professional dealers and experienced nondealers. Results from these experiments again show that market experience reduces the importance of endowment.
In the bidding treatments, WTP and WTA measures were elicited using auction formats where the best strategy for an individual is to submit a bid that is equal to his/her value (WTP or WTA) for the good. For both familiar and unfamiliar goods, the experimental results suggest that individual behavior converges to the neoclassical prediction that estimates of WTA equal estimates of WTP as trading experience intensifies. For inexperienced agents, statements of value were found to depend on endowment, with statements of WTA exceeding statements of WTP. However, data from these experiments show no difference in WTA and WTP estimates among their experienced counterparts. When WTA and WTP measures are evaluated separately, the results suggest a potential channel through which experience affects the disparity in values—experienced agents are more likely to give up (sell) goods that they own and to require less compensation to do so. In particular, while there is no difference in WTP across experienced and inexperienced subjects, more experienced subjects state significantly lower WTA figures than their inexperienced counterparts.
To summarize, these trading and bidding experiments suggest that the main effect of endowment may be to enhance not the appeal of the good one owns, but rather the “pain” of giving it up ( Loewenstein and Kahneman 1991 ). That is, ex ante, agents may overestimate the cost they will incur from giving up a good (and so state a high WTA). However, through actual market interactions, agents may realize that the pain associated with a loss is not as great as initially imagined and learn to take advantage of arbitrage opportunities.
As a whole, the literature exploring the disparity in WTA and WTP suggests that concerns regarding the stability and consistency of preferences may not be as serious as some have argued. When the behavior of a population of agents familiar with the underlying trading institution is examined, choices converge to neoclassical benchmarks (i.e., rates of trade and statements of value are independent of initial endowment), particularly as market experience intensifies. For researchers and policymakers alike, these results underscore the importance of institutions and experience when considering the WTP/WTA disparity. 11
Promoting Conservation Efforts
This section provides a review of the growing literature that explores the effectiveness of dynamic pricing plans and behavioral “nudges” such as normative appeals or tailored information on residential energy and water use. In doing so, we provide evidence from a series of field experiments that show that both dynamic pricing plans and behavioral “nudges” are effective strategies for promoting desired reductions in use.
Norms and Social Comparisons
In his social comparison theory, Festinger (1954) argues that individuals validate the appropriateness of an action through comparisons to others. There is a broad body of work within the social psychology literature that examines the use of social-norm marketing, feedback, and tailored information campaigns as a way to manage the consumption of energy and water and thus promote environmental conservation. The most influential work in this literature is Schultz et al. (2007) , which finds that combining normative messages that compare a household’s energy use to the use of their neighbors with injunctive messages (i.e., emoticons and ) generated significant reductions in energy consumption. This approach has been the foundation for OPower’s program to promote household energy consumption around the world. 12
Social comparisons and residential energy use
Given the scope of OPower’s operations, a body of literature has emerged that evaluates the effectiveness of its baseline energy efficiency program. For example, Allcott (2011) evaluates data from seventeen natural field experiments targeting more than 600,000 residential households. These households were randomly assigned to either a treatment group, with each household receiving a home energy report (HER) comparing its energy use with similar neighbors’ use, or a control group, which did not receive this information. The results suggest that receipt of an HER leads to a reduction in average monthly energy consumption of approximately 1.4 to 3.3 percent. Such effects are equivalent to a reduction in daily electricity consumption of approximately 0.62 kilowatt hours, or 10.4 hours of 60 watt light bulb use.
Yet, Alcott (2011) and Ayres, Raseman, and Shih (2013) also document significant heterogeneity in the effect of the HER based on (1) a household’s average monthly consumption in the preintervention period, (2) the frequency of messaging, and (3) the day of the week. For example, Allcott (2011) finds that for households in the highest decile of preintervention energy use, the estimated average treatment effect exceeds the effect for households in the lowest decile by a factor of more than twenty. Moreover, the average treatment effect for households that receive monthly HERs was approximately one-third greater than for households receiving quarterly HERs. Ayres, Raseman, and Shih (2013) analyze data from an experiment conducted by OPower and Puget Sound Energy (PSE) and find evidence of heterogeneity in the effects of the HER across days of the week—nearly 38 percent of the observed reductions in use occur on Sundays and Mondays.
The role of political ideology
In a study that reanalyzes data from an OPower experiment with the Sacramento Municipal Utility District (SMUD), Costa and Kahn (2013) compare the effect of the HER on energy use for subgroups of political liberals versus subgroups of political conservatives. Specifically, they find that the estimated effect of the HER is approximately twice as large for liberals who live in a census block group where the share of liberals is high than for the subset of conservatives that neither purchase renewable energy nor donate to environmental groups and live in a census block group where the share of liberals is low. 13 Moreover, they find that conservatives are more likely to opt out of receiving the HER and to state that they disliked receiving the report.
Gromet, Kunreuther, and Larrick (2013) find similar heterogeneity in the role of political ideology on support for energy efficiency programs and the response to normative appeals to promote energy efficiency. Specifically, they document that Republicans and those who are politically conservative are less likely to support programs that provide subsidies for energy efficiency and less likely to purchase energy efficient lighting when the product includes a label highlighting environmental benefits. Such heterogeneity suggests that there is no one-size-fits-all approach for using normative messages to influence behavior, and that policymakers should tailor the content of appeals to account for differences in ideology or norms across groups.
Social comparisons and residential water consumption
To examine the effectiveness of normative messages in managing residential water demand, Ferraro and Price (2013) implemented a natural field experiment in conjunction with the Cobb County Water System (CCWS) in Cobb County, Georgia. Treated households received a letter detailing one of three conservation strategies: (1) information on behavioral and technological modifications, (2) a prosocial appeal that urges households to “use water wisely” and “make every drop count,” and (3) information comparing the household’s use during the previous summer to use by others in Cobb County. The results show that while technical advice (strategy 1) has only a small impact on water use (consumption falls by approximately 1 percent), augmenting that advice with either a prosocial appeal (strategy 2) or social comparison (strategy 3) generates reductions that are approximately 2.5 to 4.5 times larger.
Similar findings are reported in Brent, Cook, and Olsen (2015) , who use data from three natural field experiments in California conducted by WaterSmart. 14 Those customers receiving a home water report comparing their consumption over the previous two months to that of similar neighbors reduced their use by 4.9 to 5.1 percent and were six percentage points more likely to enroll in a utility sponsored efficiency program. Moreover, both Brent, Cook, and Olsen (2015) and Ferraro and Price (2013) find that the effects of the social comparison are greater for households that use more water in the preintervention period.
The persistence of normative appeals
Allcott and Rogers (2012) examine whether such norm-based messages influence behavior in the long-run. They focused on approximately 12,000 households that were randomly selected from three of OPower’s earliest experiments to stop receiving home energy reports after two years. For these households, Allcott and Rogers (2012) document a decay in the estimated treatment effect relative to those who continue to receive the HER and a convergence of use towards preintervention levels. However, the observed rate of decay was orders of magnitude slower than what was observed during the initial months of the program, during which households would reduce energy use in the days following receipt of the HER but would revert back to preintervention levels of use over the duration of the month. These results suggest that households in the treatment group develop a “habit” for conservation (see Becker and Murphy 1988 ) that persists over time.
Ferraro, Miranda, and Price (2011) found a similar persistence in their examination of posttreatment usage (between 2007 and 2009) for households in the original CCWS experiment. Their results indicate that social comparisons affect long-run patterns of use, with households that received such messages using significantly less water than their control group counterparts during both the 2008 and 2009 summer seasons. In an extension of this analysis, Bernedo, Ferraro, and Price (2014) show that the social comparison letter’s effect on use was still detectable in the summer of 2013, six years after the initial mailing.
In summary, this literature on normative appeals and social comparisons highlights the importance of social norms on consumption decisions and the benefits of framing conservation as a normative behavior. Providing households information that compares their behavior (energy/water use) to that of similar neighbors appears to be a powerful tool for managing resource use, especially among higher-use groups. This suggests that policies based on messages that target the “why” and “how much” of conservation may prove to be a useful complement to pecuniary measures.
Using Prices to Promote Conservation: Dynamic Pricing Experiments
While policies based on normative appeals and social comparisons have received a great deal of attention in the literature, researchers have also explored other mechanisms to promote conservation. For example, economists have long recognized the potential for dynamic pricing strategies such as “peak load” or “real-time” pricing to manage electricity consumption during periods when the marginal cost of production is high. Thus this subsection summarizes the growing literature that uses field experiments to examine the effectiveness of various dynamic pricing schemes. We focus on a series of pilot experiments in the early to mid-2000s that explore both critical peak and time-of-use pricing plans, and a more recent set of targeted field experiments aimed at identifying the effect of information feedback on price sensitivity. 15
Critical peak and time-of-use pricing
Wolak (2006) evaluates data from a critical peak pricing (CPP) experiment involving 123 residential consumers of the City of Anaheim Public Utilities (APU) in Anaheim, California. Participants received a “smart” meter that recorded consumption over 15-minute intervals and were randomly assigned to either a treatment or control group. Control group customers were charged according to APU’s prevailing rate schedule. Treatment group customers paid the same tariff except for during peak hours on CPP days, when they received a rebate of 35 cents per kilowatt-hour for reductions in consumption relative to a reference level. 16Wolak (2006) found that treated households consumed approximately 12 percent less electricity during peak hours on CPP days, about half of which can be attributed to increased consumption by treated households during peak hours on non-CPP days.
Faruqui and George (2005) and Herter, McAuliffe, and Rosenfeld (2007) report similar results from a statewide pilot pricing experiment in California designed to compare the relative impact of various dynamic pricing plans. Their findings suggest the relative superiority of CPP programs over time-of-use pricing: households facing time-of-use pricing reduced peak use by nearly 6 percent, but those facing a CPP plan reduced peak use by more than 13 percent over the same period.
Wolak (2011) extends this earlier work to consider the impact of a broader array of dynamic pricing plans (hourly pricing, critical peak pricing, or critical peak pricing with a rebate) for a sample of 1,245 residential consumers throughout Washington, DC. 17 The results support the earlier findings on dynamic pricing plans, with treated customers reducing electricity use during high-priced periods (peak events). However, the average treatment effect for the CPP treatment is greater than the effect for the CPP with rebate treatment (13 versus 5.3 percent)—a difference that is consistent with the literature showing that incentives framed as penalties have a greater effect on behavior than those framed as bonuses (see, e.g., Fryer et al. 2012 ; Hossain and List 2012 ; Levitt et al. 2012 ).
Jessoe and Rapson (2012) extend this literature by exploring the role of imperfect information on the price elasticity of demand. In their study, a subset of households facing a CPP tariff was given an in-home display that provided real-time feedback on the price and quantity of electricity consumed. They found that households exposed solely to real-time prices reduced demand by up to 7 percent, but those provided real-time feedback on their electricity use reduced demand by 8 to 22 percent. Similarly, Allcott (2011) examines data from a real-time pricing experiment in Chicago and finds that households that received a plastic orb that changed color to indicate current prices were significantly more responsive to the underlying price changes. 18 In summary, the literature on dynamic pricing plans shows that such strategies are an effective way to promote conservation during periods of peak demand. However, such effects can be enhanced by providing consumers with real-time feedback on prices and use. Thus policies designed to promote the adoption of in-home electricity displays and/or disseminate information on real-time energy use should be viewed as complements to strategies that use financial incentives to influence demand.
Lessons Learned and Directions for Future Research
This review of field experiments in environmental and resource economics offers several important lessons that have implications for both policymakers and researchers. First, in the context of nonmarket valuation, institutions matter. Thus policymakers should design elicitation schemes very carefully because statements of value are sensitive to both the mode of elicitation and the characteristics of those eliciting the value. Second, in the context of the disparity between WTA and WTP, the most important lesson is that experience and institutions matter, because, when investigated within a population of experienced agents familiar with the underlying trading institution, behavior converges to neoclassical predictions and the value disparity disappears. Finally, in the context of energy and water conservation, we believe the most important lesson is that norms matter. Targeted messages that frame conservation as a normative behavior are powerful tools for managing residential consumption. Similarly, dynamic pricing plans that increase prices during periods of peak demand are promising options for influencing the temporal profile of use, but they can be enhanced by providing households with in-home displays that make consumption (and underlying prices) salient.
With these lessons in mind, we have identified several promising areas for future research using field experiments in environmental and resource economics. In terms of methodology, artefactual, framed, and natural field experiments can be used to identify structural parameters that help explain or predict behavior in naturally occurring settings. Early examples of such an approach are Ashraf, Karlan, and Yin (2006) and Meier and Sprenger (2010) , who use laboratory methods to elicit individual-specific measures of time preference that are then used to predict observed patterns of borrowing and savings.
Another promising methodological area is the use of field experiments to test the design of new environmental markets or regulatory policies. To date, researchers have relied almost exclusively on laboratory settings for such studies (e.g., Cason 1995 ; Cason and Plott 1996 ; Cason and Gangadharan 2005 ; Murphy and Strandlund 2007 ; Suter et al. 2010 ). Yet, as noted in Cason (2010) , field experiments may provide qualitatively different insights regarding market design and the relative performance of different trading institutions. In this regard, we see a need for more work in the spirit of Jack (2011) , who designs a series of framed field experiments to compare alternate mechanisms to allocate tree planting contracts in Malawi.
In terms of topic areas, our discussion of nonmarket valuation indicates the need for further research on how people formulate values when responding to CV questions. For example, more work is needed on increasing our understanding of the underlying causes of observed biases. A theoretical model of such biases would be helpful not only to place the results into perspective, but also to guide future field experiments. It would also be useful to explore new mechanisms to minimize hypothetical bias (and other biases).
We also see significant potential for continued work at the intersection of development, health, and environmental economics. In particular, randomized field trials designed to promote investments in safe drinking water ( Kremer et al. 2011 ) or reduce individual exposure to contaminated groundwater ( Bennear et al., 2010 ) would be natural complements to the research exploring ways to promote conservation efforts and improved environmental quality.
Another important area for future research is to make better use of behavioral economics to promote our policy goals. For example, economists have found that the default option is very powerful, with most people sticking to the default rather than choosing other available options. However, whether policymakers can use default options to promote actions that offset the external costs of our consumption decisions remains an open question. Thus studies that explore the use of defaults and nudges to prompt compensatory actions such as the purchase of renewable energy certificates or carbon offsets would appear to have tremendous potential.
Another area in which behavioral economics could be used to help meet policy objectives is through the exploration of goal setting as a way to reduce energy consumption. The natural field experiment of Harding and Hsiaw (2012) , which examines participation in and the impact of a utility sponsored goal-setting program, is an important foundation for further research in this area. In particular, it would be helpful to explore how consumers set their initial goals and the effect of initial goals on subsequent changes in energy use.
Finally, studies that examine the use of social comparisons and normative appeals have proven to be an effective way to manage residential water/energy consumption. Thus future studies that use such strategies to promote the adoption of green technologies and manage the temporal profile of energy use have great potential. Herberich, List, and Price (2012) , who use a natural field experiment to investigate the relative impact of price reductions and normative appeals on the decision to purchase compact fluorescent light bulbs, and Ito, Ida, and Tanaka (2015) , who investigate the relative effectiveness of dynamic pricing and moral suasion on energy use during peak periods, have already laid the groundwork for such research.
We thank Charles Kolstad, the former editor of REEP , for his tremendous patience with this manuscript and two anonymous reviewers for comments.