Why China needs data sharing to address its air-quality challenge

In the fight against air pollution, sharing data within China and forming scientists’ consensus on data-driven policy recommendations are at least as important as data collection. The scale of data publicly available can have dramatic implications for the insights possible: a series of research studies in limited data contexts can lead to the wrong conclusions if the contexts do not encompass the broader picture or if they are not connected. Despite the clear benefits of data sharing, there are systemic hurdles in China: in the past, collection of air-pollution data was fraught with distortion. Further, promotion and tenure requirements disincentivize data sharing essential to informative results. Finally, overcoming the data-sharing and collaboration problem is not enough. As demonstrated by the California Air Resources Board and in air-quality management in Houston, scientists must then come together around that shared data to inform the policymaking process. As data sharing and collaboration can take on multiple forms, to elucidate our argument, we first differentiate between forms of data sharing. Data sharing is the exchange of data between members of a group for an express purpose. Data sharing can take many shapes with varying numbers of researchers at the same or different institutions. A data-sharing programmayormaynot involvemakingdata public.Making data public is the most extreme form of data sharing, in as much as it makes the data available to everyone. In between public data sharing and data shared amongst only two entities is a data-sharing club, where data are shared basedon a set of commonprinciples.Traditionally, an ambitious international collaboration addressing an issue with the complexity of air pollution would be the product of a small group with the ability to expand as the project develops. The types and scale of questions answerable with shared data can be significantly greater than any single group could collect and analyse on their own. Data sharing can conflict with traditional cultural standards of data collection and ownership in academia, without incentives otherwise. Houston, Texas’s history of addressing air pollution, discussed later in this paper, is a prime case study in data sharing, without making the data public. Data pooling is the construction of a database to hold data that members, or in some extreme cases the general public, can access at will, rather than for a particular research or policy goal. It may be thought of as a more sophisticated extension of data sharing. Data pooling, when executed well, can have the advantage of ensuring validity and uniformity of data. Data pooling also has the potential to leverage shared resources to answer more complex, far-reaching questions, as well as allowing new questions to be addressed as scientific understanding develops, encouraging the emergence of new knowledge. Pooling data can be difficult to implement, as it requires buyin from participants who may be accustomed to holding data independently to make the most individual use of it. It also requires significantmaintenance and management to build, curate and continually validate such a database. Databases such as those created by the Convention onLong-rangeTransboundary Air Pollution (CLRTAP) or the Aerosols, Clouds, and Trace gases Research Infrastructure (ACTRIS) – both of which are discussed later in the paper – are examples of data pooling [1,2].

In the fight against air pollution, sharing data within China and forming scientists' consensus on data-driven policy recommendations are at least as important as data collection.The scale of data publicly available can have dramatic implications for the insights possible: a series of research studies in limited data contexts can lead to the wrong conclusions if the contexts do not encompass the broader picture or if they are not connected.Despite the clear benefits of data sharing, there are systemic hurdles in China: in the past, collection of air-pollution data was fraught with distortion.Further, promotion and tenure requirements disincentivize data sharing essential to informative results.Finally, overcoming the data-sharing and collaboration problem is not enough.As demonstrated by the California Air Resources Board and in air-quality management in Houston, scientists must then come together around that shared data to inform the policymaking process.
As data sharing and collaboration can take on multiple forms, to elucidate our argument, we first differentiate between forms of data sharing.Data sharing is the exchange of data between members of a group for an express purpose.Data sharing can take many shapes with varying numbers of researchers at the same or different institutions.A data-sharing program may or may not involve making data public.Making data public is the most extreme form of data sharing, in as much as it makes the data available to everyone.In between public data sharing and data shared amongst only two entities is a data-sharing club, where data are shared based on a set of common principles.Traditionally, an ambitious international collaboration addressing an issue with the complexity of air pollution would be the product of a small group with the ability to expand as the project develops.The types and scale of questions answerable with shared data can be significantly greater than any single group could collect and analyse on their own.Data sharing can conflict with traditional cultural standards of data collection and ownership in academia, without incentives otherwise.Houston, Texas's history of addressing air pollution, discussed later in this paper, is a prime case study in data sharing, without making the data public.
Data pooling is the construction of a database to hold data that members, or in some extreme cases the general public, can access at will, rather than for a particular research or policy goal.It may be thought of as a more sophisticated extension of data sharing.Data pooling, when executed well, can have the advantage of ensuring validity and uniformity of data.Data pooling also has the potential to leverage shared resources to answer more complex, far-reaching questions, as well as allowing new questions to be addressed as scientific understanding develops, encouraging the emergence of new knowledge.Pooling data can be difficult to implement, as it requires buyin from participants who may be accustomed to holding data independently to make the most individual use of it.It also requires significant maintenance and management to build, curate and continually validate such a database.Databases such as those created by the Convention on Long-range Transboundary Air Pollution (CLRTAP) or the Aerosols, Clouds, and Trace gases Research Infrastructure (ACTRIS) -both of which are discussed later in the paper -are examples of data pooling [1,2].

THE PROBLEM
In the midst of heightened tension between the USA and China on issues of trade and South China Sea, John Holdren (Director of the US Office of Science and Technology Policy) and Wan Gang (Chinese Minister of Science and Technology) convened the US-China Innovation Dialogue on 5 June 2016 to discuss collaboration opportunities that could lead to beneficial outcomes for both nations.One focus of the dialog was on the role of data sharing and collaborative analysis in tackling climate change and conventional air pollution.
Particulate matter in the atmosphere sized <2.5 μm (PM 2.5 ) and ozone caused ∼4.5 million premature deaths in 2005 [3].A component of PM 2.5 , black carbon, is a top contributor to global warming, second only to CO 2 [4].Reduction in these pollutants leads to beneficial results for both the quality of the air we breathe and the pace and extent of climate change [5].Data sharing is critical in effectively tackling this widespread issue.The scale of data that are currently publicly available can have dramatic implications for the possible insights: a series of research studies in limited data contexts can lead to the wrong conclusions if the contexts do not encompass the broader picture or if they are not connected with each other.For example, in understanding ozone depletion, many scientists and policy makers initially questioned the chlorofluorocarbon hypothesis, but the hypothesis was eventually validated with data collection across scales and platforms, leading to formulation of a new policy [6].
Despite the clear benefits of data sharing and pooling, there are systemic hurdles in China.In the past, collection of airpollution data in China was fragmented and fraught with distortions [7][8][9].Most of the data were not publicly available.Moreover, China's academic promotion practices have unintentionally hindered data sharing.For example, first or corresponding authorship is necessary for promotion and tenure.Because one and, at most, two or three authors can be listed first or as corresponding authors, and because data sharing is typically conditional on being named as first author, publishing of data is discouraged, highlighting that co-authorship is not always a sufficient solution.Researchers and institutions also guard data in order to maximize the publishable material from its use.With high-quality sensors by different groups across China, and airpollution analysis requiring high-quality data from multiple sites within each region, these data-sharing and collaboration barriers must be overcome to better reduce and manage air pollution in China.
Overcoming the data-sharing problem is not enough.As demonstrated by the California Air Resources Board in Los Angeles and in air-quality management in Houston, scientists must then together analyse the accumulated data to inform the policy-making process.Without this knowledge transfer, policy can easily be misinformed, misguided and lead to undesired outcomes.Finally, sharing and using data permit improvement in the process of data collection itself.

LESSONS FROM HISTORY: DATA-SHARING FRAMEWORKS
China is not alone in facing these challenges.Mandated data-sharing protocols are still relatively new internationally in air-quality research.In 1950, the World Meteorological Organization (WMO) began to emphasize improved international collaboration through data pooling in the areas of weather, climate and water [10].This early effort faced challenges because the data were published without checks and balances to ensure equal quality and reliability, making it difficult to use for research.The WMO did instate its Global Atmospheric Watch program as a high-quality data-pooling project that is still publicly available [11].The CLRTAP was signed in 1979 and has been implemented by the European Monitoring and Evaluation Program (EMEP), creating a shared data network to address air-quality issues across nations [2].In 1983, NASA's Global Tropospheric Experiment (GTE) was created to study tropospheric chemistry and its effects on the USA.GTE mandated a data-management protocol as a condition of funding to ensure data are shared across parties with funded projects, and funding was also explicitly allocated for data management.Field data collection ended in 2001, but the data are still publicly available, highlighting the potential for a pooled and public resource [12].Eionet in the EU was created in 1997 as a data consortium among EU member and cooperating countries to collect, share and study environmental data [13].In 2011, ACTRIS was created to merge a group of programs (not including Eionet) dating back to 2000 into a combined database and network [1].ACTRIS is funded by the European Commission and jointly run by memberstate representatives.Standards for data and meta-data reporting (including location, instrumentation, uncertainties/percentiles, etc.) are implemented in the management of the network in order to maintain quality across contributing sites.Top journals have also brought data-sharing practices into their requirements for publishing.Nature's policy reads 'A condition of publication in a Nature journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications' [14].There is still work, however, to be done internationally.The USA does not have the equivalent of ACTRIS in place.Further, the US NSF's funding for air-quality monitoring projects has not historically allowed for dedicated data-management roles, though sharing and eventually publicizing data are current requirements.

LESSONS FROM HISTORY: SCIENTISTS INFORMING THE POLICY PROCESS
There are important international examples of how scientists can inform the policy process, and how shared data practices between scientists and government regulators improve air-quality policy.In the USA, the California Air Resources Board (CARB) has a long history of promoting and funding air-quality research in California.Their work led to significant improvements in air quality throughout the state, often leading the nation [15].In the mid-twentieth century, the air quality in Los Angeles was degraded to an extent comparable to the worst found in Beijing today; annual average PM 10 reached ∼150 μg/m 3 and peak ozone exceeded 600 ppb [16].The CARB set the first automobile NOx standards in the nation as a result of research into the effects of NOx on ozone formation and PM 2.5 [15].As a consequence of these and subsequent effective emissions controls, LA air-pollution levels are now less than one-fourth of those in the past, even though the population has doubled and vehicle miles traveled quadrupled [17].In California as a whole, the collective cancer risk from exposure to major toxic air contaminants has declined by 76% in the past 23 years alone [18].With the close ties that exist in China between government institutions and academia, the CARB's experience could be a datasharing model from which to draw insights into the international context toward building cross-boundary projects and institutions.
Air-quality management in Houston, Texas, offers another story of the importance of collaboration between academia and government to drastically improve air quality [19,20].In 1999, while

PERSPECTIVES
experiencing the worst ozone pollution in the USA, Houston embarked on an ambitious research collaboration among a variety of state, federal and academic groups.Rather than NOx being the dominant cause of ozone formation, as in California, highly reactive volatile organic compounds (VOCs) were found to be more critical ozone-forming agents [20].By rapidly summarizing the results of over 300 scientists' work prior to any publications, the correct policy solution was effectively communicated to and acted upon by lawmakers.By regulating the highly reactive VOCs rather than NOx, industry saved an estimated 1 billion dollars [21].The LA and Houston cases underscore the importance of engaging scientists in the policy process and sharing data along the way.In these cases, opposite interventions were necessary to reduce ozone, due to regional geographic and industry differences.
The USA and China have a history of working together to bring science into the regulatory process.In 2003, the US Environmental Protection Agency and China State Environmental Protection Agency (the predecessor to the Ministry of Environmental Protection) made history by signing a memorandum of understanding focused on collaboration on fuel and vehicle technology and standards.Although not shared with the greater public in China, American and scientists working on the projects shared data.Reports of some of the work were also published in the USA.In 2008, the Department of Energy's Atmospheric Radiation Measurement Climate Research Mobile Facility was moved to southeastern China [22].While this was a positive step, data use and ownership issues forced the project to close down before it could gather the long-term data needed to gain scientific or policy insight.Most recently, US-China cooperation on climate and energy under the Climate Change Working Group (CCWG) contributed to the successful negotiation of the Paris Agreement.The CCWG has already begun to promote data sharing between research institutions: the China Energy Modeling Forum was established in 2015 at Tsinghua to put modeling teams and policy makers on the same platform in order to better inform policy development [23].China has also begun the process of creating a pooled database of independently collected environmental data as well as consolidated data from a variety of sources made available through a government data center [24].Such efforts will greatly benefit from a broad approach addressing the systemic barriers to data sharing and pooling, as well as scientific informing of policy.While the challenges are not unique to China, as one of the top pollution emitters, China uniquely holds the opportunity to solve some of the most important air-pollution and climate challenges of our time -leading the world in these innovations and this dialog [25].China has an impressive history of leading the world in the engineering and implementation of largescale infrastructure innovations, such as its impressive achievements with highspeed rail [26].By leading the world in data-pooling infrastructure and giving the very best scientists from China (and possibly also select friends from around the world) access to that pooled data with which to advance science and inform policy makers, China has the potential to lead the world in solving arguably the most important problem facing the globe today.(e) Include a commitment to a timeline for full global inclusion in the Blue Ribbon Panel.Finally, while international participation could bring the best minds from around the world to the urgent air-quality issue in China, alone, mandating data sharing within China and bringing Chinese scientists together to inform airquality policy would be an enormous step forward for air-quality improvements in China and thus worldwide.