-
PDF
- Split View
-
Views
-
Cite
Cite
David J. Hand, Statistical Challenges of Administrative and Transaction Data, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 181, Issue 3, June 2018, Pages 555–605, https://doi.org/10.1111/rssa.12315
- Share Icon Share
Summary
Administrative data are becoming increasingly important. They are typically the side effect of some operational exercise and are often seen as having significant advantages over alternative sources of data. Although it is true that such data have merits, statisticians should approach the analysis of such data with the same cautious and critical eye as they approach the analysis of data from any other source. The paper identifies some statistical challenges, with the aim of stimulating debate about and improving the analysis of administrative data, and encouraging methodology researchers to explore some of the important statistical problems which arise with such data.
Introduction
Administrative data are data generated during the course of some operation, and then retained in a database. They are becoming increasingly important as the potential for discovery from such sources of data is being recognized and as alternative sources of data become more costly or difficult to use (e.g. because of declining response rates in surveys). In the main, this means that the analysis of administrative data is secondary—the data are being repurposed—although, as explained below, this is not always so. The existence of large, often administrative, data sets, offering potential for secondary analysis, was one of the primary drivers behind the development of data mining technology (Hand et al., 2000) as well as the modern rise of interest in ‘big data’. But the analysis of administrative data presents new statistical challenges. This can be seen by a cursory examination of the examples in most basic statistics texts, which will almost all involve ‘random samples’: administrative data are, by definition, typically not random samples. The aim of this paper is to explore these statistical challenges and to stimulate discussion. The hope is that it will help to focus attention on what is needed for valid and accurate analysis of administrative data. The need is illustrated by the comment made by Wallgren and Wallgren (2014), page 3, on the closely related topic of analysing data from statistical registers:
‘Although register-based statistics are a common form of statistics used for official statistics and business reports, no well-established theory in the field exists. There are no recognised terms or principles, which makes the development of register-based statistics and register-statistical methodology all the more difficult. As a consequence, ad hoc methods are used instead of methods based on a generally accepted theory.’
It is hoped that this paper will serve as a framework to stimulate discussion about what ‘generally accepted theory’ might be taught for the analysis of administrative data.
There are many definitions of statistics. This is because the discipline has various aspects, including the study of methods for collecting, presenting, interpreting and analysing data, but also because it involves expertise in coping with uncertainty and chance. My own definition (Hand, 2008) tries to capture this diversity: statistics is the technology of extracting meaning from data and of handling uncertainty.
There are fewer definitions of administrative data. The Organisation for Economic Co-operation and Development (Organisation for Economic Co-operation and Development, 2016) defined administrative data as having the following features:
- (a)
the agent that supplies the data to the statistical agency and the unit to which the data relate are usually different, in contrast with most statistical surveys;
- (b)
the data were originally collected for a definite non-statistical purpose that might affect the treatment of the source unit;
- (c)
complete coverage of the target population is the aim;
- (d)
control of the methods by which the administrative data are collected and processed rests with the administrative agency.
The definition continues by saying that
‘In most cases it is normal to accept (and expect) that the administrative agency will be a government unit that is responsible for implementing an administrative regulation’.
That leads to a rather narrower definition than is taken in this paper. For example, it excludes corporate use of administrative data, describing the workforce, products, processes, and so on, as well as narrowly restricting the uses to which such data are put. Instead, although accepting that the features described above do characterize administrative data, I shall follow Nordbotten (2010) and simply distinguish between statistical data and administrative data. Statistical data are collected primarily for statistical purposes—e.g. to summarize in order to shed light on the system generating the data, or to make predictions. In contrast, administrative data are initially collected for some administrative purpose—to run an organization, such as a company, government, charity, school, hospital, and so on. Running the organization might require on-going operational analysis of the data but, once collected and stored, the data can later be analysed to shed light on what has happened, to help to predict what might happen in the future, and to evaluate systems and their performance, i.e. the data can later be subjected to statistical analysis. Often statistical data consist of mere samples from the universe of possible values which could have been obtained, and these will have been collected by surveys or experiments for example. In contrast, administrative data will ideally consist of data on all of the cases, records or transactions in some population. This leads to something of a conceptual distinction: sample data are used to obtain estimates of a population parameter. In contrast, administrative data are summarized to obtain a descriptive feature of the population.
Transaction data are an important kind of administrative data concerned with events, typically with sequences of events. Usually the prime operational purpose of collecting the data is to inform the transaction (e.g. to decide how much to charge a supermarket customer or to decide how much tax someone should pay), but once collected the data can be retained in a database and analysed to improve understanding of the organization's operations.
I used the word ‘operational’ above. Occasionally we see the terms ‘operational data’, ‘management data’ or ‘management information’ used to describe data collected and analysed to guide the operation of a system. It is clear from the above discussion that the data, once collected and placed in a database, are no different from administrative data. What differs is the way in which the data are being used—from immediate decisions to more considered analysis with longer-term implications. Operational data become administrative data when they are stored and used for some purpose beyond the day-to-day operations of the organization. In a sense, then, administrative data are data exhaust: that which is left over after the organizational machinery has used the data to drive itself forward.
Incidentally, in this paper, I shall use the term ‘survey’ to refer to data collection by sample survey, so that it is contrasted with administrative data, collected as a side effect of an operational activity. This is different from, for example, Statistics Canada (2009), page 11, which used the term survey ‘generically to cover any activity that collects or acquires statistical data’, including the collection of data from administrative records.
At first glance, though we shall see that appearances can be deceptive, administrative data appear to have several advantages compared with statistical data.
- (a)
Since the data have already been collected, no additional cost appears to be incurred in collecting them.
- (b)
In a sense, we might reasonably expect that ‘all’ the data are available. After all, a company will certainly process and can retain details of all its transactions.
- (c)
The data might be of high quality, since the effectiveness of the operation of the organization depends on this.
- (d)
The stored data will certainly be timely and might be regarded as up to date as it is possible to achieve, since they describe the organization as it is, or at least as it was when the last change was made. This advantage is strikingly illustrated in the use of administrative data to derive estimates of population attributes at times that are intermediate between decadal censuses, and in essentially realtime estimates of price inflation.
- (e)
In a real sense administrative data often tell us what people are and what they do, not what they say they are and what they claim to do. We might thus argue that such data get us closer to social reality than do survey data.
- (f)
Administrative data may provide tighter definitions than alternative sources of data. Wallgren and Wallgren (2014), page 33, gave examples of data about income and children in families. Where the time restrictions on eliciting responses to a survey might mean one must simply ask ‘what is your yearly income before tax?’, administrative data might, depending on the source of the data, specify whether this means ‘disposable income, taxable income, earned income or income including unearned income…’.
Unfortunately, although all of those advantages of administrative data might apply in an ideal world, in practice things are typically not so straightforward. Regarding (a), effort will normally be required to extract the data, to clean them and possibly to link them to other data sets. Moreover, although data may be free for the organization which collected them, other organizations which wish to use these data may have to pay—and the cost must be balanced against that of data from alternative sources, such as surveys or administrative data collected by other organizations. Regarding (b), data will usually enter a database via a complex social process—the sample of records in a database may not be representative of the population to which one wishes to make an inference. An operational database might not have a form which is convenient for statistical analysis exercises. In particular, different parts of an organization might use different database systems—indeed, there is a large amount of current activity as organizations seek to put all of their data into a single data repository (a data warehouse, for example). The notion of ‘data=all’ is discussed in Section 3. Regarding (c), although sampling variation issues may not apply, administrative data will have other sources of uncertainty, and unfortunately these may be various and diverse, and not susceptible to resolution machinery as mathematically elegant or unified as sampling theory. More generally, although in principle one might expect administrative data to be of high quality, in practice all data sets, perhaps especially those involving human beings, are susceptible to quality issues. A particular issue with administrative data sets arises from the very fact that they were not deliberately collected to answer the later statistical question being addressed. This means that the data may not be ideally suited to answer the question; there is often a compromise between cost and relevance. For example, a costly survey could be designed to answer the specific question whereas ‘free’ administrative data might be only roughly suitable (but see principle 8, clause 8.1, of the European statistics code of practice (Eurostat, 2011), which says ‘When European Statistics are based on administrative data, the definitions and concepts used for administrative purposes are a good approximation to those required for statistical purposes’). Moreover, the definitions that are involved in administrative data are subject to changes for operational purposes which might impact the research questions that can reasonably be asked. It follows that time series of administrative data might exhibit discontinuities: an administrative database containing details of unemployment benefits might appear to be ideal to address questions of unemployment rates but, if the definition that is used in assessing benefits changed over time, then it might limit what can be done. Regarding (d), although administrative data may be instantly available to the organization collecting them, it may not be so to other organizations which wish to use them. Regarding (e), there are important kinds of administrative data which are not automatically accrued through some transaction but are specifically sought and reported—e.g. income tax data. And, finally, (f) is not universally true: credit card transaction data contain considerable detail of the nature of the item purchased, but not necessarily to a level that is adequate for all potential analyses.
An example of the relative merits of administrative and statistical data is given by crime statistics in England and Wales. There are two main sources of such data: the crime survey for England and Wales and police-recorded crime (Office for National Statistics, 2016a). These can sometimes show trends going in opposite directions. The reason is that definitions—of crimes, of victims, of what data are collected—differ. With the administrative data (the police-recorded crime data) we must make use of the data that we have, whereas with survey data (the crime survey for England and Wales) we can decide what data are best collected to answer the questions we want to ask. Indeed, there has been extensive research on how best to formulate survey questions to elicit the information that is sought (see, for example, Presser et al. (2004) and Fowler (1995)).
There are also other issues which arise with administrative data which are slightly different from those arising with survey data. An obvious problem relates to privacy and confidentiality—discussed in more detail below. Since survey data are collected purely for statistical analysis, data released for analysis would not normally retain identifying information: apart from its use in ensuring consistency and representativeness, the identifying information is not relevant to the use of the data. In contrast such information is central to the initial (operational) purpose for which administrative data are collected. The (hoped for) comprehensiveness of administrative data increases the risk of reidentification—and perhaps the public concern.
I could go on, but it is clear that, although administrative data have merits, the statistician should approach such data sets with the same critical eye that they approach any other data set.
It is worth noting that it is sometimes useful to distinguish between two kinds of administrative data. The first kind is that which is necessarily collected during the course of some operation. Credit card transaction data, for example, necessarily involve the recording of the amount spent, the currency and the business where the transaction occurs, since these items of information are needed to run the credit card operation. The second kind is additional information which is not needed for an operation, but which is helpful for other reasons, and which is collected during the administrative process. The age and gender of a customer might fall into this category: a product might be bought by anyone, but it could be useful to analyse the customer's details later to inform new marketing strategies. In some sense this second kind of data lies between administrative and statistical: they are collected for statistical rather than operational purposes, but they are collected during and as part of the administrative process. There is an important lesson to be taken from this: benefits can be gained from involving the statisticians and data analysts in the data collection stage. This is not a new lesson: we recall Ronald Fisher's comment that
‘to call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of’.
This last point leads us into the modern world of so-called ‘big data’. The term has no universally accepted definition, but we might define it as the result of some automatic data collection system. Indeed, I have argued elsewhere that the data revolution is not so much a consequence of the size of modern data sets and the ability to store them (big data) but rather of the fact that data are nowadays largely collected automatically without requiring explicit human effort. Examples of automatically collected data are everywhere and include personal health data collected by wrist monitors, automated monitoring of tickets as people travel through a rail network, telemetry of engine functioning and recording of metadata of phone calls. Data arising from the so-called ‘Internet of things’ would clearly be of this type.
Given that so many official and economic statistics are based on administrative data, or on a combination of administrative and survey data, we might have expected there to be a substantial literature in the leading methodological statistical journals describing the statistical challenges and how to overcome them. This appears not to be so, with such journals carrying relatively few papers on the statistical challenges of administrative data (being mostly focused on the consequences of sampling theory). For example, a search of the Journal of the American Statistical Association for occurrences of the phrase ‘administrative data’ yielded 44 results. Obviously the Journal of Official Statistics is an exception, although even there most of the papers including the phrase ‘administrative data’ in the title are concerned with particular applications. More generally, papers on the topic seem to be widely scattered and often appear in the proceedings of conferences and workshops, or perhaps as reports of official exercises (e.g. from official statistics offices). Given this wide scattering, it is certain that important contributions have been omitted from this paper, and I welcome attention being drawn to them in the discussion.
One of the best discussions is the excellent and comprehensive introduction to register-based statistics by Wallgren and Wallgren (2014). In one sense, this has a much wider scope than this paper, including discussion of register structures and the creation of registers, but in another sense it is narrower, being focused on official statistics and not including, for example, commercial or engineering applications of administrative data.
The series of conferences on ‘New techniques and technologies for statistics’ (e.g. New Techniques and Technologies for Statistics (2013, 2015)), organized by the European Commission, often have items touching on administrative data challenges: the phrase ‘administrative data’ appeared in the 2013 conference proceedings 220 times. To give the flavour of the breadth of topics that were covered, these proceedings included papers on state space models (Horn and Czaplewski, 2013), structural equation models (Scholtus and Bakker, 2013), business statistics and other topics. Romanov and Gubman (2013) explored regression to the mean in survey responses to questions on income by using administrative (tax) data, and Lewis and Woods (2013) described some of the issues which must be tackled when using administrative data in the form of value-added tax and company accounts data, as the basis for business statistics. As well as problems of matching and cleaning administrative and survey data, they also discussed differences of timeliness and periodicity. Kloek and Vâju (2013) discussed integration of administrative data with other kinds of data. They characterized five different kinds of use: direct use at microlevel, use as auxiliary information at microlevel, use as auxiliary information at aggregate level, use as a source for the population frame and use as circumstantial evidence. They also explored the distinction between administrative data for business and those for households. This, of course, reflects the general point that data describing different kinds of entities might have different characteristics (e.g. more pronounced skewness for some variables for business data compared with household data). Ćetković et al. (2013) have provided an elaborate example: the Austrian register-based census, involving seven base registers and several comparison registers which are provided with data from 35 data holders. They characterized data quality in terms of several ‘hyperdimensions’ described by Berka et al. (2012) (see below). Antoni (2013) linked survey and administrative employment data.
Other sources which have relevant materials include the following:
- (a)
the United Nations Economic Commission for Europe data collection workshops (see, for example, United Nations Economic Commission for Europe (2016), on new frontiers in data collection);
- (b)
the ‘ESS vision 2020’, which includes discussion of administrative data sources and challenges (European Statistical System, 2020) (their ‘Administrative data sources’ project is exploring how administrative data may be used to increase data availability and to reduce costs (European Statistical System Admin, 2015));
- (c)
ESSNet has current and previous projects on administrative data topics (ESSNet, 2017) (see, for example, ESSNet Admin Data Workshop (2013));
- (d)
the US Review of Administrative Data Sources (Ruggles, 2015);
- (e)
the Statistics New Zealand ‘Guide to reporting on administrative data quality’ (Statistics New Zealand, 2016);
- (f)
the use of administrative data at Statistics Canada (Statistics Canada, 2015);
- (g)
the administrative data quality assurance documents produced by the UK Statistics Authority (UK Statistics Authority, 2014, 2015);
- (h)
the checklist of quality of statistical outputs in van Nederpelt (2009);
- (i)
‘Pros and cons for using administrative records in statistical bureaus’, from the Israel Central Bureau of Statistics (2007);
- (j)
the Organisation for Economic Co-operation and Development compilation ‘Short-term economic statistics (STES) administrative data: two frameworks of papers’ (Organisation for Economic Co-operation and Development, 2016) is a particularly valuable source of examples of the use and challenges of administrative data, albeit focused mainly on economic uses.
The structure of this paper is as follows. Section 2 describes a fundamental problem that is relevant to all data analysis, no matter what the source of the data, namely data quality. But the challenges—and even the recognition that there are challenges—that are presented by administrative data differ from those presented by other sources. We look at some of these challenges and how they differ from those of other types of data.
Section 3 addresses the notion that one might have ‘all’ of the data. This is typically regarded as one of the particular merits of administrative data but, as we show, it is all too often an unjustified assumption.
Section 4 explores the fact that administrative data are collected for operational purposes, and not with specific research questions in mind. The consequence is that the data may be far from ideal for addressing those questions.
Sections 5, 6 and 7 look at deeper issues where the nature of administrative data impacts other aspects of analysis, including efforts to identify causation, merging data from multiple sources and the thorny issues of confidentiality, privacy and anonymization of records.
Section 8 draws some conclusions.
Data quality
The value of administrative data for producing official statistics has attracted increasing attention recently. In large part this is in the hope that they can replace more conventional survey data, motivated on the one hand by a worldwide decrease in survey response rates, and on the other by a perceived lower cost in using administrative data, since they have already been collected. However, as the UK Statistics Authority put it,
‘we have been surprised by the general assumption made by many statistical producers that administrative data can be relied upon with little challenge, and, unlike survey-based data, are not subject to any uncertainties’
(UK Statistics Authority, 2014). Because of this, the UK Statistics Authority has produced a report on quality issues in administrative data, summarizing the lessons learnt from a review of users of administrative data for statistical purposes and describing a toolkit to monitor data quality in this context (UK Statistics Authority, 2014, 2015).
Other explorations of the quality aspect of administrative data include the model that Daas et al. (2008) have developed for Statistics Netherlands. Noting that a key issue with administrative data is that the source of the data is typically some other body, Daas et al. (2008) pointed out that the collection and maintenance are not within the control of the analyst: when data are collected by bodies other than those undertaking the analysis, issues of data provenance and curation are critical. In their review of earlier work on administrative data quality, Daas et al. (2008) observed that different researchers have identified
‘a remarkable difference between the number and types of quality groups or dimensions identified for the statistical quality aspects of administrative data’.
They attributed this partly to the complexity of the problem and partly to the fact that different researchers had different perspectives on the topic. Their paper is then an attempt to integrate the various views into a single framework. Their conclusions include the observation that administrative data quality is a multi-dimensional issue, with a hierarchy of dimensions (Karr et al., 2006).
Other work (e.g. Eurostat (2003)) has explored the potential uses of administrative data. This is an important point when attempting to evaluate data quality, as data may be ‘good’ for one purpose but ‘bad’ for another: quality is not a property of the data set itself, but of the interaction between the data set and the use to which it is put. And yet other, more general, work on data quality, especially in the context of official statistics, inevitably touches on administrative sources (e.g. van Nederpelt (2009), Memobust Handbook (2014) and Statistics Netherlands (2014)). Once again, we stress that work on the quality of administrative data has appeared in diverse publications, from a wide range of sources.
The common misconception that quality issues are less important for administrative data than they are for survey data seems to be based chiefly on the belief that data that have initially been acquired for operational purposes must necessarily be both complete and error free, whereas survey data will be based on a mere sample from the population being studied, so the results will vary between possible samples. The fact is, however, that administrative data may be neither complete nor error free. As far as ‘complete’ is concerned, incompleteness can manifest either in the form of partial records—records in which some of the fields are missing—or in the form of entire records missing, so that the data set does not in fact cover the entire population. And, as far as ‘error free’ is concerned, errors can arise in an unlimited number of ways. To paraphrase Leo Tolstoy: ‘A perfect data set is perfect in only one way; each imperfect data set is imperfect in its own way’. This means that we can never be sure that all the errors have been detected. The problem is analogous to that of testing random-number generators: we can look for particular kinds of departures from randomness, but there will always be kinds that we have not thought of. Unfortunately, one of the lessons that we have learnt from data mining practice over the past 20 years is that most of the unusual structures in large data sets arise from data errors, rather than anything of intrinsic interest. We should be suspicious of any data set (large or small) which appears perfect. A standard check that I carry out is to ask those providing the data what they have done about missing values. Often this has resulted in surprising responses which the researchers would not have thought to mention if the question had not been explicitly asked. For example, it is not uncommon for researchers to have removed any incomplete records from the data set, introducing unknown selection bias. Caruana et al. (2015) described the development of a machine learning diagnostic system based on hospital administrative data which classified high risk asthma patients as low risk because such cases had been excluded from the training data.
Although the technology of data editing and imputation has been substantially developed, with entire books being written about it (e.g. de Waal et al. (2011)), it is not the case that detected errors can necessarily usefully be corrected. This means that commercial tools for detecting and correcting data errors are unlikely to be 100% effective, whatever they may claim.
Statisticians know very well that it is common for the major part of their time on a project to be spent cleaning data before actual statistical analysis. This is all very well when the data set is of moderate size, but it becomes more of a problem when the data set is massive—as is increasingly the case in the ‘big data’ world, and is particularly the case with administrative data and data which are captured automatically. Especially in such contexts, the computer is a necessary intermediary between the analyst and the data, with consequent risks of missing important shortcomings of the data—and indeed, even creating extra errors during an automatic data cleaning process. For example, rule-based correction mechanisms can distort perfectly good, though unusual, data values, and an unfortunately all-too-common strategy for coping with missing values is to substitute the mean of the observed values (so leading to an underestimate of variance).
Familiarity with the fact that data are often not of the highest quality has led to the development of relevant statistical methods and tools, such as detection methods based on integrity checks and on statistical properties (e.g. comparing distributions with expected distributions in electoral data, or using the Benford distribution for leading digits); see, for example, Hellerstein (2008) and de Jonge and van der Loo (2013). However, this emphasis has often not been matched within the realm of machine learning, which places more emphasis on the final modelling stage of data analysis. This can be unfortunate: feed data into an algorithm and a number will emerge, whether or not it makes sense. However, even within the statistical community, most teaching implicitly assumes perfect data. This is entirely reasonable: if one is aiming to teach the basic concepts of regression, one does not want to spend time pointing out the consequences of missing data, digit heaping or digit transposition. Nonetheless, students do need to understand the reality of data analysis. This leads to our first challenge.
Challenge 1. Statistics teaching should cover data quality issues.
Even if data may depart from perfect quality in an unlimited number of ways, it is important to characterize as many ways as possible, and Kim et al. (2003) have produced a general ‘taxonomy of dirty data’. They characterized data as dirty ‘if the user or application ends up with a wrong result or is not able to derive a result due to certain inherent problems with the data’, and they identified various possible causes of the problem, including data entry errors, data update errors, data transmission errors and also bugs in a data processing system. Particular applications are likely to have their own characteristic types of error, and it seems likely that an 80/20 rule will often apply, with a large proportion of errors being of just a few types, so that relatively little effort will lead to substantial initial improvement in overall quality. An illustration of this was given by Lewis and Woods (2013), who identified the main causes of error in value-added tax data to be just four types: scanning errors, unit errors, incorrect quarterly data and errors in individual responses. De Veaux and Hand (2005) gave examples of data errors and their consequences, and national statistical institutes often define several dimensions of quality, including accuracy, relevance, timeliness, existence, coherence, completeness, accessibility and security (see, for example, Eurostat (2000) and the archives of other national statistical institutes; Meader and Tily (2008) and Biemer et al. (2014)), though these will affect administrative data in varying degrees.
The ‘relevance’ aspect in the national statistical institutes list is more subtle than simply finding a mistake in the data. Even perfectly accurate data may be useless for answering a particular research question if the data have not been collected with the research question in mind—as is typical with administrative data. Clearly we can try to ease that difficulty if we know beforehand what questions are likely to occur, but even then difficulties can arise. For example, in a project aimed at constructing a scorecard to predict likely default on bank loans, one of the (relatively highly predictive) variables was ‘is the applicant a home owner or renter?’. This was administrative data of the second kind mentioned above—the question was not relevant to everyday operations. But as a consequence of this the people tasked with recording the data failed to see its importance, with the result that they initially recorded it for only a small percentage of customers.
If administrative data are subject to restrictions arising from operational imperatives, they are also subject to possible constraints from the opposite direction: administrative data are often communicated, compared and aggregated across bodies collecting the data. For example, national statistics for US states and countries within the European Union will be aggregated to produce Federal statistics and European Union statistics respectively. The need to do this imposes constraints on what must be collected and on its format, with particular standards requiring particular structures, formats and protocols, as well as content.
As mentioned above, administrative data are also susceptible to changes of definition, which can adversely affect things like time series, rendering them non-comparable over time. Since much administrative data, especially those concerned with government and public policy, are subject to regulation and legislation, changes in laws can have an unfortunate effect, at least from the perspective of the statistician hoping to use the data to make inferences. Changes in what data can be stored, or the characteristics which are allowed to be used in statistical models, can mean that earlier models become unusable.
Data can be incorrectly entered, even for operational purposes. We have all heard of ‘fat finger’ errors leading to mistaken financial transactions. Other classic examples include things like weights of 1 lb being miscoded as 11 lb, data being entered in incorrect columns, abbreviations leading to confusion (e.g. MS for Microsoft or Morgan Stanley), incorrect time stamps due to clocks being mis-set, mistakes in the use of measurement units, simple misspellings and instrument failures not being detected (leading to, for example, an unnoticed stream of 0-values). The list of examples is endless. Kruskal (1981) observed that
‘A reasonably perceptive person, with some common sense and a head for figures, can sit down with almost any structured and substantial data set or statistical compilation and find strange-looking numbers in less than an hour’.
However, even data which are entered correctly and unambiguously for operational purposes can lead to errors when subjected to statistical analysis. Alternative, equally legitimate spellings or identifiers (e.g. David Hand, David J. Hand and D. J. Hand) may not be recognized as equivalent in a subsequent analysis unless they have been explicitly characterized as so. Conversely, identical entries might refer to different objects (e.g. father and son with the same name). Missing values for age coded as 999 can be analysed as legitimate ages, with obvious adverse consequences. Although clearly this should be flagged in the metadata, we note, again, that large data sets necessarily involve an opacity that does not affect small data sets, in that the computer is a necessary intermediary between the data and the analyst. Mistakes and ambiguities can slip through.
The argument has been made that errors in data will often affect only a very small part of the data, and so will, for example, have no significant effect on large-scale conclusions. Although this may be true, large-scale conclusions are typically not the only ones which will be drawn from administrative data. One of the particular strengths of such data is that they are also used for small-scale investigations—to explore subgroups or for small area statistics, for example. In such cases, errors in only a few records can have important consequences.
Quality issues may also arise when data sets, of adequate quality in themselves, are merged. Take, for example, time series which are out of phase, or have different frequencies of publication, or publish on different dates or, even worse, are irregular.
These considerations lead to several challenges.
Challenge 2. Develop detectors for particular quality issues.
Challenge 3. Construct quality metrics and quality scorecards for data sets.
Challenge 4. Audit data sources for quality.
Challenge 5. Be aware of time series discontinuities arising from changing definitions.
Challenge 6. Evaluate the impact of data quality on statistical conclusions.
‘Data = all’?
The phrase ‘data=all’ is sometimes encountered in the context of administrative data. This is intended to convey the notion that the data are not merely a sample from the population of objects but are its entirety: all credit card transactions, all supermarket purchases, all tax records, and so on. The implication is that having data describing the entire population means that we need not worry about sampling errors or errors arising from non-representativeness. This, however, is misleading. Administrative data tell us what happened with a particular group of people, but this group of people may or may not be the group about which we wish to make statements or from which we wish to generalize. Very often, for example, the selection process which results in their being chosen will include an aspect of self-selection.
A few examples will illustrate some of the difficulties.
Retail banks and other financial institutions construct scorecards to predict likely customer behaviour with financial products. For example, such models are used to predict who is likely to default on a loan, and hence whom to give loans to. Administrative data are then collected on the customers who are awarded loans as they make their repayments. In particular, outcome data—whether they defaulted or not—are collected. Such data, the outcomes along with potential predictor variables (from application forms or behaviour on other financial products), can then be used to construct models to make loan decisions on future applicants. Unfortunately, this data set will not be representative of the population of applicants. It will only include people who were previously thought to be good risks. This means that models based on it could give seriously distorted predictions for people who are drawn from the entire population of applicants (Hand and Henley, 1993; Hand, 2001). ‘All’ of the data are there, but they are not all of the data that one needs.
An example that is currently attracting a huge amount of attention is publication bias and associated phenomena in scientific literature. We can obtain data on all papers that are published, but they arrive at publication through a complex sociological selection process: papers reporting positive results are more likely to be submitted, editors are more likely to publish them, anomalous results may be regarded as errors so the work is not written up, and so on. So what we see is a distorted view of the work that is done and the results that are obtained, so much so, in fact, that John Ioannidis could publish a paper with the title ‘Why most published research findings are false’ (Ioannidis, 2005), stimulating much interest and subsequent work. The notion that the published scientific literature represents ‘all’ the relevant material is simply false.
The Crimemaps system provides another example. Originally developed in Chicago, based on police-recorded crimes, this gives (approximate) locations of crimes, displayed on maps so that people can see which areas are dangerous. However, research from the Direct Line insurance company in the UK suggests that large numbers of people are not reporting crimes because of the potentially adverse effect it will have on house prices and hence their ability to sell or rent their house (Direct Line, 2011). The data are purely administrative—from the police databases—but can become progressively more distorted. This may mean not only that the data are of limited value for determining which areas are risky, but also that they become increasingly less valuable for their original purposes. This is a straightforward illustration of Campbell's law:
‘The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor’.
This law applies just as much to administrative data as to survey or other data.
Even something as apparently automatic and complete as data from financial markets exhibit numerous errors and omissions. These can arise from the use of different time stamps on transactions, ambiguity over time resolution of transactions, the extent and method of aggregation of data, whether or not certain types of transactions are present (e.g. so-called ‘dark pool trades’), changes to symbols identifying corporations perhaps leading to mismatches or failed matches when data are linked, confusion arising from stock splits or mergers, and so on. The financial data sources described in the Caltech Quantitative Finance Group guide to market data at http://quant.caltech.edu/historical-stock-data.html. illustrate the problems.
The bulk of traditional survey work is based on known probabilities of including each person in the study, with sampling theory permitting solid inferential conclusions, but sometimes non-probability samples are chosen, e.g. convenience sampling, matched sampling, network sampling and situations in which people are allowed to opt in or opt out. These kinds of non-probability samples have stimulated research on drawing valid population conclusions (see, for example, Baker et al. (2013) and Bethlehem (2010)) although it may not be straightforward to apply the methods to administrative data. In general, the data distortion will have different consequences in different contexts and with different problems. For example, for one particular credit scoring data set, Crook and Banasik (2004) drew the conclusion that
‘even where a very large proportion of applicants are rejected, the scope for improving on a model parameterised only on those accepted appears modest. Where the rejection rate is not so large, that scope appears to be very small indeed.’
So, for their example at least, the problem seems not to be too bad, but it would be unwise to assume that this is generally so.
On top of all this, there are more complicated questions of what is meant by ‘complete’. For example, many systems are dynamic and constantly changing. A database of all the people in a company today will provide at best only a snapshot—it is likely that some employees will have moved on or been recruited by next year. Indeed, it is certain that the individual employees will have changed by next year—if only because they will have aged, let alone possible name changes due to marriage, address changes due to moving house, and so on. This population drift poses interesting statistical challenges, and it again points out the weakness of the assertion that administrative data represent ‘all’ the data one needs.
Statistical methods have been developed for correcting for sample distortion (e.g. Heckman (1976) and Copas and Li (1997)), but they depend on making assumptions about the form of the distortion. Statisticians can do amazing things, but they cannot perform miracles and if the data have been chosen in an arbitrary and unspecified way there is little that can be done. If this were not so we could always draw accurate conclusions from the most limited of data. And this is precisely why survey sampling and experimental design have grown into such elaborate disciplines: they specify and constrain how the data must be collected so that valid conclusions can be drawn from a statistical analysis. Administrative data, in contrast, without this underlying statistical imperative, may not be so useful for drawing statistical conclusions. They may be selected in precisely those ‘arbitrary and unspecified ways’.
In short, the fact that administrative data arise through an administrative process does not mean that they represent the entire population of interest. Some major successes in the world of ‘big data’ have been achieved by simply analysing the data as they present—but some major failures—such as the initial Google flu trends projections (Hodson, 2014)—have also arisen from taking the data at face value.
These points lead to the following challenges.
Challenge 7. Explore potential sources of non-representativeness in the data.
Challenge 8. Develop and adopt tools for adjusting conclusions in the light of the data selection processes.
Answering the right question
As the previous sections have illustrated, there can be difficulties in using administrative data to answer specific research questions. This might be because the data were not collected with those questions in mind, because of quality issues that are irrelevant to operations but highly relevant to subsequent statistical analysis, because of changes in definitions of the recorded data items or for other reasons. This brings us back to a point that was made earlier: it can be useful, if it is possible, to have statisticians involved in the data collection process. They might be able to think ahead and to expand the range of data collected so that they will be more able to answer future questions.
Statistical analysis methods are often divided into descriptive and inferential. Descriptive methods are used to summarize a body of data so that the important messages within it can be readily grasped. We might summarize a distribution of values by their mean and standard deviation, or the results of a census by using a series of counts organized as cross-tabulations. It goes without saying that the summary statistics that are appropriate will depend on the subject matter and on the questions to which answers are sought. Administrative data are often used for purely descriptive purposes—perhaps especially so in official statistics contexts, where we might want to establish the characteristics of some population.
In contrast, inferential methods are used to make a statement about unobserved values or underlying mechanisms. We might be trying to infer the disease of a new patient, on the basis of analysis of patients with similar symptoms diagnosed in the past. We might be trying to forecast whether inflation will go up or down next month. We might be trying to elucidate an underlying mechanism, so that we can understand how the data were generated, and perhaps influence things in the future. Much of the statistical theory of inference is based on the notion of random sampling from a (possibly infinite) population of values. Because the sampling is random, solid mathematics (such as the law of large numbers and the central limit theorem) means that sound statements can be made about the characteristics of the population from summary statistics obtained from the sample. Moreover, error bounds can be put on the conclusions. We can say things such as ‘on average, 99 out of 100 of our intervals will cover the true population mean’, so we can be confident of our results (always subject to data quality issues, of course).
But administrative data are not collected by such a random sampling process. We can certainly calculate descriptive statistics, summarizing the data before us and, if we are willing to assume that the data are perfect, with no missing or distorted values (a brave assumption, as the above discussion has illustrated), then this will accurately summarize the population which led to our data. We can make a statement such as ‘this is the true population mean’.
But, if our aim is not really to summarize the data at hand but to make an inference to another (often future) population, then additional, unknown and quite possibly unquantified, sources of uncertainty may be relevant—possible incompleteness of the data set, discussed above, is only one example. This has consequences for inferential statements.
Of course, these unknown and possibly unquantified sources of uncertainty beyond sampling variation also affect the sampling approach. Indeed, it is likely that they have been underappreciated in many contexts, where the elegance of the sampling theory mathematics has distracted attention from the fact that there are other sources of uncertainty. This could be one of the drivers behind the phenomena to which John Ioannidis has drawn attention, mentioned above. Hand (2006) has given examples and implications of this oversight in a different context.
Administrative data are often highly complex. For example, a single credit card transaction leads to 70–80 items of data being recorded, whereas Web search and social media data have an elaborate graph structure. And this leads to the observation that data capture technology changes rapidly. In the context of the first of these examples, a recent change is a shift towards mobile phone banking, leading to new and additional transaction characteristics being available. In the context of the second, changes in social media platforms mean that there is a very real risk that any particular kind of model based on data recorded from Web transactions may be impossible to build just a couple of years in the future, as social interaction media change and evolve. And similar problems apply in other areas—in medicine, for example, with different kinds of bioinformatic measurement methods, in the financial sector with short time series because of changing regulations, and so on. At a substantive level, this has clear implications for studying how society is developing. At a statistical analysis level, it drives home the point that was made above, that administrative data are collected for operational reasons, and may have serious weaknesses for subsequent analysis.
Economic and social measures such as gross domestic product, the consumer price index and national wellbeing are what are called pragmatic measures (see, for example, Hand (2004)): the definition of the concept and the way that it is measured are two sides of the same coin. Change the measurement procedure and you change the thing being measured, with different measures being suitable for different purposes. This is why, in the UK, we have CPI, CPIH, RPI, RPIJ, and so on. It is not a question of any of these being more ‘right’ than the others, but simply that they measure slightly different things. This means that they have different properties and are suited to answering different questions. Increasingly, interest is turning to the possibility of using administrative data for measuring productivity and price inflation. Instead of conducting surveys of businesses to obtain data, the data can be automatically transmitted from the transaction to the database. Scanner data, such as retail purchase data obtained directly from the point-of-sale machine, provide an example, yielding data that are ideal for use in price index calculation. Moreover, such data also give information on the volume of different goods purchased, so that weights can be chosen. But issues of selection bias still apply: not all purchases are made through such routes, and we cannot assume that those purchases which are made in this way represent a random or representative sample of all purchases.
A variant of this uses Web-scraped price collection, being explored by various national statistical institutes. For example, the billion prices project (Cavallo and Rigobon, 2016) seeks to collect massive amounts of price data from Web sites. Apart from its vast coverage (‘big data’) this means that much more timely estimates can be obtained much more cheaply than by traditional methods. Note, however, that Cavallo and Rigobon (2016) did not describe this as an alternative to traditional methods, but as a complement.
This approach has had some notable successes—for example, Cavallo (2013), using such on-line data collection, estimated Argentina's annual inflation rate between 2007 and 2011 to be over 20%, which was a striking contrast with the 8% claimed by the Argentinian Government. Moreover the on-line estimates were reported daily.
The apparent simplicity of this approach risks concealing various complications. It is still necessary to decide which Web sites to collect data from—and this might be biased towards larger retailers. In fact, Cavallo and Rigobon (2016) said ‘we … focus almost exclusively on large multichannel retailers and tend to ignore online-only retailers (such as Amazon.com)’. The basket of goods to be included still must be chosen. Cavallo and Rigobon noted that on-line prices cover a much smaller set of retailers and product categories than is covered by the traditional approach. Also, on-line prices are one thing, but they say nothing about the quantity sold.
A key question, perhaps obvious in view of the earlier sections, is how representative the on-line prices are of prices in general: what about prices of goods or services that are not bought on line? Moreover, in certain contexts, such as airline tickets, dynamic pricing systems operate, which introduce not only changes over time, but also the effect of gaming strategies.
One of the problems with Web-based tools is the rate of change of that technology. Companies appear, grow to a massive size and vanish at a dramatic pace. Bebo, for example, launched in January 2005, and sold to America On Line just 3 years later for $850 million. But, then, in May 2013 it voluntarily filed for Chapter 11 bankruptcy protection. Worse still, the algorithms that are used by the companies can change arbitrarily. Google's search algorithm is constantly being redeveloped. As we have noted above, this means that administrative data may have a short shelf life, in the sense that comparative data and time series may not merely have discontinuities as definitions change but may also experience changes in ill-defined, even ill-understood, ways.
Since survey data will be collected with a view to answering specific questions, the variables will be relevant by definition. Variables from administrative data may be less relevant. This means that derived variables will be more important for the analysis of administrative data. These are variables that are created by combining other variables. For example, whereas a survey question might ask about disposable income directly, to obtain the corresponding value from administrative data might require adding earned income, interest from bank and other deposits as well as other sources of income, subtracting tax paid, and so on.
As a final example of definitional difficulties, the media were recently exercised by an apparent discrepancy between the number of long-term migrants to the UK, estimated by the International Passenger Survey, and the number of national insurance number registrations (administrative data). Close examination (Office for National Statistics, 2016b) revealed that the discrepancy was due to differences in definitions. They commented that
‘it is not possible to provide an accounting type reconciliation that simply “adds” and “subtracts” different elements of the NINo registrations to match the LTIM definitions’.
All of this leads to the next challenge.
Challenge 9. Explore how suitable the administrative data are for answering the questions. Identify their limitations, and be wary of changes of definitions and data capture methods over time.
Administrative data, typically being observational data, permit hypothesis-generating exercises. Whereas survey data have the advantage that they will be tuned to answer the survey questions, and administrative data may not be well suited to answer those questions, the converse also applies: administrative data, often being much richer than survey data, can be used to explore other questions and to generate hypotheses based on relationships that are observed in the data.
Here is one example illustrating both the complexity of human behaviour and the use of administrative data in detecting unsuspected patterns in that behaviour.
Hand and Blunt (2001) sought to model the distribution of sizes of credit card transactions at petrol stations in the UK, on the basis of administrative data recorded at petrol stations. Superficially, the distribution was as expected—roughly normal, but with some right skewness since it could take only positive values. However, closer investigation revealed some anomalous spikes. The size of the data set meant that these spikes could not be attributed to random variation arising from the particular period being studied but must have represented an underlying reality. (Note, the data were all of the transactions but were being analysed as a sample to make inferences about underlying mechanisms, and hence what might be expected to happen at other times). Closer investigation led to the observation that there are two different types of behaviour pattern: some people simply fill the petrol tank at each purchase, whereas others seek to hit a convenient whole number of pounds cost, such as £20 or £30. Noting this, and digging deeper, led us to recognize further patterns of behaviour: there was more overshoot than undershoot; people preferred to hit whole numbers of pounds of any magnitude than numbers ending in a non-zero number of pennies (though especially those which were a multiple of £10); subject to that, they particularly favoured numbers ending in 50p, and then 25p, and so on. Things were further complicated by the fact that a significant proportion of spend in such situations is in the forecourt shop, where goods have particular prices, often with special values of their own (e.g. ending in 99p). And, as if all that was not enough; there were further features in the data which arose as a consequence of marketing initiatives run by the forecourt operations. In the end, we constructed an elaborate mixture model which tried to take all these phenomena into account. This model was purely descriptive, though inference was needed to decide whether effects were sufficiently large (in the context of the particular data set being supposedly drawn from a superpopulation of possible such data sets) to be included. However, the aim was not merely to describe what customers had done in the past, but to use the model to inform future pricing strategies.
A particular merit of administrative data, and especially of transaction data, is that they are recorded as time progresses. Unlike data that are recorded at a particular time, or a discrete sequence of times as in repeated surveys, it is essentially continuous. This means that administrative data can be very useful for early detection of changes in populations. Indeed, often one of the operational reasons for collecting the data in the first place will be for monitoring processes. But an assertion that a time series (of gross domestic product or unemployment, say) has changed will typically not be intended merely as a face value assertion that the raw numbers differ, but rather as an assertion that some underlying reality has changed. And this should not be based on a simple comparison of the numbers, but rather on a comparison of the difference between the numbers with the inaccuracy of measurement. It should answer the question: is this difference larger than we would expect, given the intrinsic uncertainty in how the measured numbers represent the reality, or is it well within the scope of what we might expect with no change in the underlying reality? And the crucial point is that this intrinsic uncertainty should include all sources of uncertainty, not merely sampling variation (if that is indeed relevant).
This leads to the following challenge.
Challenge 10. Report changes and time series with appropriate measures of uncertainty, so that both the statistical and the substantive significance of changes can be evaluated. The measures of uncertainty should include all sources of uncertainty which can be identified.
Causality and intervention
As is well known, observational data present challenges in establishing causality. If we observe a difference in some outcome measure (e.g. income) between two groups, and we note that the groups differ in various properties (e.g. education), we cannot be sure that the observed differences in their properties explain or cause the difference in outcome. To establish causality, we need to intervene to break all possible causal links except the link that we wish to test (but see also Pearl et al. (2016)). The most common way to do this is via a properly controlled experiment involving randomization. Usually this is difficult with administrative data, not least because it requires modifying the standard operation of the organization, although occasionally experimental designs are built into on-going operations, enabling comparisons to be made by using administrative data. In such situations the designs will typically be fairly simple, such as merely comparing two groups.
This notion of modifying an operation so that we can learn from it, as well as simply using it to carry out its normal function, can manifest in other ways. One mail order organization that we worked with enrolled a ‘gold sample’—a small set of people regarded as poor risks (who would normally be rejected), just so that they could collect data across the entire population distribution, and hence enhance their models and improve future predictions. Scholtus et al. (2015) have also explored the use of such a gold sample, in their case to yield estimates of intercept bias in a model.
We see from this that the needs of planned subsequent statistical analysis can sometimes influence what administrative data are to be collected. Occasionally, such intentions can lead to data being collected during the operations which are not required to run the organization but which can be used subsequently—the second kind of administrative data mentioned in Section 1.
Challenge 11. Be aware that administrative data are observational data, and exercise due caution about claiming causal links.
Combining data from different sources
Combining data and evidence from different sources is increasingly important in statistics and elsewhere. This can be for statistical purposes, such as to yield an improved or more comprehensive estimate (e.g. Ashley et al. (2005) and Cunningham and Jeffery (2007)), or simply because information is needed for a higher level organization (e.g. combining statistics from several countries to give European-Union-wide statistics). But it can also be at the individual level, e.g. in detecting fraud or adverse drug reactions, or tracking terrorist activity.
Even if the data are, at least in principle, of the same type, such as combining economic statistics from different countries, they may have been collected by using different methods or definitions, so producing combinations or aggregates is not necessarily straightforward. Vâju et al. (2015) describe
‘a huge number of possible sources of lack of comparability, given by combinations of (i) national legal and institutional environments, (ii) acceptable trade-off between quality dimensions at national level, (iii) appropriate trade-off between costs and benefits in terms of output data quality at national level, (iv) methodological choices to integrate the several data sources’.
At a lower level, problems might arise because a particular characteristic might be grouped in different ways in two data sets (e.g. age classified into 10-year bands or into young versus old), or observations might be taken or recorded with different periodicities. Possible strategies for overcoming such problems include the latent variable perspective, with the observed data being regarded as a coarsened or grouped version, or state space models (Horn and Czaplewski, 2013).
The situation is further complicated because the data are often of different types—survey data, administrative data, Web-scraped data, social network data, data collected from wrist health and activity monitors, and even non-numerical forms of data such as speech and image data. This is perhaps where the real opportunities, and statistical challenges, arise. Medicine, in particular, is making extensive use of such approaches, combining medical images, clinical trial reports, epidemiological data and health registry data. Credit bureaus combine credit card transaction records from several operators to build a single database from which they can construct a generally applicable credit scorecard. An example from official statistics in the UK is the estimation of income within small geographical areas, based on linking data from the Family Resources Survey and administrative data from benefit claimant counts, council tax bandings and tax credit claims. Vâju et al. (2015) pointed out that, even if the accuracy of the separate sources of data can be measured, assessing the sensitivity of the accuracy of the final combined data set to the source-specific errors and the integration methods can be very difficult.
As far as merging data from different sources goes, reasons include the following.
- (a)
Complement: different sources of data and different types of data, can each serve as a complement to each other by providing different types of information. This is perhaps particularly true for administrative and survey data. Some types of variables—attitudes and opinions, for example—do not normally naturally arrive in administrative data but must be collected by surveys (or panels, or some other purposive data collection strategy). Surveys can be designed so that they shed light on tightly focused research questions, whereas with administrative data we may have to be satisfied with questions which are a little different from those we would ideally like to ask, perhaps because they are based on slightly different definitions of the concepts involved. In contrast, administrative data sets are likely to be larger, with better population coverage (though possibly vulnerable to the other data quality issues that were mentioned above).
- (b)
Supplement: although administrative data are often thought of as an alternative to survey data, they are at least as valuable when used in conjunction with survey data. Survey data can be used to pinpoint particular research questions, but cost necessarily limits coverage. However, relationships that are found from survey data can be extrapolated to yield estimates from overall populations and smaller groups by using such tools as regression estimation applied to an administrative data population base. This can be useful to yield small area and regional estimates. Indeed, such statistical tools can be used to improve estimates from survey data. A further point is that surveys require sampling frames, and administrative data are central to their construction.
- (c)
Accuracy: we have stressed issues of data quality above. Triangulation and imputation from multiple sources of data and reconciliation between sources of data are good ways to tackle these issues. Berka et al. (2012) gave an example, exploring accuracy in the Austrian register-based census of 2011. They noted the use of surveys to check register data but pointed out that this is resource intensive. They evaluated the quality of data at the raw data level in terms of three ‘hyperdimensions’, assessing documentation (e.g. plausibility and legal aspects), preprocessing (formal methods for testing for errors and inconsistencies) and comparison with an external source. The results are three measures, each scored in the interval 0–1. A weighted average is taken to yield an overall quality indicator for each register and attribute. The fundamental challenge here is that of combining quality indicators from different sources, and Berka et al. (2012) explored the use of Dempster–Shafer theory to do this.
Another example was given by Romanov and Gubman (2013), who used administrative data to explore bias in answers to survey questions about income. Discrepancies pinpoint potential errors and issues to be resolved. Of course, there are complications. Errors can propagate and perhaps not all of them can be resolved. Worse, especially in the context of administrative data, this jigsaw solution is vulnerable to one of the pieces disappearing as the operational imperatives generating the administrative data change. Moreover, as we have repeatedly stressed, one must be alert to different sources of data using different definitions.
A special case of merging data from different sources is matching data from different administrative databases. For example, we may have identified data on individuals, collected for different reasons and stored in two distinct databases, and we may want to combine them. But, of course, the problem is not restricted to data on individuals: Lewis and Woods (2013) described a problem of incompatible business registers, with different identifiers in the two databases. Because of its importance the matching of corresponding records from different databases has been the focus of much research effort—see, for example, Christen (2012), D’Oriazio et al. (2006) and Rässler (2002). It faces various data analytic challenges, including deciding when to match two records given that they do not have unique and identical identifiers, detection of duplicate records (again, because slightly different identifiers may refer to the same individual person or object) and merging of duplicate records into a single entity (or deduplication).
A traditional, and still widely used, method, at least for small data sets, is manual matching. This has some obvious shortcomings, including a scalability cost (in various measures), subjectivity arising from human biases, variation between people, variation within any one person as they become tired or bored, and the difficulty of objectively improving performance. A modern variant of manual matching, for contexts where confidentiality is not important or where the data to be matched can be effectively encrypted, is crowdsourcing, enlisting the help of large numbers of people.
Computational methods can be divided into two classes: deterministic and probabilistic or statistical.
Deterministic methods simply see whether two records agree on all of a specified set of identifiers. This is clearly very quick. It can be a single-step procedure or can proceed through sequential steps, beginning with stringent matching criteria and progressively relaxing them.
Probabilistic methods relax the requirement of an exact match and instead calculate a dissimilarity measure for each field in the pair of records being compared. The choice of dissimilarity measure will depend on the context (e.g. approximate string matches for some text fields, matches that allow different date formats and matches that allow the given name and surname to occur in the reverse order). The separate field dissimilarity measures are then combined (e.g. added or used to maximize the likelihood of a match, given a probability model) to yield an overall dissimilarity measure for the record pair. In the simplest approaches, these dissimilarity measures can then be compared with a threshold to yield a match–non-match classification. More sophisticated approaches (e.g. the classic work of Fellegi and Sunter (1969)) follow the ‘reject option’ and define three types of decisions: match, non-match or possible match. The third class is then subjected to a second stage of investigation, which is often a manual comparison. Winkler (2006) has reviewed linkage methods.
Clearly methods which are based on pair-by-pair comparisons run the risks of intransitivity, of several records from one database being matched to a single record in the other and of computational intractability if all possible pairs are compared. The first two problems, at least, can be eased if a higher level view of the matching process is taken, in which constrained groups of records are compared. To take a simple example, suppose that we wanted to match a collection of left shoes with a collection of right shoes, to find which shoes belonged in a pair. One strategy would be simply to calculate similarities between shoes, one from each collection, and to choose the pairs which had the greatest similarity—but this would be susceptible to the first two of the problems just listed. An alternative approach would be (computation allowing) to look at all possible pairings of shoes, one from each collection, and to choose the set of pairings which maximized the likelihood. Exactly this sort of approach has been used in chromosome matching.
These considerations lead to the following challenges.
Challenge 12. Be aware of the risks that are associated with linked data sets and the potential effect on the accuracy and validity of any conclusions. Recognize that quality issues of individual databases may propagate and amplify in linked data. Develop better measures of overall combined data quality.
Challenge 13. Continue to develop statistically principled and sound methods for record linkage and evidence assimilation, especially from non-structured data and data of different modes.
Challenge 14. Develop improved methods for data triangulation, combining different sources and types of data to yield improved estimates.
Confidentiality, privacy and anonymization
A common challenge with all data describing human beings is the need to preserve confidentiality and privacy, but this often seems to be a particularly sensitive issue with administrative data. This may be because, unlike with surveys, there may be no choice about being included (at least, if one wants access to the service or product) or perhaps because it is obvious that the identifier must be retained in the data (since it is needed for operational reasons—one cannot run a credit card operation without being able to match transactions to customers). There seems to be growing concern about the data shadows that we all inevitably leave as we access administrative services, whether corporate or public.
Anonymization and deidentification tools do exist—e.g. based on aggregating data, perturbing data or randomly generating data with statistical properties the same as the raw data—but they all have shortcomings. An overview of such methods is given in Duncan et al. (2011) and see also Karr et al. (2006), Reiter (2005), Matthews and Harel (2011) and McClure and Reiter (2016). One of the most challenging—and probably intractable—problems is that it is often possible to combine a data set with other publicly available data to identify an individual and to reveal something about them. There have been several well-known public incidents of this kind, such as the identification of individual subscribers from the Netflix prize data set (Narayanan and Shmatikov, 2008) and the identification of the medical records of Massachusetts Governor William Weld (Anderson, 2009).
From the perspective of statistical challenges, work continues to develop statistical methods of disclosure control—such as the development of differential privacy (Dwork and Roth, 2014). More generally, statistical tools are being developed to permit analysis without divulging the identity of individuals. For example, multiparty computation is a strategy to calculate aggregate statistics for a collection of individuals without requiring any individual to give away their value (Cramer et al., 2015).
Challenge 15. Continue to explore anonymization and deidentification methods.
Conclusion
In the paper, I have sought to identify and characterize what I thought were the main statistical challenges arising from administrative data. There are other challenges, including the following three.
- (a)
The communication of uncertainty: as statisticians, we are familiar with uncertainty arising from sampling variation, and with methods of communicating that uncertainty, such as confidence intervals. However, since the sources of uncertainty in administrative data are many and diverse, and may not include sampling variation, we need to find other ways to communicate (and indeed perhaps even to define) such uncertainties. In some contexts this is already done. For example, the Bank of England's August 2016 inflation report (Bank of England, 2016), chart 5.1, shows a fan chart with
‘To the left of the vertical dashed line, the distribution reflects the likelihood of revisions to the data over the past; to the right, it reflects uncertainty over the evolution of GDP growth in the future’.
More, however, remains to be done. Manski (2014) has a good discussion of the issues.
- (b)
Statistical education: challenge 1 above was about statistical education, although limited to the context of data quality. Administrative data are becoming so important, and so widely used (as a consequence of automatic data capture), that one can argue a case for more specialized teaching of specific methods related to administrative data.
- (c)
Legal environment: the growth of awareness of modern data analysis technology has stimulated considerable legal and regulatory thought, much a consequence of the privacy and confidentiality issues discussed above. On April 14th, 2016, the European Union's General Data Protection Regulation (European Union, 2017) was adopted by the European Parliament, and on April 27th, 2017, the UK's Digital Economy Act received Royal assent (Her Majesty's Government, 2017). These changes will certainly impact how personal data are stored and are likely to impact statistical analyses of administrative data.
As a final comment, applied statisticians often emphasize the importance of being familiar with the data generation process. Understanding where the data come from and how they are collected can lead to the avoidance of many misunderstandings and mistakes. At first glance it might seem as if this is less critical for administrative data. This, however, is not so. Issues of data quality, changes over time, changing regulatory and legal environments, advances in data capture and access technology and a host of other factors are likely to impact administrative data, and their analysis. In fact, because the data will have been primarily collected for some operational purpose, these changes will almost certainly have been made without any subsequent statistical analysis in mind. They may not even be reported to the statistician who is later analysing the data. It is thus even more important—perhaps essential—that the statistician understands the data collection process. But note that this is a two-way communication. If they are aware of the analyses to be undertaken later, the data producers will be able to adjust their data collection and recording processes to facilitate the subsequent analyses.
The aim of this paper is to raise awareness and to stimulate discussion among statisticians of the need for methodological statistical work on administrative data. Such data are being used increasingly more widely—partly a consequence of the ‘big data’ revolution. But drawing valid conclusions from such data encounters problems that are distinct from the more familiar and well-trodden paths of sampling theory inference. The problems are diverse and heterogeneous, so it is doubtful that a unifying theory as elegant as that of sampling theory can be developed. But nevertheless some principles apply. These include the need to cope with rather different kinds of data quality issues, the recognition that, despite superficial appearances, we typically do not have ‘all’ the data, possible mismatches between the question we want to answer and the information in the available data, challenges arising from the fact that the data are (usually) merely observational, so elucidation of causality is difficult, the need to combine data from multiple rather different sources, and issues of confidentiality, privacy and anonymization which might be rather different from those of survey data.
Discussion on the paper by Hand
Penny Babb (Office for Statistics Regulation, London)
Reflections on quality, administrative data and official statistics
I am grateful to the Royal Statistical Society for the invitation to propose thanks to Professor Hand for sharing an excellent and thought-provoking paper. It is one that has many lessons for official statistics.
To quote David's paper: ‘quality is not a property of the data set itself, but of the interaction between the data set and the use to which it is put’. This statement brings together three important aspects: the reminder that quality is not an attribute of the data; it speaks of ‘data’ in the collective; and most importantly it relates quality to use.
The nature and degree of quality can vary by how the data are used; the data can be sufficiently good for some purposes but not for others. David also highlighted that we can never be sure that all errors have been detected—a sobering reminder to analysts that the job is never done and not overly to trust either the data or their analysis. Also, in our enthusiasm to make greater use of the wide variety of administrative data sources within official statistics, we need to be wary that new quality issues can arise when merging data of adequate quality.
In his paper David identifies some common misconceptions—here are three which struck me as particularly pertinent for those producing official statistics.
Data collected for operation reasons will be complete and error free: in response David provides a helpful tip—ask data providers what they have done about missing values. It is only when you probe and find out about the data collection that you begin to recognize weaknesses that can have a profound effect on your statistics. But, also, what I really like about David's advice is that it points producers of official statistics to speaking directly to those involved in the data collection—this fits well with the UK Statistics Authority's guidance in the ‘Quality assurance of administrative data’ (https://www.statisticsauthority.gov.uk/osr/monitoring/administrative-data-and-official-statistics/), in establishing collaborative relationships with data supply partners.
David also recommends being suspicious of anything that seems perfect. Absolutely—I learnt through experience as a junior researcher working with official statistics that interesting findings are usually wrong!
Data equals all: David reminds us that administrative systems reflect the operational needs—the population of interest for statistics may not be the same as needed for the original purpose. And going back to the observation about data as a collective term—it is a plural noun—there are likely to be many different data items, all with their own data collection and quality issues. Do not assume that data quality for a source is homogeneous.
Data errors will only affect a small part of the data set: this is a misconception that any errors will not compromise your use. But remember that a key benefit of administrative data is that they enable analysis at greater levels of disaggregation—for small geographic areas, for subgroups of the population. Errors in only a few records can have important consequences.
Our work on understanding quality assurance issues for administrative data within the Statistics Authority led to the release of a toolkit (https://www.statisticsauthority.gov.uk/wp-content/uploads/2015/12/images-qualityassurancetoolkitcm97-44368.pdf) that sets out four practice areas that producers of official statistics should consider in their on-going preparation of statistics. Our advice can be summarized in two questions.
- (a)
How do you know that the data are sufficiently reliable and suitable to be used to produce official statistics?
- (b)
What do users need to know about their quality to use them appropriately?
And these are questions that are useful for expert users of administrative data to consider as well. Do not overly trust the data but probe them with a challenging mind. And think about the implications of quality issues for the way that you use the data—what effect do they have on your analysis and what does that mean for the way that you present your findings?
David presents 15 challenges in his paper: valuable insights that have great relevance for official statistics producers. There were some that resonated for me with what I have learnt from official statistics producers over the past few years and the areas where they have had difficulty when working with administrative data which I have interpreted as follows:
- (a)
develop quality indicators for your data sources,
- (b)
monitor the indicators by using scorecards or dashboards,
- (c)
evaluate the effect of quality issues on statistical conclusions and
- (d)
provide training on understanding and challenging data.
I would also emphasize the need for training on monitoring and reporting on quality, and explaining quality issues to users.
Ultimately, understanding use and potential use is essential to recognizing the relevance of quality issues to users and to producing valuable statistics.
Li-Chun Zhang (University of Southampton and Statistics Norway, Oslo)
I congratulate Professor Hand on this timely and thought-provoking paper. Although there is an established tradition to use ‘administrative data’ as auxiliary information to enhance the quality of ‘statistical data’ collected via sample surveys and censuses, more direct secondary statistical uses of administrative data raise many difficult questions (Zhang, 2012; Di Zio et al., 2005). This comment applies also to the use of other forms of ‘big data’ for official statistics, as for example discussed by Pfeffermann (2013).
Hand points to various challenges arising from the reuses of administrative data. It is equally important to emphasize that combining multiple sources (and techniques) is generally required, because of the deficiencies of any single data set, especially if it is being repurposed. Historically, the comprehensive uptake of administrative data in the Nordic countries was inspired by the idea of possible separation of data collection and production: on the one hand, capture and curation as data are generated; on the other hand, more or less separately, processing and output as the need arises (Nordbotten, 1966). Such a perspective naturally implies both reuses and combination of data.
For integration of data from multiple sources, Zhang (2012) delineated the potential errors including, but not limited to, those mentioned in the paper. It is based on a two-phase extension of the life cycle schema of administrative data (Bakker, 2010), which in turn is an adaption of the total-survey-error framework (Groves et al., 2004). De Waal et al. (2017) have surveyed the most typical multisource data situations in official statistics and related statistical methods. Di Zio et al. (2017) have identified various statistical tasks pertaining to
- (a)
transformation of input data objects and attributes to relevant statistical units and measurements for secondary uses and
- (b)
microlevel and macrolevel integration of separate data sets, often with overlapping units and measurements.
In comparison, classical survey sampling theory focuses only on the single task of devising a sampling strategy, consisting of a sampling design and an associated estimator.
Hand calls for developing better statistical methods. Granted that secondary data are not collected by ‘a random-sampling proces’, modelling will be necessary to deal with its deficiency, whichever that may be. I would like to question a multifaceted thorny issue: valid statistical modelling and inference of secondary data for descriptive targets. Now, there is an expanding range of uses of ‘big data’ by ‘smart’ methods, which are associated with tangible gains such as sales increases or numbers of people reached by aid operations, where the statistical validity of the relevant approach may seem less of a preoccupation. For analytic (or ‘inferential’) targets, such as the consumer price index, life expectancy or seasonally adjusted figures, the validity of modelling is of a theoretical nature, as can be noted in the set-up of associated statistical methods of hypothesis testing or model selection. At the other end, the descriptive targets that are traditionally held in official statistics lead inexorably to a confrontation between the model-based estimate and the ‘truth’, of a kind which in principle can be obtained by a perfectly executed instantaneous census.
Take the case of population count. In some countries it is produced directly from a central population register, for which it is necessary that we accept a definition that is suitable to the coverage of the central population register as is. A second option requires additional data, such as a population coverage survey, to implement a desirable definition that is feasible in the population coverage survey, but at the expense of extra cost and loss of granularity. As a third potential option, suppose that an estimator, denoted by can be calculated under a statistical model, using multiple incomplete administrative registers, where N denotes the unknown population size and n the generic size of the available data sets. Below are some questions regarding the statistical validity of and inference of N.
- (a)
How likely is it to have under some asymptotic setting, as n, N→∞? Is it possible to validate the model to this end on the basis of the available data?
- (b)
Granted the possibility of an auditing survey that provides additional data, what does it mean to establish the validity of , i.e. the estimator without using the survey data? Is the formulation of validity still acceptable even when the auditing survey is a perfect instantaneous census itself?
- (c)
Provided that audit sampling can either be used to validate the model underlying or to modify it to become a valid model, how do we incorporate the validation or modelling uncertainty, which is affected by the sampling error of the auditing survey?
Hand's paper does a valuable service to an important topic, by raising the awareness of ‘methodology researchers to explore some of the important statistical problems which arise’. It gives me great pleasure to second the vote of thanks.
The vote of thanks was passed by acclamation.
Paul Allin (Imperical College London)
On first reading this paper, I was sceptical that administrative data needed their own theory. To draw on David's opening sentence, surely survey data are also ‘generated during the course of some operation’, as the operation here is the construction and delivery of a structured exercise to collect predefined data? But, as David carefully points out, there are crucial differences between administrative data and survey or census data. So, I welcome this paper not so much as launching a drive for theory, but aiming for a set of ethics and a culture around the use of data: establishing practices and ways of doing things, including with standard and robust tools and methods, that underpin the quality of the analyses and applications of data from whatever source.
David of course recognizes that understanding the use to which statistics are to be put is a crucial part of assessing quality of data. What flows from that, it seems to me, is that gathering and understanding users’ requirements would make a good starting point, before ‘statisticians’ are gathered to discuss ‘the need for methodological statistical work on administrative data’. An example of this was the Royal Statistical Society's working party on the measurement of unemployment in the UK (Bartholomew et al., 2014), in which I played a part. The working party ‘consulted widely in this country and abroad’, particularly to understand the views of users of the administrative and the survey sources then available. Our recommendations included that the headline measure should be based on a survey, rather than on the administrative count of claimants. (Would this be different today, given for example the widely documented decline in survey response rates?)
Being clearer about the potential use and value of administrative data might also help with resolving a log jam that is currently seen, because costs and benefits of processing administrative data for statistical purposes fall in different places and often to different organizations.
Finally, many of the challenges and opportunities discussed have long-standing parallels in the secondary analysis of survey data, from where lessons can be learned. Hakim (1982) included the encouragement of more collaborative research, for example, as one of the benefits.
Anders Wallgren and Britt Wallgren (formerly Statistics Sweden, Stockholm)
The missing challenge
The Nordic countries have changed their statistical systems by using administrative registers. The challenge was to improve the national system so that registers could replace the traditional census. The systems were changed from traditional data collection for area samples and censuses into systems where all surveys are register based.
Hand does not discuss this important challenge: how can we use administrative registers to improve the national statistical system? This requires support from politicians, co-operation with central and local government and strategic thinking together with hard work. The Nordic countries have gone through this process and have developed efficient statistical systems. It was necessary to improve administrative systems for taxation and national registration to make georeferenced social statistics possible.
Data quality
We want to add two remarks to Hand's discussion regarding quality. In Laitila et al. (2013), we explained that we should not only search for errors when we analyse an administrative register—we should also look for opportunities—how can we use this register? It should also be remembered that it is the output quality that is important. We have the impression that people mainly think on input quality.
‘Data = all’
There are two ‘alls’ but Hand discusses only one. Assume that we want to produce statistics on cancer incidence based on hospital data. The ‘all’ that are discussed by Hand concern if all hospitals have reported all cases of cancer. But there is another ‘all’; every survey must have a defined population, in this case the population of allindividuals with or without cancer diagnosis. For this ‘all’ we need access to a population register with small coverage errors. The population register is the most important part of the statistical systems in the Nordic countries; all social statistics are based on populations from this register.
Combining different sources
The efficiency of the Nordic systems is explained by the record linkage that is used. As identities are of good quality, thousands of deterministic record linkage operations are done every year to produce social statistics of high quality. It is necessary to combine many sources to obtain registers with rich content, e.g. for creating longitudinal registers. In an efficient statistical system, deterministric linkage with good identities is the only option.
When registers have bad coverage, it may be necessary to combine registers with area samples This was discussed in Wallgren (2016) where we used calibration of weights to produce good quality estimates.
Gordon Blunt (Market Harborough)
David Hand raises many important challenges in his paper but gives relatively little consideration to one very important issue, which is that of domain knowledge. Such knowledge is important in any data analysis, of course, but it is even more important when approaching the analysis of an administrative data set, compared with the typical data set that a statistician might see.
When we first see a ‘new’ administrative data set, we must rely on the data owners to describe everything that they know about their data. Note the use of the word owners in this sentence: most administrative data sets, as so eloquently described in the paper, are some sort of ‘data exhaust’. One consequence of this is that statisticians are unlikely to have been involved in the data collection process, so we shall be presented with the data set ‘as is’.
Crucially, and before we even see any data, we are likely to have to persuade the data owner(s) to provide a data extract for us. If we are working in a commercial environment, it is highly unlikely that we shall be given access to data on a live system. Therefore, as well as our data handling and analytical skills, we must also develop our negotiating skills if we are to become proficient in dealing with administrative data sets.
As a consequence of these restrictions, we must be good influencers as well as good analysts. To achieve this, we must understand the client's motivations: what they want, what they need and, importantly, what they need to be told. We also need to engage in debate at senior levels, and this can be challenging, because many data owners will need to be convinced of the worth of our analysis.
Once we have persuaded senior contacts—often board directors—to allow access to their data, we need to work very closely with the analysts to gain their understanding of the domain. Analysts and senior contacts in an organization are rarely the same people.
Finally, once we have gained all the domain knowledge that we can, and undertaken some analysis, we need to speak again to our senior contacts, so that we can sell our analysis. However good our work is, we shall fail to convince anyone if we cannot demonstrate our recently acquired knowledge of the problem and the domain.
Andrew Garrett (ICON Clinical Research, Marlow)
The paper nicely reflects David Hand's wide ranging experience over many years in the evolving area of administrative data. I have three points.
Firstly, both randomized clinical trials (RCTs) and surveys have well-known properties that address bias in different but specific ways:
- (a)
RCTs through balancing groups (over all randomizations) with respect to factors, known and unknown, that may impact outcome;
- (b)
surveys through sampling that is representative of a population.
Statisticians understand these properties well, but they do not guarantee unbiased estimation. Missing data are a common problem and untestable assumptions may lead to misinterpretation. Sampling of passengers at airports may be random, but the answer on expected length of stay may be unreliable. Clinical trial end points may have been generated from machines that produce summaries of ‘big data’ for individual subjects with data anomalies per Section 2. Furthermore RCTs and surveys may be limited in their ability to answer questions associated with long durations or complexity. Examples include understanding disease trajectories and the treatment of comorbidity in an aging population, So RCTs and surveys also have limitations and administrative data, although being far from perfect, present unique opportunities.
The second point relates to prespecification and sensitivity analysis, both of which are more established in drug development than elsewhere (e.g. social science). The European Medicines Agency (2017) has introduced the estimand that translates the trial objective into a precise definition of the treatment effect to be estimated. Furthermore sensitivity analysis investigates the effect of various data limitations and assumptions on the estimand. The assumptions behind, and the limitations of, administrative data are likely to go well beyond those for clinical trial data and vulnerability to p-hacking (Gelman and Loken, 2014) is inevitable. So statisticians should demand more rigour in the planning and analysis of administrative data to ensure that reliable decisions are made—including at the individual level through algorithms.
The final point relates to the practical aspects of data linkage. There is a lack of standardization around variable naming conventions and definitions, formats etc. across government (Garrett, 2016). Legacy systems also mean that there is little appetite or capacity to add or modify variables that would produce better statistics—with no desire for increased operational complexity or expense. Data scientists, like statisticians, complain that most of their time is spent preparing data. A greater emphasis on administrative data by design is required to bring efficiency and to improve quality, with the ‘statistics’ requirement placed higher up the priority list.
Fionn Murtagh (University of Huddersfield)
This is a really seminal paper is relation to data science and ‘big data’ analytics. Potentially, I see the addressing of the 15 challenges in general with (Murtagh, 2017) new statistical drivers, open data and general frameworks based on the geometry and topology of data and information, semantics, homology and field, geometric data analysis and the correspondence analysis platform.
Section 2, ‘data quality’ (implying, also, data encoding and data curation), and Section 3, ‘Data = all?’
Depending on how comprehensive or insightful our data sources are, there is the need to interact with them to have visualization and verbalization of data (see Blasius and Greenacre (2014)).
Section 4, ‘Answering the right question’ (focus of the analysis), and Section 5, ‘Causality and intervention’
Observational data are now our main engagement, and there are limitations in how we can interact and engage with the domains that are at issue. Relative to underpinning causality, this is well addressed in the 16th page.
Section 6, ‘Combining data from different sources’
This can lead to the importance of triangulation, which is very important in understanding the narrative of behavioural patterns, of analytical processes and other such themes and issues.
Section 7, ‘Confidentiality, privacy and anonymization’
For security and also for the ethical issue of the individual not being thoroughly replaced by the cluster or group, there can be full and complete account taken of both security and such ethical issues.
Population rather than sampling is at issue. Another consideration can be that the data are aggregated, leading to a different resolution scale of the analysis. Resolution scale of our data is another contextual and relevance aspect. It is just interesting to note how data collected for statistical purposes can be subject to bias but, if so, the big data context can be useful for calibration purposes: Keiding and Louis (2013).
Measurement, e.g. having quantitative or qualitative (categorical) variables, and other forms of data encoding are so very central as an issue. Integration of statistical analytics with the data is crucial: ‘quality is not a property of the data set itself, but of the interaction between the data set and the use to which it is put’.
The end of the ninth page relating to data storage and rebranding of data is, in effect, the burgeoning issue of data curation. This was further emphasized (relating to machine learning) on the eighth page.
Peter W. F. Smith (University of Southampton)
Professor Hand has clearly articulated the main challenges for the statistical analysis of administrative data. Many of these challenges are being addressed by the methodological work being undertaken by the Administrative Data Research Network (ADRN) which includes the Administrative Data Research Centre for England (ADRCE).
The ADRN is funded by the Economic and Social Research Council and was instigated in 2013 to facilitate access to linked administrative data for research purposes. The setting up of the ADRN was a recommendation of the Administrative Data Taskforce (2012). They recommended a centre in each of the four countries of the UK, because of their different administrative data environments. The Network also contains the Administrative Data Service, which has a co-ordinating role. Each centre is also a partnership between academic institutions and a national statistical institute. The ADRCE is led by the University of Southampton and run in collaboration with University College London, the London School of Hygiene and Tropical Medicine, the Institute for Fiscal Studies and the Office for National Statistics. For more details, see https://www.adrn.ac.uk/.
In his conclusions, Professor Hand mentions the lack of communication between those involved at various stages of the administrative data journey. To improve communication, members of the ADRCE consortium and the ADRN have produced some guidance for information about linking data sets (Gilbert et al., 2017). This ‘Guidance for information about linking data sets’ divides the data linkage pathway into four areas: data provision, data linkage, data analysis and reporting study findings, and recommends what information should be provided at each stage.
Members of ADRCE's Southampton Group have used call record data from the three UK social surveys linked to census data to assess the representativeness of the surveys and how it changes as the data are collected (Moore et al., 2018a). Results suggest that representativeness does not increase after 6–8 calls. Stopping here would reduce the number of calls made by 7–15% and result in substantial financial savings. Moore et al. (2018c) have studied biases in linkable 2010 UK Small Business Survey data sets. As well as assessing how representative a final linked data set would be, they have identified the correlates of consent to linkage, and whether a register identifier can be appended and hence a link made. These two references address challenge 12 regarding the risks, and measures of the quality, of linked data.
Duncan Elliott (Office for National Statistics, Newport), Guy Nason (University of Bristol) and Ben Powell (University of York)
We welcome Professor Hand's insightful tour of the challenges involved in using administrative and transaction data, and we hope it motivates further research in this area. The issues identified are surely shared by statisticians in many fields and a robust justification for attempting to interpret ‘repurposed’ data is likely to require deep thought and open discussion of the sort provided by this paper. Especially valuable are his wide selection of illustrative examples typifying common problems.
We agree with Hand's identification of sampling theory's inadequacy for informing inference with administrative data. Historically, a universe of ignored factors could be lumped into a model's ‘random-error term’; today automated data collection introduces data artefacts that are persistent, structured and that, crucially, do not cancel out given enough data.
In our own work (Powell et al., 2018) we have also found difficulties because there is typically neither a ‘ground truth’ to validate our inferences against, nor an accepted metric for scoring inferences should the truth ever be revealed. This is problematic for making arguments for or against particular inference methodologies. We have also encountered the need for data disaggregation that effectively isolates clusters of outlying observations. We found, for example, that consumer price data feature relatively small subsets relating to specific time windows or product categories with enough leverage to skew our inferences significantly. The experience taught us of the need to model at sufficiently fine granularity to identify such subsets, even when interest lies in aggregated quantities.
We are also interested in the role that Bayesian methods might play in the contexts described in the paper. We often have prior knowledge for the biases of different data sources, and we need to make decisions on the back of our analyses. These are two factors that are well handled with Bayesian methods. However, we can understand the reluctance of official statisticians to impose any more structure on their data than is strictly necessary.
Finally, we note that data combination strategies motivated by potential cost savings and potentially improved inferences ought to be backed up by explicit cost analyses. We have explored the issue of cost effectiveness in the context of an optimal sampling rate for a time series (Nason et al., 2016) and are working towards extending the methodology to consider decisions to combine sample survey and non-sample survey data.
Jamie C. Moore, Gabriele B. Durrant and Peter W. F. Smith (University of Southampton)
Professor Hand provides an authoritative overview of the potential for and challenges of using administrative and transaction data in statistical research. Such work is especially timely in the UK context, where, as he notes, recent legislation facilitates access to various government data sets (the Digital Economy Act 2017). Given our research backgrounds, we have read with interest the sections considering data quality, particularly Section 6 where the practice of combining and merging data from different sources is reviewed. Reasons for utilizing these approaches and potential issues are discussed. Concerning issues, that between differences in source in item definitions or categorizations may cause difficulties for estimation is noted. In addition, there is mention of the work of Romanov and Gubman (2016), who used administrative data to identify biases in survey subject income estimates, in the context of discrepancies pinpointing potential issues and errors to be resolved.
We feel that this latter point deserves emphasis. In recent research, we identify similar differences in the reporting of subject highest qualifications and ethnicity between interviewer-administered UK social surveys and the self-reported census (Moore et al., 2009). Our work is motivated by proposals to combine data from the decadal census and more frequent surveys to produce interim estimates of population characteristics (Luna Hernandez et al., 2015; Correa-Onel et al., 2016). However, between-source contemporaneous summary estimate differences have been found (Awano, 2011). Using for the UK a unique data set linking subject survey responses to those to the same items in the concurrent 2011 census (the ‘Census non-response link study’: Moore et al. (2018a)), we describe non-trivial differences in highest qualification reporting, especially among subjects with foreign qualifications (inferred if over 16 years when first arriving in the UK). We also describe ethnicity reporting differences, particularly among subjects responding ‘mixed race’ in the census. Because of the foreign qualifications impact, we suggest that highest qualification differences are caused by survey interviewers probing subjects who are not familiar with categorizations (see also Awano (2011)). Similar is likely with ethnicity, though perceptual differences will also be important. We suspect that reporting mode differences and therefore such ‘differential reporting’ are likely to be common when combining data sources. Research is now needed on how to quantify and adjust for its impacts on estimation in these scenarios.
Paul A. Smith (University of Southampton) and Raymond L. Chambers (University of Wollongong)
We congratulate Professor Hand on a paper that poses challenges across a wide range of topics in administrative data; here we focus on analysis of linked data and implications for a general framework (challenges 12 and 13).
Analysis of linked data is now widespread, enabled by data sharing legislation such as in the Statistics and Registration Service and Digital Economy Acts, and also by projects like the Administrative Data Research Network. Some linkages use unique identifiers, but often linkage is probabilistic, based on matching of record level characteristics, and hence is subject to error.
Given N potential matches and a correct match value Yi, Neter et al. (1965) characterized this error by defining a new random variable
where λ+(N−1)γ=1. This has been dubbed the exchangeable linkage errors model. Analyses using the linkage error affected variable are unbiased for means, but variances are inflated (increasing type II errors), correlations between and other variables are attenuated and estimates of regression parameters based on are biased.
A simple extension (Chambers, 2009; Kim and Chambers, 2012a,2012b) embeds this model within post-strata assuming no between-stratum matches and 1–1 linkage together with ignorable linkage errors within strata. Given a matrix Xq of regression covariates and a vector of linked values in stratum q, an unbiased estimator of the linear regression of y on X under exchangeable linkage errors is then
where (with obvious notation). More complex extensions are possible, for example allowing linkage errors only with ‘closer’ units giving a banded diagonal E, and also where λ varies from unit to unit. A maximum likelihood estimator (different from above) is also available under additional assumptions on the variances (Chambers (2009), section 2.3).
Other principled ways to analyse linked data have also been suggested; for example Goldstein et al. (2012) proposed Bayesian methods and multiple imputation.
Analysis of linked data should account for linkage errors, minimally to assess the sensitivity of results, although unbiased analysis quickly becomes challenging even in simple situations. More development and some case-studies implementing these methods would be very valuable. In this respect we think that Professor Hand's challenges 12 and 13 are not sufficiently ambitious and should be extended to include principled model-based analysis of linked data sets. This would go some way towards statisticians being more specific about the effect of data quality on analytical outputs from administrative data.
Agnes M. Herzberg (Queen's University, Kingston)
Leon Trotsky said ‘You may not be interested in war, but war is interested in you’. This quotation could be changed to ‘You may not be interested in marketing but marketing is interested in you’.
Although this paper may not be unique in the history of discussion papers read to the Royal Statistical Society, it stands out as there is not a single equation or table in it.
It is difficult to tell whether or not Professor Hand is for or against the use of administrative data because of the 15 challenges or drawbacks that he mentions.
The use of administrative data has the same drawbacks as self-selected surveys: the method of the collection of the data is not soundly based. If the basic laws of probability and statistics are ignored when selecting the sample, the conclusions are invalid and the actions taken on the basis of these conclusions inappropriate, and sometimes detrimental.
More research needs to be done on eliminating these challenges before money and time have been wasted.
As in many things, one gets what one pays for. The use of administrative data in the long run will be more expensive than applying a full fledged sample survey. Administrative data are not obtained by using sound methods; these data were collected for marketing only!
I feel the present paper does not address the problems.
Mark Pilling (University of Cambridge)
The paper and its listed ‘challenges’ are a welcome summary of the issues facing the use of (increasingly available and published) administrative and transactional data sets. One improvement to the paper which I propose would be to comment more on the data set construction phase, where the paper states the majority of the researcher's time is usually spent. Strengthening the reporting of this phase of a study would improve
- (a)
reproducibility,
- (b)
reuse of data sets and
- (c)
context for results.
Reproducibility in science is a hot topic, and the assumptions made as well as ideally the code used during data manipulations need to be made available. Certain analyses will be valid from one data set, but other analyses may be misleading if the data set is not fully understood (e.g. by researchers reusing or repurposing a large data set). Given the vulnerability of analyses of administrative and transaction data sets to underlying assumptions and biases, improving the documentation and reporting of data set creation is vital.
The following contributions were received in writing after the meeting.
Wendy Appleby (London)
Professor Hand's otherwise excellent paper omits the possibility of having two independent administrative sources that purport to measure the same thing. A good example is deaths from road accidents, which may be measured from police records (STATS 19: http://www.adls.ac.uk/department-for-transport/stats19-road-accident-dataset/?detail) or death certificates. When seat belt wearing was made compulsory for drivers and front seat passengers in 1983, STATS 19 showed an immediate sharp drop in deaths, but figures from the Office of Population Censuses and Surveys (OPCS) showed a gradual decline.
The STATS 19 system covers Great Britain whereas OPCS figures were for England and Wales, but this should not have caused the difference. STATS 19 reports only include accidents that the police are aware of. However, whereas many minor injuries go unreported, very few fatalities should be missed. (STATS 19 data clearly have many other flaws due to poor reporting, but these should not affect this discussion.) In STATS 19, an injury is only a fatality if death is within 30 days of the accident, which is not so with OPCS figures; STATS 19 thus understates deaths slightly.
However, the real reason is timing. STATS 19 data show the date of the accident. OPCS figures show the date of death registration. Death may occur days or weeks after the accident. Also, whereas nearly all deaths are registered within a few days of death, road deaths will frequently be referred to a coroner, leading to long delays in registration. This is why the OPCS figures showed a spurious gradual decline rather than the true sudden decline.
Arthur Barnett (Harrow)
This is an excellent paper from David Hand which has a read across to the UK Code of Practice for Official Statistics (UK Statistics Authority, 2009), which was being revised at the time of writing.
The “‘data = all”?’ fallacy not only has a potential issue with too few members of a population from an administrative source but also with too many. For example school performance tables are based on examination board data which essentially are a data set of examinations not candidates. When the data were first being used comparison with the school census—a statistical exercise—showed that there were more 16-year-olds taking examinations than there were 16-year-olds. To help to resolve this problem schools were subsequently asked to check their data—a school has no problem identifying its pupils whereas it is a difficult computing task.
The current version and the latest draft revision of the Code of Practice for Official Statistics do not include a requirement for credibility or empirical testing of outputs. The above example demonstrates that this is important with administrative systems but it is also important more generally. An example of the latter is UK consumer price indices which are based on economic theory and approximations to standard models or formulae with, in recent times, no empirical verification that they meet the needs of government policy, commercial interests or the public.
There is also a potential issue with changes to the definitions of important variables. At a recent meeting it was suggested that the Office for National Statistics were considering changing to a location-based definition of a household because that was all administrative systems could support. This change is apparently being considered for the next-but-one census and may not happen, but if implemented it is likely to result in poorer classification variables.
Finally is the issue of the paucity of measures of precision and accuracy for administrative systems. Statistical systems are designed to provide measures of precision and accuracy—the so-called ‘scientific principles’ referred to in the current Code of Practice for Official Statistics. This scientific principles requirement in the current code has been replaced in the latest draft by ‘sound principles’. The term ‘sound’ is not well defined and arguably in civil service or ‘Yes Minister’ terms is more likely to refer to the need not to rock the boat than to a rigorous scientific approach.
Rajendra Bhansali (University of Liverpool and Imperial College London)
David is to be congratulated on a timely recognition of a pertinent issue. The challenges that he has delineated certainly deserve greater attention than has hitherto been the case and so are some of the statistical problems that he has identified. On a positive note, however, it should be emphasized that both transaction and administrative data may be analysed by descriptive and exploratory statistical methods and a careful analysis can yield new insights which can challenge an established orthodoxy. An important example is provided by financial returns, which may be viewed as ‘transaction data’. A widely accepted point of view was that the asset price series follow a random-walk model. This belief was supported by an extensive statistical analysis of asset price changes and underpinned by the ‘efficient market hypothesis’. Many financial analysts, however, had serious doubts about this particular hypothesis, and which appeared to contradict their own personal experiences of how financial markets functioned. Nevertheless, it was not until the squares and absolute values of financial returns were analysed that evidence began to emerge that statistical behaviour of neither series is entirely consistent with that of a random-walk model for asset prices and that volatility of financial prices is predictable from its own past. This analysis in turn has led to a distinction being made between ‘strict white noise’ and ‘weak white noise’. Moreover, a descriptive statistical analysis of financial returns over many assets and markets has led to the development of the concept of ‘empirical stylized facts’ of financial data, and it forms an important part of the subject area of ‘financial econometrics’.
Neeraj Bharadwaj (University of Tennessee, Knoxville) andYuexiao Dong (Temple University, Philadelphia)
We congratulate Professor Hand on the thoughtful call to improve statistical analysis of administrative and transaction data. This comment identifies several nuances in an organizational decision-making context and notes three challenges that result.
First, managers within organizations design strategies to achieve goals. Those strategies are informed by data, which can be either pre-existing data collected for a differen reason or new data to be collected to address the specific goal at hand. Given that managers are purposive about the new data to acquire (Bharadwaj, 2018; McAfee and Brynjolfsson, 2012), it is necessary to underscore that they are proactive (not passive) in
- (a)
identifying the data that are needed and
- (b)
designing a strategy either to harness pre-existing administrative data or to collect new data.
Acknowledging that managers are purposive about the data they require escalates the importance of administrative and transaction data in an organizational decision-making context, and takes a forward step to ensure their usefulness. The challenge that we identify here is: what can managers do a priori to use pre-existing administrative and transaction data more effectively to address a managerial question and to inform new data collection?
Second, it is important to acknowledge that changes in population demographics continually present opportunities and threats. Although existing strategies may have been effective in servicing customers in the past, they may prove inadequate for appealing to a new set of future customers. What are needed are model diagnostics which can identify whether the earlier strategies will still be relevant in the new demographic context. The administrative and transaction data can serve as an important input to help managers to monitor the environment and to provide additional information to (re)align the organization. The challenge we identify here is: how can a firm leverage administrative and transaction data for model diagnostics which help the firm to safeguard against population shifts?
Third, the detection of a ‘data anomaly’ can yield significant business improvements beyond well-behaved ‘normal data’. For example, abnormal credit card transaction patterns could be a leading indicator of fraud. The administrative and transaction data can be useful in differentiating a data anomaly apart from low quality data. Thus, the third challenge we identify is: how can a firm leverage administrate and transaction data for the purpose of data diagnostics and help a manager to distinguish rightfully between abnormal data that may contain important information (i.e. a, data anomaly) versus data that are low quality (and contain little meaningful information)?
J. A. van den Brakel (Statistics Netherlands, Heerlen, and Maastricht University)
The paper provides an inspiring discussion around the increasing interest in making more use of administrative data and other ‘big data’ sources in the production of statistical information about modern societies. If administrative and big data sources are used as primary data for compiling statistical information about an intended target population, inference procedures are required that correct for possible selection bias. This is a challenge since the data-generating process is unknown. An alternative, which is often considered safer, is to combine administrative data with survey data, e.g. as auxiliary information. In the context of design-based and model-assisted inference procedures this is a well-established approach (Särndal et al., 1992). In the context of model-based inference, Marchetti et al. (2015) and Blumenstock et al. (2015) used big data as a source of auxiliary information for cross-sectional small area estimation models. In the context of official statistics, surveys are conducted repeatedly. Therefore time series methods are particularly appropriate to combine survey data with administrative data. Harvey and Chung (2000) and Van den Brakel and Krieg (2016) proposed state space models for the Labour Force Survey extended with a series of claimant counts. It is also worthwhile to refer to the econometrics literature where low frequency survey time series are combined with high frequency administrative series to nowcast gross domestic product (Giannone et al., 2008; Stock and Watson, 2002). In this context timeliness and precision of survey data are improved with big data time series. Van den Brakel et al. (2017) discussed how the concept of cointegration can be used to evaluate to what extent a big data source is representative for an intended target population. A time series modelling approach also enables us to account for discontinuities in time series to avoid distorting real evolutions (Harvey and Durbin, 1986; Van den Brakel et al., 2008).
In Section 5, the author emphasizes the use of randomized experiments for formal causal inference. Embedding randomized experiments in on-going surveys already dates back to Mahalanobis (1942). See also Fienberg and Tanur (1987, 1988, 1989) and Van den Brakel (2008, 2013) for details on how experiments embedded in sample surveys can be used for causal inference, and also for situations with more than two treatments. Applying these concepts to the context of administrative data is an interesting idea. If formal randomization is not achievable, the methods reviewed and proposed by Pfeffermann and Landsman (2011) for causal inference in observational studies can be considered.
Lisa Budd (London)
Professor Hand correctly says that a time series of the number of people receiving unemployment benefit is unreliable if the criteria for receiving benefit change over time. Indeed, in the UK in the 1980s, changes were made deliberately to reduce this number so the number lost all credibility. However, more important is the fact that it does not measure what usually interests users: the number of people with no job who would like to have one. It has always been the case that many people in this position have been ineligible for the benefit, and some people who work only a few hours a week and do not wish to work more hours may receive the benefit. Thus at best unemployment benefit can only give an accurate answer to the wrong question.
Similarly, the numbers of marriages and divorces measure something exactly, but nowadays it bears little relationship to the number of people entering and leaving long-term cohabiting relationships.
Administrative records are often problematic because of the lack of a unique identifier for people. Fingerprints are believed to be unique. However, there are many cases on the British fingerprint databases of two people with identical fingerprints because they are the same person under different names. Similarly, a count of the number of people who are company directors is inflated by people who are directors of multiple companies and their records have not been linked.
A good example of how improved statistical techniques can improve data from administrative sources is death rates by occupation. Some high status occupations show surprisingly high death rates, taking numerators from occupations recorded on death certificates and denominators from the census. However, when the Longitudinal Survey linked death records with census records, it showed that the deceased's relatives often gave a more prestigious job title than that person had given in the census.
James Doidge and Ruth Gilbert (University College London)
Routinely collected and administrative data are central to the future of health and social research and the shift towards evidence-based public policy and decision making. The challenges laid out by David Hand are a timely reminder of the significant hurdles that must be overcome if we are to maximize the value of these oceans of data and navigate their many hazards. The ‘Bloomsbury Group’ of the Administrative Data Research Centre for England, led by University College London in conjunction with researchers from the London School of Hygiene and Tropical Medicine and Institute for Fiscal Studies, and partners at the Office for National Statistics, has been directly grappling with these methodological issues for the past 4 years. We
- (a)
provide training on the linkage and analysis of administrative data, focusing on issues of data quality and linkage error and resources to help people to understand high value data sets (Herbert et al., 2017) and linked data (Gilbert et al., 2017) (challenges 1 and 12),
- (b)
constructed data quality metrics and showed how they can be incorporated in data linkage and analysis (Harron et al., 2017; Hagger-Johnson et al., 2015) (challenge 3),
- (c)
demonstrated how linkage error can influence statistical conclusions (Harron et al., 2014) (challenge 6),
- (d)
developed improved methods for record linkage (Goldstein et al., 2017; Harron et al., 2016) and techniques for analysis of linked data that better account for limitations of data quality and uncertainty (Goldstein et al., 2012; Harron et al., 2014) (challenges 12 and 13),
- (e)
are collaborating with population-based longitudinal studies and the Office for National Statistics on triangulating data across multiple administrative, registry and primary data collections (challenge 14),
- (f)
are developing new methods for anonymization that retain the necessary properties for valid and efficient statistical inference (challenge 15) and
- (g)
tackle all the challenges in our programme of exemplary studies of linked data, which are designed to highlight both the research potential and the limitations of specific administrative data sets.
To achieve these outcomes, we work closely with government agencies such as Public Health England, NHS Digital, the Department for Health, Ministry for Justice, Department for Workplace and Pensions and the Department for Education,
- (a)
to understand the administrative data that are collected,
- (b)
to develop tools to enable better use of administrative data,
- (c)
to build expertise around record linkage and analysis of linked or administrative data and,
- (d)
through applied research, to demonstrate the value of administrative data and methods for overcoming the many challenges it presents.
Brian Francis (Lancaster University)
I very much welcome this paper, which synthesizes a large body of disparate research and provides a research agenda for the future.
I would like to make two points. The first concerns the purity of administrative data, which may additionally be biased by ‘system design’. Thus, not all crimes recorded by the police are notifiable to the Home Office—these include less serious motoring offences, less serious public order offences and less serious fraud. Such ‘bias by design’ is often not well documented and requires substantial detective work to find for example a list of omitted crimes. This contrasts with traditional sampling frames for surveys, where documentation is clear.
Additionally, human intervention in the recording process can cause additional bias. The gap between a crime being reported and the crime being recorded is particularly salient, as the police are allowed to record ‘no crime’ or to cancel a report. It was estimated by Her Majesty's Inspectorate of Constabulary (2014) that around one in four sexual crimes were not recorded after being reported, and around one in three violence offences. This led to the removal of the national statistics standard from police-recorded crime figures by the UK Statistics Authority (administrative data) whereas the Crime Survey of England and Wales (survey data) has retained its national statistics quality badge. Moreover, these non-compliance rates will also change over time and affect crime trends. Thus longitudinal data series arising from administrative data should also in many cases be assessed for collection compliance as well as changing definitions.
The seconal point concerns replication of analysis. Good scientific practice demands that data sets be made available so that analyses can be tested and repeated. With survey data, this is achieved by deposit of the data sets in data archives, and survey administrators understand this research culture and the need to make data sets available. However, administrative data are often not in the public domain, and access to administrative data can easily be withheld. There currently is no research culture of requiring such administrative data collectors to deposit data sets if used for research.
Kayla Frisoli and Rebecca Nugent (Carnegie Mellon University, Pittsburgh)
We thank Professor Hand for highlighting some challenges of working with administrative data that are often overlooked or misunderstood. These challenges must be addressed, especially in the wake of large data collection capabilities and the growing demand for public transparency.
We recently had the opportunity to collaborate with John Abowd, the Associate Director of Research and Methodology and Chief Scientist at the US Census Bureau, on a unique project incorporating active learning in record linkage methodology. Although this work was fruitful and enriching, we experienced first hand some challenges related to administrative data.
Hand makes an important argument that we are handcuffed by current sampling methodologies and little attention seems to be paid to the uncertainty that surely propagates through analysis of population and administrative data. We saw directly how future analyses are predicated on records being correctly linked. But, what if they are not correct? We could take the linked records as they are, or we could incorporate uncertainty measures into subsequent analyses. We could also try to add additional historical or complementary information to records that seem problematic or incomplete (e.g. attach the marriage certificate to a record pair match that otherwise seems uncertain).
But, what if our errors originate from inconsistent data practices? Because departments within large government agencies often function independently, variables may be altered, added or dropped in a non-systematic fashion. Name and address standardization techniques may differ across departments, or not be utilized at all. It is crucial that building a consistent data management infrastructure be viewed as a worthwhile investment for government agencies and large companies.
As Hand points out, maintaining data privacy and confidentiality is critical to working with administrative data; although there have been methodological strides in privacy and utility guarantees, the day-to-day logistics of working with this type of administrative data can be burdensome. There are large potential benefits of having long-term collaborations between government agencies and leading universities; however, in practice, progress can be slowed by government clearances, travel time to secure locations, limitations of available data and the bureaucratic approval process for adopting new technologies. Although government agencies and large corporations are necessarily constrained, major stakeholders must emphasize non-trivial investments in streamlining access (e.g. secure virtual data centres) and the approval of new approaches to maximize productivity and privacy simultaneously.
Francisco Javier García Perez (Universidad de Castilla la Mancha), Libia Lara (Universidad Andrés Bello, Santiago) and Emilio Porcu (University of Newcastle and University Federico Santa Maria, Valparaiso)
We congratulate Professor Hand for his nice paper, which gives a thorough picture of the challenges in statistics for administrative data. We believe that statistical thinking would benefit from the distinction of administrative data into data given by governments (public administration) and transactional administrative data (commercial operations and others). Such a distinction is non-trivial and in particular the first category needs much care. In fact, UK government data are protected by the Digital Economy Act 2017 (section 41) and cannot be assigned to companies or statesmen without control of their use and the application of their results (section 64, and article 89.1 of Regulation (UE) 2016/679). In particular, it might be worth mentioning that data regarding citizens can be used by companies doing statistical analysis and by statisticians (researchers or academics). Section 64 allows the restricted transfer of data for their study, but not their subsequent use for, for example, a particular marketing campaign. The same applies to European legislation, and article 9.1 of Regulation (UE) 2016/679 of the European Parliament and Council, on the protection of natural individuals with regard to the processing of personal data and on the free handling of such data, stipulates that no data may be released to European citizens for uses other than those allowed by the citizen himself, who owns his personal data. Regulation enforcement will be controlled by the Data Protection Officer (article 39.1, Regulation (UE) 2016/679). The paper talks about data before the analysis (the quality of the administrative data depending on how they are obtained). Data analysis is also presented. At the same time, the author does not mention the use of the results of statistical analysis of administrative data. This is limited by section 80 of the aforementioned law. In conclusion, it is not possible to treat the data obtained by commercial transactions as well as the data that are in the control of the administration, because such data are the property of the citizen and have been collected for uses very different from the statistical analysis. The best contribution of the study is to show that administrative data, although they may be of very high quality, should be collected with appropriate criteria so that they can be used in statistical analysis.
Sarah Henry (Office for National Statistics, Newport)
I welcome this helpful paper that sets out the challenges as we move towards greater use of administrative data generated by government and other sectors. It will be important to address these issues at pace. To do so we must find the right collaborative working arrangement with academia, the private sector and government statistical services, both nationally and internationally.
As the paper eloquently states, the data collected for administrative and service purposes are then intended to be analysed for a different purpose, raising methodological issues, including inferring from a sample of one population to another—e.g. ‘users’ of a service and the population of a country. Although we expect a large overlap, especially once data have been linked, the data that are missing can often represent the most vulnerable in society such as people sleeping rough on the streets or children being denied education. Although the numbers are likely to be small, from a public policy perspective in a modern welfare state, the risk is that those who are not counted do not count. Although this problem exists with surveys also, the assumption that administrative data are ‘big’ and more inclusive compounds the issue.
Addressing this type of missing data will require the use of new surveys to inform editing and imputation and to improve estimation although the methods for doing so are still in their infancy.
It would be helpful to set out the methodological issues in more detail with initial proposals on how they can be solved so that this work can be commissioned. As we head towards the last planned census, timing in the UK is pivotal. It will be the last opportunity to validate the census by using administrative data and vice versa with diminishing returns, and increased risk of bias, thereafter. These risks must be addressed using new surveys, which will need to be developed.
Alongside the issues with administrative data, statisticians are facing increasing methodological problems surveying a faster changing society and an increasingly complex economy, serving impatient users of statistics. Using administrative data, combined with other sources of data, will ensure that statistics continue to underpin sound research, better decisions and a democratic society.
Ian Hunt (Nice)
‘I have a feeling that statisticians are cynics, because you realise how much of the stuff that you are told is true in the world is actually just that month's accident that worked out ...’ (Efron, 2010).
I think that Professor Hand's paper cleverly flags three general ways in which we can differentiate the profession of statistics from the practice of ‘data science’ and ensure that we play a key role in modern inductive research.
First, we need to focus on creativity and understanding with respect to data—in particular, its collection, classification and exploration. Without decent data no one can even specify useful models, let alone assess and compare them.
Secondly, let us aim to link properly our knowledge of ‘sampling variances’ to decision processes and inductive inferences. And, if our hoary statistical inferences are too crudely deductive for this purpose, then let us work hard to develop new arguments that help our clients to ‘go beyond the data’.
Thirdly, statisticians can and should develop a unique voice with which residual uncertainty can be expressed.
For a statistician, the trick with the last two points is to maintain the attitude of an empirical sceptic rather than a cynical cynic, while matching a unique appreciation of everyday randomness (as alluded to by Efron above) with complex mathematical knowledge.
But I worry that the profession of statistics clings to mathematics and classical inferences that are too remote from useful induction. For example Hand suggests that, when we can, we conclude by saying ‘things such as “on average, 99 out of 100 of our intervals will cover the true population mean”’. This seems little use for a statistician's client who wishes to make a particular decision or inductive inference. Does Hand agree?
I maintain that a statistical inference is a deduction—in other words, it is a logical proof that follows from a set of premises (in the guise of a model). And that the main ‘virtue of a logical proof is not that it compels belief, but that it suggests doubts’ (Lakatos, 1976). In this context, Hand's focus on data-based challenges is a call to consider new premises, to explicate hitherto hidden premises and to create new models that suggest new doubts. This sounds promising and is unlike data science. For example, I think that statistical deductions based on scenarios with different premises could formally address many of the cautions raised by Hand's challenges. Does Hand agree?
Francesca Ieva and Francesca Gasperoni (Politecnico di Milano)
In this contribution, we target clinical databases as a specific class of administrative databases, discussing some issues emerged in Professor Hand's paper.
In countries where a public or national health service is present, healthcare administrative data come from the automatic storage of billing records, drug receipts, hospital admissions to be reimbursed, and so on. Therefore, they address ‘operational’ goals, since they are collected mainly for managerial and economic purposes, but are increasingly used for clinical and epidemiological purposes. Extracting meaningful information from such huge data sets holds unparalleled potential for epidemiology (Motheral and Fairman, 1987). Indeed, they can give a complete frame of the whole healthcare system, both from the patients’ (therapies, prognosis and pharmacoepidemiologic information) and the providers’ side (policies’ efficiency, cost-effectiveness of hospitalizations and procedures, and provider profiling), Moreover, the size of the sample collected in these data warehouses enables us to reach real world conclusions and to deal with rare events. They are also much cheaper than clinical trials, which are often the gold standard in clinical statistics.
Despite all the positive aspects described above, several problems arise when clinical databases are considered, like
- (a)
accessibility and anonymization,
- (b)
reliability of the information and
- (c)
detection of the most appropriate method to extract high qualitative information from data.
Accessibility is a crucial issue for public health data (Schneeweiss, 2014) due to the different legal constraints and protocols that (whether they) exist. However, to ensure the accuracy of the information, a twofold strategy may be pursued: to link administrative data with suitable clinical registries or surveys (Gasperoni et al., 2017) and to check appropriateness of clinical questions that may be answered by these data, since ‘the quality is not a property of the data set itself’ but depends on the ‘use to which it is put’. As an example, specific guidelines for dealing with administrative databases when acute and/or chronic cardiovascular diseases are considered are provided in Gasperoni et al. (2017), Barbieri et al. (2010), Mazzali et al. (2016) and Ieva et al. (2015).
In line with these observations, close collaboration between clinicians, data owners or administrators and data scientists is crucial. In fact, only cross-disciplinary research may win the challenges placed by such complex data. This scenario leads to multilevel implication of the statistical research, making statisticians fundamental for both data analysis and experimental design. Therefore, will it be possible to think of administrative trials as an evolution of administrative data, to implement an ‘in vivo’ procedure instead of a ‘post-mortem autopsy’?
Ingegerd Jansson (Statistics Sweden, Stockholm)
The paper by David Hand is a comprehensive and very interesting summary of challenges when using administrative data for statistical purposes. Those who use administrative data will recognize the challenges, those who aim to use such data in the future will have to consider them and those who are cautious will find support for their view.
Administrative data are no quick fix, and this is made very clear in the paper. There are many interesting threads that deserve to be further elaborated on, but I shall focus on two.
It is true that relatively few papers are published on the statistical challenges that come with the use of administrative data, given the interest in using such data. However, attempts to move theory forward have been made. The Journal of Official Statistics published a special issue in 2015 dedicated to coverage problems in administrative sources (Bakker et al., 2015), and Statistica Neerlandica published an issue in 2012 on methodological challenges of register-based research (Bakker and Van Rooijen, 2012). Other recent references are for example Reid et al. (2017) and Zhang (2011). Although papers on statistical theory related to the use of administrative data might be few in number, I believe that they will be followed by other papers and that we shall see theory develop.
I come from an organization that has been using administrative data for more than 50 years, but it does not mean we do not have surveys. We do conduct many surveys, and we shall continue to do so. This is because surveys and administrative data are not competing sources; on the contrary, they can complement each other. This is touched on in the paper but deserves to be further emphasized. Sometimes survey variables can be replaced by administrative data, but many variables, such as those measuring views and opinions, or variables based on concepts that do not match administrative data, never can.
But administrative data have a wider field of use and may in many ways support surveys, e.g. to construct sampling frames, for imputation and calibration, for support when editing or for reconciliation between sources. ‘Big data’ sources can possibly also be used in similar ways. I can only agree with David Hand that data integration and data linking are very important challenges for statisticians, today and in the future.
Kuldeep Kumar (Bond University, Gold Coast)
Although this nicely written paper by Professor Hand looks specifically at administrative data and statistical challenges associated with them I am sure that discussion will be helpful to any other data sets which are analysed by statisticians in other areas of research. However, the challenge of data quality in the context of ‘big data’ is much more serious compared with routine administrative data because of the volume and velocity of data collection. Also if we are combining data from different sources then it is much more difficult to assess the quality. The problem usually comes when different databases to be merged are developed independently. The computational methods to match the data may depend on the situation. For example if we are dealing with credit card fraud then we must use deterministic methods in which case two records agree on all of a specified set of identifiers. However, for other microeconomic or macroeconomic data sets we can use probabilistic methods. Finally is there a measure which can assess the quality of merged data?
Nick Longford (Imperial College London)
Professor Hand should be commended on a fine assessment of the potential of routinely collected data and an outline of a comprehensive research agenda that would exploit such databases more fully.
Although a section addresses issues related to causality, I think that the general theme of causal analysis with observational data, as well as practical and by now well-established methods, such as propensity matching, are rather glossed over. In medical research, their importance is in the gradual altering of the balance of the strengths of clinical trials and retrospective (register-based) studies.
Clinical trials are the gold standard, and they are indispensable for testing new treatments and procedures. But their full potential is undermined by difficulties in recruitment, ethical constraints, resources required (funding, time, expertise and goodwill to tolerate the disruption of the normal practice) and analytical complexities in dealing with deviations from the protocol, including missing data.
In some contexts, such as perinatal medicine, much research is concerned with fine tuning the care and establishing the best practice that would then be codified in improved guidelines. Comprehensive registers of care provide much more detailed and extensive data that are relevant to such issues than could realistically be collected in a clinical trial. The flipside is the potential confounding.
The developments that span the groundwork on the potential outcomes framework (Rubin, 1978; Holland, 1986), detailed study of confounding in observational studies (Rosenbaum, 2002, 2012) and collection of practical experience (Rubin, 2006; Imbens and Rubin, 2015) are tipping the balance in favour of observational studies, especially when data entry has been seamless and nearly complete for some time, and the design of the registers is informed by the anticipated research needs.
The discussion in the paper is sparse about regularly conducted analyses, such as audits, which are on the borderline of statistical research and routine analysis (Longford, 2011). The National Neonatal Audit Programme (NNAP) and the National Neonatal Research Database (NNRD) are an example of two entities that work in tandem. For the NNAP, the NNRD is the principal data source and, for the NNRD, the NNAP is an important rationale and an incentive to maintain high quality of the data. Whereas Professor Hand's paper emphasizes adapting research priorities to deal with the (relatively) new data sources, a feedback mechanism is also at play, whereby research priorities influence changes in database design and compilation.
Asta Manninen (formerly City of Helsinki Urban Facts)
The paper by David Hand is profound and covers all essential issues. It identifies major challenges and raises important questions. It provides a good framework for further discussion and development.
I would like to contribute briefly with some experiences from Finland. A grand example of using administrative data for producing official statistics in Finland is the 1990 population and housing census which was totally based on administrative data. The use of administrative data has systematically increased. The experience of using administrative data has shown that the use of administrative records is the most efficient way of rationalizing data collection but also that it is essential to have a legislative base which governs their use. Today, open data and open interfaces further extend the opportunities and also obligations of official statistics. In addition, a growing amount of data is geocoded, allowing innovative opportunities for producing official statistics. Improved interoperability between various data sets is also an important asset.
Example: how can official statistics contribute to the discussion on making commuting more sustainable?
Statistics Finland has produced statistics about commuting since 2005. Sources of data including map co-ordinates of buildings of each person's dwelling and workplace have allowed estimates of distances to work and directions of commuting. These calculations have been based on the shortest distance between two points ‘as the crow flies’. Today, Statistics Finland can improve the estimates and extend the use of commuting statistics by integrating new multisource geospatial data into its register data. These new data are the Digiroad—a national road and street database (by the Finnish Transport Agency)—data from automatic traffic measuring devices (by the Finnish Transport Agency) and timetable data of public transportation (by the Helsinki Region Transport ‘journey planner’ and the Finnish Transport Agency ‘Matka.fi’) The new statistics provided by Statistics Finland are estimates of distances and time to work by using private car, bicycle or public transportation, all of them by using real commuting routes.
Jorge Mateu (University Jaume I, Castellón)
Professor Hand is to be congratulated on a valuable contribution and thought-provoking paper. The paper is a kind of position and opinion paper and touches on the trend topic of ‘big data’ problems in connection with the ‘Internet of things’. I write this note expressing my very personal point of view on this aspect, and I focus on three aspects of administrative data.
Big data; big opportunities
Businesses are increasingly driven by data and organizations can gain competitive advantage from this change. However, doing so means changing the way they collect, store and use information. The availability of data is not the issue. The success of a data-driven approach relies on the quality of the data gathered and the effectiveness of its analysis in making smart business decisions. This is becoming increasingly complex with the emergence of big data; new types of unstructured data in exponential volumes coming from a variety of sources such as social media, sensors and the Internet.
Internet of things: when machines meet business needs
To face the challenges of an ever-changing and competitive market, companies must improve their operational processes and maximize their efficiency. To do so, they are digitalizing their operations with machines to monitor their operations in realtime in what is being called the ‘Industrial Internet’. Nevertheless, new challenges have arisen regarding the storage, processing and analysis of all the unstructured and structured data that are now available to these organizations. Thanks to big data and the use of advanced analytic techniques, or data science, enterprises can gain new insights from their business process and start to take advantage of all of this information to become data-driven organizations. They now have the capacity to generate value from operational data for multiple purposes. With the increasing complexity of applications, reporting structures, regulations and internal processes across the enterprise, the benefits of establishing true governance over shared data are significant.
Data are the new oil
The real potential is the business insights that can be derived from this new, vast and growing natural resource. If data are the next big thing, then companies need to think about a new business model that exploits this valuable resource. They must also be able to analyse, process and integrate an extremely large amount of new big data types in almost realtime.
Paul D. McNicholas, Sharon M. McNicholas and Peter A. Tait (McMaster University, Hamilton)
We congratulate Professor Hand on an important and timely contribution. In addition to being an effective conversation starter, this paper is an excellent resource for anyone wishing to understand more about administrative and transaction data. We have some comments and would be grateful if he would clarify some points.
The definition proffered for ‘big data’, i.e. ‘the result of some automatic data collection system’, is very interesting for several reasons. For one, it seems to suggest that data that are ‘tall’ (i.e. large n) might be considered big data. It also raises the question over whether the author views big data as distinct from administrative data. Perhaps the author can clarify his position on these points. In doing so, we are particularly interested in the extent to which the views of the author diverge from those articulated by Puts et al. (2015), who contrasted administrative and big data. We are not suggesting that the author's definition is inferior to, or even incompatible with, other definitions for big data; rather, we wish to understand the author's view on the relationship between big and administrative data.
The use of crime statistics data as an example is perhaps particularly interesting in that questions have been raised around the veracity of the associated administrative data, i.e. police-recorded crime data. Specifically, the problem of police departments manipulating data has been reported by the media in several countries including Australia (http://www.abc.net.au/news/2017-01-30/allegations-goldcoast-police-crime-managers-manipulating-stats/8217550), the UK (https://www.theguardian.com/uk-news/2014/jan/15/police-crime-figures-status-claims-fiddling; http://www.bbc.com/news/uk-25002927) and the USA (http://www.nytimes.com/2012/06/29/nyregion/new-york-police-department-manipulates-crime-reports-study-finds.html; https://www.reuters.com/article/us-crime-newyork-statistics/nypd-report-confirms-manipulation-of-crime-stats-idUSBRE82818620120309). More generally, we wonder whether the author could clarify his view on strategies for identifying and dealing with the, perhaps systematic, manipulation of administrative data.
It is noteworthy that, with the help of machine learning methods, two common paradigms for causal inference have been extended to the administrative data setting. The potential outcome paradigm (Imbens and Rubin, 2015) and the structural causal model framework (Pearl, 2009) use machine learning methods to predict the probability of receiving treatment (Lee et al., 2010) and learn the causal structure from the data (Kalisch and Bühlmann, 2014). Karwa et al. (2011) suggested that an approach combining aspects of both paradigms may be the most effective. It would be interesting if the author could share his experiences using administrative data to answer causal questions when the standard operation of the organization is not modified to meet the needs of the analysis.
Jenny Mehew (National Health Service Blood and Transplant, Bristol)
In medical applications, progress is often driven by robust analyses of administrative data, rather than interventional studies. This is particularly so in organ transplantation. Although every UK registration and transplant is recorded on the database, enabling analysis on a complete population, there are issues associated with the nature of administrative data.
Data collection
The information that is directly required to register and transplant a patient must be accurate; however, data for audit and analysis are non-essential and often incomplete. Imputation is not realistic when a missing value may occur systematically and liberal interpretation of p-values is therefore important.
As new technologies emerge, some data fields previously collected become inaccurate. It is therefore important to have a flexible mechanism of capturing new data fields enabling our analysis to continue to stay relevant.
Study design
Scenarios may arise where clinicians propose a hypothesis yet the subsequent statistical analysis does not support this hypothesis. Often this is because the observed values do not cover a sufficiently wide (but clinically plausible) spectrum. Clinical practice is restricted by logistics, funding and ethical considerations and hence analysing such observational administrative data is limited in comparison with running a controlled trial. This leads on to the issue of causality and intervention. Although an event may occur, seemingly corresponding to a change in outcome measure, it may not be possible to obtain statistical evidence supporting this. There may be other changes in clinical practice that occur around the same time, and the event may occur to differing extents at different places. Although risk adjustment is a useful tool, as well as the use of random effects, neither can solve the issue of bias which is sometimes inherent in observational event analysis.
Although clinical trials would address these issues, they are costly, may take years to set up and are frequently not ethical in the setting of donation and transplantation
Example
A topical issue where administrative data are essential is the consenting system for organ donation. In Wales, this recently changed from an opt-in system, where individuals express their wishes to join or not the organ donor register, to an opt-out system. The existence of two systems side by side in England and Wales provides an ideal opportunity to compare consent rates in these two countries over the same time period. Sequential methods are being used to signify when there is evidence, or otherwise, of the benefits of changing the system.
Daniel L. Oberski (Utrecht University)
A research programme for dealing with most administrative data challenges: data linkage and latent variable modelling
David Hand has written an excellent overview of the challenges associated with using administrative data. As the paper already hints, many of the issues discussed apply equally to other types of ‘data exhaust’, such as social media, Internet usage data and many other types of ‘big data’. We must lay the groundwork of addressing these challenges before data exhaust can be leveraged by businesses for actionable insights and by scientists for valid conclusions; David's call to action is therefore well taken.
Here, I take up that call by suggesting a single common theoretical framework for addressing what I consider the four most important challenges. The framework I suggest is
- (a)
linkage of the exhaust data to designed data, followed by
- (b)
latent variable modelling (LVM).
LVM offers a comprehensive way of thinking about the challenges, whereas linkage to designed data allows us to relax many of the assumptions that would otherwise be necessary.
For example, although both surveys and administrative data have errors and strengths, these errors are often non-overlapping and strengths complementary. Surveys can directly address definition errors and cover parts of the population missed by registers, whereas administrative data omit many errors associated with the survey response process. Neither source is perfect in any respect. But linkage effectively ‘crosses’ the errors, making it possible (identifiable) for LVMs to account for the effect of errors in both sources.
A recent example is our work on linked survey–administrative data to address measurement error (Oberski et al., 2017). LVMs have been extended with graphical modelling of missing data to account for selection bias. Of course, the linkage itself may introduce linkage error; a process that has been recognized as generating ‘latent classes’ (finite mixtures). Modelling these latent classes mirrors Lahiri and Larsen (2005) and is also suggested in Oberski et al. (2017). Finally, privacy is an important issue. Existing solutions—disclosure control, distributed computation and homomorphic encryption—are all applicable to LVMs and have been so applied in the machine learning literature. Indeed, adding noise to protect values, as done in disclosure control and differential privacy, leads to an LVM with known measurement relationships.
Much more work is needed. But LVMs are a promising overarching framework to address the challenges laid out so eloquently in this paper. Perhaps some day this framework could serve as the ‘generally accepted theory’ that David identifies as currently lacking.
Marcelo Ruiz (University of Rio Cuarto), Victor J. Yohai (University of Buenos Aires) and Ruben Zamar (University of British Columbia, Vancouver)
We praise Professor Hand for this paper and for bringing attention to administrative data (AD) and several interesting related issues.
Professor Hand convincingly states that AD are not statistical samples but a complete account of all the individuals or items in a finite (but often large) population. These data need to be summarized, described and analysed to uncover possibly hidden patterns occurring at different levels of complexity and dimensionality. He also brings due attention to the issue of uneven data quality (DQ) in many AD. Classical descriptive measures of a population such as means and standard deviations have the same shortcomings as the corresponding sample estimators: failure to describe well the majority of the population. Therefore the notion of robust descriptive measures that summarize the behaviour of the majority of the population is necessary for the study of AD.
In general data quantity is not a concern in most AD applications because the number of cases is very large. However, the issue of DQ remains paramount. As mentioned in the paper, data can be incorrectly entered, even for operational purposes. Hence, we wish to elaborate on two DQ issues:
- (a)
cellwise outliers (gross errors affecting single cells in the data table, e.g. 11.1 in place of 1.11) and
- (b)
casewise outliers (entire outlying rows in the data table, e.g. a tainted blood sample causing overall wrong test results, or the inclusion of cases from different subpopulations).
The devastating effect of combined cellwise and casewise outliers on standard robust estimates has recently been discussed in Agostinelli et al. (2015) who introduced a second generation of robust estimates that can deal simultaneously with these two DQ issues. These procedures can also be used to summarize and analyse AD.
Robust procedures search for the strongest pattern followed by the majority of the data. For example, if only 95% of the items follow a certain pattern, this pattern may be missed by a least squares fit but clearly revealed by a robust algorithm fit.
The toy example in Fig. 01 shows that the least squares fit (y=−9.15+3.28x; the broken line) poorly summarizes the whole data set. The robust fit (y=6.29+1.91x; the full line) reveals the existence of two distinct regimes, fitting the one followed by the majority and clearly exposing by looking at residuals the existence of a distinct minority. Similar situations arise in higher dimensions where visualization is much more difficult. Hence, robust algorithms are needed in these cases.

Small example to illustrate the performance of robust regression algorithms: number of employees versus advertising expenditure for 1000 large companies in Argentina
Milan Stehlík (Johannes Kepler University in Linz and University of Valparaíso), Silvia Stehlíková (University of Valparaíso) and Ludy Núñez Soza (Universidad de Tarapacá)
We congratulate Hand for giving readers a provocative discussion on administrative data. We agree with him that we need to develop specific methods for analysis of administrative data, as we point out in the following examples.
Basel
Operational administrative data in finance can be significantly biased compared with their real counterparts. Thus low quality of such data substantially decreases the effectiveness of initiatives such as the Basel acccord, e.g. the assessment of upper quantiles of risk (Potocký et al., 2014). A robust and consistent estimator is needed to assess upper quantiles; a solution is given in Stehlík et al. (2010). A special problem is that many banks consider risk as a problem for their information technology departments.
Negative interest rates
Changing regulations can invalidate existing financial series (as Hand points out). Quantitative easing in the banking sector and low quality administrative data can lead to significant challenges requiring non-standard techniques to be developed (Stehlík et al., 2014). The current models are old fashioned and thus not reliable for present data. Data underlying the recent Greek or Portugal crises show principal discrepancies with standard random-walk models (Stehlík et al., 2014). Also standard predictions of interest rates completely fail in the presence of negative interest rates. In 2005 the first author was consulted by KPMG Slovakia about negative interest rates found by their audit. The administrative and financial bodies were not willing to accept the existence of negative interest rates (even a referee was forced to ground interest rates at 0 in Stehlík et al. (2015), despite the fact that in 2014 the European Central Bank introduced them).
Ecology and medicine
The 2014 arsenic data from the provinces of Arica and Parinacota in Chile, which were obtained by the Ministry of Health, are a nice example of a non-random, censored and incomplete sample (Núñez Soza and Stehlík, 2017). Proper collection, analysis and maintenance of water pollution data are needed for the development of appropriate water quality norms in Chile (Stehlík et al.,2014). Many data are aggregated by an inappropriate statistic (e.g. the arithmetic mean) which fails to address extremes severely (Beran et al., 2014; Stehlík et al., 2016,2016; Jordanova et al., 2016). Aggregation data statistics can have very different properties for sparse and dense data (Stehlík, 2011), which in the case of for example complicated medical images needs a very careful choice of method (Hermann et al., 2015).
Jude Towers (Lancaster University)
As a quantitative social scientist with a deep interest in data the content of the paper seems rather obvious, perhaps even bland—but its existence, and in such a prestigious statistical journal, demonstrates the necessity of such work: it is very welcome, because, in reality, it is still relatively rare to find critical reflections about data and the implications for analysis, findings and conclusions routinely embedded in the research process.
The multiple challenges that are outlined throughout the paper provide a useful ‘checklist’, but what is not explicitly discussed is the need to understand the socially constructed nature of data, which is arguably especially crucial for administrative data. The writer is optimistic about the potential of administrative data and certainly they are an increasingly important (re)source for researchers, but they are imbued with power relationships which are not explicit and with feedback loops of their own making. For example police-recorded crime does not reflect the ‘real’ rate of crime; it reflects the policies and practices of police forces, the Police and Crime Commissioners, the government, etc. When a campaign to encourage victims of domestic violence to come forward is launched, the associated number of recorded crimes increases. When these statistics are published at some later date, this rise is often interpreted as ‘real’, justifying additional intervention and resource. But resources are finite so those spent on domestic violence cannot be spent elsewhere; neglected, other crimes begin to show a declining trend, justifying even less intervention and resource, and so on ….
Accepting the socially constructed nature of quantitative data and statistics leads to a fundamental principle of critical reflection about the data being embedded throughout the research process: I would argue this is not (yet) systematically practised, taught or written about, although things are starting to change; for example a small number of innovative critical thinking courses are being introduced by some universities as ‘prequels’ to technical statistical methods courses so that students have foundational frameworks for ‘statistical thinking’ on which they build technical skills and knowledge. Most of this work seems to be concentrated in the social sciences (perhaps because British quantitative social scientists have had to convince their own discipline that quantitative data are not inherently positivistic). I would argue that the big challenge is embedding this into hard sciences and especially data science: the challenges are not only statistical.
Priyantha Wijayatunga (Umeå University)
In the case of entire population data we can estimate many probabilistic and statistical quantities without sample variations, other uncertainties that should be treated appropriately aside. Important features are association measures between variables, such as conditional probabilities, (partial) correlation coefficients and (conditional) odds ratios. Sometimes they do not change from one population to another, for, for example, from a present to a future instance, though marginal probability distributions often do. One instance is causal relationships. In such cases, they can be transported to other populations to obtain causal or predictive effect estimates, for, for example, regression coefficients, in them in case no direct analyses are done or possible. For, for example, we can estimate what these quantities might be at a future time if marginal distributions are expected to be different from those at present; this can be useful for policy making when such changes are expected. Suppose that an effect Y of a certain drug Z is the same for similar age subgroups X in two populations that have two different age structures. Let X be the only confounding factor of the causal relationship in both populations. Then knowing or estimating P(Y|Z,X) in one population accurately (with entire population data) is sufficient to estimate the average causal effect in each population, where in the second population only the marginal distribution of age is known or assumed. Note that the second population can be the first itself but at a future time, or similar.
My proposal is to use population data to estimate relationships, ideally probabilistic, between variables, that may give more accurate estimates. Dependence measures should be more probabilistic than statistical as argued (implicitly) in Wijayatunga (2016) by showing a generalization of Pearson's correlation coefficient. We can find instances where some probabilistic relationships may not change by time, place, etc., especially if they are causal. With such estimates we can estimate many statistical quantities that are of interest to policy makers, etc., in the event that the population structure is assumed to be different in the future, at another place, etc. Another simple example is that the regression coefficient of X in the regression of Y at a future time where marginal distributions are assumed to be different from those of the present can be estimated if the current correlation coefficient between X and Y is known reliably from (population) data. Thus, administrative data on the entire population can be useful for policy makers, etc.
The author replied later, in writing, as follows.
I would like to express my appreciation to the discussants for their deep and insightful comments on my paper. I expected, when I wrote the paper, that its value would lie at least as much in the discussion contributions as in the paper itself, and this expectation has been fully met. My only regret is that brevity prevents a more detailed rejoinder: the large number of discussants has meant that I can say very little in response to the important points that each of them make. However, I have read the contributions carefully, and absorbed them, so that they will certainly be reflected in my future work!
Penny Babb stresses the need to understand the data collection process. This is important whatever the nature of the data, but whereas such understanding is implicit with data which are deliberately collected for statistical purposes (such as survey data) such understanding need not accompany administrative data—an extra effort must be made. It is particularly important because, as Brian Francis says, data distortion can arise as an intrinsic aspect of the data collection process. Duncan Elliott, Guy Nason and Ben Powell note that
‘automated data collection introduces data artefacts that are persistent, structured and that, crucially, do not cancel out given enough data’.
The fact is that the term ‘big data’ does not mean good, or reliable, or trustworthy or even valid data. As I have said elsewhere, big data have all the problems of small data, along with further problems of their own.
Francis also notes that, since much administrative data evolve over time, replication studies can be difficult, and that data sets must be stored. In many contexts this will require a culture change. I entirely endorse Mark Pilling's comments about the importance of documenting how data sets are manipulated and cleaned.
Babb also properly reminds us of the intrinsic heterogeneity of many administrative data collection processes—so that data items, fields and records are likely to have different and possibly unique data quality issues. And I think her sentence ‘Do not overly trust the data but probe them with a challenging mind’ could usefully be adopted as a motto for all data and all statistical results. Such a perspective, if adopted by laypeople as well as professional statisticians, might go a long way towards quelling the rise in ‘false facts’ which we read so much about today.
The Nordic countries lead the world in the use of administrative data by the national statistical institutes, as is illustrated by Asta Manninen's commuting example, so it is fitting that Anders and Britt Wallgren should raise the possibility of using administrative registers to improve national statistical systems. In doing so, they go further than I have gone in the paper. They point out that the Nordic countries’ experience led to improvement of certain administrative systems. This is an immensely important point. I can see a clear parallel to the improvement of medical and health data collection systems when their data began to be used by statistical and machine learning diagnostic systems: these require high quality data, and this reflected back in an improvement to the data collection process itself. I think Wallgren and Wallgren's introduction of the notion of ‘output data quality’ in contrast with ‘input quality’ is a valuable perspective.
Li-Chun Zhang also draws attention to the extensive use of administrative data in the Nordic countries. As he notes, administrative data lead naturally to the notion of combining data from multiple sources—with immediate potential benefits and challenges. Priyantha Wijayatunga also comments on this. Zhang further makes the important distinction between data-based predictive models (often based on big data) where performance is all, and theory-based descriptive models such as for inflation. The statistical analysis of much commercial data is aimed at the first kind of problem, whereas much of the analysis in national statistical institutes targets the latter kind. The purely empirical nature of the first, with their relatively clear-cut objectives (e.g. the prediction of sales) might make them less susceptible to the shortcomings of administrative data than the inferential and descriptive nature of the second. Li-Chun's example of population count illustrates this. In a related vein, Duncan Elliott, Guy Nason and Ben Powell make the point that for social measures there is often no ‘ground truth’.
Paul Allin is, of course, quite correct that survey data are generated during the course of some operation. Perhaps I might better have said that the operation (of collating the data) is intrinsic to what a survey is, whereas compiling administrative data is a side effect (or even often an afterthought) of some operation. In the paper I made the point that, although administrative data might be free for the organization collecting them, they might not be free for other organizations. Indeed, as we move towards a world which increasingly recognizes the monetary value of data, so we should expect to see more discussion of these aspects. Allin quite rightly generalizes these observations to a wider discussion of where the costs (and benefits) fall.
I am grateful to Allin for drawing attention to the secondary analysis of survey data. And, in this regard, perhaps I can add a reference to the UK Data Archive in Essex (http://www.data-archive.ac.uk/home).
Gordon Blunt, from the perspective of the private rather than public sector, also notes that statisticians will typically not have been involved in the collection of administrative data. Like Babb, he stresses the critical importance of domain knowledge, and, like Allin, he comments that we must understand the client's motivations (in consultancy environments). It seems to me that all of this implies a need for statisticians to be more pre-emptive and actively involved during the data collection process, rather than simply reactive.
Neeraj Bharadwaj and Yuexiao Dong go further, asserting that managers are not passive in deciding what data to collect. There is opportunity for them to recognize that the operational data exhaust can be subsequently analysed as administrative data and that this analysis might in turn feed back to better operational decisions.
Agnes Herzberg say that is difficult to tell whether I am for or against the use of administrative data, so I welcome the opportunity to state explicitly that I am in favour of it! However, and this is really the driver behind the paper, merely because administrative data are a very rich resource, with huge potential for human benefit, we should not fail to recognize their shortcomings and the hurdles that we need to overcome: we should retain the statisticians’ traditional cautious eye. That is what I have tried to do in the paper, and the nice examples that were provided by Lisa Budd, Sarah Henry, Jenny Mehew and others serve to illustrate my point. As Henry says, the data that are omitted from administrative sources might be describing the very people who are of primary interest.
I was interested in Andrew Garrett's comment that legacy systems mean that ‘there is little appetite or capacity to add or modify variables that would produce better statistics’. I think that this brings us back to the points made by Babb, Allin, Wallgren and Wallgren, Blunt and others, that statisticians need to engage more with the administrative data collection process, seeking to convince the data producers of the merits of bearing in mind future possible statistical analysis. Garrett's phrase ‘administrative data by design’ nicely captures this. This is related to Jorge Mateu's point about businesses which hope to capitalize on the potential of big data needing to change the way that they collect, store and use information.
There are, however, complications arising from inconsistent data practices between different organizations leading to problems when data are linked, as noted by Kayla Frisoli and Rebecca Nugent, Wendy Appleby, Fionn Murtagh and others. Since different agencies will often operate independently, it would be unrealistic to expect them to change definitions coherently. However, within the UK, as far as the functions of the UK Statistics Authority are concerned, section 80 of the Digital Economy Act amended the Statistics and Registration Service Act of 2007 by inserting several passages, including one which
‘may require [any public authority providing data to the UKSA] to consult the UKSA before making changes to (a) its processes for collecting, organising, storing or retrieving information’.
Frisoli and Nugent also comment about possible slow progress for various reasons, and I certainly endorse this, having experienced it when I was Chair of the Board of the Administrative Data Research Network. Incidentally, while on the subject of the Network, I am grateful to Peter Smith for drawing attention to its work, and in particular to the guidance on linking data drawn up by Gilbert et al. (2017), and to James Doidge and Ruth Gilbert for pointing to the work of the Administrative Data Research Centre for England. I am also delighted with Smith's illustrations of the benefits of combining administrative and survey data. Jan van den Brakel also makes some very useful suggestions about strategies to combine survey and administrative data to overcome their complementary shortcomings, and Jamie Moore, Gabriele Durrant and Peter Smith provide a powerful example.
Murtagh points out that administrative data are often aggregrated before analysis. Working with data which have been aggregated always carries a risk—in particular, from such phenomena as the ecological fallacy.
I certainly think that it is going too far to assert, as Herzberg does, that ‘administrative data are not obtained by using sound methods’. The fact is that they are collected by sound methods, but that those methods are not the familiar random samples of traditional statistics. And, as for the suggestion that administrative data are collected for marketing only, that is also going too far: medical records, tax records, education records and so on are collected for a variety of reasons (see other discussion contributions for many examples), but marketing is not one of them.
Paul Smith and Raymond Chambers also draw attention to the importance of linked data. It seems to me that there are several distinct research communities working in this area (perhaps most importantly computer science and statistics, but also several application domains such as medicine), with less cross-fertilization than might be desirable. Smith and Chambers nicely summarize the unique role that statistics can play here, hopefully deriving sound estimators even in the face of less than perfect data, and I take their point about being more ambitious in this regard.
Appleby claims that the paper omits the possibility of having two different administrative data sources that purport to measure the same thing. I had intended that Section 6 of the paper should include this, as well as be about combining different types of data. It is certainly something I have encountered frequently while working with the UK Statistics Authority, so I am grateful to Appleby for raising it. The issue, as she illustrates with a particular example, is that raised above, namely that often different data providers use different definitions.
I like Arthur Barnett's demonstration that administrative data can err on the side of including too many cases, as well as too few! However, I am surprised by his assertion that there is
‘no empirical verification that UK consumer price indices meet the needs of government policy, commercial interests or the public’
given the very extensive public and user consultations which have taken place over the past few years. Regarding his comments about the draft revised Code of Practice for Official Statistics, I believe that it has been reworded.
I am grateful to Rajendra Bhansali for drawing attention to the fact that financial market data are a very clear example of administrative and transaction data. With one of my hats, I spend time looking at such data, and it is abundantly clear that although the efficient market hypotheses is a nice mathematical idealization it leaves much to be desired as a description of the way that the real world works.
Francisco Javier García Perez, Libia Lara and Emilio Porcu spell out the implications of recent legislative changes regarding the use of administrative data. I should have said more about this in the paper—and I am grateful to them for rectifying my oversight. They point out that the UK's Digital Economy Act (section 64) allows official data to be analysed by other organizations, subject to certain conditions, and (section 71) provided that the research is in the public interest. Likewise new European Union legislation (the General Data Protection Regulation) delineates how personal data may be used.
Ian Hunt raises some deep and important issues. He says ‘without decent data no one can even specify useful models, let alone assess and compare them’. And I do think that this is one feature which distinguishes statistics from much so-called ‘data science’. Statisticians have a reputation for caution (Hunt's ‘empirical sceptic’), asking questions such as ‘have the data the quality to support the model?’, ‘Are the assumptions of my modelling procedure roughly satisfied?’ and so on. Regarding Hunt's comment that the interpretation of confidence intervals seems of ‘little use for a statistician's client who wishes to make a particular decision or inductive inference’, I do not see why. We can never know for sure how the single observed sample compares with the unobserved underlying distribution, so the best that we can do is to rely on the fact that the method typically (here 99 times out of 100) gives a range containing the true value of the parameter. I think that Andrew Garrett's comment on the importance of and the potential for sensitivity analysis in ensuring that valid conclusions are drawn from administrative data goes some way towards answering Hunt's final point.
Francesca Ieva and Francesca Gasperoni's suggestion of administrative trials is interesting. This links back to the suggestion above, that engaging the statisticians in the data collection process can be beneficial in a variety of ways.
I am grateful to Ingegerd Jansson for the references to further methodogical work on administrative data: this was exactly the sort of contribution that I hoped the paper would stimulate, raising awareness of what is out there.
Kuldeep Kumar has hit on something that became increasingly apparent to me while I was writing the paper: although some of the issues were specific to administrative data, and some were more pointed for such data than other kinds, many issues were also common across types of data. There has been work on the quality of linked data; see, for example, Zaveri et al. (2012).
In answer to two specific questions from Paul McNicholas, Sharon McNicholas and Peter Tait: yes, large n is one way in which big data can arise, and whereas administrative data may be ‘big’ there are other kinds of big data, such as those generated from particle physics experiments or the latest generation of space telescopes. I am grateful to these contributors for bringing up the possibility of manipulated data. I think that there are rich possibilities for detecting fraud and other kinds of manipulation in administrative data, as is illustrated by work on detecting electoral and scientific fraud. I am grateful to these discussants and to Nick Longford for drawing further attention to causal modelling with observational data.
Like Daniel Oberski, I also am an enthusiast for latent variable models, and I also see these as a major part of the solution to combining data of different kinds. By mapping different kinds of representations to a common underlying structure, the different data modes can all usefully contribute.
Many of the discussants, including Marcelo Ruiz, Victor Yohai and Ruben Zamar, and Milan Stehlík, Siliva Stehlíková and Ludy Núñez Soza, drew attention to data quality issues of various kinds, as well as their potentially major consequences. From one angle or another, this certainly seems like a central issue for administrative data—belying the popular notion that such data are exhaustive and error free.
Jude Towers raises the issue of how administrative data in particular are ‘socially constructed’. I did cite Campbell's law, but perhaps I should have said more about this—because, as Towers implies, there are circumstances when it can be critical.
I must conclude by mentioning one area which, rather disappointingly, was barely picked up in the discussion. This is the question of what statistical methodology we should teach in the context of administrative data. Given the growth in the use of such data, this is clearly important.
Acknowledgements
The first draft of this paper was written as part of the Isaac Newton Institute programme on ‘Data linkage and anonymization’, July–December 2016. I would like to express my appreciation to the three referees and the Associate Editor for their detailed and helpful comments, which led to substantial improvement of the paper. The opinions expressed in this paper are the personal opinions of the author and do not necessarily reflect those of any organization with which the author is associated.