Summary

A very large share of the adult population frequently assent to provide data on their place of residence to local governments and businesses when registering for or acquiring goods and services. When linked together, such data can provide highly granular inventories of local populations and their characteristics on far faster refresh cycles than conventional statistical sources. However, each of the constituent sources of data is of largely unknown provenance. We describe how careful curation, linkage and analysis of sources of consumer and administrative data can resolve many questions of content and coverage, resulting in comprehensive, highly disaggregate and frequently updateable representations of population structure, along with reliable estimates of incompleteness and possible bias. We link 20 consecutive annual public UK registers of electors to a range of sources of consumer data to create annual updates to a longitudinal profile of the adult residents of almost every domestic property. We illustrate the applicability and value of the resulting unique data resource through the derivation of an annual small area household change index. We also assess the prospects of other, related, data linkage projects.

1 Introduction

A large and increasing share of the ‘big data’ that have been collected about citizens in recent years has arisen through transactions between consumers and the (private and public sector) organizations which provide them with goods and services. Collectively, they can be referred to as consumer data, although they comprise a range of forms and originate from a wide variety of sources (Longley et al., 2018). They are best thought of as digital ‘exhaust’, in that they are essentially by-products of business or service delivery (Harford, 2014). These data can also be interpreted as digital footprints that can be repurposed to create precise indicators of population statistics to supplement those traditionally collected by government agencies. In other instances, such data can provide entirely novel insights on population activities and characteristics.

The wide penetration of big data collection procedures is today enticing researchers and statistics agencies to repurpose said data to describe the population at large. Indeed, the future of conventional long-form-based censuses is uncertain and several countries are considering ways in which conventional statistics might be supplemented by using administrative records (Office for National Statistics, 2018) or even commercial data (Office for National Statistics, 2017a). The core limitation of traditional sources of population statistics is that data sets that aim to achieve nearly complete coverage are costly to produce and are infrequently collected. New forms of data are attractive because of their volume, velocity and (often) ready availability, despite their disconnection from scientific sampling procedures or quality controls. Seen from this perspective, data-driven approaches are motivated by the richer content and faster refresh of nascent big data sources, albeit at the expense of full population coverage issues or the basis to generalization, inference and scientific replicability (Hand, 2018; Norman et al., 2017).

Thus, new forms of data are fundamentally changing empirical analysis in statistics, and indeed the practice of social science. Concerns have arisen that the epistemology of ‘data-driven’ approaches to representing populations are unclear (Miller and Goodchild, 2015), and that data-driven analytical methods may be unable to accommodate the poorly understood sources and operation of bias in big data. Few, if any, sources of consumer data can approach completeness of coverage, not least because no consumer organization has a monopoly of market share, and few if any goods or services are consumed by every member of any crisply defined population (Lansley and Cheshire, 2018). Conventional survey research requires prespecification of the probability of selection of any member of a known and clearly defined population, which is a condition that is less easy to fulfil with many administrative sources, where some subgroups may be underenumerated or overenumerated (Office for National Statistics, 2017b). Furthermore, many administrative data sets are difficult to reconcile with one another in the absence of an overarching address frame (Goerge and Lee, 2002). Further complications arise when individuals change address. However, none of these problems are insurmountable and the spirit of our own research is to develop and extend work that has been undertaken by using administrative data in the context of consumer data research. We believe that this offers the prospects of improving the range of characteristics that can be assigned to individuals, and of improving the spatial and temporal granularity for which such data may be harvested (see also Office for National Statistics (2018)).

Our specific goal is to reuse underexploited consumer data to construct an annually updated linked database of the residences of the individuals and households that make up the entire UK adult population. Although the source and operation of bias in the component data sets are largely unknown, we develop and apply address matching and data linkage procedures to develop a consistent inventory of individual names and addresses for the period 1997–2016. Our motivation is to facilitate reliable annual estimation of the changing attributes of neighbourhoods and the characteristics of households that reside within them, and to understand the social and spatial consequences of residential mobility better. Most of our sources of data are from commercial organizations, but we also include public versions of the electoral roll—which, although fulfilling administrative functions are also considered ‘consumer data’ as they facilitate choice of elected representatives and (since 2003) indicate consent by the named individuals to contact for unrelated purposes such as marketing.

Here we describe the creation of an individual level linked consumer register (LCR) which traces the residences of individuals between 1997 until 2016. This process required the initial assembly of multiple sources of consumer data, as defined above, their reconciliation with an assured address framework for each year and their subsequent linkage over time at the level of the individual. Procedures were developed to establish the provenance of the various sources. Through the amalgamation of 20 years of linked records, we present a detailed individual level product and demonstrate how it can be used to infer a longitudinal perspective on the changing characteristics of the adult population. We conclude by speculating on the implications of this work for empiricist approaches to social science in the big data era.

2 Consumer data sources

No comprehensive population register is collected for the UK population, although local authorities have a statutory obligation to make available annual lists of electors who have not opted out of inclusion, according to a published schedule of charges. Our constituent data sets each comprised lists of adults’ names and addresses, obtained with appropriate consents, along with dates on which individuals were ‘last seen’ by each data collection agency. These dates were bunched around the deadline for filing voter registrations in the case of the public electoral rolls. The data were structured into annual time intervals for the 20-year period.

The full electoral register includes eligible electors for both parliamentary and local government elections. Before electors were given the option to opt out of inclusion from 2003 onwards, electoral registers were frequently used to frame social surveys and investigations (Hoinville and Jowell, 1978). The bulk of electoral registrations are compiled during a canvassing period in October each year and the public versions are usually made available by individual local authorities after the following February. Thus, the compiled registers generally represent the population of the preceding year.

The coverage of electorates in the public ‘edited’ register has gradually decreased since its introduction (White and Horne, 2014). By 2013, 14.6 million electors in England and Wales opted to be excluded from the edited register. Following the introduction of individual electoral registration in place of registration by a self-nominated head of household in 2014, the number of opt-outs had increased to 23.6 million in 2016, or 56% of electors. The opt-out rate varies considerably by local authority (Fig. 1), at least in part because of differences in the layout of voter registration materials.

Proportion of electors who opted out of the edited electoral register in (a) 2014, (b) 2015 and (c) 2016 in England and Wales
Fig. 1

Proportion of electors who opted out of the edited electoral register in (a) 2014, (b) 2015 and (c) 2016 in England and Wales

It has become the practice of value-added data resellers to supplement the public electoral roll with additional consumer data, to enhance their value in marketing applications. The data in the research that is reported here were sourced from the composite ‘consumer registers’ for 2003–2012 from DataTalk Ltd (St Ives, UK), and for 2013–2017 from CACI Ltd (London, UK). The identities of the providers of the consumer data enhancements are not revealed for commercial reasons, but each are identified by a separate flag in the files. Generally, the proportion of records in the consumer registers that were acquired from the consumer data files fluctuates between 20% and 40% for each data set. The total numbers of records, and associated proportion obtained from the contemporaneous electoral register, are shown in Table 1.

Table 1

Number of records in each electoral register (1998–2002) and each consumer register (2003–2017), and the proportion of records which are derived from the most recent public version of the electoral register

YearIndividual records% electoral register
199845466638100
199946299201100
200046616530100
200144037323100
200243713671100
20034488161976.04
20044273326973.69
20054152704672.50
20063757388877.30
20073603233676.69
20083655622272.12
20093316152075.04
20104220320557.00
20114352479755.78
20124123500263.97
20134837091046.47†
20145428355757.19
20155182024754.49
20165138746345.98
20175371105239.82
YearIndividual records% electoral register
199845466638100
199946299201100
200046616530100
200144037323100
200243713671100
20034488161976.04
20044273326973.69
20054152704672.50
20063757388877.30
20073603233676.69
20083655622272.12
20093316152075.04
20104220320557.00
20114352479755.78
20124123500263.97
20134837091046.47†
20145428355757.19
20155182024754.49
20165138746345.98
20175371105239.82

The percentage for 2013 is a minimum figure, because of some ambiguity on the data source flags provided for this year’s data.

Table 1

Number of records in each electoral register (1998–2002) and each consumer register (2003–2017), and the proportion of records which are derived from the most recent public version of the electoral register

YearIndividual records% electoral register
199845466638100
199946299201100
200046616530100
200144037323100
200243713671100
20034488161976.04
20044273326973.69
20054152704672.50
20063757388877.30
20073603233676.69
20083655622272.12
20093316152075.04
20104220320557.00
20114352479755.78
20124123500263.97
20134837091046.47†
20145428355757.19
20155182024754.49
20165138746345.98
20175371105239.82
YearIndividual records% electoral register
199845466638100
199946299201100
200046616530100
200144037323100
200243713671100
20034488161976.04
20044273326973.69
20054152704672.50
20063757388877.30
20073603233676.69
20083655622272.12
20093316152075.04
20104220320557.00
20114352479755.78
20124123500263.97
20134837091046.47†
20145428355757.19
20155182024754.49
20165138746345.98
20175371105239.82

The percentage for 2013 is a minimum figure, because of some ambiguity on the data source flags provided for this year’s data.

Compared with DataTalk Ltd, the registers from CACI Ltd include more individuals because records are carried forward to later years if no new information on the individuals who were resident at an address was collected. CACI date stamps indicate that the proportion of records that were collected within 12 months of each annual release incrementally but cumulatively declined in successive registers. The consumer registers do not conform to any address standard and so official address products from the Royal Mail, Ordnance Survey and Land Registry were used to establish consistency. Each of the data sets are described in Table 2, along with summaries of the ways in which they were repurposed for database creation. The data on resident individuals and households are less reliable than those on addresses (Lynn and Taylor, 1995). AddressBase and address level house sale data were not available for Northern Ireland.

Table 2

Summary of the data components of the LCRs

Source of dataPurpose of collectionLikely strengthsLikely limitations
Full electoral register 1998–2002Enumeration of all named voters (age 17 years or older) for all elections: this includes attainers, i.e. teenagers due to become eligible voters during the currency of the registersLegal requirement for completion (with minor caveats); includes Old and New Commonwealth citizens; includes Irish citizens and other European Union citizensUnderrepresentation of Commonwealth and European Union citizens; double or undercounting of students and recent movers; possible double counting of some second-home owners; no non-voters
Public version of electoral register 2003–2017Enumeration of all named voters and those coming of voting age for any elections: in Scotland, attainers from the local government register are not included (as these can be as young as 15 years old)As above, although the European Union enlarged during this time period: Scotland lowered the legal voting age to 16 years in 2013 for local electionsAs above plus exclusion of ‘opt-out’individuals: variability in opt-out rates from 24% to 60% over the period 2003–2017 (see Table 1)
Consumer files (2003–2017)Provision and promotion of consumer goods and servicesFills in many of those who ‘opt out’ of the public version of the electoral register and those ineligible to voteUnknown motivations for inclusion and consent; possible systematic bias for inclusion; non-stan-dard address fields
Land Registry records of domestic property transactions in England and Wales (1995–2017)Payment of stamp duty and title registrationAll transactions recorded; very high correspondence with residential moves in owner-occupied sector; precise transaction datesHard to differentiate the minority of landlord transactions from the majority of owner-occupier residential moves
Registers of Scotland records of domestic property transactions in Scotland (2003–2016)As aboveAs aboveAs above
Ordnance Survey AddressBase Premium 2018Enumeration and location of residential addresses (including historic records dating back to 1990)Nearly complete coverage of residential addresses; all address names have been consistently formatted and include a unique reference number which is used in other official productsNot entirely complete or accurate: Great Britain only
Postcode Address File® 2016Enumeration and approximate locations of residential addressesAs above, but extends to the rest of the UKNot entirely complete or accurate: contain some non-domestic records; only made available for a single snapshot in 2016
Source of dataPurpose of collectionLikely strengthsLikely limitations
Full electoral register 1998–2002Enumeration of all named voters (age 17 years or older) for all elections: this includes attainers, i.e. teenagers due to become eligible voters during the currency of the registersLegal requirement for completion (with minor caveats); includes Old and New Commonwealth citizens; includes Irish citizens and other European Union citizensUnderrepresentation of Commonwealth and European Union citizens; double or undercounting of students and recent movers; possible double counting of some second-home owners; no non-voters
Public version of electoral register 2003–2017Enumeration of all named voters and those coming of voting age for any elections: in Scotland, attainers from the local government register are not included (as these can be as young as 15 years old)As above, although the European Union enlarged during this time period: Scotland lowered the legal voting age to 16 years in 2013 for local electionsAs above plus exclusion of ‘opt-out’individuals: variability in opt-out rates from 24% to 60% over the period 2003–2017 (see Table 1)
Consumer files (2003–2017)Provision and promotion of consumer goods and servicesFills in many of those who ‘opt out’ of the public version of the electoral register and those ineligible to voteUnknown motivations for inclusion and consent; possible systematic bias for inclusion; non-stan-dard address fields
Land Registry records of domestic property transactions in England and Wales (1995–2017)Payment of stamp duty and title registrationAll transactions recorded; very high correspondence with residential moves in owner-occupied sector; precise transaction datesHard to differentiate the minority of landlord transactions from the majority of owner-occupier residential moves
Registers of Scotland records of domestic property transactions in Scotland (2003–2016)As aboveAs aboveAs above
Ordnance Survey AddressBase Premium 2018Enumeration and location of residential addresses (including historic records dating back to 1990)Nearly complete coverage of residential addresses; all address names have been consistently formatted and include a unique reference number which is used in other official productsNot entirely complete or accurate: Great Britain only
Postcode Address File® 2016Enumeration and approximate locations of residential addressesAs above, but extends to the rest of the UKNot entirely complete or accurate: contain some non-domestic records; only made available for a single snapshot in 2016
Table 2

Summary of the data components of the LCRs

Source of dataPurpose of collectionLikely strengthsLikely limitations
Full electoral register 1998–2002Enumeration of all named voters (age 17 years or older) for all elections: this includes attainers, i.e. teenagers due to become eligible voters during the currency of the registersLegal requirement for completion (with minor caveats); includes Old and New Commonwealth citizens; includes Irish citizens and other European Union citizensUnderrepresentation of Commonwealth and European Union citizens; double or undercounting of students and recent movers; possible double counting of some second-home owners; no non-voters
Public version of electoral register 2003–2017Enumeration of all named voters and those coming of voting age for any elections: in Scotland, attainers from the local government register are not included (as these can be as young as 15 years old)As above, although the European Union enlarged during this time period: Scotland lowered the legal voting age to 16 years in 2013 for local electionsAs above plus exclusion of ‘opt-out’individuals: variability in opt-out rates from 24% to 60% over the period 2003–2017 (see Table 1)
Consumer files (2003–2017)Provision and promotion of consumer goods and servicesFills in many of those who ‘opt out’ of the public version of the electoral register and those ineligible to voteUnknown motivations for inclusion and consent; possible systematic bias for inclusion; non-stan-dard address fields
Land Registry records of domestic property transactions in England and Wales (1995–2017)Payment of stamp duty and title registrationAll transactions recorded; very high correspondence with residential moves in owner-occupied sector; precise transaction datesHard to differentiate the minority of landlord transactions from the majority of owner-occupier residential moves
Registers of Scotland records of domestic property transactions in Scotland (2003–2016)As aboveAs aboveAs above
Ordnance Survey AddressBase Premium 2018Enumeration and location of residential addresses (including historic records dating back to 1990)Nearly complete coverage of residential addresses; all address names have been consistently formatted and include a unique reference number which is used in other official productsNot entirely complete or accurate: Great Britain only
Postcode Address File® 2016Enumeration and approximate locations of residential addressesAs above, but extends to the rest of the UKNot entirely complete or accurate: contain some non-domestic records; only made available for a single snapshot in 2016
Source of dataPurpose of collectionLikely strengthsLikely limitations
Full electoral register 1998–2002Enumeration of all named voters (age 17 years or older) for all elections: this includes attainers, i.e. teenagers due to become eligible voters during the currency of the registersLegal requirement for completion (with minor caveats); includes Old and New Commonwealth citizens; includes Irish citizens and other European Union citizensUnderrepresentation of Commonwealth and European Union citizens; double or undercounting of students and recent movers; possible double counting of some second-home owners; no non-voters
Public version of electoral register 2003–2017Enumeration of all named voters and those coming of voting age for any elections: in Scotland, attainers from the local government register are not included (as these can be as young as 15 years old)As above, although the European Union enlarged during this time period: Scotland lowered the legal voting age to 16 years in 2013 for local electionsAs above plus exclusion of ‘opt-out’individuals: variability in opt-out rates from 24% to 60% over the period 2003–2017 (see Table 1)
Consumer files (2003–2017)Provision and promotion of consumer goods and servicesFills in many of those who ‘opt out’ of the public version of the electoral register and those ineligible to voteUnknown motivations for inclusion and consent; possible systematic bias for inclusion; non-stan-dard address fields
Land Registry records of domestic property transactions in England and Wales (1995–2017)Payment of stamp duty and title registrationAll transactions recorded; very high correspondence with residential moves in owner-occupied sector; precise transaction datesHard to differentiate the minority of landlord transactions from the majority of owner-occupier residential moves
Registers of Scotland records of domestic property transactions in Scotland (2003–2016)As aboveAs aboveAs above
Ordnance Survey AddressBase Premium 2018Enumeration and location of residential addresses (including historic records dating back to 1990)Nearly complete coverage of residential addresses; all address names have been consistently formatted and include a unique reference number which is used in other official productsNot entirely complete or accurate: Great Britain only
Postcode Address File® 2016Enumeration and approximate locations of residential addressesAs above, but extends to the rest of the UKNot entirely complete or accurate: contain some non-domestic records; only made available for a single snapshot in 2016

Past research has suggested that the full electoral register underenumerates young adults and ethnic minorities (Lynn and Taylor, 1995). Private rental tenants and recent movers are also known to be underenumerated (Electoral Commission, 2016). Unfortunately, less research has focused on the provenance of the edited version of the electoral registers, beyond rudimentary geographical analysis at local authority scale (Fig. 1). Little is known about the quality of the consumer data files, beyond that they are supplied by four different suppliers in most consumer registers (2013–2017) and we expect that their compilation may be prone to errors (e.g. data linkage errors) which could lead to duplications or removal of records. Prima facie, it is reasonable to expect a similar lack of coverage of recent movers and migrants as the data involve address-based registrations, although non-voters are eligible for inclusion. Indeed, we expect that in blending multiple data sets of unknown provenance we may encounter issues of undercoverage (of hard-to-reach groups) and overcoverage (of those who might be duplicated because of changes of address or second-home ownership). There is a need to investigate such issues in future research, using methods promulgated in the Census Coverage Survey (Abbott, 2009).

3 Constructing the linked consumer register

A barrier to our core objective of establishing a reliable linked data product is that individuals, businesses or local authorities may use differing conventions for recording names and addresses. As such, it is often difficult to reconcile individual records, requiring the development of bespoke heuristics. The construction of the LCR required two core linkage exercises: the construction of a common address spine, and attribution of household composition (including assignment of houses in multiple occupation) to each address. Below we first describe the steps that were used to link records pertaining to the same address and to deduplicate individual records by using linkage to the address frames and fuzzy matching procedures. Second, we link individuals through matching names at each address and a series of steps which attempt to identify instances where individuals may have changed name or recorded a part of their name in a different way.

Given the personal nature of the data, ethical review was sought and approved subject to conduct of the research in a safe researcher environment. This research considers only public and private sector data sets for which appropriate consents have been obtained by third-party organizations. Our processing of the data falls under the public interest derogation for research under Article 89 of the General Data Protection Regulation. Although formed from proprietary component data sources, the resulting LCR are available for bona fide research purposes on successful application by accredited safe researchers to the UK Economic and Social Research Council Consumer Data Research Centre (cdrc.ac.uk). Such access enables access to the code (written in Scala and structured query language) that has been used to link the registers for different years. Furthermore, aggregated data products which have been run through disclosure controls will be made available to the research community and public institutions to improve the availability of statistics for further research and end uses in providing public services.

3.1 Address matching

Across all the registers for 1998–2017, 67.6 million unique address strings were recorded: more than twice the expected number of addresses. An initial exploration suggested that some unique addresses were composed in any of eight major variants. Table 3 identifies the nature of the address matching task over the 1998–2017 period.

Table 3

Number of unique address strings in each source of data

Sources of dataNumber of unique address strings
Consumer and electoral registers, 1998–201767582896
AddressBase Premium 2018 (includes demolished addresses from 1990 and non-domestic addresses)45967398
Postcode Address File® 201630063575
Land Registry (England and Wales), 1995–2017, cumulative total, property sales only16115514
Registers of Scotland, 2003–2016, cumulative total, property sales only1562488
Sources of dataNumber of unique address strings
Consumer and electoral registers, 1998–201767582896
AddressBase Premium 2018 (includes demolished addresses from 1990 and non-domestic addresses)45967398
Postcode Address File® 201630063575
Land Registry (England and Wales), 1995–2017, cumulative total, property sales only16115514
Registers of Scotland, 2003–2016, cumulative total, property sales only1562488

2014 Department for Communities and Local Government dwelling estimate: 28.1 million.

Table 3

Number of unique address strings in each source of data

Sources of dataNumber of unique address strings
Consumer and electoral registers, 1998–201767582896
AddressBase Premium 2018 (includes demolished addresses from 1990 and non-domestic addresses)45967398
Postcode Address File® 201630063575
Land Registry (England and Wales), 1995–2017, cumulative total, property sales only16115514
Registers of Scotland, 2003–2016, cumulative total, property sales only1562488
Sources of dataNumber of unique address strings
Consumer and electoral registers, 1998–201767582896
AddressBase Premium 2018 (includes demolished addresses from 1990 and non-domestic addresses)45967398
Postcode Address File® 201630063575
Land Registry (England and Wales), 1995–2017, cumulative total, property sales only16115514
Registers of Scotland, 2003–2016, cumulative total, property sales only1562488

2014 Department for Communities and Local Government dwelling estimate: 28.1 million.

It was therefore necessary to standardize and consolidate the list of addresses from the diverse sources by using AddressBase Premium and the Postcode Address File® (PAF). These data sets each contain individual address records for Great Britain and the UK respectively, and both were used to establish consistent content, format and complete UK coverage.

Before matching, the addresses in the consumer registers needed to be cleaned and reformatted to remove inconsistencies. Common abbreviations (such as ‘st.’ or ‘rd’) were expanded to their full forms, and commonly used property partitions (such as ‘gff’: ground floor flat) were similarly expanded by using a standardization procedure. Changes in postcodes were accommodated by using a Royal Mail update lookup table of 272240 postcodes that changed between 1992 and 2006. Other possible duplicates were identified by filtering out multiple-unit postcodes that shared precisely identical reference co-ordinates in the Office for National Statistics (ONS) ONS Postcode Database.

Following this, we utilized three approaches to address matching. At each stage, we attempted to reduce the number of unique addresses in the consumer registers. The procedures were designed to minimize false matches in favour of non-matches, as the latter could be picked up in a subsequent stage. They are briefly summarized below.

3.1.1 Rule-based matching

Given that the consumer registers share no common standardized address format, and that any component may be inconsistently ordered or configured, we successively rearranged the address components in the AddressBase Premium and PAF framework data sets to ascertain whether any would then directly match to the registers. Only matches within the same unit postcode identifier were considered. However, it is possible that selecting certain components of an address may incorrectly match some addresses. Therefore, matches that linked to multiple records in AddressBase Premium or the PAF were not amended.

3.1.2 Occupier-based matching

We also took advantage of the data on residents to reduce the number of unique addresses in our database. Our assumption was that it is very unlikely that two properties within a postcode will share an identical composition of residents’ names. Thus, we concatenated occupants’ names from addresses and searched for identical occurrences within the same postcode for each source register, and we repeated the procedure for immediately succeeding years. Where duplicates were identified, they were merged into a single address (favouring the string that matches AddressBase Premium or occurred most commonly).

3.1.3 Fuzzy matching

It is also feasible that addresses may not match because of typographic errors. Therefore, we implemented a fuzzy matching procedure which was based on three separate techniques:

  • (a)

    a comparison of flat and address numbers to give an indication of the likelihood that they pertain to the same address (if the numbers did not match, the pairs were considered further);

  • (b)

    a comparison of text strings by using a word bag approach to consider the difference in unique words used in the addresses (common address words, such as ‘road’ and ‘street’, were assigned low weights in inverse proportion to their frequency of occurrence; this step also took into account use of common abbreviations);

  • (c)

    use of a variant of the Levenshtein distance (edit distance) of the difference between successive two-character strings, with stronger weighting on differences detected at the beginning of each address string—because the first address elements typically pertain to unique addresses and the later strings relate to aggregations such as districts or towns.

The three parts of the similarity function were linearly combined with tunable parameters to reduce false matches. The parameters were manually tuned following testing on small samples of the data.

3.1.4 Matching stages

Each stage of the matching processes condensed the total number of addresses in our consumer registers by eliminating possible duplicates (Table 4). However, it is of course extremely difficult to validate matching processes on data so vast, and some domestic residences may not be included in the PAF or AddressBase Premium. The existence of inconsistencies between AddressBase Premium and the PAF highlights the difficulties in attaining universal coverage.

Table 4

Cumulative reduction in addresses at successive stages of the analysis

StepAddress identities
1: rules based37976018
2: occupier matching36704969
3: fuzzy matching32034661
StepAddress identities
1: rules based37976018
2: occupier matching36704969
3: fuzzy matching32034661
Table 4

Cumulative reduction in addresses at successive stages of the analysis

StepAddress identities
1: rules based37976018
2: occupier matching36704969
3: fuzzy matching32034661
StepAddress identities
1: rules based37976018
2: occupier matching36704969
3: fuzzy matching32034661

We estimate that around 30 million residential addresses have existed in the UK over the 1998–2017 period. This estimate derives from the number of active addresses in the PAF and AddressBase Premium, recent dwelling stock estimates, the number of demolitions (1998–2016) and the number of conversions between 1998 and 2016. Our final list of addresses that occur in the consumer and electoral registers stands at just over 32 million entries (see Table 4). This overestimation is possibly a consequence of the fact that our databases include postal addresses that may have been altered over time and have thus been duplicated; in addition, our data also include a very small proportion of non-domestic addresses. It is also feasible that some addresses may appear more than once because of the different formatting of individual records.

Table 5 shows how each of the final 32 million unique addresses were identified and assigned a unique reference number. Table 5 also shows how each unique address in the property sales data for England and Wales and for Scotland may be linked to the final unique reference numbers. 94.9% of unique addresses where sales occurred could be linked to the unique addresses in the consumer registers. Almost all of these were linked to AddressBase Premium, indicating that Land Registry transaction data are generally of better quality than those assembled for electoral registration or marketing to consumers.

Table 5

Reference frames for the unique reference numbers in the consolidated 1998–2017 database

LinkageResults for consumer registers and electoral registersResults for property sales data
Unique addresses%Unique addresses%
Link to AddressBase Premium2801953187.471387255799.78
Link to PAF but not AddressBase Premium8424792.6330630.02
Linked only to other consumer registers or electoral registers29279709.14243710.18
Unmatched or unique2446810.7634280.02
LinkageResults for consumer registers and electoral registersResults for property sales data
Unique addresses%Unique addresses%
Link to AddressBase Premium2801953187.471387255799.78
Link to PAF but not AddressBase Premium8424792.6330630.02
Linked only to other consumer registers or electoral registers29279709.14243710.18
Unmatched or unique2446810.7634280.02
Table 5

Reference frames for the unique reference numbers in the consolidated 1998–2017 database

LinkageResults for consumer registers and electoral registersResults for property sales data
Unique addresses%Unique addresses%
Link to AddressBase Premium2801953187.471387255799.78
Link to PAF but not AddressBase Premium8424792.6330630.02
Linked only to other consumer registers or electoral registers29279709.14243710.18
Unmatched or unique2446810.7634280.02
LinkageResults for consumer registers and electoral registersResults for property sales data
Unique addresses%Unique addresses%
Link to AddressBase Premium2801953187.471387255799.78
Link to PAF but not AddressBase Premium8424792.6330630.02
Linked only to other consumer registers or electoral registers29279709.14243710.18
Unmatched or unique2446810.7634280.02

3.2 Resident matching

Having established a universal address spine for all the registers, we could then link residents across the 20-year period. Individuals’ names may differ between registers because of issues of marriage, name changes, alternative variants of spellings and misspellings. Therefore, an additional pipeline method was developed to improve the match rate of residents. In each step, the occurrence of each unique name at each unique address by year was recorded. It is rare in the UK, but conceivable, that a household may include multiple individuals who share the same name (e.g. junior and senior), although the consumer registers do not include minors. Implausibly high duplication of names occurred within addresses each year, averaging around 440000: duplicate names were thus flagged and merged. Before the analysis, empty spaces and punctuation were removed from the names (excluding hyphens which were used in subsequent steps).

3.2.1 Alternative versions of forenames

Apparent inconsistencies arise out of use of shortened or informal versions of forenames. We therefore developed a database of nicknames and their common name equivalents by recording the co-occurrence of forenames which commonly share both addresses and surnames. The assumption is that many individuals will record their different name variants over time and therefore, within addresses, the two monikers that they volunteer may share higher-than-expected rates of co-occurrence.

The most frequently co-occurring forenames were combinations of the most common, yet distinctive, names in the database—for example, almost 80000 Margarets and Johns were observed to share both addresses and surnames, and were not of interest to this analysis. Instead, Table 6 shows the pairs of names that had the highest co-occurrence ratios, i.e. the frequency of a co-occurrence relative to the total frequency of the less common name of a pair. For instance, 80% of occurrences of the name ‘Stpehen’ appear in the same household as an occurrence of the name ‘Stephen’. In this case, it is likely that the former name is a misspelling.

Table 6

The 10 most frequent moniker–forename pairs with a frequency of 1000 or more

MonikerForenamePair countMoniker countCo-occurrence ratio
stpehenstephen108713640.80
roberrobert133818650.72
wiliamwilliam240635130.68
giliangillian137021930.62
magaretmargaret113118290.62
patricapatricia375067620.55
valarievalerie183435630.51
malcommalcolm133427190.49
shielasheila270263370.43
hillaryhilary165441490.40
MonikerForenamePair countMoniker countCo-occurrence ratio
stpehenstephen108713640.80
roberrobert133818650.72
wiliamwilliam240635130.68
giliangillian137021930.62
magaretmargaret113118290.62
patricapatricia375067620.55
valarievalerie183435630.51
malcommalcolm133427190.49
shielasheila270263370.43
hillaryhilary165441490.40
Table 6

The 10 most frequent moniker–forename pairs with a frequency of 1000 or more

MonikerForenamePair countMoniker countCo-occurrence ratio
stpehenstephen108713640.80
roberrobert133818650.72
wiliamwilliam240635130.68
giliangillian137021930.62
magaretmargaret113118290.62
patricapatricia375067620.55
valarievalerie183435630.51
malcommalcolm133427190.49
shielasheila270263370.43
hillaryhilary165441490.40
MonikerForenamePair countMoniker countCo-occurrence ratio
stpehenstephen108713640.80
roberrobert133818650.72
wiliamwilliam240635130.68
giliangillian137021930.62
magaretmargaret113118290.62
patricapatricia375067620.55
valarievalerie183435630.51
malcommalcolm133427190.49
shielasheila270263370.43
hillaryhilary165441490.40

In addition to common alternative spellings we also considered co-occurrences that differ in length by two or more characters to demonstrate shortened name variants (Table 7).

Table 7

The 10 most frequent shortened moniker–forename pairs with a frequency of 1000 or more

MonikerForenamePair countMoniker countCo-occurrence ratio
lizelizabeth4445143580.31
tashanatasha120841140.29
patpatricia5039193970.26
valvalerie107441580.26
pampamela211782100.26
lesleslie147857660.26
gillgillian2857120660.24
suesusan9084389610.23
jacquijacqueline3248142760.23
mickmichael138062810.22
MonikerForenamePair countMoniker countCo-occurrence ratio
lizelizabeth4445143580.31
tashanatasha120841140.29
patpatricia5039193970.26
valvalerie107441580.26
pampamela211782100.26
lesleslie147857660.26
gillgillian2857120660.24
suesusan9084389610.23
jacquijacqueline3248142760.23
mickmichael138062810.22
Table 7

The 10 most frequent shortened moniker–forename pairs with a frequency of 1000 or more

MonikerForenamePair countMoniker countCo-occurrence ratio
lizelizabeth4445143580.31
tashanatasha120841140.29
patpatricia5039193970.26
valvalerie107441580.26
pampamela211782100.26
lesleslie147857660.26
gillgillian2857120660.24
suesusan9084389610.23
jacquijacqueline3248142760.23
mickmichael138062810.22
MonikerForenamePair countMoniker countCo-occurrence ratio
lizelizabeth4445143580.31
tashanatasha120841140.29
patpatricia5039193970.26
valvalerie107441580.26
pampamela211782100.26
lesleslie147857660.26
gillgillian2857120660.24
suesusan9084389610.23
jacquijacqueline3248142760.23
mickmichael138062810.22

Thus, a moniker lookup table was produced for name pairs with co-occurrence ratios above 0.05. This table was manually inspected to ensure that no erroneous pairings were generated. Some monikers appeared to match multiple forenames and in such cases only the pairing with the highest score was retained. In addition, a handful of moniker–forename pairs were reversed to account for shortened names that could match two distinctive forenames, e.g. matching of ‘Steve’ matched both ‘Stephen’ and ‘Steven’. Some pairs were removed if they were clearly not variants of the same name, e.g. Kehinde and Taiwo had a co-occurrence ratio of 13.6%. Interestingly, these names originate from West Africa and are typically given to twins. In total, 1253 unique monikers remained in the lookup table and over 680000 records were subsequently cleaned.

3.2.2 Initialisms

In total, almost 3.65 million records in the database provided initials instead of forenames. As they could hamper linkage when used inconsistently, we sought to link initials to other forenames that shared the same surname and began with the same letter. Where an initial could be linked to two or more other records on this basis, the flag identifying the source of the data was used to prioritize linkage of data from different providers; after this priority was given to pairings that occurred across the most similar time period. A total of 1.68 million duplicates were identified and merged in this step.

3.2.3 Double-barrelled names

We also expected that some individuals may use both double-barrelled and their single-surname components. We created a filter to identify cases where a forename occurred twice within a household: once with a double-barrelled surname, and once with just one of the components of the double-barrelled name. These records were then merged and the shorter name was retained. In total, 1.76 million records contained hyphens, although only approximately 200000 of them could be linked via the method described, as the majority of double-barrelled name bearers reported their surnames consistently. Following this stage, hyphens were removed and the matches were rerun.

3.2.4 Surname changes

Although name changes are generally rare, many women take their husband’s surname on marriage. It is estimated that between 1998 and 2015 there were almost 290000 marriages a year in the UK. An algorithm was developed to identify probable female surname changes following marriage. These records were deduplicated, and flagged with both the maiden and the married names in all databases as they might be useful for future linkage work.

The following additional steps were undertaken.

  • Step 1: gender was ascribed through linkage to a forename database of probable genders (see Lansley and Longley (2016)). The source database was compiled from over 10 million records from birth certificates and consumer data files. 94% of records in the LCR were assigned a probable gender at this stage of the analysis.

  • Step 2: a flag was created to identify female forenames that appeared multiple times (but with different surnames) at an address in our linked database.

  • Step 3: a second search identified whether one of the female surnames was shared with a male at the same address.

  • Step 4: married women were then identified where a probable female with a duplicated forename bore a family surname unless

  • (a)

    the female without the family name was first recorded after the first recording of the individual with the family name or

  • (b)

    the address contained a large (over 35) number of individuals.

In total, 1969411 probable married women with changed names were identified in the database.

3.2.5 Fuzzy matching

Despite the above steps, many misspelled names may be retained in the database so a fuzzy matching procedure was implemented. The Soundex fuzzy matching technique is based on phonetics as pronounced in English and was devised for matching names (Stanier, 1990). It produces Soundex codes based on homophones which can be used to group words that sound the same but are spelt slightly differently. A Soundex code was assigned to each name that remained in the database, although no changes were made to names that were matched at an address but nevertheless probably had different genders (e.g. Jean and John, and Michelle and Michael). Where a match occurred, the most common name was retained. The process was run separately for forenames and surnames.

A summary of the number of unique records in the data following each step is shown in Table 8. For each individual, we retained a flag to indicate what stage of the analysis they were linked as a measure of uncertainty.

Table 8

Cumulative reduction in the number of unique names at unique addresses, at successive stages of the analysis

ProcessNumber of unique records
Joining all registers154514095
Text cleaning150031561
Monikers149349785
Initialisms147666541
Double-barrelled names147485472
Marriages145516061
Fuzzy matching143789049
ProcessNumber of unique records
Joining all registers154514095
Text cleaning150031561
Monikers149349785
Initialisms147666541
Double-barrelled names147485472
Marriages145516061
Fuzzy matching143789049
Table 8

Cumulative reduction in the number of unique names at unique addresses, at successive stages of the analysis

ProcessNumber of unique records
Joining all registers154514095
Text cleaning150031561
Monikers149349785
Initialisms147666541
Double-barrelled names147485472
Marriages145516061
Fuzzy matching143789049
ProcessNumber of unique records
Joining all registers154514095
Text cleaning150031561
Monikers149349785
Initialisms147666541
Double-barrelled names147485472
Marriages145516061
Fuzzy matching143789049

3.3 Identifying missing records

The amalgamation of data from numerous sources for each year may cause many individuals to appear in and to drop out of address records in successive registers. Thus, the final section of the data cleaning attempted to impute data where records were thought to be missing. The primary means of doing this was by identifying gaps in an individual’s apparent residence at an address, and then using data from adjacent time periods to fill in the gaps. Although it is possible that some people may indeed vacate a property and then return (e.g. university students), inspection of the data sets suggested that the vast majority of the gaps were from incomplete temporal records. When blending the registers, we aligned the records to the years when they were most probably collected. Thus most of the registers were time stamped as pertaining to the immediate previous year to account for their autumn collection dates unless specific ‘last-seen’ dates were provided.

The populations that were attributed to the registers for each year following the linkage exercise are included in Table 9. By monitoring residence over the entire study period, we can boost the number of individuals who are allocated to addresses throughout the intermediate years of our study. Only very small numbers of adults were supplemented to the earlier years largely because the coverage from the full electoral register data was very high. However, unfortunately, this approach is less effective at supplementing records during the later years as the number of new records diminishes. Other consumer data providers are available, however, and in future work we plan to address this issue.

Table 9

Number of records in the final version of the UK LCR by year

YearFrequency of records seen following record linkageNumber of records from enhanced householdsNumber of anonymized records imputed from house sales dataFinal counts in the LCR
1997451285320045128532
199846973618844641146982475
19994736594315023275447383720
200047172883373201450547224708
2001457172533689152490046111068
200247167988817133460547284306
2003461574051121115730746326823
2004456343338627277787546574935
20054460433015464509654946247329
200644131069235500512871246614786
200744835431230291516263247300978
200845008081260316118708047798322
200946875665217544921788849269002
201047909994250673227056750687293
201145908413383029639668850135397
2012367143661251214160918949835696
2013356198241470933493779351266951
20143126060819552928143372952247265
20152731989823081685221271252614295
20162573282224630985315652653520333
YearFrequency of records seen following record linkageNumber of records from enhanced householdsNumber of anonymized records imputed from house sales dataFinal counts in the LCR
1997451285320045128532
199846973618844641146982475
19994736594315023275447383720
200047172883373201450547224708
2001457172533689152490046111068
200247167988817133460547284306
2003461574051121115730746326823
2004456343338627277787546574935
20054460433015464509654946247329
200644131069235500512871246614786
200744835431230291516263247300978
200845008081260316118708047798322
200946875665217544921788849269002
201047909994250673227056750687293
201145908413383029639668850135397
2012367143661251214160918949835696
2013356198241470933493779351266951
20143126060819552928143372952247265
20152731989823081685221271252614295
20162573282224630985315652653520333
Table 9

Number of records in the final version of the UK LCR by year

YearFrequency of records seen following record linkageNumber of records from enhanced householdsNumber of anonymized records imputed from house sales dataFinal counts in the LCR
1997451285320045128532
199846973618844641146982475
19994736594315023275447383720
200047172883373201450547224708
2001457172533689152490046111068
200247167988817133460547284306
2003461574051121115730746326823
2004456343338627277787546574935
20054460433015464509654946247329
200644131069235500512871246614786
200744835431230291516263247300978
200845008081260316118708047798322
200946875665217544921788849269002
201047909994250673227056750687293
201145908413383029639668850135397
2012367143661251214160918949835696
2013356198241470933493779351266951
20143126060819552928143372952247265
20152731989823081685221271252614295
20162573282224630985315652653520333
YearFrequency of records seen following record linkageNumber of records from enhanced householdsNumber of anonymized records imputed from house sales dataFinal counts in the LCR
1997451285320045128532
199846973618844641146982475
19994736594315023275447383720
200047172883373201450547224708
2001457172533689152490046111068
200247167988817133460547284306
2003461574051121115730746326823
2004456343338627277787546574935
20054460433015464509654946247329
200644131069235500512871246614786
200744835431230291516263247300978
200845008081260316118708047798322
200946875665217544921788849269002
201047909994250673227056750687293
201145908413383029639668850135397
2012367143661251214160918949835696
2013356198241470933493779351266951
20143126060819552928143372952247265
20152731989823081685221271252614295
20162573282224630985315652653520333

Efforts were made to simulate the missing records at addresses where no data were collected. This was achieved by bringing forward records from previous registers for active properties that were missing data in 2016. Where active properties are considered as those identified as in use in AddressBase. This approach is viable as the vast majority of adults who were recorded at a property in a particular year are likely to remain at their address during the following year. Unadjusted LCR records suggest that 95% of residents spend more than 1 year at their address. Between 2001 and 2011, an average of 11% of individual LCR records terminated in each year, indicating that the adult has either changed address or was deceased. This amounts to a very modest apparent underenumeration relative to the figures from the 2011 ONS estimates of 12.2%. The ONS figure includes all international emigrants, internal migration and death statistics. We seek to accommodate vacancies between known residences by assuming that most such instances arise from failures in data capture. The Electoral Commission (2016) identified that a minority of elector records are correctly updated within a year of a change of address. Thus for properties that appear to be vacant, we backdated incoming households by up to 2 years and, in a small number of instances, also roll forward the outgoing household to fill the remaining void, although gaps of over 2 years occurred only for a very small minority of properties. These records were flagged to signal a measure of uncertainty about properties that may well have been vacant.

We also attempted to identify changes at properties that occurred since the last evidence (if any) that a property had been occupied. Property sales data were used to identify households that have probably vacated and anonymous residents were imputed for vacant dwellings. The specific number of anonymous residents for each property was based on the median number of residents per year as recorded in the earlier data. Finally, there were some new build properties recorded as in use in AddressBase but had no recorded occupants in the LCR. These properties were allocated two notional adult residents. Transitions in the rental sector can be modelled by using historic lettings records from companies such as Zoopla or the tenancy deposit scheme, though this was not available for this analysis. This is unfortunate, given that these households have a higher residential churn rate than owner-occupied households.

The final counts in the LCR are shown in Table 9. The LCR data represent the vast majority of the adult population for every year between 1997 and 2016. They comprise data on almost every single active property for every year, although the size of the overall counts waned over the 20-year period. This is largely a consequence of incomplete households being recorded at many addresses. Although it was possible to enhance the data for many households where there were some relevant data, we did not have a means of imputing residents who were completely missing. In addition, some records were also missing where properties did not appear in our data for a large number of years. Fig. 2 compares the number of records in our analysis with UK mid-year population estimates. On average, the total number of imputed records in the LCR for each year is only 1.8% different from the mid-year estimates from 1997 to 2016. However, it was observed that following the analysis the final years of the LCR very slightly overestimate the adult population. This is probably because of the unknown size of new build properties (which were simply ascribed the rounded mean household size for the UK), and because some duplicated addresses may have been retained despite our best efforts. The slight overestimation of the adult population between 1998 and 2000 probably arises because of those registered at multiple addresses, or possible lags in data entries. Nevertheless, it is also worth remembering that official mid-year population estimates are approximate calculations (Rees et al., 2004).

Number of individual records in the UK LCR compared with mid-year population estimates: , original counts after linkage; , enhanced counts; , enhanced counts (excluding anonymized imputations); , mid-year population estimates
Fig. 2

Number of individual records in the UK LCR compared with mid-year population estimates: graphic, original counts after linkage; graphic, enhanced counts; graphic, enhanced counts (excluding anonymized imputations); graphic, mid-year population estimates

We also observed that, for each year, the counts in the LCR corresponded very closely to mid-year population estimates at the district level. The correlation coefficients for every year were over 0.99. The coefficients in the most recent 3 years were the lowest; however, mid-year population estimates are believed to deviate further from the true totals as the elapsed time since the most recent census increases.

4 Application: residential mobility and neighbourhood change

The LCR offers several opportunities for the investigation of neighbourhood characteristics at any convenient spatial scale, and with annual temporal refresh. A core limitation of censuses is that their infrequent collection makes it impossible to monitor rapid population changes. Thus, for example, Short (1978) argued that electoral registers could be an invaluable tool for understanding population turnover because of their annual refresh and high coverage. With the advent of opt-out provisions for electoral registers, and with the development of data handling technologies that allow integration of other sources of consumer data for which consents have been obtained, the LCR can be considered the natural successor to full public UK electoral registers. Following Short (1978) and Marshall (1971), we use recurrence of adults at addresses as a means of estimating population turnover across space (see also Clark and Coulter (2015)). Using the LCR series, we can develop estimates of annual neighbourhood turnover (or ‘churn’) for the entire UK settlement system and identify individual addresses where the occupants have changed annually. Thus we can identify areas that have undergone considerable change.

In our analysis, we have attempted to pinpoint the year in which each household joined and vacated an address. In this application we have investigated neighbourhood change by using the years in which the most recently identified households at each address first joined their properties. If none of the household members joined an address in the same year, then we considered the first-seen date of the earliest household member. Household members were defined as all residents who were estimated to be present in 2016. The specific dates were refined by using property sales records and aggregated to lower super-output areas of between 400 and 1200 households. The equivalent small area units were used for Scotland (data zones) and Northern Ireland (super-output areas). The resulting household change index (HCI) records the proportion of active addresses that have changed in occupation completely between 2016 and each of the preceding years. Active addresses were identified as properties that were recorded as in use in AddressBase Premium, were recorded in the 2016 PAF or were recorded in the consumer registers after 2010.

This approach enables us to hone the dates for when an entire set of house members ‘refresh’. As such, the analysis is limited to addresses rather than household units. Houses in multiple occupation may exhibit partial rather than collective transitions and, in such instances, our approach records the year in which the first 2016 resident moved into the property. The cumulative frequency of the ‘first-seen’ dates for the households in the HCI are shown in Table 10.

Table 10

Years in which the last recorded households in properties extant in 2016 joined their present address

Year first seenFrequency of householdsCumulative %
1997 and before7839962100.00
199879569573.53
199963918270.85
200071092468.69
200151749066.29
200285934664.54
200367263561.64
200464549959.37
200567377857.19
200679351054.92
200782354852.24
200862128949.46
2009122308747.36
2010142442043.23
2011125690938.42
2012165648934.18
2013236745628.59
2014139425820.60
2015201063315.89
2016 and after26958469.10
Year first seenFrequency of householdsCumulative %
1997 and before7839962100.00
199879569573.53
199963918270.85
200071092468.69
200151749066.29
200285934664.54
200367263561.64
200464549959.37
200567377857.19
200679351054.92
200782354852.24
200862128949.46
2009122308747.36
2010142442043.23
2011125690938.42
2012165648934.18
2013236745628.59
2014139425820.60
2015201063315.89
2016 and after26958469.10
Table 10

Years in which the last recorded households in properties extant in 2016 joined their present address

Year first seenFrequency of householdsCumulative %
1997 and before7839962100.00
199879569573.53
199963918270.85
200071092468.69
200151749066.29
200285934664.54
200367263561.64
200464549959.37
200567377857.19
200679351054.92
200782354852.24
200862128949.46
2009122308747.36
2010142442043.23
2011125690938.42
2012165648934.18
2013236745628.59
2014139425820.60
2015201063315.89
2016 and after26958469.10
Year first seenFrequency of householdsCumulative %
1997 and before7839962100.00
199879569573.53
199963918270.85
200071092468.69
200151749066.29
200285934664.54
200367263561.64
200464549959.37
200567377857.19
200679351054.92
200782354852.24
200862128949.46
2009122308747.36
2010142442043.23
2011125690938.42
2012165648934.18
2013236745628.59
2014139425820.60
2015201063315.89
2016 and after26958469.10

The change index estimates just over 38% of households in 2016 had moved to their current address in the period since the start of 2011 (which coincides with when the most recent census was recorded). A large share of these households are likely to be from the private rental sector where short-term tenancy agreements are common. There is a dip in the frequency of households joining addresses in 2008. This might reflect the effects of the financial crisis in that year, which resulted in a sharp decrease in property sales.

The lower super-output area level index reveals that, in general, central urban areas have experienced the greatest population turnover, especially during the last 5 years. Neighbourhoods that are known to have young and cosmopolitan populations with high proportions of properties in the private rental sector have experienced the greatest rate of change. For example, Fig. 3(a) shows the proportion of change since 2011 across Bristol. The central parts of the city experienced the highest rates of change, although areas where there have been extensive new residential developments also obviously experienced change. A large area of developments has occurred near the University of the West of England campus at Filton; here 84% of households have moved in since 2011. In addition, Fig. 3(b) shows that substantial changes have occurred in Portishead since 2001, following the redevelopment of the marina and the construction of a large number of properties to the east of the town.

(a) Incidence of household change 2011–2016 across Bristol and surrounding areas and (b) the same measure for 2001–2016 (source: maps.cdrc.ac.uk)
Fig. 3

(a) Incidence of household change 2011–2016 across Bristol and surrounding areas and (b) the same measure for 2001–2016 (source: maps.cdrc.ac.uk)

The analysis of population turnover is just one of many useful applications that can be developed by using the LCR. Table 11 identifies some other potential applications that might be useful for understanding local population composition and changes over time.

Table 11

Selection of potential research applications that might use the LCR

VariableComment
Gender and age groupUsing a forenames database built from birth certificates and consumer data files it is possible to ascribe the probable demographic statistics to individual level names data (Lansley and Longley, 2016)
EthnicityAnnual updates of neighbourhood ethnicity profiles can be developed by using an ethnicity estimator to ascribe probable ethnic group by using names (see Kandt and Longley (2018))
Household compositionsIt is possible to detect the number of adults per address and to use surnames as indicators of family membership to detect rates of shared households (for example see Samuel et al. (2019))
Internal migrationUse of patterns in names within households to investigate the nature of residential transitions (as demonstrated in Lansley and Li (2018)): the granularity of the data enables us to measure trends such as social mobility through linkage to small area data on deprivation or socio-economics
VariableComment
Gender and age groupUsing a forenames database built from birth certificates and consumer data files it is possible to ascribe the probable demographic statistics to individual level names data (Lansley and Longley, 2016)
EthnicityAnnual updates of neighbourhood ethnicity profiles can be developed by using an ethnicity estimator to ascribe probable ethnic group by using names (see Kandt and Longley (2018))
Household compositionsIt is possible to detect the number of adults per address and to use surnames as indicators of family membership to detect rates of shared households (for example see Samuel et al. (2019))
Internal migrationUse of patterns in names within households to investigate the nature of residential transitions (as demonstrated in Lansley and Li (2018)): the granularity of the data enables us to measure trends such as social mobility through linkage to small area data on deprivation or socio-economics
Table 11

Selection of potential research applications that might use the LCR

VariableComment
Gender and age groupUsing a forenames database built from birth certificates and consumer data files it is possible to ascribe the probable demographic statistics to individual level names data (Lansley and Longley, 2016)
EthnicityAnnual updates of neighbourhood ethnicity profiles can be developed by using an ethnicity estimator to ascribe probable ethnic group by using names (see Kandt and Longley (2018))
Household compositionsIt is possible to detect the number of adults per address and to use surnames as indicators of family membership to detect rates of shared households (for example see Samuel et al. (2019))
Internal migrationUse of patterns in names within households to investigate the nature of residential transitions (as demonstrated in Lansley and Li (2018)): the granularity of the data enables us to measure trends such as social mobility through linkage to small area data on deprivation or socio-economics
VariableComment
Gender and age groupUsing a forenames database built from birth certificates and consumer data files it is possible to ascribe the probable demographic statistics to individual level names data (Lansley and Longley, 2016)
EthnicityAnnual updates of neighbourhood ethnicity profiles can be developed by using an ethnicity estimator to ascribe probable ethnic group by using names (see Kandt and Longley (2018))
Household compositionsIt is possible to detect the number of adults per address and to use surnames as indicators of family membership to detect rates of shared households (for example see Samuel et al. (2019))
Internal migrationUse of patterns in names within households to investigate the nature of residential transitions (as demonstrated in Lansley and Li (2018)): the granularity of the data enables us to measure trends such as social mobility through linkage to small area data on deprivation or socio-economics

5 Evaluation

Ever-increasing amounts of data are collected about citizens today, and an increasing real share of these data is collected through transactions with organizations. An important contribution of this work is our demonstration that such data can be blended with administrative data and refashioned into comprehensive, timely and granular data sets that can be used for the social good—e.g. by facilitating better understanding of neighbourhood dynamics and the sociospatial implications that follow from them. We thus conclude that such data linkage exercises present a pivotal opportunity potentially to broaden our conceptions of population statistics to include indicators of activities and processes. The LCR, which we have described in this paper, presents an important underpinning to more granular and frequently updated demographic statistics, which are especially timely given the developing interests in non-traditional sources of social data (Hand, 2018) and data-led approaches to robust small area estimations (Tzavidis et al., 2018).

This analysis presents a means of supplementing conventional and new methods of estimating local population size and composition. UK ONS mid-year population estimates are blended from geographically referenced administrative data (such as births and deaths), national level data (such as international migration statistics) and decennial census data. This process involves amalgamating data that are produced at different time periods and at different scales and then attempting to deduce trends for small areas. In contrast, the LCR is built up from large assemblages of public electoral register and sources of consumer data that remain grounded at the level of the individual. It can therefore offer fresh insights at a highly granular scale and can assist with honing aggregate statistics.

However, the consumer registers do not have full and accurate adult population coverage, and their provenance is unknown. This makes it necessary to ‘harden’ the raw data by anchoring them to conventional statistical sources, where and when these are available. There are thus challenges arising from repurposing data that are not collected for research purposes, not least because the sources of bias in the recorded data may be systematic, and may operate to exclude certain groups in society. Our analysis considered both AddressBase Premium and the PAF as address frames, but additional addresses were added where their occurrences in the consumer register could not be reconciled with these frames. Address-based frames are imperfect and also create data issues where they do not perfectly correspond to household definitions or property use categories.

The research that is described here is consistent with heightened interest in the use of hybrid sources of big data to supplement or even to replace some conventional official statistics (Hand, 2018). In particular, it draws on methods to enhance record linkage at the individual level, as is often required with administrative and consumer data sets. Our innovative approach to harnessing extensive lists of residents could be adopted by institutions that have access to more complete data sets that are otherwise unobtainable elsewhere. However, the sensitivity of the data dictates that such data sets should only be accessed at highly granular scales within safe research environments. Crucially, in addition to a novel source of demographic data, the names-based individual level data also provide a spine through which additional administrative and consumer data can be linked and, through this, the provenance of said data can be investigated and new, pertinent and current geodemographic insights may be gleaned.

Acknowledgements

This work was funded by the UK Economic and Social Research Council Consumer Data Research Centre, grant reference ES/L011840/1. The raw data and aggregated outputs from this project may be obtained on successful application through the Consumer Data Research Centre Data Service. Access will also be granted to specialized software to assist with data linkage and processing. The Registers of Scotland data for this reserach were provided by the Urban Big Data Centre, Glasgow.

References

Abbott
,
O.
(
2009
)
2011 UK Census coverage assessment and adjustment methodology
.
Popln Trends
,
137
,
25
32
.

Clark
,
W. A.
and
Coulter
,
R.
(
2015
)
Who wants to move?: The role of neighbourhood change
.
Environ. Planng A
,
47
,
2683
2709
.

Electoral Commission
(
2016
)
The December 2015 electoral registers in Great Britain, accuracy and completeness of the registers in Great Britain and the transition to Individual Electoral Registration
.
Report
.
Electoral Commission
,
London
.

Goerge
,
R. M.
and
Lee
,
B. J.
(
2002
)
Matching and cleaning administrative data
.
New Zeal. Econ. Pap.
,
36
,
63
64
.

Hand
,
D. J.
(
2018
)
Statistical challenges of administrative and transaction data (with discussion)
.
J. R. Statist. Soc.
A,
181
,
555
605
.

Harford
,
T.
(
2014
)
Big data: are we making a big mistake?
Significance
,
11
, no.
5
,
14
19
.

Hoinville
,
G.
and
Jowell
,
R.
(
1978
)
Survey Research Practice
.
London
:
Heinemann Educational Books
.

Kandt
,
J.
and
Longley
,
P. A.
(
2018
)
Ethnicity estimation using family naming practices
.
PLOS One
,
13
, no.
8
,
article e0201774
.

Lansley
,
G.
and
Cheshire
,
J.
(
2018
)
Challenges to representing the population from new forms of consumer data
.
Geog. Compass
,
12
,
article e12374
.

Lansley
,
G.
and
Li
,
W.
(
2018
) Consumer registers as spatial data infrastructure and their use in migration and residential mobility research. In
Consumer Data Research
(eds
P.
Longley
,
J.
Cheshire
and
A.
Singleton
), pp.
15
27
.
London
:
University College London Press
.

Lansley
,
G.
and
Longley
,
P.
(
2016
)
Deriving age and gender from forenames for consumer analytics
.
J. Retail. Consmr Serv.
,
30
,
271
278
.

Longley
,
P. A.
,
Singleton
,
A.
and
Cheshire
,
J.
(eds) (
2018
)
Consumer Data Research
.
London
:
University College London Press
.

Lynn
,
P.
and
Taylor
,
B.
(
1995
)
On the bias and variance of samples of individuals: a comparison of the electoral registers and postcode address file as sampling frames
.
Statistician
,
44
,
173
194
.

Marshall
,
M. L.
(
1971
) The use of probability distributions for comparing the turnover of families in a residential area. In
London Papers in Regional Science
, vol. 2, Urban and Regional Planning (ed.
A. G.
Wilson
), pp.
171
193
.
London
:
Pion
.

Miller
,
H. J.
and
Goodchild
,
M. F.
(
2015
)
Data-driven geography
.
GeoJournal
,
80
,
449
461
.

Norman
,
P.
,
Marshall
,
A.
and
Lomax
,
N.
(
2017
)
Data analytics: on the cusp of using new sources?
Rad. Statist.
,
116
,
19
30
.

Office for National Statistics
(
2017a
)
Research outputs: using mobile phone data to estimate commuting flows
.
Office for National Statistics, Newport
. (
Available from
https://www.ons.gov.uk/census/censustransformationprogramme/administrativedatacensusproject/administrativedatacensusresearchoutputs/populationcharacteristics/researchoutputsusingmobilephonedatatoestimatecommutingflows.)

Office for National Statistics
(
2017b
)
Research outputs: estimating the size of the population in England and Wales, 2017 release
.
Office for National Statistics, Newport.
(
Available from
https://www.ons.gov.uk/census/censustransformationprogramme/administrativedatacensusproject/administrativedatacensusresearchoutputs/sizeofthepopulation/researchoutputsestimatingthesizeofthepopulationinenglandandwales2017release?platform=hootsuite.)

Office for National Statistics
(
2018
)
Annual assessment of ONS’s progress on the Administrative Data Census: July 2018
.
Office for National Statistics, Newport.
(
Available from
https://www.ons.gov.uk/census/censustransformationprogramme/administrativedatacensusproject/administrativedatacensusannualassessments/annualassessmentofonssprogressontheadministrativedatacensusjuly2018.)

Rees
,
P.
,
Norman
,
P.
and
Brown
,
D.
(
2004
)
A framework for progressively improving small area population estimates
.
J. R. Statist. Soc. A
,
167
,
5
36
.

Samuel
,
A.
,
Lansley
,
G.
and
Coulter
,
R.
(
2019
)
Estimating the prevalence of shared accommodation across the UK from Big Data
. In
Proc. Conf. Geographical Information Science Research UK, Newcastle-upon-Tyne.

Short
,
J. R.
(
1978
)
Population turnover: problems in analysis and an alternative method
.
Area
,
10
,
231
235
.

Stanier
,
A.
(
1990
)
How accurate is Soundex matching
.
Comput. Geneal.
,
3
,
286
288
.

Tzavidis
,
N.
,
Zhang
,
L.-C.
,
Luna
,
A.
,
Schmid
,
T.
and
Rojas-Perilla
,
N.
(
2018
)
From start to finish: a framework for the production of small area official statistics (with discussion)
.
J. R. Statist. Soc. A
,
181
,
927
979
.

White
,
I.
and
Horne
,
A.
(
2014
)
Supply and sale of the electoral register
.
Report SN/PC/01020
.
House of Commons Library
,
London
. (
Available from
http://researchbriefings.files.parliament.uk/documents/SN01020/SN01020.pdf.)

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)