-
PDF
- Split View
-
Views
-
Cite
Cite
Joseph J. Salvo, Annette Jacoby, Arun Peter Lobo, Census 2020 Why Increasing Self-Response is Key to a Good Count, Significance, Volume 17, Issue 1, February 2020, Pages 30–33, https://doi.org/10.1111/j.1740-9713.2020.01356.x
- Share Icon Share
Abstract
With just weeks to go until the decennial United States Census, Joseph J. Salvo, Annette Jacoby and Arun Peter Lobo explain why proxy respondents, administrative records, and imputation are no substitute for a high rate of self-response
For years now, survey researchers have watched with concern as people become ever more reluctant to respond to surveys. With the most important US survey of the last 10 years due to take place on 1 April 2020, worries about survey non-response take on a new, much larger dimension.
The survey in question is the 2020 US Census. All households are required to self-respond, to fill in the questionnaire on their own, but many do not. The Census Bureau is emphatic that this non-response will be addressed by robust field data collection operations, known as non-response follow-up (NRFU). And when households still do not respond, even after visits from fieldworkers, the Census Bureau will use imputation and other procedures to correct for this.
According to the Census Bureau, these procedures, taken together, will result in everyone being counted in 2020. However, the Bureau concedes that NRFU, imputation, and other efforts to correct for non-response could result in lower-quality data (bit.ly/BrennanC). An accurate count of the population and data quality are closely intertwined; indeed, compromised data quality is often a consequence of a poor count – which is why low self-response is such a worry for all who rely on census data.
In what follows, we show how NRFU and other operations open the door to errors that have compromised past censuses and could do so again in 2020. While our discussion is rooted in New York City, its conclusions have implications for the census count in urban areas and other geographies that tend to have low self-response rates. We discuss what local governments can do to increase self-response, which will aid both the count and quality of data.
Self-response
Self-response is the gold standard in decennial census data collection and is the most accurate and efficient source of data. But there are a number of reasons why households might not self-respond. According to the Census Bureau, these reasons include fear of government and data breaches; low English-language proficiency; a diversifying population; more complex living arrangements; and a mobile population.1
Regardless of the reason, a general unwillingness to take part in the census can be a precursor to an undercount, where the number of people counted is less than the actual resident population. This was the case in 1990, a particularly problematic census. That year, the overall self-response for New York City, measured by the mail return rate, was just 53%, and the city's undercount was estimated at 244,000, or 3.2%. Black neighbourhoods had among the lowest rates of self-response and, in general, neighbourhoods with low levels of self-response were subject to higher undercounts as well.
The differential pattern of self-response that was seen in 1990 was also evident in the 2010 Census. Again, predominantly black neighbourhoods had the lowest mail return rates and, as we show later, were more likely to have been undercounted.
Low mail returns adversely affect the quality of census data. In the 2010 Census, among those who self-responded, about 96% of returns were deemed correct – defined as an enumeration for an actual person or housing unit that was appropriate, complete, and in the correct location for the relevant tabulation – and only 0.5% of data for all household members were missing and had to be filled in, or imputed. The comparable figures for those who did not self-respond were 89% and 7%, respectively.
High self-response in a community not only results in higher-quality data and a better count for households that respond, but it also limits the proportion of households that need to be followed up in NRFU operations – and these processes can adversely affect both the overall count and data quality.
Non-response follow-up
NRFU is meant to fill the gaps in census coverage created by non-response. But, unfortunately, neighbourhoods that have low self-response are also less likely to respond to fieldworkers knocking on doors.
According to standard NRFU procedures, when a housing unit appears to be occupied but no one responds to a knock on the door, enumerators are encouraged to speak to “proxy respondents”, including neighbours and landlords, to garner information on the unit and its occupants. In small, multi-unit buildings with informal or illegal – what we call “hidden” – housing units, information from proxies is often highly speculative and prone to error. Census Bureau research shows that in 2010, 93% of responses obtained from other household members were correct, but this was true for just 70% of proxy responses (bit.ly/2010coverage).
Often, it can even be difficult for NRFU fieldworkers to determine whether a unit is occupied or not. For example, in New York City in 2010, a host of neighbourhoods with strong housing markets showed increases in vacant units, sometimes in excess of 500%.2 But many of these neighbourhoods had an abundance of smaller multi-unit buildings with hidden housing units and poor labelling of individual apartments, which made it hard for census enumerators to identify and determine their occupancy status. Comparisons with other data sources, as well as with the Census Bureau's own internal data, showed that this pattern of increases in vacant units was erroneous and likely to have been the result of ineffective NRFU operations, which are often tied to the time pressures faced by local census field staff.
The Census Bureau has procedures in place to determine whether non-responding units are occupied or whether they exist. In 2020, even a housing unit where mail gets returned as undeliverable as addressed (UAA) will be visited by the Census Bureau, as a safeguard, to confirm its existence and occupancy status. When this information cannot be elicited from a visit, administrative records could be used to assign this information. Unfortunately, though, there are serious issues regarding the completeness and accuracy of administrative records. For example, the Census Bureau found that about 20% of housing units determined to be vacant using administrative records in census tests conducted in 2016 in Los Angeles County, California, and Harris County, Texas, were actually occupied.3
Post-NRFU procedures
After NRFU operations are completed in 2020, the Census Bureau will rely on administrative records to assign missing characteristics and on donor households for imputation. But these procedures can produce data that are biased because administrative records tend to underrepresent certain groups, and the characteristics of those who need to be imputed are almost always different from those who self-respond.
Administrative data
In 2020, the Census Bureau will match individuals to administrative records to fill in missing items on race, age, gender, etc. However, administrative records have been found to be problematic in terms of coverage. When person records from the 2010 Census were compared with multiple administrative records to assess overall coverage, only 89% of the time was there a match nationally (bit.ly/2010matchstudy). Internal Revenue Service (IRS) data, one of the most important administrative data sources, had only 81% of household members represented, according to the Census Bureau, and this varied across the country.
The incomplete coverage in administrative records affects groups differently. For example, in administrative records, only 79% of Hispanics self-identified, while 92% of non-Hispanics self-identified.4 The same research showed that for smaller minority groups, even when administrative data were available, the records tended not to match responses in the 2010 Census. Young children, who tend to be undercounted, also have lower administrative record match rates than other age groups (bit.ly/OHareAdmin).
So, while the use of administrative records can help with assigning missing information, socially vulnerable populations are unlikely to benefit equally. As researchers from the Urban Institute put it, “these groups may not only miss the benefits of a more accurate count experienced by other groups with more extensive [administrative records], they may actually be misrepresented due to a lack of data that are assumed to exist but do not” (urbn.is/38r0tDJ).
People who interact with government agencies, have bank accounts, and deal with public institutions are more likely to be represented in administrative records, while more vulnerable populations – including immigrants, the formerly incarcerated, and homeless persons – might be underrepresented because of the absence of such ties. Even administrative data on benefits for underrepresented and marginalised populations fall short of capturing everyone, as those eligible to receive these benefits often do not apply for them.
While the use of multiple databases is supposed to ensure that groups will be represented in at least one of them, the underrepresentation of certain groups tends to be mirrored across different data sets. Moreover, sources like IRS data are assigned greater weight, and therefore less coverage of a group in these data will be even more consequential. Thus, the bias introduced by administrative records varies for population subgroups and geographic areas, which translates into poorer coverage and data quality for marginalised groups.
Imputation
Statistical imputation is the last resort used to fill in the remaining missing data. There are three imputation procedures the Census Bureau has in place. Count imputation ensues when the status of a unit (non-existent, occupied or vacant) and household size are imputed through the use of proximate “donor” households. The Census Bureau also imputes characteristics when it is missing single items, such as the age of a respondent (item imputation), or all the characteristics of household residents (whole-household imputation).
In 2010, substituted persons accounted for 1.9% of the USA's enumerated population
Count, item, and whole-household imputation in 2010 all made use of donor households to fill in missing information. These were households that had filled in their census forms and were often neighbours of non-responding households, and were thus assumed to be in some ways similar. However, as the Census Bureau points out: “As the country has diversified, … the assumption that neighbors are alike is growing increasingly tenuous” (bit.ly/AdminImpute). This is especially true in gentrifying areas of New York, where whites have higher self-response and could be used as donors for non-responding black or other minority households.
In 2010, almost 6 million persons enumerated in the census nationally were in households that were imputed in total using donor households (bit.ly/2010NRI). This process, referred to as substitution, is based on a statistical routine that substitutes the characteristics of a donor household in the immediate geographic area for the missing household. For a subset of 4.8 million, the Census Bureau asserts that an enumerator could determine the count of persons living at the address using proxy respondents and other information. For the remaining 1.2 million persons where proxy information was not available, the Census Bureau conducted a count imputation, which assigned an occupancy status and a count of persons to these households.
While substituted persons accounted for 1.9% of the nation's enumerated population in 2010, this proportion varied considerably across counties and population subgroups. Substitution in New York City was at 3%, with the highest rates of substitution clustered in just a few black neighbourhoods, where as much as 20% of the population was “enumerated” through a substitution algorithm. Indeed, Table 1 shows that, compared to the city overall, neighbourhoods with substitution rates greater than 6% were disproportionately home to hard-to-count groups, such as blacks, persons living in poverty, female-headed families with children, and those living in small, multi-unit buildings. Since this form of imputation extrapolates data based on neighbouring donor households, error is introduced because the characteristics of non-respondents are unknown and may differ substantially from those reported by donor respondents.
Characteristics of New York City areas001 with high rates of substitution, 2010. Sources: 2010 Census, Summary File 1, 2013–2017 A, Table P44; NYC Department of Finance.
. | Areas with high substitution . | City average . |
---|---|---|
Black (%) | 46.8 | 24.6 |
Below poverty (%) | 20.9 | 18.2 |
Female-headed families with children (%) | 20.1 | 14.4 |
Lots with 1 or 2 housing units (%) | 25.6 | 17.6 |
Lots with 3 or 4 four housing units (%) | 20.3 | 12.1 |
. | Areas with high substitution . | City average . |
---|---|---|
Black (%) | 46.8 | 24.6 |
Below poverty (%) | 20.9 | 18.2 |
Female-headed families with children (%) | 20.1 | 14.4 |
Lots with 1 or 2 housing units (%) | 25.6 | 17.6 |
Lots with 3 or 4 four housing units (%) | 20.3 | 12.1 |
∗ Aggregation of census tracts with substitution greater than 6%.
Characteristics of New York City areas001 with high rates of substitution, 2010. Sources: 2010 Census, Summary File 1, 2013–2017 A, Table P44; NYC Department of Finance.
. | Areas with high substitution . | City average . |
---|---|---|
Black (%) | 46.8 | 24.6 |
Below poverty (%) | 20.9 | 18.2 |
Female-headed families with children (%) | 20.1 | 14.4 |
Lots with 1 or 2 housing units (%) | 25.6 | 17.6 |
Lots with 3 or 4 four housing units (%) | 20.3 | 12.1 |
. | Areas with high substitution . | City average . |
---|---|---|
Black (%) | 46.8 | 24.6 |
Below poverty (%) | 20.9 | 18.2 |
Female-headed families with children (%) | 20.1 | 14.4 |
Lots with 1 or 2 housing units (%) | 25.6 | 17.6 |
Lots with 3 or 4 four housing units (%) | 20.3 | 12.1 |
∗ Aggregation of census tracts with substitution greater than 6%.
The problem with a “zero” undercount
Given the inherent problems with its field and statistical operations, the Census Bureau fails to enumerate every person, which results in a net undercount. In 2010, however, the estimated net undercount for the USA as a whole was close to zero. (There was, in fact, an estimated small net overcount of 0.01%; see Table 2). Nevertheless, there were major differentials by race and Hispanic origin. Among the major groups, whites had a net overcount, while blacks and Hispanics had undercounts. In 2000, the Census Bureau estimated there was an overcount of 0.49% for the USA as a whole, but blacks were still undercounted.
Estimates and standard errors (SE) of percentage net undercount for the USA by race/Hispanic origin. (Negative numbers are net overcounts.) Source: Thomas Mule, Census Coverage Measurement Memorandum Series #2010-G-01, US Census Bureau, 2012.
. | 2010 . | 2000 . | 1990 . | |||
---|---|---|---|---|---|---|
Race/Hispanic origin . | Estimate . | SE . | Estimate . | SE . | Estimate . | SE . |
Total, USA | –0.01 | 0.14 | –0.49002 | 0.20 | 1.61002 | 0.20 |
Non-Hispanic white | –0.84002 | 0.15 | –1.13002 | 0.20 | 0.68002 | 0.22 |
Non-Hispanic black | 2.07002 | 0.53 | 1.84002 | 0.43 | 4.57002 | 0.55 |
Non-Hispanic Asian | 0.08 | 0.61 | –0.75 | 0.68 | 2.36002 | 1.39 |
American Indian on reservation | 4.88002 | 2.37 | –0.88 | 1.53 | 12.22002 | 5.29 |
American Indian off reservation | –1.95 | 1.85 | 0.62 | 1.35 | 0.68002 | 0.22 |
Native Hawaiian or Pacific Islander | 1.34 | 3.14 | 2.12 | 2.73 | 2.36002 | 1.39 |
Hispanic | 1.54002 | 0.33 | 0.71 | 0.44 | 4.99002 | 0.82 |
. | 2010 . | 2000 . | 1990 . | |||
---|---|---|---|---|---|---|
Race/Hispanic origin . | Estimate . | SE . | Estimate . | SE . | Estimate . | SE . |
Total, USA | –0.01 | 0.14 | –0.49002 | 0.20 | 1.61002 | 0.20 |
Non-Hispanic white | –0.84002 | 0.15 | –1.13002 | 0.20 | 0.68002 | 0.22 |
Non-Hispanic black | 2.07002 | 0.53 | 1.84002 | 0.43 | 4.57002 | 0.55 |
Non-Hispanic Asian | 0.08 | 0.61 | –0.75 | 0.68 | 2.36002 | 1.39 |
American Indian on reservation | 4.88002 | 2.37 | –0.88 | 1.53 | 12.22002 | 5.29 |
American Indian off reservation | –1.95 | 1.85 | 0.62 | 1.35 | 0.68002 | 0.22 |
Native Hawaiian or Pacific Islander | 1.34 | 3.14 | 2.12 | 2.73 | 2.36002 | 1.39 |
Hispanic | 1.54002 | 0.33 | 0.71 | 0.44 | 4.99002 | 0.82 |
∗ Percentage net undercount statistically significantly different from 0, p < 0.1.
Estimates and standard errors (SE) of percentage net undercount for the USA by race/Hispanic origin. (Negative numbers are net overcounts.) Source: Thomas Mule, Census Coverage Measurement Memorandum Series #2010-G-01, US Census Bureau, 2012.
. | 2010 . | 2000 . | 1990 . | |||
---|---|---|---|---|---|---|
Race/Hispanic origin . | Estimate . | SE . | Estimate . | SE . | Estimate . | SE . |
Total, USA | –0.01 | 0.14 | –0.49002 | 0.20 | 1.61002 | 0.20 |
Non-Hispanic white | –0.84002 | 0.15 | –1.13002 | 0.20 | 0.68002 | 0.22 |
Non-Hispanic black | 2.07002 | 0.53 | 1.84002 | 0.43 | 4.57002 | 0.55 |
Non-Hispanic Asian | 0.08 | 0.61 | –0.75 | 0.68 | 2.36002 | 1.39 |
American Indian on reservation | 4.88002 | 2.37 | –0.88 | 1.53 | 12.22002 | 5.29 |
American Indian off reservation | –1.95 | 1.85 | 0.62 | 1.35 | 0.68002 | 0.22 |
Native Hawaiian or Pacific Islander | 1.34 | 3.14 | 2.12 | 2.73 | 2.36002 | 1.39 |
Hispanic | 1.54002 | 0.33 | 0.71 | 0.44 | 4.99002 | 0.82 |
. | 2010 . | 2000 . | 1990 . | |||
---|---|---|---|---|---|---|
Race/Hispanic origin . | Estimate . | SE . | Estimate . | SE . | Estimate . | SE . |
Total, USA | –0.01 | 0.14 | –0.49002 | 0.20 | 1.61002 | 0.20 |
Non-Hispanic white | –0.84002 | 0.15 | –1.13002 | 0.20 | 0.68002 | 0.22 |
Non-Hispanic black | 2.07002 | 0.53 | 1.84002 | 0.43 | 4.57002 | 0.55 |
Non-Hispanic Asian | 0.08 | 0.61 | –0.75 | 0.68 | 2.36002 | 1.39 |
American Indian on reservation | 4.88002 | 2.37 | –0.88 | 1.53 | 12.22002 | 5.29 |
American Indian off reservation | –1.95 | 1.85 | 0.62 | 1.35 | 0.68002 | 0.22 |
Native Hawaiian or Pacific Islander | 1.34 | 3.14 | 2.12 | 2.73 | 2.36002 | 1.39 |
Hispanic | 1.54002 | 0.33 | 0.71 | 0.44 | 4.99002 | 0.82 |
∗ Percentage net undercount statistically significantly different from 0, p < 0.1.
The seemingly precise overall 2010 Census coverage was a function of combining persons who should not have been counted or were counted more than once (erroneous enumerations) and those that were missed (omissions). In New York City, erroneous enumerations totalled 6.5%, while omissions were 7.9%, for a net undercount of well under 2%. However, the two components of census error occur in different neighbourhoods – omissions are more likely in neighbourhoods with large minority populations of lower socioeconomic status, while erroneous enumerations are more likely in majority white, higher-income communities. In 1990, the last time omission rates were available for small areas, neighbourhoods that were at least 50% black constituted just over one-fifth of the city's population but accounted for nearly two-fifths of the undercount, while neighbourhoods with an overcount were, on average, more than 90% white. These lower-level geographic effects have real-life consequences in terms of conducting needs assessments, calculating rates of disease incidence, determining the distribution of resources and services, as well as the drawing of local political districts.
Discussion
The finest operational plans and the best technology will not result in a good count unless the public heeds the call to answer the census. Self-response is the gold standard of data collection in the decennial census; non-response opens the door to the use of proxy respondents, administrative records, and imputation, all of which have been shown to compromise the quality of census data and the overall count. The most socially vulnerable populations tend to be undercounted, while whites tend to be overcounted, with differential impacts at the neighbourhood level. Thus, factors that encourage non-response pose a grave threat: these include heightened concerns about government intrusion, data privacy and cybersecurity, as well as the climate of fear among the nation's immigrants, which was exacerbated by the failed bid to include a citizenship question on the 2020 Census questionnaire.
To encourage high rates of self-response, local jurisdictions will need to mobilise “trusted voices” in immigrant communities and communities of colour, which is the key way to convince residents that it is safe and in their interest to self-respond. There should also be a high-profile effort among municipalities to encourage the Census Bureau to hire residents to work in their own neighbourhoods, irrespective of their citizenship status. The hiring of non-citizens, who are authorised to work in the USA, is crucial to ensure that there are enough interpreters available to engage respondents.
Increasing self-response is challenging in an era of distrust of government. Nevertheless, the Census Bureau managed to do just that in 2010, increasing self-response nationwide, including in New York City. After the self-response period, the point at which NRFU operations begin, the mail return rate stood at 62% in the city. Additional funding allowed the Census Bureau to increase outreach and mail another round of questionnaires, lifting the final mail return rate to 72%. A similar effort tailored for the internet era, along with robust outreach efforts by the federal government to allay fears in immigrant communities, needs to occur. These efforts, coupled with vigorous partnerships with local governments and trusted community voices, could pay huge dividends in 2020.