There’s a parable about six men who are blind touching an elephant. Each describes the animal differently, depending on whether they felt its tusk, tail, leg, trunk, ear, or side. Take the accounts separately, and you’ll learn something about the feel of the individual parts. Put them together, and you get a sense of the elephant as a whole.
It’s the latter that worries those working at the Census Bureau. Right now, identifying individuals based on public Census data is difficult. But more information from outside sources is increasingly accessible, and the computing power needed to link globs of data from different places is also easier to attain than it was in 2010. There have been numerous studies showing that even anonymized datasets can be re-identified when they’re cross-referenced with each other. Journalists with The New York Times were able to verify they had received Donald Trump’s tax returns from 1985 to 1994 by comparing them to an anonymized database and other public documents.
So to protect your data, the Census Bureau is digging into statistical methodology, injecting variability and randomness into the data itself. But as numerous interviews with researchers and data scientists eager for all of that information show, balancing privacy with data science won’t be easy.
The Census attacks itself
In 2018, the Bureau published the results of a simulated attack on the 2010 Census data, to see if it could recreate private information from the many chunks of public data floating around. Over 308 million people were counted in the 2010 Census. Using the 2010 data, like sex, age, race, and ethnicity, it was able to reconstruct records for 46 percent of the population, exactly matching the confidential record only certain Census workers have access to.
Even with the Census records secure, the Bureau wanted to try linking the reconstructed records with commercially available data. Those reconstructed records didn’t have names, but using public databases, the Bureau found it could attach 45 percent of them to names and addresses. Those names were accurate only 38 percent of the time, however. coming out to a correct identification for 17 percent of the total population. An attacker wouldn’t necessarily know which 17 percent they had correct without some extra work. “They could have found out if they were right by doing additional field work,” John Abowd, chief scientist and associate director for research and methodology at the U.S. Census Bureau, told Digital Trends. “That means they’d have to go and find out by telephone or sending people to the homes to ask.” But the Census Bureau didn’t want to wait and see if more data would make the reidentification more likely. It started looking into using differential privacy ahead of the 2020 Census.
The impossible dream of perfect data
The more unique you are, the easier you are to spot in the data. If you’re the only 20-year-old Pacific Islander on your block, your record will stand out. That’s why, for years, the Bureau used “swapping” to mask such identifiable individuals. For example, The New York Times tracked down the sole couple who live on Liberty Island, the caretakers of the Statue of Liberty. While their Census records had their correct ages, their ethnicities had been listed as Asian, though they identify as white. That ethnicity wasn’t just randomly assigned; it had been substituted from another couple in the area. Just how frequently the Census is swapping such information is a mystery, to help keep the records more private. If attackers knew the percentage of numbers that were switched around, it would help them reconstruct the records.
“Differential privacy is forcing people to confront the fact that there’s error in the data …”
The Bureau has applied different methods of privacy protection over the years. In the 1970s, it suppressed full tables and started using measures including swapping for the 1990 census. Plus, there would be errors and missing information on the forms people sent back, and workers would do their best to correct mistakes and fill in the blanks. Add to this fundamental problems like undercounting — missing vulnerable populations like people experiencing homelessness or those in very remote areas — and overcounting — marking a child of divorced parents twice.
In other words, there’s been inaccuracies in the data forever. Differential privacy just lets the Bureau be transparent about how much it’s fiddled with the numbers. Let’s say there were 12 angry jurors in a room. In a secret ballot, they learn that 11 are for conviction and one is against. No one knows who’s who, unless they vote again while the lone holdout is in the bathroom. The idea with differential privacy is that the juror’s vote should be protected whether or not they’re actually included in the participant pool, though it’s not a guarantee of privacy.
“Differential privacy is forcing people to actually confront the fact that there’s error in the data, because differential privacy is very explicit about the introduction of error,” said Dr. Salil Vadhan, a computer science and applied mathematics professor at the Harvard John A. Paulson School of Engineering & Applied Sciences. “And we who work in differential privacy think of that as a feature not a bug.”
With differential privacy, some amount of “noise” is added to each value in a table. With the jurors example, you’d add or subtract an amount from the yay and nay votes, and the amount would have to fall within a certain range. With a very small population, like 12, you’d want to keep the range tight while still allowing for privacy. Maybe you choose plus or minus three. The algorithm would then randomly select a value within the range and apply it to the yays, then do the same for the nays. You could, then, end up with results that look like this: Ten for and negative two against. That’s obviously illogical, but the algorithm randomly selected to subtract one from the yays and subtract three from the nays. The point is, the people in the room wouldn’t know if the algorithm subtracted two from the yays and three from nays. That’s not helpful for a jury, but it does keep things a little more private.
In this example, the total number of differential private votes — technically eight but more logically, 10 — doesn’t add up to the real number of people in the room, 12. You might look at that vote and say it’s worthless, but what if the vote didn’t have to be unanimous but merely a measure that needed to pass by majority? Even though the numbers aren’t exact, it’s clear the yays have it. Again, things become more tricky if the voters are split down the middle and the algorithm assigns plus one to the nays and minus one to the yays. The problem is magnified with small populations but starts to lessen as groups get larger.
“There’s always been resentment about this kind of two-tiered access.”
One feature of this noise is that it’s “tunable.” You can adjust it. If you have a table people are going to use for a specific metric, you can narrow the range for that column in the table, while increasing it in other values. If a demographer wants to know how many people of Hawaiian or Pacific Islander descent live in a city, the table with that information might have the noise injected into the actual number of people narrowly changed, but the ages are altered by a larger range. Instead of seeing the single 20-year-old, it’s suddenly a 25-year-old, and an attacker would be less certain that record belongs to a specific name and address in a commercial database.
From a demographics perspective, it might not matter too much that a 20-year-old is suddenly a 25-year-old, but for certain uses, like voting issues, that 20-year-old absolutely cannot become a 17-year-old. There are certain stats, known as invariants, that won’t have any noise injected. For example, state-level populations will remain untouched, so we’ll know exactly how many people live in Alaska, Kansas, and so on. “The Bureau will also release the exact, un-altered, total number of housing units at the Census block level, and it will not alter the number and type of occupied group quarters (like correctional facilities, college dorms, and shelters).
To make all the data products it releases more secure, the Bureau applied differential privacy to the voting-age population in the 2018 end-to-end census test and the 2010 Demonstration Data Product, which the Bureau released to help researchers see how the process would affect the data they use. While the Census used to provide exact numbers of people both above and below 18 (the voting age), the Census Bureau’s Data Stewardship Executive Policy Committee (DSEP) has “grave concerns about its effects on the Census Bureau’s ability to protect confidentiality, especially in block and block-group level tabulations,” according to an email from a Bureau spokesperson. DESP hasn’t yet made final decisions on the what will remain invariant.
Deductions from the privacy budget
For the 2020 Census, the form includes a number of demographic questions, including how many people live in the household; their ages, sexes, races, and ethnicities; and their relation to the head of household. As the 2010 Census data shows, however, the information adds up to more than it asks; based on its questions from a decade ago, the Bureau released about 7.8 billion statistics about Americans.
This time around, instead of releasing all that data and relying on swapping and suppression, each statistical table made public will nibble away at the privacy loss budget. This budget has to be determined first, then each table will be assigned a slice of that budget. Frequently used tables might stick closer to the original data, while less utilized one may get more noise.
The more privacy a table needs, the greater the chunk of the budget it takes and the more noise needs to be injected. It’s a double-edged sword. Small populations need more privacy projection to deter database reconstruction, but introducing more noise in tables with small numbers affects the results more significantly. Like the invariant question, the Bureau hasn’t made final decisions about the privacy loss budget.
The question for smaller populations, like Alaska Natives, is what is an acceptable level of privacy loss, said Dr. Randall Akee at a recent Committee on National Statistics (CNSTAT) workshop on differential privacy and the Census. He’s an associate professor at the University of California, Los Angeles in the Department of Public Policy and American Indian Studies. “I think that’s something that has to be addressed directly to tribal governments themselves,” he said. Some might be fine with their populations being publicly enumerated, while others may be more reticent, he said. It’s a problem the Census Bureau is still grappling with. “We have some further prototyping and other work to do before we can show the user community what those will look like,” said Abowd.
The demands on data access
Critics of the Census Bureau’s differential privacy plan worry that it will release less information than it has in the past or that researchers will have to visit Federal Statistical Research Data Centers to do their work. There are only 29 centers throughout the U.S., and demographers and others are concerned about applying for and receiving access in a timely manner. While researchers have always needed to have their work approved to visit the centers, some think that they’ll need to do so more often with the 2020 data. “There’s always been a little bit of resentment about this kind of two-tiered access,” said Jane Bambauer, a law professor at the University of Arizona. She thinks differential privacy might exacerbate the issue, with graduate students and researchers at smaller universities losing out with less publicly available data.
“A lot of social scientists feel shut out of the sphere of influence for the key decision makers at the Census Bureau.”
At the December 2019 CNSTAT workshop, a number of researchers presented their findings after working with some differentially private data. The Bureau released some 2010 data products that it had put through its differential privacy system. Researchers then compared the new data with the original 2010 data that the Bureau released with old privacy measures, like swapping. Many participants highlighted the discrepancies they found. William Sexton of the Census Bureau said that one source of error was “post-processing,” or fiddling with the data after applying differential privacy measures. This would include adjustments like making sure a block didn’t have negative people. There are ways to improve these fixes, he said. In addition, the Bureau is taking into account the problems people are finding with the DP data and looking for solutions. “In order to know where to look for anomalies, we need a lot more eyes on the data than are available inside the house,” Abowd told Digital Trends.
There has been frustration from some researchers and others about just how they should prepare for the 2020 Census data. “It will take some time for the data users to learn which are the appropriate methods to use to try to analyze the data that have been protected in this way,” said Vadhan. The Bureau is still deciding on all the products it will release and how researchers will access the data.
Privacy pros and woes
Each dent in the privacy loss budget represents a value judgment. While they will ultimately be made by the Census Bureau, it is seeking feedback and input from researchers, advocates, and others.”It’s not a computer just spitting out a set of parameters that are the best ones to use,” said David Van Riper, director of spatial analysis at the Minnesota Population Center. “It’s a group of people that are going to take in information from user groups, different stakeholders, and decide on these policy decisions.”
Yet there have been communication issues between data users and the Bureau. “I went to the National Demographers Conference earlier this year, and there are a lot of social scientists that feel shut out of the sphere of influence for the key decision makers at the Census Bureau,” said Bambauer.
Some researchers still feel that the Bureau is putting a higher value on privacy than access to the data itself. “The Census Bureau has an obligation to provide data that’s useful for a broad spectrum of data users, from local planners to researchers to state and local governments,” said Van Riper. “And that usefulness and utility is, in my opinion, as important as the privacy protections.”
In 2010, the “Census moment” was set at 11:59 p.m. on April 1. The aim was to count everyone living in the U.S. at that exact time. Because of the gap between this moment and when people send back their forms, the enumeration will never be flawless. The uses of the Census data — reapportioning Congressional seats, distributing federal funds, and so on — are important enough that data users are willing to overlook the imperfections.
Recently, historians learned that census officials provided the government with information about Japanese-Americans who were then sent to internment camps. While there is no citizenship question on the 2020 Census, people are wary of how their information will be used. Some experts are concerned that mistrust could result in one of the largest undercounts of several minority groups in decades.
With differential privacy, the hope is to safeguard the information from anyone who would use the data against another person, whether they’re inside or outside the government. The Bureau hopes the promise of increased security will make people more willing to participate, especially those who have been hesitant to do so in the past.
Correction: This story was updated on March 5 to clarify the measures the Census Bureau will take to anonymize block-level data.