Census 2016: Arguments For And Against Name/Address Retention

The Village Lawyer, c. 1621, by Pieter Brueghel the Younger

The Village Lawyer, c. 1621, by Pieter Brueghel the Younger

The Australian Bureau of Statistics has made a right hash of the decision to retain name and address data provided in this year’s Census for up to four years.

The ABS has repeatedly hand-waved away specific concerns raised by experts in privacy, information security, and statistical research who rely on Census data, while simultaneously claiming we have nothing to worry about. Various supporters of the ABS’ position have used similarly thin justifications, and waved away concerns and criticisms as unhinged, conspiratorial, and overblown.

What hasn’t been attempted — by the ABS or anyone else, really — is to make a clear and cogent argument for why retention of the name and address data is so important, and worth it.

So I’m going to attempt it.

I’m going to try very hard to take the side of those who want to keep names and addresses, and to argue logically from facts. I’m not trying to build a strawman that I then dismantle. I want to build a strong summary argument; the kind that I wish the ABS had made in the first place.

Why? Because the census is important, as I will cover. I am a big believer in it, and until this year was a very big fan of both the census and the ABS more generally.

Why A Census Is Important

That the Census is important and worthwhile isn’t the issue here, but let’s quickly recap some of its most important points.

Firstly, and possibly most importantly, the Census is central to a representative democracy. Representation in the House of Representatives is based on how many people live in a given area, so we need to count them periodically to keep the tally up to date.

The number of people in an area, such as a state or electorate, is also used for divvying up various common resources, like tax dollars towards things like roads, healthcare, schools, and so on. We need to know how many children are in a given area so we can make sure there are enough schools, for example.

Other demographic information is also important for making policy decisions. For example, how many people live in cities versus rural areas? What is the literacy rate for English? What is the distribution of ages in a given area? How does the proportion of Indigenous or Muslim people in the population compare to the proportion in Parliament, or in upper management of corporations?

Note how these are all aggregate statistics. They say nothing about individuals.

They also provide data at a point in time: when the census is taken. Multiple snapshots, one at each census, can show us how things change over time (trends) but again, only at an aggregate level.

That’s really, really useful all by itself, because we can track things like literacy rates, life expectancy, rates of disease, and so on. We don’t need names for any of this.

We do need addresses, at some level of granularity, or we won’t know where in Australia people are. Your specific address doesn’t matter once we’ve counted the number of people in Victoria (for example), which is why after the initial processing, the address can be discarded.

Longitudinal Studies

Unlike snapshots (or cross-sectional study), longitudinal studies can provide more detailed information about how things change over time.

A longitudinal study looks at the same group of people at each sample time, so you can track what happened to a specific group of people, rather than a general class of people. One great example of how this is useful is in studying poverty. If the poverty rate in the 2005 Census was 10%, and at the 2011 Census it was still 10%, we don’t know much about why. It could be that the same 10% of people were poor both times, or that 10% of the population (different people) are poor at any one time.

A longitudinal study would help you figure out which is the case, and that should result in different policies. If it’s always the same 10% of people who are poor, you can direct your efforts to just that 10% of people, and not waste your effort on people who don’t need it (to oversimplify a bit). If 10% of the population generally are poor all the time, then helping people is a more general problem, and requires a different approach.

But if we don’t keep people’s names and addresses, we have no way of knowing who filled out which set of data for any given Census, and can’t do longitudinal research. If we have your name and address, we can. That’s why in the Census we want to ask for your name and address now, and also five years ago, because then we can look up where you were last time, and match your answers to this time, and see if you’re still below the poverty line or not.

What About Privacy?

The ABS has a lot of very sensitive information about every Australian, so it’s vital that the information be kept safe. Since the very beginning, the Census and Statistics Act has made the privacy and security of your personal data very clear. It is expressly forbidden for the ABS to share personally identifiable information about anyone. There are substantial penalties for anyone at the ABS misusing data, or providing information to anyone else, including any other government department, including ASIO and the ATO.

In fact, back before World War II, a government official attempted to get personal wealth details out of the ABS, but the Statistician at the time refused. The ABS has a long history of defending the privacy of those who fill out the Census, and this speech by Dennis Trewin to the National Press Club highlights just how important the ABS considers its role.

But researchers need to be able to look at data, or they can’t do their job. They don’t get to see individual responses. They only get to see aggregate data, and the ABS employs a variety of techniques to foil attempts at unmasking individuals. Sometimes data is randomly scrambled, which doesn’t affect the aggregation too much (because of the way randomness works), but it hides the individual responses. Other times the data is combined with a minimum number of other answers, again, to hide individuals but providing useful summary statistics.

Because the Census isn’t about individuals. It’s about all of us.

In the modern, computerised age, additional care needs to be taken. It’s far harder to make off with a million paper forms that you have to physically steal from a warehouse than it is to silently infiltrate a computer system and download them all.

The ABS employs a variety of well-tested techniques to protect the computer systems that house sensitive data. There’s the usual principles of keeping systems maintained, ensuring only authorised people have access, using encryption and so on. The ABS has been independently audited by the Australian National Audit Office and found to have a very high level of internal security. But the ABS goes much further.

Not all of the data is kept in the one place, which is called the separation principle. This guards against an intruder making off with all the data in one go. If someone did get in, which is extremely unlikely but cannot be deemed impossible, this reduces the risk they can make off with all of the data. The idea is to make the task so difficult that even if someone were to partially succeed, they would be caught in the act.

Keeping names and addresses increases the risk to privacy, which is why the ABS has added even more layers of security to keep the data safe. Name and address data is kept completely separate from all the other data (the separation principle again). In order to do longitudinal studies, this name and address data needs to be linked with different sets of data. How do we do that without violating the separation principle?

The ABS uses a special technique called linkage keys. These are special fields created from names, addresses, and some other fields like date of birth or sex, to create a code that is used instead of the original data. The way the code is created makes it difficult to reverse back to get the original information (similar, but not identical to a hash function). The keys are created on the name and address data by one person, who is granted access to that sensitive data, but they are not allowed to see any other data. A different person then uses the linkage key to match individual responses on a different dataset, ensuring that no one person is given access to all of the data at any one time, but data can still be linked together.

It’s a little complex, but those are the lengths the ABS has to go to in order to keep your data safe.

Why Retain For So Long?

Why not just create a single master linkage key, and then throw away the names and addresses? While that would simplify the linking process, and mean we didn’t need the name and address data for as long, it would create some major problems that the ABS cannot support.

Each dataset to be linked doesn’t have the same key in it. That would require every person in Australia to be issued with a single, constant identifier that is used on all datasets. The risk of having that linkage identifier exposed is just too great. The idea has come up before, most famously as the Australia Card, and it was rejected for a host of very good reasons.

Instead, each dataset keeps its own set of fields and a way of identifying people, independently of each other. There is no over-arching government way to track an individual across all the datasets held by government.

If we want to link datasets together, we need to create a special linkage key just for that purpose. In order to create that linkage, we need identifying information: names and addresses. By keeping names and addresses for four years, we can create linkage keys for up to four years where there are valid research projects that require linking.

Why four years? We believe it’s a reasonable trade-off between the number of worthwhile research projects that normally occur and the risk of hanging onto personally identifiable data, given all of the additional steps we’ve taken to protect it.

Argument Against

That took a bit longer than I’d expected.

There are two major arguments against name and address retention. The first is security, and the second is privacy and consent.

Security

Security of the data cannot be guaranteed. It is not possible for the ABS to rule out the data being misused, stolen, or exposed.

We know that the ABS has suffered at least 14 data breaches since 2013. One ex-employee was sent to jail for misusing ABS data for insider trading.

Still, this wasn’t Census data, and the ABS has an admirable track record of protecting census data. The ABS is more likely to be able to protect this sensitive data that your mate Dave at IT-for-Less Inc., but that’s a function of practice and resources, not some kind of innate immunity to being hacked.

Success in the past is not a guarantee of success in the future. I’ve never had Hepatitis C, but that doesn’t mean I can’t get it.

And we have loads of examples of data breaches, some of very sensitive data that we would expect was well protected. I can’t imagine that the NSA really wanted Snowden to make off with all those documents, and the Office of Personnel Management held people’s fingerprints for pity’s sake!

The easiest way to protect people’s data is to not collect it in the first place. If the data is collected, it will probably be misused or exposed at some point.

We also have the ABS at first claiming they were in the very top Cyber Secure Zone when they plainly weren’t, and then quietly removing the references when people noticed. The response I got when I asked them about the claims was (in part):

“The ABS has improved its compliance and security since the ANAO audit, and continues to do so through a rolling program of security projects. The ABS has updated its online information in relation to this audit. The ABS network is subjected to regular independent security testing and audits, and been found to be highly resilient to external attacks.”

I was also alerted by someone on Twitter, and then verified myself, that the ABS emails you a server-generated password tied to your Census supplied login ID if you opt to save a partially completed Census form. In plain text. Sending plain text passwords in email is known to be bad for a host of reasons.

Information security is very hard to do well. These examples undermine the credibility of the ABS that they really are as good at information security as they claim. It’s not just the ABS either, as the ABS has contracted IBM to supply the online census system, so we need to trust IBM’s information security practices as well as those of the ABS.

The ABS is critically reliant on people’s trust in order that they provide accurate and truthful answers in the Census. Its actions thus far have unfortunately undermined that trust for a great many people.

So we’re now down to a trade-off between the value of keeping the data, and the risk and impact of it being stolen. That’s a really difficult calculus to attempt, and the ABS hasn’t really done this in any significant detail. At least, not yet.

If you are a judge, policeman, or member of the military, having your personal details exposed would be substantial.

If you are a women with an abusive ex, a witness to a major crime, or a persecuted minority, the impact of personal details leaking could be devastating.

How do you even calculate the upside versus downside here? And how can the ABS make that decision on behalf of vulnerable people?

Which brings us to privacy and consent.

Privacy and Consent

Australians have a legal right to privacy, but it’s a bit of a fluid concept that depends on context. The right to privacy is balanced against other rights, responsibilities, and duties.

Once again, we need to make a trade-off between individual privacy and the benefits to society or other groups from a lack of individual privacy.

Here, we have a risk to our personal security by giving up our privacy, as discussed above. That implies that the benefits from a lack of privacy for individuals must be pretty substantial. The ABS has not made the case for people to give up their privacy voluntarily. It is being compelled by force of law and the threat of prosecution for a crime, as well as open-ended fines. That’s pretty heavy handed stuff given that not voting, which I would argue is a greater personal duty than participation in the Census, carries a maximum fine of $180.

There’s also the issue of informed consent. Informed consent is a vital ethical consideration for any research involving humans, as all university ethics committees know.

The ABS is proposing that individuals will be linked to other, as yet undefined datasets. It is not possible to give informed consent to undefined future research. Now, sometimes consent is not possible, but where it is, it should be attempted. The ABS here is compelling people to give their consent, up-front, to future unnamed research projects that use their private, personally identifiable information.

This is another major expansion in the purview of what a census is supposed to be about. What is supposed to be a periodical cross-sectional snapshot of the population — counting people — has turned into an undefined number of perpetual longitudinal studies carried out on every individual in Australia without their consent.

Now maybe, maybe, the benefits of this situation are worth it. What are the benefits from these longitudinal studies? How is their worth measured? I just don’t believe the ABS has made the case other than to broadly assert that there are benefits.

I’m willing to look, but the ABS doesn’t seem willing to articulate these purported benefits and clearly demonstrate that it is aware of the risks and show that the benefits outweigh them.

Bookmark the permalink.

3 Comments

  1. Thank you for a very informative and thoughtful response.

    You and your arguments deserve some coverage in the media.

  2. Thank you for sharing your erudition with the great unwashed, we need it. My suspicions were alerted by the inarticulate government representative charged with making the case in front of the cameras. Appeared to me that the more educated public servants were perhaps refusing.

  3. Your argument about longitudinal data and 10% poor people fails because while it might be a statisticians wet dream there in fact isn’t a different solution for helping those poor people in the different scenarios. You can’t come up with a REAL scenario, and if you could, or political parties don’t care much for that way of formulating policy anyway.

Comments are closed