Organizations that collect data often claim it’s anonymous. Research shows this is impossible.

Commercial mass surveillance: The collected data can’t be kept anonymous

When the tech giants collect huge quantities of data about your internet behavior, they always hide behind defenses such as ‘it’s only metadata’ or ‘we’ve anonymized the information’.

When tech giants collect data about people, they have two standard excuses. The first one is: ‘It’s only metadata’. In other words, they’re saying it’s not a problem because they don’t collect the actual conversation between two people (although in fact they do) or anything else concrete (in their eyes). But as we’ve explained in this article, metadata equates to mapping someone’s life. After this, they usually say: ‘We’ve anonymized the data’. And then they talk about how they’ve replaced the digits in an IP address or simply hidden it. Or removed other information that can be linked to a particular person. But you only have to read about data brokers to realize that, for anyone who wants to, it’s quite simple to add one and one in order to ‘re-identify users’, as it’s often called.

Because the fact is that if you collect sufficient data, it’s impossible to keep it anonymous. And because the entire business model of the tech giants is based on big data, this means your internet behavior can undoubtedly be linked to you as a person. For example, if you have access to several different databases and can compare them, you can de-anonymize people very quickly. Like when Netflix released 10 million film ratings from half a million anonymous users and, to prove the point, a team of researchers at the University of Texas succeeded in identifying several of them simply by comparing the ratings and the time they were made with ratings published publicly on the IMDb. And here’s another example: when the State of Washington sold medical data about anonymous patients for 50 dollars a time, Harvard researchers could put names to several of them by comparing parts of the records with news articles about accidents and violent crimes.

It’s difficult to identify someone if you only have access to one or two data points. But as soon as you have access to more, you can use classic exclusion methods to work out who’s behind the information. In his book Data and Goliath, cryptographer and security expert Bruce Schneier gives a good example: The FBI needed to track someone sending anonymous emails from different IP addresses. When they looked at the IP addresses, it turned out they all belonged to different hotels. The person had been careful to change the hotel every time they wanted to send an email. But all the FBI had to do was examine the customer records from the different hotels. Was there somebody who’d checked in at all the hotels when the emails were sent? They didn’t have to look at many hotel stays before the list came down to a single person.

The researchers had access to search histories from 657,000 users. There were no names, only a number linked to each list of searches. When they were done, they'd replaced the numbers with names.

Research has often shown that you don’t need many data points to identify people. The fastest way is by using location data, if you have access to several places an anonymous person has visited. Think about it: there may be hundreds of people at your workplace, but how many of them shop in the same grocery store as you? There are perhaps a couple of you that match both of these points. But add a few more data points and you’re done. Researchers at universities in the UK and Belgium have published methods saying that it’s possible to identify 99.98% of people on anonymous lists if there are a mere 15 demographic attributes. Another group of researchers say that you only need four data points – if they contain place and time – to identify 95% of individuals. In a further study, researchers looked at three months’ credit card statements to determine that it was sufficient to have four points – once again regarding place and time – to identify nine out of ten people.

Given how much data is collected about each of us as soon as we start up a web browser, anyone who wants to use the data (and de-anonymize it) barely needs to even use place and time parameters. Amongst the examples Bruce Schneier gives is when researchers examined the search history of 657,000 users. In total it involved 20 million searches and the information was, as they say, anonymized. There was only a number linked to each list of searches. But by correlating different pieces of data, the researchers could replace numbers with names. We’ll say it again: your internet behavior is tracked and logged in detail. It doesn’t take long using exclusion methods to reduce the options down to just you.

A mere four data points are sufficient to work out that it’s you. Want to find out how much data the tech giants have about you?

If your search behavior is sufficient to identify you, how can you protect yourself? Read more about how every search you do can look like your very first – with Mullvad Browser.

How is big data actually collected? Learn more about the techniques used when your internet behavior is mapped.

What happens to society and to humans in a world where we know everything about everyone? Read more about the potential consequences of mass surveillance here.