這篇部落格文章是在 5 年前發佈，可能已過時。

Big Data = Big Failure II

2021年5月12日 Privacy

It's possible to remain anonymous with a single data point. However, as soon as a series of data is collected, remaining anonymous becomes impossible. Let’s take it step by step.

Last week we wrote about how the vast amount of data collected today makes it possible to predict political opinions, beliefs, religion, and interests. This week we will show by example how one single data point put together in a series makes it impossible to stay anonymous.

One Data Point

As an example, let’s take a single data point which contains time, location, and temperature:

Time	Location	Temperature
2021-05-31 12:00	Gothenburg	15 degrees Celsius

A common way of “anonymizing” the data is to remove one of the elements, in this case, the location:

Time	Temperature
2021-05-31 12:00	15 degrees Celsius

Now it would be hard to determine the data point’s location. Even if we had all the available temperature data in the world, a search would most likely find many locations that match this specific data point. If we further remove the time as well, trying to pinpoint specifics becomes pointless.

A Series of Data Points

With a series of data points, the scenario changes significantly.

Time	Location	Temperature
2021-05-31 12:00	Gothenburg	15 degrees Celsius
2021-06-01 12:00	Gothenburg	14 degrees Celsius
2021-06-02 12:00	Gothenburg	12 degrees Celsius
2021-06-03 13:00	Gothenburg	15 degrees Celsius

If we remove the location, we still have 4 values for temperature and time in a sequence which we can match to measurements throughout the world.

Time	Temperature
2021-05-31 12:00	15 degrees Celsius
2021-06-01 12:00	14 degrees Celsius
2021-06-02 12:00	12 degrees Celsius
2021-06-03 13:00	15 degrees Celsius

This narrows down the number of locations the data series could have originated from to about 1 or 2 locations. In terms of location data, 4 data points is the typical number of data samples needed in order to identify a person.

If we were to remove the time data points, we would need a longer series of data in order to determine the location as “Gothenburg”. This is possible as long as we have a sequence of data points in chronological order.

For instance, if we have 365 data points, it would be pretty easy to spot that these temperature readings follow the typical Scandinavian weather cycle; we can determine what year the data is from and narrow it down further to Gothenburg. With the amount of reference data available today and the possibilities of AI, this would be easy to do.

As the number of collected data sets grows and the number of identifiable data series for comparison also steadily grows, staying anonymous becomes impossible.

More about the de-anonymization of data series

“Researchers from two universities in Europe have published a method they say is able to correctly re-identify 99.98% of individuals in anonymized data sets with just 15 demographic attributes.” Researchers spotlight the lie of ‘anonymous’ data – TechCrunch
“We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely re-identify 90% of individuals.” A study published in Science
Netflix users were identified from a database of nameless customer records in a study at the University of Texas at Austin
In a Harvard study, patients in an anonymized hospitalization data set were reidentified by name
Researchers are able to estimate the likelihood of re-identifying people in incomplete data sets, as published in Nature Communications
De-anonymization attack on geolocated data
De-anonymizing Social Networks

Did you miss part one in this series? Or do you want to read part three right away?

For the universal right to privacy,

Mullvad VPN