Big Data = Big Failure II
It's possible to remain anonymous with a single data point. However, as soon as a series of data is collected, remaining anonymous becomes impossible. Let’s take it step by step.
Last week we wrote about how the vast amount of data collected today makes it possible to predict political opinions, beliefs, religion, and interests. This week we will show by example how one single data point put together in a series makes it impossible to stay anonymous.
One Data Point
As an example, let’s take a single data point which contains time, location, and temperature:
Time | Location | Temperature |
---|---|---|
2021-05-31 12:00 | Gothenburg |
15 degrees Celsius |
A common way of “anonymizing” the data is to remove one of the elements, in this case, the location:
Time | Temperature |
---|---|
2021-05-31 12:00 |
15 degrees Celsius |
Now it would be hard to determine the data point’s location. Even if we had all the available temperature data in the world, a search would most likely find many locations that match this specific data point. If we further remove the time as well, trying to pinpoint specifics becomes pointless.
A Series of Data Points
With a series of data points, the scenario changes significantly.
Time | Location | Temperature |
---|---|---|
2021-05-31 12:00 |
Gothenburg |
15 degrees Celsius |
2021-06-01 12:00 | Gothenburg | 14 degrees Celsius |
2021-06-02 12:00 | Gothenburg | 12 degrees Celsius |
2021-06-03 13:00 | Gothenburg | 15 degrees Celsius |
If we remove the location, we still have 4 values for temperature and time in a sequence which we can match to measurements throughout the world.
Time | Temperature |
---|---|
2021-05-31 12:00 | 15 degrees Celsius |
2021-06-01 12:00 | 14 degrees Celsius |
2021-06-02 12:00 | 12 degrees Celsius |
2021-06-03 13:00 |
15 degrees Celsius |
This narrows down the number of locations the data series could have originated from to about 1 or 2 locations. In terms of location data, 4 data points is the typical number of data samples needed in order to identify a person.
If we were to remove the time data points, we would need a longer series of data in order to determine the location as “Gothenburg”. This is possible as long as we have a sequence of data points in chronological order.
For instance, if we have 365 data points, it would be pretty easy to spot that these temperature readings follow the typical Scandinavian weather cycle; we can determine what year the data is from and narrow it down further to Gothenburg. With the amount of reference data available today and the possibilities of AI, this would be easy to do.
As the number of collected data sets grows and the number of identifiable data series for comparison also steadily grows, staying anonymous becomes impossible.
More about the de-anonymization of data series
- “Researchers from two universities in Europe have published a method they say is able to correctly re-identify 99.98% of individuals in anonymized data sets with just 15 demographic attributes.” Researchers spotlight the lie of ‘anonymous’ data – TechCrunch
- “We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely re-identify 90% of individuals.” A study published in Science
- Netflix users were identified from a database of nameless customer records in a study at the University of Texas at Austin
- In a Harvard study, patients in an anonymized hospitalization data set were reidentified by name
- Researchers are able to estimate the likelihood of re-identifying people in incomplete data sets, as published in Nature Communications
- De-anonymization attack on geolocated data
- De-anonymizing Social Networks
Did you miss part one in this series? Or do you want to read part three right away?
For the universal right to privacy,
Mullvad VPN