Przejdź do głównej zawartości

Big Data = Big Failure II

Privacy 

It's possible to remain anonymous with a single data point. However, as soon as a series of data is collected, remaining anonymous becomes impossible. Let’s take it step by step.

Last week we wrote about how the vast amount of data collected today makes it possible to predict political opinions, beliefs, religion, and interests. This week we will show by example how one single data point put together in a series makes it impossible to stay anonymous.

One Data Point

As an example, let’s take a single data point which contains time, location, and temperature:

Time Location Temperature
2021-05-31 12:00 Gothenburg

15 degrees Celsius

A common way of “anonymizing” the data is to remove one of the elements, in this case, the location:

Time Temperature
2021-05-31 12:00

15 degrees Celsius

Now it would be hard to determine the data point’s location. Even if we had all the available temperature data in the world, a search would most likely find many locations that match this specific data point. If we further remove the time as well, trying to pinpoint specifics becomes pointless.

A Series of Data Points

With a series of data points, the scenario changes significantly.

Time Location Temperature
2021-05-31 12:00

Gothenburg

15 degrees Celsius
2021-06-01 12:00 Gothenburg 14 degrees Celsius
2021-06-02 12:00 Gothenburg 12 degrees Celsius
2021-06-03 13:00 Gothenburg 15 degrees Celsius

If we remove the location, we still have 4 values for temperature and time in a sequence which we can match to measurements throughout the world.

Time Temperature
2021-05-31 12:00 15 degrees Celsius
2021-06-01 12:00 14 degrees Celsius
2021-06-02 12:00 12 degrees Celsius
2021-06-03 13:00

15 degrees Celsius

This narrows down the number of locations the data series could have originated from to about 1 or 2 locations. In terms of location data, 4 data points is the typical number of data samples needed in order to identify a person.

If we were to remove the time data points, we would need a longer series of data in order to determine the location as “Gothenburg”. This is possible as long as we have a sequence of data points in chronological order.

For instance, if we have 365 data points, it would be pretty easy to spot that these temperature readings follow the typical Scandinavian weather cycle; we can determine what year the data is from and narrow it down further to Gothenburg. With the amount of reference data available today and the possibilities of AI, this would be easy to do.

As the number of collected data sets grows and the number of identifiable data series for comparison also steadily grows, staying anonymous becomes impossible.

More about the de-anonymization of data series

Did you miss part one in this series? Or do you want to read part three right away?

For the universal right to privacy,

Mullvad VPN