News

Is anonymized data worthless?

January 11, 2018 - Big Data & AI

Anonymization is still sometimes criticized as a practice that renders data useless, as it removes important information. Cnil decided to prove the contrary with the Cabanon project conducted during 2017. In particular, it used the ITM's TeraLab big data platform to anonymize New York cab data and demonstrate the possibility of creating a transportation service.

On March 10, 2014, an infographic published on Twitter by the New York cab regulator arouses Chris Whong's curiosity. It's not so much the vehicle occupancy rate at rush hour shown on the graph that interests the young urban planner. Rather, his eye is drawn to the source of the data, indicated at the bottom, which enabled New York City's Taxi and Limousine Commission (NYC TLC) to produce the illustration. By commenting on the tweet, he joins a request from another social network user, Ben Wellington, to find out if the raw data is available. An exchange ensued, enabling Chris Whong to retrieve the dataset via a tedious procedure, but accessible to anyone determined enough to enter the intricacies of administrative paperwork. Once in possession of the data, he puts it online. Thanks to this, Vijay Pandurangan, a software engineer, will show that the identity of drivers and customers, as well as their addresses, can be retrieved from the information stored on the runs.

Problems of anonymizing open data sets are not new. In fact, they weren't even new in 2014, when the NYC TLC data story broke. Yet these kinds of cases persist to this day. One reason is that anonymized datasets are reputed to be less useful than their unfiltered counterparts. Removing the possibility of tracing identity is tantamount to removing information. In the case of New York cabs, for example, this means limiting the location of cabs to geographical areas, rather than providing coordinates to the nearest metre. For service creators who want to build applications, but also for data managers who want their data to be used as efficiently as possible, anonymization means a loss of value.

A fervent defender of the protection of personal data, the French Data Protection Authority (Commission Nationale Informatique et Libertés - CNIL) has decided to challenge this preconceived idea. The Cabanon project, led by its digital innovation laboratory (the Linc) in 2017, took up the challenge of anonymizing the NYC TLC dataset and using it in concrete scenarios for creating new services. " There are several ways to anonymize data, but none is a miracle solution that would suit all uses," warns Vincent Toubiana, in charge of dataset anonymization on the project - who has since moved from the Cnil to the Arcep. The Cabanon team therefore came up with a dedicated solution.

Spatial and temporal degradation

First step: GPS coordinates have been replaced by the ZCTA code, the American equivalent of our postal codes. This is the method chosen by Uber to guarantee the security of personal data. This operation degrades the spatial data; it drowns out the departure and arrival positions of cabs in zones covering several city blocks. However, it may prove insufficient to truly guarantee the anonymity of customers and drivers. At certain times of the night, a single cab may run from one area to another. Even if GPS positions are erased, it is still possible to link geographical position and identity.

" In addition to spatial degradation, we have introduced temporal degradation," adds Vincent Toubiana. Time slots are adapted to avoid the problem of a single customer. " In each departure and arrival zone, we look at all the people who take a cab over 5, 15, 30 and 60-minute slots ", he continues. In the dataset, the time calibration is set so that no time segment has fewer than ten people. If, despite everything, this happens in the widest 60-minute segment, the data is simply deleted. For Vincent Toubiana, " the aim is to find the best compromises from a mathematical point of view, so as to retain the maximum amount of data with the smallest possible time intervals ".

Based on 2013 data used by the Cnil - the same data made public by Chris Whong - NYC TLC counted over 130 million trips. The double degradation operation therefore requires considerable computing resources. Manipulating the data to process it according to the different temporal and spatial breakdowns required the use of TeraLab, the ITM's big data platform. " Going through TeraLab was essential in order to make queries on the database to see the 5-minute intervals, or to test how many people we could group together as a minimum ," assures Vincent Toubiana.

Dataviz' at the service of data usage

Once the dataset has been anonymized in this way, it remains to demonstrate its usefulness. To make it easier to read, a data visualization was created in the form of a choropleth map - in other words, a geographical representation associating each zone with a color according to the volume of shopping. " The visual experience makes it easier to see the difference between anonymized and non-anonymized data, and facilitates the analysis and narration of this data," emphasizes Estelle Hary, the Cnil designer who created the data visualization.

On the left: map showing routes based on non-anonymized data. Right: choropleth map representing journeys with anonymized granularity.

On the basis of this map, a reflection process was launched on the services to be created with anonymized data. It identified points in Brooklyn from which people order cabs to complete their journey home. " We came up with the idea of a private transportation network that would complement New York's public transport system," says Estelle Hary. Cheaper than cabs, these private public transport services could fill in the gaps left by buses. " This is a typical example of a viable service that can be created using anonymized data," she continues. In this case, the information lost to protect personal data has no impact. The dataset processed is just as effective. And that's just one example. " By coupling an anonymized dataset with other public data, the possibilities become truly multiple," points out the designer. Which is just one way of saying that the value of an open dataset also depends on the ability to be creative.

Of course, there will always be cases where the degradation of raw data is a limit to the creation of a service. This is particularly true for the most personalized services. But perhaps we should imagine anonymity not as a binary value, but as a gradient. Rather than seeing anonymity as a characteristic present or absent in datasets, isn't it more relevant to consider several degrees of anonymity accessible according to the dataset's exposure and the control exercised over its use? This is what the Cnil proposes in its conclusion to the Cabanon project. Data could be publicly accessible in a completely anonymized form. Alternatively, the same dataset could be made available in increasingly less anonymous versions, with a correspondingly higher level of control over usage.

Latest news

,

[BELLE HISTOIRE] AI to optimize robot-assisted knee osteoarthritis surgery

As part of a thesis conducted with Ganymed Robotics and LaTIM (a laboratory under the joint supervision of IMT Atlantique, a component school of the Carnot TSN institute), Anna Gounot is developing AI models capable of predicting the state of knee cartilage from scanner images, in order to improve the precision of prosthesis fitting to treat osteoarthritis.

[VIDEO] Hadaptic Evident: an experimental platform at the heart of digital health

At the crossroads of digital technologies, healthcare and applications, the Hadaptic Evident platform at Télécom SudParis, a component school of the Carnot TSN institute, is a unique experimentation and co-innovation facility. Awarded the label of the Carnot institute Télécom & Société numérique, it is a concrete illustration of the ability of academic research to respond to the major societal challenges linked to autonomy, ageing and well-being.
, ,

[BELLE HISTOIRE] Using graphs and deep learning to make recruiters' lives easier

As part of a CIFRE thesis involving the Easy Partner recruitment agency, the SAMOVAR laboratory at Télécom SudParis, a component school of the TSN Carnot institute, and Efrei, Éric Behar has developed a tool to help recruiters. Thanks to a graphical representation and a deep learning model, his recommendation system can identify relevant candidates for a job offer and, conversely, offers corresponding to a candidate.

Need more information?

© 2022 Carnot Télécom & Société Numérique | Legal Notice