Anonymous? Estimating the Risk of Re-identification Within the ProCAncer-I Health Data Sets
by Emily Johnson and Theresa Henne
Age | Weight | PSA value | Gleason Group |
---|---|---|---|
33 | 112,2 kg | 7,9 ng/ml | 3+4 |
Table 1. Is this data anonymous? Data sample of a fictional prostate cancer patient
Imagine you stumble across the information given above - it may be part of a leaked database or a paper file forgotten on the bus. From the insights given above, can you reconstruct who the information is about? Would you consider this data anonymous in the sense that the person it refers to is not identifiable?
The question under which conditions data anonymization can be considered successful constitutes a constant topic of discussion within the ProCAncer-I project dealing with personal health data. In regard to rapid technological development, it is necessary to continuously re-evaluate whether the re-identification of an individual is reasonable. Rather than considering anonymization as an issue that can be ticked off, we understand it as a process including numerous measures conducted during the project’s lifespan and beyond. A process that should be regularly monitored and assessed. This approach is also outlined by Article 29 Data Protection Working Party which states that anonymisation should not be viewed as a “one-off exercise” and, that data controllers should regularly assess risks of re-identification.1 Further legislative guidance from the European Data Protection Board demands that all of the objective factors associated with re-identification should be considered at the time of anonymisation and on an ongoing basis. As outlined in Recital 26 GDPR, in particular the costs of and the amount of time required for identification are important factors, which highly depend on the state-of-the-art technology.2
But before we dive deeper into the universe of anonymization and attempt to answer the question of whether the data above can be considered anonymous, we would like to begin with a brief outline of the project and its ambitions.
ProCAncer-I is an EU-funded Horizon 2020 research project that aims to employ artificial intelligence technology to improve the detection, prediction and treatment of prostate cancer. To many, the gland, which is part of the male reproductive system and about the size of a chestnut, might not seem of particular noteworthiness, and hence it will be surprising that prostate cancer is the second most frequent type of cancer in men and the third most lethal. A formally common procedure for diagnosing prostate cancer builds on testing the patient’s blood for the amount of PSA, a protein produced in high quantities by cancer cells in the prostate. The PSA value, however, did not prove as a reliable indicator and led to over-diagnosis and overtreatment. Currently, the medical community builds hope on AI-based solutions analysing multi-parametric (mp) MR images of the patient’s organ. Although existing efforts are promising, they remain fragmented and of limited scope.
To catalyse the development, the ProCAncer-I project aims to build Europe’s largest repository of MR images of prostate cancer and to develop AI models trained to serve the needs of eight different clinical scenarios. The gathered data stems from 13 clinical sites situated across Europe and includes mpMR scans as well as clinical data from over 17 000 patients. Before sharing the patient records via a cloud-based platform, the data will be anonymized by the clinical partners using a designated software. All “direct identifiers”, such as the name, social security number, or contact details of the patient, will be removed from the data set. Information, such as the age, or weight of the patient, the result of a biopsy or blood tests, however, are kept to allow for them to be considered within the research. For example, the risk of reoccurrence of cancer might correlate with one of these variables and hence be an important predictor when assessing future cases.
Although from a medical perspective enriching the data set with more context information on the patients is often desirable, from a legal perspective, caution is necessary. With every so-called “indirect identifier” added to the data set, the risk of re-identification rises since the combination of variables can easily make a person unique. But how to determine which variables can and cannot be included in the data set in order to ensure anonymity? “Well, it depends” must be the answer. It depends on several factors, such as the sample size, which describes the overall number of people included in the data set. Also, the distribution of the variables and the existence of so-called outliers, which are rare values such as an extremely high age which is easily attributable, must be considered. Here is an example:
If a data set only includes data of patients from a small town at the coast of the UK, indicating the age of the patient might be sufficient to identify him. For instance, he might be the only 33-year-old male person living in the town or the only person in that age group with a weight above 100 kg. A strategy to reduce the risk of re-identification can be to increase the intervals in which the data is recorded. The indication that the patient is between 30-35 years old will not render someone identifiable even within a small data set.
Luckily, preserving the privacy of the patients in the Pro-CAncer-I project is not as difficult as in a small British town. The sheer volume of the data set counting more than 17 000 cases grants a high degree of protection since the likelihood of unique combinations decreases with size. Also, the origin of the data will not be represented within the repository making it impossible to trace back whether the patient data stems from Greece, Spain or the UK. Furthermore, the ProCAncer-I data set only lists two indirect identifiers, namely age and weight, which decreases the likelihood of unique combinations.
For deciding which new indirect identifier can be included in the data set in the future, or whether the interval of a variable should be increased, it is a helpful approach to consider probability metrics that aim to assess the risk of re-identification. In their books on anonymizing health data, Khaled El Emam and Luk Arbuckle3 propose a risk-based perspective on anonymization and propose holistic measures to ensure sustainable anonymization. Within their metrics, the authors differentiate between the maximum risk, which is attributed to the record with the highest risk of re-identification, typically an outlier, and the average risk, which is associated with any patient listed in the data set. Also, different threat scenarios are considered. Clearly, the risk of re-identification during a targeted attack, in which an adversary attempts to identify an extreme outlier, for example by matching them with public records, is rather high. However, the likelihood that a by-passer spontaneously recognizes someone they know (‘“Holly Smoker, that’s my neighbour”’4), can be considered rather low. In any of these cases, Eman and Arbuckly propose to ensure that the maximum risk of re-identification should not be above 0.5 or 0.3, meaning that there should be no unique combinations of indirect identifiers and that each record should be matched with at least two or three others.
Making sure that there are no unique entries is only one approach to anonymization and should be accompanied by other privacy-preserving measures. Not least because if unique entries occur, it might be desirable not to remove them from the data set since this would exclude patients with uncommon characteristics from the research project and therewith from resulting health care benefits.
While the processing of anonymised data falls outside the scope of the GDPR,5 the risk of re-identification of data subject post-anonymisation “is never zero”.6 This sentiment is highlighted in legislative guidance and in the wording of GDPR itself. Prior to anonymisation, ProCAncer-I partners locally process data related to health. The processing of this special category of personal data incurs the possibility of increased risks to the rights and freedoms of the data subjects. Any possibility of re-identification from the anonymised data must therefore be taken seriously and this includes assessing whether the inclusion of additional data sets increases the risk of identification. Such assessments can only be made by examining the whole data set in relation to the “means reasonably likely to be used” to re-identify the data subject, whether directly or indirectly.7
When investigating only one line of a data set, as presented in Table 1, the question “Is this data anonymous?” can therefore hardly be answered. Besides the characteristics of the data set, such as its size and variable distribution, also all the objective factors that may lead to re-identification must be taken into account by carrying out regular anonymisation assessments and by ensuring state-of-the-art technology is in place to maintain anonymisation.8
Sources:
[1] Opinion, 05/2014 on Anonymisation Techniques, adopted 10 April 2014, 0829/14/EN, page 7.
[2] AEPD & EDPS Joint Paper, ’10 Misunderstandings Related to Anonymisation’, (AEDP & EDPS, 27 April 2021) at 5.
[3] Luk Arbuckle and Khaled El Emam, Building an Anonymization Pipeline: Creating Safe Data (O’Reilly Media, Incorporated 2020); Khaled El Emam and Luk Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started (O’Reilly Media, Inc 2013).
[4] Emam and Arbuckle (n 4) 34.
[5] GDPR, Article 1(1); Recital 26.
[6] AEPD & EDPS Joint Paper, ’10 Misunderstandings Related to Anonymisation’, (AEDP & EDPS, 27 April 2021) at 5.
[7] GDPR, Recital 26.
[8] GDPR, Recital 26.