Data anonymisation – or so we thought
26 July 2019 | 0
The new oil, the gold of your organisation, plutonium – handle with care!
All of these epithets have been used to describe data in the age of hyperconnected enterprise. That said, I have a T-shirt from a now acquired business that asserts data is, in fact, the new bacon).
But the fact remains, data is now the most valuable commodity, as evidenced by the lengths to which to organisations will go to acquire it and base campaigns upon it, for example Cambridge Analytica.
“As people become more aware of their privacy and the need for personal data to be protected, so too the value of large datasets for all manner of services is also becoming apparent”
However, the power of data to provide better insights to support resource allocation, distribution, supply chains and planning cannot be denied, and so ways must be found to make data available, securely, reliably and sometimes, anonymously.
Data anonymisation was thought to be a good way of sanitising data to make it useful for things such as public planning, when it may have come from sources such as census or electoral records, etc, but has had personal details stripped out so as to represent a populace, not people.
But in various tests of these anonymisation exercises, worrying trends have emerged that have shown the problem of re-identification to be all too easy.
The latest is the one in our news story, where researchers from Imperial College London and the UCLouvain found that 99.98% of Americans were correctly re-identified in any available anonymised dataset by using just 15 characteristics, including age, gender, and marital status.
Another online tool was able to re-identify in up to 86% of cases, with the help of just four key parameters, gender, marital status, date of birth and post code.
Other studies are also cited, where in Germany, journalists re-identified public figures in an anonymised browsing history dataset of 3 million German citizens they acquired for free from a data broker, allowing them to discover some rather personal information about public figures, such as a judge and an MP.
Dealing with this issue is key to ensuring that large datasets can be made safe for wider usage, and there are already suites of tools to allow it to be accomplished. Many of these makes claims to be complaint with various standards or regulations, such as HIPAA and GDPR, and yet, the tales keep emerging of just how readily re-dentification can occur.
The European open data initiative aims to provide of data from the European Union (EU) institutions and other EU bodies which can be used and reused for commercial or non-commercial purposes. It says, “by providing easy access to data — free of charge — we aim to help you put them to innovative use and unlock their economic potential.” It provides a data portal to “make the EU institutions and other bodies more open and accountable.”
In fact, Ireland is a key player in this, and as recently as 2017, the country’s Open Data Unit National Open Data portal won an eGovernment award in the general category at the eGovernment Awards.
I’m not aware of any research on Irish datasets to assess the level to which reidentification is possible, but I’m sure there must be concerns.
As people become more aware of their privacy and the need for personal data to be protected, so too the value of large datasets for all manner of services is also becoming apparent.
As fines under various regulations, from GDPR to the California Consumer Privacy Act, make organisations sit up and take notice, it is equally important that we do not lose the ability to access and leverage these datasets for fear of consequences.
A great source of intelligence may be lost, not just the ability to target ads.
But as the US presidential election looms once more, and the prospect of a UK general election in the near future, and perhaps one on this island too, the spectre of manipulation and interference based on targeting through granular datasets is all too real.
A careful middle ground must be reached where the greater good is not subverted through fear, while at the same time ensuring that the citizen, and all citizens, are given the privacy they deserve.
Anonymisation techniques must be developed as quickly as the data interrogation tools themselves to ensure that all the above goals can be achieved.