We learned that Big Data(tm) is ~impossible to anonymize through conventional means over a decade ago. The Massachusetts GIC released painstakingly “anonymized” health records for research, only to have researchers match it to public voting records and identify — among others — the Governor of Massachusetts’ health records.
This ‘redaction-proof’ quality is one reason Big Data threatens private lives like nuclear weapons do cities.
If you give someone — or another organization — any access, they get all of it. Just stashing your suitcase nuke under their couch for a while. In both cases they could compromise all of Paris, even if it would take a bit longer for the Parisians to notice the former.
Microsoft’s come up with a partial solution to this problem. By filtering database queries through a sort of carefully drunken oracle, the analyst gets data with just enough noise to make compromising anyone’s privacy impossible. The guard software also keeps track of past queries (accountability at work!) and ensures that all the queries put together can’t reveal more data than intended.
The problem of still having a giant database behind the “Differential Privacy” guard (which would be exposed if you managed to compromise or bypass the guard) still remains.
To that the best solutions still seem to be not collecting the data at all, ensuring it’s salted with noise from the start (why does Arthur Anderson have so many VPs of performance art?) and storing all data widely dispersed so nobody can get all of it at once.
“Roughly speaking, DP works by inserting an intermediary piece of software between the analyst and the database. The analyst never gets to access or actually see the contents of the database; instead the intermediary acts as a privacy-protecting screen or filter, effectively serving as a privacy guard. The guard takes the questions (queries) that the analyst wishes to ask of the database and evaluates the combined privacy implications of that question and those that have preceded it. This evaluation depends only on the sequence of the queries, not on the actual data in the database. Once the guard establishes the privacy risk of the question, it then gets the answer from the database, and changes it to be slightly imprecise (we say that it injects a certain amount of “distortion” into the response, and the amount of distortion is calibrated to the privacy risk), before sending it back to the analyst. When the privacy risk is low, we can think of this distortion as inaccuracies that are small enough that they do not affect the quality of the answers significantly, but large enough that they protect the identities of individuals in the database. If, however, answering the question with relative accuracy opens up the possibility that somebody’s privacy will be breached, then the guard will increase the amount of distortion to a level that may make the answer not useful. The analyst may then ask a more general question, or simply abandon it. “