Christopher Waites

Computer Science @
Stanford University 🌲

Broadly interested in deep
learning, privacy, ethics,
and generative modeling

Fortunate to have been
advised by the excellent
Rachel Cummings

GitHub Profile

26 February 2020

When Differential Privacy Might Be Most Useful

by Chris Waites

When talking about differential privacy, typically this involves a data curator and the outside world. Supposedly the curator has an incentive to release some data analysis result, but in such a way where the privacy leakage associated with this analysis is trackable and/or minimized.

Although, there are several practical issues with this. One which is common is this issue of an infinite horizon. That is, say some silicon valley tech giant wants to adhere to some strict privacy budget for the rest of eternity. When they want to release the result of an analysis, what epsilon should they choose? 1? 0.1? 0.0001? It’s unclear, especially if their goal is to stick around for eternity, and we assume that privacy leakage associated with a given analysis is permanant.

What will happen? Well, once they’ve inevitably creeped up on their limit, should we expect them to seriously consider halting the release of further analysis forever? No - more than likely they’ll raise the limit. And going forward, they’ll inevitably do it again. This is to say that, in the context of entities which have no forseeable horizon, the apparent practicality of differential privacy is of concern because the privacy expenditure is a number which monotonically increases with time, and you can never go back.

This is not a problem I have a solution to. I’m a big fan of differentially private synthetic datasets, and in certain contexts they can help in this regard (say, in the context of the U.S. Census). Although, not every undertaking can be easily framed as a synthetic data problem, especially if new relevant data comes in at a consistent rate.

The point I’d like to make is that, the continual public release of results to analyses may not be the most interesting or useful context to evaluate the utility of differential privacy within.

Instead, consider the case where an engineer at the aforementioned silicon valley tech giant accidentally leaves their laptop in the car and it gets stolen, and say they were doing some work concerning sensitive data. Can you begin to quantify the amount of damage done to the individuals included in the data they were working with? Not by default, but naturally if the results the engineer was working with were computed with differential privacy in mind, then you could actually start to get some form of guarantees.

So, the slight distinction I’m making here is that maybe the utility of differential privacy is not as pronounced in contexts where the forefront goal is information release. Maybe a more useful context for differential privacy is actually behind the walls of cryptographic protection, seperating you from the world outside your organization.

So, the idea is to share the results of analyses while incorporating differential privacy as a means for protection against the worst case scenario, where a data leakage happens against your will. So now, the conversation shifts from saying “silicon valley tech giant, use differential privacy so that your public analyses don’t reveal too much about your users”, and it becomes “have your employees speak through the lens of differential privacy, so we know how much damage has been done in the worst case where information is leaked.”

This reformulation of the problem setting, in a bit of a roundabout matter, highlights the utility of differential privacy by dampening issues concerning infinite horizons, stemming from the inherent nature of data leakages. Namely, they are unintended and sparse. For example, it doesn’t make as much sense to complain about the limitations of differential privacy if data release is unintended by definition - if it’s going to happen regardless, you don’t have to worry as much about the number of releases you intend to perform forever onwards because that’s not a variable you can control in the first place.

Additionally, given that data leakages are canonically sparse, this allows you to talk about a global privacy budget per individual which might actually be useful. That is, you could actually get away with something like an epsilon of 1.0 per user over a very long timespan if data release occurs every ten years, not every day.

There are just my thoughts in isolation, and this has been said before by others. But hopefully it sparks additional discussion on the topic, on where differential privacy will make the most sense to be deployed in the real world in years to come.