Fermat's Library | 13 - Chapter 13 annotated/explained version.

Reﬂections

13.1 Toward practicing privacy

Diﬀerential Privacy was designed with internet-scale data sets in mind.

Reconstruction attacks along the lines of those in Section 8 can be

carried out by a polynomial time bounded adversary asking only O(n)

queries on databases of size n. When n is on the order of hundreds of

millions, and each query requires a linear amount of computation, such

an attack is unrealistic, even thought the queries can be parallelized.

This observation led to the earliest steps toward diﬀerential privacy: If

the adversary is restricted to a sublinear number of counting queries,

then o(

√

n) noise per query — less than the sampling error! — is suf-

ﬁcient for preserving privacy (Corollary 3.21).

To what extent can diﬀerential privacy be brought to bear on

smaller data sets, or even targeted attacks that isolate a small sub-

set of a much larger database, without destroying statistical utility?

First, an analysis may require a number of queries that begins to look

something like the size of this smaller set. Second, letting n now denote

the size of the smaller set or small database, and letting k be the num-

ber of queries, fractional errors on the order of

√

k/n are harder to

ignore when n is small. Third, the

√

ln(1/δ)/ε factor in the advanced

254

13.1. Toward practicing privacy 255

composition theorem becomes signiﬁcant. Keeping in mind the recon-

struction attacks when noise is o(

√

n), there appears to be little room

to maneuver for arbitrary sets of k ≈ n low-sensitivity queries.

There are several promising lines of research for addressing these

concerns.

The Query Errors Don’t Tell the Whole Story. As an example of this

phenomenon, consider the problem of linear regression. The input is

a collection of labeled data points of the form (x, y), where x ∈ R

and y ∈ R, for arbitrary dimension d. The goal is to ﬁnd θ ∈ R

that

“predicts” y “as well as possible,” given x, under the assumption that

the relationship is linear. If the goal is simply to “explain” the given

data set, diﬀerential privacy may well introduce unacceptable error.

Certainly the speciﬁc algorithm that simply computes

argmin

∑

i=1

θ ·x

− y

and adds appropriately scaled Laplace noise independently to each

coordinate of θ may produce a

θ that diﬀers substantially from θ. But if

the goal is to learn a predictor that will do well for future, unseen inputs

(x, y) then a slightly diﬀerent computation is used to avoid overﬁtting,

and the (possibly large) diﬀerence between the private and non-private

coeﬃcient vectors does not translate into a gap in classiﬁcation error!

A similar phenomenon has been observed in model ﬁtting.

Less Can Be More. Many analyses ask for more than they actually

use. Exploitation of this principle is at the heart of Report Noisy Max,

where for the accuracy “price” of one measurement we learn one of

the largest of many measurements. By asking for “less” (that is, not

requiring that all noisy measurements be released, but rather only ask-

ing for the largest one), we obtain “more” (better accuracy). A familiar

principle in privacy is to minimize collection and reporting. Here we

see this play out in the realm of what must be revealed, rather than

what must be used in the computation.

Quit When You are NOT Ahead. This is the philosophy behind

Propose-Test-Release, in which we test in a privacy-preserving way

256 Reﬂections

that small noise is suﬃcient for a particular intended computation on

the given data set.

Algorithms with Data-Dependent Accuracy Bounds. This can be

viewed as a generalization of Quit When You are Not Ahead. Algo-

rithms with data-dependent accuracy bounds can deliver excellent

results on “good” data sets, as in Propose-Test-Release, and the accu-

racy can degrade gradually as the “goodness” decreases, an improve-

ment over Propose-Test-Release.

Exploit “Nice” Query Sets. When (potentially large) sets of linear

queries are presented as a batch it is possible, by analyzing the geometry

of the query matrix to obtain higher quality answers than would be

obtained were the queries answered independently

Further Relaxation of Diﬀerential Privacy We have seen that (ϵ, δ)-

diﬀerential privacy is a meaningful relaxation of diﬀerential privacy

that can provide substantially improved accuracy bounds. Moreover,

such a relaxation can be essential to these improvements. For example,

Propose-Test-Release algorithms can only oﬀer (ε, δ)-diﬀerential pri-

vacy for δ > 0. What about other, but still meaningful, relaxations of

diﬀerential privacy? Concentrated Diﬀerential Privacy is such a relax-

ation that is incomparable to (ε, δ)-diﬀerential privacy and that per-

mits better accuracy. Roughly speaking, it ensures that large privacy

loss happens with very small probability; for example, for all k the

probability of privacy loss kε falls exponentially in k

. In contrast,

(ε, δ)-diﬀerential privacy is consistent with having inﬁnite privacy loss

with probability δ; on the other hand, privacy lost 2ε can happen in

concentrated diﬀerential privacy with constant probability, while in

(ε, δ)-diﬀerential privacy it will only occur with probability bounded

by δ, which we typically take to be cryptographically small.

Why might we feel comfortable with this relaxation? The answer

lies in behavior under composition. As an individual’s data participate

More accurately, the analysis is of the object K = AB

, where A is the query

matrix and B

is the k-dimensional L

ball; note that K is the feasible region in

answer space when the database has one element.

13.1. Toward practicing privacy 257

in many databases and many diﬀerent computations, perhaps the real

worry is the combined threat of multiple exposures. This is captured by

privacy under composition. Concentrated diﬀerential privacy permits

better accuracy while yielding the same behavior under composition as

(ε, δ) (and (ε, 0)) diﬀerential privacy.

Diﬀerential privacy also faces a number of cultural challenges. One

of the most signiﬁcant is non-algorithmic thinking. Diﬀerential privacy

is a property of an algorithm. However, many people who work with

data describe their interactions with the data in fundamentally non-

algorithmic terms, such as, “First, I look at the data.” Similarly, data

cleaning is often described in non-algorithmic terms. If data are rea-

sonably plentiful, and the analysts are energetic, then the “Raw Data”

application of the Subsample and Aggregate methodology described in

Example 7.3 suggests a path toward enabling non-algorithmic, inter-

actions by trusted analysts who will follow directions. In general, it

seems plausible that on high-dimensional and on internet-scale data

sets non-algorithmic interactions will be the exception.

What about ε? In Example 3.7 we applied Theorem 3.20 to con-

clude that to bound the cumulative lifetime privacy loss at ε = 1 with

probability 1 − e

−32

, over participation in 10, 000 databases, it is suf-

ﬁcient that each database be (1/801, 0)-diﬀerentially private. While

k = 10, 000 may be an overestimate, the dependence on k is fairly

weak (

√

k), and in the worst case these bounds are tight, ruling out a

more relaxed bound than ε

= 1/801 for each database over the lifetime

of the database. This is simply too strict a requirement in practice.

Perhaps we can ask a diﬀerent question: Fix ε, say, ε = 1 or

ε = 1/10; now ask: How can multiple ε’s be apportioned? Permitting

ε privacy loss per query is too weak, and ε loss over the lifetime of the

database is too strong. Something in between, say, ε per study or ε

per researcher, may make sense, although this raises the questions of

who is a “researcher” and what constitutes a “study.” This aﬀords sub-

stantially more protection against accidental and intentional privacy

compromise than do current practices, from enclaves to conﬁdentiality

contracts.

A diﬀerent proposal is less prescriptive. This proposal draws from

second-generation regulatory approaches to reducing environmental

258 Reﬂections

degradation, in particular pollution release registries such as the Toxic

Release Inventory that have been found to encourage better practices

through transparency. Perhaps a similar eﬀect could arise with private

data analysis: an Epsilon Registry describing data uses, granularity of

privacy protection, a “burn rate” of privacy loss per unit time, and a

cap on total privacy loss permitted before data are retired, when accom-

panied with a ﬁnancial penalty for inﬁnite (or very large) loss, can lead

to innovation and competition, deploying the talents and resources of

a larger set of researchers and privacy professionals in the search for

diﬀerentially private algorithms.

13.2 The diﬀerential privacy lens

An online etymological dictionary describes the original 18th century

meaning of the term of the word “statistics” as “science dealing with

data about the condition of a state or community.” This resonates with

diﬀerential privacy in the breach: if the presence or absence of the data

of a small number of individuals changes the outcome of an analysis

then in some sense the outcome is “about” these few individuals, and

is not describing the condition of the community as a whole. Put diﬀer-

ently, stability to small perturbations in the data is both the hallmark

of diﬀerential privacy and the essence of a common conception of the

term “statistical.” Diﬀerential privacy is enabled by stability (Section 7)

and ensures stability (by deﬁnition). In some sense it forces all queries

to be statistical in nature. As stability is also increasingly understood to

be a key necessary and suﬃcient condition for learnability, we observe a

tantalizing moral equivalence between learnability, diﬀerential privacy,

and stability.

With this in mind, it is not surprising that diﬀerential privacy is also

a means to ends other than privacy, and indeed we saw this with game

theory in Section 10. The power of diﬀerential privacy comes from its

amenability to composition. Just as composition allows us to build com-

plex diﬀerentially private algorithms from smaller diﬀerentially private

building blocks, it provides a programming language for constructing

stable algorithms for complex analytical tasks. Consider, for example,

the problem of eliciting a set of bidder values, and using them to price

13.2. The diﬀerential privacy lens 259

a collection of goods that are for sale. Informally, Walrasian equilib-

rium prices are prices such that every individual can simultaneously

purchase their favorite bundle of goods given the prices, while ensuring

that demand exactly equals the supply of each go od. It would seem

at ﬁrst blush, then, that simply computing these prices, and assigning

each person their favorite bundle of goods given the prices would yield

a mechanism in which agents were incentivized to tell the truth about

their valuation function — since how could any agent do better than

receiving their favorite bundle of goods? However, this argument fails —

because in a Walrasian equilibrium, agents receive their favorite bundle

of goods given the prices, but the prices are computed as a function

of the reported valuations, so an industrious but dishonest agent could

potentially gain by manipulating the computed prices. However, this

problem is solved (and an approximately truthful mechanism results)

if the equilibrium prices are computed using a diﬀerentially private

algorithm — precisely because individual agents have almost no eﬀect

on the distribution of prices computed. Note that this application is

made p ossible by the use of the tools of diﬀerential privacy, but is com-

pletely orthogonal to privacy concerns. More generally, this connection

is more fundamental: computing equilibria of various sorts using algo-

rithms that have the stability property guaranteed by diﬀerential pri-

vacy leads to approximately truthful mechanisms implementing these

equilibrium outcomes.

Diﬀerential privacy also helps in ensuring generalizability in adap-

tive data analysis. Adaptivity means that the questions asked and

hypotheses tested depend on outcomes of earlier questions. General-

izability means that the outcome of a computation or a test on the

data set is close to the ground truth of the distribution from which the

data are sampled. It is known that the naive paradigm of answering

queries with the exact empirical values on a ﬁxed data set fails to gener-

alize even under a limited amount of adaptive questioning. Remarkably,

answering with diﬀerential privacy not only ensures privacy, but with

high probability it ensures generalizability even for exponentially many

adaptively chosen queries. Thus, the deliberate introduction of noise

using the techniques of diﬀerential privacy has profound and promising

implications for the validity of traditional scientiﬁc inquiry.

Comments

Products

Project