Privacy protection requires released data to satisfy a certain confidentiality standard, for example, differential privacy.
A method includes storing a value in data storage so that a third party is prevented from accessing the value, retrieving the value and applying a first transform to the value to form a transformed value having a uniform distribution. Noise is added to the transformed value to form a sum and a second transform is applied to the sum to form a transformed sum having a uniform distribution. An inverse of the first transform is applied to the transformed sum to form a privatized value and the privatized value is provided to the third party.
In a further embodiment, a computing device having a memory and a processor executes instructions to perform steps including storing a value in data storage so that a third party is prevented from accessing the value, retrieving the value and applying a first transform to the value to form a transformed value having a uniform distribution. Noise is added to the transformed value to form a sum and a second transform is applied to the sum to form a transformed sum having a uniform distribution. An inverse of the first transform is applied to the transformed sum to form a privatized value and the privatized value is provided to the third party.
In a still further embodiment, a method includes storing values in data storage so that a third party is prevented from accessing the values, the values having a distribution. Noise is then used to form a plurality of privatized values from a plurality of the values in the data storage while ensuring that the privatized values have the same distribution as the distribution of the values in the data storage. Access to the privatized value is then provided to the third party.
Privacy protection requires released data to satisfy a certain confidentiality standard, for example, differential privacy. It becomes increasingly crucial as a large amount of data is collected. To meet the standard of differential privacy, researchers have concentrated on the development of privatization mechanisms. Unfortunately, these mechanisms typically sacrifice statistical accuracy for privacy protection while requiring bounded support of the sample distribution to guard against extreme events. Furthermore, they alter the multivariate structures of an original sample after privatization. Preserving the original data's distribution is essential to any downstream statistical analysis while satisfying differential privacy. Towards this goal, we propose a distribution-invariant privatization mechanism, descriptively named DIP, for independently and identically observations of a vector of variables, not only satisfying differential privacy but also permitting any distribution of continuous, discrete, mixed, and categorical variables, with or without bounded support, regardless of the number of variables. Specifically, DIP perturbs a transformed sample while employing a suitable transformation to the original scale to (approximately) retain the sample distribution. Consequently, any downstream statistical analysis on privatized data can maintain statistical accuracy while being differentially private at any desired level of protection as if it had used the original sample. Numerically, we demonstrate the utility of the proposed method by simulated and benchmark examples and illustrate its advantages over its competitors.
Data privacy has been increasingly important in the BigData era, where a massive amount of sensitive personal information is collected and stored in a digital form. To protect data privacy and promote data sharing, differential privacy has recently received a lot of attention. Differential privacy mathematically quantifies the notion of privacy for downstream statistical analysis of large datasets, particularly for publicly released data such as the census and survey. It is becoming one gold standard of privacy protection, and many technology companies had adopted it to guard privacy against most extreme events.
There are two main challenges in differential privacy. First, many existing privatized mechanisms alter the sample distribution when satisfying differential privacy. For example, the Laplace mechanism modifies the data distribution and converts count data to negative numerical values. As a result, a downstream analysis may draw dramatically different conclusions based on privatized data. From a statistical perspective, multivariate analysis requires retaining dependency structures for privatized data, namely, distribution-invariant privatization. Second, one needs to impose bounded support of the underlying data distribution to satisfy differential privacy. This requirement is stringent due to a widely used distribution with unbounded support in practice. Yet, it is challenging to develop privatization mechanisms for distribution with unbounded support.
The embodiments provide a novel distribution-invariant privatization (DIP) mechanism to address the above challenges for essentially all types of univariate and multivariate data involving continuous, discrete, mixed, and categorical variables. First, the privatized data produced by DIP, or the DIP privatized sample, approximately preserves the data distribution while satisfying differential privacy, c.f., Theorem 7. Precisely, the DIP privatized data preserves the known data distribution and approximates the unknown data distribution when it is estimated by the empirical distribution on a hold-out sample. Consequently, any downstream privatized statistical analysis and learning will lead to the same conclusion as if the original data were used, which is a unique aspect that existing mechanisms do not enjoy. Moreover, DIP approximately maintains statistical accuracy even with strict privacy protection. By comparison, existing methods suffers from the trade-off between statistical accuracy and the level of privacy protection. These characteristics enable us to treat multivariate problems effectively, including regression, classification, graphical models, clustering, among others. Second, DIP is differentially private even if the underlying distributions have unbounded support due to the proposed nonlinear transformation. Third, DIP's statistical accuracy depends on the accuracy of estimating the unknown data distribution by the empirical distribution. To the best of our knowledge, this is the first attempt to tackle all of the above issues.
Methodologically, DIP transforms a univariate variable into a variable with bounded support and perturbs. Then data perturbation adds the Laplace noise to the transformed variable to ensure differential privacy. Then DIP maps the perturbed variable to the original scale using an inverse distribution function to guarantee invariant transformation. In multiple situations, DIP first factorizes the multivariate distribution into a product of conditional and marginal distributions by the probability chain rule. Then it applies the univariate treatment sequentially to preserve differential privacy and an invariant distribution. This construction enables us to accommodate correlated data and dependency structures in, for example, privatized regression and graphical models. In practice, when the probability distribution is unknown, we estimate an unknown probability distribution by a good estimator, for example, the empirical distribution. Finally, we propose a high-dimensional continualization method to treat continuous data empirically.
Theoretically, we prove that DIP satisfies ε-differential privacy (Definition 1). Numerically, we illustrate DIP with four simulations, including estimation of distributions and parameters, linear regression, and Gaussian graphical models, and on two real-world applications. In numerical examples, DIP compares favorably with strong competitors in terms of statistical accuracy of downstream analysis while maintaining strict privacy protection, including the Laplace mechanism, the exponential mechanism, and the minimax optimal procedure.
Section 2 introduces the notion of differential privacy and the general DIP method. Section 3 considers DIP for univariate continuous variables, whereas Section 4 focuses on univariate discrete variables with a discussion of mixed variables. Section 5 generalizes the univariate variables to the multivariate case. Section 6 is devoted to simulation studies of the operating characteristics of DIP and compares it with other strong competitors in the literature, followed by two real-world applications in Section 7.
Differential privacy concerns a publicly shared dataset by describing the patterns of any subset of data while withholding the information of remaining. Subsequently, we focus on a one-time privatization release problem.
Consider a random sample (Z1, . . . ZN)F, where F is a cumulative distribution function (cdf) and has support
. Here
can be bounded or unbounded, for example,
=
∪{0} for a Poisson distribution,
=
p for a p-dimensional normal distribution, and
={0, 1}p for dummy variables of a categorical variable of (p+1) possible values. For this sample, we randomly partition the original sample into a privatization sample (Z1, . . . Zn) and a hold-out sample (Z*1, . . . Z*m), with m=N−n. The privatization sample (Z1, . . . Zn) will be privatized through a privatized mechanism m(·):
→
, which generates a new sample
to release. For notation simplicity, we also write {tilde over (Z)}i=m(Zi); i=1, . . . , n. The hold-out sample is not privatized or released, and is reserved for estimating the probability distribution F, if F is unknown. When F is known, we do not need a hold-sample, and have n=N and m=0.
The primary objective of differential privacy is guarding against disclosing the information of data by any alternation of one observation of the data to be released. Suppose z and z′ are two adjacent realizations of Z, which differ in only one observation. Then ε-differential privacy is defined as follows:
Definition 1 A privatized mechanism m(·) satisfies ε-differential privacy if
where B⊂n is a measurable set and ε≥0 is a privacy factor or leakage that is usually small. For convenience, the ratio is defined as 1 when the numerator and denominator are 0.
Definition 1 requires that the ratio of conditional probabilities of any privatization event given two adjacent data realizations is upper bounded by eε. This definition is stringent because the proximity of two distributions in a certain statistical distance may not meet the criterion in Definition 1, as rare events may drastically increase the probability ratio. It differs from a conventional privacy definition in cryptography.
To make Definition 1 more meaningful in privacy protection, we describe a context of multiple privatized data releases.
Lemma 2 Suppose {tilde over (Z)}(1), . . . , {tilde over (Z)}(M) are M independent and ε-differentially private copies of Z generated from m(·) as a result of multiple data releases. Then any test based on {tilde over (Z)}(1), . . . , {tilde over (Z)}(M) with significance level γ>0 of H0: Zi
Lemma 2 says that ε-differential privacy protects against data alternation. In particular, it is impossible to reject a null hypothesis H0 that an observation equals to a specific value μ0 in any sample because of a small power γeMε for sufficiently small ε, especially so in a one-time data-release scenario with M=1. However, the information leak could occur when M increases while holding ε fixed in that H0 is eventually rejected as a result of increased power. Throughout this article, therefore, we focus on the case of M=1.
This subsection presents the main ideas of DIP and its general form while deferring technical details to Sections 3-5.
Lemma 3 shows that any direct perturbation of Zi (up to a linear transformation) cannot ensure ε-differential privacy for Zi with unbounded support.
Lemma 3 Assume that a privatization mechanism m(·) satisfies m(Zi)=β0+β1Zi+ei for Zi∈⊂
p, where β0∈
p, β1∈
p×p are fixed coefficients with |β1|≠0, and ei∈
p is a random noise vector, i=1, . . . , n. If Zi has unbounded support, then m(·) is not ε-differentially private.
This lemma motivates our general treatment of univariate privatization, consisting of three steps. First, we apply an integral transformation to a continuous variable, or a discrete variable after continualization, to yield a uniform distribution. Second, we add a Laplace perturbation to encode the data for differential privacy. Third, we transform the perturbed data by applying an integral transformation, followed by the inverse cdf transformation to retain the data distribution.
DIP's privatized mechanism m(·) for univariate Zi is written as:
where ei Laplace(0, 1/ε) is random noise, H(·) is a function ensuring the distribution of {tilde over (Z)}i following the target distribution for any privacy factor level ε>0, and
and FV is the cdf of Vi. Here ei follows a Laplace distribution, C(Vi) is bounded, and hence C(Vi)+ei is differentially private. Note that H( ) is not a function of Z. Subsequently, we provide specific forms of Vi, C, and H.
In general, we develop our privatized mechanism for the multivariate case Zi=(Zi1, . . . , Ziq)T; i=1, . . . n, where differential privacy is guaranteed regardless of the dimension q. In particular, we apply (1) sequentially
{tilde over (Z)}i1=m1(Zil),{tilde over (Z)}il=ml(Zil|{tilde over (Z)}il, . . . ,{tilde over (Z)}i,l−1); 1=2 . . . , q, (2)
according to the probability chain rule. It first privatizes Zi1 via (1) to yield its privatized {tilde over (Z)}i1, then privatizes Zi2 given {tilde over (Z)}i1, and so forth, where ml(·) denotes the privatization process at sequential step 1. Note that the order of privatization with respect to the variables of interest does not matter. We defer details to Section 3.3.
One important aspect of (2) is that {tilde over (Z)}i preserves the joint distribution of Zi; i=1, . . . , n, and thus any dependency structure of (Zil, . . . , Zip)T therein. As a result, any downstream statistical analysis remains the same for privatized data as if the original data were used. In other words, there is no loss of statistical accuracy for any method based on privatized data.
In practice, F may be unknown and is estimated by {circumflex over (F)}. However, {circumflex over (F)} needs to be independent of Z to satisfy differential privacy on Z. Therefore, we construct {circumflex over (F)} based on a random hold-out subsample of size m while treating the remaining sample Z of size n to be privately released. Then the results in (1) and (2) continue to be valid provided that {circumflex over (F)} is a good estimate of F. Also, there is no loss of statistical accuracy asymptotically in the release of {tilde over (Z)}, as long as {circumflex over (F)} is consistent for F as m→∞, c.f., Theorems 8 and 10. However, a small loss may incur depending on the estimation precision of {circumflex over (F)} for F in terms of m in a finite situation. Whereas a large m provides a refined estimate of F, a large n guarantees a large privatization sample for downstream statistical analysis. See Section 5 for simulations. In general, m needs to be suitably chosen to strike a balance between statistical accuracy and the privatization sample size n when holding n+m=N fixed.
This section assumes that the true distribution F is known, and hence that all data are privatized and released with n=N and m=0.
This subsection begins with a simple situation, where Z1, . . . , Zn are univariate continuous variables. By the probability integral transformation, we have F (Zi)
Unif (0, 1). First we perturb F(Zi) by adding an independent noise ei for privatization; i=1, . . . , n. Here
Laplace(0, b) is an independent noise with a pdf
and b is a privatization parameter to be chosen for the ε-differential privacy. Then we consider G(F(Zi)+ei)˜Unif (0, 1), where G is the cdf of F(Zi)+ei; whose expression is given in the supplementary materials. Finally, we apply the inverse function F−1 to G(F(Zi)+ei). Then, (1) for a continuous variable becomes
where ° denotes function composition.
Theorem 4 (Continuous case: Distribution-preservation differential privacy) Let Z1, . . . , Zn be an i.i.d. sample of a continuous probability distribution F. Then {tilde over (Z)}1 . . . {tilde over (Z)}n F and DIP (3) is ε-differentially private when b≥1/ε.
This subsection generalizes the result of continuous to discrete variables. We first discuss discrete numerical variable Z1, . . . , Zn F, such as binomial, Poisson, geometric, and negative binomial. Now, our strategy for a continuous H in (3) requires modifications to accommodate jumps of F. Towards this end, we continualize F by the convolution of Zi with a continuous variable so that (3) is applicable and then deconvoluting to reconstruct a discrete variable. Specifically, we continualize a step function of a discrete cumulative distribution over its jump points by subtracting a uniform random variable Ui from Zi.
To scrutinize the convolution-deconvolution process, consider an example in which Z is a Bernoulli random variable with P (Z=1)=P1>0. When U˜Unif (0, 1) is independent of Z, V=Z−U spreads out point masses at 0 and 1 uniformly over intervals (−1, 0] and (0, 1], respectively. Without loss of generality, we choose left-open and right-closed intervals to avoid overlap. As a result, the cumulative distribution function of V becomes continuous and piecewise linear over (−1, 1), which also agrees with F at the support of Z in a sense that P(Z≤k)=P(V≤k); k=0, 1. Moreover, this mixed variable V can transform back to follow the original distribution of Z by applying a ceiling function.
In general, let Zi's support be {a1, . . . , as} ∈, where a1< . . . <as can be unequally spaced and s=∞ is permitted. Given Zi=ak, Vi follows a uniform distribution on (ak−1, ak] and Vi's are independent; i=1, . . . , n, where, without loss of generality, a0≡a1−1.
Lemma 5 The cumulative distribution function FV of Vi is written as
where
k=1, . . . s. Then FV(v) is Lipschitz-continuous in v and invertible, and P(Zi≤ak)=P(Vi≤ak); k≥0.
By Lemma 5, H in (1) for a discrete variable is
{tilde over (Z)}i=H(Fv(Vi)+ei), H=
where Laplace (0, b).
Theorem 6 (Discrete case: Distribution-preservation differential privacy) Let Z1, . . . , Zn be an i.i.d. sample of a discrete probability distribution F. Then {tilde over (Z)}1, . . . , {tilde over (Z)}n F and (4) satisfies ε-differential privacy when b≥1/ε.
DIP (4) continues to work for most mixed distributions consisting of both continuous and discrete components. For example, this includes many mixture distributions with finite components. The high-level idea is the same. That is, convert Zi into a variable Vi with continuous cdf, such that the probability integral transform can still be applied.
Next we work with categorical variables Zi F, which has finite support {a1, . . . , as}⊂
, for example, {a1, . . . , a4} represents “red”, “yellow”, “blue”, “green”. In the absence of numerical ordering, the treatment of (4) is no longer applicable. Nevertheless, Zi with s categories can be considered as an (s−1)-dimensional multivariate variable, which is a special case of multivariate distributions to discuss in Section 3.3.
We are now to expand our univariate result to the multivariate case, which can be applied to regression, classification, graphical models, among others.
Suppose Zi F, where F is a multivariate distribution. Our privatization of Z=(Z1, . . . , Zn)T with Zi=(Zil, . . . , Zip)T proceeds sequentially by the probability chain rule in (2), where ml(·) is one type of univariate DIP mechanisms discussed in Sections 3.1 and 3.2, depending on that of the conditional distribution of Zil|({tilde over (Z)}il, . . . , {tilde over (Z)}i,l−1). Thus, given ({tilde over (Z)}il, . . . , {tilde over (Z)}i,l−1), {tilde over (Z)}il follows the same distribution of Zil, implying that Zi and {tilde over (Z)}i i follow the same joint distribution, as indicated in Theorem 7. To guarantee ε-differential privacy for (2), we follow the sequential composition and require Zil|({tilde over (Z)}il, . . . , {tilde over (Z)}i,l−1) to be ε/p-differentially private, that is, eil
Laplace(0, p/ε).
Theorem 7 (Multivariate case: Distribution-preservation and differential privacy) Let Z1, . . . , Zn be an i.i.d. sample from a p-dimensional distribution F. Then {tilde over (Z)}1, . . . , {tilde over (Z)}n F and DIP (2) is ε-differentially private provided that b is chosen for satisfying ε/p-differential privacy in each sequential step.
As suggested by Theorem 7, the sequential order of privatization in (2) does not matter. Some natural orders may be preferable from a practical consideration, in, for example, regression of Y on X and chronological order in longitudinal data. To confirm this aspect, we present a linear regression simulation in supplementary materials in the same setting as in Section 5.2 to demonstrate that the sequential order of privatization has little impact on the regression accuracy.
DIP extends to most mixed distributions consisting of both continuous and discrete components, such as mixture distributions with a finite number of components.
The main idea of privatization remains the same, that is, to convert Zi into a variable Vi with continuous cdf so that the probability integral transform can apply. We focus on a scenario in which the cdf F contains a series of jump discontinuities. By Darboux-Froda's Theorem, since F is non-decreasing, the set of jump discontinuities is at most countable. Let the set of discontinuities be {ak, k∈Z}⊂R, where ak<ak+1 for any k∈Z and a0 denotes the first non-negative jump in that a(−1)<0≤a0.
Recall that in a univariate discrete distribution, P(ak−1<Zi<ak)=0 and hence the probability mass at ak can be evenly spread across (ak−1, ak]. In contrast, P(ak−1<Zi<ak)>0 for a mixed distribution since Zi contains both continuous and discrete components. To address this issue, we evenly spread the probability mass at each ak by “squeezing in” a unit interval to its left.
Our method proceeds as follows. Define
where Ui Unif (0,1) and the cdf of vi is continuous. After privatization of Vi following (3), we acquire the privatized Z as
where
{tilde over (Z)}i=
Unlike in (4), P(Zi≤ak)=P(Vi≤ak) may not be necessarily true for all k∈Z. However, {tilde over (Z)}i still follows the same distribution as Zi, and the property of ε-differential privacy continues to hold.
When F is unknown, we replace it by a consistent estimate {circumflex over (F)} of F, such as an empirical distribution function (edf), based on an independent hold-out sample Z* of F. In this section, we discuss asymptotic properties of DIP with respect to the size of Z*, namely m, going to infinity. Note that each Zi, i=1, . . . , n, is privatized independently, and differential privacy holds for any n.
Suppose that {circumflex over (F)} is an empirical distribution function based on a hold-out sample Z*=z*. Note that the probability integral transformation in (3) may not be directly applicable since {circumflex over (F)} is not continuous. We construct a continuous cdf Ĉ that is a continualized version of {circumflex over (F)} in order to apply (3).
Construction of Ĉ given {circumflex over (F)}. When F is continuous, we assume that z*(1))< . . . <z*(m) are the order statistics of Z*, where “<” holds almost surely. We apply the same construction as in Lemma 5 to yield
where, without loss of generality, z*(0)≡z*(1)−1. For notation consistency, we let V*=Z*. When F is discrete, we create a continuous variable V*j through continualizing each Z*j, j=1, . . . , m. Specifically, let V*j given Z*j=ak follow a uniform distribution on (ak−1, ak] as described in Section 3.2. Then V*j is continuous. By Lemma 5, V*j FV. Similarly, let v*(1)< . . . <v*(m) be the order statistics of V*=(V1, . . . , Vm)T. Then (5) applies to V*. A combination of the two cases yields a general expression:
where dk=z*(k) if F is continuous and dk=v*(k) if F is discrete, k=1, . . . , m. When F is a mixed distribution, (6) is similarly constructed.
For Z, we generate V similarly as V* above. Then the mechanism in (1) continues to be applicable through replacing C by Ĉ, that is,
for i=1, . . . , n, where L takes the corresponding form as defined in Section 3, depending on the type of F. Let {circumflex over (F)} be the distribution of {tilde over (Z)}i. Since Zi's are i.i.d., we have {tilde over (Z)}1, . . . , {tilde over (Z)}n {tilde over (F)}.
Theorem 8 DIP (7) satisfies ε-differential privacy on Z when b≥1/ε. Furthermore, {tilde over (F)} is a consistent estimator of F with respect to the Kolmogorov-Smirnov distance, that is, σ(F, {tilde over (F)}=|F(z)−{tilde over (F)}(z)|
0 as m→∞.
Theorem 8 implies that {tilde over (Z)} follows F when m→∞. This preserves the original distribution asymptotically. It is also important to note that Ĉ is built upon the hold-out sample Z* which is independent of Z. This guarantees that Ĉ and Ĉ−1 are not functions of Z, which is necessary in the proof of ε-differential privacy.
In addition to empirical distribution functions, Theorem 8 can be generalized to consistent estimators of C. This is useful especially when additional information is available. For example, suppose we know that Z1, . . . , ZN N(μ, 1) where μ is unknown. Then we can set Ĉ as the cdf of
which is continuous and is a more refined estimator of F than the continualized empirical distribution function.
A detailed algorithm for the privatization of univariate data with an unknown distribution is summarized in Algorithm 1. We further demonstrate the great scalability of the univariate DIP in Proposition 9. Notice that, although we illustrate privatization for each i in Algorithm 1, most operations are with respect to the vector Z in practice.
Similarly, (2) can be generalized to multivariate empirical distributions when F is unknown. For Z*=z*, its empirical distribution function is
where A(t) is an indicator function, with
A(t)=1 if t∈A and
A(t)=0 otherwise.
Note that (2) cannot be easily obtained for empirical distributions conditional on continuous variables. To address this issue, we propose a multivariate continualization method, which generalizes the univariate continualization method described in Section 4.1. The basic idea is to split the support of {circumflex over (F)} into a grid of small p-dimensional cubes, and evenly spread the probability mass of each observation within the cube that is on the immediate bottom-left of the observation. Then for any v in the support of {circumflex over (F)}, we can find the cube it belongs to, and the corresponding Ĉ(v1) and Ĉ(vl|v1, . . . , vl−1), l=2, . . . , p, can be obtained accordingly. On this ground, the sequential privatization mechanism in (2) can be conducted.
We use a two-dimensional case as an example. Suppose F is continuous and we have m=5 observations, namely z*1=(x1,y3),z*2=(x2,y1),z*3=(x3,y5),z*4=(x4,y2), and z*5=(x5,y4), based on which the empirical distribution {circumflex over (F)} can be built. For example, {circumflex over (F)}(z1) ={circumflex over (F)}(z2)=0.2, and {circumflex over (F)}(x5 y5)=1. The continualized empirical distribution Ĉ is created by spreading the probability mass associated with each observation over a rectangle. We can see that Ĉ is continuous and agrees with {circumflex over (F)} on z*1, . . . , z*5. Then for an arbitrary x∈(0, x5], the empirical conditional distribution Ĉ(z2|z1=x) can be calculated. For example, for x∈(x2, x3], we have
And for z1∉(0, x5], we let Ĉ(z2|z1)=0.
Formally, for each l=1, . . . , p, let Z*l denote the lth variable of Z*. If Z*l is continuous, we let V*l=Z*l which has order statistics v*(1)l< . . . <v*(m)l almost surely. In the above example, v*(j)1=xj and v*(j)2=yj; j=1, . . . , 5. If Z*l is discrete, {a1l, . . . , as
For a given v=(v1, . . . , vp)T, we now introduce the conditional cdf Ĉ(vl|v1, . . . , vl−1) for l=2, . . . , p. Suppose there exists j∈{1, . . . , m} such that (vl, . . . , vl−1)T is within the smallest cube on the immediate bottom-left of (V*j1, . . . , V*j(l−1)))T. For each variable, recall that the order statistics are non-identical almost surely. Then in the holdout sample, such j is unique almost surely. Suppose that the corresponding V*jl is the qth order statistic of V*l, that is, V*jl=dql. Then the conditional distribution can be obtained as
And we let Ĉ(vl|v1, . . . , vl−1)=0 if there is no observation in the hold-out sample such that the first (l−1) coordinates is the “top-right” corner of the cube in which (v1, . . . , vl−1)T is located. For notation simplicity, we write Ĉ(vl|v1, . . . , vl−1) as Ĉl.
Now we discuss the privatization of the privatization sample Z=z when Ĉ is available. We first convert each Zil into Vil depending on the type of the lth variable, i=1, . . . , n, l=1, . . . , p. Next, we apply the probability chain rule in (2) to Vi with each ml(·) being
to yield its privatized version {tilde over (V)}i. Then we apply Ll({tilde over (V)}il) for each 1 to convert {tilde over (V)}i back to {tilde over (Z)}i, where Ll(·) is the ceiling function corresponding to the lth variable of {tilde over (V)}i, l=1, . . . , p.
Let {tilde over (F)} be the distribution of {tilde over (Z)}i. Since Zi's are i.i.d., we have {tilde over (Z)}1, . . . , {tilde over (Z)}n {tilde over (F)}.
Theorem 10 DIP (2) using (9) is ε-differentially private provided that b is chosen for satisfying ε/p-differential privacy in each sequential step. Furthermore, {tilde over (F)} is a consistent estimator of F with respect to the Kolmogorov-Smirnov distance as m→∞.
As in the univariate cases, we may also consider any consistent estimators of F when additional information is available. Importantly, the multivariate distribution-invariant property in Theorem 10 is invariant with respect to the sequential order of privatization of each variable. Algorithm 2 summarizes the sequential privatization process for multivariate data. Note that Algorithm 2 is directly applicable to categorical variables coded as dummy variables. Meanwhile, (10) and Algorithm 2 continue to be valid even for the high-dimensional case when p>m. Proposition 11 demonstrates the scalability of the multivariate DIP.
It is worth noting that the privacy of raw data in the hold-out sample are also protected. First, any alteration, querying, or release of any raw data in the hold-out sample is not permissible. Second, only a continuous version of the estimated F is constructed, which guarantees that {circumflex over (F)} does not contain probability mass of the raw data. This is achieved through adding noise to the raw data in the hold-out sample. Therefore, the privacy protection of a hold-out sample remains at a high standard. (The definition of differential privacy is not applicable to the hold-out sample due to no adjacent realizations being permissible.) Specifically, the probability of identifying a continualized value of a hold-out sample from the privatized data is zero. In other words, there is no privacy leakage for the hold-out sample. Meanwhile, we can also use some public dataset that follows the same distribution F as the hold-out sample, while treating the entire original dataset as the privatization sample. For example, the American Community Survey data are public and the U.S. census data are private, both coming from the same population. The former can serve as a hold-out sample for the latter.
This section performs simulations to investigate the operating characteristics of DIP and compare it with some strong competitors in the literature. This includes, the non-private mechanism (NP), the Laplace randomized mechanism (LRM), the optimal procedure of minimaxity (OPM), and the exponential mechanism (EXM). Here NP serves as a benchmark, which yields the best statistical accuracy since no privatization is imposed. LRM is a popular method that adds a Laplace distributed noise to a variable of interest. OPM is defined under local differential privacy. Although its results may not be optimal when applying to ε-differential privacy, it serves as a benchmark. Finally, EXM is among the most popular privatizations especially for discrete variables and is hence compared to in discrete settings.
Section 5.1 illustrates the distribution preservation property of DIP using continuous and discrete variables with bounded or unbounded support. Here the true distribution F is known with no hold-out samples and hence that n=N.
The first simulation study focuses on the distribution preservation aspect of NP, DIP, LRM, and OPM, as measured by the Kolmogorov-Smirnov distance between the true cdf and the empirical distribution of each method. Moreover, we investigate the robustness with respect to a change of privacy factor ε and sample size n=N.
A random sample Z1, . . . , Zn F of size n=1000 is generated according to Unif(0, 1), Beta(2, 5), N(0, 1), or Exp(1). For each ε, privatized {tilde over (Z)}=
, . . . , {tilde over (Z)}n)T is obtained from each of LRM, OPM, and DIP, while {tilde over (Z)}=Z for NP. For DIP, we use (3) with the corresponding F. Note that LRM and OPM require a bounded domain of F, which is not met by N(0, 1) and Exp(1). For LRM, we use [−max(|Zi|), max(|Zi|)] and [0, max(Zi)]; i=1, . . . , n, to respectively approximate the unbounded support of N(0, 1) and Exp(1). Throughout Sections 5 and 6, we conduct all LRM's similarly. For OPM, we consider a privacy mechanism for 12-ball and use √{square root over (n)} max(|Zi|); i=1, . . . , n, to approximate the true radius there to deal with the unbounded support issue. In this study, the privacy factor ε is set to be 1, 2, 3, 4. The simulation is repeated 1000 times.
As indicated in Table 1, DIP performs the best across 16 situations. By preserving the true distribution, DIP maintains the Kolmogorov-Smirnov distance as small as if a non-private empirical distribution were used, with a small distributional distance across various values of ε for differential privacy. This is because the privatized data {tilde over (Z)} by DIP retains the same distribution as Z regardless of the value of ε when a known distribution F is used, as implied by Theorem 4. This aligns with our discussions in Section 3 that DIP entails no loss of statistical accuracy when the true distribution F is known. By comparison, the amount of improvement of DIP over LRM ranges from (129.55−26.91)/26.91=381% to (439.61−27.62)/27.62=1492%, while that over OPM is from (507.38−27.42)/27.42=1750% to (510.14−26.99)/26.99=1790%. In contrast, LRM's and OPM's performance deteriorates when a stricter differential privacy policy is imposed by a smaller value of E.
Next, we investigate the impact of the sample size n on the privatization by NP, DIP, OPM, and LPM. DIP gives the best performance among all competitors and essentially agrees with NP, where the small difference is only due to the sampling error of added noise. Importantly, DIP enjoys the large-sample property with its Kolmogorov-Smirnov distance decreasing at the same rate as NP because of its distribution preservation property, Overall, the Kolmogorov-Smirnov distances for OPM and LPM also decrease as n increases, whereas the rate is much lower than the rate of DIP and NP. In fact, the amount of improvement of DIP over OPM and LPM is rather large, ranging from 421% to 2638%. This phenomenon is also attributed to the fact that DIP preserves the distribution of original data whereas OPM and LPM fail to do so. This reinforces our argument that distribution preservation in privatization is critical to statistical analysis.
The second study is concentrated on the distribution preservation aspect of discrete distributions. For DIP, we apply the privatization in (4). For the Laplace framework, as in the continuous case, Laplace noise is added to the response. For OPM, we consider its private mean estimation function under the assumption of the given k-th moment for k=∞. Here k=∞ is adopted since it provides the most accurate results comparing with other values. In an unreported study, the setting when k is finite and slightly greater than 1 is also tested where the random variables are less bounded: The results are less accurate. In addition, we also consider the Exponential Mechanism (EXM)—for each element a in the support of a discrete distribution, we choose the quality score function as the number of data points equal to a, and the sensitivity is specified as max(Zi). Throughout Sections 5 and 6, we perform all EXM's similarly.
Our goal is to compare the accuracy of each method in terms of distributional parameter estimation. We consider Z1, . . . , Zn F, where n=1000, and F(·) is chosen from Bernoulli(0.1), Binomial(5, 0.5), Poisson(3), and Geometric(0.2). We calculate the private mean via each framework and estimate the parameter. Then we compare the estimation error with that provided by the non-private estimator. The simulation is repeated 1000 times.
As suggested by Table 2, the estimation error of DIP as well as its standard deviation is the best across all settings. Moreover, DIP, through implementing the proposed continualization method and ceiling function, continues to preserve discrete distributions in that it is robust against the change of ε and incurs only a finite-sample estimation error as in the non-private case. This agrees with our theoretical finding in Theorem 6. In contrast, the competing mechanisms provide relatively larger estimation errors, among which the exponential mechanism has performance closest to DIP. The estimation errors of all competing methods are substantially impacted by the value of ε, where the severity of the impact is least for the exponential mechanism.
Notice that DIP is essentially a perturbation method retaining the similarity between each pair of Zi and {tilde over (Z)}i. In other words, {tilde over (Z)}i preserves the information about the subject i. In contrast, EXM creates a private sampling scheme, whose privatized data are generated from the estimated distribution and are not necessarily associated with the original subject ID i.
Parameter estimation with respect to the change of sample size is investigated as well. Similar to the results for continuous distributions, DIP has the best performance and the same convergence rate as the non-private estimator. EXM has the second-best results. All other mechanisms enjoy the large sample property as well, but with performance worse than that of DIP.
This section performs privatized linear regression, Gaussian graphical model, and graphical lasso to examine the performance of DIP. We assume the true distributions are unknown, and split the data into a hold-out sample (15%, 25%, or 35% of N) and a privatization sample (85%, 75%, or 65% of N).
Our next study is designed to examine the regression parameter estimation in linear regression for NP, DIP, LRM, and OPM. Let the response variable Y follow Y=Xβ+e, where ei N (0, 1), X is a design matrix, and β is a vector of coefficients.
Simulations proceed with N=200 or 2000, the privacy factor ε=1, 2, 3, 4, and a p-dimensional true regression parameter vector βp×1=(1, 1, . . . , 1)T with p=6 or 30. Finally, each ⅓ of X's columns follow independent N(0, 102), Poisson(5), and Bernoulli(0.5), respectively.
For DIP, we consider a sample split with a splitting ratio 15%, 25%, 35% of the original sample for hold-out and privatization samples. The reported parameter estimation results are based on linear regression using the privatized release data. We privatize each column of the privatization sample Z:=(X, Y) sequentially following Algorithm (2). For LRM and OPM, we follow the configurations as described in Section 5.1. We utilize the additional information that columns of X is independent, and privatize them independently to yield {tilde over (X)}. We then privatize Y given {tilde over (X)} through privatizing the corresponding linear regression residuals. Then privatized regression is performed to regress {tilde over (Y)} against {tilde over (X)} to obtain an estimated regression parameter vector {tilde over ({circumflex over (β)})}. For all methods, each variable is privatized under a privacy factor ε/(p+1). The estimation accuracy is measured by the L2 norm of the differences between an estimated and true regression parameter vectors.
As indicated in Table 3, DIP yields the best private estimator of regression coefficient across all situations with a significant amount of improvement on LRM and OPM. Compared with NP, a small difference is seen, which is due to an estimation error of the unknown multivariate cdf in privatization. Meanwhile, there is a trade-off between the hold-out size and statistical accuracy. A large hold-out sample yields a more accurate estimate cdf but reduces the privatization sample size. As suggested by an unreported study, DIP's estimation accuracy improves as an increase of the hold-out size until 50% of the entire sample across all cases. In other words, a reasonable amount of release data is still required to train the desired model. Interestingly, an increased level of the privacy factor ε seems to have a more profound impact on LRM and OPM than on DIP with respect to the estimation error, although their standard errors remain much larger. In contrast, DIP is robust against the change on ε. Overall, the performance of each method improves as the sample size N increases, or the number of regression parameters p decreases. To assess the effect of the order of sequential privatization in (2), we conduct privatized linear regression using DIP with the same setting but in reverse privatization order. Specifically, we let Z:=(Y, X.p, . . . , X.1), that is, we first privatize Y, followed by X.p to X.1. The order impact is minimal and the new results are nearly the same or even slightly better in certain settings. This observation agrees with the preservation of the joint distribution in Theorem 10.
The next example concerns reconstruction of the structure of an undirected graph in a Gaussian graphical model. It will illustrate the importance of the distribution-invariant property of privatization in (2) for estimation of pairwise dependence structures. In this case, Zi=(Zi1, . . . , Zip)TN(0, Σ), where Σ is a known covariance matrix while Ω=Σ−1 is the precision matrix. Note that the (k, l)th element of Ω encodes the partial correlation between Zik and Zil; i=1, . . . , N.
Two types of graph networks are considered, namely, the chain and exponential decay networks. For the chain network, Ω is a tri-diagonal matrix corresponding to the first-order autoregressive structure. Specifically, the (k, l)th element of Σ is σkl=exp(−0.5|τk−τl|), where τ1< . . . <τp and τl−τl−1 Unif(0.5, 1); 1=2, . . . , p. For exponential decay network, the (k, l)th element of Ω is ωkl=exp(−2|k−l|). Clearly, Ω is sparse in the first case but not in the second.
We now perform simulations. Let N=2000, p=5 or 15, and the privacy factor ε be from 1 to 4. Then we generate a random sample according to the aforementioned cases, followed by privatization by DIP, LRM, and OPM. Finally, the estimated precision matrix {circumflex over (Ω)} will be compared with the true precision matrix Ω using the entropy loss with EL(Ω, {circumflex over (Ω)})=tr(Ω−1{circumflex over (Ω)})−log |Ω−1{circumflex over (Ω)}|−p and the quadratic loss QL(Ω, {circumflex over (Ω)})=tr((Ω−1{circumflex over (Ω)}−Ip)2). Note that QL=p for LRM and OPM across almost all settings due to close-to-zero {circumflex over (Ω)}, and hence the corresponding table is not presented.
For DIP, we consider the same setting as in the previous case. Then (2) is applied, assuming an unknown multivariate cdf, and each variable is privatized with a privacy factor ε/p. For LRM and OPM, we follow the configurations as described in the first part of Section 5.1. The entire sample is released, and each variable is privatized independently under ε/p.
As suggested by the top panel of Table 4, DIP has the smallest error of estimating the precision matrix by preserving the underlying dependency structure for both the chain and the exponential decay networks and across different network sizes. Its improvement on competitors is about 1000-fold across all situations in terms of the entropy loss. Moreover, it yields a much smaller standard error when compared with LRM and OPM. Compared with NP, DIP's performance is very close to the non-private precision matrices estimation results, indicating that the dependency structure and the underlying multivariate probability distribution are mostly preserved. A small difference is seen as in the regression case, which is due to an estimation error during the approximation of the unknown multivariate cdf in privatization. In fact, DIP is less sensitive to the value of ε than LRM and OPM. All these method perform better when p decreases. In summary, DIP's distribution preservation property becomes more critical to recover multivariate structures, which explains a large difference between DIP, and the existing methods.
We perform an additional simulation for graphical lasso in a high-dimensional situation with the graph size p exceeding the sample size N, particularly p=250 and N=200. In this case, the dependence structure in the precision matrices and all other parameters remain the same as in the Gaussian graphical model. We consider a sample split of the original sample for DIP with a ratio 15%, 25%, 35% for a hold-out sample, leaving n=170, 150, 130 for privatization and release. Using the “glasso” package in R and 5-fold cross-validation, we estimate all model and tuning parameters. The tuning parameter is selected through evaluating the log-likelihood on the validation set.
As seen in the bottom panel of Table 4, DIP is the only privatization method that still allows the estimation of the dependence structure. Compared with the non-private graphical lasso, DIP still has good performance despite some estimation errors due to the approximation of the high-dimensional cdf.
This section analyzes two sensitive yet publicly available datasets, namely the University of California system salary data and a Portuguese bank marketing campaign data, to understand practical implications of distribution-invariant privatization. In particular, privatized mean is estimated for the salary data while privatized logistic regression is conducted for the bank marketing data. In both examples, the true distributions are unknown, and we use empirical distributions for DIP.
Our first study concerns the salary data of the University of California (UC) system, which collect annual salaries of N=252, 540 employees, including faculty, researchers, and staff. For Year 2010, the average salary of all employees is $39,531.49 with a standard deviation $53,253.93. The data is highly right-skewed, with the 90% quantile $95,968.12 and the maximum exceeding two million dollars.
For this dataset, we apply each mechanism to estimate the ε-differentially private mean UC salary. One important aspect is to contrast the privatized mean with the original mean $39,531.49 to understand the impact of privatization on statistical accuracy of estimation. Three privatized mechanisms are compared, including DIP, LRM, and OPM. For DIP, we hold out 15%, 25%, or 35% of the sample, and apply Algorithm 1. For OPM, we follow a private mean estimation function and choose the number of moments k=20 and the moment as the one closest to 3 in order to optimize its performance. The above process, including privatization, is repeated 1000 times.
As indicated in Table 5, DIP delivers the most accurate mean salary estimation under differential privacy. The amount of improvement on LPM and OPM is in a range of 405% to 3533%. By comparison, LRM and OPM yield a large estimation error. Note that the performance is attributed primarily to the distribution-invariant property that LPM and OPM do not possess. Moreover, the cost of stricter privacy protection is little for DIP. When ε decreases from 4 to 1, DIP's relative error increases only by 35%, 33%, and 24%, given 15%, 25%, and 35% of the sample held out, respectively. By comparison, those of LRM and OPM increase by 288% and 151%, respectively. This is a result of the impact of the high privatization cost of LPM and OPM on statistical accuracy. In summary, the distribution preservation property is critical to maintaining statistical accuracy in downstream analysis.
Our second study examines marketing campaign data of a Portuguese banking institution.
This marketing campaign intends to sell long-term deposits to potential clients through phone conversations. During a phone call, an agent collects a client's personal and demographic data, past contact histories, and if the client is interested in subscribing a term deposit (yes/no). Our goal is to examine statistical accuracy of logistic regression based on privatized data versus the original data.
The campaign data includes the response variable is whether a client has subscribed a long-term deposit. A total of 9 explanatory variables are used, including clients' age (numeric), employment status (yes/no, where “no” includes “unemployed”, “retired”, “student”, and “housemaid”), marital status (“single”, “married”, and “divorced”, which is coded as two dummy variables with “married” being the reference level), education level (“illiterate”, “4 years”, “6 years”, “9 years”, “professional course”, “high school”, and “university”, labeled as 0 to 6, respectively), default status (yes/no), housing loan (yes/no), personal loan (yes/no), client's device type (“mobile” or “landline”), and the total number of contacts regarding this campaign. This leads to a total of N=30, 488 complete observations for data analysis, and p=11 variables to be privatized (including the binary response variable and 2 dummy variables for “marital status”).
For DIP, we hold out 15%, 25%, or 35% of the sample and apply Algorithm 2 without assuming any underlying true distributions. For OPM, we follow the same procedure as described in Section 5.2, and then conduct private logistic regression following a private estimation of generalized linear models. The privatization process is repeated 1000 times.
As shown in Table 6, parameter estimation based on privatized logistic regression by DIP yields a very small value of the Kullback-Leibler divergence—less than 5×10−2. Moreover, its performance is insensitive to the private factor ε, permitting the low cost of strict privacy protection, which is guaranteed by the distribution-invariant property, c.f., Theorem 10. DIP performs slightly better if more data are used in the hold-out sample for empirical cdf estimation. In contrast, the performance of LRM is at least 5 times worse than that of DIP, and the results provided by OPM is infinity, since an estimated probability of 1 or 0 exists across all settings.
Differential privacy has become a standard of privacy protection as a massive amount of personal information is collected and stored in a digital form. In this paper, we propose a novel privatization method, called DIP, preserving the distribution of original data while guaranteeing differential privacy. Consequently, any downstream privatized statistical analysis or learning leads to the same conclusion as if the original data were used, which is a unique aspect that all existing mechanisms may not enjoy. Second, DIP is differentially private even if underlying variables have unbounded support or unknown distributions. Finally, DIP maintains statistical accuracy even using a strict privacy factor, which is unlike all existing methods that are constrained by the trade-off between statistical accuracy and the level of a privacy factor. Our extensive numerical studies demonstrate the utility and statistical accuracy of the proposed method against strong competitors in literature. Moreover, DIP can be easily generalized to the case of Zi˜Fi independently when repeated measurements of Zi are available, as in longitudinal data analysis.
The proposed methodology also opens up several future directions. One direction is to investigate the privacy protection aspect of DIP against a large number of queries, possibly interactive, that is, whether the original data can be retrieved under an extreme scenario. Another direction is to investigate privatization in the presence of missing observations, as in semi-supervised learning.
Definition 1 protects data owners' privacy after aggregating all data points to a data server and then applying privatization. In many scenarios, however, data owners may not even trust the data server. To further protect data owners' privacy from the server, one may consider data privatization at each owner's end before aggregation. And the notion of s-local differential privacy is introduced.
Definition 12 A privatized mechanism m(·) satisfies ε-local differential privacy if
where Bi⊂ is a measurable set and ε≥0 for i=1, . . . , n. The ratio is defined as 1 if the numerator and denominator are both 0.
Lemma 13 If {tilde over (Z)}i i=m(Zi) satisfies ε-local differential privacy, then ζ({tilde over (Z)}i) is also ε-locally differentially private for any measurable function ζ(·).
Based on Lemma 13, the DIP privatization mechanism in (1) and (2) can be extended. Further investigation of DIP is necessary to understand DIP under local differential privacy.
System 100 includes a secured data storage unit 102 containing secured data 104. Secured data 104 has actual values for variables and may indicate a link between those values and individual entities such as people, companies, universities and the like. Secured data storage unit 102 includes one or more protections against unauthorized access to the values and any links between those values and individual entities stored in data storage unit 102. At a minimum, secure data storage unit 102 requires authentication of people and computer programs that request access to the data in data storage unit 102 before providing access to the data.
Because only a limited number of users are able to access the actual data in secured data storage unit 102, the data is of limited use to others. To make the data of more use to others, a privatization module 106 executed by a processor in a computing device 108 converts values of secured data 104 into privatized data 110. Privatization module 106 then stores privatized data 110 in a data storage unit 112. In some embodiments, data storage units 102 and 112 are a single data storage unit. Privatized data 110 not only obscures the actual values of secured data 104 but also ensures a privacy level for the link between secured data 104 and individual entities. Third parties using a third party device 114 are able to query privatized data 110 through one or more query APIs 116 executed by a processor on computing device 115. In accordance with one embodiment, data storage unit 112 is part of computing device 115.
In step 200 of
At step 202, one of the values in the set of secured values is selected and at step 204 a transform based on the distribution is applied to the value to form a transformed value having a bounded uniform distribution between [0,1]. In one particular embodiment, the value is applied to the cumulative distribution function to produce a probability of the value. (See equation 1 above)
At step 206, noise is applied to the transformed value to form a sum. In accordance with one embodiment, the noise is a Laplacian noise. This obscures the transformed value but also causes the resulting sum to have an unbounded distribution. At step 208, a second transform is applied to the sum to provide a transformed sum that has a bounded uniform distribution between [0,1]. In accordance with one embodiment, the second transform involves applying the sum to a cumulative distribution function of the sum to obtain the probability of the sum. At step 210, the inverse of the first transform applied in step 204 is applied to the transformed sum. This produces a privatized value that has a same distribution as the secured value selected at step 202.
At step 301, the discrete cumulative distribution is continualized to form a continuous cumulative distribution function. In accordance with one embodiment, the discrete cumulative distribution is continualized by the convolution of the discrete variable with a continuous variable. At step 302, one of the discrete values in the set of secured values is selected and at step 304 the discrete value is continualized by subtracting a random variable having a uniform distribution. At step 306, a transform based on the continuous cumulative distribution determined at step 301 is applied to the continualized value to form a transformed value having a bounded uniform distribution between [0,1]. In one particular embodiment, the continualized value is applied to the continuous cumulative distribution function to produce a probability of the value. (See equation 4 above)
At step 308, noise is applied to the transformed value to form a sum. In accordance with one embodiment, the noise is a Laplacian noise. This obscures the transformed value but also causes the resulting sum to have an unbounded distribution. At step 310, a second transform is applied to the sum to provide a transformed sum that has a bounded uniform distribution between [0,1]. In accordance with one embodiment, the second transform involves applying the sum to a cumulative distribution function of the sum to obtain the probability of the sum. At step 312, the inverse of the first transform applied in step 306 is applied to the transformed sum. This produces a continuous privatized value that has a same distribution as the continualized value determined at step 304. At step 314, a ceiling function is applied to the continuous privatized value to produce a discrete privatized value having a same distribution as the discrete value selected at step 302.
At step 402, one of the values in the set of secured values is selected and at step 404 a respective transform based on the respective conditional probability distribution of a dimension is applied to the value of each dimension to form a transformed value for each dimension having a bounded uniform distribution between [0,1].
At step 406, noise is applied to the transformed value to form a sum for each dimension. In accordance with one embodiment, the noise is a Laplacian noise. This obscures the transformed value but also causes the resulting sum to have an unbounded distribution. At step 408, a second transform is applied to each sum to provide a transformed sum for each dimension that has a bounded uniform distribution between [0,1]. In accordance with one embodiment, the second transform involves applying the respective sum to a respective cumulative distribution function of the sum to obtain the probability of the sum. At step 410, the inverse of the respective first transform applied in step 404 is applied to the transformed sum of the dimension. This produces a privatized value for each dimension resulting in a privatized multivariate value that has a same distribution as the secured value selected at step 402.
Embodiments of the present invention can be applied in the context of computer systems other than computing device 10. Other appropriate computer systems include handheld devices, multi-processor systems, various consumer electronic devices, mainframe computers, and the like. Those skilled in the art will also appreciate that embodiments can also be applied within computer systems wherein tasks are performed by remote processing devices that are linked through a communications network (e.g., communication utilizing Internet or web-based software systems). For example, program modules may be located in either local or remote memory storage devices or simultaneously in both local and remote memory storage devices. Similarly, any storage of data associated with embodiments of the present invention may be accomplished utilizing either local or remote storage devices, or simultaneously utilizing both local and remote storage devices.
Computing device 10 further includes an optional hard disc drive 24, an optional external memory device 28, and an optional optical disc drive 30. External memory device 28 can include an external disc drive or solid-state memory that may be attached to computing device 10 through an interface such as Universal Serial Bus interface 34, which is connected to system bus 16. Optical disc drive 30 can illustratively be utilized for reading data from (or writing data to) optical media, such as a CD-ROM disc 32. Hard disc drive 24 and optical disc drive 30 are connected to the system bus 16 by a hard disc drive interface 32 and an optical disc drive interface 36, respectively. The drives and external memory devices and their associated computer-readable media provide nonvolatile storage media for the computing device 10 on which computer-executable instructions and computer-readable data structures may be stored. Other types of media that are readable by a computer may also be used in the exemplary operation environment.
A number of program modules may be stored in the drives and RAM 20, including an operating system 38, one or more application programs 40, other program modules 42 and program data 44. In particular, application programs 40 can include programs for implementing any one of methods discussed above. Program data 44 may include any data used by the systems and methods discussed above.
Processing unit 12, also referred to as a processor, executes programs in system memory 14 and solid-state memory 25 to perform the methods described above.
Input devices including a keyboard 63 and a mouse 65 are optionally connected to system bus 16 through an Input/Output interface 46 that is coupled to system bus 16. Monitor or display 48 is connected to the system bus 16 through a video adapter 50 and provides graphical images to users. Other peripheral output devices (e.g., speakers or printers) could also be included but have not been illustrated. In accordance with some embodiments, monitor 48 comprises a touch screen that both displays input and provides locations on the screen where the user is contacting the screen.
The computing device 10 may operate in a network environment utilizing connections to one or more remote computers, such as a remote computer 52. The remote computer 52 may be a server, a router, a peer device, or other common network node. Remote computer 52 may include many or all of the features and elements described in relation to computing device 10, although only a memory storage device 54 has been illustrated in
The computing device 10 is connected to the LAN 56 through a network interface 60. The computing device 10 is also connected to WAN 58 and includes a modem 62 for establishing communications over the WAN 58. The modem 62, which may be internal or external, is connected to the system bus 16 via the I/O interface 46.
In a networked environment, program modules depicted relative to the computing device 10, or portions thereof, may be stored in the remote memory storage device 54. For example, application programs may be stored utilizing memory storage device 54. In addition, data associated with an application program may illustratively be stored within memory storage device 54. It will be appreciated that the network connections shown in
Although elements have been shown or described as separate embodiments above, portions of each embodiment may be combined with all or part of other embodiments described above.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms for implementing the claims.
The present application is based on and claims the benefit of U.S. provisional patent application Ser. No. 63/171,828, filed Apr. 7, 2021, the content of which is hereby incorporated by reference in its entirety.
This invention was made with government support under DMS-1712564, DMS-1721216, DMS-1952539 awarded by the National Science Foundation and GM126002 and HL105397 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Name | Date | Kind |
---|---|---|---|
7237115 | Thomas | Jun 2007 | B1 |
8275204 | Kovalsky | Sep 2012 | B1 |
8515058 | Gentry | Aug 2013 | B1 |
20030118246 | August | Jun 2003 | A1 |
20030135741 | Nuriyev | Jul 2003 | A1 |
20050021266 | Kouri | Jan 2005 | A1 |
20050234686 | Cheng | Oct 2005 | A1 |
20080294565 | Kongtcheu | Nov 2008 | A1 |
20090010428 | Delgosha | Jan 2009 | A1 |
20090077543 | Siskind | Mar 2009 | A1 |
20150286827 | Fawaz | Oct 2015 | A1 |
20180349605 | Wiebe | Dec 2018 | A1 |
20190065775 | Klucar, Jr. | Feb 2019 | A1 |
20190312854 | Fiske | Oct 2019 | A1 |
20200076604 | Argones Rua | Mar 2020 | A1 |
20200401916 | Rolfe | Dec 2020 | A1 |
20210035002 | Hastings | Feb 2021 | A1 |
20210058241 | Georgieva | Feb 2021 | A1 |
20210133590 | Amroabadi | May 2021 | A1 |
20220067505 | Liu | Mar 2022 | A1 |
20220180234 | Kamthe | Jun 2022 | A1 |
20240095392 | Damewood | Mar 2024 | A1 |
Entry |
---|
Zhu et al., Posson subsampled Renyi differential privacy, International Conference of Machine Learning, 9 pages, 2019. |
Abadi et al., Deep learning with differential privacy, In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 14 pages, 2016. |
Agarwal et al., cpSGD: communication-efficient and differentially-private distributed SGD, In Advances in Neural Information Processing Systems, 12 pages, 2018. |
Alda et al., The Bernstein mechanism: function release under differential privacy, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 7 pages, 2017. |
Apple, Learning with privacy at scale, Apple Machine Learning Journal, vol. 1, No. 8, 25 pages, 2017. |
Avella-Media, Privacy-preserving parametric inference: A case for robust statistics, Journal of the American Statistical Association, 62 pages, 2019. |
Bu et al., Deep learning with Gaussian differential privacy, HHS Public Access, Harv Data Sci Rev, Author manuscript, 40 pages, 2020. |
Butucea et al., Interactive versus non-interactive locally differentially private estimation: Two elbows for the quadratic functional, arXi preprint arXi:2003.04773, 49 pages, 2020. |
Cai et al., The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy, arXi preprint arXiv:1902.04495, 55 pages, 2020. |
Chaudhuri et al., Differentially private empirical risk minimization, Journal of Machine Learning Research, vol. 12, p. 1069-1109, 2011. |
Cormen et al., Introduction to algorithms, Third Edition, MIT press, pp. 148-169, 2009. |
Csorgo, Strong approximations of the Hoeffding, Blum, Kiefer, Roseblatt multivariate empirical process, Journal of Multivariate analysis, vol. 9, pp. 84-100, 1979. |
Day One Staff, Protecting data privacy: How Amazon is advancing privacy-aware data processing, https://blog.aboutamazon.com/amazon-ai/protecting-data-privacy, 3 pages, 2018. |
Demerjian, Rise of the Netflix hacker, https://www.wired.com/2007/03/rise-of-the-netflix-hackers, 2 pages, 2007. |
Ding et al., Collecting telemetry data privately, Advances in Neural Information Processing Systems, 10 pages, 2017. |
Dong et al., Gaussian differential privacy, arXiv preprint arXiv:1905.02383, 86 pages, 2019. |
Duchi et al., Minimax optimal procedures for locally private estimation, Journal of the American Statistical Association, vol. 113, No. 521, 64 pages, 2017. |
Durfee, Practical differentially private top-k selection with pay-what-you-get composition, Advances in Neural Information Processing Systems, 11 pages, 2019. |
Dwork, Differential privacy, The 33rd International Colloquium on Automata, Languages and Programming, 12 pages, 20006. |
Dwork et al., Our data, ourselves: Privacy via distributed noise generation, Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486-503, 2006. |
Dwork et al., Calibrating noise to sensitivity in private data analysis, Proceedings of the 3rd Theory of Cryptography Conferences, pp. 265-284, 2006. |
Dwork et al., Boosting and differential privacy, IEEE 51st Annual Symposium on Foundations of Computer Science, 10 pages, 2010. |
Dwork et al., The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science, vol. 9, No. 3-4, pp. 211-407, 2014. |
Erlingsson et al., RAPPOR: Randomized aggregatable privacy-preserving ordinal response, Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 1054-1067, 2014. |
Evfimievski et al., Limiting privacy breaches in privacy preserving data mining, Proceedings of the twenty-second ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, 12 pages, 2003. |
Friedman et al., Sparse inverse covariance estimation with the graphical lasso, Biostatistics, vol. 9, No. 3, pp. 432-441, 2008. |
Funk, Netflix update: Try this at home, http://sifter.org/˜simon/journal/20061211.html, 4 pages, 2006. |
Hall et al., Random differential privacy, Journal of Privacy and Confidentiality, vol. 4, No. 2, 12 pages, 2011. |
Hall et al., Differential Privacy for Functions and Functional Data, Journal of Machine Learning Research, vol. 14, 23 pages, 2012. |
Harper et al., The MovieLens Datasets: History and Context, ACM Transactions on Interactive Intelligent Systems, vol. 5, No. 4, Article 19, 19 pages, 2015. |
Kairouz et al., The composition theorem for differential privacy, IEEE Transactions on Information Theory, vol. 63, No. 6, 10 pages, 2017. |
Karwa et al., Inference using noisy degrees: differentially private β-model and synthetic graphs, The Annals of Statistics, vol. 44, No. 1, pp. 87-112, 2016. |
Karwa et al., Finite sample differentially private confidence intervals, arXiv preprint arXiv:1711.03908, 51 pages, 2017. |
Kasiviswanathan et al., What can we learn privately?, SIAM Journal on Computing, vol. 40, No. 3, 35 pages, 2010. |
Kenthapadi et al., PriPeARL: A framework for privacy-preserving analytics and reporting at LinkedIn, 10 pages, 2018. |
Lin et al., A Monte Carlo comparison of four estimators of a covariance matrix, Multivariate Analysis, vol. 6, pp. 411-429, 1985. |
McSherry et al., Mechanism design via differential privacy, 48th Annual IEEE Symposium on Foundations of Computer Science, 84 pages, 2008. |
Ilya Mironov et al., Renyi differential privacy, IEEE 30th Computer Security Foundations Symposium, 13 pages, 2017. |
Moro et al., A data-driven approach to predict the success of bank telemarketing, Decision Support Systems, vol. 62, 35 pages, 2014. |
Narayanan et al., Robust De-anonymization of large datasets (how to break anonymity of the Netflix prize dataset), arXiv preprint cs/0610105, 24 pages, 2007. |
Nayak, New privacy-protected Facebook data for independent research on social media's impact on democracy, https://research.fb.com/blog/2020/02/new-privacyprotected-facebook-data-for-independent-research-on-social-medias-impact-ondemocracy, 2 pages, 2020. |
Rohde et al., Geometrizing rates of convergence under local differential privacy constraints, arXiv:1805.01422v2, 72 pages, 2019. |
Rudin, Principles of mathematical analysis, International Series in Pure and Applied Mathematics, McGraw-Hill New York, pp. 94-97, 1964. |
Shorack, Errata, 55 pages, 2009. |
Shorack et al., Empirical process with applications to Statistics, SIAM, Chapter 26, pp. 826-841, 2009. |
United States Census Bureau. Disclosure avoidance and the 2020 census. Available at https://www.census.gov/about/policies/privacy/statistical_safeguards/disclosureavoidance-2020-census.html, 2 pages, 2020. |
Vadhan, The complexity of differential privacy, In Tutorials on the Foundations of Cryptography, Springer, 95 pages, 2017. |
Van der Vaart, Asymptotic statistics, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Chapter 19, pp. 265-266, 1998. |
Wang et al., Subsampled Renyi differential privacy and analytical moments accountant, The 22nd International Conference on Artificial Intelligence and Statistics, 10 pages, 2019. |
Wasserman et al., A Statistical framework for differential privacy, Journal of the American Statistical Associate, vol. 105, No. 489, 42 pages, 2010. |
Ye et al., Optimal schemes for discrete distribution estimation under locally differential privacy, IEEE Transactions on Information Theory, 15 pages, 2018. |
Number | Date | Country | |
---|---|---|---|
20220327238 A1 | Oct 2022 | US |
Number | Date | Country | |
---|---|---|---|
63171828 | Apr 2021 | US |