Some of the most effective uses of computers involve analyzing data to determine patterns. These patterns can include the distribution of a single variable across a population or the relationships between different variables.
In general, there are two types of data: public data and private data. Public data is information that people and entities do not mind sharing with the world. Private data is information that people and entities do not want shared with others. For example, medical information, political opinions and financial activities. At times, people must release their private information in order obtain something they want such as medical care or a loan. However, even in these situations, the person releasing their information has an expectation that the information will be kept in confidence.
Because of this confidentiality, the vast amount of collected private information stored on computer systems is under-utilized since it cannot be released to a third party for analysis.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
A method includes, for each of a plurality of values stored in a computer memory, requesting a set of intervals for the value; receiving the set of intervals such that different values have different sets of intervals; determining which interval the value falls within; and storing the interval that the value falls within as part of public interval data that does not identify the value.
In accordance with a further embodiment, a computer-implemented method secures values from disclosure to a third party by performing steps for each value. The steps include obtaining at least two intervals for the value such that different ones of the values have different intervals; determining which of the at least two intervals the value is found in; and providing an indication of which of the at least two intervals the value is found in without providing the value itself to the third party.
In accordance with a still further embodiment, a system includes a memory holding values in a secure manner to prevent disclosure of the values and a processor executing instructions to perform steps. The steps include, for each of a plurality of the values held in a secure manner, obtaining a set of intervals; determining which interval the value is from; storing the set of intervals and which interval the value is from in the memory; and providing access to the set of intervals and the interval the value is from without providing access to the value itself.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The embodiments described below provide an improved computer system that secures private data while allowing processes to accurately run on the private data. In particular, for each private datum, the embodiments obtain a randomly generated set of intervals and determine which of intervals the datum is in. The interval that the datum is in and the anchor points of the intervals for the datum are saved for use by an untrusted data collector. Because the intervals are randomly generated, the data collector is able to perform the same processes on the interval data that the data collector would have been able to perform on the private data such as determining relationships represented by the data, determining distributions of the data, and using the data to train learning algorithms, for example. Thus, the system secures the private data from the data collector but continues to provide a truthful representation of the private data that can be used in many processes including in the generation of artificial intelligence systems based on the interval data.
The choice of the intervals affects the privacy-utility tradeoff. Consider an extreme case where the number of intervals is sufficiently large. Then each interval becomes narrow and the privacy coverage tends to zero. In another case, where only a small number of intervals is provided, the number of values covered by each interval is large and the information provided by indicating that a value is in an interval is of little utility.
Below, the privacy provided by the present embodiments is described mathematically.
Definition 1 (Interval Privacy).
A mechanism M has the property of interval privacy, if for all x1, x2∈Sz,
where pX|Z denotes the distribution of X conditional on Z and Sz is the essential support of X given Z=z.
Definition 2 (Privacy Coverage).
The privacy coverage of M, denoted by τ (M), is defined by E(L(SZ)) where the expectation is over both Y, Z. The privacy leakage from mechanism M is defined by κ(M)=1−τ (M). We say that M has the property of κ-interval privacy if κ(M)≤κ.
One way to estimate E(L(SZ)) is to use En(Ln(SZ)) where the subscript n denotes the empirical expectation or probability. In this way, we do not need to estimate the distribution function of X. An alternative way to quantify the privacy leakage is through the mutual information between observed data and raw data. For example, observing an interval (−∞, u] or (u, ∞) with u around the median of y gives about one bit of information. The tradeoff between effective sample size and information loss may be derived on a case-by-case basis. Evaluating the information-theoretic quantity often requires the estimation of density functions and may not be computationally easy.
Interval privacy means that the conditioning on Z=z does not provide extra information except that x falls into Sz. If x1 does not equal x2, and they fall into the same support Sz, then their likelihood ratio remains the same as if no action is taken. Interval privacy also implies that
p
X|Z(x|Z=z)=cz1x∈s
where 1 is an indicator function, holds for the normalizing constant cz=1/∫s
Suppose that X=x1 is to be protected, interval privacy creates ambiguity by obfuscating the true x1 with sufficiently many x2's in a neighborhood whose posterior ratios do not vary by incorporating the new information Z=z. Also, suppose that SZ is a (closed or open) interval, then the finite cover theorem implies the following alternative to the above second condition. For all x in the interior of Sz, there exists an open neighborhood of x, U(x)∈Sz, where interval privacy holds for all x1, x2∈U(x).
In addition, by its definition, the privacy coverage T(M) takes values from [0, 1]. The privacy coverage quantifies the average amount of ambiguity or the level of privacy. A larger value indicates increased privacy. Likewise, for each pair of [x, z] (nonrandom), we introduce L(Sz) as the individual privacy coverage, interpreted as the privacy level for a particular datum being collected. To realize interval privacy, we will introduce a natural interval mechanism that converts x to a random interval that contains x. In such a scenario, Sz will be in the form of [1, r], and thus L(Sz)=r−1.
In the general case, a raw data point x∈ is ‘observed’ as an interval that covers it, and a particular case is when the interval degenerates to the exact point x. For multi-dimensional data, we will privatize each entry of it. Let X denote a generic raw data to be privatized so that X˜Fx. Within an interval privacy mechanism, the observed data is not X itself, but a privatized data Z generated in the following way: Partition the data domain Y⊂ into disjoint intervals. Report the interval which X falls. If the reported interval is pre-marked, the point value of X is further collected. The resulting data would be in the form of a mixture of intervals and points. Formally, we introduce the following notations.
Suppose that the domain Y is partitioned into m intervals in the form of (Ui−1, Ui] for i=1, . . . , m, where U0=−∞ and Um=∞. Let I:(Q, Y) 1→i denote the indicator function where I(Q, Y)=i means X falling into (Ui−1, Ui], for i=1, . . . , m. Alternatively, we may use m indicator variables 1X<Ui to represent where X is located at, i=1, . . . , m. By definition, 1X≤u
Definition 3 (Interval Mechanism).
A privacy mechanism M: X1→Z maps each given X to the following Z,
Z=[Ui,Δi,i=1, . . . ,m−1,X·U
where Δi=1U
Suppose that there are k observers/adversaries, each using a mechanism and observing information. These k observers may choose to collaborate to narrow down the range where a particular X falls. This motivates the following ensemble mechanism, denoted by ⊕j=1kMj, which is itself an interval mechanism.
Definition 4 (Ensemble Mechanism).
The ensemble of two privacy mechanisms
1
:X
Z={U
i
,I(Qi,Y),X·1XϵC
with i=1, 2 is defined by
1⊕2:XZ={U1⊕U2,I(Q1⊕U2,Y),X·1XϵC
where U1⊕U2 denotes the finer partition of U1 and U2, and C1∪C2 denotes the union of two sets C1, C2. In general, the ensemble of k privacy mechanisms, denoted by is ⊕i=1kMi, is recursively defined by ⊕i=1kMi=M1⊕ . . . ⊕Mk−1)⊕Mk(k≥2).
This leads to the following:
(a) Composition Property. Suppose that there are k (ensemble) interval mechanisms Mj, j=1, . . . , k. Then k(⊕j=1kMj)≤Σj=1kk(Mj).
(b) Robustness to Preprocessing. For any interval mechanism M, it holds that κ(Mg)≤κ(M), where the equality holds if and only if g−1(x) has zero Fx-measure for all {tilde over (x)}∈g(Y).
(c) Robustness to Post-processing. Suppose that M:X1→Z is an interval mechanism with κ-interval privacy. Let f:Z1→W be an arbitrary deterministic or random mapping that defines a conditional distribution W|Z. Then f∘M:X→[Z, W] also meets K-interval privacy.
The amalgamation of composition property and robustness to post-processing permit modular design and analysis of interval privacy mechanisms: if each system element that accesses sensitive data is separately interval-private, so will be their combination or consequent post-processing. We will elaborate on some specific uses of the interval mechanism in the following sections.
We introduce the following privacy mechanisms.
(a) The I-I interval mechanism (illustrated in
(b) The I-P-I Interval Mechanism (illustrated in
(c) The P-I-I Interval Mechanism is M:X1→Z=[U, V, Δ, Γ, ΔY], where Δ=1X≤U, Γ=1U<X≤U, and [U, V] is independent with X with (U<V)=1.
(d) Suppose that [U1, . . . , Um−1] are random variables with (U1< . . . <Um−1)=1. Let U0=−∞, Um=∞ for notational convenience. The I- . . . -I Interval Mechanism (illustrated in
:XZ=[U1, . . . ,Um−1,Δ1, . . . ,Δm−1] (5)
The discussion above provides a description of the privacy attainable using intervals in place of actual values. Although using intervals provides privacy, it also creates technical challenges for the data analyst when they are attempting to analyze the actual values but only have access to the intervals. A number of different embodiments are provided below for using the intervals to produce an analysis of the hidden actual values.
In accordance with one embodiment, the mean of the distribution of X is estimated from the interval data. For example, suppose that the raw data are i.i.d. Xi∈[a, b] for i=1, . . . , n, with unknown mean μ. The observations are Zi=[Ui, Δi], i=1, . . . , n, from the I-I interval mechanism with Ui˜i.i.d. Uniform[a, b]. In accordance with one embodiment, the mean is estimated as:
This estimation works in part because Ui is different and random for each observation Zi. As a result, over the sample set, the information about which interval each data value is in takes on the distribution of the underlying values.
In accordance with another embodiment, a method is provided for estimating a regression function from the interval data.
Suppose that we are interested in estimating the regression function with X being the response variable and Y=[Y1, . . . , Yp]∈Rp being predictors. Suppose that X has been already privatized, and we only access an interval-private observation of X, while the data Y is fully visible. This scenario occurs, for example, when Alice (who holds X) sends her privatized data to Bob (who holds Y) to seek Assisted Learning. The scenario also occurs when X has to be private while Y is already publicly available.
Following the typical setting of regression analysis, we postulate the data generating model X=f*(Y)+ε, where f* is the underlying regression function to be estimated, and ε˜Fε is an additive random noise. We suppose that Y is a random variable independent of ε, and that Fε is a known distribution, say Gaussian or Logistic distributions.
Suppose that I-I or I-P-I intervals are used, then the data are n observations in the form of
D
i=[μi,δi,yi]T or Di=[μi,υi,δi,γiyi]T (7)
i=1, . . . , n. The function f* is unknown. A parametric approach typically models f* by a linear function in the form of X=YTβ+ε, where β∈p is treated as an unknown parameter, and ε˜Fε. The above model includes parametric regression and nonparametric regression based on series expansion with bases such as polynomials, splines, or wavelets. Suppose a data analyst is to estimate β from Dn. A classical approach is by maximizing the likelihood function which can be written as Πi=1n[Fε(ui−yiTβ)δ
Though the likelihood approach is principled for estimating a parametric regression, its implementation depends on the specific parametric form of the regression function f, and its extension to nonparametric function classes is challenging. In supervised learning, data analysts typically use a nonparametric approach, such as various types of tree ensembles and neural networks. However, the observables are intervals in the form of [L, R], and existing regression techniques for point-valued responses (X) cannot handle interval-valued labels. As such, we are motivated to “transform” the interval-data format into the classical point-data format, to enable direct uses of existing nonparametric regression methods and software packages.
Our main idea is to transform the data format from intervals to point values. Ideally, we hope to transform [L, R] to a single point so that classical methods are readily applicable. We propose to use
{tilde over (X)}=(X|D,Y) (8)
as a surrogate to X. This is motivated from the observation that {tilde over (X)} is an unbiased estimator of f(y) for a given Y=y, namely ({tilde over (X)}|y)=(X|y)=f(y). It is worth noting that we cannot use {tilde over (X)}=(X|D), because its expectation is (X) instead of (X|y), and thus this surrogate will produce a large estimation error.
In general, suppose that we choose a loss function £: X×X→ such as L1 or L2 loss. Based on the above, it is tempting to solve the following optimization problem
The following result justifies the validity of using {tilde over (X)} as a surrogate of X to estimate f, if the {tilde over (X)}i were computable. We define the norm of f to be ∥f∥Y=(f(Y)2). Suppose that the underlying regression function f*∈F where F is a parametric or nonparametric function class with bounded L2 (Y)-norms. The next result shows that the optimal f obtained by minimizing (13) with a squared loss £ is asymptotically close to the underlying truth f*. For notational convenience, we will let n({tilde over (X)}−f(Y))2=n−1Σi=1n({tilde over (X)}i−f(Yi))2 denote the average squared loss.
Suppose that
|n({tilde over (X)}−f(Y))2−({tilde over (X)}−f(Y))2|→p0 (11)
as n→∞. Then any sequence {circumflex over (f)}n that maximizes n({tilde over (X)}−f(Y))2 converges in probability to f*, meaning that ∥{circumflex over (f)}n−{circumflex over (f)}*∥Y→p0. as n→∞.
In practice, however, the calculation of {tilde over (X)} itself is unrealistic as it involves the knowledge of f(Y). In other words, the unknown function f appears in both the optimization (9) and calculation of surrogates (10). The above difficulty motivates us to propose an iterative method where we iterate the steps in (9) and (10), using any commonly used supervised learning method to obtain {circumflex over (f)} at each step.
The pseudocode is provided in Algorithm 1, where I-I intervals are considered for brevity.
We provide a theoretical guarantee of its convergence for linear regression models. In practice, we set the initialization by {circumflex over (f)}0(y)=0 for all y. We experimentally found that Algorithm 1 is robust and works well for a variety of nonlinear models, including tree ensembles and neural networks.
Suppose that the underlying true regression function is f*(y)=yTβ*, β*∈p, and the least squares estimate {circumflex over (β)}k is used in (9). Assume that ε is supported on with (ε)≤∞, the CDF Fε(•) of ε is continuously differentiable. Also assume that |U−f*(Y)| is almost surely bounded by a constant B, and supv:|v|≤B|1+Gε(v)pε(v)/{Fε(v)(1−Fε(v))}|<1, where pεis the density of ε and Gε(v)=∫−∞vydFε(y). Then there exists a constant δ>0 such that for all initial values β0 satisfying ∥{circumflex over (β)}0−β*∥<δ, we have limn,k→∞∥{circumflex over (β)}k−β*∥=0 in probability.
The term (ε|ũ, δi, yi) is essentially a conditional mean indicated by δi, or (ε|ε≤ui−{circumflex over (f)}k−1(yi)) if δi=1 and (ε|ε<ui−{circumflex over (f)}k−1(yi)) otherwise. In implementing Algorithm 1, we need to specify a distribution for the error term £ when calculating (12). In general, the learning accuracy can be suboptimal if a data analyst postulates a model that violates the underlying data generating process, an issue known as model mis-specification y. Nevertheless, in our context of data-private regression, we found from various experiments that the accuracy of estimating regression functions is not sensitive to mis-specification of the noise distribution. We conjecture that the consequent estimation of the regression function is still consistent under mild conditions. A practical suggestion to data analysts is to treat £ as Logistic random variables to simplify the computation. Also, if the standard deviation of the noise term £ is unknown in practice, we suggest estimate σ2 with n−1 Σi=1n(xi−{tilde over (x)}i)2 at each iteration of Algorithm 1.
In this experiment, we demonstrate interval-private data (with reasonably broad privacy coverage) in moment estimation. Suppose that 100 raw data are generated from X˜N (0.5, 1), and the private data we collect is based on the I-I mechanism with U E Uniform[−T, T], where T=2n1/3. The goal is to estimate (X). We demonstrate three methods and compare them with the (unrealistic) estimates based on the raw data in Table 1. The first method, denoted by ‘Example 2’, uses the estimator in (6). The second method, denoted by ‘NPMLE’, uses the nonparametric maximum likelihood approach available from the ‘Icens’ R package (Gentleman and Vandal, 2010). The third method, denoted by ‘MLE’, presumes that the distribution is Gaussian with unknown mean and variance, and estimates the mean with the standard maximum likelihood principle. Unlike the above two methods, the third one requires the knowledge of a parametric form and is thus expected to converge faster. We also consider two methods based on the raw data in hindsight: the sample average and the sample median. To demonstrate the robustness of interval data, we add none, 1%, and 5% proportion of outliers (where X is set to be 999). We also considered the estimation of (Y2), where we used a similar procedure to collect interval-private Y2, except that U∈Uniform[0, 2T] so that the privacy coverage remains to be about 0.95 (or 5% leakage).
Table 1 summarizes the results, which indicate that the estimation under highly private data is reasonably well when compared with the oracle approach (under 0% outlier). The interval-private data tend to be more robust against outliers than the estimation based on simple mean and comparable to the median (based on raw data).
In this experiment, we estimate the linear regression model X=f(Y)+ε, where f(Y)=βy is to be estimated from the I-I privatized data Z=[U, Δ]. We generate n=200 data with β=1, ε˜N (0, 1), U a Logistic random variables with scale 5. The corresponding privacy coverage is around 0.9. The prediction error is evaluated by mean squared errors (f({tilde over (Y)})−{circumflex over (f)}({tilde over (Y)}))2 where {tilde over (Y)} denotes the unobserved (future) data. With a limited size of data, the algorithm will produce an estimate {circumflex over (f)}(y)={circumflex over (β)}Y that converges well within 20 iterations. The initialization is done by simply setting {circumflex over (f)}0(y)=0.
In another experiment, we demonstrate the method proposed above on the nonparametric regression model X=f(Y)+ε, using the I-P-I privatized data Z=[U, V, Δ, Γ]. Suppose that n=200 data are generated from quadratic regression f(Y)=Y2−2Y+3, and ε˜N(0, 1). Let U=min(L1, L2), V=max(L1, L2), where L1, L2 are independent Logistic random variables with scale 5. The corresponding privacy coverage is around 0.9. Random Forest (depth 3, 100 trees) with features Y1=Y, Y2=Y2 are used to fit Algorithm 1. With a limited size of data, the algorithm will produce a tree ensemble estimate that converges well within 20 iterations.
In the examples above, the intervals are used directly to estimate the mean and to determine the linear regression. In other embodiments, referred to herein as D-transformations, the intervals are transformed to points whose conditional expectation (averaging over the randomness of intervals) equal to the raw values.
This section addresses a general design, where data falling into the middle interval are observed. As before, we suppose that the distribution of x is denoted by F, and the distribution of U=[U1, . . . , Um−1] is Fu(u). Assume that U and X are independent. Let Δi be defined in a similar way as in Definition 3. We will elaborate on specific designs, including the I-I, I-P-I, I-I-P, I- . . . -I designs mentioned above.
Given n observations of Z=[Ui, Δi, i=1, . . . , m−1, denoted by zj=[ui,j, δi,j, i=1, . . . , m−1], j=1, . . . , n, our goal is to find a suitable surrogate of each unobserved/latent xj, denoted by {tilde over (x)}j. We consider an unbiased estimator of xj, so that E({tilde over (x)}|xj)=xj, where the expectation is with respect to the distribution of uj. Since we assume the data [xi, uj], j=1, . . . , n, are i.i.d., our goal is to construct a random variable {tilde over (X)} that satisfies
({tilde over (x)}|X=x)=x (13)
A natural choice is {tilde over (x)}=Σi=1, . . . , m, i≠i*(Y|Ui−1<X≤Ui)+Δi*y, since direct calculations show that (13) holds. Unfortunately, Fx is usually unknown in practice, so we cannot easily compute each conditional expectation (Y|Ui−1<X≤Ui).
Alternatively, we consider a surrogate {tilde over (X)} in the following form
where Φ1, . . . , Φm are some fixed measurable functions that do not depend on x or its underlying distribution and that are easy to evaluate. In other words, if x is known to fall into interval (Ui−1, Ui], then let {tilde over (X)}=Φi(U); moreover, if x is exactly observed, then simply let {tilde over (x)}=x. Note that {tilde over (x)} is a statistic computable from the observed data Z, while x is not. For {tilde over (x)} to be an unbiased estimator of x, we will need to design appropriate Φi's, which motivates the following definition.
Definition 5 (D-Transformation).
If the function tuple (Φ1, . . . , Φm) meet the following conditions, we say it is a D-transformation with respect to the distribution of U=[U1, . . . , Um−1], and denote it by (Φ1, . . . , Φm)∈D.
2) [Computability](Φ1, . . . , Φm) may depend on the function Fu,v, but will remain the same regardless of F.
The regularity is a condition that will guarantee the unbiasedness of {tilde over (X)}. In later subsections, we will show that for any given distribution of U, there may exist multiple designs of (Φ1, . . . , Φm that meet this condition. The computability condition says that the surrogate {tilde over (X)} can be evaluated even without knowing the distribution function of X (which is often the case in practice).
For any (Φ1, . . . , Φm)∈D, we have ({tilde over (X)}|X)=X. Furthermore, ({tilde over (X)})=(X).
Suppose that in the Definition 6 Fu,v, Fx are replaced with conditional distributions on R, Fu,v|r, Fx|r, then ({tilde over (X)}|R)=(X|R). In particular, ({tilde over (X)}|X, R)=X, ({tilde over (X)}|X)=X, and ({tilde over (X)})=E(X).
The D-transformation provides a class of unbiased estimators of x. In the sequel, we show two examples. We will assume that EX2<∞, and that H has density fu,v (u, v)>0.
This section addresses the I-I design, where the observation is in the form of Z=[U, 1X≤U], where U is generated independently with the data. The surrogate {tilde over (X)} in (14) becomes the following specific form
{tilde over (X)}=Φ
1(U)Δ+Φ2(U)(1−Δ) (16)
and the regularity condition for Φ1, Φ2 becomes
∫x∞Φ1(u)dFu(u)+∫−∞xΦ2(v)dFv(v)=x. ∀x∈ (17)
Then the following is a D-transformation,
The intuition of the above I-I Design 1 is explained as follows. The underlying raw data is only known to fall into either (−∞, U] or (U, ∞); if x is known to be smaller than U, then we surrogate it with U minus a correction term uniquely determined by U; otherwise, we surrogate it with U plus a correction term uniquely determined by U. Note that if X=[a, b] with finite a, b, a similar result holds by replacing −∞ and ∞ with a+ and b− respectively. The next result provides another design. Though different designs give an unbiased estimate of x, their variances may be different.
Assume that Fu is continuously differentiable on , fu(t)>0 for all t∈R. Then the following pair is a D-transformation,
I-P-I Design: Two Intervals Mixed with Mid-Range Points
This section addresses the I-P-I design, where the observation is in the form of Z=[U, V, Δ, Γ], where Δ=n1X≤U, Γ=2U<X≤V. The surrogate {tilde over (X)} in (14) is in the form of
{tilde over (X)}=Φ
1(U,V)Δ+Φ2(x)Γ+Φ3(U,V)(1−Δ−Γ) (20)
For simplicity, we suppose that Φ1(u, v) is only a function of u, and Φ2(u, v) is only a function of v. The regularity condition for Φ1, Φ2, Φ3 becomes
∫xbΦ1(u)dFu(u)+Φ2(x){Fu(x)−Fv(x)}+∫axΦ3(v)dFv(v)=x, ∀y∈ (21)
where Fu and Fv denote the CDF of U and V, respectively. We provide two specific designs below.
(I-P-I Design 1).
Assume that Fu and Fv are continuously differentiable on , the support of U, V, fu(t)>0, fv(t)>0 for all t∈R, and
Then the following triplet is a D-transformation.
The intuition of this design is similar to that of I-I Design 1, except that we use the exact value of x whenever it is observed (i.e. falling into (U, V]). Also, I-I Design 1 can be regarded as a special case of I-P-I Design 1 when U=V.
(I-P-I Design 2).
Assume that the joint density fu,v>0 almost everywhere on the support . Then the following triplet is a D-transformation.
The intuition of this design is explained as follows: If x is not observed and only known to fall into two tails, then we truncate it to be simply zero; since this may cause a severe bias, we compensate the truncation by correcting the observed x by a factor that is uniquely determined by U and V.
Before we proceed to other mechanisms, we provide some example designs in correspondence to the I-I and I-P-I designs. We focus on the following class of joint distributions of U, V that is often easy to implement in practice.
We let
U=W−a,V=W+a (24)
where a is a constant that controls the (deterministic) gap between U, V. It can be verified that
f
u(u)=f(u+a),fv(v)=f(u−a),Fu(u)=F(u+a),Fv(v)=F(v−a) (25)
Therefore, calculations of Φ1, Φ2, Φ3 can be made in closed-form as long as f, F have closed forms.
We let
U=min(W1,W2),V=max(W1,W2) (26)
where W1, W2 are independently drawn from F. It can be verified that
f
u(u)=2f(u)(1−F(u)),fv(v)=2f(v)F(v),Fu(u)=2F(u)−F(u)2,Fv(v)=F(v)2 (27)
Compared with the class in (16), this class does not allow U≈V without reducing the variance of distribution F.
Suppose that the distribution of W (as used in Examples 1 and 2) are generated is the Uniform distribution on [0, c], and X=[0, c]. We calculate the resulting surrogates {tilde over (X)} in Table 2. This uniform design is not recommended in practice unless x is known to be bounded.
Suppose that the distribution of W is the Logistic distribution on with CDF
with s>0 being a constant. We calculate the resulting surrogates {tilde over (X)} in Table 3.
P-I-I Design: Two Intervals Mixed with Left-Tail Points
This section addresses the P-I-I design, where the observation is in the form of Z=[U, V, Δ, Γ], where Δ=1X≤U, Γ=1U<X≤V. The surrogate {tilde over (X)} in (14) may be written in the form of
{tilde over (X)}=Φ
1(x)Δ+Φ2(U)Γ+Φ3(V)(1−Δ−Γ) (29)
For simplicity, we have assumed Φ2, Φ3 to be scalar functions (similarly to the I-P-I Design). The regularity condition for Φ1, Φ2, Φ3 becomes
∫xbΦ1(u)dFu(u)+Φ2(x){Fu(x)−Fv(x)}+∫axΦ3(v)dFv(v)=x, ∀y∈ (30)
We provide two designs below.
Assume that Fu and Fv are continuously differentiable on , the support of U, V, fu(t)>0, fv(t)>0 for all t∈. Then the following triplet is a D-transformation.
The intuition is that we assign 0 to censored data and leverage the weight of censored data. When the distribution H of the censored data is unknown, we could use D-transformation.
Assume that Fu and Fv are continuously differentiable on R, the support of U, V, fu(t)>0, fv(t)>0 for all t∈. Then the following triplet is a D-transformation.
The U=[U1, . . . , Um−1]) may be bounded or unbounded, dependent or independent. Our general method is described as follows. Suppose that [Φ1,i(Ui), Φ3,i(Ui)] is a D-transformation of x for the distribution of U, under I-I Design. Given observation of Z as defined in (5), our key idea is to turn the m-interval problem into many 2-interval problems, and average their corresponding estimators to reduce the overall variance. One possible solution is to consider a surrogate {tilde over (X)} in the following form
where I(U, X)∈{1, . . . , m} denotes such k that X falls into (Uk−1, Uk].
Since each term in the summation is a (conditionally) unbiased estimate of x, their average is still unbiased, while the variance will typically decrease. Sometimes, the distribution of Ui is nontrivial, and thus it is not easy to compute its corresponding Φi and evaluate the variance. We will propose an alternative unbiased estimator that can be much easier to compute compared with (33), whose variance is also provably reduced by m times. Next, we show an example to illustrate the above ideas.
Let W be a random variable whose distribution function and density functions are Fw and fw, respectively. Let a>0 be a constant that specifies the distance between two censoring points. We generate U1, . . . , Um−1 (m≥3) in a way similar to Example 1:
U
i
=W−(m−2)a+2(i−1)a, i=1, . . . ,m−1 (34)
The marginal distribution function of Ui is therefore Fw(ui+(m−2)a−2(i−1)a), which can be used to evaluate (33) using specific designs (e.g. those introduced before).
Suppose that one generates U1, . . . , Um−1 (m≥3) by drawing i.i.d. random variables W1, . . . , Wm−1 from a distribution function Fw (with density function denoted by fw), and let W0≤ . . . ≤W(m−1) be their order statistics. Then we let Ui=W(i)), whose density function fi and distribution function Fi are calculated to be
for i=1, . . . , m−1. The Φ1, Φ2 in (33) can be further calculated using thereafter. Nevertheless, the variance of {tilde over (X)} is usually nontrivial to evaluate, mainly because the summation terms in (33) are often dependent. Calculating the pairwise covariance can be challenging. Methods such as Rio's covariance inequality may be used to derive upper bounds of the variance as m increases. To simplify computation and evaluate the variance for the constructed {tilde over (X)}, we propose the following alternative estimator.
Let [Φ1, Φ2] be a D-transformation with respect the distribution function W. The following proposition indicates the following estimator,
satisfies E({tilde over (X)}|X)=X and admits a more tractable variance.
Suppose that W1, . . . , Wm−1 are identically distributed random variables such that the joint distribution of [W(1), 1x≤W(1), . . . , W(m−1), 1x≤W(m−1)] is the same distribution as [U1, Δ1, . . . , Um−1, Δm−1] for any given x. Here, W(1)≤ . . . W(m−1) are the order statistics of W1, . . . , Wm−1. Suppose that [Φ1, Φ2] is a D-transformation with respect the distribution of W1. Then the {tilde over (X)} in (36) is an unbiased estimator of x, and its conditional variance is
whenever the right-hand side is bounded.
The identity (37) further implies that
where {tilde over (X)} is an independent copy of {tilde over (X)} with m=2.
The boundedness of V{1x≤W
which hold for common distributions such as Gaussian, Logistic, etc. Suppose that ({tilde over (X)}|X=x) is bounded, then the Proposition above implies that as m increases, ({tilde over (X)}|X=x) monotonically converges to zero. This property is intuitively appealing as more information brought by more intervals tends to reduce the uncertainty of an estimate of x.
Suppose the raw observation is [X(−), X(+)] that may or may not be dependent on X, and that contains X almost surely. An example is that [X(−), X(+)] represents a confidence interval of an unknown parameter X, and the intervals are further collected for meta-analysis. Another example is that a set of interval-privatized data needs to be re-collected by another agent. Suppose that X(−)=X−T(−), X+T(+), where T(−), T(+)∈+ are treated as two nonnegative random variables. Similar to Definition 5(a), the privatized data is obtained in the following way.
Definition 6 (I-I Design with Non-Point Raw Data).
Suppose that the privacy mechanism
:[X(−),X(+)]Z=[U,Δ,Γ] (38)
where
Δ=1X(+)≤U, Γ=1X(−)≤U≤X(+) (39)
and U is independent with Y, X(−), X(+).
The privatization is interpreted in the following way: if X(+)≤U, then almost surely Y∈(−∞, U] and we report this interval; if X(−)>U, then almost surely Y∈(U, ∞); otherwise, there is ambiguity and we do not report any interval.
As before, suppose that Fx is unknown while Fu is known. We still try to obtain an unbiased estimator of x or (Y|R) where R, if non-null, denotes external side information.
We construct an estimator in the form of
{tilde over (X)}=Φ
1(U)Δ+Φ2(U)Γ+Φ3(U)(1−Δ−Γ) (40)
Similar to Definition 5, we have the following result.
Let
be a fixed constant. If
∫x
Then ({tilde over (X)}|X, R)=X for any random variable R that is independent with U.
In particular, (26) holds if
The Design in (43) is an analog of that in I-I Design 1, except that there is a correction term brought by Γ=1.
Suppose that Y∈ is an unbounded real-valued random variable. We generate a privatized variable Z as introduced in (3), and collect Z for further transmission/storage/learning (see
Suppose that one is interested in a measurable function of X, say g(X). One possible way is to turn Z into an interval observation for g(X) and apply the same technique. For some distributions of U, this may not be computationally tractable. Alternatively, one would collect an interval observation of g(X) directly. For instance, to estimate V(x), one may choose to observe X and g(X)=X2 separately, in order to construct an unbiased estimators of V(X). If so, the overall privacy leakage of X is no larger than the sum of both leakag.
Though a constructed statistic {tilde over (X)} satisfies ({tilde over (X)})=(X), its variance depends on the interval mechanism. The following inequality holds despite the underlying interval mechanism.
({tilde over (X)})=(({tilde over (X)}|X))+(({tilde over (X)}|X))=(X)+(({tilde over (X)}|X))≥(X) (44)
with equality if and only if {tilde over (X)}=X almost surely. Intuitively, the uncertainty of {tilde over (X)} is larger than that of X on average due to the ambiguity brought by intervals. In some regression and classification problems (elaborated in later sections), it is helpful to estimate the original variance of X from the inflated variances ({tilde over (X)}) that can be estimated by its sample analog. In general, an analytical form of ({tilde over (X)}) needs to be derived on a case-by-case basis. For example, the following Proposition provides an explicit formula for the variance ({tilde over (X)}) in I-P-I Design 1 & 2.
Assume that Fu and Fv are continuously differentiable on , the support of U, V, and
The {tilde over (X)} in (12) satisfies
whenever the right hand side is well defined, for example when the following condition holds
Example (Variance Estimation).
Even when Fx is unknown, it is possible to construct an estimator of (X) depending on Fu,v. Suppose that Y˜N(μ, σ2) is Gaussian. Note that these two conditions hold for X being Gaussian. We obtain
Note that ({tilde over (X)})→(X) as a→∞, which matches the intuition. Let N({tilde over (X)}) denote the sample variance of ({tilde over (X)}) calculated from {tilde over (x)}1, . . . , {tilde over (x)}n. Consider the equation of σ2:
The above root exists and is unique if N({tilde over (X)})>s2 exp{(−μ+a)/s}+s2 exp{(μ−a)/s}, since the left hand side is an increasing function of σ2. This motivates the follow consistent estimator of μ, σ2:
A similar procedure may be applied to other interval mechanisms. For example, the above arguments apply to the I-I Design 1 since it is a special case of the I-P-I Design 1 when a=0. For another example, to estimate the variance σ2 for I- . . . -I Design 2, equation 37 implies that the {circumflex over (σ)}n2(Fx) in (49) is still consistent, except that Equation (48) needs to be replaced with
Example (Covariance Estimation).
Suppose that [X1, . . . , Xp] is a multivariate random variable whose covariance Σx is unknown and needs to be estimated. The covariance of each pair of variables, say X1, X2, satisfies
Therefore, Σx,12 may be estimated by the consistent estimator {circumflex over (Σ)}x,12=n−Σi=1n{tilde over (X)}1,i{tilde over (X)}2,i, where {tilde over (X)}j,i is the surrogate value for Xj computed from the ith observation (for j=1, 2). In the above derivation, we used (X1|{tilde over (X)}2)=({tilde over (X)}1|{tilde over (X)}2). An implicit assumption for this to hold is that the interval mechanisms are independently applied to X1 and X2.
We may obtain
{circumflex over (ρ)}={circumflex over (Σ)}x,12/√{square root over ({circumflex over (σ)}n2(Fx
as a content estimation of the correlation between X1 and X2.
Experiment 2 (Correlation Estimation).
In a simulation study, we let X1, X2 be bi-variate Gaussian with correlation ρ=0, 0.5, 0.9. We calculated the sample correlation using the aforementioned method and its standard error (from 100 replications) under two different interval mechanisms, for sample sizes n=100, 300, 500. The Mechanism 1 uses the I-P-I interval mechanism (Definition 5(b), Example 1 with s=1, a=1), and the privacy leakage is around 0.4. The Mechanism 2 uses the I- . . . -I mechanism with m=10, and the privacy leakage is around 0.7. The results are reported in Table 4.
Use for More Complicated Learning Tasks.
In many learning problems to be elaborated in later sections, the solutions often reduce to the estimation of certain means or (co)variances. Many times, we can simply replace the unobserved X with {tilde over (X)}. We exemplify this through the following two estimation problems. In the rest of the paper, we consider the following default setup. We use the I-I Design 1 in (22) as the default option to construct {tilde over (X)} from privatized data of X. We generate U in the way described in (26), with F being a Logistic distribution with scaling factor s, and by default s=1. We will assume that the interval mechanism is independently applied to each scalar random variable. As before, we will use {tilde over (X)} to denote the D-transformed variable of the raw data X, so that it satisfies ({tilde over (X)}|X, R)=X for any given raw data X and side information R (Corollary 1). Note that V(X˜) usually depends on the distributions of anchor points U and raw data Y, and may or may not be finite. Though the existence of ({tilde over (X)}) may be verified analytically, it can be difficult to verify without knowing the distribution of Y in practice. We will assume that {tilde over (X)} has a finite variance for technical convenience when performing variance and covariance estimation in later sections.
Privatizing features X.
Suppose that the underlying data generating process is modeled by a linear regression model
Y=β
T
X+ε (53)
where X∈p is a p-dimensional feature vector with zero mean and Σx as the covariance matrix, ε is a random noise with zero mean and bounded variance, and ε is independent with X. The unknown parameter is β∈p. Suppose that the observed data are realizations of (Y, ZX), where ZX is the interval-privatized data for X. We apply a D-transformation to ZX to obtain {tilde over (X)}. It can be calculated that
(Y{tilde over (X)})={Y({tilde over (X)}|Y)}={Y(X|Y)}=(YX)=(XXTβ)=Σxβ (54)
Thus, a consistent estimator of β is
{circumflex over (β)}=Σx−1n(Y{tilde over (X)}) (55)
where n denotes the empirical expectation. Moreover, if Σx is unknown, it may be further estimated.
Privatizing Response Y.
In this scenario, Y is to be privatized. Similarly to (54), we have ({tilde over (Y)}X)=(YX)=Σxβ, which leads to the estimator {circumflex over (β)}=ΣX−1M({tilde over (Y)}X). Moreover, suppose that both the response Y and features X are privatized. It holds that ({tilde over (W)}{acute over (X)})=Σxβ, and thus we may use
{circumflex over (β)}=Σx−1n({tilde over (Y)}{tilde over (X)}) (56)
as an estimator
Proposition.
Assume that ({tilde over (X)}) and (Y) are finite, and data is generated from (53) with β=β* and p a fixed integer. The {circumflex over (β)} in (56) is a consistent estimator of β. Moreover, we have (β*T{tilde over (X)}−Y)=0, β*TX(β*T{tilde over (X)}−Y)}−Y)=0, β*T{tilde over (X)}(2)(β*T{tilde over (X)}−Y)}−Y)=0.
Regression Diagnostics.
The proposed D-transformation provides natural machinery to perform regression diagnostics, which were known to be challenging in the classical literature of survival analysis. In particular, we will propose the following way of performing residual analysis to visualize the adequacy of postulated regression models and the assumption of independent noise. Suppose that {circumflex over (β)} is an estimated regression coefficients under the postulated model using Y and {tilde over (X)}. Suppose that we fit the linear regression model (53) using realizations of [Y, {tilde over (X)}], and obtain {circumflex over (β)}. Then we may plot Ŷ−Y where {tilde over (Y)}={circumflex over (β)}T{tilde over (X)} (as the x-axis) against {circumflex over (β)}TX (as the x-axis). If {circumflex over (β)}≈β*, then we expect to see a null residual plot when the postulated model is valid. Nevertheless, {circumflex over (β)}TX may not be known since X is unobserved. In practice, a data analyst may choose to query another interval-privatized data {tilde over (X)}(2) from X, and use {circumflex over (β)}T{tilde over (X)}(2) as the x-axis. We do not use {circumflex over (β)}T{tilde over (X)} because it is usually correlated with {tilde over (Y)}−Y.
Suppose that we fit the following regression model
Y=f(x)+ε (57)
where f is an unknown regression function.
This model includes the model (53) as a special case. To address privatized Y, we may simply replace Y with its surrogate {tilde over (Y)}. This is because ({tilde over (Y)}|X)=(Y|X)=f(X), and thus the solution to
is a consistent estimator of f* under mild conditions.)
To estimate the model in (57) with privatized X, we cannot apply a similar argument in (54) due to a lack of linear structure imposed on f. Alternatively, it is tempting to directly use [Y, {tilde over (X)}] to fit a nonlinear model by minimizing some loss within a function class F as in (58). Assume that the data are indeed generated from (37) with some underlying correct f*∈F. The solution of (38) may not be consistent even in the linear regression case. This is conceptually related to the literature of regression attenuation, where regression slopes tend to shrink towards zero due to measurement error of covariates. For example, if one fits the simple linear regression Y=β0+β1X+ε using data X+η instead of X, where η is independent with X, the coefficient shrinks by a factor of (X)/{(X)+(η)}. Obtaining a consistent estimation of nonlinear f under privatized X is, in general, a challenging problem. We recommend the following solution. Given labeled data [Yi, Xi] that may come from a regression or classification task, we estimate the unknown function f by solving the following empirical risk minimization problem
where F is a regression function class and 1:X×Y→+∪{0} is a task-related loss function; we then use the rescaled function {circumflex over (f)}n=λĝn to predict for future X, where
The above λ is motivated by the exact shrinkage factor in the simple linear regression case. Our numerical studies show that the above general method works reasonably well for nonlinear regression. The pseudocode of general supervised learning is summarized in Algorithm 2.
Experiment
In an experiment, we simulated data from the following two data-generating models with n=1000, p=5. The data observed by the data collector/analyst is interval-valued data for each dimension of X using the I-P-I interval mechanism. In particular, the I-P-I Design 1, U, V Design 1, with s=a=1 (as described in Table II) is applied to each dimension of X. We also report the case when Y is not privatized and privatized (when the same privacy mechanism is applied to its standardization).
Model M1: a linear model Y=βTX+ε, where β=[1, . . . , 1]T∈p, X1, . . . , Xp, ε˜N (0, 1) are independent. The signal-to-noise ratio (SNR), defined by (Y|X)/(ε), is 5.
Model M2: a nonlinear model Y=10 sin(πX1X2)+20(X3−0.5)2+10X4+5X5+ε, where X1, . . . , Xp are independent and uniformly distributed on [0, 1], and ε˜N(0, 1) is the independent noise. This data is also known as the Friedmanl dataset, with SNR around 24.
For the data generated from M1, we fit it using the linear regression model in (53); for the data generated from M2, we fit it using the (nonlinear) gradient boosting regression (with depth 3, 100 boosting stages, and 0.1 learning rate) as well as the linear regression; The predictive performance are summarized in Table 5 below, where we compared three methods, including the proposed method in Algorithm 2, the mid-value method that simply uses the anchor points as a surrogate, namely
where k is such that (Uk−1, Uk] contains X, and U, Δ are defined in Definition 3.
The performance as evaluated by L1 and L2 losses are reported in Table 5. The results show that the proposed method works reasonably well even if both X and Y are privatized, except in misspecified linear regression models (where a severe underfitting occurs). Moreover, interval-privatized Y tends to degrade performance compared with point-valued Y, which is conceivable.
It is worth noting that for I-P-I design, a natural alternative method is to only consider the point-valued data of X and omit all those involving intervals. Nevertheless, this is often not realistic for moderately large p. For instance, in our experiment, the probability of observing a point-valued X is at the order of 10-4 (by calculation). Thus, point values barely appeared in our replicated experiments (with a sample size of n=1000). Moreover, this method does not apply to many other designs that only involve intervals, and therefore we did not consider it in the experiments.
C. Linear Discriminant Analysis with Privatized Features
As we commented in equation (59), the proposed method may be directly used for classification where interval-transformed values are fed into an empirical risk minimization objective with a suitably chosen loss function. In some widely used classification models such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression, one may develop more specific solutions. We will take LDA as an example in this subsection. A similar argument applies to QDA.
In a general classification problem, the raw data are in the form of (Y, X), where Y∈{1, . . . , k} denote the class label and X∈p is the feature vector, and the learning goal is to predict the label Y for any given X. LDA assumes that the data generating process is
Y˜multinomial(q),X|(Y=)˜(,V), ∀=1, . . . ,k (62)
q=[q1, . . . , qk], μ1, . . . , μk, V are often estimated from training data. To minimize the Bayes risk, the prediction is obtained by the decision rule
Suppose that the interval mechanism is applied to each dimension of X, and the observed data is ZX. We may apply the D-transformation to obtain {tilde over (X)} from ZX, and then estimate μ1, . . . , μk, V using from {tilde over (X)}. The estimation procedure is the same as the existing one based on X except that we need to correct the diagonal elements of V. Nevertheless, in the presence of large dimensional X whose dimension p is comparable to sample size n, various numerical studies indicate that using {tilde over (X)} directly as if X is more stable than the method corrects V. A possible reason is that the inflated variance on V's diagonal enhances the numerical stability when computing the V−1 in (63), in a way similar to the ridge penalization method.
Experiment.
We exemplify the application to LDA with the ‘phoneme speech recognition’ dataset [43], which contains 4509 samples of digitized speech for five phonemes: aa, ao, dcl, iy, and sh, each sample being a speech frame of 32-ms audio. For each speech frame, a log-periodogram of length 256 was computed and stored as the feature vector X. In other words, we have p=256, k=5, and n=4509. We also considered n=1000 by randomly sampling. We consider three I- . . . -I interval mechanisms which uses a) m=3 intervals, s=4 (the scale of Logistic anchor distributions), b) m=3 and s=1, and c) m=6, s=1 on the standardized data. The privacy coverages for three mechanisms are around 0.9, 0.6, 0.3, respectively. We compared three data-processing procedures for LDA: interval data handled by the proposed D-transformation, interval data handled by the mid-value method in Equation 61, and the using raw data. The performance, as summarized in Table 6, shows the proposed method is comparable to the oracle method using raw data, and it consistently outperforms the mid-value method. Moreover, it is interesting to see that the proposed method slightly outperforms the method based on the raw data in mechanisms (b) and (c), possibly because the inflated variance stabilizes the large matrix inverse.
For a continuous multivariate random variable (Y, X) where X∈p and Y∈, a subspace S′ is called the effective dimension reduction (EDR) space if Y is independent with X conditional on PSI X, where PSI denotes the projection matrix of St. Under mild conditions, the intersection of all the EDR spaces is again an EDR space, named the central space, and denoted by S. If the dimension of S is much smaller than p, it is possible to represent a predictive model of Y using only a small number of variables (which are linearly transformed from X). To exemplify how our proposed method may be used for sufficient dimension reduction, we will consider a particular method called Sliced Inverse Regression (SIR).
Y=f(β1TX, . . . ,βdTX,ε) (64)
where X follows a p-dimensional elliptical distribution with mean zero and covariance matrix Σx. The βi's are unknown projection vectors, d is unknown but is assumed to be much smaller than p, and the error ε is independent of x and has mean 0.
Note that in general, the matrix B=[β1, . . . , βd]∈prod is not identifiable but the column span of B, Col(B), can be identifiable. The SIR method is based on the following key identity
Col(Λ)=ΣxCol(B) (65)
which holds for e distributions of X. In particular, Given n i.i.d. samples [yi, xi], i=1, . . . , n, SIR first divides them into H slices according to the quantiles of Y at h/H, h=1, . . . , H. Let the sample mean of X in the h-th slice be {tilde over (x)}h. Then Λ=((X|Y)) can be estimated by
The SIR then estimates the central space Col(B) by Col({circumflex over (Σ)}X−1{circumflex over (Λ)}D), where {circumflex over (Σ)}x is the empirical variance of X, and {circumflex over (Λ)}d consists of the top d eigenvectors of {circumflex over (Λ)}.
Suppose that we apply the proposed D-transformation to interval-privatized X and obtain {tilde over (X)}. Since ({tilde over (X)}|Y)=({tilde over (X)}|Y), we could use the same SIR procedure where we only replace the {tilde over (x)}h in (44) with {tilde over (x)}h, the sample mean of {tilde over (X)} in the h-th slice. Using a similar argument as was used in [45], [52], it can be proved under some regularity conditions that Col({circumflex over (B)}) converges in probability to Col(B). It is interesting to study extensions of the above method to high-dimensional settings, using LASSO-type shrinkages (see [51] and the references therein).
Experiment.
We exemplify the SIR with interval data with n=1000, p=10, 50, and variance dimension models. We generate X, β1, . . . βd∈p and ε∈ from independent and standard Gaussian. For d=1, we generated data from the models M1, M2, M3, and M4 under the general formulation (42):
respectively, where μ=βTX. For d=2, the models we considered are
respectively, where μk=βKtX for k=1, 2. For each dataset, we compared the proposed SIR method using three mechanisms of interval data, as well as the original SIR method using raw data. The performance for d=1 and d=2 is summarized in Table 7 and Table 8, respectively.
Recall that the goal is to estimate Col(β), the space spanned by β. For performance evaluation, we considered two quantities: 1) the estimation error D(Col({circumflex over (β)}), Col(β)), where D(S, S′) denotes the Frobenius norm of the difference between the projection matrices associated with these subspaces S, S′, and 2) the estimated intrinsic dimension {circumflex over (d)}, denoted by {circumflex over (d)}. The d was selected using Gap statistics, namely arg maxi:1≤i≤H−1(λi−λi+i), where λ1, . . . , λH are the singular values of XH in descending order A benchmark value of D(S, S′) is d, which is achieved when the projection matrix of S′ is simply taken to be zero. Table 7 and Table 8 show that the proposed SIR method performs better as the privacy coverage decreases (or intervals are on average narrower). The performance under Mechanisms (b) and (c) is comparable to the oracle method using raw data. Under Mechanism (a), however, the inflated variances of {tilde over (X)} caused by less-informative intervals tend to favor too many intrinsic dimensions, which in turn degrades the performance. For example, for p=50, Mechanism (a) selects d{circumflex over ( )}s=9 in all four data cases. The number 9 is the maximum possible as we chose the number of slices to be H=10 in the experiments.
The principal components analysis (PCA) algorithm is a standard tool for identifying good low dimensional approximations to high-dimensional data. We will show that PCA can be used for dimension reduction and visualization of multidimensional interval-valued data. PCA estimates few principal axes in the feature space, representing the directions of maximum variances in the data. Suppose that we obtain an estimated covariance matrix {circumflex over (Σ)}X for X. Let {circumflex over (Σ)}X=ΩΛΩT be the orthogonalization of {circumflex over (Σ)}X, where the diagonal matrix Λ is sorted in descending order. Let Ωd denote the matrix that consists of the first d columns of Ω. The transformation TX=ΩdTX maps each data of X∈p to a smaller dimensional vector in d.
Unlike PCA, the Canonical Correlation Analysis (CCA) aims to quantify the associations between two sets of variables using a small number of linear combinations in each set. Suppose that the two datasets are represented by random variables X∈p and Y∈q whose joint distribution exists. CCA seeks vectors a∈p and b∈p that maximizes the correlation between aTX and bTY, also named the first pair of canonical variables. Such a and b can be determined using the covariance matrix of [XT, YT], and thus our method is readily available to obtain an estimate for a and b. As a result, we can measure and visualize the associations between two sets of interval-valued data.
Experiment.
In an experiment of visualizing multidimensional intervals in 2D, we considered the Iris dataset that consists of 50 samples from each of three flower species of ‘setosa’, ‘virginica’ and ‘versicolor’, with an overall 150 samples. Four features were measured from each sample. We apply the Mechanism (c) on (standardized) X to obtain the intervals and then apply the D-transformation to obtain {tilde over (X)}. Different procedures of PCA were applied to visualize the projection onto two dimensions.
In another experiment, we studied the Mmreg data that consists of 600 college students and two sets of variables, including (psychological) ‘locus of control’, ‘self concept’, ‘motivation’, and (academic) ‘reading’, ‘writing’, ‘math’, ‘science’. CCA may be used to associate the psychological measures and academic achievement measures. We calculated aTX and bTY in three settings similar to the PCA case and plotted them in
We use the Infant Health and Development Program (IHDP) dataset, using Dowhy package. The data was from a study of infants from hospital discharge to 36 months between 1985 and 1988. Infants were randomized to the Intervention Group or the Follow-Up Group. The Intervention Group received home visits, attendance at a special child development center, and pediatric follow-up. The Follow-Up Group received only the pediatric follow-up component of the program. The treatment here is the intervention. The outcome was quantified through the combination of the mental, behavioral, and health statuses of infants. There are 25 variables such as the measures of cognitive development, behavioral status, health status, child's gender, race, etc. collected from both groups, which are possible confounders of the outcome and the intervention.
The goal is to estimate the average treatment effect on the outcome while accounting for the confounders (assuming the data are observational). To that end, we considered four methods, including the difference between the average outcome in the treatment group and the control group, the propensity score matching, the propensity score stratification, and the inverse probability weighting, where the last three methods employ Logistic regression. In this experiment, we only privatized the outcome data by using three I- . . . -I designs applied to each of the two practical scenarios: normalized data and unnormalized data. The estimated treatment effects are summarized in Table 9. As the data was collected through a randomized trial, we may estimate the causal effect by the difference of the average effect of the two groups without accounting for the confounders. The value estimated from the raw data is 4.02, which can be regarded as the oracle. The results indicate that the performance does not vary across different designs, and the estimates are close to the oracle. The performance in the first column exhibits larger standard errors because of its more information loss.
Due to societal and ethical concerns, patients' private diagnostic data may not be publicly available. Nevertheless, it may help publish privatized data to solicit powerful data-driven diagnosis methods from the scientific community. For instance, during the outbreak time of Coronavirus Disease 2019 (COVID-19), a significant challenge in controlling the escalation of the disease is the shortage of test kits, which are mostly based on reverse transcription polymerase chain reaction (RT-PCR). Hospitals have been utilizing alternative diagnosis methods, including computed tomography (CT) scans, to facilitate the screening of COVID-19 patients. In this experiment, we apply interval privacy to CT image data and show its potential use for data-private image classification. We considered an open-sourced dataset ‘COVID-CT-Dataset’ consisting of CT images from patients infected or non infected by COVID-19 (labeled as ‘COVID’ and ‘Non-COVID’ respectively). The numbers in two classes are balanced, and among them, 425 images are for training and 203 for testing. Our goal is to first interval-privatize each pixel of the image, resulting in an image of pixel ranges. We apply the D-transformation to obtain the transformed images and train a classifier from it. We report the classification accuracy on the testing data. We note that the way we privatize an image is at a pixel-level and the anchor points are independent across pixels. The method could be replaced with more sophiscated methods, e.g., to use dependent anchor points, privatize preprocessed features from blocks of pixels, etc.
The model we used to train a classifier is a deep neural network named MobileNetV2. Since millions of pixels exist in each image and only a limited number of images, we use a transfer learning approach. In particular, we first store a pre-trained model using the Imagenet dataset, attach a two-layer feed-forward network of input dimension 1000, hidden dimension 20, and output dimension 2 that represents two classes, and re-trained the constructed neural network with our training images. During the training, we resized each image to have 224×224 pixels and used stochastic gradient descent with momentum 0.9 and initial learning rate 0.001, decayed by a factor of 0.1 every 7 epochs for an overall of 50 epochs. We considered the use of raw data (‘Raw’), I-P-I interval data generated from s=a=1 (‘M1’), I-I interval data generated from s=1 (‘M2’) and s=4 (‘M3’). The privacy leakages for the four scenarios are 100%, 43%, 24%, 7%, respectively. In the evaluation of testing performance, we consider two scenarios. In the first scenario, we use the original testing data. The numerical results, as summarized in the first three columns of Table 10, show that the performance of the proposed method is much better than that of random sampling, which is close to a random guess. The second scenario concerns the use of transformed testing data, parallel to the operation on the training data. Numerical results indicate that the performance in this scenario is better than the first scenario. This result indicates that bias induced by a shift of covariate distributions (between the training and testing) overweighs the change of learnability due to information loss. Interestingly, in both scenarios, the classification performance does not degrade as the privacy leakage reduces.
-transformation
Suppose that we have an urn that contains p=4 kinds of colored balls, say black, red, green, and blue. In an experiment, each of the n individuals draws a ball from this urn and then puts it back. The procedure generates n original data values. Due to privacy, the data collector asks each individual whether a (randomly-generated) subset (e.g., {red, blue}) includes the actual color. Assuming everyone's answer is honest, our observation consists of two-element sets, each containing an individual's actual color.
Notation. We let [p]=[1, 2, . . . , p]. For a set a, let |a| denote its cardinality. Let ac denote the complement of a subset a, and an indicator function. Let log, ln, 0 log 0 denote the base-2 logarithm, natural logarithm, and zero, respectively. Let 1, Ip, diag(w), wT, and w□u denote the all one vector, p×p identity matrix, diagonal matrix expanded by an vector w, the transpose of the vector w, and the outer product of w, u, respectively. We define M† as the pseudo-inverse matrix of M, tr(M)=Σmii the trace of a matrix, and ∥w∥q=(Σwiq)1/q the Lq norm of a vector w. We use for expectation and for probability. We write X⊥Y if X is independent with Y, and ∥/ if dependent.
We will focus on a scalar categorical random variable X, and then extend it to more general cases below.
Raw data: X∈[p] is assumed to be a random variable with distribution (X=j)=wj for j=1, 2, . . . , p.
We also write X˜pw, w=(w1, . . . , wp)T, where pw is the probability mass function (PMF) and pw(j)=wj.
Privatized data: A∈A is assumed to be a random variable whose joint distribution with X exists. Here, A={a:a∈[p]} denotes the set of all the subsets of [p].
Recall that we will collect A instead of X to infer population-level information while protecting each individual's privacy. We want |A|≥2 so that the true category cannot be immediately identified. Consequently, we need p≥3 for creating a non-trivial subset. For now, we assume that p≥4, and defer the special cases of p=2, 3 below. The case of p=1 is not interesting since the variable is deterministic. Also, it is appealing to design such A that does not immediately imply X. Ideally, the only information provided by A about X is the event X∈A, which is so-called non-informative property. This motivates the following definition.
Definition 7 (Subset Privacy):
A mechanism as described by a Markov chain X→A meets subset privacy, if |A|≥2 and for any x∈[p], a⊆[p],
Next, we introduce a general mechanism to generate a Markov chain X→A satisfying the subset privacy.
Definition 8 (Conditional Mechanism):
A conditional mechanism is a Markov chain X→A whose transition law is determined by
P(A=a|X=j)=μaj∈a, ∀a⊆[p],j∈[p], (70)
where μa satisfies
Any specific realization {μa, a∈A} from a conditional mechanism is referred to as a conditional design. We will use {μa, a∈A} to represent a conditional design.
We will show that there exist infinitely many conditional designs (for p≥4). It can be verified that the conditional mechanism is the only mechanism to meet Definition 7. Nevertheless, it is nontrivial to find all the conditional designs that meet the constraints
Additionally, such a mechanism requires the data collector to know the true value X to realize the design and generate A. As a result, the mechanism is often convenient only from the data publisher's perspective. Because of the above reasons, we will introduce a special conditional mechanism named independent mechanism that is easy to implement. The idea behind the independent mechanism is to generate a Markov chain X→A through another chain [X, Ã]→A, where à is independent with X, and A satisfies
A=Ã if X∈Ã, and A=Ãc if X∉Ã (72)
In other words, the mechanism independently generates a subset à and then uses it or its complement as the subset observation A.
Definition 9 (Independent Mechanism):
An independent mechanism is a Markov chain X→A induced by another Markov chain [X, Ã]]→A, whose transition law is given by
Ã
X, (Ã=ä)=vä, (A=a|X=j,Ã=ä)=1j∈ä,a=ä, ∀a,ä⊆[p],j∈[p] (73)
where va, satisfies
We will use {va, a∈A} to represent an independent mechanism throughout the paper. In practice, in order to avoid the trivial case that a=[p], we usually require v{[p]}=v(0)=0. It is conceivable that the final observation A will be non-informative, since A is constructed based on à that is independent with X. Technically, the transition law of X→A in Definition 9 can be characterized as P(A=a|X=j)=(va+va
As an example, a specific design under the independence mechanism is assigning all allowed subsets with the same probability.
Definition 10 (Uniform Independence Design):
An independence design {va, a∈A} is uniform if
Before we conclude this subsection, we introduce another notation of a that will be frequently used in the remainder of the paper. We will represent a with 1a, which is a vector in {0, 1}p, with j-th coordinate one if j∈a and zero otherwise. Let 1A be the random vector corresponding to 1a. With the above notation, under a conditional design, the joint distribution of X, A is (X=j, A=a)=wjμaj∈a, with marginal distribution (A=a)=1aTwμa.
In this subsection, we will quantify the amount of privacy for a conditional design or a particular user.
1) Threat model: A potential adversary can access subset observations {A1, . . . , An}. We assume that the adversary does not know the identity of the individual associated with any subset A before observing the data. Its goal is to infer the individual's identity, the underlying value X, or both. The above setting is commonly seen in practice. For example, for data publishing, entities will often first anonymize the data by removing all identifiers. We allow the potential adversaries to know all or part of individuals' information.
2) Subset-size based privacy coverage and leakage: It is conceivable that a ‘larger’ subset A leads to more difficulty in guessing X since we would have less knowledge about X. Such knowledge means not only the cardinality of A but also the probability of X belonging to A. The above intuition motivates the following notion of privacy.
Definition 11 (Size Privacy):
The size of an individual subset a is defined as L(a)=P(X∈a)=1aTw. The size privacy coverage τ=(L(A))=(1ATw), and the size privacy leakage is S=1−τ=(1−1ATw).
In the above definition, we used the sum of probabilities of the categories observed in a subset to measure each individual's privacy. Size privacy coverage is the average subset size. The defined privacy coverage and leakage are naturally interpretable for the unique subset data format. For an individual being collected the subset a, the same subset may also be collected from L(a) portion of the population. The larger L(a) is, the more ambiguity of identifying a specific individual. The size of a subset a for the design is related to the underlying distribution w, which is usually unknown in practice. To actively control the lower bound of the privacy coverage, we will propose a modification of the subset privacy below.
Example (Obfuscated urn, continued): Assume that the underlying distribution is given by (black)=0.01, (red)=0.1, (green)=0.2, and (blue)=0.69. Suppose that we observe a={red, green, blue} from an individual. Then we know X is not black. However, this subset only eliminates a very small proportion of uncertainty. Knowing a will not significantly help us guess X. This is aligned with our defined subset size of 0.99, interpreted in the way that the individual datum is very secure given the observation.
In another case with a={black, red}, we know that a small-probability event occurs. In this case, knowing a leaks more information about X. Quantitatively, the subset size of a is 0.11, meaning a relatively higher privacy leakage. If we use a uniform design, the average subset size (or privacy coverage) for this example is about 0.684.
We note that subset privacy is a concept in a relative sense, comparing the information change before and after observing A. In a particular example, suppose that we know there is only one possible color, red. Then, even if
we observe {red, green}, the coverage is one because we do not leak any additional information. In this sense, it
is considered private.
3) Information-theoretic privacy coverage and leakage: From an information-theoretic perspective, a data privacy mechanism X→Y may be treated as a Markov chain, where the privatized data Y may take values in a general alphabet.
The subset privacy can also be quantified from the information-theoretic view. One natural idea is to measure the information contained in A about X using the mutual information.
Definition 12 (Mutual Information Privacy):
For a given design, the mutual information privacy leakage is
the privacy coverage is the conditional entropy H(X|A)= log P(X|A).
Another natural idea is to measure the privacy leakage by the maximal probability (or ‘prediction risk’) to correctly guess the label X given A. Formally, let a random variable Y∈[p] be our guess for X. Then, X→A→Y forms a Markov Chain since Y only depends on A. The prediction risk is the largest probability (X=Y) among all the conditional distributions Y|A.
Definition 13 (Prediction Privacy):
For a given design, the prediction privacy leakage and coverage are respectively
In this subsection, we show that the one-variable case can be generalized to the multi-variable case. Suppose we have a d-dimensional random vector X=(X1, . . . , Xd)T, each Xk being a discrete random variable of pk categories. The observation is A, a random variable whose value is a subset of all possible outcomes of X. We can construct a mapping X+X,
where X has
categories. In this way it degenerates to d=1 case.
A practical concern of constructing a mapping is a large sample space of the subset A, as ||=2p grows exponentially with d. To address this concern, we propose another way to deal with multiple variables. In particular, we restrict the form of observation A to the product of subsets corresponding to each Xk. In other words, A can be decomposed as the product of A=(A1, . . . , Ad), where each Ak is a subset of possible outcomes of Xk. For implementation, we observe Ak from each Xk with a conditional mechanism, then put them together as A. As a result, the conditional mechanism for multiple variables becomes the following product mechanism.
Definition 14 (Product Mechanism):
A product mechanism is a Markov chain X→A whose transition law is determined by
where {μa
The advantage of using a product mechanism is that the sample space of A is a product space for every coordinate, with a cardinality 2Σ
D. Extension to Two- or Three-Category Variables (p=2, 3)
In this subsection, we propose two solutions to address the case when p=2 or 3.
The first solution is by combining every two variables with less than four categories into one variable. For example, suppose we have two binary variables, say gender (male, female) and income (low, high). We represent them as a new variable with four categories and apply subset privacy based on this new variable. The inference of each variable is often straightforward from the inference of the combined variable. For example, the probability of observing one category is implied by the marginal distribution. The above method can be directly applied to any even number of variables. If we have an odd number of variables with less than four categories, a solution is to combine it with an artificial variable. In a practical data collection system, the above technique is easy to implement. Specifically, the data holder (who owns the private data) and the data collector may use an open-sourced program to generate a random number as an artificial variable.
An alternative method is to introduce dummy categories to the data, which is elaborated below.
In this section, we provide methods to estimate the population distribution w for a single variable X. Recall that the observed data (to the data collector/analyst) are n subsets A1, . . . , An generated from some design {μa, ∀a∈} and i.i.d. original data X1, . . . , Xn.
A standard way to estimate w is using the maximum likelihood estimator (MLE). Here we discuss the model identifiability, consistency, and asymptotic normality of MLE. Note that
has only p−1 degrees of freedom. We reparameterize the parameter space with
Hence, we have w=α+Bθ, where α=(0, . . . , 0, 1)T, B=(Ip-1,−1)T.
Let xi, ai be n i.i.d. realizations of X, A. The log-likelihood function is
Since there is a one-to-one mapping between w and θ, with a slight abuse of notation, we will also write the log-likelihood function as
(up to a constant that does not depend on w). As we will discuss, Eq. (81) is more convenient for optimization, while Eq. (80) is suitable for theoretical analysis.
First, we provide a sufficient and necessary condition to guarantee the model identifiability.
Assumption 1 (Identifiability): If for all a with μa>0, 1aTBu=0 holds, then u=0.
Theorem: A subset privacy model X→A is identifiable if and only if Assumption 1 holds. Moreover, for an independent mechanism, Assumption 1 is equivalent to the following: if for all a with μa>0, 1aTu=0 holds, then u=0.
The following assumption is sufficient to guarantee the uniqueness of the MLE.
Assumption 2 (Uniqueness of MLE): Matrix RB is full-rank, where R=(1a
Note that if Assumption 1 holds, then Assumption 2 also holds with high probability, while Assumption 2 implies Assumption 1. To show the normality of MLE, we need a regularity assumption below.
Assumption 3 (Bound on log density): We assume there is a positive constant
Theorem (Consistency): Under Assumption 1, the MLE satisfies {circumflex over (θ)}n→θ a. s. as n→∞. As a consequence, ŵn→w a. s. as n→∞.
Theorem (Asymptotic normality): Under Assumptions 1 and 3, the MLE satisfies √{square root over (n)}({circumflex over (θ)}n−θ)→N(0, I(θ)−1) in distribution. Hence √{square root over (n)}(ŵn−w)→N(0, BI(θ)−1BT) in distribution, where I(θ) is the Fisher information matrix.
Next, we briefly introduce two algorithms to compute the MLE. It can be verified that In(w) is concave, and thus we need to solve the convex optimization problem
where is the feasible range of w introduced at the beginning of this subsection.
A general algorithm to find an approximation of MLE is the expectation-maximization (EM) algorithm, where we treat X as missing. Applying the EM algorithm to Eq. (81), we can derive the iterative updating formula as wj(m+1)=n−1Σi=1nm(Xi=j|ai), where m(Xi=j|ai)=wj(m)1a
Calculating the MLE is often computationally expensive, as there is no closed-form solution. Even with the approximation method like the EM algorithm, it still needs many iterations to converge. When the design is known, we propose a much faster estimator using the method of moments (MoM) and prove its consistency and asymptotic normality.
We introduce a coefficient matrix Q=(qij), qij=Σi,jϵaμa, ∀i,j∈[p], with the i-th row denoted as qi. Note that for the k-th coordinate of a random vector 1A=(1A,1, . . . , 1A,p), the first order moment
Hence, γ(γ1, . . . , γp)T=Qw establishes the relationship between the first order moment of 1A and population w. The empirical first order moment {circumflex over (γ)}=n−1Σi=1n1a
Qŵ={circumflex over (γ)} (85)
Assumption 1 ensures that Q is positive definite, so ŵ=Q−1{circumflex over (γ)}. We introduce
Then, as a direct application of the central limit theorem, we have that the MoM estimator defined in Eq. (85) satisfies √{square root over (n)}(ŵ−w)→N(0,Q−1CQ−1) in distribution.
In particular, for the uniform design, there is a much simpler form. We have
The equivalence is due to Σiŵi=1
Based on the previous MoM estimator, which has a n−1/2 rate of convergence, we can further construct an asymptotically efficient estimator (just like the MLE) using the following one-step update.
{circumflex over (θ)}one={circumflex over (θ)}MoM+∇2ln({circumflex over (θ)}MoM)−1∇ln({circumflex over (θ)}MoM), {circumflex over (ω)}one=α+B{circumflex over (θ)}one(7) (89)
where ∇2 is the Hessian operator and ∇ is the gradient operator.
Under Assumptions 1 and 3, the one step estimator defined in Eq. (89) has the same asymptotic distribution as MLE.
We show that when p grows with n, the MoM estimator is also consistent under some conditions.
Let MoM estimator be ŵ. If we use uniform design, then (p2 ln p)/n→0 guarantees that ∥ŵ−w∥1→0 in probability as n→∞. Also, with uniform design, p/n→0 guarantees that
∥ŵ−w∥2→0 in probability as n→∞. Also, it can be proved using the Berry-Esseen Theorem that each coordinate of the MoM estimator is asymptotically normal under the uniform design, even when p>n.
As we mentioned above, the product mechanism is more suitable for multi-variable cases. In this section, we study the two-variable case by introducing a contingency table for subset data and then proposing methods for independence tests.
Suppose that we have two random variables X˜pwX, Y˜pwY with wx∈pwY∈q, and the joint distribution (X=x, Y=y)=wxy, denoted by (X, Y)˜pwx with W∈pxq. By the Definition 8 of product design, if we let both X→A and Y→B be conditional designs, then (X, Y)→(A, B) is a product design. Let {μaA, ∀a∈A}, {μbB, ∀b∈B} denote two conditional designs and s=|A|, r=|B| denote the cardinalities of possible outcomes. A typical s is much larger than p, and it is at most 2p−p−2.
The observations under this setting are n i.i.d. pairs of Ai, Bi. Thus, it is natural to consider a two-way contingency table for A and B. In particular, the table N=({nab})s×r, where each nab is the number of observations satisfying A=a, B=b.
We study how to perform independence testing between X and Y based on the subset-valued contingency table N. Our null hypothesis is H0:XY against the alternative H1:XY. It can be verified that the joint and marginal distributions for A, B are
(A=a,B=b)=paAμbB1aTW1b, P(A=a)=μaA1aTwx, P(B=b)=μbB1bTωY. (90)
Hence, the above hypothesis is equivalent to H0:AB against H1:AB. Based on this equivalence, we develop four tests below and numerically compare.
The methods are based on the estimation of distributions. With the contingency table, we can consistently estimate the joint distribution W using any of the methods mentioned above with a subtle modification. Let Ŵ, ωX, ωY be the estimated parameters. We then estimate the distributions of A, B with
(A=a,B=b)=μaAμbB1aTŴ1b, (A=a)=μaA1aTωX, (B=b)=μbB1bT{circumflex over (ω)}Y (91)
1) Pearson's Chi-square test: The classical Pearson's Chi-square test statistics in our context is
T
P=Σa,b(nab−ϵab)2/ϵab, where ϵab=μ(A=a)(B=b) (92)
is the expected number of observations under the null hypothesis. Under the null hypothesis, TP asymptotically converges to xsr2, as n→∞. A common rule (and also our experiments) suggest using Pearson's Chi-square test only when all nab≥5.
2) Likelihood ratio test (LRT): Recall that the joint distribution of A, B is decided by W. The LRT here is based on the estimation of W and wX, wY. In other words, the null is H0: W=wX⊙wY against H1:W represents a free discrete distribution. The test statistic is
Under the null hypothesis, TL converges to X2(p−1)(q−1) in distribution as n→∞. Hence, the limiting distribution has a smaller degree of freedom compared with TP.
3) Approximated LRT: From experimental results of parameter estimation methods discussed below, we find that the MoM estimator is significantly faster than other estimators but with similar accuracy. This motivates us to replace the MLE with MoM when evaluating the LRT. This leads to a test statistic the same as Eq. (93), except that the distribution parameters are estimated by MoM. Though the above test is to approximate the LRT, we experimentally find that the statistical power does not degrade.
4) Bonferroni correction: Instead of testing H0:AB, we simultaneously test the independence for each pair of coordinates of 1A=(1A,1, . . . , 1A,p)T, (1B,1, . . . , 1B,q)T with a Bonferroni correction. In other words, we test
H
0:1A,x1B,y, ∀x=1, . . . ,p, y=1 . . . ,q. (94)
It proceeds as follows. For each x=1, . . . , p, y=1, . . . , q, we use the Pearson's Chi-square to calculate a p-value pxy for H0:1
In this section, we provide a real-data example to demonstrate the subset privacy and developed methods. We use the ‘Adult’ (also known as ‘Census Income’) dataset, which Barry Becker extracted from the 1994 Census database. The dataset contains N=32561 observations with 14 attributes. Among those attributes, race, gender, and income are of particular interest.
There are five categories of race. The numerical labels and population distribution w are summarized in Table 12. Sometimes, race is considered as sensitive information, and it can be protected by subset privacy. Here, we show that our method provides desirable privacy, and it is efficient to estimate the population distribution.
In the experiments, we randomly draw n samples from the original dataset in each replication. We draw subset observations based on the uniform design and estimate {circumflex over (ω)}. To stabilize the estimation, we replicate k=100 times and record the average scaled L2 loss n∥{circumflex over (ω)}−ω∥22. Its expectation is expected to converge to a constant as n→∞ for the proposed methods. We implement and compare the estimators above, including the MLE estimated by the EM algorithm (‘EM’), the MLE solved by a general-purpose optimization package CVXPY (‘MLE’), the moment-based estimator (‘MoM’), and the one-step estimator derived from MoM (′ONE). The additional line named ‘Sample’ is calculated from the original non-private data. The results for different values of n are summarized in
We calculate the privacy leakage for three designs using the three ways introduced in Subsection II-B. The three designs include the non-private design (A={X}), uniform design, and fully-private design (A=range(X)). The results are shown in
As we expect, the privacy leakage of the uniform design is within the range of no privacy and full privacy. Mutual information privacy leakage is half the maximum leakage, meaning that the information about X contained in subset observations is one bit (either in or not in the subset). Size privacy leakage decreases from 0.25 to 0.15 after the use of uniform design. The interpretation is that we can eliminate 0.25 probability by observing the non-private data X on average, while we can only eliminate 0.15 from the private data A. Prediction privacy leakage decreases from 1 to 0.9, meaning that the probability we correctly guess the label is 0.9 from the uniform design. Without any privatization, we immediately know the true label, so the leakage is 1. One may worry that 0.9 represents a considerable leakage. However, we note that it needs to be compared with the minimum leakage. If we do not know any knowledge about X, then the optimal strategy is guessing the most likely category, which is ‘white’ in this case, with the average accuracy as 0.85.
In the above results, the measurement of privacy is averaged over the population. Recall that the size and prediction privacy can be applied to evaluate each individual's privacy as well. We illustrate the idea below.
Because the true category coincides with the most likely one, the prediction leakage is one despite the privatization. The above scenario will change for Case 14, whose category is Asian-Pac-Islander. The subset observation includes white, so the optimal guess is still white, and an adversary will not coincidentally recover the true label. Moreover, a large amount of uncertainty is brought in because of the inclusion of white in the subset. The subset size leakage decreases significantly from 0.97 to 0.11 (compare the ball size of gray plus cyan and gray solely).
In conclusion, subset privacy protects the privacy of rare categories much better than the frequent categories at the individual level, which is an attractive property in real-world applications.
The last application is to test the independence of gender and income under subset privacy. The null hypothesis is that they are independent. The contingency table from the whole non-private dataset is presented in Table III. Pearson's Chi-square test has a p-value close to zero, indicating that the two variables are effectively dependent at the population level. In the experiment, we take the dependence between gender and income as the ground truth.
We will show the power of our tests while controlling the probability of type I error to be 5%. Specifically, we consider the Pearson's Chi-square test (‘Pearson’), likelihood ratio test designed for subset privacy (‘LRT:MLE’), likelihood ratio test with the MLE approximated by MoM (‘LRT:MoM’), Bonferroni correction-based test (‘Bon’). The subset data are generated as follows. First, since both gender and income are binary variables, after sampling n observations from the original dataset, we apply ‘p=2’-uniform design introduced in Subsection II-D to sample subsets from the Markov chain X→Y→A. Then, we perform all four tests based on subset observations. For comparison purposes, we also apply Pearson's Chi-square test to the non-private samples X and intermediate data Y, denoted by ‘PT:X’ and ‘PT:Y’, respectively. We independently replicate the above procedure k times, and each time we calculate the p-value and power for each test. The results are summarized in
We note that the transformation from the original data X to intermediate data Y loses some information, and the subset sampling further loses information.
In the following, we highlight some subtle aspects of subset privacy and elaborate on its connections with (local) differential privacy.
In this section, we propose a modification of subset privacy, which enables us to control the privacy coverage. In other words, we can set an arbitrary lower bound (within [0, ½)) for the size privacy coverage.
We consider one-dimensional variable X∈p for brevity. The technique can be extended to the multi-dimensional case using the product design. We further consider the independent design {va, a∈} with va=va
For a parameter 0<α<½ that gives the lower bound of the privacy coverage, we normalize the population distribution by {circumflex over (ω)}j=(1−2α)ωj; j∈[p] and ωj=α, j=p+1, p+2. Since the probability of a dummy category is at least a, and a subset must contain one of them, the size coverage is lower bounded by α. Accordingly, we state the following new subset-generating mechanism. If X is a true user (X∈[p]), then we draw a subset A⊂[p], X∈A by using the original design {va, a∈A}. Then, we add one of the dummy categories to A with equal probability to produce the final output subset Â. If X is dummy, we draw A{tilde over ( )} from {va, a∈A}, and Â=Â∪X. It can be seen that this procedure defines a new conditional design on the enlarged data domain [p+2]. Hence, we can apply all the developed methods to analyze the enlarged subset Â.
In summary, in addition to collecting the data from n users, we also simulate 2 an dummy users, each associated with a dummy category. Then, the collection system creates subsets from the induced design stated above. As long as a is revealed, the inference of true users' population information can be derived. Such modification ensures minimum size privacy to be a, retaining the virtue of faithfulness. Users only have to answer a relatively non-sensitive question of whether the true value is contained in a subset with a guaranteed privacy coverage.
Using a similar technique proposed above, we introduce another way to address two- or three-category variables. We only have to allow the independent design to sample subsets A such that |A|=1 or |A|=p−1. For instance, when p=2, if X∈[p], then Â={X, 3} or Â={X, 4} with equal probability; if X∉[p], then Â={X, 1} or Â={X, 2} with equal probability.
In
The present embodiments transform the private data into interval data 112 that can be shared with data collector devices 102 without exposing the private data itself. In particular, an interval transform 110 generates interval data 112 from private data 106 using the method of
In step 200 of
At step 204, interval transform 110 determines which of the returned intervals the selected private datum 106 comes from. Interval transform 110 then stores as interval data 112, the identified interval that selected private datum 106 comes from together with the set of anchor points/intervals that were returned for the selected private datum at step 206. Interval transform 110 then determines if there are more private data 106 at step 208. When there are more private data, the next private datum is selected by returning to step 200. When all of private data 106 have been processed the interval data is sent to or made accessible to data collector device 102 while continuing to secure private data 106 from data collector device 102. Thus, data collector device 102 is only able to learn what intervals the private data was in and for each private datum, the set of intervals that were available for that datum.
One example of a data collector device 102 is one that trains a model from the interval data.
In step 400 of
Embodiments of the present invention can be applied in the context of computer systems other than computing device 10. Other appropriate computer systems include handheld devices, multi-processor systems, various consumer electronic devices, mainframe computers, and the like. Those skilled in the art will also appreciate that embodiments can also be applied within computer systems wherein tasks are performed by remote processing devices that are linked through a communications network (e.g., communication utilizing Internet or web-based software systems). For example, program modules may be located in either local or remote memory storage devices or simultaneously in both local and remote memory storage devices. Similarly, any storage of data associated with embodiments of the present invention may be accomplished utilizing either local or remote storage devices, or simultaneously utilizing both local and remote storage devices.
Computing device 10 further includes an optional hard disc drive 24, an optional external memory device 28, and an optional optical disc drive 30. External memory device 28 can include an external disc drive or solid state memory that may be attached to computing device 10 through an interface such as Universal Serial Bus interface 34, which is connected to system bus 16. Optical disc drive 30 can illustratively be utilized for reading data from (or writing data to) optical media, such as a CD-ROM disc 32. Hard disc drive 24 and optical disc drive 30 are connected to the system bus 16 by a hard disc drive interface 32 and an optical disc drive interface 36, respectively. The drives and external memory devices and their associated computer-readable media provide nonvolatile storage media for the computing device 10 on which computer-executable instructions and computer-readable data structures may be stored. Other types of media that are readable by a computer may also be used in the exemplary operation environment.
A number of program modules may be stored in the drives and RAM 20, including an operating system 38, one or more application programs 40, other program modules 42 and program data 44. In particular, application programs 40 can include programs for implementing any one of modules discussed above. Program data 44 may include any data used by the systems and methods discussed above.
Processing unit 12, also referred to as a processor, executes programs in system memory 14 and solid state memory 25 to perform the methods described above.
Input devices including a keyboard 63 and a mouse 65 are optionally connected to system bus 16 through an Input/Output interface 46 that is coupled to system bus 16. Monitor or display 48 is connected to the system bus 16 through a video adapter 50 and provides graphical images to users. Other peripheral output devices (e.g., speakers or printers) could also be included but have not been illustrated. In accordance with some embodiments, monitor 48 comprises a touch screen that both displays input and provides locations on the screen where the user is contacting the screen.
The computing device 10 may operate in a network environment utilizing connections to one or more remote computers, such as a remote computer 52. The remote computer 52 may be a server, a router, a peer device, or other common network node. Remote computer 52 may include many or all of the features and elements described in relation to computing device 10, although only a memory storage device 54 has been illustrated in
The computing device 10 is connected to the LAN 56 through a network interface 60. The computing device 10 is also connected to WAN 58 and includes a modem 62 for establishing communications over the WAN 58. The modem 62, which may be internal or external, is connected to the system bus 16 via the I/O interface 46.
In a networked environment, program modules depicted relative to the computing device 10, or portions thereof, may be stored in the remote memory storage device 54. For example, application programs may be stored utilizing memory storage device 54. In addition, data associated with an application program may illustratively be stored within memory storage device 54. It will be appreciated that the network connections shown in
Although elements have been shown or described as separate embodiments above, portions of each embodiment may be combined with all or part of other embodiments described above.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are described as example forms of implementing the claims.
The present application is based on and claims the benefit of U.S. provisional patent application Ser. No. 63/119,811, filed Dec. 1, 2020, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63119811 | Dec 2020 | US |