The invention relates generally to differential data privacy, and more particularly to determining a differentially private aggregate classifier for multiple databases.
Data collection provides information for a wide variety of academic, industrial, business, and government purposes. For example, data collection is necessary for sociological studies, market research, and in a census. To maximize the utility of collected data, all data can be amassed and made available for analysis without any privacy controls. Of course, most people and organizations (“privacy principals”) are unwilling to disclose all data, especially when data are easily exchanged and could be accessed by unauthorized persons. Privacy guarantees can improve the willingness of privacy principals to contribute their data, as well as to reduce fraud, identity theft, extortion, and other problems that can arise from sharing data without adequate privacy protection.
A method for preserving privacy is to compute collective results of queries performed over collected data, and disclose such collective results without disclosing the inputs of the participating privacy principals. For example, a medical database might be queried to determine how many people in the database are HIV positive. The total number of people that are HIV positive can be disclosed without disclosing the names of the individuals that are HIV positive. Useful data are thus extracted while ostensibly preserving the privacy of the principals to some extent.
However, adversaries might apply a variety of techniques to predict or narrow down the set of individuals from the medical database who are likely to be HIV positive. For example, an adversary might run another query that asks how many people both have HIV and are not named John Smith. The adversary may then subtract the second query output from the first, and thereby learn the HIV status of John Smith without ever directly asking the database for a name of a privacy principal. With sensitive data, it is useful to provide verifiable privacy guarantees. For example, it would be useful to verifiably guarantee that nothing more can be gleaned about any specific privacy principal than was known at the outset.
Adding noise to a query output can enhance the privacy of the principals. Using the example above, some random number might be added to the disclosed number of HIV positive principals. The noise will decrease the accuracy of the disclosed output, but the corresponding gain in privacy may warrant this loss.
The concept of adding noise to a query result to preserve the privacy of the principals is generally known. One method uses differentially private classifiers for protecting the privacy of individual data instances using added noise. A classifier evaluated over a database is said to satisfy differential privacy if the probability of the classifier producing a particular output is almost the same regardless of the presence or absence of any individual data instance in the database.
However, the conventional differentially private classifiers are determined locally for each database and fail to provide privacy when there is a requirement to use those classifiers over multiple databases. Accordingly, there is a need to determine such a classifier for a set of databases that preserves the differential data privacy of each database.
Differential privacy provides statistical guarantees that the output of a classifier does not include information about individual data instances. However, in multiparty applications, data for determining a classifier are distributed across several databases, and conventional differential privacy methods do not preserve differential data privacy for multiple contributing parties.
This is because conventional methods are inherently designed for the case where the classifier is determined based on access to the entire data of the database and is modified by the noise value computed over the data to produce differentially private classifier specifically for that data. However, in multiparty applications, it is often impossible to access the data of different database due to security constraints.
Embodiments of the invention are based on a realization that for a set of databases, a differentially private aggregate classifier preserving the differential data privacy of each database can be determined from the classifiers and the noise values of individual databases in the set of databases, without allowing access to the data of the databases.
However, in multiparty applications, adding the noise value to the classifier is no longer straightforward. This is because there is no guarantee that the added noise results in the differential data privacy of each database. For example, the aggregation of all classifiers and the noise values, which would be considered as a logical approach, does not satisfy differential privacy for the combined data.
Embodiments of the invention are based on another realization that a differentially private aggregate classifier can be determined as an aggregation of classifiers of each database modified by a noise value corresponding to a smallest database in the set of databases. The smallest database has the smallest number of entries, wherein a data structure of each entry is the same across all databases. The proof for correctness of this realization is provided in the Appendix.
Accordingly, one embodiment of the invention discloses a method for determining a differentially private aggregate classifier for a set of databases, wherein each database in the set of databases is associated with a classifier and a noise value, wherein the classifier and the noise value are determined locally for each database, such that a combination of the classifier and the noise value ensure differential data privacy of the database, and wherein the differentially private aggregate classifier preserves the differential data privacy of each database, comprising the steps of: combining classifiers of each database to determine an aggregate classifier; and modifying the aggregate classifier with a noise value corresponding to a smallest database, wherein the smallest database has the smallest number of entries, wherein a data structure of each entry is the same for all databases.
Moreover, various embodiments of the invention determine the differentially private aggregate classifier securely using cryptographic protocols. Those embodiments ensure that the data of each database are not shared with any other party and the differentially private aggregate classifier cannot be reverse engineered to learn about any individual data instances of any database.
Another embodiment discloses a system for determining a differentially private aggregate classifier for a set of databases, wherein each database in the set of databases is associated with a classifier and a noise value, wherein the classifier and the noise value are determined locally for each database, such that a combination of the classifier and the noise value ensure a differential data privacy of the database, and wherein the differentially private aggregate classifier preserves the differential data privacy of each database, comprising: means for combining classifiers to determine an aggregate classifier; and means for modifying the aggregate classifier with a noise value corresponding to a smallest database in the set of databases to produce the differentially private aggregate classifier.
Yet another embodiment discloses a computer readable medium storing a differentially private aggregate classifier for a set of databases, wherein each database in the set of databases is associated with a classifier and a noise value, wherein the classifier and the noise value are determined locally for each database, such that a combination of the classifier and the noise value ensure a differential data privacy of the database, wherein the differentially private aggregate classifier is a combination of the classifiers of the set of databases modified with the noise value corresponding to a smallest database in the set of databases.
In describing embodiments of the invention, the following definitions are applicable throughout (including above).
A “computer” refers to any apparatus that is capable of accepting a structured input, processing the structured input according to prescribed rules, and producing results of the processing as output. Examples of a computer include a computer; a general-purpose computer; a supercomputer; a mainframe; a super mini-computer; a mini-computer; a workstation; a microcomputer; a server; an interactive television; a hybrid combination of a computer and an interactive television; and application-specific hardware to emulate a computer and/or software. A computer can have a single processor or multiple processors, which can operate in parallel and/or not in parallel. A computer also refers to two or more computers connected together via a network for transmitting or receiving information between the computers. An example of such a computer includes a distributed computer system for processing information via computers linked by a network.
A “central processing unit (CPU)” or a “processor” refers to a computer or a component of a computer that reads and executes software instructions.
A “memory” or a “computer-readable medium” refers to any storage for storing data accessible by a computer. Examples include a magnetic hard disk; a floppy disk; an optical disk, like a CD-ROM or a DVD; a magnetic tape; a memory chip; and a carrier wave used to carry computer-readable electronic data, such as those used in transmitting and receiving e-mail or in accessing a network, and a computer memory, e.g., random-access memory (RAM).
“Software” refers to prescribed rules to operate a computer. Examples of software include software; code segments; instructions; computer programs; and programmed logic. Software of intelligent systems may be capable of self-learning.
A “module” or a “unit” refers to a basic component in a computer that performs a task or part of a task. It can be implemented by either software or hardware.
As shown in
Each database 120-130 in the set of databases 110 is associated with a classifier, e.g., the classifiers 121 and 131 and a noise value, e.g., noise values 122 and 132. For example, the databases are denoted as D1, . . . , DK, where Di=(x,y)|j includes a set of entries x and corresponding binary labels y. The classifier and the noise value are determined locally for each database, such that a combination of the classifier and the noise value ensure a differential data privacy 125 or 135 of the database. The determined locally means that the classifiers and the noise values are determined independently by each owner of the database before or concurrently with the execution of the method 100.
Typically, to ensure the differential privacy, the noise value is determined over the entire data entries of the database in dependence with a size 123 or 133 of the database. As used herein, the size of the database is based on a number of entries. The entry can have any data structure. However the data structure of each entry is the same across all databases. Examples of the entry are a field, a row in a table, a table itself, a file, or another database.
Embodiments of the invention are based on a realization that a differentially private aggregate classifier from the union of the databases D1∪D2 . . . ∪DK can be determined as an aggregation of classifiers of each database modified by a noise value corresponding to the smallest database in the set of databases. The smallest database has the smallest number of entries, wherein a data structure of each entry is the same for all databases. The proof for correctness of this realization is provided in the Appendix A.
Accordingly, one embodiment combines 140 classifiers of each database to determine an aggregate classifier 145. Additionally, the embodiments determines 150 the noise value 155 corresponding to the smallest database. Next, the aggregate classifier 145 is modified 160 with the noise value 155 to produce the differentially private aggregate classifier 170. The differentially private aggregate classifier is published, e.g., stored in a memory 175 or distributed over internet.
Differential Privacy
According to the definition of a differential privacy model, given any two databases D and D′ differing by one element, i.e., adjacent databases, a classifier defined by a randomized query function M is differentially private, if the probability that the function M produces a response S on the database D is similar to the probability that the function M produces the same response S on the database D′. As the query output is almost the same in the presence or absence of an individual entry with high probability, almost nothing can be learned about any individual entry from the output.
The randomized function M with a well-defined probability density P satisfies ε-differential privacy if, for all adjacent databases D and D′ and for any Sεrange(M),
Accordingly, the differentially private classifier guaranties that no additional details about the individual entries can be obtained with certainty from output of the learning algorithm, beyond the a priori background knowledge. Differential privacy provides an ad omnia guarantee as opposed to most other models that provide ad hoc guarantees against a specific set of attacks and adversarial behaviors. By evaluating the differentially private classifier over a large number of entries, an adversary cannot learn the exact form of the data.
A differential diameter 240 and privacy parameter ε can be used in calculating each of the distributions in
Typically, the classifiers are designed to be differentially private by adding a noise value to the weights of the classifier, where the noise value is selected from the distribution described above. Further, parameters of the distribution depend on a degree of desired privacy expressed by the epsilons ε, which usually depends on the size of the database, and on the type of the function of the classifier, e.g., average, or maximum or logarithm function. In one embodiment, the noise values have a Laplace distribution.
Determining Classifiers Locally on Individual Databases
Each database owner Pj uses its database (x, y)|j to determine the classifier with weights wj, wherein j is an index of the database. One embodiment uses a regularized logistic regression function l2 for the classifiers. For example, the classifier can be determined by minimizing the following objective function
where λ>0 is a regularization parameter, and T is a transpose operator. However, the classifiers are determined Focally for each individual database and no data or information are shared.
Example of Differentially Private Aggregate Classifier
One embodiment of the invention defines the differentially private aggregate classifier ws 170 according to
where K is a number of the databases in the set of databases, j is the index of the database, and η is a d-dimensional random variable sampled from a Laplace (Lap) distribution scaled with a parameter
n(1) is the noise value corresponding to the smallest database, i.e., n(1)=minjnj, λ is the parameter of the Laplace distribution, and ε is the differential privacy parameter.
The differentially private aggregate classifier ws incurs only a well-bounded excess risk over training a classifier directly on the union of all data while enabling the parties to maintain their privacy. The noise value η ensures that the classifier ws satisfies differential privacy, i.e., that individual data instances cannot be discerned from the classifier.
The definition of the noise value η above is not intuitive, but we have proved that differentially private aggregate classifier constructed by aggregating locally trained classifiers is limited by the performance of the individual classifier that has the least number of entries.
Some embodiments of the invention are based on the realization that the owners of the databases Pj cannot simply take their locally trained classifiers wj, perturb them with a noise vector and publish the perturbed classifiers, because aggregating such classifiers will not give the correct noise value η: Lap(2/(n(1)ελ)) that ensures differential privacy. Also, because individual database owners cannot simply add noise to their classifiers to impose differential privacy for all other classifiers, the actual averaging operation must be performed such that the individual classifiers or the number of entries in each database are not exposed. Accordingly, some embodiments use a secure multiparty computation (SMC) method for interacting with a processor to perform the averaging. The outcome of the method is such that each of the database owners obtains additive shares of the desired differentially private classifier ws, such that these shares must be added to obtain the differentially private aggregate classifier.
Secure Multiparty Computation (SMC) Method
The embodiments use asymmetric key additively homomorphic encryption. A desirable property of such encryption is that operations performed on ciphertext elements maps into known operations on the same plaintext elements. For an additively homomorphic encryption function ξ(•), that means that for any a and b ξ(a)ξ(b)=ξ(a+b), ξ(a)b=ξ(ab).
The additively homomorphic encryption is semantically secure, i.e., repeated encryption of the same plaintext will result in different cyphertexts. For the SMC method, encryption keys are considered public and decryption keys are privately owned by the specified database owners.
Determining an Obfuscated Index of the Smallest Database
The processor determines 310 an obfuscated index 315 of a smallest database based on permuted indexes 320 resulting from a permutation of indexes of the databases. For example, each database owner, i.e., a party Pj, computes nj=aj+bj, where aj and bj are integers representing additive shares of the database lengths nj for j=1, 2, . . . , K. The K-length vectors of additive shares are defined as a and b, respectively.
The parties Pj mutually agree on a permutation π1 on the index vector (1, 2, . . . , K). This permutation is unknown to the processor. Then, each party Pj transmits its share aj to a representative party Pπ
The parties Pj generate a key pair (pk, sk) where pk is a public key for homomorphic encryption and sk is the secret decryption key known only to the parties, but not to the processor. The element-wise encryption of a is defined as ξ(a). The parties send ξ(π1(a))=π1(ξ(a)) to the processor.
The processor generates a random vector r=(r1, r2, . . . , rK) where the elements ri are integers selected uniformly at random and are equally likely to be positive or negative. Then, the processor computes ξ(π1(aj))ξ(rj)=ξ(π1(aj)+rj). In vector notation, the processor computes ξ(π1(a)+r).
Similarly, by subtracting the same random integers in the same order to the received additive shares, the processor obtains π1(b)−r, selects a permutation π2 at random and obtains a signal
π2(ξ(π1(a)+r))=ξ(π2(π1(a)+r)),
and a signal π2(π1(b)−r). The processor transmits the signal ξ(π2(π1(a)+r)) to the individual parties in the order, e.g., a first element to first party P1, a second element to a second party P2, . . . , and Kth element to a party PK.
Each party decrypts the signal received from the processor, i.e., the parties P1, P2, . . . , PK respectively possess the elements of the vector π2(π1(a)+r) while the processor possesses the vector π2(π1(b)−r). Because π1 is unknown to the processor and π2 is unknown to the parties, the indices in both vectors are obfuscated.
If π2(π1(a)+r)=ã and π2(π1(b)−r)={tilde over (b)}, then ni>njãi+{tilde over (b)}i>ãj+{tilde over (b)}jãi−ãj>{tilde over (b)}j−{tilde over (b)}i.
For each (i, j) pair with i,jε{1, 2, . . . , K}, these comparisons can be solved by any implementation of a secure millionaire protocol. When all the comparisons are done, the processor determines the index {tilde over (j)} 325 such that ã{tilde over (j)}+{tilde over (b)}{tilde over (j)}=minjnj. However, the true index corresponding to the smallest dataset is obfuscated.
Selecting Obliviously First Additive Share of Noise Value of Smallest Database
Based on the obfuscated index 315, the processor selects 330 obliviously from additive shares 340 of all noise values a first additive share 335 of a noise value associated with the smallest database. A second additive share 360 of the noise value associated with the smallest database is stored by one or more databases.
For example, the processor constructs an indicator vector u of length K such that u{tilde over (j)}=1 and all other elements are 0. Then the processor permutes the indicator vector to produce a permuted vector π2−1(u), where π2−1 inverts π2. Next, the processor generates a key-pair (pk′, sk′) for additive homomorphic function ζ(•), where only the encryption key pk′ is publicly available to the parties Pj, and transmits ζ(π2−1(u))=π2−1(ζ(U)) to the parties Pj.
The parties mutually obtain a permuted vector π1−1(π2−1(ζ(u)))=ζ(v) where π1−1 inverts the permutation π1 originally applied by the parties Pj and v is the underlying vector. Now that both permutations have been removed, the index of the non-zero element in the indicator vector v corresponds to the true index of the smallest database. However, since the parties Pj cannot decrypt ζ(•), the parties cannot find out this index.
For j=1, . . . , K, party Pj selects the noise value ηj. In one embodiment, the noise value is a d-dimensional noise vector sampled from a Laplace distribution
with parameter
In another embodiment, the noise value selected from different distributions. In yet another embodiment, the noise value is predetermined. Then, it obtains a d-dimensional vector ψj where for i=1, . . . , d, ψj(i)=ζ(v(j))n
All parties Pj compute a d-dimensional noise vector ψ such that, for i=1, . . . , d, ψ(i)=πjψj9i)=πjζ(v(j)ηj(i))=ζ(Σjv(j)ηj(i)).
By construction, the above equation selects only the noise value for the smallest database, while rejecting the noise values for all other databases. This is because has an element with value 1 at the index corresponding to the smallest database and has zeroes everywhere else.
One of the parties, e.g., P1, generates a d-dimensional random integer noise vector S to produce the first additive share 335 ψ(i)ζ(s(i)) for all i=1, . . . , d, and transmits the first additive share to the processor. Also, the party P1 stores a second additive share 360 of the noise value, e.g., by computing w1−Ks wherein w1 is the classifier of the party. Additionally or alternatively, the second additive share is stored by at multiple databases.
The processor decrypts ψ(i)ζ(s(i)) to obtain η(i)+s(i) for i=1, . . . , d. Accordingly, the processor stores a first additive share of the noise value associated with the smallest database as K(η+s), selected party P1 stores the second additive share of the noise value and the classifier as w1−Ks, and all other parties Pj, j=2, . . . , k stores the classifiers wj.
Obliviously Combining Classifiers, the First and the Second Additive Shares
In various embodiments, the processor and the K database-owning parties execute a secure function evaluation protocol, such that each of the K+1 participants obtains an additive share of the differentially private aggregate classifier Kws. In some embodiments, additive shares are generated using a computationally secure protocol. In other embodiments additive shares are generated using an unconditionally secure protocol. The resulting K+1 shares form the differentially private aggregate classifier and are published, e.g., are stored in the memory 175.
The embodiments of the invention determine differentially private aggregate classifier to ensure differential privacy of multiple databases. This invention is based on the realization that, to achieve multiparty differential privacy, it is sufficient to select the stochastic component based on the size of the smallest database. Some embodiments further recognize that because of this realization the selection of the noise value can be securely performed via the SMC method.
However, unlike conventional methods, the embodiments do not use SMC to construct a classifier. Therefore, our embodiments are significantly less complex than any SMC method to compute the classifier on the combined data.
It is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
It is sufficient to select the stochastic component based on the size of the smallest database. Below, a theoretical proof is given that this is indeed the case.
We show that the perturbed aggregate classifier in this invention satisfies differential privacy. We use the following bound on the sensitivity of the regularized regression classifier:
Theorem 1 Given a set of n data instances lying in a ball of radius 1, the sensitivity of regularized logistic regression function is at most
If w1 and w2 are function (classifiers) trained on adjacent databases of size n with regularization parameter λ,
This bound has been proved by Kamalika Chaudhuri and Claire Monteleoni in Privacy-preserving logistic regression. Neural Information Processing Systems, pages 289-296, 2008 and is incorporated herein by reference. To show that the perturbed function or classifier in our invention satisfies differential privacy, proceed as follows:
Theorem 2 The classifier w3 preserves ε-differential privacy. For any two adjacent databases D and D′,
Proof Consider the case where one instance of the training database D is changed to result in an adjacent database D′. This would imply a change in one element in the training database of one party and thereby a change in the corresponding learned vector wjδ. Assuming that the change is in the database of the party Pj, the change in the learned vector is only going to be in wj; let denote the new classifier by wj′. In Theorem 1, we bound the sensitivity of wj as
Considering that we learn the same vector wx using either the training databases D and D′, we have
by the definition of function sensitivity. Similarly, we can lower bound the ratio by exp(−ε)
As expected, adding a perturbation noise term introduces an error in the function evaluation. For functions used as classifiers, this error is called as excess error or excess risk. It is the price paid for differential privacy. For a differentially private classifier to be useful, it is desired that the excess risk is small. In other words, it is desirable that adding noise does not deteriorate the classification performance by too much.
In the following discussion, we consider how much excess error is introduced when using the perturbed aggregate classifier wS in this invention (satisfying differential privacy) as opposed to the non-private unperturbed classifier wx trained on the entire training data. We also consider how much excess error is introduced with respect to the (non-private) unperturbed aggregate classifier w.
We first establish a bound on the l2 norm of the difference between the aggregate classifier w and the classifier trained over the entire training, data. To prove the hound we apply the following Lemma
Lemma 1 Let G(w) and g(w) be two differentiable, convex functions of w. If w1=arg minwG(w) and w2=arg minwG(w)+g(w), then
where
g1=maxw∥∇g(w)∥ and G2=minvminwvT∇2G(w)v for any unit vector vεRd.
Lemma 1 is obtained from Kamalika Chaudhuri and Claire Monteiconi in Privacy-preserving logistic regression. Neural Information Processing Systems, pages 289-296, 2008 and is incorporated herein by reference. First, consider the following theorem bounding the excess risk between the non-private unperturbed aggregate classifier w and the non-private classifier wx tuned on the entire database.
Theorem 3 Given the aggregate classifier w, the classifier wx trained over the entire training data and η(1) is the size of the smallest training database,
We formulate the problem of estimating the individual classifiers wj and the classifier wx trained over the entire training data in terms of minimizing the two differentiable and convex functions g(w) and G(w).
Substituting the bounds on g1 and G2 in Lemma 6.2,
Applying triangle inequality,
The bound is inversely proportional to the number of instances in the smallest database. This indicates that when the databases are of disparate sizes, w will be a lot different from wx. The largest possible value for η(1) is
in which case all parties having an equal amount of training data and w will be closest to wx. In the one party case for K=1, the bound indicates that norm of the difference would be upper bounded by zero, which is a valid sanity check as the aggregate classifier w is the same as wx.
We use this result to establish a bound on the empirical risk of the perturbed aggregate classifier wδ=w+η over the empirical risk of the unperturbed classifier wx in the following theorem.
Theorem 4 If all data instances xi lie in a unit ball, with probability at least 1-δ, the empirical regularized excess risk of the perturbed aggregate classifier ws over the classifier ws trained over entire training data is
We use the Taylor series expansion of the function J to have
J({circumflex over (w)}x)=J(wx)+({circumflex over (w)}x−wx)T∇J(wx)+½({circumflex over (w)}x−wx)T∇2J(w)({circumflex over (w)}s−wx)
for some wΕRd. By definition, VJ(wx)=0.
Taking l2 norm Oft both sides and applying Cauchy-Schwarz inequality,
|J({circumflex over (w)}δ)−J(wx)|≦½∥{circumflex over (w)}δ−wx∥2∥∇2J(w)∥. (8)
The second gradient of the regularized loss function for logistic regression is
Since the logistic function term is always less than one and all xi lie in a unit ball, ∥∇2J(w)∥≦λ+1. Substituting, this into Equation 8 and using the fact that J(wx)≦K(w), ∀wεRd,
The classifier wδ is the perturbed aggregate classifier, i.e., ws=w+η, with the noise term
To bound ∥η∥ with probability at least 1−δ, we apply the following Lemma from Kamalika Chaudhuri and Claire Monteleoni in Privacy-preserving logistic regression. Neural Information Processing Systems, pages 289-296, 2008 and is incorporated herein by reference.
Lemma 2 Given a d-dimensional random variable η: Lap(β) i.e.,
with probability at least 1−δ, the l2 norm of the random variable is bounded us
Substituting this into Equation 9, we have
Using the Cauchy-Schwarz inequality on the last term,
The bound suggests an error because of two factors: aggregation and perturbation. The bound increases for smaller values of ε implying a tighter definition of differential privacy, indicating a clear trade-off between privacy and utility. The bound is also inversely proportional to n(1)2 implying an increase in excess risk when the parties have training databases of disparate sizes.
In the limiting case ε→∞, we are adding a perturbation term η sampled from a Laplacian distribution of infinitesimally small variance resulting in the perturbed classifier being almost as same as using the unperturbed aggregate classifier w satisfying a very loose definition of differential privacy. With such a value of ε, our bound becomes
Similar to the analysis of Theorem 3, the excess error in using an aggregate classifier is inversely proportional to the size of the smallest database n(1) and in the one party case K=1, the bound becomes zero as the aggregate classifier w is the same as wx.
While the previous theorem gives us a bound on the empirical excess risk over a given training database, it is important to consider a bound on the true excess risk of ws over wx. Let us denote the true risk of the classifier ws by {tilde over (J)}(wx)=E[J(wx)] and similarly, the true risk of the classifier wx by {tilde over (J)}(wx)=E[J(wx)].
Theorem 5 If all training data instances xi lie in a unit ball, with probability at least 1−δ, the true excess risk of the perturbed aggregate classifier wx over the classifier wx trained over entire training data is
Let wr be the classifier minimizing {tilde over (J)}(w). By rearranging the terms,
{tilde over (J)}(wx)={tilde over (J)}(wx)+[{tilde over (J)}(ws)−{tilde over (J)}(wr)]+[{tilde over (J)}(wr)−{tilde over (J)}(wx)]≦{tilde over (J)}(wx)+[{tilde over (J)}(wx)−{tilde over (J)}(wr)].
To proceed, we first need a bound between the true excess risk of any classifier as an expression of bound on the regularized empirical risk for that classifier and the classifier minimizing the regularized empirical risk. With probability at least 1−δ,
Substituting the bound from Theorem 4,
Substituting this bound into Equation 10 gives us a bound on the true excess risk of the classifier wx over the classifier wx.