This application relates generally to data sharing and collaboration in biomedicine.
We are entering the age where everyone will have a copy of his or her own genome sequence. Allowing researchers to tap into this vast source of information would spur scientific and medical breakthroughs. Currently, however, most sequenced genomes reside in strict access-controlled repositories.
Genome-wide association studies (GWAS) aim to identify genetic variants that are statistically correlated with phenotypes of interest (e.g., disease status). Analyzing a large number of individuals in a GWAS is critical for detecting elusive causal variants that are rare or have small effect sizes. To this end, several research groups have been coordinating efforts to amass large genomic and phenotypic databases. With the advent of large centralized databases, however, comes severe privacy concerns. In an era where companies' sensitive user data are routinely leaked in bulk, a breach of genomic and clinical data is becoming increasingly probable. Adding to the concern, a leak of genomic data permanently affects the individuals whose data is released, and their descendants. Unfortunately, it is challenging to accurately assess, let alone contain, the risks associated with genomic breaches, due to our incomplete understanding of what is encoded in one's genome.
Modern cryptography offers a means to substantially reduce the likelihood of a devastating leak. In particular, secure multiparty computation (MPC) frameworks enable researchers to collaboratively perform analyses over encrypted data without having direct access to the underlying input. Because no entity is given trust to handle the raw data, the breach or corruption of a single party no longer compromises the privacy of study participants. Despite its promises, conducting GWAS via any such framework to date has not been possible, at least in part due to major technical challenges. Existing proposals either consider vastly simplified versions of the task or require overwhelming runtime or resource constraints for modern data sets with a large number of individuals. Alternative cryptographic systems for secure computation, such as homomorphic encryption, or garbled circuits, currently impose even greater computational burden and thus are not viable for large-scale GWAS.
A major computational bottleneck for secure GWAS is identifying population structure within the study cohort, which can cause spurious associations that are linked to inter-population differences rather than the phenotype of interest. A widely used procedure for accounting for such confounding is to use principal component analysis (PCA) to capture broad patterns of genetic variation in the data. The top principal components, which are thought to be representative of population-level differences among the individuals, are included as covariates in the subsequent statistical tests to correct for bias. Performing PCA on very large matrices, however, is a highly challenging task for secure computation, which to our knowledge has never been successfully addressed. This shortcoming is mainly due to the iterative nature of algorithms for PCA, which significantly increases the communication cost and the overall complexity of the computation (e.g., the multiplicative depth). In addition, PCA requires the support of fractional numbers with a relatively high precision, which adds nontrivial overhead to most MPC frameworks that are inherently based on integer operations.
This disclosure provides for enhanced computational techniques that are leveraged to facilitate secure crowdsourcing of genomic and phenotypic data, e.g., for large-scale association studies. According to one embodiment, a method for securely crowdsourcing genomic and phenotypic data for large-scale association studies involves a set of high-level operations. In particular, the method begins by receiving, via secret sharing, genomic and phenotypic data of individual study participants in a manner that preserves privacy of individual genomic and phenotypic data. Such data is a “first data set.” Then, a “second data set” is received. The second data set preferably is also received via secret sharing. The second data set comprises results of pre-computation over random number data, e.g., mutually independent and uniformly-distributed random numbers and results of calculations over those random numbers. Then, a secure computation is executed against the secretly-shared genomic and phenotypic data, using the secretly-shared results of the pre-computation over random number data, to generate a set of genome-wide association study (GWAS) statistics. For increased computational efficiency, at least a part of the computation is executed over dimensionality-reduced genomic data. The resulting GWAS statistics are then used to identify genetic variants that are statistically-correlated with a phenotype of interest.
The foregoing has outlined some of the more pertinent features of the subject matter. These features should be construed to be merely illustrative. Many other beneficial results can be attained by applying the disclosed subject matter in a different manner or by modifying the subject matter as will be described.
For a more complete understanding of the subject matter and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The technique herein provides for a secure multiparty communication protocol for GWAS that accounts for population stratification while achieving practical performance at scale. Several technical advances enable this breakthrough. First, a core multiparty computation (MPC) technique known as Beaver triples, which was initially developed for pairwise multiplications, is generalized to arbitrary arithmetic circuits. The generalized method is used to construct efficient subroutines for matrix multiplication, exponentiation, and iterative algorithms with data reuse. Second, random projection techniques are then provided to significantly reduce the complexity of a Principal Component Analysis (PCA) procedure, e.g., during a population stratification analysis.
In one embodiment, the overall workflow of our MPC protocol is illustrated in
The success of the MPC protocol on a real lung-cancer GWAS data set with 9,178 individuals (5,088 cases and 4,090 controls) and 612,794 genomic loci has been demonstrated. In this example, a common quality control procedure for GWAS, which includes filters for missing rate, departure from Hardy-Weinberg equilibrium, heterozygosity, and minor allele frequency, was incorporated with the MPC protocol. In addition, population bias was corrected by computing the top five principal components and including them in the association tests as covariates. As shown in
In particular, in this example the secure GWAS protocol accurately recapitulated the ground truth association scores calculated based on the plaintext data (r2=1.00;
Moreover, the results closely matched what was presented in the original publication; the two strongest associations securely identified by the protocol, rs2736100 (p=4.76×10−24) and rs7086803 (p=5.95×10−16), were also the top two findings in prior work. rs2736100 maps to telomerase reverse transcriptase (TERT), whose association with lung cancer is well-established in the literature. On the other hand, rs7086803, which maps to an intracellular trafficking gene named VTI1A, was one of the three novel susceptibility loci identified by the previous study. Notably, the other two loci were originally reported with substantially weaker associations than rs7086803 and thus are likely to have been more sensitive to minor differences in the data and analysis.
In addition to obtaining accurate association statistics, the secure GWAS protocol achieved a practical runtime of one day (see
A primary GWAS data set was obtained via dbGaP Authorized Access (study accession: phs000716.v1.p1). Retaining only autosomal SNPs, the combined data across seven study groups consisted of 612,794 genotyped loci over 9,178 individuals (5,088 cases and 4,090 controls). The study cohort was divided into five age groups: <40, 40-50, 50-60, 60-70, and >70. Binary membership vectors for age and study group were utilized as additional covariates for the association tests (ten linearly independent features). To generate smaller data sets for scalability analysis, the individuals were randomly subsampled to obtain data sets with 1k, 2k, and 5k individuals.
Following the original study, the following filters were incorporated for quality control: genotype missing rate per individual <0.05 and per SNP<0.1, individual heterozygosity rate >0.25 and <0.30, minor allele frequency >0.1, and Hardy-Weinberg equilibrium test χ2 statistic <28.3740 (p-value <10−7). All of these filters were evaluated securely without revealing the underlying data. The resulting inclusion/exclusion status for each individual and SNP is revealed to allow the CPs to reduce subsequent computation. In the experiment, the reduced data included 9,098 individuals and 378,493 loci.
For population stratification analysis, a subset of SNPs with low levels of linkage disequilibrium by imposing a pairwise distance threshold of >100 kb was used, which resulted in 23,723 loci. Genomic positions of the SNPs in the data are considered public, because it does not contain any private information. Therefore, this filtering step is performed in plaintext. Genotypes of each SNP were standardized before PCA was performed. Following previous work, this was achieved by subtracting the mean and dividing by the standard deviation of a Bernoulli variable representing an allele at the given locus being the non-reference type. This random variable is treated as being sampled twice per individual to give rise to the observed genotypes. As explained in the Supplementary Information (see below), this standardization was performed indirectly (i.e., computation results were adjusted after the fact) to avoid the size of pre-computed data being quadratic in genotype matrix dimensions. The top five principal components were then kept for the subsequent analysis.
As proof of concept, a Cochran-Armitage trend test was used to assess the association between each SNP and the disease status. In the presence of covariates, which in the example case consisted of top principal components and age/study group memberships, the desired test statistic is equivalent to the squared Pearson correlation coefficient between the genotype and phenotype vectors, where the subspace defined by the covariates are projected out from both vectors before computing the correlation. The resulting correlations were then revealed as the final output of the secure protocol. Note the mapping between test statistics and significance scores (p-values) did not reveal any additional information about the input and thus can be performed in plaintext.
To assess the scalability of the protocol, several aspects of the experiments were quantified, namely: runtime, communication bandwidth, the size of pre-computed data, and the size of initial data sharing. Runtime refers to only the main computation phase after initial data sharing, because the initial transfer is heavily dependent upon the study setup given the distributed nature of data ownership. Instead, the total size of initial data, which consists of a genotype vector, disease status, and covariate phenotypes collected from each study participant, is presented. Note that initial data sharing includes (i) distributed data transfer from each SP to CP1 or CP2 (one party receives only a few bytes for the pseudorandom generator key) and (ii) an equal-sized data exchange between CP1 and CP2 for the “Beaver partitioning” of input data (see the detailed discussion below titled “Supplementary Information”).
Because the exchange between CP1 and CP2 during this procedure can be coalesced into a single batch transfer, physical shipment is a potential alternative to online data transfer at the scale of tens of terabytes corresponding to millions of genomes. Next, communication bandwidth refers to the total amount of data exchanged between CP1 and CP2 during the main computation phase. Finally, the size of pre-computed data is the amount of data transferred from CP0 to either CP1 or CP2 during the pre-computation phase.
Notably, the protocol achieves communication bandwidth (total amount of data transferred) linear in the number of individuals (n) and the number of variants (m). This excludes the initial data-sharing phase, which is necessarily quadratic due to the size of the genotype matrix. In comparison, existing frameworks lead to quadratic bandwidths with large multiplier constants, which are infeasible at scale, e.g., where both n and m are close to a million. A comprehensive description of the technical contributions, protocol details, and security and complexity analyses are provided in the Supplementary Information that follows below.
In a standard honest-but-curious security model, all parties are assumed to faithfully follow the prescribed protocol, but are free to inspect and analyze any portion of the data they observe to gain information about the input. Under this model, the described protocol guarantees that the computing parties do not learn anything about the raw genotypes or phenotypes other than what can be inferred from the final results. Notably, this security guarantee rests upon the assumption that the CPs do not collude with one another. This no-collusion assumption can be further relaxed by introducing more CPs at the expense of efficiency (see the Supplementary Information below). The protocol as described above is based on two main CPs and an auxiliary one to perform the pre-processing, but this implementation is not intended to be limiting. There may be a different number of entities in these roles, and one or more entities may have shared responsibility.
As will be described in further detail, a number of techniques that are summarized here are provided to achieve scalability. A first technique improves upon an existing building block for secure computation, known as Beaver triples. Beaver triples were originally developed for carrying out secure pairwise multiplications; according to this disclosure, they are extended to arbitrary arithmetic circuits to thereby obtain more efficient subroutines for various operations such as matrix multiplication and exponentiation. Another technique involves using a random projection-based algorithm for PCA so that the scale of the data is greatly reduced for performing population stratification correction (a known computational bottleneck of secure GWAS). An additional technique provides for the notion of shared pseudorandom number generators (PRGs) between pairs of computing entities; the use of shared PRGs obviates transferring a stream of random numbers, which is a frequent operation in the protocol, by enabling each party to independently draw the random numbers from the same PRG. Collectively, these techniques (namely, improved secure computation building blocks, randomized PCA for population stratification, and shared PRGs) enable the provision of a secure protocol that achieves practical scalability.
As will be described in more detail in the Supplementary Information that follows, the technique of this disclosure leverages building blocks that enable arbitrary arithmetic functions, i.e., addition and multiplication, over the private input, to be securely evaluated. As will be seen, the approach avoids the requirement of generating a Beaver triple for every pairwise multiplication, which is computationally-efficient especially if the overall computation contains many multiplications. As will be described, the MPC technique of this disclosure works with elements in a finite field, and there are additional high-level routines for various bit operations like bit shift, bit comparison, etc., which preferably are also written as arithmetic functions so that they can be efficiently-implemented. To address the problem (of having to generate a Beaver triple for every pairwise multiplication), the approach herein generalizes Beaver triples so that auxiliary random numbers, which enable the secure evaluation of nonlinear functions, are generated for groups of operations at a time, rather than for each individual multiplication. In particular, the generalized Beaver tuple preferably includes a random number for each input value, and some function of these random numbers for each of the output values of a desired computation. Thus, in this operation-centric to data-centric view, the total amount of auxiliary data to be generated and secret-shared depends on the size of input and output, rather than the number of pairwise multiplications, and this leads to significant computational efficiency.
Although there are many ways to generate the Beaver triple, according to the protocol herein preferably a third party (CP0) is used to perform this task. As noted, the third-party preferably does not observe the input data in any way and is able to finish its task before the main protocol, as a pre-processing step.
This work is complementary to existing literature on differential privacy techniques in biomedicine, whose aim is to control the privacy leak in the published summary statistics (i.e., association p-values), the output of a GWAS. While the amount of sensitive information revealed by the final statistics will only decrease as the size of the GWAS data sets grow, any existing method for slightly perturbing the statistics for privacy protection also can be used in conjunction with the protocol as a post-processing step.
The technical advances in achieving a scalable MPC protocol for GWAS as described herein may be extended for use with diverse applications in biomedicine and other disciplines. As an example, insights from the PCA protocol may be directly applied to drug-target interaction prediction, e.g., by casting it as a low-rank matrix factorization problem. The computational approach herein allows pharmaceutical companies and academic researchers to pool their data for more accurate predictions while keeping the raw data private. Further, the methods for secure computation as described herein removes obstacles for data sharing and, as a result, enable new workflows and opportunities for scientific collaboration.
In summary, and in the illustrative embodiment, the technique herein provides an end-to-end protocol (e.g., for secure GWAS) that provably reveals no information about input genotypes and phenotypes, except for the final association statistics.
The following Supplementary Information comprises a set of Notes that describe further technical aspects of the MPC protocol of this disclosure.
There are four parties in our GWAS protocol: study participants (SP), two main computing parties (CP1 and CP2), and an auxiliary computing party (CP0). The auxiliary computing party CP0 only participates in a precomputation (or preprocesisng) phase of the protocol, and in particular, does not participate in the main protocol execution. For simplicity, we take SP to be a single entity that possesses all of the input genomes and phenotypes. As we discuss in Supplementary Note 9, this setting naturally generalizes to the crowdsourcing scenario where SP represents many independent participants, each securely contributing their own genome to the computing parties.
Overview. The overall workflow of our GWAS protocol is as follows. First, using a cryptographic technique called secret sharing, the study participants SP share their data (i.e., their genome and associated phenotypes) with CP1 and CP2 in such a way that enables computation over the shared data, but does not reveal any information to either CP1 or CP2. Next, CP1 and CP2 engage in an interactive protocol to perform GWAS over the private inputs without learning anything about the underlying data. During this computation, CP1 and CP2 leverage pre-computed data from CP0 (which is input-agnostic) to greatly speed up the computation. Finally, CP1 and CP2 combine their outputs to reconstruct the final GWAS statistics from secret shares and publish the results. We work in the general paradigm of computing on secret-shared data first formalized by Ben-Or et. al [1], and subsequently extended in a number of works [2, 3, 4, 5, 6].
Security model. In this paper, we assume that the protocol participants are semi-honest (i.e., honest-but-curious). In other words, all parties follow the protocol exactly as specified, but at the end of the protocol execution, parties may try to infer additional information about other parties' private inputs based on their view of the protocol execution. Informally, we say that a cryptographic protocol is secure in the semi-honest model if all of the information a party is able to learn from the protocol execution can be expressed as a function of just the party's input to the protocol execution and the output of the computation (i.e., the GWAS results). In Supplementary Note 10, we briefly describe ways of extending our protocol to relax our security assumptions and additionally provide security against malicious adversaries in the online phase of the protocol by applying the techniques from the SPDZ online protocol [3].
Communication model. In our protocol, we assume that every pair of parties communicate over a secure and authenticated channel (e.g., over the TLS protocol). Concretely, this means that if CP1 sends a message to CP2 during the protocol execution, only CP2 can read the message; other parties (such as SP, CP0, or an eavesdropper on the network) are unable to do so.
Here, we formally describe a cryptographic framework known as secure multiparty computation (MPC) based on secret sharing, which provides the foundation for our secure GWAS protocol.
We begin by introducing some notation that we use throughout this work. For a prime q, we write q to denote the integers modulo q. For a finite set S, we write
to denote that x is sampled uniformly at random from S. Our protocols consist of a sequence of operations performed by multiple parties. We annotate each operation with the relevant party or parties that is responsible for performing the operation (unless the operation applies to all parties). We write a tuple with angle brackets (e.g., a, b, c, d to represent data that are visible to only a subset of the parties in the protocol. The ordering we adopt is CP1, CP2, CP0, SP. In other words, these tuples are placeholders that take on different values depending on which party is executing the corresponding line of the protocol. We write ⊥ to denote an empty slot, and we adopt the following abbreviations for compactness:
a,b
≡
a,b,⊥,⊥ and a,b,c≡a,b,c,⊥.
We give a few examples of our notation in
Our protocol relies on secret sharing-a cryptographic technique that allows two or more parties to collectively represent a private value without having any knowledge about it individually. The underlying secret is reconstructed only when a prescribed set of parties (e.g., all of them) combine their respective shares. In our setting, the study participants first secret share their data to the two computing parties CP1 and CP2. The protocol execution consists of an interactive protocol where CP1 and CP2 jointly compute over the secret-shared data.
Our protocol relies on a two-party additive secret sharing scheme. A two-party additive sharing of a value x∈q consists of a pair of values (r, x−r) where
is a uniformly random field element. By construction, each share (either r or x−r) individually reveals no information about the value x. However, given both shares r and x−r, it is possible to recover the secret value x by adding together the two shares (hence the name, additive secret sharing). In this work, we denote the two shares (r, x−r) of a value x∈q by ([x]1, [x]2), respectively. Using our tuple notation, a secret sharing [x] of a value x∈q can be written as
[x]:=[x]1,[x]2.
In words, [x]1 and [x]2 are shares of x individually owned by CP1 and CP2, respectively. Adding the two shares together yields the value x. Since CP1 and CP2 each only sees a single share of x, individually, they have no information about the value of x. The procedures for sharing and reconstructing a secret are summarized in Protocols 1 and 2, shown in
Not only do secret sharing schemes allow data to be shared across multiple parties in a privacy-preserving manner, the parties can exploit the structure of the shared data to additionally compute over the private inputs without learning anything about the underlying inputs [1]. Intuitively, this is akin to manipulating objects while blindfolded. Secure computation of an arbitrary function is typically implemented as a composition of subprotocols that are each defined for primitive operations (e.g., addition and multiplication). Each subprotocol takes secret-shared values as input and outputs the intended result also as secret shares, while ensuring that no information about the private input is leaked in between.
Affine functions over shared values. First, consider the case of adding two secret-shared values [x] and [y] without revealing x and y. This is possible by having each party add their own shares. Note that the resulting shares [x]1+[y]1∈q and [x]2+[y]2∈q add up to x+y∈q. As long as the shares of [x] and [y] were constructed independently, the resulting shares contain no information about x+y. Similarly, adding or multiplying by a public field element a∈q (represented as a, a using our notation) is also possible. We describe these basic subroutines in Protocols 3-5, shown in
Protocols 3-5 together enable secure computation of any affine function over the private input. For notational simplicity, we write affine functions directly in terms of the secret shares [⋅] without explicitly showing how the basic subroutines are invoked. For instance, we can express the composite protocol for securely evaluating ƒ(x, y)=ax−y+b given shared inputs [x] and [y] and public values a and b as
[ƒ(x,y)]←a[x]−[y]+b.
Multiplication of shared values. Multiplication of secret-shared values is more complicated. We adopt a useful tool known as Beaver multiplication triples [7], which are triples of (correlated) random values that can be used for secure multiplication. More precisely, a Beaver multiplication triple is a secret-sharing of a random product: namely, a triple ([a], [b], [c]) where
are random field elements and c=ab∈q. The key observation is that if two parties possess a secret-sharing of a random multiplication triple, it is possible to compute a secret-sharing of an arbitrary product with a small amount of communication. In this work, we take a server-aided approach [8] to generate the multiplication triples, where we essentially outsource the generation of the multiplication to an auxiliary party (namely, CP0) during a preprocessing step. The Sharemind framework [9] for multiparty computation uses a similar technique in their share multiplication protocol. Protocol 6 outlines the generation of a multiplication triple, and Protocol 7 shows how it is used to perform secure multiplication on shared values. Both protocols are provided in
Correctness. Correctness of Protocol 7 follows from the fact that
Because this expression is affine over the values in a multiplication triple (a, b, and c=ab), it can be securely computed given only the secret-shared triple using Protocols 3-5. Note that x−a and y−b can be treated as public since they are revealed to both parties in Lines 2 and 3 of Protocol 7.
Security. Unlike previous subroutines, Protocol 7 is interactive, so each party sees additional information beyond their initial input shares. Thus, its security needs to be analyzed more carefully. Consider the view of CP1 in Protocol 7. It observes the secret shares of the multiplication triple ([a]1, [b]1, and [c]1) and the data sent by CP2 in Lines 2 and 3 ([x]2−[a]2 and [y]2−[b]2). First, note that [a]1, [b]1, and [c]1 are fresh shares of a, b, and c, and thus are independent of a and b. In addition, a and b are each independently random and unknown to CP1. This means that [x]2−[a]2=([x]2+[a]1)−a and [y]2−[b]2=([y]2+[b]1)−b are independent and uniformly random. Thus, the view of CP1 during the protocol execution contain no information about x and y (all the values CP1 sees are distributed uniformly and independently random). The same argument holds for CP2. Overall, neither party learns anything about x and y during Protocol 7 that was not previously known.
With the addition of secure multiplication subroutine (Protocol 7), we now have a general-purpose secure computation framework for arbitrary arithmetic circuits (polynomial functions). In the following, we extend our shorthand notation for composite protocols to include multiplication:
[a][b]:=Secure multiplication of [a] and [b].
Multiplication triples are traditionally used to compute a product of two secret-shared values. In many scenarios, the direct application of multiplication triples to more complex computations (where each multiplication operation invokes Protocol 7) leads to highly inefficient protocols, especially when there are a large number of multiplications. Here, we describe how to generalize multiplication triples to facilitate the evaluation of arithmetic circuits (i.e., polynomials) on secret-shared data. Our approach is much more practical for complex computations such as computing GWAS statistics, and thus, may be of independent interest. We begin by giving an outline of our method:
More formally, let ƒ be the arithmetic circuit (represented as a polynomial) that we want to evaluate:
where the number of monomials T, the coefficients c(t), and the exponents pi(t) are public, and the input is given as secret shares [x1], . . . , [xn].
Input blinding. In the first step, the inputs are Beaver-partitioned into (i) public values x1−r1, . . . , xn−rn and (ii) hidden (secret-shared) blinding factors [r1], . . . , [rn] (Protocol 8). The computing parties CP1 and CP2 receive x1−r1, . . . , xn−rn as well as shares of the blinding factors [r1], . . . , [rn]. The auxiliary party CP0 chooses the blinding factors r1, . . . , rn.
Next, we re-express the function ƒ as a polynomial in the variables r1, . . . , rn using the substitution xiri+(xi−ri). Moreover, we treat the values xi−ri as a constant for all i (since this value is publicly known to CP1 and CP2):
where we introduce a new set of parameters: coefficients {tilde over (c)}(t) and {tilde over (d)}i, exponents {tilde over (p)}i(t), and {tilde over (T)} denoting the number of intermediate-degree terms (between 1 and deg(ƒ)). We separately group degree-one terms and the terms corresponding to ƒ(r1, . . . , rn) and ƒ(x1−r1, . . . , xn−rn) to simplify our subsequent analysis. We enforce that ({tilde over (p)}1(t), . . . , {tilde over (p)}n(t)) be unique for each t=1, . . . , {tilde over (T)}, and assume that there is a canonical structure for g that is fixed and known to all of the parties. In other words, {tilde over (T)} and the new exponents {tilde over (p)}i(t) are fixed and known to all parties.
Offline computation. In the second step, CP1 and CP2 compute the values of {tilde over (c)}(t), {tilde over (d)}i, and ƒ(x1−r1, . . . , xn−rn) from the blinded input x1−r1, . . . , xn−rn derived in the first step of the computation. Meanwhile, CP0 computes
which is possible because CP0 knows the blinding values r1, . . . , rn.
Output reconstruction. Finally, CP0 secret shares R(0), . . . , R({tilde over (T)}) with CP1 and CP2. Then, CP1 and CP2 compute
which is [g(x1, . . . , xn)] by Eq. (1). Note this function is affine over the secret inputs [r1], . . . , [rn] and [R(0)], . . . , [R({tilde over (T)})]. Therefore, it can be computed non-interactively using Protocols 3-5.
The overall procedure for securely evaluating a polynomial given a secret-shared input is provided in Protocol 9, as shown in
Security. Security of this protocol follows analogously to that for the Beaver multiplication protocol. Namely, the additional data observed by each party in the protocol execution consists entirely of independent random values q, and thus, do not reveal any information about any secret input. As before, consider the view of each party. In the first step (BeaverPartition), CP1 sees values [x]2−[r]2=([x]2+[r]1)−r and [r]1, which are distributed uniformly and independently over q (since r and [r]1 are independent, uniformly random values of q). An analogous argument holds for CP2. The second step is non-interactive. In the final step, CP1 and CP2 receive secret shares of CP0's computation results, which individually are independent and uniform over q. Therefore, the views of both CP1 and CP2 in the protocol execution consists of uniformly random values. To conclude the argument, we note that CP0 does not receive any messages from any other party in Protocol 9 (so its view is trivially independent of any secret inputs).
Circuits with multiple outputs. Our method can be further generalized to arithmetic circuits with more than one output as follows. Suppose the circuit has output gates. We can represent each output by a polynomial ƒ1, . . . , ƒl on a common set of inputs x1, . . . , xn. To avoid redundant computation, we first write out Eq. (1) for each polynomial, and take the union over all intermediate-degree terms to obtain a set of exponents ({tilde over (p)}1, . . . , {tilde over (p)}n) that include all intermediate-degree terms that emerge in the circuit. During the second step, CP0 calculates r1{tilde over (p)}
The linearization cost. We define the linearization cost T of an arithmetic circuit to be the cardinality of the aforementioned union set of intermediate-degree terms. Intuitively, {tilde over (T)} represents the number of extra terms that need to be calculated by CP0 (over the blinding factors) and secret-shared with CP1 and CP2 as a consequence of linearizing the function as in Eq. (3). This notion will be useful in analyzing the efficiency of our Beaver partitioning approach when applied to various arithmetic circuits.
3.1 Comparison with Beaver Multiplication Triples
In
The key advantage of our generalized method is that irrespective of the number of multiplication gates and depth of the circuit, the online phase only requires a single round of communication and bandwidth equal to the input size. As a tradeoff, however, we incur a potentially large offline communication cost that includes {tilde over (T)}, which is O(2d) in general. We address this tradeoff by employing our method only in situations where the benefits clearly outweigh the costs. In the following, we highlight three concrete scenarios where our generalized Beaver partitioning approach is useful in the context of performing large-scale GWAS.
When the multiplicative depth of the circuit is at most one (deg(ƒ)≤1), there are no intermediate-degree terms in Eq. (1), and the linearization cost becomes zero ({tilde over (T)}=0). A frequent operation in GWAS for which this property holds is matrix multiplication. Suppose we want to compute the product of two secret-shared matrices with dimensions d1×d2 and d2×d3. Even though this operation requires evaluating m=d1d2d3 pairwise multiplications, the multiplicative depth d of the overall computation is 1. Moreover, the arithmetic circuit for performing matrix multiplication contains d1d2+d2d3 inputs and d1d3 outputs. Substituting these values into the expression in
Because our generalized Beaver partitioning method can evaluate circuits with multiple outputs, multiple operations (each represented as a circuit) that share common inputs can be combined into a single circuit for joint evaluation. In doing so, each input only needs to be Beaver partitioned once, which is more efficient than constructing a fresh Beaver partition for each operation individually. In fact, we can implicitly combine circuits operating on common inputs by reusing the Beaver partitions on the common inputs across multiple invocations of Protocol 9.
More precisely, given a secret-shared input [x] involved in the evaluation of ƒ1(x, . . . ) followed by ƒ2(x, . . . ), we can simulate the joint evaluation of ƒ1 and ƒ2 by obtaining the Beaver partition of [x] during the evaluation of ƒ1 and reusing the results, namely x−r, x−r and [r]1, [r]2, r for a uniformly and independently random r∈q, for ƒ2. Note this does not affect the security of our method, because in every invocation of BeaverPartition or ShareSecret in the overall protocol, the views of CP1 and CP2 still consist only of uniform and independent values in q.
As a concrete example, consider the following power iteration procedure that appears in our randomized PCA protocol (Protocol 32):
where G is d1×d2 and A(t) is d2×d3 for all t. In practice, d1 and d2 are large while d3 is small. Even with our improved method for secure matrix multiplication (Scenario 1), carrying out this computation by considering each multiplication separately would require Beaver partitioning G a total of 2T times—a prohibitive cost for a large G. With the reuse of Beaver partitioned data, we need to Beaver partition G only once. This is another key factor in achieving practicality for large-scale GWAS. The complexity comparison for this example is shown in
Exponentiation is a special case where the linearization cost grows linearly with the circuit depth. Specifically, consider a circuit that takes a single value x as input and outputs all powers of x up to a known parameter α≥2: x2, . . . , xα. For simplicity, assume that the circuit is naïvely represented as a sequence of α−1 pairwise multiplications. In this case, the circuit has 1 input, α−1 outputs, depth α−1, and consists of α−1 multiplication gates. Its linearization cost T is {tilde over (T)}=α−2, since there are α−2 intermediate powers of the blinding factor (from 2 to α−1) that needs to be calculated and shared by CP0. Using the asymptotic analysis given in
Here, we introduce a number of protocol building blocks that we leverage when constructing our GWAS protocol.
Let T={kivi}i=1d be a public table that maps keys ki to values vi, where ki, vi∈q for all i∈{1, . . . , d}. The table lookup functionality takes as input a key k∈q and returns a value v∈q, where v=T(k) if k is in the table. If k is not present in the table, then the output of the table lookup is undefined (can be an arbitrary field element). A table lookup on secret-shared inputs corresponds to the setting where the input key [k] is secret-shared, and the output should be a secret sharing of the value [v]. Table lookups can be implemented by representing the entries in the table as points on a polynomial ƒ. Then, looking up a key k in T just corresponds to evaluating the polynomial on k. We write LagrangeInterpolation to denote an algorithm that takes as input a table T of d values and outputs a polynomial (of degree d−1) that interpolates the mappings in T:
We now describe several protocols for performing bit-wise operations on secret-shared inputs. Let {[ai]}i=1L:={[a1], . . . , [aL]} be a secret L-bit vector where ai∈{0, 1} for all i, and every individual bit is secretly shared.
To use secret sharing in practice, we need to define a mapping between data values and field elements. In this work, we use a fixed-point (FP) representation [13] to represent real-valued numbers that appear during the GWAS computation. Various building blocks for secure computation with FP numbers, including division, have been proposed by Catrina and Saxena [13]. Here, we recall their protocols and describe several optimizations that we make in our work (namely, using a server-aided design and removing the dependence on the secure bit-decomposition subroutine). We additionally describe novel square-root and square-root inversion subroutines, which are frequently invoked in GWAS.
Let k be the number of bits we use to represent a signed real number, of which ƒ bits are allocated to the fractional domain (referred to as the precision). The values of k and ƒ are chosen such that any real number we expect to encounter during the protocol falls in the range
:=[−2k−ƒ−1,2k−ƒ−1)⊂.
We map each x∈ to an element in the finite field q using the encoding function
E
ƒ(x)=└x·2ƒ┘ mod q,
where └⋅┘ denotes the floor function. Conversely, each field element z∈q is mapped back to real data space ∪{NaN} using the following decoding function:
Intuitively, this representation corresponds to taking a real number in and truncating it to the closest multiple of 2−ƒ.
Notation. We write [Eƒ(x)] to denote a secret-share of a value x∈. We alternatively express this as
[x](ƒ):=[Eƒ(x)] for x∈.
We use the same base field q for all values of ƒ for interoperability. We choose q to be large enough for the highest precision required. Secret sharing of integers x∈ with an identity encoding function (i.e., [x](0)) is denoted without the superscript as [x], i.e.,
[x]:=[x](0) for x∈∩,
since it is equivalent to directly treating the integer as an element in q.
Addition and subtraction of secret-shared fixed-point values correspond to addition and subtraction on the underlying secret-shared field elements, provided that the fixed-point values are encoded with the same precision. In particular, we define
[x±y](ƒ):=[x](ƒ)±[y](ƒ) and [x±a](ƒ):=[x](ƒ)±Eƒ(a),
for secret-shared values x, y∈ and public constant a∈. Unlike the case of adding and subtracting shares of field elements, adding and subtracting shares of encoded fixed point values introduces a small amount of error (bounded by ½ƒ−1). To see why, observe that by construction, the shared value [x](ƒ)+[y](ƒ) is an additive sharing of Dƒ(Eƒ(x))+Dƒ(Eƒ(y)). Due to the imprecision introduced by the fixed-point encoding, this value is close to [x+y](ƒ)=Dƒ(Eƒ(x+y)), but there is some error due to the non-linearity of the encoding/decoding procedures.
Multiplying two FP numbers requires an additional step. Note that directly multiplying two encoded FP numbers outputs a result with precision 2ƒ instead of ƒ. In particular, we write
[xy](2f):=[x](ƒ)[y](ƒ).
To scale the precision back to ƒ, we use the Truncate subroutine (Protocol 16) from Catrina and Saxena [13], as shown in
Security. Notably, Line 5 of Truncate reveals the value x+r in the clear (to CP1 and CP2), where r is uniform over the set {0, . . . , 2b+κ−1}, and x∈{0, . . . , 2b−1}. Assume that the base prime q is large enough such that x+r does not overflow. Then, the distribution of x+r is statistically close to the uniform distribution over {0, . . . , 2b+κ−1}. More precisely, the statistical distance between the two distributions is bounded by 2−κ for all x∈{0, . . . , 2b−1}. (Note that the statistical distance between two distributions 1 and 2 over a common finite domain χ is defined to be ½Σx∈χ|Pr[1(x)]−Pr[2(x)]|.) This implies no adversary (including computationally unbounded ones) is able to distinguish x+r from a uniformly random field element r′, except with an advantage 2−κ—i.e., the difference in the a posteriori guessing probability of a given candidate being a real message (i.e., x+r) between the two candidates (i.e., r′ and x+r, randomly shuffled) is upper-bounded by 2−κ, where 0 corresponds to perfect indistinguishability. Thus, CP1 and CP2 effectively sees x+r as a uniformly random element, and do not learn anything about x from it as a result.
Our overall GWAS protocol requires multiple invocations of the Truncate protocol, which increases the adversary's distinguishing advantage by a multiplicative factor v equal to the total number of invocations (based on a standard hybrid argument; cf. [14, § 3.2.3]). In practice, we choose our statistical security parameter κ such that the adversary's distinguishing advantage v·2−κ<2−30, or equivalently, κ−log2(v)>30. This ensures that the view of each party in the overall protocol is statistically indistinguishable from a sequence of uniformly random field elements.
Next, since 2b+κ lower-bounds the base prime q we use for the overall protocol, which in turn determines the overall efficiency, we choose the smallest possible b for each invocation of Truncate. For example, in Line 2 of MultiplyFP, we choose b=k+ƒ. Recall that by assumption, every value encountered during the protocol lies in D. This means that xy∈, which implies E2f(xy) has effective bit-length at most k+ƒ.
The following sections describe protocols in which individual bits of a secret integer are separately secret-shared. To exploit the smaller field size needed for bit operations compared to FP operations, we instead perform operations over an auxiliary field q′ where q′<<q. We write [[x]]:=[[x]]1, [[x]]2 to denote a secret sharing over q′. Because our use of the auxiliary field is limited to unsigned integers, the mapping between the data space and the auxiliary field is simply the identity map over {0, . . . , q′−1}. When appropriate, we invoke subroutines previously defined over [⋅] with the secret shares [[⋅]]. In our protocols, secret shares over q′ are constructed using the function ShareSecretSmallField defined as follows:
To implement a comparison protocol on secret-shared fixed-point values, we first implement a secure sign test for FP values (i.e., a comparison with zero). Here, we use the technique of Nishide and Ohta [15] for comparing elements of q to the value q/2 (as a real number). Given a secret-shared FP value [x](ƒ), the idea is to securely retrieve the least significant bit of 2·Eƒ(x). If x is negative, then Eƒ(x)>q/2, and thus 2·Eƒ(x) wraps around and results in an odd integer (since q>2 is odd). On the other hand, if x is positive, then Eƒ(x)<q/2, which implies 2·Eƒ(x) is even. Thus, retrieving the least significant bit of 2·Eƒ(x) is equivalent to performing a sign test on x. The least significant bit is securely obtained by manipulating the binary representation of 2·Eƒ(x)+r, where the blinding factor r is sampled uniformly at random and secret-shared by CP0. The overall procedure is shown in IsPositive (Protocol 19), provided in
Security. The security of IsPositive follows from the security of secret sharing (Lines 4-7) and the fact that 2Eƒ(x)+r in Line 8 does not reveal any information about x to either party, since r is uniformly random and independent from all other values.
Comparing secret-shared FP values. The IsPositive protocol can be used to compare two secret-shared FP values [x](ƒ) and [y](ƒ) by casting the problem as a sign test of the difference [y](ƒ)−[x](ƒ). This extended protocol is named LessThan (Protocol 20). Additionally, we define LessThanPublic (Protocol 21) for the setting where one of the input values is known to both CP1 and CP2. Both protocols are provided in
We implement division and square root routines (for FP values) using Goldschmidt's algorithm [16], which approximates the desired operation as a series of multiplications. Each iteration quadratically reduces the relative approximation error.
Goldschmidt's algorithm for division. Take two real numbers a, b∈. We describe Goldschmidt's algorithm for approximating the quotient a/b∈. Let w0 be an initial approximation of 1/b and let ϵ0:=1−bw0 be the relative error for the approximation w0. We require that |ϵ0|<1. The algorithm iteratively computes the following:
x
t
←x
t−1(1+yt−1)
y
t
←y
t−1
y
t−1
with x0=aw0 and y0=ϵ0. Here, xt converges to a/b [16].
Goldschmidt's algorithm for square root. For computing the square root of a value a∈, Goldschmidt's algorithm starts with an initial approximation w0 of 1/√{square root over (a)}. As before, let ϵ0:=1−aw02 denote the relative error in the approximation. We require that |ϵ0|<1. The algorithm then iteratively computes
z
t−1←1.5−xt−1yt−1
x
t
←z
t−1
x
t−1
y
t
←z
t−1
y
t−1
with x0=aw0 and y0=w0/2. Here, xt converges to √{square root over (a)} and 2yt converges to 1/√{square root over (a)} [16]. Note we obtain the square root inverse of a for free.
Choosing the number of iterations. Normally, if the input is available in the clear, we would iterate Goldschmidt's algorithm until the values satisfy some convergence criterion. However, this strategy does not extend to the secure computation setting when the number of iterations needed before convergence could itself reveal information about the underlying inputs. In our protocol execution, we thus take an oblivious evaluation approach and instead fix the number of iterations in advance and always iterate the algorithm for that many rounds. We choose the number of iterations to be sufficiently large to ensure the desired level of accuracy. In this work, we follow the recommendation from a previous work [13] and use 2┌log2(k/3.5)┐ iterations for both the division and square root routines. An extra iteration is added for division protocol to account for our initial approximation (described below) being slightly less accurate, albeit more efficient, than what was proposed in the original work.
Computing the initial approximations. For both the division and square root routines, we require a suitable initial approximation w0 of 1/x or 1/√{square root over (x)} to ensure convergence of the iterative procedure. We achieve this by first scaling the input to be within [0.25, 1) and evaluating a quadratic polynomial that approximates the desired function within that range. Reverse scaling is performed to map the approximation back to the scale of input.
We introduce a new protocol NormalizerEvenExponent (Protocol 22, shown in
After obtaining [22t] and [2t] where 2k−2≤22t·z<2k for the input value [x](ƒ), we scale x to [0.25, 1) by truncating k−ƒ bits of [22t][x](ƒ) with Truncate. The resulting scaled input is denoted [
Given the scaled input [
where the coefficients of non-constant terms are chosen to be integers to avoid adding an extra call to Truncate. The resulting estimate, denoted [
Thus, we approximate 1/x as
[w0](ƒ)←Truncate([22t][w0](ƒ),k+ƒ+2,k−ƒ).
Note our scaled approximation
Adapting the above approach for the inverse square root, we observe
which leads to
[w0](ƒ)←Truncate([2t][w0](ƒ),└k/2┘+ƒ+2,(k−ƒ)/2)
for approximating 1/√{square root over (x)}. Here, b=└k/2┘+ƒ+2 is sufficient, since 2t is at most └k/2┘ bits and
The resulting subroutines Divide and SqrtAndSqrtInverse are shown as Protocols 23 and 24 in
Here, we describe how we choose the size of finite fields (e.g., q) for secret sharing.
Choosing the primary field q. Recall first that our supported data range ⊂ is defined to be [−2k−ƒ−1, 2k−ƒ−1), where k is the total number of bits used for the fixed point representation and ƒ<k is the number of bits assigned to the fractional range. Also, κ is the statistical security parameter for the subroutines that provide statistical security guarantees (i.e., Truncate and NormalizerEvenExponent). Whenever we multiply two secret-shared FP values, we obtain the value [x](2f) for some x∈. This means q must have at least k+ƒ bits. Moreover, invoking Truncate subroutine on [x](2f) to obtain [x](ƒ) requires us to generate a (k+ƒ+κ)-bit random number (Line 1 of Truncate, with b=k+ƒ), which lower bounds q at 2k+ƒ+κ. We ensure that the maximum precision of a secret FP number in our protocol is 2ƒ (by invoking Truncate when necessary), although one could use higher precisions at the expense of increasing q. In our setting, the highest lower bound on q is introduced by Line 6 of Divide where we invoke Truncate with k+ƒ+2. Thus, our final requirement for q is
q>k+ƒ+κ+2.
Choosing the auxiliary fields. For the small auxiliary field for bit operations, we choose two different base primes q1 and q2 for maximal efficiency.
In our protocols, TableLookupWithFieldConversion imposes a lower bound on the small prime, as the size of the field must be at least as large as the number of entries in the table. When the secret shares are being constructed in the auxiliary field, we find all associated table lookups and choose the smallest prime that is larger than the largest table size. In summary, NormalizerEvenExponent, which Divide and SqrtAndSqrtInverse depend on, uses
q
1
>k/2,
whereas IsPositive, which LessThanPublic and LessThan depend on, uses
q
2>√{square root over (BitLength(q))}+1.
Note that q2 is often much smaller than q1, and thus our use of two auxiliary fields leads to a substantial improvement in efficiency compared to a one-size-fits-all approach.
In addition, we note that our use of NormalizerEvenExponent for Divide, as opposed to the more natural normalization routine without the requirement of the exponent being even, is intended for efficiency. If we were to take the alternative approach, the size of the table lookup at the end of the procedure becomes k instead of k/2. This doubles the size of the small field used for the entire protocol execution, which in turn, effectively doubles the amount of communication.
Parameters for experiments. The parameter setting we used for our benchmark experiments is as follows: k=60, ƒ=30, κ=64, q=2160−47, q1=31, and q2=17. Base primes q, q1, and q2 are chosen to satisfy the requirements above. The data range parameters k and ƒ naturally depend on the input data dimensions and the required level of precision. Our choice was sufficient to obtain accurate results for the benchmark data sets. While larger data may require higher precision, we note that even doubling the precision (e.g., k=120 and ƒ=60) increases the size of each field element (the bit length of q) by only roughly 40%, and thus maintains the practical feasibility of our protocol.
Supplementary Note 7: Further Optimization with Shared Random Streams
To further improve performance, we use a pseudorandom generator (PRG) to generate the random bits (used primarily for secret sharing and Beaver partitioning). At a high level, a PRG takes as input a short uniformly random seed (e.g., 128-bits) and outputs a long stream of random-looking bits (e.g., 264 bits). The security guarantee is that no “efficient” adversary (i.e., a polynomial-time algorithm) is able to distinguish the output of the PRG from a uniformly random sequence of random bits. Using a PRG to derive a random stream of bits allows us to significantly reduce the communication needed in the protocol. For instance, instead of sending a party a long stream of uniformly random bits, one can instead send a short PRG seed and have the receiving party derive the random bits by evaluating the PRG.
This optimization significantly improves the performance of ShareSecret and BeaverPartition. Protocols 25 and 26, provided in
Using PRGs to construct the random streams significantly reduces the bandwidth of our overall protocol. In particular, the initial Beaver partitioning of the input genotype matrix (a matrix with a million rows and a million columns) no longer requires CP0 to send an equally-large random matrix to each of the main computing parties (CP1 and CP2). In fact, because the communication needed for matrix multiplication in our Beaver partitioning framework is proportional to the size of the output matrix (
Here, we describe the subroutines for linear algebraic operations that we developed and used to implement principal component analysis (PCA) in GWAS.
Notation. We extend our previous notation for secret sharing to matrices (which includes vectors and scalars as special cases). Typically, we will use bold-faced uppercase letters (e.g., A, B) to denote matrices and bold-faced lowercase letters (e.g., u, v) to denote vectors. For a matrix A∈n×m and an index 1≤i≤n we write Ai:,: to denote the submatrix corresponding to the bottom n−i+1 rows of A (namely, the matrix consisting of rows i, i+1, . . . , n of A). We write A:i,: to denote the submatrix consisting of the top i rows of A. Similarly, for a column index 1≤j≤m, we write A:,j: to denote the submatrix consisting of the rightmost m−j+1 columns of A, and A:,:j to denote the leftmost j columns of A. In some cases, we also combine the two; for instance, we write A:i,:j to denote the submatrix of A containing the first i rows and j columns of A.
For a real-valued matrix A∈n×m, we define its secret sharing to be a secret-sharing of the fixed-point encoding of A:
We also extend our previously-defined protocols and functions that operate on scalar values to operate component-wise on matrices. For instance, invoking ShareSecret or ShareSecretSharedPRG on a matrix is equivalent to invoking the protocol on each element in parallel so that every element is individually secret shared.
Secure arithmetic with matrices. Much of our previous discussion in Supplementary Note 2 naturally extends to arithmetic operations with matrices. For example, affine functions over secret-shared matrices can be performed non-interactively, since each element of the resulting matrix is an affine function over the secret-shared elements in the input. Secure multiplication of two matrices is efficiently handled by our generalized Beaver partitioning method (Supplementary Note 3, Section 3.2). Furthermore, if the same matrix is used across multiple multiplications, its Beaver-partitioned data can be reused (Supplementary Note 3, Section 3.3). Note that to simplify the protocol description, we do not explicitly describe how the variable reuse optimization is applied in our protocol.
Protocols. We now describe the subroutines used for PCA. All of the protocols described here implement secure computations over secret-shared real values (represented as fixed-point values, as described in Supplementary Note 5).
Tridiagonalize([A](ƒ))→([Q](ƒ), [T](ƒ)): On input a secret-shared symmetric matrix [A](ƒ) where A∈d×d, the Tridiagonalize protocol (Protocol 30, shown in
Now we describe how we build a secure and efficient GWAS protocol using the tools described in the previous Supplementary Notes. We work in the crowdsourcing scenario where there are n study participants, denoted SP1, . . . , SPn, and m candidate SNPs to be tested for association. For simplicity, we assume each SP owns the data of a single individual. In addition, we assume that the mapping from SNP indices {1, . . . , m} to known SNP identifiers (e.g., rsids) is fixed and public. The individuals' genotypes at each position are of course hidden.
We represent the input data owned by each SPi as follows:
x
i
:=g
i
Aa+2·giaa,
which can be easily computed from the above data. We use a one-hot encoding of genotypes with giAA, giAa, and giaa in order to obtain the separate counts for AA, Aa, and aa for each SNP, which are needed for the quality control procedure (Hardy-Weinberg equilibrium and heterozygosity filters). Note that while computing xi from giAA, giAa, and giaa is easy, the reverse is quite expensive due to the need to perform several equality tests in order to extract each genotype. For a similar reason, missing genotypes are encoded in a separate vector hi as opposed to using a sentinel value in the genotype vector, which would also require equality tests to extract the information.
While all numbers in our data are binary-valued, it is straightforward to incorporate continuous values using the fixed-point representation (Supplementary Note 5). For instance, we can incorporate imputed genotypes by assigning probabilities (as FP values) to the corresponding entries in gAA, gAa, and gaa. Covariate features and the phenotype encodings can also be continuous.
During the initial data sharing phase, each SPi samples a random seed for a PRG and sends it to CP1 over a secure channel. This initializes PRGCP
At the end of this procedure, the computing parties CP1 and CP2 have access to the shares [giAA], [giAa], [giaa], [hi], [ci], and [yi], as well as their Beaver partitions for all i∈{1, . . . , n}. We also assume CPs have non-interactively computed the minor allele dosages
[xi]←[giAa]+2[giaa]
for all i∈{1, . . . , n}. Note that the Beaver partitions of [xi] can be obtained for free by simply taking a linear combination of the Beaver partitions of [giAa] and [giaa].
The first phase of the main GWAS computation includes common quality control filters for GWAS. In the following, we provide the list of filters we implemented. We write UB and LB to denote an upper bound and a lower bound, respectively, which take on different values for each filter. We assume that these bounds are public and fixed.
After the CPs compute each of the above quantities over the secret-shared values, they compare each quantity against the (public) thresholds using the LessThan or LessThanPublic protocols (Protocols 20 and 21). For all filters except for HWE, we multiply the thresholds with the denominator of the term being compared to avoid invoking the relatively more expensive Divide protocol. This reduces the required computation (other than comparisons) to only affine functions over the secret shares, which can be computed non-interactively.
The HWE filter is the most complex among the quality control filters. First, the numerator and denominator of αj are separately computed for all j, which is a depth-one computation with an overall output size 2 m. Given this result, we use m parallel invocations of Divide to securely compute αj for all j. After computing the terms (Ojt−Ejt)2 and Ejt via secure multiplications, we use three rounds of m parallel invocations of Divide (one for each t∈{AA, Aa, aa}) to calculate the ratios (Ojt−Ejt)2/Ejt. Lastly, the results are added for each SNP to obtain the HWE test statistics, which are then compared with the threshold.
After the CPs have securely evaluated each of the above filters, the computing parties compute an AND over the filter outputs. The computing parties then publish their shares and reconstruct the binary inclusion/exclusion status for each individual or SNP (to be revealed also at the conclusion of GWAS along with the association statistics). We perform this step in advance to allow the CPs to directly filter the data sets for the subsequent steps, which cannot be done efficiently if the filtering results are kept hidden. As further explained at the end of this Supplementary Note, this information reveals only a single bit per SNP or per individual and therefore arguably poses a significantly smaller privacy risk than the publication of association statistics.
In practice, researchers may wish to apply some filters first and evaluate the remaining filters over the reduced data. For instance, if the data is pooled from different genotyping platforms, it may be desirable to filter out SNPs with high missing rates first in order to discard non-intersecting loci. In our experiments, we performed quality control in three stages: (i) locus missing rate filter, (ii) individual missing rate and heterozygosity filters, and (iii) locus HWE and MAF filters. The results from each stage was used to reduce the data set before proceeding to the next stage.
We use nqc and mqc to denote the number of individuals and SNPs passing all quality control filters, respectively.
The next phase of the computation is population stratification analysis, where the goal is to obtain the top principal components of the genotype matrix via principal component analysis (PCA) to be included as covariates in the association tests.
SNP selection. The current standard practice is to perform PCA over a reduced set of SNPs that are largely independent from one another. Strong correlations among the SNPs, such as those arising from linkage disequilibrium (LD), can distort the PCA results and thus, need to be avoided [20]. In our protocol, we take the simplified approach of keeping only SNPs that are at least 100 Kb apart in order to minimize the impact of LD. Alternative approaches, such as directly computing pairwise correlations and filtering based on them, can be implemented at the expense of efficiency. Since annotations (e.g., genomic position) associated with each SNP in the input data is public, each party independently filters the SNPs according to the same procedure and obtains the reduced matrix non-interactively. Let mpca be the resulting number of SNPs to be used for PCA.
Computing standardization parameters. We denote the filtered input matrix for PCA as X∈{0, 1, 2}n
To standardize the matrix before performing PCA, we follow previous work [21] to compute the mean μj and standard deviation σj of each SNP j as
where the mean is computed over only the observed genotypes. Note μj is smoothed by adding a pseudocount for both allele types to avoid zero standard deviations.
To compute these parameters, the CPs first calculate the denominator and numerator of μj for all j, which are affine functions over the secret shares. Next, μj for all j are computed by mpca parallel invocations of Divide. Then, μj and 1−μj are securely multiplied and provided as input to mpca parallel invocations of SqrtAndSqrtInverse to obtain 1/σj for all j. We keep 1/σj instead of σj in order to standardize the genotypes by multiplication, and not by division, as
{tilde over (X)}
ij:=(1/σj)·(Xij−μj(1−Hij)),
where {tilde over (X)} denotes the standardize input matrix for PCA. Note we subtract the mean only where the genotype is not missing.
Implicit standardization. Explicitly constructing the standardized matrix {tilde over (X)} incurs a communication cost that scales quadratically in the dimension of the input data. This is because each element in {tilde over (X)} corresponds to an output gate that needs to be reconstructed. Instead, we adopt a lazy computation scheme for standardizing X, where every occurrence of {tilde over (X)} in the following computation is replaced with
In our PCA protocol, {tilde over (X)} only appears in products where the resulting matrix is either tall-and-skinny or short-and-fat with the smaller dimension being a very small constant. After writing out each multiplication with the above substitution, we observe that we can evaluate matrix products involving {tilde over (X)} in one of two different ways:
A{tilde over (X)}
A(X−HM)Σ−1(AX)Σ−1−((AH)M)Σ−1,
{tilde over (X)}B
(X−HM)Σ−1BX(Σ−1B)−H(M(Σ−1B)),
where A is short-and-fat and B is tall-and-skinny. This way, each matrix multiplication involving X results in an intermediary matrix that has at least one very small dimension. This means that the overall communication bandwidth scales linear in mpca and nqc. Note that the computation results we obtain via this procedure is equivalent to directly working with {tilde over (X)}.
Randomized PCA. Given the standardized genotype matrix {tilde over (X)} (which is never explicitly constructed), the final step of Phase 2 is to securely perform PCA to obtain the top principal components (PCs) of the columns of {tilde over (X)} (i.e., the left singular vectors of {tilde over (X)}). These principal component represent broad genetic patterns in the data that are thought to be indicative of population structure. Since PCA is a complex, iterative algorithm, directly performing PCA on {tilde over (X)} is infeasible due to the large input dimensions involved in realistic GWAS scenarios. To illustrate, if we were to naively invoke Eigendecompose (Protocol 31) on the covariance matrix {tilde over (X)}{tilde over (X)}T to perform PCA, we would need to QR factorize a nqc×nqc matrix a multiple of nqc times. As a result, communication scales cubically in nqc, which is infeasible. In this work, we instead use an efficient randomized algorithm for matrix factorization based on the techniques of [22].
We give a high-level sketch of the RandomizedPCA (Protocol 32, shown in
{tilde over (X)}QQ
T
≈{tilde over (X)}.
Note ψ is the desired number of top principal components, and α is a small oversampling parameter (set to 10 in our experiments) used to increase the stability and accuracy of the algorithm. Also, ψ+α<<mpca. Such Q can be obtained by projecting {tilde over (X)} onto a random subspace and extracting its orthonormal bases via QR factorization. We additionally apply the power iteration procedure to improve the approximation quality, as described in [22]. More precisely,
Q∈
m
×(ψ+α):=Orthonormal bases of the row space of Π{tilde over (X)}({tilde over (X)}T{tilde over (X)})ρ,
where Π∈(ψ+α)×m
Next, we perform ψ-truncated singular value decomposition (SVD) on the matrix Z:={tilde over (X)}Q≈UψΣψVψT, where UψΣψ(QVψ)T is the ψ-truncated SVD of {tilde over (X)}QQT, which is approximately {tilde over (X)}. Thus, Uψ approximates the desired left singular vectors of {tilde over (X)}.
In our adaptation of this algorithm, we take a step further and compute the SVD of Z by the eigendecomposition of an even smaller (ψ+α)×(ψ+α) matrix
Z
T
Z=Q′L′(Q′)T
with eigenvectors Q′ and a diagonal L′ containing the eigenvalues. Assuming the eigenvalues are sorted, we have
Q
:,:ψ
′=V
ψ and L:ψ,:ψ′=Eψ2.
Thus, we can recover Uψ given the eigendecomposition of ZTZ by computing
U
ψ
=ZQ
:,:ψ′(L:ψ,:ψ′)−1/2,
which concludes our algorithm.
Overall, our modified algorithm reduces the original problem of factorizing a large nqc×mpca matrix {tilde over (X)} to an eigendecomposition of a tiny (ψ+α)×(ψ+α) matrix ZTZ whose dimensions only slightly exceed the number of top principal components we are interested in. Notably, the size of this subproblem does not depend on nqc or mpca, and in typical GWAS scenarios, we expect ψ+α≤20.
The final phase of GWAS is computing the association statistic for each SNP, which intuitively quantifies how informative each SNP is for predicting the phenotype of interest. In this work, we compute the χ2 statistics of Cochran-Armitage (CA) trend test. We use a generalized version of the CA test described in [21] that additionally corrects for covariates, which in our case, includes those provided in the input (e.g., age) as well as the principal components from Phase 2. Correction is performed by regressing out the covariate features from each genotype and phenotype vectors before computing the test statistic.
Given the secret shares of ψ principal components Uψ∈n
Let xj∈{0,1,2}n
{circumflex over (x)}
j:=(In
ŷ:=(In
Next, the CA statistic is computed as the squared Pearson correlation coefficient between the corrected vectors {circumflex over (x)}j and ŷ, which can be expressed as
We include missing genotypes as zeros to maintain consistency with the PCA step. Note only SNPs and individuals with low missing rates are considered here as a result of Phase 1, so the impact of missing data is minimal.
Efficiently computing the above expression for rj2 for every SNP is not trivial. For instance, explicitly constructing the corrected genotypes {circumflex over (x)}j for all SNPs should be avoided, as it would incur a communication cost equal to the size of the genotype matrix, which is naturally quadratic in the input dimensions. Here, we provide an alternative formulation for computing the CA statistics that requires only linear communication bandwidth.
First, observe that ŷ can be explicitly computed, since it is only a single vector (unlike {circumflex over (x)}j, which exists for each j∈{1, . . . , mqc}). We evaluate ŷ as y−Q(QTy) to ensure that all intermediary results have linear size. Once the secret shares of ŷ are obtained, they are Beaver partitioned to use in future computation.
Next, note that computing the following summary statistics are sufficient for computing the CA statistics:
s
j
x:=Σi{circumflex over (x)}ji,∀j∈{1, . . . ,mqc},
s
j
xx:=Σi{circumflex over (x)}ji2,∀j∈{1, . . . ,mqc},
s
j
xy:=Σi{circumflex over (x)}jiŷi,∀j∈{1, . . . ,mqc},
s
y:=Σiŷi,
s
yy:=Σiŷi2,
We can directly calculate sy and syy from the secret shares of ŷ. In addition, note sjxy can be computed using the uncorrected xj, since taking the inner product with ŷ, which is already projected onto the null space of Q, ensures that the component of x residing in the covariate space will be ignored. Thus, we compute sjxy as Σixjiŷi for all j, which is a depth-one circuit with output size mqc.
It remains to show how to compute sjx and sjxx. Let X be an nqc×mqc matrix constructed by horizontally concatenating xj for all j (i.e., xj is placed in the j-th column of X). We express the required computation in terms of matrices as
where we write 1n
u:=QQ
T1n
B:=Q
T
X,
both of which have size linear in the input dimensions and thus, can be evaluated without incurring a quadratic communication cost. The remainder of the computation is computed as
s
x=1n
s
xx=diag(XTX)−diag(BTB),
which are depth-one circuits with output size only mqc each.
After all of the summary statistics have been computed, the CA statistics can be obtained via secure multiplications and invocations of SqrtAndSqrtInverse (or Divide) over length-mqc vectors in accordance with Eq. (4). We formally describe our procedure CochranArmitage for efficiently computing CA statistics in Protocol 33, shown in
At the end of our GWAS protocol, the computing parties CP1 and CP2 reveal their individual shares of the association statistics via ReconstructSecret (Protocol 2) and publish the results. In addition, they publish the results of the quality control filters from Phase 1 to facilitate the interpretation of the released GWAS statistics. For instance, one may wish to distinguish whether a particular SNP is deemed insignificant based on the association test or simply excluded from the analysis due to poor quality.
None of the computation by CP0 in our overall protocol for GWAS depends on input values, and thus can be performed entirely in advance in a preprocessing phase. The one exception is the filtering step in Phase 1. The auxiliary computing party CP0 needs the results of the filtering step to obtain the data dimensions used in the subsequent computation and use the appropriate Beaver partitions obtained during the initial data sharing phase. Thus, CP0 performs the precomputation in stages: once before initial data sharing and once after each filtering stage in Phase 1. Note CP0 can remain offline for the entirety of Phases 2 and 3.
To summarize, excluding the initial data sharing phase, our secure GWAS protocol requires communication complexity for both precomputation and online computation phases only linear in number of individuals and number of SNPs in the data. This is enabled by the following technical contributions:
The security of our end-to-end GWAS protocol reduces to the security of the underlying building blocks we use. More precisely, the view of each computing party in each of the subprotocols consist of uniformly random values (or values that are statistically close to uniform). As explained in the security section of Supplementary Note 5, we choose a large enough statistical security parameter κ to achieve an overall statistical distance (or equivalently, an adversary's distinguishing advantage)<2−30 between each computing party's view and an ideal distribution where all of the messages exchanged in the protocol consist of uniformly random values. This ensures no information about the underlying input genotypes and phenotypes is leaked to the computing parties during the computation.
Therefore, the only information about the input that is “leaked” are the publicly-revealed GWAS results computed by our protocol, which include the association test statistics and the output of the quality control filters. The quality control results consist of binary-valued inclusion/exclusion status of each individual or SNP, and in realistic GWAS scenarios, do not pose a significant privacy risk for the participants. For example, our per-individual filter reveals only whether a study participant had poor genotyping (too many missing genotypes) or has too many or too few heterozygous sites across the whole genome. The link between such high-level and limited information (a single bit) and the raw genotypes at individual SNPs is extremely tenuous. On the other hand, the association results arguably contain more information about the raw genotypes. We can compose our protocols with techniques based on differential privacy [24] (as a post-processing step) to assuage these concerns [25, 26]. However, at the scale of a million individuals or more, we expect the risk of releasing such summary statistics to be considerably small.
The security of our protocol assumes the computing parties do not collude with each other. In settings where it is difficult to justify this assumption, we can introduce additional computing parties to ensure tolerance against a bounded number of collusions in the online (namely, input-dependent) phase of the computation. Note that we still need to assume a semi-honest (and non-colluding) CP0 in the precomputation phase. In particular, instead of secret sharing the data between two main computing parties CP1 and CP2, we can distribute the private input x across n such entities CP1, . . . , CPn such that CP1 for 1≤i≤n−1 holds an independent and uniformly random number as its share ([x]i=ri), and CPn holds [x]n=x−Σi=1n−1ri. Analogous to the two-party case, we have the property that Σi[x]i=x. Here, as long as there exists a single honest party that does not collude, x remains perfectly hidden. Our building block protocols can be easily extended to handle secret shares over more than two parties. In the extended version of our Beaver partitioning approach, CP0 still obtains the blinding factors in the clear, so we require that CP0 does not collude with the other parties. In summary, the relaxed security assumption in the (n+1)-party setting (including CP0) is that CP0 and at least one other CP are honest (i.e., do not collude with the other parties). The main benefit of this setup is that the protocol is able to tolerate collusion among the other n−1 computing parties in the online phase of the computation. As a tradeoff, however, the overall communication scales linearly with the number of computing parties involved.
Our protocol provides security against semi-honest adversaries, namely adversaries that honestly follow the protocol execution, but may subsequently try to learn additional information about other parties' private inputs. In some scenarios, it may be necessary to ensure security even against malicious or active adversaries. For instance, if one of the computing parties is compromised or subverted during the computation, then it may deviate from the protocol specification in order to learn additional information about the participants' genomes.
In the last few years, an elegant line of work [3, 4, 6, 27] has introduced secret-sharing-based multiparty computation protocols with an optimal online phase which provides security against active (i.e., malicious) adversaries that may deviate from the protocol execution. The same techniques can be applied to our protocol to achieve security against active adversaries in the online phase of the computation. We describe the main idea used in the SPDZ protocol here [3] and how to extend it to our protocol. For simplicity, we describe our extension in the two-party setting, but everything generalizes naturally to the multiparty setting.
Secret-shared authenticated values. First, in the precomputation phase of the protocol, CP0 samples a random field element
and secret shares it with the computing parties CP1 and CP2. The secret element a serves as a secret key for an information-theoretic message authentication code (MAC) that is used to authenticate the secret-shared data and validate the outputs of the computation. Then, in the online phase of the computation, an authenticated secret-sharing of a value x∈q is represented as follows:
[x]:=(δ,[x]1,[α(x+δ)]1),(δ,[x]2,[α(x+δ)]2),
where δ∈q is some public scalar (known to all parties). At a high level, each computing party possesses a share of the input x as well as a share of the MAC αx on x.
Computing on authenticated shares. It is straightforward to see that
[x]+[y]=[x+y] and a·[x]=[ax] and [x]+a=[x+a], (5)
where [x]+[y] denotes component-wise addition, a·[x] denotes component-wise scalar multiplication, and
[x]+a:=(δ−a,[x]1+a,[α(x+δ)]1),(δ−a,[x]2,[a(x+δ)]2).
Thus, computing linear functions on secret-shared authenticated values is almost identical to computing linear functions of normal secret-shared values (Supplementary Note 2). Multiplication relies on Beaver multiplication triples as before. In particular, given two secret-shared authenticated values [x], [y], and an authenticated sharing of a multiplication triple ([a], [b], [c]) where c=ab, the parties can compute an authenticated secret-sharing of the product [xy]. Specifically, the parties first reveal the values x−a and y−b (but not their MACs). Then, they compute
[xy]:=(x−a)(y−b)+(x−a)[b]+(y−b)[a]+[c],
exactly as in Protocol 7 using the linear relations defined in Eq. (5). Moreover, our generalization of Beaver multiplication triples (Supplementary Note 3) directly applies to reduce the online communication costs of computing on authenticated values as our generalized Beaver partitioning method only requires computing linear relations over secret-shared authenticated values.
Validating the MAC. At the end of the computation, all of the computing parties have an authenticated secret-sharing [y] of the output y. In the semi-honest version of the protocol, the computing parties would simply publish their shares of the output, and reconstruct the final output of the computation. In the actively-secure protocol, all of the parties first validate the MAC on the output and only if the MAC verification succeeds do the (honest) parties publish their shares. This step of the computation is identical to the output verification step in the SPDZ protocol described in [3,
Initial data sharing. To leverage the SPDZ techniques for active security in the online setting, we assumed that each of the computing parties have secret-shared authenticated values (rather than vanilla secret-shared values). Thus, we need to adapt the initial data sharing procedure (between the study participants SP and the computing parties CP) so that the computing parties possess authenticated shares of the participants' input. We achieve this using a simple adaptation of the SPDZ input-sharing procedure. To simplify the description, assume for now that each SPi contributes just a single input xi∈q. The protocol naturally generalizes to the setting where each study participant contributes a vector of field elements. During the pre-computation phase, for each study participant SPi, the auxiliary computing party CP0 secret-shares a random value [ri] where
with each of the computing parties. To contribute its input to the study, SPi first interacts with CP0 to obtain the blinding factor ri. It then sends the blinded value xi−ri to each of the computing parties. Since the computing parties possess an authenticated sharing of [r1], they can locally compute an authenticated secret-sharing of [ri]+(xi−ri)=[xi], exactly as required for the online protocol. Note that here, we can use the same trick from Supplementary Note 7 and have CP0 derive the randomness ri from a PRG. Then, the total communication between CP0 and each SPi consists of only a single (short) PRG seed. Compared to the semi-honest protocol, there is increased communication between CP0 and one of the computing parties CP1 since CP0 needs to send over a share of ri (the other shares can also be derived from a PRG as before). The communication from the study participant to the computing parties also increases, since it now needs to broadcast xi−ri to all computing parties. However, the resulting protocol provides stronger security in the online setting.
Security discussion. By relying on the SPDZ protocol for the online phase of the computation, our online protocol is secure against even if one of the computing parties is actively malicious. More generally, in the extended setting with additional computing parties, the online phase provides security against an active adversary that corrupts all but a single computing party. However, the resulting protocol still relies on a semi-honest precomputation phase. In other words, security of the online phase relies on correctly-generated authenticated secret-shared (generalized) Beaver triples, as well as CP0 not colluding with any of the computing parties in the online phase. While this may seem like a strong assumption, it is important to keep in mind that the precomputation phase is input-independent, and moreover, we only need CP0 to be a “trusted dealer” (as opposed to a trusted party that possibly sees private inputs). More precisely, CP0 can be modeled as a “write-only” party since we can structure the protocol such that it never needs to receive any message from another party during the protocol execution. This means that an adversary who only corrupts CP0 does not compromise privacy of any of the study participants' inputs.
Here, we provide a ballpark assessment of the applicability of other existing cryptographic frameworks for secure computation, namely homomorphic encryption (HE) [28] and garbled circuits (GC) [12], for securely evaluating large-scale GWAS. HE refers to an encryption scheme that allows certain types of computation to be performed over the private input by manipulating the ciphertexts without decrypting them. Unlike our multiparty solution, HE computation can be carried out by a single party, albeit with greater computational overhead. The current state-of-the-art HE schemes can evaluate a single multiplication in 0.1 seconds [29]. Given that the number of multiplication gates in just the PCA computation is loosely lower-bounded by the number of elements in the genotype matrix, existing HE solutions already require over 30 years of computation to evaluate PCA over a matrix of one million individuals and a reduced set of 10K SNPs, which is clearly infeasible.
On the other hand, Yao's GC protocol is a two-party protocol that enables secure evaluation of arbitrary functionalities (represented as Boolean circuits). In Yao's protocol, one of the parties (i.e., the “garbler”) takes the Boolean circuit, and encrypts and permutes (i.e., “garbles”) the circuit. The other party (i.e., the “evaluator”) is then able to evaluate the garbled circuit and learn the final output, but nothing else about the other party's input. While Yao's protocol require just two rounds of communication for arbitrary computation, the size of the Boolean circuit needed to evaluate our large-scale GWAS computation is prohibitively large. This is due to the fact that GWAS evaluation consists primarily of arithmetic operations. Converting an arithmetic circuit to a Boolean circuit incurs a non-negligible overhead, which is typically linear or quadratic in the bit-length of the values. For example, assuming the same number of bits (60) are used to represent a single number as in our method (using the fixed-point representation), a pairwise multiplication using the Karatsuba algorithm requires roughly 600 AND gates. At 128-bits of security, we require 32 bytes to represent a single AND gate in a garbled circuit [30]. Thus, a single pairwise multiplication requires 20.6 KB of communication. Using the same lower-bound for the number of multiplication gates as in our analysis of HE, communicating a garbled circuit for performing PCA requires roughly 190 PB for a million individuals and 10K SNPs. This is well beyond the feasible realm.
Note our estimates above are very loose lower-bounds, and thus we expect the computational burden of the current state-of-the-art HE and GC frameworks for large-scale GWAS to be even greater in practice.
In case-control studies where the phenotype of interest is binary (e.g., disease status), logistic regression analysis is often used in conjunction with Cochran-Armitage (CA) trend tests to quantify the candidate SNP's impact on phenotypic odds ratio (e.g., disease risk). To this end, one typically trains a logistic regression model for each SNP that predicts the phenotype based on the SNP's minor allele dosage as well as other covariate features, such as age, gender, and population weights (captured by principal components). The estimated model parameter for minor allele dosage is interpreted as the marginal effect of the SNP on the odds ratio and is often reported with top GWAS hits.
Secure evaluation of logistic regression is very challenging. Unlike CA tests, where the required computation can be formulated as a few rounds of matrix multiplications and division (see Protocol 33), training a logistic regression model not only requires frequent evaluation of the sigmoid function (highly nonlinear), but also relies on iterative optimization techniques for parameter estimation (e.g., stochastic gradient descent). These aspects greatly increase the complexity of the overall computation. To illustrate, a recent work based on homomorphic encryption reported a runtime of 30 seconds for a single evaluation of the sigmoid function [31]. Applying this technique to GWAS with a million individuals would result in a runtime of almost a year for a single pass through the data set, which is clearly not feasible considering that gradient descent methods typically require many passes through the data.
The efficient secure computation techniques we introduced for GWAS offer a more tractable approach for performing logistic regression at large-scale. Notably, we approximate the sigmoid function (more precisely, its logarithm: −log(1+exp(−x))) as a piecewise-linear function (with 64 segments to ensure high accuracy). Given a private input, we perform secure binary search of depth six—implemented as a sequence of secure comparisons (Protocol 21)—to determine which segment the input belongs to. Then, we retrieve the coefficients of the corresponding linear function via secure table lookup (Protocol 11). Given these coefficients, the (approximate) output of sigmoid and its derivative can be easily computed, non-interactively.
The rest of the computation in logistic regression can be handled by our MPC framework in a straightforward manner. Importantly, our novel generalization of Beaver multiplication triples (Supplementary Note 3) is critical for obtaining an efficient protocol for the stochastic gradient descent (SGD) algorithm, which heavily depends on matrix multiplications and displays high data reuse patterns. In particular, a naïve approach using Beaver multiplication triples would freshly blind/Beaver-partition the input matrix for every iteration, which is infeasible at our scale; with our technique, Beaver-partitioning is performed only once.
Even with our advances, logistic regression imposes an overwhelming computational burden when applied to hundreds of thousands of SNPs in a typical GWAS data set. In practice, we suggest a two-step approach where CA tests are first used to narrow down the set of candidate SNPs with tangible association signals and we consider only the chosen SNPs in a subsequent logistic regression analysis.
We implemented logistic regression in our secure MPC framework and tested it on our benchmark lung cancer data set. Our protocol accurately computed the odds ratios for 100 SNPs within a day of runtime (Supplementary
Recently, Mohassel and Zhang also introduced an implementation of privacy-preserving logistic regression by combining techniques from secret-sharing-based MPC with garbled circuits [32]. Although they show that their protocol achieves practical runtimes for data sets containing up to a million training instances, their improved scalability comes at the expense of accuracy. In particular, a key factor that contributed to the scalability of their protocol is their use of a coarse approximation of the sigmoid function (namely, as a piecewise linear function with three segments). While this approximation may suffice for obtaining competitive predictive performance (the focus of their work) for certain data sets, it is too inaccurate in our GWAS setting where the goal is to obtain an accurate estimate of the model parameters. Moreover, even with their coarse approximation, training a separate logistic regression model for each of the hundreds of thousands of SNPs in a typical GWAS data set still requires several years of computation. We also observed that, using our implementation, we can achieve runtimes that are comparable to their approach by reducing the quality of our approximation of the sigmoid function and taking a similar number of passes through the data set for SGD. This further illustrates the challenges of achieving a practical solution for genome-wide logistic regression analysis under secure computation frameworks.
One or more functions of the computing platform of this disclosure may be implemented in a cloud-based architecture. As is well-known, cloud computing is a model of service delivery for enabling on-demand network access to a shared pool of configurable computing resources (e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. Available services models that may be leveraged in whole or in part include: Software as a Service (SaaS) (the provider's applications running on cloud infrastructure); Platform as a service (PaaS) (the customer deploys applications that may be created using provider tools onto the cloud infrastructure); Infrastructure as a Service (IaaS) (customer provisions its own processing, storage, networks and other computing resources and can deploy and run operating systems and applications).
The platform may comprise co-located hardware and software resources, or resources that are physically, logically, virtually and/or geographically distinct. Communication networks used to communicate to and from the platform services may be packet-based, non-packet based, and secure or non-secure, or some combination thereof.
More generally, the techniques described herein are provided using a set of one or more computing-related entities (systems, machines, processes, programs, libraries, functions, or the like) that together facilitate or provide the described functionality described above. In a typical implementation, a representative machine on which the software executes comprises commodity hardware, an operating system, an application runtime environment, and a set of applications or processes and associated data, that provide the functionality of a given system or subsystem. As described, the functionality may be implemented in a standalone machine, or across a distributed set of machines.
A computing entity herein receives the secretly-shared data from a client device, which may even be an end user device. Thus for example, but without limitation, a client device is a mobile device, such as a smartphone, tablet, or wearable computing device. Such a device comprises a CPU (central processing unit), computer memory, such as RAM, and a drive. The device software includes an operating system (e.g., Google® Android™, or the like), and generic support applications and utilities.
The underlying network transport may be any communication medium including, without limitation, packet-based, cellular, wireless, Wi-Fi, small cell, and combinations thereof.
Each above-described process (e.g., each of the protocols set forth in the drawings) preferably is implemented in computer software as a set of computer program instructions executable in one or more processors, as a special-purpose machine.
Representative machines on which the subject matter herein is provided may be hardware processor-based computers running an operating system and one or more applications to carry out the described functionality.
While the above describes a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, or the like. References in the specification to a given embodiment indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic.
While the disclosed subject matter has been described in the context of a method or process, the subject matter also relates to apparatus for performing the operations herein. This apparatus may be a particular machine that is specially constructed for the required purposes, or it may comprise a computer otherwise selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including an optical disk, a CDROM, and a magnetic-optical disk, a read-only memory (ROM), a random access memory (RAM), a magnetic or optical card, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
A given implementation of the computing platform is software that executes on a hardware platform running an operating system, such as Linux. A machine implementing the techniques herein comprises a hardware processor, and non-transitory computer memory holding computer program instructions that are executed by the processor to perform the above-described methods.
Communications herein (e.g., secret sharing of data, sharing of PRGs, etc.) preferably take place over secure connections. For machine-to-machine communications over a network, information typically is communicated over SSL/TLS links, or using any other protocol, although if significant trust is in place (e.g., machines being operated in a secure environment, or between entities that have a trust relationship, etc.) information may be transmitted in the clear.
In lieu of using physically-distinct computing entities, computations herein may be carried out within a cluster comprises a set of computing entities (e.g., processors). A grid computing architecture may also be utilized.
There is no limitation on the type of computing entity that may implement any connection (e.g. client-to-server). Any computing entity (system, machine, device, program, process, utility, or the like) may act as a client or a server.
While given components of the system have been described separately, one of ordinary skill will appreciate that some of the functions may be combined or shared in given instructions, program sequences, code portions, and the like. Any application or functionality described herein may be implemented as native code, by providing hooks into another application, by facilitating use of the mechanism as a plug-in, by linking to the mechanism, and the like.
The functionality may be co-located or various parts/components may be separately and run as distinct functions, perhaps in one or more locations (over a distributed network).
There may any number of computing parties that securely compute the GWAS statistics using the secretly-shared data sets (namely, the genome and phenotype data, and the random number data created during the pre-processing operation).
The entity that performs pre-processing (CP0) may also be a computing entity (e.g., CP1) in the MPC computation to generate the GWAS output.
The pre-processing may be provided as a service.
The secure computation of the genome and phenotype data may be provided as a service.
Computing entities herein may be independent from one another, or associated with one another. Multiple computing entities may be associated with a single enterprise entity, but are separate and distinct from one another with respect to the MPC secure computation itself over their respective secret shares.
Having described our invention, what we claim also is set forth below.
This invention was in part made with government support under GM108348 awarded by Department of Health and Human Services, National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62525446 | Jun 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16020058 | Jun 2018 | US |
Child | 17164235 | US |