Described herein is an ultra-secure integrated storage and analysis solution for personal genomic applications. The method guarantees data privacy whilst enabling access and ongoing analysis of genomic data when required. The provided solution can be applied to any confidential biological information regardless of its nature or size and can also be applied to natively encrypted sequencing.
As methods for faster and cheaper DNA sequencing and analysis continuously emerge, the market of “Direct-to-Consumers” genetic tests is booming. The sequencing revolution paired with emergence of well characterised and clinically-actionable mutations open the way to personalised medicine and much more.
The adequate management of Genetic Data Privacy (GDP) as well as the respect of user's preferences in term of reporting requires new tools. Indeed, even though the relative standardisation of bioinformatics formats and analysis pipelines allow genetic analysts to build informative personalised genetic reports, the storage and reporting of these data requires new methods to fully respect the personal privacy preferences of each patient including in case of IT security breach or successful computer Cyber-attacks.
The danger for GDP comes also from the fact that “anonymized DNA data” might be in the future “de-anonymised” by powerful A.I. models able to integrate information within our genome and outside (including our personal social media information) and deduce the missing parts. Once our DNA sequence is disclosed, it is very difficult if not impossible to take our genetic privacy back, for the better or the worst.
Indeed, the particular nature of biological information (and especially DNA), necessitate extreme caution to properly store, protect, analyse, and communicate both raw and analysed data to the final user. By nature, the DNA sequence information can be used against the user's personal interests or for his best interests.
As such, as of 2019, a growing number of companies are marketing genetic testing kits directly to consumers in order to inform them about their genome variations for many different applications (Health, Lifestyle, Ancestry, etc). Some of these tests are sold directly to consumers (“DTC”). Subsequent communication about the genetic results occurs frequently through web applications or through websites. These applications allow online access to computing systems that extract particular genetic variations out of the total sequence data and report in a technical manner as to their relevance.
The communication of the information is focused on physicians and health care providers which are familiar with genetic variations but is difficult to understand for the end user. On top of that, in case of psychologically impactful results (such as a 60% chance of developing breast Cancer), DTC companies do not adequately adjust the test reporting to the user's preference, leading to unnecessary stress.
DTC companies such as 23andme offers web and App applications used for communicating genetic results. It consists of popular descriptions of hundreds of genetic features and associated technical reports. Although some reports are useful and actionable, many of them are either useless due to their obviousness because already known (color of eyes, teeth shapes, alcohol tolerance), non-actionable (no preventive treatment available) or difficult to understand without the help of a genetic counsellor. The focus should be on the patient himself. The focus is rather clinical, and the genetic results are thus difficult to understand for the end user. The user is offered no tools to personalize the content in this mobile application or to share or discuss certain information with its physician or any other person such as his/her Genetic counsellor.
The existing tools for interpretation and communication of next generation sequencing (NGS) raw data remain uninviting, of limited utility and too high-level for general clinicians or consumers, who do not necessarily have an extensive background in genetics and bio-informatics. Nowadays, whole genome sequencing (or whole exome sequencing) data is still predominantly used in academics and only gradually gains interest in daily clinical practice. A number of companies develop tools for analysis, annotation and interpretation of these raw sequencing data. However, existing approaches remain high-level, solely focused on experienced geneticists, often use complex user interfaces, lack flexible and responsive filtering, use limited annotation, and only few of them offer a truly actionable, affordable, secure and personalised experience to the user.
Computer files can be protected by means of encryption. Homomorphic encryption is a form of encryption that allows direct computation on “ciphertext”.
In cryptography, a ciphertext is the result of encryption performed on plaintext using a cipher type algorithm, generating a piece of encrypted information that contains a form of the original plaintext, but which is unreadable by a human or computer without the proper cipher key to decrypt it.
Homomorphic encryption is capable of performing operations on ciphertext and generating an encrypted result which, if it were decrypted, would match the result of corresponding operations that had been performed on the original piece of unencrypted plaintext. As such, homomorphic encryption can be used for secure outsourced computation, for example secure cloud computing services, and securely chaining together different services without exposing sensitive data.
In typically highly regulated industries, such as health care, homomorphic encryption can be used to enable new services by removing privacy barriers inhibiting data sharing. For example, predictive analytics in health care can be hard to utilize due to medical data privacy concerns, but if the predictive analytics service provider can operate on encrypted data instead these privacy concerns are diminished.
A cryptographic system that supports arbitrary computation on ciphertexts, in other words a system which can perform computations of any type, rather than a limited number of types of computation from a set of predefined operations, is known as fully homomorphic encryption (FHE) system.
Theoretically, a fully homomorphic encryption system can provide any desirable functionality that an unencrypted system could, running on encrypted inputs to produce an encryption of the results. As such an FHE program need never decrypt its inputs, it can be run by an untrusted party without revealing its inputs and internal state.
Cryptographic systems that support FHE thus have great practical implications in the outsourcing of private computations, for instance, in the context of cloud computing. However, relatively few FHE systems have been demonstrated to function, and those that have been have done so at great cost to security level and processing power.
For example, a useful application of FHE would be for securely querying a database. Typical database encryption leaves the database encrypted at rest, but when queries are performed the data must be decrypted in order to be parsed. A fully homomorphic encryption scheme applied to this application was demonstrated in Gahi, Youssef; Guennoun, Mouhcine; El-Khatib, Khalil (11 Dec. 2015). “A Secure Database System using Homomorphic Encryption Schemes”. However the authors noted that the scheme was both low-level and non-secure compared to regular encryption techniques, and a huge toll was taken on the performance, with operations such as a 16 bit multiplication taking approximately 24 minutes.
US2017357749 describes methods of homomorphic encryption wherein genomic data and linear prediction models are batch encoded into one or more sets of polynomials, which are then encrypted, and dot product operations are performed on the encrypted polynomials. Limiting the supported prediction models to linear models removes the need for relinearization so that the encrypted operations are not impractically slow.
In order for FHE encryption to be used for a more comprehensive, holistic analysis of bioinformatics data, the improvements described herein are desireable.
The invention uses a computer implemented method for securely providing a user with a personally relevant analysis of biological information comprising:
The invention uses a computer implemented method for securely providing a user with a personally relevant analysis of biological information comprising:
The biological information may be genetic information such as nucleic acid or protein sequence data.
The term “non-linear prediction models” as used herein refers to models where data are modeled by a function which is a nonlinear combination of the model parameters, and may depend on one or more independent variables as inputs. A detailed explanation of the supported prediction models is provided below.
It is an objective of the present invention to remedy all or part of the disadvantages mentioned above. The present invention fulfils these objectives by providing methods and systems allowing for the easy and quick interpretation of a personal genorne sequence and more generally any biologically-relevant information.
The methods and systems create a powerful and secure environment allowing the user to exploit his/her own complex genorne data and facilitate the user's exploration as to the relevance of particular genome variations in an actionable way. The present invention overcomes shortcomings of the conventional art and may achieve other advantages not contemplated by the conventional software and services.
In general terms, the present invention provides a method and/or system for efficient storage and/or communication of personal aenome sequence data and/or medical information, making the relevant personal genome sequence and/or relevant medical information accessible on a mobile device or web application in an easy, secure and efficient way. The user is then free to select secure, cutting-edge methods (such as A.I. models) to analyse his own data in a way that only the user can access the analysis results.
Described is a computer implemented homomorphic encryption method for securely producing natively encrypted sequencing data in a way that allows subsequent analysis on the encrypted data without requiring the file to be decrypted.
In one embodiment, the present invention provides a method and/or system for securely providing a user with a personally relevant analysis of biological information comprising:
In one embodiment, the present invention provides a method for securely providing a user with a personally relevant analysis of biological information comprising:
The encryption of the file can be irreversible such that the raw data cannot be decrypted. The encrypted the can be entered using a unique user ID key, herein referred to as a GeneKey. The key allows the use to enter the encrypted the, ask specific queries of the data in the file and to generate and access reports from the file. The key is unique to the user and cannot be duplicated or replaced. The unique key allows the user and nobody else access to the user specific analysis of the genetic information. The key can be in the form of a chip card with or without contactless capacity. The key can act as both a genetic/Biometric ID card and a cryptographic key to open reports.
A unique DNA based identifier can be added to the user specific personal information at step b. The DNA based identifier can be selected from one or more of:
The SNP's or STR's can be from Chromosome Y or autosomes.
The method can contain a unique identifier determined according to DNA forensics methods added to the user specific personal information at step b, wherein
The file can contain any information relevant to an individual. The file can contain biological sequence data including protein or nucleic acid sequence data. The data may be genetic sequence information, for example a collection of single nucleotide polymorphisms (SNP's), a whole genome sequence, a partial or exome sequence. The data may include transcriptome, proteome, metabolome, medical data or any data stored in electronic medical records or collected by quantify-self devices. The data may be an amalgamation compiled from a variety of different providers or experimental techniques, optionally including genome, transcriptome, proteome, metabolome, medical data or any data stored in Electronic Medical Records or collected by quantify-self devices.
The information may originate from a number of different providers or be sourced from two or more databases.
The user can add further information to the file. The file may therefore be supplemented with user specific personal information, for example one or more of history of illness, blood group, allergy information, birth date, location of birth, nationality, family contacts or family history of illness.
The information be automatically added to the encrypted file without requiring user input, For example the file can automatically be linked to a wearable fitness device which measures blood pressure or heart rate.
Further information can optionally be added after the file has been encrypted. Further genetic sequence or medically relevant information can be added after encryption.
The key allows access to the file to interrogate the information stored therein. The user can query the data and generate the answer to specific queries, for example disposition to future illness. The reports are also encrypted and require the user to have the key for access. The reports can be designed such that the report containing the analysed data can only be accessed once. Optionally the report containing the analysed data can only be accessed for a time limited period after creation (from few seconds to decades).
The interrogation of the encrypted file can be operated through a mobile app providing access to a variety of analysis methods. The analysis methods may be end-to-end encrypted methods. The method may be applied to the fields of health (optionally including risk prediction and predispositions analysis), nutrition (optionally including genetically optimised diet), lifestyle (optionally including daily sunlight needs or life rhythms), family history (optionally including genetic genealogy, paternity testing, forensics), and genetic-centered social interactions (optionally including genetic interest group about syndromes or Orphan diseases).
The method can be used for the analysis of an individuals medical information. Alternatively the method can be used to prove ownership of a biological organism, for example an animal or plant. For example the method can be used to authenticate origin or ownership of pets, race-horses, laboratory animals, farm animals, microorganisms, agricultural crops etc. Thus the method can be used to prove ownership, or prove the authenticity of an organism based on comparing the sequence derived from the sample with the sequence in the encrypted file.
The method can be used where genetic information is from a biological asset own by the user (whose property can be demonstrated by the user), such as without being limited to, plants, animals, synthetic biological systems or microorganisms. The biological asset can be an animal or plant used in agro-food industry, the cosmetics industry, or any other industrial domain or human activity.
The data can be analysed by a computer program. The computer program can be a classical or an Artificial Intelligence (“A.I”) program regardless of its A.I. class including without being limited to: Apriori Algorithm, Artificial Neural Networks, Collaborative Filtering, Decision Trees, Deep Learning, K Means Clustering Algorithm, Linear Regression, Logistic Regression, Naïve Bayes Classifier Algorithm, Nearest Neighbours, Random Forests, Support Vector Machine Algorithm or any method commonly described as belonging to A.I. field. The method can be a combination of programs including classical or A.I. models.
In order to further protect the information, the genetic sequence information can be encrypted at the point of sequencing a sample provided by the user.
In order to authenticate a sample, the genetic sequence information and authenticity of the sample can be encrypted at the point of origin of a sample provided by the user.
The method described includes a method wherein the content of the encrypted file combines all or part of the following elements:
Described is a method allowing authentication of user by digital ways during the sample collection (such as saliva sample) as well as a method to guarantee the integrity of a sample and sample shipment to the sequencing laboratory. These digital methods implement biometric authentication as well as digital tracking of the shipment and may involve the following technologies (and any combination of these technologies): GPS tracking, remote Biometric Authentication on secured software or hardware including Drones, USB Stick logger embedded in shipment and Blockchain recording of sampling and transportation events.
Described herein is a secure integrated storage and analysis solution for personal genomic applications. The method guarantees data privacy whilst enabling access and ongoing analysis of genomic data when required. Described is a computer implemented homomorphic encryption method for securely producing natively encrypted sequencing data in a way that allows subsequent analysis on the encrypted data without requiring the file to be decrypted.
The system allows a user to directly mine his own information. Analyses are secured and new results are transferred encrypted. Each analysis method complies with End-to-End encryption. The user decrypts results using his unique key, guaranteeing total privacy. The methods bring the state-of-the-art algorithms close to the customer, for example using A.I. As new algorithms are developed they can be applied to the encrypted data. If a query cannot be applied due to insufficient genetic data, the system determines the quickest and most cost-effective way to generate the additional data required for the query to be satisfied.
The file can be supplemented with additional data, including additional genetic data or phenotype data.
Data on the file may include one or more of:
Encryption technology described herein allows fully homomorphic encryption to support super-fast operations in the encrypted domain. The technology comes under the form of a set of software tools for use-case specifications and semi-automatic code generation.
A user's genomic data is provided in encrypted form to a service provider in order to predict a genetic trait or a risk of disease. The service provider evaluates a proprietary prediction model homomorphically on the encrypted data and returns an encrypted result without ever being able to access the genomic data in the clear. The encrypted result is then decrypted by the user—or an associated device—to view the prediction result value.
The method supports a wide class of prediction models that combine table look-ups and additive aggregation of independent gene-level contributions. Thus the invention extends far beyond logistic regression—the classical linear model for genome-wide association.
The classes of prediction models supported by the invention and the methods of their application are described below.
The prediction service provider is provided with a set of single nucleotide polymorphisms (or SNPs)
S=((rsid1,x1)(rsid2, x2), . . . , (rsidn, xn))
where rsidi indicates the identifier of the i-th SNP and xi indicates its value. For instance, when the SNPs contain a pair of nucleotide bases, each xi is an ordered pair of symbols in the standard alphabet “—ACGTYRWSKMDVHBN” and can only have 136 possible values.
In addition to the set of SNPs, the prediction may require a set of covariates cov providing additional information such as age, weight, height, body mass index, ethnicity or other relevant user-specific information.
The output value of the prediction is a probability that measures the presence of a genetic trait or a health risk:
p=prediction_model(S, cov)
By applying comparison with a selected threshold probability, the result value can be made a binary value (yes or no). By apply several models in parallel, the output may also be a vector of probabilities and/or binary values.
The sets of SNPs and covariates are input into the prediction models as a single vector of value:
V=(v1,v2, . . . vk)
Given the input vector V=(v1, v2, . . . , vk), a linear model returns the output probability
p=f(w0+w1·v1+. . . +wkvk)
where the function f and all the weights w0, w1, . . . , wk are real-valued and constitute the model.
For instance, when f is chosen to be the logistic function f(t)=1/(1+e−t), the model is said to be a logistic model and wo,wi, Wk are called the regression coefficients. However other linear models may use different functions.
Linear models have 2 intrinsic limitations:
Limitation 1. They assume that all input variables have independent contributions in the computation of p. Indeed the contribution wi·vi of vi is independent from all the other input variables.
Limitation 2. The contribution of an input variable vi is linear in vi.
What we call here non-linear models are a generalization of linear models where
P=f(w0+f1(v1)+ . . . +fk(vk))
and the coefficient w0 as well as the functions f, f1. . . fk are arbitrary and belong to the model.
Thus non-linear models escape Limitation 2. However each contribution fi(vi) remains independent from the other input variables, resulting in that Limitation 1 still applies.
Non-linear co-dependent models allow each contribution to depend on arbitrary subsets of input variables.
As an example, assume that input variables in V form contiguous clusters of co-dependent variables, for instance
V=((v1,v1, v2)v3),(v4v5,v6),v7, . . .).
In this example, v1 and v2 form a cluster, v3 is independent, v4, v5 and v6 form another cluster, v7 is independent, and so forth. A non-linear co-dependent model outputs
p=f(w0+f12(v1, v2)+f3(v3)+f456(v4,v5,v6)+f7(v7)+ . . . )
and the model parameters now include arbitrary multivariate functions.
In the general case, V is a collection of clusters (V1, . . . , Vq) where each cluster V1 is a collection of input variables Vl⊆{v1, . . . , vk}. An input variable may belong to several clusters. The contribution of cluster Vl in the computation of p is fl(Vl) and the output of the model is
p=f(w0+f1(V1)+ . . . +fq(Vq)).
We see that non-linear co-dependent models have no longer Limitation 1 and that
The method as per the invention supports these 3 categories of models.
In linear or non-linear models, all input SNP variables have an independent effect on the final prediction result.
However, in potentially many concrete cases of genomic predictions, this is not accurate because some of the input SNPs may belong to the same gene, resulting in dependencies between the contributions of these SNPs being observed in acquired medical data.
Therefore one gets a far more accurate model by combining the SNPs belonging to the same gene together in the same cluster, and possibly adding relevant covariates to that cluster as well, so that all observed dependencies are taken into account in the model.
The particular parameters of a model (the coefficient w0 and functions f, f1, . . . , fq) can be extracted from medical acquisitions in various ways e.g. using machine learning techniques.
We now show how the invention allows to evaluate any non-linear co-dependent prediction model over encrypted input variables using homomorphic encryption.
Because this is the most general class of models, this description also applies—with simplifications—to linear and non-linear models.
The description that follows makes use of a generic homomorphic encryption scheme that supports:
An encryption of an integer x is denoted [[x]].
Section 3 describes one particular reduction to practice in more detail using a particular scheme.
2.1. Step 1: Key Generation
Using the key generation procedure of the encryption scheme, the user generates 3 different cryptographic keys:
The user publishes enc_key so that third parties can encrypt genomic data on behalf of the user.
The user publishes eva_key so that third parties such as prediction service providers can carry out homomorphic computations over encrypted data.
The user keeps sec_key private and will use it to decrypt the encrypted prediction results.
Optionally, sec_key can also be used by the user to provide encrypted genomic data to prediction service providers.
User data is divided into 2 distinct categories:
1. The set of SNPs attached to the user (genomic data),
2. The set of covariates attached to the user (medical profile).
In their standard form, the value of an SNP is an ordered pair of symbols in the alphabet “-ACGTYRWSKMDVHBN”. For non-autosomal chromosomes, or in cases of trisomy, an SNP can be composed of less or more than 2 symbols.
A convention must be adopted to encode the SNP value into an integer in an appropriate range. Typically, SNPs containing a pair of standard symbols can be encoded as an integer ranging from 1 to 136.
Alternately, the values of an SNP may be categorized into genetic variants, or groups of variants that are known to produce the same statistical effect on the medical condition of the user. In that case, the SNP value is replaced with an integer that encodes the group of variants the SNP belongs to.
In any case, if (rsidi, xi) denotes an SNP, we identify xi with the integer-valued encoding of its value.
The above SNP is made available in encrypted form as (rsidi, [[xi]]) where [[xi]] is a homomorphic encryption of xi under the user's public encryption key enc_key.
Covariates may be of very different nature and may rely on medical measurements in various units. By convention, the numeric representation of the j-th covariate may adopt the generic format
(Descriptionj, cj)
where (Descriptionj) is a unique descriptive object (e.g. a character string or a reference to some class in an ontology) and cj an integer-valued encoding of the value of the covariate. For instance,
(‘Height(cm)@2019-05-13’, 189)
may represent the user's height in centimeters at a certain date.
The above covariate is made available in encrypted form as
(Descriptionj, [[cj]])
where [[cj]] is a homomorphic encryption of c1.
The homomorphic prediction model, known by the service provider who is performing the evaluation homomorphically, is composed of:
(rsid1, . . . , rsidn)
(Description1, . . . ,Descriptionm)
Since the homomorphic prediction model is necessarily integer-valued, it may be obtained by approximating a continuous prediction model with an appropriate degree of precision.
The prediction service provider is given the encrypted input data
[[x1]], . . . , [[xn]], [[c1]], . . . , [[cm]]
and for l=1, . . . , q, collects the encrypted variables belonging to cluster Vl:
[[Vl]]=( [[xi
The prediction service provider is given the user's public evaluation key eva_key.
For a given query from a user, using eva_key, the prediction service provider performs the following algorithm:
1. Initialize acc=w0
2. For l=1 to q (2a). Perform a homomorphic table lookup with
[[Vl]]=([[xi
3. on table Tf
z
l
=[[T
f
[x
i
, , . . . ,
, c
j
, . . . ,
]]]
4. of cluster Vl. (2b). Use homomorphic addition to aggregate over l=1 to q
acc=acc+z
l
5. Perform a homomorphic table lookup with acc on table Tf. to get the encrypted prediction probability [[p]].
The encrypted prediction result [[p]] is returned to the user.
Using the secret decryption key sec_key, the user decrypts [[p]] to get the prediction result value p in the clear.
In this particular embodiment of the invention, we make use of a set of techniques based on the Torus FHE (TFHE) homomorphic encryption scheme. TFHE defines 3 distinct encryption formats TLWE, TRLWE and TRGSW with the distinct features. Only the description of TLWE is needed to show how the invention is implemented using TFHE.
The plaintext is assigned a real value, μ, in the range [0,1) and is encrypted as
TLWE(μ)=(a1, . . . , an, b)
with
where each ai˜U[0,1) is picked uniformly at random in the interval [0,1) and ϵ˜N(0, σ) is a centered Gaussian noise with variance σ2.
The secret encryption-decryption key is sec_key=(s1, . . . , sn) ∈ {0,1}n.
TLWE public-key encryption
Given sec_key, the encryption public key enc_key is derived by providing a vector of random encryptions of zero:
enc_key=(Z1, . . . , Zr)
where Zi=TLWE(0). The public-key encryption of μ ∈ [0,1) consists in selecting random bits a1, . . . , ar ∈ {0, 1} and computing
TLWE(μ)=a1·Z1+ . . . +ar·Zr+μ mod 1.
1. The user randomly selects sec_key=(s1, . . . , sn) ∈ {0,1}n uniformly at random.
2. The user generates r encryptions of zero Z1, . . . , Zr and sets the encryption public key to enc_key=(Z1, . . . , Zr).
3. The user randomly generates a bootstrapping key eva_key to allow homomorphic computations by third parties.
To encrypt an integer variable v (an SNP value or a covariate), v is decomposed into bits v0, . . . , vt-1 and [[v]] is defined as
Relying on the description of section 2.3.4, it is enough to provide a description of how homomorphic table lookups and homomorphic additions are performed for a single cluster of input variables.
Given an encrypted cluster of integer variables
[[Vl]]=([[xi
and since each encrypted variable is a vector of its encrypted bits under TLWE, we view [[Vl]] as a concatenated vector of encrypted bits under TLWE:
Now, TFHE provides a technique for the homomorphic evaluation of a table lookup. Let T be an arbitrary t-dimensional table of 2t integer values in the range {0, . . . , 2d−1}. By applying the CMux tree and gate bootstrapping techniques on the vector of encrypted bits
one can compute
where the integer d>0 is a system parameter.
In this embodiment of the invention, these techniques are used for every table lookup made necessary by the prediction model.
Since TLWE supports homomorphic additions, the current accumulated value
can be updated as
As a result of successive accumulations, the final value of the accumulator acc contains the sum
z=w
0
+T
f
[V
1
]+ . . . +T
f
[V
q]∈{0, . . . , 2d−1}
of all contributions, namely
In this embodiment, the function f is not applied homomorphically on acc to compute [[p]]=f([[z]]). Instead, the prediction service provider directly returns acc=[[z]] to the user together with a description of f. The function f can also be chosen once and for all as a convention between users and prediction service providers.
Using her secret encryption-decryption key sec_key, the user
1. Decrypts 2. Applies f to z to get p=f(z).
An example is below
Among all predictive genetic tests currently available DTC, BRCA mutation testing can be considered the most actionable with proven clinical utility. Specific genetic variants in the BRCA1 and BRCA2 genes are associated with an increased risk of developing certain cancers, including breast cancer (in women and men) and ovarian cancer. These variants may also be associated with an increased risk for prostate cancer and certain other cancers. This test includes three genetic variants in the BRCA1 and BRCA2 genes that are most common in people of Ashkenazi Jewish descent.
Data relating to an individual was encrypted and the BRCA status analysed:
Number | Date | Country | Kind |
---|---|---|---|
1907358.4 | May 2019 | GB | national |
This application is a 35 U.S.C. § 371 national stage filing of International Application No. PCT/GB2020/051268, filed May 26, 2020, which claims the benefit of priority to United Kingdom Patent Application No. 1907358.4, filed May 24, 2019, the contents of each of which are herein incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2020/051268 | 5/26/2020 | WO |