This disclosure relates to systems and methods for a real time sentinel system.
In one aspect, a method includes constructing an isolated memory partition that forms a secure enclave, wherein the secure enclave is available to one or more processors for running one or more application computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of the secure enclave; pre-provisioning software within the secure enclave, wherein the pre-provisioned software within the secure enclave is configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to a central node; wherein the central node is pre-provisioned with software configured to execute instructions of the one or more application computing processes on one or more processors of the central node by receiving the output data from a plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output.
In some embodiments, each of the plurality of secure enclaves isassociated with a different health system.
In some embodiments, two or more of the plurality of secure enclaves are associated with two different information systems within a single health system.
In some embodiments, the input data includes at least one of demographic data, comorbidities, and geographic data.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the regression model is a stratified model.
In some embodiments, the regression model is a conditional logistical regression model.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the output includes a plurality regression coefficients of the regression model.
In some embodiments, the output includes a covariance matrix of the coefficients of the regression model.
In some embodiments, the output includes the log of an odds ratio as a continuous function of time.
In some embodiments, the log of the odds ratio as a continuous function of time is estimated based on an interpolation between the log of an odds ratio at two or more time points.
In some embodiments, the pre-provisioned software within the secure enclave is further configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by encrypting the output data using the one or more cryptographic keys; and providing external access to the encrypted output data and the proof of execution.
In some embodiments, the central node is an isolated memory partition available to one or more processors for running one or more application computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of the central node.
In some embodiments, the output data includes a portion of the input data.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
In some embodiments, fitting the regression model and fitting the aggregate regression model is iterative.
In some embodiments, the pre-provisioned with software within the central node is configured to execute instructions of the one or more application computing processes on one or more processors of the central node by sending one or more aggregate regression coefficients of the aggregate regression model to each of the plurality of secure enclaves.
In some embodiments, the pre-provisioned software within the secure enclave is further configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by: receiving one or more aggregate regression coefficients of the aggregate regression model; tuning the regression model using the one or more aggregate regression coefficients of the aggregate regression model to generate updated output data; and sending the updated output data to a central node.
In some embodiments, the updated output includes a gradient of one or more regression coefficients of the regression model.
In some embodiments, the aggregate output includes the log of an odds ratio across the plurality of secure enclaves.
In some embodiments, a vaccination status of an individual is assigned as unvaccinated if the test date is within a predetermined time of the vaccination date and the vaccination status of the individual is assigned as vaccinated if the test date is the predetermined time after the vaccination date.
In some embodiments, the vaccination date includes dates of one or more doses.
In one aspect, system includes a non-transitory memory; and one or more hardware processors configured to read instructions from the non-transitory memory that, when executed cause one or more of the hardware processors to perform operations including: constructing an isolated memory partition that forms a secure enclave, wherein the secure enclave is available to one or more processors for running one or more application computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of the secure enclave; pre-provisioning software within the secure enclave, wherein the pre-provisioned software within the secure enclave is configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to a central node; wherein the central node is pre-provisioned with software configured to execute instructions of the one or more application computing processes on one or more processors of the central node by receiving the output data from a plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
In one aspect, a non-transitory computer-readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations including: constructing an isolated memory partition that forms a secure enclave, wherein the secure enclave is available to one or more processors for running one or more application computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of the secure enclave; pre-provisioning software within the secure enclave, wherein the pre-provisioned software within the secure enclave is configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to a central node; wherein the central node is pre-provisioned with software configured to execute instructions of the one or more application computing processes on one or more processors of the central node by receiving the output data from a plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
In one aspect, a method includes constructing a central node; wherein the central node is in communication with a plurality of secure enclaves, wherein each of the plurality of secure enclaves is formed by an isolated memory partition and is available to one or more processors for running one or more computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of each of the plurality of secure enclaves; pre-provisioning software within the central node, wherein the pre-provisioned software is configured to execute instructions of the one or more application computing processes on the one or more processors of the central node by receiving output data from the plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output; wherein each of the plurality of secure enclaves is pre-provisioned with software configured to execute instructions of the one or more application computing processes on the one or more processors of each of the plurality of secure enclaves by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate the output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to the central node.
In some embodiments, each of the plurality of secure enclaves are associated with a different health system.
In some embodiments, two or more of the plurality of secure enclaves are associated with two different information systems within a single health system.
In some embodiments, the input data includes at least one of demographic data, comorbidities, and geographic data.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the regression model is a stratified model.
In some embodiments, the regression model is a conditional logistical regression model.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the output includes a plurality regression coefficients of the regression model.
In some embodiments, the output includes a covariance matrix of the coefficients of the regression model.
In some embodiments, the output includes the log of an odds ratio as a continuous function of time.
In some embodiments, the log of the odds ratio as a continuous function of time is estimated based on an interpolation between the log of an odds ratio at two or more time points.
In some embodiments, the pre-provisioned software within each of the plurality of secure enclaves is further configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by encrypting the output data using the one or more cryptographic keys; and providing external access to the encrypted output data and the proof of execution.
In some embodiments, the central node is an isolated memory partition available to one or more processors for running one or more application computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of the central node.
In some embodiments, the output data includes a portion of the input data.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
In some embodiments, fitting the regression model and fitting the aggregate regression model is iterative.
In some embodiments, the pre-provisioned with software within the central node is configured to execute instructions of the one or more application computing processes on one or more processors of the central node by sending one or more aggregate regression coefficients of the aggregate regression model to each of the plurality of secure enclaves.
In some embodiments, the pre-provisioned software within each of the plurality of secure enclaves is further configured to execute instructions of the one or more application computing processes on the one or more processors of the secure enclave by: receiving one or more aggregate regression coefficients of the aggregate regression model; tuning the regression model using the one or more aggregate regression coefficients of the aggregate regression model to generate updated output data; and sending the updated output data to a central node.
In some embodiments, the updated output includes a gradient of one or more regression coefficients of the regression model.
In some embodiments, the aggregate output includes the log of an odds ratio across the plurality of secure enclaves.
In some embodiments, a vaccination status of an individual is assigned as unvaccinated if the test date is within a predetermined time of the vaccination date and the vaccination status of the individual is assigned as vaccinated if the test date is the predetermined time after the vaccination date.
In some embodiments, the vaccination date includes dates of one or more doses.
In one aspect, a includes a non-transitory memory; and one or more hardware processors configured to read instructions from the non-transitory memory that, when executed cause one or more of the hardware processors to perform operations including: constructing a central node; wherein the central node is in communication with a plurality of secure enclaves, wherein each secure enclave is formed by an isolated memory partition and is available to one or more processors for running one or more computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of each secure enclave; pre-provisioning software within the central node, wherein the pre-provisioned software is configured to execute instructions of the one or more application computing processes on the one or more processors of the central node by receiving output data from a plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output; wherein each of the plurality of secure enclaves is pre-provisioned with software configured to execute instructions of the one or more application computing processes on the one or more processors of each of the plurality of secure enclaves by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination status, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate the output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to the central node.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
In one aspect, a non-transitory computer-readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations including: constructing a central node; wherein the central node is in communication with a plurality of secure enclaves, wherein each secure enclave is formed by an isolated memory partition and is available to one or more processors for running one or more computing processes in isolation from one or more unauthorized computing processes running on the one or more processors of each of the plurality of secure enclaves; pre-provisioning software within the central node, wherein the pre-provisioned software is configured to execute instructions of the one or more application computing processes on the one or more processors of the central node by receiving output data from a plurality of secure enclaves; executing the one or more application computing processes to apply an aggregate analysis to the output data of each of the plurality of secure enclaves; and providing an aggregate output; wherein each of the plurality of secure enclaves is pre-provisioned with software configured to execute instructions of the one or more application computing processes on the one or more processors of each of the plurality of secure enclaves by: receiving input data for the one or more application computing processes in an encrypted form, wherein the input data includes, for a plurality of individuals, a vaccination status, a vaccination date, a test date, and a test result; decrypting the input data using one or more cryptographic keys; executing the one or more application computing processes to generate the output data; generating a proof of execution that proves that the one or more instructions of the one or more application computing processes operated on the received input data; and sending the output data to the central node.
In some embodiments, executing the one or more application computing processes includes fitting a regression model to the input data to generate output data, wherein the output data includes at least one of a regression coefficient of the regression model, a standard error for a regression coefficient, and a value calculated from one or more regression coefficients or one or more standard errors.
In some embodiments, the output includes the log of an odds ratio at one or more time points and a standard error corresponding to each odds ratio.
In some embodiments, the aggregate analysis includes fitting a regression model to the output data, wherein the output data includes a portion of the input data from each of the plurality of secure enclaves.
In some embodiments, the aggregate analysis includes inverse variance weighting.
In some embodiments, the aggregate analysis includes fitting an aggregate regression model to the output data.
Any one of the embodiments disclosed herein may be properly combined with any other embodiment disclosed herein. The combination of any one of the embodiments disclosed herein with any other embodiments disclosed herein is expressly contemplated.
The objects and advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which
A real time ‘sentinel’ system is disclosed that can alert policymakers, health care provider systems and other decision makers when the efficacy of vaccines (e.g., COVID19 vaccines) is going to go down. To make a determination of durability of a vaccine, real world data from Health Systems is used to allows comparison of two cohorts: a vaccinated one with an unvaccinated one. If the risk of contracting the virus (Risk Ratio or odds ratio) of a person who took the vaccine approaches the risk of an unvaccinated person, one can conclude that the vaccine is losing its effectiveness (or durability).
For example, a small amount of data from one health system leads to a conclusion that the vaccines are losing their durability after several (6-8) months. See
If there is a greater number of patients in these cohorts, the ‘uncertainty band’ will be tighter as a consequence of statistics. Accordingly, it would be desirable to develop improved methods to increase the number of patients in cohorts, e.g., by aggregating data across health systems. For example, in some embodiments, the software for a sentinel system runs in the ‘control’ of a Health System so that patient privacy is not compromised (e.g., using privacy preserving architecture). In some embodiments, multiple health systems, e.g., in every state of the country, use a ‘federated architecture’ so that patient data is processed within secure enclaves (e.g., within each health system) and is not copied to a single central place before the software or a central coordinator computes risk ratios or other aggregate output relating to vaccine durability. Further description of federated architecture can be found below and in U.S. patent application Ser. No. 16/908,520, titled “Systems and Method for Computing with Private Healthcare data,” filed Jun. 22, 2020, the contents of which are incorporated by reference in its entirety.
A truly astonishing amount of information has been collected from patients and consumers pertaining to their health status, habits, environment, surroundings, and homes. Increasingly, this information is being processed by computer programs utilizing machine learning and artificial intelligence models. Such computer programs have shown remarkable progress in analyzing and predicting consumer health status, incidence and treatment of diseases, user behavior, etc. Furthermore, since the collected data may contain patient biometric and other personal identification attributes, there is a growing concern that such computer programs may allow the identities of patients and consumers to be learned. Accordingly, enterprises interested in analyzing healthcare data containing private attributes are concerned with maintaining privacy of individuals and observing the relevant regulations pertaining to private and personal data, such as HIPAA (Health Insurance Portability and Accountability Act 1996) regulations.
In addition to HIPAA, many other regulations have been enacted in various jurisdictions, such as GDPR (General Data Protection Regulations) in the European Union, PSD2 (Revised Payment Services Directive), CCPA (California Consumer Privacy Act 2018), etc.
In the following descriptions, the terms “user information,” personal information,” “personal health information (“PHI”),” “healthcare information data or records,” “identifying information,” and PII (Personally Identifiable Information) may be used interchangeably. Likewise, the terms “electronic health records (“EHR”)” and “data records” may be used interchangeably.
One approach to handling private data is to encrypt all the records of a dataset. Encrypted text is sometimes referred to as ciphertext; decrypted text is also referred to as plaintext. Encryption may be described, by way of analogy, as putting the records of the dataset in a locked box. Access to the records of the locked box is then controlled by the key to the locked box. The idea is that only authorized entities are allowed access to the (decryption) key.
Some regulations (e.g., HIPAA) require that healthcare data be stored in encrypted form. This is also sometimes referred to as “encryption at rest.”
Malicious entities may, however, gain access to the decryption key or infer/guess the decryption key using computational mechanisms. The latter possibility becomes probable when the encryption/decryption technologies are not sufficiently strong (e.g., the length of the key—the number of bits comprising the key—is not sufficiently long to withstand computational attacks), or if the key is lost or not stored securely.
Encryption and other such security technologies may depend on the expectation that a computational attacker is likely to expend a certain amount of resources—computer time, memory and computing power—to gain access to the underlying data. The length of encryption keys is one of the variables used to increase the amount of computational resources needed to break the encryption.
Even strong encryption technology may not resolve security challenges associated with processing private data. For example, an enterprise that is processing an encrypted dataset may load the dataset into a computer, decrypt the dataset, process the records of the dataset and re-encrypt the dataset. In this example, one or more records of the dataset are decrypted (into plaintext) during processing. A malicious entity may gain access to the computer while the plaintext records are being processed, leading to a leak of personal information. That is, decrypting the data for the purpose of processing introduces a “run-time” vulnerability.
Accordingly, it would be desirable to develop improved techniques for processing private data.
Fully homomorphic encryption (FHE) describes an approach for computing with encrypted data without decrypting it. That is, given encrypted data elements x1e, x2e, . . . compute the function f (x1e, x2e, . . . ) yielding an encrypted result (y1e, y2e, . . . ). Since the input, output and processing phases of such computations deal with encrypted data elements only, the probability of leaks is minimized. If the (mathematical) basis of the encryption technology is sufficiently strong, the inference/guessing of keys may become an infeasible computation, even if very powerful computers, e.g., quantum computers, are used.
However, conventional techniques for computing with FHE datasets may be inefficient to the point of being impractical. Calculations reported in 2009 put computations running over FHE datasets as hundred trillion times slower than unencrypted data computations. (See Ameesh Divatia, https://www.darkreading.com/attacks-breaches/the-fact-and-fiction-of-homomorphic-encryption/a/d-id/1333691 and Priyadarshan Kolte, https://baffle.io/blog/why-is-homomorphic-encryption-not-ready-for-primetime/.)
Furthermore, existing application code may need to be re-written to use FHE libraries that provide the basic FHE functions.
A secure enclave describes a computing environment where sensitive data can be decrypted and processed in memory without exposing it to the other processes running in the computer. Data is decrypted and processed in a computing environment that is “isolated” from other processes and networks. Protection of such an environment could be further enhanced by protecting the decryption keys in a manner explained later.
The technology of secure enclaves may be more efficient than FHE techniques.
In some instances, a computer containing a secure enclave may also be referred to as a secure computer. A secure computer may contain one or more secure enclaves, e.g., one secure enclave for each application running in the computer.
In general, it is a goal of secure enclave technology to ensure isolation of the enclave from other processes and from other enclaves.
A secure enclave is an isolated environment including hardware (CPU, memory, registers, cache, etc.) and/or software (programmed circuitry). The secure enclave is accessible by application programs via especially configured hardware and software elements, sometimes referred to as a call gate or a firewall. Access to the secure enclave may be controlled via cryptographic keys some of which may reside in hardware elements, configured at the time of manufacturing. A malicious entity could attempt to extract keys during the booting process of the secure enclave. Reverse engineering or other such attacks to extract keys may be thwarted by disallowing repeated key requests and/or lengthening the time between such requests. In some cases, a set of keys may be associated with a particular set of hardware elements.
Additional protection may be achieved by requiring that data (and computer programs) that are injected into a secure enclave be encrypted; further that data outputted from a secure enclave to be encrypted also. Encrypted data once injected into a secure enclave could then be decrypted within the secure enclave, processed, and the results could be encrypted in preparation for output. Thus, an isolated secure enclave solves the runtime vulnerability problem discussed above.
Additional measures of protecting the data within a secure enclave can be introduced by requiring that the process of decrypting the data inside the secure enclave be made more secure by protecting the decryption keys from being known outside the secure enclave. That is, entities external to the secure enclave infrastructure are prohibited from accessing the decryption key
In this manner, encrypted data may be injected into a secure enclave when an injecting agent satisfies the constraints of the firewall of the secure enclave. The secure enclave includes a decryption key that may be used to decrypt the injected data and process it. The secure enclave may encrypt results of the processing activity using an encryption key available inside the secure enclave before outputting the results.
Another technique to address the issue of protecting private data is to de-identify or anonymize the data. This technique relies on replacing private data by random data, e.g., replacing social security numbers by random digits. Such techniques may be used in structured datasets. For example, a structured dataset comprising names, social security number and heart rate of patients may be anonymized by de-identifying the values of the attributes “name” and “social security number.”
De-identification technologies in structured datasets lead to loss of processing power as follows.
Structured datasets often need to be combined with other structured datasets to gain maximum processing advantage. Consider, by way of example, two structured datasets (name, SS#, heartrate) and (name, SS#, weight). By combining the two datasets, one may gain a more complete data record of a patient. That is, one may exploit the relationships inherent in the two datasets by associating the patients represented in the two datasets. The process of de-identifying the two datasets leads to anonymizing the patients which loses the inherent relationships.
To continue with the above example, in order to preserve the inherent relationship, the entity performing the de-identification may assign the same random data to the represented patients in the two datasets. That is, the anonymizing entity knows that a patient, say John, is represented by certain data in the two datasets. This implies that the knowledge of the entity doing the anonymizing becomes a vulnerability.
Thus, de-identifying structured data may lead to introducing vulnerabilities that may be exploited by malicious computational entities.
Another disadvantage of traditional de-identifying technologies is that it does not apply to unstructured datasets such as medical notes, annotations, medical history, pathology data, etc. A large amount of healthcare data consists of unstructured datasets. In a later part of this disclosure, techniques that use machine learning and artificial intelligence techniques to de-identify unstructured datasets are disclosed.
One consequence of de-identifying unstructured datasets is that the resulting dataset may contain some residual private data. In one embodiment, de-identification of an unstructured dataset is subjected to a statistical analysis that derives a measure of the effectiveness of the de-identification. That is, measures of the probability to which a dataset has been de-identified may be obtained.
In embodiments, an entity A de-identifies a dataset to a probability measure p, and provides it to an entity B. The latter also receives from an entity C one or more computer programs. Entity B processes the data received from entity A using the computer programs received from entity C and provides the result of the processing to another entity D. (In embodiments, A, B, C and D may be distinct entities in principle; in practice, one or more of entities A, B, C and D may be cooperate through mutual agreements.)
Embodiments of the present invention enable Entity B to assure entity A (and C, and D) that its processing maintains the probability p associated with the data.
Further, in a process not involving entity B, entity A may approve the use of computer programs of entity C on its dataset.
Embodiments of the present invention enable entity B to assure entity C (and A, and D) that the dataset in question was only processed by computer programs provided by entity C and that the dataset was not processed by any other computer program. Furthermore, entity B may be able to assure the other entities that the computer programs provided by entity C and used to process the underlying dataset were not altered, changed or modified in any manner, i.e., the binary image of the computer programs used during processing was identical to the binary image of the provided computer programs. That is, this enablement maintains the provenance of the received computer programs.
Furthermore, the inscrutability property corresponds to a property that satisfies the following conditions.
Additionally, the various assurances above are provided in the form of verifiable and unforgeable data instruments, i.e., certificates, based on the technology of cryptography.
Embodiments of the present invention, shown in
Without loss of generality and for ease of description, in the illustrative embodiment of
The present disclosure, inter alia, describes “federated pipelines” (implemented using software technologies and/or hardware/firmware components) that maintain the input de-identification probability of datasets, the provenance of input computer programs, and the inscrutability of the various data and computer programs involved in the computation.
In some cases, a data scientist (e.g., entity 1A104 cf.
That is, the third party may wish to obtain a proof that the output received from the federated pipeline was indeed used as input to a new computer program and the output provided to the third-party is outputted by the said program. That is, the data scientist may be asked by the third-party to extend the chain of trust associated with the federated pipeline. If the data scientist is not associated with the federated pipeline, a method of extending the chain of trust is needed that is independent of the method(s) used in the federated pipeline system.
Furthermore, the data scientist may wish that the recipient trust that dataset #2 was processed by a computer program P2 (that may have been provided by the data scientist) and the alleged execution of program P2 resulted in the Final Output Dataset (
D. Genkin, et. al. “Privacy in Decentralized Cryptocurrencies,” C. of the ACM, 2018, incorporated herein by reference in its entirety, illustrates exemplary techniques for verifying the execution of programs P1 and P2. A software module called the prover provides a computing environment in which the program P1 and P2 may be executed. Upon such executions, the prover produces two outputs: (1) the output of the programs P1 and P2, and (2) a data object called the proof of the execution of programs P1 and/or P2.
Additionally, the prover also provides a software module called the verifier (cf.
Thus, D. Genkin, et. al. shows system and methods whereby alleged execution of computer programs may be verified by submitting the proofs of the alleged executions to a verifier system. The proof objects are cryptographic objects and do not leak information about the underlying data or the programs (other than the meta statement that the alleged execution is verifiable).
In embodiments, a computer program, P, may be agreed upon as incorporating a policy between two enterprises, E1 and E2. The former enterprise E1 may now cause the program P to be executed and to produce a proof π of its alleged execution using the above described prover technology. Enterprise E2 may now verify π (using the verifying technology described above) and trust that the program P was executed, thereby trusting that the agreed upon policy has been implemented.
In some embodiments, the following method may be performed to populate a secure enclave with code and data.
Method [Create and populate Secure Enclave]
(1) compile secure part of application;
(2) issue command to create secure enclave (e.g., using underlying hardware/OS instruction set);
(3) Load any pre-provisioned code from pre-specified libraries;
(4) load the compiled code from step 1 into secure enclave;
(5) generate appropriate credentials; and
(6) save the image of the secure enclave and the credentials.
In some embodiments, the following method may be performed to execute the code in a secure enclave.
(1) compile unsecure part of an application (e.g., application 100) along with the secure image;
(2) execute the application;
(3) the application creates the secure enclave and loads the image in the secure enclave; and
(4) verify the various credentials.
The hardware and software components of a secure enclave provide data privacy by protecting the integrity and confidentiality of the code and data in the enclave. The entry and exit points are pre-defined at the time of compiling the application code. A secure enclave may send/receive encrypted data form its application and it can save encrypted data to disk. An enclave can access its application's memory, but the reverse is not true, i.e., the application cannot access an enclave's memory.
An enclave is a self-sufficient executable software that can be run on designated computers. For example, the enclave may include the resources (e.g., code libraries) that it uses during operation, rather than invoking external or shared resources. In some cases, hardware (e.g., a graphic processing unit or certain amount of memory) and operating system (e.g., Linux version 2.7 or Alpine Linux version 3.2) requirements may be specified for an enclave.
As described in method “Create and Populate Secure Enclave,” pre-provisioned software may also be loaded into a secure enclave. SE 220 contains, inter alia, pre-provisioned software 240-2 that acts as one endpoint for a TLS (Transport Level Security) connection. The second endpoint 240-1 for the TLS connection resides with the database 200. (Any secure network connection technology, e.g., https, VPN, etc., may be used in lieu of TLS.)
The TLS connection may be used by App 230 to retrieve data from database 200. App 230 may also include a proxy mechanism for executing receipt of data records.
Additionally, SE 220 contains pre-provisioned software modules PA 250 (Policy Agent) and AC 260 (Access Controller) whose functions are discussed below.
Program App 230 in SE 220 may thus retrieve data from database 200 using the TLS endpoints 240-1 and 240-2. TLS technology ensures that the data being transported is secure. Database 200 may contain encrypted data records. Thus, App 230 receives encrypted data records. In operation, App 230 decrypts the received data records and processes them according to its programmed logic. (The method by which decryption occurs is described later.)
Using method “Execute Code in Secure Enclave” described above, App 230 may be invoked which may then retrieve and decrypt data from database 200. The result of the processing is may be directed to an entity labelled data scientist 280 under control of the policy agent PA 250. Generally, PA 250 operates in conjunction with policy manager 280. The functioning and inter-operation of PA250 and policy manager 280 will be described in more detail later.
In some embodiments, the policy manager 280 may exist in its own secure enclave 290.
In summary, a computational task may be achieved by encoding it as an application program with a secure and an unsecure part. When invoked, the unsecure part of the application creates one or more secure enclaves, injects its secure part into a secure enclave and invokes its execution. The secure part of the application may have access to data from (pre-provisioned) databases connected to the enclaves or from other enclaves. The secure part of the application then decrypts received data. Processing then proceeds as per the app's logic possibly utilizing the arrangement of the interconnected enclaves. The results are presented to the data scientist via the policy agent.
In comparison to the FHE dataset approach, in which the data is never decrypted and processing proceeds on encrypted data, in the arrangement shown in
The pipeline technology described above allows computations to be carried out on datasets that may contain private and personal data. An aspect of pipeline technology is that data (and programs) inside a secure enclave are inscrutable, i.e., subject to policy control exercised by the policy manager (or its cohort, the policy agent). Furthermore, the outputs produced as a consequence of the execution of the program, may be directed according to policies also.
As an illustration, consider a computation carried out in a pipeline that calculates the body mass index (BMI) of individual patients stored in a dataset containing, inter alia, their weights, heights, date of births and addresses. The computation then proceeds to calculate the average BMI across various US counties.
Since these calculations involve private and personal patient data, the computations may be subject to privacy regulations. Various types of outputs may be desired, such as the following illustrative examples: (1) a dataset of 5 US counties that have the highest average BMI; (2) a dataset of 5 patients with street addresses with “overweight” BMI; (3) a dataset of patients containing their zip codes and BMI from Norfolk county, MA; (4) a dataset of patients with “overweight” BMI between the ages of 25-45 years from Dedham, Mass.; or (5) a dataset of patients containing their weight, height and age from Allied Street, Dedham Mass. In each case, the input to the computation is a dataset that may contain private and personal data and the output is a dataset that may also contain private and personal data.
The first output dataset above lists data aggregated to the level of county populations and does not contain PII data attributes. The result is independent of any single individual's data record; the result pertains to a population. A policy may therefore provide that such a dataset may be outputted, i.e., as plaintext.
On the other hand, the second outputted dataset above (1) contains personal identifiable information, i.e., street address, and (2) the number of items in the dataset, i.e., the cardinality of the output set, is small. A malicious agent may be able to isolate particular individuals from such a dataset. In this case, a policy may be formed to disallow such requests.
That is, a parameter, K, called the privacy parameter, may be provided that imposes a bound on the cardinality of the outputted datasets. Thus, an outputted dataset may be disallowed if its PII attributes identify less than K individuals.
Additionally, or alternatively, the output dataset may be provided in encrypted form inside a secure enclave to the intended recipient, e.g., the data scientist along with a computer program responsive to queries submitted by the data scientist. The latter may then use an (unsecure) application program to query the (secure) program inside the enclave and receive the latter's responses. Thus, the data scientist may not see the patient data but can receive the responses to his queries. Furthermore, the responses of the secure program may be constrained to reveal only selected and pre-determined “views” the output dataset, where the “views” may correspond to the generally accepted notions of views in database system. Alternatively, the output dataset may also be provided to the data scientist without enclosing it in a secure enclave by first encrypting the dataset using FHE.
In the third output request above, the data is being aggregated across zip codes of a county and therefore may not engender privacy concerns, provided that the number of such patients is large enough. In such examples, a policy may be formed that imposes a constraint on the size of the output dataset, e.g., output dataset must contain data pertaining to at least 20 patients. Similar policies may also be used for the fourth and fifth output requests.
In some embodiments, a policy may be formed that provides for adding random data records to an outputted dataset if the cardinality of the dataset is less than the imposed constrained limit. That is, a constraint is imposed such that enough records are included in the output dataset to achieve an output of a minimum size, e.g., 20 individuals.
Further challenges may arise when output requests (e.g., the third, fourth and fifth output requests above) are issued as a series of requests and the outputs are collected by a single entity (e.g., a data scientist) or multiple entities that collude to share the outputs. Since the output requests compute datasets that successively apply to smaller population sizes, there is a possibility that such “narrowing” computations may be used to gain information about specific individuals.
It has been shown in the literature (cf. Cynthia Dwork, Differential Privacy: A Survey of Results, International Conference on Theory and Applications of Models of Computation, 2008) that sequences of ever-increasing narrowing (or more accurate responses) ultimately leaks individual information.
In some embodiments, a policy agent may be configured so as to be included as a pre-provisioned software in one or more secure enclaves of a pipeline. The policy agent receives its policies from a Policy Manager (described below) and imposes its policies, some examples of which have been provided in the discussion above, on every outputted dataset. Out of band agreements between various (business) parties may be used to allow parties to specify and view the pre-provisioned policies contained in a policy agent.
Policy agent software also records, i.e., logs, all accesses and other actions taken by the programs executing within an enclave.
A Policy Manager may be configured to manage one or more policy agents. The Policy Manager may also perform other functions which will be described below. For simplicity, the present disclosure illustrates a single Policy Manager for a pipeline managing all policy agents in the pipeline in a master-slave arrangement.
The present disclosure also shows a Policy Manager running in the domain of the Operator of the pipeline for illustrative purposes, and various alternatives are possible. In some embodiments, the Policy Manager may be implemented in any domain controlled by either the operator, data provider, program provider or data scientist. If the Policy Manager is implemented using decentralized technology, the control of the Policy Manager can be decentralized across one or more of the above business entities. The term “decentralized” as used in this disclosure implies that the policies that control a policy manager may be provided by multiple parties and not by any single party.
For example,
In some embodiments, a policy agent may record its state with the Policy Manager. Additionally, the Policy Manager may be architected to allow regulators and/or third-party entities to examine the recorded state of the individual Policy Agents. Thus, regulators and third-party entities may examine the constraints under which datasets have been outputted. In embodiments, a possible implementation method for the Policy Manager is as a block-chain system whose ledgers may then contain immutable data records.
In scenarios discussed above, a policy may dictate that a data scientist may receive an outputted dataset enclosed in a secure enclave. This means that the data in the dataset is non-transparent to the data scientist. The latter is free to run additional output requests on the outputted dataset in the enclave by injecting new requests into the enclave. In those cases, when the outputted dataset does not have any PII data or does not violate the privacy parameter constraint, the dataset may become unconstrained and may be made available to the data scientist.
In some embodiments, a data scientist or other requestor may view the contents of a dataset contained within an enclave. The contents of an enclave may be made available to a requestor by connecting the enclave to a web browser and causing the contents of the enclave to be displayed as a web page. This prevents the requestor from saving or copying the state of browser. However, in some cases, the requestor may take a visual image of the browser page.
In some embodiments, a data scientist may submit data requests, which are then curated using a curation service. If the curation service deems the data requests to be privacy-preserving, then the data requests may be processed using the dataset in the enclave and the outputted dataset may be provided to the data scientist as an unconstrained dataset. In this manner, the curation service checks and ensures that the submitted data requests are benign, i.e., that the data requests do not produce outputs that violate privacy regulations.
As discussed above, a further challenge associated with processing private data using enclaves is whether policies can be provided about the computations carried out within the enclave, since the processes internal to an enclave are inscrutable. Consider, for example, a use case of secure enclave technologies, following the general description above with respect to
Furthermore, the policy in question may require that the access to process the data and receive the outputted dataset by the data scientist must be authorized. That is, the access by the data scientist must be authenticated. Data scientists on their part may require that they be assured that their data requests operate on data provided by a specified data provider since the integrity of data is crucial to the data processing paradigm. In particular, if the data scientist intends to share the outputted results with a third party, the data scientist may need to assure the former of the integrity of the input data and the fact that the results were obtained by executing a particular data request. Regulators may require that the entire process of storing and processing the data must be transparent and to be made available for investigations and ex post facto approval.
To address the various concerns stated above, an orchestration method may be performed as shown in a workflow diagram in
Referring to
Referring to
Step 1. The policy manager initiates the policy agent that it had prepared in step 5 of
Public key cryptography relies on a pair of complementary keys typically called the private and public keys. The latter may be distributed to any interested party. The former, i.e., the private key, is always kept secret. Using the public key distributed by, say Alice, another party, say Bob, may encrypt a message and send it to Alice safe in the knowledge that only Alice can decrypt the message by using her private key. No other key can be used to decrypt the message encrypted by Bob. As mentioned before, ownership of a private key is a major concern and several techniques are discussed in literature relevant to this topic.
Secure enclave technology may be used to address the private key ownership issue by ensuring that the private key (corresponding to a public key) always resides in a secure enclave. This may be accomplished, for instance, by creating a first secure enclave and pre-provisioning it with public/private key cryptography software that creates pairs of private and public keys. Such software is available through opensource repositories. A computer program residing in a second secure enclave may then request the first enclave to provide it (using a secure channel) a copy of the private key that it needs. Thus, the private key never exists outside the secure enclave infrastructure, always residing in secure enclaves and being transmitted between the same using secure channels.
In some embodiments, a policy manager may be pre-provisioned with public/private key software and the Policy Manager be enclosed in a secure enclave as shown in
A secure enclave may then request its policy agent for a private key. The policy agent, as discussed above, operates in conjunction with the policy manager and may request the same from its policy manager. A computer program executing in a secure enclave may need a private key to decrypt the encrypted data it may receive from a data provider. It may request its policy agent who may then provide it the needed private key for decryption purposes.
As explained earlier, encryption technologies referred to as hash functions or hashing algorithms exist that can take a string of cleartext, often called a message, and encrypt it as a sequence of hexadecimal digits, i.e., sequence of digits [0-9, A-F]. Examples of publicly available hash functions are MD5, SHA-256, SHA-512. The latter two functions use keys of length 256 and 512, respectively. As discussed above, the length of the keys is a factor in ensuring the strength of an encryption technology to withstand malicious attacks.
One property of hash functions that map cleartext into hexadecimal digits is that they do not map different cleartexts to the same digits. Thus, a piece of cleartext may have a unique signature, i.e., the output of the hash function operating on the cleartext as input.
If a secure enclave containing programs and data can be viewed as comprising cleartext then it follows that every secure enclave has a unique signature. Thus, by applying a suitable hash function to the contents of a secure enclave, a signature of that enclave is obtained. The signature is unique in that no other and different secure enclave will have that signature.
If a secure enclave is populated with a known computer program and a known dataset then that secure enclave's signature may be used to assert that the secure enclave is executing (or executed) the program on the known dataset by comparing the signature of a secure enclave with previously stored signatures.
Thus, a data provider, if provided with a signature of the enclave, may be assured that its dataset is uncorrupted or unchanged and is operated upon by a pre-determined program.
Similarly, a program provider may be assured that its programs are uncorrupted and unchanged. A data scientist may be assured that its output is the result of processing by the pre-determined program on pre-determined data.
Since a policy manager may be programmed to disallow the operator to access the contents of a secure enclave by denying access to the relevant decryption keys, the operator of the pipeline cannot view or edit the contents of the secure enclave.
In the present disclosure, secure enclaves may be pre-provisioned with software to compute hash functions that may be invoked by the policy manager to create signatures. The policy manager may then be programmed to provide these signatures as certificates upon request to various entities, e.g., to the data provider or the program provider.
Referring now to
Thus, enterprise 1005 has a choice to run apps injected into enclave 1003 or to receive the dataset 1008 into a different enclave 1004 and run their proprietary apps therein.
That is, a series of enclaves 1001, 1002, 1003 and 1004 (
Enterprise 1005 has the flexibility to run its own data requests on the datasets and provide the results of the processing to its customers, along with certificates that the appropriate data requesting programs were executed and the provenance of the input data was ascertained. Enterprise 1005 may assume ownership of the dataset 1007 but then it assumes legal responsibility for its privacy.
Along with the secure data layer available to all enclaves, additional layers may be provided for secure messaging 904, access control and policy agent communication 905 and exchange of cryptographic keys 906. These additional communication layers are provided so that enclaves may exchange various kinds of data securely and without leaks with each other.
Referring to the illustrative embodiment shown in
Enterprise 890 receives the dataset 2B and causes it to be stored in enclave 802 where it may be processed and readied for further processing, whereupon it is stored in the secure data layer 810 as dataset 850.
Enclave 802 is pipelined to enclave 803 which implies that the dataset 850 is outputted from enclave 802 and provided as input to enclave 803. The apps in enclave 803 may now process the data and produce as output dataset 809.
In turn, enclave 803 is pipelined to enclave 804 which exists in a network administered by enterprise 899. That is, enclave 803 is administered by enterprise 890 and enclave 804 is administered by enterprise 899. The latter enterprise may inject additional data 811 into enclave 804, and also inject apps to process the dataset 811 in conjunction with input dataset 809, to produce dataset 805. The result of the computation may be made accessible to a data scientist at enterprise 899 as per the dictates of the policy agent/manager.
In the foregoing discussion, various embodiments have shown system and methods for collaborative storing, processing and analyzing of data by multiple parties. For example,
In another embodiment, a decentralized trust model may be provided in which multiple enterprises are trusted. Such a trust model may be particularly apt in an open marketplace where data providers contribute data and analyzers contribute data requests, i.e., computer programs, that process the contributed data. No single enterprise or entity is to be trusted in the decentralized model. Rather an openly available structure is provided that any third party may access to verify that the constraints governing the data and algorithm providers are being maintained.
As explained above, in order to load the image into an enclave, a specific encryption key is needed to encrypt the data (whose corresponding decryption key will be used by the enclave to decrypt the data).
It is to be understood that the foregoing embodiments are illustrative, and that many additional and alternative embodiments are possible. In some embodiments, at least a portion of the federated pipeline described above may be run on hardware and firmware that provides protected memory, such as Intel Security Guard Extensions (SGX), the implementation details of which are described at https://www.intel.com/content/www/us/en/architecture-and-technology/software-guard-extensions.html. In some embodiments, at least a portion of the federated pipeline may be run using virtualization software that creates isolated virtual machines, such as AMD Secure Encrypted Virtualization (SEV), the implementation details of which are described at https://developer.amd.com/sev/. In some embodiments, the federated pipeline may manage cryptographic keys using a key management service, such as the Amazon AWS Key Management Service (KMS), which described in further detail at a https://aws.amazon.com/kms/. However, these examples of hardware, firmware, virtualization software, and key management services may not independently create isolated software processes that are based on cryptographic protocols which can be used to create federated pipelines that have different ownerships, policies and attestations. Accordingly, in some embodiments, middleware (e.g., a layer of software) may be provided that can use underlying hardware/firmware, operating system, key management and cryptographic algorithms to achieve secure and private isolated processes, such as secure enclaves.
In some embodiments, secure enclaves can be linked together to form pipelines. Consistent with such embodiments, computations can be broken into sub-tasks that are then processed in pipelines, either concurrently or sequentially or both based on the arrangement of the pipelines.
In some embodiments, an attestation service can be associated with a pipeline. The attestation service establishes a chain of trust that originates from the start of the pipeline to the end of the pipeline, which provides external entities assurances even though the internal contents of a pipeline may not be observable to external entities. In some embodiments, the chain of trust can be further extended without extending the associated pipeline itself.
One way of dealing with healthcare data is to anonymize or mask the private data attributes, e.g., mask social security numbers before it is processed or analyzed. In some embodiments of the present disclosures, methods may be employed for masking and de-identifying personal information from healthcare records. Using these methods, a dataset containing healthcare records may have various portions of its data attributes masked or de-identified. The resulting dataset thus may not contain any personal or private information that can identify one or more specific individuals.
The foregoing embodiments related to federated architecture can be applied to provide a sentinel system and method to determine durability of vaccines. For example, such a system can allow calculate aggregate statistics related to vaccine durability based on data from multiple enterprises or health systems without a need to store personally identifiable patient data in a single location. In some embodiments, an enterprise is an entity that has a health record, e.g., a health system, an academic health center, or a private provider of healthcare data. In some embodiments, an enterprise is an entity providing or performing analysis of healthcare information or providing software for analysis of healthcare information, for example, providing information related to vaccine durability. In some embodiments, each enterprise can process data (e.g., patient data) on its own secure enclave, without storing personally identifiable patient data in a central location. In these embodiments, an output can be provided to a central node from each secure enclave so that the central node can provide an aggregate output related to vaccine durability. In some embodiments, an enterprise can include multiple secure enclaves, for example, an enclave for each of multiple information systems or EHR systems within a hospital system.
In some embodiments, the secure enclaves are constructed by the enterprise where they are located, e.g., by each health system. In other embodiments, the secure enclaves are constructed by another enterprise, e.g., an enterprise providing analysis of healthcare information or providing software for analysis of healthcare information. In some embodiments, analysis can be performed within the secure enclave by the entity where the secure enclaves are located, e.g., by each health system. In some embodiments, analysis can be performed within each secure enclave by another enterprise, such as one providing analysis of healthcare information or providing software for analysis of healthcare information.
An exemplary embodiment is shown in
In some embodiments, the output data 1222, 1223, 1224 from each secure enclave does not include personally identifiable information, e.g., if the output data includes aggregated data or parameters for the respective enterprise or health system or if the output data includes de-identified data. In these embodiments, the output data need not be encrypted before sending to the central node. In these embodiments, the central node need not be a secure enclave. Alternatively, in other embodiments, the output data can include personally identifiable information. In these embodiments, the output data can be encrypted before sending to the central node, the central node can be a secure enclave, and the output data can be decrypted by software within the central node using one or more cryptographic keys. In some embodiments, the data within each secure enclave is decrypted using a pair of keys, e.g., public and private keys. In some embodiments, each secure enclave is encrypted with a different pair of public and private keys. In some embodiments, the secure enclaves are encrypted with a single pair of public and private keys.
In some embodiments, rather than storing individual patient data in a single location there can be a ‘coordinator’ or central node where pre-computed Risk Ratios or odds ratios from each of the federated nodes or enclaves are aggregated. In these embodiments, the central node itself DOES NOT receive or store any patient data. In some embodiments, this coordinator function is provided by enterprise providing analysis of healthcare information or providing software for analysis of healthcare information, e.g., by a software provider. In some embodiments, this coordinator function is provided by a national, state, or local health agency or organization. Such a coordinator node can be made capable of providing an alert based on the ‘trajectory of the temporal Relative Risk curve’ on a single system, single state, multiple systems or states, or for a larger geographical region or the entire country by computing with minimal information (e.g., number of vaccinated people in the cohort, number of unvaccinated people in the cohort, number of infections in each subset, number of symptomatic infections etc.). For example, an unadjusted odds ratio can be determined from aggregate counts of infections and numbers of vaccinated and unvaccinated people in the cohort.
In other embodiments, ONLY the Risk Ratios are shared from each federated node and the coordinator node can still compute an aggregate output, such as a composite or aggregate Risk Ratio and associated uncertainty (confidence interval). In these embodiments, not even counts of the patients need be shared to the coordinator node. A visualization of an individual node's information is provided by the ‘federated node’ or secure enclave as the coordinator will not have this information. This embodiment is useful when some part of the federation wants to minimize information sharing with the party that is providing the ‘coordinator’ function.
In some embodiments, the system uses aggregate statistics without sharing individual statistics from each federated node. In some embodiments, individual data from enclaves is sent to the central node and combined to form a new cohort. In this embodiment, analysis, e.g., a regression analysis, can be performed on this new cohort in the central node. In these embodiments, at least a portion of the input data from each secure enclave is sent to the central node. In these embodiments, the input data from individual enclaves can be encrypted before sending to the central node, the central node can be a secure enclave, and the output data can be decrypted by software within the central node using one or more cryptographic keys. By combining data, trends that are too small to be seen in an individual dataset can emerge from a larger, combined dataset.
Yet another embodiment is a total lack of a ‘coordinator’ function in the network. Here, as in Bitcoin and various Blockchain implementations, the end system, e.g., on the desktop of the CDC director, acts as the sole aggregator.
In some embodiments, an alert functionality is configured by the user of the end system based on various thresholds of a Temporal Risk Ratio curve. For example, an alert can be configured by the enterprise or user who operates the central node or performs analysis within the central node. In some embodiments, an alert can be triggered when the odds ratio of testing positive relative to the odds ratio of testing positive at a time of full protection (e.g., a defined interval after vaccination) exceeds a threshold value. In these embodiments, comparing to a time of full protection can provide an estimate of absolute effectiveness. In some embodiments, an alert can be triggered when the odds ratio of testing positive relative to the odds ratio of testing positive at baseline (e.g., shortly after vaccination) exceeds a threshold value. In some embodiments, an alert can be triggered when a month-to-month change (e.g., an increase) in the odds ratio of testing positive exceeds a threshold value.
In some embodiments, vaccinated versus unvaccinated individuals are compared to establish when the protection against infection has been lost (like a vaccinated vs. placebo group in a clinical trial). However, in some health systems, individuals who are not recorded as being vaccinated are not necessarily unvaccinated. For example, individuals may have received vaccines elsewhere and not had the vaccination recorded in the Health System database, etc. This is related to how different states and hospitals would maintain syncing between registries and electronic health records, which is inherently going to be quite variable. Particularly as the vaccination rate in the population increases to high numbers (e.g. >90% in elderly people), this “contamination” of unvaccinated cohorts with vaccinated individuals will have substantial impacts on estimation of vaccine effectiveness.
To get around this, in some embodiments, a strategy is for the system to assess durability by only considering data from vaccinated patients. The “baseline risk of infection” is determined based on a predetermined interval or time (e.g., 4-10 days) after the first vaccine was received. During this time period, per the phase 3 trials and real world studies, there should not yet be an observed protection against infection—thus, this interval is considered to approximate the unvaccinated state. In some embodiments, the predetermined interval is determined based on the time that it takes for a vaccine to provide protection against infection. The odds of being infected each day after to this baseline rate is compared to determine when a fully vaccinated individual's risk of infection is “back to baseline.”
In some embodiments, an approach that considers only data from vaccinated individuals makes the system more scalable. Relying on access to adequate numbers of confidently unvaccinated individuals would likely hamper the speed and quality of such an analysis. For example, health systems did not know what to do with data from vaccinated patients because they were not able to make a high-confidence unvaccinated cohort. An approach that considers only data from vaccinated individuals circumvents that by just relying on individuals for whom “vaccination=YES,” which we will generally have much more confidence in than “vaccination=NO.”
In some embodiments, assessment of antibody durability is performed using a cohort study or case-control study. Assessment of antibody durability can be performed using any known methods for forming a cohort. In some embodiments, data from a cohort study or a case-control study can provide input to a secure enclave for calculation of vaccine durability. In some embodiments, analysis of a cohort study or case control study can occur within a secure enclave of each enterprise. In a cohort study, groups can be defined based on an exposure, e.g., vaccination status, and the rate of an outcome, e.g., infection, is assessed between groups. For example, the question is how the rate of the outcome differs based on exposure. In a case-control study, groups are defined based on outcome, e.g., infection status, and the rate of an exposure, e.g., vaccination, is assessed between groups. For example, the question is what the odds are of being a case for exposed versus non-exposed people.
In an exemplary cohort study, shown in
Vaccine Effectiveness (VE)=1−Incidence Rate Ratio=1−(IRvax/IRUnvax)
VE=1−([1/900]/[3/700])
VE=74%
In an exemplary case-control-study, shown in
Odds Ratio=(CasesVax/ControlsVax)/(CasesUnvax/ControlsUnvax)
VE=1−Odds Ratio
In the exemplary case-control study shown in Table 1, VE is 89%:
VE=1−([1/3]/[3/1])=1−0.11=89%
In some embodiments, a cohort study is a crossover study which allows people to contribute their at-risk time to the unvaccinated cohort until they become vaccinated and then contribute their at-risk time to vaccinated cohort.
In some embodiments, a crossover design is beneficial because such a study avoids “looking into the future” during selection of participants. For example, if an individual is vaccinated at day 0, a matched unvaccinated individual is selected for the unvaccinated cohort. However, if the matched individual subsequently becomes vaccinated, the matched individual would be excluded in a non-crossover study because of the subsequent vaccination. The study designers have looked to the future in determining whether the match is valid. In contrast, in a crossover design, the match is valid, and the match would be moved from the unvaccinated cohort to the vaccinated cohort on the date of vaccination.
In some embodiments, a cohort study uses a dynamic cohort, which is like a crossover study but includes a “buffer” during the interval in which the vaccine is not yet fully effective. In embodiments using a dynamic cohort, an individual does not contribute at-risk time to any group between the first dose and the date of full vaccination.
In some embodiments, the selection of cohort itself provides a proxy for unvaccinated patients by using the first few days of the ‘vaccinated patients’ as an unvaccinated cohort.
In some embodiments, a case-control study is a test-negative study where case and controls presented with COVID-19 symptoms AND were tested (e.g. by PCR) for SARS-CoV-2 infection.
Table 2 shows results for an exemplary test-negative study for COVID-19 vaccine effectiveness. Table 2, cases are positive PCR tests, and controls are negative PCR tests. For each case/control, the vaccination status is determined at the time of the test, and the odds are computed for being a case for vaccinated 1703 vs. unvaccinated individuals. The Odds Ratio and Vaccine Effectiveness are computed as shown below:
Odds Ratio=(A/B)/(C/D)
Vaccine Effectiveness=1−Odds Ratio
In this study, probability of being a case can be modeled as follows:
Table 3 shows results for an exemplary test-negative study for COVID-19 vaccine durability. In Table 3, cases are positive PCR tests, and controls are negative PCR tests. For each case/control, the time since vaccination is determined at the time of test, and the odds of being a case for recently vaccinated vs. distantly vaccinated is computed. The Odds Ratio is computed as shown below:
Odds Ratio=(A/B)/(C/D)
In this study, if the Odds Ratio (OR) is greater than one, the odds of infection is higher for people who were vaccinated a long time ago is greater than for people who were recently vaccinated. In some embodiments, an OR greater than one indicates waning immunity. In this study, probability of being a case can be modeled as follows:
logit(p)˜Time Since Vaccination, where p=probability of being a case
In some embodiments, cohort selection accounts for factors that could influence the odds of testing positive. In some embodiments, these factors are included as inputs to a model used to calculate an odds ratio.
In some embodiments, demographics influence odds of testing positive. Non-limiting examples of demographic factors include age, sex, race, ethnicity, income, and geography (e.g. continent, nation, state, county, city, borough, or zip code).
In some embodiments, clinical comorbidity influences odds of testing positive. Non-limiting examples of comorbidities include essential hypertension, hyperlipedemia, acute pharyngitis, otitis media, hypertension, type 2 diabetes mellitus, coronary atherosclerosis, upper respiratory tract infection, chronic kidney disease, and heart disease.
In some embodiments, geography influences odds of testing positive. For example, different geographic regions can have differences in COVID-19 incidence, masking policies, social distancing policies, or vaccine coverage. In some embodiments, geography data can include continent, nation, region, state, county, city, borough, or zip code. In some embodiments, a region is within a country, e.g., the Northeast, Southeast, South, Southwest, or Northwest of the United States.
In some embodiments, the time at which the test was taken influences odds of testing positive. For example, whether the test was taken during a spike in COVID-19 cases (e.g., during July or August 2021) or during a COVID variant's prevalence (e.g., during Alpha, Delta, or Omicron prevalence).
In some embodiments, the time at which the individual was vaccinated influences odds of testing positive. For example, potentially higher risk groups were vaccinated earlier. Non-limiting examples of high risk groups include individuals with occupational exposure, essential workers, individuals in long-term care facilities, older individuals, and individuals with comorbidities. For example, younger or healthier groups were vaccinated later. For example, there may be behavioral differences in the decision to become vaccinated over time.
In some embodiments, a health system can process input data using a regression model, e.g., a logistical regression model. For example, in an exemplary regression model, logit(p) is equal to the sum of Time Since Vaccination and one or more of Age, Sex, Race, Ethnicity, Comorbidities, Residential County, Date of Test, and Date of Vaccination, depending on the input data provided. In some embodiments, input to a regression model includes a vaccination status, a vaccination date, a test date, and a test result. In some embodiments, the input data can also include one or more of the demographic, comorbidity, and geographic data described above. In some embodiments, the date of the test and the date of vaccination define a primary exposure of interest: for example, time since vaccination=date of test−date of vaccination.
In some embodiments, a durability study design can be performed using test-negative approaches. In some embodiments, a regression can be stratified using any of the demographic information, comorbidity, or geographic information described above. In some embodiments, a regression can be stratified using time, e.g., date of test or date of vaccination. In some embodiments, the regression can be stratified using time based on a week, a two-week period, or a month. In a first exemplary approach, regression is stratified using county and date of test:
logit(p)˜Time Since Vaccination+Age+Sex+Race+Ethnicity+Comorbidities+Strata(Residential County, Date of Test)
In a second exemplary approach, regression is stratified using county and date of vaccination:
logit(p)˜Time Since Vaccination+Age+Sex+Race+Ethnicity+Comorbidities+Strata(Residential County, Date of Vax)
In both the first and second approaches, two analyses are performed. First, for fully vaccinated individuals, the odds of infection each day relative to the date of full vaccination is computed to approximate maximal protection. Second, for individuals with at least one dose, the odds of infection each day relative to four days after the first dose is computed to approximate the unvaccinated state.
In some embodiments, using a federated architecture system, a regression model can be fit to patient data for an individual health system, e.g., within a secure enclave of that system, and the output of that linear regression can be sent to a central node. In these embodiments, the central node can receive such output from a plurality of health systems and apply an aggregate analysis to that output data to provide an aggregate output, e.g., an odds ratio of infection aggregated over the plurality of health systems. For example, this aggregate output can be used as part of a sentinel system to determine vaccine effectiveness over time. In some embodiments, the regression model is a stratified model. In some embodiments, the regression model is a conditional logistical regression model. In some embodiments, the input data from each health system includes a vaccination status, a vaccination date, a test date, and test result. In some embodiments, the input data includes one or more other factors that may impact positivity rate, e.g., demographic data, comorbidities, or geographic data. In some embodiments, the output from each health system includes one or more parameters of the regression model, e.g., one or more regression coefficients, one or more standard errors of the one or more regression coefficient, or a value calculated from a combination thereof. In some embodiments the output from each health system can include a matrix of a regression model, e.g., a covariance matrix of a logistic regression model.
In some embodiments, a logistic regression can be applied to input data from a health system to produce estimates and confidence intervals for vaccine effectiveness. In some embodiments, a logistic regression is applied to input data within a secure enclave of the health system. In some embodiments, these estimates and confidence intervals can be sent to a central node that receives such outputs from a plurality of health systems and combines those outputs to provide an aggregate output. In one embodiment, a logistic regression can be applied to input data from a health system to produce estimates and confidence intervals for vaccine effectiveness for a number of time windows, each time window relative to some reference time window. In some embodiments, the reference time window is the date of vaccination or the date after vaccination when full immunity is reached (full vaccination). In some embodiments, the reference time window is the date of receiving a booster or a date after a booster when full immunity is reached, e.g., one month after a booster. For example, the input data can be from a COVID vaccine durability test-negative analysis (as discussed above in relation to
For example, when evaluating change of vaccine efficacy relative to a reference time window of time of full vaccination (e.g., 2 doses+14 days), the aggregate covariate-adjusted odds of testing positive for covid is computed in each time window of interest (e.g. the time windows (i) 30-60 days following full vaccination, (ii) 60-90 days following full vaccination, (iii), 90-120 days following full vaccination, etc.) relative to a reference time window of 0-30 days following full vaccination (equivalent to the window 14-44 days following 2 doses).
In some embodiments, a covariate-adjusted odds ratio can be determined by fitting a regression model, e.g., a conditional logistic regression model, to the input data of each health system. In these embodiments, the input data for the regression model includes data for each time window of interest, where each individual test is bucketed into a time window and each individual's input data includes a binary covariate for each time window and the reference time window (1 if the test falls within the time window and 0 if the test falls outside the time window or if the test falls within the reference time window). This regression model can be applied to input data of a system within a secure enclave of that system. The regression model can produce, for each time window of interest w, an estimated regression coefficient {circumflex over (β)}w, along with an estimated standard error SEβ
Assumptions required for this interpretation to hold (in the slightly different context of using a test-negative design to estimate vaccine effectiveness) include that decision to vaccinate is not correlated to exposure or susceptibility and that a vaccine confers all-or-nothing protection.
In some embodiments, such a regression analysis can be run on each of a plurality of health systems, e.g., within a secure enclave for each health system, and aggregated, e.g., by a central node. One exemplary method to combine the estimates for each health system is to apply a standard inverse-variance weighting meta-analysis procedure, described herein. First, each system is denoted by superscript (1), (2), etc: so the coefficient/standard error estimates are {circumflex over (β)}w(1), SEβ
To calculate a standard inverse-variance weighting analysis, the following summations are made over all health systems j. For each time window w the following are computed:
These combined estimates can be used to give a combined or aggregated estimate of the protection from vaccine, along with 95% confidence interval during each window w:
Estimated odds exp({circumflex over (β)}w,combned)
95% CI: [exp({circumflex over (β)}w,combined−1.96 SEβ
In some embodiments, this method assumes that the health systems providing regression coefficients are estimating the same underlying true parameter values βw. For example, this method assumes that the regression model has accounted for systemic biases, e.g., demographics (e.g., age, sex, race, ethnicity), comorbidities, or geography. In some embodiments, the possibility of additional systemic difference in treatment effect (e.g., an effect that isn't adjusted for by the covariate adjustments in the regression model) can be handled by applying the Der-Simonian-Laird meta-analysis approach for handling possible heterogeneity in effects. The formulas for this method are contained also in Borenstein M, Hedges L V, Higgins J P T, Rothstein H R. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res Synth Methods. 2010;1: 97-111. For example, a between-study or between-system variance τ2 can be estimated by computing the amount of study-to-study or system-to-system variation actually observed, estimating how much the observed effects would be expected to vary from each other if the true effect were actually the same across studies, and assuming that excess variation reflects real differences in effect size. In this example, τ2 can be estimated using the following equations:
k is the number of studies (or health systems), and
The statistic Q is a (weighted) sum of squares of the effect size estimates (Yi) about their mean (M). Q is weighted in such a manner that assigns more weight to larger studies, and this also puts Q on a standardized metric. In this metric, the expected value of Q if all studies share a common effect size is df. Therefore, Q−df represents the excess variation between studies, that is, the part that exceeds what we would expect based on sampling error. Since Q−df is on a standardized scale, we divide by a factor, C, which puts this index back into the same metric that had been used to report the within-study variance, and this value is τ2. If τ2 is less than zero, it is set to zero, since a variance cannot be negative. These methods can also be run using the function statsmodels.stats.meta_analysis.combine_effects in the python statsmodels package. See Seabold S, Perktold J. Statsmodels: Econometric and Statistical Modeling with Python. Proceedings of the 9th Python in Science Conference. 2010. pp. 92-96.
In one embodiment, the central node can fit output data from a plurality of health systems to an aggregate regression model across the plurality of health systems. In this embodiment, the output data of each health system can include all regression coefficients and associated standard errors of the regression model of that health system or the output data of each health system or the output data can include one or more regression coefficients from each healthcare system that are stratified by one or more demographic factors. In this embodiment, the output of each health system can also include the coefficient covariance matrix, for example, if providing a continuous estimate over time, as discussed in further detail below. In other embodiments, the output data of each health system can include all patient data and a regression is performed a the central node.
In some embodiments, fitting of the aggregate model can be done in an iterative fashion, e.g., by gradient descent to fine tune the aggregate model. For example, as shown in
An exemplary aggregate regression embodiment is shown in
In one embodiment, the central node can provide a continuous estimate of vaccine durability over time, as an alternative to providing estimates only at specific time windows relative to a reference window. In some embodiments, the number of time windows can be limited by the number of parameters that can be fit based on the sample size. In these embodiments, a day-by-day change can be calculated using interpolation (e.g., a spline), without increasing the number of time windows or parameters to be fit. In some embodiments, the number of time windows can be limited by the sample size during a particular time period. In these embodiments, if each time window is short, there can be a small number or tests in each time window and the data can be noisy.
For example, to provide a continuous estimate of vaccine durability over time, within each health system, as discussed above, a regression model can produce, for each time window of interest w, an estimated regression coefficient {circumflex over (β)}w, along with an estimated standard error SEβ
Certain embodiments will now be described in the following non-limiting examples.
Out of approx. 259K patients who had the mRNA vaccines at this health system, 3.8K patients (less than 1.5%) had a breakthrough infection after full vaccination. The vaccine has been protective for over 98.5% of people.
Both Pfizer and Moderna are highly effective, but Moderna is even more impressive given that, as shown in
As shown in
Yet, as shown in
As shown in
As shown in
The processes and logic flows described in this specification, including the method steps of the subject matter described herein, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the subject matter described herein by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the subject matter described herein can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of nonvolatile memory, including by way of example semiconductor memory devices, (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto optical disks; and optical disks (e.g., CD and DVD disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, the subject matter described herein can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, (e.g., a mouse or a trackball), by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user can be received in any form, including acoustic, speech, or tactile input.
The subject matter described herein can be implemented in a computing system that includes a back end component (e.g., a data server), a middleware component (e.g., an application server), or a front end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back end, middleware, and front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
Those of skill in the art would appreciate that the various illustrations in the specification and drawings described herein can be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware, software, or a combination depends upon the particular application and design constraints imposed on the overall system. Skilled artisans can implement the described functionality in varying ways for each particular application. Various components and blocks can be arranged differently (for example, arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
Furthermore, an implementation of the communication protocol can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system, or other apparatus adapted for carrying out the methods described herein, is suited to perform the functions described herein.
A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. The methods for the communications protocol can also be embedded in a non-transitory computer-readable medium or computer program product, which comprises all the features enabling the implementation of the methods described herein, and which, when loaded in a computer system is able to carry out these methods. Input to any part of the disclosed systems and methods is not limited to a text input interface. For example, they can work with any form of user input including text and speech.
Computer program or application in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or notation; b) reproduction in a different material form. Significantly, this communications protocol can be embodied in other specific forms without departing from the spirit or essential attributes thereof, and accordingly, reference should be had to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.
The communications protocol has been described in detail with specific reference to these illustrated embodiments. It will be apparent, however, that various modifications and changes can be made within the spirit and scope of the disclosure as described in the foregoing specification, and such modifications and changes are to be considered equivalents and part of this disclosure.
It is to be understood that the disclosed subject matter is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, systems, methods and media for carrying out the several purposes of the disclosed subject matter. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the disclosed subject matter.
It will be appreciated that while one or more particular materials or steps have been shown and described for purposes of explanation, the materials or steps may be varied in certain respects, or materials or steps may be combined, while still obtaining the desired outcome. Additionally, modifications to the disclosed embodiment and the invention as claimed are possible and within the scope of this disclosed invention.
This application claims priority to U.S. Provisional Application No. 63/252,778, filed on Oct. 6, 2021, the contents of which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63252778 | Oct 2021 | US |