The present invention relates generally to protecting data privacy and intellectual property of computer programs with particular emphasis on cases wherein such data or programs are shared between distinct entities such as enterprises and corporations.
Data collection has become ubiquitous and a major activity of enterprises. There are many enterprises whose business model is to collect and monetize data. Some enterprises are engaged in the creation and distribution of computer programs and applications, i.e., programmatic assets.
Owners of datasets and computer programs would understandably like to protect their data from being copied or distributed without authorization. Enterprises which acquire datasets need to provide assurances that they will abide by the policies governing the acquisition of data or execution of computer programs.
When data and programmatic assets are shared, several questions arise pertaining to ownership, usage, intellectual property, rights adhering to the sharing, etc.
Sharing of assets may be further complicated if a shared asset contains private or sensitive data, e.g., a shared dataset may contain Patient Health Information (commonly abbreviated as PHI), or Patient Identity Information (PII).
Therefore, a technology that protects and manages the sharing of assets would be of enormous benefit to commercial activities and members of society.
In accordance with one aspect of the methods and systems described herein, a method is provided for securely receiving an algorithm in a computing environment that is to process a dataset. In accordance with the method, an algorithm is received in a first trusted and isolated computing environment for processing datasets that is received in encrypted form from an algorithm-providing computational domain of an entity that is authorized to provide the algorithm. The algorithm is encrypted by a first encryption key. The first trusted and isolated computing environment is established by a controlling trusted and isolated computing environment that provides the algorithm-providing computational domain with a second encryption key for encrypting the first encryption key. A first decryption key for decrypting the first encryption key is received in the first trusted and isolated computing environment from the controlling trusted and isolated computing environment such that the first trusted and isolated computing environment is able to decrypt the encrypted algorithm without allowing any other computational domain to access the algorithm in an unencrypted form except for the algorithm-providing computational domain. A trusted and isolated computing environment is a computing environment whose computer code is able to be attested by comparing a digest of the computing environment to a baseline digest of the computing environment that is available to third parties to thereby verify computing environment integrity while also being a computing environment in which only a specified maximum number of application processes and specified system processes implementing the computing environment are able to operate.
In accordance with another aspect of the methods and systems described herein, the first encryption key is a symmetric key.
In accordance with another aspect of the methods and systems described herein, the symmetric key is generated within the algorithm-providing computational domain.
In accordance with another aspect of the methods and systems described herein, the symmetric key is encrypted by the second encryption key within the algorithm-providing computational domain and provided to the controlling trusted and isolated computing environment.
In accordance with another aspect of the methods and systems described herein the method further includes: receiving in the first trusted and isolated computing environment from the controlling trusted and isolated computing environment a decryption key for decrypting the symmetric key; decrypting the encrypted algorithm within the first trusted and isolated computing environment to provide an unencrypted algorithm; re-encrypting the unencrypted algorithm with a second encryption key received from the controlling trusted and isolated computing environment; and storing the re-encrypted algorithm in a storage system external to the first trusted and isolated computing environment.
In accordance with another aspect of the methods and systems described herein, a method of securely processing a dataset with an algorithm to produce an output result to be securely provided to an output recipient includes: establishing, with a controlling trusted and isolated computing environment, a first trusted and isolated computing environment in which a dataset to be processed by an algorithm is received in encrypted form from a dataset-providing computational domain of an entity that is authorized to provide the dataset, the dataset being encrypted by a first encryption key, the controlling trusted and isolated computing environment providing the dataset-providing computational domain with a second encryption key for encrypting the first encryption key; providing to the first trusted and isolated computing environment, from the controlling trusted and isolated computing environment, a first decryption key for decrypting the first encryption key such that the first trusted and isolated computing environment is able to decrypt the encrypted dataset without allowing any other computational domain to access the dataset in an unencrypted form except for the dataset-providing computational domain, wherein a trusted and isolated computing environment is a computing environment whose computer code is able to be attested by comparing a digest of the computing environment to a baseline digest of the computing environment that is available to third parties to thereby verify computing environment integrity while also being a computing environment in which only a specified maximum number of application processes and specified system processes implementing the computing environment are able to operate; wherein the first trusted and isolated computing environment obtains the algorithm that is to process the dataset by receiving the algorithm as an encrypted algorithm from an external storage system and decrypts the encrypted algorithm using a second decryption key obtained from the controlling trusted and isolated computing environment such that the first trusted and isolated computing environment is able to decrypt the encrypted algorithm without allowing any other computational domain to access the algorithm in an unencrypted form except for the computational domain of an entity that is authorized to provide the algorithm; and causing the algorithm to process the dataset in the first trusted and isolated computing environment to produce an output result.
In accordance with another aspect of the methods and systems described herein, the method further includes: providing from the controlling trusted and isolated computing environment to a designated recipient of the output result a second encryption key for encrypting a symmetric key provided to the designated recipient by the dataset-providing computational domain; receiving in the first trusted and isolated computing environment the encrypted symmetric key from the designated recipient; receiving a third decryption key in the first trusted and isolated computing environment from the controlling trusted and isolated computing environment for decrypting the encrypted symmetric key; decrypting the encrypted symmetric key in the first trusted and isolated computing environment using the third decryption key; and encrypting the output result in the first trusted and isolated computing environment using the symmetric key and storing the encrypted output result in a storage system external to the first trusted and isolated computing environment and the controlling trusted and isolated computing environment.
In accordance with another aspect of the methods and systems described herein, a method for securely receiving a dataset in a computing environment that is to process the dataset includes: receiving in a first trusted and isolated computing environment an encrypted dataset from a dataset-providing computational domain of an entity that is authorized to provide the dataset, wherein the encrypted dataset is only able to be decrypted in the first trusted and isolated computing environment using decryption keys available from the dataset-providing computational domain and a controlling trusted and isolated computing environment that generates the first trusted and isolated computing environment, wherein a trusted and isolated computing environment is a computing environment whose computer code is able to be attested by comparing a digest of the computing environment to a baseline digest of the computing environment that is available to third parties to thereby verify computing environment integrity while also being a computing environment in which only a specified maximum number of application processes and specified system processes implementing the computing environment are able to operate; and decrypting the encrypted dataset in the first trusted and isolated computing environment using the decryption keys such that the decrypted dataset cannot be accessed in an unencrypted form by any computational domain except for the computational domain of the entity that is authorized to provide the algorithm.
In accordance with another aspect of the methods and systems described herein, one of the decryption keys is a symmetric key received in the first trusted and isolated computing environment in an encrypted form and generated by the dataset-providing computational domain.
In accordance with another aspect of the methods and systems described herein, a second of the decryption keys is received in the first trusted and isolated computing environment from the controlling computational domain, wherein the second decryption key is configured to decrypt the symmetric key.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Modern business enterprises rely on data and programmatic assets in their business operations. For many enterprises, production of data and computer programs is the main business. Thus, it is commonplace to hear expressions such as “data is the new oil.” Many enterprises acquire and share data and computer programs. Thus, data and computer programs may rightfully be treated as assets.
It is well known that the technologies of machine learning, artificial intelligence, pharmaceutical drugs and medical/biological research and development require large datasets for training and development of high performing systems. Acquisition of such datasets is known to be a cumbersome but much needed activity.
When assets are shared, several questions arise pertaining to ownership, usage, intellectual property, rights adhering to the sharing, etc.
Sharing of assets may be further complicated if a shared asset contains private or sensitive data, e.g., a shared dataset may contain Patient Health Information (commonly abbreviated as PHI), or Patient Identity Information (PII).
Furthermore, acquisition of data creates additional risks and costs to the data acquiring enterprise. The acquired data must be transported from its storage area (typically a cloud system or a data lake, etc.) which is expensive. The communication links used for the transportation of data need to be made secure. The computing environment used for data processing needs to be secure against malicious attacks and intrusive computer programs. The processing of data and the ensuing output itself must preserve consumer data privacy according to regulations governing PII, such as GDPR (General Data Protection Regulations), HIPAA (Health Insurance Portability and Accountability Act), etc. The outputted data must be used or shared in a manner that preserves the regulations and privacy of consumer data. (For example, GDPR restricts the movement of certain kinds of datasets across jurisdictional boundaries; HIPAA imposes restrictions on the sharing of patient data, etc.)
Conventional approaches to the problems associated with data acquisition involve enacting legal contracts amongst the data providing and data processing entities. Many enterprises have instituted compliance departments to satisfy the legal contracts under which datasets are acquired.
In the discussions to follow, we use the terms data provider, (computer) program provider, and output receiver to denote entities that own or are otherwise legitimately authorized/entitled to provide datasets, own or are otherwise legitimately authorized/entitled to provide computer programs to process datasets, and entities that are legitimately authorized/entitled to receive the outputted results of the data processing activity, respectively. Often, data provider and program provider entities will be distinct.
In some embodiments, the output receiver and the data provider or the program provider may be the same entity. In some embodiments, it will be convenient to describe an entity as playing a role, e.g., when describing an entity that is both a program provider and an output receiver. The term role thus refers to the actions or operational activities of entities. We sometimes use the term algorithm provider as a synonym for the term program provider. We also observe that we use the term “dataset” to denote various types of data irrespective of its storage method, e.g., we use the “data” to refer to structured data stored in database systems, unstructured data stored in file systems, cloud storage, digital images, electrocardiogram (ECG) data, real-time data being served through some networked system such as Kafka queues, etc.
We often use the terms program and algorithm interchangeably, but we note that as used herein the terms algorithm and program denote the availability of source code. Thus, for example, when we refer to “encrypting/decrypting a program/algorithm”, we shall be referring to encrypting the source code of the program/algorithm.
The role of the dataset provider 212 is enabled by a computer system 210 that contains datasets 211. The role of the output receiver 222 is enabled by computer system 220 that is programmed to receive results 221. Details of the programming of computer systems 200, 210 and 220 will be provided later, but a partial and simple operational explanation may be arrived at as follows.
Computer system 210 may receive program 201 and use it to process datasets 211. It may then transmit the results of the processing to computer system 220. Other operational methods (e.g., in one such method, the data may travel and be provided to the compute asset) can also be envisaged by those of ordinary skill in the art.
The above simple operational explanation does not address the crucial detail of constraints that may be imposed by the data provider, program provider and/or output receiver entities. Examples of such constraints are provided as follows.
A program provider entity may require that its intellectual property in the form of the computer program be protected. Thus, no entity should be able to view, edit, copy, or modify the computer program. Only a pre-determined and identified dataset may be provided to the computer program for processing. The processing itself may be subject to constraints, e.g., the computer program may process the dataset for a pre-determined period, or have access to selected updated versions of the dataset, etc.
A data provider entity may require that its dataset not be copied, duplicated, edited, modified, or transmitted outside its domain. It may require that only pre-determined and identified computer programs may process its dataset. (Some commercial enterprises refer to computer programs that have been pre-determined and identified as “curated.”) It may further require that the outputted results may be provided only to designated entities.
The output receiver entity may require that the results it receives may not contain any personal or protected health information of consumers, etc. The data provider and program provider may both require that the output receiver may not share the provided results with any other entity.
Thus, the simple, operational method described above with reference to
The descriptions that follow are meant to provide illustrative embodiments that are able to implement the architecture shown in
We take the opportunity to express some general comments about the invention with respect to
If the output receiver entity is distinct from the data provider and/or the program provider, and since it never comes into possession of the dataset or the program upon which the results are based, can it trust the results 321?
Trusting the (results of) the execution of computer programs in remote—in the sense of being inaccessible—computing environments has important commercial consequences some of which we highlight later by providing illustrative embodiments.
The term user, client, edge or endpoint device as used herein refers to a broad and general class of computers used by consumers including but not limited to smart phones, personal digital assistants, laptops, desktops, tablet computers, IoT (Internet of Things) devices such as smart thermostats and doorbells, digital (surveillance) cameras, etc. The list includes one or more devices associated with (using wireless or wired connections) user/endpoint devices, e.g., smart watches, fitness bracelets, consumer health monitoring devices, environment monitoring devices, home monitoring devices such as smart thermostats, smart light bulbs, smart locks, smart home appliances, etc.
Given the prevalent situation of frequent malicious attacks on computing machinery, there is concern that a computer program may be hijacked by malicious entities. Can a program's code be secured against attacks by unauthorized and malicious entities and hence be trusted?
One possibility is for an enterprise to develop a potential algorithm and put it up in a publicly accessible place where it may be analyzed, updated, edited and improved by the developer community. After some time during which this process has been used, the algorithm can be “expected” to be reasonably safe against intrusive attacks, i.e., it garners some trust from the user community. As one learns more from the experiences of the developers, one can continue to increase one's trust in the algorithm. However, complete trust in such an algorithm can never be reached for any number of reasons, e.g., bad guys may simply be waiting for a more opportune time to strike.
It should be noted that Bitcoin, Ethereum and certain other cryptocurrencies, and some open-source enterprises use certain methods of gaining the community's trust by making their source code available on public sites. Any person may then download the software so displayed and, e.g., become a “miner,” i.e., a member of a group that makes processing decisions based on the consensus of a majority of the group.
U.S. patent application Ser. No. 17/094,118, which is incorporated by reference herein in its entirety, proposes a different method of gaining trust. As discussed therein, a computation is a term describing the execution of a computer program or algorithm on one or more datasets. (In contrast, an algorithm or dataset that is stored, e.g., on a storage medium such as a disk, does not constitute a computation.) The term process is used in the literature on operating systems to denote the state of a computation and we use the term—process—to mean the same herein. A computing environment is a term for a process created by software contained within the supervisory programs, e.g., the operating system of the computer (or cluster of computers), that is configured to represent and capture the state of computations, i.e., the execution of algorithms on data, and provide the resulting outputs to recipients as per its configured logic. The software logic that creates computing environments (a type of process) may utilize the services provided by certain hardware elements of the underlying computer (or cluster of computers).
As used herein, a computing cluster may refer to a single computer, a group of networked computers or computers that otherwise communicate and interact with one another, and/or a group of virtual machines. That is, a computing cluster refers to any combination and arrangement of computing entities.
U.S. patent application Ser. No. 17/094,118 creates computing environments which are guaranteed to be isolated and trusted. As explained below, an isolated computing environment is an environment that supports a fixed or maximum number of application processes and specified system processes. A trusted computing environment is an environment in which the digest (described below) of the code running in the environment has been verified against a baseline digest. (Such verifications based on matching digests, etc., may be operationalized using Certificate Authority (CA).
We may use (cryptographic) hash functions to create technology that can be used to create computing environments that can be trusted. One way to achieve trust in a computing environment is by allowing the code running in an environment to be verified by using cryptographic hash functions/digests.
That is, a computing environment is created by the supervisory programs which are invoked by commands in the boot logic of a computer at boot time which then use hash functions, e.g., SHA-256 (available from the U.S. National Institute of Standards and Technology), to take a digest of the created computing environment. This digest may then be provided to an escrow service to be used as a baseline for future comparisons.
Note that the installation script is an application-level computer program. Any application program may request the supervisory programs to create a computing environment which then use the above method to verify if the created environment can be trusted. Boot logic of the computer may also be configured, as described above, to request the supervisory programs to create a computing environment.
Whereas the above process can be used to trust a computing environment created on a computer, we may in certain cases require that the underlying computer must be trusted as well. That is, can we trust that the computer was booted securely and that its state at any given time as presented by the contents of its internal memory registers can be trusted.
The attestation method may be further enhanced to read the various PCRs (Platform Configuration Registers), e.g., taken from a Trusted Platform Module (TPM) and take a digest of their contents. In practice, we may concatenate the digest obtained from the PCRs with that obtained from a computing environment (e.g., such as a Virtual Machine, VM) and use that as a baseline for ensuring trust in the boot software and the software running in the computing environment. In such cases, the attestation process which has been upgraded to include PCR attestation may be referred to as a measurement. Accordingly, in the examples presented below, all references to obtaining a digest of a computing environment are intended to refer to obtaining a measurement of the computing environment in alternative embodiments.
The enhanced attestation method described above may be used in computer systems that are not provided or provisioned with a secure boot process or those that do not provide process level support for isolation.
Note that a successful measurement of a computer implies that the underlying supervisory program has been securely booted and its state and that of the computer as represented by data in the various PCR registers is the same as the original state, which is assumed to be valid since we may assume that the underlying computer(s) are free of intrusion at time of manufacturing. Different manufacturers provide facilities that can be utilized by the Attestation Module to access the PCR registers. For example, some manufactures provide a hardware module called TPM (Trusted Platform Module) that can be queried to obtain data from PCR registers.
As mentioned above, U.S. patent application Ser. No. 17/094,118 also creates computing environments which are guaranteed to be isolated in addition to being trusted.
The notion of isolation is useful to eliminate the possibility that an unknown and/or unauthorized process may be “snooping” while an algorithm is running in memory. That is, a concurrently running process may be “stealing” data or effecting the logic of the program running inside the computing environment. An isolated computing environment can prevent this situation from occurring by using memory elements in which only one or more authorized (system and application) processes may be concurrently executed.
The manner in which isolation is accomplished depends on the type of process that is involved. As a general matter there are two types of processes that may be considered: system and application processes. An isolated computing environment may thus be defined as any computing environment in which a specified maximum number of application processes and specified system processes implementing the computing environment are able to operate. System processes are allowed access to an isolated memory segment if they provide the necessary keys. For example, Intel Software Guard
Extension (SGX) technology uses hardware/firmware assistance to provide the necessary keys. Application processes are also allowed entry to an isolated memory segment based on keys controlled by hardware/firmware/software element called the Access Control Module, ACM (described later).
Typically, system processes needed to create a computing environment are known a priori to the supervisory program and can be configured to ask and be permitted to access isolated memory segments. Only these specific system processes can then be allowed to run in an isolated memory segment. In the case of application processes such knowledge may not be known a priori. In this case, developers may be allowed to specify the keys that an application process needs to gain entry to a memory segment.
Additionally, a maximum number of application processes may be specified that can be allowed concurrent access to an isolated memory segment.
Computing environments are created by code/logic available to supervisory programs of a computer (or cluster of computers). This code may control which specific system processes are allowed to run in an isolated memory segment. On the other hand, as previously mentioned, access control of application processes is maintained by Access Control Modules.
It is important to highlight the difference between trusted and isolated computing environments. An isolated computing environment is an environment that supports a fixed or maximum number of application processes and specified system processes. A trusted computing environment is an environment in which the digest of the code running in the environment has been verified against a baseline digest.
As an example of the use of isolated memory as an enabling technology, consider the creation of a computing environment as discussed above. The computing environment needs to be configured to permit a maximum number of (application) processes for concurrent execution. To satisfy this requirement, SGX or SEV technologies can be used to enforce isolation. For example, in the Intel SGX technology, a hardware module holds cryptographic keys that are used to control access by system processes to the isolated memory. Any application process requesting access to the isolated memory is required to present the keys needed by the Access Control Module. In SEV and other such environments, the supervisory program locks down the isolated memory and allows only a fixed or maximum number of application processes to execute concurrently.
Consider a computer with an operating system that can support multiple virtual machines (VMs). (An example of such an operating system is known as the Hypervisor or Virtual Machine Monitor, VMM.) The hypervisor allows one VM at a given instant to be resident in memory and have access to the processor(s) of the computer. Working as in conventional time sharing, VMs may be swapped in and out, thus achieving temporal isolation.
Therefore, to achieve an isolated environment, a hypervisor like operating system may be used to temporally isolate the VMs and, further, allow only specific system and a known (or maximum) number of application processes to run in a given VM.
As previously mentioned, U.S. patent application Ser. No. 17/094,118 introduced the concept of Access Control Modules (ACM), which allow application processes entry to an isolated memory segment based on keys controlled by hardware/firmware/software element called the Access Control Module (ACM). ACMs are hardware/firmware/software components that use public/private cryptographic key technology to control access. An entity wishing to gain access to a computing environment must provide the needed keys. If it does not possess the keys, it will need to generate the keys to gain access which will require it to solve the intractable problem corresponding to the encryption technology deployed by the ACM, i.e., assumed to be a practical impossibility.
Access to certain regions of memory can also be controlled by software that encrypts the contents of memory that a CPU (Central Processing Unit) needs to load into its registers to execute, i.e., the so-called fetch-execute cycle. The CPU then needs to be provided the corresponding decryption key before it can execute the data/instructions it had fetched from memory. Such keys may then be stored in auxiliary hardware/firmware modules, e.g., Hardware Security Module (HSM). An HSM may then only allow authorized and authenticated entities to access the stored keys.
It is important to note that though a computing environment may be created by supervisory programs, e.g., operating system software, the latter may not have access to the computing environment. That is, mechanisms controlling access to a computing environment are independent of mechanisms that create said environments.
Thus, the contents of a computing environment may not be available to the supervisory or any other programs in the computing platform. An item may only be known to an entity that deposits it in the computing environment. A digest of an item may be made available outside the computing environment and it is known that digests are computationally irreversible.
Computing environments that have been prepared/created in the above manner can thus be trusted since they can be programmed to not reveal their contents to any party. Data and algorithms resident in such computing environments do not leak. In subsequent discussions, computing environments with this property are referred to as secure (computing) environments.
We now demonstrate methods by which secure computing environments may be used to effectuate remote executions of computer programs. We present the description in three phases. The three phases may collectively constitute a data pipeline, and in particular a secure data pipeline of the type shown in PCT/US22/23671 [Docket No. 12701/10], which is incorporated by reference herein in its entirety.
In phase 1, as shown in
We begin by creating a secure computing environment 661 on computing cluster 660 using the method of
Controller 662 is responsive to a user interface 650 associated with the algorithm provider that may be utilized by external programs to interact with it. That is, the user interface 650 is located in the algorithm provider's domain and not the controller's domain. Rather than detail the various commands available in user interface 650, we will describe the commands as they are used in the descriptions below.
Algorithm provider 601 indicates (using commands of user interface 650) to Controller 662 that it wishes to deposit algorithm 642. The user interface 650 employs a program to generate a symmetric key and provides the symmetric key to the algorithm provider who uses it to encrypt the algorithm 642. The Controller 662 requests Key Manager 663 to generate a first secret/public key pair (also known as decryption/encryption keys, respectively). The Key Manager 663 requisitions the underlying hardware to generate a private-public key pair. The public key component is provided to the algorithm provider 601 who uses it to encrypt the symmetric key used to encrypt the algorithm 642, and provides the encrypted symmetric key to the Controller 662. Controller 662, upon receipt, deposits the received information in Policy Manager 664.
Additionally, the algorithm provider 601 may use interface 650 to provide Controller 662 various policy statements that govern/control access to the algorithm 642. Various such policies are described in U.S. patent application Ser. No. 17/094,118. In the descriptions herein, we assume a policy that specifies that the operator 680 is not allowed access to the algorithm 642 or (as detailed below) the dataset, etc. Policy Manager 664 manages the policies provided to it by various entities.
Controller 662 may now invoke supervisory programs to create secure computing environment 641 (using method shown in
Controller 662 may now request and receive an attestation/measurement from secure environment 641 to verify that 641 is secure using the method of
Once verified, the secure computing environment may then request and receive the encrypted algorithm 642 from the algorithm provider 601. Controller 662 may provide secure computing environment 641 the encrypted symmetric key that was used to encrypt the algorithm 642. To use the symmetric key, secure computing environment 641 needs to first decrypt it. Secure computing environment 641 requests and receives from controller 662/Key Manager 663 the corresponding decryption/secret key. (This decryption key corresponds to the encryption key provided to the algorithm provider above.) Once decrypted the symmetric key can be used to decrypt the algorithm 642 in the secure environment 641. The algorithm 642 is then be encrypted again by asking the controller 662 to generate a second decryption/encryption key pair and using a new (i.e., a second) encryption key to encrypt the algorithm 642. Controller 662 retains control of the corresponding second secret key. This second pair of keys is referred to as the ALG-key pair. (The encrypted algorithm will be used in phase 2; hence, the secret key will be needed in phase 2.) Secure computing environment 641 may then deposit the encrypted algorithm in storage 670.
It will be convenient to refer the arrangement created on computer cluster 640 as data plane 698.
We summarize the above steps as follows (cf.
As a parenthetical note, we use the term storage system in a generic sense. In practice, a file system, a data bucket, a database system, a data warehouse, a data lake, live data streams or data queues, etc., may be used to effectuate input and output of data.
We note that in the entire process outlined and detailed above, the operator never comes into possession of the secret keys generated and stored within Controller 662. The secret keys remain inside the secure environments 661 and 641. Thus, the operator 680 is unable to access algorithm 642.
The above concludes phase 1. Note that at the conclusion of phase 1, the encrypted algorithm 642 is stored in storage system 670 (
In phase 2, as shown in
Dataset provider 602 indicates, using commands of a user interface 652 associated with the dataset provider 602, to Controller 662 that it wishes to provide dataset 633. It also encrypts the dataset 633 using a symmetric key that is generated using a program associated with the user interface 652. Controller 662 requests Key Manager 663 to generate a secret/public key pair which uses the underlying hardware to obtain the requested public-private key pair. It provides the public key component to dataset provider 602. The dataset provider encrypts its dataset using the symmetric key and encrypts the symmetric key using the provided public key and provides the encrypted symmetric key to the Controller 662. Controller 662 upon receipt deposits the received information in Policy Manager 664.
Additionally, the dataset provider 602 may use interface 652 to provide Controller 662 various policy statements that govern/control access to the dataset. Various such policies are described in U.S. patent application Ser. No. 17/094,118. In the descriptions herein, we assume a policy that specifies that the operator 680 is not allowed access to the algorithm (or as detailed below) the dataset, etc. Policy Manager 664 manages the policies provided to it by various entities.
Controller 662 may now invoke supervisory programs to create secure environment 631 (using method shown in
Controller 662 may now request and receive an attestation/measurement from secure environment 631 to verify that 631 is secure using the method of
This attestation/measurement, if successful, establishes that environment 631 is secure since its code base is the same as the baseline code (in escrow).
Once verified, the secure computing environment may request and receive the encrypted dataset 633 from the dataset provider 602. Controller 662 may provide secure computing environment 631 the encrypted symmetric key used to encrypted the dataset 633. To use the symmetric key, secure environment 631 needs to first decrypt it.
Secure computing environment 631 requests Controller 662 to provide the decryption key to decrypt the symmetric key, which the secure environment 631 uses to decrypt the dataset 633. The dataset 633 may then be encrypted again by asking controller 662 to generate a second decryption/encryption key pair. The second encryption key is provided to the secure environment 631, which uses the second encryption key to encrypt the dataset 633.
We will refer to the second public-private key pair as DATA-Keys.
Furthermore, secure computing environment 631 accesses the code of algorithm 642 (cf.
Once the dataset and the algorithm have been decrypted, algorithm 642 processes dataset 633 to produce an output result(s).
Optionally, dataset provider 602 may require the output result(s) to be encrypted using a symmetric key (which is generated using its associated user interface) that will be provided to the output receiver that is to receive the output result(s). This symmetric key is referred to as the OUTPUT key. In such a case, controller 662 provides a public key to the output receiver to encrypt the OUTPUT key and provides the encrypted OUTPUT key to the secure computing environment 631. . Controller 662 provides the corresponding private key to secure computing environment 631, which uses it to decrypt the OUTPUT key. In turn, the secure computing environment 631 uses the decrypted OUTPUT key to encrypt the output result(s), which are then stored in output storage system 6712.
It will be convenient to refer to the arrangement created on computer cluster 630 as data plane 6981.
We summarize the above steps of phase 2 as follows (cf.
We note that, as in phase 1, the operator never comes into possession of the secret keys generated and stored within Controller 662. Furthermore, the program provider (601, cf.
The above concludes phase 2.
In phase 3, as shown in
Recall that the output receiver 603 possesses the symmetric key (OUTPUT-Key). It may use it to access the Output Storage System 6712 and decrypt the outputted result(s).
In the above discussion we have assumed that the dataset 633 can be loaded into secure environment 631 (
To accommodate and enable large datasets,
Note that database processor 622 is shown in
The advantage of running program 622 in a secure environment is that the symmetric key used to access and decrypt the outputted results is provided to the
Controller 662 by the output receiver in encrypted form (and is thus secure) and, furthermore, the symmetric key remains secure inside Controller 662 and is provided only to programs, e.g., 622, running in secure environments, e.g., 621. Additionally, the transport between secure environments is secure.
In a preferred embodiment, the database processor 622 runs in secure environments.
The above concludes phase 3.
The method proceeds as follows.
The technology of secure environments allows an encrypted algorithm to be provided to the dataset provider whose corresponding decryption key is only available within a secure environment operating in the domain of the dataset provider.
Additionally, the output of the ensuing computation by the algorithm on the dataset provided by the dataset provider is provided in encrypted form to the output receiver, and the corresponding decryption key is only available inside a secure environment operating in the output receiver's domain.
We have remarked earlier that the executions of computer programs in secure environments can be trusted since they can be verified. We now discuss the verification of program executions.
In some embodiments, every action carried out by a computing entity in the system described herein (
Each action listed above is a description of the actual commands/data used to implement the action. The corresponding log record will contain the detailed commands/data of the action. For example, action 1 above (“Create control plane”) will have the log data that shows the command/data generated by the operator, received by the control plane, commands used to create a new pipeline, etc.
Similarly, action 4 listed above (“Request and verify measurement”) will have a log record that shows the request to the environment for a measurement, receive the requested measurement, match the received measurement with the baseline measurement (cf.
We can use the log record corresponding to action 4 to verify that the environment can be trusted as follows. The log record contains the digest received from the environment and the baseline digest. If we are provisioned with a suitable computer program, we may use the same to match the baseline digest with the digest provided by the environment. A successful match will indicate that the environment can be trusted.
As another example of verification, consider the log record corresponding to action 5 (“Request and receive keys”). The corresponding log data will contain the request for a public key component from the controller and the receipt of the requested key.
We can use a suitably provisioned computer program operating on the log record corresponding to action 5 to verify the policy controls as follows. Receipt of a public/encryption key indicates that the receiving entity can encrypt a program/data object. Receipt of a secret/decryption key indicates that the receiving entity can decrypt (and, thus, have access to) a program/data object.
We have thus shown that log records of
The various embodiments presented above illustrate various use cases, but the systems and techniques described herein are not limited to those use cases. As will be evident from the discussion below, the various embodiments also may be combined to produce several new service offerings or transform existing service offerings.
In the descriptions above we have considered policies that relate to the ownership of assets. We now describe several other types of policies.
One type of policies relates to enabling time-based access to an asset. For example, access to a dataset may be allowed for 1 week. Access may be allowed if a certain event occurs, e.g., second Wednesday of the month. Another type of policies relates to the number of accesses, e.g., access to a programmatic asset may be allowed for a certain number of uses or executions of the asset. Policies also may be specified that deny or allow access to assets based on the identity credentials of an account, an organization (e.g., all accounts belonging to an organization), etc. Policies also may relate to charges for use of assets and discounts thereof. For example, a policy may dictate that an asset may be used for a given time period for a certain charge.
Policies also may be specified that bind one or more assets to a given secure environment or a cluster of computers upon which one or more secure environments have been defined. (Computers in a cluster may be identified by keys generated by internal hardware elements of the computers. In such cases, provisioning a computer for an execution requires presentation of the “platform” key.) As an example, a policy may require that a particular data asset can only be processed on a certain computer or a computer that is provisioned (from a cloud provider) within a certain jurisdiction.
Yet another type of policy involves revocation of a previously authorized policy. This may be thought of as a “meta” policy in the sense that it relates to policies whereas previously discussed policies relate to assets.
The implementation of the policies described above may be based on controlling the provisioning of keys from key manager 663 (cf.
A central aspect of the policy control of the execution and provisioning of assets concerns charging for those assets.
In some embodiments, one set of charging mechanisms may be based on asset owners assigning prices to their assets or asset users offering pricing for a given asset. Such schemes may be considered as part of “out of band” negotiations between owners and users of assets.
We now present a charging mechanism that is intimately tied to the sharing fabric proposed in the present invention. In this mechanism, we observe that secure environments containing assets may need to encrypt and decrypt those assets. Encryption and decryption of assets is a central and important service provided by secure environments.
It is proposed that in some embodiments secure environments may be provisioned with computer programs that track the number of encryption and decryption actions performed to service a given sharing action. For example, an output receiver may use the catalog of shared assets (described above) to discover a programmatic asset and a dataset asset. It may then stitch the two assets together and launch a computation to be executed on a given data plane, with the output directed to itself.
It is to be observed that to carry out the about directive, the computation may need to decrypt the programmatic and data assets, encrypt the output, etc.
We may now define a charge for the computation as a function of the number of encrypt/decrypt operations. We observe that in this view, a secure computing environment may be thought of as a “meter” that tracks the number of encrypt/decrypt operations.
The fact that the encrypt/decrypt operations occur within a secure environment, and the program tracking such operations runs inside the secure environment, implies that the operations and their tracking can be trusted. That is, secure environments may be thought of as housing or providing trusted metering devices.
In implementations or arrangements wherein a multiplicity of secure environments is statically provisioned, e.g., in a cluster of computers, or dynamically provisioned, e.g., to effectuate a launched computation, the metering can be made more efficient as follows.
Further note that secure computing environments 621, 631 and 641, when first created need to be authenticated by secure supervisory or controlling computing environment 661 before keys can be provisioned to them (cf. step 5,
In some embodiments, that primary computing environments may be configured to enable metering whereas secondary computing environments track the encrypt/decrypt operations and send the corresponding tabulations to the primary environment.
In one particular implementation, consider
In
Additionally, a sharing server node and (Federation) Agent node exists to receive sharing commands and actions and maintain consistency of information between the various cloud accounts. The Sharing Server acts as a proxy for the sharing service.
Note that a data plane exists between the secure environments of each account and that each account receives control information (e.g., keys) from the Federation Service Provider.
It is to be noted further that each cloud account may exist on distinct cloud providers, e.g., Organization A's account may exist on cloud provider 1 and Organization B's account may exist on cloud provider 2, etc.
Whilst
To reify the above descriptions without loss of generality, we consider various illustrative embodiments.
As background to a first illustrative embodiment, consider a first organization that wishes to perform predictive analytics on its proprietary data by using an algorithm provided by a second party. The first party would like to preserve the confidentiality of its dataset, whilst the second party wishes to protect the Intellectual Property of its algorithm (embodied in a computer program).
Thus, the envisaged transaction between the first and second party could be effectuated by a policy directed computation using secure environments as shown in
Preferentially, algorithm provider and dataset provider use out-of-band channels to reach an agreement whereby algorithm/program 842 is provided to secure computing environment 801, where it may be used to obtain predictive analytics from dataset 833. Once the processing is completed, the results may be provided to the output receiver who may then use web clients 897 to query the results.
As an alternative or support to the out-of-band arrangements, the federation service provider may provide a catalog service and make it available to all member organizations, i.e., their cloud accounts. Such a service may list all available programs and datasets (along with their concomitant policies) that have been “published” by the members for purposes of sharing. The catalog may thus be searched (“browsed”) and programmatic and data assets may be discovered and “stitched” together, along with their constituent policies, into executable computations in secure environments.
We note that in the arrangement shown in
We also note that the output receiver does not have access to dataset 833 or program 842.
We further note that the provisioning of the output receiver with encrypted results is optional. In some embodiments, the results may be provisioned in cleartext.
There is a general practice in the data markets of today to prepare healthcare datasets by de-identifying them of patient information and provide the resulting datasets on commercial basis to interested enterprises. For example, pharmaceutical companies are often interested in de-identified datasets for research and development purposes.
We explain this illustrative embodiment with respect to
Organization A (
Next, organization A shares dataset 933 with the cloud account of organization B containing secure computing environment 901. Dataset 933 is registered in the sharing catalog and becomes visible to members of the sharing federation.
Organization B uses its existing IT infrastructure 999 to import/pull Dataset 933 from its secure computing environment 901. It may now process Dataset 933.
Additionally, since Dataset 933 has been provided by organization A to organization B, a sharing policy may have been imposed by organization A restricting the processing of Dataset 933 by a pre-determined and designated application, identified by its hash digest (as explained above). The designated application would then be required to run in secure computing environment 901. Audit logs may be generated showing that the only application accessing Dataset 933 was the application uniquely identified by its digest.
In an extension of the above use case, organization B may wish to provide an algorithm, say ABC, to organization A and request A to use ABC to construct a new version of Dataset 833, i.e., a customized dataset. To effectuate this variation of the use case, we may proceed as follows.
Organization B publishes algorithm ABC into its cloud account containing secure computing environment 933 and shares it with organization A. Algorithm ABC becomes available to organization A in A′s secure environment 900.
Organization A pulls/pushes other assets into its IT infrastructure as needed to create a customized dataset which it may now share with organization B. Note that “pull,” “push,”, “import,” and “export” are well-known terms of art.
Also note that since algorithm ABC is provided to organization A through a secure sharing process which encrypts the contents of ABC, organization A is unable to access the contents of ABC.
In Machine Learning (ML) technology, inputted datasets, called training datasets, are used by computer programs to produce a new type of computer program called a model. The internal memory state of the model is said to represent the learnings obtained through the processing of the training sets. At the conclusion of the training, a model is said to be trained and this phase is called the training phase.
Once trained, a model may be used to process actual data. For example, we may use a model trained for detecting pulmonary hypertension from ECG data, or a model for deciding the approval or disapproval of a credit loan application from an applicant's loan request data. This phase is called the serving phase.
Additionally, a trained model, when deployed in the field, may gather and save data related to its performance (e.g., percentage of correct diagnoses) etc. Such data may then be used to retrain the model to improve its performance. This phase is called the retraining phase.
Model 1033 may now be used in its serving phase wherein model 1033 may be accessed via interface 1098 by various clients 1097. That is, model 1033 is configured to provide outputs to queries inputted from clients 1097.
Model 1033 may be further configured to produce a second set of outputs, the so-called retraining sets, designated to be shared with organization A (See
That is, the retraining sets are encrypted with the decryption key being held by organization A.
We thus see that
In many instances in the above descriptions, a symmetric key may need to be communicated by one entity to another entity, e.g., the Controller may need to communicate a symmetric key to the output receiver. In some embodiments, symmetric keys may be encrypted using a public key for purposes of protecting the symmetric key during communication. (This is in addition to using a secure communication channel.) The following method explains such use cases.
Similarly, there may be a need to communicate a public key securely between two entities and we may use a previously known symmetric key (e.g., using an out of band communication) to encrypt the public key that is to be communicated.
Computing With Data Containing PII and other Private Information
The above description have shown how a data pipeline may be constructed using secure environments so that the programmatic and data assets used in the data pipeline are protected. We now describe embodiments in which the data assets used in pipelines contain PII (Personally Identifiable Information), Private and Confidential data.
In stage 1220, the raw data is transformed (using computer programs) into anonymous data. The term “anonymization” refers to processes that remove PII, private and confidential data. In certain cases, PII and other data may be converted into or replaced by other data to preserve privacy, e.g., social security numbers may be replaced by “asterisks.” For example, the Health Insurance Portability and Accountability Act (HIPAA) specifies 18 types of data elements that need to be replaced or removed in a dataset for the latter to be HIPAA compliant.
In stage 1230 (
Typically, since datasets are generally very large in size, verification procedures use statistical probes to gain a degree of “privacy compliance.”
In stage 1240, once a dataset has satisfied some a priori (given) statistical notion of compliance, it may be deemed worthy of storage and dissemination to a third party for data processing.
Stages 1210-1240 are generally performed by a single entity, usually referred to as a data owner or data provider 1201.
An anonymized dataset (1240) may now be provided to an entity (called Data User or Data Processor 1202), who may process the data in step 1250 and obtain results 1260. Since 1240 is assumed to be anonymized, i.e., free of PII and other types of private data, the results 1260 are generally assumed to not violate any privacy regulations.
Whereas
Generally, anonymizing datasets is a computationally difficult process and so-called anonymized datasets may contain privacy violating data elements. For example, consider a dataset of medical patient information containing physician's notes. It may be computationally difficult to locate and remove all patient identifying information since the physician's notes may be difficult to parse or understand by purely machine-based methods. Hence, in many cases, statistical adherence to privacy preservation is considered acceptable. Many healthcare dataset providers, for example, use “expert determination” for statistical verification of anonymized datasets. (Recall that we use the terms “anonymized” and “de-identified” datasets interchangeably.)
More importantly for our purposes, an anonymized dataset may constrain data processing in the sense that the anonymized dataset may fail to return certain results (that are derivable from the original dataset) or may return erroneous results.
Consider by way of example the dataset shown in
Similarly,
Anonymized version of dataset
Simply put, K-anonymity provides the assurance that any query can only isolate up to “K−1” individuals.
While the K-anonymity property may be useful for protecting privacy of individual data contained in a dataset, as shown in
Consider the data pipeline in
A data owner 1601 assembles raw data into a storage system 1610. Data owner 1601 receives a data processing algorithm from data user 1602. In stage 1620, the algorithm so received is used to process the raw data. In stage 1630 the processed data is anonymized and in stage 1640 the anonymized data is verified. Finally, in stage 1650 the verified data is provided to the data user 1602.
Put simply, the dataset is first processed and the results (output) of the processing stage are anonymized as opposed to
In a variation of the process shown in
In
In summary, we propose a method for computing with data that contains PII and other private information using a data pipeline whose various stages use secure environment technology. Rather than following the conventional approach wherein the raw data is first anonymized, we propose that the raw data first be processed to produce results, which in turn are anonymized using a set of anonymizing policies (or data transformation rules). The anonymized results may then be output from the data pipeline.
The data transformation rules may be provided (“shared”) by a third party, e.g., the data owner or the algorithm provider, or the data user.
The algorithm to process the data may also be provided by a third party, e.g., the algorithm provider or the data user.
We now show two alternative embodiments to the methods shown in
Although not expressly shown, each domain shown in
In
The method of
In another alternative embodiment shown in
The method of
As discussed above, aspects of the subject matter described herein may be described in the general context of computer-executable instructions, such as computer programs, being executed by a computer or a cluster of computers. Generally, computer programs include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. Aspects of the subject matter described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Also, it is noted that some embodiments have been described as a process which is depicted as a flow diagram or block diagram. Although each may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged. A process may have additional steps not included in the figure.
The claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. For instance, the claimed subject matter may be implemented as a computer-readable storage medium embedded with a computer executable program, which encompasses a computer program accessible from any computer-readable storage device or storage media. For example, computer readable storage media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive . . . ).
However, computer readable storage media do not include transitory forms of storage such as propagating signals, for example. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
As used herein the terms “software,” computer programs,” “programs,” “computer code” and the like refer to a set of program instructions running on an arithmetical processing device such as a microprocessor or DSP chip, or as a set of logic operations implemented in circuitry such as a field-programmable gate array (FPGA) or in a semicustom or custom VLSI integrated circuit. That is, all such references to “software,” computer programs,” “programs,” “computer code,” as well as references to various “engines” and the like may be implemented in any form of logic embodied in hardware, a combination of hardware and software, software, or software in execution. Furthermore, logic embodied, for instance, exclusively in hardware may also be arranged in some embodiments to function as its own trusted execution environment.
Moreover, as used in this application, the terms “component,” “module,” “engine,” “system,” “apparatus,” “interface,” or the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
The foregoing described embodiments depict different components contained within, or connected with, different other components. It is to be understood that such depicted architectures are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermediary components. Likewise, any two components so associated can also be viewed as being “operably connected”, or “operably coupled”, to each other to achieve the desired functionality.
While various embodiments have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope. In fact, after reading the above description, it will be apparent to one skilled in the relevant art(s) how to implement alternative embodiments. Thus, the present embodiments should not be limited by any of the above described exemplary embodiments.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/440,165, filed Jan. 20, 2023, and is a continuation-in-part of U.S. Ser. No. 17/939,314, filed Sep. 7, 2022, which claims the benefit of U.S. Provisional Application Ser. No. 63/241,239, filed Sep. 7, 2021. The contents of the applications listed above are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63440165 | Jan 2023 | US | |
63241239 | Sep 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17939314 | Sep 2022 | US |
Child | 18415843 | US |