An Appendix hereto includes the following computer program listing which is incorporated herein by reference: “LEID0046US_BlindFLCodeAppendix.txt” created on Jun. 10, 2024, 63.1 KB.
The technical field is generally related to protecting client data privacy in a federated learning system and more specifically to combining client model segmentation (CMS) and fully homomorphic encryption (FHE) to achieve this goal.
Internet of Things (IoT) devices, sensor systems, and other edge computing devices, are light-weight, low-latency platforms that include the ability to collect and respond to data quickly. Products that rely upon data collected by these systems have the potential to be enhanced by artificial intelligence (AI). AI models allow computers to process, analyze, and respond to data by detecting patterns and making predictions in a way that mirrors human responses. To develop predictive ability, AI model construction conventionally relies upon the collection of data in a single location. This presents privacy concerns since the centralization of data could result in the potential misuse of sensitive information such as personally identifiable information (PII).
Federated learning (FL) has been popularized as a way to collaboratively train a shared AI model while keeping training data at the edge, thus separating the ability to train AI from the need for a centralized data set. Instead of gathering all data in a centralized location, FL typically relies upon edge nodes passing their locally updated model parameters to a central server. The central server, who coordinates this process, then aggregates the parameters from each local model to update its global model. Finally, the centralized model parameters are sent to edge nodes to be further updated. In following this procedure, FL allows edge-to-cloud models to avoid sharing data for model training. Today, AI models are increasingly deployed and updated on edge devices using FL architectures for services like text completion, self-driving cars, healthcare services, and other domains where one or more of the following characteristics is present: data is collected at the edge from user devices; there is limited bandwidth to transmit user data to the cloud; predictions at the edge are made using a local model and locally collected data; and predictions are specific to given users (i.e., they may differ from the predictions made for other users).
Standard FL architecture also offers privacy benefits, since user data does not need to leave the edge device and only a local model, trained with the user's data, is passed from client to server. However, sharing models rather than data does not guarantee privacy, since sharing models still risks the potential for privacy leakage, in particular risks from model inversion attacks. During a model inversion attack, a malicious actor attempts to reconstruct the model's original training set with access to only the model. Several model inversion attacks, particularly those using generative adversarial networks (GANs), have been shown to be quite effective at reconstructing training data from AI models.
To protect against model inversion attacks (and other server-side privacy risks), several secure aggregation procedures have emerged. These proposed procedures work to counter data leakages through various privacy-preserving mechanisms, including cryptographic techniques, perturbative techniques, and anonymization techniques, which all belong to the larger class of privacy-preserving federated learning (PPFL) protocols. These protocols have typically fallen short in their impact on model accuracy, particularly in approaches that use differential privacy, or in their impact on system performance, particularly in approaches that use fully homomorphic encryption (FHE).
The seminal work describing PPFL implementations in PyTorch was written by Ryffel, et al., A generic framework for privacy preserving deep learning. CoRR abs/1811.04017 (2018), arXiv: 1811.04017. As FL implementations grew in popularity, so did the number of attacks proposed upon FL edge-to-cloud systems. Inferencing attacks resulting in information leakages was first proposed by Hitaj, et al., Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning, In Proceedings of the 2017 ACMSIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS ′17). Further notable works include gradient attacks mentioned in Wang, et al., Towards Eliminating Hard Label Constraints in Gradient Inversion Attacks, arXiv: 2402.03124v2 [cs.CR] 15 Apr. 2024, and He et al., Model inversion attacks against collaborative inference, In Proceedings of the 35th Annual Computer Security Applications Conference (San Juan, Puerto Rico, USA) (ACSAC ′19). A full survey of gradient attacks is covered under Huang, et al., Evaluating Gradient Inversion Attacks and Defenses in Federated Learning, arXiv: 2112.00059v1 [cs.CR] 30 Nov. 2021.
A variety of counter measures to defend against attacks to FL systems has also emerged. In 2019, Zhu, et al. proposed gradient pruning as a method to obscure model gradients without any changes in training. See Zhu, et al., Deep Leakage from Gradients, arXiv: 1906.08935v2 [cs.LG] 19 Dec. 2019. In 2020, Huang, et al. proposed InstaHide, “[encrypting] each training image with a ‘one-time secret key.’” See Huang et al., InstaHide: Instance-hiding Schemes for Private Distributed Learning, arXiv: 2010.02772v2 [cs.CR] 24 Feb. 2021. And Hu, et al. worked on decentralized FL with a segmented gossip approach, as described in Decentralized Federated Learning: A Segmented Gossip Approach, arXiv: 1908.07782v1 [cs.LG] 21 Aug. 2019.
In 2018, Phong, et al. used homomorphic encryption to protect deep learning model gradients as described in Privacy-Preserving Deep Learning via Additively Homomorphic Encryption. IEEE Transactions on Information Forensics and Security 13, 5 (2018), 1333-1345. Yin, et al. in 2021 produced a review of emerging PPFL techniques and notes the decrease in accuracy for several approaches. See Yin, et al., A Comprehensive Survey of Privacy-preserving Federated Learning: A Taxonomy, Review, and Future Directions. ACM Comput. Surv. 54, 6, Article 131 (July 2021), 36 pages. In 2023, Rahulamathavan, et al. worked on the FheFL PPFL, applying FHE directly to FL model gradients as described in FheFL: Fully Homomorphic Encryption Friendly Privacy-Preserving Federated Learning with Byzantine Users, arXiv: 2306.05112v2 [cs.AI] 26 Jun. 2023. Rahulamathavan, et al. notes the increased bandwidth requirements that the use of FHE introduces. And in 2023, Jin, et al. proposed FedML-HE, an approach that selectively encrypts only the most privacy-sensitive parameters within the model as described in FedML-HE: An Efficient Homomorphic-Encryption-Based Privacy-Preserving Federated Learning System, arXiv: 2303.10837 [cs.LG] Oct. 30, 2023.
Developing privacy enhancements for FL is a continuing research thread. Elkordy, et al. produced a paper in 2022 outlining the theoretical bounds for information leakage and its relationship with the number of clients participating. See, Elkordy, et al., How Much Privacy Does Federated Learning with Secure Aggregation Guarantee? arXiv: 2208.02304v1 [cs.LG] 3 Aug. 2022. And in 2022, Sébert, et al. explored combining homomorphic encryption with differential privacy for protecting FL training data in Combining FHE and DP in Federated Learning, arXiv: 2205.04330v2 [cs.CR] 31 May 2022. But these privacy solutions all require performance trade-offs.
The following patents and published applications also describe various other methods that have been implemented in the prior art to secure data during the federated learning (“FL”) process. U.S. Patent Pub. No. US20210143987 describes key distribution in an FL setting; U.S. Patent Pub. No. Patent US20220374544 describes applying SMPC (secure multi-party computation) to an FL system; U.S. Pat. No. 11,188,791 describes creating an anonymized training set on the client before even training the client model; U.S. Patent Pub. No. US20210166157 describes privatizing the clients' model updates (because the server can perform a diff between the previous global model and a given current client model) and U.S. Pat. No. 11,139,961 describes the use of homomorphic encryption (within the context of a single system). Each of these methods requires a performance trade-off.
Accordingly, there is a need in the art for an architecture and process which preserves data privacy throughout the process without trading off performance.
In a first embodiment, a process for securing individual client data during federated learning of a global model, includes: during a first global model training run:
In a second embodiment, a process for securing individual client data during federated learning of a global model, includes: determining, by a server, initial parameters W for an untrained global model; selecting, by the server, at least two clients, ci, and providing the untrained global model thereto, wherein each of the at least two clients ci initializes its individual model, mi, with W and trains mi over data Di to produce its own new parameters, wi, for the global model, wherein each mi includes at least one parameter matrix M; initiating, by the server, generation of a public-private key pair by a key distributor, wherein the key distributor provides the public key to the at least two clients, ci; determining, by the server, a number Ni of parameter matrices M of each individual model, mi, that each client ci should provide to the server;
In a third embodiment, a non-transitory computer readable medium stores instructions for securing individual client data during federated learning of a global model, the instructions including: determining, by a server, initial parameters W for an untrained global model; selecting, by the server, at least two clients, ci, and providing the untrained global model thereto, wherein each of the at least two clients ci initializes its individual model, mi, with W and trains mi over data Di to produce its own new parameters, wi, for the global model, wherein each mi includes at least one parameter matrix M; initiating, by the server, generation of a public-private key pair by a key distributor, wherein the key distributor provides the public key to the at least two clients, ci; determining, by the server, a number Ni of parameter matrices M of each individual model, mi, that each client ci should provide to the server; receiving, by the server, Ni parameter matrices from clients ci, wherein the Ni parameter matrices M are homomorphically encrypted by the at least two clients, ci; using the public key; aggregating, by the server, the encrypted Ni parameter matrices M to generate an updated global model; notifying, by the server, the key distributor to provide the private key from the public-private key pair to clients, ci; and pushing, by the server, the updated global model to the clients, ci.
The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
Initially,
In the typical training scheme, as displayed in
Wnew is then sent to all clients, and this concludes the first round of FL. Successive rounds may be conducted to produce improved models. The primary privacy-preserving feature of FL is that each client gets the benefit of a model trained on every client's data without having to reveal its raw data either to any other client or to the central server.
There is a flaw in the privacy-preserving protections offered using standard FL. The privacy of FL is provided by the protocol's ability to prevent client data from leaving the edge device. However, sending full, plaintext models to the server poses a serious risk: the server could reverse engineer the model to potentially reveal sensitive characteristics of the data, such as PII, used to train client models. This risk is well-researched as described in Lyu, et al., Threats to Federated Learning: A Survey, arXiv: 2003.02133v1 [cs.CR] 4 Mar. 2020. Thus, to maximize the security of an FL system, this risk of server-side model inversion attacks should be addressed. Similarly, in cases where a semi-honest adversary A has white-box access to either the server or a client within the federation. A will respect the federated learning protocol of the standard FL process, but A will also try to learn from models to assess the data client ci might contain. Accordingly, a privacy threat can emerge in two different ways: (1) A is able to access the server and obtain the global model or (2) A is able to access a client and obtain the global model upon the completion of a round of FL. In either case, with access to a global model, a model inversion attack could be performed.
While theoretically, the application of homomorphic encryption alone would solve the problem described herein, this isn't practical. Today's implementations of FHE are simply too time and space-intensive to be practical for our purposes. Homomorphically encrypting large weight matrices takes a considerable amount of time, and the result of that encryption requires many gigabytes of space. A description of FHE can be found in Park et al., Privacy-Preserving Federated Learning Using Homomorphic Encryption. Appl. Sci. 2022, 12, 734.
Client model segmentation (CMS), on the other hand, does not fully protect against this problematic reverse engineering attack. Over time, the server may receive enough segments from a given client to piece together an approximation of that client's full model. A description of a segmented FL approach can be found on-line in Hu et al., Decentralized Federated Learning: A Segmented Gossip Approach, arXiv: 1908.07782v1 [cs.LG] 21 Aug. 2019.
The key differentiator in the embodiments herein, then, is our use of CMS to share only small portions of the model while also maintaining the strict security of FHE over the complete segment transferred from client to server. This lowers bandwidth requirements substantially over vanilla combinations of FL with FHE while still providing complete encryption and maintaining model accuracy.
The embodiments described herein address both threat scenarios. We guarantee that at no point is a complete model from a given client is shared with or seen by the central server or any other client within the federation. We also ensure that the resultant model sent to clients cannot be attacked to reconstruct other clients' datasets.
Our solution to the central server trust/security problem identified above incorporates two methods into the federated learning (FL) process: (1) client model segmentation (CMS): instead of each client sending its full model, the server only requests certain parameter matrices from each client model; and (2) fully homomorphic encryption (FHE): Client model segments are homomorphically encrypted before being sent to the server. FHE enables computations to be run on encrypted data, so the server is still able to aggregate received segments. FHE solves the server-side problem proposed and CMS reduces the high time and space requirements of FHE. We refer to our process as BlindFL.
As with prior art standard FL systems, the embodiments described herein uses client data D1, D2, . . . . Dn on each client C1, C2 . . . . Cn to generate an individual client model M1, M2 . . . . Mn. The central server 5 then requests a subset of each client's model layers or portion of a model layer(s), i.e., client model segments, e.g., Seg. 1B is Seg. B of client C1. These client model segments are then subject to homomorphic encryption using a public key K1 received from a key distributor 10 to encrypt each model layer or portion of model layer (segment) that the clients are sending to the server 5. The encrypted segments are pulled into the centralized server 5 and aggregated, all while maintaining the encryption of the weights, i.e., FHE enables computations to be run on encrypted data, so the server is still able to perform a weighted average of received segments. The updated, encrypted global model GM is then returned to the clients for decryption using the private key K2 received from a key distributor 10. The individual segments from each client are thus never seen by the centralized server 10. A more detailed description is provided below.
Suppose that there is 1 server and C>2 clients, each of which has 1 model. Suppose that each model is a deep neural network (DNN) with M>0 parameter matrices. The server selects c≤C clients. The server then generates c random sequences of M booleans (each represented as a 0 or 1), which we represent as request matrix R of size c×M. The server generates R such that the following property holds true:
where p is the number of client parameter matrices that must be gathered for each server-side global parameter matrix (i.e., p dictates how much of each client model is sent back to the server). The value p≥1 is configured before BlindFL begins. Further, since each element of R is either 0 or 1, it is clear that p≤c.
To generate R such that the above property holds true, the server calculates the number of matrices each client should send back, defined by N below:
The server then generates the c sequences of M booleans. For the first selected client, the server generates a sequence, R1, of M booleans where N random booleans are set to 1. For the second selected client through the last selected client, the server initializes a sequence, Ri, of M booleans all set to 0. It then calculates the sum of all previously generated Ri sequences in the form of an M-length sequence where each value in the sequence denotes how many client parameter matrices will currently be requested for that given global model parameter matrix. Then, N times, the minimum value in that summations sequence is found, and a random index that has that minimum value is selected. A “true” boolean (a 1) is inserted at that index into Ri. As 1's are inserted, the summations sequence is updated. The generation of these matrix requests is formalized in Algorithm 1.
The server then sends to each client the row of the matrix that corresponds to it (i.e., R1 is sent to client 1, R2 to client 2, and so on through Rc). For each i=1, 2, . . . , c, where Ri,j=1 for any j=1, 2, . . . , M, client i will send the server its jth parameter matrix, wi,j, as well as the number of examples that were used to train its model, ti,j. Where Ri,j=0 for any j=1, 2, . . . , M, client i will set and send both wi,j and ti,j as 0. Once the server has received all requested w parameter matrices and t training example counts, it creates the global model using the following formula to calculate each aggregated parameter matrix Wj:
All parameter matrices Wj are then sent from the server to all C clients, thus placing the newly updated global model on each client. This method was implemented using the free and open-source “Flower: A Friendly Federated Learning Research Framework” as described in the on-line publication at arXiv: 2007.14390v5 [cs.LG] Mar. 5, 2022.
BlindFL leverages asymmetric FHE, which requires two keys: a public key and a private key. The public key contains the information required to perform encryptions and encodings. As a result, the public key can be given to any system performing mathematical operations on the FHE values. However, the public key lacks the information to decrypt any encrypted values. For any party to be able to decrypt the model, it needs the private key.
BlindFL incorporates FHE into the CMS process described above at the following points: (1) the client homomorphically encrypts the requested parameter matrices and example counts per segment before sending them to the server; (2) the server uses homomorphic operations to perform an encrypted, weighted average of the corresponding client model segments; and (3) the server sends the complete, encrypted updated model to the clients, and the clients decrypt using the private key.
These two steps require that FHE keys be properly distributed to maximally preserve privacy. We introduce a third node type, a key distributor (KD), to address this need. The KD is incorporated as follows:
The KD generates a new key pair for each round of training. Otherwise, a valid private key would carry over from the previous round of training, increasing the ability of clients to intercept and decrypt the other clients' models in transit. For exemplary purposes, this process is implemented using the free and open-source “Pyfhel: PYthon For Homomorphic Encryption Libraries” as described in the paper published at WAHC ′21, Nov. 15, 2021, Virtual Event, Republic of Korea.
In a real-world deployment, we would encrypt client-to-server communication via a Diffie-Hellman key exchange protocol. If layers being sent from the client to server were not encrypted, then the KD node could easily perform a man-in-the-middle (MITM) attack, recovering client layers and decrypting them using the private key. The same is also done for all communication sent from the KD node.
Finally, as another necessary step for real-world deployment, each node in the federation would need to be equipped with a certificate for verifying identity.
The unique advantage of BlindFL is found in its hybridization of FHE and CMS. Hybridizing these components enables FHE to be used in contexts that it would otherwise be too time- and space-intensive. Without CMS, FHE would fully protect against the server-side privacy risks of FL. However, current-day FHE has very high time and space requirements. As a result, it is often impractical to encrypt and send entire models from every client in a federation. In such cases, BlindFL can be implemented to provide the enhanced security of FHE while also decreasing its time and space requirements.
CMS, on the other hand, does not fully protect against a model inversion attack. By piecing together segments from a given client collected round-over-round, the server can potentially approximate a full model from a client. However, CMS significantly reduces the load of FHE on any given client. As a result, CMS enables the performance gains that BlindFL offers relative to traditional FHE implementations.
BlindFL experiments were run via simulation. The following parameters were set for each run of the simulation: (1) Number of clients—The dataset for a given experiment is randomly shuffled, then evenly partitioned among the clients; (2) Number of client parameter matrices selected per global model parameter matrix—When this parameter is equal to the number of clients, CMS is effectively inactive; (3) Number of rounds of federated learning; and (4) Whether or not FHE should be active—And parameters for the FHE context are set (see below for additional description of FHE context selection).
For each experiment described below, one or more of these parameters is varied. Each result is averaged across 5 runs of the experiment. The experiments were run on two use cases: classification on the MNIST dataset using a LeNet-5 model and classification on the CIFAR-10 dataset using a ResNet20 model. The datasets were partitioned evenly amongst the given number of clients for a given run of the simulation (e.g., for a run with 3 clients, each client is given a third of the dataset). The LeNet-5 model has 10 total parameter matrices, and the ResNet20 model has 20 total parameter matrices.
All experiments were run on an AWS EC2 r5.8xlarge instance, which has a vCPU count of 32 and 256 GiB of memory. These high specifications allowed simulations of the system to run many clients at once on a single instance, which is an intensive process due to the FHE involved. The accuracy recorded in the figures below is measured as the average accuracy of the global model across each client's test set. As previously described, Flower was used for our federated learning implementation, and Pyfhel was used for our fully homomorphic encryption implementation.
Before any other experiments were run, simulations with FHE enabled and CMS inactive were tried to determine the smallest FHE context required to (a) successfully run BlindFL without crashing and (b) maintain the accuracy of the model. The FHE framework, Pyfhel, required that the following parameters be determined: the n value, the scale, and the qi sizes. Powers of 2 were tested for the n value. Only 214 and 215 allowed the simulation to run without error, and 215 caused the simulation to run over 4 times slower. Thus, 214 was selected. Powers of 2 were also tested for the scale. The minimum scale that still allowed the simulation to run without error, 230, was selected. Finally, it was determined that at least 4 qi size values were required, and the following were the smallest functional qi sizes tested: [40, 30, 30, 40]. In summary, our Pyfhel FHE context had the following settings:
Notice the very slight decrease in performance as the number of clients is increased. As previously mentioned, for a given experiment, the original, full dataset (whether MNIST or CIFAR-10) is partitioned evenly among the number of clients. The very slight decrease in performance, then, is simply an artifact of each client having a bit less data to train with. Overall, though, this result makes clear that, as long as the overall training set remains the same, the number of clients in a BlindFL federation has little to no effect on model performance.
Over the course of 20 rounds of federated learning, we see that the same level of accuracy is achieved across all four run types. In other words, while FHE introduces a clear time overhead, neither component of BlindFL impacts the level of accuracy that is converged to. This is an especially notable result given the introduction of CMS-segmentation does not result in any decrease in convergence accuracy. Overall, it may be concluded that BlindFL does not negatively impact the accuracy of models.
Furthermore, FHE's time overhead is very effectively reduced by CMS. As previously mentioned, this experiment includes 10 clients, and by requesting only 5 parameter matrices from each client, we are able to cut server-side processing time in half. This 50% time reduction can even be further reduced by requesting even fewer parameter matrices from clients, which per the experiment detailed in section 6.3, will not result in reduced accuracy.
Tables 1 and 2 show, for both of our use cases, the average amount of data sent to the server per client, varying the number of client parameter matrices CMS sends and setting FHE both on and off. For these experiments, 10 clients were used.
It is clear that, while FHE does introduce a high space overhead, this overhead still results in very manageable client-server bandwidth requirements. Furthermore, as with FHE's time overhead, CMS cuts done FHE's space overhead significantly.
Federated learning is a useful technique for preserving the privacy of distributed data used to train ML models. However, the threat to privacy posed by inversion attacks is significant. While many different PPFL techniques have been proposed, each comes with its own inherent risks and flaws. A technology shown to be very effective at protecting data against these attacks is FHE. FHE greatly reduces the ability of a server to perform an inversion attack, but it does have several drawbacks: increased requirements of compute, memory, network traffic, and time. These drawbacks make FHE less ideal for systems with high volume, low compute, or even low bandwidth, such as edge systems.
Our proposed solution, BlindFL, is a scalable PPFL that enhances an FHE approach with client model segmentation (CMS). Unlike FHE, CMS by itself does not fully protect against server-side inversion attacks, but it does provide increased privacy. Furthermore, it significantly reduces the demands of FHE on both the clients and the server. We are thus able to implement BLindFL in contexts that an FHE-only approach would be too slow or data-intensive.
Our solution also maintains the accuracy of the aggregated global model and displays great performance overall. The addition of CMS cuts server-side process time roughly in half. Furthermore, in contexts such as edge computing, bandwidth requirements can be a significant limitation. While our experiments show an increase in client-server bandwidth requirements due to FHE, CMS greatly reduces those requirements, thus providing higher security for only a mild increase in bandwidth. Additionally, BlindFL maintains practically identical accuracy to non-FHE models on both the MNIST and CIFAR-10 datasets. Thanks to all the enhancements it offers, BlindFL can be confidently offered as a solution for enhanced FL privacy, even for applications that FHE would typically be impractical for.
BlindFL protects the security of client data during the federated learning including, but not limited to personal health information, personal financial information, personal identifying information, and asset location information. Exemplary use cases for BlindFL might include, but are not limited to: health diagnosis systems, wherein healthcare providers can collaborate to produce a model that gives more accurate patient diagnoses without revealing patient data to one another; inter-intelligence agency collaboration, wherein agencies like the FBI, CIA, and DIA could all collaborate to produce models that more accurately flag threats; autonomously piloted assets such as aircraft, ships, vehicle, wherein self-piloting assets could collaborate to produce a model that is piloted more safely and efficiently; mobile applications, e.g., next word prediction, face detection, voice recognition, without revealing personal information; and protecting against financial fraud, e.g., with BlindFL, financial institutions across the globe can collaborate to improve fraud detection, and stop more fraudulent transactions before they occur, without revealing customers' personal information.
All references and patent documents cited in this disclosure are incorporated herein by reference in their entireties.
The present application claims the benefit of priority to U.S. Provisional Patent Application No. 63/507,615 entitled A PROCESS FOR SECURING CLIENT DATA DURING FEDERATED LEARNING, filed Jun. 12, 2023, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63507615 | Jun 2023 | US |