The present disclosure generally relates to machine operations and communication, including methods for decentralized machine learning, distributed learning, federated learning, asynchronous federated learning, network communication, and machine communication.
Asynchronous Federated Learning with Buffered Aggregation (FedBuff), developed by Meta (Facebook), is a federated learning algorithm known for its efficiency and high scalability. However, FedBuff has a high communication cost. Federated Learning (FL) is a distributed machine learning paradigm that enables training of models on decentralized data, without the need for centralized data storage or sharing of raw data. One of the main challenges in FL is the presence of data heterogeneity and communication constraints. These challenges arise due to distributed data and in particular to data sources that vary. While FL operations provide benefits, there is a need and desire for improved operation and processes.
Many of the latest efforts in machine learning are focused on bringing learning as close to data collection as possible. This is of practical interest in a diverse array of applications and in particular for sensor networks. In sensor networks, nodes are often communication-constrained, operate in a distributive manner, and can send only a few bits to their neighbors. There is a need and a desire for processes that allow for learning with non-fully connected data sources, such as a non-fully connected graph.
Disclosed and described herein are systems, methods and configurations for bidirectional quantized communication and buffered aggregation. In one embodiment, a method is provided for bidirectional quantized communication and buffered aggregation including sampling, by a server device, at least one client device, wherein the sampling includes a request to update a local model of the client device using a hidden state model, and receiving, by the server device, a quantized difference for the local model of the at least one client device. The method also includes aggregating, by the server device, model updates in a buffer, and performing, by the server, a global update on a server model using the model updates in the buffer. The method also includes updating, by the server device, the hidden state model using the server model, and transmitting, by the server device, a hidden state model update to the at least one client device.
In one embodiment, sampling the at least one client device includes a request for local updates to the hidden state model from the at least one client device, and wherein the local updates are determined following training by the at least one client device.
In one embodiment, the server model is a Quantized Federated Learning Model with Buffered Aggregation (QAFeL).
In one embodiment, the quantized difference includes a compressed set of hidden state model updates determined by the at least one client device.
In one embodiment, the quantizer compresses updates to the hidden state model including a function with a compression parameter, and an internal randomness parameter, the function configured to reduce trained dataset updates to a quantized set of data updates.
In one embodiment, aggregating model updates in the buffer includes determining quantized model updates, the quantized model updates providing training feedback for the local model and hidden state model.
In one embodiment, the global update includes updating a server model on the server device based on aggregated training updates received for the hidden state model from at least one client device.
In one embodiment, updating the hidden state model includes modifying a hidden state model using an updated server model on the server device.
In one embodiment, transmitting the hidden state model update includes broadcasting a modified hidden state model to the at least one client device.
In one embodiment, the at least one client device performs a gossiped and quantized communication of hidden state model updates to at least one additional node.
Another embodiment is directed to a system for bidirectional quantized communication and buffered aggregation. The system includes at least one client device, and a server device. The server device includes a controller configured to sample the at least one client device, wherein sampling includes a request to update a local model of the client device using a hidden state model, and receive a quantized difference for a local model of the at least one client device. The controller is configured to aggregate model updates in a buffer, and perform a global update on a server model using the model updates in the buffer. The controller is configured to update the hidden state model using the server model, and transmit a hidden state model update to the at least one client device.
In one embodiment, the system is a Quantized Federated Learning system with Buffered Aggregation (QAFeL).
In one embodiment, the client device includes an unbiased quantizer configured to generate a quantized difference for the model of the at least one client device and the hidden state model.
In one embodiment, the quantizer compresses updates to the hidden state model including a function with a compression parameter, and an internal randomness parameter, the function configured to reduce trained dataset updates to a quantized set of data updates.
In one embodiment, aggregating model updates in the buffer includes determining quantized model updates, the quantized model updates providing training feedback for the local model and hidden state model.
In one embodiment, the global update includes updating a server model on the server device based on aggregated training updates received for the hidden state model from at least one client device.
In one embodiment, the at least one client device performs a gossiped and quantized communication of hidden state model updates to at least one additional node.
Another embodiment is directed to a method for bidirectional quantized communication and buffered aggregation including receiving, by a client device, a hidden state model, the hidden state model based on a server device model. The method also includes training, by the client device, a local model to determine updates for the local model. The method also includes receiving, by client device, a request to transmit updates determined for the local model of the client device, and generating, by the client device, a quantized difference for the local model and the hidden state model. The method also includes transmitting, by the client device, a quantized difference to the server device.
In one embodiment, the model is a Quantized Federated Learning Model with Buffered Aggregation (QAFeL).
In one embodiment, generating a quantized difference is performed by an unbiased quantizer for the client device configured to perform a gossiped and quantized communication of hidden state model updates to at least one additional node.
Another embodiment is directed to a method for gossiped and quantized communication and learning. The method includes sampling, by a client device, at least one neighboring client device, wherein sampling includes a request for the at least one neighboring client device to update a local model of the at least one neighboring client device using a hidden state model. The method also includes receiving, by the client device, a quantized difference for the local model of the at least one neighboring client device, and aggregating, by the client device, model updates in a buffer. The method also includes performing, by the client device, a global update on a client model using model updates from the at least one neighboring client device in the buffer, and updating, by the client device, the hidden state model of the client device using the client model. The method also includes transmitting, by the client device, a hidden state model update to the at least one neighboring client device, wherein the client device performs a gossiped and quantized communication of hidden state model updates to the at least one neighboring client device.
Other aspects, features, and techniques will be apparent to one skilled in the relevant art in view of the following detailed description of the embodiments.
The application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The features, objects, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout and wherein:
One aspect of the disclosure is directed to machine learning including systems and methods for machine communication. In one embodiment, processes and configurations are provided for bidirectional quantized communication for asynchronous federated learning and buffered aggregation. Operations and configurations discussed herein improve machine communication and performance of machine learning operations. Embodiments are directed to Quantized Asynchronous Federated Learning (QAFeL), with a quantization scheme that establishes a shared “hidden” state between the server and clients, and that is continuously updated through quantized updates. The shared hidden state may be model shared between a server device and client devices, and may operate as a reference for updating a global model. The hidden model may be referred to herein as a client background model. Systems and processes according to embodiments provide high precision while significantly reducing the data transmitted during client-server interactions. In addition, systems and processes discussed herein provide theoretical convergence guarantees for asynchronous Federated Learning (FL) with quantized communications and buffered aggregation, in both directions of server-to-client and client-to-server.
According to embodiments, systems may be configured for distributed learning and federated learning, where the data may be kept at users and a global model is trained by communicating the models trained at each node or the gradients calculated at each node. Embodiments improve communication among the nodes, between clients and with one of more server devices (e.g., cloud). Embodiments provide solutions to the large communication bandwidth required for communication, the bandwidth requirement may be a bottleneck to conventional systems. Embodiments may address communication bottlenecks in a system by introducing a quantization method and system that can be used to reduce the transmitted bits while maintaining performance for a system.
Operations may also be directed to models and systems for asynchronous FL methods, as FL may be suited to handle large-scale and dynamic systems. In asynchronous FL, clients and nodes in a system may be configured to update models independently and can also communicate with a server at different times. This eliminates the need for fitting clients into time slots, allows the handling of clients that are slow to respond or have limited communication capabilities. Embodiments also provide operations to address challenges of asynchronous learning such as stale gradients and stragglers, which need to be handled properly to ensure performance.
Embodiments also provide systems and processes that allow for communication and operations for Online Multi-Kernel Learning (OMKL). Configurations may include communication among nodes with quantized communication. Processes and device configurations can include using a shared “hidden” state. In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, online multi-kernel learning may provide sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). According to embodiments, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. Processes and system configurations are provided for non-fully connected graphs, which is often the case in wireless sensor networks. According to embodiments, a gossip algorithm is provided as well as experimental results to achieve sub-linear regret. Experiments with real datasets are described that confirm operation of the embodiments.
According to embodiments, processes and device configurations may be configured to provide models and machine processes for natural language processing, computer vision, and health-care, network communication and computer operations in general.
As used herein, the terms “a” or “an” shall mean one or more than one. The term “plurality” shall mean two or more than two. The term “another” is defined as a second or more. The terms “including” and/or “having” are open ended (e.g., comprising). The term “or” as used herein is to be interpreted as inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
Reference throughout this document to “one embodiment, certain embodiments, an embodiment,” or similar term means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of such phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner on one or more embodiments without limitation.
According to embodiments, system 100 is configured for distributed machine learning, and for using and training models on decentralized data, such as data stored or accessed by nodes/clients 1051-n. According to embodiments, server device 110 may include a controller configured to sample at least one client device, such as one or more of nodes/clients 1051-n. According to embodiments, sampling can include transmitting a request to at least one of the nodes/clients 1051-n to update a local model of a client device using a hidden state. According to embodiments, a hidden state model may be used and shared between at least one server device and nodes/clients 1051-n to update a model. According to embodiments, system 100 may use quantized communication for sharing updates relative to nodes/clients 1051-n and server 110. The controller of server 110 may also be configured to receive a quantized difference for a model of the at least one client device, and aggregate model updates in a buffer. The controller is also configured to perform a global update on a model using the model updates in the buffer, and update the hidden state using the model. According to embodiments, the controller is configured to transmit a hidden state update to the at least one of the nodes/clients 1051-n.
According to embodiments, system 100 may include one or more servers, such as server device 110 and optional server device 120. Optional server device 120 may also store model 115 and be configured for updating model 115 and receiving quantized updates.
According to embodiments, system 100 may be configured to provide distributed learning and federated learning. As such, data for machine learning operations may be stored at users, such as one or more of nodes/clients 1051-n and a global model, such as model 115, is trained by communicating the models trained at each node, such as models 1061-n, or the gradients calculated at each node. Nodes/client devices 1051-n may include an unbiased quantizer configured to generate a quantized difference for a local model, such as a model of the at least one client device and a hidden state model.
According to embodiments, system 100 may allow for communication among nodes, such as nodes/clients 1051-n, and/or between clients, such as nodes/clients 1051-n, and network 106, which may provide one or more cloud network services. System 100 and configurations and processes discussed herein allow for high communication throughput challenges to be met. According to embodiments, communication bottlenecks may be overcome by quantization to reduce the transmitted bits while maintaining the performance. Operations and processing including quantization may be applied to other distributed machine learning processes. By way of example, processes and configurations herein may provide a Quantized Asynchronous Federated Learning (QAFeL), with a quantization scheme that establishes a shared “hidden” state between the server and clients and is continuously updated through quantized updates. The QAFeL operations discussed herein can provide high precision while significantly reducing the data transmitted during client-server interactions. In addition, the processes can provide convergence guarantees for asynchronous Federated Learning with quantized communications and buffered aggregation, in both directions of server-to-client and client-to-server. Experimental results are discussed herein supporting the machine improvements.
According to embodiments, one or more of nodes/clients 1051-n may be configured for communication across nodes of system 100 and for Quantized Online Multi-Kernel Learning (OMKL). Operations are provided for a complete graph or fully connected nodes (e.g., every pair of nodes in the network can communicate), and non-fully connected nodes. According to embodiments processes and device configurations are provided for nodes/client devices 1051-n to provide quantized communications. Nodes/client devices 1051-n may provide configurations to expand OMKL to non-fully connected graphs, which is often the case in wireless sensor networks, by utilizing a gossip algorithm. In embodiments, nodes/client devices 1051-n can perform gossiped and quantized communication of hidden model state updates to at least one additional node.
Process 200 may be initiated at block 201 with sampling at least one client device. Sampling can include a sending a request from a server device to at least one node/client device to update a local model of the client device using a hidden state. Client devices may train asynchronously based on data available, data received and inputs to the client devices. Sampling the at least one client device can include a request for local updates to the hidden state from the at least one client device, and the local updates may be determined following training by the at least one client device. According to embodiments, process 200 may be configured to provide a communication process for Quantized Federated Learning Model with Buffered Aggregation (QAFeL). The model may be a global model for a system, where a server level model is updated and communicated to one or more nodes or client devices. According to embodiments, the model may be for one or more of machine learning, distributed learning, and federated learning. Process 200 and embodiments provide communication efficient methods for performing gossip/consensus distributed machine learning and asynchronous federated learning. In addition, process 200 may be performed for asynchronous Federated Learning with Buffered Aggregation (FedBuff).
At block 205, process 200 includes at least one server device (e.g., server 110) receiving a quantized difference for a model of the at least one client device. Client devices of a system, such as nodes/client devices 1051-n of system 100 may communicate relative to each other and relative to one or more servers. In response to sampling at least one client device at block 201, a server device of process 200 may receive information and data from one or more node/client devices to update a server model. According to embodiments, a quantized difference may be received including a compressed set of hidden model updates determined by the at least one client device. The quantizer may compress updates to the hidden model state including a function with a compression parameter, and an internal randomness parameter, the function configured to reduce trained dataset updates to a quantized set of data updates. Compression of updates may be performed using a quantizer and/or quantizer operations as a form of lossy compression that allows for a reduced message and fewer communication bits than a full model package of updates.
At block 210, process 200 includes aggregating model updates in a buffer, and at block 215 performing a global update on a server model using the model updates in the buffer. A server device may aggregate client updates for a period of time prior to updating a model. Updates received from one or more clients (e.g., nodes) may be stored as received. Aggregating model updates in the buffer may include determining quantized model updates. Quantized model updates may provide training feedback for the local model and hidden state. The global update includes updating a model on the server device based on aggregated training updates received for the hidden model state from at least one client device. Aggregating for the global update can include averaging local updates received from clients to form a global update. The global update may be based on a weighted mean of the local updates and/or based on an aggregation type.
At block 220, process 200 optionally includes updating the hidden state model using the server model. Process 200 may also optionally include transmitting a hidden state update to the at least one client device at block 225. Updating the hidden state model can include modifying a hidden model using an updated server model on the server device. Transmitting the hidden state update includes broadcasting a modified hidden model state to the at least one client device.
According to embodiments, a node/client device (e.g., nodes/client devices 1051-n) may be configured to perform one or more operations of process 200 for machine communication. In particular, a node/client device may perform gossiped and quantized communication and learning without a server (e.g., server 110). By way of example, process 200 may be performed as a gossiped version, where there is no need for a server in order for one or more processes and system configurations described herein to operate. By way of further example, without any server, clients can exchange information with each other. Exchange an updating of hidden state models may eventually lead to convergence of local models to a global model.
According to embodiments, process 200 may illustrate a method for gossiped and quantized communication and learning performed by a node/client device. As such, process 200 may illustrate one or more operations performed by the node client device, such as nodes/client devices 1051-n of system 100. According to embodiments, process 200 may include a client device sampling at least one neighboring client device at block 201, wherein sampling includes a request for the at least one neighboring client to update a local model of the at least one neighboring client device using a hidden state model. At block 205, the client device may receive a quantized difference for the local model of the at least one neighboring client device. At block 210, the client device may aggregate model updates in a buffer. At block 215, the client device may perform a global update on a client model using model updates from the at least one neighboring client device in the buffer. The client device may optionally update the hidden state model of the client device using the client model at block 220. The client device may optionally transmit a hidden state model update to the at least one neighboring client device, wherein the client device performs a gossiped and quantized communication of hidden state model updates to the at least one neighboring client device.
Controller 305 may relate to a processor or control device configured to execute one or more operations stored in memory 310, such as providing bidirectional quantized communication and buffered aggregation. Communications module 315 may be configured to communicate with one or more other devices, such as a server device (e.g., sever 110) or node/client device (e.g., node/client device 1051).
According to one embodiment, controller 305 may be configured to sample client devices, receive client device quantized differences, aggregate model updates and perform global updates to a model. Controller 305 may also direct updates and control transmission and broadcasting of hidden state updates to one or more client devices. Server device 300 may include one or more components for receiving and detecting communications from other devices. Server device 300 and controller 305 may also be configured to aggregate updates from one or more client devices and perform processes, such as process 200 of
According to embodiments, server 425 may be configured for asynchronous federated learning with bidirectional quantized communications and buffered aggregation, such as process 200. Asynchronous federated learning with bidirectional quantized communications and buffered aggregation can provide a tighter convergence rate, compared to non-quantized communications, and allow for non-convex functions and heterogeneous clients, without assuming bounded client heterogeneity. System 400 provides a bidirectional quantization scheme into a buffered process (e.g., FedBuff) and can provide a Quantized Asynchronous Federated Learning (QAFeL) process. The QAFeL process also includes a hidden shared state between a server device and clients and facilitates a reduction in communication costs. According to embodiments, a QAFeL process includes client devices 4101-n training asynchronously, and that are configured to send local updates to a server when the client devices 4101-n are finished training. Server 420 may be configured to accumulate local updates in a buffer until a maximum capacity is reached and then produce a server model update. According to embodiments, system 400 and operations at block 415 and 420, which may be performed by client devices, server devices and devices in a network in general, may be configured to compress updates. Compression of updates may be performed using a quantizer and/or quantizer operations as a form of lossy compression that allows for a reduced message and fewer communication bits than a full model package of updates.
According to embodiments, a quantizer and/or quantizer operations may be performed by a device, such as a server device 405 or client devices 4101-n. According to embodiments, a quantizer may include one or more functional operations. According to embodiments, a quantizer may be denoted by Q:→, is a function (possibly random) that satisfies the following condition:
where δ>0 is a compression parameter and Q denotes the expectation with respect to internal randomness of the quantizer Q, wherein quantizer and compression operator are used interchangeably herein. As an example, the quantizer may use a vector x∈d, and operate with the definitions
with probability
otherwise and the receiver can reconstruct
For topk and randk, their compression parameter is δ k/d. For a qsgd with s levels of quantization,
The topk may be the only biased quantizer of the three and the general biased quantizers may be converted to unbiased quantizers with additional data transmission. According to embodiments, n bits may be used per coordinate instead of a full precision floating point number, which is usually 232 bits. This n-bit qsgd quantizer and the number of bits per coordinate, n, may automatically determine the quantization level s.
In
According to embodiments, communication between nodes may be quantized. Client devices or nodes may be configured to quantize data by using a function to select parameters and models for providing updates. Quantizing updates may reduce one or more of the data transmissions, and data size provided for updates. Process 500 may allow for quantized communication in both directions, such that client devices may send quantized updated to a server device and a server device may send quantized updates to a client devices. Process 500 may also be applied to a network complete graph and non-fully connected graphs. For non-fully connected graphs a gossip algorithm may be used, such as process 800.
Process 500 illustrates an example timeline for execution of QAFeL. According to embodiments, the hidden model or hidden state 515 may be shared between clients and a server device. The hidden state model may be saved in server memory and each client and may be used as an approximation of the server model at each client. The hidden state model can save communication costs while guaranteeing convergence, since both clients and the server know the values of the hidden model. According to embodiments, the server may be configured to send a quantized difference between the hidden state model and the server model at each iteration to progressively improve the approximation. In
According to embodiments, a QAFeL process and model may be configured to accommodate networks without broadcast capabilities. In this case, the server must keep hidden state updates in storage for a maximum of Cmax updates, where Cmax is the storage size of the model divided by the expected size of a compressed hidden state, and then transmit the necessary updates so that the hidden state is synchronized between the client and server. If the staleness is larger than Cmax, the server can transmit the hidden state to the client. In each scenario, the communication cost is improved of prior communication formats.
According to embodiments, a global loss function can be based on minimizing the weighted sum of the client loss functions
where Fn is the loss function on Client n and N is the total number of clients. The global loss function minimizes the average of local loss functions, which is the same as minimizing the sum of the client loss functions with equal weights 1/N. Each function Fn depends only on data collected locally (i.e., client n) and the notation is summarized in Table 1:
t, Δkt
Assumptions may be made for all clients including bounded local variance, bounded and L-smooth loss gradient, and bonded staleness when K=1. It may be assumed that ƒ achieves a minimum value at ƒ*. The upper bound of the staleness may depend on the buffer size, K. As the buffer size increases, the server may update less frequently, which reduces the number of server steps between when a client starts training and when the updates are applied to the global model.
According to embodiments, to alleviate the communication cost of asynchronous FL, with buffered aggregation, a QAFeL process may include models such as a QAFeL-server (e.g., server model), a QAFeL client (e.g., client device model) and QAFeL client background (e.g., hidden state model).
According to embodiments, quantized data may be a full set of model updates minimized based on a server device and client maintaining a common shared hidden state. To begin training the server device and one or more client devices may start with a pre-agreed upon model, which may be used to initiate a hidden state. The server device may then asynchronously sample clients and request the clients to compute a local update. A requested client device may copy the current hidden state and perform P local updates using the equation
where ne is the local learning rate and gp is the noisy gradient. After the local updates are computed, the client device can send the quantized difference Qc(yp−1−yo) to the server to aggregate. The server can accumulate updates in a buffer until it has K samples and then performs a global update on the model using the equation
Where
According to embodiments secure aggregation may be used and the QAFeL algorithm may extend processes such as FedBuff without interfering with a privacy scheme.
According to embodiments, operations discussed herein provide improved convergence. According to embodiments, a condition may be provided on the learning rates necessary for the convergence to hold, such as a bounded rate. For the case of biased server quantizers, a looser bound on the convergence rate may be provided. By selecting local and global learning rates that satisfy the condition and unbiased server and client quantizers, the ergodic convergence rate of a QAFeL model may have an upper bound. In addition, the selection of a client quantizer may affect the order of error more than the choice of a server quantizer.
Experimental results of processes and operations of embodiments include results from a series of simulations to evaluate the performance of a QAFeL. Simulations were performed to evaluate the reduction in communication load between a client and server in regard to a federated learning model with buffered aggregation and to assess the impact of different quantization techniques on the convergence speed. A standard metric for comparing synchronous and asynchronous learning methods may be the number of client trips, which can include the number of times the client downloads the model and uploads and updated version after training. In addition, the number of bytes per message may also be illustrative of the benefits of processes and operations according to embodiments.
A trade-off may be in place with respect to the amount of quantization and the speed of convergence. In other words, compressing more will send less bytes per message, but more messages will have to be sent to provide target accuracy. According to embodiments, a QAFeL may require less total number of communication bits compared to conventional methods including FEDBUFF. According to embodiments, an optimal level of quantization, may be based on the quantizer.
Results of experiments support the quantization schemes as providing a specific convergence rate and how quantization at the client can be controlled to affect the performance of a QAFeL more than server quantization.
According to embodiments, processes are provided for gossiped and Quantized Online Multi-Kernel Learning (OMKL). Device configurations described herein (e.g., server device 110, node clients 1051-n) may be configured for online kernel learning. OMKL use may result in sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph), even when little prior information is available. According to embodiments processes and device configurations are provided for managing communication load, which is often a performance bottleneck, and to provide quantized communications between. Embodiments provide configurations to expand OMKL to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, embodiments can utilize a gossip algorithm. Experimental results are also discussed.
According to embodiments, a network may be characterized as having J nodes and the network may be modeled as an undirected connected graph. At each instant of time t, Node j may receive a data string xtj∈d and the desired response ytj∈d. An approximating function ƒ(xtj) may be determined for ytj. The function ƒ may belong to a Reproducing Kernel Hilbert Space (RKHS)={ƒ|ƒ(x)=Σt=1∞Σt=1∞αtj(x, xtj)} where C is a cost function and λ>0 is a regularization parameter that controls an increasing function Ω. An optimal solution for this problem exists in the form {circumflex over (ƒ)}(x)=Σt=1TΣJ=1Jαtjk(x, xtj)=αTk(x), where α and k(x) are the vector versions of {αtj} and {k(x,xtj)}, respectively. For multi-kernel learning, a weighted combination of several kernels may be selected to improve performance, compared to single-kernel learning. A convex combination of kernels {kp}p=1P, where kpεp is an RKHS, and may be denoted as =1⊕ . . . ⊕p. Using instead of , the problem may be expressed as
A Random Feature (RF) approximation may be used to evade dimensionality. For normalized shift-invariant kernels. Drawing D i.i.d. samples from πk(v), a weight vector θ∈21 can be constructed such that {circumflex over (ƒ)}(x)=θTzv(x), where the vector zv(x) will be generated from a pdf πk. The loss function may be defined as (ƒ(x))=(ΘTzv(x),y)+λΩ(∥ν∥2). A weight vector may be constructed and a loss function defined for each Kernel p and Node j:
where n∈(0,1) is a learning rate. The weights may ne normalized as ωp,tj=ωp,tj/Σp=1pωp,tj to have
The preceding representations represent how local online multi-kernel learning models may be built. According to embodiments, to learn a global model the local models are propagated using a gossip algorithm. The nodes calculate the weighted average of the information provided by their neighbors at each communication round and use it to update their local information. The weights, associated with the existing edges of the graph, are chosen to construct a J×J doubly stochastic gossip matrix. The spectral gap of W′ is denoted by p=1−λ2(W′)∈(0,1] where λ2(W′) represents the second eigenvalue of W′ in descending order. A gossip algorithm works very well when the nodes communicate their states perfectly with their neighbors. However, practically, only a few bits can be communicated, i.e., information needs to be quantized before being shared with the neighbors. According to embodiments, a random quantizer Q:n→n, may be used to represent an arbitrary vector x∈n, with Q(x) in an efficient way. Embodiments can operate for any random quantizer that satisfies:
For some δ>0, which may be a compression parameter. Here, Q denotes the expected value with respect to the internal randomness of Q(•). Each element of a non-zero vector v∈n, i.e., vi, may be quantized by QM(vi)+∥v∥sign(vi)ξi(v,M) where M=2b−1 is the number of quantization levels and defining
is represented by b bits and
Process 800 was evaluated experimentally with real data sets for binary classification, with different topologies and using different values of quantization level. The evaluation includes use of Kernel Logistic Regression (KLR) loss function:
Experiments were conducted with three datasets: Banana, Credit-Card, and MNIST. The synthetic data from the Banana dataset (n=5300, d=2) are two nonlinearly separable clusters. The Credit-Card dataset (n=30000, d=2) contains data from credit card users, such as their credit history and whether they have defaulted or not, the dataset obtained from the UCI machine learning repository. The MNIST dataset (n=70000, d=784) contains pictures of handwritten digits 0 to 9 and their corresponding labels. For the experiment, datasets are divided into two classes, those that are number 8 and those that are not.
The experimental setup has J=20 nodes, dimension D=20 for our RF approximation, regularization parameter λ=0.001, and three Gaussian kernels with a E {1, 3, 5}. Simulations have been performed 10 times with different sets of Random Features and the corresponding mean is plotted. A quantizer according to embodiments here is used with 7 levels of quantization, that is, 3 bits per element in any transmitted array. Our learning rates are η=0.01 and γ=0.9η=0.009.
Although not shown in the figure, observations from experimental results indicate that, for values of η and γ chosen in
Results indicate that more densely connected communication graphs lead to better performance. The algorithm is also tested for M≥7 levels of quantization, that satisfies the condition δ>0 in (11), and all of them perform as well as the non-quantized version, with the loss function differences in the order of 10-6.
The results show that gossiped OMKL algorithm (such as process 800) can successfully extend an OMKL algorithm to non-complete graphs and distributed learning algorithms to multi-kernel learning.
Embodiments are directed to Federated learning, and in particular asynchronous federated learning, which may be faster and more scalable compared to synchronous counterparts. Embodiments can also provide solutions to communications bottlenecks by providing Quantized Asynchronous Federated Learning (QAFeL)), which may be configured to include a hidden-state quantization scheme to avoid error propagation caused by direct quantization. QAFeL may also be provided by a buffer to aggregate client updates, thus ensuring scalability and compatibility with techniques such as secure aggregation. For stochastic gradient descent on non-convex objectives, QAFeL may achieve a ergodic convergence rate. Processes and configurations may be performed without imposing restrictions on the choice of step size, nor assuming uniform client arrivals.
A machine learning pipeline data may collect data for clients in a central server and then train a model on collected data that is deployed for use. General models may have major drawbacks including requiring a large amount of storage at a central server and more importantly privacy concerns may be raised when sensate data is collected. Embodiments may provide decentralized learning, which can deal with privacy concerns. With Federated Learning (FL), clients may train local models and send the local models to a server for aggregation. In FL, local data may be used only to train local models, such that the local data does not leave client devices. By way of example, the clients may send local model updates to a server for aggregation and to update a global model. Inclusion of new updates may improve the global model accuracy. FL may be used for a variety of applications including healthcare, finance, and natural language processing.
FL may have characteristics different from traditional distributed optimization, such as data that originates from clients and that cannot be shared with a server. Second, clients may be heterogeneous (i.e., clients have access to different data sets with different speeds and communication bandwidths). ML models may be large and communicating modes may be costly due to bandwidth.
Process 1200 illustrates Quantized Asynchronous Federated Learning (QAFeL), may include a bidirectional quantization scheme for asynchronous FL with buffered aggregation. To address error propagation, a common hidden state is used by process 1200 by aggregating all communicated messages. According to embodiments, a server device may quantize and broadcast the difference between a hidden state and its updated model. Similarly, clients may quantize the difference between their update model and the corresponding hidden state version. Using process 1200, a QAFeL process is provided that avoids error propagation and is scalable as only the hidden state needs to be tracked. Process 1200 may be a privacy aware system that does not track client states.
where q is the expectation with respect to the possible internal randomness of the quantizer.
According to embodiments,
According to embodiments, process 1300 may be initiated at block 1305 by at least one client initializing hidden states with {circumflex over (x)}0, which is the same as the initial server model x0. At block 1310, the server waits for a client update. To start training at block 1340, the client copies the locally stored hidden state into a variable y0←{circumflex over (x)}t at block 1345, and performs P local model update steps of the type
at block 1350, where g(Yp) is a noisy, unbiased estimator of the gradient at Yp based on the local dataset and ηi is a local step-size. Then, the client sends the quantized update Qc(Y0−Yp−1) to the server at block 1355, where Qc is the client's quantizer. The server adds the received updates to its buffer at block 1215. The k-th update in the buffer is denoted as Δk. The server keeps receiving updates until the buffer is full (decision block 1320), i.e., the server has received K updates. Then the server updates the global model at block 1325 by averaging the received updates:
When the predefined number of iterations T is reached (“Yes” path of decision block 1330), the serve outputs the model and the training stops at block 1335. Otherwise (“No” path of decision block 1330) the server computes the quantized difference between its updated model and the hidden state at block 1375,
Where Qs is the server's quantizer. The server broadcasts qt to all clients at block 1370. The clients, at block 1360 and block 1365, and the server, at block 1380, update their copies of the hidden state using the same equation, {circumflex over (x)}t+1={circumflex over (x)}t+qt.
While this disclosure has been particularly shown and described with references to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the claimed embodiments.
This application claims priority to U.S. provisional application No. 63/497,692 titled SYSTEMS AND METHODS FOR QUANTIZED MACHINE LEARNING, FEDERATE LEARNING AND BIDIRECTIONAL NETWORK COMMUNICATION filed on Apr. 21, 2023, the content of which is expressly incorporated by reference in its entirety.
This invention was made with Government support and is supported in part by Grant No. ECCS-2207457 awarded by the National Science Foundation. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63497692 | Apr 2023 | US |