The invention concerns the telecommunications networks. It relates to the learning of neural networks implemented by devices connected to a communication network.
The invention more specifically lies in the context of federated learning in which devices locally train models of neural networks of the same structure and share the learnings carried out on their devices with the other devices.
The federated learning is opposed to the centralized learning where the learning is done centrally, for example on the servers of a service provider.
For more information on the federated learning, those skilled in the art may refer to the document «Brendan McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, vol. 54, 2017 ».
The federated learning can for example be favored over the centralized learning when it is difficult to envisage a global centralized model adapted to all devices.
The use of a federated learning can also be advantageous when the devices are likely to train their models with data which distribution is likely to depend on, at least to a certain extent, on these devices.
In recent years, the federated learning approach has attracted a lot of interest in many fields, such as healthcare, banking, industry 4.0 or smart cities, because it can help build better global models, while preserving the confidentiality of the local files (medical, financial files, etc.). It can provide a natural solution to the growing needs for personal data protection, while addressing the current technological challenges: decrease in the energy consumption, minimization of the latency, two challenges with the deployment of 5G technology. As explained above, the federated learning is thus a form of distributed learning where several nodes collaboratively solve a machine learning task.
For some applications, the data collected by the users in real contexts often have non-«independent and identically distributed» (IID) distributions (unlike random variables following the same probability law), which can have a significant impact on the convergence of the models, during a federated learning, in particular when a single joint model may not correspond to the objective of each node.
According to a first aspect, the invention concerns a method for configuring models of neural networks of nodes from a set of nodes of a communication network, the neural networks of said nodes all having the same structure.
In particular, the invention concerns a method for configuring weights of models of neural networks (NN) of the same structure, of nodes from a set of nodes of a communication network, said method including a federated learning of said weights in which said nodes locally train their model of neural networks and share the weights of their model with other nodes of said network, the method including:
In at least one embodiment, said designation is temporary, the method comprising at least one other designation for at least one other partition of said set of nodes.
In at least one embodiment, the configuration method comprises, during said federated learning:
In at least one embodiment, the configuration method includes a partition of the set of nodes into at least one cluster by taking into account a communication cost between the nodes within said at least one cluster.
In at least one embodiment, the configuration method includes a partition of the set of nodes to reorganize said clusters into at least one cluster, said reorganized clusters being constituted according to a function taking into account a communication cost between the nodes within a reorganized cluster and a similarity of a change in the weights of the models of the nodes within a reorganized cluster.
In at least one embodiment, said similarity is determined by:
In at least one embodiment, the configuration method includes:
In at least one embodiment, the configuration method includes:
Correlatively, the invention relates to a coordination entity able to configure models of neural networks of nodes from a set of nodes of a communication network, the neural networks of said nodes all having a model of the same structure,
In particular, the invention concerns a coordination entity able to configure weights of models of neural networks of the same structure, of nodes from a set of nodes of a communication network, by federated learning of said weights in which said nodes locally train their models of neural networks and share the weights of their model with other nodes of said network, said coordination entity comprising at least one processor capable of:
According to at least one embodiment, the coordination entity comprises:
According to at least one embodiment, said coordination entity includes:
According to a second aspect, the invention concerns a learning method implemented by a node from a set of nodes of a communication network.
In particular, the invention concerns a learning method implemented by a node from a set of nodes including neural networks having a model of the same structure, of a communication network, said method including, before federated learning of the weights of said models of the neural networks of the nodes of said set, in which said nodes locally train their model of neural networks and share the weights of their model with other nodes called aggregation nodes of said network:
According to at least one embodiment, the learning method comprises, when said node is said aggregation node;
According to at least one embodiment, the learning method comprises, when said node is said aggregation node:
According to at least one embodiment, the learning method includes, when said node is said aggregation node; if it is determined that said cluster must be restructured, restructuring said cluster by grouping at least part of the nodes of said cluster into at least one subcluster, said subclusters being constituted according to a function taking into account a communication cost between the nodes within one said subcluster and a similarity of a change in the weights of the models of the nodes within one said subcluster.
According to at least one embodiment, said restructuring of said cluster includes sending, to said entity of said communication network, the identifier of an isolated node of said cluster.
According to at least one embodiment, the learning method comprises, when said node is not said aggregation node:
According to at least one embodiment, said method is implemented by a node belonging to a first cluster, and said entity of said communication network is:
According to at least one embodiment, the invention concerns a learning method implemented by a node from a set of nodes of a communication network, said node being able to play the role of aggregation node in a cluster of nodes from the set of nodes, the nodes of this set including a neural network, the neural networks of these nodes all having a model of the same structure. This method includes:
Correlatively, the invention concerns a node belonging to a set of nodes of a communication network. In particular, the invention concerns a node belonging to a set of nodes including neural networks having a model of the same structure, of a communication network, said node including at least one processor able to
According to at least one embodiment, the node comprises:
According to at least one embodiment, the invention relates to a node belonging to a set of nodes of a communication network, said node being able to play the role of aggregation node in a cluster of nodes of said set of nodes, the nodes of this set including a neural network, the neural networks from said nodes all having a model of the same structure. This node includes:
According to some embodiments, the invention also targets a system including a coordination entity and at least one node as mentioned above.
The invention proposes federated learning in which nodes of the network can communicate or receive weights (or parameters) or changes in the weights of the models of their neural networks.
These nodes can be communication devices of any type. These can be in particular terminals, connected objects (IoT for Internet of Things), for example cell phones, laptops, home equipment (for example gateways), private or public equipment, particularly an operator of a telecommunications network, for example access points, core network equipment, servers dedicated to the invention or servers implementing functions of the operator for the implementation of a service in the network. The nodes Ni can be fixed or mobile. These can be virtual machines.
In one embodiment, the nodes each have access to a local dataset.
Thus, the invention can be implemented, but in a non-limiting manner, within the framework of applications or services of a communication network, for which it is not possible or desirable for the devices of the network to communicate their data either to each other or to a centralized entity.
The invention can be implemented with all types of datasets, for example when the data of the local datasets are not «independent and identically distributed » (IID) data, but non-IID data.
In one particular embodiment, the nodes are grouped (partitioned) into clusters (or groups of nodes), these being likely to vary dynamically to help, for example, the convergence of the models shared by the nodes of the same cluster.
More specifically, the partition of the nodes into clusters can vary, the structure of a cluster (namely particularly the set of nodes that compose it) is likely to vary over time.
Thus, in some particular embodiments, a coordination entity is configured to partition or repartition the set of nodes into clusters, and to designate an aggregation node in at least some of these clusters.
In some particular embodiments of the invention, at least some nodes of the set of nodes are able to play this role of aggregation node.
In some particular embodiments of the invention, when the coordination entity has defined a new partition of the nodes into clusters and designated the nodes that must play the role of aggregation node within their clusters, the coordination entity sends information to these nodes so that they play this role of aggregation node within their cluster. It also tells them the identifiers of the nodes of the cluster.
In one particular embodiment of the invention, it is considered not only that each node of a cluster includes its own model, but also that each cluster includes its own model.
In some embodiments, the aggregation node of a cluster manages the aggregate model of at least that cluster.
In one particular embodiment of the invention, each cluster includes an aggregation node which manages the aggregate model of this cluster.
In one particular embodiment of the invention, the aggregate model of a cluster is obtained by aggregation of the weights of the models of the nodes of the cluster, trained with local datasets to these nodes.
The nodes of a cluster that train their models with their local datasets and that contribute to the construction of the aggregate model of the cluster can be for example referred to as worker nodes.
In some embodiments of the invention, a node may be able to play the role of aggregator node, to play the role of worker node, or to play both roles.
In one embodiment of the invention, the role of a node can vary over the partitions, for example be redefined at each new partition.
Thus, in one particular embodiment, the learning method is implemented by a node which, in addition to being able to play the role of aggregation node, is further able to play the role of worker node. In this embodiment, an entity of the communications network can specifically inform the node that it must play the role of worker node.
As a variant, a node implicitly understands that it must play the role of worker node when it receives, from an entity of the communication network, the identifier of an aggregation node of a cluster to which it belongs.
The fact of being able to change the role of the nodes over the iterations, and particularly that worker nodes at least temporarily play the role of aggregation node, makes it possible to constitute clusters in a much more flexible way than in the methods of the prior art in which the aggregation, when it exists, is carried out by servers.
When a node plays the role of worker node, it receives, from the aggregation node of its cluster, weights of a model having the structure of the models of all the nodes of the set to initialize the weights of its own model and it transmits to this aggregation node the weights of its model trained with a local dataset to this node.
In one embodiment of the invention, the aggregation node of a cluster relays the communication between the nodes within the cluster. In this embodiment, if the communication cost between two nodes is used as a criterion (unique or not) to determine the clusters of a partition of nodes, the communication cost within a cluster can be the sum of the communication costs between the aggregation node of the cluster and each of the nodes of the cluster.
In one embodiment, to limit (for example minimize) communication costs, the aggregation node of a cluster is chosen in the vicinity of the nodes of the cluster.
In one embodiment of the invention, the aggregation node of a cluster is one of the nodes from the aforementioned set of nodes. In which case, it manages not only the model of the cluster but also its own model as described previously.
In one embodiment of the invention, the aggregation node of a cluster relays the communication between the coordination entity and the nodes of its cluster.
In some particular embodiments of the invention, the aggregation node of a cluster has the possibility to reorganize its cluster, particularly to create subclusters within its cluster or to exclude nodes from its cluster.
In one particular embodiment, several cluster levels can be used, and the model of a cluster of level n can be obtained by aggregation of the models of the clusters of level n+1. In this embodiment, the aggregation node of a cluster of level n can for example relay the communications with the aggregation nodes of the clusters of level n−1 and/or of level n+1.
In one embodiment of the invention, it can be considered that the coordination entity is an aggregation node of the lowest level, by a convention of level 0 for example.
In one embodiment of the invention, the entity of the network which sends, to a node of a cluster of level n, the information according to which this node must play said role of aggregation node in this cluster, the identifiers of the nodes of this cluster and the weights of a global model to the set of nodes is:
Likewise, in one embodiment of the invention, the entity of the network which sends to a node the information according to which it must play the role of worker node in a cluster of level n and the identifier of an aggregation node of this cluster is:
In one particular embodiment, the aggregate model of each cluster is sent to the cluster of lower level, for example conditionally, such as after a constant number of iterations. The aggregate models can thus go up to the coordination entity which can aggregate these models into an updated version of the overall model.
This global model can then go down to all the nodes for a new implementation of the method either directly or via the aggregation nodes.
In some embodiments of the invention, the partition of the nodes into clusters can take into account a communication cost between the nodes of at least one cluster or take into account at least one service implemented by at least one of the nodes. But other criteria can be used.
For example, In one particular embodiment of the invention, the clusters of the partition of the nodes (in the initial partition for example) are determined to minimize a communication cost between the nodes of this cluster. But other criteria can be used. The clusters of the partition (like the initial partition) can for example be determined to favor the grouping of the nodes which implement the same service in the communication network. They can also be created randomly.
Considering the communication cost between nodes of a cluster, either for the initialization or for the reorganization of the clusters, can help in a potential reduction of the communication cost. Indeed, if the nodes are grouped by geographical areas and the weight updates are only shared between the nodes of the same geographical area, the communication latency and the energy consumption will be reduced since they are increasing functions of the distance between the two nodes communicating the weights.
In addition, in some cases there may be a correlation between the non-IID distribution of the data and the geographic distribution of the devices.
In one particular embodiment of the invention, the weights of the model of a cluster can be obtained by aggregation of the weights of the models of the nodes that compose this cluster. The nodes communicate the weights (or as a variant the gradients) of their models, resulting from local calculations from their local datasets. Thus, the data remain local and are not shared or transferred, which ensures data privacy, while achieving the learning objective.
The invention is in this sense very different from the federated multi-task optimization method described in the document «V. Smith, C. K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi-task learning,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 4425-4435, 2017 » which does not propose to group the nodes into clusters.
Different aggregation methods can be used to update the aggregate model of a cluster of level n from the aggregate models of the clusters of higher level n+1 or from the models of the nodes that compose this cluster of level n.
In one particular embodiment, the aggregation method used to update:
For example, the «Federated Average» method (average weighted by the size of the dataset of the nodes) presented in the document «H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, vol. 54, 2017 » can be used.
For example, the Coordinate-wise median method presented in the document «D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” 35thInternational Conference on Machine Learning, ICML 2018, vol. 13, p. 8947-8956, 2018» can also be used.
In one particular embodiment, the method includes a loop implemented within each cluster. At each iteration, the aggregate model of the cluster is communicated to each of the nodes of the cluster, each of the nodes of the cluster updates its model by performing for example a gradient descent with its local data and returns either its new model or the change or the update of its model (i.e. the weight difference between the current iteration and the previous iteration) so that it is aggregated at the level of the aggregate model of the cluster and returned to the nodes of the cluster in the next iteration. This loop may or may not include a constant number of iterations. For example, it can stop when a stopping condition is met.
In one embodiment of the invention, the coordination entity determines how the weights of the global model change, for example to what extent this global model continues to converge, and decides whether or not to redefine the clusters.
In one embodiment of the invention, this determination can comprise obtaining a representation of the overall model in the form of a vector whose coordinates are constituted by the changes in the weights of this model and the decision to whether or not to redefine the clusters can take into account the norm of this vector, for example via a comparison of the norm of this vector with a constant value.
In one particular embodiment, the reorganization of the clusters is a reorganization of the set of nodes into a new partition of clusters of nodes. Optionally, new aggregation nodes can be defined for at least some of the clusters. These can be, for example, nodes of these reorganized clusters.
As a variant, other reorganizations could be envisaged, for example only for the nodes of some clusters.
In one embodiment, during this reorganization, the reorganized clusters are constituted according to a function taking into account:
For example, it may be sought to limit or minimize at least one of the elements above or a combination of these elements.
The fact of taking into account the similarity of the change in the weights of the models of the nodes to constitute the clusters of nodes can help group nodes which a priori have similarities in their local datasets, without sharing information on these local datasets. Such embodiments can help solve a problem of statistical heterogeneity. Indeed, by constituting clusters which group nodes having similar data distributions, statistical heterogeneity is greatly reduced within the clusters.
In one particular mode of implementation of the invention, this similarity is determined by:
These requests can be made to the nodes directly by the coordination entity. As a variant, they can be carried out or relayed by the aggregation nodes.
In one particular embodiment, the changes in the weights of the models are represented in the form of vectors and the similarity of the changes in the weights of the models of the different nodes is for example determined by a method called cosine similarity.
In one particular mode of implementation of the invention, the weights of the updated global model are returned to each of the nodes, either directly or via the aggregation nodes of the clusters thus reorganized. The nodes can thus update their model with the global model. The aggregate models of the reorganized clusters can also be updated with the global model.
In one particular embodiment, these new clusters are then constituted by nodes selected according to a proximity criterion (communication cost for example) and whose models are likely to change in the same way.
It can be considered that these steps complete a general initialization phase and that a phase which can be referred to as “optimization phase” then begins, during which at least some of the clusters will be able to be reorganized, for example by creating subclusters or by excluding some of their nodes.
In one particular embodiment of the invention, this phase can include a loop implemented within each reorganized cluster, identical or similar for example to that of the initialization phase. At each iteration, the aggregate model of the reorganized cluster is communicated to each of the nodes of this cluster, each of the nodes updates its model by performing a gradient descent with its local dataset and returns either its new model or the change of its model so that it is aggregated at the level of the aggregate model of the reorganized cluster and returned to the nodes of this cluster at the next iteration. This loop can include a constant or variable number of iterations. For example, it can stop when a stopping condition is met.
In one particular embodiment, the learning method includes a step of determining whether at least one reorganized cluster must be restructured.
In one particular embodiment, it is determined whether a reorganized cluster must be restructured according to a convergence criterion which takes into account a change in the weights of said reorganized cluster and/or a change in the weights of the nodes of the reorganized cluster. For example, it may be a double convergence criterion taking into account a change in the weights of said reorganized cluster and a change in the weights of the nodes of the reorganized cluster.
In at least one embodiment of the invention, it is determined that a reorganized cluster must be restructured if the following conditions are met:
In one embodiment of the invention, to verify the first criterion (1), the global model is represented in the form of a vector whose coordinates are constituted by the changes in the weights of this model and the norm of this vector is compared with a numerical value, used for example as a threshold value. This value can be a constant or a value which depends for example on the level of the cluster or on the number of iterations already carried out.
In one embodiment of the invention, to verify the second criterion (2), a similarity is determined between the change of each of the nodes of the cluster and the change that the cluster would have if it were deprived of this node. For example, for a given node:
In one particular embodiment, the restructuring of a cluster includes the grouping of at least part of the nodes of this cluster into at least one subcluster, these subclusters being constituted according to a function taking into account a communication cost between the nodes within one said subcluster and a similarity of a change in the weights of the models of the nodes within one said subcluster (to minimize this function for example).
This step is similar to the reorganization step described previously (initialization phase) except that it only applies to the nodes of the cluster to be restructured and not to all the nodes.
In one particular embodiment, if at least one node, called “isolated node”, of a cluster to be restructured is not allocated to a subcluster, this node can be allocated to another cluster.
To this end, in one embodiment, when the aggregation node of a cluster of level n detects an isolated node, it sends the identifier of this isolated node to an entity of the communication network so that this node is reallocated in another cluster.
This entity can for example be a coordination entity as mentioned below or a node which plays the role of aggregation in a cluster of level n−1.
In one particular embodiment of the invention, the reallocation of an isolated node to another cluster is carried out by the coordination entity mentioned above. Consequently, in one embodiment, the configuration method includes:
In one particular embodiment, the methods mentioned above are implemented by a computer program.
Consequently, the invention also relates to a computer program on a recording medium, this program being likely to be implemented by a coordination entity or more generally in a computer. This program includes instructions adapted to the implementation of a configuration method or a learning method as described above. These programs can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in partially compiled form, or in any other desirable form.
The invention also relates to an information medium or a recording medium readable by a computer, and including instructions of a computer program as mentioned above.
The information or recording medium can be any entity or device capable of storing the programs. For example, the media can include a storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or a magnetic recording means, for example a floppy disk or a hard drive, or a flash memory.
On the other hand, the information or recording medium may be a transmissible medium such as an electrical or optical signal, which can be conveyed via an electrical or optical cable, by radio link, by wireless optical link or by other means.
A program according to the invention can be particularly downloaded onto an Internet-type network.
Alternatively, the information or recording medium can be an integrated circuit in which a program is incorporated, the circuit being adapted to execute or to be used in the execution of one of the methods in accordance with the invention.
Other characteristics and advantages of the present invention will emerge from the description given below, with reference to the appended drawings which illustrate exemplary embodiments devoid of any limitation. In the figures:
In at least one embodiment, the nodes Ni each have access to a local dataset dsi.
In at least one embodiment, if the set of nodes Ni is considered, the data of the local datasets dsi of these nodes are non-IID data.
In the example of
In practice, the distribution of the local data dsi, of a node Ni is not known, this being moreover likely to vary over time as the node Ni acquires or generates new data and/or as some data become obsolete.
Each node Ni can acquire or generate the data dsi, from its local dataset. These data dsi, may for example be signaling or monitoring data of the communication network, for example quality of service data, statistics on the communication network, performance indicators of the communication network. It may also be data representative of the use of the node Ni, for example durations, locations or ranges of use of the node Ni, data on the profiles of the users of the node Ni, data on the services accessed or offered by the node Ni. It may also be data acquired by the node Ni or by a sensor of the node Ni, for example meteorological data, measurements of temperature, consumption, use, wear, etc. It may also be data entered or acquired by a user of the node Ni, for example textual data (message contents, etc.), images, videos, voice messages, audio recordings, etc.
In at least one embodiment, the local data dsi, of a node Ni may be sensitive data in the sense that these data must not be shared or communicated to the other nodes. For example, it may be data private to a user of the node, such as personal data.
In some embodiments, a communication cost between two nodes Ni, Nj can be known. For example, the nodes are located, for example thanks to their GPS coordinates, and the communication cost between two nodes is constituted by the geographical distance between these nodes. In at least one other embodiment, the communication cost between two nodes can be a measurement of throughput, latency, bandwidth of a communication between these nodes.
In one embodiment of the invention, the structures (number and topology of the layers) of the models of the neural networks NN of the different nodes Ni are identical. But the weights (or parameters) of the models of these networks are potentially different, since these networks are trained from different local datasets dsi.
The training of a neural network of a node Ni to obtain a more efficient model can comprise a few iterations (or rounds) of a gradient descent. More specifically, once the weights of the network have been initialized, during iteration, a node Ni can perform a gradient descent during E epochs with different data (that is to say it calculates a gradient by using for example each of its local data dsi, E times).
In at least one embodiment of the invention, and as represented in
In at least one of the embodiments described here, in each cluster Cj, one of the nodes Ni represented in black is an aggregation node Aj of the cluster Cj. In at least one of the embodiments described here, the nodes Ni within a cluster Cj communicate only via the aggregation node Aj so that the communication cost between two nodes within a cluster is the sum of the communication costs between each of these nodes and the aggregation node Aj of this cluster.
The number of cluster levels can be any number, each aggregation node of level n greater than or equal to 1being configured to communicate with an aggregation node of lower level n−1 with the convention introduced above.
In at least one of the embodiments described here, and for the sake of simplification, only two aggregation levels (levels 0 and 1) will be considered and the lowest level, level 0, is constituted by an aggregation node A0(coordination entity within the meaning of the invention).
In at least one of the embodiments described here, the aggregation node A0 of level 0 is able to communicate directly with each of the nodes Nj but so as not to overload
In at least one of the embodiments described here, each aggregation node Aj is configured to constitute an aggregate model for the cluster Cj from the local models of the nodes Nj of this cluster.
In the same way, each aggregation node of level n, n greater than or equal to 0, is configured to constitute an aggregate model of level n from the aggregate models of the clusters of level n+1.
Generally, and as described specifically below, in at least one of the embodiments described here:
In at least one embodiment of the invention, the clusters resulting from the partitions and the successive restructuring are constituted by pursuing a dual objective, namely:
The clusters determined by the invention may result, for example, from a compromise between these two organizations.
In at least one of the embodiments described here, the nodes Ni do not communicate their local datasets dsi, neither to the other nodes nor to the aggregation nodes nor to the coordination entity. In such an embodiment, the local datasets dsi, therefore cannot be used directly to distribute the nodes in the clusters.
Consequently, the nodes Ni whose updates Δθi(t) of their models change in the same direction can for example be grouped into clusters (or subclusters), by considering the updates Δθi(t) of the models as representative of the distributions of the datasets based on which these models were trained, and that aggregating models that change in the same direction can help obtain an aggregate model that will change in this same direction.
Δθi{circumflex over (θ)}n In one particular embodiment, and as represented in
Δθi{circumflex over (θ)}n-origin represents the weights of the model before its change, in other words a reference model;
This will now be illustrated with reference to
With reference to
It is assumed that the models of the nodes N1 and N2 are initialized with the same set of weights θ0.
The phantom vectors represent, at each round t, the change of the model of the node N1. It is seen that the norms of these vectors tend (if the model converges) to decrease at each round t, and that these vectors are (normally) directed towards the point representing the weights θOPT1 of the optimal model of the dataset ds1.
The dotted line vectors represent, at each round t, the change in the model of the node N2. It is seen that the norms of these vectors tend (if the model converges) to decrease at each round t, and that these vectors are (normally) directed towards the point representing the weights θOPT2 of the optimal model of the dataset ds2.
The solid line vectors represent the change in the aggregate model of the cluster C1, obtained by aggregation of the models of the nodes N1 and N2.
It is noticed that over the rounds (for example at each round), if the models converge:
These findings can help determine that the aggregate model is no longer changing.
In this
In one particular mode of implementation of the invention, and as detailed later, once the aggregate model no longer changes, the nodes whose changes of the models are represented by vectors of identical or neighboring directions are intended to be grouped in the same cluster (assuming that this grouping is not questioned by the criterion of limitation of the communication costs).
Modes of implementation of a configuration method and of a learning method in accordance with the invention will now be described. These methods are described in the context of a system in accordance with the invention and including a coordination entity A0 and a set of nodes Ni all able to play the role of aggregation node within a cluster and the role of worker node.
In the exemplary embodiment described here, the coordination entity A0 can be considered as an aggregation node of level 0.
In this example, a cluster initialization phase with reference to
With reference to
During a step E10, the aggregation node A0 of level 0 (coordination entity within the meaning of the invention) performs a first partition of the nodes Ni to initialize the clusters Cj and determines, among the nodes Ni, the aggregation node Aj of each of these clusters Cj. In at least one of the embodiments described here, the method includes a parameter kinit which defines the number of clusters which must be created during this initial partition.
In at least one of the embodiments described here, during this initialization step, the clusters kinit are constituted only by taking into account the distances between the nodes Ni, based on their geographical locations.
In some embodiments, the constitution of the clusters can for example comprise the creation of a cluster per node Ni and recursive mergings of pairs of clusters closest to each other into a single cluster.
In some embodiments, the constitution of the clusters can for example use the Hierarchical Agglomerative Clustering algorithm presented in the document «T. Hastie, R. Tibshirani, and J. Friedma, The Elements of Statistical Learning, second edi ed. Springer, 2008 ».
For each of the clusters Cj (j=0, . . . kinit−1) thus created, the aggregation node Aj can be chosen for example as the node Ni of this cluster which minimizes the sum of the distances between this node Aj and each of the other nodes of this cluster.
In one embodiment, during a step E15, once the clusters Cj have been created, the coordination entity A0sends:
In at least one of the embodiments described here, at the first occurrence of a step E20, the coordination entity A0 initializes a variable t representing the number of the current round to 0 and sends, to each aggregation node Aj, the weights θ0 of an initial global model to the set of nodes and a request to learn the models of the nodes of this cluster with these weights.
In at least one of the embodiments described here, the learning request can be accompanied by a number δ of updates, in other words by iterations to be carried out for this learning.
In at least one of the embodiments described here, during a step E25, the aggregation node Aj of each of the clusters Cj initializes the weights of its model θjt for round t with the weights θt of the global model.
In at least one of the embodiments described here, during a step E30, the aggregation node Aj of each of the clusters Cj sends the weights θjt of its aggregate model to each of the nodes Nj of its cluster.
In at least one of the embodiments described here, during a step E35, each of the nodes Ni initializes the weights of its local model θjt for round t with the weights θjt of the model of its cluster Cj.
In at least one of the embodiments described here, during step E30 already described, when the aggregation node Aj sends the weights θjt of the model of its cluster to a node Ni, it asks it to update its model θjt . In one embodiment, the aggregation node Aj communicates the hyperparameters E, B, η to the nodes Ni.
In at least one of the embodiments described here, during a step E35, the node Ni updates its model θjt . For example, it performs a gradient descent during E epochs on a batch of size B of its local data dsi.
In at least one of the embodiments described here, during a step E40, the node Ni sends the update Δθjt of its model for round t to the aggregation node of its cluster Cj.
In at least one of the embodiments described here, during a step E45, an aggregation node Aj increments the variable t (number of the current round) and updates the weights of the aggregate model θjt of its cluster Cjfor the round t by aggregation of the updates Δθjt of the weights of the models of the nodes Ni of this cluster Cjthat are received in step E40. It is noted here that the variables t of θjt and of Δθjt differ by one unit, since for example, the weights of the model θjt of the cluster Cj for round 1 are obtained by aggregation of the updates Δθjt of the models of the nodes Ni at round 0.
Different aggregation methods can be used, for example the methods such as «Federated Average» (average weighted by the size of the dataset dsi, of the node Ni) or «Coordinate-wise median» known to those skilled in the art.
In at least one of the embodiments described here, during a step E50, an aggregation node Aj verifies whether the rounds δ (or iterations) have been carried out, in other words whether t is divisible by δ. If this is not the case, the result of test E50 is negative and the aggregation node sends (during a new iteration of step E30), the weights θjt of the model of its updated cluster Cj to each of the nodes Ni of its cluster. These update their model θjt (step E35) and send the update Δθjt to the aggregation node Aj (step E40) so that it increments the value t and updates the aggregate model θjt of its cluster Cj (step E45).
In at least one of the embodiments described here, when the rounds δ (or iterations) have been carried out, the result of the test E50 is positive and the aggregation node Aj sends the aggregate model θjt of its cluster Cjto a node playing the role of aggregation node in a cluster of lower level, namely, in this example, to the coordination entity A0.
In at least one of the embodiments described here, during a step E55, the coordination entity A0 updates the weights of the global model θt by aggregation of the aggregate models θjt of the clusters Cj. The coordination entity A0 can use different aggregation methods according to the embodiments, and for example the aforementioned «Federated Average» or «Coordinate-wise median» aggregation methods.
In at least one of the embodiments described here, during an test E60, the coordination entity A0 determines whether the global model θt is still far from the convergence or not. To do so, it compares the norm of the change Δθt of its model (Δθt θt−θt−1) with a convergence criterion ε0.
In at least one of the embodiments described here, when the norm of the change Δθt of its model is greater than the convergence criterion ε0, the coordination entity A0 considers that its model continues to converge and the result of the test E60 is positive. It then sends (new occurrence of step E20) the weights θt of the new global model to the aggregation nodes Aj of the clusters Cj, asking them to repeat the process described above to update δ times their aggregate models θjt .
In at least one of the embodiments described here, when the norm of change Δθt of its model is lower than the convergence criterion ε0, the coordination entity A0 considers that its model is no longer changing and the result of the test E60 is negative.
In at least one of the embodiments described here, it then sends, during a step E65, the weights θt of the global model to the nodes Ni (either directly or via the aggregation nodes Aj of the clusters Cj) and asks them to update this model θit .
In at least one of the embodiments described here, the nodes Ni update their model θit during a step E65 (for example by performing a gradient descent from their local dataset dsi) and send the update Δθit of their models to the coordination entity A0 during a step E70.
In at least one of the embodiments described here, during a step E75, the coordination entity A0 carries out a new partition of the nodes to reorganize the clusters. It is recalled that in step E10, the clusters kinit had been constituted, in the embodiment cited as an example, by only taking into account the distances between the nodes Ni, based on their geographical locations.
However, at this stage, the global model created based on these first clusters, on a purely geographical criterion, no longer changes or barely changes.
In at least one of the embodiments described here, and as mentioned previously, step E75 reorganizes the nodes into clusters so that they address a compromise, namely limiting (for example minimizing) the communication costs between the nodes within a cluster, and on the other hand constituting clusters of nodes whose updates Δθi(t) of their models change in the same direction, this second criterion representing a priori the fact that these nodes have local datasets coming from homogeneous distributions.
In at least one of the embodiments described here, step E75 reorganizes the clusters to globally optimize the dimension di,k calculated for each pair of nodes Ni, Nk:
In one embodiment of the invention,
where <x, y>denotes the scalar product of x and y and ∥x∥ denotes the norm of x.
At the end of step E75, the new clusters Cj are reorganized and their aggregation nodes Aj are designated.
It can be considered, at least in some embodiments, that this step E75 completes an initialization phase.
It can be considered that this initialization phase is followed by an optimization phase which will now be described with reference to the
In at least one of the embodiments described here, this optimization phase includes steps F10 to F95.
During a step F10, the coordination entity A0 records the global model θt as a reference model {circumflex over (θ)}n.
In at least one embodiment, during a step F15, in a manner similar to step E15 already described, the coordination entity A0 sends:
θt In at least one of the embodiments described here, during a step F20, the coordination entity A0 sends to each aggregation node Aj:
In at least one of the embodiments described here, during a step F25, the aggregation node Aj of each of the reorganized clusters Cj initializes the weights of its model Δθjt of for round t with the weights θt of the global model.
In at least one of the embodiments described here, during a step F30, the aggregation node Aj of each of the reorganized clusters Cj sends the weights θjt of its aggregate model to each of the nodes Ni of its cluster and their request to update their model θit by performing a gradient descent from their local dataset dsi.
In at least one of the embodiments described here, during a step F35, each of the nodes Ni initializes the weights of its local model θit for round t with the weights θjt of the model of its cluster Cj and updates this model by performing a gradient descent during E epochs on a batch of size B of its local data dsi.
In at least one of the embodiments described here, during a step F40, the node Ni sends the update Δθit of its model for round t to the aggregation node of its cluster Cj.
Δθitθjt of In at least one of the embodiments described here, during a test F45, the aggregation node Aj of a reorganized cluster Cj determines whether the cluster Cj must be restructured. In the example described here, and as described previously with reference to
Δθitθjt calculating for each node Ni of the cluster Cj a vector representative of the change of the model of this node for round t. This vector has as origin a point representing, and a direction substantially directed towards the weights of the optimal model of the dataset dsi of this node Ni;
Still during this test F45, if the norm of the aggregate model Δθjt of the cluster Cj is lower than the convergence criterion εn, which is the case if the model of the cluster Cj no longer converges, the aggregation node Aj determines whether there is at least one model of a node Ni of its cluster which continues to change differently from the models of the other nodes of the cluster.
In at least one of the embodiments described here, to determine whether the model of a node Ni continues to change, an aggregation node Aj compares the norm of Δθjt with the convergence criterion εnn.
Δθit In at least one embodiment, to determine whether the model of a node Ni which continues to change, changes differently than the models of the other nodes of the cluster Cj the aggregation node Aj considers the angle between:
More specifically, in at least one of the embodiments described here, the aggregation node Aj considers that the model of a node of the cluster changes differently if αt>□ where:
If this is not the case, in other words, if the model of the cluster Cj continues to change, without any model of the node Ni changing in a direction significantly different from that in which the model of the cluster would change without this node, the result of the test F45 is negative.
In this case, in at least one of the embodiments described here, the aggregation node Aj of a cluster Cjcalculates during a step F50, the aggregate model θjt of the cluster Cj obtained from the updates Δθit of the model of the nodes Ni of the cluster Cj received in step F40. This updated model is sent to all the nodes Ni of the cluster Cj during a new iteration of step F30.
In at least one of the embodiments described here, the loop of steps F25 to F50 is carried out as long as t is smaller than a value T. Other stopping criteria can be used.
In at least one of the embodiments described here, the aggregation node Aj sends (step F58) the aggregate model θjt of its cluster Cj to the coordination entity A0 (or to the aggregation node of lower level).
When an aggregation node Aj determines either that the model of the cluster Cj is no longer changing or that there is at least one node whose model is changing in a “bad” direction, the result of the test F45 is positive, and the aggregation node Aj undertakes, during a step F60, a restructuring of the cluster Cj.
This step is similar to step E75 already described except that it only applies to the nodes of the cluster Cjand not to all the nodes. It therefore produces a set of subclusters of the cluster SCj and aggregation nodes SAjof these subclusters, these subclusters SCj being constituted to limit the communication costs between their nodes and to group together nodes whose updates of their models change substantially in the same direction. Possibly, nodes Ni, are not assigned to any of the subclusters and can be considered isolated.
In at least one of the embodiments described here, the subclusters SCj and the isolated nodes Ni, are not processed in the same way.
In one embodiment, during a step F65, the aggregation node Aj sends:
θjt In at least one of the embodiments described here, during this step F65, the aggregation node Aj sends to each aggregation node SA:
In at least one of the embodiments described here, an aggregation node Aj creates, during a step F70, for each subcluster, a reference model by aggregation of the models of the nodes of this subcluster and then sends the weights of this model to the aggregation node SAj of this subcluster SCj. This subcluster aggregation node can then recursively implement the steps described above to customize its subcluster.
In at least one of the embodiments described here, when a node NIi is isolated, the aggregation node of lower level, namely the coordination entity A0 in this example, updates, during a step F75, the reference model {circumflex over (θ)}n by aggregation of the models θi(t).
In at least one of the embodiments described here, the coordination entity A0 sends the weights of this reference model to the isolated node NIi during a step F80.
In at least one of the embodiments described here, during a step F85, the isolated node NIi initializes the weights of its local model with the weights of this reference model {circumflex over (θ)}n and updates this model by performing a gradient descent. The isolated node NIi sends the update of its model to the coordination entity A0 during a step F90.
In at least one of the embodiments described here, during a step F95, the isolated node NIi is allocated to the cluster Cj whose change of the model compared to the reference model {circumflex over (θ)}n , i.e. (θi(t)−{circumflex over (θ)}n), closest to Δθi.
In the previous example, a mode of implementation of the invention for a system including only two aggregation levels is described but the invention can be implemented with a greater number of aggregation levels.
Generally, during the initialization phase, when an aggregation node Aj has updated the aggregate model θjt of its cluster Cj (step E45), it sends it to the aggregation node of lower level so that it can update its own model (step E55) by aggregation of the models it receives. Generally, an aggregation node of level n is configured to determine (step E60) whether its aggregate model is still far from the convergence by comparing the norm of the change of its model with a convergence criterion εn which can be specific to this aggregation level.
At the end of the initialization phase, an aggregation node of level n sends a reference model resulting from this initialization phase to the aggregation nodes of level n+1 (step F10). Each aggregation node is configured to determine whether its cluster must be reconfigured (step F45) and if so, to create subclusters (step F60) or ask the aggregation node of lower level to assign a cluster (step F95) to the nodes that would be isolated.
With reference to
These communication means 15 can in particular allow the coordination entity A0 to communicate with nodes of the network.
The read only memory 12 of the coordination entity A0 constitutes a recording medium in accordance with the invention, readable by the processor and on which a computer program PGC in accordance with the invention is recorded, including instructions for the execution of a weight configuration method according to the invention.
For example the processor of said coordination entity (A0) can be capable of:
The program PGC defines various functional and software modules of the coordination entity A0, able to implement the steps of the weight configuration method. With reference to
The read only memory 22 of the node Aj constitutes a recording medium in accordance with the invention, readable by the processor and on which a learning program PGA in accordance with the invention is recorded, including instructions for the execution of a learning method according to the invention.
For example, the processor of the node can be able to receive, from an entity of said communication network, before federated learning, weights of said models of the neural networks of the nodes from said set, in which said nodes locally train their models of neural networks and share the weights of their models with other nodes of said network of information designating a node (Aj) of said set as an aggregation node managing an aggregate model for said federated learning and, when said node is said node, identifiers of the nodes of said cluster of a cluster whose said aggregation node manages said abbreviated model.
The program PGA defines various functional and software modules of the node Aj able to implement the steps of the learning method. With reference to
In one embodiment in which this node, then noted Ni, is also able to play the role of worker node, the communication means COM2 are configured to:
The part (a) of
The part (b) of
In these examples, test images of the dataset MNIST which includes images each representing a number from 0 to 9, i.e. ten classes, are used. 99% accuracy means that for every 100 new known images, 99% are classified correctly.
This model is presented in the document «Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010».
It appears in these parts (a) and (b) of
The parts (c) and (d) of
In these figures, if a set of nodes that would only include two nodes is considered, the first node having a set of X1 local data and an accuracy P1 and the second node having a set of X2 local data and an accuracy P2, then the ordinate of a point would have as value:
The parts (e) and (f) of
In these figures, a communication cost calculated in the case of a network of 50 nodes at each of the rounds is considered, the communication cost being in this example constituted by the sum of the communication costs (i) of the model towards the nodes and (ii) back from these nodes towards the aggregation nodes or towards the coordination entity, by taking into account the number of bits necessary to send the model (in other words the weights), multiplied by the sum of the distances between two nodes, to the power of □ (pass-loss exponent).
It is assumed here that in the case of centralized federated learning, the single aggregation node is the barycenter of all the nodes.
It is seen that the cluster federated learning can help reduce the communication cost by avoiding the communication between each node of the network and the single aggregation node.
Thus, it can be observed, for the two aggregation methods:
Number | Date | Country | Kind |
---|---|---|---|
2109043 | Aug 2021 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2022/051617 | 8/29/2022 | WO |