This application claims the benefit of priority to Taiwan Patent Application No. 110141246, filed on Nov. 5, 2021. The entire content of the above identified application is incorporated herein by reference.
Some references, which may include patents, patent applications and various publications, may be cited and discussed in the description of this disclosure. The citation and/or discussion of such references is provided merely to clarify the description of the present disclosure and is not an admission that any such reference is “prior art” to the disclosure described herein. All references cited and discussed in this specification are incorporated herein by reference in their entireties and to the same extent as if each reference was individually incorporated by reference.
The present disclosure relates to a federated learning method and a federated learning system, and more particularly to a federated learning method and a federated learning system based on a mediation process.
In the existing federated learning method, a model can be trained on a local device without transferring local data from a client end, and a shared model can then be further established and updated. This method not only has high confidentiality, but also saves costs of dense transmissions of large amounts of data. However, the local data collected by different client ends cause a data deviation due to factors such as environment and location, and this data deviation reduces an accuracy of the trained model.
In addition, the existing federated learning method connects a server to a plurality of client devices for learning. However, selected client devices often experience interruptions in network communication, thereby causing a collecting process in the federated learning method being executed by the server to cease.
In response to the above-referenced technical inadequacies, the present disclosure provides a federated learning method and federated learning system based on a mediation process.
In one aspect, the present disclosure provides a federated learning method based on a mediation process. The federated learning method includes: configuring a server device to divide a plurality of client devices into a plurality of mediator groups based on a plurality of records of data distribution information of the plurality of client devices, and generate a plurality of mediator modules that are configured to manage the plurality of mediator groups, respectively; configuring the server device to broadcast initial model weight data to the plurality of mediator modules; configuring the plurality of mediator modules to execute a sequential training process for the plurality of mediator groups, respectively, in which the sequential training process includes: determining a training sequence for the corresponding client devices; transmitting the initial model weight data to the corresponding client devices, and configuring the corresponding client devices to use a plurality of records of local data as a plurality of records of training data, and sequentially train a target model according to the initial model weight data and the training sequence to generate trained model weight data; and transmitting the trained model weight data back to the server device. The federated learning method further includes: configuring the server device to obtain multiple records of the trained model weight data of the plurality of mediator groups, and calculate a plurality of weights respectively corresponding to the plurality of mediator groups according to the multiple records of the trained model weight data; configuring the server device to execute a weighted federated averaging algorithm on the multiple records of the trained model weight data according to the plurality of weights to generate global model weight data; and configuring the server device to set the target model with the global model weight data to generate a global target model.
In another aspect, the present disclosure provides a federated learning system based on a mediation process. The federated learning system includes a plurality of client devices, a server device, and a plurality of mediator modules. The server device is communicatively connected to the plurality of client devices, and is configured to divide the plurality of client devices into a plurality of mediator groups based on a plurality of records of data distribution information of the plurality of client devices. The plurality of mediator modules are generated by the server device and configured to manage the plurality of mediator groups, respectively. The server device is configured to broadcast initial model weight data to the plurality of mediator modules. The plurality of mediator modules are configured to execute a sequential training process for the plurality of mediator groups, respectively, and the sequential training process includes: determining a training sequence for the corresponding client devices; transmitting the initial model weight data to the corresponding client devices, and configuring the corresponding client devices to use a plurality of records of local data as a plurality of records of training data, and sequentially train a target model according to the initial model weight data and the training sequence to generate trained model weight data; and transmitting the trained model weight data back to the server device. The server device is configured to obtain multiple records of the trained model weight data of the plurality of mediator groups, and calculate a plurality of weights respectively corresponding to the plurality of mediator groups according to the multiple records of the trained model weight data. The server device is configured to execute a weighted federated averaging algorithm on the multiple records of the trained model weight data according to the plurality of weights to generate global model weight data. The server device is configured to set the target model with the global model weight data to generate a global target model.
Therefore, the federated learning method and federated learning system based on the mediation process provided by the present disclosure add mediators in federated learning to coordinate training tasks in the mediator group, thereby assisting model weights to be transferred between the client ends and server to overcome uneven distribution of data in the federated learning, while having high privacy and low cost characteristics.
In addition, the federated learning method and federated learning system based on the mediation process provided by the present disclosure provide a fault-tolerant mechanism under a mediator architecture of the federated learning, such that the training efficiency and stability of the model can be maintained even if the client device is disconnected during the training process.
Furthermore, the federated learning method and federated learning system based on the mediation process provided by the present disclosure can operate in parallel through a plurality of mediator modules, and each uses a sequential training method to allow the client devices to update the global model in a specific sequence, therefore, not only can biased weights be avoided, but also communication costs can be reduced, thereby speeding up an overall training speed of the federated learning.
These and other aspects of the present disclosure will become apparent from the following description of the embodiment taken in conjunction with the following drawings and their captions, although variations and modifications therein may be affected without departing from the spirit and scope of the novel concepts of the disclosure.
The described embodiments may be better understood by reference to the following description and the accompanying drawings, in which:
The present disclosure is more particularly described in the following examples that are intended as illustrative only since numerous modifications and variations therein will be apparent to those skilled in the art. Like numbers in the drawings indicate like components throughout the views. As used in the description herein and throughout the claims that follow, unless the context clearly dictates otherwise, the meaning of “a”, “an”, and “the” includes plural reference, and the meaning of “in” includes “in” and “on”. Titles or subtitles can be used herein for the convenience of a reader, which shall have no influence on the scope of the present disclosure.
The terms used herein generally have their ordinary meanings in the art. In the case of conflict, the present document, including any definitions given herein, will prevail. The same thing can be expressed in more than one way. Alternative language and synonyms can be used for any term(s) discussed herein, and no special significance is to be placed upon whether a term is elaborated or discussed herein. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any terms is illustrative only, and in no way limits the scope and meaning of the present disclosure or of any exemplified term. Likewise, the present disclosure is not limited to various embodiments given herein. Numbering terms such as “first”, “second” or “third” can be used to describe various components, signals or the like, which are for distinguishing one component/signal from another one only, and are not intended to, nor should be construed to impose any substantive limitations on the components, signals or the like.
The server device 12 is communicatively connected to the client devices 100-1, 100-2, . . . , 100-K, and is configured to divide the client devices 100-1, 100-2, . . . , 100-K into mediator groups 10-1, 10-2, . . . , 10-N based on a plurality of records of data distribution information of the client devices 100-1, 100-2, . . . , 100-K. The mediator modules 14-1, 14-2, . . . , 14-N are generated by the server device 12 and are configured to manage the mediator groups 10-1, 10-2, . . . , 10-N, respectively. The number of the mediator modules 14-1, 14-2, . . . , 14-N is the same as the number of the mediator groups 10-1, 10-2, . . . , 10-N, but the present disclosure is not limited thereto.
In the federated learning system 1, a main task of the server device 12 is to initialize and assign the client devices 100-1, 100-2, . . . , 100-K to the different mediator groups 10-1, 10-2, . . . , 10-N according to data distribution. For example, the mediator group 10-1 includes the client devices 100-1, 100-2, 100-3, and the mediator group 10-2 includes the client devices 100-4, 100-5, 100-6. The server device 12 further creates the mediator modules 14-1, 14-2, . . . , 14-N for the mediator groups 10-1, 10-2, . . . , 10-N, and executes a weighted federated averaging algorithm for model weights trained by the mediator groups 10-1, 10-2, . . . , 10-N, and finally generates a model that integrates all training results.
On the other hand, the client devices 100-1, 100-2, . . . , 100-K are responsible for processing data requests on a data plane, performing training, and transferring the weights of the trained models. In addition, in the federated learning system 1, in order to coordinate the training tasks among the client devices 100-1, 100-2, . . . , 100-K, the present disclosure includes the mediator modules 14-1, 14-2, . . . , 14-N, which are responsible for a control plane to provide software programs for configuring and closing the data plane, and also determine which client device should be used for training.
Reference is made to
The server device 20 includes a processor 200, a communication interface 202 and a storage medium 204. The processor 200 is coupled to the communication interface 202 and the storage medium 204. The storage medium 204 can be, for example, but not limited to, a hard disk, a solid state drive or other storage devices that can be used to store data, and is configured to store at least a plurality of computer readable instructions D1, global data distribution information D2, a clustering algorithm D3, a mediator module generation program D4, a weighted federated averaging algorithm D5, initial model weight data D6, and target model data D7. The communication interface 202 is configured to access the network under control of the processor 200, and can communicate with the client devices 22, 24, for example.
The client device 22 can include a processor 220, a communication interface 222, and a storage medium 224. The processor 220 is coupled to the communication interface 222 and the storage medium 224. The storage medium 224 can be, for example, but not limited to, a hard disk, a solid state drive or other storage devices that can be used to store data, and is configured to store at least a plurality of computer readable commands Dr, local data D2′, data distribution information D3′, a training program D4′, target model data D5′, and model weight data D6′. The communication interface 222 is configured to perform network access under control of the processor 220, and can communicate with the client device 22, for example.
Similarly, the client device 24 can include a processor 240, a communication interface 242, and a storage medium 244. The processor 240 is coupled to the communication interface 242 and the storage medium 244, and the storage medium 244 and the communication interface 242 are similar to the storage medium 224 and the communication interface 222, and thus the repeated descriptions are omitted. In some embodiments, the client devices 22 and 24 can be, for example, mobile devices, Internet of Things (IoT) devices, fog computing apparatus, and the like.
In addition, the mediator modules 14-1, 14-2, . . . , 14-N in
Step S30: configuring the client device to statistically calculate local data to generate data distribution information and send the data distribution information to the server device. In detail, this step is included in an initialization process. In the initialization process, the server device can communicate with multiple client devices that are predetermined to participate in the federated learning method, and a registration process is then executed. As shown in
Step S31: configuring the server device to divide the client devices into a plurality of mediator groups according to a plurality of records of data distribution information of the client devices, and to generate a plurality of mediator modules that are configured to manage the plurality of mediator groups, respectively. In this step, the processor 200 of the server device 20 can be configured to execute the clustering algorithm D3, based on the data distribution of the client devices 22, 24, for example, statistics of the global data distribution information D2 such as an average value, a standard error, a median, a standard deviation, a sample variance, a kurtosis, a skewness, a range, a minimum, a maximum, and a sum, so as to perform an average clustering. A clustering result can be as shown in
Next, the server device 20 can further execute the mediator module generation process D4 to set the mediator module. In the software implementations, the mediator module generation program D4 can be executed to determine where to execute the mediator modules. For example, in addition to executing the mediator modules on the server device, geographical characteristics or distance characteristics of the client devices can be further collected during the registration process, such that a shared server can be selected based on the corresponding mediator group under the premise that there is a shared server to execute the mediator modules. In the hardware implementations, a device similar to the server device can be set up according to the geographic characteristics or distance characteristics of the client devices in the mediator group, so as to manage the corresponding mediator group. The above are only examples, and the present disclosure is not limited thereto.
Step S32: configuring the server device to broadcast initial model weight data to the mediator modules. As shown in
Step S33: configuring the mediator modules to execute a sequential training process for the mediator groups, respectively.
Reference is made to
As shown in
Furthermore, the corresponding client devices can be configured to use multiple records of local data as multiple records of training data, and sequentially train the target model according to the initial model weight data and the training sequence to generate trained model weight data. For example, the following steps S43 to S45 can be executed.
Step S43: configure the first client device to train the target model with the initial model weight data, and generate first trained model weight data in response to the training being completed. For example, as shown in
Step S44: transferring the first trained model weight data to a second client device in the training sequence.
Step S45: configuring the second client device to train the target model with the first trained model weight data, and generate second trained model weight data in response to the training being completed. Similarly, after the target model is set with the first trained model weight data, local data of the second client device can be used as the training data for training until all the client devices in the training sequence complete the training, and trained model weight data is generated.
The sequential training process proceeds to step S46: determining, according to the trained model weight data, whether or not the sequential training process needs to be executed again. That is, whether or not the training process is performed again can be determined according to a training result. For example, the training process can be determined to be performed again according to the training result if it is desired to test whether or not an accuracy of the trained target model can be further improved.
In response to determining in step S46 that the sequential training process needs to be executed again, the sequential training process returns to step S40. The device states of the corresponding client devices are re-confirmed one by one to re-determine the training sequence, and the trained model weight data is transmitted to the corresponding client devices as the initial model weight data for training.
In response to determining in step S46 that the sequential training process does not need to be executed again, the sequential training process proceeds to step S47: transmitting the trained model weight data back to the server device. For example, the client device 22 can store the trained model weight data D6′ and send it to the mediator module, through which the trained model weight data can then be sent back to the server device.
It should be noted that the above process can be performed in multiple mediator groups in parallel computing through multiple mediator modules, and each mediator module uses the sequential training manner to set client devices to update the weight data in a specific sequence and transmit the updated weight data to the server device. Therefore, not only biased weights can be avoided, but also communication costs can be reduced, thereby speeding up an overall training speed of the federated learning.
In addition, as shown in
Reference is made to
Step S50: monitoring a connection status of the client device that performs training.
For example, the fault-tolerant process can proceed to step S51: sending a periodic signal to the client device that performs training. Step S52: determining whether or not the client device that performs training does not respond to the periodic signal within a predetermined period of time.
In response to determining that the client device that performs training does not respond to the periodic signal within the predetermined period of time, the fault-tolerant process proceeds to step S53: determining that the client device that performs training enters an offline state.
In response to detecting that the client device that performs training enters the offline state, the fault-tolerant process proceeds to step S54: confirming device states of the corresponding client devices.
Step S55: selecting a new client device from the corresponding client devices according to the device states, and transferring a model weight predetermined to be trained by the client device entering the offline state to the new client device for training.
For example, the fault-tolerant process can proceed to step S56: configuring the client device that performs training immediately before the client device that is determined to enter the offline state to transfer the model weight predetermined to be trained to the new client device.
The fault-tolerant process can then return to S50 to keep monitoring the connection status of the client device that performs training, such that the fault-tolerant process can be triggered any time when the offline state is detected.
On the other hand, every time the offline state is detected, the fault-tolerant process proceeds to step S56: recording fault-tolerant information relevant to the client device entering the offline state, and sending the fault-tolerant information to the server device. The fault-tolerant information can be, for example, the client device identifier assigned in the registration process to the client device that enters the offline state, and can assist the server device in calculating relevant weight in the subsequent steps.
Therefore, by providing a fault-tolerant mechanism under a mediator architecture of the federated learning, the training efficiency and stability of the model can be maintained even if the client device is disconnected during the training process.
Reference is further made to
Step S35: configuring the server device to obtain the multiple records of the trained model weight data of the mediator groups, and calculate a plurality of weights respectively corresponding to the mediator groups according to the multiple records of the trained model weight data. For example, the server device can determine a weight of each mediator group according to amount of data generated by the training of each mediator module in the current cycle. In this step, the server device can also determine the weight of each mediator group based on the recorded fault-tolerant information.
Step S36: configuring the server device to execute a weighted federated averaging algorithm on the multiple records of the trained model weight data according to the weights to generate global model weight data.
In the present disclosure, the weighted federated averaging algorithm is also called FedAVG algorithm, which substantially includes steps of determining topology, calculating gradient, exchanging information, and aggregating models. In the architecture of the present disclosure, the step of determining the topology is to establish an initial model weight and determine the mediator modules participating in this round of federated learning. The steps of calculating the gradient and exchanging the information are to confirm the initial model weight and corresponding parameters downloaded from the server device first, and then perform local training on the client device before uploading trained model weight to the server device. It should be noted that steps S35 and S36 actually correspond to the step of aggregating model, that is, in a case that the trained model weights have been sent to the server, utilizing the server to give a weight based on the number of samples contained in the selected members (that is, all mediator modules and mediator groups), and the trained model weight of the mediator module is multiplied by the weight, summed up and averaged so that the final model weight is the one referred to as the global model weight data in step S36.
Step S37: determining, according to the global model weight data, whether or not the federated learning process needs to be executed again. For example, whether or not the federated learning process is executed again can be determined according to a training result, if it is desired to test whether or not an accuracy of the trained target model set with the global model weight data can be further improved.
In response to determining in step S37 that the federated learning process is no longer executed, the federated learning method proceeds to step S38: configuring the server device to set the target model with the global model weight data to generate a global target model.
In response to determining in step S37 that the federated learning process needs to be executed again, the federated learning method proceeds to step S39: configuring the server device to reorganize the data distribution information. For example, the client devices that have entered the offline state can be eliminated, the client devices that are newly connected to the server device can be added, and data distribution information of all the client devices can be collected again. At this time, the federated learning method can return to step S31 to perform clustering again.
It is worth mentioning that since the federated learning method and the federated learning system provided by the present disclosure add a mediation process mechanism, training tasks in the mediator group can be appropriately coordinated, thereby assisting model weights to be transferred between the client ends and server to overcome uneven distribution of data in the federated learning, while having high privacy and low cost characteristics.
In conclusion, the federated learning method and federated learning system based on the mediation process provided by the present disclosure add mediators in the federated learning to coordinate training tasks in the mediator group, thereby assisting model weights to be transferred between the client ends and server to overcome uneven distribution of data in the federated learning, while having high privacy and low cost characteristics.
In addition, the federated learning method and federated learning system based on the mediation process provided by the present disclosure provide a fault-tolerant mechanism under a mediator architecture of the federated learning, such that the training efficiency and stability of the model can be maintained even if the client device is disconnected during the training process.
Furthermore, the federated learning method and federated learning system based on the mediation process provided by the present disclosure can operate in parallel through a plurality of mediator modules, and each uses a sequential training method to allow the client devices to update the global model in a specific sequence, therefore, not only biased weights can be avoided, but also communication costs can be reduced, thereby speeding up an overall training speed of the federated learning.
The foregoing description of the exemplary embodiments of the disclosure has been presented only for the purposes of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching.
The embodiments were chosen and described in order to explain the principles of the disclosure and their practical application so as to enable others skilled in the art to utilize the disclosure and various embodiments and with various modifications as are suited to the particular use contemplated. Alternative embodiments will become apparent to those skilled in the art to which the present disclosure pertains without departing from its spirit and scope.
Number | Date | Country | Kind |
---|---|---|---|
110141246 | Nov 2021 | TW | national |