This application claims the benefit of the Chinese Patent Application No. 202310340865. 1, filed Mar. 31, 2022, the contents of which are hereby incorporated herein in its entirety by reference.
Example implementations of the present disclosure generally relate to executing actions and, in particular, to a method, apparatus, device, and computer-readable storage medium for action execution in a distributed environment.
With the development of federated learning and mobile computing, architectures have now been proposed for performing federated learning in distributed environments, which enable collaborative exploration and exploitation by a plurality of clients while protecting sensitive data. For example, individual actions may be executed separately at a plurality of clients, and the plurality of clients may communicate with a server, thereby progressively obtaining, in a distributed manner, an action model for determining the actions to be executed. However, federated learning involves a large amount of data communication and computation, which results in various overheads due to data communication and computation that might affect the performance of machine learning. At this point, how to manage the action execution and thus complete the federated learning process in a more effective way has become a research hotspot and a difficult issue.
In a first aspect of the present disclosure, a method for action execution is provided. In the method, a set of actions to be executed at a first device is determined from a plurality of actions based on a first action model at the first device. A data accumulated indicator associated with the set of actions is obtained, the data accumulated indicator indicating an amount of data to be sent from the first device to a second device associated with the first device. In response to that the data accumulated indicator meets a predetermined condition, parameter data associated with the set of actions is transmitted to the second device to cause the second device to update a second action model at the second device using the parameter data, the parameter data comprising reward data and consumption data associated with the set of actions respectively.
In a second aspect of the present disclosure, an apparatus for action execution is provided. The apparatus comprises: a determining module configured for determining a set of actions to be executed at a first device from a plurality of actions based on a first action model at the first device; an obtaining module configured for obtaining a data accumulated indicator associated with the set of actions, the data accumulated indicator indicating an amount of data to be sent from the first device to a second device associated with the first device; and a sending module configured for, in response to that the data accumulated indicator meets a predetermined condition, transmitting parameter data associated with the set of actions to the second device to cause the second device to update a second action model at the second device using the parameter data, the parameter data comprising reward data and consumption data associated with the set of actions respectively.
In a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises: at least one processing unit; and at least one memory, coupled to the at least one processing unit and storing instructions to be executed by the at least one processing unit, the instructions, when executed by the at least one processing unit, causing the electronic device to implement the method according to the first aspect of the present disclosure.
In a fourth aspect of the present disclosure, a computer-readable storage medium is provided with a computer program stored thereon, the computer program, when executed by a processor, causing the processor to implement the method according to the first aspect of the present disclosure.
In a fifth aspect of the present disclosure, a method for action execution is provided. In the method, a plurality of parameter data from a plurality of first devices respectively are received at a second device associated with the plurality of first devices, parameter data from a first device among the plurality of first devices of the plurality of parameter data being transmitted from the first device to the second device in response to that a data accumulated indicator associated with the first device meets a predetermined condition, the data accumulated indicator indicating an amount of data to be transmitted from the first device to the second device, the parameter data comprising reward data and consumption data associated with a set of actions executed at the first device respectively. Aggregated parameter data is determined based on the plurality of parameter data. The aggregated parameter data is transmitted to the plurality of first devices respectively, so as to cause the plurality of first devices to update a plurality of first action models located at the plurality of first devices based on the aggregated parameter data respectively.
In a sixth aspect of the present disclosure, an apparatus for action execution is provided. The apparatus comprises: a receiving module configured for receiving a plurality of parameter data from a plurality of first devices respectively at a second device associated with the plurality of first devices, parameter data from a first device among the plurality of first devices of the plurality of parameter data being transmitted from the first device to the second device in response to that a data accumulated indicator associated with the first device meets a predetermined condition, the data accumulated indicator indicating an amount of data to be transmitted from the first device to the second device, the parameter data comprising reward data and consumption data associated with a set of actions executed at the first device respectively; a determining module configured for determining aggregated parameter data based on the plurality of parameter data; and a transmitting module configured for transmitting the aggregated parameter data to the plurality of first devices respectively, so as to cause the plurality of first devices to update a plurality of first action models located at the plurality of first devices based on the aggregated parameter data respectively.
In a seventh aspect of the present disclosure, an electronic device is provided. The electronic device comprises: at least one processing unit; and at least one memory, coupled to the at least one processing unit and storing instructions to be executed by the at least one processing unit, the instructions, when executed by the at least one processing unit, causing the electronic device to implement the method according to the fifth aspect of the present disclosure.
In an eighth aspect of the present disclosure, a computer-readable storage medium is provided with a computer program stored thereon, the computer program, when executed by a processor, causing the processor to implement the method according to the fifth aspect of the present disclosure.
It would be appreciated that the content described in the Summary section is neither intended to identify key or essential features of the implementations of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily envisaged through the following description.
The above and other features, advantages and aspects of the implementations of the present disclosure will become more apparent in combination with the accompanying drawings and with reference to the following detailed description. In the drawings, the same or similar reference symbols refer to the same or similar elements, where:
The implementations of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although some implementations of the present disclosure are shown in the drawings, it would be appreciated that the present disclosure can be implemented in various forms and should not be interpreted as limited to the implementations described herein. On the contrary, these implementations are provided for a more thorough and complete understanding of the present disclosure. It would be appreciated that the drawings and implementations of the present disclosure are only for illustrative purposes and are not intended to limit the scope of protection of the present disclosure.
In the description of the implementations of the present disclosure, the term “comprise” and its variants are to be read as open terms that mean “includes, but is not limited to.” The term “based on” is to be read as “based at least in part on.” The term “one implementation” or “the implementation” are to be read as “at least one implementation.” The term “some implementations” is to be read as “at least some implementations.” Other definitions, either explicit or implicit, may be included below. As used herein, the term “model” may refer to an association relationship between various data. For example, the above association relationship may be obtained based on various technical solutions currently known and/or to be developed in the future.
It is to be understood that data involved in the present technical solution (including but not limited to the data itself, the acquisition or use of the data) should comply with requirements of corresponding laws and regulations and relevant rules.
It is to be understood that, before applying the technical solutions disclosed in various implementations of the present disclosure, the user should be informed of the type, scope of use, and use scenario of the personal information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.
For example, in response to receiving an active request from the user, prompt information is sent to the user to explicitly inform the user that the requested operation would acquire and use the user's personal information. Therefore, according to the prompt information, the user may decide on his/her own whether to provide the personal information to the software or hardware, such as electronic devices, applications, servers, or storage media that perform operations of the technical solutions of the present disclosure.
As an optional but non-limiting implementation, in response to receiving an active request from the user, the way of sending the prompt information to the user may, for example, include a pop-up window, and the prompt information may be presented in the form of text in the pop-up window. In addition, the pop-up window may also carry a select control for the user to choose to “agree” or “disagree” to provide the personal information to the electronic device.
It is to be understood that the above process of notifying and obtaining the user authorization is only illustrative and does not limit the implementations of the present disclosure. Other methods that satisfy relevant laws and regulations are also applicable to the implementations of the present disclosure.
The term “in response to” as used herein means a state in which a corresponding event occurs or a condition is satisfied. It will be understood that there is not necessarily a strong correlation between the execution timing of subsequent actions executed in response to the event or condition and the time when the event occurs or the condition is established. For example, in some cases, subsequent actions can be executed immediately when an event occurs or a condition is established; in other cases, subsequent actions can be executed after a period of time after an event occurs or a condition is established.
Many real-world applications, such as data recommendation systems, Internet data push systems, network parameter configuration, crowdsourcing systems, clinical trial systems, and other systems, involve sequential decision-making, in which a decision-maker executes actions in a decided order to maximize their long-term rewards. This type of problem is often modeled as a multi-armed bandit (MAB), in which the decision-maker needs to balance between exploration and exploitation. As a distributed machine learning approach, federated learning can protect user sensitive data to some extent and thus has gained increasing popularity. MAB involves multiple clients collaborating to update the model, extending MAB to the context of federated learning. At this point, a central server is able to leverage a distributed dataset from a large number of clients to improve the performance of MAB algorithms while still protecting the sensitive data of each client.
However, previous research on the federated gaming machine problem has yet to consider the resource constraints on action execution (i.e., pulling arms of MAB), which is a crucial issue in the real world. Typically, the execution of an action needs to satisfy resource consumption constraints and achieve the objective of maximizing rewards throughout the entire decision-making process. For example, in a data push system, it is expected to provide users with various data pushes from data providers, and users' feedback on the data push (e.g., clicks, etc.) might involve the protection of sensitive data. The federated gaming machine may be used to model the data push process, with the goal of optimizing long-term data push effects based on users' feedback. Each data push is associated with a certain consumption. In this case, when performing data push, the decision-making process needs to consider not only the expected rewards (for example, clicks or conversion results) but also the consumption caused by the data push (for example, various resources consumed).
As another example, in the example of parameter configuration for base stations, modeling may also be performed using a federated gaming machine: the base station corresponds to the client, and each configuration of the base station corresponds to an arm of MAB. Taking into account the user experience of base station services, mobile network operators only allow a limited number of adjustments, which reflects the restriction of resources. Furthermore, similar problems may also be found in crowd-sourcing scenarios. That is, the crowd-sourcing platform needs to assign a plurality of tasks to workers and may provide corresponding compensation to workers, so there are budget constraints. However, existing technical solutions do not consider the consumption restriction that should be followed when pulling a certain arm.
A technical solution as MAB has been proposed. This technical solution describes the problem of executing any of K actions on M clients as follows: the client may correspond to MAB, at which point pulling a certain arm may correspond to executing a certain action among the K actions. In the traditional setting of MAB, an arm may be represented as a scalar to infer the reward from the unknown arm distribution. As classical modeling, the linear reward model has been widely studied in contextual gambling machine scenarios. As MAB represents an online learning model that may achieve the intrinsic exploration-exploitation tradeoff in many sequential decision-making problems, many real-world problems are solved by being modeled as gaming machine problems.
In recent years, gaming machine problems in multi-agent and distributed environments have gained increasing attention, and technical solutions have been proposed for distributed federated gaming machines. Specifically, the channel selection problem in distributed wireless networks may be considered, modeled as MAB with collisions, in which a reward of zero is assigned to clients that select the same arm. Meanwhile, there are also some technical solutions for cooperative estimation in the MAB problem, focusing on the problems of network delay and efficient communication issues. A distributed linear gaming machine algorithm has been proposed that uses an efficient communication model, whereas this algorithm fails to protect sensitive data. Although a variety of combined technical solutions have been proposed, existing technical solutions cannot be directly used due to the following three main challenges.
The first challenge comes from computation cost. In classical federated learning, clients have no enough resources to perform complex computation tasks, which requires to efficiently solve the knapsack problem and reduce computation overheads. In existing technical solutions, a linear programming (LP) problem can be solved in each epoch, incurring high computation costs, especially when there are a large number of clients or arms. Solutions based on oracle (i.e., optimal solutions) have further been proposed. However, the optimal solution in the real world can hardly be estimated, and strong assumptions on hypothesis class or distribution are required. The second challenge comes from communication cost. The knapsack constraint under the federated learning framework significantly increases the complexity of the gaming machine problem. Technical solutions that have been proposed reduce communication cost by setting a specific communication threshold. If the amount of LP calculations is reduced, then unlike traditional federated gaming machines, the client will not be able to independently update the local action model in each epoch during calculation. Existing technical solutions only consider the communication cost without the computation cost, which is not conducive to improving overall performance. The third challenge comes from the protection of sensitive data. In many real-world applications, users' private information needs to be protected. Existing solutions often involve adding noise and perturbations to the raw data. However, this will affect the performance of the model, and there is still a risk of sensitive data leakage in the case of less noise. At this point, how to manage action execution and further complete the federated learning process in a more effective way has become a research hotspot and a difficult issue.
In order to at least partially solve the deficiencies in the prior art, a method for action execution is proposed according to an example implementation of the present disclosure. Specifically, the present disclosure addresses the balance between collaborative exploration and exploitation by studying the knapsack problem. Suppose there are M clients, and any of K actions may be performed at each client. M clients may perform desired actions based on linear rewards and consumption under the coordination of the server, thereby minimizing total regret. According to an example implementation of the present disclosure, the process for action execution is implemented using Federated Linear Bandits with Knapsacks (FedUCBwK).
Confronted with the challenge from computation costs, the present disclosure takes into account the significant delay caused by high computation costs, and thus proposes a technical solution that may reduce the number of LP calculations and solve it without the need for assumptions. In the face of communication cost challenges, the present disclosure proposes a new synchronization threshold that allows for uniform synchronization of communication and computation. If the client only calculates and updates the policy without timely communication, information from other clients will be lost and the calculation will be less effective; if the client only communicates without calculation, the client will not be able to update the policy, resulting in the transmission of useless information. Faced with the challenge from the protection of sensitive data, the technical solution of the present disclosure only transmits model parameters but not raw data, thereby protecting sensitive data.
A brief of a distributed processing environment is described with reference to
As shown in
The server 110 may aggregate parameters from various clients to update the parameters of the action model at the server. Further, the server 110 may transmit corresponding parameters 132 (e.g., integrated parameters), . . . , and 152 to the clients 120, . . . , and 140 respectively, thereby updating the action model at each client. In this way, various constraints on rewards and consumption may be considered in federated learning, thereby obtaining an action model that better matches actual needs.
Further, more details of an example implementation according to the present disclosure are shown with reference to
During the federated learning process, a set of actions 214 to be executed at the first device 210 may be determined from a plurality of actions based on the first action model 212 at the first device 210. A data accumulation indicator 216 associated with the set of actions 214 may be obtained based on the set of actions 214. The data accumulation indicator 216 may here indicate the amount of data to be sent from the first device 210 to the second device 220 associated with the first device 210. If the data accumulation indicator 216 satisfies a predetermined condition, parameter data 218 associated with the set of actions 214 may be sent from the first device 210 to the second device 220 (e.g., the reward and consumption related to the set of actions determined at the first device 210), so that the second device updates the second action model 222 at the second device 220 using the parameter data.
In other words, the data accumulation indicator 216 may indicate whether a communication round is initiated between the first device 210 and the second device 220. In this way, the communication overheads between the two devices may be reduced, thereby improving the efficiency of executing actions. It will be appreciated that although
According to an example implementation of the present disclosure, the data accumulation indicator 216 may be used to indicate whether to send data to the second device 220. If the parameter data generated by the action that has been executed at the first device 210 reaches a predetermined threshold, the accumulated parameter data may be sent to the second device 220. If the amount of accumulated data does not reach the predetermined threshold, one or more actions may continue to be executed and the LP solution may continue to be performed at the first device 210, so that parameter data to be transmitted is further accumulated.
The proposed technical solution may solve the distributed knapsack problem in each epoch using a uniform threshold condition that may balance the regret, communication and computation costs of the client. In this way, at the first device 210, on the one hand, the goal of maximizing the reward may be achieved while satisfying the consumption constraints, and on the other hand, communication overheads and time overheads caused by too frequent communication between the first device 210 and the second device 220 may be avoided, thereby finding a balance between exploration and utilization and improving the overall performance of federated learning. In addition, during the process of executing actions and updating the action model, the real data collected (for example, user clicks and other operations) is not transmitted, so that the action model may be updated in a distributed manner while ensuring the security of sensitive data.
While the brief according to one example implementation of the present disclosure has been described, further details of executing actions in a distributed manner are described below with reference to
During the federated learning process, a plurality of actions may be executed at the first device 210. Here, the plurality of actions may be actions of pulling a certain arm of the MAB. In the initial stage, the plurality of actions may be executed 320 in sequence, and initial parameter data associated with the plurality of actions may be transmitted 322 to the second device 220. Similar operations may be performed at the first device 310, for example, a plurality of actions may be executed 320′ at the first device 310, and initial parameter data associated with the plurality of actions may be transmitted 322′ to the second device 220. The second device 220 may receive initial parameter data from each first device (e.g., the first devices 210, . . . , and 310), and then determine 324 aggregated initial parameter data. The second model at the second device 220 may be updated using the aggregated initial parameter data.
Further, the second device 220 may transmit 326 the aggregated initial parameter data to the first device 210. Similarly, the second device 220 may transmit 326′ the aggregated initial parameter data to the first device 310. The first action model at the first device 210 may be updated using the aggregated initial parameter data; similarly, the first action model at the first device 310 may be updated using the aggregated initial parameter data. Here, the process indicated by arrows 320 to 328′ involves an initialization stage, and the corresponding first action models at the first device 210, . . . , 310 and the second action model at the second device 220 may be initially updated.
After the initialization stage, actions to be executed may be determined at each first device using the updated action model. For example, at the first device 210, a set of actions may be executed 330 based on the updated first action model (where the number of the set of actions depends on whether the data accumulation indicator satisfies a predetermined condition). The updated first action model may be utilized to determine one or more actions to be executed and further execute the actions.
Further, a data accumulation indicator associated with the above set of actions may be determined. For example, the data accumulation indicator can be determined using features of actions that have been executed. If the data accumulation indicator satisfies 332 the predetermined condition, a communication round may be initiated from the first device 210 to the second device 220 in order to transmit 334 parameter data associated with the set of actions. If the data accumulation indicator does not meet the predetermined condition, the operation flow may return to the position shown by an arrow 330 and continue to determine and execute subsequent one or more actions based on the updated first action model until the data accumulation indicator associated with the a set of actions that have already been executed satisfies the predetermined condition.
Similar operations may be performed at the first device 310, for example, a set of actions 330′ may be executed based on the updated first action model. If the data accumulation indicator associated with a set of actions satisfies 332′ the predetermined condition, parameter data associated with the set of actions may be transmitted 334′ to the second device 220.
Further, the second device 220 may determine aggregated parameter data based on the received parameter data from each first device. The second model at the second device 220 may be updated using the aggregated parameter data. Further, the second device 220 may transmit 338 the aggregated parameter data to the first device 210. Similarly, the second device 220 may transmit 338′ the aggregated parameter data to the first device 310. At this point, the operation flow may return to the position shown by arrows 328 and 328′, and each first device may update the local first action model using the newly received parameter data.
According to an example implementation of the present disclosure, a predetermined cycle end condition may be set. For example, a predetermined time length may be set, and the process shown in
By means of the example implementations of the present disclosure, the second action model at the second device 220 is an accurate action model obtained via federated learning technology, and the action model can achieve the goal of maximizing the reward while satisfying the predetermined consumption limit. At this point, the first device 210 does not need to transmit the parameter data of each epoch of LP solution to the second device 220 after executing each action, but can accumulate the parameter data generated by executing a plurality of actions and performs transmission when the data accumulation indicator meets the predetermined condition. In this way, it may be guaranteed that excessive communication overheads will not be incurred during the federated learning process.
Having described the brief of the collaboration between client and server, specific Equations related to the action execution process will be introduced below. According to an example implementation of the present disclosure, FedUCBwK can be implemented under a server-client architecture. Suppose there are Mclients, each client can access K actions (arms) of d dimensions, represented as a∈[K]:== {1, 2 . . . , K}. At each time t∈[T], at,m actions may be executed and the reward and consumption may be observed at each client m∈[M]. The features (that is, context) of at,m can be expressed as x(at,m)∈d, the unknown expected reward is expressed as rt,m, and the corresponding resource consumption is expressed as ct,m. Assume there is a fixed total consumption budget B∈+, which is known to the server. B is a strict limit on resource consumption, and the algorithm terminates when Bis exhausted. Alternatively, or in addition, the algorithm may terminate when the runtime reaches a predetermined time length T.
According to an example implementation of the present disclosure, a linear structure with global parameters is proposed. In order to obtain the intrinsic correlation between rewards for executing the same action at different clients, it is assumed that the rewards rt,m have a linear structure: rt,m=(θ, x(at,m)+ηt,m(r), where θ∈d is a fixed but unknown vector, and ηt,m(r)∈[−1,1] are independent random noise with zero mean for the rewardsrt,m. Furthermore, it is assumed that consumption ct,m has a linear structure: ct,m=β, x(at,m)+ηt,mc, where β∈d is a fixed but unknown parameter vector, and ηt,mc∈[−1,1] are independent random noise with zero mean for costs ct,m.
In terms of communication, assume there is a central server in a distributed environment, clients may communicate with the server periodically. Specifically, during each communication, the clients may send locally accumulated parameter data to the central server, and then the central server may aggregate the received parameter data to update the action model and calculate a current policy. The central server may then broadcast the policy to all clients. The communication costs of the algorithm may be defined as the scalar (integer or real) amount of communication between the server and client.
In terms of computation, the computation problem may be described as a MAB problem in a linear static environment, which will be solved in the present disclosure by decomposing the original problem into multiple linear programming problems. The present disclosure defines the computation cost of an algorithm as the number of solving the linear programming problem.
In terms of sensitive data protection, the context for each client is static, which can be understood that for each client, its characteristics and usage habits will not change within a certain period of time. Although it is not time-varying, the context of each client is personal information, and clients still need to protect their sensitive data. Moreover, the reward of each action, e.g., user feedback (click and purchase behavior) to data in the data push system, is also the sensitive information at clients. The technical solution of the present disclosure only requires each client to transmit the parameters calculated in each epoch instead of directly transmitting the raw data.
According to an example implementation of the present disclosure, an objective function in terms of regret may be determined based on the following Equation, that is, the objective function is to execute actions that minimize the expected regret among all clients:
In this Equation, REGRET(T) represents the objective function, OPT represents the total expected reward for the optimal policy, T represents the predetermined time length, M represents the number of clients, at,m represents the action executed at the clientmat time t, x(at,m) represents the characteristics of the corresponding action, θ represents a reward-related parameter. At the same time, it is necessary to ensure that the consumption does exceed the overall budget of consumption, that is, Σt=1TΣm=1M ct,m≤B.
In terms of distributed choices, regret can represent the difference between the total reward obtained by the algorithm of the present disclosure and OPT. At this point, the problem description of the present disclosure adopts a federated setting, and the present disclosure expands the OPT-LP problem to a distributed scenario.
In terms of LP relaxation, the present disclosure defines a linear programming relaxation for the expected total reward under a mixed-arm. The present disclosure may describe any possible expected reward and possible expected consumption for each action based on the following linear operation:
In the above Equation, X∈d×K is a matrix that contains x of K arms, Δ={p: Σi=1K pi=1, pi≥0, i=i, . . . , K}. pi can represent the probability of pulling arm i. LetLP(XTθ, XTβ) represent the value of the LP relaxation, and β represent the consumption-related parameter. It can be concluded that:
LP(XTθ, XTβ) is an upper bound for OPT, which is denoted as OPT-LP in the present disclosure. Therefore, the present disclosure can use OPT-LP to replace OPT in the regret analysis.
In terms of distributed decomposition, when there are multiple clients, how to allocate the budget among them may be decided. In the case of a static context setting, each arm has a corresponding context that does not change over time for all clients. Thus, decomposition may be performed by dividing the budget equally among all clients.
First, according to the LP relaxation, the budget can be allocated into each epoch so that at each time period t, the budget will reach B/T. Suppose the budget is further divided and assigned it to M clients. Then for each client, a distributed LP problem can be expressed in the following form:
In the above Equation, Bm represents the budget allocated to the client m. Based on Equation 4, it may be derived that:
If there is an optimal solution, then Bm≥min (XTβ) for each m. A more intuitive understanding is that the current budget must be higher than the arm with the lowest cost. Otherwise, there will be no arm that can be pulled. By using the strong duality theorem, the following Equation can be obtained:
Therefore, the present disclosure can further obtain the total reward of M clients:
As ym=y for all m=M. That is, no matter how the budget is allocated, as long as there is an optimal solution, the objective achieved by the LP relaxation optimal solution is equal. For simplicity, the budget can be divided equally among between the clients. Therefore, for any XTθ, XTβ, letDisLP (XTθ, XTβ) represent the value of the following linear programming.
MT·DisLP(XTθ, XTβ) can be used instead of OPT in the analysis of regret.
To protect the sensitive data of client data, the present disclosure only communicates parameter data, and each client saves the raw data locally. Therefore, the calculation of the knapsack problem and arm selection is performed locally on each client. The central server is only responsible for aggregating the received parameter data and then sending it to each client. Furthermore, to meet the computation requirement, the central server sendsB/M (Alternatively, or in addition, the budget can also be unevenly divided) instead of the original total budget to each client. Each client only knows its own allocated budget without knowing the total value. In this way, sensitive data on both sides can be protected.
In order to solve the computation and communication efficiency problems, a unified communication scheme is proposed. Specifically, within a time period, when the amount of accumulated information (i.e., pulled arms and corresponding feedback results) is limited, the amount of parameter data will not change much. Therefore, the present disclosure can divide the total time into different communication rounds where communication and synchronization is performed at the end of each communication round and the LP problem is solved. When estimating parameters, the linear gaming machine problem can be treated based on “optimism in the face of uncertainty”, thereby obtaining optimistic estimates of the parameters β and θ.
The specific process for action execution will be described with reference to the specific steps in
In summary, in the initial stage each client m can execute various actions separately and observe feedback (the corresponding reward and consumption) for initialization. After initialization until the algorithm stops (i.e., when the time range Tis not reached and the budget is not exhausted), communication may proceed round by round at the client. In epoch e of each communication round in the algorithm 400, each client m can execute an action according to the policy for this epoch pe, calculated at the end of the previous epoch (Line 11). Then, the number of times f(m, a) of executing the action a (line 12) is recorded, and the matrix Vt,m, the total reward Rm,a and the total consumption of each action Cm,a (lines 14-16) are updated. Then, adding together the matrices of all actions, the total matrix Vt,m of the client m can be obtained. This matrix can be viewed as a matrix containing data up to time t (line 17). Then, it is determined whether the following Equation is true (i.e., it is judged whether the data accumulation indicator meets the predetermined condition):
In this Equation, D is the threshold to determine whether synchronization is required. An intuitive understanding of det Vt,m/detVt
Afterwards, all clients can send their calculated parameter data and the number of action executions to the central server for aggregation. At this point, the central server only has the relevant parameters θm,ae and {circumflex over (β)}m,ae of different actions, but does not have any contextual information on different actions, so it cannot obtain the matrix. Since θm,ae and x(a) are in the same direction, the unit vector of x(a) can be obtained, {circumflex over (θ)}m,ae/∥θm,ae∥. Additionally, the central server can record the number of executions for each action. Therefore, the present disclosure can summarize and calculate aggregated parameter data by:
After receiving the above parameters from the central server, each client can maintain a confidence set θe⊆d for the parameter θ and a confidence set βe⊆d for the parameter β in epoch e. Specifically, following techniques commonly used in previous work on linear context gaming machines, the confidence set θe is an ellipsoid-centered {circumflex over (θ)}e while the confidence set θe is ellipsoid-centered {circumflex over (β)}e. The confidence set e,β can be constructed using {circumflex over (β)}e and Ve.
In the above Equation,
This explains the reason why the synchronization condition is set to Equation 9: The volume of the confidence ellipsoids for both θ and β depend on det(Vt,m). If det(Vt,m) does not vary greatly, it will not affect the confidence guarantee even if the confidence ellipsoid changes slightly. Also, if it does not vary greatly, it will always remain Ve in each epoch e. Therefore, det(Ve) instead of det(Vt,m) can be used to participate in the calculation. Then, an optimistic estimate of these parameters can be used:
At this point, the upper confidence limit for rewards and the lower confidence limit for consumption can be calculated. Each client will then calculate and update the policy of selecting the arm for the next epoch. The following linear programming problem can be solved:
It will be understood that to meet the hard constraint restrictions,
can be used in place of
to allow some estimation error. At this point, the algorithm will not end prematurely due to running out of budget.
Having described the Equations involved in algorithms 400 and 500, further details of the interaction between the client and the server are described below with reference to
Referring to
According to an example implementation of the present disclosure, a plurality of actions may be executed at the client, and a plurality of rewards and consumption associated with the plurality of actions may be obtained respectively. Return to
With the example implementations of the present disclosure, the data accumulation indicator Vt
According to an example implementation of the present disclosure, initial parameter data associated with the plurality of actions may be determined based on the plurality of actions, the plurality of rewards, and the plurality of consumption. Then, the initial parameter data can be transmitted to the server, so that the server can update the corresponding action model using the initial parameter data. Here, the reward rm,a can represent the benefit yielded by executing the action at,m, and the consumption cm,a can represent the consumption of resources allocated to the client, incurred by executing the action. In the application environment of data push, the reward can represent the increase in click rate due to pushing certain data, and consumption can represent the resources consumed due to pushing the data. With the example implementation of the present disclosure, sufficient consideration is given to both factors of reward and consumption due to execution of actions, thereby maximizing the reward while satisfying consumption constraints.
Still with reference to
As shown in line 6 of the algorithm 400, in the initialization stage, after each action has been executed, reward data {circumflex over (θ)}m,a and consumption data {circumflex over (β)}m,a for each action can be sent from the client to the server so that the server determines the aggregated parameter data to update the action model at the server.
More details of the process performed at the server will be described with reference to
Specifically, in line 3 of the algorithm 500, the server may receive initial parameter data from each client m (i.e., reward data {circumflex over (θ)}m,a and consumption data {circumflex over (β)}m,a determined in line 4 and line 5 of the algorithm 400). Here, each initial parameter data may come from a corresponding client and is determined based on K actions executed at the client and a plurality of rewards and a plurality of consumption associated with the K actions respectively. In this way, the server can receive initial parameter data from each client respectively, and then perform an aggregation operation based on the received initial parameter data from each client, thereby obtaining aggregated initial parameter data and updating the action model at the server.
Further, aggregated initial parameter data may be determined at the server based on the plurality of initial parameter data. A specific Equation for generating the aggregated initial parameter data is shown in lines 4 to 6 of the algorithm 500. Here, the aggregated initial parameter data may comprise: aggregated accumulated data, aggregated reward data, and aggregated consumption data. Specifically, an equation for determining the aggregated cumulative data is shown in line 4. For example, the aggregated accumulated data V0 in the aggregated parameter data can be determined based on the reward data. As shown in line 4 of the algorithm 500, a ratio of the outer product of the reward data from each client to the square of the norm of the reward data can be determined and then summed. Data from each of M clients can be aggregated. For example, the sums for each client may be added together to get the aggregated accumulated data V0.
Further, an equation for determining the aggregated reward data {circumflex over (θ)}0 in the aggregated initial parameter data based on the reward data {tilde over (θ)}m,a and the aggregated accumulated data V0 is shown in line 5 of the algorithm 500; and an equation n for determining the aggregated consumption data {circumflex over (β)}0 in the aggregated initial parameter data based on the consumption data {circumflex over (β)}m,a and the aggregated accumulated data V0 is shown in line 6. In this way, the action model at the server can be updated based on parameter data from each client in the initial stage, thereby improving the accuracy of the action model in a distributed training manner.
Then, the server can transmit the aggregated initial parameter data to the plurality of clients respectively, so that the plurality of clients can respectively update local action models based on the aggregated initial parameter data. Specifically, in line 7 of the algorithm 500, the determined aggregated initial parameter data (including V0, {circumflex over (θ)}0 and {circumflex over (β)}0) may be transmitted to each client. In this way, each client can update the local action model based on the received aggregated initial parameter data, thereby realizing federated learning.
Further, the client may receive aggregated initial parameter data for updating the local action model (the aggregated initial parameter data is determined by the second device based on the initial parameter data), so as to update the local action model using the aggregated initial parameter data. Returning to
At this point, the initialization operation at the client ends, and a set of the plurality of actions may be executed in subsequent operations. When the accumulated data indicator associated with the executed action meets the predetermined condition, the corresponding parameter data may be transmitted from the client to the server. In line 9 of the algorithm 400, the termination condition of the operation process can be set, that is, the algorithm is terminated when the execution time reaches the predetermined time length T. In line 10 of the algorithm 400, it can be judged whether the resources allocated to the client have been exhausted, that is, whether the corresponding budget allocated has been exhausted. If the budget has not yet been exhausted, the corresponding action can be selected and executed based on the determined confidence of executing the action. If the budget has been exhausted, the loop is exited. In this way, the federated learning process can be managed in a more flexible way and terminated when needed.
According to an example implementation of the present disclosure, a target action may be determined from the plurality of actions by a vector output by the updated action model. In the first loop after the initialization stage, as shown in line 11 of the algorithm 400, a target action at,m to be executed at this time may be determined based on the vector pe (e is 0 at this time), and various parameters at the client may be updated based on the action to at,m. As shown in line 12 of the algorithm 400, the number of executions of the action at,m can be performed incrementally f(m, at,m); as shown in line 13, the reward rm,a and consumption cm,a associated with executing the action may be obtained respectively; as shown in line 14, the data accumulated indicator Vt,m may be updated based on the characteristics of the target action; as shown in line 15, the reward data Rm,a (i.e., the summation of reward rm,a) at the client may be updated; as shown in line 16, the consumption data Cm,a (i.e., the summation of consumption cm,a) may be updated.
Further, as shown in line 17, it can be judged whether the accumulated data indicator Vt,m at this time meets the predetermined condition (that is, whether Equation 9 described above is established). If the judgment result is “yes”, the operation flow advances to line 18, and the client can send a synchronization signal to the server in order to start a communication round. If the judgment result is “No”, the operation flow continues to perform the next epoch of operations.
As shown in line 19, if it is determined that the communication round has been started, the operations shown in lines 20 to 26 are performed. Specifically, as shown in line 20, the parameter Vt
With reference to
Specifically, parameter data from each of the plurality of clients may be received separately. Here, the parameter data from a client of the plurality of clients among the plurality of parameter data is transmitted from the client to the server in response to that a data accumulated indicator associated with the client meets a predetermined condition, and the data accumulated indicator indicates the amount of data to be transmitted from the client to the server. In other words, the parameter data received here is accumulated data formed after the training process at the client accumulates to a certain stage. In this way, the client does not have to transmit parameter data from a single action execution one by one, but executes a set of actions at the client depending on a specific predetermined condition and then transmits corresponding parameter data to the server. As a result, the number of communication epochs between the client and the server can be reduced, thereby reducing communication overheads while ensuring the training effects.
According to an example implementation of the present disclosure, the parameter data received by the server may comprise reward data and consumption data respectively associated with the set of actions performed at the client. For a specific client m, the parameter data may comprise the reward data {circumflex over (θ)}m,ae and consumption data {circumflex over (β)}m,ae of the client m. Here, the reward data {circumflex over (θ)}m,ae can represent the revenue generated by executing each action in the set of actions at the client m; and the consumption data {circumflex over (β)}m,ae can represent the consumption of resources allocated to the client, generated by executing each action in the set of actions at the client m. Further, the parameter data may comprise the frequency data for each action f(m, at,m). The frequency data can identify the number of executions of an action in the set of actions executed at the client m. In this way, the updated parameters for the action model generated based on different action data at each client can be collected from each client in a distributed manner, thereby facilitating the improvement of the training precision of the action model.
According to an example implementation of the present disclosure, the aggregated parameter data may be determined in the following manner. For example, the aggregated accumulated data in the aggregated parameter data may be determined based on the frequency data and the reward data. Specifically, as shown in line 11 of the algorithm 500, the aggregated accumulated data Ve in the aggregated parameter data for the current communication round e can be determined based on the frequency data and the reward data.
According to an example implementation of the present disclosure, the aggregated reward data in the aggregated parameter data may be determined based on the frequency data and the reward data. Specifically, as shown in line 12, the aggregated reward data {circumflex over (θ)}e in the aggregated parameter data for the current communication round e can be determined based on the frequency data and the reward data.
According to an example implementation of the present disclosure, the aggregated consumption data in the aggregated parameter data may be determined based on the frequency data and the consumption data. As shown in line 13, the aggregated consumption data {circumflex over (β)}e in the aggregated parameter data for the current communication round e can be determined based on the frequency data and consumption data. In this way, the aggregated parameter data can be determined in a simple and clear-cut way, and further the action model at the server can be updated.
Further, as shown in line 13, the server can send the aggregated parameter data to each client (e.g., including the aggregated accumulated data Ve, aggregated reward data {tilde over (θ)}e and aggregated consumption data {circumflex over (β)}e for the current communication round e). At this point, these aggregated parameter data can be used to update the action model at each client, thereby continuously performing federated learning.
According to an example implementation of the present disclosure, the client may receive aggregated parameter data for updating the client's local action model from the server, and the aggregated parameter data is determined by the server based on the parameter data (as in lines 11 to 13 of the algorithm 500). Still referring to
Description has been presented above to determining whether to start a communication round for transmitting parameter data from the client to the server depending on whether the data accumulated indicator meets the predetermined condition. In this way, the communication round can be started only when the data accumulated indicator meets the predetermined condition, thereby achieving the goal of reducing communication overheads. According to an example implementation of the present disclosure, the algorithm shown in
According to an example implementation of the present disclosure, the above process may be terminated when the predetermined condition is met (for example, a predetermined time length is reached, or a predetermined budget is exhausted). At this point, the training process of the action model ends, and corresponding tasks may be executed using the trained action model.
Theoretical guarantees in terms of regret, communication cost, and computation cost will be provided below. For the regret, the following proof is provided.
Theorem 6.1: Using the algorithms 400 and 500 with
the following upper bound probability on the regret boundary can be obtained:
Lemma 6.1: For any δ>0, with probability 1−Mδ, θ always lies in the constructed t,m for all t and all m.
Lemma 6.2: Let {Xt}t=1∞ represent a sequence in d, V is a d*d positive definite matrix and define Vt=V+Σs=1t XsXsT. It can be concluded that
Furthermore, if ∥Xt∥2≤L for all t, then
Lemma 6.3: With the probability 1−Mδ, the single-step difference difft,m={tilde over (θ)}t,m−θ,x(at,m) is bounded by
Lemma 6.4: (Azuma-Hoeffding inequality). If a supermartingale (Yt: t≥0), corresponding to filtration t, satisfies |Yt−Yt-1|≤ct for some constant ct, for allt=1, . . . . T, then for any a≥0,
Lemma 6.5: Let A, B and C be positive semi-definite matrices such that A=B+C. Then, there is
Then the Diff (T) is bounded by O(d√{square root over (DMT log(MT))}).
In the following, the proof process is provided. First of all, although the same estimator {tilde over (θ)}e is used in each epoch e, for analysis purposes {tilde over (θ)}t,m can be used to represent it for each time period t and each client m, that is, {tilde over (θ)}t,m={tilde over (θ)}e, t∈e. In the present disclosure, the number of epochs will be divided by communication rounds. Assuming that there are E epochs, the aggregated matrix in the epoch e is expressed as Ve. By the communication threshold, the following may be obtained
Otherwise, there will be a synchronization. The MT pulls are all performed by one client in a round-robin fashion (i.e., according to a1,1, a1,2, . . . , a1,M, a2,1, . . . , aT,M). {tilde over (V)}t,m=λI+Σ{(p,q):(p<t)∨(p=t∧q<m)}x(ap,q)x(ap,q)T may be used to represent the matrix that can be obtained when the client executes x(at,m). With the disclosed algorithm, each client m will use a random policy received from the central server, generated by the aggregated matrix in each communication round. Therefore, the gap between matrices can be constrained as:
Therefore, by Lemma 6.5,
Then, the single client difference bound can be used and the difference proved. Suppose e represents the set of (t, m) pairs that belong to epoch e, using Lemma 6.2 and 6.3, letting
it may be obtained
{tilde over (θ)}t,m and {tilde over (β)}t,m estimates satisfy the following properties. With the probability 1−Mδ,
{tilde over (θ)}t,m≥θ,{tilde over (β)}t,m≤β. Property (1):
Property (2):
Since [({tilde over (θ)}t,m,x(at,m)|pt,XT{tilde over (θ)}t,m]=XT{tilde over (θ)}t,m·pt, it can be obtained, using Lemma 6.4, with probability 1−Mδ,
Then, according to the Diff(T) in Lemma 6.6, the gap between the real reward of the pulled arm and the expected reward under the fractional solution can be bounded using the confidence upper bound,
In the same way as above, it can further be obtained
Then
can be set. Property (2) can be obtained so far. According to property (1), the definition and problem decomposition of OPT can be expressed as:
Then, using Property (2) and Equation 18, the hard constraint can be satisfied, i.e.,
Therefore, the algorithm will not terminate before time T. The total real reward obtained can be expressed as REW:
In terms of communication cost, according to the algorithm 400 and the algorithm 500, it can be determined that if
then a communication round is started, that is:
Then by Lemma 6.2, it can be obtained:
Further,
According to an example implementation of the present disclosure, communication is only required at the end of each epoch when each client sends O(dk) numbers to the server and then downloads O(d2) numbers. Therefore, in each epoch, the communication cost is O(Md(d+K)). Therefore, the total communication cost is
In terms of computation cost, in both the algorithm 400 and the algorithm 500, the same threshold is used, thus linear programming is calculated when a communication round starts. The local feature vectors X are private information and not known to the central server, so the LP calculation is only done at each client. Therefore, the computation cost is
In the context of the present disclosure, OPT can be defined as Distributed-OPT, and the feasibility of this solution is demonstrated through a distributed decomposition of LP. Here, FedUCBwK keeps the total budget private through budget allocation and executes policy computation on clients. It transmits model update parameters instead of raw data to protect the client's sensitive data. Based on the transmitted parameters, the present disclosure designs a unified communication and computation threshold to solve the distributed LP problem in each epoch and control the regret by a budget deflator. In addition, a trade-off can be made between regret, communication cost and computation cost. The present disclosure can obtain a high-probability regret bound under the communication cost and computation cost.
According to an example implementation of the present disclosure, various tasks can be executed on real datasets in order to verify the performance of the task execution process described above. The performance of many different technical solutions can be compared. Specifically, FedUCBwK can represent the algorithm proposed in the present disclosure; FedUCBwK-FullCom represents the technical solution in which clients performs communication and calculation in each epoch (D at this time can be set to a smaller value); FedUCBwK-NoCom represents a technical solution in which clients do not perform communication (D at this time can be set to a larger value); FedUCBwK-FewCom represents a technical solution in which clients only perform a small amount of communication and calculation (at this time, D can be set to an intermediate value); FedUCB and FedUCB-FullCom represent existing technical solutions.
According to an example implementation of the present disclosure, the technical solution according to the present disclosure can be verified on multiple public datasets. Specifically, it can be verified on the MovieLens−100K dataset. To handle sparse and incomplete rating matrices, the present disclosure uses collaborative filtering to complete the rating matrix, and then employs non-negative matrix factorization with 10 latent factors to get R=WH, where W∈943×10, H∈10×1682.
According to an example implementation of the present disclosure, the dataset can be used to simulate two scenarios for the data push system. The first scenario involves pushing proper data to a specific group of users. The present disclosure regards each data push corresponding to a certain amount of reward and consumption as an action. The objective is to find the optimal data push for a specific group of users. To accomplish this, the k-means algorithm can be applied to the column vectors of W to produce 20 classes. One class can be selected as the specific user group u, and let θu be the center of the selected user group. The present disclosure can then apply the k-means algorithm to the row vectors of H to produce K=10 groups (actions). At this point, the action executed corresponds to the action of finding the optimal data push for the specific user group.
The second scenario involves finding suitable recommended targeted users for a specific category of data push. Each targeted user can be considered as an action and the k-means algorithm is applied to the raw vectors of H in order to produce 20 classes, selecting one class as one specific category of data push a. Let θa be the center of the selected data push category, and the corresponding consumption parameter β is randomly generated. The k-means algorithm can be applied to the row vectors of W to produce K=10 targeted users (actions). In the above experiments, it is possible to set K=20 and d=10. Additionally, it is possible to set M=10 and T=1000. At this time, the action executed corresponds to the action of selecting a suitable recommended targeted user.
In terms of human activity recognition, a dataset can be built from the recordings of 30 data contributors. A crowd-sourcing task can be simulated using the dataset, where each activity can be treated as an action and the label for each activity is assumed to be the expected reward. Furthermore, a possible consumption can be generated for each data sample, the consumption parameter β can be calculated, and the class center of each action can be calculated. The center is used as a standard feature of the action to solve LP problems and OPT problems. In the simulated crowd-sourcing scenario, each task can be regarded as capturing corresponding human activity features. Each task will incur certain consumption and reward in terms of resources. The goal is to learn the optimal task. There can be six classes of tasks, each having 561-dimensional features, so in the gaming machine experiments of the present disclosure, K=6 and d=561 can be set. Additionally, M=10 and T=1000 can be set.
In terms of the results of data push recommendation and targeted user recommendation, an explanation is given to the reasons why the regret curve of the Federated Bandits with Knapsacks (FBwK) problem differs from that of the traditional gaming machine, where the latter part of the regret curve in FBwK sometimes increases steeply. When calculating the accumulated regret at each time period based on the technical solution of the present disclosure, if the budget is exhausted, the algorithm will stop while the OPT continues to accumulate rewards, leading to a sharp increase in regret. This phenomenon is particularly obvious since the existing FedUCB (DisLinUCB) and FedUCB-FullCom do not consider the budget constraint. They prioritize finding the arm with the highest reward, thus incurring lower regret in the early stage. When the budget is exhausted during a certain time period, the algorithm terminates prematurely and ends up performing poorly.
As shown in
It will be understood that it is not the case that the larger ϵ the better, but a proper ϵ needs to be selected. For example, in the data push task, it is better ϵ=0.05, and when ϵ is too large, the estimation of the consumption will become too small, which will affect the selection of the optimal arm. Although the budget is preserved, the too cautious selection makes the final budget increase instead.
For the targeted user recommendation task,
According to an example implementation of the present disclosure, the method described above can be further applied to the allocation of crowd-sourcing tasks, at which point the experimental results are also similar. Alternatively, or in addition, the method described above may be further applied to other application scenarios such as network parameter configuration.
By means of the example implementations of the present disclosure, a knapsack-based federated linear gaming machine solution is implemented. This technical solution can be used to minimize the regret by selecting the action from K actions that maximizes the reward based on the collaboration of M clients while meeting a predetermined budget. According to an example implementation of the present disclosure, only the parameter data of the model is transmitted instead of raw data, thereby protecting the clients' sensitive data. Further, the communication and computation cost involved in federated learning can be reduced by a unified threshold, and the overall regret can be controlled by solving the distributed knapsack problem in epochs.
According to an example implementation of the present disclosure, determining the data accumulated indicator comprises: determining the data accumulated indicator based on features of each action in the set of actions.
According to an example implementation of the present disclosure, the method further comprises: determining the parameter data associated with the set of actions based on the set of actions, and a set of rewards and a set of consumption associated with the set of actions respectively, wherein the reward in the set of rewards represents the revenue yielded by executing an action in the set of actions, and the consumption in the set of consumption represents the consumption of resources allocated to the first device, generated by executing the action.
According to an example implementation of the present disclosure, determining the parameter data comprises: determining the reward data based on a linear calculation of the set of actions and the set of rewards; determining the consumption data based on a linear calculation of the set of actions and the set of consumption; and determining frequency data in the parameter data based on the number of executions associated with the set of actions.
According to an example implementation of the present disclosure, the method further comprises: determining a target action from the plurality of actions based on the first action model; updating the data accumulated indicator based on features of the target action; and updating the parameter data using the target action and a target reward and target consumption associated with the target action.
According to an example implementation of the present disclosure, the method further comprises: receiving aggregated parameter data for updating the first action model from the second device, the aggregated parameter data being determined by the second device based on the parameter data; and updating the first motion model using the aggregated parameter data.
According to an example implementation of the present disclosure, the aggregated parameter data comprises aggregated reward data, aggregated consumption data, and aggregated accumulated data.
According to an example implementation of the present disclosure, the method further comprises: determining a target action to be executed at the first device from the plurality of actions using the updated first action model.
According to an example implementation of the present disclosure, the method further comprises: before determining the set of actions, executing the plurality of actions at the first device; obtaining a plurality of rewards and a plurality of consumption associated with the plurality of actions respectively; determining initial parameter data associated with the plurality of actions based on the plurality of actions, the plurality of rewards, and the plurality of consumption; and transmitting the initial parameter data to the second device to cause the second device to update the second action model using the initial parameter data.
According to an example implementation of the present disclosure, the method further comprises: receiving aggregated initial parameter data for updating the first action model from the second device, the aggregated initial parameter data being determined by the second device based on the initial parameter data; and updating the first action model using the aggregated initial parameter data.
According to an example implementation of the present disclosure, the method further comprises: terminating the method in response to at least any of: a time length for performing the method reaching a threshold time length; or consumption associated with at least one action that has been executed at the first device reaching threshold consumption.
According to an example implementation of the present disclosure, the first device is a client device for performing federated learning, and the second device is a server device for performing the federated learning.
According to an example implementation of the present disclosure, the plurality of actions comprises at least any one of: a data push action, a user selection action, a crowd-sourcing task allocation action, and a network parameter setting action.
According to an example implementation of the present disclosure, the plurality of parameter data comprises: reward data of the first device, the reward data representing revenue yielded by the set of actions executed at the first device; consumption data of the first device, the consumption data representing consumption of resources allocated to the first device, generated by the set of actions executed at the first device; and frequency data of the first device, the frequency data representing the number of executions of an action in the set of actions.
According to an example implementation of the present disclosure, determining the aggregated parameter data comprises: determining aggregated accumulated data in the aggregated parameter data based on the frequency data and the reward data; determining aggregated reward data in the aggregated parameter data based on the frequency data and the reward data; and determining aggregated consumption data in the aggregated parameter data based on the frequency data and the consumption data.
According to an example implementation of the present disclosure, the method further comprises: updating a second action model at the second device based on the plurality of parameter data.
According to an example implementation of the present disclosure, the method further comprises: determining an action to be executed using the updated second action model.
According to an example implementation of the present disclosure, the method further comprises: before receiving a plurality of parameter data respectively from the plurality of first devices at the second device, receiving a plurality of initial parameter data respectively from the plurality of first devices at the second device, the initial parameter data from the first device among the plurality of initial parameter data being determined based on a plurality of actions executed at the first device and a plurality of rewards and a plurality of consumption associated with the plurality of actions respectively; determining aggregated initial parameter data based on the plurality of initial parameter data; and transmitting the aggregated initial parameter data to the plurality of first devices respectively, so as to cause the plurality of first devices to update a plurality of first action models at the plurality of first devices based on the aggregated initial parameter data.
According to an example implementation of the present disclosure, the method further comprises: updating a second action model at the second device based on the plurality of initial parameter data.
According to an example implementation of the present disclosure, the first device is a client device for performing federated learning, and the second device is a server device for performing the federated learning.
According to an example implementation of the present disclosure, the plurality of actions comprises at least any one of: a data push action, a user selection action, a crowd-sourcing task allocation action, and a network parameter setting action.
According to an example implementation of the present disclosure, determining the data accumulated indicator comprises: determining the data accumulated indicator based on features of each action in the set of actions.
According to an example implementation of the present disclosure, the apparatus further comprises: a parameter determining module configured for determining the parameter data associated with the set of actions based on the set of actions, and a set of rewards and a set of consumption associated with the set of actions respectively, wherein the reward in the set of rewards represents the revenue yielded by executing an action in the set of actions, and the consumption in the set of consumption represents the consumption of resources allocated to the first device, generated by executing the action.
According to an example implementation of the present disclosure, the parameter determining module comprises: a first determining module configured for determining the reward data based on a linear calculation of the set of actions and the set of rewards; a second determining module configured for determining the consumption data based on a linear calculation of the set of actions and the set of consumption; and a third determining module configured for determining frequency data in the parameter data based on the number of executions associated with the set of actions.
According to an example implementation of the present disclosure, the apparatus further comprises: a target action determining module configured for determining a target action from the plurality of actions based on the first action model; an indicator updating module configured for updating the data accumulated indicator based on features of the target action; and a parameter updating module configured for updating the parameter data using the target action and a target reward and target consumption associated with the target action.
According to an example implementation of the present disclosure, the apparatus further comprises: a receiving module configured for receiving aggregated parameter data for updating the first action model from the second device, the aggregated parameter data being determined by the second device based on the parameter data; and a model updating module configured for updating the first motion model using the aggregated parameter data.
According to an example implementation of the present disclosure, the aggregated parameter data comprises aggregated reward data, aggregated consumption data, and aggregated accumulated data.
According to an example implementation of the present disclosure, the apparatus further comprises: a target action determining module configured for determining a target action to be executed at the first device from the plurality of actions using the updated first action model.
According to an example implementation of the present disclosure, the apparatus further comprises: an executing module configured for, before determining the set of actions, executing the plurality of actions at the first device; a data obtaining module configured for obtaining a plurality of rewards and a plurality of consumption associated with the plurality of actions respectively; an initial parameter determining module configured for determining initial parameter data associated with the plurality of actions based on the plurality of actions, the plurality of rewards, and the plurality of consumption; and an initial parameter transmitting module configured for transmitting the initial parameter data to the second device to cause the second device to update the second action model using the initial parameter data.
According to an example implementation of the present disclosure, the apparatus further comprises: a receiving module configured for receiving aggregated initial parameter data for updating the first action model from the second device, the aggregated initial parameter data being determined by the second device based on the initial parameter data; and a module updating module configured for updating the first action model using the aggregated initial parameter data.
According to an example implementation of the present disclosure, the apparatus further comprises: a terminating module configured for terminating the method in response to at least any of: a time length for performing the method reaching a threshold time length; or consumption associated with at least one action that has been executed at the first device reaching threshold consumption.
According to an example implementation of the present disclosure, the first device is a client device for performing federated learning, and the second device is a server device for performing the federated learning.
According to an example implementation of the present disclosure, the plurality of actions comprises at least any one of: a data push action, a user selection action, a crowd-sourcing task allocation action, and a network parameter setting action.
According to an example implementation of the present disclosure, the plurality of parameter data comprises: reward data of the first device, the reward data representing revenue yielded by the set of actions executed at the first device; consumption data of the first device, the consumption data representing consumption of resources allocated to the first device, generated by the set of actions executed at the first device; and frequency data of the first device, the frequency data representing the number of executions of an action in the set of actions.
According to an example implementation of the present disclosure, the determining module comprises: a first determining module configured for determining aggregated accumulated data in the aggregated parameter data based on the frequency data and the reward data; a second determining module configured for determining aggregated reward data in the aggregated parameter data based on the frequency data and the reward data; and a third determining module configured for determining aggregated consumption data in the aggregated parameter data based on the frequency data and the consumption data.
According to an example implementation of the present disclosure, the apparatus further comprises: updating a second action model at the second device based on the plurality of parameter data.
According to an example implementation of the present disclosure, the apparatus further comprises: determining an action to be executed using the updated second action model.
According to an example implementation of the present disclosure, the apparatus further comprises: an initial parameter receiving module configured for, before receiving a plurality of parameter data respectively from the plurality of first devices at the second device, receiving a plurality of initial parameter data respectively from the plurality of first devices at the second device, the initial parameter data from the first device among the plurality of initial parameter data being determined based on a plurality of actions executed at the first device and a plurality of rewards and a plurality of consumption associated with the plurality of actions respectively; an initial aggregated parameter determining module configured for determining aggregated initial parameter data based on the plurality of initial parameter data; and an initial aggregated parameter transmitting module configured for transmitting the aggregated initial parameter data to the plurality of first devices respectively, so as to cause the plurality of first devices to update a plurality of first action models at the plurality of first devices based on the aggregated initial parameter data.
According to an example implementation of the present disclosure, the apparatus further comprises: updating a second action model at the second device based on the plurality of initial parameter data.
According to an example implementation of the present disclosure, the first device is a client device for performing federated learning, and the second device is a server device for performing the federated learning.
According to an example implementation of the present disclosure, the plurality of actions comprises at least any one of: a data push action, a user selection action, a crowd-sourcing task allocation action, and a network parameter setting action.
As shown in
The computing device 1200 typically includes multiple computer storage media. Such medium may be any available medium that is accessible to the computing device 1200, including but not limited to volatile and non-volatile medium, removable and non-removable medium. The memory 1220 may be volatile memory (for example, a register, cache, a random access memory (RAM)), a non-volatile memory (for example, a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory), or any combination thereof. The storage device 1230 may be any removable or non-removable medium, and may include a machine readable medium such as a flash drive, a disk, or any other medium, which may be used to store information and/or data (such as training data for training) and may be accessed within the computing device 1200.
The computing device 1200 may further include additional removable/non-removable, volatile/non-volatile storage medium. Although not shown in
The communication unit 1240 communicates with a further computing device through the communication medium. In addition, functions of components in the computing device 1200 may be implemented by a single computing cluster or multiple computing machines, which can communicate through a communication connection. Therefore, the computing device 1200 may be operated in a networking environment using a logical connection with one or more other servers, a network personal computer (PC), or another network node.
The input device 1250 may be one or more input devices, such as a mouse, a keyboard, a trackball, etc. The output device 1260 may be one or more output devices, such as a display, a speaker, a printer, etc. The computing device 1200 may also communicate with one or more external devices (not shown) through the communication unit 1240 as required. The external device, such as a storage device, a display device, etc., communicate with one or more devices that enable users to interact with the computing device 1200, or communicate with any device (for example, a network card, a modem, etc.) that makes the computing device 1200 communicate with one or more other electronic devices. Such communication may be executed via an input/output (I/O) interface (not shown).
According to an example implementation of the present disclosure, a computer-readable storage medium is provided on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the method described above. According to an example implementation of the present disclosure, a computer program product is further provided. The computer program product is tangibly stored on a non-transient computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the method described above. According to an example implementation of the present disclosure, a computer program product is provided on which a computer program is stored, the program, when executed by a processor, implementing the method described above.
Various aspects of the present disclosure are described herein with reference to the flow chart and/or the block diagram of the method, the apparatus, the device and the computer program product implemented in accordance with the present disclosure. It would be appreciated that each block of the flowchart and/or the block diagram and the combination of each block in the flowchart and/or the block diagram may be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to the processing units of general-purpose computers, specialized computers or other programmable data processing devices to produce a machine that generates an apparatus to implement the functions/actions specified in one or more blocks in the flow chart and/or the block diagram when these instructions are executed through the computer or other programmable data processing apparatuses. These computer-readable program instructions may also be stored in a computer-readable storage medium. These instructions enable a computer, a programmable data processing apparatus and/or other devices to work in a specific way. Therefore, the computer-readable medium containing the instructions includes a product, which includes instructions to implement various aspects of the functions/actions specified in one or more blocks in the flowchart and/or the block diagram.
The computer-readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices, so that a series of operational steps may be performed on a computer, other programmable data processing apparatus, or other devices, to generate a computer-implemented process, such that the instructions which execute on a computer, other programmable data processing apparatuses, or other devices implement the functions/acts specified in one or more blocks in the flowchart and/or the block diagram.
The flowchart and the block diagram in the drawings show the possible architecture, functions and operations of the system, the method and the computer program product implemented in accordance with the present disclosure. In this regard, each block in the flowchart or the block diagram may represent a part of a unit, a program segment or instructions, which contains one or more executable instructions for implementing the specified logic function. In some alternative implementations, the functions labeled in the block may also occur in a different order from those labeled in the drawings. For example, two consecutive blocks may actually be executed in parallel, and sometimes can also be executed in a reverse order, depending on the function involved. It should also be noted that each block in the block diagram and/or the flowchart, and combinations of blocks in the block diagram and/or the flowchart, may be implemented by a dedicated hardware-based system that performs the specified functions or acts, or by the combination of dedicated hardware and computer instructions.
Various implementations of the present disclosure have been described above. The above description is an example but not exhaustive, and is not limited to the disclosed implementations. Without departing from the scope and spirit of the described implementations, many modifications and changes are obvious to ordinary skill in the art. The selection of terms used in the present disclosure aims to best explain the principles, practical application or improvement of technology in the market of each implementation, or to enable other ordinary skill in the art to understand the various implementations disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
202310340865.1 | Mar 2023 | CN | national |