This application is based upon and claims priority to Netherlandish Patent Application No. N2036872, filed on Jan. 23, 2024, the entire contents of which are incorporated herein by reference.
The present invention belongs to the technical field of Mobile edge computing, in particular relates to a method of joint computation offloading and resource allocation in multi-edge smart communities with personalized federated deep reinforcement learning.
Smart cities use intelligent technologies to empower community governance and services for improving the efficiency of community running and the quality of citizens' lives. In smart communities, End Devices (EDs) are interconnected through wireless links, forming the Internet-of-Things (IoT). The EDs commonly own certain capabilities of data collection and task processing that can support emerging intelligent applications to some extent, such as smart transport, smart grids, and autonomous driving. However, due to the limited capacities of computing and power storage on EDs, it is hard to meet the high demands of intelligent applications for low delay and sustainable processing. In classic cloud computing, computation intensive and delay-sensitive tasks on EDs are usually up-loaded to the remote cloud with sufficient resources for execution. However, the long transmission distance between EDs and the cloud often leads to excessive delay, which seriously degrades Quality-of-Service (QoS).
To alleviate the contradiction between the high demands of intelligent applications and the limited capacities of EDs, the integration of the emerging Mobile Edge Computing (MEC) with Wireless Power Transmission (WPT) is deemed as a feasible and promising solution. In MEC, more computing resources are deployed at the network edge close to EDs, which can extend the computing capacities of EDs by offloading their tasks to MEC servers for execution. Meanwhile, EDs can be charged via WPT to maintain their power demands for long-term running. However, MEC servers are equipped with fewer resources compared to cloud data centers, and thus excessive delay may happen if too many tasks are offloaded simultaneously. Moreover, offloading decisions are constrained and affected by many factors such as the attributes and characteristics of tasks, the power storage of EDs, and the available resource status of MEC servers. Therefore, it is extremely challenging to design an effective and efficient solution for computation offloading and resource allocation in complex and dynamic MEC environments with multiple constraints.
Most of the classic solutions for computation offloading and resource allocation are based on rules, heuristics, and control theory. Although they can handle this complex problem to some extent, they commonly rely on some prior knowledge of systems (e.g., state transitions, demand changes, and energy consumption) to formulate appropriate policies for computation offloading and resource allocation. Therefore, they may work well in specific scenarios but cannot well fit in real-world MEC systems with high dynamics and complexity, causing degraded QoS and excessive system overheads. In contrast, Deep Reinforcement Learning (DRL) can better adapt to MEC environments and make policies with higher generalization abilities. Recently, there have been some DRL-based studies on computational offloading and resource allocation. Most of them adopted value-based DRL methods such as Deep Q-Network (DQN) and Double Deep Q-Network (DDQN), whose action space grows exponentially with the increasing number of EDs, resulting in huge complexity. Moreover, the value-based DRL discretizes the continuous space of resource allocation, which may lead to inaccurate policies and undesired results. To better cope with the problem of continuous control, some studies used policy-based DRL methods such as Deep Deterministic Policy Gradient (DDPG), which avoids exponential growth in action space by separating action selection and value evaluation. However, the policy-based DRL is prone to the Q-value overestimation issue, which may cause great fluctuations in the training process and policies falling into the local optimum.
Moreover, most of the existing solutions commonly adopted a centralized training manner, where all the information about EDs may need to be uploaded to a central server. This manner enables models to perform well with rich training samples but might cause severe network congestion and privacy leakage. To ameliorate this problem, some distributed training manners (e.g., multi-agent DRL) can be deemed as potentially feasible research directions. In multi-agent DRL, each agent regards other agents as environment variables and interacts with the environment independently, and then uses the feedback from the environment to improve its policy. However, when some agents lack training samples, the performance of local models will be seriously limited, making it hard to efficiently achieve model convergence. In contrast, Federated Reinforcement Learning (FRL) implements a collaborative model training on data silos with the original purpose of privacy protection. With FRL, MEC servers only upload their model updates to a central server for federated aggregation, and the aggregated global model will be distributed to MEC servers for the next round of training on a single agent. Therefore, the FRL can achieve comparable results to the centralized training manner at a faster convergence speed, which can also improve the issue of lacking training samples. However, different smart communities may own personalized demands on QoS and system overheads, and the classic FRL cannot handle this problem because it just naively averages model parameters over MEC servers.
The purpose of the present invention is to provide a method of joint computation offloading and resource allocation in multi-edge smart communities with personalized federated deep reinforcement learning, design a new multi-edge smart community system consisting of communication, computing, and energy harvesting models, where the task execution delay and energy consumption are formalized as the optimization objectives under multiple constraints;
The proposed system of multi-edge smart communities consists of a Central Base Station (C-BS) and m smart communities, denoted by the set R={Ri, i∈m}; in the smart community Ri, an Access Point (AP) interacts with the C-BS and there are n EDs, denoted by the set EDi={EDi,j, i∈m, j∈n}; each AP is equipped with an MEC server (denoted by Mi) that can process the tasks offloaded by EDs and feedback results, and it can also transmit energy to the EDs within its communication coverage through the wireless network, each ED is equipped with a rechargeable battery that can receive and store energy to power the processes of task offloading and processing;
When uploading Taski,j(h) to Mi for execution via the AP in Ri, the uplink date rate of EDi,j is defined as
In the proposed model, all EDs and MEC servers can offer computing services, thus consider the local and edge computing modes as follows:
In the proposed system, all EDs are equipped with rechargeable batteries with a maximum capacity of bmax; at the beginning of t, the battery power of EDi,j is bi,j(t); during the process of harvesting energy, EDi,j receives energy through WPT and deposits it into the battery in the form of energy packets, and the amount of harvested energy by an ED during t is denoted as et, which can be used to execute tasks locally or offload tasks to Mi for execution; for different system states during t, consider different situations of power variations on EDi,j as follows;
Based on the above system models, the delay and energy consumption of executing a task with different offloading decisions are respectively defined as
where qi,t and qi,e indicate the weights of delay and energy consumption, respectively; C1 indicates that a task can only be executed locally or offloaded to the MEC server for execution; C2 indicates that the delay of executing a task cannot exceed the maximum tolerable delay; C3 indicates that the energy consumption of executing a task cannot exceed the available battery power of an ED; C4 indicates that the sum of the proportion of bandwidth allocated for uploading tasks should be 1; C5 indicates that the sum of the proportion of the computational resources allocated for executing offloaded tasks should be 1.
The DRL agent selects actions under different states by interacting with the single-edge environment and continuously optimizes the policies of computation offloading and resource allocation referring to the reward signals from the environment; accordingly, the state space, action space, and reward function for DRL are defined as follows;
The main steps of the proposed improved twin-delayed DRL-based computation offloading and resource allocation algorithm as follows: first, the actor's network μi and two critic's networks Qi,1 and Qi,2 are initialized, and the target actor's network μi′ and two target critic's networks Qi,1′ and Qi,2′ are initialized accordingly; introduce two critic's networks that separate action selection and Q-value update, aiming to improve the training stability; than initialize the number of training epoch P, the number of time-slots H, the number of sub-slots T, the update frequencies of FL fp and the actor's network fa, the replay buffer Gi, the batch size N and the learning rate τ; for each training epoch, when it comes to the round of FL update, μi(s|θμ
Design a new personalized FL-based training framework to further improve the adaptiveness and training efficiency of the DRL-based computation offloading and resource allocation model for different environments; the proposed personalized FRL-based training framework as follow: initialize the federated actor's network f, two federated critic's networks Qf,1 and Qf,2, the number of edges participating in FRL training K (K≤m), and the communication rounds for federated aggregation Pf; in each communication round, introduce a new proximal term to attenuate the dispersion of local updates, the process is defined as
The technical solution of the present invention is described in detail in combination with the accompany drawings.
Proposed in the present invention is a method of joint computation offloading and resource allocation in multi-edge smart communities with personalized federated deep reinforcement learning.
The method specifically comprises the following design process:
As shown in
In the proposed system, we adopt a discrete-time running mode, which contains H time-slots with the same span, where h=1, 2, . . . , H. At the beginning of h, EDi,j generates a task, denoted by Taski,j (h)=(Di,j (h), Ci,j (h), Td), where Di,j (h) indicates the data volume, Ci,j (h) indicates the required computational resources, and Td indicates the maximum tolerable delay. If a task cannot be completed within maximum tolerable delay and available power, it will be determined to be failed. Specifically, the tasks generated by EDi,j will be placed in its buffer queue, and the tasks that first enter the queue will be completed before subsequently arriving tasks can be executed. Moreover, tasks can be processed locally or offloaded to the MEC server for execution. Furthermore, as shown in
When uploading Taski,j(h) to Mi for execution via the AP in Ri, the uplink date rate of EDi,j is defined as
where Bi(t) indicates the available upload bandwidth at the sub-slot t, wi,j(t) indicates the proportion of bandwidth allocated to Taski,j(h), Pi,j(t) indicates the transmission power of EDi,j, gi,j(t) indicates the channel gain between Mi and EDi,j, σ2 indicates the average power of Gaussian white noise, and li,j(t) indicates the distance between Mi and EDi,j.
Thus, the delay of uploading Taski,j(h) is defined as
Accordingly, the energy consumption of uploading Taski,j(h) is defined as
Since the results of executing tasks are much smaller than the data volume uploaded by tasks, the delay and energy consumption of downloading results from Mi to EDi,j can be commonly neglectable.
In the proposed model, all EDs and MEC servers can offer computing services, and thus we consider the local and edge computing modes as follows.
When a task is executed on an ED, the delay and energy consumption of executing the task are defined as
When offloading a task to the MEC server for execution, the delay and energy consumption of executing the task are defined as
In the proposed system, all EDs are equipped with rechargeable batteries with a maximum capacity of bmax. At the beginning of t, the battery power of EDi,j is bi,j(t). During the process of harvesting energy, EDi,j receives energy through WPT and deposits it into the battery in the form of energy packets, and the amount of harvested energy by an ED during t is denoted as et, which can be used to execute tasks locally or offload tasks to Mi for execution. Specifically, for different system states during t, we consider different situations of power variations on EDi,j as follows.
Based on the above system models, the delay and energy consumption of executing a task with different offloading decisions are respectively defined as
To minimize the delay and energy consumption of executing tasks, the optimization objective is formulated as
where qi,t and qi,e indicate the weights of delay and energy consumption, respectively. C1 indicates that a task can only be executed locally or offloaded to the MEC server for execution. C2 indicates that the delay of executing a task cannot exceed the maximum tolerable delay. C3 indicates that the energy consumption of executing a task cannot exceed the available battery power of an ED. C4 indicates that the sum of the proportion of bandwidth allocated for uploading tasks should be 1. C5 indicates that the sum of the proportion of the computational resources allocated for executing offloaded tasks should be 1.
To address the above optimization problem, we propose a novel Personalized Federated deep Reinforcement learning based computation Offloading and resource Allocation method (PFR-OA). First, for single-edge scenarios, we design an improved twin-delayed DRL-based algorithm to approximate the optimal policy. Next, for multi-edge scenarios, we develop a new distributed training framework based on personalized FL to further enhance the model adaptiveness and training efficiency.
P1 can be transformed into a classic problem of online budgeted maximum coverage, which has been proven to be NP-hard. In this problem, the element ei is selected at each step that contains costs and values, and thus all the selected elements during the whole process can be denoted as the set E={e1, e2, . . . , en}. The optimization objective of this problem is to find a set E′⊆E that can maximize the total values while the total costs do not exceed the budget. For P1, we regard the task to be processed at each sub-slot as an element in E, the computational resources allocated to the task as costs, and the rewards of completing the task as values. The objective is to find a set {α′, β′, w′}⊆{α, β, w} (i.e., an optimized policy of computation offloading and resource allocation) that can maximize the rewards without exceeding the constraints on available bandwidth and computational resource, which is the optimization objective of P1. Therefore, P1 is an NP-hard problem. To solve this complicated problem, we model it as a Markov Decision Process (MDP) and propose a new DRL-based solution.
Specifically, we design an improved twin-delayed DRL-based algorithm to address P1 for single-edge scenarios, aiming to minimize the delay and energy consumption of executing tasks. Based on an actor-critic framework, the classic twin-delayed DRL combines deep deterministic policy gradient and dual Q-learning, which performs well on many continuous-control problems. However, the classic twin-delayed DRL adopts a manner of local updating, which reveals the negative impact on global model convergence during distributed training. Moreover, the high frequency of error updates to the actor's network results in serious action dispersion. To address these issues, the proposed improved twin-delayed DRL-based algorithm lessens the unreasonable update frequency of the actor's network by introducing a new proximal term, which attenuates the dispersion of local updating and reduces the variance of action-value estimation, therefore generating better policies. As shown in
If a task can be successfully completed, the instant reward will be the opposite of the weighted sum of delay and energy consumption. If C2 or C3 cannot be satisfied, the instant reward will be qi,p, which is used as the penalty for failing to complete the task. Therefore, the long-term reward is defined as
The main steps of the proposed improved twin-delayed DRL-based computation offloading and resource allocation algorithm are given in Algorithm 1. First, the actor's network μi and two critic's networks Qi,1 and Qi,2 are initialized, and the target actor's network μi′ and two target critic's networks Qi,1′ and Qi,2′ are initialized accordingly (Line 1). To address the Q-value overestimation issue in classic actor-critic-based DRL, we introduce two critic's networks that separate action selection and Q-value update, aiming to improve the training stability. Next, we initialize the number of training epoch P, the number of time-slots H, the number of sub-slots T, the update frequencies of FL fp and the actor's network fa, the replay buffer Gi, the batch size N and the learning rate τ (Line 1). For each training epoch, when it comes to the round of FL update, μi(s|θμ
Specifically, the proposed algorithm uses the critic's network to fit Qi(s(t), a(t)), which can accurately reflect the Q-values of each action. Meanwhile, we use the actor's network to fit the mapping between s(t) and a(t), and thus the DRL agent can take proper actions at different states and maximize the long-term reward. We introduce the Gaussian noise in the target actor's network to obtain a(t+1), and this process is defined as
Next, the target Q-value is calculated by considering the current reward and comparing two critic's networks (Line 15), which is defined as
Next, the critic's network is updated by the loss back-propagated of the difference between ytarget and the current Q-value (Line 16). Due to the variable demands on QoS and system overheads among different edge environments, we design a proximal term to replace the original loss function that only tends to minimize the difference in local Q-values, which speeds up the convergence of the FL-based training framework in Algorithm 2 and reduces the negative impact of local updates on global model convergence during distributed training. The update process is defined as
Finally, to reduce the improper update of the actor's network, we design a soft updating mechanism, which makes the update frequency of the actor's network less than the critic's network and thus avoids the action dispersion caused by the high frequency of error updates (Lines 17˜20). With this design, the variance of action-value estimation can be effectively reduced, thus generating better policies of computation offloading and resource allocation.
In classic centralized training manners, all the information about EDs may need to be uploaded to a central server to train a DRL-based decision-making model of computation offloading and resource allocation. Such manners can achieve good model performance with rich training samples, but it is prone to severe network congestion and the potential risk of privacy leakage. As an emerging distributed training framework, the multi-agent DRL allows each agent to be trained independently, but the performance of local models will be seriously limited if some agents lack training samples. In contrast, the FRL can solve this issue by implementing a collaborative model training on data silos with the original purpose of privacy. However, different smart communities usually have personalized demands on QoS and system overheads. In this case, the classic FRL with the average aggregation of model parameters cannot make an effective response. To solve these issues, we design a new personalized FL-based training framework to further improve the adaptiveness and training efficiency of the DRL-based computation offloading and resource allocation model for different environments. The proposed personalized FRL-based training framework is illustrated in
First, we initialize the federated actor's network μf, two federated critic's networks Qf,1 and Qf,2, the number of edges participating in FRL training K (K≤m), and the communication rounds for federated aggregation Pf (Line 1). In each communication round, we introduce a new proximal term to attenuate the dispersion of local updates, because different smart communities commonly own personalized demands on QoS and system overheads. Specifically, the proximal term is added to the local training loss by calling Algorithm 1, allowing faster convergence to the global optimum. The process is defined as
In each communication round, each DRL model in Ri(Ri∈R) uploads the actor's network μi(s|θμ
Finally, the C-BS distributes the aggregated DRL models to the DRL agents in each Ri(Ri∈R) (Line 15) and waits for the next communication round of FRL training.
At the sub-slot t, the sizes of state and action spaces in a DRL agent are |si(t)| and |ai(t)|, respectively, and all m DRL agents own the same network structure. Therefore, the following two parts should be considered when calculating the complexity of the proposed PFR-OA.
By combining the above two parts, the complexity of a DRL agent is Yt=YtR+YtAC at the sub-slot t, and thus the complexity of completely training a DRL agent is O(H·T·Yt). Considering the FRL training process, the complexity of the proposed PFR-OA is O(Pf·H·T·Yt).
We refer to the architecture of the real-world edge computing platform (i.e., C-ESP) and adopt its running datasets. The C-ESP builds an edge computing platform covering almost all regions in China, which hosts different types of service providers and records running datasets. Specifically, the datasets come from 2359 edge servers deployed in more than 1,000 locations and 96,209 users with 10,159,851 requests. From the datasets, we can obtain the fuzzy geographic locations of request senders (i.e., users) and receivers (i.e., edge servers) with unique identifiers and the generation time of requests. It is noted that the geographic locations are fuzzy based on IP addresses to protect user privacy. The simulation experiments are conducted on a workstation with an 8-core Intel® Xeon® Sliver 4208 CPU @3.2 GHz, two NVIDIA GeForce RTX 3090 GPUs, and 32 GB of RAM. Based on PyTorch, we implement the proposed system model and PFR-OA. Specifically, the system model is built with three MEC servers, where each MEC server owns computing capability Fi(t) of 20 GHz. Meanwhile, each MEC server is equipped with a BS with bandwidth Bi(t) of 15 MHz, where the EDs within the communication coverage are connected to it via the wireless network. The tasks of EDs are generated with various demands for dynamic multi-edge communities based on the running datasets of C-ESP. Moreover, a complete training epoch contains 20 time-slots and each time-slot contains 4 sub-slots. As for the other parameters in the proposed model, we set Td=1.5 s, Di,j(h)∈[0.5, 1.5] MB, Ci,j(h)∈[1, 1.2] GHz, Pi,j(t)∈[100, 400]W, fi,j∈[1, 1.2] GHz, Pi,j(t)∈[0.1, 0.4] MB/s, Pm=100 W, bmax=120 J, et=40 J, k=e−26, and σ=e−3, respectively. As for the parameters in the proposed PFR-OA, we set γ=0.995, τ=0.0001, fp=50, fq=0.3, N=256, ρ=0.1, and fa=2, respectively.
To comprehensively evaluate the proposed PFR-OA, we use the following performance indicators.
To verify the superiority of the proposed PFR-OA, we compare it with the following benchmark methods.
We analyze the impact of different hyperparameters (including the reward discount factor γ and learning rate τ) on the performance of the proposed PFR-OA. As shown in
As shown in
Specifically, as shown in
Performance Comparison with Various Required Computational Resources of Tasks
As illustrated in
However, due to the limited computational capabilities of EDs, a lower task success rate happens with the increasing difficulty of executing tasks. The Edge offloads all tasks to MEC servers for execution, leading to higher energy consumption and transmission delay. It is worth noting that the proposed PFR-OA displays better performance than other advanced DRL-based methods regarding task success rate, average energy consumption, and average waiting time.
Performance Comparison with Various Harvested Energies by an ED
Performance Comparison with Various Network Bandwidths of a BS
We test the influence of various network bandwidths on the performance of different methods. As shown in Table 2, the changes in network bandwidths do not affect the performance of the Local because it does not contain the offloading process. As the network bandwidth increases, the performance of the Edge enhances most significantly. This is because the Edge offloads all tasks to MEC servers for execution, and thus the increasing network bandwidths can reduce the task transmission delay, also considerably improving reward and task success rate. With the increase in network bandwidths, the performance of different methods tends to stabilize. This is because fewer tasks fail due to exceeding the delay constraint during the offloading process. Since the energy consumption of computation offloading is much higher than the local execution, EDs may not support too many tasks for offloading under the constraint of battery power. In this case, the performance cannot be further improved. It is noted that the proposed PFR-OA outperforms other advanced DRL-based methods regarding different performance indicators, verifying its superiority in handling the complex issue of computation offloading and resource allocation in dynamic multi-edge environments.
Performance Comparison with Various Computing Capabilities of a MEC Server
We evaluate the performance of different methods with various computing capabilities of MEC servers. As depicted in
However, EDs are constrained by battery power and cannot support offloading too many tasks, and thus the performance of these methods cannot be further improved. As shown in
Performance Comparison with Various Maximum Tolerable Delays of a Task
We compare the impact of various maximum tolerable delays on the performance of different methods. As illustrated in
Compared to other advanced DRL-based methods, the proposed PFR-OA always achieves better results in terms of all performance indicators. This is because the PFR-OA can optimize the computation offloading and resource allocation process by using an improved twin-delayed DRL and efficiently aggregate models across diverse edge environments by introducing a new personalized FL-based framework. The above experimental results validate the superiority of the proposed PFR-OA.
To further verify the practicality and superiority of the proposed PFR-OA, we construct a real-world testbed with hardware devices to evaluate the performance. As shown in
All these devices are connected to a 5 GHz router where the communication platform is built based on the Flask framework. In the testbed environment, each MEC server owns comparable bandwidth to the simulation environment but its computing capability is different. We adopt image classification as a service instance for computation offloading, where EDs generate tasks of image classification with varying data volumes and resource demands at different time-slots and send offloading requests. If the requests are accepted, the tasks will be uploaded to the corresponding MEC servers for execution. Otherwise, the tasks will be executed locally. Since the data transmission delay might be affected by unstable channels in real-world environments, the errors between the theoretical and actual values of the data transmission time are taken into account. Moreover, we consider the diversity of task attributes and service demands in edge environments of R1, R2, and R3.
As illustrated in
The advanced DRL-based methods can make appropriate offloading decisions based on system states and task attributes. Therefore, the DRL-based methods achieve higher rewards than other heuristics under different edge environments. Among all the DRL-based methods, the proposed PFR-OA reaches the best performance. This is because the PFR-OA considers the demand diversity in different edge environments and improves the efficiency of edge cooperative training through a new personalized FL-based framework, avoiding performance degradation due to local training dispersion. The above results verify the effectiveness of the proposed PFR-OA in real-world scenarios.
In this application, we first formulate the computation offloading and resource allocation in dynamic multi-edge smart community systems with personalized demands as a model-free DRL problem with multiple constraints. Next, we propose a novel PFR-OA that combines an improved twin-delayed DRL-based algorithm and a new personalized FLbased training framework to address the issues of action dispersion and inefficient model updates. Using real-world settings and testbed, extensive experiments demonstrate the effectiveness of the proposed PFR-OA. Compared to the other seven benchmark methods (i.e., MCF-TD3, TD3, DDPG, DQN, Greedy, Edge, and Local), the PFR-OA shows superiority in improving the task success rate, average energy consumption, and average waiting time. Specifically, the PFR-OA outperforms other benchmark methods in different scenarios with various required computational resources of tasks, harvested energies by EDs, network bandwidths of BSs, computing capabilities of MEC servers, and maximum tolerable delays. Notably, we validate the practicality of the PFR-OA on the real-world testbed. When facing heterogeneous devices and diverse demands in different edge environments, the PFR-OA is able to maintain the best performance among all methods.
Number | Date | Country | Kind |
---|---|---|---|
2036872 | Jan 2024 | NL | national |