The present disclosure relates generally to systems, methods, and computer-readable media for allocating power for base stations in wireless network, and in particular, for allocating power using multi-agent deep reinforcement learning-enabled distributed scheme in wireless network.
Millimeter-Wave (mmWave) or wireless communication has been one important technology for the fifth-generation (5G) cellular systems. The proliferation of mmWave frequency bands has increased the link capacity by several orders of magnitude compared to sub-6 GHz wireless systems and is able to support massive connections. To combat propagation loss, directional beamforming is commonly used. It has been shown that even at mmWave frequencies, spectrum availability is still limited considering the abundance of mobile and data intensive services. Therefore, spectrum sharing is necessary for better utilization of unlicensed and shared spectrum.
The concurrency of highly directional transmissions, however, has presented new challenges to spectrum sharing. Without proper coordination, beams could overlap and cause severe interference which hinders the performance. The situation is further exacerbated by the use of small cells with densely populated user devices.
Recently, deep reinforcement learning (DRL) has achieved notable success in wireless resource management. A deep Q-network (DQN)-based (discrete) power allocation scheme, which achieves competitive throughput performance to conventional centralized approaches like weighted minimum mean square error (WMMSE) and fractional programming (FP), was proposed. Treating as RL agents, the transmitters improve their decision making by actively interacting with the radio environment and benefit from learning with accumulated experiences. This work was further extended to continuous power control and joint spectrum and power allocation.
Some other types of DRL algorithms were applied to the same tasks. For mmWave networks, a DQN-based resource management scheme was proposed to learn and predict blockage patterns in a backhaul capacity-limited system. A DQN-based joint spectrum and (discrete) power allocation scheme was proposed as well. The clustering problem for mm Wave networks with user mobility and proposed a DQN-based clustering scheme have been studied. Moreover, a deep recurrent Q-network (DRQN)-based handover scheme was proposed for dynamic mmWave user association. One common issue with these attempts is that the stationarity assumption of MDP is violated in the multi-agent setting as the environment seen by each agent is impacted by the unknown behaviors of other agents. This violation has been ignored in these studies.
In wireless communication system, sub-optimal solutions by solving a non-convex optimization lead to scalability issues due to centralized control. The present disclosure provides various aspects to address the scalability issues by using multi-agent deep reinforcement learning-enabled distributed scheme in wireless network.
In accordance with various aspect, a base station is associated with user devices in a wireless network, which includes a plurality of base stations. The base station includes a processor and a memory including instructions that, when executed by the processor, cause the base station to function as an actor network configured to determine a current transmit power, a critic network configured to evaluate a quality function of previous transmit powers of the base station based on local observations and previous transmit powers of neighboring base stations, and a decentralized training unit configured to train the quality function over the neighboring base stations. The neighboring base stations are a subset of the plurality of base stations, and the current transmit power is determined based on the previous transmit powers of the base station, direct channel gains between the base station and the user devices, and interference measures from the user devices.
In accordance with various aspects, a system is provided for allocating power in a wireless network. The system includes a plurality of base stations, of which each base station is configured to autonomously determine a transmit power based on local observations and being associated with user devices. Each base station includes a processor and a memory including instructions that, when executed by the processor, cause each base station to function as an actor network configured to determine a current transmit power, a critic network configured to evaluate a quality function of previous transmit powers of each base station based on local observations and previous transmit powers of neighboring base stations, and a decentralized training unit configured to train the quality function over the neighboring base stations. The neighboring base stations are a subset of the plurality of base stations, and the current transmit power is determined based on the previous transmit powers of the base station, direct channel gains between the base station and the user devices, and interference measures from the user devices.
In accordance with various aspects, a method is provided for allocating power to a plurality of base stations in a wireless network. The method includes receiving, at each base station associated with user devices, previous transmit powers of neighboring base stations, evaluating, at each base station, a quality function of previous transmit powers based on local observations and previous transmit powers of neighboring base stations, determining, at each base station, a current transmit power based on the previous transmit powers of each base station, direct channel gains between each base station and the user devices, and interference measures from the user devices, and training the quality function over the neighboring base stations.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Additional features and advantages of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. The features and advantages of such implementations may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present disclosure will become more fully apparent from the following description and appended claims, or may be learned by the practice of these implementations as set forth hereinafter.
In order to describe the manner in which at least some of the advantages and features of the present disclosure may be obtained, a more particular description of aspects of the present disclosure will be rendered by reference to specific aspects thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical aspects of the present disclosure and are not therefore to be considered to be limiting of its scope in terms of dimensions, materials, configurations, arrangements or proportions unless otherwise limited by the claims, aspects of the present disclosure will be described and explained with additional specificity and detail through the use of the accompanying drawings described below.
While these various aspects are described in sufficient detail to enable those skilled in the art to practice the present disclosure, it should be understood that other aspects may be realized and that various changes to the present disclosure may be made without departing from the spirit and scope of the present disclosure. Thus, the following more detailed description of the aspects of the present aspects is not intended to limit the scope of the present disclosure, as claimed, but is presented for purposes of illustration only and not limitation to describe the features and characteristics of the present disclosure, to set forth the best mode of operation of the present disclosure, and to sufficiently enable one skilled in the art to practice the present disclosure. Accordingly, the scope of the present disclosure is to be defined solely by the appended claims.
It is noted that aspects of the present disclosure, whether claimed or not, cannot be performed, practically or otherwise, in the mind of a human. Accordingly, nothing herein should be construed as teaching or suggesting that any aspects of the present disclosure could or would be performed, practically or otherwise, in the mind of a human. Further, and unless explicitly indicated otherwise herein, the disclosed methods, processes, and operations, are contemplated as being implemented by computing systems that may comprise hardware and/or software. That is, such methods processes, and operations, are defined as being computer-implemented.
Power allocation schemes may be designed for wireless communication networks including millimeter waver (mmWave) cellular downlink by leveraging the multiagent deep deterministic policy gradient (MADDPG) algorithm. Each base station may be modeled as an agent that determines its transmit power autonomously in real time. MADDPG addresses the multi-agent environment non-stationarity issue by conditioning Q-functions of individual agents also on other agents' actions which may be made available by using a centralized-training-distributed-execution framework. Conditioning the Q-functions over actions of a large number of agents can result in high inter-base station communication overhead and may incur instability in training the agents.
To make it scalable, a distributed version of MADDPG may be employed where the Q-function (e.g., critic network) of each agent may be trained over a subset of base stations (i.e., each base station and its neighboring base stations) with a system-level reward. This formulation can suppress unnecessary information exchanges among base stations, which barely impact each other's local environment dynamics. It also increases the stability and training efficiency of the deep neural network (DNN)-based actor/critic training by restricting the input of the DNNs to a relatively small or manageable size. A distributed power allocation scheme will be described below based on the distributed MADDPG algorithm. Simulations can show that the employed scheme can achieve performance comparable or better than the conventional WMMSE and FP.
First, this scheme may deal with continuous power control while the DQN-based schemes can only handle discrete powers and the effect of quantization has not been properly investigated. Second, agent heterogeneity may be addressed by equipping each base station with a unique actor/critic network that accommodates its specific local radio environment. In contrast, in other approaches, a single global actor is trained using experiences gathered from all base stations which is then copied to each base station for use. For heterogeneous systems like mmWave networks where each base station can face quite different beam coverage and interference conditions, a single actor/critic may not be able to fit all agents. Third, the distributed MADDPG uses information exchange among subsets of base stations. This largely reduces the communication overhead compared to other alternatives where network-level experience collection is required.
In describing and claiming the present disclosure, the following terminology will be used. The singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a base station” includes reference to one or more of such materials and reference to “the base station” refers to one or more of such base stations.
As used herein with respect to an identified property or circumstance, “substantially” refers to a degree of deviation that is sufficiently small so as to not measurably detract from the identified property or circumstance. The exact degree of deviation allowable may in some cases depend on the specific context.
As used herein, “adjacent” or “neighboring” refers to the proximity of two structures or elements. Particularly, elements that are identified as being “adjacent” or “neighboring” may be either abutting or connected. Such elements may also be near or close to each other without necessarily contacting each other. The exact degree of proximity may in some cases depend on the specific context.
As used herein, the term “about” is used to provide flexibility and imprecision associated with a given term, metric or value. The degree of flexibility for a particular variable can be readily determined by one skilled in the art. However, unless otherwise enunciated, the term “about” generally connotes flexibility of less than 2%, and most often less than 1%, and in some cases less than 0.01%.
As used herein, a plurality of items, structural elements, compositional elements, and/or materials may be presented in a common list for convenience. However, these lists should be construed as though each member of the list is individually identified as a separate and unique member. Thus, no individual member of such list should be construed as a de facto equivalent of any other member of the same list solely based on their presentation in a common group without indications to the contrary.
As used herein, the term “at least one of” is intended to be synonymous with “one or more of.” For example, “at least one of A, B and C” explicitly includes only A, only B, only C, or combinations of each.
Numerical data may be presented herein in a range format. It is to be understood that such range format is used merely for convenience and brevity and should be interpreted flexibly to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range as if each numerical value and sub-range is explicitly recited. For example, a numerical range of about 1 to about 4.5 should be interpreted to include not only the explicitly recited limits of 1 to about 4.5, but also to include individual numerals such as 2, 3, 4, and sub-ranges such as 1 to 3, 2 to 4, etc. The same principle applies to ranges reciting only one numerical value, such as “less than about 4.5,” which should be interpreted to include all of the above-recited values and ranges. Further, such an interpretation should apply regardless of the breadth of the range or the characteristic being described.
Any steps recited in any method or process claims may be executed in any order and are not limited to the order presented in the claims. Means-plus-function or step-plus-function limitations will only be employed where for a specific claim limitation all of the following conditions are present in that limitation: a) “means for” or “step for” is expressly recited; and b) a corresponding function is expressly recited. The structure, material or acts that support the means-plus function are expressly recited in the description herein. Accordingly, the scope of the present disclosure should be determined solely by the appended claims and their legal equivalents, rather than by the descriptions and examples given herein.
Now turning to
The base station 110a transmits or relays communication signals to or between the user devices 120a1-120ai, where the subscript index “i” may be any number greater than or equal to one. Likewise, the base station 110n transmits or relays communication signals to or between the user devices 120n1-120n1, where the subscript index “1” may be any number greater than or equal to one.
The base stations 110 may include one or more antennas to enable beam-based directional transmissions while the user devices 120 may be equipped with a single antenna. In an aspect, the user devices 120 may be equipped with more than one antennas. The wireless communication network 100 may be a fully synchronized and slotted system operating on a shared spectrum of W Hz. The range of W Hz may be in a giga Hz range, tera Hz range, or any range greater than the tera Hz range. A block fading model may be adopted for downlink channels. In particular, the small-scale fading may keep unchanged during each slot and follow a temporally correlated Nakagami distribution with the following probability density:
with parameters
and Γ denotes the Gamma function. The notation ∀ followed by a variable (e.g., x, t, or any other variable) represents the expression of for all values of the variable. The fading coefficients {h(t), ∀t} are generated in a way such that, first, h(t), ∀t follows a Nakagami distribution with the same parameters m and Ω, and, second, the squared channels between any two consecutive time slots have a correlation coefficient ρ of
It is noted that hu
in which PL(dji) indicates the path loss between the base station 110i and user device 120ij or uj and dji is the distance between the base station 110i and user device 120ij, Gji indicates the antenna gain of base station 110i towards user device 120ij. The dual-slope path loss model may be adopted for the wireless communication channel modeling. In particular, the path loss at distance d is equal to
where dC is a critical distance indicating the boundary distance between the near-field and the far-field, and α0, α1(α1≥α0>0) are path loss exponents for the near-field and the far-field, respectively. Moreover, the base stations 110 may use a keyhole-like sectorized antenna model which has a constant main lobe radiation gain of Gmax and a constant sidelobe gain Gmin. In particular, the antenna gain in the direction of angle θ is equal to Gmaxif |θ|≤Δ and Gmin otherwise, whereΔ is the beamwidth. The main-to-sidelobe ratio (MSR) can be defined as
When the power allocation profile of all base stations 110 at time t is MSR ≙|p1(t), p2(t), . . . , Pn(t)|, the received signal-to-interference-plus-noise ratio (SINR) at the scheduled user device 120ij of the base station 110i is equal to
where σ2=n0W is the total noise with n0 denoting the noise power spectrum density. The normalized throughput (bps/Hz) of the base station 110i can be written in expression
The goal of the base station 110 is to maximize the total throughput therefrom, which is expressed as:
which is subject to instantaneous power constraints pi(t)≤pimax, ∀i. In particular, a power allocation pi(t) may be computed at the beginning of each time slot such that Csum(t) may be maximized.
Although the power maximization problem has been extensively studied, finding the optimal solution is still challenging due to its non-convex nature. For wireless communication networks including millimeter wave (mmWave) networks, power control becomes more complicated as the beam-based transmissions by the base stations 110 may be properly coordinated to reduce interference. Unlike conventional approaches using centralized control, data-driven learning approaches may be leveraged to design distributed and scalable power allocation schemes, in which the base stations 110 choose powers based on their local measurements and limited information exchange with other neighboring base stations 110.
The multi-agent DRL-based power allocation is now described. A distributed version of MADDPG is derived based on which a power allocation scheme is then developed. An overview of DRL is provided first before proceeding to the description of the multi-agent DRL-based power allocation.
In reinforcement learning (RL), an agent aims to optimize an expected return through repeated interactions with the environment over time. In each interaction, the agent receives a reward signal from the environment as an indicator of the quality of the action taken. The agent learns in a trial-and-error manner by gradually refining its decision making using the received reward signals. Specifically, in a discrete-time Markov Decision Process (MDP) (S, A, R, T), given some state st∈S at time t, the agent takes an action at∈A with probability μ(at|st) according to a policy μ satisfying ∫aμ(at|st)da =1, where μ(at|st) is a conditional probability of at in the presence or condition of st. Impacted by at, the environment transitions (governed by the transition function T) to a new state st+1 and the agent receives a scalar reward rt=R(st, at, st+1), which indicates how good the taken action at is. The set of transition quadruples {(st, at, rt, st+1), ∀t} is referred to as an experience.
The return Gt is defined as the cumulative future rewards, i.e., Gt≙Σπ=0∞γπrt+π with γ∈(0, 1] being the discount factor that adjusts the relative importance of the near and far future rewards. The state-action value function (or Q-function) μ under a specific policy μ is defined as the expected return starting from any state-action pair (s, a)∈S×A, i.e.,
where the expectation is taken over both the policy μ and the transition dynamics T. Model-free RL optimizes the agent's expected return without knowing or explicitly learning the transition dynamics and has seen significant developments in recent years due to the use of neural function approximators.
Deep Deterministic Policy Gradient (DDPG) is a DRL algorithm which focuses on deterministic policies that map each state s to a specific action a=μ(s). It uses an actor-critic architecture in which two separate DNNs θμ and θ are used to represent the policy μ(s|θμ), which is called an actor network, and the Q-function
(s, a|θ
), which is called a critic network, respectively. DDPG may be extended to the multi-agent domain and the multi-agent DDPG (MADDPG) algorithm, which addresses the non-stationarity issue due to multi-agent participation by conditioning the Q-function of each agent and the actions of other agents (referred to as a centralized critic network), i.e., agent or the base station 110i may be defined with a Q-function
(s, a1, ···, aN) that has all agents' or base stations' actions as input. The intuition behind this modification is that, given a set of fixed actions of all other agents, the environment perceived by each agent becomes stationary regardless of what policies are used by other agents. Hereinafter, the term “agent” or “agents” may be interchangeably used as “base station” or “base stations,” respectively.
Under the multi-agent setting, the i-th agent gets a local observation oi instead of the true global state. In this regard, turning to
The replay memory 230 may store local observations, which are a set of input data received by the base station 200, including previous power levels, channel gains, interference measurements, and throughput data from neighboring base stations. The centralized training module 240 may train the Q-functions of the critic networks 220 using data from the neighboring base stations. The distributed execution module 250 may enable each base station to autonomously determine and execute its transmit power based on local observations and the trained actor network 210. This distributed execution module 250 may ensure that the decision-making process is distributed across the network, allowing each base station to operate independently while still maintaining overall network performance. Further details of each component of the base station 200 are described below.
Actions may be chosen based on each agent's local observation, i.e., ai=μi(oi|θiμ). The training of the centralized critic Qi(s, a|θiQ) requires the knowledge of the global state s≙ (oi)i=1K and the joint actions a≙(ai)i=1K. This may be enabled by the centralized training module 240, in which the actor network 210 and the critic network 220 may be trained periodically with network-level experiences, while the actions are determined based solely or partially on each agent's local observation. In an aspect, the replay memory 230 may use a fixed-size experience replay buffer D to store the past experiences collected from all agents in a first-in-first-out (FIFO) manner. Mini-batches of experiences are then sampled from the experience replay buffer D to train the actor network 210 and the critic network 220 using the stochastic gradient descent (SGD) method. More specifically, given a mini-batch B={(sj, aj, rj, s′j)}, with rj=(rij)i=1K being the rewards of all agents, the critic network 220 of the base station 200 or the agent may be trained by minimizing the following loss L:
The regression target over the j-th sample in B can be expressed as the following:
which is generated by two target networks Q′i(·|θiQ′) and μ′k(·|θkμ′) in order to stabilize the training process. The actor network 210 of the i-th agent may be trained by minimizing the following expression:
Finally, the weights of the actor and critic networks 210 and 220 may be updated according to the following expression:
for some small number π∈(0, 1). In MADDPG, exploration may be achieved by adding a random noise Nt to the actor output, as follows:
One drawback of the original MADDPG algorithm is that the Q-function i(s, a) of each agent takes a global state s=(oi)i=1K and a joint action a=(ai)i=1K as inputs. This can be problematic for systems because wireless communication systems generally have a large number of base stations and the input dimension of the corresponding critic networks can be huge which may result in slow convergence and instability in DNN training. To address this issue according to the present disclosure, MADDPG may be modified to a distributed version of MADDPG, which allows decentralized critic training. In other words, the critic networks 220 of agents may be trained over a subset of all base stations. For example, as illustrated in
The i-th agent may receive a localized critic defined by the following expression:
where {circumflex over (N)}t≙Niυ{i}. i and μi may be trained similarly according to expressions (10) and (12). Expression (15) may indicate that each agent only needs to gather information from its neighboring agents and network level information exchange (which introduces delays) can be reduced.
The definition of localized critic (15) may be motivated by the following observation. For example, in mmWave or wireless networks, base stations that are far apart from each other or with properly placed beams may contribute little interference to each other and have negligible impact on the local environment transition dynamics of each other. This is because mmWave frequency itself suffers from rapid attenuation due to propagation characteristics, and highly directional beams suppress interference to undesired directions. Therefore, it is unnecessary to use a centralized critic that needs to be trained over the entire set of base stations. More generally, the formulation may provide a flexible trade-off between inter-base station information exchange and how accurately the overall radio environment can be perceived by each agent. Two special cases may be considered in this disclosure. When no information exchange is allowed among agents, i.e., Ni=∅, ∀i, expression (15) becomes
which is the case of independent learning. In contrast, if information exchange is allowed among arbitrary agents, i.e., Ni={1, . . . , K}\{i}, ∀i, expression (15) becomes
which is equivalent to the original MADDPG critic. It is noted that the notation “\{i}” indicates that the i-th agent is excluded and thus Ni is a set of neighboring agents except the i-th agent. In an aspect, with the neighboring set {circumflex over (N)}t, which include Ni and the i-th agent, a separate replay buffer Di may be defined for the i-th agent and store the experiences in the form of
Actions, local observations, and rewards in the multi-agent deep reinforcement learning-enabled distributed scheme may be designed and described below. With regard to actions, each base station may need to determine the transmit power pi(t)∈[0, pimax] to its scheduled user device at the beginning of each slot. Since the Tanh activation may be used for the output layer of the actor networks 210, the actor output ai(t)=μi(oi(t)|θiμ) falls into a range of [−1, 1]. The Tanh activation may be defined as the tangent hyperbolic tangent function or by the following expression.
To achieve exploration, a random noise Nt may be added to ai(t), which is then clipped to the range [−w, w] for w∈(0, 1). Since the hyperbolic tangent function requires an infinitely large input to achieve the output values±1, clipping to [−w, w] where 0<w<1 increases numerical stability of the DNN training. Therefore, the actor output is mapped to the powers according to the following expression:
The local observation may represent each agent's (partial) perception of the radio environment. The local observation may have to capture the environment features that are relevant to the agent's decision making. To control complexity and make the scheme scalable, the information exchange is limited to neighboring base stations. In this regard, the distributed execution module 250 may gather local observations, including data on previous power levels, channel gains, interference measurements, and throughput data from neighboring base stations. In particular, the observation oi(t) of the i-th agent may be defined as
which includes several interference and channel measurements at the i-th base station and throughput obtained from neighboring base stations. In particular, pi(t−1) is the power of the i-th base station in the previous slot, gii(t−1) is the direct channel gain between the i-th base station and its scheduled user device in slot t−12, and is defined by gii(t−1)=PL(dii)Gii|hii(t−1)|2. Direct channel gain gii(t) is the direct channel gain in slot (t) and is defined by gii(t)=PL(dii)Gii|hii(t)|2. The direct channel gains may be estimated via pilot training. It is assumed that the channel changes from hii(t−1) to hii(t) at the very beginning of slot (t) and right before the new powers pi(t)∀i are determined, gii(t) may also be used by the i-th base station. The total interference (plus noise), Ii(t−1), measured at the i-th base station in slot (t−1) may be defined as
Likewise, the total interference, , measured at the beginning of slot (t) where the channels have changed but the powers have not been updated may be defined as
Ci(t−1) is the throughput of the i-th base station at the previous slot (t−1), and
represents the relative importance of the i-th base station in terms of throughput contribution among its neighboring base stations. Moreover, gij(t−1)pj(t−1) and gij(t)pj(t) are the measured interferences from the j-th base station ∈Ni in slot (t−1) and the beginning of slot (t), respectively. Finally, Cj(t−1) represents the throughput achieved by The j-th base station in the previous slot. It is noted that Cj(t−1) has to be delivered from the j-th base station to the i-th base station despite all other interference measurements can be directly obtained at the i-th base station. One or more previous slots may be included in order for the agents to better keep track of the time-varying channels.
With regard the rewards, unlike other methods which rely on a heuristic reward design, a centralized reward, i.e., ri(t)=Σj=1KCj(t), ∀i, which is intuitive and computationally efficient, may be employed.
With the above definitions, the power allocation scheme may be summarized as follows. At the beginning of slot (t), the i-th base station conducts the interference measurements and exchanges throughput of the previous slot with its neighboring base stations in order to construct its local observation oft). The i-th base station may choose an action, ai(t)=μi(oi(t)|θiμ)+Nt, which is then mapped to the actual power
The chosen power may be used for one slot until the next slot begins and the new observation oi(t+1) can be obtained. The experience of the i-th agent, (), may be then pushed to the corresponding replay buffer Di. For every Ttrain slots, the actor network 210 and the critic network 220 may be trained using the mini-batch stochastic gradient descent (SGD) method according to expressions (10) and (12) by the centralized training module 240.
With regard to the complexity, the input sizes of the actor and critic networks 210 and 220 may be |oi(t)|=3N+7 and (N+1)|oi(t)|=(N+1)(3N+7), respectively, where N=|Ni|, which is the size of the neighboring agents of the i-th agent. These input sizes may not scale with K if N is fixed.
For example with N=6, the actor network 210 and critic network 220 may contain about 30k and 55k parameters, respectively. Due to the local observation design of expression (21), each base station needs to send its throughput in the previous slot to its neighboring base stations. This incurs a communication overhead O(N) per base station, thereby reducing the communication overhead. Thus, regardless of the number of base stations (e.g., the base stations 110a-110n) in a wireless network (e.g., the wireless network 100 of
Now turning to
To verify advancement of throughput and computational efficiencies of the multi-agent deep reinforcement learning-enabled distributed scheme, simulations have been performed with other conventional approaches including FP, weighted minimum mean square error (WMMSE), random, and full reuse approaches. Simulation setup may consider a wireless network with 4 base stations under various beam and user device configurations as shown in
The parameters used in the simulation are summarized in Table I.
The total noise power is calculated according to σ2 (dBm)=10 log(κBT0×103)+NR(dB)+10 logW, with κB, NR, and T0 being Boltzmann's constant, receiver noise figure, and temperature, respectively. Taking the typical values of NR=1.5 dB, T0=290 K, σ2 is calculated to be −86.46 dBm. The scheme may be implemented with PyTorch. Each actor or critic network is represented by a fully-connected DNN with 5 layers including 3 hidden layers each containing 200, 100, and 50 neurons, respectively. The numbers of layers and hidden layers of the DNN are provided as an example and may be greater or less than 5 and 3, respectively. Further, the number of neurons is also provided as an example, and may be greater or less than 200, 100, or 50.
In these configurations, each actor network of the base stations may have one output port with the Tanh activation clipped to the range [−0.95, +0.95] based on the magnitude of the noise. Each critic network of the base stations may also have one output port activated with the rectified linear unit (ReLU) activation function, which outputs a non-negative value regardless of a value of the input.
With the simulation parameters and the wireless communication configuration of
The FP scheme is an optimization technique used to solve the power allocation problem by maximizing the throughput while considering the interference among base stations. The objective function may maximize
subject to: Pi≤Pmax ∀i, where Pi is the transmit power of base station i, and Pmax is the maximum allowable transmit power. The WMMSE scheme is another optimization technique used to solve the power allocation problem by minimizing the mean square error (MSE) to improve overall network performance. Both are centralized algorithms with superior performance which is hard to be outperformed in general. For example, several learning-based methods have been shown to approximate their performance. However, the performance of WMMSE and FP depends on heuristic parameter initialization whose impact is unclear. It is assumed that the required channel state information (CSI) can be obtained with no delay at the beginning of each slot.
The random scheme utilizes randomness to enhance exploration, initialization, and the robustness of the algorithms, and the full reuse scheme simultaneously uses the entire frequency spectrum of all base stations without any dedicated frequency bands. This approach aims to maximize spectral efficiency and are provided as additional baselines for comparison purposes.
Data plots 510-550 of
Based on the data plots 5101-5103 and 5201-5203, the average throughputs by the FP and WMMSE schemes are substantially constant along the passage of time slots. Similarly, based on the data plots 5301-5303 and 5401-5403, the average throughputs by the random and full reuse schemes are substantially constant along the passage of time slots. However, the average throughputs by the FP and WMMSE schemes are generally greater than the average throughputs by the random and full reuse schemes.
On the other hand, based on the data plots 5501-5503, the multi-agent deep reinforcement learning-enabled distributed scheme is illustrated as converging in roughly 40,000 slots and achieving slightly higher throughputs than the FP and WMMSE schemes and substantially higher throughputs than the random and full reuse schemes.
Now turning to
To further ensure adequate exploration, the action noise {Nt, ∀t} may be chosen as a Gaussian noise with a decreasing variance, i.e., Nt˜N(0, σt2), where σt+1=max {(1−ε)σt, σmin} with ε=10−4, σ00=1, and σmin=0.01. The actor and critic networks may be trained every Ttrain=5 slots. There are two phases: the training phase (50,000 slots) where the actor and critic networks may be trained periodically while interacting with the environment, and the testing phase (1,000 slots) in which the trained actor networks may be used to select powers without learning. Further, the inputs of the actor and critic networks may be normalized to stabilize the training. Hence, each interference measurement Ii at the i-th base station in expression (21) may be scaled to [0, 1] via the following expression:
where Iimax and Iimin are the maximum and minimum possible interference at the i-th base station, respectively. In a case where Iimin is set to be zero, Iimax may be estimated by letting all base stations transmit with maximum powers and observing the interference.
During training phase and based on the data plots 7101-7501 illustrated in
The empirical CDF of the average throughput during the testing phase is illustrated as data plots 7102-7502. The throughputs of the random and full reuse schemes are shown in the data plot 7302 and 7402, respectively, and are much less than the throughput of the configuration 600 according to the data plot 7502. Although the channels are less correlated and harder to be tracked, the learned policy still maintains a very close performance to WMMSE according to the data plot 7202. FP achieves higher throughput according to the data plot 7102 than the throughput of the configuration 600 according to the data plot 7502 but has a less concentrated distribution than the configuration 600. Thus, the configuration 600 may demonstrate generalization ability.
Turning now to
The method 800 may include step 810, in which each base station associated with user devices receives previous transmit powers of neighboring base stations, which are a subset of all the base stations in the wireless network. The neighboring base stations may be selected based on the impact on each base station. In an aspect, if one base station does not have any significant impact on each base station, the one base station may not be selected as a neighboring base station. In another aspect, the neighboring base station may be selected based on the distance from each base station.
The method 800 may further include step 820, in which each base station evaluates a quality function of previous transmit powers based on local observations and previous transmit powers of neighboring base stations. The quality function may be defined as the expected return (e.g., expression (9)) starting from any state-action pair. Each base station may include a critic network configured to evaluate the quality function. The local observation may represent each base station's (partial) perception of the radio environment. The local observation may have to capture the environment features that are relevant to the base station's decision making. To control complexity and make the scheme scalable, the information exchange is limited to neighboring base stations. The local observation may include data on previous transmit power levels, channel gains, interference measurements, and throughput data from neighboring base stations. The local observation may be defined as in expression (21).
The method may further include step 830, in which each base station determine a current transmit power based on the previous transmit powers of each base station, direct channel gains between each base station and the user devices, and interference measures from the user devices. The direct channel gains may be defined as in expression (3). The interference measures may be the total interference of each base station and defined as in expression (22) or (23). The current transmit power may be determined based on expression (20).
The current transmit power may be independently mapped for each base station based on training using multi-agent deep reinforcement learning. In this regard, the method 800 may further include step 840, in which each base station periodically trains the quality function over the neighboring base stations. For example, the centralized training module 240 of
Attention will now be directed to
The computing device 900 includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device's hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of non-limiting examples, Freebase stationD, Openbase stationD, Netbase stationD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some aspects, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smart phone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
In various aspects, the computing device 900 may include a storage 910. The storage 910 is one or more physical apparatus used to store data or programs on a temporary or permanent basis. In some aspects, the storage 910 may be volatile memory and requires power to maintain stored information. In aspects, the storage 910 may be non-volatile memory and retains stored information when the computing device 900 is not powered. In aspects, the non-volatile memory includes flash memory. In aspects, the non-volatile memory includes dynamic random-access memory (DRAM). In aspects, the non-volatile memory includes ferroelectric random-access memory (FRAM). In aspects, the non-volatile memory includes phase-change random access memory (PRAM). In aspects, the storage 910 includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In aspects, the storage 910 may be a combination of devices such as those disclosed herein.
The storage 910 includes executable instructions (i.e., codes). The executable instructions represent instructions that are executable by the processor 905 of the computing device 900 to perform the disclosed operations, such as those described in the various methods. Furthermore, the storage 910 excludes signals, carrier waves, and propagating signals. On the other hand, the storage 910 that carry computer-executable instructions may be “transmission media” and include signals, carrier waves, and propagating signals. Thus, by way of example and not limitation, the current aspects may include at least two distinctly different kinds of computer-readable media: computer storage media and transmission media.
The computing device 900 further includes a processor 930, an extension 940, a display 950, an input device 960, and a network card 970. The processor 930 is a brain to the computing device 900. The processor 930 executes instructions which implement tasks or functions of programs. When a user executes a program, the processor 930 reads the program stored in the storage 910, loads the program on the RAM, and executes instructions prescribed by the program.
The processor 930 may include, without limitation, Field-Programmable Gate Arrays (“FPGA”), Program-Specific or Application-Specific Integrated Circuits (“ASIC”), Program-Specific Standard Products (“ASSP”), System-On-A-Chip Systems (“SOC”), Complex Programmable Logic Devices (“CPLD”), Central Processing Units (“CPU”), Graphical Processing Units (“GPU”), or any other type of programmable hardware by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions. As used herein, terms such as “executable module,” “executable component,” “component,” “module,” or “engine” may refer to the processor 930 or to software objects, routines, or methods that may be executed by the processor 930 of the computing device 900. The different components, modules, engines, and services described herein may be implemented as objects or the processor 930 that execute on the computing device 900 (e.g., as separate threads).
In aspects, the extension 940 may include several ports, such as one or more universal serial buses (USBs), IEEE 1394 ports, parallel ports, and/or expansion slots such as peripheral component interconnect (PCI) and PCI express (PCIe). The extension 940 is not limited to the list but may include other slots or ports that may be used for appropriate purposes. The extension 940 may be used to install hardware or add additional functionalities to a computer that may facilitate the purposes of the computer. For example, a USB port may be used for adding additional storage to the computer and/or an IEEE 1394 may be used for receiving moving/still image data.
In some aspects, the display 950 may be a cathode ray tube (CRT), a liquid crystal display (LCD), or light emitting diode (LED). In some aspects, the display 950 may be a thin film transistor liquid crystal display (TFT-LCD). In some aspects, the display 950 may be an organic light emitting diode (OLED) display. In various some aspects, the OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some aspects, the display 950 may be a plasma display. In some aspects, the display 950 may be a video projector. In some aspects, the display may be interactive (e.g., having a touch screen or a sensor such as a camera, a 3D sensor, a LiDAR, a radar, etc.) that may detect user interactions/gestures/responses and the like.
A user may input and/or modify data via the input device 960 that may include a keyboard, a mouse, or any other device with which the use may input data. The display 950 displays data on a screen of the display 950. The display 950 may be a touch screen so that the display 950 may be used as an input device.
The network card 970 is used to communicate with other computing devices, wirelessly or via a wired connection. Through the network card 970, one or more data links and/or data switches that enable the transport of electronic data between computer systems, modules, and/or other electronic devices. When information is transferred, or provided, over a network (either hardwired, wireless, or a combination of hardwired and wireless) to a computer, the computer properly views the connection as a transmission medium. The computing device 900 may include one or more communication channels that are used to communicate with the network card 970. Data or desired program codes are carried or transmitted in the form of computer-executable instructions or in the form of data structures vi the network card 970.
Any of the herein described methods, programs, algorithms or codes may be converted to, or expressed in, a programming language or computer program. The terms “programming language” and “computer program,” as used herein, each include any language used to specify instructions to a computer, and include (but is not limited to) the following languages and their derivatives: Assembly, Basic, Batch files, BCPL, C, C+, C++, C #, Delphi, Fortran, Java, JavaScript, machine code, operating system command languages, Pascal, Perl, PL1, scripting languages, Visual Basic, meta-languages which themselves specify programs, and all first, second, third, fourth, fifth, or further generation computer languages. Also included are database and other data schemas, and any other meta-languages. No distinction is made between languages which are interpreted, compiled, or use both compiled and interpreted approaches. No distinction is made between compiled and source versions of a program. Thus, reference to a program, where the programming language could exist in more than one state (such as source, compiled, object, or linked) is a reference to any and all such states. Reference to a program may encompass the actual instructions and/or the intent of those instructions.
The aspects disclosed herein are examples of the disclosure and may be embodied in various forms. Although certain aspects herein are described as separate aspects, each of the aspects herein may be combined with one or more of the other aspects herein. It should be understood that various aspects disclosed herein may be combined in different combinations than the combinations specifically presented in the description and accompanying drawings. It should also be understood that, depending on the example, certain acts or events of any of the processes or methods described herein may be performed in a different sequence, may be added, merged, or left out altogether (e.g., all described acts or events may not be necessary to carry out the techniques). In addition, while certain aspects of this disclosure are described as being performed by a single module or unit for purposes of clarity, it should be understood that the techniques of this disclosure may be performed by a combination of units or modules.
The present disclosure may be embodied in other specific forms without departing from its characteristics. The described aspects are to be considered in all respects only as illustrative and not restrictive. The scope of the present disclosure is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more examples. In the preceding description, numerous specific details were provided, such as examples of various configurations to provide a thorough understanding of examples of the described technology. It will be recognized, however, that the technology may be practiced without one or more of the specific details, or with other methods, components, devices, etc. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the technology.
Although the subject matter has been described in language specific to structural features and/or operations, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features and operations described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Numerous modifications and alternative arrangements may be devised without departing from the spirit and scope of the described technology.
This application claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 63/504,467 filed on May 26, 2023, and entitled “MULTI-AGENT DEEP REINFORCEMENT LEARNING-ENABLED DISTRIBUTED POWER ALLOCATION SCHEME FOR MMWAVE CELLULAR NETWORKS,” which is expressly incorporated herein by reference in its entirety.
This present disclosure was made with government support under DE-AC07-05ID14517 awarded by the U.S. Department of Energy, and 2229562 awarded by the National Science Foundation. The government has certain rights in the present disclosure.
Number | Date | Country | |
---|---|---|---|
63504467 | May 2023 | US |