The present disclosure relates to the field of cellular telecommunications, and more specifically to the problem of link adaptation in grant-free random-access transmission for massive machine type communication.
The 5th Generation (5G) of mobile communication is required to support diverse applications. The most common scenarios are enhanced Mobile Broadband (eMBB), massive Machine Type Communication (mMTC), and Ultra Reliable Low Latency Communication (URLLC). mMTC, is propelled by the need to support machine-to-machine (M2M) communication over cellular networks. M2M communication is different from human to human (H2H) communication in several ways. In particular, M2M communication has the following characteristic: 1) the packets are smaller, 2) the transmission is more sporadic, and 3) the quality of service (QoS) requirements are more diverse. As a result, the scheduling procedure for M2M communication is fundamentally different from H2H communication.
A considerable amount of M2M devices will be battery driven, whereas in currently deployed wireless systems, a lot of energy consumed by the communicating devices is allocated to establish and maintain connections. As identified in “K. Au, L. Zhang, H. Nikopour, E. Yi, A. Bayesteh, U. Vilaipornsawai, J. Ma, and P. Zhu, Uplink contention based SCMA for 5G radio access, in IEEE Globecom Workshops, December 2014, pp. 900-905”, when transmitting small packets, the grant request procedure can result in 30% of resource elements overhead. While semi-persistent connection, as adopted by the narrowband internet of things (NB-IoT) standard, might reduce the signaling overhead, it can only do so efficiently in the case of periodic traffic arrival.
A grant-free access mechanism can enable devices to transmit data in an arrive and-go manner in the next available slot. Various grant-free approaches are considered in the literature; most works focus on the decoding and feedback procedure.
In U.S. Ser. No. 10/609,724, a method for determining the modulation and coding scheme (MCS) in grant free uplink transmission is proposed based on a limit MCS received by the base station.
In EP3644673, the user equipment (UE) selects a grant-free transmission resource configuration from a set of available options configured and sent from the base station. The UE selects the configuration based on its service performance index requirement and/or an amount of data to be transmitted. Such static solution can easily lead to a greedy behavior that can hinder the overall performance of the network.
In “N. Mastronarde and M. van der Schaar, Joint physical-layer and system-level power management for delay-sensitive wireless communications, IEEE Trans. Mobile Comput., vol. 12, no. 4, pp. 694-709, April 2013”, a reinforcement learning algorithm is proposed to jointly select an AMC, and dynamic power management (DPM), in order to minimize the transmitted power in a single-user system while satisfying a certain delay constraint.
Afterwards, in “N. Mastronarde, J. Modares, C. Wu, and J. Chakareski, Reinforcement Learning for Energy-Efficient Delay-Sensitive CSMA/CA Scheduling, in IEEE Global Communications Conference, 2016, pp. 1-7” the previous work is extended to consider a multiuser system in a IEEE 802.11 network with carrier sensing multiple access (CSMA). In this work, the authors considered three users contending for channel access, and adopted an independent learners approach, where each user optimizes its own rewards, ignoring the interaction from other users. Despite its simplicity, the independent learners solution is known to have several issues, such as, Pareto-selection, nonstationarity, stochasticity, alter-exploration and shadowed equilibria.
There is a need for optimizing the access procedure in grant-free access. In grant-free access, due to the lack of scheduling on orthogonal time-frequency resources, there is a high probability that different devices randomly choose the same resource blocks for the uplink transmission, resulting in the superposition of data (collision). Moreover, grant-free transmission poses new challenges in the design of physical layer (PHY) and medium access control (MAC) protocols. Static policies for adaptive modulation and coding (AMC), power control, and packet retransmission are not efficient and would not be able to scale to the diverse throughput, latency, and power saving requirements of mMTC.
Herein, a partially observable stochastic game (POSG) to model PHY and MAC dynamics of a grant-free mMTC network is described. A multiagent reinforcement learning (MARL) framework is employed for a distributed decision-making solution that captures the interaction between PHY and MAC. As a result, the network performance is improved in terms of transmission latency and energy efficiency compared to baseline schemes, while keeping communication overhead to a minimum.
There is provided a method for selecting, for a device, a transmit power, a physical resource block (PRB), and a modulation and coding scheme (MCS) for a grant free uplink transmission. The method comprises obtaining an observation of the radio environment of the device. The method comprises selecting an action, based on the observation, for execution by the device during a next time slot, the action comprising selecting the transmit power, the PRB, and the MCS for the grant free uplink transmission.
There is provided a device for selecting a transmit power, a physical resource block (PRB), and a modulation and coding scheme (MCS) for a grant free uplink transmission. The device comprises processing circuits and a memory, the memory containing instructions executable by the processing circuits whereby the device is operative to obtain an observation of the radio environment of the device. The device is operative to select an action, based on the observation, for execution during a next time slot, the action comprising selecting the transmit power, the PRB, and the MCS for the grant free uplink transmission.
There is provided a non-transitory computer readable media having stored thereon instructions for selecting a transmit power, a physical resource block (PRB), and a modulation and coding scheme (MCS) for a grant free uplink transmission. The instructions comprise obtaining an observation of the radio environment of the device. The instructions comprise selecting an action, based on the observation, for execution by the device during a next time slot, the action comprising selecting the transmit power, the PRB, and the MCS for the grant free uplink transmission.
The methods, devices, apparatus and systems provided herein present improvements to the way link adaptation in grant-free random-access transmission for massive machine type communication operate. The solution described herein grants free multiple access transmission and allows the selection of the modulation and coding scheme, the physical resource block (PRB), and the transmission power that minimize the power consumption while satisfying delay constraints. It supports diverse QoS requirements and packet arrival intensities and offers a learnable dynamic policy for a fast and energy-efficient grant-free transmission.
Various features will now be described with reference to the drawings to fully convey the scope of the disclosure to those skilled in the art.
Sequences of actions or functions may be used within this disclosure. It should be recognized that some functions or actions, in some contexts, could be performed by specialized circuits, by program instructions being executed by one or more processors, or by a combination of both.
Further, computer readable carrier or carrier wave may contain an appropriate set of computer instructions that would cause a processor to carry out the techniques described herein.
The functions/actions described herein may occur out of the order noted in the sequence of actions or simultaneously. Furthermore, in some illustrations, some blocks, functions or actions may be optional and may or may not be executed.
Traditional cellular networks are designed to be used with a grant-based procedure where a user device requests for a channel in order to transmit data. In turn, the base station resolves the request and allocates a channel to the user device, if available, to allow the user device to transmit its data. However, for machine-to-machine communication, the better approach may be to use grant free access to transmit data directly on the random-access channel. That is because the machine-to-machine devices often transmit small packets and the grant-based process of requesting the allocation of a channel for small packets may take more power and resources than direct transmission of the data on the random-access channel, which completely avoids requesting the allocation of a channel.
The grant-free approach comes with some challenges, however, because although there is less overhead, there are more collisions, which can occasion the need to retransmit the packets several times, which ends up taking more resources than with the grant-based procedure. Therefore, an approach to optimize the grant-free approached is described herein.
It is proposed to use multi agent reinforcement learning to decide when and how to access the random-access channel. Multi agent reinforcement learning is a framework that can handle requirements related to massive deployment of devices with diverse characteristics, different latency requirements, different power requirements, etc. Multi agent reinforcement learning provides the flexibility and power to adapt to such requirements and can be deployed in a decentralized manner which mitigates complexity and reduces overhead of communicating decisions between devices or between the base station and the devices.
The problem is modeled in terms of a state that varies with time and an environment that determines how the state varies, and actions taken by the agents that shape what the next state will be depending on the environment, the actions and maximization of a reward signal.
The actions available for the agents are, for example, to adjust the transmit power, switching the radio to either the idle or transmit mode, deciding on which sub-carrier to transmit, and the modulation order. The objective of the agents is to minimize the transmit power subject to device specific constraints.
Challenges include that agents can only access local information, there is no guarantee of convergence to an equilibrium in multi agent reinforcement learning, and the search space increases exponentially with the addition of devices. Another challenge is how to learn a good policy or protocol sharing as little information as possible between the agents.
As will be described in more details below with reference to the figures, multi-agent reinforcement learning is employed to select the transmit power, PRB, and MCS. For each transmission time interval (TTI), a partially observable stochastic game (POSG) is used to model how multiple agents with distinct and possibly adversarial goals interact with a stochastically changing environment in discrete time slots. At each time slot, the agents receive a partial, and possibly noisy, observation of the environment and selects an action to take in the next time slot based on this observation. Each action incurs a reward, and the objective of the agents is to learn a policy that maximizes its rewards. Each device aims to find a stochastic policy that minimizes its infinite-horizon cost function non-cooperatively. One goal in the proposed scenario is to minimize the average discounted expected transmit power subject to a constraint on the discounted expected delay costs, resulting in a constrained Markov decision process (CMDP).
In a grant free multiple access configuration, where the nodes have the liberty to send their data to the base station without prior handshake, the selection of the transmit power, PRB, and MCS affect the time required (delay) for the data to be successfully received at the base station and the energy consumed by the node to successfully transmit the data. A method is provided to select the transmit power, PRB, and MCS to satisfy the nodes' delay constraints while consuming minimal power for data transmission. At the beginning of each transmission time interval, every node decides to go into transmit mode or idle mode to save power. If a node is in transmit mode, it selects a transmit power, the PRB, and the MCS. Then, it draws a random backoff time as a function of its MCS. During the backoff time the node listens to the channel and if it doesn't sense any transmission during the backoff interval it transmits a packet in the rest of the TTI. To select the transmit power, PRB, and MCS, Multi-agent reinforcement learning can be employed at the nodes.
Three learning strategies based on the actor/critic algorithms are proposed and are described in more details below: independent learners (IL), distributed actors with centralized critic (DACC) and centralized learning with decentralized inference (CLDI).
In the independent learners approach, each device trains an actor neural network which will be used to decide which actions to take, and also train a critic neural network which evaluates the value of the current state.
In the approach with distributed actors and centralized critic, each device trains an actor neural network, while the critic neural network is centralized in the cloud and collects and aggregates data from all the devices to train a single critic for all the devices. This enables some level of cooperative behaviors between the devices and a better use the spectrum. However, this has for cost that some state information needs to be provided by the devices and a critic value has to be broadcasted to the devices to allow the devices to train their local actor neural networks.
In the approach with centralized learning and decentralized inference, there is a single central neural network, that is trained in the cloud, and fed back to the devices. This removes the burden of training a neural network by the devices and provides a single policy that is shared between all the devices, thereby reducing the search space i.e., when new devices are added, the search space for the optimal policy is not increased. Experimental data has shown that the centralized learning and decentralized inference approach outperforms the other approaches in terms of delay, dropped packets, power consumption and collisions.
Variables used in
In
In summary, depending on the action, the device tries to access the wireless channel, if it gains access, the device transmits its data with physical layer parameters chosen according to the selected action. Otherwise, it starts the procedure again. The cloud plays no part in the independent learners architecture.
In
In
In
The below description in relation with
Hereinbelow, italic lowercase letters denote real and complex scalar values, and x* denotes the complex conjugate of x. Lower case boldface letters denote vectors, while upper case boldface denote matrices. A lowercase letter with one subscript, xi, represents the i-th element of the vector x, while both xi,j and [X]i,j are used to denote the element on the i-th row and j-th column of matrix X. The operators xH and XH denote the hermitian conjugate of a vector and of a matrix, respectively. The operator E[⋅] denotes the expected value of a random variable. The function (⋅) represents the probability of an event and x˜
(μ, K), where K∈
, denotes that x is a complex Gaussian random vector, with mean μ and covariance matrix K. The notation x˜U(
) denotes that x is drawn uniformly from the set
. The sets
,
and
are the sets of the real, complex and binary numbers, respectively. A calligraphic uppercase letter, such as
, denotes a set and |
| is its cardinality. The function ln(⋅) denotes the natural logarithm of its argument, while the function
(⋅) is the indicator function.
A system model is first introduced, considering a grant-free uplink cellular network where a set of Machine Type Devices (MTDs), with |
|=NU, are transmitting to a base station (BS) belonging to the set
, such that |
|=NB. It is assumed that each device connects to its closest BS and employs carrier sensing multiple access (CSMA) for congestion control. At the beginning of each transmission time interval (TTI) of duration Δt, every device decides to go into transmit mode or idle mode to save power. If a device is in transmit mode, it selects a transmit power, one subcarrier from the set
, with |
|=NS, and a modulation order from the set
={1, . . . , M}. Then, it draws a random backoff time as a function of its modulation order. During the backoff time the device listens to the channel and if it doesn't sense any transmission during the backoff interval it transmits its packet in the rest of the TTI. At the end of the TTI, the MTDs who attempted to transmit data during this TTI receive an acknowledgement signal for the successfully transmitted packets. In the system model, it is hypothesized that the devices have already realized the attach procedure, have successfully connected and synchronized with the network, and that the relevant control information, such as TTI duration, modulation and coding scheme, supported transmit power values, available subcarriers, has been configured prior to data transmission. Moreover, to maintain the flexibility, the BS does not how many devices will be connecting beforehand. It is assumed that all transmissions are affected by multipath Rayleigh fading and additive white Gaussian noise (AWGN) with power N0. It is assumed that the MTDs have perfect channel state information, and can always listen if the channel is empty, hence, ignoring the hidden terminal problem.
The Markov decision process (MDP) based framework is extended to a POSG to incorporate the nature of the distributed policy search in a cellular wireless network.
Referring to and selects an action αt∈
, where
is the set of states and
is the set of available actions. Depending on this action and the current state of the environment it receives a new state observation st+1∈
with probability
(st+1|st,αt), and the reward incurred by taking the action 60t) while at state st is given by rt+1 with conditional probability
(rt+1|st,αt). Note that the future state and reward of the system depend only on the current state and action, making it an MDP. The block diagram in
The goal of an agent is to learn a policy π:×
→[0,1] that maximizes its expected discounted rewards. A policy is nothing more than a conditional probability distribution of taking an action given the current state of the agent. The value function Vπ(s) quantifies how good it is for an agent to be at state s while following a policy π in terms of the discounted expected rewards, and is formally defined as
where γ∈(0,1] is the discount factor and determines how important are future rewards. Similarly, the action-value function Qπ(s,α) quantifies the value of taking action α while at state s and following policy π and is given by
It is possible to establish a partial ordering between different policies, where π≥π′ if and only if Vπ(s)≥Vπ′(s) for all s∈. Hence, for an optimal policy π*, we must have
When the transition probabilities (s′|s,α)∀s, s′∈
and α∈
and the rewards distribution
(r|s,α)∀s∈
and α∈
are known, an optimal policy can be found via dynamic programming. However, in many problems these probabilities are unknown to the agents or the state and action sets are too large, rendering dynamic programming infeasible.
A Partially Observable Stochastic Game (POSG), models how multiple agents with distinct and possibly adversarial goals interact with a stochastically changing environment in discrete time slots. At each time slot, the agents receive a partial, and possibly noisy, observation of the environment and select an action to take in the next slot based on this observation. Each action incurs in a reward and the objective of the agents is to learn a policy that maximizes its rewards. Herein, we are concerned with infinite horizon POSGs. The POSG problem is formally defined by a tuple (,
,
,
,
,
), where
is a set with Nu agents.
and
=
i are the state space of the system and the joint action space, where
i is the action space of agent i. The state-action transition probability
:
×
×
→[0,1] gives the probability of transitioning to a state, given the current state and the joint selected action. Furthermore,
={ri|ri:
×
→
∀i∈
} is the set of reward functions, where ri denotes the reward function of agent i. The set
={
i:
i⊆
∀i∈
} contains the observation space of each device, which is a subset of the complete state space. The state of the environment and the joint selected action at time slot t are denoted by st∈
and at∈
, respectively.
In the network model, each TTI consists of a POSG time slot. It is considered that the state of the network at time slot t is given by the tuple st=(bt,{Hti}i=1N, where
={1, 2, . . . , B} and B is the packet buffer length, denotes the number of packets queued in the buffer. The matrix Hti∈
denotes the complex channel between the i-th MTD and the BSs on each subcarrier, where [Hti]k,j is the channel between BS k and device i on subcarrier j. The vector xt∈{1, 0} N
BS(i)=k iff di,k≤di,k′∀k′∈{1, . . . ,Nb} (5)
The devices only have information about their local state. Therefore, the observation tuple of device i at time slot t is given by oti=(bti,Hti,xti).
At the beginning of each TTI, the i-th device observes the tuple oti and selects an action tuple ati=(pti,βti,θti,yti), where pti∈ is the transmit power and
is the set of transmit powers, θti∈
is the subcarrier to transmit, βti∈{1, . . . , M} is the modulation order and M is the maximum modulation order, and yti∈{1, 0} indicates whether to transition to the transmit or idle mode, respectively, of user i at time slot t.
In the model, the joint actions and the state of the environment are represented by tuples of state and action vectors. However, not all actions from the tuple affect the transition of all states. For instance, the channel gain at time slot t+1 does not depend on the transmit power pti of device i at time slot t. Therefore, the transition probability function can be decomposed as
s(st,at,st+1)=(bt+1|bt,{Hti}i=0N
({Ht+1i}i=0N
(xt+1|xt,yt). (6)
The procedures involved in the action selection, and the analysis of the transition probabilities shown in (6) are discussed below.
It is considered that all MTDS employ a rate adaptive CSMA as a MAC protocol on each individual subcarrier. A time slot is divided into two phases: contention and transmission. During the contention phase the devices listen to the channel on a specific subcarrier for a random backoff time τC<Δt. If no other user has started transmission during this time, the device starts its transmission for an amount of time τTX=Δt−τC. The protocol is illustrated in
A same rate adaptive CSMA protocol is considered, where a congestion window (cw) given by CWmin(βti)=└A2M−β is a design parameter, are assigned to the devices according to their modulation order. The backoff time of the i-th device τci is uniformly chosen from [0, CWmin(βti)] and is reseted at the end of the time slot. If a collision occurs, the devices' CW are set to CWmax=A2β
packets are transmitted by user i in a given TTI. This approach increases the likelihood that a device that intends to transmit at higher rates obtains channel access, avoiding anomaly where low-rate users can significantly degrade the performance of the whole network.
Now, let k={i:BS(i)=k} and
j={i:θit=j and xti=1} be the sets of MTDs associated to BS k, and MTDs trying to transmit on subcarrier j on time slot t, respectively. Finally, let Ωti,j,k=τCi≤τCu∀u∈
k∩
j be the event that device i obtains transmit access to bs k, on subcarrier j, at time slot t.
In this work, it is assumed that all devices transmit symbols from a M-quadrature amplitude modulation (QAM) with fixed duration T s and that all the packets are L bits long. At each time slot, the devices select a modulation order and a transmit power from the finite sets ={1, . . . , M} and
={ρ1, ρ2, . . . , ρmax} dBm, respectively. The modulation order and the transmit power affect the probability of transmitting a packet successfully, the interference levels of the network and the cost associated with each transmission.
How each of the elements in the state tuple st=(bt,{Hti}i=0N
Wireless Channel: The channel gain between MTD i and BS k on subcarrier j at time slot t is given by
where ζ is the path loss exponent, and hti,j,k is the small-scale fading. It is assumed that the channel gains are constant during the TTI duration. A first-order Gauss-Markov small-scale flat fading model is considered
h
t
i,j,k
=κh
t−1
i,j,k
+n
t
i,j,k, (8)
where the innovation nti,j,k˜(0,1−ρ2). “Innovation” is defined as the difference between the observed value of a variable at time t and the optimal forecast of that value based on information available prior to time t. The correlation between successive fading components is given by
κ=J0(2πfmaxTs), (9)
where fmax is the maximum Doppler frequency and J0 is the zero-th order Bessel function.
Buffer State and Traffic Model: The buffer state bt represents the number of packets queued for transmission at each MTD at time slot t. At each time slot lt new packets arrive and gt packets depart from the buffer. Therefore, the current number of packets in the buffer is given by
b
t=min(max(bt−1−gt−1,0)+lt,B). (10)
It is assumed that the number of packets arriving at each device is independent of the actions taken and is independent identically distributed (i.i.d.). The number of arrivals at each time slot is modeled as a Poisson random process with distribution
where λi is the mean packet arrival rate of device i.
The goodput gt, defined as being the number of information successfully transmitted to a destination per unit of time, has a more complex relationship with the actions taken at the current time slot. It depends whether or not the user is able to take hold of the channel, on the interference from other users, on its transmit power and modulation order and on its channel's quality. Some auxiliary variables and sets are introduced and the goodput's conditional probability distribution is then derived. Firstly, let t(j)={i:1(Ωti,j,k)=1∀i,k} be the set of users scheduled to transmit at subcarrier j at time slot t. The goodput of the i-th MTD is a function of the device's transmit power, its selected subcarrier, its channel to the receiving BS, and the interference power at the receiving BS. The probability of decoding a bit in error (denoted as Pei) can be approximated by
where lti is the interference experienced by device's i transmission, and is given by
With the approximate probability of decoding a bit in error given by (12), the probability of losing a packet is obtained as
Plossi=1−(1−Pei)L. (14)
Finally, the conditional probability distribution of the goodput is given by
Now, based on (10), (11) and (15) the sequence of states {bt}t=0∞ is modeled as
controlled Markov chain with transition probability
where Pt(l)=(lti=l), and Pg(g)=
(gti=g|βti,pt,{Hti}i=1N
Moreover, one goal is to reduce the probability of overflown packets, i.e. packets that arrive while the buffer is full. The number of overflown packets at device's i buffer is given by
ξti=max(bti+lti−gti−B,0). (17)
Dynamic Power Management: In the system under consideration, at each time slot the devices have a stochastic number of packets arriving to be transmitted over a stochastic wireless channel. Therefore, at some situations, when there is little or no packets at the queue or during poor channel conditions, not transmitting any data may be the optimal approach to save power. In order to take advantage of this, it is assumed that each device is able to select a power state mode between:
There is an inherit delay in switching between different modes, so the dynamic power management state of the device i at time slot t is modeled as a Markov chain with transition probability
where 0≤ω≤1 is the probability of switching between power states in time for the next TTI. It is assumed that ω=0 without loss of generality.
In a POSG, each device aims to find a stochastic policy πi∈Πi, where Πi={π|π:i×
i→[0,1]} is the set of all possible policies for MTD i, that minimizes its infinite-horizon cost function non-cooperatively. One goal in the proposed scenario is to minimize the average discounted expected transmit power subject to a constraint on the discounted expected delay costs, resulting in a constrained Markov decision process (cmdp). Mathematically this can be expressed as
where π=[π1, . . . , πN
Power cost: At each time slot, the i-th device incurs in an instantaneous power cost of
Note that the power cost by the PON is normalized. This normalization is important for the stability of the algorithms discussed further below. Therefore, its discounted expected power cost is given by
Delay Cost: According to Little's theorem, the average number of packets queued in the buffer is proportional to the average packet delay in queues with stable buffers (i.e., no overflow). Hence, the delay cost is designed to discourage large number of packets in the queue, which is referred to as the holding cost, while simultaneously penalizing dropped packets, which is referred to as the overflow cost. The instantaneous delay cost at time t is defined as
where μ is the overflow penalty factor. Hence, the infinite-horizon discounted expected delay cost is given by
The penalty factor is chosen such that it ensures dropping packets is suboptimal while encouraging devices to transmit with low power when advantageous. To meet these requirements, a value of μ is chosen such that dropping a packet costs as much as the largest possible discounted expected cost incurred by holding a packet in the buffer, which happens if the packet is held in the buffer forever. Therefore
To deal with the problem of dynamic channel, power and modulation selection under delay constraints described previously, three different distributed learning architectures are proposed: independent learners (IL), distributed actors with centralized critic (DACC) and centralized learning with decentralized inference (CLDI).
Finally, the three architectures are compared with a baseline employing power ramping and random transmission probability to avoid congestion.
In order to provide a fair comparison, in all of the proposed architectures an actor-critic style PPO algorithm is considered, due to its ease of implementation, the possibility of decoupling the policy and the value estimator, reduced sample complexity in comparison with trust region policy optimization (TRPO) and first-order updates. Furthermore, it is considered that each agent employs an artificial neural network (ANN) to model the policy and as a value estimator.
In contrast to action-value methods, such as Q-learning, where the agent learns an action-value function and uses it to selection actions that maximize its output, policy gradient methods learn a parametrized policy that selects the actions without consulting a value function. Let w∈ be the policy parameter vector, then the parametrized policy πw(a|s)=
(at=a|st=s, wt=w) denotes the probability of selecting action a, while at state s with policy parameter w during the time slot t.
In order to learn the policy parameter, a scalar performance function J(w), differentiable with respect to w is considered. Then, the learning procedure consists of maximizing J(w) through gradient ascent updates of the form
w
t+1
=w
t+α∇w{tilde over (J)}(wt), (26)
where α is the learning rate, and ∇w{tilde over (J)}(wt) is an estimator of the gradient of the performance measure. A common choice of performance measure is
J(w)=πw(at|st)A(st,at), (27)
where A(st,at)=Q(st,at)−V(st) is the advantage function, which gives the advantage of taking action at in comparison to the average action. The name actor-critic comes from (27), as the difference between the actor estimate (Q(st,at)) and the critic estimate (V(st)) is evaluated. The gradient of the performance measure can be estimated by taking the average gradient over a finite batch of time slots as
where denotes an empirical average of the finite batch of state action tuples
, while sampling the actions from policy πw(a|s). Moreover, equality (a) is obtained from the identity
The Proximal Policy Optimization (PPO) algorithm consists of maximizing a clipped surrogate objective Jclip(w) instead of the original performance measure J(w), therefore avoiding the destructively large updates experienced on policy gradient methods without clipping. The surrogate objective is defined as
J
clip(w)=[min(Γt(w)A(st,at),clip(Γt(w),1−∈,1+∈)A(st,at))] (30)
where
is the importance weight, ∈ is a hyperparameter that controls the clipping, and wold are the policy weights prior to the update. Due to the term clip(Γt(w)A(st,at),1−∈,1+∈), the importance weight is clipped between 1−∈ and 1+∈, minimizing the incentives for large destabilizing updates. Furthermore, by taking the minimum of the clipped and unclipped functions, the resulting surrogate objective is a lower bound first-order approximation of the unclipped objective around wold.
Furthermore, the performance measured is augmented to include a value function loss term, corresponding to the critic output, given by
Finally, a final term of entropy bonus H(πw) is added to encourage exploration of the state space. The final surrogate objective function to be maximized is given by
J
surr(w)=Jclip(w)−k1JVF(w)+k2H(πw), (32)
where k1 and k2 are system hyperparameters. The PPO algorithm is summarized in Algorithm 1, 1100,
In the IL architecture, each device has its own set of weights wi and is running its own learning algorithm to update their weights and they do not share any information about their policies or current and previous states. As each device has only a local view of the state of the environment, they cannot learn a policy to solve the optimization problem in (20). Instead, the i-th device searches for a policy to solve its own local problem. Therefore, each agent solves the problem given by
In this architecture, each agent changes its policy independently of one another, but their actions affect the rewards experienced by other agents, therefore, agents perceive the environment as non-stationary.
Moreover, the problem posed as a constrained optimization one can be reformulated as an unconstrained problem by including a Lagrangian multiplier Λ≥0, corresponding to the delay cost constraint, resulting in the reward function
r
t
i(st,at)=−(cpi(opi,api)+Λticdi(otiati)), (34)
In addition, the observation value function Vπ
Analogously, the action-observation value function is defined as the value of taking action a, while observing o and following policy πi as
To summarize, in the fully distributed architecture, each device runs Algorithm 1 independently using (35) and (36), where each device seeks to minimize its local expected discounted costs. As shown in
The PPO algorithm makes use of two networks, the actor that models the agent's policy, and the critic that estimates the value of a state. Originally, the algorithm proposes that both networks can share weights to accelerate convergence and reduce memory costs, however, due to the nature of the problem considered herein, an architecture is proposed where each agent stores its own actor network, while a single critic network is stored in the cloud by the network operator. The goal of this architecture is to mitigate the effects of the partial observation by having a critic who has access to data of every agent (the whole state) to estimate the state value. Despite having access to the whole state information, it is considered that the value estimator at the central agent only makes use of the aggregate local information of each MTD, as in the scenario considered herein, the number of MTDs is not known beforehand. At the same time, the agents just need information about their own observations to take actions.
Hence, in this architecture, the surrogate objective function in (32) is split in two, one to be minimized by the agents to train the actor network, given by
J
a
surr(w)=Jclip(w)+k2H(πw), (37)
and another one to be minimized in the cloud to train the critic network, given by
J
c
surr(w)=JVF(w). (38)
Additionally, JVF(w) is maximized over data collected from all agents, by calculating the global value function as
Furthermore, as illustrated in
Moreover, this architecture requires the BSs to feedback the value of each state after every TTI, so, the agents can perform backpropagation and train their actor networks.
As the number of MTDs in the network increases, the size of the policy search space Π increases exponentially making it less likely for the system to convert to an equilibrium where the MTDs satisfy their delay constraints. To address this issue, in the CLDI architecture, there is a single set of weights, therefore a single policy π and a search space Π that does not increase in size with the number of MTDs. The policy is trained in the cloud and is periodically broadcast to the MTDs in the network, also reducing the computational burden on the devices required to train a neural network. Moreover, the cloud collects data from all MTDs enabling faster convergence. Hence, instead of solving (20), the CLDI architecture looks for solutions to
So, in order to solve (40), the PPO algorithm maximizing the average reward over the whole network is used.
The performance of the proposed architectures is evaluated through computer simulation and compared for two different scenarios. It is considered that there are two BSs and eight subcarriers serving a circular area with 300 m radius. One thousand realizations of this scenario are generated, and at each realization both the WAPs and the MTDs are placed in a random location within the circular area. On each realization the learning algorithms start from scratch (e.g., the weights of the agents are randomly initialized at the beginning of each realization) and runs for fifteen thousand TTI. Then, the average performances are compared, along with their variances, with respect to the average delay experienced by the network, the number of dropped packets, the average power spent and the number of collisions.
In order to provide a frame of reference, the performance of a simple baseline distributed algorithm is also simulated, for comparison.
The baseline and the architectures described previously are then compared in terms of the average network delay, power, dropped packets and collisions during fifteen thousand TTIs. The network delay is evaluated through the holding cost, as the average network delay is proportional to the number of packets held in the devices' buffer.
What follows is noted from the simulations. With 40 users, the average holding cost between all four approaches is roughly the same, however, the baseline presents a significantly higher variance than the proposed architectures.
Furthermore, the average network delay is below four, which is the smallest constraint in the network, within at least one standard deviation. With respect to overflown packets, the baseline approach drops on average slightly more packets than the proposed architectures, but again with significantly more variance. With respect to the power consumption, the three proposed architectures spend on average roughly 70% of the power spent by the baseline. Moreover, as mentioned previously, the CLDI algorithm tends to converge faster as it is trained on observations from every device in the network and it has to search for a policy in a notably small policy search space. However, for this same reason, in the best-case scenario it only finds a sub-optimal solution. This is confirmed by the fact that as the simulation advances in time and the IL and DACC algorithm train on more data, they achieve similar levels of performance to CLDI, while using less power. The performance improvement of the proposed architectures, in comparison to the baseline, is even more noticeable when it comes to the number of collisions. The reinforcement learning based solutions experience on average 15% of the baseline's collisions during the same period of time.
When the number of users is increased to 120, the average holding cost of the CLDI architecture converges to 2 packets, while the IL and DACC converge to 8 packets and the baseline to 12 packets. From this result, it can be concluded that as the number of users increase the lack of collaboration between the MTDs in the IL and DACC architectures start to impact the average network delay, while CLDI performance stays around the same as for 40 users. Also, the average overflow cost of CLDI still remain around 0, while the IL and DACC stabilizes around 0.7 and the baseline at 0.19. With regards to the average power costs at convergence, the CLDI architecture spends 16.66% of the power spent by the baseline, while the IL and DACC spend 52%. The significant decrease on the power spent by CLDI is explained by the centralized training, which make more training data available, CLDI has 120 new data points for each TTI while the other architectures have only 1, and it indicates that cooperative behavior arise among the MTDs. This is also reflected on the collision's performance, where CLDI experiences around 2.25% of the baseline collisions and IL and DACC experience around 14%.
The IL architectures does not require a central cloud entity to work, therefore it cuts all the necessary overhead data transmission between MTDs and WAPs. However, each MTD has to perform training and inference of its ANN, which can be computationally expensive. Moreover, as each MTD is trained in a fully distributed manner, without sharing any information between themselves, there is no chance of cooperation arising. Both DACC and CLDI architectures require MTDs to transmit information about their local observations that cannot be inferred at the WAP, such as channel state information and the number of packets currently in the buffer. On the other hand, part of the training is done in the cloud for DACC, and completely in the cloud for CLDI, which offloads some of the computational burden to the cloud, saving power and requiring less complex MTDs.
As discussed previously, for a smaller user density the IL and DACC architectures slightly outperform CLDI. However, for a higher density of MTDs, the CLDI architecture is able to leverage data from observations collected from all MTDs, and due to the centralized training, the MTDs work together to use the network resources more equitably, resulting in overwhelming power savings, small average delays, and a minimal number of dropped packets and collisions. So, for cellular networks designed to serve a smaller number of MTDs the IL and DACC architectures may be preferred, depending if the devices have enough computational power to train their ANN and how much overhead is tolerated. While for cellular networks designed to support a massive number of low-cost devices the CLDI architecture may be recommended.
Furthermore, if the MTD is currently violating its delay constraints or if there was a dropped packet in the last TTI, the device ramps up its power. Afterwards, if at least one packet was successfully transmitted on the last TTI, the MTD assumes it is facing a good channel condition and it increases the transmission modulation order, otherwise, it assumes a bad channel and decreases it.
Further details are provided here concerning an ANN that can be used with the techniques described herein. For example, one of the requirements may be that the ANN should be shallow and relatively small to keep a light memory footprint on the devices and to reduce the computational complexity on training and inference. A Grated Recurrent Unit (GRU) unit connected to a two-layer perceptron may be considered. As the observations of the MTDs are temporally correlated (through the number of packets in the buffer, and the channel gains) a GRU unit is included in the input to extract information from sequences of states. GRUs are employed as it has been shown that they have comparable performance to the more commonly used long short-term memory (LSTM) units while being more computationally efficient. In the model presented herein, an GRU with NBNS+4 inputs is considered, where NBNS inputs are due to the channel state information, and the remaining four are the number of packets in the buffer (bti), the number of arriving packets WA the goodput on the previous TTI (gt−1i) and the number of overflown packets in the previous TTI The GRU unit has 32 output values, both of the linear layers have 32 inputs and 32 outputs. Finally, the actor head has 32 inputs and 2M||NS output layers (one for each possible action), while the critic head has 32 inputs and one output (the critic value).
The networks are trained using an adaptive moment estimation (ADAM) optimizer with learning rate of 7×10−4. At each ANN network update, the weights are trained over 4 PPO epochs with 10 minibatches per epoch. To avoid large gradient updates that make the optimization unstable, the gradients are clipped such that ∥∇Jw∥≤0.5. A value loss coefficient k1=0.5 and an entropy loss coefficient k2=0.01 are used.
The observation may comprise a packet buffer length, denoting a number of packets queued in a buffer, a complex channel between at least one MTD and Base Stations on each of a plurality of subcarriers, and a dynamic power management state of the at least one MTD.
In the method, the action may be selected by an actor and the actor may be a first trained neural network. In the method, the action may incur a reward and the actor learns a policy that maximizes the reward.
The method may further comprise training the first neural network based on previous selected actions, previous observations, previous rewards and a critic value.
In the method, the critic value may be provided by a critic and the critic may be a second trained neural network. The first and second trained neural networks may be the same. The critic value may be a value of a current state that is a function of a current time and of the observed environment at the current time. The state may include the Channel State Information (CSI), arriving packets, overflow, goodput and buffer load.
The method may further comprise a step of training the second neural network based on previous CSI, arriving packets, overflow, goodput and buffer load.
In the method, the actor and the critic may be trained locally in the device.
Alternatively, the actor may be trained in the device and the critic may be trained in a cloud computing environment. The critic trained in the cloud computing environment may be the same for a plurality of devices and each of the devices may listen to a radio channel to get the critic value. The radio channel may be a broadcast channel. The critic value may be broadcasted to the devices.
Alternatively, the actor and the critic may be both trained in a cloud computing environment. The actor and the critic trained in the cloud computing environment may be the same for a plurality of devices and each of the devices may listen to a radio channel to get weights of the actor neural network and to get the critic value. The weights of the actor neural network may be used by the device to update a local actor neural network in the device. The radio channel may be the random-access channel or another radio channel. The weight of the actor neural network and the critic value may be broadcasted to the devices.
Referring to
A virtualization environment (which may go beyond what is illustrated in
A virtualization environment provides hardware comprising processing circuitry 1401 and memory 1403. The memory can contain instructions executable by the processing circuitry whereby functions and steps described herein may be executed to provide any of the relevant features and benefits disclosed herein.
The hardware may also include non-transitory, persistent, machine readable storage media 1405 having stored therein software and/or instruction 1407 executable by processing circuitry to execute functions and steps described herein.
There is provided a device 1500 for selecting a transmit power, a physical resource block (PRB), and a modulation and coding scheme (MCS) for a grant free uplink transmission. The device comprising processing circuits 1501 and a memory 1503, the memory containing instructions 1507 executable by the processing circuits. The device is operative to obtain an observation of the radio environment of the device. The device is operative to select an action, based on the observation, for execution during a next time slot, the action comprising selecting the transmit power, the PRB, and the MCS for the grant free uplink transmission.
The device may be a machine to machine (M2M) type device 1500 or a HW (as illustrated in
The observation may comprise a packet buffer length, denoting a number of packets queued in a buffer, a complex channel between at least one MTD and Base Stations on each of a plurality of subcarriers, and a dynamic power management state of the at least one MTD. The action may be selected by an actor and the actor may be a first trained neural network. The action may incur a reward and the actor may learn a policy that maximizes the reward.
The device may be further operative to train the first neural network based on previous selected actions, previous observations, previous rewards and a critic value. The critic value may be provided by a critic and the critic may be a second trained neural network. The first and second trained neural networks may be the same. The critic value may be a value of a current state that is a function of a current time and of the observed environment at the current time. The state may include the Channel State Information (CSI), arriving packets, overflow, goodput and buffer load.
The device may be further operative to train the second neural network based on previous Channel State Information (CSI), arriving packets, overflow, goodput and buffer load. The actor and the critic may be trained in the device.
Alternatively, the actor may be trained in the device and the critic may be trained in a cloud computing environment. The critic trained in the cloud computing environment may be the same for a plurality of devices and each of the plurality of devices may be operative to listen to a radio channel to get the critic value. The radio channel may be a broadcast channel and the critic value may be broadcasted to the devices.
Alternatively, the actor and the critic may be both trained in a cloud computing environment. The actor and the critic trained in the cloud computing environment may be the same for a plurality of devices and each of the devices may listen to a radio channel to get weights of the actor neural network and to get the critic value. The weights of the actor neural network may be used by each of the plurality of devices to update a local actor neural network in the device. The radio channel may be a random-access channel or another radio channel. The weights of the actor neural network and the critic value may be broadcasted to the devices.
The device is further operative to execute any of the functions described herein.
There is provided a non-transitory computer readable media 1407, 1507 having stored thereon instructions for selecting, for a machine to machine (M2M) type device, a transmit power, a physical resource block (PRB), and a modulation and coding scheme (MCS) for a grant free uplink transmission. The instructions comprise obtaining an observation the radio environment of the M2M device. The instructions comprise selecting an action, based on the observation, for execution by the M2M device during a next time slot, the action comprising selecting the transmit power, the PRB, and the MCS for the grant free uplink transmission.
The instructions may further comprise instructions to execute any of the functions described herein.
Modifications will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that modifications, such as specific forms other than those described above, are intended to be included within the scope of this disclosure. The previous description is merely illustrative and should not be considered restrictive in any way. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
This non-provisional patent application claims priority based upon the prior U.S. provisional patent application entitled “METHOD, DEVICE AND APPARATUS FOR OPTIMIZING GRANT FREE UPLINK TRANSMISSION OF MACHINE TO MACHINE (M2M) TYPE DEVICES”, application No. 63/126,284, filed Dec. 16, 2020, in the names of de Carvalho Evangelista et al.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2021/061795 | 12/15/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63126284 | Dec 2020 | US |