The present invention relates to system and method for schedulers for cellular networks.
As the next-generation cellular communication technology, 5G New Radio (NR) aims to cover a wide range of service cases, including broadband human-oriented communications, time-sensitive applications with ultra-low latency, and massive connectivity for Internet of Things [4]. With its broad range of operating frequencies from sub-GHz to 100 GHz [8], the channel coherence time for NR varies greatly. Comparing to LTE, which typically operates on bands lower than 3 GHz [12] and with a coherence time over 1 millisecond (ms), NR is likely to operate on higher frequency range (e.g., 3 to 6 GHz), with much shorter coherence time (e.g., ˜200 s microsecond (μs)). Further, from application's perspective, 5G NR is expected to support applications with ultra-low latency (e.g., augmented/virtual reality, autonomous vehicles [10]), which call for sub-millisecond time resolution for scheduling.
With such diverse service cases and channel conditions, the air interface design of NR must be much more flexible and scalable than that of LTEs [1]. To address such needs, a number of different OFDM numerologies are defined for NR [6], allowing a wide range of frequency and time granularities for data transmission. Instead of a single transmission time interval (TTI) of 1 ms as for LTE, NR allows 4 numerologies (0, 1, 2, 3) for data transmission (with numerology 4 for control signaling) [9], with TTI varying from 1 ms to 125 μs [5]. In particular, numerology 3 allows NR to cope with extremely short channel coherence time and to meet the stringent requirement in extreme low-latency applications, where the scheduling resolution is ˜100 μs.
But the new ˜100 μs time requirement also poses a new challenge to the design of an NR scheduler. To concretize our discussion, we use the most popular proportional-fair (PF) scheduling as an example [19-22]. Within each scheduling time interval, a PF scheduler needs to decide how to allocate frequency-time resource blocks (RBs) to users and determine modulation and coding scheme (MCS) for each user. The objective of a PF scheduler is to maximize the sum of logarithmic (long-term) average rates of all users. An important constraint is that each user can only use one MCS (from a set of allowed MCSs) across all RBs that are allocated to her. This problem is found to be NP hard [20-22] and has been widely studied in the literature. Although some of the existing approaches could offer a scheduling solution on a much larger time scale, none of these PF schedulers can offer a solution close to 100 μs. In [19], Kwan et al. formulated the PF scheduling problem as an integer linear programming (ILP) and proposed to solve it using branch-and-bound technique, which has exponential computational complexity due to its exhaustive search. Some polynomial-time PF schedulers that were designed using efficient heuristics can be found in [20-22]. We will examine the computational complexity and real-time computational time of these schedulers in “The Real-Time Challenge for NR PF Scheduler” section. A common feature of these PF schedulers (designed for LTE) is that they are all of sequential designs and need to go through a large number of iterations to determine a solution. Although they may meet the scheduling timing requirement for LTE (1 ms), none of them comes close to meet the new ˜100 μs timing requirement for 5G NR.
This invention is a novel design of a parallel PF scheduler using off-the-shelf GPU to achieve ˜100 μs scheduling resolution. We name this new design “GPF”, which is the abbreviation of GPU-based PF scheduler. The key ideas of GPF are: (i) to decompose the original PF scheduling problem into a large number of small and independent sub-problems with similar structure, where each sub-problem can be solved within very few number of iterations; (ii) to identify and select a subset of promising sub-problems through intensification and fit them into the massive parallel processing cores of a GPU.
In the literature, there have been a number of studies applying GPUs in networking [23-25] and signal processing for wireless communications [26-28]. The authors of [23] proposed PacketShader, which is a GPU-based software router that utilizes parallelism in packet processing to boost network throughput. The work in [24] applied GPU to network traffic indexing and is able to achieve an indexing throughput of over one million records per second. In [25], the authors designed a packet classifier that is optimized towards GPU's memory hierarchy and massive number of cores. All these previous works focus on network packet processing, which is fundamentally different from the resource scheduling problem that we consider. Authors of [26] proposed a parallel soft-output MIMO detector for GPU implementation. In [27], the authors designed GPU-based decoders for LDPC codes. The work in [28] addressed the implementation of a fully parallelized LTE Turbo decoder on GPU. These studies address baseband signal processing and their proposed approaches cannot be applied to solve a complex scheduling optimization problem like PF.
The objective of the invention is to disclose systems and methods for the first design of a PF scheduler for 5G NR that can meet the 100 μs timing requirement. This design can be used to support 5G NR numerology 0 to 3, which are to be used for data transmission. This is also the first design of a scheduler (for cellular networks) that exploits GPU platform. In particular, the invention uses commercial off-the-shelf GPU components and does not require any expensive custom-designed hardware.
Our GPU-based design is based on a successful decomposition of the original optimization problem into a large number of sub-problems through enumerating MCS assignments for all users. We show that for each sub-problem (with a given MCS assignment), the optimal RB allocation problem can be solved exactly and efficiently.
To reduce the number of sub-problems and fit them into the streaming microprocessors (SMs) in a GPU, we identify the most promising search space among the sub-problems by using intensification technique. By a simple random sampling of sub-problems from the promising subspace, we can find a near-optimal (if not optimal) solution.
We implement our invention, which is a GPU-based proportional-fair scheduler (“GPF scheduler” or “GPF”), on an off-the-shelf Nvidia Quadro P6000 GPU using the CUDA programming model. By optimizing the usage of streaming processors on the given GPU, minimizing memory access time on the GPU based on differences in memory types/locations, and reducing iterative operations by exploiting techniques such as parallel reduction, we are able to achieve overall scheduling time of GPF to 100 μs for a user population size of up to 100 for an NR macro-cell.
We conduct extensive experiments to investigate the performance of our GPF and compare it to three representative PF schedulers (designed for LTE). Experimental results show that our GPF can achieve near-optimal performance (per PF criterion) in about ˜100 μs while the other schedulers would require much more time (ranging from many times to several orders of magnitude) and none of them can meet 100 μs time requirement.
By breaking down the time performance between data movement (CPU to/from GPU) and computation in GPU, we show that between 50% to 70% (depending on user population size) of the time is spent on data movement while less than half of the time is spent on GPU computation. This suggests that our invention (GPF) can achieve even better performance (e.g., <50 μs) if a customized GPU system (e.g., with enhanced bus interconnection such as the NVLink [34], or integrated host-GPU architecture [35-37]) is used for 5G NR base stations (BSs).
In describing a preferred embodiment of the invention illustrated in the drawings, specific terminology will be resorted to for the sake of clarity. However, the invention is not intended to be limited to the specific terms so selected, and it is to be understood that each specific term includes all technical equivalents that operate in a similar manner to accomplish a similar purpose. Several preferred embodiments of the invention are described for illustrative purposes, it being understood that the invention may be embodied in other forms not specifically shown in the drawings.
Primer on NR Air Interface
To meet diverse operating requirements, NR employs a much more flexible and scalable air interface than LTE [1]. The radio frame structure on an operating carrier of NR is illustrated in
At the base station, each scheduling time interval (or scheduling resolution) is called transmission time interval (TTI), and its duration can vary from several OFDM symbols (a mini-slot or sub-slot), one slot, to multiple slots. The choice of TTI depends on service and operational requirements [4]. In the frequency domain, the scheduling resolution is one RB, which consists of 12 consecutive SCs grouped together. Within each TTI, the base station needs to decide how to allocate (schedule) all the RBs for the next TTI to different users. Thus the channel coherence time should cover at least two TTIs.
Within a TTI, each RB can be allocated to one user while a user may be allocated with multiple RBs. The next question is what modulation and coding scheme (MCS) to use for each user. For 5G NR, 29 MCSs are available (more precisely, 31 MCS are defined, with 2 of them being reserved, leaving 29 MCS available) [7], each representing a combination of modulation and coding techniques. For a user allocated with multiple RBs, the BS must use the same MCS across all RBs allocated to this user [7]. Here, one codeword is considered per user. The analysis can be extended to cases where a user has two codewords by configuring the same MCS for both codewords. This requirement also applies to in LTE. The motivation behind this is that using different MCSs on RBs cannot provide a significant performance gain, but would require additional signaling overhead [14]. For each user, the choice of MCS for its allocated RBs depends on channel conditions. A scheduling decision within each TTI entails joint RB allocation to users and MCS assignment for the RBs.
A Formulation of the PF Scheduling Problem
Herein, a formulation of the classical PF scheduler under the NR framework is presented. Table 2 describes the notation used for the purposes of the following discussion.
Mathematical Modeling and Formulation
Consider a 5G NR base station (BS) and a set U of users under its service. For scheduling at the BS, we focus on downlink (DL) direction (data transmissions from BS to all users) and consider a (worst case) full-buffer model, i.e., there is always data backlogged at the BS for each user. Denote W as the total DL bandwidth. Under OFDM, radio resource on this channel is organized as a two-dimensional frequency-time resource grid. In the frequency domain, the channel bandwidth is divided into a set B of RBs, each with bandwidth W0=W/|B|. Due to frequency-selective channel fading, channel condition for a user varies across different RBs. For the same RB, channel conditions from the BS to different users also vary, due to the differences in their geographical locations. In the time domain, we have consecutive TTIs, each with a duration T0. Scheduling decision at the BS must be made within the current TTI (before the start of the next TTI).
Denote xub(t)∈ {0, 1} as a binary variable indicating whether or not RB b∈B is allocated to user u∈U in TTI t, i.e.,
Since each RB can be allocated at most to one user, we have:
At the BS, there is a set M of MCSs that can be used by the transmitter for each user u∈ U at TTI t. When multiple RBs are allocated to the same user, then the same MCS, denoted m (m ∈ M), must be used across all these RBs. Denote yum(t)∈ {0, 1} as a binary variable indicating whether or not MCS m∈M is used by the BS for user u∈U in TTI t, i.e.,
Since only one MCS from M can be used by the BS for all RBs allocated to a user u∈U at t, we have:
For user u∈U and RB b∈B, the achievable data-rate for this RB can be determined by
Recall that for user u∈U, the BS must use the same MCS mode m∈M across all RBs allocated to this user. As an example, suppose there are k RBs (denoted as b1, b2, . . . , bk) allocated to user u. Without loss of generality, suppose qub
PF Objective Function
To describe an embodiment of the PF objective function, let {tilde over (R)}u denote the long-term average data-rate of user u (averaged over a sufficiently long time period). A widely used objective function for PF is Σu∈U log
during TTI(t−1) and use the outcome of the decision variables for scheduling TTI t [17, 18, 20, 21], where Ru(t) is the scheduled rate to user u for TTI t (which can be calculated in (6)) and {tilde over (R)}u (t−1) is user u's exponentially smoothed average data-rate up to TTI(t−1) over a window size of Nc TTIs, and is updated as:
It has been shown that such real-time (per TTI) scheduling algorithm can approach optimal PF objective value asymptotically when Nc→∞ [17]. Adopting this understanding, a novel PF scheduler is described herein. Putting equation (27) into equation (28) results in:
Problem Formulation
Based on the above, the PF scheduling optimization problem for TTI t can be formulated as:
In OPT-PF, rub,m(t) is a constant for a given u∈U, b∈B, m∈M and qub(t). Recall that qub(t) is a constant and is determined by the CQI in user u's feedback report at TTI(t−1), which we assume is available by the design of an NR cellular network. {tilde over (R)}u(t−1) is also a constant as it is calculated in TTI(t−1) based on {tilde over (R)}u(t−2) available at TTI(t−1) and Ru(t−1) (the outcome of the scheduling decision at TTI(t−2). The only variables here are xub(t) and yum(t) (u∈U, b∈B, m∈M), which are binary integer variables. Since we have a product term xub(t)·yum(t) (nonlinear) in the objective function, we can employ the Reformulation-Linearization Technique (RLT) [29] to linearize the problem. To do this, define zub,m(t)=xub(t)·yum(t) (u∈U, b∈B, m∈M). Since both xub(t) and yum(t) are binary variables, zub,m
zub,m(t)≤xub(t),(u∈,b∈,m∈), (10)
and
zub,m(t)≤yum(t),(u∈,b∈,m∈). (11)
By replacing xub(t)yum(t) with zub,m
OPT-R is an ILP since all variables are binary and all constraints are linear. Commercial optimizers such as the IBM CPLEX can be employed to obtain optimal solution to OPT-R (optimal to OPT-PF as well), which will be used as a performance benchmark for the scheduler design. Note that ILP is NP-hard in general and is consistent to the fact that our PF scheduling problem is NP-hard [20-22].
The Real-Time Challenge for NR PF Scheduler
Although it is possible to design an algorithm to find a near-optimal solution to OPT-R, it remains an open problem to find a near-optimal solution in real-time. By real-time, we mean that one needs to find a scheduling solution for TTI t during TTI(t−1). For 5G NR, we are talking about on the order of ˜100 μs for a TTI, which is much smaller than a scheduling time interval under 4G LTE. This requirement comes from the fact that the shortest slot duration allowed for data transmission in NR is 125 μs under numerology 3. When numerology 3 is used in scenarios with very short channel coherence time, the real-time requirement for scheduler is on a TTI level, i.e., ˜100 μs. To the best of our knowledge, we have not seen any scheduling solution in the literature that can claim to solve the PF scheduling problem with a time on the order of ˜100 μs. As such, this is the first scheduler design that breaks this technical barrier for real-time scheduling in 5G NR network.
To design a ˜100 μs PF scheduler for 5G NR, it is important to first understand why existing LTE schedulers fail to meet such timing requirement. PF schedulers designed for LTE can be classified into two categories: 1) metric-based schemes (typically implemented in industrial-grade schedulers) that only address RB allocation [15, 16], and 2) polynomial-time approximation algorithms that address both RB allocation and MCS assignment [20-22].
Basically, simple metric-based schedulers such as those surveyed in [15, 16] allocate RBs to users in each TTI by comparing per-user metrics (e.g., the ratio between instantaneous rate and past average rate) on each RB. These schedulers do not address the assignment of MCS. In a BS, an independent adaptive modulation and coding (AMC) module is in charge of assigning MCS for each user [14]. Therefore, metric-based schedulers cannot be used to solve our considered problem OPT-PF. On the other hand, from the perspective of optimization, such a decoupled approach cannot achieve near-optimal performance and will have a loss of spectral efficiency.
In the literature, there have been a number of polynomial-time heuristics designed for LTE PF scheduling. These heuristics are sequential and iterative algorithms that need to go through a large number of iterations. For example, Alg1 and Alg2 proposed in [20] first determine the RB allocation without considering constraints of a single MCS per user, and then fix conflicts of multiple MCSs per user by selecting the best MCS for each user given the RB allocation. The computational complexity of Alg1 and Alg2 is O(|U∥B∥M|). The Unified Scheduling algorithm proposed in [21] selects a user with its associated MCS and adjusts RB allocation iteratively, until a maximum number of K users are scheduled in a TTI. It has a complexity of O(K U∥B∥M|). The greedy algorithm proposed in [22] employs a similar iterative design and can support scheduling over multiple carriers. It does not restrict the number of scheduled users per TTI and thus has a complexity of O(|U|2|B∥M|) for scheduling on a single carrier.
Among the aforementioned schedulers, Alg1 and Alg2 are the fastest ones since they have the lowest complexity. Consider a practical NR macro-cell setting with 100 users per cell, 100 available RBs, and 29 orders of MCS. The number of iterations that Alg1 and Alg2 need to go through is roughly 2.9×105. Each iteration involves a number of addition, multiplication and comparison operations. Our implementation of Alg1 on a computer with an Intel Xeon E5-2687W v4 CPU (3.0 GHz) shows that the computation time of Alg1 under the considered network setting is beyond 800 μs. More numerical results of these LTE PF schedulers are provided in “Performance Validation” section.
For these sequential PF schedulers, employing more CPU cores cannot help reduce time overhead very much. Although an optimized program can benefit from additional cores (utilizing instruction-level parallelism, e.g., pipelining), the reduction of computational time is far from 10×, which is needed for meeting the timing requirement in 5G NR.
A Design of a Real-Time Scheduler
The basic idea in this design is to decompose the original problem (OPT-R) into a large number of mutually independent sub-problems, with a solution to each sub-problem being a feasible solution to the original problem. Then, the optimal solution can be determined by comparing the objectives of all the feasible solutions. In order to implement this idea, the following two questions must be addressed: (1) How to decompose the original problem into a large number of sub-problems that can be executed in parallel; and (2) how to fit the large number of sub-problems into a given GPU platform.
The first question is directly tied to the time complexity of our scheduler. To meet a time requirement of ˜100 μs, each sub-problem must be solved in 10 s of μs. Therefore, it is important that each sub-problem is small in size and requires only very few (sequential) iterations to find a solution. Also, it is desirable that all sub-problems have the same structure and require the same number of iterations to find their solutions.
The second question is to address the space limitation of a given GPU platform. If a GPU had an infinite number of processors, then we can fit each sub-problem into one or a group of processors and there is no issue. Unfortunately, any GPU has a limited number of processors. Although such number is large (e.g., 3840 CUDA cores in a Nvidia Quadro P6000 GPU), it is still much smaller than the number of sub-problems that we have. So we have to remove some sub-problems (that are less likely to produce optimal solutions) so that the remaining sub-problems can fit into the number of GPU processing cores. Addressing these two questions leads to the implementation of an embodiment of the invention on a GPU platform.
In our design of GPF, we do not exploit channel correlations in either time or frequency domains. This is to ensure that GPF works under any operating conditions.
Decomposition
There are a number of decomposition techniques for optimization problems, with each designed for a specific purpose. For example, in branch-and-bound method, a tree-based decomposition is used to break a problem into two sub-problems so as to intensify the search in a smaller search space. In dynamic programming method, decomposition results in sub-problems that still require to be solved recursively. These decompositions cannot be readily parallelized and implemented on GPU.
Our proposed decomposition aims to produce a large number of independent sub-problems with the same structure. Further, each sub-problem is small and simple enough so that GPU cores can complete their computation under a few tens of μs. In other words, our decomposition is tailored toward GPU structure (massive number of cores, lower clock frequency per core, few number of computations for each sub-problem). Such a decomposition can be done by fixing a subset of decision variables via enumerating all possibilities. Then for each sub-problem, we only need to determine the optimal solution for the remaining subset of variables.
To see how this can be done for our optimization problem, consider OPT-PF, i.e., the original problem that has two sets of variables xub and yum, u∈U, b∈B, m∈M. To simplify notation, we omit the TTI index t. Recall that variables xub's are for RB allocation (i.e., assigning each RB to a user) while yumi's are to determine MCS for a user (i.e., choosing one MCS from M for each user). So we can decompose either along x or y. If we decompose along the x-variable, then we will have |U||B| sub-problems (since there are |U| ways to assign each RB and we have a total of |B| RBs). On the other hand, if we decompose along y, then we will have |M||U| sub-problems (since there are |M| ways to assign MCS for a user and we have a total of |U| users). Here, we choose to decompose along y, partly due to the fact that the “intensification” technique that we propose to use can work naturally for such sub-problem structure.
For a given y-variable assignment, denote yum=Yum, where Yum is a constant (0 or 1) and satisfies the MCS constraint (4), i.e., Σm∈M Yum=1. Then OPT-PF degenerates into the following sub-problem (under this given y-variable assignment):
In the objective function, for
only one term in the summation is non-zero, due to the MCS constraint on Yum. Denote the m for this non-zero Yum as m*u. Then the objective function becomes
By interchanging the two summation orders, we have:
OPT(Y) now becomes:
For a given b∈B, there is only one term in the inner summation
that can be non-zero, due to the RB allocation constraint (2). So
is maximized when the xub corresponding to the largest
across all users is set to 1 while others are set to 0. Physically, this means that the optimal RB allocation (under a given MCS setting) is achieved when each RB is allocated to a user that achieves the largest instantaneous data-rate normalized by its average rate.
We have just shown how to solve each sub-problem involving x-variable (RB allocation) under a given y-variable (MCS) assignment. If we solve it sequentially, the computational complexity of each sub-problem is |B| U|. Note the solution to the sub-problem also allows us to perform optimal RB allocation in parallel for all RBs. In this case, the computational complexity of the sub-problem can be reduced to |U| iterations that are used to search for the most suitable user for each RB.
Selection of Sub-Problems
After problem decomposition by enumerating all possible settings of the y-variable, we have a total of |M||U| sub-problems. This is too large to fit into a GPU and solve them in parallel. In this second step, we will identify a set of K sub-problems that are most promising in containing optimal (or near-optimal) solutions and only search the best solution among these K sub-problems. Our selection of the set of K sub-problems is based on the intensification and diversification techniques from optimization (see, e.g., [30]). The basic idea is to break up the search space into promising and less promising subspaces and devote search efforts mostly to the most promising subspace (intensification). Even though there is a small probability that the optimal solution may still lie in the less promising subspace, we can still be assured that we can get a high quality near-optimal solution in the most promising subspace. So the first question to address is: what is the most promising search subspace (among all possible y-variable settings) for the optimal solution?
Recall that each user has |M| levels of MCS to choose from, with a higher level of MCS offering a higher achievable data rate but also requiring a better channel condition. Recall for each b∈B, qub is the maximum level of MCS that can be supported by user u's channel. Since qub differs for different b∈B, denote qumax=maxb∈B qub as the highest level of MCS that user u's channel can support among all RBs. Then for user u, it is safe to remove all MCS assignments with m>qumax (since such MCS assignments will have a rate of 0 on RB b∈B) and we will not lose the optimal solution.
Among the remaining MCS settings for user u, i.e., {1, 2, . . . , qumax}, it appears that the search space for user u with MCS settings close to qumax is most promising. To validate this idea, we conduct a numerical experiment using CPLEX solver to solve OPT-R (not in real time) and examine the probability of success in finding the optimal solution as a function of the number of MCS levels near qumax (inclusive) for each user u∈U. Specifically, denote:
Qu2d={m|max(1,qumax−d+1)≤m≤qumax}⊂M (12)
as the set of d MCS levels near qumax (inclusive), where d ∈ N* denotes the number of descending MCS levels from qumax. For example, when d=1, we have Qu1=(m|m=qumax) for user u, meaning that user u will only choose its highest allowed MCS level qumax; when d=2, we have Qu2=(m|qumax−1≤m≤qumax) for user u, meaning that user u's MCS can choose between qumax−1 and qumax. Across all |U| users, we define:
Qd=Q1d× . . . ×Q|u|d⊂M|u| (13)
as the Cartesian of sets Q1d, Q2d, . . . , Q|u|d. Clearly, Qd contains MCS assignment vectors for all users where the MCS assigned to each user u is within its corresponding set Qud.
In our experiment, we consider a BS with 100 RBs and the number of users ranging from 25, 50, 75, and 100. A set of 29 MCSs (see
Now we turn the table around and are interested in the probability of success in finding the optimal solution for a given d. Then
For a given target success probability, the optimal d depends not only on |U| but also on users' channel conditions. For instance, when there are frequency correlations among RBs, i.e, the coherence bandwidth is greater than an RB, the optimal d may change. Thus in a practical NR cell, optimal d under each possible |U| should be adapted online to keep up with the changes of channel conditions. Specifically, the BS frequently computes optimal solution to OPT-PF under the current |U| based on users' CQI reports, and records the smallest d that contains the optimal solution associated with the given |U|. Such computations can be done only for selected TTIs and there is no strict real-time requirement. Optimal values of d under different |U|'s are re-calculated periodically based on recorded results through the statistical approach described above, and are maintained in a lookup table stored in the BS's memory. During run-time, the BS sets d adaptively based on the number of active users in the cell by simply looking up the table.
For any subspace Qd with d>1, the huge number of sub-problems in it (e.g., for Q2 with 100 users, we have 2100 sub-problems) prohibits us from enumerating all possibilities using a real-world GPU. We need to select K sub-problems from the promising subspace through intensification. Our strategy is to use random sampling based on certain distribution. The selection of probability distribution for sampling is open to special design. In this work, we employ uniform distribution as an example. Specifically, after determining the promising sub-space Qd, for each of the K sub-problems that we consider, we choose MCS for each user u from Qud randomly following a uniform distribution. This is equivalent to sampling from Qd with a uniform distribution. Note that this sampling can be executed in parallel on a GPU across all K sub-problems and users (see “Implementation” section below). This finalizes our selection of sub-problems.
Near-Optimality of Sub-Problem Solutions
Through the above search intensification, we may not always be able to obtain the optimal solution to OPT-PF by solving the K sampled sub-problems. However, as we will show next, the K sub-problem solutions (samples) would almost surely contain at least one near-optimal solution to OPT-PF (e.g., within 95% of optimum).
The science behind this is as follows. Denote the gap (in percentage) of a sample from the optimum by a. For a given bound for optimality gap ε ∈ [0%, 100%], denote 1−ϵ as the probability that a sample is (1−ε)-optimal, i.e., the sample achieves at least (1−ε) of the optimal objective value. We have 1=ϵ=P(a≤ϵ). The probability 1−ϵ is the same among all K samples since they are sampled from the same search subspace following a common uniform distribution. Denote PK, 1−ϵ as the probability that at least one sample (among the K samples) is (1−ε)-optimal. Since all samples are mutually independent, we have:
PK,1−ϵ=1−(1−1−ϵ)K
Therefore, to ensure that PK,1−ϵ≥99.99%, i.e., to have more than 99.99% probability of achieving (1−ε)-optimal by the K samples, we should have
which depends on the value of K, i.e., the number of sub-problems that can be handled by the available GPU cores. The Nvidia Quadro P6000 GPU we employed in the implementation can solve K=300 sub-problems under a realistic setting of 100 RBs and 25˜100 users. Therefore, we should have 1−ϵ≥3.02% to ensure, PK,1−ϵ≥99.99%.
We now investigate the probability 1−ϵ through experiments. The environment setting is: |B|=100, |U|∈ {25, 50, 75, 100}, and |M|=29. We consider the scenario without frequency correlation. The parameter d is set to 6, 3, 3, and 2 for |U|=25, 50, 75, and 100, respectively. We run experiments for 100 TTIs with Nc=100. For each TTI, we generate 100 samples from Qd under each |U|, and record gaps (a's) of their objective values from the optimum. Thus for each |U|, we have 10000 samples and their corresponding a's. Cumulative distribution functions (CDFs) of a under different |U|'s are shown in
When the sampling is parallelized, although there may exist identical samples, it is easy to calculate that such probability is very small as each sample consisting of |U| MCS assignments. In fact, even if there are identical samples, it will not affect much on the near-optimal performance because we have a large number (hundreds) of samples available.
Implementation
Why Choose GPU for Implementation
From the perspective of implementing 5G NR scheduling, there are a number of advantages of GPU over FPGA and ASIC. First, in terms of hardware, GPU is much more flexible. By design, GPU is a general-purpose computing platform optimized for large-scale parallel computation. It can be implemented for different scheduling algorithms without hardware change. In contrast, FPGA is not optimized for massive parallel computation, while ASIC is made for a specific algorithm and cannot be changed or updated after the hardware is made. Second, in terms of software, GPU (e.g., Nvidia) comes with highly programmable tool such as CUDA, which is capable of programming the behavior of each GPU core. On the other hand, it is much more complicated to program the same set of functions in FPGA. Finally, in terms of cost and design cycle, the GPU platform that we use is off-the-shelf, which is readily available and at low cost (for BS). On the other hand, the cost for making an ASIC could be orders of magnitude higher than off-the-shelf GPU. It will take a considerable amount of time to develop an ASIC.
Next, we show how the proposed scheduler is implemented on an off-the-shelf GPU to meet the design target of getting near-optimal scheduling solution in ˜100 μs.
Fitting Sub Problems into a GPU
We use an off-the-shelf Nvidia Quadro P6000 GPU [31] and the CUDA programming platform [32]. This GPU consists of 30 streaming multi-processors (SMs). Each SM consists of 128 small processing cores (CUDA cores). These cores are capable of performing concurrent computation tasks involving arithmetic and logic operations. Under CUDA, the K sub-problems considered by the scheduler per TTI is handled by a grid of thread blocks. An illustration of this implementation is given in
Thus, the total number of sub-problems that we can fit into an Nvidia Quadro P6000 GPU for parallel computation is K=30·I. For example, for |B|=100 RBs and |U|=100 users, the GPU can solve K=300 sub-problems in parallel.
Solution Process
To find an optimal (or near-optimal) solution on a GPU, we need to spend time for three tasks: (i) transfer the input data from Host (CPU) memory to GPU's global memory; (ii) generate and solve K=30·I sub-problems with 30 thread blocks (one thread block per SM); and (iii) transfer the final solution back to the Host (CPU) memory. In the rest of this section, we give details for each task.
Transferring Input Data to GPU
Based on the above discussion, we only transfer input data associated with the promising search space Qd*, where d* depends on the user population |U|. For each user u, only d* MCS levels in Qd* will be considered in the search space. Note that even if with up to 10% probability we may miss the optimal solution in Qd*, we can still find extremely good near-optimal solutions in Qd*. The input data that we need to transfer from Host (CPU) memory to the GPU's global memory include rub,m's (for m∈ Qud*, u∈U, b∈B) and {tilde over (R)}u's (for u∈U). For example, with 100 users and 100 RBs, we have d*=2. Then the size of transferred data is equal to 80 KB for rub,m's plus 0.4 KB for {tilde over (R)}u's (with float data-type).
Generating and Solving K Sub-Problems
Within each SM, K/30 sub-problems are to be generated and solved with one thread block. Then the best solution among the K/30 sub-problems is selected and sent to the global memory. This is followed by a round of selection of the best solution from the 30 SMs (with a new thread block).
Step 1 (Generating Sub-Problems) Each of the 30 thread blocks needs to first generate I sub-problems, where I is defined in equation (14). For each sub-problem, an MCS level for each user u is randomly and uniformly chosen from the set Qud*. Doing this in parallel requires |U| threads for each sub-problem. Thus, to parallelize this step for all I sub-problems, we need to use I·|U|≤1024 threads. Threads should be synchronized after this step to ensure that all sub-problems are successfully generated before the next step.
Step 2 (Solving Sub-Problems) For each of the I sub-problems (i.e., given y-variable), optimal RB allocation (xub's) can be determined by solving OPT(Y). For each sub-problem, the allocation of each RB b∈B to a user is done in parallel with |B| threads. With I sub-problems per block, we need I·|B|≤1024 threads for parallelizing this step. Each thread needs to have input data for all users for comparison. Due to the small size of shared memory in a SM (only 96 KB per SM for Nvidia Quadro P6000 GPU), we cannot store the input data for all |U| users in a SM's shared memory (a part of the shared memory is reserved for other intermediate data). On the other hand, if we let the thread read out data for each user separately from the GPU's global memory, it will result in |U| times of access to the global memory. Recall that access time to the global memory in a GPU is much slower than that to the shared memory in a SM. To address this problem, we put |U| users into several sub-groups such that the input data for each sub-group of users can be read out from the global memory in one access and fit into a SM's shared memory. This will result in a major reduction in the number of times that are required for accessing global memory in this step. Once we have the input data for the sub-group of users in the shared memory, we let the thread find the most suitable user for the given RB within this sub-group. By performing these operations for each sub-group of users, a thread will find the optimal RB allocation for the sub-problem. A synchronization of all threads in a block is necessary after this step.
Step 3 (Calculation of Objective Values): Given the optimal RB allocation for the sub-problem in Step 2, we need to calculate the objective value under the current solution to the sub-problem. The calculation of objective value involves summation of |B| terms. To reduce the number of iterations in completing this summation, we employ a parallel reduction technique.
Step 4 (Finding the Best Solution in a Thread Block): At the end of Step 3, we have I objective values in a SM corresponding to I sub-problems. In this step, we need to find the best solution (with the highest objective value) among the solutions to the I sub-problems. This is done through comparison, which again can be realized by parallel reduction. We need I/2 threads to parallelize this comparison. After synchronizing the I/2 threads, we write the best solution along with its objective value to the GPU's global memory.
Step 5 (Finding the Best Solution Across All Blocks): After Steps 1 to 4 are completed by the 30 thread blocks (SMs), we have 30 solutions (and their objective values) stored in the global memory, each corresponding to the best solution from its respective thread block. Then we create a new thread block (with 15 threads) to find the “ultimate” best from these 30 “intermediate” best solutions. Again, this step can be done through parallel reduction.
Transferring Output Solution to Host
After we find the best solution in Step 5, we transfer this solution from the GPU back to the Host (CPU)'s memory.
Performance Validation
Experiment Platform
Our experiment was done on a Dell desktop computer with an Intel Xeon E5-2687W v4 CPU (3.0 GHz) and an Nvidia Quadro P6000 GPU. Data communications between CPU and GPU goes through a PCIe 3.0 X16 slot with default configuration. Implementation on the GPU is based on the Nvidia CUDA (version 9.1) platform. For performance comparison, the IBM CPLEX Optimizer (version 12.7.1) is employed to find an optimal solution to OPT-R.
Settings
We consider an NR macro-cell with a BS and a number of users. The user population size |U| is chosen from {25, 50, 75, 100}. The number of available RBs is |B|=100. Assume that a set of |M|=29 MCSs shown in
Performance
In addition to the optimal solution obtained by CPLEX, we also incorporate the algorithm Alg1 proposed in [20], the Unified algorithm proposed in [21], and the Greedy algorithm proposed in [22] for performance comparison. We set the maximum number of scheduled users per TTI to 20 for the Unified algorithm in all cases.
First, it is necessary to verify that the GPF scheduler can meet the requirement of ˜100 μs for scheduling time overhead, which is the major purpose of this invention. We consider the worst-case scenario where there is no frequency correlation, i.e., qub(t)'s change independently across RBs. Based on the above results, the parameter d* for controlling the sampling sub-space Qd* is 6, 3, 3 and 2 for |U|=25, 50, 75 and 100, respectively. Results of scheduling time for 100 TTIs are shown in
In
It can be seen that the time spent for computing a scheduling solution at the GPU is much shorter than 100 μs with very small deviation. It meets our target of designing a PF scheduler that has low complexity and extremely short computational time. On the other hand, the most significant time overhead is introduced by the data transfer between GPU and CPU. Such data transfer operations take more than 60% of the total scheduling time overhead. Thus we conclude that the bottleneck of GPF is on the communication between GPU and CPU. However, a hardware-level tuning to optimize the GPU-CPU communication bus is beyond the scope of this invention. But it does suggest that this data transfer overhead can be reduced by a customized design of CPU-GPU system with optimized bus for real-world NR BSs.
Next we verify the near-optimal performance of GPF. We consider two important performance metrics, including the PF criterion Σu∈U log2({tilde over (R)}u(t)) (the ultimate objective of a PF scheduler) and the sum average cell throughput Σu∈U{tilde over (R)}u(t)) (representing the spectral efficiency). The PF and sum throughput performance for 100 TTIs is shown in
We have also run experiments for scenarios with frequency correlation, where qub(t)'s are the same within a group of consecutive RBs and change randomly across groups. Results with coherence bandwidth equal to 2 and 5 RBs indicate that optimal d's change with frequency correlations. Specifically, when coherence bandwidth covers 2 RBs, optimal d's for |U|=25, 50, 75 and 100 are 5, 3, 3 and 2, respectively; when coherence bandwidth covers 5 RBs, optimal d's are 4, 3, 3 and 2, respectively. With adjusted settings of d, GPF achieves similar real-time and near-optimal performance as in the case without frequency correlation.
On that basis it can be concluded that GPF is able to achieve near-optimal performance and meet NR's requirement of ˜100 μs for scheduling time overhead.
Why LTE Scheduler Cannot be Reused for 5G NR
In LTE, the time resolution for scheduling is 1 ms since the duration of a TTI is fixed to 1 ms. It means that an LTE scheduler updates its solution every 1 ms. To investigate the efficiency of reusing an LTE scheduler in 5G NR, we conduct an experiment with the following setting. Assume that the channel coherence time covers two slot durations under numerology 3, i.e., 250 μs (likely to occur at a high frequency band). We compare two scheduling schemes: Scheme 1: Update the scheduling solution every 8 slots (since 1 ms/125 μs=8) by using an LTE scheduler; Scheme 2: In each slot, use GPF to compute the solution. If the time spent is shorter than a slot duration (<125 μs), update solution; otherwise, reuse the previous solution. We adopt Alg1 algorithm for the LTE scheduler since it is able to find a solution in 1 ms and is the fastest among the state-of-the-art PF schedulers. Results of the two schemes for 100 TTIs under |U|=25 and 100 are shown in
The foregoing description and drawings should be considered as illustrative only of the principles of the invention. The invention is not intended to be limited by the preferred embodiment and may be implemented in a variety of ways that will be clear to one of ordinary skill in the art. Numerous applications of the invention will readily occur to those skilled in the art. Therefore, it is not desired to limit the invention to the specific examples disclosed or the exact construction and operation shown and described. Rather, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention. All references cited are incorporated herein in their entirety.
This application is a continuation of International Application No PCT/US18/42730, filed Jul. 18, 2018, which claims benefit of U.S. Provisional Application No. 62/537,733, filed Jul. 27, 2017, which is incorporated herein in its entirety.
This invention was made with government support under Grant Nos. CNS-1343222 and CNS-1642873 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Name | Date | Kind |
---|---|---|---|
20080244584 | Smith | Oct 2008 | A1 |
20120294161 | Sunay et al. | Nov 2012 | A1 |
20150244430 | Shattil | Aug 2015 | A1 |
20160183276 | Marinier et al. | Jun 2016 | A1 |
20180132200 | Gheorghiu | May 2018 | A1 |
20180192449 | Mueck | Jul 2018 | A1 |
Entry |
---|
Written Opinion/Search Report for PCT/US2018/042730 dated Oct. 2, 2018. |
Number | Date | Country | |
---|---|---|---|
20190036639 A1 | Jan 2019 | US |
Number | Date | Country | |
---|---|---|---|
62537733 | Jul 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2016/080853 | Jul 2018 | US |
Child | 16039486 | US |