Disclosed are embodiments related to online convex optimization.
Many machine learning, signal processing, and resource allocation problems can be cast into a dynamic optimization problem with time-varying convex cost functions. Online convex optimization (OCO) provides the tools to handle dynamic problems in the presence of uncertainty, where an online decision strategy evolves based on the historical information [1], [2] (bracketed numbers refer to references at the end of this disclosure). OCO can be seen as a discrete-time sequential decision-making process by an agent in a system. At the beginning of each time slot, the agent makes a decision from a convex feasible set. The system reveals information about the current convex cost function to the agent only at the end of each time slot. The lack of in-time information prevents the agent from making an optimal decision at each time slot. Instead, the agent resorts to minimizing the regret, which is the performance gap between the online decision sequence and some benchmark solution. A desired online decision sequence should be asymptotically no worse than the performance benchmark, i.e., achieving regret that at most grows sublinearly over time.
Most of the early works on OCO studied the static regret, which compares the online decision sequence with a static offline benchmark [3], [4], [5], [6]. However, the optimum of dynamic problems is often time varying. As a rather coarse performance metric, achieving sublinear static regret may not be meaningful since the static offline benchmark itself may perform poorly. A more attractive dynamic regret was first proposed in [3], where the offline benchmark solution can be time varying. It is well known that in the worst-case, it is impossible to obtain sublinear dynamic regret, since drastic variations of the underlying systems can make the online problem intractable. Therefore, dynamic regret bounds are often expressed w.r.t. the accumulated system variations that reflect the hardness of the problem. Theoretical guarantees on the dynamic regrets for OCO with general cost functions were studied in [3], [7], and [8], while the case of strongly convex cost functions were studied in [9], [10], [11], and [12].
The above OCO frameworks do not consider the network heterogeneity on information timeliness and computation capacity in many practical applications. For example, consider the multiple transmission/reception point (TRP) cooperative network with non-ideal backhaul links for 5G New Radio (NR) [13], each TRP has a priori local channel state information (CSI) but less computation capacity compared with a central controller (CC). In mobile edge computing [14], the remote processors have timely information about the computing tasks but may offload some tasks to the edge server due to the limitation on local computation resources [15]. Another example is self-driving vehicular networks, where each vehicle moves based on its real-time sensing while reporting local observations to a control center for traffic routing or utility maximization. In these applications, data are distributed over the network edge and vary over time. Furthermore, the network edge needs to make real-time local decisions to minimize the global costs. However, due to the coupling of data and variables, the global cost function may be non-separable, i.e., it may not be expressed as a summation of local cost functions at the network edge.
Algorithms for non-separable global cost minimization problems, such as coordinated block descent [16] and alternating direction method of multipliers [17], [18] are centralized in nature, as they implicitly assume there is a central node that coordinates the iterative communication and computation processes. However, with distributed data at the network edge, centralized solutions suffer from high communication overhead and performance degradation due to communication delay. Furthermore, existing distributed online optimization frameworks such as parallel stochastic gradient descent [19], federated learning [20], and distributed OCO [21] are confined to separable global cost functions. Specifically, each local cost function depends only on the local data, which allows each node to locally compute the gradient without information about the data at all the other nodes. Therefore, these distributed online frameworks cannot be directly applied to non-separable global cost minimization problems, such as the multi-TRP cooperative precoding design problem considered in this invention, where downlink transmissions at the TRPs are coupled by broadcasting channels.
It is therefore challenging to develop an online learning framework that takes full advantage of the network heterogeneity on information timeliness and computation capacity, while allowing the global cost functions to be non-separable. In this work, we propose a new Hierarchical Online Convex Optimization (HiOCO) framework for dynamic problems over a heterogeneous master-worker network with communication delay. The local data may not be independent and identically distributed (i.i.d.) and the global cost function may not be separable. We consider network heterogeneity, such that the worker nodes have more timely information about the local data but possibly less computation resources compared with the master node. As disclosed here, HiOCO is a framework that takes full advantage of both the timely local and delayed global information, while allowing gradient descent at both the network edge and control center for improved system performance. Our incorporation of non-separable global cost functions over a master—worker network markedly broadens the scope of OCO.
According to a first aspect, a method for performing online convex optimization is provided. The method includes receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The method includes performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The method includes sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.
According to a second aspect, a method for performing online convex optimization is provided. The method includes receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The method includes performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The method includes sending, to the master node, the local decision vector and local data.
According to a third aspect, a master node for performing online convex optimization is provided. The master node includes processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The processing circuitry is operable to perform a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The processing circuitry is operable to send, to each of the two or more worker nodes, the global decision vector and corresponding global information.
According to a fourth aspect, a worker node for performing online convex optimization, the worker node comprising processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The processing circuitry is operable to perform a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The processing circuitry is operable to send, to the master node, the local decision vector and local data.
According to a fifth aspect, a computer program is provided comprising instructions which when executed by processing circuitry of a node causes the node to perform the method of any embodiment of the first and second aspects.
According to a sixth aspect, a carrier containing the computer program of the fifth aspect is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
In the seminal work of OCO [3], an online projected gradient descent algorithm achieved (√{square root over (T)}) static regret with bounded feasible set and gradient, where T is the total time horizon. The static regret was shown to be unavoidably Ω(√{square root over (T)}) for general convex cost functions without additional assumptions and was further improved to (log T) for strongly convex cost functions [4]. Moreover, [5] provided (√{square root over (τT)}) static regret in the presence of τ-slot delay and [6] studied OCO with adversarial delays. First introduced in [3], the dynamic regret of OCO has received a recent surge of interest [7], [8]. Strong convexity was shown to improve the dynamic regret bound in [9]. By increasing the number of gradient descent steps, the dynamic regret bound was further improved in [10]. The standard and proximal online gradient descent algorithms were respectively extended to accommodate inexact gradient in and [12]. Below, we compare the settings and dynamic regret bounds of these works in more details.
The above OCO algorithms are centralized. Distributed online optimization of a sum of local convex cost functions was studied in [21], [22], [23], [24], [25], [26], [27]. Early works on distributed OCO focused on static regret [21], [22], [23], [24] while more recent works studied dynamic regret [25], [26], [27]. However, existing distributed OCO works are over fully distributed networks with separable global cost functions.
Online frameworks such as Lyapunov optimization [28] and OCO have been applied to solve many dynamic problems in wireless systems. For example, online power control for wireless transmission with energy harvesting and storage was studied for single-hop transmission [29] and two-hop relaying [30]. Online wireless network virtualization with perfect and imperfect CSI were studied in [31] and [32]. Online projected gradient descent and matrix exponential learning were leveraged in [33] and [34] for uplink covariance matrix design. Dynamic transmit covariance design for wireless fading systems was studied in [35]. Online periodic precoding updates for wireless network virtualization was considered in [36]. The above works focused on centralized problems for single-cell wireless systems.
Multi-cell cooperative precoding via multiple base stations (BSs) at the signal level can effectively mitigate inter-cell interference, and this has been shown to significantly improve the system performance. However, traditional cooperative precoding schemes focused on centralized offline problems with instantaneous CSI available at the CC [37], [38], [39]. The TRPs defined in 5G NR are much smaller in size compared with the traditional BSs and therefore have limited computation power. Furthermore, non-ideal backhaul communication links in practice have received a surge of attention in the 5G NR standardization. In this work, we apply the proposed HiOCO framework to an online multi-TRP cooperative precoding design problem with non-ideal backhaul links, by taking full advantage of the CSI timeliness at the TRPs and computation resources at the CC.
We formulate a new OCO problem over a heterogenous master-worker network with communication delay, where the worker nodes have timely information about the local data but possibly less computation resources compared with the master node. At the beginning of each time slot, each worker node executes a local decision vector to minimize the accumulation of time-varying global costs. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.
We propose a new HiOCO framework that takes full advantage of the network heterogeneity in information timeliness and computation capacity. As disclosed here, HiOCO allows both central gradient descent at the master node and local gradient descent at the worker nodes for improved system performance. Furthermore, by communicating the aggregated global information and compressed local information, HiOCO can often reduce the communication overhead while preserving data privacy.
We analyze the special structure of HiOCO in terms of its hierarchical multi-step gradient descent with estimated gradients, in the presence of multi-slot delay. We prove that it can yield sublinear dynamic regret under mild conditions. Even with multi-slot delay, by increasing the estimated gradient descent steps at either the network edge or center, we can configure HiOCO to achieve a better dynamic regret bound compared with centralized inexact gradient descent algorithms.
We apply HiOCO to an online multi-TRP cooperative precoding design problem. Simulation under typical urban micro-cell Long-Term Evolution (LTE) settings demonstrates that both the central and local estimated gradient descent in HiOCO can improve system performance. In addition, HiOCO substantially outperforms both the centralized and distributed alternatives.
Embodiments disclosed here consider OCO over a heterogeneous network with communication delay, where the network edge executes a sequence of local decisions to minimize the accumulation of time-varying global costs. The local data may not be independent and identically distributed (i.i.d.) and the global cost functions may not be separable. Due to communication delays, neither the network center nor edge always has real-time information about the current global cost function. We propose a new framework, termed Hierarchical OCO (HiOCO), which takes full advantage of the network heterogeneity on information timeliness and computation capacity to enable multi-step estimated gradient descent at both the network center and edge.
For performance evaluation, we derive upper bounds on the dynamic regret of HiOCO, which measures the gap of costs between HiOCO and an offline global optimal performance benchmark. We show that the dynamic regret is sublinear under mild conditions. We further apply HiOCO to an online cooperative precoding design problem in multiple transmission/reception point (TRP) wireless networks with non-ideal backhaul links for 5G New Radio (NR). Simulation results demonstrate substantial performance gain of HiOCO over both the centralized and distributed alternatives.
The message passing and internal node calculations described below are also illustrated schematically in
Let ƒ({dtc}c=1C{xc}c=1C): n→ be the convex global cost function at time slot t. In the hierarchical computing network 100, the worker nodes 104 and master node 102 cooperate to jointly select a sequence of decisions from the feasible sets c, to minimize the accumulated time-varying global costs. This leads to the following optimization problem:
We consider the general case that the global cost function may be non-separable among the worker nodes 103, i.e., ƒ({dtc}c=1C, {xc}c=1C) may not be expressed as the summation of C local cost functions that each corresponds only to the local data dtc and decision vector xc. Therefore, due to the coupling of both data and variables, each worker node c cannot compute the gradient ∇x
∇xcƒ({dtc}c=1C, {xc}c=1C)hfc(dtc,xc,gfc,({dtl}l≠c,{xl}l≠c)) (1)
where gfc({dtl}l≠cl , {xl}l≠c) is some global information function w.r.t. the local data and decision vectors at all the other worker nodes 104. The local gradient and global information functions depend on specific formats of the global cost functions. We will show later that, communicating the values of the global information functions, instead of the raw data and decision vectors, can often reduce the communication overhead.
For notation simplicity, in the following, we define the global feasible set as ∪c=1Cc and denote the global cost function ƒ({dtc}c=1C, {xc}c=1C) as ƒt(x), where x [x1
Due to the lack of in-time information about the global cost function at either the worker nodes 104 or the master node 102, it is impossible to obtain an optimal solution to P1.1 In fact, even for the most basic centralized OCO problem [3] an optimal solution cannot be found [4]. Instead, we aim at selecting an online solution sequence {xt}t=1T that is asymptotically no worse than the dynamic benchmark {x*t}t=1T, given by
Note that x*t is computed with the current information about ƒt(x) at each time slot t and the resulting solution sequence {x*t}t=1T is a global optimal solution to P1. The corresponding dynamic regret is defined as
RE
T
d
Σt=1T(ƒt(xt)−ƒt(x*t)). (3)
An OCO algorithm is desired to provide sublinear dynamic regret with respect to the time horizon T,
Sublinearity is important since it implies that the online decision is asymptotically no worse than the dynamic benchmark in terms of its time-averaged performance. However, in the worst case, no online algorithm can achieve sublinear dynamic regret if the systems vary too drastically over time [40]. Therefore, the dynamic regret bounds are expressed in terms of different measures on system variations that represent the hardness of 1 problem. For a clear comparison on the dynamic regret bounds between HiOCO and existing literature, we introduce several common variation measures as follows.
Borrowing from [3], we define the following accumulated variation of an arbitrary sequence of reference points {rt}t=1T (which is termed the path length in [3]):
ΠTΣt=1T∥rt−rt−1∥2. (4)
The online projected gradient descent algorithm in [3] achieved (√{square root over (TΠT)}) dynamic regret w.r.t. any sequence of reference points {rt}t=1T. Another version of the path length defined in [7] is
Π′TΣt=1T∥rt−Φt(rt−1∥2. (5)
where Φt(⋅) is a given function available at the decision maker to predict the current reference point. The dynamic mirror descent algorithm in [7] achieved (√{square root over (TΠ′t T)}) dynamic regret. When the reference points are the optimal points, i.e., rt=x*t for any t, the resulting path length is defined as
Π*TΣt=1T∥x*t−x*t−1∥2. (6)
There are some other related measures that can be used to characterize the system variation, e.g., the accumulated variation of the cost functions {ƒt(x)}t=1T given by
and the accumulated squared variation of gradient given by
Γ2,TΣt=1T∥∇ƒt(xt)−∇ƒt−1(xt−1)∥22 (8)
The optimistic minor descent algorithm in [8] achieved a dynamic regret bound
in terms of Π*T, ΘT, and Γ2,T simultaneously.
The above OCO works [3], [7], [8] focused on general convex cost functions. With strongly convex cost functions, the one-step projected gradient descent algorithm in [9]improved the dynamic regret to (Π*T). The multi-step gradient descent algorithm in [10] further improved the dynamic regret to (Π*2,T), where Π*2,T is the squared path length defined as
Π*2,TΣt=1T∥x*t−x*t−1∥22. (9)
Note that if Π*T or Π*2,T is sublinear, Π*2,T is often smaller than Π*T in the order sense.2 For instance ∥x*t−x*t−1|∝ for any t, then Π*T=() and Π*2,T=(). For a sublinear Π*T or Π*2,T, we have <0 and therefore Π*2,T is smaller than Π*T in the order sense. Particularly, if
we have Π*2,T=(1) and Π*2,T=√{square root over (T)}. The standard and proximal online gradient descent algorithms were respectively extended in [11] and [12] to accommodate inexact gradient. Both resulted in (max{Π*T, ΔT}) dynamic regret, where ΔT is the accumulated gradient error defined as
with ∇{circumflex over (ƒ)}t(⋅) being a given function available at the decision maker to predict the current gradient.
In this section, we present details of HiOCO and study the impact of hierarchical multi-step estimated gradient descent on the performance guarantees of HiOCO to provide dynamic regret bounds. We further provide sufficient conditions under which HiOCO yields sublinear dynamic regrets and discuss its performance merits over existing OCO frameworks.
Existing distributed OCO frameworks cannot be directly applied to solve the aforementioned minimization problem with non-separable global cost functions. As an alternative, one may apply a centralized OCO approach at the master node after it has received all the local data from the worker nodes. However, this way of solving the problem does not take advantage of either the more timely information at the worker nodes or the computation 2 resources at the worker nodes. Different from existing OCO frameworks that are either centralized or fully distributed, in HiOCO, the master node and worker nodes cooperate in gradient estimation and decision updates, by taking full advantage of the network heterogeneity on information timeliness and computation capacity. For ease of exposition, we will first consider the case of zero local delay at the worker node but will later extend that to the case of non-zero local delay. In the following, we present the algorithms at the master node and worker nodes.
At the beginning of each time slot t, each worker node c executes its current local decision vector xtc and uploads it to the master node 102. To enable central gradient descent at the master node 102, each worker node c also needs to share information about the local data dtc with the master node 102. However, sending the raw data directly would incur a large amount of uplink overhead. Instead, each worker node c sends a compression of the current local data lfc(dtc) to the master node 102. Due to the remote uplink delay, at the beginning of each time slot t>τru, the master node 102 only has the τru-slot-delayed local decision vector xt−τ
Remark 1. There is often a delay—accuracy tradeoff for the recovered data
at the master node, since more accurate data at the master node 102 require less compression at the worker nodes 104 and more transmission time. If data privacy is a concern, the worker nodes 104 can add noise to the compressed data while sacrificing some system performance [41].
With
for each worker node c, the master node 102 sets an intermediate decision vector {circumflex over (x)}tc,0=xt−τ
where ∇{circumflex over (ƒ)}t−τ
and it is given by
The master node 102 then sends {circumflex over (x)}tc,J
to assist the local gradient descent at each worker node c.
for each worker node c.
Specifically, at 314, the master node 102 checks whether j≤Jr. If it is, master node 102 proceeds to 316, otherwise master node 102 proceeds to 322. Initially, j=1 when master node 102 reaches 314 for the first time. At 316, an estimated gradient (∇{circumflex over (ƒ)}t−τ
to each worker node c. At 324, the algorithm ends.
Worker Node c's Algorithm
When the global cost function is non-separable, each worker node c cannot compute the local gradient ∇ƒtc(xtc)=hƒc(dtc,xc,gƒc({dtl}l≠c,{xl}l≠c)) based only on its local data dtc. Therefore, in HiOCO, the master node 102 assists the local gradient estimation by communicating the corresponding delayed global information
to each worker node c. Note that due to the communication delay and data compression, the global information received by the worker nodes 104 is delayed and with errors.
At the beginning of each time slot t>τr, each worker node c receives the global decision vector {circumflex over (x)}tc,J
from the master node 102. Each worker node c then sets an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,J
where ∇{circumflex over (ƒ)}tc ({tilde over (x)}tc,j−1) is an estimated gradient based on the timely local data dtc and the delayed global information
and it is given by
The above estimated gradient takes full advantage of the information timeliness at the worker nodes, as well as the central availability of information at the master node, to enable local gradient descent at the worker nodes for non-separable cost functions. Each worker node c then executes xtc={tilde over (x)}tc,J
Remark 2. For separable global cost functions, HiOCO can still be applied. In this case, it is still beneficial to perform centralized gradient descent for improved system performance, while sacrificing some communication overhead caused by uploading the compressed local data.
Remark 3. Single-step and multi-step gradient descent algorithms were provided in [9] and [10], while [11] and [12] proposed single-step inexact gradient descent algorithms. However, the algorithms in [9], [10], [11], [12] are centralized and under the standard OCO setting with one-slot delayed gradient information. In HiOCO, both the master node 102 and worker nodes 104 can perform multi-step estimated gradient descent in the presence of multi-slot delay.
from the master node 102. At 410, the worker node 104 sets an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,J
Specifically, at 412, the worker node 104 checks whether j≤J1. If it is, worker node 104 proceeds to 414, otherwise worker node 104 proceeds to 420. Initially, j=1 when worker node 104 reaches 412 for the first time. At 414, an estimated gradient (∇{circumflex over (ƒ)}tc({tilde over (x)}tc,j−1) is constructed according to equation 12. At 416, {tilde over (x)}tc,j is updated by solving the optimization problem P3. At 418, the index j is incremented by one, and worker node 104 proceeds to perform the check at 412. At 420, after the gradient descent has completed, worker node 104 implements xtc={circumflex over (x)}tc,J
In this section, we present new techniques to derive the dynamic regret bounds of HiOCO, particularly to account for its hierarchical multi-step estimated gradient descent with multi-slot delay. For clarity of exposition, proofs are omitted.
We make the following assumptions that are common in the literature of OCO with strongly convex functions [9], [10], [11], [12]. Strongly convex objectives arise in many machine learning and signal processing applications, such as Lasso regression, support vector machines, and robust subspace tracking. For applications with general convex cost functions, adding a simple regularization term like
often does not sacrifice the system performance We will show later that strongly convexity develops a contraction relation between ∥xt+1-x*t∥22 and ∥xt−x*t∥22 which can be leveraged to improve the dynamic regret bounds.
Assumption 1. For any t, ƒt(x) satisfies the following:
ƒt(x) is μ-strongly convex over , i.e., ∃μ>0, s.t., for any x, y∈ and t
ƒt(x) is L-smooth over , i.e., ∃L>0, s.t., for any x, y∈ and t
The gradient of ƒt(x) is bounded, i.e., ∃D>0, s.t., for any x∈ and t
∥∇ƒt(x)∥2≤D. (15)
Assumption 2. The radius of is bounded, i.e., ∃R>0, s.t., for any x, y∈z,ϵ
∥x−y∥2≤R. (16)
We also require the following lemma, which is reproduced from Lemma 2.8 in [1].
Lemma 1. Let ,ϵ∈n be a nonempty convex set. Let ƒ(x) be a μ-strongly-convex function over z,68 . Let
Then, for any y∈, we have
The following lemma is general and quantifies the impact of one-step estimated gradient descent in terms of the squared gradient estimation error. We further provide a sufficient condition under which the estimated gradient descent yields an improved decision to the optimal points.
Lemma 2. Assume ƒ(x): → is μ-strongly-convex and L-smooth. Let
where ∇{circumflex over (ƒ)}(y) is an estimated gradient of ∇ƒ(y), and
For any α>L, and γ∈(0, 2μ), we have
The sufficient condition for ∥z−x*∥22<∥y−x∥22 is
∥∇{circumflex over (ƒ)}(y)−∇ƒ(y)∥22<γ(2μ−γ)∥y−x*∥22. (18)
Remark 4. The condition on gradient estimation error in (18) is most easily satisfied when γ=μ. In this case, the contraction constant
recovers the one in [9]. Furthermore, as γ approaches 0, η approaches the contraction constant
in [10]. Different from Proposition 2 in [9] and Lemma 5 in [10], Lemma 2 takes into account the impacts of estimated gradient descent and generalizes the results in [9] and [10].
Remark 5. The optimal gradient descent step-size in needs to be in a specific range based on the knowledge of μ, L and v from an additional assumption ∥∇{circumflex over (ƒ)}t(xt)−∇ƒt(xt)∥22≤ϵ2+υ2∥∇ƒt(xt)∥22 for some ϵ≥0 and υ≥0. The contraction analysis in [12] focused on the proximal point algorithm and is substantially different from Lemma 2.
We examine the impact of hierarchical multi-step estimated gradient descent on the dynamic regret bounds for OCO, which has not been addressed in the existing literature. To this end, we define the accumulated squared gradient error as
Similar to the relationship between the standard path length Π*T, and squared path length Π*2,T as discussed above, Δ2,T is often smaller than ΔT in the order sense. Note that
in (19) is the maximum estimated gradient estimation error and serves as an upper bound for the gradient estimations in (11) and (12). We use Δ2,T as a loose upper bound for our performance analysis since it covers more general gradient estimation schemes that can be adopted in HiOCO.
Leveraging results in Lemmas 1-2 and OCO techniques, the following theorem provides upper bounds on the dynamic regret RETd for HiOCO.
Theorem 3. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:
For any J1+Jr≥1 such that 2ηJ
For any J1+Jr≥1, we have
Extension with Local Delay
We now consider the case of non-zero local delay, i.e., at the beginning of each time slot t, each worker node c only has τ1-delayed local data dt−τ
The master node's algorithm with local delay may proceed as follows. The algorithm starts, the parameter a is initialized, and at the beginning of each t>τ, the master node 102 receives xt−τ
The worker node's algorithm with local delay may proceed as follows. The algorithm starts, the local decision vectors xtc∈c for any t≤τ are initialized, and at the beginning of each t>τ, the worker node 104 receives {circumflex over (x)}tc,J
Using similar techniques in the proof of Theorem 3, we provide dynamic regret bounds for HiOCO in the presence of both local and remote delay.
Theorem 4. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:
For any J1+Jr≥1 such that 4ηJ
For any J1+Jr≥1, we have
Due to the local delay, Theorem 4 has a more stringent condition on the total number of gradient descent steps compared with Theorem 3. However, the order of the dynamic regret bound is dominated by the accumulated system variation measures and is often the same as the case without local delay.
In this section, we discuss the sufficient conditions for HiOCO to yield sublinear dynamic regret and highlight several prominent advantages of HiOCO over existing OCO frameworks. From Theorems 3 and 4, we can derive the following corollary regarding the dynamic regret bound.
Corollary 5. Suppose the accumulated squared variation of the gradient at the optimal points satisfies Σt=1T∥∇ƒt(x*t)∥22=(max{τ2 Π*2,T, Δ2,T}), from Theorems 3 and 4, we have
RE
T
d=(min{max{τΠ*T,ΔT},max{τ2Π*2,T,Δ2,T}}).
Note that Σt=1T∥∇ƒt(x*t)∥22 is often small and the condition in Corollary 5 is commonly satisfied. In particular, if x*t is an interior point of or P1 is an unconstrained online problem, we have ∇ƒt(x*t)=0. Form Corollary 5, a sufficient condition for HiOCO to yield sublinear dynamic regret is either max{τΠ*T,ΔT}=o(T) or max{τ2Π*2,T,Δ2,T}=o(T). Sublinearity of the accumulated system measures is necessary to have sublinear dynamic regret [40]. In many online applications, the system tends to stabilize and the gradient estimation becomes more accurate over time, leading to sublinear dynamic regret.
Remark 6. The centralized single-step and multi-step gradient descent algorithms achieved (Π*T) and (min{Π*T,Π*2,T}) dynamic regrets in [9] and [10], respectively. HiOCO takes advantage of both the timely local and delayed global information to perform multi-step estimated gradient descent at both the master and worker nodes. Our dynamic regret bound analysis takes into account the impacts of the unique hierarchical update architecture, gradient estimation errors, and multi-slot delay on the performance guarantees of OCO that were not considered in [9] and [10].
Remark 7. The centralized single-step inexact gradient descent algorithms in [11] and [12] achieved (max{Π*T, ΔT}) dynamic regret under the standard OCO setting with one-slot delay. Noting that in the order sense, Π*2,T and Δ2,T are usually smaller than Π*T and ΔT, respectively. Therefore, even in the presence of multi-slot delay, HiOCO provides a better dynamic regret bound by increasing the number of estimated gradient descent steps, and recovers the performance bounds in [11] and [12] and as a special case.
The message passing and internal node calculations described below are also illustrated schematically in
We consider a total of C TRPs 504 coordinated by the CC 502 to jointly serve K users 506 in the cooperative network 500. Each TRP c has Nc antennas, so there is a total of N=Σc=1CNc antennas in the network 500. Let Htc∈K×N
For ease of illustration only, here we consider the case where there is no local delay at the TRPs to collect the local CSI. However, embodiments may also cover the case of non-zero local delay as explained above. At each time slot t, each TRP c has the current local CSI Htc and implements a local precoding matrix Vtc∈N
c
{V
c
:∥V
c∥F2≤Pmaxc} (20)
to meet the per-slot maximum transmit power limit. Let Vt=[Vt1
y
t
=H
t
V
t
s
t
where st∈K×1 contains the transmitted signals from the TRPs to all K users 506 which are assumed to be independent to each other with unit power, i.e., {ststH}=I, ∀t.
We first consider idealized backhaul communication links, where each TRP c communicates Htc to the CC 502 without delay. The CC 502 then has the global CSI Ht at time slot t and designs a desired global precoder Wt∈N×K to meet the per-TRP maximum power limits. The design of Wt can be based on the services needs of the K users 506 and is not limited to any specific precoding scheme. For the CC 502 with Wt, the desired received signal vector (noiseless) {tilde over (y)}t is given by
{tilde over (y)}
t
=H
t
W
t
s
t.
With the TRPs' 504 actual precoding matrix Vt and the desired precoder Wt at the CC 502, the expected deviation of the actual received signal vector at all K users 506 from the desired one is given by {∥yt−{tilde over (y)}t∥F2}=∥HtVt−HtWt∥F2. We define the precoding deviation of the TRPs' 504 precoding from the precoder at the CC 502 as
ƒt(V)∥HtVt−HtWt∥F2,∀t (21)
which is a strongly convex cost function.
Note that due to the coupling of local channel states {Htc}c=1C and local precoders {Vtc}c=1C, the cost function ƒt(V) is not separable among the TRPs 504. Furthermore, the local gradient at each TRP c depends on the local channel state W, local precoder W, and the channel states {Htl}l≠c and precoders {Vtl}l≠c at all the other TRPs 504, given by
The goal of the multi-TRP cooperative network 500 is to minimize the accumulation of the precoding deviation subject to per-TRP maximum transmit power limits with non-ideal backhaul communication links. The online optimization problem is in the same form as P1 with {Htc}c=1C being the local data, {Vtc∈c}c=1C being the local decision vectors, and ƒt(V) being the global cost function.
For non-ideal backhaul links with τru-slot uplink and τrd-slot downlink communication delays, as illustrated herein, only the round-trip communication delay τr matters and we can equivalently consider there is τr-slot uplink delay and no downlink delay. At each time slot t, each TRP c has the timely local CSI Htc and implements a local precoder Vtc. If communication overhead is a concern, instead of sending the complete CSI Htc, each TRP c can send a compressed local CSI Ltc to the CC 502. Due to the communication delay and CSI compression, the CC 502 recovers a delayed global channel state Ĥt−τ
Leveraging the proposed HiOCO framework, we now provide hierarchical solutions to the formulated online multi-TRP cooperative precoding design problem.
At the beginning of each time slot t>τr, the CC 502 receives the precoding matrices
from the TRPs 504 and recovers the delayed global CSI Ĥt−τ
It then sets {circumflex over (V)}tc,0=Vt−τ
where
is the projection operator onto the convex feasible set c and
is an estimation of the gradient at time slot t−τr. The CC 502 then communicates the intermediate precoder {circumflex over (V)}tc,J
Note that instead of sending the global channel state Ĥt−τ
Each TRP c can implement any local precoder in c for any t∈[1, τr]. At the beginning of each time slot t>τr, after receiving the intermediate precoder {circumflex over (V)}tc,J
is an estimation of the current gradient based on the timely local CSI Htc and delayed global information Ĝt−τc. Finally, each TRP c uses Vtc={tilde over (V)}tc,J
Note that the optimal precoding solution is V*t=Wt at each time slot t. However, with non-ideal backhaul links, each TRP c cannot receive Vtc* from the CC 502 in time and implement it at each time slot t. A naive solution is to implement the delayed optimal solution V*t−τ
We assume that the channel power is bounded by a constant B>0 at any time t, given by
∥Ht∥F2≤B. (23)
In the following Lemma, we show that the formulated online multi-TRP cooperative precoding design problem satisfies Assumptions 1 and 2 made above.
Lemma 6. Assume the channel power is bounded in (23). Then, Assumptions 1 and 2 hold with the corresponding constants given by μ=2, L=B, D=2B√{square root over (Σc=1CPmaxc)}, and R=2√{square root over (Σc=1CPmaxc)}.
Leveraging the results in Theorems 3 and 4, and noting that the gradient of the optimal precoder satisfies ∇t(V*t)=HtH/(HtV*t−HtWt)=0, the following corollary provides the dynamic regret bounds yielded by the hierarchical online precoding solution sequence {Vt}t=1T.
Corollary 7. The dynamic regret bounds in Theorems 3 and 4 hold for {Vt}t=1T generated by HiOCO, with the constants μ, L, D, and R given in Lemma 6 and Σt=1T∥∇ƒt(V*t)∥F2=0.
In this section, we present simulation results under typical urban micro-cell LTE network settings. We study the impact of various system parameters on the convergence and performance of HiOCO. We numerically demonstrate the performance advantage of HiOCO over both the centralized and distributed alternatives.
We consider an urban hexagon micro-cell of radius 500 m with C=3 equally separated TRPs each is equipped with Nc=16 antennas. We consider 5 co-located users in the middle of every two adjacent TRPs for a total of K=15 users in the network. Following the standard LTE specification [42], as default system parameters, we set maximum transmit power limit Pmaxc=30 dBm, noise power spectral density N0=−174 dBm/Hz, noise figure NF=10 dB, and we focus on the channel over one subcarrier with bandwidth BW=15 kHz. We model the fading channel as a first order Gauss Markov process ht+1c,k+αhhtc,k+ztc,k between each user k and each TRP c, where htc,k˜(0, βc,kI) with βc,k [dB]=−31.54-33 log10 (dc,k)−φc,k represents the path-lost and shadowing effects, dc,k being the distance in kilometers from TRP c to user k, φc,k˜(0,σØ2) being the shadowing effect that is used to model the variation of user positions with σØ2=8 dB, αh∈[0,1] is the channel correlation coefficient, and ztc,k˜(0, (1−αh2) βc,kI) is independent of htc,k. We set αh=0.998 as default, which corresponds to user speed 1 km/h. We consider each TRP c communicates the accurate local CSI Htc to the CC, since the impact of channel compression error can be emulated by increasing the communication delay τr.
For our performance study, we assume the CC adopts cooperative zero forcing (ZF) precoding, given by
W
t
ZF=√{square root over (PtZF)}HtH(HtHtH)−1
where PtZF is a power normalizing factor. Note that we must have N≥K to perform ZF precoding. We assume all K users have the same noise σn2=NF+N0BW and therefore all the users will have the same data rate
The CC adopts the power normalizing factor
which is the optimal solution for the following sum rate maximization problem with per-TRP maximum transmit power limits:
As performance metrics, we define the time-averaged normalized precoding deviation as
and the time-averaged per-user rate as
where
is the signal-to-interference-plus-noise ratio (SINR) of user k.
Next, we study the impact of channel correlation on the performance of HiOCO. Note that as αh increases, the accumulated system variation measures become smaller, leading to better dynamic regret bounds. As shown in
For performance comparison, we consider the delayed optimal precoder V*t−τ
denote the available channel state between the Kt−τ
where {tilde over (P)}t−τ
t−τ
c=√{square root over (
where
centralized OCO with Jr=8 and Jr=1 steps of gradient descent, and the dynamic and fixed user association schemes as
respectively, with τ1=0 and τr=4. We observe HiOCO achieves the best system performance compared with all of the above alternative schemes. Furthermore, by performing only Jr=1 step additional local gradient descent at the TRPs, HiOCO achieves substantial performance gain compared with the centralized OCO with Jr=8 steps of gradient descent. The user association schemes that based on the timely local CSI have worse performance compared with the other alternatives, since the TRPs are not coordinated to jointly serve the users.
We further study the impact of number of antennas Nc and users K.
Embodiments provide OCO over a heterogeneous master—worker network with communication delay, to make a sequence of online local decisions to minimize some accumulated global convex cost functions. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.
We propose a new HiOCO framework, which takes full advantage of the network heterogeneity in information timeliness and computation capacity, to enable multi-step estimated gradient descent at both the master and worker nodes. Our analysis considers the impacts of multi-slot delay, gradient estimation error, and the hierarchical architecture on the performance guarantees of HiOCO, to show sublinear dynamic regret bounds under mild conditions.
We apply HiOCO to a multi-TRP cooperative network with non-ideal backhaul links for 5G NR. We take full advantage of the information timeliness on CSI and computation resources at both the TRPs and CC to improve system performance By sharing the compressed local CSI and delayed global information, both the uplink and downlink communication overhead can be greatly reduced. The cooperative precoding solutions at both the TRPs and CC are in closed forms with low computational complexity.
Notes on the performance of the proposed methods: We numerically validate the performance of the proposed hierarchical precoding solution for multi-TRP cooperative networks under typical LTE cellular network settings. Extensive simulation results are provided to demonstrate the impact of the number of estimated gradient descent steps, channel correlation, remote and local delay, and the number of antennas and users. Simulation results demonstrate the superior delay tolerance and substantial performance advantage of HiOCO over both the centralized and distributed alternatives under different scenarios.
Process 1500 is a method for performing online convex optimization, performed e.g. by a master node such as mater node 102 and/or CC 502.
Step 1502 comprises receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes.
Step 1504 comprises performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information.
Step 1506 comprises sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.
In some embodiments, the local data received from each of the two or more worker nodes is compressed, and wherein the method further comprises uncompressing the local data received from each of the two or more worker nodes. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate decision vector {circumflex over (x)}tc,0=xt−τ
and (2) updating {circumflex over (x)}tc,j for each of the two or more worker nodes c, by solving an optimization problem for {circumflex over (x)}tc,j based on the estimated gradients; where:
refers to the local decision vectors received from each of the two or more worker nodes,
refers to compressed local data for each of the two or more worker nodes that is based on the local data received from each of the two or more worker nodes,
j∈[1, Jr], and
Jr refers to the number of steps of the multi-step gradient descent.
In some embodiments, the estimated gradient is given by
the optimization problem is given by
and the corresponding global information for a given worker node c is given by
where:
∇{circumflex over (ƒ)}t−τ
hfc( ) refers to a general function,
In some embodiments, the local data corresponding to each of the two or more worker nodes has a non-zero local delay. In some embodiments, the two or more worker nodes comprise transmission/reception points (TRPs), the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix {circumflex over (V)}tc,0=Vt−τ
and (2) updating {circumflex over (V)}tc,j for each of the two or more TRPs c, by solving an optimization problem for {circumflex over (V)}tc,j based on the estimated gradients; where:
C refers to the number of the two or more worker nodes,
c is an index referring to a specific one of the two or more worker nodes
t refers to the current time slot,
τr refers to a round-trip remote delay,
refers to the local precoding matrices received from each of the two or more TRPs,
refers to compressed local channel state information for each of the two or more TRPs that is based on the local channel state information received from each of the two or more TRPs,
j∈[1, Jr], and
Jr refers to the number of steps of the multi-step gradient descent.
In some embodiments, the estimated gradient is given by ∇ƒt−τ
and the corresponding global information for a given TRP c is given by Ĝt−τc=Σl=1,l≠c(Ĥt−τl{circumflex over (V)}tl,J
is the projection operator onto the convex feasible set Vc,
Ŵt−τ
∇{circumflex over (ƒ)}t−τ
α refers to a fixed parameter.
Process 1600 is a method for performing online convex optimization, performed e.g. by a worker node such as worker node 104 and/or TRP 504.
Step 1602 comprises receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it.
Step 1604 comprises performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector.
Step 1606 comprises sending, to the master node, the local decision vector and local data.
In some embodiments, the local data sent to the master node is compressed prior to sending. In some embodiments, performing the multi-step gradient descent further comprises:
and (2) updating {tilde over (x)}tc,j, by solving an optimization problem for {tilde over (x)}tc,j based on the estimated gradient; where:
c is an index referring to a worker node corresponding to the local data,
t refers to the current time slot,
τr refers to a round-trip remote delay,
dtc refers to the local data,
{circumflex over (x)}tc,J
refers to the global information,
In some embodiments, the estimated gradient is given by
the optimization problem is given by
and the local decision vector given by xtc=
{tilde over (x)}tc,J
∇{circumflex over (ƒ)}tc( ) refers to a local gradient function,
hfc( ) refers to a general function,
Xc refers to a compact convex feasible set, and
α refers to a fixed parameter.
In some embodiments, the local data has a non-zero local delay. In some embodiments, the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix if {tilde over (V)}tc,0={tilde over (V)}tc,J
c is an index referring to a worker node corresponding to the local data,
t refers to the current time slot,
τr refers to a round-trip remote delay,
τ1 refers to a local delay,
τ refers to the total delay,
Htc refers to the local channel state information,
{tilde over (V)}tc,J
Ĝt−τc refers to the global information,
j∈[1,J1], and
J1 refers to the number of steps of the multi-step gradient descent.
In some embodiments, the estimated gradient is given by ∇{circumflex over (ƒ)}t−τ
and the local precoding matrix given by Vtc={tilde over (V)}tc,J
is the projection operator onto the convex feasible set Vc, ∇{circumflex over (ƒ)}t−τ
α refers to a fixed parameter.
While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described example embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2022/050212 | 1/12/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63144257 | Feb 2021 | US |