HIERARCHICAL ONLINE CONVEX OPTIMIZATION

TECHNICAL FIELD

Disclosed are embodiments related to online convex optimization.

BACKGROUND

Many machine learning, signal processing, and resource allocation problems can be cast into a dynamic optimization problem with time-varying convex cost functions. Online convex optimization (OCO) provides the tools to handle dynamic problems in the presence of uncertainty, where an online decision strategy evolves based on the historical information [1], [2] (bracketed numbers refer to references at the end of this disclosure). OCO can be seen as a discrete-time sequential decision-making process by an agent in a system. At the beginning of each time slot, the agent makes a decision from a convex feasible set. The system reveals information about the current convex cost function to the agent only at the end of each time slot. The lack of in-time information prevents the agent from making an optimal decision at each time slot. Instead, the agent resorts to minimizing the regret, which is the performance gap between the online decision sequence and some benchmark solution. A desired online decision sequence should be asymptotically no worse than the performance benchmark, i.e., achieving regret that at most grows sublinearly over time.

Most of the early works on OCO studied the static regret, which compares the online decision sequence with a static offline benchmark [3], [4], [5], [6]. However, the optimum of dynamic problems is often time varying. As a rather coarse performance metric, achieving sublinear static regret may not be meaningful since the static offline benchmark itself may perform poorly. A more attractive dynamic regret was first proposed in [3], where the offline benchmark solution can be time varying. It is well known that in the worst-case, it is impossible to obtain sublinear dynamic regret, since drastic variations of the underlying systems can make the online problem intractable. Therefore, dynamic regret bounds are often expressed w.r.t. the accumulated system variations that reflect the hardness of the problem. Theoretical guarantees on the dynamic regrets for OCO with general cost functions were studied in [3], [7], and [8], while the case of strongly convex cost functions were studied in [9], [10], [11], and [12].

The above OCO frameworks do not consider the network heterogeneity on information timeliness and computation capacity in many practical applications. For example, consider the multiple transmission/reception point (TRP) cooperative network with non-ideal backhaul links for 5G New Radio (NR) [13], each TRP has a priori local channel state information (CSI) but less computation capacity compared with a central controller (CC). In mobile edge computing [14], the remote processors have timely information about the computing tasks but may offload some tasks to the edge server due to the limitation on local computation resources [15]. Another example is self-driving vehicular networks, where each vehicle moves based on its real-time sensing while reporting local observations to a control center for traffic routing or utility maximization. In these applications, data are distributed over the network edge and vary over time. Furthermore, the network edge needs to make real-time local decisions to minimize the global costs. However, due to the coupling of data and variables, the global cost function may be non-separable, i.e., it may not be expressed as a summation of local cost functions at the network edge.

Algorithms for non-separable global cost minimization problems, such as coordinated block descent [16] and alternating direction method of multipliers [17], [18] are centralized in nature, as they implicitly assume there is a central node that coordinates the iterative communication and computation processes. However, with distributed data at the network edge, centralized solutions suffer from high communication overhead and performance degradation due to communication delay. Furthermore, existing distributed online optimization frameworks such as parallel stochastic gradient descent [19], federated learning [20], and distributed OCO [21] are confined to separable global cost functions. Specifically, each local cost function depends only on the local data, which allows each node to locally compute the gradient without information about the data at all the other nodes. Therefore, these distributed online frameworks cannot be directly applied to non-separable global cost minimization problems, such as the multi-TRP cooperative precoding design problem considered in this invention, where downlink transmissions at the TRPs are coupled by broadcasting channels.

SUMMARY

It is therefore challenging to develop an online learning framework that takes full advantage of the network heterogeneity on information timeliness and computation capacity, while allowing the global cost functions to be non-separable. In this work, we propose a new Hierarchical Online Convex Optimization (HiOCO) framework for dynamic problems over a heterogeneous master-worker network with communication delay. The local data may not be independent and identically distributed (i.i.d.) and the global cost function may not be separable. We consider network heterogeneity, such that the worker nodes have more timely information about the local data but possibly less computation resources compared with the master node. As disclosed here, HiOCO is a framework that takes full advantage of both the timely local and delayed global information, while allowing gradient descent at both the network edge and control center for improved system performance. Our incorporation of non-separable global cost functions over a master—worker network markedly broadens the scope of OCO.

According to a first aspect, a method for performing online convex optimization is provided. The method includes receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The method includes performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The method includes sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.

According to a second aspect, a method for performing online convex optimization is provided. The method includes receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The method includes performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The method includes sending, to the master node, the local decision vector and local data.

According to a third aspect, a master node for performing online convex optimization is provided. The master node includes processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The processing circuitry is operable to perform a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The processing circuitry is operable to send, to each of the two or more worker nodes, the global decision vector and corresponding global information.

According to a fourth aspect, a worker node for performing online convex optimization, the worker node comprising processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The processing circuitry is operable to perform a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The processing circuitry is operable to send, to the master node, the local decision vector and local data.

According to a fifth aspect, a computer program is provided comprising instructions which when executed by processing circuitry of a node causes the node to perform the method of any embodiment of the first and second aspects.

According to a sixth aspect, a carrier containing the computer program of the fifth aspect is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

FIG. 1 illustrates an example network according to an embodiment.

FIG. 2 illustrates an example network according to an embodiment.

FIG. 3 illustrates a flow chart according to an embodiment.

FIG. 4 illustrates a flow chart according to an embodiment.

FIG. 5 illustrates an example network according to an embodiment.

FIG. 6 illustrates an example network according to an embodiment.

FIGS. 7-14 illustrate a graph according to an embodiment.

FIG. 15 illustrates a flow chart according to an embodiment.

FIG. 16 illustrates a flow chart according to an embodiment.

FIG. 17 is a block diagram of an apparatus according to an embodiment.

DETAILED DESCRIPTION

In the seminal work of OCO [3], an online projected gradient descent algorithm achieved custom-character (√{square root over (T)}) static regret with bounded feasible set and gradient, where T is the total time horizon. The static regret was shown to be unavoidably Ω(√{square root over (T)}) for general convex cost functions without additional assumptions and was further improved to custom-character (log T) for strongly convex cost functions [4]. Moreover, [5] provided (√{square root over (τT)}) static regret in the presence of τ-slot delay and [6] studied OCO with adversarial delays. First introduced in [3], the dynamic regret of OCO has received a recent surge of interest [7], [8]. Strong convexity was shown to improve the dynamic regret bound in [9]. By increasing the number of gradient descent steps, the dynamic regret bound was further improved in [10]. The standard and proximal online gradient descent algorithms were respectively extended to accommodate inexact gradient in and [12]. Below, we compare the settings and dynamic regret bounds of these works in more details.

The above OCO algorithms are centralized. Distributed online optimization of a sum of local convex cost functions was studied in [21], [22], [23], [24], [25], [26], [27]. Early works on distributed OCO focused on static regret [21], [22], [23], [24] while more recent works studied dynamic regret [25], [26], [27]. However, existing distributed OCO works are over fully distributed networks with separable global cost functions.

Online frameworks such as Lyapunov optimization [28] and OCO have been applied to solve many dynamic problems in wireless systems. For example, online power control for wireless transmission with energy harvesting and storage was studied for single-hop transmission [29] and two-hop relaying [30]. Online wireless network virtualization with perfect and imperfect CSI were studied in [31] and [32]. Online projected gradient descent and matrix exponential learning were leveraged in [33] and [34] for uplink covariance matrix design. Dynamic transmit covariance design for wireless fading systems was studied in [35]. Online periodic precoding updates for wireless network virtualization was considered in [36]. The above works focused on centralized problems for single-cell wireless systems.

Multi-cell cooperative precoding via multiple base stations (BSs) at the signal level can effectively mitigate inter-cell interference, and this has been shown to significantly improve the system performance. However, traditional cooperative precoding schemes focused on centralized offline problems with instantaneous CSI available at the CC [37], [38], [39]. The TRPs defined in 5G NR are much smaller in size compared with the traditional BSs and therefore have limited computation power. Furthermore, non-ideal backhaul communication links in practice have received a surge of attention in the 5G NR standardization. In this work, we apply the proposed HiOCO framework to an online multi-TRP cooperative precoding design problem with non-ideal backhaul links, by taking full advantage of the CSI timeliness at the TRPs and computation resources at the CC.

We formulate a new OCO problem over a heterogenous master-worker network with communication delay, where the worker nodes have timely information about the local data but possibly less computation resources compared with the master node. At the beginning of each time slot, each worker node executes a local decision vector to minimize the accumulation of time-varying global costs. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.

We propose a new HiOCO framework that takes full advantage of the network heterogeneity in information timeliness and computation capacity. As disclosed here, HiOCO allows both central gradient descent at the master node and local gradient descent at the worker nodes for improved system performance. Furthermore, by communicating the aggregated global information and compressed local information, HiOCO can often reduce the communication overhead while preserving data privacy.

We analyze the special structure of HiOCO in terms of its hierarchical multi-step gradient descent with estimated gradients, in the presence of multi-slot delay. We prove that it can yield sublinear dynamic regret under mild conditions. Even with multi-slot delay, by increasing the estimated gradient descent steps at either the network edge or center, we can configure HiOCO to achieve a better dynamic regret bound compared with centralized inexact gradient descent algorithms.

We apply HiOCO to an online multi-TRP cooperative precoding design problem. Simulation under typical urban micro-cell Long-Term Evolution (LTE) settings demonstrates that both the central and local estimated gradient descent in HiOCO can improve system performance. In addition, HiOCO substantially outperforms both the centralized and distributed alternatives.

Embodiments disclosed here consider OCO over a heterogeneous network with communication delay, where the network edge executes a sequence of local decisions to minimize the accumulation of time-varying global costs. The local data may not be independent and identically distributed (i.i.d.) and the global cost functions may not be separable. Due to communication delays, neither the network center nor edge always has real-time information about the current global cost function. We propose a new framework, termed Hierarchical OCO (HiOCO), which takes full advantage of the network heterogeneity on information timeliness and computation capacity to enable multi-step estimated gradient descent at both the network center and edge.

For performance evaluation, we derive upper bounds on the dynamic regret of HiOCO, which measures the gap of costs between HiOCO and an offline global optimal performance benchmark. We show that the dynamic regret is sublinear under mild conditions. We further apply HiOCO to an online cooperative precoding design problem in multiple transmission/reception point (TRP) wireless networks with non-ideal backhaul links for 5G New Radio (NR). Simulation results demonstrate substantial performance gain of HiOCO over both the centralized and distributed alternatives.

OCO Over Master-Worker Network
Problem Formulation

FIG. 1 illustrates an example network according to an embodiment. We consider OCO over a master—worker network 100 in a time-slotted setting with time indexed by t. As shown in FIG. 1, one master node 102 is connected to C worker nodes 104 through separate communication links. The links each have an associated delay, which is here denoted by τ_r^ufor the uplink delay (indicating a τ_r^u-slot remote uplink delay) and for the downlink delay (indicating a τ_r^d-slot remote downlink delay). The round-trip delay is at least one time slot, i.e., τ_r=τ_r^u+τ_r^d≥1. For ease of exposition, we first consider the case of zero local delay, i.e., τ₁=0. Later we discuss the case of non-zero local delay. At the beginning of each time slot t, each worker node (such as a TRP) c collects a local data d_t^cand executes a local decision vector x_t^c∈ custom-character ⁿ^cfrom a compact convex feasible set ^c∈ⁿ^c. The data {d_t^c}_c=1^Cmay be non-i.i.d. and can vary arbitrarily over time with unknown statistics.

The message passing and internal node calculations described below are also illustrated schematically in FIG. 2.

Let ƒ({d_t^c}_c=1^C{x^c}_c=1^C): custom-character ⁿ→ be the convex global cost function at time slot t. In the hierarchical computing network 100, the worker nodes 104 and master node 102 cooperate to jointly select a sequence of decisions from the feasible sets ^c, to minimize the accumulated time-varying global costs. This leads to the following optimization problem:

$P 1 : \min_{{{x_{t}^{C} \in 𝒳^{c}}_{c = 1}^{C}}_{t = 1}^{T}} \sum_{t = 1}^{T} f ({d_{t}^{c}}_{c = 1}^{C}, {x_{t}^{c}}_{c = 1}^{C}) .$

We consider the general case that the global cost function may be non-separable among the worker nodes 103, i.e., ƒ({d_t^c}_c=1^C, {x^c}_c=1^C) may not be expressed as the summation of C local cost functions that each corresponds only to the local data d_t^cand decision vector x^c. Therefore, due to the coupling of both data and variables, each worker node c cannot compute the gradient ∇_x_cƒ({d_t^c}_c=1^C, {x^c}_c=1^C) based only on its local data d_t^c. In this case, the local gradient at worker node c may depend on its local data d_t^c, local decision vector x^c, and possibly the data d_t^land decision vector x_t^lat any other worker node l≠c. We define the local gradient at each worker node c as a general function denoted by h_f^cas follows:

∇_x^cƒ({d_t^c}_c=1^C, {x^c}_c=1^C) custom-character h_f^c(d_t^c,x^c,g_f^c,({d_t^l}_l≠c,{x^l}_l≠c)) (1)

where g_f^c({d_t^l}_{l≠cl , {x}^l}_l≠c) is some global information function w.r.t. the local data and decision vectors at all the other worker nodes 104. The local gradient and global information functions depend on specific formats of the global cost functions. We will show later that, communicating the values of the global information functions, instead of the raw data and decision vectors, can often reduce the communication overhead.

For notation simplicity, in the following, we define the global feasible set as custom-character ∪_c=1^C^cand denote the global cost function ƒ({d_t^c}_c=1^C, {x^c}_c=1^C) as ƒ_t(x), where x [x¹^T, . . . , x^C^T]∈ⁿis the global decision vector. The local gradient ∇_x_cƒ({d_t^c}_c=1^C, {x^c}_c=1^C) at each worker node c is denoted as ∇ƒ_t^c(x^c).

Performance Metric and Measure of Variation

Due to the lack of in-time information about the global cost function at either the worker nodes 104 or the master node 102, it is impossible to obtain an optimal solution to P1.¹In fact, even for the most basic centralized OCO problem [3] an optimal solution cannot be found [4]. Instead, we aim at selecting an online solution sequence {x_t}_t=1^Tthat is asymptotically no worse than the dynamic benchmark {x*_t}_t=1^T, given by

$\begin{matrix} x_{t}^{*} \in \arg \min_{x \in 𝒳} {f_{t} (x)} . & (2) \end{matrix}$

Note that x*_tis computed with the current information about ƒ_t(x) at each time slot t and the resulting solution sequence {x*_t}_t=1^Tis a global optimal solution to P1. The corresponding dynamic regret is defined as

RE
_T
^d
custom-character Σ_t=1^T(ƒ_t(x_t)−ƒ_t(x*_t)). (3)

An OCO algorithm is desired to provide sublinear dynamic regret with respect to the time horizon T,

$i . e ., \lim \frac{{RE}_{T}^{d}}{T} \to 0.$

Sublinearity is important since it implies that the online decision is asymptotically no worse than the dynamic benchmark in terms of its time-averaged performance. However, in the worst case, no online algorithm can achieve sublinear dynamic regret if the systems vary too drastically over time [40]. Therefore, the dynamic regret bounds are expressed in terms of different measures on system variations that represent the hardness of ¹problem. For a clear comparison on the dynamic regret bounds between HiOCO and existing literature, we introduce several common variation measures as follows.

Borrowing from [3], we define the following accumulated variation of an arbitrary sequence of reference points {r_t}_t=1^T(which is termed the path length in [3]):

Π_T custom-character Σ_t=1^T∥r_t−r_t−1∥₂. (4)

The online projected gradient descent algorithm in [3] achieved custom-character (√{square root over (TΠ_T)}) dynamic regret w.r.t. any sequence of reference points {r_t}_t=1^T. Another version of the path length defined in [7] is

Π′_T custom-character Σ_t=1^T∥r_t−Φ_t(r_t−1∥₂. (5)

where Φ_t(⋅) is a given function available at the decision maker to predict the current reference point. The dynamic mirror descent algorithm in [7] achieved custom-character (√{square root over (TΠ′t T)}) dynamic regret. When the reference points are the optimal points, i.e., r_t=x*_tfor any t, the resulting path length is defined as

Π*_T custom-character Σ_t=1^T∥x*_t−x*_t−1∥₂. (6)

There are some other related measures that can be used to characterize the system variation, e.g., the accumulated variation of the cost functions {ƒ_t(x)}_t=1^Tgiven by

$\begin{matrix} Θ_{T} \overset{Δ}{=} \sum_{t = 1}^{T} \min_{x \in 𝒳} ❘ f_{t} (x) - f_{t - 1} (x) ❘ & (7) \end{matrix}$

and the accumulated squared variation of gradient given by

Γ_2,T custom-character Σ_t=1^T∥∇ƒ_t(x_t)−∇ƒ_t−1(x_t−1)∥₂² (8)

The optimistic minor descent algorithm in [8] achieved a dynamic regret bound

$𝒪 (\min {\sqrt{Γ_{2, T} Π_{T}^{★}}, {(Γ_{2, T} Θ_{T} T)}^{\frac{1}{3}}})$

in terms of Π*_T, Θ_T, and Γ_2,Tsimultaneously.

The above OCO works [3], [7], [8] focused on general convex cost functions. With strongly convex cost functions, the one-step projected gradient descent algorithm in [9]improved the dynamic regret to custom-character (Π*_T). The multi-step gradient descent algorithm in [10] further improved the dynamic regret to (Π*_2,T), where Π*_2,Tis the squared path length defined as

Π*_2,T custom-character Σ_t=1^T∥x*_t−x*_t−1∥₂². (9)

Note that if Π*_Tor Π*_2,Tis sublinear, Π*_2,Tis often smaller than Π*_Tin the order sense.²For instance ∥x*_t−x*_t−1|∝ custom-character for any t, then Π*_T=() and Π*_2,T=(). For a sublinear Π*_Tor Π*_2,T, we have <0 and therefore Π*_2,Tis smaller than Π*_Tin the order sense. Particularly, if

$ϱ = - \frac{1}{2},$

we have Π*_2,T= custom-character (1) and Π*_2,T=√{square root over (T)}. The standard and proximal online gradient descent algorithms were respectively extended in [11] and [12] to accommodate inexact gradient. Both resulted in (max{Π*_T, Δ_T}) dynamic regret, where Δ_Tis the accumulated gradient error defined as

$\begin{matrix} Δ_{T} \overset{Δ}{=} \sum_{t = 1}^{T} \max_{x \in 𝒳} { \nabla f_{t} (x) - \nabla {\hat{f}}_{t} (x) }_{2} & (10) \end{matrix}$

with ∇{circumflex over (ƒ)}_t(⋅) being a given function available at the decision maker to predict the current gradient.

Hierarchical Online Convex Optimization

In this section, we present details of HiOCO and study the impact of hierarchical multi-step estimated gradient descent on the performance guarantees of HiOCO to provide dynamic regret bounds. We further provide sufficient conditions under which HiOCO yields sublinear dynamic regrets and discuss its performance merits over existing OCO frameworks.

HiOCO Framework

Existing distributed OCO frameworks cannot be directly applied to solve the aforementioned minimization problem with non-separable global cost functions. As an alternative, one may apply a centralized OCO approach at the master node after it has received all the local data from the worker nodes. However, this way of solving the problem does not take advantage of either the more timely information at the worker nodes or the computation ²resources at the worker nodes. Different from existing OCO frameworks that are either centralized or fully distributed, in HiOCO, the master node and worker nodes cooperate in gradient estimation and decision updates, by taking full advantage of the network heterogeneity on information timeliness and computation capacity. For ease of exposition, we will first consider the case of zero local delay at the worker node but will later extend that to the case of non-zero local delay. In the following, we present the algorithms at the master node and worker nodes.

Master Node's Algorithm

At the beginning of each time slot t, each worker node c executes its current local decision vector x_t^cand uploads it to the master node 102. To enable central gradient descent at the master node 102, each worker node c also needs to share information about the local data d_t^cwith the master node 102. However, sending the raw data directly would incur a large amount of uplink overhead. Instead, each worker node c sends a compression of the current local data l_f^c(d_t^c) to the master node 102. Due to the remote uplink delay, at the beginning of each time slot t>τ_r^u, the master node 102 only has the τ_r^u-slot-delayed local decision vector x_t−τ_r_u^cand compressed data set l_f^c(d_t−τ_r_u^c) from each worker node c. The master node 102 then recovers an estimated data {circumflex over (d)}_t−τ_r_u^c(d_t−τ_r_u^c). The compression and recovery techniques on the data can be chosen based on specific applications. Note that the master node 102 needs to consider the remote downlink delay and design the decision vectors for the worker nodes τ_r^d-slot ahead based on the τ_r^u-slot delayed information. Only the round-trip remote delay τ_rimpacts the decision-making process. Therefore, in the following, without loss of generality, we simply consider the case with τ_r-slot remote uplink delay and zero remote downlink delay.

Remark 1. There is often a delay—accuracy tradeoff for the recovered data

${{\hat{d}}_{t - τ_{r}}^{c}}_{c = 1}^{C}$

at the master node, since more accurate data at the master node 102 require less compression at the worker nodes 104 and more transmission time. If data privacy is a concern, the worker nodes 104 can add noise to the compressed data while sacrificing some system performance [41].

With

${x_{t - τ_{r}}^{c}}_{c = 1}^{C} and {{\hat{d}}_{t - τ_{r}}^{c}}_{c = 1}^{C},$

for each worker node c, the master node 102 sets an intermediate decision vector {circumflex over (x)}_t^c,0=x_t−τ_r^cand performs J_r-step gradient descent to generate {circumflex over (x)}_t^c,J^ras follows. For each gradient descent step j∈[1,J_r] the master node 102 solves the following optimization problem for {circumflex over (x)}_t^c,j:

$P 2 : \max_{x^{c} \in 𝒳^{c}} 〈 \nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{x}}_{t}^{c, j - 1}), x^{c} - {\hat{x}}_{t}^{c, j - 1} 〉 + \frac{α}{2} { x^{c} - {\hat{x}}_{t}^{c, j - 1} }_{2}^{2}$

where ∇{circumflex over (ƒ)}_t−τ_r^c({circumflex over (x)}_t^c,j−1) is an estimated gradient based on {{circumflex over (x)}_t^c,j−1}_c=1^cand

${{\hat{d}}_{t - τ_{r}}^{c}}_{c = 1}^{C},$

and it is given by

$\begin{matrix} \nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{x}}_{t}^{c, j - 1}) \overset{Δ}{=} h_{f}^{c} ({\hat{d}}_{t - τ_{r}}^{c}, {\hat{x}}_{t}^{c, j - 1}, g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, j - 1}}_{l \neq c})) . & (11) \end{matrix}$

The master node 102 then sends {circumflex over (x)}_t^c,J^rand the corresponding global information

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c})$

to assist the local gradient descent at each worker node c.

FIG. 3 illustrates an algorithm for the master node 102 according to an embodiment. At 302, the algorithm starts. At 304, the master node 102 initializes a parameter α. At 306, the master node 102 begins a time slot t, where t is greater than the remote delay τ_r. At 308, the master node 102 receives a local decision vector (x_t−τ_r^c) and compressed local data (l_f^c(d_t−τ_r^c)) from each worker node c. At 310, the master node 102 estimates the local data (d_t−τ_r^c) from the compressed local data (l_ƒ^c(d_t−τ_r^c)). At 312, the master node 102 sets an intermediate decision vector {circumflex over (x)}_t^c,0=x_t−τ_r^cfor each worker node c. At 314-320, the master node 102 performs a J_r-step gradient descent to generate a global decision vector ({circumflex over (x)}_t^c,J^r) and corresponding global information

$(g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, j - 1}}_{l \neq c}))$

for each worker node c.

Specifically, at 314, the master node 102 checks whether j≤J_r. If it is, master node 102 proceeds to 316, otherwise master node 102 proceeds to 322. Initially, j=1 when master node 102 reaches 314 for the first time. At 316, an estimated gradient (∇{circumflex over (ƒ)}_t−τ_r^c({circumflex over (x)}_t^c,j−1)) is constructed according to equation 11. At 318, {circumflex over (x)}_t^c,jis updated for each worker node c by solving the optimization problem P2. At 320, the index j is incremented by one, and master node 102 proceeds to perform the check at 314. At 322, after the gradient descent has completed, master node 102 sends the global decision vector ({circumflex over (x)}_t^c,J^r) and corresponding global information

$(g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c}))$

to each worker node c. At 324, the algorithm ends.

Worker Node c's Algorithm

When the global cost function is non-separable, each worker node c cannot compute the local gradient ∇ƒ_t^c(x_t^c)=h_ƒ^c(d_t^c,x^c,g_ƒ^c({d_t^l}_l≠c,{x^l}_l≠c)) based only on its local data d_t^c. Therefore, in HiOCO, the master node 102 assists the local gradient estimation by communicating the corresponding delayed global information

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c})$

to each worker node c. Note that due to the communication delay and data compression, the global information received by the worker nodes 104 is delayed and with errors.

At the beginning of each time slot t>τ_r, each worker node c receives the global decision vector {circumflex over (x)}_t^c,J^rand the global information

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c})$

from the master node 102. Each worker node c then sets an intermediate decision vector {tilde over (x)}_t^c,0={circumflex over (x)}_t^c,J^rand performs a J₁-step gradient descent to generate {tilde over (x)}_t^c,J¹as follows. For each gradient descent step j∈[1,J₁], each worker node c solves the following optimization problem for {tilde over (x)}_t^c,j

$P 3 : \min_{x^{c} \in X^{c}} 〈 \nabla {\hat{f_{t}}}^{c} ({\tilde{x}}_{t}^{c, j - 1}), x^{c} - {\tilde{x}}_{t}^{c, j - 1} 〉 + \frac{α}{2} { x^{c} - {\tilde{x}}_{t}^{c, j - 1} }_{2}^{2}$

where ∇{circumflex over (ƒ)}_t^c({tilde over (x)}_t^c,j−1) is an estimated gradient based on the timely local data d_t^cand the delayed global information

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c}),$

and it is given by

$\begin{matrix} \nabla {\hat{f}}_{t}^{c} ({\tilde{x}}_{t}^{c, j - 1}) \overset{Δ}{=} h_{f}^{c} (d_{t}^{c}, {\tilde{x}}_{t}^{c, j - 1}, g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c})) . & (12) \end{matrix}$

The above estimated gradient takes full advantage of the information timeliness at the worker nodes, as well as the central availability of information at the master node, to enable local gradient descent at the worker nodes for non-separable cost functions. Each worker node c then executes x_t^c={tilde over (x)}_t^c,J¹as its current local decision vector. It then uploads x_t^cand the compressed local data l_ƒ^c(d_t^c) to the master node.

Remark 2. For separable global cost functions, HiOCO can still be applied. In this case, it is still beneficial to perform centralized gradient descent for improved system performance, while sacrificing some communication overhead caused by uploading the compressed local data.

Remark 3. Single-step and multi-step gradient descent algorithms were provided in [9] and [10], while [11] and [12] proposed single-step inexact gradient descent algorithms. However, the algorithms in [9], [10], [11], [12] are centralized and under the standard OCO setting with one-slot delayed gradient information. In HiOCO, both the master node 102 and worker nodes 104 can perform multi-step estimated gradient descent in the presence of multi-slot delay.

FIG. 4 illustrates an algorithm for the worker node 104 according to an embodiment. At 402, the algorithm starts. At 404, the worker node 104 initializes local decision vectors (x_t^cfor x_t^c∈ custom-character ^c) for any t≤τ_r. At 406, the worker node 104 begins a time slot t, where t is greater than the remote delay τ_r. At 408, the worker node 104 receives a global decision vector ({circumflex over (x)}_t^c,J^r) and global information

$(g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c}))$

from the master node 102. At 410, the worker node 104 sets an intermediate decision vector {tilde over (x)}_t^c,0={circumflex over (x)}_t^c,J^r. At 412-418, the worker node 104 performs a J₁-step gradient descent to generate a local decision vector ({tilde over (x)}_t^c,J¹).

Specifically, at 412, the worker node 104 checks whether j≤J₁. If it is, worker node 104 proceeds to 414, otherwise worker node 104 proceeds to 420. Initially, j=1 when worker node 104 reaches 412 for the first time. At 414, an estimated gradient (∇{circumflex over (ƒ)}_t^c({tilde over (x)}_t^c,j−1) is constructed according to equation 12. At 416, {tilde over (x)}_t^c,jis updated by solving the optimization problem P3. At 418, the index j is incremented by one, and worker node 104 proceeds to perform the check at 412. At 420, after the gradient descent has completed, worker node 104 implements x_t^c={circumflex over (x)}_t^c,J^ras its current local decision vector. At 422, worker node 104 sends the local decision vector (x_t^c) and corresponding compressed local data (l_f^c(d_t^c) to the master node 102. At 424, the algorithm ends.

Dynamic Regret Bounds

In this section, we present new techniques to derive the dynamic regret bounds of HiOCO, particularly to account for its hierarchical multi-step estimated gradient descent with multi-slot delay. For clarity of exposition, proofs are omitted.

We make the following assumptions that are common in the literature of OCO with strongly convex functions [9], [10], [11], [12]. Strongly convex objectives arise in many machine learning and signal processing applications, such as Lasso regression, support vector machines, and robust subspace tracking. For applications with general convex cost functions, adding a simple regularization term like

$\frac{μ}{2} { x }_{2}^{2}$

often does not sacrifice the system performance We will show later that strongly convexity develops a contraction relation between ∥x_t+1-x*_t∥₂²and ∥x_t−x*_t∥₂²which can be leveraged to improve the dynamic regret bounds.

Assumption 1. For any t, ƒ_t(x) satisfies the following:

ƒ_t(x) is μ-strongly convex over custom-character , i.e., ∃μ>0, s.t., for any x, y∈ and t

$\begin{matrix} f_{t} (y) \geq f_{t} (x) + 〈 \nabla f_{t} (x), y - x 〉 + \frac{μ}{2} { y - x }_{2}^{2} . & (13) \end{matrix}$

ƒ_t(x) is L-smooth over custom-character , i.e., ∃L>0, s.t., for any x, y∈ and t

$\begin{matrix} f_{t} (y) \leq f_{t} (x) + 〈 \nabla f_{t} (x), y - x 〉 + \frac{L}{2} { y - x }_{2}^{2} . & (14) \end{matrix}$

The gradient of ƒ_t(x) is bounded, i.e., ∃D>0, s.t., for any x∈ custom-character and t

∥∇ƒ_t(x)∥₂≤D. (15)

Assumption 2. The radius of custom-character is bounded, i.e., ∃R>0, s.t., for any x, y∈z,ϵ

∥x−y∥₂≤R. (16)

We also require the following lemma, which is reproduced from Lemma 2.8 in [1].

Lemma 1. Let ,ϵ∈ custom-character ⁿbe a nonempty convex set. Let ƒ(x) be a μ-strongly-convex function over z,68 . Let

$x^{⋆} \in \arg \min_{x \in X} {f (x)} .$

Then, for any y∈ custom-character , we have

$f (x^{⋆}) \leq f (y) - \frac{μ}{2} { y - x^{⋆} }_{2}^{2} .$

The following lemma is general and quantifies the impact of one-step estimated gradient descent in terms of the squared gradient estimation error. We further provide a sufficient condition under which the estimated gradient descent yields an improved decision to the optimal points.

Lemma 2. Assume ƒ(x): custom-character → is μ-strongly-convex and L-smooth. Let

$z \in \arg \min_{x \in X} {〈 \nabla \hat{f} (y), x - y 〉 + \frac{α}{2} { x - y }_{2}^{2}},$

where ∇{circumflex over (ƒ)}(y) is an estimated gradient of ∇ƒ(y), and

$x^{★} \in \arg \min_{x \in X} {f (x)} .$

For any α>L, and γ∈(0, 2μ), we have

$\begin{matrix} { z - x^{★} }_{2}^{2} \leq η { y - x^{★} }_{2}^{2} + β { \nabla \hat{f} (y) - \nabla f (y) }_{2}^{2} & (17) \end{matrix}$

$where η = \frac{α - μ}{α + μ - γ} < 1 and β = \frac{1}{γ (α + μ - γ)} .$

The sufficient condition for ∥z−x*∥₂²<∥y−x∥₂²is

∥∇{circumflex over (ƒ)}(y)−∇ƒ(y)∥₂²<γ(2μ−γ)∥y−x*∥₂². (18)

Remark 4. The condition on gradient estimation error in (18) is most easily satisfied when γ=μ. In this case, the contraction constant

$η = \frac{α - μ}{α}$

recovers the one in [9]. Furthermore, as γ approaches 0, η approaches the contraction constant

$\frac{α - μ}{α + μ}$

in [10]. Different from Proposition 2 in [9] and Lemma 5 in [10], Lemma 2 takes into account the impacts of estimated gradient descent and generalizes the results in [9] and [10].

Remark 5. The optimal gradient descent step-size in needs to be in a specific range based on the knowledge of μ, L and v from an additional assumption ∥∇{circumflex over (ƒ)}_t(x_t)−∇ƒ_t(x_t)∥₂²≤ϵ²+υ²∥∇ƒ_t(x_t)∥₂²for some ϵ≥0 and υ≥0. The contraction analysis in [12] focused on the proximal point algorithm and is substantially different from Lemma 2.

We examine the impact of hierarchical multi-step estimated gradient descent on the dynamic regret bounds for OCO, which has not been addressed in the existing literature. To this end, we define the accumulated squared gradient error as

$\begin{matrix} Δ_{2, T} \overset{△}{=} \sum_{t = 1}^{T} \max_{x \in X} { \nabla f_{t} (x) - \nabla {\hat{f}}_{t} (x) }_{2}^{2} & (19) \end{matrix}$

Similar to the relationship between the standard path length Π*_T, and squared path length Π*_2,Tas discussed above, Δ_2,Tis often smaller than Δ_Tin the order sense. Note that

$\max_{x \in X} { \nabla f_{t} (x) - \nabla {\hat{f}}_{t} (x) }_{2}$

in (19) is the maximum estimated gradient estimation error and serves as an upper bound for the gradient estimations in (11) and (12). We use Δ_2,Tas a loose upper bound for our performance analysis since it covers more general gradient estimation schemes that can be adopted in HiOCO.

Leveraging results in Lemmas 1-2 and OCO techniques, the following theorem provides upper bounds on the dynamic regret RE_T^dfor HiOCO.

Theorem 3. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:

For any J₁+J_r≥1 such that 2η^J¹^+J^r≤1, we have

$R E_{T}^{d} \leq \frac{1}{2 ξ} \sum_{t = 1}^{T} { \nabla f_{t} (x_{t}^{★}) }_{2}^{2} + \frac{L + ξ}{2} τ_{r} R^{2} + \frac{L + ξ}{2 (1 - 2 η^{J_{l} + J_{r}})} (τ_{r} R^{2} + 2 τ_{r}^{2} Π_{2, T}^{★} + \frac{2 β}{1 - η} Δ_{2, T}) .$

For any J₁+J_r≥1, we have

$R E_{T}^{d} \leq τ_{r} D R + \frac{D}{1 - \sqrt{η^{J_{l} + J_{r}}}} (τ_{r} R^{2} + τ_{r} Π_{T}^{★} + \frac{\sqrt{β}}{1 - \sqrt{η}} Δ_{T}) .$

Extension with Local Delay

We now consider the case of non-zero local delay, i.e., at the beginning of each time slot t, each worker node c only has τ₁-delayed local data d_t−τ₁^cfor some τ₁≥1. In this case, we modify the master and worker algorithms by adding τ₁-slot delay to the algorithm starting time and all the time stamps of the data and estimated gradients. Let τ=τ₁+τ_rbe the total delay. Noting that the master node only has T-slot delayed data {{circumflex over (d)}_t−τ^c}_c=1^C, with compression errors for gradient estimation at the beginning of each time slot t>τ.

The master node's algorithm with local delay may proceed as follows. The algorithm starts, the parameter a is initialized, and at the beginning of each t>τ, the master node 102 receives x_t−τ_r^cand l_ƒ^c(d_t−τ^c) from each worker node c. The master node 102 estimates {circumflex over (d)}_t−τ^cfrom l_ƒ^c(d_t−τ^c). The master node 102 sets {circumflex over (x)}_t^c,0=x_t−τ^cfor each worker node c. For each step j of the J_r-step gradient descent, the gradient ∇{circumflex over (ƒ)}_t−τ^c({circumflex over (x)}_t^c,j−1) is constructed. This is done similarly to what is shown in equation 11, noting that the time stamps are adjusted to account for the local delay. Likewise, for each step j of the J_r-step gradient descent, {circumflex over (x)}_t^c,jis updated for each worker node c by solving P2 with ∇{circumflex over (ƒ)}_t−τ^c({circumflex over (x)}_t^c,j−1). Following the gradient descent, {circumflex over (x)}_t^c,J^rand g_ƒ^c({{circumflex over (d)}_t^l}_l≠c, {{circumflex over (x)}^l}_l≠c) are sent to the worker nodes c.

The worker node's algorithm with local delay may proceed as follows. The algorithm starts, the local decision vectors x_t^c∈ custom-character ^cfor any t≤τ are initialized, and at the beginning of each t>τ, the worker node 104 receives {circumflex over (x)}_t^c,J^rand g_ƒ^c({{circumflex over (d)}_t^l}_l≠c, {{circumflex over (x)}^l}_l≠c) from the master node 102. The worker node 104 sets {tilde over (x)}_t^c,0={circumflex over (x)}_t^c,J^r. For each step j of the J₁-step gradient descent, the gradient ∇{circumflex over (ƒ)}_t−τ₁^c({tilde over (x)}_t^c,j−1) is constructed. This is done similarly to what is shown in equation 12, noting that the time stamps are adjusted to account for the local delay. Likewise, for each step j of the J₁-step gradient descent, {tilde over (x)}_t^c,jis updated by solving P3 with ∇{circumflex over (ƒ)}_t−τ₁^c({tilde over (x)}_t^c,j−1). Following the gradient descent, is x_t^c={circumflex over (x)}_t^c,J^rimplemented as the local decision vector, and x_t^cand l_ƒ^c(d_t−τ₁^c) are sent to the master node 102.

Using similar techniques in the proof of Theorem 3, we provide dynamic regret bounds for HiOCO in the presence of both local and remote delay.

Theorem 4. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:

For any J₁+J_r≥1 such that 4η^J¹^+J^r<1, we have

$R E_{T}^{d} \leq \frac{1}{2 ξ} \sum_{t = 1}^{T} { \nabla f_{t} (x_{t}^{★}) }_{2}^{2} + \frac{L + ξ}{2} τ R^{2} + \frac{L + ξ}{2 (1 - 4 η^{J_{l} + J_{r}})} (τ R^{2} + (2 τ_{l}^{2} + 4 τ_{r}^{2}) Π_{2, T}^{★} + \frac{2 β}{1 - η} Δ_{2, T}) .$

For any J₁+J_r≥1, we have

$R E_{T}^{d} \leq τ D R + \frac{D}{1 - \sqrt{η^{J_{l} + J_{r}}}} (τ R^{2} + τ Π_{T}^{★} + \frac{\sqrt{β}}{1 - \sqrt{η}} Δ_{T}) .$

Due to the local delay, Theorem 4 has a more stringent condition on the total number of gradient descent steps compared with Theorem 3. However, the order of the dynamic regret bound is dominated by the accumulated system variation measures and is often the same as the case without local delay.

Discussion on the Dynamic Regret Bounds

In this section, we discuss the sufficient conditions for HiOCO to yield sublinear dynamic regret and highlight several prominent advantages of HiOCO over existing OCO frameworks. From Theorems 3 and 4, we can derive the following corollary regarding the dynamic regret bound.

Corollary 5. Suppose the accumulated squared variation of the gradient at the optimal points satisfies Σ_t=1^T∥∇ƒ_t(x*_t)∥₂²= custom-character (max{τ²Π*_2,T, Δ_2,T}), from Theorems 3 and 4, we have

RE
_T
^d= custom-character (min{max{τΠ*_T,Δ_T},max{τ²Π*_2,T,Δ_2,T}}).

Note that Σ_t=1^T∥∇ƒ_t(x*_t)∥₂²is often small and the condition in Corollary 5 is commonly satisfied. In particular, if x*_tis an interior point of custom-character or P1 is an unconstrained online problem, we have ∇ƒ_t(x*_t)=0. Form Corollary 5, a sufficient condition for HiOCO to yield sublinear dynamic regret is either max{τΠ*_T,Δ_T}=o(T) or max{τ²Π*_2,T,Δ_2,T}=o(T). Sublinearity of the accumulated system measures is necessary to have sublinear dynamic regret [40]. In many online applications, the system tends to stabilize and the gradient estimation becomes more accurate over time, leading to sublinear dynamic regret.

Remark 6. The centralized single-step and multi-step gradient descent algorithms achieved custom-character (Π*_T) and (min{Π*_T,Π*_2,T}) dynamic regrets in [9] and [10], respectively. HiOCO takes advantage of both the timely local and delayed global information to perform multi-step estimated gradient descent at both the master and worker nodes. Our dynamic regret bound analysis takes into account the impacts of the unique hierarchical update architecture, gradient estimation errors, and multi-slot delay on the performance guarantees of OCO that were not considered in [9] and [10].

Remark 7. The centralized single-step inexact gradient descent algorithms in [11] and [12] achieved custom-character (max{Π*_T, Δ_T}) dynamic regret under the standard OCO setting with one-slot delay. Noting that in the order sense, Π*_2,Tand Δ_2,Tare usually smaller than Π*_Tand Δ_T, respectively. Therefore, even in the presence of multi-slot delay, HiOCO provides a better dynamic regret bound by increasing the number of estimated gradient descent steps, and recovers the performance bounds in [11] and [12] and as a special case.

Application to Multi-TRP Cooperative Wireless Networks

FIG. 5 illustrates an example application according to an embodiment. We apply HiOCO to solve an online multi-TRP cooperative precoding design problem in multiple-input multiple-output (MIMO) systems, where multiple TRPs 504 (such as TRP 1 and TRP 2 that are illustrated) jointly transmit signals to serve users 506 in the network 500 as shown in FIG. 5. The TRPs 504 also cooperate with the CC 502. Traditional cooperative precoding design schemes focused on centralized offline problems at the CC with instantaneous CSI [37], [38], [39]. In contrast, some disclosed embodiments provided here are online, based on delayed CSI, and updated at both the CC 502 and TRPs 504.

The message passing and internal node calculations described below are also illustrated schematically in FIG. 6.

Online Multi-TRP Cooperative Precoding Design

We consider a total of C TRPs 504 coordinated by the CC 502 to jointly serve K users 506 in the cooperative network 500. Each TRP c has N^cantennas, so there is a total of N=Σ_c=1^CN^cantennas in the network 500. Let H_t^c∈ custom-character ^K×N^cdenote the local channel state of the K users 506 from TRP c. Let H_t^c=[H_t¹, . . . , H_t^C]∈^K×Ndenote the global channel state between the K users 506 and C TRPs 504.

For ease of illustration only, here we consider the case where there is no local delay at the TRPs to collect the local CSI. However, embodiments may also cover the case of non-zero local delay as explained above. At each time slot t, each TRP c has the current local CSI H_t^cand implements a local precoding matrix V_t^c∈ custom-character ^N^c^×K(in the compact convex set

custom-character
^c

{V
^c
:∥V
^c∥_F²≤P_max^c} (20)

to meet the per-slot maximum transmit power limit. Let V_t=[V_t¹^H, . . . , V_t^c^H]∈ custom-character ^N×Kdenote the global precoding matrix executed by the C TRPs 504 at time slot t. The actual received signal vector y_t(excluding noise) at the K users 506 is given by

y
_t
=H
_t
V
_t
s
_t

where s_t∈ custom-character ^K×1contains the transmitted signals from the TRPs to all K users 506 which are assumed to be independent to each other with unit power, i.e., {s_ts_t^H}=I, ∀t.

We first consider idealized backhaul communication links, where each TRP c communicates H_t^cto the CC 502 without delay. The CC 502 then has the global CSI H_tat time slot t and designs a desired global precoder W_t∈ custom-character ^N×Kto meet the per-TRP maximum power limits. The design of W_tcan be based on the services needs of the K users 506 and is not limited to any specific precoding scheme. For the CC 502 with W_t, the desired received signal vector (noiseless) {tilde over (y)}_tis given by

{tilde over (y)}
_t
=H
_t
W
_t
s
_t.

With the TRPs' 504 actual precoding matrix V_tand the desired precoder W_tat the CC 502, the expected deviation of the actual received signal vector at all K users 506 from the desired one is given by custom-character {∥y_t−{tilde over (y)}_t∥_F²}=∥H_tV_t−H_tW_t∥_F². We define the precoding deviation of the TRPs' 504 precoding from the precoder at the CC 502 as

ƒ_t(V) custom-character ∥H_tV_t−H_tW_t∥_F²,∀t (21)

which is a strongly convex cost function.

Note that due to the coupling of local channel states {H_t^c}_c=1^Cand local precoders {V_t^c}_c=1^C, the cost function ƒ_t(V) is not separable among the TRPs 504. Furthermore, the local gradient at each TRP c depends on the local channel state W, local precoder W, and the channel states {H_t^l}_l≠cand precoders {V_t^l}_l≠cat all the other TRPs 504, given by

$\nabla f_{t}^{c} (V) \overset{△}{=} \frac{\partial f_{t} (V)}{\partial V^{c^{*}}} = H_{t}^{c} (\sum_{l = 1}^{C} (H_{t}^{l} V^{l}) - H_{t} W_{t}) .$

The goal of the multi-TRP cooperative network 500 is to minimize the accumulation of the precoding deviation subject to per-TRP maximum transmit power limits with non-ideal backhaul communication links. The online optimization problem is in the same form as P1 with {H_t^c}_c=1^Cbeing the local data, {V_t^c∈ custom-character ^c}_c=1^Cbeing the local decision vectors, and ƒ_t(V) being the global cost function.

For non-ideal backhaul links with τ_r^u-slot uplink and τ_r^d-slot downlink communication delays, as illustrated herein, only the round-trip communication delay τ_rmatters and we can equivalently consider there is τ_r-slot uplink delay and no downlink delay. At each time slot t, each TRP c has the timely local CSI H_t^cand implements a local precoder V_t^c. If communication overhead is a concern, instead of sending the complete CSI H_t^c, each TRP c can send a compressed local CSI L_t^cto the CC 502. Due to the communication delay and CSI compression, the CC 502 recovers a delayed global channel state Ĥ_t−τ_r, with errors, and then it designs the desired precoding matrix Ŵ_t−τ_r. Later we will show how HiOCO leverages the instantaneous local CSI {H_t^c}_c=1^Cat the TRPs 504 and the delayed global channel state Ĥ_t−τ_r, at the CC 502 to jointly design the cooperative precoding matrices {V_t^c}_c=1^C.

Hierarchical Precoding Solution

Leveraging the proposed HiOCO framework, we now provide hierarchical solutions to the formulated online multi-TRP cooperative precoding design problem.

Precoding Solution at CC

At the beginning of each time slot t>τ_r, the CC 502 receives the precoding matrices

${V_{t - τ_{r}}^{c}}_{c = 1}^{C}$

from the TRPs 504 and recovers the delayed global CSI Ĥ_t−τ_rwith some errors from the compressed local

$CSI {L_{t - τ_{r}}^{c}}_{c = 1}^{C} .$

It then sets {circumflex over (V)}_t^c,0=V_t−τ_r^cfor each TRP c and performs J_r-step estimated gradient descent to generate {circumflex over (V)}_t^c,J^r. For each gradient descent step j∈[1,J_r], the CC 502 has a closed-form precoding solution given by

${\hat{V}}_{t}^{c, j} = 𝒫_{𝒱^{c}} {{\hat{V}}_{t}^{c, j - 1} - \frac{1}{α} \nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{V}}_{t}^{c, j - 1})}$

where

$𝒫_{𝒱^{c}} {V^{c}} = \arg \min_{U^{c} \in 𝒱^{c}} {{ U^{c} - V^{c} }_{F}^{2}}$

is the projection operator onto the convex feasible set custom-character ^cand

$\nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{V}}_{t}^{c, j - 1}) = {\hat{H}}_{t - τ_{r}}^{c} (\sum_{l = 1}^{C} ({\hat{H}}_{t - τ_{r}}^{l} {\hat{V}}_{t}^{l, j - 1}) - {\hat{H}}_{t - τ_{r}} {\hat{W}}_{t - τ_{r}})$

is an estimation of the gradient at time slot t−τ_r. The CC 502 then communicates the intermediate precoder {circumflex over (V)}_t^c,J^rand global information Ĝ_t−τ^c=Σ_{t=1, l≠c}^C(Ĥ_t−τ_r^l{circumflex over (V)}_t^l,J^r)−Ĥ_t−τ_rŴ_t−τ_r∈ custom-character ^K×Kto TRP c, for all c∈{1, . . . C}. Note that there is no need to communicate the local information Ĥ_t−τ_r^c{circumflex over (V)}_t^c,J^rto each TRP c, since more recent local information will be used to reduce the gradient estimation error.

Note that instead of sending the global channel state Ĥ_t−τ_r∈ custom-character ^K×Nglobal precoding matrix {circumflex over (V)}_t^J^r∈^N×Kand the desired global precoder Ŵ_t−τ_r∈C^N×Kto each TRP c for the local gradient estimation, in the proposed method, sending V_t^c,J^r∈^N^c^×Kand Ĝ_t−τ_r^c∈^K×Kto each TRP c greatly reduces the amount of downlink communication overhead.

Precoding Solution at TRP c

Each TRP c can implement any local precoder in custom-character ^cfor any t∈[1, τ_r]. At the beginning of each time slot t>τ_r, after receiving the intermediate precoder {circumflex over (V)}_t^c,J^rand global information Ĝ_t−τ^cfrom the CC 502, each TRP c sets {tilde over (V)}_t^c,0={circumflex over (V)}_t^c,J^rand performs J₁-step estimated gradient descent to generate {tilde over (V)}_t^c,J¹. For each gradient descent step j∈[1, J₁], each TRP c also has a closed-form precoding solution given by

${\tilde{V}}_{t}^{c, j} = 𝒫_{𝒱^{c}} {{\tilde{V}}_{t}^{c, j - 1} - \frac{1}{α} \nabla {\hat{f}}_{t}^{c} ({\tilde{V}}_{t}^{c, j - 1})}$

$where \nabla {\hat{f}}_{t}^{c} ({\tilde{V}}_{t}^{c, j - 1}) = H_{t}^{c} (H_{t}^{c} {\tilde{V}}_{t}^{c, j - 1} + {\hat{G}}_{t - τ}^{c})$

is an estimation of the current gradient based on the timely local CSI H_t^cand delayed global information Ĝ_t−τ^c. Finally, each TRP c uses V_t^c={tilde over (V)}_t^c,J¹as its precoding matrix for transmission in time slot t and communicates it together with either the complete CSI H_t^cor the compressed local CSI L_t^cto the CC 502.

Performance Bounds

Note that the optimal precoding solution is V*_t=W_tat each time slot t. However, with non-ideal backhaul links, each TRP c cannot receive V_t^c* from the CC 502 in time and implement it at each time slot t. A naive solution is to implement the delayed optimal solution V*_t−τ_r, at the TRPs 504. However, we will show that directly implementing V*_t−τ_r, at the TRPs 504 leads to system performance degradation compared with HiOCO can adopt to the unknown channel variations.

We assume that the channel power is bounded by a constant B>0 at any time t, given by

∥H_t∥_F²≤B. (23)

In the following Lemma, we show that the formulated online multi-TRP cooperative precoding design problem satisfies Assumptions 1 and 2 made above.

Lemma 6. Assume the channel power is bounded in (23). Then, Assumptions 1 and 2 hold with the corresponding constants given by μ=2, L=B, D=2B√{square root over (Σ_c=1^CP_max^c)}, and R=2√{square root over (Σ_c=1^CP_max^c)}.

Leveraging the results in Theorems 3 and 4, and noting that the gradient of the optimal precoder satisfies ∇^t(V*_t)=H_t^H/(H_tV*_t−H_tW_t)=0, the following corollary provides the dynamic regret bounds yielded by the hierarchical online precoding solution sequence {V_t}_t=1^T.

Corollary 7. The dynamic regret bounds in Theorems 3 and 4 hold for {V_t}_t=1^Tgenerated by HiOCO, with the constants μ, L, D, and R given in Lemma 6 and Σ_t=1^T∥∇ƒ_t(V*_t)∥_F²=0.

Simulation Results

In this section, we present simulation results under typical urban micro-cell LTE network settings. We study the impact of various system parameters on the convergence and performance of HiOCO. We numerically demonstrate the performance advantage of HiOCO over both the centralized and distributed alternatives.

Simulation Setup

We consider an urban hexagon micro-cell of radius 500 m with C=3 equally separated TRPs each is equipped with N^c=16 antennas. We consider 5 co-located users in the middle of every two adjacent TRPs for a total of K=15 users in the network. Following the standard LTE specification [42], as default system parameters, we set maximum transmit power limit P_max^c=30 dBm, noise power spectral density N₀=−174 dBm/Hz, noise figure N_F=10 dB, and we focus on the channel over one subcarrier with bandwidth B_W=15 kHz. We model the fading channel as a first order Gauss Markov process h_t+1^c,k+α_hh_t^c,k+z_t^c,kbetween each user k and each TRP c, where h_t^c,k˜ custom-character (0, β^c,kI) with β^c,k[dB]=−31.54-33 log₁₀(d_c,k)−φ^c,krepresents the path-lost and shadowing effects, d^c,kbeing the distance in kilometers from TRP c to user k, φ^c,k˜(0,σ_Ø²) being the shadowing effect that is used to model the variation of user positions with σ_Ø²=8 dB, α_h∈[0,1] is the channel correlation coefficient, and z_t^c,k˜ custom-character (0, (1−α_h²) β^c,kI) is independent of h_t^c,k. We set α_h=0.998 as default, which corresponds to user speed 1 km/h. We consider each TRP c communicates the accurate local CSI H_t^cto the CC, since the impact of channel compression error can be emulated by increasing the communication delay τ_r.

For our performance study, we assume the CC adopts cooperative zero forcing (ZF) precoding, given by

W
_t
^ZF=√{square root over (P_t^ZF)}H_t^H(H_tH_t^H)⁻¹

where P_t^ZFis a power normalizing factor. Note that we must have N≥K to perform ZF precoding. We assume all K users have the same noise σ_n²=N_F+N₀B_Wand therefore all the users will have the same data rate

$\log_{2} (1 + \frac{P_{t}^{ZF}}{σ_{n}^{2}}) .$

The CC adopts the power normalizing factor

$P_{t}^{Z F} = \min {\frac{P_{\max}^{c}}{{ {H_{t}^{c H} (H_{t} H_{t}^{H})}^{- 1} }_{F}^{2}}, \forall c}$

which is the optimal solution for the following sum rate maximization problem with per-TRP maximum transmit power limits:

$P 4 : \max_{P_{t}^{ZF} \geq 0} K \log_{2} (1 + \frac{P_{t}^{Z F}}{σ_{n}^{2}})$

$s . t . P_{t}^{Z F} { {H_{t}^{c H} (H_{t} H_{t}^{H})}^{- 1} }_{F}^{2} \leq P_{\max}^{c}, \forall c .$

As performance metrics, we define the time-averaged normalized precoding deviation as

$\bar{f} (T) \overset{△}{=} \frac{1}{T} \sum_{t = 1}^{T} \frac{f_{t} (V_{t})}{{ H_{t} W_{t}^{Z F} }_{F}^{2}}$

and the time-averaged per-user rate as

$\bar{R} (T) \overset{△}{=} \frac{1}{T K} \sum_{t = 1}^{T} \sum_{k = 1}^{K} \log_{2} (1 + S I N R_{t}^{k})$

where

$S I N R_{t}^{k} = \frac{{❘ {[H_{t} V_{t}]}_{k, k} ❘}^{2}}{Σ_{j \neq k} {❘ {[H_{t} V_{t}]}_{k, j} ❘}^{2} + σ_{n}^{2}}$

is the signal-to-interference-plus-noise ratio (SINR) of user k.

Impact of Number of Estimated Gradient Descent Steps

FIG. 7 shows ƒ(T) and R(T) versus T for different numbers of the estimated gradient descent steps J_rat the CC and J₁at the TRPs. We consider zero local delay first and set the remote delay as J_r=1. We observe that HiOCO has fast convergence (within T=100 time slots). Furthermore, the system performance improves as J_ror J₁increases, showing the performance gain brought by performing multi-step estimated gradient descent at either the master node or worker nodes. As shown in FIG. 8, the system performance almost stabilizes when J_r=8. Further considering the TRPs usually have less computation capacity compared to the CC, in the following simulation, we set J₁=1 and J_r=8 as default simulation parameters.

Impact of Channel Correlation

Next, we study the impact of channel correlation on the performance of HiOCO. Note that as α_hincreases, the accumulated system variation measures become smaller, leading to better dynamic regret bounds. As shown in FIG. 9, the system performance improves as α_hincreases, which is consistent with Theorems 3 and 4. When the channel is static, the steady-state per-user rate is high at 11.4 bpcu. The reason is our system is operated at high signal-to-noise ratio (SNR) region, in which the desired cooperative ZF precoding at the CC approaches the optimal precoder [43]. The steady state per-user rate decreases to 8.4 bpcu as αh increases to 0.999 which corresponds to 0.5 km/h user speed. This is because the cooperative ZF precoding nulls the inter-user interference, but its performance is sensitive to CSI inaccuracy [44] and therefore, the channel correlation in the online setting.

Performance Comparison

For performance comparison, we consider the delayed optimal precoder V*_t−τ₁_−τ_r=W_t−τ₁_−τ_rthat can be computed by the CC after receiving the local CSI from the TRPs at each time slot t>τ₁+τ_r. To show the performance gain brought by the local gradient descent in HiOCO, we consider centralized OCO algorithms that perform multi-step estimated gradient descent. For distributed alternatives, we consider the idealized user association scheme that each user k selects the TRP that has the highest channel gain for downlink signal transmission at each time slot t with τ₁-slot delayed local CSI H_t−τ₁^cas K_t−τ₁^c. Let the number of users associated with TRP c based on H_t−τ₁^cas K_t−τ₁^c. Let

${\tilde{V}}_{t - τ_{1}}^{c} \sqrt{{\tilde{P}}_{t - τ_{1}}^{c}} {{\tilde{H}}_{t - τ_{1}}^{cH} ({\tilde{H}}_{t - τ_{1}}^{c} {\tilde{H}}_{t - τ_{1}}^{cH})}^{- 1}$

denote the available channel state between the K_t−τ₁^cusers and the N^cantennas in TRP c at each time slot t>τ₁. Each TRP c then adopts ZF precoding to serve the K_t−τ₁^cusers with the τ₁-delayed local CSI as

${\tilde{V}}_{t - τ_{1}}^{c} = \sqrt{{\tilde{P}}_{t - τ_{1}}^{c}} {{\tilde{H}}_{t - τ_{1}}^{cH} ({\tilde{H}}_{t - τ_{1}}^{c} {\tilde{H}}_{t - τ_{1}}^{cH})}^{- 1}$

where {tilde over (P)}_t−τ₁^cis set such that ∥{tilde over (V)}_t−τ₁^c∥_F²=P_max^c. We also consider a fixed user association scheme that each user k selects the TRP that has the lowest path cost and shadowing and the local CSI is delayed by τ₁time slots at the TRPs. Let K^cdenote the number of users associated with TRP c and H_t−τ₁^c∈ custom-character ^K^c^×N^cbe available channel state between the K^cusers and the N^cantennas in TRP c at each time slot t>τ₁. For the fixed user association scheme, each TRP adopts the following ZF precoding to serve the K^cusers at each time slot t>τ₁

V

_t−τ
₁
^c=√{square root over (P_t−τ₁^c)}H_t−τ₁^cH(H_t−τ₁^cH_t−τ₁^cH)⁻¹

where P_t−τ₁^cis set such that

${ {\bar{V}}_{t - τ_{l}}^{c} }_{F}^{2} = P_{\max}^{c} .$

FIG. 10 shows the performance comparison between HiOCO, the delayed optimal sequence

${V_{t - τ_{l} - τ_{r}}^{★}}_{t = 1}^{T},$

centralized OCO with J_r=8 and J_r=1 steps of gradient descent, and the dynamic and fixed user association schemes as

${{\overline{V}}_{t - τ_{l}}}_{t = 1}^{T} and {{\tilde{V}}_{t - τ_{l}}}_{t = 1}^{T},$

respectively, with τ₁=0 and τ_r=4. We observe HiOCO achieves the best system performance compared with all of the above alternative schemes. Furthermore, by performing only J_r=1 step additional local gradient descent at the TRPs, HiOCO achieves substantial performance gain compared with the centralized OCO with J_r=8 steps of gradient descent. The user association schemes that based on the timely local CSI have worse performance compared with the other alternatives, since the TRPs are not coordinated to jointly serve the users.

Impact of Remote and Local Delay

FIGS. 11 and 12 show the performance comparison on the steady state value of ƒ(T) and R(T) versus the remote delay τ_rand local delay τ_l. As shown in FIG. 11, in a wide range of remote delay, HiOCO is better than the user association schemes that is based on the timely local CSI. It shows performance gain brought by the central gradient descent at the CC. Furthermore, HiOCO performs better than the other centralized alternatives through one-step additional local gradient descent at the TRPs. As shown in in FIG. 12, the performance gain of HiOCO over the centralized OCO algorithms decreases as the local delay τ_lincreases. It indicates the importance of information timeliness on the performance gain of local gradient descent. However, in a wide range of local delay, HiOCO is better than the centralized OCO algorithms and the delayed user association schemes. By taking full advantage of the timely local CSI and delayed global CSI to perform gradient descent at both the TRPs and CC, HiOCO outperforms both the centralizes and distributed alternatives in a wide range of delay.

Impact of Number of Antennas and Users

We further study the impact of number of antennas N^cand users K. FIG. 13 shows that the precoding deviation ƒ decreases as the number of antennas N^cincreases, since the TRPs has more degrees of freedom to design the cooperative precoding. The per-user rate R drastically improves as N^cincreases, indicating the performance advantage of massive MIMO systems.

FIG. 14 shows that the precoding deviation keeps increasing as the number of users K increases, since the TRPs has less degrees of freedom to optimize the cooperative precoding for precoding deviation minimization. Note that to perform cooperative precoding at the CC, the maximum number of users is K=N=48. We observe that HiOCO substantially outperforms the delayed optimal precoder when the number of users is close to the number of antennas in the presence of delay. In a wide range of N^cand K, HiOCO yields the best performance among both the centralized and distributed alternatives.

Advantages of Embodiments

Embodiments provide OCO over a heterogeneous master—worker network with communication delay, to make a sequence of online local decisions to minimize some accumulated global convex cost functions. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.

We propose a new HiOCO framework, which takes full advantage of the network heterogeneity in information timeliness and computation capacity, to enable multi-step estimated gradient descent at both the master and worker nodes. Our analysis considers the impacts of multi-slot delay, gradient estimation error, and the hierarchical architecture on the performance guarantees of HiOCO, to show sublinear dynamic regret bounds under mild conditions.

We apply HiOCO to a multi-TRP cooperative network with non-ideal backhaul links for 5G NR. We take full advantage of the information timeliness on CSI and computation resources at both the TRPs and CC to improve system performance By sharing the compressed local CSI and delayed global information, both the uplink and downlink communication overhead can be greatly reduced. The cooperative precoding solutions at both the TRPs and CC are in closed forms with low computational complexity.

Notes on the performance of the proposed methods: We numerically validate the performance of the proposed hierarchical precoding solution for multi-TRP cooperative networks under typical LTE cellular network settings. Extensive simulation results are provided to demonstrate the impact of the number of estimated gradient descent steps, channel correlation, remote and local delay, and the number of antennas and users. Simulation results demonstrate the superior delay tolerance and substantial performance advantage of HiOCO over both the centralized and distributed alternatives under different scenarios.

FIG. 15 is a flow chart according to an embodiment.

Process 1500 is a method for performing online convex optimization, performed e.g. by a master node such as mater node 102 and/or CC 502.

Step 1502 comprises receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes.

Step 1504 comprises performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information.

Step 1506 comprises sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.

In some embodiments, the local data received from each of the two or more worker nodes is compressed, and wherein the method further comprises uncompressing the local data received from each of the two or more worker nodes. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate decision vector {circumflex over (x)}_t^c,0=x_t−τ_r^cfor each of the two or more worker nodes c; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient for each of the two or more worker nodes c, wherein the estimated gradient is based on {{circumflex over (x)}_t^c,j−1}_c=1^Cand

${{\hat{d}}_{t - τ_{r}}^{c}}_{c = 1}^{C},$

and (2) updating {circumflex over (x)}_t^c,jfor each of the two or more worker nodes c, by solving an optimization problem for {circumflex over (x)}_t^c,j based on the estimated gradients; where:

- C refers to the number of the two or more worker nodes,
- c is an index referring to a specific one of the two or more worker nodes,
- t refers to the current time slot,
- τ_rrefers to a round-trip remote delay,

${x_{t - τ_{r}}^{c}}_{c = 1}^{C}$

refers to the local decision vectors received from each of the two or more worker nodes,

${{\hat{d}}_{t - τ_{r}}^{c}}_{c = 1}^{C}$

refers to compressed local data for each of the two or more worker nodes that is based on the local data received from each of the two or more worker nodes,

j∈[1, J_r], and

J_rrefers to the number of steps of the multi-step gradient descent.

In some embodiments, the estimated gradient is given by

$\nabla {\hat{f}}_{t - τ_{r}}^{C} ({\hat{x}}_{t}^{c, j - 1}) \overset{△}{=} h_{f}^{c} ({\hat{d}}_{t - τ_{r}}^{c}, {\hat{x}}_{t}^{c, j - 1}, g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, j - 1}}_{l \neq c})),$

the optimization problem is given by

$\min_{x^{c} \in 𝒳^{c}} 〈 \nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{x}}_{t}^{c, j - 1}), x^{c} - {\hat{x}}_{t}^{c, j - 1} 〉 + \frac{α}{2} { x^{c} - {\hat{x}}_{t}^{c, j - 1} }_{2}^{2},$

and the corresponding global information for a given worker node c is given by

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c});$

where:

∇{circumflex over (ƒ)}_t−τ_r^c( ) refers to a local gradient function,

h_f^c( ) refers to a general function,

- ^crefers to a compact convex feasible set, and
- α refers to a fixed parameter.

In some embodiments, the local data corresponding to each of the two or more worker nodes has a non-zero local delay. In some embodiments, the two or more worker nodes comprise transmission/reception points (TRPs), the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix {circumflex over (V)}_t^c,0=V_t−τ_r^cfor each of the two or more TRPs c; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient for each of the two or more TRPs c, wherein the estimated gradient is based on

${V_{t - τ_{r}}^{c}}_{c = 1}^{C} and {{\hat{H}}_{t - τ_{r}}^{c}}_{c = 1}^{C},$

and (2) updating {circumflex over (V)}_t^c,jfor each of the two or more TRPs c, by solving an optimization problem for {circumflex over (V)}_t^c,jbased on the estimated gradients; where:

C refers to the number of the two or more worker nodes,

c is an index referring to a specific one of the two or more worker nodes

t refers to the current time slot,

τ_rrefers to a round-trip remote delay,

${V_{t - τ_{r}}^{c}}_{c = 1}^{C}$

refers to the local precoding matrices received from each of the two or more TRPs,

${{\hat{H}}_{t - τ_{r}}^{c}}_{c = 1}^{C}$

refers to compressed local channel state information for each of the two or more TRPs that is based on the local channel state information received from each of the two or more TRPs,

j∈[1, J_r], and

J_rrefers to the number of steps of the multi-step gradient descent.

In some embodiments, the estimated gradient is given by ∇ƒ_t−τ_r^c({circumflex over (V)}_t^c,j−1)=Ĥ_t−τ_r^c(Ĥ_t−τ_r^l{circumflex over (V)}_t^l,j−1)−Ĥ_t−τ_rŴ_t−τ_r), a solution to the optimization problem is given by

${\hat{V}}_{t}^{c, j} = 𝒫_{𝒱^{c}} {{\hat{V}}_{t}^{c, j - 1} - \frac{1}{α} \nabla {\hat{f}}_{t - τ_{r}}^{c} ({\hat{V}}_{t}^{c, j - 1})},$

and the corresponding global information for a given TRP c is given by Ĝ_t−τ^c=Σ_l=1,l≠c(Ĥ_t−τ^l{circumflex over (V)}_t^l,J^r)−Ĥ_t−τ_rŴ_l-τ_r∈ custom-character ^K×K; where

$𝒫_{𝒱^{c}} {V^{c}} = \arg \min_{U^{c} \in 𝒱^{c}} {{ U^{c} - V^{c} }_{F}^{2}}$

is the projection operator onto the convex feasible set V^c,

Ŵ_t−τ_rrefers to a desired global precoding matrix,

∇{circumflex over (ƒ)}_t−τ_r^c( ) refers to a local gradient function, and

α refers to a fixed parameter.

FIG. 16 is a flow chart according to an embodiment.

Process 1600 is a method for performing online convex optimization, performed e.g. by a worker node such as worker node 104 and/or TRP 504.

Step 1602 comprises receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it.

Step 1604 comprises performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector.

Step 1606 comprises sending, to the master node, the local decision vector and local data.

In some embodiments, the local data sent to the master node is compressed prior to sending. In some embodiments, performing the multi-step gradient descent further comprises:

- initializing an intermediate decision vector {tilde over (x)}_t^c,0={circumflex over (x)}_t^c,J^r; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient, wherein the estimated gradient is based on d_t^cand

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c}),$

and (2) updating {tilde over (x)}_t^c,j, by solving an optimization problem for {tilde over (x)}_t^c,jbased on the estimated gradient; where:

c is an index referring to a worker node corresponding to the local data,

t refers to the current time slot,

τ_rrefers to a round-trip remote delay,

d_t^crefers to the local data,

{circumflex over (x)}_t^c,J^rrefers to the global decision vector,

$g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, J_{r}}}_{l \neq c})$

refers to the global information,

- j∈[1,J₁], and
- J₁refers to the number of steps of the multi-step gradient descent.

In some embodiments, the estimated gradient is given by

$\nabla {\hat{f}}_{t}^{c} ({\tilde{x}}_{t}^{c, j - 1}) \overset{△}{=} h_{f}^{c} (d_{t}^{c}, {\tilde{x}}_{t}^{c, j - 1}, g_{f}^{c} ({{\hat{d}}_{t - τ_{r}}^{l}}_{l \neq c}, {{\hat{x}}_{t}^{l, Jr}}_{l \neq c})),$

the optimization problem is given by

$\min_{x^{c} \in 𝒳^{c}} 〈 \nabla {\hat{f}}_{t}^{c} ({\tilde{x}}_{t}^{c, j - 1}), x^{c} - {\tilde{x}}_{t}^{c, j - 1} 〉 + \frac{α}{2} { x^{c} - {\tilde{x}}_{t}^{c, j - 1} }_{2}^{2},$

and the local decision vector given by x_t^c=

{tilde over (x)}_t^c,J¹; where:

∇{circumflex over (ƒ)}_t^c( ) refers to a local gradient function,

h_f^c( ) refers to a general function,

X^crefers to a compact convex feasible set, and

α refers to a fixed parameter.

In some embodiments, the local data has a non-zero local delay. In some embodiments, the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix if {tilde over (V)}_t^c,0={tilde over (V)}_t^c,J^r; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient, wherein the estimated gradient is based on H_t−τ₁^cand Ĝ_t−τ^c, and (2) updating {tilde over (V)}_t^c,j, by solving an optimization problem for if {tilde over (V)}_t^c,jbased on the estimated gradient; where:

c is an index referring to a worker node corresponding to the local data,

t refers to the current time slot,

τ_rrefers to a round-trip remote delay,

τ₁refers to a local delay,

τ refers to the total delay,

H_t^crefers to the local channel state information,

{tilde over (V)}_t^c,J^rrefers to the global precoding matrix,

Ĝ_t−τ^crefers to the global information,

j∈[1,J₁], and

J₁refers to the number of steps of the multi-step gradient descent.

In some embodiments, the estimated gradient is given by ∇{circumflex over (ƒ)}_t−τ₁^c({tilde over (V)}_t^c,j−1)=H_t−τ₁^c(H_t−τ₁^c{tilde over (V)}_t^c,j−1+Ĝ_t−τ^c), a solution the optimization problem is given by

${\tilde{V}}_{t}^{c, j} = 𝒫_{𝒱^{c}} {{\tilde{V}}_{t}^{c, j - 1} - \frac{1}{α} \nabla {\hat{f}}_{t - τ_{1}}^{c} ({\tilde{V}}_{t}^{c, j - 1})},$

and the local precoding matrix given by V_t^c={tilde over (V)}_t^c,J¹; where:

$𝒫_{𝒱^{c}} {V^{c}} = \arg \min_{U^{c} \in 𝒱^{c}} {{ U^{c} - V^{c} }_{F}^{2}}$

is the projection operator onto the convex feasible set V^c, ∇{circumflex over (ƒ)}_t−τ₁^c( ) refers to a local gradient function, and

α refers to a fixed parameter.

FIG. 17 is a block diagram of an apparatus such as master node 102, worker node 104, CC 502, and/or TRP 504, according to some embodiments. As shown in FIG. 17, the apparatus may comprise: processing circuitry (PC) 1702, which may include one or more processors (P) 1755 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 1748 comprising a transmitter (Tx) 1745 and a receiver (Rx) 1747 for enabling the apparatus to transmit data to and receive data from other nodes connected to a network 1710 (e.g., an Internet Protocol (IP) network) to which network interface 1748 is connected; and a local storage unit (a.k.a., “data storage system”) 1708, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 1702 includes a programmable processor, a computer program product (CPP) 1741 may be provided. CPP 1741 includes a computer readable medium (CRM) 1742 storing a computer program (CP) 1743 comprising computer readable instructions (CRI) 1744. CRM 1742 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 1744 of computer program 1743 is configured such that when executed by PC 1702, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 1702 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described example embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

REFERENCES

[1] S. Shalev-Shwartz, “Online learning and online convex optimization,” Found. Trends Mach. Learn., vol. 4, pp. 107-194, February 2012.

[2] E. Hazan, “Introduction on online convex optimization,” Found. Trends Optim., vol. 2, pp. 157-325, August 2016.

[3] M. Zinkevich, “Online convex optimization and generalized infinitesimal gradient descent,” in Proc. Intel. Conf. Mach. Learn. (ICML), 2003.

[4] E. Hazan, A. Agarwal and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Mach. Learn., vol. 69, pp. 169-192, 2007.

[5] J. Langford, A. J. Smola and M. Zinkevich, “Slow learners are fast,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2009.

[6] K. Quanrud and D. Khashabi, “Online learning with adversarial delays,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2015.

[7] E. C. Hall and R. M. Willett, “Online convex optimization in dynamic environments,” IEEE J. Sel. Topics Signal Process., vol. 9, pp. 647-662, June 2015.

[8] A. Jadbabaie, A. Rakhlin, S. Shahrampour and K. Sridharan, “Online optimization: competing with dynamic comparators,” in Proc. Intel. Conf. Artif. Intell. Statist. (AISTATS), 2015.

[9] A. Mokhtari, S. Shahrampour, A. Jababaie and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in Proc. IEEE Conf. Decision Control (CDC), 2016.

[10] L. Zhang, T. Yang, J. Yi, J. Rong and Z.-H. Zhou, “Improved dynamic regret for non-degenerate functions,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2017.

[11] A. S. Bedi, P. Sarma and K. Rajawat, “Tracking moving agents via inexact online gradient descent algorithm,” IEEE J. Sel. Topics Signal Process, vol. 12, pp. 202-217, 2018.

[12] R. Dixit, A. S. Bedi, R. Tripathi and K. Rajawat, “Online learning with inexact proximal online gradient descent algorithms,” IEEE Trans. Signal Process., vol. 67, pp. 1338-1352, 2019.

[13] 3. TS38.300, “3rd Generation Partnership Project Technical Specification Group Radio Access Network; NR; NR and NG-RAN Overall Description; Stage 2 (Release 15)”.

[14] B. Liang, “Mobile edge computing,” in Key Technologies for 5G Wireless Systems, Cambridge University Press, 2017.

[15] J. P. Champati and B. Liang, “Semi-online algorithms for computational task offloading with communication delay,” IEEE Trans. Parallel Distrib. Syst., vol. 28, pp. 1189-1201, 2017.

[16] S. J. Wright, “Coordinated descent algorithms,” Math. Programming, vol. 151, pp. 3-34, 2015.

[17] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, pp. 1-122, 2011.

[18] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S. Ma and Z.-Q. Luo, “A block successive upper-bound minimization method of multipliers for linearly constrained convex optimization,” Math. Oper. Res., vol. 45, pp. 933-961, 2020.

[19] M. Zinkevich, M. Weimer, L. Li and A. J. Smola, “Parallelized stochastic gradient descent,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2010.

[20] H. B. McMahan, S. H. Moore, D. Ramage and B. A. y. Arcas, “Communication-efficient learning of deep networks for decentralized data,” in Proc. Intel. Conf. Artif. Intell. Statist. (AISTATS), 2017.

[21] J. C. Duchi, A. Agarwal and M. J. Wainwright, “Dual averaging for distributed optimization: convergence analysis and network scaling,” IEEE Trans. Autom. Control, vol. 57, pp. 592-606, 2012.

[22] D. Mateos-Nez and J. Corts, “Distributed online convex optimization over jointly connected digraphs,” IEEE Trans. Netw. Sci. Eng., vol. 1, pp. 23-37, 2014.

[23] A. Koppel, F. Y. Jakubiec and A. Riveiro, “A saddle point algorithm for networked online convex optimization,” IEEE Trans. Signal Process., vol. 63, pp. 5149-5164, 2015.

[24] M. Akbari, B. Gharesifard and T. Linder, “Distributed online convex optimization on time-varying directed graphs,” IEEE Trans. Control Netw. Syst., vol. 4, pp. 417-428, 2017.

[25] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Trans. Autom. Control, vol. 63, pp. 714-725, March 2018.

[26] N. Eshraghi and B. Liang, “Distributed online optimization over a heterogeneous network with any-batch mirror descent,” in Proc. Intel. Conf. Mach. Learn. (ICML), 2020.

[27] Y. Zhang, R. J. Ravier, M. M. Zavlanos and V. Tarokh, “A distributed online convex optimization algorithm with improved dynamic regret,” in Proc. IEEE Conf. Decision Control (CDC), 2019.

[28] M. J. Neely, Stochastic Network Optimization with Application on Communication and Queueing Systems, Morgan & Claypool, 2010.

[29] F. Amirnavaei and M. Dong, “Online power control optimization for wireless transmission with energy harvesting and storage,” IEEE Trans. Wireless Commun., vol. 66, pp. 4888-4901, July 2016.

[30] M. Dong, W. Li and F Amirnavaei, “Online joint power control for two-hop wireless relay networks with energy harvesting,” IEEE Trans. Signal Process., vol. 66, pp. 462-478, January 2018.

[31] J. Wang, M. Dong, B. Liang and G. Boudreau, “Online downlink MIMO wireless network virtualization in fading environments,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019.

[32] J. Wang, M. Dong, B. Liang and G. Boudreau, “Online precoding design for downlink MIMO wireless network virtualization with imperfect CSI,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), 2020.

[33] P. Mertikopoulos and E. V. Belmega, “Learning to be green: Robust energy efficiency maximization in dynamic MIMO-OFDM system,” IEEE J. Sel. Areas. Commun., vol. 34, pp. 743-757, April 2016.

[34] P. Mertikopoulos and A. L. Moustakas, “Learning in an uncertain world: MIMO covariance matirx optimization with imperfect feedback,” IEEE Trans. Signal Process., vol. 64, pp. 5-18, January 2016.

[35] H. Yu and M. J. Neely, “Dynamic transmit covariance design in MIMO fading systems with unknown channel distributions and inaccurate channel state information,” IEEE Trans. Wirelss Commun., vol. 16, pp. 3996-4008, June 2017.

[36] J. Wang, B. Liang, M. Dong and G. Boudreau, “Online MIMO wireless network virtualization over time-varying channels with periodic updates,” in Proc. IEEE Intel. Workshop on Signal Process. Advances in Wireless Commun. (SPAWC), 2020.

[37] D. Gesbert, S. Hanly, H. Huang, S. S. shiz, O. Simeone and W. Yu, “Multi-cell MIMO cooperative networks: A new look at interference,” IEEE J. Sel. Topics Signal Process., vol. 28, pp. 1308-1408, December 2010.

[38] H. Zhang, N. B. Mehta, A. F. Molisch, J. Zhang and S. H. Dai, “Asynchronous interfence mitigation in cooperative base station systems,” IEEE Trans. Wireless Commun., vol. 7, pp. 155-165, January 2008.

[39] R. Zhang, “Cooperative multi-cell block diagonalization with per-base-station power constraints,” IEEE J. Sel. Areas. Commun., vol. 28, pp. 1435-1445, 2010.

[40] O. Besbes, Y. Gur and A. Zeevi, “Non-stationary stochastic optimization,” Oper. Res., vol. 63, pp. 1227-1244, September 2015.

[41] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2016.

[42] H. Holma and A. Toskala, WCDMA for UMTS-HSPA evolution and LTE, John Wiely & Sons, 2010.

[43] Y. Jiang, M. K. Varanasi and J. Li, “Performance analysis of ZF and MMSE equalizers for MIMO systems: An in-depth study of the high SNR regime,” IEEE Trans. Inf. Theory, vol. 57, pp. 2008-2026, April 2011.

[44] R. Corvaja and A. G. Armada, “Phase noise degradation in massive MIMO downlink with zero-forcing and maximum ratio transmission precoding,” IEEE Trans. Veh. Technol., vol. 65, pp. 8052-8059, October 2016.

HIERARCHICAL ONLINE CONVEX OPTIMIZATION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

PCT Information

Provisional Applications (1)