HIERARCHICAL ONLINE CONVEX OPTIMIZATION

Information

  • Patent Application
  • 20240119355
  • Publication Number
    20240119355
  • Date Filed
    January 12, 2022
    2 years ago
  • Date Published
    April 11, 2024
    8 months ago
Abstract
A method for performing online convex optimization is provided. The method includes receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The method includes performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes. Performing the multi-step gradient descent includes determining a global decision vector and corresponding global information. The method includes sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.
Description
TECHNICAL FIELD

Disclosed are embodiments related to online convex optimization.


BACKGROUND

Many machine learning, signal processing, and resource allocation problems can be cast into a dynamic optimization problem with time-varying convex cost functions. Online convex optimization (OCO) provides the tools to handle dynamic problems in the presence of uncertainty, where an online decision strategy evolves based on the historical information [1], [2] (bracketed numbers refer to references at the end of this disclosure). OCO can be seen as a discrete-time sequential decision-making process by an agent in a system. At the beginning of each time slot, the agent makes a decision from a convex feasible set. The system reveals information about the current convex cost function to the agent only at the end of each time slot. The lack of in-time information prevents the agent from making an optimal decision at each time slot. Instead, the agent resorts to minimizing the regret, which is the performance gap between the online decision sequence and some benchmark solution. A desired online decision sequence should be asymptotically no worse than the performance benchmark, i.e., achieving regret that at most grows sublinearly over time.


Most of the early works on OCO studied the static regret, which compares the online decision sequence with a static offline benchmark [3], [4], [5], [6]. However, the optimum of dynamic problems is often time varying. As a rather coarse performance metric, achieving sublinear static regret may not be meaningful since the static offline benchmark itself may perform poorly. A more attractive dynamic regret was first proposed in [3], where the offline benchmark solution can be time varying. It is well known that in the worst-case, it is impossible to obtain sublinear dynamic regret, since drastic variations of the underlying systems can make the online problem intractable. Therefore, dynamic regret bounds are often expressed w.r.t. the accumulated system variations that reflect the hardness of the problem. Theoretical guarantees on the dynamic regrets for OCO with general cost functions were studied in [3], [7], and [8], while the case of strongly convex cost functions were studied in [9], [10], [11], and [12].


The above OCO frameworks do not consider the network heterogeneity on information timeliness and computation capacity in many practical applications. For example, consider the multiple transmission/reception point (TRP) cooperative network with non-ideal backhaul links for 5G New Radio (NR) [13], each TRP has a priori local channel state information (CSI) but less computation capacity compared with a central controller (CC). In mobile edge computing [14], the remote processors have timely information about the computing tasks but may offload some tasks to the edge server due to the limitation on local computation resources [15]. Another example is self-driving vehicular networks, where each vehicle moves based on its real-time sensing while reporting local observations to a control center for traffic routing or utility maximization. In these applications, data are distributed over the network edge and vary over time. Furthermore, the network edge needs to make real-time local decisions to minimize the global costs. However, due to the coupling of data and variables, the global cost function may be non-separable, i.e., it may not be expressed as a summation of local cost functions at the network edge.


Algorithms for non-separable global cost minimization problems, such as coordinated block descent [16] and alternating direction method of multipliers [17], [18] are centralized in nature, as they implicitly assume there is a central node that coordinates the iterative communication and computation processes. However, with distributed data at the network edge, centralized solutions suffer from high communication overhead and performance degradation due to communication delay. Furthermore, existing distributed online optimization frameworks such as parallel stochastic gradient descent [19], federated learning [20], and distributed OCO [21] are confined to separable global cost functions. Specifically, each local cost function depends only on the local data, which allows each node to locally compute the gradient without information about the data at all the other nodes. Therefore, these distributed online frameworks cannot be directly applied to non-separable global cost minimization problems, such as the multi-TRP cooperative precoding design problem considered in this invention, where downlink transmissions at the TRPs are coupled by broadcasting channels.


SUMMARY

It is therefore challenging to develop an online learning framework that takes full advantage of the network heterogeneity on information timeliness and computation capacity, while allowing the global cost functions to be non-separable. In this work, we propose a new Hierarchical Online Convex Optimization (HiOCO) framework for dynamic problems over a heterogeneous master-worker network with communication delay. The local data may not be independent and identically distributed (i.i.d.) and the global cost function may not be separable. We consider network heterogeneity, such that the worker nodes have more timely information about the local data but possibly less computation resources compared with the master node. As disclosed here, HiOCO is a framework that takes full advantage of both the timely local and delayed global information, while allowing gradient descent at both the network edge and control center for improved system performance. Our incorporation of non-separable global cost functions over a master—worker network markedly broadens the scope of OCO.


According to a first aspect, a method for performing online convex optimization is provided. The method includes receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The method includes performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The method includes sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.


According to a second aspect, a method for performing online convex optimization is provided. The method includes receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The method includes performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The method includes sending, to the master node, the local decision vector and local data.


According to a third aspect, a master node for performing online convex optimization is provided. The master node includes processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes. The processing circuitry is operable to perform a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information. The processing circuitry is operable to send, to each of the two or more worker nodes, the global decision vector and corresponding global information.


According to a fourth aspect, a worker node for performing online convex optimization, the worker node comprising processing circuitry and a memory containing instructions executable by the processing circuitry. The processing circuitry is operable to receive, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it. The processing circuitry is operable to perform a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector. The processing circuitry is operable to send, to the master node, the local decision vector and local data.


According to a fifth aspect, a computer program is provided comprising instructions which when executed by processing circuitry of a node causes the node to perform the method of any embodiment of the first and second aspects.


According to a sixth aspect, a carrier containing the computer program of the fifth aspect is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.





BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.



FIG. 1 illustrates an example network according to an embodiment.



FIG. 2 illustrates an example network according to an embodiment.



FIG. 3 illustrates a flow chart according to an embodiment.



FIG. 4 illustrates a flow chart according to an embodiment.



FIG. 5 illustrates an example network according to an embodiment.



FIG. 6 illustrates an example network according to an embodiment.



FIGS. 7-14 illustrate a graph according to an embodiment.



FIG. 15 illustrates a flow chart according to an embodiment.



FIG. 16 illustrates a flow chart according to an embodiment.



FIG. 17 is a block diagram of an apparatus according to an embodiment.





DETAILED DESCRIPTION

In the seminal work of OCO [3], an online projected gradient descent algorithm achieved custom-character(√{square root over (T)}) static regret with bounded feasible set and gradient, where T is the total time horizon. The static regret was shown to be unavoidably Ω(√{square root over (T)}) for general convex cost functions without additional assumptions and was further improved to custom-character(log T) for strongly convex cost functions [4]. Moreover, [5] provided custom-character(√{square root over (τT)}) static regret in the presence of τ-slot delay and [6] studied OCO with adversarial delays. First introduced in [3], the dynamic regret of OCO has received a recent surge of interest [7], [8]. Strong convexity was shown to improve the dynamic regret bound in [9]. By increasing the number of gradient descent steps, the dynamic regret bound was further improved in [10]. The standard and proximal online gradient descent algorithms were respectively extended to accommodate inexact gradient in and [12]. Below, we compare the settings and dynamic regret bounds of these works in more details.


The above OCO algorithms are centralized. Distributed online optimization of a sum of local convex cost functions was studied in [21], [22], [23], [24], [25], [26], [27]. Early works on distributed OCO focused on static regret [21], [22], [23], [24] while more recent works studied dynamic regret [25], [26], [27]. However, existing distributed OCO works are over fully distributed networks with separable global cost functions.


Online frameworks such as Lyapunov optimization [28] and OCO have been applied to solve many dynamic problems in wireless systems. For example, online power control for wireless transmission with energy harvesting and storage was studied for single-hop transmission [29] and two-hop relaying [30]. Online wireless network virtualization with perfect and imperfect CSI were studied in [31] and [32]. Online projected gradient descent and matrix exponential learning were leveraged in [33] and [34] for uplink covariance matrix design. Dynamic transmit covariance design for wireless fading systems was studied in [35]. Online periodic precoding updates for wireless network virtualization was considered in [36]. The above works focused on centralized problems for single-cell wireless systems.


Multi-cell cooperative precoding via multiple base stations (BSs) at the signal level can effectively mitigate inter-cell interference, and this has been shown to significantly improve the system performance. However, traditional cooperative precoding schemes focused on centralized offline problems with instantaneous CSI available at the CC [37], [38], [39]. The TRPs defined in 5G NR are much smaller in size compared with the traditional BSs and therefore have limited computation power. Furthermore, non-ideal backhaul communication links in practice have received a surge of attention in the 5G NR standardization. In this work, we apply the proposed HiOCO framework to an online multi-TRP cooperative precoding design problem with non-ideal backhaul links, by taking full advantage of the CSI timeliness at the TRPs and computation resources at the CC.


We formulate a new OCO problem over a heterogenous master-worker network with communication delay, where the worker nodes have timely information about the local data but possibly less computation resources compared with the master node. At the beginning of each time slot, each worker node executes a local decision vector to minimize the accumulation of time-varying global costs. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.


We propose a new HiOCO framework that takes full advantage of the network heterogeneity in information timeliness and computation capacity. As disclosed here, HiOCO allows both central gradient descent at the master node and local gradient descent at the worker nodes for improved system performance. Furthermore, by communicating the aggregated global information and compressed local information, HiOCO can often reduce the communication overhead while preserving data privacy.


We analyze the special structure of HiOCO in terms of its hierarchical multi-step gradient descent with estimated gradients, in the presence of multi-slot delay. We prove that it can yield sublinear dynamic regret under mild conditions. Even with multi-slot delay, by increasing the estimated gradient descent steps at either the network edge or center, we can configure HiOCO to achieve a better dynamic regret bound compared with centralized inexact gradient descent algorithms.


We apply HiOCO to an online multi-TRP cooperative precoding design problem. Simulation under typical urban micro-cell Long-Term Evolution (LTE) settings demonstrates that both the central and local estimated gradient descent in HiOCO can improve system performance. In addition, HiOCO substantially outperforms both the centralized and distributed alternatives.


Embodiments disclosed here consider OCO over a heterogeneous network with communication delay, where the network edge executes a sequence of local decisions to minimize the accumulation of time-varying global costs. The local data may not be independent and identically distributed (i.i.d.) and the global cost functions may not be separable. Due to communication delays, neither the network center nor edge always has real-time information about the current global cost function. We propose a new framework, termed Hierarchical OCO (HiOCO), which takes full advantage of the network heterogeneity on information timeliness and computation capacity to enable multi-step estimated gradient descent at both the network center and edge.


For performance evaluation, we derive upper bounds on the dynamic regret of HiOCO, which measures the gap of costs between HiOCO and an offline global optimal performance benchmark. We show that the dynamic regret is sublinear under mild conditions. We further apply HiOCO to an online cooperative precoding design problem in multiple transmission/reception point (TRP) wireless networks with non-ideal backhaul links for 5G New Radio (NR). Simulation results demonstrate substantial performance gain of HiOCO over both the centralized and distributed alternatives.


OCO Over Master-Worker Network
Problem Formulation


FIG. 1 illustrates an example network according to an embodiment. We consider OCO over a master—worker network 100 in a time-slotted setting with time indexed by t. As shown in FIG. 1, one master node 102 is connected to C worker nodes 104 through separate communication links. The links each have an associated delay, which is here denoted by τru for the uplink delay (indicating a τru-slot remote uplink delay) and for the downlink delay (indicating a τrd-slot remote downlink delay). The round-trip delay is at least one time slot, i.e., τrrurd≥1. For ease of exposition, we first consider the case of zero local delay, i.e., τ1=0. Later we discuss the case of non-zero local delay. At the beginning of each time slot t, each worker node (such as a TRP) c collects a local data dtc and executes a local decision vector xtccustom-characternc from a compact convex feasible set custom-characterccustom-characternc. The data {dtc}c=1C may be non-i.i.d. and can vary arbitrarily over time with unknown statistics.


The message passing and internal node calculations described below are also illustrated schematically in FIG. 2.


Let ƒ({dtc}c=1C{xc}c=1C): custom-characterncustom-character be the convex global cost function at time slot t. In the hierarchical computing network 100, the worker nodes 104 and master node 102 cooperate to jointly select a sequence of decisions from the feasible sets custom-characterc, to minimize the accumulated time-varying global costs. This leads to the following optimization problem:






P

1
:


min


{


{


x
t
C



𝒳
c


}


c
=
1

C

}


t
=
1

T









t
=
1

T




f

(



{

d
t
c

}


c
=
1

C

,


{

x
t
c

}


c
=
1

C


)

.





We consider the general case that the global cost function may be non-separable among the worker nodes 103, i.e., ƒ({dtc}c=1C, {xc}c=1C) may not be expressed as the summation of C local cost functions that each corresponds only to the local data dtc and decision vector xc. Therefore, due to the coupling of both data and variables, each worker node c cannot compute the gradient ∇xcƒ({dtc}c=1C, {xc}c=1C) based only on its local data dtc. In this case, the local gradient at worker node c may depend on its local data dtc, local decision vector xc, and possibly the data dtl and decision vector xtl at any other worker node l≠c. We define the local gradient at each worker node c as a general function denoted by hfc as follows:





xcƒ({dtc}c=1C, {xc}c=1C)custom-characterhfc(dtc,xc,gfc,({dtl}l≠c,{xl}l≠c))  (1)


where gfc({dtl}l≠cl , {xl}l≠c) is some global information function w.r.t. the local data and decision vectors at all the other worker nodes 104. The local gradient and global information functions depend on specific formats of the global cost functions. We will show later that, communicating the values of the global information functions, instead of the raw data and decision vectors, can often reduce the communication overhead.


For notation simplicity, in the following, we define the global feasible set as custom-charactercustom-characterc=1Ccustom-characterc and denote the global cost function ƒ({dtc}c=1C, {xc}c=1C) as ƒt(x), where xcustom-character [x1T, . . . , xCT]∈custom-charactern is the global decision vector. The local gradient ∇xcƒ({dtc}c=1C, {xc}c=1C) at each worker node c is denoted as ∇ƒtc(xc).


Performance Metric and Measure of Variation

Due to the lack of in-time information about the global cost function at either the worker nodes 104 or the master node 102, it is impossible to obtain an optimal solution to P1.1 In fact, even for the most basic centralized OCO problem [3] an optimal solution cannot be found [4]. Instead, we aim at selecting an online solution sequence {xt}t=1T that is asymptotically no worse than the dynamic benchmark {x*t}t=1T, given by










x
t
*



arg


min

x

𝒳




{


f
t

(
x
)

}

.






(
2
)







Note that x*t is computed with the current information about ƒt(x) at each time slot t and the resulting solution sequence {x*t}t=1T is a global optimal solution to P1. The corresponding dynamic regret is defined as






RE
T
d
custom-characterΣt=1Tt(xt)−ƒt(x*t)).  (3)


An OCO algorithm is desired to provide sublinear dynamic regret with respect to the time horizon T,







i
.
e
.

,


lim



RE
T
d

T



0.





Sublinearity is important since it implies that the online decision is asymptotically no worse than the dynamic benchmark in terms of its time-averaged performance. However, in the worst case, no online algorithm can achieve sublinear dynamic regret if the systems vary too drastically over time [40]. Therefore, the dynamic regret bounds are expressed in terms of different measures on system variations that represent the hardness of 1 problem. For a clear comparison on the dynamic regret bounds between HiOCO and existing literature, we introduce several common variation measures as follows.


Borrowing from [3], we define the following accumulated variation of an arbitrary sequence of reference points {rt}t=1T (which is termed the path length in [3]):





ΠTcustom-characterΣt=1T∥rt−rt−12.  (4)


The online projected gradient descent algorithm in [3] achieved custom-character(√{square root over (TΠT)}) dynamic regret w.r.t. any sequence of reference points {rt}t=1T. Another version of the path length defined in [7] is





Π′Tcustom-characterΣt=1T∥rt−Φt(rt−12.  (5)


where Φt(⋅) is a given function available at the decision maker to predict the current reference point. The dynamic mirror descent algorithm in [7] achieved custom-character(√{square root over (TΠ′t T)}) dynamic regret. When the reference points are the optimal points, i.e., rt=x*t for any t, the resulting path length is defined as





Π*Tcustom-characterΣt=1T∥x*t−x*t−12.  (6)


There are some other related measures that can be used to characterize the system variation, e.g., the accumulated variation of the cost functions {ƒt(x)}t=1T given by










Θ
T


=
Δ








t
=
1

T


min

x

𝒳





"\[LeftBracketingBar]"




f
t

(
x
)

-


f

t
-
1


(
x
)




"\[RightBracketingBar]"







(
7
)







and the accumulated squared variation of gradient given by





Γ2,Tcustom-characterΣt=1T∥∇ƒt(xt)−∇ƒt−1(xt−1)∥22  (8)


The optimistic minor descent algorithm in [8] achieved a dynamic regret bound






𝒪

(

min


{




Γ

2
,
T




Π
T




,


(


Γ

2
,
T




Θ
T


T

)


1
3



}


)




in terms of Π*T, ΘT, and Γ2,T simultaneously.


The above OCO works [3], [7], [8] focused on general convex cost functions. With strongly convex cost functions, the one-step projected gradient descent algorithm in [9]improved the dynamic regret to custom-character(Π*T). The multi-step gradient descent algorithm in [10] further improved the dynamic regret to custom-character(Π*2,T), where Π*2,T is the squared path length defined as





Π*2,Tcustom-characterΣt=1T∥x*t−x*t−122.  (9)


Note that if Π*T or Π*2,T is sublinear, Π*2,T is often smaller than Π*T in the order sense.2 For instance ∥x*t−x*t−1|∝custom-character for any t, then Π*T=custom-character(custom-character) and Π*2,T=custom-character(custom-character). For a sublinear Π*T or Π*2,T, we have custom-character<0 and therefore Π*2,T is smaller than Π*T in the order sense. Particularly, if







ϱ
=

-

1
2



,




we have Π*2,T=custom-character(1) and Π*2,T=√{square root over (T)}. The standard and proximal online gradient descent algorithms were respectively extended in [11] and [12] to accommodate inexact gradient. Both resulted in custom-character(max{Π*T, ΔT}) dynamic regret, where ΔT is the accumulated gradient error defined as










Δ
T


=
Δ








t
=
1

T


max

x

𝒳










f
t

(
x
)


-





f
^

t

(
x
)





2






(
10
)







with ∇{circumflex over (ƒ)}t(⋅) being a given function available at the decision maker to predict the current gradient.


Hierarchical Online Convex Optimization

In this section, we present details of HiOCO and study the impact of hierarchical multi-step estimated gradient descent on the performance guarantees of HiOCO to provide dynamic regret bounds. We further provide sufficient conditions under which HiOCO yields sublinear dynamic regrets and discuss its performance merits over existing OCO frameworks.


HiOCO Framework

Existing distributed OCO frameworks cannot be directly applied to solve the aforementioned minimization problem with non-separable global cost functions. As an alternative, one may apply a centralized OCO approach at the master node after it has received all the local data from the worker nodes. However, this way of solving the problem does not take advantage of either the more timely information at the worker nodes or the computation 2 resources at the worker nodes. Different from existing OCO frameworks that are either centralized or fully distributed, in HiOCO, the master node and worker nodes cooperate in gradient estimation and decision updates, by taking full advantage of the network heterogeneity on information timeliness and computation capacity. For ease of exposition, we will first consider the case of zero local delay at the worker node but will later extend that to the case of non-zero local delay. In the following, we present the algorithms at the master node and worker nodes.


Master Node's Algorithm

At the beginning of each time slot t, each worker node c executes its current local decision vector xtc and uploads it to the master node 102. To enable central gradient descent at the master node 102, each worker node c also needs to share information about the local data dtc with the master node 102. However, sending the raw data directly would incur a large amount of uplink overhead. Instead, each worker node c sends a compression of the current local data lfc(dtc) to the master node 102. Due to the remote uplink delay, at the beginning of each time slot t>τru, the master node 102 only has the τru-slot-delayed local decision vector xt−τruc and compressed data set lfc(dt−τruc) from each worker node c. The master node 102 then recovers an estimated data {circumflex over (d)}t−τruc (dt−τruc). The compression and recovery techniques on the data can be chosen based on specific applications. Note that the master node 102 needs to consider the remote downlink delay and design the decision vectors for the worker nodes τrd-slot ahead based on the τru-slot delayed information. Only the round-trip remote delay τr impacts the decision-making process. Therefore, in the following, without loss of generality, we simply consider the case with τr-slot remote uplink delay and zero remote downlink delay.


Remark 1. There is often a delay—accuracy tradeoff for the recovered data







{


d
^


t
-

τ
r


c

}


c
=
1

C




at the master node, since more accurate data at the master node 102 require less compression at the worker nodes 104 and more transmission time. If data privacy is a concern, the worker nodes 104 can add noise to the compressed data while sacrificing some system performance [41].


With









{

x

t
-

τ
r


c

}


c
=
1

C



and




{


d
^


t
-

τ
r


c

}


c
=
1

C


,




for each worker node c, the master node 102 sets an intermediate decision vector {circumflex over (x)}tc,0=xt−τrc and performs Jr-step gradient descent to generate {circumflex over (x)}tc,Jr as follows. For each gradient descent step j∈[1,Jr] the master node 102 solves the following optimization problem for {circumflex over (x)}tc,j:







P

2
:


max


x
c



𝒳
c












f
^


t
-

τ
r


c

(


x
^

t

c
,

j
-
1



)


,


x
c

-


x
^

t

c
,

j
-
1








+


α
2







x
c

-


x
^

t

c
,

j
-
1






2
2






where ∇{circumflex over (ƒ)}t−τrc({circumflex over (x)}tc,j−1) is an estimated gradient based on {{circumflex over (x)}tc,j−1}c=1c and








{


d
ˆ


t
-

τ
r


c

}


c
=
1

C

,




and it is given by














f
^


t
-

τ
r


c

(


x
^

t

c
,

j
-
1



)



=
Δ




h
f
c

(



d
^


t
-

τ
r


c

,


x
^

t

c
,

j
-
1



,


g
f
c

(



{


d
^


t
-

τ
r


l

}


l

c


,


{


x
^

t

l
,

j
-
1



}


l

c



)


)

.





(
11
)







The master node 102 then sends {circumflex over (x)}tc,Jr and the corresponding global information







g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)




to assist the local gradient descent at each worker node c.



FIG. 3 illustrates an algorithm for the master node 102 according to an embodiment. At 302, the algorithm starts. At 304, the master node 102 initializes a parameter α. At 306, the master node 102 begins a time slot t, where t is greater than the remote delay τr. At 308, the master node 102 receives a local decision vector (xt−τrc) and compressed local data (lfc(dt−τrc)) from each worker node c. At 310, the master node 102 estimates the local data (dt−τrc) from the compressed local data (lƒc(dt−τrc)). At 312, the master node 102 sets an intermediate decision vector {circumflex over (x)}tc,0=xt−τrc for each worker node c. At 314-320, the master node 102 performs a Jr-step gradient descent to generate a global decision vector ({circumflex over (x)}tc,Jr) and corresponding global information






(


g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

j
-
1



}


l

c



)

)




for each worker node c.


Specifically, at 314, the master node 102 checks whether j≤Jr. If it is, master node 102 proceeds to 316, otherwise master node 102 proceeds to 322. Initially, j=1 when master node 102 reaches 314 for the first time. At 316, an estimated gradient (∇{circumflex over (ƒ)}t−τrc ({circumflex over (x)}tc,j−1)) is constructed according to equation 11. At 318, {circumflex over (x)}tc,j is updated for each worker node c by solving the optimization problem P2. At 320, the index j is incremented by one, and master node 102 proceeds to perform the check at 314. At 322, after the gradient descent has completed, master node 102 sends the global decision vector ({circumflex over (x)}tc,Jr) and corresponding global information






(


g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)

)




to each worker node c. At 324, the algorithm ends.


Worker Node c's Algorithm


When the global cost function is non-separable, each worker node c cannot compute the local gradient ∇ƒtc(xtc)=hƒc(dtc,xc,gƒc({dtl}l≠c,{xl}l≠c)) based only on its local data dtc. Therefore, in HiOCO, the master node 102 assists the local gradient estimation by communicating the corresponding delayed global information







g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)




to each worker node c. Note that due to the communication delay and data compression, the global information received by the worker nodes 104 is delayed and with errors.


At the beginning of each time slot t>τr, each worker node c receives the global decision vector {circumflex over (x)}tc,Jr and the global information







g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)




from the master node 102. Each worker node c then sets an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,Jr and performs a J1-step gradient descent to generate {tilde over (x)}tc,J1 as follows. For each gradient descent step j∈[1,J1], each worker node c solves the following optimization problem for {tilde over (x)}tc,j







P

3
:




min



x
c



X
c













f
t

ˆ

c

(


x
˜

t

c
,

j
-
1



)


,


x
c

-


x
˜

t

c
,

j
-
1








+


α
2







x
c

-


x
˜

t

c
,

j
-
1






2
2






where ∇{circumflex over (ƒ)}tc ({tilde over (x)}tc,j−1) is an estimated gradient based on the timely local data dtc and the delayed global information








g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)

,




and it is given by














f
ˆ

t
c

(


x
˜

t

c
,

j
-
1



)



=
Δ




h
f
c

(


d
t
c

,


x
˜

t

c
,

j
-
1



,


g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)


)

.





(
12
)







The above estimated gradient takes full advantage of the information timeliness at the worker nodes, as well as the central availability of information at the master node, to enable local gradient descent at the worker nodes for non-separable cost functions. Each worker node c then executes xtc={tilde over (x)}tc,J1 as its current local decision vector. It then uploads xtc and the compressed local data lƒc(dtc) to the master node.


Remark 2. For separable global cost functions, HiOCO can still be applied. In this case, it is still beneficial to perform centralized gradient descent for improved system performance, while sacrificing some communication overhead caused by uploading the compressed local data.


Remark 3. Single-step and multi-step gradient descent algorithms were provided in [9] and [10], while [11] and [12] proposed single-step inexact gradient descent algorithms. However, the algorithms in [9], [10], [11], [12] are centralized and under the standard OCO setting with one-slot delayed gradient information. In HiOCO, both the master node 102 and worker nodes 104 can perform multi-step estimated gradient descent in the presence of multi-slot delay.



FIG. 4 illustrates an algorithm for the worker node 104 according to an embodiment. At 402, the algorithm starts. At 404, the worker node 104 initializes local decision vectors (xtc for xtccustom-characterc) for any t≤τr. At 406, the worker node 104 begins a time slot t, where t is greater than the remote delay τr. At 408, the worker node 104 receives a global decision vector ({circumflex over (x)}tc,Jr) and global information






(


g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

J
r



}


l

c



)

)




from the master node 102. At 410, the worker node 104 sets an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,Jr. At 412-418, the worker node 104 performs a J1-step gradient descent to generate a local decision vector ({tilde over (x)}tc,J1).


Specifically, at 412, the worker node 104 checks whether j≤J1. If it is, worker node 104 proceeds to 414, otherwise worker node 104 proceeds to 420. Initially, j=1 when worker node 104 reaches 412 for the first time. At 414, an estimated gradient (∇{circumflex over (ƒ)}tc({tilde over (x)}tc,j−1) is constructed according to equation 12. At 416, {tilde over (x)}tc,j is updated by solving the optimization problem P3. At 418, the index j is incremented by one, and worker node 104 proceeds to perform the check at 412. At 420, after the gradient descent has completed, worker node 104 implements xtc={circumflex over (x)}tc,Jr as its current local decision vector. At 422, worker node 104 sends the local decision vector (xtc) and corresponding compressed local data (lfc(dtc) to the master node 102. At 424, the algorithm ends.


Dynamic Regret Bounds

In this section, we present new techniques to derive the dynamic regret bounds of HiOCO, particularly to account for its hierarchical multi-step estimated gradient descent with multi-slot delay. For clarity of exposition, proofs are omitted.


We make the following assumptions that are common in the literature of OCO with strongly convex functions [9], [10], [11], [12]. Strongly convex objectives arise in many machine learning and signal processing applications, such as Lasso regression, support vector machines, and robust subspace tracking. For applications with general convex cost functions, adding a simple regularization term like







μ
2





x


2
2





often does not sacrifice the system performance We will show later that strongly convexity develops a contraction relation between ∥xt+1-x*t22 and ∥xt−x*t22 which can be leveraged to improve the dynamic regret bounds.


Assumption 1. For any t, ƒt(x) satisfies the following:


ƒt(x) is μ-strongly convex over custom-character, i.e., ∃μ>0, s.t., for any x, y∈custom-character and t











f
t

(
y
)





f
t

(
x
)

+







f
t

(
x
)


,

y
-
x




+


μ
2







y
-
x



2
2

.







(
13
)







ƒt(x) is L-smooth over custom-character, i.e., ∃L>0, s.t., for any x, y∈custom-character and t











f
t

(
y
)





f
t

(
x
)

+







f
t

(
x
)


,

y
-
x




+


L
2







y
-
x



2
2

.







(
14
)







The gradient of ƒt(x) is bounded, i.e., ∃D>0, s.t., for any x∈custom-character and t





∥∇ƒt(x)∥2≤D.  (15)


Assumption 2. The radius of custom-character is bounded, i.e., ∃R>0, s.t., for any x, y∈z,ϵ





x−y∥2≤R.  (16)


We also require the following lemma, which is reproduced from Lemma 2.8 in [1].


Lemma 1. Let ,ϵ∈custom-charactern be a nonempty convex set. Let ƒ(x) be a μ-strongly-convex function over z,68 . Let







x




arg

min

x

X




{

f

(
x
)

}

.






Then, for any y∈custom-character, we have







f

(

x


)




f

(
y
)

-


μ
2







y
-

x





2
2

.







The following lemma is general and quantifies the impact of one-step estimated gradient descent in terms of the squared gradient estimation error. We further provide a sufficient condition under which the estimated gradient descent yields an improved decision to the optimal points.


Lemma 2. Assume ƒ(x): custom-charactercustom-character is μ-strongly-convex and L-smooth. Let







z


arg

min

x

X



{








f
ˆ

(
y
)


,

x
-
y




+


α
2






x
-
y



2
2



}



,




where ∇{circumflex over (ƒ)}(y) is an estimated gradient of ∇ƒ(y), and







x




arg

min

x

X




{

f

(
x
)

}

.






For any α>L, and γ∈(0, 2μ), we have













z
-

x





2
2




η





y
-

x





2
2


+

β









f
ˆ

(
y
)


-



f

(
y
)





2
2







(
17
)










where


η

=




α
-
μ


α
+
μ
-
γ


<

1


and


β


=


1

γ

(

α
+
μ
-
γ

)


.






The sufficient condition for ∥z−x*∥22<∥y−x∥22 is





∥∇{circumflex over (ƒ)}(y)−∇ƒ(y)∥22<γ(2μ−γ)∥y−x*∥22.  (18)


Remark 4. The condition on gradient estimation error in (18) is most easily satisfied when γ=μ. In this case, the contraction constant






η
=


α
-
μ

α





recovers the one in [9]. Furthermore, as γ approaches 0, η approaches the contraction constant







α
-
μ


α
+
μ





in [10]. Different from Proposition 2 in [9] and Lemma 5 in [10], Lemma 2 takes into account the impacts of estimated gradient descent and generalizes the results in [9] and [10].


Remark 5. The optimal gradient descent step-size in needs to be in a specific range based on the knowledge of μ, L and v from an additional assumption ∥∇{circumflex over (ƒ)}t(xt)−∇ƒt(xt)∥22≤ϵ22∥∇ƒt(xt)∥22 for some ϵ≥0 and υ≥0. The contraction analysis in [12] focused on the proximal point algorithm and is substantially different from Lemma 2.


We examine the impact of hierarchical multi-step estimated gradient descent on the dynamic regret bounds for OCO, which has not been addressed in the existing literature. To this end, we define the accumulated squared gradient error as










Δ

2
,
T



=









t
=
1

T


max

x

X










f
t

(
x
)


-





f
^

t

(
x
)





2
2






(
19
)







Similar to the relationship between the standard path length Π*T, and squared path length Π*2,T as discussed above, Δ2,T is often smaller than ΔT in the order sense. Note that







max

x

X










f
t

(
x
)


-





f
ˆ

t

(
x
)





2





in (19) is the maximum estimated gradient estimation error and serves as an upper bound for the gradient estimations in (11) and (12). We use Δ2,T as a loose upper bound for our performance analysis since it covers more general gradient estimation schemes that can be adopted in HiOCO.


Leveraging results in Lemmas 1-2 and OCO techniques, the following theorem provides upper bounds on the dynamic regret RETd for HiOCO.


Theorem 3. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:


For any J1+Jr≥1 such that 2ηJ1+Jr≤1, we have







R


E
T
d






1

2

ξ









t
=
1

T









f
t

(

x
t


)




2
2


+



L
+
ξ

2



τ
r



R
2


+



L
+
ξ


2


(

1
-

2


η


J
l

+

J
r





)






(



τ
r



R
2


+

2


τ
r
2



Π

2
,
T




+



2

β


1
-
η




Δ

2
,
T




)

.







For any J1+Jr≥1, we have







R


E
T
d






τ
r


D

R

+


D

1
-


η


J
l

+

J
r









(



τ
r



R
2


+


τ
r



Π
T



+



β


1
-

η





Δ
T



)

.







Extension with Local Delay


We now consider the case of non-zero local delay, i.e., at the beginning of each time slot t, each worker node c only has τ1-delayed local data dt−τ1c for some τ1≥1. In this case, we modify the master and worker algorithms by adding τ1-slot delay to the algorithm starting time and all the time stamps of the data and estimated gradients. Let τ=τ1r be the total delay. Noting that the master node only has T-slot delayed data {{circumflex over (d)}t−τc}c=1C, with compression errors for gradient estimation at the beginning of each time slot t>τ.


The master node's algorithm with local delay may proceed as follows. The algorithm starts, the parameter a is initialized, and at the beginning of each t>τ, the master node 102 receives xt−τrc and lƒc(dt−τc) from each worker node c. The master node 102 estimates {circumflex over (d)}t−τc from lƒc(dt−τc). The master node 102 sets {circumflex over (x)}tc,0=xt−τc for each worker node c. For each step j of the Jr-step gradient descent, the gradient ∇{circumflex over (ƒ)}t−τc({circumflex over (x)}tc,j−1) is constructed. This is done similarly to what is shown in equation 11, noting that the time stamps are adjusted to account for the local delay. Likewise, for each step j of the Jr-step gradient descent, {circumflex over (x)}tc,j is updated for each worker node c by solving P2 with ∇{circumflex over (ƒ)}t−τc({circumflex over (x)}tc,j−1). Following the gradient descent, {circumflex over (x)}tc,Jr and gƒc({{circumflex over (d)}tl}l≠c, {{circumflex over (x)}l}l≠c) are sent to the worker nodes c.


The worker node's algorithm with local delay may proceed as follows. The algorithm starts, the local decision vectors xtccustom-characterc for any t≤τ are initialized, and at the beginning of each t>τ, the worker node 104 receives {circumflex over (x)}tc,Jr and gƒc({{circumflex over (d)}tl}l≠c, {{circumflex over (x)}l}l≠c) from the master node 102. The worker node 104 sets {tilde over (x)}tc,0={circumflex over (x)}tc,Jr. For each step j of the J1-step gradient descent, the gradient ∇{circumflex over (ƒ)}t−τ1c({tilde over (x)}tc,j−1) is constructed. This is done similarly to what is shown in equation 12, noting that the time stamps are adjusted to account for the local delay. Likewise, for each step j of the J1-step gradient descent, {tilde over (x)}tc,j is updated by solving P3 with ∇{circumflex over (ƒ)}t−τ1c({tilde over (x)}tc,j−1). Following the gradient descent, is xtc={circumflex over (x)}tc,Jr implemented as the local decision vector, and xtc and lƒc(dt−τ1c) are sent to the master node 102.


Using similar techniques in the proof of Theorem 3, we provide dynamic regret bounds for HiOCO in the presence of both local and remote delay.


Theorem 4. For any α≥L, ξ>0 and γ∈(0, 2μ), the dynamic regret yielded by HiOCO is bounded as follows:


For any J1+Jr≥1 such that 4ηJ1+Jr<1, we have







R


E
T
d






1

2

ξ









t
=
1

T









f
t

(

x
t


)




2
2


+



L
+
ξ

2


τ


R
2


+



L
+
ξ


2


(

1
-

4


η


J
l

+

J
r





)






(


τ


R
2


+


(


2


τ
l
2


+

4


τ
r
2



)



Π

2
,
T




+



2

β


1
-
η




Δ

2
,
T




)

.







For any J1+Jr≥1, we have







R


E
T
d





τ

D

R

+


D

1
-


η


J
l

+

J
r









(


τ


R
2


+

τ


Π
T



+



β


1
-

η





Δ
T



)

.







Due to the local delay, Theorem 4 has a more stringent condition on the total number of gradient descent steps compared with Theorem 3. However, the order of the dynamic regret bound is dominated by the accumulated system variation measures and is often the same as the case without local delay.


Discussion on the Dynamic Regret Bounds

In this section, we discuss the sufficient conditions for HiOCO to yield sublinear dynamic regret and highlight several prominent advantages of HiOCO over existing OCO frameworks. From Theorems 3 and 4, we can derive the following corollary regarding the dynamic regret bound.


Corollary 5. Suppose the accumulated squared variation of the gradient at the optimal points satisfies Σt=1T∥∇ƒt(x*t)∥22=custom-character(max{τ2 Π*2,T, Δ2,T}), from Theorems 3 and 4, we have






RE
T
d=custom-character(min{max{τΠ*TT},max{τ2Π*2,T2,T}}).


Note that Σt=1T∥∇ƒt(x*t)∥22 is often small and the condition in Corollary 5 is commonly satisfied. In particular, if x*t is an interior point of custom-character or P1 is an unconstrained online problem, we have ∇ƒt(x*t)=0. Form Corollary 5, a sufficient condition for HiOCO to yield sublinear dynamic regret is either max{τΠ*TT}=o(T) or max{τ2Π*2,T2,T}=o(T). Sublinearity of the accumulated system measures is necessary to have sublinear dynamic regret [40]. In many online applications, the system tends to stabilize and the gradient estimation becomes more accurate over time, leading to sublinear dynamic regret.


Remark 6. The centralized single-step and multi-step gradient descent algorithms achieved custom-character(Π*T) and (min{Π*T,Π*2,T}) dynamic regrets in [9] and [10], respectively. HiOCO takes advantage of both the timely local and delayed global information to perform multi-step estimated gradient descent at both the master and worker nodes. Our dynamic regret bound analysis takes into account the impacts of the unique hierarchical update architecture, gradient estimation errors, and multi-slot delay on the performance guarantees of OCO that were not considered in [9] and [10].


Remark 7. The centralized single-step inexact gradient descent algorithms in [11] and [12] achieved custom-character(max{Π*T, ΔT}) dynamic regret under the standard OCO setting with one-slot delay. Noting that in the order sense, Π*2,T and Δ2,T are usually smaller than Π*T and ΔT, respectively. Therefore, even in the presence of multi-slot delay, HiOCO provides a better dynamic regret bound by increasing the number of estimated gradient descent steps, and recovers the performance bounds in [11] and [12] and as a special case.


Application to Multi-TRP Cooperative Wireless Networks


FIG. 5 illustrates an example application according to an embodiment. We apply HiOCO to solve an online multi-TRP cooperative precoding design problem in multiple-input multiple-output (MIMO) systems, where multiple TRPs 504 (such as TRP 1 and TRP 2 that are illustrated) jointly transmit signals to serve users 506 in the network 500 as shown in FIG. 5. The TRPs 504 also cooperate with the CC 502. Traditional cooperative precoding design schemes focused on centralized offline problems at the CC with instantaneous CSI [37], [38], [39]. In contrast, some disclosed embodiments provided here are online, based on delayed CSI, and updated at both the CC 502 and TRPs 504.


The message passing and internal node calculations described below are also illustrated schematically in FIG. 6.


Online Multi-TRP Cooperative Precoding Design

We consider a total of C TRPs 504 coordinated by the CC 502 to jointly serve K users 506 in the cooperative network 500. Each TRP c has Nc antennas, so there is a total of N=Σc=1CNc antennas in the network 500. Let Htccustom-characterK×Nc denote the local channel state of the K users 506 from TRP c. Let Htc=[Ht1, . . . , HtC]∈custom-characterK×N denote the global channel state between the K users 506 and C TRPs 504.


For ease of illustration only, here we consider the case where there is no local delay at the TRPs to collect the local CSI. However, embodiments may also cover the case of non-zero local delay as explained above. At each time slot t, each TRP c has the current local CSI Htc and implements a local precoding matrix Vtccustom-characterNc×K (in the compact convex set






custom-character
c
custom-character
{V
c
:∥V
cF2≤Pmaxc}  (20)


to meet the per-slot maximum transmit power limit. Let Vt=[Vt1H, . . . , VtcH]∈custom-characterN×K denote the global precoding matrix executed by the C TRPs 504 at time slot t. The actual received signal vector yt (excluding noise) at the K users 506 is given by






y
t
=H
t
V
t
s
t


where stcustom-characterK×1 contains the transmitted signals from the TRPs to all K users 506 which are assumed to be independent to each other with unit power, i.e., custom-character{ststH}=I, ∀t.


We first consider idealized backhaul communication links, where each TRP c communicates Htc to the CC 502 without delay. The CC 502 then has the global CSI Ht at time slot t and designs a desired global precoder Wtcustom-characterN×K to meet the per-TRP maximum power limits. The design of Wt can be based on the services needs of the K users 506 and is not limited to any specific precoding scheme. For the CC 502 with Wt, the desired received signal vector (noiseless) {tilde over (y)}t is given by






{tilde over (y)}
t
=H
t
W
t
s
t.


With the TRPs' 504 actual precoding matrix Vt and the desired precoder Wt at the CC 502, the expected deviation of the actual received signal vector at all K users 506 from the desired one is given by custom-character{∥yt−{tilde over (y)}tF2}=∥HtVt−HtWtF2. We define the precoding deviation of the TRPs' 504 precoding from the precoder at the CC 502 as





ƒt(V)custom-characterHtVt−HtWtF2,∀t  (21)


which is a strongly convex cost function.


Note that due to the coupling of local channel states {Htc}c=1C and local precoders {Vtc}c=1C, the cost function ƒt(V) is not separable among the TRPs 504. Furthermore, the local gradient at each TRP c depends on the local channel state W, local precoder W, and the channel states {Htl}l≠c and precoders {Vtl}l≠c at all the other TRPs 504, given by










f
t
c

(
V
)



=








f
t

(
V
)





V

c
*




=



H
t
c

(








l
=
1

C



(


H
t
l



V
l


)


-


H
t



W
t



)

.






The goal of the multi-TRP cooperative network 500 is to minimize the accumulation of the precoding deviation subject to per-TRP maximum transmit power limits with non-ideal backhaul communication links. The online optimization problem is in the same form as P1 with {Htc}c=1C being the local data, {Vtccustom-characterc}c=1C being the local decision vectors, and ƒt(V) being the global cost function.


For non-ideal backhaul links with τru-slot uplink and τrd-slot downlink communication delays, as illustrated herein, only the round-trip communication delay τr matters and we can equivalently consider there is τr-slot uplink delay and no downlink delay. At each time slot t, each TRP c has the timely local CSI Htc and implements a local precoder Vtc. If communication overhead is a concern, instead of sending the complete CSI Htc, each TRP c can send a compressed local CSI Ltc to the CC 502. Due to the communication delay and CSI compression, the CC 502 recovers a delayed global channel state Ĥt−τr, with errors, and then it designs the desired precoding matrix Ŵt−τr. Later we will show how HiOCO leverages the instantaneous local CSI {Htc}c=1C at the TRPs 504 and the delayed global channel state Ĥt−τr, at the CC 502 to jointly design the cooperative precoding matrices {Vtc}c=1C.


Hierarchical Precoding Solution

Leveraging the proposed HiOCO framework, we now provide hierarchical solutions to the formulated online multi-TRP cooperative precoding design problem.


Precoding Solution at CC

At the beginning of each time slot t>τr, the CC 502 receives the precoding matrices







{

V

t
-

τ
r


c

}


c
=
1

C




from the TRPs 504 and recovers the delayed global CSI Ĥt−τr with some errors from the compressed local






CSI




{

L

t
-

τ
r


c

}


c
=
1

C

.





It then sets {circumflex over (V)}tc,0=Vt−τrc for each TRP c and performs Jr-step estimated gradient descent to generate {circumflex over (V)}tc,Jr. For each gradient descent step j∈[1,Jr], the CC 502 has a closed-form precoding solution given by








V
ˆ

t

c
,
j


=


𝒫

𝒱
c




{



V
ˆ

t

c
,

j
-
1



-


1
α







f
ˆ


t
-

τ
r


c

(


V
ˆ

t

c
,

j
-
1



)




}






where








𝒫

𝒱
c




{

V
c

}


=

arg


min


U
c



𝒱
c




{





U
c

-

V
c




F
2

}






is the projection operator onto the convex feasible set custom-characterc and











f
ˆ


t
-

τ
r


c

(


V
ˆ

t

c
,

j
-
1



)


=



H
ˆ


t
-

τ
r


c

(








l
=
1

C



(



H
ˆ


t
-

τ
r


l




V
ˆ

t

l
,

j
-
1




)


-



H
ˆ


t
-

τ
r






W
ˆ


t
-

τ
r





)





is an estimation of the gradient at time slot t−τr. The CC 502 then communicates the intermediate precoder {circumflex over (V)}tc,Jr and global information Ĝt−τct=1, l≠cCt−τrl{circumflex over (V)}tl,Jr)−Ĥt−τrŴt−τrcustom-characterK×K to TRP c, for all c∈{1, . . . C}. Note that there is no need to communicate the local information Ĥt−τrc{circumflex over (V)}tc,Jr to each TRP c, since more recent local information will be used to reduce the gradient estimation error.


Note that instead of sending the global channel state Ĥt−τrcustom-characterK×N global precoding matrix {circumflex over (V)}tJrcustom-characterN×K and the desired global precoder Ŵt−τr∈CN×K to each TRP c for the local gradient estimation, in the proposed method, sending Vtc,Jrcustom-characterNc×K and Ĝt−τrccustom-characterK×K to each TRP c greatly reduces the amount of downlink communication overhead.


Precoding Solution at TRP c

Each TRP c can implement any local precoder in custom-characterc for any t∈[1, τr]. At the beginning of each time slot t>τr, after receiving the intermediate precoder {circumflex over (V)}tc,Jr and global information Ĝt−τc from the CC 502, each TRP c sets {tilde over (V)}tc,0={circumflex over (V)}tc,Jr and performs J1-step estimated gradient descent to generate {tilde over (V)}tc,J1. For each gradient descent step j∈[1, J1], each TRP c also has a closed-form precoding solution given by








V
~

t

c
,
j


=


𝒫

𝒱
c




{



V
~

t

c
,

j
-
1



-


1
α







f
ˆ

t
c

(


V
~

t

c
,

j
-
1



)




}









where







f
ˆ

t
c

(


V
~

t

c
,

j
-
1



)



=


H
t
c

(



H
t
c




V
~

t

c
,

j
-
1




+


G
ˆ


t
-
τ

c


)





is an estimation of the current gradient based on the timely local CSI Htc and delayed global information Ĝt−τc. Finally, each TRP c uses Vtc={tilde over (V)}tc,J1 as its precoding matrix for transmission in time slot t and communicates it together with either the complete CSI Htc or the compressed local CSI Ltc to the CC 502.


Performance Bounds

Note that the optimal precoding solution is V*t=Wt at each time slot t. However, with non-ideal backhaul links, each TRP c cannot receive Vtc* from the CC 502 in time and implement it at each time slot t. A naive solution is to implement the delayed optimal solution V*t−τr, at the TRPs 504. However, we will show that directly implementing V*t−τr, at the TRPs 504 leads to system performance degradation compared with HiOCO can adopt to the unknown channel variations.


We assume that the channel power is bounded by a constant B>0 at any time t, given by





HtF2≤B.  (23)


In the following Lemma, we show that the formulated online multi-TRP cooperative precoding design problem satisfies Assumptions 1 and 2 made above.


Lemma 6. Assume the channel power is bounded in (23). Then, Assumptions 1 and 2 hold with the corresponding constants given by μ=2, L=B, D=2B√{square root over (Σc=1CPmaxc)}, and R=2√{square root over (Σc=1CPmaxc)}.


Leveraging the results in Theorems 3 and 4, and noting that the gradient of the optimal precoder satisfies ∇t(V*t)=HtH/(HtV*t−HtWt)=0, the following corollary provides the dynamic regret bounds yielded by the hierarchical online precoding solution sequence {Vt}t=1T.


Corollary 7. The dynamic regret bounds in Theorems 3 and 4 hold for {Vt}t=1T generated by HiOCO, with the constants μ, L, D, and R given in Lemma 6 and Σt=1T∥∇ƒt(V*t)∥F2=0.


Simulation Results

In this section, we present simulation results under typical urban micro-cell LTE network settings. We study the impact of various system parameters on the convergence and performance of HiOCO. We numerically demonstrate the performance advantage of HiOCO over both the centralized and distributed alternatives.


Simulation Setup

We consider an urban hexagon micro-cell of radius 500 m with C=3 equally separated TRPs each is equipped with Nc=16 antennas. We consider 5 co-located users in the middle of every two adjacent TRPs for a total of K=15 users in the network. Following the standard LTE specification [42], as default system parameters, we set maximum transmit power limit Pmaxc=30 dBm, noise power spectral density N0=−174 dBm/Hz, noise figure NF=10 dB, and we focus on the channel over one subcarrier with bandwidth BW=15 kHz. We model the fading channel as a first order Gauss Markov process ht+1c,khhtc,k+ztc,k between each user k and each TRP c, where htc,k˜custom-character(0, βc,kI) with βc,k [dB]=−31.54-33 log10 (dc,k)−φc,k represents the path-lost and shadowing effects, dc,k being the distance in kilometers from TRP c to user k, φc,k˜custom-character(0,σØ2) being the shadowing effect that is used to model the variation of user positions with σØ2=8 dB, αh∈[0,1] is the channel correlation coefficient, and ztc,k˜custom-character(0, (1−αh2) βc,kI) is independent of htc,k. We set αh=0.998 as default, which corresponds to user speed 1 km/h. We consider each TRP c communicates the accurate local CSI Htc to the CC, since the impact of channel compression error can be emulated by increasing the communication delay τr.


For our performance study, we assume the CC adopts cooperative zero forcing (ZF) precoding, given by






W
t
ZF=√{square root over (PtZF)}HtH(HtHtH)−1


where PtZF is a power normalizing factor. Note that we must have N≥K to perform ZF precoding. We assume all K users have the same noise σn2=NF+N0BW and therefore all the users will have the same data rate








log
2

(

1
+


P
t
ZF


σ
n
2



)

.




The CC adopts the power normalizing factor







P
t

Z

F


=

min


{



P
max
c







H
t

c

H


(


H
t



H
t
H


)


-
1




F
2


,


c


}






which is the optimal solution for the following sum rate maximization problem with per-TRP maximum transmit power limits:






P

4
:


max


P
t
ZF


0


K



log
2

(

1
+


P
t

Z

F



σ
n
2



)










s
.
t
.


P
t

Z

F










H
t

c

H


(


H
t



H
t
H


)


-
1




F
2




P
max
c


,



c
.






As performance metrics, we define the time-averaged normalized precoding deviation as








f
¯

(
T
)


=




1
T








t
=
1

T





f
t

(

V
t

)






H
t



W
t

Z

F





F
2







and the time-averaged per-user rate as








R
¯

(
T
)


=




1

T

K









t
=
1

T








k
=
1

K




log
2

(

1
+

S

I

N


R
t
k



)






where







S

I

N


R
t
k


=





"\[LeftBracketingBar]"



[


H
t



V
t


]


k
,
k




"\[RightBracketingBar]"


2





Σ



j

k







"\[LeftBracketingBar]"



[


H
t



V
t


]


k
,
j




"\[RightBracketingBar]"


2


+

σ
n
2







is the signal-to-interference-plus-noise ratio (SINR) of user k.


Impact of Number of Estimated Gradient Descent Steps


FIG. 7 shows ƒ(T) and R(T) versus T for different numbers of the estimated gradient descent steps Jr at the CC and J1 at the TRPs. We consider zero local delay first and set the remote delay as Jr=1. We observe that HiOCO has fast convergence (within T=100 time slots). Furthermore, the system performance improves as Jr or J1 increases, showing the performance gain brought by performing multi-step estimated gradient descent at either the master node or worker nodes. As shown in FIG. 8, the system performance almost stabilizes when Jr=8. Further considering the TRPs usually have less computation capacity compared to the CC, in the following simulation, we set J1=1 and Jr=8 as default simulation parameters.


Impact of Channel Correlation

Next, we study the impact of channel correlation on the performance of HiOCO. Note that as αh increases, the accumulated system variation measures become smaller, leading to better dynamic regret bounds. As shown in FIG. 9, the system performance improves as αh increases, which is consistent with Theorems 3 and 4. When the channel is static, the steady-state per-user rate is high at 11.4 bpcu. The reason is our system is operated at high signal-to-noise ratio (SNR) region, in which the desired cooperative ZF precoding at the CC approaches the optimal precoder [43]. The steady state per-user rate decreases to 8.4 bpcu as αh increases to 0.999 which corresponds to 0.5 km/h user speed. This is because the cooperative ZF precoding nulls the inter-user interference, but its performance is sensitive to CSI inaccuracy [44] and therefore, the channel correlation in the online setting.


Performance Comparison

For performance comparison, we consider the delayed optimal precoder V*t−τ1−τr=Wt−τ1−τr that can be computed by the CC after receiving the local CSI from the TRPs at each time slot t>τ1r. To show the performance gain brought by the local gradient descent in HiOCO, we consider centralized OCO algorithms that perform multi-step estimated gradient descent. For distributed alternatives, we consider the idealized user association scheme that each user k selects the TRP that has the highest channel gain for downlink signal transmission at each time slot t with τ1-slot delayed local CSI Ht−τ1c as Kt−τ1c. Let the number of users associated with TRP c based on Ht−τ1c as Kt−τ1c. Let








V
~


t
-

τ
1


c





P
~


t
-

τ
1


c







H
~


t
-

τ
1


cH

(



H
~


t
-

τ
1


c




H
~


t
-

τ
1


cH


)


-
1






denote the available channel state between the Kt−τ1c users and the Nc antennas in TRP c at each time slot t>τ1. Each TRP c then adopts ZF precoding to serve the Kt−τ1c users with the τ1-delayed local CSI as








V
~


t
-

τ
1


c

=




P
~


t
-

τ
1


c







H
~


t
-

τ
1


cH

(



H
~


t
-

τ
1


c




H
~


t
-

τ
1


cH


)


-
1







where {tilde over (P)}t−τ1c is set such that ∥{tilde over (V)}t−τ1cF2=Pmaxc. We also consider a fixed user association scheme that each user k selects the TRP that has the lowest path cost and shadowing and the local CSI is delayed by τ1 time slots at the TRPs. Let Kc denote the number of users associated with TRP c and Ht−τ1ccustom-characterKc×Nc be available channel state between the Kc users and the Nc antennas in TRP c at each time slot t>τ1. For the fixed user association scheme, each TRP adopts the following ZF precoding to serve the Kc users at each time slot t>τ1







V

t−τ

1

c=√{square root over (Pt−τ1c)}Ht−τ1cH(Ht−τ1cHt−τ1cH)−1


where Pt−τ1c is set such that











V
¯


t
-

τ
l


c



F
2

=


P
max
c

.






FIG. 10 shows the performance comparison between HiOCO, the delayed optimal sequence








{

V

t
-

τ
l

-

τ
r




}


t
=
1

T

,




centralized OCO with Jr=8 and Jr=1 steps of gradient descent, and the dynamic and fixed user association schemes as









{


V
_


t
-

τ
l



}


t
=
1

T



and




{


V
~


t
-

τ
l



}


t
=
1

T


,




respectively, with τ1=0 and τr=4. We observe HiOCO achieves the best system performance compared with all of the above alternative schemes. Furthermore, by performing only Jr=1 step additional local gradient descent at the TRPs, HiOCO achieves substantial performance gain compared with the centralized OCO with Jr=8 steps of gradient descent. The user association schemes that based on the timely local CSI have worse performance compared with the other alternatives, since the TRPs are not coordinated to jointly serve the users.


Impact of Remote and Local Delay


FIGS. 11 and 12 show the performance comparison on the steady state value of ƒ(T) and R(T) versus the remote delay τr and local delay τl. As shown in FIG. 11, in a wide range of remote delay, HiOCO is better than the user association schemes that is based on the timely local CSI. It shows performance gain brought by the central gradient descent at the CC. Furthermore, HiOCO performs better than the other centralized alternatives through one-step additional local gradient descent at the TRPs. As shown in in FIG. 12, the performance gain of HiOCO over the centralized OCO algorithms decreases as the local delay τl increases. It indicates the importance of information timeliness on the performance gain of local gradient descent. However, in a wide range of local delay, HiOCO is better than the centralized OCO algorithms and the delayed user association schemes. By taking full advantage of the timely local CSI and delayed global CSI to perform gradient descent at both the TRPs and CC, HiOCO outperforms both the centralizes and distributed alternatives in a wide range of delay.


Impact of Number of Antennas and Users

We further study the impact of number of antennas Nc and users K. FIG. 13 shows that the precoding deviation ƒ decreases as the number of antennas Nc increases, since the TRPs has more degrees of freedom to design the cooperative precoding. The per-user rate R drastically improves as Nc increases, indicating the performance advantage of massive MIMO systems.



FIG. 14 shows that the precoding deviation keeps increasing as the number of users K increases, since the TRPs has less degrees of freedom to optimize the cooperative precoding for precoding deviation minimization. Note that to perform cooperative precoding at the CC, the maximum number of users is K=N=48. We observe that HiOCO substantially outperforms the delayed optimal precoder when the number of users is close to the number of antennas in the presence of delay. In a wide range of Nc and K, HiOCO yields the best performance among both the centralized and distributed alternatives.


Advantages of Embodiments

Embodiments provide OCO over a heterogeneous master—worker network with communication delay, to make a sequence of online local decisions to minimize some accumulated global convex cost functions. The local data at the worker nodes may be non-i.i.d. and the global cost functions may be non-separable.


We propose a new HiOCO framework, which takes full advantage of the network heterogeneity in information timeliness and computation capacity, to enable multi-step estimated gradient descent at both the master and worker nodes. Our analysis considers the impacts of multi-slot delay, gradient estimation error, and the hierarchical architecture on the performance guarantees of HiOCO, to show sublinear dynamic regret bounds under mild conditions.


We apply HiOCO to a multi-TRP cooperative network with non-ideal backhaul links for 5G NR. We take full advantage of the information timeliness on CSI and computation resources at both the TRPs and CC to improve system performance By sharing the compressed local CSI and delayed global information, both the uplink and downlink communication overhead can be greatly reduced. The cooperative precoding solutions at both the TRPs and CC are in closed forms with low computational complexity.


Notes on the performance of the proposed methods: We numerically validate the performance of the proposed hierarchical precoding solution for multi-TRP cooperative networks under typical LTE cellular network settings. Extensive simulation results are provided to demonstrate the impact of the number of estimated gradient descent steps, channel correlation, remote and local delay, and the number of antennas and users. Simulation results demonstrate the superior delay tolerance and substantial performance advantage of HiOCO over both the centralized and distributed alternatives under different scenarios.



FIG. 15 is a flow chart according to an embodiment.


Process 1500 is a method for performing online convex optimization, performed e.g. by a master node such as mater node 102 and/or CC 502.


Step 1502 comprises receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes.


Step 1504 comprises performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information.


Step 1506 comprises sending, to each of the two or more worker nodes, the global decision vector and corresponding global information.


In some embodiments, the local data received from each of the two or more worker nodes is compressed, and wherein the method further comprises uncompressing the local data received from each of the two or more worker nodes. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate decision vector {circumflex over (x)}tc,0=xt−τrc for each of the two or more worker nodes c; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient for each of the two or more worker nodes c, wherein the estimated gradient is based on {{circumflex over (x)}tc,j−1}c=1C and








{


d
ˆ


t
-

τ
r


c

}


c
=
1

C

,




and (2) updating {circumflex over (x)}tc,j for each of the two or more worker nodes c, by solving an optimization problem for {circumflex over (x)}tc,j based on the estimated gradients; where:

    • C refers to the number of the two or more worker nodes,
    • c is an index referring to a specific one of the two or more worker nodes,
    • t refers to the current time slot,
    • τr refers to a round-trip remote delay,







{

x

t
-

τ
r


c

}


c
=
1

C




refers to the local decision vectors received from each of the two or more worker nodes,







{


d
ˆ


t
-

τ
r


c

}


c
=
1

C




refers to compressed local data for each of the two or more worker nodes that is based on the local data received from each of the two or more worker nodes,


j∈[1, Jr], and


Jr refers to the number of steps of the multi-step gradient descent.


In some embodiments, the estimated gradient is given by












f
ˆ


t
-

τ
r


C

(


x
ˆ

t

c
,

j
-
1



)



=




h
f
c

(



d
ˆ


t
-

τ
r


c

,


x
ˆ

t

c
,

j
-
1



,


g
f
c

(



{


d
ˆ


t
-

τ
r


l

}


l

c


,


{


x
ˆ

t

l
,

j
-
1



}


l

c



)


)


,




the optimization problem is given by









min


x
c



𝒳
c












f
^


t
-

τ
r


c

(


x
^

t

c
,

j
-
1



)


,


x
c

-


x
^

t

c
,

j
-
1








+


α
2







x
c

-


x
^

t

c
,

j
-
1






2
2



,




and the corresponding global information for a given worker node c is given by








g
f
c

(



{


d
^


t
-

τ
r


l

}


l

c


,


{


x
^

t

l
,

J
r



}


l

c



)

;




where:


∇{circumflex over (ƒ)}t−τrc( ) refers to a local gradient function,


hfc( ) refers to a general function,

    • custom-characterc refers to a compact convex feasible set, and
    • α refers to a fixed parameter.


In some embodiments, the local data corresponding to each of the two or more worker nodes has a non-zero local delay. In some embodiments, the two or more worker nodes comprise transmission/reception points (TRPs), the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix {circumflex over (V)}tc,0=Vt−τrc for each of the two or more TRPs c; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient for each of the two or more TRPs c, wherein the estimated gradient is based on









{

V

t
-

τ
r


c

}


c
=
1

C



and




{


H
^


t
-

τ
r


c

}


c
=
1

C


,




and (2) updating {circumflex over (V)}tc,j for each of the two or more TRPs c, by solving an optimization problem for {circumflex over (V)}tc,j based on the estimated gradients; where:


C refers to the number of the two or more worker nodes,


c is an index referring to a specific one of the two or more worker nodes


t refers to the current time slot,


τr refers to a round-trip remote delay,







{

V

t
-

τ
r


c

}


c
=
1

C




refers to the local precoding matrices received from each of the two or more TRPs,







{


H
^


t
-

τ
r


c

}


c
=
1

C




refers to compressed local channel state information for each of the two or more TRPs that is based on the local channel state information received from each of the two or more TRPs,


j∈[1, Jr], and


Jr refers to the number of steps of the multi-step gradient descent.


In some embodiments, the estimated gradient is given by ∇ƒt−τrc({circumflex over (V)}tc,j−1)=Ĥt−τrct−τrl{circumflex over (V)}tl,j−1)−Ĥt−τrŴt−τr), a solution to the optimization problem is given by









V
^

t

c
,
j


=


𝒫

𝒱
c




{



V
^

t

c
,

j
-
1



-


1
α








f
^


t
-

τ
r


c

(


V
^

t

c
,

j
-
1



)




}



,




and the corresponding global information for a given TRP c is given by Ĝt−τcl=1,l≠ct−τl{circumflex over (V)}tl,Jr)−Ĥt−τrŴl-τrcustom-characterK×K; where








𝒫

𝒱
c




{

V
c

}


=

arg


min


U
c



𝒱
c




{





U
c

-

V
c




F
2

}






is the projection operator onto the convex feasible set Vc,


Ŵt−τr refers to a desired global precoding matrix,


∇{circumflex over (ƒ)}t−τrc( ) refers to a local gradient function, and


α refers to a fixed parameter.



FIG. 16 is a flow chart according to an embodiment.


Process 1600 is a method for performing online convex optimization, performed e.g. by a worker node such as worker node 104 and/or TRP 504.


Step 1602 comprises receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it.


Step 1604 comprises performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector.


Step 1606 comprises sending, to the master node, the local decision vector and local data.


In some embodiments, the local data sent to the master node is compressed prior to sending. In some embodiments, performing the multi-step gradient descent further comprises:

    • initializing an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,Jr; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient, wherein the estimated gradient is based on dtc and








g
f
c

(



{


d
^


t
-

τ
r


l

}


l

c


,


{


x
^

t

l
,

J
r



}


l

c



)

,




and (2) updating {tilde over (x)}tc,j, by solving an optimization problem for {tilde over (x)}tc,j based on the estimated gradient; where:


c is an index referring to a worker node corresponding to the local data,


t refers to the current time slot,


τr refers to a round-trip remote delay,


dtc refers to the local data,


{circumflex over (x)}tc,Jr refers to the global decision vector,







g
f
c

(



{


d
^


t
-

τ
r


l

}


l

c


,


{


x
^

t

l
,

J
r



}


l

c



)




refers to the global information,

    • j∈[1,J1], and
    • J1 refers to the number of steps of the multi-step gradient descent.


In some embodiments, the estimated gradient is given by













f
^

t
c

(


x
~

t

c
,

j
-
1



)



=




h
f
c

(


d
t
c

,


x
~

t

c
,

j
-
1



,


g
f
c

(



{


d
^


t
-

τ
r


l

}


l

c


,


{


x
^

t

l
,
Jr


}


l

c



)


)


,




the optimization problem is given by









min


x
c



𝒳
c












f
^

t
c

(


x
~

t

c
,

j
-
1



)


,


x
c

-


x
~

t

c
,

j
-
1








+


α
2







x
c

-


x
~

t

c
,

j
-
1






2
2



,




and the local decision vector given by xtc=


{tilde over (x)}tc,J1; where:


∇{circumflex over (ƒ)}tc( ) refers to a local gradient function,


hfc( ) refers to a general function,


Xc refers to a compact convex feasible set, and


α refers to a fixed parameter.


In some embodiments, the local data has a non-zero local delay. In some embodiments, the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices. In some embodiments, performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix if {tilde over (V)}tc,0={tilde over (V)}tc,Jr; and for each step j in the multi-step gradient descent: (1) constructing an estimated gradient, wherein the estimated gradient is based on Ht−τ1c and Ĝt−τc, and (2) updating {tilde over (V)}tc,j, by solving an optimization problem for if {tilde over (V)}tc,j based on the estimated gradient; where:


c is an index referring to a worker node corresponding to the local data,


t refers to the current time slot,


τr refers to a round-trip remote delay,


τ1 refers to a local delay,


τ refers to the total delay,


Htc refers to the local channel state information,


{tilde over (V)}tc,Jr refers to the global precoding matrix,


Ĝt−τc refers to the global information,


j∈[1,J1], and


J1 refers to the number of steps of the multi-step gradient descent.


In some embodiments, the estimated gradient is given by ∇{circumflex over (ƒ)}t−τ1c({tilde over (V)}tc,j−1)=Ht−τ1c(Ht−τ1c {tilde over (V)}tc,j−1t−τc), a solution the optimization problem is given by









V
~

t

c
,
j


=


𝒫

𝒱
c




{



V
~

t

c
,

j
-
1



-


1
α








f
^


t
-

τ
1


c

(


V
~

t

c
,

j
-
1



)




}



,




and the local precoding matrix given by Vtc={tilde over (V)}tc,J1; where:








𝒫

𝒱
c




{

V
c

}


=

arg


min


U
c



𝒱
c




{





U
c

-

V
c




F
2

}






is the projection operator onto the convex feasible set Vc, ∇{circumflex over (ƒ)}t−τ1c( ) refers to a local gradient function, and


α refers to a fixed parameter.



FIG. 17 is a block diagram of an apparatus such as master node 102, worker node 104, CC 502, and/or TRP 504, according to some embodiments. As shown in FIG. 17, the apparatus may comprise: processing circuitry (PC) 1702, which may include one or more processors (P) 1755 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 1748 comprising a transmitter (Tx) 1745 and a receiver (Rx) 1747 for enabling the apparatus to transmit data to and receive data from other nodes connected to a network 1710 (e.g., an Internet Protocol (IP) network) to which network interface 1748 is connected; and a local storage unit (a.k.a., “data storage system”) 1708, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 1702 includes a programmable processor, a computer program product (CPP) 1741 may be provided. CPP 1741 includes a computer readable medium (CRM) 1742 storing a computer program (CP) 1743 comprising computer readable instructions (CRI) 1744. CRM 1742 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 1744 of computer program 1743 is configured such that when executed by PC 1702, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 1702 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.


While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described example embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.


Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.


REFERENCES



  • [1] S. Shalev-Shwartz, “Online learning and online convex optimization,” Found. Trends Mach. Learn., vol. 4, pp. 107-194, February 2012.

  • [2] E. Hazan, “Introduction on online convex optimization,” Found. Trends Optim., vol. 2, pp. 157-325, August 2016.

  • [3] M. Zinkevich, “Online convex optimization and generalized infinitesimal gradient descent,” in Proc. Intel. Conf. Mach. Learn. (ICML), 2003.

  • [4] E. Hazan, A. Agarwal and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Mach. Learn., vol. 69, pp. 169-192, 2007.

  • [5] J. Langford, A. J. Smola and M. Zinkevich, “Slow learners are fast,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2009.

  • [6] K. Quanrud and D. Khashabi, “Online learning with adversarial delays,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2015.

  • [7] E. C. Hall and R. M. Willett, “Online convex optimization in dynamic environments,” IEEE J. Sel. Topics Signal Process., vol. 9, pp. 647-662, June 2015.

  • [8] A. Jadbabaie, A. Rakhlin, S. Shahrampour and K. Sridharan, “Online optimization: competing with dynamic comparators,” in Proc. Intel. Conf. Artif. Intell. Statist. (AISTATS), 2015.

  • [9] A. Mokhtari, S. Shahrampour, A. Jababaie and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in Proc. IEEE Conf. Decision Control (CDC), 2016.

  • [10] L. Zhang, T. Yang, J. Yi, J. Rong and Z.-H. Zhou, “Improved dynamic regret for non-degenerate functions,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2017.

  • [11] A. S. Bedi, P. Sarma and K. Rajawat, “Tracking moving agents via inexact online gradient descent algorithm,” IEEE J. Sel. Topics Signal Process, vol. 12, pp. 202-217, 2018.

  • [12] R. Dixit, A. S. Bedi, R. Tripathi and K. Rajawat, “Online learning with inexact proximal online gradient descent algorithms,” IEEE Trans. Signal Process., vol. 67, pp. 1338-1352, 2019.

  • [13] 3. TS38.300, “3rd Generation Partnership Project Technical Specification Group Radio Access Network; NR; NR and NG-RAN Overall Description; Stage 2 (Release 15)”.

  • [14] B. Liang, “Mobile edge computing,” in Key Technologies for 5G Wireless Systems, Cambridge University Press, 2017.

  • [15] J. P. Champati and B. Liang, “Semi-online algorithms for computational task offloading with communication delay,” IEEE Trans. Parallel Distrib. Syst., vol. 28, pp. 1189-1201, 2017.

  • [16] S. J. Wright, “Coordinated descent algorithms,” Math. Programming, vol. 151, pp. 3-34, 2015.

  • [17] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, pp. 1-122, 2011.

  • [18] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S. Ma and Z.-Q. Luo, “A block successive upper-bound minimization method of multipliers for linearly constrained convex optimization,” Math. Oper. Res., vol. 45, pp. 933-961, 2020.

  • [19] M. Zinkevich, M. Weimer, L. Li and A. J. Smola, “Parallelized stochastic gradient descent,” in Proc. Adv. Neural Info. Proc. Sys. (NIPS), 2010.

  • [20] H. B. McMahan, S. H. Moore, D. Ramage and B. A. y. Arcas, “Communication-efficient learning of deep networks for decentralized data,” in Proc. Intel. Conf. Artif. Intell. Statist. (AISTATS), 2017.

  • [21] J. C. Duchi, A. Agarwal and M. J. Wainwright, “Dual averaging for distributed optimization: convergence analysis and network scaling,” IEEE Trans. Autom. Control, vol. 57, pp. 592-606, 2012.

  • [22] D. Mateos-Nez and J. Corts, “Distributed online convex optimization over jointly connected digraphs,” IEEE Trans. Netw. Sci. Eng., vol. 1, pp. 23-37, 2014.

  • [23] A. Koppel, F. Y. Jakubiec and A. Riveiro, “A saddle point algorithm for networked online convex optimization,” IEEE Trans. Signal Process., vol. 63, pp. 5149-5164, 2015.

  • [24] M. Akbari, B. Gharesifard and T. Linder, “Distributed online convex optimization on time-varying directed graphs,” IEEE Trans. Control Netw. Syst., vol. 4, pp. 417-428, 2017.

  • [25] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Trans. Autom. Control, vol. 63, pp. 714-725, March 2018.

  • [26] N. Eshraghi and B. Liang, “Distributed online optimization over a heterogeneous network with any-batch mirror descent,” in Proc. Intel. Conf. Mach. Learn. (ICML), 2020.

  • [27] Y. Zhang, R. J. Ravier, M. M. Zavlanos and V. Tarokh, “A distributed online convex optimization algorithm with improved dynamic regret,” in Proc. IEEE Conf. Decision Control (CDC), 2019.

  • [28] M. J. Neely, Stochastic Network Optimization with Application on Communication and Queueing Systems, Morgan & Claypool, 2010.

  • [29] F. Amirnavaei and M. Dong, “Online power control optimization for wireless transmission with energy harvesting and storage,” IEEE Trans. Wireless Commun., vol. 66, pp. 4888-4901, July 2016.

  • [30] M. Dong, W. Li and F Amirnavaei, “Online joint power control for two-hop wireless relay networks with energy harvesting,” IEEE Trans. Signal Process., vol. 66, pp. 462-478, January 2018.

  • [31] J. Wang, M. Dong, B. Liang and G. Boudreau, “Online downlink MIMO wireless network virtualization in fading environments,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019.

  • [32] J. Wang, M. Dong, B. Liang and G. Boudreau, “Online precoding design for downlink MIMO wireless network virtualization with imperfect CSI,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), 2020.

  • [33] P. Mertikopoulos and E. V. Belmega, “Learning to be green: Robust energy efficiency maximization in dynamic MIMO-OFDM system,” IEEE J. Sel. Areas. Commun., vol. 34, pp. 743-757, April 2016.

  • [34] P. Mertikopoulos and A. L. Moustakas, “Learning in an uncertain world: MIMO covariance matirx optimization with imperfect feedback,” IEEE Trans. Signal Process., vol. 64, pp. 5-18, January 2016.

  • [35] H. Yu and M. J. Neely, “Dynamic transmit covariance design in MIMO fading systems with unknown channel distributions and inaccurate channel state information,” IEEE Trans. Wirelss Commun., vol. 16, pp. 3996-4008, June 2017.

  • [36] J. Wang, B. Liang, M. Dong and G. Boudreau, “Online MIMO wireless network virtualization over time-varying channels with periodic updates,” in Proc. IEEE Intel. Workshop on Signal Process. Advances in Wireless Commun. (SPAWC), 2020.

  • [37] D. Gesbert, S. Hanly, H. Huang, S. S. shiz, O. Simeone and W. Yu, “Multi-cell MIMO cooperative networks: A new look at interference,” IEEE J. Sel. Topics Signal Process., vol. 28, pp. 1308-1408, December 2010.

  • [38] H. Zhang, N. B. Mehta, A. F. Molisch, J. Zhang and S. H. Dai, “Asynchronous interfence mitigation in cooperative base station systems,” IEEE Trans. Wireless Commun., vol. 7, pp. 155-165, January 2008.

  • [39] R. Zhang, “Cooperative multi-cell block diagonalization with per-base-station power constraints,” IEEE J. Sel. Areas. Commun., vol. 28, pp. 1435-1445, 2010.

  • [40] O. Besbes, Y. Gur and A. Zeevi, “Non-stationary stochastic optimization,” Oper. Res., vol. 63, pp. 1227-1244, September 2015.

  • [41] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2016.

  • [42] H. Holma and A. Toskala, WCDMA for UMTS-HSPA evolution and LTE, John Wiely & Sons, 2010.

  • [43] Y. Jiang, M. K. Varanasi and J. Li, “Performance analysis of ZF and MMSE equalizers for MIMO systems: An in-depth study of the high SNR regime,” IEEE Trans. Inf. Theory, vol. 57, pp. 2008-2026, April 2011.

  • [44] R. Corvaja and A. G. Armada, “Phase noise degradation in massive MIMO downlink with zero-forcing and maximum ratio transmission precoding,” IEEE Trans. Veh. Technol., vol. 65, pp. 8052-8059, October 2016.


Claims
  • 1. A method for performing online convex optimization, the method comprising: receiving, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes;performing a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information; andsending, to each of the two or more worker nodes, the global decision vector and corresponding global information.
  • 2. The method of claim 1, wherein the local data received from each of the two or more worker nodes is compressed, and wherein the method further comprises uncompressing the local data received from each of the two or more worker nodes.
  • 3. The method of claim 1, wherein performing the multi-step gradient descent further comprises: initializing an intermediate decision vector {circumflex over (x)}tc,0=xt−τrc, for each of the two or more worker nodes c;for each step j in the multi-step gradient descent:(1) constructing an estimated gradient for each of the two or more worker nodes c, wherein the estimated gradient is based on {{circumflex over (x)}tc,j−1}c=1C and
  • 4. The method of claim 3, wherein the estimated gradient is given by
  • 5. The method of claim 1, wherein the local data corresponding to each of the two or more worker nodes has a non-zero local delay.
  • 6. The method of claim 1, wherein the two or more worker nodes comprise transmission/reception points (TRPs), the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices.
  • 7. The method of claim 6, wherein performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix {circumflex over (V)}tc,0=Vt−τrc, for each of the two or more TRPs c;for each step j in the multi-step gradient descent:(1) constructing an estimated gradient for each of the two or more TRPs c, wherein the estimated gradient is based on
  • 8. The method of claim 7, wherein the estimated gradient is given by ∇{circumflex over (ƒ)}t−τrc({circumflex over (V)}tc,j−1)=Ĥt−τrc(Σt=1C(Ĥt−τrl{circumflex over (V)}tl,j−1)−Ĥt−τrŴt−τr),wherein a solution to the optimization problem is given by
  • 9. A method for performing online convex optimization, the method comprising: receiving, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it;performing a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector; andsending, to the master node, the local decision vector and local data.
  • 10. The method of claim 9, wherein the local data sent to the master node is compressed prior to sending.
  • 11. The method of claim 9, wherein performing the multi-step gradient descent further comprises: initializing an intermediate decision vector {tilde over (x)}tc,0={circumflex over (x)}tc,Jr;for each step j in the multi-step gradient descent:(1) constructing an estimated gradient, wherein the estimated gradient is based on dtc and
  • 12. The method of claim 11, wherein the estimated gradient is given by
  • 13. The method of claim 9, wherein the local data has a non-zero local delay.
  • 14. The method of claim 9, wherein the local data corresponds to local channel state information, and the local decision vectors correspond to precoding matrices.
  • 15. The method of claim 14, wherein performing the multi-step gradient descent further comprises: initializing an intermediate precoding matrix {tilde over (V)}tc,0={circumflex over (V)}tc,Jr;for each step j in the multi-step gradient descent:(1) constructing an estimated gradient, wherein the estimated gradient is based on Ht−τ1c and Ĝt−τc, and(2) updating {tilde over (V)}tc,j, by solving an optimization problem for {tilde over (V)}tc,j based on the estimated gradient;where:c is an index referring to a worker node corresponding to the local data,t refers to the current time slot,τr refers to a round-trip remote delay,τ1 refers to a local delay,τ refers to the total delay,Htc refers to the local channel state information,{circumflex over (V)}tc,Jr refers to the global precoding matrix,Ĝt−τc, refers to the global information,j∈[1,J1], andJ1 refers to the number of steps of the multi-step gradient descent.
  • 16. The method of claim 15, wherein the estimated gradient is given by ∇{circumflex over (ƒ)}t−τrc({tilde over (V)}tc,j−1)=Ht−τ1c (Ht−τ1c {tilde over (V)}tc,j−1+Ĝt−τc),wherein a solution the optimization problem is given by
  • 17. A master node adapted to perform the method of claim 1.
  • 18. A worker node adapted to perform the method of claim 9.
  • 19. A master node for performing online convex optimization, the master node comprising processing circuitry and a memory containing instructions executable by the processing circuitry, whereby the processing circuitry is operable to: receive, from two or more worker nodes, a local decision vector and local data corresponding to each of the two or more worker nodes;perform a multi-step gradient descent based on the local decision vector and the local data received from the two or more worker nodes, wherein performing the multi-step gradient descent comprises determining a global decision vector and corresponding global information; andsend, to each of the two or more worker nodes, the global decision vector and corresponding global information.
  • 20. A worker node for performing online convex optimization, the worker node comprising processing circuitry and a memory containing instructions executable by the processing circuitry, whereby the processing circuitry is operable to: receive, from a master node, a global decision vector and corresponding global information, wherein the global information has a time delay associated with it;perform a multi-step gradient descent based on the global decision vector and local data, wherein performing the multi-step gradient descent comprises determining a local decision vector; andsend, to the master node, the local decision vector and local data.
  • 21. A computer program comprising instructions which when executed by processing circuitry of a node causes the node to perform the method of claim 1.
  • 22. A carrier containing the computer program of claim 21, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium. a.
PCT Information
Filing Document Filing Date Country Kind
PCT/IB2022/050212 1/12/2022 WO
Provisional Applications (1)
Number Date Country
63144257 Feb 2021 US