Federated edge learning (FEEL) is a distributed learning framework that leverages the computational powers of EDs and uses the local data at the EDs without compromising their privacy to train a model.
However, the communication aspect of FEEL stands as one of the main bottlenecks. To address this issue, one of the promising solutions is to perform the aggregation with over-the-air computation methods that harness the interference that naturally occurs in wireless systems. However, developing a broadband AirComp scheme that allows analog computation via a digital scheme is not trivial. In this disclosure, we address this issue with a novel AirComp scheme.
Over-the-air computation (OAC) refers to the computation of mathematical functions by exploiting the signal-superposition property of wireless multiple-access channel [1], [2]. To reduce the utilization of limited wireless resources, it was initially considered for wireless sensor networks [3]. With the same motivation, OAC has recently gained increasing attention in the literature for applications such as distributed learning or wireless control systems [4]-[6]. For example, federated edge learning (FEEL), one of the promising distributed edge learning frameworks, aims to implement federated learning (FL) [7] over a wireless network.
With FEEL, the task of model training is distributed across multiple edge devices (EDs) and the data uploading is avoided to promote user-privacy [5], [8]. Instead of data samples, EDs share a large number of local stochastic gradients (or local model parameters) with an edge server (ES) for aggregation, e.g., averaging.
However, typical orthogonal user multiplexing methods such as orthogonal frequency division multiple access (OFDMA) can be wasteful in this scenario since the ES may not be interested in the local information of the EDs but only in a function of them.
Similarly, a control system that requires an input that is a function of many Internet-of-Things (IoT) devices' readings can suffer from high latency since the available spectrum for these networks is often limited and the OAC can address the latency issues by calculating the functions, e.g., difference equations [6], over the air.
Despite the motivations for OAC, it is challenging to realize a reliable OAC scheme due to the detrimental impact of wireless channels on the OAC. To address this issue, a majority of the state-of-the-art OAC methods rely on pre-equalization techniques [9]-[17]. However, a pre-equalizer can impose stringent requirements on the underlying mechanisms such as time-frequency synchronization, channel estimation, and channel prediction, which can be challenging to satisfy under the non-stationary channel conditions [18], [19]. Also, the typical equalization or phase correction methods used at the receiver for traditional multiple-access schemes, e.g., OFDMA, cannot be directly employed for OAC to compensate the channel distortion or imperfect synchronization as the impact of distortions on the received symbols after the superposition often do not satisfy the distributive law of mathematical operations.
Another issue is that most of the OAC schemes use analog modulation schemes to achieve a continuous-valued computation. However, analog modulations are more susceptible to noise as compared to digital schemes. Although there are digital aggregation methods, e.g., one-bit broadband digital aggregation (OBDA) [12], frequency-shift keying (FSK)-based majority vote (MV) (FSK-MV) [20], and pulse-position modulation (PPM)-based MV (PPM-MV) [21], by relying on specific training approaches, i.e., distributed training by the MV with sign stochastic gradient descent (signSGD) [22], these schemes do not allow one to compute a continuous-valued function.
1) Over-the-air computation: In the literature, OAC schemes are particularly investigated to reduce the per-round communication latency of FEEL. In [9], broadband analog aggregation (BAA) that modulates the orthogonal frequency division multiplexing (OFDM) subcarriers with the model parameters is proposed. To overcome the impact of the multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, known as truncated-channel inversion (TCI) in the literature. In [10], an additional time-varying precoder is applied along with TCI to facilitate the aggregation. In [11], the gradient estimates are sparsified and the sparse vectors are projected into a low-dimensional vector to reduce the bandwidth. The compressed data is transmitted with BAA. In [14], BAA with power control and re-transmissions over static channels is investigated to obtain the optimal number of re-transmissions.
In [15], a multi-slot OAC framework is proposed for fast-fading channels and time diversity is exploited to mitigate the impact of fading channel on OAC. In [16], instead of TCI, the parameters are multiplied with the conjugate of the channel coefficients (i.e., maximum-ratio transmission) to increase the power efficiency. In [17], the channel inversion is optimized with the consideration of sum-power constraint to avoid potential interference issues.
In [12], OBDA is proposed to facilitate the implementation of FEEL for a practical wireless system. In this method, considering distributed training by MV with the signSGD [22], the EDs transmit quadrature phase-shift keying (QPSK) symbols along with TCI, where the real and imaginary parts of the QPSK symbols are formed by using the signs of the stochastic gradients. At the ES, the signs of the real and imaginary components of the superposed received symbols on each subcarrier are calculated to obtain the MV for the sign of each gradient. The authors in also consider OBDA, but the pre-equalization in this method applies only phase correction to the transmitted symbols (i.e., equal-gain transmission) by emphasizing the fact that the amplitude alignment is not needed for digital OAC. The reader is referred to for various combining strategies for channel-aware decision fusion under the assumption of real-valued channel coefficients.
In the literature, there are OAC methods that do not use pre-equalization. For example, in and [27], the authors consider blind EDs, i.e., CSI is not available at the ED. By exploiting channel hardening, the ES utilizes an estimate of the superposed CSI to achieve an analog aggregation with maximum-ratio combining (MRC). Pre-equalization along with digital and analog beamforming are investigated for OAC in and [28], respectively. To address the issues related to TCI, in and [21], the OAC for FEEL is realized by exploiting non-coherent receiver techniques. Similar to OBDA [12], the schemes in and depend on the distributed training based on MV [22]. However, instead of modulating the phase of the OFDM subcarriers based on the sign of the local stochastic gradients, the schemes in and use FSK and PPM, respectively, and a non-coherent detector is used to detect the MV at the ES. While and provide robustness against time-variation of the wireless channel, synchronization errors, and imperfect power control, the 1-bit quantization nature of signSGD can degrade the test accuracy in the heterogeneous data distribution scenarios.
In and [30], Goldenbaum and Stanczak propose to calculate the energy of a sequence of superposed symbols. In [31], they show that their scheme can also work when there is no CSI at the transmitter under a scenario where the ES is equipped with multiple antennas. Nevertheless, the performance of these schemes rely on existence of special set of sequences to reduce the interference across the uses. Another relevant work is Goldenbaum's digital OAC method discussed in [32]. In this method, a general nomographic function is targeted. Lattice coding with a special source encoding is investigated. However, the proposed scheme is only evaluated for additive white Gaussian noise (AWGN) channel.
2) Quantization: In the literature, extensive efforts have been made to decrease the communication costs of machine learning algorithms by quantization. For example, in [33], a general quantized stochastic gradient descent (SGD) (QSGD) with the Elias integer encoding is investigated for encoding the gradients by relying on the fact that large gradients are often less frequent. The signSGD, proposed in [22], is an extreme case of quantization, where the signs of the gradients are considered for the training. In [34], a ternary quantization, which is a balanced number system where the base is 3, is applied to the model parameters to implement FL based on parameter averaging.
In [35], by considering the tradeoff between precision and energy, the quantization levels for the neural network parameters are optimized for FL. In [36], a gradient quantization method that uses the historical gradients as side information to compress the local gradients is proposed. The authors exploits the fact that gradients between adjacent rounds may have a high correlation for SGD. Nevertheless, these quantization methods consider either ideal communication channels for training or orthogonal multiple access for EDs. Also, they do not consider an OAC scheme.
Aspects and advantages of the presently disclosed subject matter will be set forth in part in the following description, or may be apparent from the description, or may be learned through practice of the presently disclosed subject matter.
Broadly speaking, the presently disclosed subject matter relates to improvements in federated edge learning (FEEL) as a distributed learning framework.
Further, this disclosure addresses the communication latency problem of training an artificial intelligence model over a wireless network. It reduces the per-round communication latency. It also paves the way for training when the data distribution is heterogeneous in the network.
More particularly, we presently disclose a digital over-the-air computation (OAC) scheme for achieving continuous-valued (analog) aggregation for distributed machine learning applications in wireless networks, e.g., federated edge learning (FEEL). We show that the average of a set of real-valued parameters can be calculated approximately by using the average of the corresponding numerals, where the numerals are obtained based on a balanced number system. By exploiting this key property, the presently disclosed scheme encodes the local stochastic gradients into a set of numerals. Next, it determines the positions of the activated orthogonal frequency division multiplexing (OFDM) subcarriers by using the values of the numerals. To eliminate the need for precise sample-level time synchronization, channel estimation overhead, and power instabilities due to the channel inversion, the presently disclosed scheme also uses a non-coherent receiver at the edge server (ES) and does not utilize a pre-equalization at the edge devices (EDs). To improve the test accuracy of FEEL with the presently disclosed scheme, we also introduce the concept of adaptive absolute maximum (AAM). Our numerical results show that when the presently disclosed scheme is used with AAM for FEEL, the test accuracy can reach up to 98% for heterogeneous data distribution.
In this disclosure, we introduce an OAC scheme for FEEL, where the local stochastic gradients are encoded into a set of numerals based on a balanced (also called signed-digit) number system to achieve a continuous-valued computation over a digital scheme. To avoid confusion, we use the terms of “numeral” and “balanced” for “digit” and “signed-digit”, respectively, since the term of “digit” may specifically imply the ten symbols of the common base 10 numeral system. The presently disclosed method does not rely on pre-equalization and the availability of channel state information (CSI) at the EDs and the ES, which relaxes the synchronization requirements at the EDs and the ES.
Some of the contributions of this presently disclosed subject matter can be listed as follows:
Continuous-valued OAC with a digital scheme: We show that the average of a set of real-valued parameters can be calculated by using the average of the corresponding numerals in the real domain, approximately. By exploiting this key property, discussed in Section III-A, we achieve a continuous valued computation over a digital OAC scheme. With the presently disclosed method, the EDs first encode the real-valued local stochastic gradients into the numerals for a given balanced number system. The EDs then active the dedicated time-frequency resources (i.e., OFDM subcarriers) based on the values of the numerals. The EDs simultaneously transmit their OFDM symbols and the average numerals are calculated at the ES with a non-coherent receiver. By using the average of the numerals, the ES computes an estimate of the real-valued average stochastic gradient. To the best of our knowledge, this is the first disclosure that uses a general balanced number system for OAC.
Theoretical MSE analysis: We derive the mean squared error (MSE) of the estimator of the average stochastic gradient for a given set of parameters such as number of numerals, number of EDs, number of antennas at the ES, and the base in Section III-D. We also introduce the concept of adaptive absolute maximum (AAM), where each ED shares a single parameter with the ES to adjust the maximum quantization level to minimize the estimation error over the communication rounds of FEEL.
Theoretical convergence analysis: By using the MSE derivation and considering both homogeneous and heterogeneous data distributions in the network, we prove the convergence of FEEL in the presence of the presently disclosed scheme with and without AAM for a non-convex loss function, i.e., Theorem 1 and Theorem 2, respectively. While the presently disclosed framework without AAM contributes the noise ball due to the stochastic gradients in an additive manner, the additive impact is removed when the presently disclosed scheme is utilized with the AAM.
Organization: The rest of the disclosure is organized as follows. In Section II, the notation and the preliminary discussions used in the rest of the sections are provided. In Section III, the presently disclosed OAC scheme and its MSE performance are discussed. In Section IV, the convergence rate of the FEEL with the presently disclosed scheme is discussed. In Section V, the numerical results are provided. We conclude the disclosure in Section VI.
Notation: The sets of complex numbers, real numbers, integers, and integers modulo H are denoted by ,
,
, and
H respectively. The N-dimensional all zero vector and the N×N identity matrix are ON and IN, respectively. The function
[·] results in 1 if its argument holds, otherwise it is 0.
x[·] is the expectation of its argument over x. ∇f (w) denotes the gradient of the function f , i.e. ∇f , at the point w. The zero-mean circularly symmetric multivariate complex Gaussian distribution with the covariance matrix CM of an M-dimensional random vector x∈
is denoted by x˜CN(0M, CM). The gamma distribution with the shape parameter n and the rate λ is Γ(n, λ). The binomial distribution with the K trials and the success probability p for each trial is
(K, p). The
2-norm of the vector x is ∥x∥2. [
] denotes {
0, . . . ,
K−1}.
Yet another aspect of the presently disclosed subject matter is to improve technology areas related to over-the-air computation (OAC) methodology and systems for federated edge learning (FEEL).
One presently disclosed exemplary embodiment relates to a methodology for a over-the-air computation (OAC) methodology for federated edge learning (FEEL) without using pre-equalization or channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES). Such exemplary methodology may preferably comprise providing a distributed machine-learning model to be trained with the update vectors received at an edge server (ES) as transmitted from a plurality of edge devices (EDs); and performing methodology operations comprising. Such operations may preferably comprise transmitting local update vectors as real-valued local stochastic gradients from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel, receiving the superposed local update vectors at the ES, and inputting the superposed local update vectors into the machine-learning model to be updated. Preferably, such real-valued local stochastic gradients are encoded into a set of numerals based on a balanced number system to achieve a continuous-valued computation over a digital scheme.
It is to be understood from the complete disclosure herewith that the presently disclosed subject matter equally relates to both methodology and corresponding and related apparatus/system subject matter.
Another presently disclosed exemplary embodiment preferably relates to a system for an over-the-air computation (OAC) system for federated edge learning (FEEL) without using pre-equalization or channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES). Such exemplary system preferably may comprise a distributed machine-learning model training to process data comprising update vectors received at an edge server (ES) as transmitted from a plurality of edge devices (EDs); one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. Such operations preferably comprise transmitting local update vectors as real-valued local stochastic gradients from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel, receiving the superposed local update vectors at the ES, and inputting the superposed local update vectors into the machine-learning model to be updated. Such real-valued local stochastic gradients preferably are encoded into a set of numerals based on a balanced number system to achieve a continuous-valued computation over a digital scheme.
Other example aspects of the present disclosure are directed to systems, apparatus, tangible, non-transitory computer-readable media, user interfaces, memory devices, and electronic smart devices or the like. To implement methodology and technology herewith, one or more processors may be provided, programmed to perform the steps and functions as called for by the presently disclosed subject matter, as will be understood by those of ordinary skill in the art.
Per some presently disclosed embodiments, the presently disclosed scheme achieves a continuous-valued computation over a digital scheme. Hence, it can address a wide range of scenarios including wireless sensor networks. It does not need a channel inversion at the EDs. From this aspect, it is compatible with time-varying channels and does not lose the gradient information due to the truncation. The presently disclosed scheme reduces PMEPR with a simple randomization technique. It also does not require CSIs at the ES or multiple antennas for over-the-air computation. It can also provide a test accuracy when the data distribution is heterogeneous.
Potential market size is large as it is related to both commercial wireless and AI technologies. It could be useful for artificial intelligence technologies over wireless or sensor networks, 5G and beyond, 6G wireless standardization, IEEE 802.11 Wi-Fi. Also, recently, IEEE 802.11 has formed a Topic Interest Group (TIG), where distributed learning over a wireless network has been mentioned (https://mentor.ieee.org/802.11/documents?is_dcn=DCN%2C%20Title%2C%20Author%20or%20Affiliation&is_group=).
Additional objects and advantages of the presently disclosed subject matter are set forth in, or will be apparent to, those of ordinary skill in the art from the detailed description herein. Also, it should be further appreciated that modifications and variations to the specifically illustrated, referred and discussed features, elements, and steps hereof may be practiced in various embodiments, uses, and practices of the presently disclosed subject matter without departing from the spirit and scope of the subject matter. Variations may include, but are not limited to, substitution of equivalent means, features, or steps for those illustrated, referenced, or discussed, and the functional, operational, or positional reversal of various parts, features, steps, or the like.
Still further, it is to be understood that different embodiments, as well as different presently preferred embodiments, of the presently disclosed subject matter may include various combinations or configurations of presently disclosed features, steps, or elements, or their equivalents (including combinations of features, parts, or steps or configurations thereof not expressly shown in the Figures or stated in the detailed description of such Figures). Additional embodiments of the presently disclosed subject matter, not necessarily expressed in the summarized section, may include and incorporate various combinations of aspects of features, components, or steps referenced in the summarized objects above, and/or other features, components, or steps as otherwise discussed in this application. Those of ordinary skill in the art will better appreciate the features and aspects of such embodiments, and others, upon review of the remainder of the specification, and will appreciate that the presently disclosed subject matter applies equally to corresponding methodologies as associated with practice of any of the present exemplary devices, and vice versa.
A full and enabling disclosure of the presently disclosed subject matter, including the best mode thereof, directed to one of ordinary skill in the art, is set forth in the specification, which makes reference to the appended Figures, in which:
Table 1 is a table of an exemplary presently disclosed a convolution neural network (CNN) at the EDs for practicing the presently disclosed methodology, and listing exemplary layers, learnables, and activations;
Repeat use of reference characters in the present specification and drawings is intended to represent the same or analogous features or elements or steps of the presently disclosed subject matter.
It is to be understood by one of ordinary skill in the art that the present disclosure is a description of exemplary embodiments only, and is not intended as limiting the broader aspects of the disclosed subject matter. Each example is provided by way of explanation of the presently disclosed subject matter, not limitation of the presently disclosed subject matter. In fact, it will be apparent to those skilled in the art that various modifications and variations can be made in the presently disclosed subject matter without departing from the scope or spirit of the presently disclosed subject matter. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the presently disclosed subject matter covers such modifications and variations as come within the scope of the appended claims and their equivalents.
The present disclosure is generally directed to improvements in federated edge learning (FEEL) as a distributed learning framework.
More particularly, in this section, we provide the signal and learning model that we use throughout the disclosure and the preliminaries related to the encoding and decoding based on a balanced number system.
We consider a wireless network with K EDs that are connected to an ES, where each ED and the ES are equipped with a single antenna and R antennas, respectively. We assume that the large-scale impact of the wireless channel is compensated with a power control mechanism, e.g., closed-loop power control with physical uplink control channel (PUCCH) in Fifth Generation (5G) New Radio (NR) [37], before the training process for FEEL begins.
For the signal model, we assume that the EDs access the wireless channel on the same time-frequency resources simultaneously with S OFDM symbols consisting of M active subcarriers for OAC. Assuming that the cyclic prefix (CP) duration is larger than the sum of maximum time-synchronization error and maximum-excess delay of the channel, we express the superposed modulation symbol on the lth subcarrier of the mth OFDM symbol at the ES for the th communication round of the training as:
where hk,l,m()˜
(0R,IR) is a vector that consists of the channel coefficients between R antennas at the ES and the kth ED, thk,l,m(
)∈
is the transmitted modulation symbol from the kth ED, and nl,m(
) ˜
(0R,σn2IR) is the AWGN, where σn2 is the noise variance for l∈
Mand m∈
S.
We denote the signal-to-noise ratio (SNR) of an ED at the ES receiver as 1/σn2.
In practice, the synchronization point where the discrete Fourier transform (DFT) starts to be applied to the received signal for demodulation at the ES and the time synchronization across the EDs may not be precise. To model former impairment, we assume that synchronization point can deviate by Nerr samples within the CP window. For the latter impairment, the time of arrivals of the EDs' signals at the ES location are sampled from a uniform distribution between 0 and Tsync seconds, where Tsync is equal to the reciprocal to the signal bandwidth. Note the coarse time-synchronization can be maintained with the state-of-the-art protocols used in cellular systems. We introduce additional phase rotations to hk,l,m() to capture the impact of the time-synchronization errors on rl,m(
). We assume that the frequency synchronization is handled before the transmissions with a control mechanism as done in 3GPP Fourth Generation (4G) Long Term Evolution (LTE) and/or 5G NR with random-access channel (RACH) and/or PUCCH or custom methods such as AirShare [39].
Let ƒenc,β be a function that maps v∈ to a sequence of D elements (i.e., numerals) in {(XD−1, . . . , x1, x0) |xi∈
β, β>1, i∈
D} as
(xD−1, . . . , x1,x0)=ƒenc,β(v), (2)
where β is the base (also called scale [40]), xi is referred to as a numeral at the ith position, and Sβ is the symbol set with base β.
In this disclosure, we consider a balanced number system for expressing enc,β and assume that β is an odd positive integer. The numerals are obtained as follows: For a given v for |v|≤vmax, the encoder ƒenc,β (v) first computes the base-β representation of the rounded, biased, and normalized v as:
for ξ(βD−1)/2 and bi∈
β.
Afterwards, it calculates xi as xibi−(β−1)/2, ∀i∈
D. Hence,
β can be defined as:
β
{a
j
|a
j
=ƒ
bal(j), j∈β}, (4)
where ƒbal (j) is given by
Based on (5), aβ−1 is a zero-valued symbol. The example symbol sets for β=5 and β=7 can obtained as 5={−1, 1, −2, 2, 0} and
7={−1, 1, −2, 2, −3, 3, 0}, respectively. For a balanced number system, there is no dedicated symbol for the sign of v as
β contains negative-valued symbols.
Assume that β=5, D=3, and vmax=1 and we want to calculate ƒenc,β(0.28) and ƒenc,β(−0.86). By the definition, ξ=(521)/2=62. The base 5 representations of the decimal └62×−0.86+62+½┘=79 and the decimal └62×−0.86+62+½┘=9 are (b2b1b0)5=(304)5 and (b2b1b0)5=(014)5, respectively. Since xibi−(β−1)/2, we obtain ƒenc,β(0.28)=(1, −2, 2), and ƒenc,β(−0.86)=(−2, −1, 2).
The corresponding decoder ƒdec,β that maps the sequence (xD−1 , . . . , x1, x0) to can be expressed as:
Note that
Consider the parameters given in Example 1. Hence, we obtain ƒdec,βƒenc,β(0.28)=ƒdec,β(1, −2, 2)≈0.2742, and ƒdec,βƒenc,β(−0.86)=ƒdec,β(−2, −1, 2)≈−0.8548 based on (6). The step size can also be calculated as Δ=2/(53−1)≈0.016.
Let k denote the local data set containing the labeled data samples (
,
) at the kth ED, ∀k∈
KK, where
is the
th data sample with its ground truth label
. Suppose that all EDs upload their data sets to the ES. The centralized learning problem can then be expressed as:
where F(w) is the loss function, =
0∪
1∪ . . . ∪
K−1 is complete data set, and ƒ (w;
,
) is the sample loss function for the parameters w=[w0, . . . , wQ−1]T∈
, and Q is the number of parameters.
With (full-batch) gradient descent, a local optimum point can be obtained as:
w
(
+
)
=w
(
)
−ηg
(
), (8)
where η is the learning rate and the gradient vector g()=[g0(
), . . . , gQ−1(
)]T∈
be expressed as:
The equation (8) can be re-written as:
where gk()∈
denotes the local gradient vector at the kth ED.
Therefore, (8) can still be realized by communicating the local gradients or locally updated model parameters between the EDs and the ES, rather than moving the local data sets from the EDs to the ES, which is beneficial for promoting data privacy [5], [8]. This observation also shows the underlying principle of the plain FL based on gradient or model parameter aggregations [7].
FEEL aims to realize FL over a wireless network. In this disclosure, we consider the implementation of FL based on SGD, known as FedSGD [7], over a wireless network: The kth ED calculates an estimate of the local gradient vector, denoted by {tilde over (g)}k()=[{tilde over (g)}k,0(
) , . . . , {tilde over (g)}k,Q−1(
)]T∈
,as:
where ⊂
is the data batch obtained from the local data set and nb=|
k| as the batch size
The EDs transmit the local gradient estimates to the ES. Assuming identical data set sizes across the EDs, to solve (7), the ES calculates the average stochastic gradient vector
and broadcasts it to the EDs. Finally, the model parameters at the EDs are updated as:
w
(
+1 )
=w
(
)
−ηV
(
). (11)
With a traditional orthogonal user multiplexing, the per-round communication latency for FEEL linearly increases with the number of EDs [41]. With the motivation of eliminating per-round communication latency, the main objective of this work is to calculate an estimate of v(), denoted by {circumflex over (v)}(
)
[{circumflex over (v)}0(
) , . . . , {circumflex over (v)}Q−1(
)]T, through a digital OAC scheme robust against fading channel.
In this section, we discuss the presently disclosed OAC scheme relying on the representation of the gradients based on a balanced number system. We analyze its performance in terms of MSE and introduce the AAM to improve the MSE over the communication rounds of FEEL.
Based on the discussions given in Section II-C, consider the qth gradient at the kth ED for the th communication round of the FEEL, i.e., {tilde over (g)}k,q(
). Suppose that {tilde over (g)}k,q(
)is encoded into the sequence of length D denoted by:
(dk,qD−1(),. . . , dk,q,l (
)dk,q,0(
))=ƒenc,β({tilde over (g)}k,q(
)). (12)
for d() k,q,i∈Sβ.
By using definition of ƒdec,β in (6), the qth average stochastic gradient, i.e.,
can be obtained approximately as:
where {tilde over ()is the quantized gradient, i.e., {tilde over (
)=ƒdec,βƒenc,β({tilde over (g)}k,q(
)).
Equation (13) implies that vq() can be calculated approximately by evaluating the function ƒdec,β with the values that are calculated by averaging the numerals across K EDs in real domain, i.e., {μg,i(
)|i∈
D˜}. By evaluationg μq,i(t)further, it can also be shown that:
where Uq,i,j denotes the number of EDs with the symbol aj for the ith numeral in (12) and the qth gradient.
Note that the identity in (14) is due to the definition of expectation for discrete outcomes as given for a probability mass function.
Assume that K=2, {tilde over (g)}0,q()=0.28, and {tilde over (g)}1,q(
)=−0.86. The average of the gradients can be calculated as vq(
)=({tilde over (g)}0,q(
)+{tilde over (g)}1,q(
))=−0.29. Now, consider the encoder parameters given in Example 1. We obtain ƒenc,β(0.28)=(1, −2, 2), and ƒenc,β(−0.86)=(−2, −1, 2). Therefore, the average of the numerals can be calculated as (μq,2(
), μq,1(
), μq,0(
))=(1−2, −2−1, 2+2)/2=(−½, − 3/2, 2). Also, notice that (μq,2(
), μq,1(
), μq,0(
)) can be calculated by using the number of EDs that votes for each element of {−1, 1, −2, 2, 0}. For instance, μq,0(
) can be calculated via the last expression in (14) for (Uq,i,0, Uq,i,1, Uq,i,2, Uq,i,3, Uq,i,4)=(0, 0, 0, 2, 0) where the corresponding symbols are (a0, a1, a2, a3, a4)=(−1, 1, −2, 2, 0) for β=5. By evaluating
)=ƒdec,β(−½, − 3/2, 2), we obtain
)≈−0.2903. Note
) is also equal to the average of the quantized gradients, i.e., ƒdec,βƒenc,β(0.28)≈0.2742 and ƒdec,βƒenc,β(−0.86) ≈−0.8548, as exemplified in Example 2.
The presently disclosed OAC scheme computes an estimate of ) by relying on the expansion in (13) and the identity given in (14), rather than averaging the continuous {tilde over (g)}k,q(
) with an analog OAC such as BAA proposed in [9].
At the th communication round of the FEEL, the kth ED calculates the numerals {dk,q,i(
)|q∈
Q,i∈
D} with (12), for a given β. The main strategy exploited at the kth ED with the presently disclosed scheme is that β−1 subcarriers are dedicated for each numeral and one of them is activated based on its value. To express this encoding operation rigorously, let
be a function that maps q∈
Q to a set of (β−1)D distinct time-frequency index pairs denoted by Tq
{(mi,
, li
)|i∈
D,
∈
β−1} for mi,
∈
s and li
∈
M, where
q1∩
q2=∅if q1≠q2 for q1, q2Å
Q. The kth ED determines the modulation symbol tk,m
,l
(
) as:
for all i∈D and
∈
β−1, where Es=β−1 is a factor to normalize the OFDM symbol energy and sk,q,l(
) is a randomization symbol on the unit circle for peak-to-mean envelope power ratio (PMEPR) reduction [20].
Note that we do not allocate a subcarrier for aβ−1=0 since it does not contribute to the sum given in (14). After the calculation of (15) for all gradients, the kth ED calculates the OFDM symbols and all EDs transmit them simultaneously based on the discussions in Section II. Since the presently disclosed scheme uses (β−1)D subcarriers for each gradient, the maximum number of gradients that can be transmitted on each OFDM symbol can be calculated as Mpar=└M/((β−1)D)┘ for all EDs.
It is worth emphasizing that the function can be designed based on an scrambler to randomize the synthesized OFDM symbols or an encryption function to enhance the security of the OAC. We leave these extensions for future work and assume that the function
uses (β−1)D adjacent subcarriers for each gradient, as illustrated in
Consider the parameters given in Example 3, i.e., K=2, {tilde over (g)}0,q()=0.28, and {tilde over (g)}1,q(
)=−0.86, where the local gradients are represented as ƒenc,β(0.28)=(d0,q,2(
),d0,q,1(
),d0,q,0(
))=(1, −2, 2) for the 0th ED, and ƒenc,β(−0.86)=(d1,q,2(
),d1,q,1(
),d1,q,0(
))=(−2, −1, 2) for the 1st ED for β=5 and D=3. Assume that the resource set for the qth gradient, i.e., Tq, is given by:
q={(m0,0, l0,0), (m0,1, l0,1), (m0,2, l0,2), (m0,3, l0,3), (l1,0, l1,0), (m1,1, l1,1), (m1,2, l1,2), (m1,3, l1,3), (m2,0, l2,0), (m2,1, l2,1), (m2,2, l2,2), (m2,3, l2,3),}={(0, 0), (0, 1), . . . , (0, 11)},
i.e., the first 12 adjacent subcarriers of 0th OFDM symbol. Based on (4), 5={a0=−1, a1=1, a2=−2, a3=2, a4=0}.
Hence, based on (15), the activated subcarriers for the 0th ED (with omitting the randomization symbols for readability) are then:
because [d0,q,1(
)=a
]=1 for (i=0,
=3), (i=1,
=2), (i=2,
=1).
For the 1st ED, the active subcarriers are given by:
as [d0,q,1(
)=a
]=1 for (i=0,
=0), (i=1,
=2), (i=2,
=2).
Remark 1. If D=1, the presently disclosed scheme divides [−vmax, vmax] into β equal ranges and the modulation is equivalent to (β−1)-ary FSK.
At the ES, we assume that the CSI, i.e., {hk,l,m()|k∈ZK, l∈ZM, m∈ZS}, is not available. Hence, the ES exploits that rm
,l
(
) is a random vector for rm
,l
(
)˜
(0R, (EsUq,i
+σn2)IR) and obtains an estimate of {Uq,i,
|
∈
β−1}, non-coherently. For given i and q, by using the corresponding log-likelihood function, the maximum likelihood (ML) detector can be expressed as:
However, due to the constraints, a solution to (16) can increase the receiver complexity considerably. To address this issue, we relax the constraints and evaluate Ûq,i independently as given by:
Therefore, a low-complexity estimator ofμq,i() can be obtained as:
Finally, the estimator of vq() can be expressed as:
{circumflex over (v)}
q
(
)=ƒdec,β({circumflex over (μ)}q,D−1(), . . . , {circumflex over (μ)}q,1(
), {circumflex over (μ)}q,0(
)). (19)
The ES then transmits {circumflex over (v)}() to the EDs for the next communication round and the kth ED updates its parameters as w(
+1)=w
)−μ{circumflex over (v)}(
), ∀k.
The transmitter and received diagrams with the presently disclosed OAC scheme for FEEL based on the aforementioned discussions are provided in
The variable ∥rm,l
(
)∥22/R in (17) is the average of R exponential variables with the mean EsUq,i,
+σn2. Thus, the distribution of ∥rm
,l
(
)∥22/R is Γ(R, R/(EsUq,i,
+σ2n)).
As a result, the mean and the variance of the estimator Ûq ,i, can be calculated through the properties of a gamma distribution as:
respectively, where the expectation is calculated over the randomness of the channel and noise.
Hence, Ûq ,i, is an unbiased estimator. Also, based on (18) and (19), both {circumflex over (μ)}q,i(
) and {circumflex over (v)}q(
) are unbiased estimators of μq,i(
)and
), respectively. For a given {Uq,i,
|
∈
β−1}, by using (18) and (21), the variance of the estimator {circumflex over (μ)}q,i(
) is obtained as:
Therefore, we can calculate the variance of the estimator {circumflex over (v)}q() as:
Hence, the (classical) MSE of the estimator {circumflex over (v)}q() can be obtained as:
where the last term is the squared bias due to the quantization.
To derive the Bayesian MSE (BMSE) of the estimator {circumflex over (v)}q() , we assume that the distribution of {tilde over (g)}k,q(
)is uniform between −vmax−Δ/2 and vmax+Δ/2. This implies that the distribution of Uq ,i,
is
(K, 1/β). As a result, the variance of the error due to the communication channel can be calculated as:
by using (23), Es=β−1, and the identities given by:
Since we assume that dk,q,i() follows a uniform distribution, we can also calculate the quantization error as:
Therefore, the BMSE can be calculated as:
BMSI ({circumflex over (v)}q() )=σchannel2+σquan2=vmax2Etotal. (2)
where Etotal is Echannel+Equan.
In practice, the gradients often have an unknown probability distribution that
changes over the communication rounds [13]. Hence, the expression in (27) has its own limitation due to the underlying distribution assumption. On the other hand, the analysis with a general non-stationary distribution is much more complicated because the expected value in (24) for different numerals may not be identical to each other. Nevertheless, (27) is a closed-form expression and predict the performance of the scheme for a given configuration roughly without using sophisticated expressions, as exemplified in Section V.
Based on (27), we infer the followings:
As we show in Section IV and demonstrate in Section V, the amount of BMSE plays a major role for the convergence rate of the FEEL. To reduce BMSE for FEEL, we introduce a simple method in the following subsection.
Without any adaptation, the BMSE in (27) is a constant and the error due to the presently disclosed scheme can dominate the estimate of vq() when its value is closer to 0. This can be a non-negligible issue in practice because the gradients tend to become smaller over time. To address this issue, we exploit the fact that the gradients between adjacent communication rounds may have a high correlation [36] and propose to improve the presently disclosed scheme with a feedback loop where all the EDs transmit only a single parameter related to their local gradients to the ES through a control channel (e.g., PUCCH in 3GPP 5G NR) and the ES sets up a new absolute maximum vmax for the next communication round based on the received feedback from the EDs. The information that are transmitted from ED can be a function of the maximum absolute value of the gradients, the empirical variance, standard deviation, or the mean of the gradients. In this disclosure, we assume that the feedback loop realizes the AAM as:
vmax()=α×∥m(
−1)∥∞, (28)
where m()=[m0(
), . . . , mK−1(
)] is the metric vector, mk(
) is the metric for the kth ED, ∀k, α is a positive value, and vmax(0) is the initial value for the AAM.
The AAM based on (28) can be implemented in a practical network as follows: 1) The kth ED transmit mk(), ∀k, at the tth communication round through an orthogonal channel; 2) The ES calculates (28); 3) The ES transmits vmax(
−1) to the EDs; and 4) The EDs update their ƒenc,βbased on the new absolute maximum vmax(
−1).
In this disclosure, we choose mk() and α as mk(
)=∥{tilde over (g)}k(
)∥2 and α=√{square root over (Q)}, heuristically, based on five-sigma deviation rule. The convergence rate of FEEL with and without AAM is analyzed in Section IV.
For the convergence rate analysis, we consider well-known Lipschitz continuity [42] and make several assumptions on the loss function and gradient estimates, given as follows:
Definition 1. A function ƒ is L-Lipschitz over a set S with respect to a norm ∥·∥ if there exist a real constant L>0 such that ∥ƒ(y)−ƒ(x)∥≤L∥y−x∥, ∀x,y∈S.
Lemma 1 ([42, Lemma 1.2.3]). For a differentiable function ƒ: RQ→R, let ∇ƒ be L-Lipschitz on RQ with respect to norm ·2. Then, for any y, x from RQ,
Assumption 1 (Bounded loss function). The loss function is bounded, i.e., F(w)≥F*, ∀w.
Assumption 2 (Smooth gradients). The gradient of the loss function, i.e., ∇F, is L-Lipschitz on Q Q with respect to norm ∥·∥2, i.e.,
∥∇F(w′)−∇F(w)∥2≤L∥w′−w∥2, ∀w,w′∈Q.
Assumption 1 and Assumption 2 are the standard assumptions that are often made in the literature for convergence analysis.
Assumption 3 (Unbiased average local stochastic gradients). For all w(), the average stochastic gradient vector is an unbiased estimate of the global gradient vector, i.e.,
[v(
)]=g(
).
Assumption 4 (Gradient divergence). For all w(), the second order moments of the local stochastic gradients of the kth ED with respected to the global gradients is bounded as:
[∥{tilde over (g)}k(
)−g(
)∥22]≤δk, ∀k.
Assumption 3 and Assumption 4 do not require the local gradients to be an unbiased estimates of the global gradients. Hence, they are compatible with a heterogeneous data distribution scenario where the sum of local gradients are unbiased.
Assumption 5 (Average quantization error). The quantization error in average is zero, i.e.,
[vq(
)
)]=0, ∀k.
Assumption 6 (MSE bound). The average MSE due to the communication channel and the quantization is bounded by σchannel2 +σquan2, i.e.,
[({circumflex over (v)}q(
)−vq(
))2]≤σchannel2+σquan2,
and σchannel2+σquan2 is given in (27).
Theorem 1. For a fixed learning rate η, the convergence rate of the distributed training based on the presently disclosed scheme in Rayleigh channel is:
where σchannel2 and σquan2 are given in (25) and (26), respectively.
The proof is given in Appendix A.
Theorem 1 is an extension of the convergence analysis of SGD under the consideration of the presently disclosed scheme. While the first term of the bound given in (30) becomes smaller for a larger total number of communication rounds T, the noise ball is determined with the values of the learning rate η, the noise variance due to the local stochastic gradient estimates, and the noise due to the presently disclosed scheme. The noise ball decreases when a smaller learning rate η is used at the expense of a larger T due to the first term in (30). The presently disclosed scheme contributes to the noise variance due to stochastic gradient calculation in (11) in an additive manner. Hence, the standard tuning methods for SGD such as momentum can also be utilized with the presently disclosed scheme to improve the convergence rate.
The convergence rate of the FEEL under the presence of the presently disclosed scheme with AAM based on (28) can be expressed as follows:
Theorem 2. For a fixed learning rate η, the convergence rate of the distributed training based on the presently disclosed scheme in Rayleigh fading channel is:
where L′=L(1 +α2EtotalK) for Uq ,i,˜
(K, 1/β) for all
, i, q.
The proof is given in Appendix B.
Theorem (2) shows that the AAM eliminates the additive impact of the presently disclosed scheme to the noise on the gradients (as in Theorem 1) at the expense of scaling up the constant L. As compared to the case without AAM, the noisy ball is smaller with AAM. Hence, the convergence rate improves considerably, as demonstrated in Section V.
In this section, we assess the presently disclosed scheme numerically. We demonstrate its BMSE performance and provide the test accuracy results based on FEEL under homogeneous and heterogeneous data distributions.
In this subsection, we demonstrate the divergence from the theoretical BMSE of the estimator of (19), given in (27). To this end, we calculate the BMSE through a simulation for both uniform distribution discussed in Section III-D and a zero-mean Gaussian distribution with the variance 0.2. We assume σn2=0.01, vmax=1, and K=25 and consider D∈{1, 2} and β∈{3, 5, 7}. Also, the channel coefficients are assumed to be independent.
In
Stated another way,
To numerically analyze OAC with the presently disclosed scheme for FEEL, we consider the learning task of handwritten-digit recognition in a single cell with K=25 EDs. We set the SNR, i.e., 1/σn2, to be 20 dB, and choose the number of antennas at the ES as R∈{1, 25}. For the fading channel, we consider ITU Extended Pedestrian A (EPA) with no mobility and regenerate the channels between the ES and the EDs independently for each communication round to capture the long-term channel variations. The subcarrier spacing is set to 15 kHz. We use M=1200 subcarriers (i.e., the signal bandwidth is 18 MHz). Hence, the difference between time of arriving ED signals is maximum Tsync=55.6 ns. We assume that the synchronization uncertainty at the ES is Nerr=3 samples. For the comparisons, we consider FSK-MV proposed in [20 ] as it is based on a non-coherent detection and provides robustness against time-synchronization errors. We do not consider methods rely on TCI as their performance can deteriorate quickly in the cases of time-synchronization errors [20], [21 ] or imperfect CSI [19]. For the presently disclosed scheme, we consider β∈{3, 5, 7} and D={1, 2}.
For the local data at the EDs, we use the MNIST database that contains labeled handwritten-digit images size of 28×28 from digit 0 to digit 9. We distribute the data samples in the MNIST database to the EDs to generate representative results for FEEL. We consider both homogeneous and heterogeneous data distributions in the cell. To prepare the data, we first choose ||=25000 training images from the database, where each digit has distinct 2500 images. For the scenario with the homogeneous data distribution, we assume that each ED has 250 distinct images for each digit. As done in [20], for the scenario with the heterogeneous data distribution, we divide the cell into 5 areas with concentric circles and the EDs located in uth area have the data samples with the labels {u−1, u, 1+u, 2+u, 3+u, 4+u} for u∈{1, . . . , 5} (See [20,
In
In
In
For the curves in
In
In this disclosure, we investigate an OAC method that exploits balanced number systems for gradient aggregation. The presently disclosed scheme achieves a continuous-valued computation through a digital scheme by exploiting the fact that the average of the numerals in the real domain can be used to compute the average of the corresponding real-valued parameters approximately. With the presently disclosed OAC method, the local stochastic gradients are encoded into a sequence where the elements of the sequence determine the activated OFDM subcarriers. We also use a non-coherent receiver to eliminate the precise sample-level time synchronization, channel estimation overhead, and power instabilities due to the channel inversion techniques. To improve its MSE performance, we also introduce AAM. We theoretically analyze its MSE performance and its convergence rate for FEEL that consider both homogeneous and heterogeneous distributions. Our numerical results demonstrate that the test accuracy of the FEEL with the presently disclosed scheme using AAM can reach up to 98% even when the EDs do not have the labels in their data sets.
The presently disclosed scheme provides a potentially rich area to be investigated. For example, in this disclosure, we consider gradient aggregation. On the other hand, one open question is if the presently disclosed scheme can also be utilized for parameter aggregation. Based on our numerical tests, the performance (e.g., test accuracy) can be poor as the neural network may not be tolerant to the errors on the model parameters due to the presently disclosed scheme. Hence, evaluating (and enhancing) the presently disclosed scheme with a noise-tolerant neural network (e.g., quantized neural networks) is an interesting future research direction that can be pursued.
Another interesting direction is the utilization of the presently disclosed OAC scheme along with distributed source coding to reduce the per-round communication latency further.
This written description uses examples to disclose the presently disclosed subject matter, including the best mode, and also to enable any person skilled in the art to practice the presently disclosed subject matter, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the presently disclosed subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they include structural and/or step elements that do not differ from the literal language of the claims, or if they include equivalent structural and/or elements with insubstantial differences from the literal languages of the claims.
Proof. By Assumption 2, we utilize Lemma 1 to obtain the following inequality:
By using Assumption 3-6, we obtain the identities given by:
Therefore, the expected improvement can be expressed as:
We then use Assumption 1, perform a telescoping sum over the iterations and calculate the expectation over the randomness in the trajectory as:
By rearranging the terms, (30) is reached.
Proof. The proof of Theorem 2 is similar to that of Theorem 1. By Assumption 2, we use Lemma 1 to express the following inequality:
for w(+1)=w
)−η{circumflex over (v)}(
). By using Assumption 3, Assumption 4, (27), and (26), we calculate:
Let bk()
[{tilde over (g)}k(
)−g(
)] be the bias vector. Based on Assumption 4,
[∥{tilde over (g)}k(
)∥22]=
[∥{tilde over (g)}k(
)−gk(
)∥22]−∥g(
)∥22−2g(
)
)≤δk+∥g(
)∥22+2g(
)
). (32)
Therefore, based on (28), (32), and by Assumption 3,
Therefore, the expected improvement with AAM can be expressed as:
Considering Assumption 1, we perform a telescoping sum over the iterations and calculate the expectation over the randomness in the trajectory as:
Also, we can express the expected value of the sum over the trajectory as:
Finally, by using (35) and rearranging the terms in (34), (30) is obtained.
The present application claims the benefit of priority of U.S. Provisional Patent Application No. 63/394,438, titled OVER-THE-AIR COMPUTATION METHODS BASED ON BALANCED NUMBER SYSTEMS FOR FEEL, filed Aug. 2, 2022, and which is fully incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63394438 | Aug 2022 | US |