Federated edge learning (FEEL) is a distributed learning framework that leverages the computational powers of edge devices (EDs) and uses the local data at the EDs without compromising their privacy to train a model[1], [2]. In FEEL, the initial model parameters are first distributed to many EDs for an edge server (ES). The EDs then share their local updates, e.g., updated model parameters or local gradients, based on local data with the ES. After the local updates are aggregated at the ES, the global updates are distributed back to the EDs for the next iteration. Since a large number of parameters needs to be transmitted from the EDs to the ES for each iteration, the communication aspect of FEEL stands as one of the main bottlenecks. To address this issue, one of the promising solutions is to perform the aggregation with over-the-air computation (AirComp) that harnesses the signal-superposition property of the wireless multiple access channel[3]-[5]. However, developing a broadband AirComp scheme is not trivial due to the multipath channel and often channel state information (CSI) needs to be available at the EDs or ES. In this disclosure, we address this issue with a novel scheme.
In the literature, several AirComp schemes are investigated for FEEL. In one example, the local model parameters at the EDs are transmitted over orthogonal frequency division multiplexing (OFDM) subcarriers to achieve broadband analog aggregation (BAA) of the model parameters over the air[6]. To overcome the impact of multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, i.e., truncated-channel inversion (TCI). In another example[7], BAA is extended to one-bit broadband digital aggregation (OBDA) to facilitate the implementation of FEEL for a practical wireless system by adopting signSGD[8]. In this method, the EDs transmit quadrature amplitude modulation (QAM) symbols over OFDM subcarriers with TCI, where the real and imaginary parts of the QAM symbols are formed by using the signs of the elements of the local gradient vectors, i.e., votes. At the ES, the estimates of the global gradients are calculated based on majority vote (MV), which corresponds to the signs of the real and imaginary components of the superposed symbols on each subcarrier. Although OBDA is compatible with digital modulations, EDs still need the CSI for TCI as in BAA for AirComp. In yet another example, an additional time-varying precoder is applied along with TCI for BAA to facilitate the aggregation[9]. EDs sparsify their gradient estimates and project the resultant sparse vector into a low-dimensional vector for bandwidth reduction. The resulting compressed data is then transmitted with BAA[10]. In other studies, blind EDs are considered. However, it is assumed that the CSI for each ED is available at the ES. The impact of channel on AirComp is mitigated through beamforming with a large number of antennas[11]-[12]. To the best of our knowledge, there is no AirComp scheme in the literature that addresses the cases where CSI is unavailable to both EDs and ES for FEEL.
Aspects and advantages of the presently disclosed subject matter will be set forth in part in the following description, or may be apparent from the description, or may be learned through practice of the presently disclosed subject matter.
Broadly speaking, the presently disclosed subject matter relates to methods for reliable over-the-air computation and federated edge learning.
The presently disclosed systems/devices and the corresponding and/or associated methodologies relate to AirComp scheme(s) for FEEL without CSI at the EDs or ES. The proposed scheme adopts the MV principle and defines multiple subcarriers and OFDM symbols for voting options, which reduces to FSK over OFDM subcarriers as a special case. Since the votes from EDs are separated on orthogonal resources, it eliminates the need for TCI at the EDs and allows the ES to detect MV with a non-coherent detector. Since the proposed method does not encode the votes on amplitude and phase, it also admits PMEPR reduction techniques. With randomization symbols, we show that the proposed scheme provides similar PMEPR characteristics to that of OFDM while providing a high-test accuracy in fading channels.
FEEL is a distributed learning framework that leverages the computational powers of EDs and uses the local data at the EDs without compromising privacy to train a model. However, the communication aspect of FEEL stands as one of the main bottlenecks. To address this issue, one of the promising solutions is to perform the aggregation with AirComp methods that harness the signal-superposition property of the wireless multiple-access channel. However, developing a broadband AirComp scheme is not trivial due to the multipath channel and often CSI needs to be available. In this disclosure, we address this issue with a novel AirComp scheme.
The presently disclosed subject matter addresses the communication latency problem of training an artificial intelligence model over a wireless network. It reduces the latency with AirComp. However, the presently disclosed subject matter does not use the channel information (e.g., channel frequency response) needed for wireless communication at the EDs (e.g., a user) or ES (e.g., a base station).
This disclosure will most likely be a case for 5G New Radio and beyond (e.g., 6G). Further, BAA and OBDA are two major methods that reduce latency; however, they require channel state information at the EDs, which is a substantial overhead.
In addition, there is a large market size for this disclosure as it is related to both commercial wireless and AI technologies. It could be useful for artificial intelligence technologies over wireless or sensor networks, 5G and beyond, 6G wireless standardization, IEEE 802.11 Wi-Fi.
The proposed scheme does not need a channel inversion at the EDs. From this aspect, it is compatible with time-varying channels and does not lose the gradient information due to the truncation. The proposed scheme reduces PMEPR with a simple randomization technique (i.e., it does not require CSIs at the ES or multiple antennas for AirComp).
The presently disclosed subject matter is theoretically supported and its validity is tested through numerical analysis and MATLAB®-based simulations under practical wireless channel models by publicly available MNIST dataset.
Generally speaking, the presently disclosed subject matter relates to distributed learning, federated edge learning, frequency-shift keying, orthogonal frequency division multiplexing, over-the-air computation, and peak-to-mean envelope power ratio, all relating to electrical-based subject matter.
In this disclosure, we propose an AirComp scheme relying on the MV principle. Instead of encoding the votes with QAM symbols, we use multiple subcarriers and/or OFDM symbols for voting options, which corresponds to FSK over OFDM subcarriers as a special case. As the votes are aggregated on orthogonal resources with the proposed scheme, we eliminate the need for TCI at the EDs and enable the ES to determine the MV with a non-coherent detector. The proposed scheme can be used with well-known PMEPR reduction techniques as it does not utilize the amplitude and the phase to encode votes. PMEPR is reduced by using randomization symbols on active subcarriers, which also speed up the convergence for non-independent and identically distributed (IID) data.
Notation: The sets of complex and real numbers are denoted by and , respectively. t[⋅] is the expectation of its argument over t. The signum function is denoted by sin(⋅).
Considered another way, we propose an AirComp scheme for FEEL. The proposed scheme relies on the concept of distributed learning by MV with signSGD. As compared to the state-of-the-art solutions, with the proposed method, EDs transmit the signs of local stochastic gradients by activating one of two orthogonal resources, i.e., OFDM subcarriers, and the MVs at the ES are obtained with non-coherent detectors by exploiting the energy accumulations on the subcarriers. Hence, the proposed scheme eliminates the need for CSI at the EDs and ES. By taking path loss, power control, cell size, and the probabilistic nature of the detected MVs in fading channel into account, we prove the convergence of the distributed learning for a non-convex function. Through simulations, we show that the proposed scheme can provide a high-test accuracy in fading channels even when the time-synchronization and the power alignment at the ES are not ideal. We also provide insight into distributed learning for location-dependent data distribution for the MV-based schemes.
The disclosure deals with a system and method for an AirComp scheme for FEEL without CSI at the EDs or ES. The disclosure adopts the MV principle and defines multiple subcarriers and OFDM symbols for voting options, which reduces to FSK over OFDM subcarriers as a special case. Thus, FSK-based AirComp is provided for FEEL without CSI. Since the votes from EDs are separated on orthogonal resources, the proposed scheme eliminates the need for TCI at the EDs and allows the ES to detect MV with a non-coherent detector. We also mitigate the PMEPR of the synthesized signals by using randomization symbols. Simulations show the proposed scheme provides high test accuracy in fading channels for both IID and non-IID data while resulting in OFDM symbols with lower PMEPRs as compared to OBDA with QAM.
It is to be understood that the presently disclosed subject matter equally relates to associated and/or corresponding methodologies.
Other exemplary aspects of the present disclosure are directed to systems, apparatus, tangible, non-transitory computer-readable media, user interfaces, memory devices, and electronic devices for an AirComp scheme for FEEL without CSI at the edge devices EDs or edge server ES. To implement methodology and technology herewith, one or more processors may be provided, programmed to perform the steps and functions as called for by the presently disclosed subject matter, as will be understood by those of ordinary skill in the art.
One exemplary presently disclosed method relates to an AirComp methodology for FEEL without using CSI at a plurality of EDs or at an ES, comprising: a distributed machine-learning model to be trained with the update vectors received at an ES as transmitted from a plurality of EDs; one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. Such operations preferably may comprise: transmitting local update vectors as weighted votes over selected multiple orthogonal subcarriers grouped based on the sign of the elements of the update vector from each respective of the plurality of EDs via a wireless multiple access channel, receiving the superposed local updates at the ES, determining the MV for each element of the update vector at the ES with an energy detector over orthogonal time and frequency resources, and inputting the MVs into the machine-learning model to be updated.
Another exemplary embodiment of presently disclosed subject matter relates to an AirComp system for FEEL without using CSI at a plurality of EDs or at an ES, comprising a machine-learning model training to process data received at an ES as transmitted from a plurality of EDs; one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising transmitting local updates as votes over selected multiple subcarriers from each respective of the plurality of EDs via a wireless multiple access channel, receiving the local updates at the ES, aggregating the local updates at the ES including separating votes from the EDs using orthogonal resources and MV principle, and inputting the obtained data into the machine-learning model as training data or data to process.
Additional objects and advantages of the presently disclosed subject matter are set forth in, or will be apparent to, those of ordinary skill in the art from the detailed description herein. Also, it should be further appreciated that modifications and variations to the specifically illustrated, referred and discussed features, elements, and steps hereof may be practiced in various embodiments, uses, and practices of the presently disclosed subject matter without departing from the spirit and scope of the subject matter. Variations may include, but are not limited to, substitution of equivalent means, features, or steps for those illustrated, referenced, or discussed, and the functional, operational, or positional reversal of various parts, features, steps, or the like.
Still further, it is to be understood that different embodiments, as well as different presently preferred embodiments, of the presently disclosed subject matter may include various combinations or configurations of presently disclosed features, steps, or elements, or their equivalents (including combinations of features, parts, or steps or configurations thereof not expressly shown in the figures or stated in the detailed description of such figures). Additional embodiments of the presently disclosed subject matter, not necessarily expressed in the summarized section, may include and incorporate various combinations of aspects of features, components, or steps referenced in the summarized objects above, and/or other features, components, or steps as otherwise discussed in this application. Those of ordinary skill in the art will better appreciate the features and aspects of such embodiments (and others upon review of the remainder of the specification) and will appreciate that the presently disclosed subject matter applies equally to corresponding methodologies as associated with practice of any of the present exemplary devices and vice versa.
These and other features, aspects and advantages of various embodiments will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the related principles.
A full and enabling disclosure of the present subject matter, including the best mode thereof to one of ordinary skill in the art, is set forth more particularly in the remainder of the specification, including reference to the accompanying figures in which:
Table 1 correlates Layers and Learnables for a Neural Network at the EDs;
Repeat use of reference characters in the present specification and drawings is intended to represent the same or analogous features, elements, or steps of the presently disclosed subject matter.
Reference will now be made in detail to various embodiments of the disclosed subject matter, one or more examples of which are set forth below. Each embodiment is provided by way of explanation of the subject matter, not limitation thereof. In fact, it will be apparent to those skilled in the art that various modifications and variations may be made in the present disclosure without departing from the scope or spirit of the subject matter. For instance, features illustrated or described as part of one embodiment, may be used in another embodiment to yield a still further embodiment. Thus, it is intended that the presently disclosed subject matter covers such modifications and variations as come within the scope of the appended claims and their equivalents.
In general, the present disclosure is directed to a system in which we consider an OFDM-based FEEL system with K users. Prior to the training, the initial values of the model parameters, denoted by w∈q, and its structure are distributed to the EDs from an ES to set up a common learning model at the EDs, where q is the model size. We denote the local dataset containing labeled data samples at the kth ED as |{(,)}∈Dk| for k=1, . . . , K, where and are th data sample and their associated label, respectively. The main goal of the FEEL system is to obtain the trained model parameters without uploading the local data to the ES.
The local loss function of the model with the parameters w at the kth ED can be calculated as:
where ƒ(w,,) is the sample loss function that measures the labelling error for (,) for the parameters w.
Assuming identical local dataset sizes, i.e., |Dk|=D for k=1, . . . , K, the global loss function can be measured as:
In this disclosure, we focus on a FEEL system based on gradient averaging[7]. For each communication round n of FEEL, the kth ED calculates an estimate of the global gradient of the loss function in Eq. (2) by using its local dataset Dk and the parameter vector w(n). Assuming that all data samples in Dk are used for gradient estimation, the local gradient estimate for the kth ED at the nth communication round, denoted by gk(n) can be expressed as:
where ∇ represents the gradient operator.
Assuming that the local gradient estimates are reliably received at the ES, the ES can obtain the global estimate of the gradient of the loss function in Eq. (2) as:
Subsequently, the ES distributes the global gradient estimate ĝ(n) to the EDs and the current model is updated based on a common update rule, e.g., gradient descent given by w(n+1)=w(n)−ηĝ(n) where η is the learning rate and w(1)=w. This process is repeated consecutively until a predetermined convergence criterion is achieved.
In this disclosure, we adopted sin SGD [8] for FEEL. Instead of the actual values of local gradients, the EDs transmitted the signs of their local gradients, i.e., {tilde over (g)}k(n) for k=1, . . . , K, to the ES where the ith element of is {tilde over (g)}k,i(n)sin(gk,i(n)). Then the estimate of the global gradient for the ith parameter can be calculated by using the MV principle as given by:
The ES then transmitted v(n)=(v0(n), . . . , vq-1(n)) to the EDs and the models at the EDs are updated, e.g., w(n+1)=w(n)−ηv(n).
In this disclosure, we assume that the EDs access the wireless channel on the same time-frequency resources simultaneously for AirComp with S OFDM symbols consisting of M active subcarriers. We assume the transmissions from the EDs are synchronized in both time and frequency and arrive at the ES within the CP duration. We also assume that the CP duration is larger than the maximum-excess delays of the channels between the ES and the EDs. The superposed symbol on the l subcarrier of the mth OFDM symbol at the ES can then be written as:
where hk,l∈ is the channel coefficient between ES and the kth ED on the l subcarrier and [|hk,l|2]=1, tk,l,m∈ is the transmitted symbol from the kth ED on the l subcarrier of the mth OFDM symbol, and nl is the zero mean additive white Gaussian noise (AWGN) with the variance σn2 on the l subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.
Let x(t)∈ be the baseband OFDM symbol in continuous time for t∈[0, Ts), where Ts is the OFDM symbol duration. We defined the PMEPR of an OFDM symbol as maxt∈[0,T
Let ƒ be a bijective function that maps i∈{0, 1, . . . , q−1} to the distinct pairs (m0, l0) and (m1, l1) for m0, m1∈{0, 1, . . . , S−1}) and l0, l1∈{0, 1, . . . , M−1}. Based on {tilde over (g)}k,i(n), at the nth communication round, we propose to calculate the symbol tk,l
respectively, where Es=2 is the normalized symbol energy and Sk,i is a randomization symbols for k∈{1, . . . , K}.
Therefore, the proposed scheme separates the options for voting over two different resources identified in time and frequency. In this disclosure, we chose Sk,i based on a random quadrature phase shift keying (QPSK) symbol to reduce PMEPR by decreasing the correlation in the frequency domain[13].
In one implementation, when {tilde over (g)}k,i=1, the symbols tk,l
In one implementation, the symbols tk,l
where gk,i is the local stochastic gradient and ω(gk,i) is a weighting function. The weighting function may be an even-symmetric function that ranges from 0 to 1 in order to limit the power of the transmitted OFDM symbols. The main motivation for using a weight function is that it can lower the error probability of detecting the incorrect majority vote as compared to the sign operation. It may also increase the convergence rate in the case of heterogenous data distribution scenarios. Examples of the smooth, non-decreasing weight function for negative or positive gk,i are as follows:
where h, t, ρ are some non-negative coefficients. All of these examples ensures that gradual power increases if the magnitude of the gradient local gradient is large. Therefore, if an ED has a smaller absolute local gradient, its impact on the MV becomes smaller. Similarly, if an ED has a large absolute local gradient, its impact on the MV becomes larger. Hence, the convergence speed may improve.
In one implementation, ω(gk,i)=1 may be chosen to achieve a design based on signs as described. In one implementation, the parameters of the weight function may be tuned through the communications round. For example, the tuning may be based on maximum values of the absolute local gradients or update vectors or the communication round index.
The functionality of f can be divided into two different mappers, i.e., gradient mapper (GM) and resource mapper (RM). While GM shuffles the quantized gradients, RM identifies how the options for voting are distributed to the time and frequency resources. As a special case of RM, if m1=m0 and l1=l0+1 for all i, the adjacent subcarriers of moth OFDM symbol are used for voting, i.e., FSK over OFDM subcarriers. In this case, the weight of the kth ED's vote in the MV for the ith gradient is independent from its vote since these subcarriers are likely to experience similar channel conditions in practice, i.e., hk,l
Gradient mapper and resource mapper may be utilized with an interleaver or an encryption function to increase the security of the proposed scheme. For example, gradient mapper or resource mapper may map the votes to different subcarriers for each communication round based on an encryption operation. Hence, an eavesdropper cannot recover the order of the gradients by simply capturing the transmission.
In one implementation, the symbols tk,l
At the ES, the pairs (m0, l0) and (m1, l1) are first calculated by using the mapping function ƒ for a given i. Assuming independent multipath channels between the ES and the EDs, it can be shown that:
where K0 and K1 are the number of EDs that vote for 1 and −1 for the ith gradient, respectively.
Therefore, the energies on the superposed symbols rl
where t is the maximum distance between |rl
In
As prior literature approaches are opposed[6], [7], the proposed scheme does not need channel inversions at the EDs. From this aspect, it is compatible with time-varying channels (e.g., mobile networks[14]) and does not lose gradient information due to TCI. On the other hand, it quadruples the number of time-frequency resources for AirComp as compared to OBDA-QAM[7]; however, OBDA-QAM is not investigated in terms of PMEPR in the literature. As shown in, OBDA-QAM can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique that also leads to better accuracy results for non-IID data. As compared to approaches indicated in prior literature[11], [12], the proposed scheme also does not require CSI at the ES or multiple antennas.
For the numerical results, we considered the learning task of handwritten digit recognition with a FEEL system and compared the proposed scheme with BAA[6] for gradient averaging and OBDA-QAM[7]. We used the MNIST dataset that contains 60,000 labelled handwritten digit images sized 28×28, from 0-9. From the IID dataset, we randomly partition 20,000 training images into equal shares to K∈{10, 50} EDs. For the non-IID data set, we chose 5 digits for each ED and selected the images randomly, i.e., different dataset can contain the same image. For a fair comparison, we used the same data randomization for different AirComp schemes.
For the model, we considered a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer were utilized. At the input layer, no normalization was applied. Our model has q=123090 learnable parameters, which corresponds to S=206, S=103, and S=52 OFDM symbols for the OBDA-FSK, BAA, and OBDA-QAM for M=1200, respectively. The subcarrier spacing was set to 15 kHz, the TCI (the truncation threshold) was 0.2, and the threshold t was set to 0.01 for the proposed scheme.
To test the FEEL, we considered two different uplink signal-to-noise ratios (SNRs), i.e., 0 dB and 20 dB.
For the fading channel, we considered ITU Extended Pedestrian A (EPA) with no mobility and then regenerated the channels between the ES and the EDs to capture the long-term channel variations for each communication round. For TCI, we assumed that CSI was available at the EDs. For the update rule, we considered stochastic gradient descent with momentum, where the momentum is 0:9. The initial learning rate was 0:01 and the learning rate decayed with a rate of 0:05 for every communication round.
In
In
In this disclosure, we proposed an AirComp scheme for FEEL. The proposed scheme relies on MV and forms the options for voting on different subcarriers and/or OFDM symbols, and thus, it allows the receiver to detect MV with a non-coherent detector and eliminates the need for TCI at the EDs as it is compatible with time-varying channels. Further, it can be used along with randomization methods in the frequency domain to reduce the PMEPR. Through simulations, we demonstrated that the proposed method provides a high-test accuracy in fading channel for both IID and non-IID data, which results in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources.
The proposed method can be improved in various ways. For example, to lower PMEPR further, the randomization symbols can be designed based on the gradients. The precoded-OFDM (e.g., discrete Fourier transform (DFT)-spread OFDM) or various mapping strategies can also be explored to improve the proposed method. In this disclosure, we focused on one-bit quantitation. Extending the proposed concept to different quantization levels is another interesting research direction that can be pursued. The system-level analysis of the proposed method with heterogeneous data is also another direction that can be investigated.
Federated edge learning (FEEL) is an implementation of federated learning (FL) in a wireless network to train a model without moving the local data generated at the edge devices (EDs) to an edge server (ES)[001], [002]. With FEEL, a large number of model parameters (or gradients) needs to be communicated between many EDs and the ES through wireless channels. However, typical user multiplexing methods such as orthogonal frequency division multiple access (OFDMA) can be inefficient to address the spectrum congestion due to a large number of EDs[003]. To address this issue, one of the promising solutions is to perform the calculations needed for FEEL, e.g., averaging, with an over-the-air computation (AirComp) method that harnesses the signal-superposition property of the wireless-multiple access channel[004]-[006]. However, developing an AirComp scheme is not a trivial task due to the multipath channel, power misalignment, and time-synchronization errors in practice. Also, the channel state information (CSI) needs to be available at the EDs or the ES with state-of-the-art solutions. In this study, we propose an AirComp scheme to address these issues.
In the literature, various AirComp schemes are proposed for FEEL. In [007], analog modulation over orthogonal frequency division multiplexing (OFDM) is investigated for broadband analog aggregation (BAA). Particularly, it is proposed to modulate the OFDM subcarriers with the model parameters at the EDs. To overcome the impact of the multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, which is known as truncated-channel inversion (TCI) in the literature. In [008], an additional time-varying precoder is applied along with TCI to facilitate the aggregation. In [009], it is proposed to sparsify the gradient estimates and project the resultant sparse vector into a low-dimensional vector to reduce the bandwidth. The compressed data is transmitted with BAA. In [010], one-bit broadband digital aggregation (OBDA) is proposed to facilitate the implementation of FEEL for a practical wireless system. In this method, considering distributed training by majority vote (MV) with the sign stochastic gradient descend (signSGD)[011], the EDs transmit quadrature phase-shift keying (QPSK) symbols over OFDM subcarriers along with TCI, where the real and imaginary parts of the QPSK symbols are formed by using the signs of the stochastic gradients, i.e., votes. At the ES, the signs of the real and imaginary components of the superposed received symbols on each subcarrier are calculated to obtain the MV for the sign of each gradient. However, the EDs still need the CSI for TCI as in BAA for AirComp. In [012] and [013], blind EDs are considered. However, it is assumed that the CSI for each ED is available at the ES. The impact of the channel on AirComp is mitigated through beamforming with a large number of antennas.
In this study, we investigate an AirComp method based on non-coherent detection to achieve FEEL without using CSI at the EDs and the ES. Inspired by the MV with signSGD[011], we use orthogonal resources, i.e., multiple subcarriers and/or OFDM symbols, to transmit the signs of local stochastic gradients. Hence, the votes from different EDs accumulate on the orthogonal resources non-coherently in fading channel with the proposed scheme. The ES then obtains the MV with an energy detector. Considering the randomness in the detected MVs due to the fading channel, path loss, and power control in the cell, we prove the convergence of learning in the presence of the proposed scheme for a non-convex loss function. We demonstrate that the proposed approach is robust against time-synchronization errors and power misalignment at the ES. We also show that it can be used with well-known peak-to-mean envelope power ratio (PMEPR) reduction techniques as it does not utilize the amplitude and the phase to encode the sign of local stochastic gradients. Finally, we evaluate the scheme by considering independent and identically distributed (IID) data and non-IID data where the data distribution is a function of the locations of EDs.
Notation: The complex and real numbers are denoted by and , respectively. [⋅] is the expectation of its argument. [⋅] is the indicator function and [⋅] is the probability of its argument. The sign function is denoted by sign(⋅) and results in 1, −1, or +1 at random for a positive, a negative, or a zero-valued argument, respectively.
Consider a wireless network with K EDs that are connected to an ES, where each ED and the ES are equipped with single antennas. We assume that the frequency synchronization in the network is done before the transmissions with a control mechanism as done in 3GPP Fourth Generation (4G) Long Term Evolution (LTE) and/or Fifth Generation (5G) New Radio (NR) with random-access channel (RACH) and/or physical uplink control channel (PUCCH)[014]. In this study, we consider the fact that the time synchronization among the EDs is not ideal, and the maximum difference between the time of arrivals of the EDs signals at the ES location is Tsync seconds and it is equal to the reciprocal to the signal bandwidth.
In this study, the power alignment at the ES can be imperfect and the level of misalignment is controlled with a power control mechanism. We assume that the signal-to-noise ratio (SNR) of an ED at the ES is 1/σn2 the reference distance Rref. We then set the received signal power of the kth ED at the ES as
where rk is the link distance between the kth ED and the ES, α is the path loss exponent, and β∈[0,α] is a coefficient that determines the amount of the path loss compensated. While β=0 means that there is no power control in the network, β=α leads to a system with perfect power alignment at the ES. We define the effective path loss exponent αeff as αeffα−β.
In this study, we assume that the EDs are deployed in a cell, where the cell radius is Rmax meters and the minimum distance between the ES and the EDs is Rmin meters for Rmin≥Rref. It is worth emphasizing that we do not consider the impact of multiple cells (e.g., inter-cell interference) or a more complicated large-scale channel model (e.g., shadowing) on learning in this work as our goal is to provide insights into the impact of power misalignment and the path loss on distributed learning with a tractable analysis.
In this study, for AirComp, the EDs access the wireless channel on the same time-frequency resources simultaneously with S OFDM symbols consisting of M active subcarriers. We assume that the cyclic prefix (CP) duration is larger than Tsync and the maximum-excess delays of the channel between the ES and the EDs. Considering independent frequency-selective channels between the EDs and the ES, the superposed symbol on the lth subcarrier of the mth OFDM symbol at the ES for the nth communication round of FEEL can be written as
where hk,l,m(n)∈ is the channel coefficient between the ES and the kth ED, tk,l,m(n)∈ is the transmitted symbol from the kth ED, and nl,m(n) is the symmetric additive white Gaussian noise (AWGN) with zero mean and the variance σn2 on the lth subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.
We consider the fact that the time synchronization at the receiver may not be precise. To model this, we assume that the synchronization point where the discrete Fourier transform (DFT) starts can deviate by Nerr samples within the CP window. Note that the uncertainty of the synchronization point within the CP window is often not an issue for traditional communications due to the channel estimation. However, it can cause a non-negligible impact on AirComp.
Let x(ttime)∈ be a baseband OFDM symbol in continuous time for ttime∈[0, Ts), where Ts is the OFDM symbol duration. We define the PMEPR of an OFDM symbol as
where Ptx=[|x(ttime)|2] is the mean-envelope power.
Let k denote the local data containing labeled data samples at the kth ED as {(,)}∈k for k=1, . . . , K, where and are th data sample and its associated label, respectively. The centralized learning problem can be expressed as
where =1∪2∪ . . . ∪K and ƒ(w, x, y) is the sample loss function that measures the labeling error for (x, y) for the parameters w=[w1 . . . , wq]T∈q, and q is the number of parameters. With full-batch gradient descend, a local optimum point can be obtained as
w
(n+1)
=w
(n)
−ηg
(n) (4)
where η is the learning rate and
where ith element of the vector g(n) is the gradient of F(w(n)) with respect to wi(n).
In [011], in the context of parallel processing, distributed training by MV with signSGD is investigated to solve (3). In this method, for the nth communication round, the kth ED1 first calculates the local stochastic gradient as
where k⊂k is the selected data batch from the local data set and nb=|k| as the batch size. Instead of the actual values of local gradients, the EDs then send the signs of their local stochastic gradients, denoted as {tilde over (g)}k(n) for k=1, . . . , K, to the ES, where the ith element of the vector {tilde over (g)}k(n) is {tilde over (g)}k,i(n)sign({tilde over (g)}k,i(n)). The ES obtains the MV for the ith gradient as
Subsequently, the ES pushes v(n)=[v1(n), . . . , vq(n)]T to the EDs and the models at the EDs are updated as 1We refer to the workers and parameter-server mentioned in [011] as EDs and ES, respectively, to describe distributed training by MV with signSGD.
w
(n+1)
=w
(n)
−ηv
(n) (8)
This procedure is repeated consecutively until a predetermined convergence criterion is achieved.
For FEEL, the optimization problem can also be expressed as (3) in a scenario where the local data samples and their labels are not available at the ES and the link between an ED and the ES experiences independent frequency-selective fading channel. To solve (3) under these constraints, in this study, we adopt the same procedure summarized for the distributed training by the MV. With the motivations of eliminating the latency caused by orthogonal multiple access and enabling distributed training in mobile wireless networks, we propose a simple-but-effective AirComp scheme to detect the MV in fading channel without using CSI at the EDs and the ES.
With the proposed AirComp scheme, the EDs perform a low-complexity operation to transmit the signs of the gradients given in (6): Let ƒ be a bijective function that maps i∈{1, 2, . . . , q} to the distinct pairs (m+, l+) and (m−, l−) for m+, m−∈{0, 1, . . . , S−1}) and l+, l−∈{0, 1, . . . , M−1}. Based on the value of
respectively, where Es=2 is a factor to normalize the symbol energy and sk,i(n) is a randomization symbol on the unit circle. Therefore, to indicate the sign of a local stochastic gradient, our scheme dedicates two subcarriers with (9) and (10), as opposed to modulating the phase of a subcarrier as done in OBDA. Also, we do not use TCI to compensate the impact of multipath channel on transmitted symbols as our goal is to exploit the energy accumulation on two different subcarriers to detect the MV with a non-coherent detector.
As a special case of ƒ, if m−=m+ and l−=l++1 hold for all i, the adjacent subcarriers of m+th OFDM symbol forms the options for a vote, which corresponds to frequency-shift keying (FSK) over OFDM subcarriers. In this case, the kth ED's vote for the ith gradient becomes independent from its choice since the adjacent subcarriers are likely to experience similar channel conditions, i.e., hi,l
After the calculations of tk,l
The receiver at the ES observes the superposed symbols at all subcarriers as expressed in (2). By using the mapping function ƒ, the superposed symbols for a given i can be shown as
respectively. The receiver at the ES detects the MV for the ith gradient with an energy detector as
v
i
(n)=sign(Δi(n)) (13)
where Δi(n)ei+−ei− for ei+|rl
The proposed scheme leads to a fundamentally different training strategy since it determines the correct MV in (7) probabilistically by comparing el and el. To elaborate this, assume that the multipath channels between the ES and the EDs are independent. Let Ki+ and Ki−=K−Ki+ be the number of EDs that vote for 1 and −1 for the ith gradient, respectively.
Lemma 1. [ei+] and [ei−] can be calculated as
μi+[ei+]=EsKi+λ+σn2 (14)
and
μi−[ei−]=EsKi−λ+σn2 (15)
respectively, where
Proof: Since (11) is a weighted summation of independent complex Gaussian random variables with zero mean and unit variance (i.e., channel coefficients), rl
To calculate (17), we need to calculate the expected value of y=r−α
Hence, the distribution of y can obtained as
By using (19), the expected value of y can be calculated as (16). The same analysis can be done for μi−.
Based on Lemma 1, (13) is likely to obtain the correct MV because μi+ and μi− are linear functions of and Ki+ and Ki−, respectively. However, the detection performance depends on the parameter λ∈[0, 1] that captures the impacts of power control, path loss, and cell size on ei+ and ei−. In
We consider several standard assumptions made in the literature for the convergence analysis[10], [11]:
Assumption 1 (Bounded loss function). F(w)≥F*, ∀w.
Assumption 2 (Smoothness). Let g be the gradient of F(w) evaluated at w. For all w and w′, the expression given by
holds for a non-negative constant vector L=[L1, . . . , Lq]T.
Assumption 3 (Variance bound). The stochastic gradient estimates {{tilde over (g)}k=[{tilde over (g)}k,1, . . . , {tilde over (g)}k,q]T=∇Fk(w(n))}, ∀k, are independent and unbiased estimates of g=[g1, . . . , gqT=∇F(w) with a coordinate bounded variance, i.e.,
[{tilde over (g)}k]=g,∀k (20)
[({tilde over (g)}k,i−gi)2]≤σi2/nb,∀k,i (21)
where is a non-negative constant vector.
Assumption 4 (Unimodal, symmetric gradient noise). For any given w, the elements of the vector {tilde over (g)}k, ∀k, has a unimodal distribution that is also symmetric around its mean.
We also assume that the parameters ei+ and ei− are exponential random variables, where their means are μi+ and μi−, respectively. This assumption holds true when the power control is ideal under IID Rayleigh fading. It is a weak assumption under imperfect power control due to the central limit theorem.
By extending our theorem in [015] with the considerations of path loss, power control, and cell size, the convergence rate in the presence of FSK-MV can obtained as follows:
Theorem 1. For nb=N/γ and η=1/√{square root over (∥L∥1nb)}, the convergence rate of the distributed training by the MV based on FSK in fading channel is
where γ is a positive integer,
and λ∈[0, 1] given in (16) is a parameter that captures the parameters related to the path loss, power control, and cell size.
The proof of Theorem 1 is given in the appendix.
Based on Theorem 1, we can infer the followings: 1) For a larger SNR (i.e., a larger 1/σn2) and a large number of EDs (i.e., a larger K), the convergence rate with FSK-MV in fading channel improves since a decreases. 2) The power control results in a better convergence rate since A increases with a lower αeff. 3) Another way of improving the convergence rate is to reduce to cell size, yielding a large λ as illustrated in
Robustness against Time-Varying Fading Channel: As opposed to the approaches in [007] and [010], the proposed scheme does not utilize the CSI for TCI at the EDs. Hence, it is compatible with time-varying channels (e.g., mobile networks[016]) and does not lose gradient information due to TCI. As a trade-off, it quadruples the number of time-frequency resources for AirComp as compared to OBDA in [010]. As compared to the approaches in [012] and [013], the proposed scheme also does not require CSI at the ES or multiple antennas.
2) Robustness against Time-Synchronization Errors: As demonstrated in Section IV, the proposed scheme provides immunity against the time-synchronization errors. This is because the timing misalignment among the EDs or the uncertainty on the receiver synchronization within the CP window cause phase rotations in the frequency domain and FSK-MV does not encode information on the amplitude or phase. Also, the proposed scheme does not use any channel-related information at the EDs and the ES. Hence, FSK-MV is more robust against time-synchronization errors as compared to OBDA.
3) Robustness against Power-Amplifier Non-linearity: The proposed scheme separates the options for voting over two different resources identified in time and frequency. Hence, it allows one to choose sk,i(n) based on specific purposes. In this study, we use random QPSK symbol to reduce PMEPR by decreasing the correlation in the frequency domain[017]. OBDA is not investigated in terms of PMEPR in the literature. As shown in Section IV, OBDA can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique. Also, FSK-MV does not require a long transmission power constraint as in introduced for OBDA[010, Eq. 9 and Eq. 10] since the 2-norm of the OFDM symbols do not change as a function of CSI with FSK-MV.
For the numerical results, we consider the learning task of handwritten-digit recognition in a single cell with K=50 EDs for Rmin=10 meters and Rmax=100 meters. We assume that the path loss exponent is α=4. To demonstrate the impact of the imperfect power control on distributed learning, we choose β∈{2, 4} and set the SNR, i.e., 1/σn2, to be 20 dB at Rref=10 meters. The link distance between the kth ED and the ES is set to rk=√{square root over (Rmin2+(k−1)(Rmax2−Rmin2)/(K−1))} based on (18). For the fading channel, we consider ITU Extended Pedestrian A (EPA) with no mobility and regenerate the channels between the ES and the EDs independently for each communication round to capture the long-term channel variations. The subcarrier spacing is set to 15 kHz. We use M=1200 subcarriers (i.e., the signal bandwidth is 18 MHz). In the case of imperfect time synchronization, we assume that the difference between time of arriving ED signals is maximum Tsync=55.6 ns and the synchronization uncertainty at the ES is Nerr=3 samples. Otherwise, these parameters are set to 0.
For the local data at the EDs, we use the MNIST database that contains labeled handwritten-digit images size of 28×28 from digit 0 to digit 92. We consider both IID data and non-IID data in the cell. To prepare the data, we first choose |D|=25000 training images from the database, where each digit has distinct 2500 images. For the scenario with the IID data, we assume that each ED has 50 distinct images for each digit. For the scenario with the non-IID data, we assume that the distribution of the images depends on the locations of the EDs to test the FEEL in a more challenging scenario. To this end, we divide the cell into 5 areas with concentric circles and the EDs located in uth area have the data samples with the labels {u−1, u, 1+u, 2+u, 3+u, 4+u} for u∈{1, . . . , 5}. Hence, the availability of the labels gradually changes based on the link distance. The areas between two adjacent concentric circles are identical and the number of EDs in each area is 10. The IID and non-IID data distributions are illustrated in
For the model, we consider a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation follow each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer are utilized. At the input layer, no normalization is applied. Our model, outline in Table I, has q=123090 learnable parameters, which corresponds to S=206 and S=52 OFDM symbols for the FSK-MV and OBDA[10], respectively. For TCI, the truncation threshold is 0.2 and we assume that CSI is available at the EDs. For the update rule, the learning rate is set to 0.01. The batch size nb is set to 64. For the test accuracy calculations, we use 10000 test samples available in the MNIST database.
In
In
Although the test accuracy with OBDA with TCI (with ideal synchronization) or FSK-MV (with/without ideal synchronization) reaches to 95%,
In this study, we propose an effective AirComp scheme for FEEL. The proposed scheme relies on the distributed learning by the MV with the signSGD in fading channel. As compared to the state-of-the-art solutions on AirComp, it uses different subcarriers and/or OFDM symbols to indicate the sign of the local stochastic gradients. Thus, it allows the receiver at the ES to detect the MV with a non-coherent detector and eliminates the need for CSI at the EDs by exploiting the non-coherent energy accumulation on the subcarriers. We also prove the convergence of the distributed learning by taking path loss, power control, and cell size into account. Through simulations, we demonstrate that the proposed method can provide a high-test accuracy in fading channel even when the power control and the time synchronization are imperfect while resulting in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources. We also provide insights into the scenarios where local data distribution depends on the locations of the EDs and demonstrate the impact of non-IID data on the distributed learning when the power control is not ideal. Our results indicate that adaptive learning methods that consider the bias in the MV due to the non-IID data and/or imperfect power control are required for achieving a higher test accuracy.
Proof: The proof of Theorem 1 relies on a well-known strategy of relating the norm of the gradient of the loss function F(w) to the expected improvement made in a single step as described in [11]. Let g(n) be the gradient of F(w(n)) (i.e., the true gradient). By using Assumption 2 and using (13), we can state that
The main challenge is to obtain an upper bound on the stochasticity-induced error. To address this, assume that sign(gi(n))=1. Let Z be a random variable for counting the number of EDs with the correct decision, i.e., sign(gi(n))=1. The random variable Z can then be model as the sum of K independent Bernoulli trials, i.e., a binomial variable with the success and failure probabilities given by
P
i
[sign({tilde over (g)}k,i(n))=sign(gi(n))]
q
i
[sign({tilde over (g)}k,i(n))≠sign(gi(n))]
respectively, for all k. This implies that
To calculate [sign(Δi(n))≠1|Z=Ki+], we use the distribution of Δi(n), which can be obtained by using the properties of exponential random variables as
Thus, by integrating (23) with respect to Δi(n),
Hence, by using (24) and the properties of binomial coefficients
Under Assumption 2 and Assumption 3, by using the derivations in [11], it can be shown that
Hence, an upper bound on the stochasticity-induced error can be obtained as
Based on Assumption 1,
By rearranging the terms in (26) and using the expressions for nb and η, (22) is reached.
While certain embodiments of the disclosed subject matter have been described using specific terms, such description is for illustrative purposes only, and it is to be understood that changes and variations may be made without departing from the spirit or scope of the subject matter. The patentable scope of the presently disclosed subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they include structural and/or step elements that do not differ from the literal language of the claims, or if they include equivalent structural and/or elements or steps with insubstantial differences from the literal language of the claims.
The present application claims the benefit of priority of U.S. Provisional Patent Application No. 63/192,671, titled Methods for Reliable Over-The-Air Computation and Federated Edge Learning, filed May 25, 2021; and claims the benefit of priority of U.S. Provisional Patent Application No. 63/313,321, titled Methods for Reliable Over-The-Air Computation and Federated Edge Learning, filed Feb. 24, 2022, both of which are fully incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63192671 | May 2021 | US | |
63313321 | Feb 2022 | US |