METHODS FOR RELIABLE OVER-THE-AIR COMPUTATION AND FEDERATED EDGE LEARNING

Information

  • Patent Application
  • 20220391696
  • Publication Number
    20220391696
  • Date Filed
    April 25, 2022
    2 years ago
  • Date Published
    December 08, 2022
    2 years ago
Abstract
The disclosure deals with system and method for an over-the-air computation (AirComp) scheme for federated edge learning (FEEL) without channel state information (CSI) at the edge devices (EDs) or edge server (ES). The disclosure adopts the majority vote (MV) principle and defines multiple subcarriers and orthogonal frequency division multiplexing (OFDM) symbols for voting options, which reduces to frequency-shift keying (FSK) over OFDM subcarriers as a special case. Thus, FSK-based over-the-air computation is provided for federated edge learning without channel state information. Since the votes from EDs are separated on orthogonal resources, the proposed scheme eliminates the need for truncated-channel inversion (TCI) at the EDs and allows the ES to detect MV with a non-coherent detector. We also mitigate the peak-to-mean envelope power ratio (PMEPR) of the synthesized signals by using randomization symbols. Simulations show the proposed scheme provides high test accuracy in fading channels for both independent and identically distributed (IID) and non-IID data while resulting in OFDM symbols with lower PMEPRs as compared to one-bit broadband digital aggregation (OBDA) with quadrature amplitude modulation (QAM).
Description
BACKGROUND OF THE PRESENTLY DISCLOSED SUBJECT MATTER

Federated edge learning (FEEL) is a distributed learning framework that leverages the computational powers of edge devices (EDs) and uses the local data at the EDs without compromising their privacy to train a model[1], [2]. In FEEL, the initial model parameters are first distributed to many EDs for an edge server (ES). The EDs then share their local updates, e.g., updated model parameters or local gradients, based on local data with the ES. After the local updates are aggregated at the ES, the global updates are distributed back to the EDs for the next iteration. Since a large number of parameters needs to be transmitted from the EDs to the ES for each iteration, the communication aspect of FEEL stands as one of the main bottlenecks. To address this issue, one of the promising solutions is to perform the aggregation with over-the-air computation (AirComp) that harnesses the signal-superposition property of the wireless multiple access channel[3]-[5]. However, developing a broadband AirComp scheme is not trivial due to the multipath channel and often channel state information (CSI) needs to be available at the EDs or ES. In this disclosure, we address this issue with a novel scheme.


In the literature, several AirComp schemes are investigated for FEEL. In one example, the local model parameters at the EDs are transmitted over orthogonal frequency division multiplexing (OFDM) subcarriers to achieve broadband analog aggregation (BAA) of the model parameters over the air[6]. To overcome the impact of multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, i.e., truncated-channel inversion (TCI). In another example[7], BAA is extended to one-bit broadband digital aggregation (OBDA) to facilitate the implementation of FEEL for a practical wireless system by adopting signSGD[8]. In this method, the EDs transmit quadrature amplitude modulation (QAM) symbols over OFDM subcarriers with TCI, where the real and imaginary parts of the QAM symbols are formed by using the signs of the elements of the local gradient vectors, i.e., votes. At the ES, the estimates of the global gradients are calculated based on majority vote (MV), which corresponds to the signs of the real and imaginary components of the superposed symbols on each subcarrier. Although OBDA is compatible with digital modulations, EDs still need the CSI for TCI as in BAA for AirComp. In yet another example, an additional time-varying precoder is applied along with TCI for BAA to facilitate the aggregation[9]. EDs sparsify their gradient estimates and project the resultant sparse vector into a low-dimensional vector for bandwidth reduction. The resulting compressed data is then transmitted with BAA[10]. In other studies, blind EDs are considered. However, it is assumed that the CSI for each ED is available at the ES. The impact of channel on AirComp is mitigated through beamforming with a large number of antennas[11]-[12]. To the best of our knowledge, there is no AirComp scheme in the literature that addresses the cases where CSI is unavailable to both EDs and ES for FEEL.


SUMMARY OF THE PRESENTLY DISCLOSED SUBJECT MATTER

Aspects and advantages of the presently disclosed subject matter will be set forth in part in the following description, or may be apparent from the description, or may be learned through practice of the presently disclosed subject matter.


Broadly speaking, the presently disclosed subject matter relates to methods for reliable over-the-air computation and federated edge learning.


The presently disclosed systems/devices and the corresponding and/or associated methodologies relate to AirComp scheme(s) for FEEL without CSI at the EDs or ES. The proposed scheme adopts the MV principle and defines multiple subcarriers and OFDM symbols for voting options, which reduces to FSK over OFDM subcarriers as a special case. Since the votes from EDs are separated on orthogonal resources, it eliminates the need for TCI at the EDs and allows the ES to detect MV with a non-coherent detector. Since the proposed method does not encode the votes on amplitude and phase, it also admits PMEPR reduction techniques. With randomization symbols, we show that the proposed scheme provides similar PMEPR characteristics to that of OFDM while providing a high-test accuracy in fading channels.


FEEL is a distributed learning framework that leverages the computational powers of EDs and uses the local data at the EDs without compromising privacy to train a model. However, the communication aspect of FEEL stands as one of the main bottlenecks. To address this issue, one of the promising solutions is to perform the aggregation with AirComp methods that harness the signal-superposition property of the wireless multiple-access channel. However, developing a broadband AirComp scheme is not trivial due to the multipath channel and often CSI needs to be available. In this disclosure, we address this issue with a novel AirComp scheme.


The presently disclosed subject matter addresses the communication latency problem of training an artificial intelligence model over a wireless network. It reduces the latency with AirComp. However, the presently disclosed subject matter does not use the channel information (e.g., channel frequency response) needed for wireless communication at the EDs (e.g., a user) or ES (e.g., a base station).


This disclosure will most likely be a case for 5G New Radio and beyond (e.g., 6G). Further, BAA and OBDA are two major methods that reduce latency; however, they require channel state information at the EDs, which is a substantial overhead.


In addition, there is a large market size for this disclosure as it is related to both commercial wireless and AI technologies. It could be useful for artificial intelligence technologies over wireless or sensor networks, 5G and beyond, 6G wireless standardization, IEEE 802.11 Wi-Fi.


The proposed scheme does not need a channel inversion at the EDs. From this aspect, it is compatible with time-varying channels and does not lose the gradient information due to the truncation. The proposed scheme reduces PMEPR with a simple randomization technique (i.e., it does not require CSIs at the ES or multiple antennas for AirComp).


The presently disclosed subject matter is theoretically supported and its validity is tested through numerical analysis and MATLAB®-based simulations under practical wireless channel models by publicly available MNIST dataset.


Generally speaking, the presently disclosed subject matter relates to distributed learning, federated edge learning, frequency-shift keying, orthogonal frequency division multiplexing, over-the-air computation, and peak-to-mean envelope power ratio, all relating to electrical-based subject matter.


In this disclosure, we propose an AirComp scheme relying on the MV principle. Instead of encoding the votes with QAM symbols, we use multiple subcarriers and/or OFDM symbols for voting options, which corresponds to FSK over OFDM subcarriers as a special case. As the votes are aggregated on orthogonal resources with the proposed scheme, we eliminate the need for TCI at the EDs and enable the ES to determine the MV with a non-coherent detector. The proposed scheme can be used with well-known PMEPR reduction techniques as it does not utilize the amplitude and the phase to encode votes. PMEPR is reduced by using randomization symbols on active subcarriers, which also speed up the convergence for non-independent and identically distributed (IID) data.


Notation: The sets of complex and real numbers are denoted by custom-character and custom-character, respectively. custom-charactert[⋅] is the expectation of its argument over t. The signum function is denoted by sin(⋅).


Considered another way, we propose an AirComp scheme for FEEL. The proposed scheme relies on the concept of distributed learning by MV with signSGD. As compared to the state-of-the-art solutions, with the proposed method, EDs transmit the signs of local stochastic gradients by activating one of two orthogonal resources, i.e., OFDM subcarriers, and the MVs at the ES are obtained with non-coherent detectors by exploiting the energy accumulations on the subcarriers. Hence, the proposed scheme eliminates the need for CSI at the EDs and ES. By taking path loss, power control, cell size, and the probabilistic nature of the detected MVs in fading channel into account, we prove the convergence of the distributed learning for a non-convex function. Through simulations, we show that the proposed scheme can provide a high-test accuracy in fading channels even when the time-synchronization and the power alignment at the ES are not ideal. We also provide insight into distributed learning for location-dependent data distribution for the MV-based schemes.


The disclosure deals with a system and method for an AirComp scheme for FEEL without CSI at the EDs or ES. The disclosure adopts the MV principle and defines multiple subcarriers and OFDM symbols for voting options, which reduces to FSK over OFDM subcarriers as a special case. Thus, FSK-based AirComp is provided for FEEL without CSI. Since the votes from EDs are separated on orthogonal resources, the proposed scheme eliminates the need for TCI at the EDs and allows the ES to detect MV with a non-coherent detector. We also mitigate the PMEPR of the synthesized signals by using randomization symbols. Simulations show the proposed scheme provides high test accuracy in fading channels for both IID and non-IID data while resulting in OFDM symbols with lower PMEPRs as compared to OBDA with QAM.


It is to be understood that the presently disclosed subject matter equally relates to associated and/or corresponding methodologies.


Other exemplary aspects of the present disclosure are directed to systems, apparatus, tangible, non-transitory computer-readable media, user interfaces, memory devices, and electronic devices for an AirComp scheme for FEEL without CSI at the edge devices EDs or edge server ES. To implement methodology and technology herewith, one or more processors may be provided, programmed to perform the steps and functions as called for by the presently disclosed subject matter, as will be understood by those of ordinary skill in the art.


One exemplary presently disclosed method relates to an AirComp methodology for FEEL without using CSI at a plurality of EDs or at an ES, comprising: a distributed machine-learning model to be trained with the update vectors received at an ES as transmitted from a plurality of EDs; one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. Such operations preferably may comprise: transmitting local update vectors as weighted votes over selected multiple orthogonal subcarriers grouped based on the sign of the elements of the update vector from each respective of the plurality of EDs via a wireless multiple access channel, receiving the superposed local updates at the ES, determining the MV for each element of the update vector at the ES with an energy detector over orthogonal time and frequency resources, and inputting the MVs into the machine-learning model to be updated.


Another exemplary embodiment of presently disclosed subject matter relates to an AirComp system for FEEL without using CSI at a plurality of EDs or at an ES, comprising a machine-learning model training to process data received at an ES as transmitted from a plurality of EDs; one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising transmitting local updates as votes over selected multiple subcarriers from each respective of the plurality of EDs via a wireless multiple access channel, receiving the local updates at the ES, aggregating the local updates at the ES including separating votes from the EDs using orthogonal resources and MV principle, and inputting the obtained data into the machine-learning model as training data or data to process.


Additional objects and advantages of the presently disclosed subject matter are set forth in, or will be apparent to, those of ordinary skill in the art from the detailed description herein. Also, it should be further appreciated that modifications and variations to the specifically illustrated, referred and discussed features, elements, and steps hereof may be practiced in various embodiments, uses, and practices of the presently disclosed subject matter without departing from the spirit and scope of the subject matter. Variations may include, but are not limited to, substitution of equivalent means, features, or steps for those illustrated, referenced, or discussed, and the functional, operational, or positional reversal of various parts, features, steps, or the like.


Still further, it is to be understood that different embodiments, as well as different presently preferred embodiments, of the presently disclosed subject matter may include various combinations or configurations of presently disclosed features, steps, or elements, or their equivalents (including combinations of features, parts, or steps or configurations thereof not expressly shown in the figures or stated in the detailed description of such figures). Additional embodiments of the presently disclosed subject matter, not necessarily expressed in the summarized section, may include and incorporate various combinations of aspects of features, components, or steps referenced in the summarized objects above, and/or other features, components, or steps as otherwise discussed in this application. Those of ordinary skill in the art will better appreciate the features and aspects of such embodiments (and others upon review of the remainder of the specification) and will appreciate that the presently disclosed subject matter applies equally to corresponding methodologies as associated with practice of any of the present exemplary devices and vice versa.


These and other features, aspects and advantages of various embodiments will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the related principles.





BRIEF DESCRIPTION OF THE FIGURES

A full and enabling disclosure of the present subject matter, including the best mode thereof to one of ordinary skill in the art, is set forth more particularly in the remainder of the specification, including reference to the accompanying figures in which:



FIG. 1 is a schematic illustration of an exemplary presently disclosed embodiment of federated edge learning (FEEL) with one-bit broadband digital aggregation (OBDA) and frequency-shift keying (FSK) features;



FIG. 2 illustrates multiple subcarrier examples of presently disclosed subject matter involving majority vote (MV) principles based on OBDA-FSK with K=3 EDs;



FIGS. 3A-3H show test accuracy results for non-IID data, where the FEEL with the OBDA-FSK converges without the CSI in both AWGN and fading channel;



FIG. 3A specifically illustrates AWGN, SNR is 0 dB, D=400, K=50;



FIG. 3B specifically illustrates AWGN, SNR is 20 dB, D=400, K=50;



FIG. 3C specifically illustrates AWGN, SNR is 0 dB, D=2000, K=50;



FIG. 3D specifically illustrates AWGN, SNR is 20 dB, D=2000, K=500;



FIG. 3E specifically illustrates Fading channel, SNR is 0 dB (D=400, K=50);



FIG. 3F specifically illustrates Fading channel, SNR is 20 dB (D=400, K=50);



FIG. 3G specifically illustrates Fading channel, SNR is 0 dB (D=2000, K=10);



FIG. 3H specifically illustrates Fading channel, SNR is 20 dB (D=2000, K=10);



FIGS. 4A-4H show test accuracy results for non-IID data, where the FEEL with the OBDA-FSK converges without the CSI in both AWGN and fading channel;



FIG. 4A specifically illustrates AWGN, SNR is 0 dB, D=400, K=50;



FIG. 4B specifically illustrates AWGN, SNR is 20 dB, D=400, K=50;



FIG. 4C specifically illustrates AWGN, SNR is 0 dB, D=2000, K=10;



FIG. 4D specifically illustrates AWGN, SNR is 20 dB, D=2000, K=10;



FIG. 4E specifically illustrates Fading channel, SNR is 0 dB (D=400, K=50);



FIG. 4F specifically illustrates Fading channel, SNR is 20 dB (D=400, K=50);



FIG. 4G specifically illustrates Fading channel, SNR is 0 dB (D=2000, K=10);



FIG. 4H specifically illustrates Fading channel, SNR is 20 dB (D=2000, K=10);



FIG. 5 illustrates peak-to-mean envelope power ratio (PMEPR) distributions, where the randomization symbols in OBDA-FSK lowers PMEPR;



FIG. 6 graphically illustrates the impact of cell size and the effective path loss exponent on λ;


Table 1 correlates Layers and Learnables for a Neural Network at the EDs;



FIGS. 7A-7B, respectively, illustrate IID and non-IID data distributions;



FIGS. 8A-8D illustrate test accuracy versus communication rounds, with FSK-MV works without the CSI at the EDs and ES and provide robustness against time-synchronization errors, and the test accuracy reduces more for non-IID when the power control is imperfect;



FIGS. 9A-9D, for the same configurations as FIGS. 8A-8D, respectively, illustrate the local loss values at the EDs as function of link distance after N=500 communication rounds; and



FIG. 10 graphically compares the PMEPR distributions for OBDA and FSK-MV.





Repeat use of reference characters in the present specification and drawings is intended to represent the same or analogous features, elements, or steps of the presently disclosed subject matter.


DETAILED DESCRIPTION OF THE PRESENTLY DISCLOSED SUBJECT MATTER

Reference will now be made in detail to various embodiments of the disclosed subject matter, one or more examples of which are set forth below. Each embodiment is provided by way of explanation of the subject matter, not limitation thereof. In fact, it will be apparent to those skilled in the art that various modifications and variations may be made in the present disclosure without departing from the scope or spirit of the subject matter. For instance, features illustrated or described as part of one embodiment, may be used in another embodiment to yield a still further embodiment. Thus, it is intended that the presently disclosed subject matter covers such modifications and variations as come within the scope of the appended claims and their equivalents.


In general, the present disclosure is directed to a system in which we consider an OFDM-based FEEL system with K users. Prior to the training, the initial values of the model parameters, denoted by w∈custom-characterq, and its structure are distributed to the EDs from an ES to set up a common learning model at the EDs, where q is the model size. We denote the local dataset containing labeled data samples at the kth ED as |{(custom-character,custom-character)}∈Dk| for k=1, . . . , K, where custom-character and custom-character are custom-characterth data sample and their associated label, respectively. The main goal of the FEEL system is to obtain the trained model parameters without uploading the local data to the ES.


A. Learning Model

The local loss function of the model with the parameters w at the kth ED can be calculated as:















(


X


,

y



)



D
k









(
1
)














F
k



(
w
)


=

1



"\[LeftBracketingBar]"


D
k



"\[RightBracketingBar]"







f


(

w
,

X


,

y



)








where ƒ(w,custom-character,custom-character) is the sample loss function that measures the labelling error for (custom-character,custom-character) for the parameters w.


Assuming identical local dataset sizes, i.e., |Dk|=D for k=1, . . . , K, the global loss function can be measured as:










F

(
w
)

=


1
K






k
=
1

K



F
k

(
w
)







(
2
)







In this disclosure, we focus on a FEEL system based on gradient averaging[7]. For each communication round n of FEEL, the kth ED calculates an estimate of the global gradient of the loss function in Eq. (2) by using its local dataset Dk and the parameter vector w(n). Assuming that all data samples in Dk are used for gradient estimation, the local gradient estimate for the kth ED at the nth communication round, denoted by gk(n) can be expressed as:










g
k

(
n
)


=





F
k

(

w

(
n
)


)


=


1
D










𝒟
k








f

(


w

(
n
)


,

x


,

y



)









(
3
)







where ∇ represents the gradient operator.


Assuming that the local gradient estimates are reliably received at the ES, the ES can obtain the global estimate of the gradient of the loss function in Eq. (2) as:











g
^


(
n
)


=


1
K






k
=
1

K



g
k

(
n
)








(
4
)







Subsequently, the ES distributes the global gradient estimate ĝ(n) to the EDs and the current model is updated based on a common update rule, e.g., gradient descent given by w(n+1)=w(n)−ηĝ(n) where η is the learning rate and w(1)=w. This process is repeated consecutively until a predetermined convergence criterion is achieved.


In this disclosure, we adopted sin SGD [8] for FEEL. Instead of the actual values of local gradients, the EDs transmitted the signs of their local gradients, i.e., {tilde over (g)}k(n) for k=1, . . . , K, to the ES where the ith element of is {tilde over (g)}k,i(n)custom-charactersin(gk,i(n)). Then the estimate of the global gradient for the ith parameter can be calculated by using the MV principle as given by:










v
i

(
n
)



=
Δ


sin



(

y
i

(
n
)


)






(
5
)










where







y
i

(
n
)



=




k
=
1

K




g
˜


k
,
i


(
n
)


.






The ES then transmitted v(n)=(v0(n), . . . , vq-1(n)) to the EDs and the models at the EDs are updated, e.g., w(n+1)=w(n)−ηv(n).


B. Signal Model

In this disclosure, we assume that the EDs access the wireless channel on the same time-frequency resources simultaneously for AirComp with S OFDM symbols consisting of M active subcarriers. We assume the transmissions from the EDs are synchronized in both time and frequency and arrive at the ES within the CP duration. We also assume that the CP duration is larger than the maximum-excess delays of the channels between the ES and the EDs. The superposed symbol on the l subcarrier of the mth OFDM symbol at the ES can then be written as:











r

l
,
m


=




k
=
1

K



h
k



,


l

t

k
,
l
,
m



+

n

l
,
m







(
6
)







where hk,lcustom-character is the channel coefficient between ES and the kth ED on the l subcarrier and custom-character[|hk,l|2]=1, tk,l,mcustom-character is the transmitted symbol from the kth ED on the l subcarrier of the mth OFDM symbol, and nl is the zero mean additive white Gaussian noise (AWGN) with the variance σn2 on the l subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.


Let x(t)∈custom-character be the baseband OFDM symbol in continuous time for t∈[0, Ts), where Ts is the OFDM symbol duration. We defined the PMEPR of an OFDM symbol as maxt∈[0,Ts)|x(t)|2/Ptx, where Ptx=custom-charactert[|x(t)|2] is the mean envelope power. For AirComp schemes, Ptx changes based on the gradient information. In this disclosure, for a fair comparison, we calculate Ptx when all subcarriers are actively utilized, i.e., Ptx=M/N, where N is the inverse DFT (IDFT) size.


FSK-Based Majority Vote
A. Transmitter

Let ƒ be a bijective function that maps i∈{0, 1, . . . , q−1} to the distinct pairs (m0, l0) and (m1, l1) for m0, m1∈{0, 1, . . . , S−1}) and l0, l1∈{0, 1, . . . , M−1}. Based on {tilde over (g)}k,i(n), at the nth communication round, we propose to calculate the symbol tk,l0,m0 and tk,l1,m1 as:










t

k
,


l

0
,




m
0




=

{







E
s


×

S

k
,
i








g
˜


k
,
i


(
n
)


=
1






0
,






g
˜


k
,
i


(
n
)


=
0






0
,






g
˜


k
,
i


(
n
)


=

-
1





,






(
7
)








and









t

k
,


l

0
,




m
0




=

{





0
,






g
˜


k
,
i


(
n
)


=
1






0
,






g
˜


k
,
i


(
n
)


=
0








E
s


×

S

k
,
i








g
˜


k
,
i


(
n
)


=

-
1





,






(
8
)







respectively, where Es=2 is the normalized symbol energy and Sk,i is a randomization symbols for k∈{1, . . . , K}.


Therefore, the proposed scheme separates the options for voting over two different resources identified in time and frequency. In this disclosure, we chose Sk,i based on a random quadrature phase shift keying (QPSK) symbol to reduce PMEPR by decreasing the correlation in the frequency domain[13].


In one implementation, when {tilde over (g)}k,i=1, the symbols tk,l0,m0 and tk,l1,m1 may be chosen randomly from the set {1, −1}.


In one implementation, the symbols tk,l0,m0 and tk,l1,m1 can be calculated based on a weighting function. For example,







t

k
,

l
0

,

m
0



=

{






E

×

s

k
,
i


×

ω

(

g

k
,
i


)


,






g
˜


k
,
i


=
1






0
,






g
˜


k
,
i



1












t

k
,

l
1

,

m
1



=

{






E

×

s

k
,
i


×

ω

(

g

k
,
i


)


,






g
˜


k
,
i


=

-
1







0
,






g
˜


k
,
i




-
1










where gk,i is the local stochastic gradient and ω(gk,i) is a weighting function. The weighting function may be an even-symmetric function that ranges from 0 to 1 in order to limit the power of the transmitted OFDM symbols. The main motivation for using a weight function is that it can lower the error probability of detecting the incorrect majority vote as compared to the sign operation. It may also increase the convergence rate in the case of heterogenous data distribution scenarios. Examples of the smooth, non-decreasing weight function for negative or positive gk,i are as follows:








ω

(

g

k
,
i


)

=

tanh

(

k


g

k
,
i



)


,








ω

(

g

k
,
i


)

=

tanh

(

k




"\[LeftBracketingBar]"


g

k
,
i




"\[RightBracketingBar]"



)


,





and






ω

(

g

k
,
i


)

=

{




1
,




"\[LeftBracketingBar]"


g

k
,
i




"\[RightBracketingBar]"


>

t

(

1
+
ρ

)








0
,


g

k
,
i




t

(

1
-
ρ

)










1
2

+


1
2



cos

(


π

(




"\[LeftBracketingBar]"


g

k
,
i




"\[RightBracketingBar]"


+

t

(

1
+
ρ

)


)


2

ρ


)



,
otherwise









where h, t, ρ are some non-negative coefficients. All of these examples ensures that gradual power increases if the magnitude of the gradient local gradient is large. Therefore, if an ED has a smaller absolute local gradient, its impact on the MV becomes smaller. Similarly, if an ED has a large absolute local gradient, its impact on the MV becomes larger. Hence, the convergence speed may improve.


In one implementation, ω(gk,i)=1 may be chosen to achieve a design based on signs as described. In one implementation, the parameters of the weight function may be tuned through the communications round. For example, the tuning may be based on maximum values of the absolute local gradients or update vectors or the communication round index.


The functionality of f can be divided into two different mappers, i.e., gradient mapper (GM) and resource mapper (RM). While GM shuffles the quantized gradients, RM identifies how the options for voting are distributed to the time and frequency resources. As a special case of RM, if m1=m0 and l1=l0+1 for all i, the adjacent subcarriers of moth OFDM symbol are used for voting, i.e., FSK over OFDM subcarriers. In this case, the weight of the kth ED's vote in the MV for the ith gradient is independent from its vote since these subcarriers are likely to experience similar channel conditions in practice, i.e., hk,l0hk,l0+1. We denoted the proposed scheme with this specific RM as OBDA-FSK in this disclosure.


Gradient mapper and resource mapper may be utilized with an interleaver or an encryption function to increase the security of the proposed scheme. For example, gradient mapper or resource mapper may map the votes to different subcarriers for each communication round based on an encryption operation. Hence, an eavesdropper cannot recover the order of the gradients by simply capturing the transmission.


In one implementation, the symbols tk,l0,m0 and tk,l1,m1 may be based on an update vector, which generalizes the concept of the local stochastic gradients. For example, the machine learning model may be iterated after E local steps. In that case, the update vector may be the difference between the model parameters without local iterations and the model parameters after E local iterations.


B. Receiver

At the ES, the pairs (m0, l0) and (m1, l1) are first calculated by using the mapping function ƒ for a given i. Assuming independent multipath channels between the ES and the EDs, it can be shown that:











𝔼
[




"\[LeftBracketingBar]"


r

l

,
0


m
0





"\[RightBracketingBar]"


2

]

=


𝔼
[




"\[LeftBracketingBar]"





E
s









k

,


g

k
,
i


(
n
)


=
1




h

k
,


l
0



s

k
,
i







+

n


l
0

,

m
0






"\[RightBracketingBar]"


2

]

=



E
s



K
0


+

σ
n
2




,




(
9
)








and










𝔼
[




"\[LeftBracketingBar]"


r


l
1



m
1





"\[RightBracketingBar]"


2

]

=


𝔼
[




"\[LeftBracketingBar]"





E
s









k

,


g

k
,
i


(
n
)


=

-
1





h

k
,


l
1



s

k
,
i







+

n


l
1

,

m
1






"\[RightBracketingBar]"


2

]

=



E
s



K
0


+

σ
n
2




,




(
10
)







where K0 and K1 are the number of EDs that vote for 1 and −1 for the ith gradient, respectively.


Therefore, the energies on the superposed symbols rl0,m0 and rl1,m1 can be compared to determine the MV as:










ν
i

(
n
)


=

{





1
,








"\[LeftBracketingBar]"


r


l
0



m
0





"\[RightBracketingBar]"


2

>





"\[LeftBracketingBar]"


r


l
1



m
1





"\[RightBracketingBar]"


2

+
t








-
1

,








"\[LeftBracketingBar]"


r


l
1



m
1





"\[RightBracketingBar]"


2

>





"\[LeftBracketingBar]"


r


l
0



m
0





"\[RightBracketingBar]"


2

+
t







0
,



otherwise



,






(
11
)







where t is the maximum distance between |rl0m0|2 and |rl1m1|2 to declare a tie under AWGN. In one implementation, threshold t may be set to zero values to simplify the receiver.


In FIG. 1, we provided the transmitter and receiver block diagrams for a FEEL system with OBDA-FSK. We also exemplified OBDA-FSK for K=3, q=5, M=10, and S=1 in FIG. 2. Assume that {tilde over (g)}1(n)=(1,1,−1,−1,−1), {tilde over (g)}2(n)=(1,−1,0,0,0), and {tilde over (g)}3(n)=(−1,1,1,−1,0). Therefore, based on Eqs. (7) and (8), the symbols on the subcarriers can be calculated as √{square root over (2)} (s1,0,0,s1,1,0,0,s1,2,0 s1,3,0,s1,4), √{square root over (2)} (s2,0,0,0,s2,1,0,0,0,0,0,0), and √{square root over (2)} (0,s3,0,s3,1,0,s3,2,0,0,s3,3,0,0) for the first ED, the second ED, and the third ED, respectively. After each ED's signal passes through their own multipath channels, the ES observes the superposed symbols on the same subcarrier indices. The detector at the ES then compares the energies on the two adjacent subcarriers to determine the gradient vector, i.e., v(n)=(v0(n), . . . , v4(n)) based on Eq. (11). For example, since the majority of the EDs (e.g., ED 1 and ED 2) activates the first subcarrier for i=0, it is likely that the detector returns v0(n)=1 based on Eqs. (9) and (10). In the case of a tie, e.g., v2(n), the detector determines the MV as 0. Note that the energy on the subcarriers is unlikely to be identical in practice due to the noise, randomization symbols, and channel. Hence, we set the MV to 0 if the distance between |rl0m0|2 and |rl1m1|2 is less than t.


C. Trade-offs and Comparisons

As prior literature approaches are opposed[6], [7], the proposed scheme does not need channel inversions at the EDs. From this aspect, it is compatible with time-varying channels (e.g., mobile networks[14]) and does not lose gradient information due to TCI. On the other hand, it quadruples the number of time-frequency resources for AirComp as compared to OBDA-QAM[7]; however, OBDA-QAM is not investigated in terms of PMEPR in the literature. As shown in, OBDA-QAM can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique that also leads to better accuracy results for non-IID data. As compared to approaches indicated in prior literature[11], [12], the proposed scheme also does not require CSI at the ES or multiple antennas.


Numerical Results

For the numerical results, we considered the learning task of handwritten digit recognition with a FEEL system and compared the proposed scheme with BAA[6] for gradient averaging and OBDA-QAM[7]. We used the MNIST dataset that contains 60,000 labelled handwritten digit images sized 28×28, from 0-9. From the IID dataset, we randomly partition 20,000 training images into equal shares to K∈{10, 50} EDs. For the non-IID data set, we chose 5 digits for each ED and selected the images randomly, i.e., different dataset can contain the same image. For a fair comparison, we used the same data randomization for different AirComp schemes.


For the model, we considered a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer were utilized. At the input layer, no normalization was applied. Our model has q=123090 learnable parameters, which corresponds to S=206, S=103, and S=52 OFDM symbols for the OBDA-FSK, BAA, and OBDA-QAM for M=1200, respectively. The subcarrier spacing was set to 15 kHz, the TCI (the truncation threshold) was 0.2, and the threshold t was set to 0.01 for the proposed scheme.


To test the FEEL, we considered two different uplink signal-to-noise ratios (SNRs), i.e., 0 dB and 20 dB.


For the fading channel, we considered ITU Extended Pedestrian A (EPA) with no mobility and then regenerated the channels between the ES and the EDs to capture the long-term channel variations for each communication round. For TCI, we assumed that CSI was available at the EDs. For the update rule, we considered stochastic gradient descent with momentum, where the momentum is 0:9. The initial learning rate was 0:01 and the learning rate decayed with a rate of 0:05 for every communication round.


In FIG. 3, we provided the test accuracy results for IID data. In AWGN channel, all AirComp schemes converged and returned a high score for both 0 dB and 20 dB SNR for K=10 and K=50 EDs as shown in FIGS. 3A-D. The test accuracy with the BAA slowly converged as compared to the OBDA-QAM and the OBDA-FSK, as the BAA is based on the actual values of the gradient estimates. In FIGS. 3E-F, we considered the fading channel for K=50 EDs. Both BAA and OBDA-QAM failed when the TCI is not used at the EDs. On the other hand, the OBDA-FSK offers a high-test accuracy without using TCI at the EDs or CSI at the ES. Similar behaviors for K=10 EDs were noted in FIGS. 3G-H.



FIG. 4 demonstrates test accuracy results for the non-IID data. In AWGN channel, both BAA and OBDA-FSK were better than the OBDA-QAM, as shown in FIGS. 4A-D. Based on these tests, the superiority of the OBDA-FSK to the OBDA-QAM is due to the randomization symbols that alter the MV. For example, although custom-character[|rl0m0|2]>custom-character[|rl1m1|2] for K0>K1, |rl1m1|2>|rl0m0|2+t can still occur since rl0 and rl1 are the summations of the randomization symbols. This random behavior may avoid converging a local optimum for non-IID data. In fading channel, the proposed scheme also works without TCI as shown in FIGS. 4E-H and the test accuracy converges faster than the one with OBDA-QAM.


In FIG. 5 (PMEPR distributions), we compared the PMEPR of the digital aggregation schemes (i.e., OBDA-QAM and OBDA-FSK) for different numbers of EDs and the IID data in fading channel and 20 dB SNR. The randomization symbols in OBDA-FSK lowered PMEPR. Since the proposed scheme introduces randomness in the frequency based on sk,i for i=0, . . . , q−1, the proposed scheme exhibits a similar behavior to a typical OFDM transmission in terms of PMEPR. On the other hand, the OBDA-QAM with or without TCI caused substantially high PMEPR for OFDM as the signs of the gradient and the channel coefficients in the frequency domain were correlated.


Concluding Remarks

In this disclosure, we proposed an AirComp scheme for FEEL. The proposed scheme relies on MV and forms the options for voting on different subcarriers and/or OFDM symbols, and thus, it allows the receiver to detect MV with a non-coherent detector and eliminates the need for TCI at the EDs as it is compatible with time-varying channels. Further, it can be used along with randomization methods in the frequency domain to reduce the PMEPR. Through simulations, we demonstrated that the proposed method provides a high-test accuracy in fading channel for both IID and non-IID data, which results in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources.


The proposed method can be improved in various ways. For example, to lower PMEPR further, the randomization symbols can be designed based on the gradients. The precoded-OFDM (e.g., discrete Fourier transform (DFT)-spread OFDM) or various mapping strategies can also be explored to improve the proposed method. In this disclosure, we focused on one-bit quantitation. Extending the proposed concept to different quantization levels is another interesting research direction that can be pursued. The system-level analysis of the proposed method with heterogeneous data is also another direction that can be investigated.


ADDITIONAL DISCLOSURE

Federated edge learning (FEEL) is an implementation of federated learning (FL) in a wireless network to train a model without moving the local data generated at the edge devices (EDs) to an edge server (ES)[001], [002]. With FEEL, a large number of model parameters (or gradients) needs to be communicated between many EDs and the ES through wireless channels. However, typical user multiplexing methods such as orthogonal frequency division multiple access (OFDMA) can be inefficient to address the spectrum congestion due to a large number of EDs[003]. To address this issue, one of the promising solutions is to perform the calculations needed for FEEL, e.g., averaging, with an over-the-air computation (AirComp) method that harnesses the signal-superposition property of the wireless-multiple access channel[004]-[006]. However, developing an AirComp scheme is not a trivial task due to the multipath channel, power misalignment, and time-synchronization errors in practice. Also, the channel state information (CSI) needs to be available at the EDs or the ES with state-of-the-art solutions. In this study, we propose an AirComp scheme to address these issues.


In the literature, various AirComp schemes are proposed for FEEL. In [007], analog modulation over orthogonal frequency division multiplexing (OFDM) is investigated for broadband analog aggregation (BAA). Particularly, it is proposed to modulate the OFDM subcarriers with the model parameters at the EDs. To overcome the impact of the multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, which is known as truncated-channel inversion (TCI) in the literature. In [008], an additional time-varying precoder is applied along with TCI to facilitate the aggregation. In [009], it is proposed to sparsify the gradient estimates and project the resultant sparse vector into a low-dimensional vector to reduce the bandwidth. The compressed data is transmitted with BAA. In [010], one-bit broadband digital aggregation (OBDA) is proposed to facilitate the implementation of FEEL for a practical wireless system. In this method, considering distributed training by majority vote (MV) with the sign stochastic gradient descend (signSGD)[011], the EDs transmit quadrature phase-shift keying (QPSK) symbols over OFDM subcarriers along with TCI, where the real and imaginary parts of the QPSK symbols are formed by using the signs of the stochastic gradients, i.e., votes. At the ES, the signs of the real and imaginary components of the superposed received symbols on each subcarrier are calculated to obtain the MV for the sign of each gradient. However, the EDs still need the CSI for TCI as in BAA for AirComp. In [012] and [013], blind EDs are considered. However, it is assumed that the CSI for each ED is available at the ES. The impact of the channel on AirComp is mitigated through beamforming with a large number of antennas.


In this study, we investigate an AirComp method based on non-coherent detection to achieve FEEL without using CSI at the EDs and the ES. Inspired by the MV with signSGD[011], we use orthogonal resources, i.e., multiple subcarriers and/or OFDM symbols, to transmit the signs of local stochastic gradients. Hence, the votes from different EDs accumulate on the orthogonal resources non-coherently in fading channel with the proposed scheme. The ES then obtains the MV with an energy detector. Considering the randomness in the detected MVs due to the fading channel, path loss, and power control in the cell, we prove the convergence of learning in the presence of the proposed scheme for a non-convex loss function. We demonstrate that the proposed approach is robust against time-synchronization errors and power misalignment at the ES. We also show that it can be used with well-known peak-to-mean envelope power ratio (PMEPR) reduction techniques as it does not utilize the amplitude and the phase to encode the sign of local stochastic gradients. Finally, we evaluate the scheme by considering independent and identically distributed (IID) data and non-IID data where the data distribution is a function of the locations of EDs.


Notation: The complex and real numbers are denoted by custom-character and custom-character, respectively. custom-character[⋅] is the expectation of its argument. custom-character[⋅] is the indicator function and custom-character[⋅] is the probability of its argument. The sign function is denoted by sign(⋅) and results in 1, −1, or +1 at random for a positive, a negative, or a zero-valued argument, respectively.


System Model
A. Scenario

Consider a wireless network with K EDs that are connected to an ES, where each ED and the ES are equipped with single antennas. We assume that the frequency synchronization in the network is done before the transmissions with a control mechanism as done in 3GPP Fourth Generation (4G) Long Term Evolution (LTE) and/or Fifth Generation (5G) New Radio (NR) with random-access channel (RACH) and/or physical uplink control channel (PUCCH)[014]. In this study, we consider the fact that the time synchronization among the EDs is not ideal, and the maximum difference between the time of arrivals of the EDs signals at the ES location is Tsync seconds and it is equal to the reciprocal to the signal bandwidth.


In this study, the power alignment at the ES can be imperfect and the level of misalignment is controlled with a power control mechanism. We assume that the signal-to-noise ratio (SNR) of an ED at the ES is 1/σn2 the reference distance Rref. We then set the received signal power of the kth ED at the ES as










P
k

=


(


r
k


R
ref


)


-

(

α
-
β

)







(
1
)







where rk is the link distance between the kth ED and the ES, α is the path loss exponent, and β∈[0,α] is a coefficient that determines the amount of the path loss compensated. While β=0 means that there is no power control in the network, β=α leads to a system with perfect power alignment at the ES. We define the effective path loss exponent αeff as αeffcustom-characterα−β.


In this study, we assume that the EDs are deployed in a cell, where the cell radius is Rmax meters and the minimum distance between the ES and the EDs is Rmin meters for Rmin≥Rref. It is worth emphasizing that we do not consider the impact of multiple cells (e.g., inter-cell interference) or a more complicated large-scale channel model (e.g., shadowing) on learning in this work as our goal is to provide insights into the impact of power misalignment and the path loss on distributed learning with a tractable analysis.


B. Signal Model

In this study, for AirComp, the EDs access the wireless channel on the same time-frequency resources simultaneously with S OFDM symbols consisting of M active subcarriers. We assume that the cyclic prefix (CP) duration is larger than Tsync and the maximum-excess delays of the channel between the ES and the EDs. Considering independent frequency-selective channels between the EDs and the ES, the superposed symbol on the lth subcarrier of the mth OFDM symbol at the ES for the nth communication round of FEEL can be written as










r

l
,
m


(
n
)


=





k
=
1

K





P
k






h

k
,
l
,
m


(
n
)




t

k
,
l
,
m


(
n
)




+

n

l
,
m


(
n
)







(
2
)







where hk,l,m(n)custom-character is the channel coefficient between the ES and the kth ED, tk,l,m(n)custom-character is the transmitted symbol from the kth ED, and nl,m(n) is the symmetric additive white Gaussian noise (AWGN) with zero mean and the variance σn2 on the lth subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.


We consider the fact that the time synchronization at the receiver may not be precise. To model this, we assume that the synchronization point where the discrete Fourier transform (DFT) starts can deviate by Nerr samples within the CP window. Note that the uncertainty of the synchronization point within the CP window is often not an issue for traditional communications due to the channel estimation. However, it can cause a non-negligible impact on AirComp.


Let x(ttime)∈custom-character be a baseband OFDM symbol in continuous time for ttime∈[0, Ts), where Ts is the OFDM symbol duration. We define the PMEPR of an OFDM symbol as








max



t

t

i

m

e




[

0
,

T
s




)







"\[LeftBracketingBar]"


x

(

t

t

i

m

e


)



"\[RightBracketingBar]"


2


P

t

x




,




where Ptx=custom-character[|x(ttime)|2] is the mean-envelope power.


C. Learning Model

Let custom-characterk denote the local data containing labeled data samples at the kth ED as {(custom-character,custom-character)}∈custom-characterk for k=1, . . . , K, where custom-character and custom-character are custom-characterth data sample and its associated label, respectively. The centralized learning problem can be expressed as










w
*

=


arg


min



F

(
w
)


=

arg


min


1



"\[LeftBracketingBar]"

𝒟


"\[RightBracketingBar]"











(

x
,
y

)


𝒟




f

(

w
,
x
,
y

)








(
3
)







where custom-character=custom-character1custom-character2∪ . . . ∪custom-characterK and ƒ(w, x, y) is the sample loss function that measures the labeling error for (x, y) for the parameters w=[w1 . . . , wq]Tcustom-characterq, and q is the number of parameters. With full-batch gradient descend, a local optimum point can be obtained as






w
(n+1)
=w
(n)
−ηg
(n)  (4)


where η is the learning rate and










g

(
n
)


=




F

(

w

(
n
)


)


=


1



"\[LeftBracketingBar]"

𝒟


"\[RightBracketingBar]"











(

x
,
y

)


𝒟






F

(


w

(
n
)


,
x
,
y

)









(
5
)







where ith element of the vector g(n) is the gradient of F(w(n)) with respect to wi(n).


In [011], in the context of parallel processing, distributed training by MV with signSGD is investigated to solve (3). In this method, for the nth communication round, the kth ED1 first calculates the local stochastic gradient as











g
~

k

(
n
)


=





F
k

(

w

(
n
)


)


=


1

n
b










(



x




,


y





)



𝒟
k







f

(


w

(
n
)


,

x


,

y



)









(
6
)







where custom-characterkcustom-characterk is the selected data batch from the local data set and nb=|custom-characterk| as the batch size. Instead of the actual values of local gradients, the EDs then send the signs of their local stochastic gradients, denoted as {tilde over (g)}k(n) for k=1, . . . , K, to the ES, where the ith element of the vector {tilde over (g)}k(n) is {tilde over (g)}k,i(n)custom-charactersign({tilde over (g)}k,i(n)). The ES obtains the MV for the ith gradient as










v
i

(
n
)



=



sign

(



k
K



g
˜


k
,
i


(
n
)



)





(
7
)







Subsequently, the ES pushes v(n)=[v1(n), . . . , vq(n)]T to the EDs and the models at the EDs are updated as 1We refer to the workers and parameter-server mentioned in [011] as EDs and ES, respectively, to describe distributed training by MV with signSGD.






w
(n+1)
=w
(n)
−ηv
(n)  (8)


This procedure is repeated consecutively until a predetermined convergence criterion is achieved.


For FEEL, the optimization problem can also be expressed as (3) in a scenario where the local data samples and their labels are not available at the ES and the link between an ED and the ES experiences independent frequency-selective fading channel. To solve (3) under these constraints, in this study, we adopt the same procedure summarized for the distributed training by the MV. With the motivations of eliminating the latency caused by orthogonal multiple access and enabling distributed training in mobile wireless networks, we propose a simple-but-effective AirComp scheme to detect the MV in fading channel without using CSI at the EDs and the ES.


FSK-Based Majority Vote
A. Edge Device—Transmitter

With the proposed AirComp scheme, the EDs perform a low-complexity operation to transmit the signs of the gradients given in (6): Let ƒ be a bijective function that maps i∈{1, 2, . . . , q} to the distinct pairs (m+, l+) and (m, l) for m+, m∈{0, 1, . . . , S−1}) and l+, l∈{0, 1, . . . , M−1}. Based on the value of gk,i(n), at the nth communication round, the kth ED calculates the symbol tk,l+,m+(n) and tk,l,m(n), ∀i, as










t

k
,

l
+

,

m
+



(
n
)


=

{






E
s


×

s

k
,
i


(
n
)








g
_


k
,
i


(
n
)


=

1






0
,






g
¯


k
,
i


(
n
)


=


-
1










(
9
)








and









t

k
,

l
-

,

m
-



(
n
)


=

{




0
,






g
_


k
,
i


(
n
)


=

1








E
s


×

s

k
,
i


(
n
)








g
¯


k
,
i


(
n
)


=


-
1










(
10
)







respectively, where Es=2 is a factor to normalize the symbol energy and sk,i(n) is a randomization symbol on the unit circle. Therefore, to indicate the sign of a local stochastic gradient, our scheme dedicates two subcarriers with (9) and (10), as opposed to modulating the phase of a subcarrier as done in OBDA. Also, we do not use TCI to compensate the impact of multipath channel on transmitted symbols as our goal is to exploit the energy accumulation on two different subcarriers to detect the MV with a non-coherent detector.


As a special case of ƒ, if m=m+ and l=l++1 hold for all i, the adjacent subcarriers of m+th OFDM symbol forms the options for a vote, which corresponds to frequency-shift keying (FSK) over OFDM subcarriers. In this case, the kth ED's vote for the ith gradient becomes independent from its choice since the adjacent subcarriers are likely to experience similar channel conditions, i.e., hi,l+(n)hk,l+,+1(n). We refer to the MV calculation with the proposed scheme under this specific mapping as FSK-based MV (FSK-MV) in this study.


After the calculations of tk,l+,m+(n) and tk,l,m(n) for all i and k, the EDs calculate the OFDM symbols and transmit them based on the discussions in Section II.


B. Edge Server—Receiver

The receiver at the ES observes the superposed symbols at all subcarriers as expressed in (2). By using the mapping function ƒ, the superposed symbols for a given i can be shown as










r


l
+

,

m
+



(
n
)


=




E
S









k

,



g
_


k
,
i


(
n
)


=
1






P
k




h

k
,

l
+

,

m
+



(
n
)




s

k
,
i


(
n
)





+

n


l
+

,

m
+



(
n
)







(
11
)








and









r


l
-

,

m
-



(
n
)


=




E
S









k

,



g
_


k
,
i


(
n
)


=

-
1







P
k




h

k
,

l
-

,

m
-



(
n
)




s

k
,
i


(
n
)





+

n


l
-

,

m
-



(
n
)







(
12
)







respectively. The receiver at the ES detects the MV for the ith gradient with an energy detector as






v
i
(n)=sign(Δi(n))  (13)


where Δi(n)custom-characterei+−ei for ei+custom-character|rl+,m+(n)|22 and ei≙|rl+,m+(n)|22, ∀i. It is worth mentioning that we do not use any method to resolve the interference in (11) and (12) among the EDs as we are not interested in the sign of a local gradients. On the contrary, we exploit the interference for aggregation and compare the amount of energy on two different subcarriers to detect the MV in (13). The transmitter and receiver block diagrams are provided in FIG. 1, based on the aforementioned discussions.


The proposed scheme leads to a fundamentally different training strategy since it determines the correct MV in (7) probabilistically by comparing el and el. To elaborate this, assume that the multipath channels between the ES and the EDs are independent. Let Ki+ and Ki=K−Ki+ be the number of EDs that vote for 1 and −1 for the ith gradient, respectively.


Lemma 1. custom-character[ei+] and custom-character[ei] can be calculated as





μi+custom-charactercustom-character[ei+]=EsKi+λ+σn2  (14)





and





μicustom-charactercustom-character[ei]=EsKiλ+σn2  (15)


respectively, where









λ

=



{







2


R
ref

α
eff





R
max
2

-

R
min
2







R
min

2
-

α
eff



-

R
max

2
-

α
eff






α
eff

-
2




α
eff



2









2


R
ref

α
eff





R
max
2

-

R
min
2




ln



R
max


R
min




α
eff


=
2









(
16
)







Proof: Since (11) is a weighted summation of independent complex Gaussian random variables with zero mean and unit variance (i.e., channel coefficients), rl+,m+(n) is a zero mean random variable, where its variance is










μ
i
+

=


𝔼

[

e
i
+

]

=


𝔼

[




"\[LeftBracketingBar]"


r


l
+

,

m
+



(
n
)




"\[RightBracketingBar]"


2
2

]

=


𝔼

[



E
s








g
_


k
,
i


(
n
)


=
1




(


r
k


R
ref


)


-

α
eff





+

σ
n
2


]

=



E
s



K
i
+



𝔼

[


(


r
k


R
ref


)


-

α
eff



]


+


σ
n
2

.









(
17
)







To calculate (17), we need to calculate the expected value of y=r−αeff. Assuming that the EDs are localized uniformly within the cell, the link distance distribution can be expressed as










f

(
r
)

=


2

r



R
max
2

-

R
min
2







(
18
)







Hence, the distribution of y can obtained as













f

(
y
)

=


f

(
r
)




"\[LeftBracketingBar]"



d

y

dr



"\[RightBracketingBar]"






"\[RightBracketingBar]"



r
=

y

-

1

α
eff






=


2


y

-



α
eff

+
2


α
eff







(


R
max
2

-

R
min
2


)



α
eff







(
19
)







By using (19), the expected value of y can be calculated as (16). The same analysis can be done for μi.


Based on Lemma 1, (13) is likely to obtain the correct MV because μi+ and μi are linear functions of and Ki+ and Ki, respectively. However, the detection performance depends on the parameter λ∈[0, 1] that captures the impacts of power control, path loss, and cell size on ei+ and ei. In FIG. 6, we plot λ for different cell sizes for a given αeff. For a better power control or a smaller cell size, the parameter λ increases to 1, which implies a better detection performance under noise. On the other hand, the MV is not deterministic for σn2=0. Hence, the convergence for a non-convex loss function F(w) needs to be shown to justify if the proposed scheme is suitable for FEEL.


C. Convergence in Fading Channel

We consider several standard assumptions made in the literature for the convergence analysis[10], [11]:


Assumption 1 (Bounded loss function). F(w)≥F*, ∀w.


Assumption 2 (Smoothness). Let g be the gradient of F(w) evaluated at w. For all w and w′, the expression given by









F

(

w


)

-

(


F

(
w
)

-


g
T

(


w


-
w

)


)




"\[RightBracketingBar]"









1
2







L
i

(


w
i


-

w
i


)

2





i
=
1


q





holds for a non-negative constant vector L=[L1, . . . , Lq]T.


Assumption 3 (Variance bound). The stochastic gradient estimates {{tilde over (g)}k=[{tilde over (g)}k,1, . . . , {tilde over (g)}k,q]T=∇Fk(w(n))}, ∀k, are independent and unbiased estimates of g=[g1, . . . , gqT=∇F(w) with a coordinate bounded variance, i.e.,






custom-character[{tilde over (g)}k]=g,∀k  (20)






custom-character[({tilde over (g)}k,i−gi)2]≤σi2/nb,∀k,i  (21)


where is a non-negative constant vector.


Assumption 4 (Unimodal, symmetric gradient noise). For any given w, the elements of the vector {tilde over (g)}k, ∀k, has a unimodal distribution that is also symmetric around its mean.


We also assume that the parameters ei+ and ei are exponential random variables, where their means are μi+ and μi, respectively. This assumption holds true when the power control is ideal under IID Rayleigh fading. It is a weak assumption under imperfect power control due to the central limit theorem.


By extending our theorem in [015] with the considerations of path loss, power control, and cell size, the convergence rate in the presence of FSK-MV can obtained as follows:


Theorem 1. For nb=N/γ and η=1/√{square root over (∥L∥1nb)}, the convergence rate of the distributed training by the MV based on FSK in fading channel is










𝔼
[


1
N










g

(
n
)




1


n
=



N
-




]




1

N




(


a





L


1




(


F

(

w
0

)

-

F
*

+

γ
2


)


+



2


2


3



γ





σ


1



)






(
22
)







where γ is a positive integer,







a
=



(

1
+

2

ξ

K



)



1

γ




for


ξ


=





E
s


λ


σ
n
2




,




and λ∈[0, 1] given in (16) is a parameter that captures the parameters related to the path loss, power control, and cell size.



FIG. 6 graphically illustrates the impact of cell size and the effective path loss exponent on λ.


The proof of Theorem 1 is given in the appendix.


Based on Theorem 1, we can infer the followings: 1) For a larger SNR (i.e., a larger 1/σn2) and a large number of EDs (i.e., a larger K), the convergence rate with FSK-MV in fading channel improves since a decreases. 2) The power control results in a better convergence rate since A increases with a lower αeff. 3) Another way of improving the convergence rate is to reduce to cell size, yielding a large λ as illustrated in FIG. 6. However, this indicates a practical limitation of a single-cell FEEL: The number of EDs may be smaller for a smaller cell. However, the power control becomes a harder task for a larger cell. 4) Finally, under ideal power control, the convergence rate becomes similar to the one with signSGD in an ideal channel[11, Theorem 1] asymptotically.


D. Comparisons

Robustness against Time-Varying Fading Channel: As opposed to the approaches in [007] and [010], the proposed scheme does not utilize the CSI for TCI at the EDs. Hence, it is compatible with time-varying channels (e.g., mobile networks[016]) and does not lose gradient information due to TCI. As a trade-off, it quadruples the number of time-frequency resources for AirComp as compared to OBDA in [010]. As compared to the approaches in [012] and [013], the proposed scheme also does not require CSI at the ES or multiple antennas.


2) Robustness against Time-Synchronization Errors: As demonstrated in Section IV, the proposed scheme provides immunity against the time-synchronization errors. This is because the timing misalignment among the EDs or the uncertainty on the receiver synchronization within the CP window cause phase rotations in the frequency domain and FSK-MV does not encode information on the amplitude or phase. Also, the proposed scheme does not use any channel-related information at the EDs and the ES. Hence, FSK-MV is more robust against time-synchronization errors as compared to OBDA.


3) Robustness against Power-Amplifier Non-linearity: The proposed scheme separates the options for voting over two different resources identified in time and frequency. Hence, it allows one to choose sk,i(n) based on specific purposes. In this study, we use random QPSK symbol to reduce PMEPR by decreasing the correlation in the frequency domain[017]. OBDA is not investigated in terms of PMEPR in the literature. As shown in Section IV, OBDA can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique. Also, FSK-MV does not require a long transmission power constraint as in introduced for OBDA[010, Eq. 9 and Eq. 10] since the custom-character2-norm of the OFDM symbols do not change as a function of CSI with FSK-MV.


Numerical Results

For the numerical results, we consider the learning task of handwritten-digit recognition in a single cell with K=50 EDs for Rmin=10 meters and Rmax=100 meters. We assume that the path loss exponent is α=4. To demonstrate the impact of the imperfect power control on distributed learning, we choose β∈{2, 4} and set the SNR, i.e., 1/σn2, to be 20 dB at Rref=10 meters. The link distance between the kth ED and the ES is set to rk=√{square root over (Rmin2+(k−1)(Rmax2−Rmin2)/(K−1))} based on (18). For the fading channel, we consider ITU Extended Pedestrian A (EPA) with no mobility and regenerate the channels between the ES and the EDs independently for each communication round to capture the long-term channel variations. The subcarrier spacing is set to 15 kHz. We use M=1200 subcarriers (i.e., the signal bandwidth is 18 MHz). In the case of imperfect time synchronization, we assume that the difference between time of arriving ED signals is maximum Tsync=55.6 ns and the synchronization uncertainty at the ES is Nerr=3 samples. Otherwise, these parameters are set to 0.


For the local data at the EDs, we use the MNIST database that contains labeled handwritten-digit images size of 28×28 from digit 0 to digit 92. We consider both IID data and non-IID data in the cell. To prepare the data, we first choose |custom-characterD|=25000 training images from the database, where each digit has distinct 2500 images. For the scenario with the IID data, we assume that each ED has 50 distinct images for each digit. For the scenario with the non-IID data, we assume that the distribution of the images depends on the locations of the EDs to test the FEEL in a more challenging scenario. To this end, we divide the cell into 5 areas with concentric circles and the EDs located in uth area have the data samples with the labels {u−1, u, 1+u, 2+u, 3+u, 4+u} for u∈{1, . . . , 5}. Hence, the availability of the labels gradually changes based on the link distance. The areas between two adjacent concentric circles are identical and the number of EDs in each area is 10. The IID and non-IID data distributions are illustrated in FIGS. 7A and 7B, respectively. FIGS. 7A and 7B illustrate IID versus non-IID data considered for the numerical analyses. The radius of the concentric circles is {10, 45.6, 63.7, 77.7, 89.6, 100} meters. In particular, FIG. 7A illustrates IID data in the cell. All EDs have data samples for 10 different digits. Further, FIG. 7B illustrates non-IID data in the cell. The available digits at the EDs change based on their locations in the cell. The digits in an area are shown in FIG. 7B. 2 For FEEL, the data samples are generated at the EDs. We distribute the data samples in the MNIST database to the EDs to generate representative results for FEEL.


For the model, we consider a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation follow each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer are utilized. At the input layer, no normalization is applied. Our model, outline in Table I, has q=123090 learnable parameters, which corresponds to S=206 and S=52 OFDM symbols for the FSK-MV and OBDA[10], respectively. For TCI, the truncation threshold is 0.2 and we assume that CSI is available at the EDs. For the update rule, the learning rate is set to 0.01. The batch size nb is set to 64. For the test accuracy calculations, we use 10000 test samples available in the MNIST database.


In FIGS. 8A-8D, we provide the test accuracy results for IID/non-IID data in the cell by taking time-synchronization errors and imperfect power control. In particular, FIG. 8A illustrates IID data, ideal power control (αeff=0), FIG. 8B illustrates (b) IID data, imperfect power control (αeff=2), FIG. 8C illustrates non-IID data, ideal power control (αeff=0), and FIG. 8D illustrates non-IID data, imperfect power control (αeff=2). For the same configurations, we provide the local loss values at the EDs as function of link distance in FIGS. 9A-9D after N=500 communication rounds. In particular, FIGS. 9A-9D illustrate local loss versus link distance. For non-IID data, the data samples are function of the locations of EDs. Since the received signal power of the cell-edge EDs are dominated by the nearby EDs, only data samples at the nearby ED are learned. For this analysis, an ideal time synchronization is assumed in order to provide the results for OBDA. The available labels are indicated as { . . . }. In particular, FIG. 9A illustrates IID data, ideal power control (αeff=0), FIG. 9B illustrates (b) IID data, imperfect power control (αeff=2), FIG. 9C illustrates non-IID data, ideal power control (αeff=0), and FIG. 9D illustrates non-IID data, imperfect power control (αeff=2).


In FIG. 8A-8B, we consider the IID data in the cell. We evaluate the scenarios with the non-IID data in FIG. 8C-8D. For FIG. 8A, the power alignment at the ES is assumed to be perfect (i.e., αeff=0). The results in this figure indicate that OBDA works well when the time synchronization is ideal, and the CSI is available at the EDs. However, OBDA without TCI or its utilization under imperfect time synchronization cause drastic reductions in the performance. On the other hand, the FSK-MV is robust against the time-synchronization errors and result a high-test accuracy without using CSI at the EDs as it is based on non-coherent detection and dedicates two orthogonal resources to indicate the sign of the gradient. In FIG. 8B-8D, we observe the same trends for OBDA and FSK-MV. However, the maximum test accuracy is highly affected by the data distribution and the power control. In FIG. 8B, the power alignment at the ES is not ideal (i.e., αeff=2).


Although the test accuracy with OBDA with TCI (with ideal synchronization) or FSK-MV (with/without ideal synchronization) reaches to 95%, FIG. 9B indicates the local losses increase at the EDs as compared to the ones in FIG. 9A. In this scenario, the distributed learning exploits the IID-data in the cell, which also benefits to the cell-edge EDs that have the similar data distributions to the ones at the nearby EDs. In FIG. 8C, we see the impact of the non-IID data on the test accuracy. Although the power alignment is ideal in this case, the maximum test accuracy reduces to 75% from 95%. We observe more degradation in accuracy in FIG. 8D, where the power control is not ideal. In FIG. 9C-9D, we can identify the digits that are not learned well. In the case of ideal power control, based on FIG. 9D, we observe that the digit 0 and the digit 9 are not learned well since these digits are available in a smaller number of EDs as compared to other digits. Hence, the MV is highly biased. A similar issue arises when the power control is not perfect. As shown in FIG. 9D, the local loss function tends to increase with the distance, i.e., the cell-edge EDs data are not learned. As the cell-edge EDs signal powers received are weak as compared the ones for the nearby EDs, the MV is again biased toward the nearby EDs local data. Therefore, the digits available at the cell-edge EDs, e.g., digits 6, 7, 8, and 9, are not learned well. Both issues in the case of non-IID data indicate that an adaptive learning strategy that takes the bias in the MV into account (e.g., through an adaptive ED selection or a power control based on the label distribution) is needed for achieving a higher test accuracy. Finally, we compare the PMEPR distributions in FIG. 10 for OBDA and FSK-MV. Since the proposed scheme introduces randomness in the frequency domain with the randomization symbols, it exhibits a similar behavior to a typical OFDM transmission in terms of PMEPR. On the other hand, the OBDA can cause substantially high PMEPR for OFDM as the signs of the gradients can be highly correlated.


Concluding Remarks

In this study, we propose an effective AirComp scheme for FEEL. The proposed scheme relies on the distributed learning by the MV with the signSGD in fading channel. As compared to the state-of-the-art solutions on AirComp, it uses different subcarriers and/or OFDM symbols to indicate the sign of the local stochastic gradients. Thus, it allows the receiver at the ES to detect the MV with a non-coherent detector and eliminates the need for CSI at the EDs by exploiting the non-coherent energy accumulation on the subcarriers. We also prove the convergence of the distributed learning by taking path loss, power control, and cell size into account. Through simulations, we demonstrate that the proposed method can provide a high-test accuracy in fading channel even when the power control and the time synchronization are imperfect while resulting in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources. We also provide insights into the scenarios where local data distribution depends on the locations of the EDs and demonstrate the impact of non-IID data on the distributed learning when the power control is not ideal. Our results indicate that adaptive learning methods that consider the bias in the MV due to the non-IID data and/or imperfect power control are required for achieving a higher test accuracy.


APPENDIX PROOF OF THEOREM 1

Proof: The proof of Theorem 1 relies on a well-known strategy of relating the norm of the gradient of the loss function F(w) to the expected improvement made in a single step as described in [11]. Let g(n) be the gradient of F(w(n)) (i.e., the true gradient). By using Assumption 2 and using (13), we can state that









F

(

w

(

n
+
1

)


)

-

F

(

w

(
n
)


)






-
η



g


(
n
)

T




v

(
n
)



+



η
2

2





L


1




=



-
η






g

(
n
)




1


+



η
2

2





L


1


+

2

η






i
=
1

q






"\[LeftBracketingBar]"


g
i

(
n
)




"\[RightBracketingBar]"





II

[


sign

(

Δ
i

(
n
)


)



sign

(

g
i

(
n
)


)


]

.













Therefore
,








𝔼
[



F

(

w

(

n
+
1

)


)

-

F

(

w

(
n
)


)




w

(
n
)



]





-
η






g

(
n
)




1


+



η
2

2





L


1


+





2

η





i
=
1

q





"\[LeftBracketingBar]"


g
i

(
n
)




"\[RightBracketingBar]"








[


sign


(

Δ
i

(
n
)


)




sign

(

g
i

(
n
)


)


]





=



P
i
err








.


Stochasticity
-
induced


error







The main challenge is to obtain an upper bound on the stochasticity-induced error. To address this, assume that sign(gi(n))=1. Let Z be a random variable for counting the number of EDs with the correct decision, i.e., sign(gi(n))=1. The random variable Z can then be model as the sum of K independent Bernoulli trials, i.e., a binomial variable with the success and failure probabilities given by






P
i
custom-character
custom-character[sign({tilde over (g)}k,i(n))=sign(gi(n))]






q
i
custom-character
custom-character[sign({tilde over (g)}k,i(n))≠sign(gi(n))]


respectively, for all k. This implies that







P
i

e

r

r


=





K
i
+

=
0

K




[


sign

(

Δ
i

(
n
)


)



1




"\[LeftBracketingBar]"


Z
=

K
i
+





]





[

Z
=

K
i
+


]










where




[

Z
=

K
i
+


]


=


(



K





K
i
+




)



P
i

K
i
+





q
i

K
-

K
i
+



.






To calculate custom-character[sign(Δi(n))≠1|Z=Ki+], we use the distribution of Δi(n), which can be obtained by using the properties of exponential random variables as










f

(

Δ
i

(
n
)


)

=

{






e

-


Δ
i

(
n
)



μ
i
-






μ
i
+

+

μ
i
-



,


Δ
i

(
n
)



0









e

-


Δ
i

(
n
)



μ
i
+






μ
i
+

+

μ
i
-



,


Δ
i

(
n
)


>
0










(
23
)







Thus, by integrating (23) with respect to Δi(n),











[


sign

(

Δ
i

(
n
)


)



1




"\[LeftBracketingBar]"


Z
=

K
i
+





]

=



μ
i
-



μ
i
+

+

μ
i
-



=



(

K
-

K
i
+


)

+

1
/
ξ



K
+

2
/
ξ








(
24
)







Hence, by using (24) and the properties of binomial coefficients










P
i

e

r

r


=






K
i
+

=
0

K





(

K
-

K
i
+


)

+

1
/
ξ



1
+

2
/
ξ





(



K





K
i
+




)



P
i

K
i
+




q
i

K
-

K
i
+





=



1

ξ

K



1
+

2

K

ξ




+


q
i


1
+

2

K

ξ










(
25
)







Under Assumption 2 and Assumption 3, by using the derivations in [11], it can be shown that







q
i






2



σ
i



3




"\[LeftBracketingBar]"


g
i

(
n
)




"\[RightBracketingBar]"





n
b




.





Hence, an upper bound on the stochasticity-induced error can be obtained as










i
=
1

q





"\[LeftBracketingBar]"


g
i

(
n
)




"\[RightBracketingBar]"




P
i

e

r

r









1

ξ

K



1
+

2

K

ξ









g

(
n
)




1


+


1


n
b







2

/
3


1
+

2

K

ξ








σ


1







Based on Assumption 1,












F

(

w

(
0
)


)

-

F
*





F

(

w

(
0
)


)

-

𝔼

[

F

(

w

(
N
)


)

]



=


𝔼

[





n
=
0


N
-
1



F

(

w

(
n
)


)


-

F

(

w

(

n
+
1

)


)


]



𝔼

[





n
=
0


N
-
1




η

1
+

2

K

ξ









g

(
n
)




1



-



η
2

2





L


1


-


η


n
b






2


2

/
3


1
+

2

K

ξ






]






(
26
)







By rearranging the terms in (26) and using the expressions for nb and η, (22) is reached.


While certain embodiments of the disclosed subject matter have been described using specific terms, such description is for illustrative purposes only, and it is to be understood that changes and variations may be made without departing from the spirit or scope of the subject matter. The patentable scope of the presently disclosed subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they include structural and/or step elements that do not differ from the literal language of the claims, or if they include equivalent structural and/or elements or steps with insubstantial differences from the literal language of the claims.


REFERENCES



  • [1] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” 2021. [Online]. Available: arXiv:2103.17150

  • [2] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” 2021. [Online]. Available: arXiv:2104.02151

  • [3] M. Goldenbaum, H. Boche, and S. Sta'nczak, “Harnessing interference for analog function computation in wireless sensor networks,” IEEE Trans. Signal Process., vol. 61, no. 20, pp. 4893-4906, October 2013.

  • [4] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computation systems: Optimization, analysis and scaling laws,” IEEE Trans. Wireless Commun., vol. 19, no. 8, pp. 5488-5502, August 2020.

  • [5] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498-3516, October 2007.

  • [6] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491-506, January 2020.

  • [7] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120-2135, November 2021.

  • [8] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. in International Conference on Machine Learning, vol. 80. Proceedings of Machine Learning Research, 10-15 Jul. 2018, pp. 560-569.

  • [9] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “Over-the-air federated learning from heterogeneous data,” 2020. [Online]. Available: arXiv:2009.12787

  • [10] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546-3557, February 2020.

  • [11] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over the air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022-2035, 2020.

  • [12] M. M. Amiria, T. M. Duman, D. Gündüz, S. R. Kulkarni, and H. Vincent Poor, “Collaborative machine learning at the wireless edge with blind transmitters,” IEEE Trans. Wireless Commun., pp. 1-1, March 2021.

  • [13] Y. A. Jawhar, L. Audah, M. A. Taher, K. N. Ramli, N. S. M. Shah, M. Musa, and M. S. Ahmed, “A review of partial transmit sequence for PAPR reduction in the OFDM systems,” IEEE Access, vol. 7, pp. 18021-18041, 2019.

  • [14] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated learning in the sky: Joint power allocation and scheduling with UAV swarms,” in Proc. IEEE International Conference on Communications (ICC), 2020, pp. 1-6.

  • [001] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” 2021. [Online]. Available: arXiv:2103.17150

  • [002] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. Vincent Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” IEEE J. Sel. Areas Commun., pp. 1-26, 2021.

  • [003] H. Hellstrom, J. M. B. da Silva Jr, V. Fodor, and C. Fischione, “Wireless for machine learning,” 2020.

  • [004] M. Goldenbaum, H. Boche, and S. Stan'czak, “Harnessing interference for analog function computation in wireless sensor networks,” IEEE Trans. Signal Process., vol. 61, no. 20, pp. 4893-4906, October 2013.

  • [005] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computation systems: Optimization, analysis and scaling laws,” IEEE Trans. Wireless Commun., vol. 19, no. 8, pp. 5488-5502, August 2020.

  • [006] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498-3516, October 2007.

  • [007] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491-506, January 2020.

  • [008] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “Over-the-air federated learning from heterogeneous data,” 2020. [Online]. Available: arXiv:2009.12787

  • [009] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546-3557, February 2020.

  • [010] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120-2135, November 2021.

  • [011] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. in International Conference on Machine Learning, vol. 80. Proceedings of Machine Learning Research, 10-15 Jul. 2018, pp. 560-569.

  • [012] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022-2035, 2020.

  • [013] M. M. Amiria, T. M. Duman, D. Gündüz, S. R. Kulkarni, and H. Vin-cent Poor, “Collaborative machine learning at the wireless edge with blind transmitters,” IEEE Trans. Wireless Commun., pp. 1-1, March 2021.

  • [014] E. Dahlman, S. Parkvall, and J. Skold, 5G NR: The Next Generation Wireless Access Technology, 1st ed. USA: Academic Press, Inc., 2018.

  • [015] A. Sahin, B. Everette, and S. Hoque, “Over-the-air computation with DFT-spread OFDM for federated edge learning,” in Proc. IEEE Wireless Communications and Networking Conference (WCNC) (submitted), April 2022, pp. 1-6.

  • [016] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated learning in the sky: Joint power allocation and scheduling with UAV swarms,” in Proc. IEEE International Conference on Communications (ICC), 2020, pp. 1-6.

  • [017] Y. A. Jawhar, L. Audah, M. A. Taher, K. N. Ramli, N. S. M. Shah, M. Musa, and M. S. Ahmed, “A review of partial transmit sequence for PAPR reduction in the OFDM systems,” IEEE Access, vol. 7, pp. 18 021-18 041, 2019.


Claims
  • 1. An over-the-air computation (AirComp) methodology for federated edge learning (FEEL) without using channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES), comprising: a distributed machine-learning model to be trained with the update vectors received at an edge server (ES) as transmitted from a plurality of edge devices (EDs);one or more processors; andone or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising:transmitting local update vectors as weighted votes over selected multiple orthogonal subcarriers grouped based on the sign of the elements of the update vector from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel,receiving the superposed local updates at the ES,determining the majority vote (MV) for each element of the update vector at the ES with an energy detector over orthogonal time and frequency resources, andinputting the MVs into the machine-learning model to be updated.
  • 2. An over-the-air computation (AirComp) methodology according to claim 1, wherein the votes comprise orthogonal frequency division multiplexing (OFDM) symbols over multiple OFDM subcarriers, and aggregating operations use one-bit broadband digital aggregation (OBDA) and frequency-shift keying (FSK)-based methodology.
  • 3. An over-the-air computation (AirComp) methodology according to claim 2, further comprising operations using randomization symbols on active subcarriers to reduce peak-to-mean envelope power ratio (PMEPR).
  • 4. An over-the-air computation (AirComp) methodology according to claim 1, wherein the receiving operations include the ES detecting MV with a non-coherent detector.
  • 5. An over-the-air computation (AirComp) methodology according to claim 1, wherein the machine learning model comprises artificial intelligence technology over wireless or sensor networks, 5G or higher, 6G wireless standardization, or IEEE 802.11 Wi-Fi.
  • 6. An over-the-air computation (AirComp) methodology according to claim 1, wherein the transmitting local updates operation includes use of gradient averaging.
  • 7. An over-the-air computation (AirComp) methodology according to claim 6, wherein the local gradient estimate gk(n) for the kth ED at the nth communication round between at least one ED and the ES comprises:
  • 8. An over-the-air computation (AirComp) methodology according to claim 7, further comprising global gradient operations that the ES determines and distributes a global gradient estimate to the EDs and the current machine-learning model is updated based on a common update rule, and the global gradient operations are repeated consecutively until a predetermined convergence criterion is achieved.
  • 9. An over-the-air computation (AirComp) methodology according to claim 1, wherein the transmitting local updates operation includes use of signs of local gradients by the respective EDs with using a general weight function to increase the probability of the detecting the correct MV.
  • 10. An over-the-air computation (AirComp) methodology according to claim 1, further comprising operations, after a signal passes from each ED through their own multipath channels, the ES observes the superposed symbols on the same subcarrier indices.
  • 11. An over-the-air computation (AirComp) methodology according to claim 10, further comprising detector operations at the ES that the detector compares the energies on two adjacent subcarriers to determine the gradient vector.
  • 12. An over-the-air computation (AirComp) methodology according to claim 1, wherein the machine-learning model is training to learn the task of handwritten-digit recognition.
  • 13. An over-the-air computation (AirComp) methodology according to claim 12, wherein the machine-learning model comprises a convolution neural network with multiple convolutional layers, with each convolutional layer followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them.
  • 14. An over-the-air computation (AirComp) methodology according to claim 13, wherein the multiple convolutional layers each have a plurality of filters, and a fully connected layer with plural units and a softmax layer are used after one of the ReLU.
  • 15. An over-the-air computation (AirComp) system for federated edge learning (FEEL) without using channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES), comprising: a machine-learning model training to process data received at an edge server (ES) as transmitted from a plurality of edge devices (EDs);one or more processors; andone or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising:transmitting local updates as votes over selected multiple subcarriers from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel,receiving the local updates at the ES,aggregating the local updates at the ES including separating votes from the EDs using orthogonal resources and majority vote (MV) principle, andinputting the obtained data into the machine-learning model as training data or data to process.
  • 16. An over-the-air computation (AirComp) system according to claim 15, wherein the votes comprise orthogonal frequency division multiplexing (OFDM) symbols over multiple OFDM subcarriers, and aggregating operations use one-bit broadband digital aggregation (OBDA) and frequency-shift keying (FSK)-based methodology.
  • 17. An over-the-air computation (AirComp) system according to claim 15, wherein the receiving operations include the ES detecting MV with a non-coherent detector.
  • 18. An over-the-air computation (AirComp) system according to claim 15, wherein the transmitting local updates operation includes use of either gradient averaging or use of signs of local gradients by the respective EDs.
  • 19. An over-the-air computation (AirComp) system according to claim 18, further comprising global gradient operations comprising that the ES determines and distributes a global gradient estimate to the EDs and the current machine-learning model is updated based on a common update rule, and the global gradient operations are repeated consecutively until a predetermined convergence criterion is achieved.
  • 20. An over-the-air computation (AirComp) system according to claim 15, wherein the machine-learning model comprises a convolution neural network with multiple convolutional layers, with each convolutional layer followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them.
PRIORITY CLAIMS

The present application claims the benefit of priority of U.S. Provisional Patent Application No. 63/192,671, titled Methods for Reliable Over-The-Air Computation and Federated Edge Learning, filed May 25, 2021; and claims the benefit of priority of U.S. Provisional Patent Application No. 63/313,321, titled Methods for Reliable Over-The-Air Computation and Federated Edge Learning, filed Feb. 24, 2022, both of which are fully incorporated herein by reference for all purposes.

Provisional Applications (2)
Number Date Country
63192671 May 2021 US
63313321 Feb 2022 US