This application claims the benefit of an earlier filing date and right of priority to Korean Patent Application No. 10-2020-0130054 filed on Oct. 8, 2020. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application.
The present disclosure relates to a wireless communication system.
Modern communication systems supporting high-rate systems (e.g., LTE and 5G) are usually operated over broadband channels. In such systems, orthogonal frequency-division multiplexing (OFDM) is widely adopted to partition a spectrum into subcarriers to simplify signal processing and facilitate resource allocation.
Recent years have witnessed growing interests in the deployment of artificial intelligence (AI) algorithms at the network edge, known as edge learning. Deploying machine learning algorithms at the network edge enables low-latency access to distributed data and computation resources, resulting in an active research area called edge learning.
Wireless communication systems will become more demanding for edge learning. Efficient training methods for wireless communication systems are required to minimize the learning latency while minimizing the transmit power when guaranteeing the data rages of all users or maximizing the throughput under power control.
The object of the present disclosure can be achieved by techniques disclosed herein for training a neural network (NN) model in a neural network (NN) based wireless communication system.
In one aspect, provided is a method for updating a neural network (NN) model in a neural network (NN) based wireless communication system. The method comprises: determining, for an one-round latency T and an overall model size L of the NN model, i) Tu that makes {circumflex over (L)}*(Tu) larger than L and ii) Tl that makes {circumflex over (L)}*(Tl)<L; determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} by setting Tm=(Tu+Tl)/2 based on Tu being different from Tl, where k=1, . . . , K, n=1, . . . , N, R*k,n is channel capacity of user equipment k on subcarrier n, L*k is parameter allocation to user equipment k, and C*k,n is subcarrier allocation indicator for user equipment k; determining Tu=Tm based on {circumflex over (L)}*(Tm) being equal to or larger than L and determining Tl=Tm based on {circumflex over (L)}*(Tm) being smaller than L; repeating determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n}, where Tm=(Tu+Tl)/2 based on Tu being different from Tl; outputting {R*k,n}, {L*k}, and {C*k,n}, based on Tu being equal to Tl; allocating NN model parameters to user equipments 1 to K based on the output {R*k,n}, {L*k}, and {C*k,n}; receiving update results of the NN model parameters from user equipments 1 to K; and updating the NN model based on the received update results.
In another aspect, provided is a base station for updating a neural network (NN) model in a neural network (NN) based wireless communication system. The base station comprises: at least one transceiver; at least one processor; and at least one memory storing at least one instruction that, when executed, causes the at least one processor to perform operations. The operations comprise: determining, for an one-round latency T and an overall model size L of the NN model, i) Tu that makes {circumflex over (L)}*(Tu) larger than L and ii) Tl that makes {circumflex over (L)}*(Tl)<L; determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} by setting Tm=(Tu+Tl)/2 based on Tu being different from Tl, where k=1, . . . , K, n=1, . . . , N, R*k,n, is channel capacity of user equipment k on subcarrier n, L*k is parameter allocation to user equipment k, and C*k,n is subcarrier allocation indicator for user equipment k; determining Tu=Tm based on {circumflex over (L)}*(Tm) being equal to or larger than L and determining Tl=Tm based on {circumflex over (L)}*(Tm) being smaller than L; repeating determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n}, where Tm=(Tu+Tl)/2 based on Tu being different from Tl; outputting {R*k,n}, {L*k}, and {C*k,n}, based on Tu being equal to Tl; allocating NN model parameters to user equipments 1 to K based on the output {R*k,n}, and {L*k}, and {C*k,n}; receiving update results of the NN model parameters from user equipments 1 to K; and updating the NN model based on the received update results
In each aspect of the present disclosure, determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} may comprise: determining {C*k,n}=0 based on Ik,n being larger than μn; determining {C*k,n}∈(0,1) based on Ik,n being equal to μn for multiple user equipments; and determining {C*k,n}=1 based on Ik,n being equal to μn for a single user equipment, wherein
where B is a system bandwidth of a cell, hk,n is an uplink channel gain of user equipment k on subcarrier n, and σ2 is additive white Gaussian noise power.
In each aspect of the present disclosure, determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} may comprise: determining {R*k,n} based on
for all Ck,n≠0, and ii) R*k,n=0 for all Ck,n=0, wherein hk,n is an uplink channel gain of user equipment k on subcarrier n, λk and νk are Lagrange multipliers, and τ is the number of bits per NN model parameter.
In each aspect of the present disclosure, determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} may comprise: determining {L*k} based on L*k=[T−√{square root over (λkTm+νkgkfk2Tm−νk(PkTm−ξ)/fk)}]fk, wherein ξ is circuit energy consumption, gk computation power factor of user equipment k, gk is computation frequency of user equipment k, and Pk is permitted energy consumption of user equipment k.
In each aspect of the present disclosure, {circumflex over (L)}*(Tm) may be determined as follows: {circumflex over (L)}*(Tm)=Σk=1KL*k.
In each aspect of the present disclosure, determining {circumflex over (L)}*(Tm), {R*k,n}, {L*k}, and {C*k,n} may comprise: i) allocating wg,i to only one user equipment, where wg,i is the i-th neuron parametric vector in layer g of the NN model; and ii) allocating Zm=[z1,m, . . . , zg,m, . . . , zG,m] to only one user equipment, where Zm is an auxiliary matrix for data sample m, zg-1,m=[zg-1,1,m, zg-1,2,m, . . . , zg,l
The above technical solutions are merely some parts of the implementations of the present disclosure and various implementations into which the technical features of the present disclosure are incorporated can be derived and understood by persons skilled in the art from the following detailed description of the present disclosure.
The accompanying drawings, which are included to provide a further understanding of the present disclosure, illustrate examples of implementations of the present disclosure and together with the detailed description serve to explain implementations of the present disclosure:
Hereinafter, implementations according to the present disclosure will be described in detail with reference to the accompanying drawings. The detailed description, which will be given below with reference to the accompanying drawings, is intended to explain exemplary implementations of the present disclosure, rather than to show the only implementations that may be implemented according to the present disclosure. The following detailed description includes specific details in order to provide a thorough understanding of the present disclosure. However, it will be apparent to those skilled in the art that the present disclosure may be practiced without such specific details.
In some instances, known structures and devices may be omitted or may be shown in block diagram form, focusing on important features of the structures and devices, so as not to obscure the concept of the present disclosure. The same reference numbers will be used throughout the present disclosure to refer to the same or like parts.
A technique, a device, and a system described below may be applied to a variety of wireless multiple access systems. The multiple access systems may include, for example, a code division multiple access (CDMA) system, a frequency division multiple access (FDMA) system, a time division multiple access (TDMA) system, an orthogonal frequency division multiple access (OFDMA) system, a single-carrier frequency division multiple access (SC-FDMA) system, a multi-carrier frequency division multiple access (MC-FDMA) system, etc. CDMA may be implemented by radio technology such as universal terrestrial radio access (UTRA) or CDMA2000. TDMA may be implemented by radio technology such as global system for mobile communications (GSM), general packet radio service (GPRS), enhanced data rates for GSM evolution (EDGE) (i.e., GERAN), etc. OFDMA may be implemented by radio technology such as institute of electrical and electronics engineers (IEEE) 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, evolved-UTRA (E-UTRA), etc. UTRA is part of universal mobile telecommunications system (UMTS) and 3rd generation partnership project (3GPP) long-term evolution (LTE) is part of E-UMTS using E-UTRA. 3GPP LTE adopts 01-DMA on downlink (DL) and adopts SC-FDMA on uplink (UL). LTE-advanced (LTE-A) is an evolved version of 3GPP LTE.
For convenience of description, description will be given under the assumption that the present disclosure is applied to 3GPP based wireless communication systems. However, the technical features of the present disclosure are not limited thereto. For example, although the following detailed description is given based on mobile communication systems corresponding to 3GPP based wireless communication systems, the mobile communication systems are applicable to other arbitrary mobile communication systems except for matters that are specific to the 3GPP based wireless communication systems.
For terms and techniques that are not described in detail among terms and techniques used in the present disclosure, reference may be made to 3GPP based wireless communication standard specifications and the following documents 1 to 9.
In the present disclosure, a user may be fixed or mobile. Each of various devices that transmit and/or receive user data and/or control information by communicating with a base station (BS) may be a user equipment (UE). The term UE may be referred to as terminal equipment, mobile station (MS), mobile terminal (MT), user terminal (UT), subscriber station (SS), wireless device, personal digital assistant (PDA), wireless modem, handheld device, etc. In the present disclosure, a BS refers to a fixed station that communicates with a UE and/or another BS and exchanges data and control information with a UE and another BS. The term BS may be referred to as advanced base station (ABS), Node-B (NB), evolved Node-B (eNB), base transceiver system (BTS), access point (AP), processing server (PS), etc. Particularly, a BS of a universal terrestrial radio access (UTRAN) is referred to as an NB, a BS of an evolved-UTRAN (E-UTRAN) is referred to as an eNB, and a BS of new radio access technology network is referred to as a gNB. Hereinbelow, for convenience of description, the NB, eNB, or gNB will be referred to as a BS regardless of the type or version of communication technology.
In the present disclosure, a node refers to a fixed point capable of transmitting/receiving a radio signal to/from a UE or BS. At least one antenna is installed per node. An antenna may refer to a physical antenna port or refer to a virtual antenna or an antenna group. The node may also be called a point.
In the present disclosure, a cell refers to a specific geographical area in which one or more nodes provide communication services. Accordingly, in the present disclosure, communication with a specific cell may mean communication with a BS or a node providing communication services to the specific cell. A DL/UL signal of the specific cell refers to a DL/UL signal from/to the BS or the node providing communication services to the specific cell. A cell providing UL/DL communication services to a UE is especially called a serving cell. Furthermore, channel status/quality of the specific cell refers to channel status/quality of a channel or a communication link generated between the BS or the node providing communication services to the specific cell and the UE.
A 3GPP-based communication system uses the concept of a cell in order to manage radio resources, and a cell related with the radio resources is distinguished from a cell of a geographic area. The “cell” of the geographic area may be understood as coverage within which a node may provide services using a carrier, and the “cell” of the radio resources is associated with bandwidth (BW), which is a frequency range configured by the carrier. Since DL coverage, which is a range within which the node is capable of transmitting a valid signal, and UL coverage, which is a range within which the node is capable of receiving the valid signals from the UE, depend upon a carrier carrying the signal, coverage of the node may also be associated with coverage of the “cell” of radio resources used by the node. Accordingly, the term “cell” may be used to indicate service coverage by the node sometimes, radio resources at other times, or a range that a signal using the radio resources may reach with valid strength at other times. In 3GPP communication standards, the concept of the cell is used in order to manage radio resources. The “cell” associated with the radio resources is defined by a combination of DL resources and UL resources, that is, a combination of a DL component carrier (CC) and a UL CC. The cell may be configured by the DL resources only or by the combination of the DL resources and the UL resources.
In the present disclosure, a UE, a BS or a server may include at least one processor and at least one memory and additionally further include at least one transceiver. The at least one memory stores instructions that, when executed, cause the at least one processor to perform operations according to some implementations of the present disclosure which are described hereinafter.
The at least one processor may be referred to as controllers, microcontrollers, microprocessors, or microcomputers. The at least one processor may be implemented by hardware, firmware, software, or a combination thereof. As an example, one or more application specific integrated circuits (ASICs), one or more digital signal processors (DSPs), one or more digital signal processing devices (DSPDs), one or more programmable logic devices (PLDs), or one or more field programmable gate arrays (FPGAs) may be included in the at least one processor. The functions, procedures, proposals, and/or methods disclosed in this document may be implemented using firmware or software, and the firmware or software may be configured to include the modules, procedures, or functions. Firmware or software configured to perform the functions, procedures, proposals, and/or methods disclosed in this document may be included in the at least one processor or stored in the at least one memory so as to be driven by the at least one processor. The functions, procedures, proposals, and/or methods disclosed in this document may be implemented using firmware or software in the form of code, commands, and/or a set of command
The at least one memory may be connected to the at least one processor and store various types of data, signals, messages, information, programs, code, commands, and/or instructions. The at least one memory may be configured by read-only memories (ROMs), random access memories (RAMs), electrically erasable programmable read-only memories (EPROMs), flash memories, hard drives, registers, cash memories, computer-readable storage media, and/or combinations thereof. The at least one memory may be located at the interior and/or exterior of the at least one processor. The at least one memory 4 may be connected to the at least one processor through various technologies such as wired or wireless connection
Conventional design of radio resource management (RRM) in OFDM based broadband systems aim at minimizing the transmit power when guaranteeing the data rates of all users (see e.g., Document 7) or maximizing the throughput under power control (see e.g., Document 8).
Recent years have witnessed growing interests in the deployment of AI algorithms at the network edge, known as edge learning. Deploying machine learning algorithms at the network edge enables low-latency access to distributed data and computation resources, resulting in an active research area called edge learning (see Document 1). Among others, the partitioned edge learning (PARTEL) framework can support the efficient training of a large-scale model using distributed computation resources at mobile devices (see Document 2). In the present disclosure, the methods for implementing PARTEL in broadband systems are described.
For the purpose of the present disclosure, the symbols shown in the following table apply.
For the purpose of the present disclosure, the following abbreviations apply.
There exist two paradigms in distributed learning: data parallelism and model parallelism. The former refers to the simultaneous training of an AI model at multiple devices using different parts of a global dataset. On the other hand, model parallelism refers to the simultaneous training of different parts of a model at different devices. Recent research on edge learning focuses on the efficient implementation of different frameworks under these two paradigms.
*Federated Edge Learning (FEEL): FEEL is a popular data-parallelism framework that aims at exploiting distributed mobile data while preserving privacy by avoiding sharing data (see Document 3). The main research focus in FEEL is to overcome the bottleneck by communication-efficient designs that integrate learning and wireless transmission techniques, via designing e.g., simultaneous multi-access technique, called over-the-air computation (see Document 4), gradient quantization (see Document 5), etc.
*PARTEL: PARTEL is a representative model-parallelism framework, which leverages distributed computation resources at devices to train a large-scale model (see Document 2). To this end, the model is partitioned and its parts are allocated to different devices for updating using downloaded datasets. In each round of PARTEL, a server partitions a global model under training into blocks of parameters, called parametric blocks, and allocate each of them to a single device for updating. In other words, PARTEL supports the distributed training of a large-scale AI model by dynamically partitioning the model and allocating the resultant parametric blocks to different devices for updating. Then devices upload the updates to a server where they are assembled and then applied to updating the model. The two steps are iterated till the model converges. The sizes of the parametric blocks determine the computation-and-communication (C2) loads of individual devices. The possibility of controlling the sizes gives rise to a new research issue unique for PARTEL, namely C2 load allocation by model partitioning. Model partitioning in PARTEL is straightforward in the case of decomposable loss functions (e.g., logistic regression). For convolutional neural network (CNN) models with nested layers, their implementation using PARTEL relies on introducing a set of auxiliary variables for the models so as to transform the loss function into a decomposable form (Document 6).
The FEEL framework is less communication efficient than PARTEL, as in the former, each device should upload the updates of the whole parameter vector to the server instead of the updates of only a subset of parameters in the latter.
The existing design for PARTEL targets the frequency non-selective channels (see Document 2). It cannot be straightforwardly extended to practical broadband systems where joint C2 control is more challenging due to frequency selectivity.
The conventional work in OFDM systems targets conventional systems providing radio-access services, the design of broadband RRM in PARTEL systems with computation load balancing is under a new design criterion, namely low learning latency. Among others, some implementations of the present disclosure for C2 resource management have two differences from its conventional counterparts. First, the former needs to account for not only channel states as the latter but also devices' computation capacities. Furthermore, the load allocation in the former is more sophisticated than bit allocation in the latter as it involves model partitioning and has to address relevant issues such as model decomposability. Second, the constraint of synchronized updates in PARTEL tends to bias the RRM to favour devices with relatively weak channels and/or computation capacities. These differences lead to new design challenges for broadband RRM.
The parameter allocation of PARTEL provides a mechanism for controlling computation-and-communication (C2) loads. In the present disclosure, efficient joint management of parameter allocation and radio resources is considered to reduce the learning latency of PARTEL when deployed in a broadband system using orthogonal frequency-division multiplexing (OFDM). Specifically, in some implementations of the present disclosure, for both decomposable models and convolutional neural network (CNN) models, the policies for joint subcarrier, parameter, and power allocation (SUPPORT) are optimized under the criterion of minimum latency.
System Model
Learning Models
In some implementations of the present disclosure, the following learning models may be used.
1) Decomposable Models: The large-scale learning tasks with decomposable objective functions (such as logistic regression) can be directly implemented using PARTEL based on the method of block coordinate descent. A decomposable objective function can be written as (w)=(w)+(w), where w={w1, w2, . . . , wL}T is the parameter vector of the learning model, L is the size of w, (w) is the loss function, and (w) is the regularized function.
2) CNN models: CNN models cannot be directly implemented using PARTEL, as the nested layers therein make the gradient elements of different layers dependent. The method of auxiliary variables may be used to decompose the CNN models into many independent sub-problems (refer to Document 6).
First, consider a CNN model with G hidden layers. The model parameter matrix is denoted as W with the size of L parameters. For an arbitrary layer therein, say layer g, the parameter matrix is denoted as Wg, the number of neurons is denoted as Ig, and the i-th neuron parametric vector is denoted as wg,i. Thereby, the objective function is given by:
(W)=Σm=1M|ym−(xm;W)|2, with (x;W)=fG+1( . . . f2(f1(x;W1);W2), . . . ;WG+1),
where the model parameter matrix can be expressed as W=[W1, W2, . . . , WG, WG+1], the parameter matrix of the g-th layer can be expressed as Wg=[wg,1, wg,2, . . . , wg,I
Then, the method of auxiliary variables is used by introducing one auxiliary variable per neuron per data sample: zg,l,m, =f(wg,i; zg-1,m), ∀(g, i, m), where f(⋅) is the activation function, wg,i is the i-th neuron parametric vector in layer g, zg,i,m is the auxiliary variable introduced for the i-th neuron in layer g regarding data sample m, zg-1,m=[zg-1,1,m, zg-1,2,m, . . . , zg,I
Next, following Document 6, by using the quadratic-penalty method, the equality-constrained optimization problem (refer to Problem (2) in Document 6) is equivalent to minimizing:
where z0,m=zm and μ→+∞.
Finally, the problem in Eq. (1) can be solved using the alternating optimization over W and Z, i.e., sequentially solving the W-stage and Z-stage, defined below, in each training iteration.
W-stage: Fixing the values of Z, solve the problem of in which the problem of each neuron is independent and can be written as:
Note that one device is allocated a task of updating one or more neuron parametric vectors by solving the sub-problems in Eq. (2).
where the problem of optimizing each per-sample auxiliary matrix is independent of others and is given as
Note that one device is allocated a task of updating one or more neuron parametric vectors by solving the sub-problems in Eq. (3).
PARTEL Architecture
Considering the PARTEL system shown in
1) Decomposable Models: The model-parameter vector is partitioned into K disjoint parametric blocks, as w={w1, . . . , wk, . . . , wK} where wk is allocated to worker k for update, using a downloaded dataset. In the PARTEL framework, one training iteration of the decomposable models is called one (communication) round. As shown in
The training process in
2) CNN Models: As mentioned before, each round of CNN models comprises two stages: W-stage and Z-stage, described as follows.
In the leftmost and uppermost cell among eight cells formed by two rows and four columns in
Referring to the cells shown in the lowermost row, auxiliary variables output from a same data sample are allocated to one and only one worker according to the granularity constraint 2.
Each stage (W-stage or Z-stage) comprises three phases, push, computation, and pull, which are similar to those in the case of decomposable models. The main difference lies in the additional granularity constraint 1 or 2. Each round comprises two stages and the rounds are repeated until the CNN model converges.
Latency and Energy Consumption Model
Consider an arbitrary communication round and an arbitrary worker, say worker k. Referring to
1) Push Phase: The push latency is the time for the server to broadcast the whole model-parameter vector to all workers. It is a constant identical for all workers. Besides, as the transmit power and bandwidth are very large during broadcasting, the push latency can be ignored. In this phase, the energy consumption by all workers is to receive the model-parameter vector from the server and is included in the circuit energy consumption that occurs in the workers and/or the server (even when the workers and the server do not perform computation or transmission) because the workers and the server are on. The circuit energy consumption may be constant. In the present disclosure, the circuit energy consumption is denoted as ξ.
2) Computation Phase: The computation latency of worker k depends on the size of the allocated parametric block Lk and its computation speed fk:
According to Document 9, the computation power of worker k is Pkcmp=gk fk3, where gk is the computation power factor. Then, the computation energy of worker k is:
Ekcmp=Pkcmp×Tkcmp=gkfk2Lk,1≤k≤K. Eq. (4-1)
3) Pull Phase: The pull phase consists of two parts. One is uploading gradient blocks from workers to the server. The other is the server updating the global model using the gradients sent by the workers. For the latter part, there is no energy consumption at the workers. Its latency is a constant and is same for all workers. In the following description, the model update latency is ignored, as it is small and has no impact on the solution of latency minimization.
For uploading, worker k transmits over a set of assigned subcarriers. Let Tk,ncom denote the uploading latency of worker k on subcarrier n. If subcarrier n is not allocated to k, i.e., Ck,n=0, Tcom=0. Otherwise,
where Lk,n is the number of parameters uploaded by worker k on subcarrier n, τ is the number of bits per gradient element, B is the bandwidth of the subcarrier, and Rk,n is the channel capacity of worker k on subcarrier n. The channel capacity is given by:
where σ2 is the power of additive white Gaussian noise, Pk,ncom is the transmit power, and hk,n is the channel gain of worker k on subcarrier n, respectively. It follows that:
Then, the overall uploading latency of worker k is decided by the slowest subcarrier:
The uploading energy consumption of worker k is modeled as follows. Let Ek,ncom denote the transmit energy consumption of worker k on subcarrier n. If subcarrier n is not allocated, i.e., Ck,n=0, Ek,ncom=0. Otherwise,
Ek,ncom=Ck,nPk,ncomTk,ncom,∀(k,n).
By using Eq. (6) and Eq. (7), Ek,ncom can be further derived as:
The total uploading energy consumption of worker k is the sum of uploading energy consumption over all subcarriers: {Ekcom=Σn=1NEk,ncom, 1≤k≤K}. Eq. (8) can be substituted by Eq. (9),
Next, the total latency and energy consumption of worker k are defined as follows. The latency of worker k is the sum latencies of the two phases:
Tk=Tkcmp+Tkcom,1≤k≤K. Eq. (10)
The energy consumption of worker k is given by:
Ek=Ekcmp+Ekcom+ξ,1≤k≤K. Eq. (11)
In some implementations of the present disclosure, the overall learning latency of the PARTEL system is minimized. The overall learning latency of the PARTEL system depends on two factors: the per-round latency and the number of rounds for model convergence. It is shown in Document 2 that the overall learning latency minimization is equivalent to separately minimizing the per-round latency. Hereinafter, minimizing the per-round latency is described.
For an arbitrary round, minimizing its latency, denoted as T, under the constraints on subcarrier assignment, latency requirement, parameter allocation, and power control, is described as follows.
1) Subcarrier Assignment Constraints: Each subcarrier can be allocated to one worker:
Ck,n∈{0,1},∀(k,n), (C1)
Σk=1KCk,n=1,1≤n≤N. (C2)
2) Per-Round Latency Constraints: As all parametric blocks should be updated in one round, all workers' latencies, say {Tk}, should not exceed the overall one-round latency T:
Tk≤T,1≤k≤K,
which, by using Eq. (10), can be derived as:
Tkcmp+Tk,ncom≤T,∀Ck,n=1. (C3)
3) Parameter Constraints: The parameter constraints are two tiers. On the one hand, the total updatable number of parameters by all workers should be no smaller than the size of the model:
Σk=1KLk≥L. (C4)
On the other hand, for each worker, the total uploaded number of parameters on all subcarriers should be no smaller than its allocated parametric-block size:
Σn=1NCk,nLk,n≥Lk,1≤k≤K. (C5)
In the following description, {Lk} and {Lk,n} are relaxed to be continuous for simplicity. In practice, the solved {L*k} and {L*k,n} will be rounded for implementation and the loss caused by the rounding operation can be ignored, since the values of {Lk} and {Lk,n} are typically large.
For the case of CNN models, granularity constraints 1 and 2 can be written mathematically as follows.
where N+ is the set of positive integers and Lsub is the size of the sub-problems, i.e., neurons or per-sample auxiliary matrices.
4) Power Constraints: The power consumption of each worker is constrained as:
5) Latency-Minimization Problem: Under these constraints, the per-round latency-minimization problem by joint SUPPORT can be formulated as:
Hereinafter, in order to solve the per-round latency minimization problem (P1), described are how to partition the model per user (i.e., worker), how to allocate subcarriers per user, and how to allocate power to subcarriers allocated to each user. In particular, methods for minimizing per-round latency are described in connection with i) decomposable models and ii) CNN models, using the afore-mentioned constraints C1 to C6.
Methods for Decomposable Models
1) By utilizing the KarushKuhnTucker (KKT) conditions, a necessary condition for the equivalent latency requirement is derived to simplify Problem (P1), as:
Tkcmp+Tk,ncom=T,∀Ck,n=1, Eq. (12)
where Tkcmp is the computation latency of worker k and Tk,ncom is the computation latency of worker k on subcarrier n. From Eq. (12), all workers should have the same latency with the overall latency T. Besides, for each worker, the uploading latency on all allocated subcarriers should be equal.
By substituting the computation latency Tkcmp defined in Eq. (4) and the uploading latency Tk,ncom defined in Eq. (5) into Eq. (12), the number of parameters uploaded by worker k on subcarrier n, say Lk,n, is derived as:
where Lk is the size of parametric block assigned to worker k.
2) By substituting Eq. (13) into Problem (P1), it can be equivalently derived as:
where Ek, defined in Eq. (11), is the energy consumption of worker k. By using Eq. (4-1), Eq. (9), Eq. (11) and Eq. (13), Ek can be derived as:
3) Problem (P2) is a mixed integer non-convex problem and is NP-hard. Two steps are used to tackle it. First, following the standard approach to tackle integer programming, linear programming relaxation is used to relax the subcarrier-allocation indicators in Problem (P2) to be continuous, i.e., {Ck,n∈[0,1], ∀(k, n)}. Then, the relaxed problem can be equivalently converted to the problem of updatable model size maximization.
Given the one-round latency T for an arbitrary round, let {circumflex over (L)}*(T) denote the maximum size of a model that can be updated within the round. Then {circumflex over (L)}*(T) solves the following problem of model size maximization:
where Ek is the energy consumption of worker k defined in Eq. (14).
4) It can be shown that {circumflex over (L)}*(T) is a monotonously increasing function of T.
5) It follows from 4) that the solution of Problem (P2) is the minimal latency, say T*, which makes the updatable model size {circumflex over (L)}*(T*) no less than the target size L. This suggests a method to solve Problem (P2) by searching T* using the criterion {circumflex over (L)}*(T)≥L, which will be elaborated in the later.
6) To get the maximum updatable model size {circumflex over (L)}*(T) requires solving Problem (P3). To this end, the following variables are used to transform Problem (P3) into a convex problem.
7) By using the variables in Eq. (15) and Ek defined in Eq. (14), Problem (P3) can be written as:
8) It can be shown that Problem (P4) is a convex problem.
9) It follows from 8) that the primal-dual method can be used to get the optimal solution of Problem (P4) as follows:
where LP4 is the Lagrange function of Problem (P4), given as:
where {μn}, {λk≥0}, and {νk≥0} are Lagrangian multipliers. The Lagrangian multiplers are auxiliary variables that can be obtained based on the optimization theory well known to the person skilled in the art. For example, initial values of μn, λLk, and νk may be arbitrary values, and {μn}, {λk≥0}, and {νk≥0} are obtained by searching for {μn}, {λk}, and {νk} which maximize a value of the inner loop problem
10) The necessary conditions (e.g., KKT) for achieving the optimal solution of the inner loop are used to derive the following optimal policies.
11) The optimal channel-capacity allocation scheme is:
12) The optimal power-allocation scheme is:
13) The optimal inter-worker parameter-allocation scheme is:
L*k=[T−√{square root over (λkT+νkgkfk2T−νk(PkT−ξ)fk)}]fk,1≤k≤K. Eq. (18)
14) The optimal intra-worker parameter allocation scheme is given by:
which shows that more parameters should be assigned to the channel with high gain.
15) The optimal subcarrier allocation is given as follows:
where μn=minkIk,n and Ik,n is the indicator function defined as follows:
where R*k,n is the optimal channel capacity of worker k on subcarrier n defined in Eq. (16).
16) Note that in the optimal scheme in Eq. (19), some subcarrier-allocation indicators may be fractions. A scheme will be described in the later to round them to binary.
17) Based on the closed-form results in 11), 12), 13), 14) and 15), a low complexity optimal algorithm is proposed to solve the convex Problem (P4) and hence, Problem (P3) is equivalently solved. The algorithm is described as follows.
In the algorithm above, P4 and {μn}, {λk≥0}, {νk≥0} are Lagrange function and multipliers of Problem (P4) and is defined in 9). {ηλ
18) The computation complexity of the algorithm in 17) is O(K2N) with K being the number of workers and N being the number of subcarriers.
19) Then, as mentioned in 4) and 5), Problem (P2) with relaxed subcarrier-allocation indicators can be solved by nesting a one-dimensional search over the latency T and solving the convex Problem (P4). Based on the monotonicity of {circumflex over (L)}*(T) in 4), the search can be efficiently implemented by bisection method. While the solution of Problem (P4) is presented in 17). Then the optimal policy to solve Problem (P2) with relaxed subcarrier-allocation indicators is presented in the following, by nesting the bisection search and the algorithm in 17).
20) Finally, based on the algorithm in 19), the joint scheme of SUPPORT without relaxation is proposed to solve the original Problem (P1). Note that not all subcarrier-allocation indicators solved by the algorithm in 19) are integers, i.e., Ck,n∈(0,1) for some (k, n). For these subcarriers, a practical subcarrier-allocation scheme is determined as follows:
where the subcarrier is allocated to the worker with the largest value. Then, given the subcarrier-allocation scheme {Ck,n}, the latency-minimization problem is a special case of Problem (P1), whose solution can also be solved by the algorithm in 19).
According to Algorithm 1 shown in Table 2 and Algorithm 2 shown in Table 3, a serve (and/or base station) may determine T*, {R*k,n}, {L*k} and {C*k,n} for workers k=1, . . . ,K which are participating in the partitioned edge learning of the model W.
For example, referring to
In other words, based on Algorithm 1 and Algorithm 2, for each worker participating in the partitioned edge learning, the BS may determine the partial model parameter wk to be updated by worker k, subcarriers through which worker k reports updated partial model parameter wk, and channel capacity (e.g. power) for each subcarrier n allocated to worker k. The BS may receive the updated partial model parameter wk from each worker k, and update the overall model parameter matrix W of the NN model.
Methods for CNN Models
1) A practical scheme, which leverages the result for decomposable models, is proposed to solve Problem (P1) for CNN models, described as follows.
2) For the scheme in 1), the challenges lie in Step 2 and are two-fold. On one hand, how should the rounding indicator be designed to minimize the rounding loss. On the other hand, as each worker's number of parameters changes, the corresponding channel-capacity (or power) allocation and intra-worker parameter allocation among the assigned subcarriers should be redesigned.
3) Denote the solved one-round latency as T*, the subcarrier-allocation policy as {C*k,n}, the spectrum efficiencies as {R*k,n}, the number of parameters of worker k as L*k, the number of parameters uploaded by worker k on subcarrier n as L*k,n.
4) Consider an arbitrary worker, say worker k. If its number of parameters is rounded down to satisfy (Ccnn), the reduced number of parameters is denoted as ΔLkd≥0. If its number of parameters is rounded up, the additional number of parameters to be uploaded is denoted as ΔLku≥0. Note that if worker k's number of parameters is rounded down, no influence is caused to the one-round latency. Hence, only the case of being rounded up is considered in the following description. In some implementations described below, a rounding scheme to minimize the resulted additional one-round latency is described.
5) Next, the joint scheme of SUPPORT may be designed as follows:
where ΔLk,n is the number of additional parameters allocated to subcarrier n for uploading, which is proportional to its currently uploaded number of parameters L*k,n.
6) Consider an arbitrary worker, say worker k, the design in Eq. (22) results in an upper bound of the minimum additional latency:
where T* is the solved latency in Step 1, ΔTk, ΔLku, and L*k are the additional latency, the number of additional parameters after the rounding operation, and the solved number of parameters in Step 1 of worker k, respectively.
7) The proof of Eq. (23) is straightforward and hence omitted in the present disclosure. Two observations can be made from Eq. (23). On one hand, the size of the sub-problems is far smaller than the problems of W-stage and Z-stage, i.e., ΔLku<<L*k. Therefore, the additional latency ΔTk is small for all workers. On the other hand, the round-up indicator, denoted as Ik should be the following ratio:
8) Following 7), the parameter rounding scheme is designed as in the following, which makes the workers with least Ik to round up and the others to round down.
where ΔLk′u is the additional number of parameters of worker k′ when being rounded up and ΔLk′d is the reduced number of parameters when being rounded down. Eq. (23) means that by rounding up K′1 workers with least round-up indicators, the parameters of all workers can satisfy granularity constraints 1 and 2.
For CNN models, when determining the partial model parameter wk, the BS may determine the partial model parameter to be updated by worker k under the granularity constraint 1. In CNN models, the BS may update a CNN model by updating a model parameter matrix W (W-stage) and then updating auxiliary matrix Z (Z-stage). The BS determines the per-sample auxiliary matrix zm for each worker under the granularity constraint 2. For example, the BS may update the model parameter matrix W and auxiliary matrix Z as shown in
According to some implementations of the present disclosure, the learning latency of PARTEL can be significantly reduced. Accordingly, some implementations of the present disclosure can improve the learning efficiency.
Number | Date | Country | Kind |
---|---|---|---|
10-2020-0130054 | Oct 2020 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
11246173 | Vahdat | Feb 2022 | B2 |
11741361 | Zheng | Aug 2023 | B2 |
20210075691 | Zeng | Mar 2021 | A1 |
20230319617 | Manolakos | Oct 2023 | A1 |
20230325679 | Wang | Oct 2023 | A1 |
Entry |
---|
Radio Resource Allocation in 5G New Radio: A Neural Networks Approach; Madyan Alsenwi, Nov. 2019 (Year: 2019). |
Joint Device Scheduling and Resource Allocation for Latency Constrained Wireless Federated Learning), Wenqi Shi; Jul. 2020. (Year: 2020). |
Number | Date | Country | |
---|---|---|---|
20220114448 A1 | Apr 2022 | US |