This application relates to the field of artificial intelligence, and in particular, to a model training system and method.
Artificial intelligence (AI) model training refers to providing a large amount of training data for a machine, so that the machine can find a proper neural network architecture and a value assigned to each parameter in the neural network architecture. In this way, the machine can accurately identify or distinguish objects through a neural network.
To perform AI model training more efficiently and accurately, a large quantity of processors may be used to form a model training machine, where the processor is, for example, a graphics processing unit (GPU), a central processing unit (CPU), or a neural network accelerator (NPU). Different training data may be input to the large quantity of processors, or different sub-models in an AI model may further be run. The large quantity of processors may obtain respective intermediate data after each iteration, and then transfer the respective intermediate data, to obtain an aggregation result of all intermediate data in a current iteration. Subsequently, each processor may use the aggregation result as an input of a next iteration. In this way, after a plurality of rounds of iterative operations, the machine can learn more key feature details, thereby becoming more intelligent.
As a scale of the neural network and a scale of a dataset increase sharply, data is transferred more frequently between processors. In this way, how to implement efficient data transmission between the large quantity of processors becomes a problem that needs to be urgently resolved currently.
This application provides a model training system and method, to implement efficient data transfer between a large quantity of processors.
According to a first aspect, this application provides a model training system, including:
a first group, where the first group includes a micro-electro-mechanical system (micro-electro-mechanical system, MEMS) and S×C processors, S is a quantity of nodes in the first group, C is a quantity of processors in one node, and both S and C are positive integers; the MEMS, configured to construct an optical transmission channel between any two nodes in the S nodes; and the S×C processors, configured to jointly train a model, where in one iteration of joint model training, at least two processors in the S×C processors transmit target data through the optical transmission channel, and a processor that receives the target data is configured to adjust a parameter for model training in the processor based on the target data. In this way, a communication connection between any two of the S nodes is implemented by the MEMS, that is, any node may send data to another node through the optical transmission channel constructed by the MEMS. Further, data obtained by one processor in the S nodes by performing model training may be transmitted to a processor of another node through the optical transmission channel constructed by the MEMS, thereby implementing efficient data transfer in model training.
In a possible implementation, the first group includes a first node and a second node, the first node includes a first processor, and the second node includes a second processor. The first processor is configured to perform model training in the first processor to obtain intermediate data of the first processor, and obtain first target data based on the intermediate data of the first processor, where the first target data may be all or a part of the intermediate data of the first processor. The first processor is further configured to send the first target data to the second processor through an optical transmission channel constructed by a first MEMS. The second processor is configured to adjust a parameter for model training in the second processor based on the first target data, where the first MEMS is located between the first node and the second node.
In a possible implementation, the first processor may be configured to send the first target data to the second processor through the optical transmission channel constructed by the first MEMS and an intra-node channel, where the intra-node channel includes a channel that is in the first node and that is between the first processor and the first MEMS, and/or a channel that is in the second node and that is between the second processor and the first MEMS. In this way, when a port in the first processor is not directly connected to a port of the first MEMS, the first processor may send, through a channel in the first node, the first target data to a processor that is in the first node and that is directly connected to the port of the first MEMS, and the processor may send the first target data to the second node through the optical transmission channel constructed by the first MEMS. Correspondingly, when a port in the second processor is not directly connected to the port of the first MEMS, a processor that is in the second node and that is directly connected to the port of the first MEMS may receive the first target data from the first node, and then send the first target data to the second processor through a channel on the second node.
In a possible implementation, the system further includes: a wavelength selective switch (wavelength selective switch, WSS) and (W−1) extended groups, where W is an integer greater than or equal to 2, the first group and the (W−1) extended groups form W groups, and the WSS is connected to each of the W groups. In this way, while a fixed optical transmission channel is constructed between any two nodes, the MEMS may expand a training scale of model training to W times of original scale through a feature of the WSS, that is, one input port and different wavelengths may correspond to different output ports, to further perform model training on a larger scale.
In a possible implementation, the WSS includes W first WSS ports and W second WSS ports. The W first WSS ports are respectively connected to W node ports, the W node ports respectively belong to the W groups, and positions of the W node ports in respective groups are corresponding. The W node ports correspond to respective MEMS ports in the respective groups, and MEMS ports corresponding to the W node ports are respectively connected to the W second WSS ports. In this way, the WSS and the MEMS in the W groups may connect nodes between any two of the W groups, so that the processors in any two groups can transmit data to each other, thereby helping expand the training scale.
In a possible implementation, the first processor is further configured to sequentially send the first target data to the second processor through optical transmission channels separately constructed by the WSS and a second MEMS. The second node is another node other than the first node in the first group, or is a node in any one of the (W−1) extended groups. The WSS and the second MEMS are sequentially located between the first node and the second node, and the second MEMS and the second node belong to a same group. In this way, through the WSS and the second MEMS, the first processor may send the first target data to the processor in another node in the group, and may further send the first target data to a processor in another group, thereby helping expand a training scale.
In a possible implementation, the first processor is specifically configured to modulate the first target data to a carrier, where a wavelength of the carrier is a preset wavelength corresponding to a group to which the second node belongs. The WSS is configured to send the carrier carrying the first target data to the second MEMS based on a mapping relationship between the wavelength of the carrier and the group to which the second node belongs. In this way, the first processor may adjust the wavelength of the carrier for carrying the first target data based on the preset wavelength of the group to which the second processor belongs, to send different target data to different groups. In addition, the first processor may quickly adjust the wavelength of the carrier, which helps increase a rate of transmitting data by the first processor to another processor.
In a possible implementation, each of the W groups corresponds to two preset wavelengths. That is, when the first processor sends the target data to a target group through the WSS, the WSS may transmit the target data to two WSS ports corresponding to the target group, and one of the two WSS ports is a WSS port corresponding to an MEMS in the target group. Correspondingly, the first processor may send the first target data to the second processor sequentially through the optical transmission channels corresponding to the WSS and the MEMS. Another WSS port may be a WSS port corresponding to a node in the target group. Correspondingly, the first processor may directly send the first target data to the second processor through the optical transmission channel corresponding to the WSS. This can help improve flexibility of data transmission in model training, and reduce unnecessary bandwidth consumption. In this case, if a total quantity of available wavelengths in the WSS is limited, for example, when the total quantity of available wavelengths in the WSS is less than a total quantity of ports of the WSS, the total quantity W of groups may be set to ½ of the total quantity of available wavelengths in the WSS.
In a possible implementation, both training data and training models in any two processors of the S×C processors are different, and a collective communication manner between the S×C processors is alltoall; or training data in any two processors of the S×C processors is different, and a collective communication manner between the S×C processors is allreduce.
In a possible implementation, the target data includes one or more of a gradient, a feature, and a model parameter for model iteration, and the target data in a plurality of dimensions is exchanged between the processors. This helps improve model training efficiency and improve accuracy of a trained model.
According to a second aspect, this application provides a model training method, including:
A first processor of a first node performs model training in the first processor to obtain first target data. The first processor sends the first target data to a second processor of a second node through an optical transmission channel constructed by an MEMS. The MEMS is located between the first node and the second node, and the first target data is for the second processor to adjust a parameter for model training in the second processor.
In a possible implementation, that the first processor sends the first target data to a second processor of a second node through an optical transmission channel constructed by an MEMS includes: The first processor sends the first target data to the second processor through the optical transmission channel constructed by the MEMS and an intra-node channel. The intra-node channel includes a channel that is in the first node and that is between the first processor and the MEMS, and/or a channel that is in the second node and that is between the second processor and the MEMS.
In a possible implementation, that the first processor sends the first target data to a second processor of a second node through an optical transmission channel constructed by an MEMS includes: The first processor sends the first target data to the second processor sequentially through an optical transmission channel constructed by a WSS and the optical transmission channel constructed by the MEMS. The second node and the MEMS belong to a same group, and the WSS is located between the MEMS and the first node.
In a possible implementation, the WSS includes a mapping relationship between a wavelength of a carrier and a group, and in one mapping relationship, the wavelength of the carrier is a preset wavelength of a corresponding group. That the first processor sends the first target data to the second processor sequentially through an optical transmission channel constructed by a WSS and the optical transmission channel constructed by the MEMS includes: The first processor modulates the first target data into the carrier, and the wavelength of the carrier is a preset wavelength corresponding to a group to which the second node belongs. The first processor sends the carriers carrying the first target data to the WSS, so that the WSS sends the carrier carrying the first target data to the MEMS.
In a possible implementation, that a first processor of a first node performs model training in the first processor to obtain first target data includes: The first processor performs model training in the first processor to obtain intermediate data of the first processor; and the first processor determines the first target data based on a collective communication manner and the intermediate data of the first processor, where the first target data is all or a part of the intermediate data of the first processor. The training data and the training models in the first processor and the second processor are different, and the collective communication manner is alltoall; or training data in the first processor and the second processor is different, and the collective communication manner is allreduce.
In a possible implementation, that the first processor determines the first target data based on a collective communication manner and the intermediate data of the first processor includes: The first processor divides the intermediate data of the first processor based on the alltoall and a total quantity of processors corresponding to the alltoall. The processors corresponding to alltoall include the first processor and a second processor, a quantity of data parts after division is equal to the total quantity of processors, and the data after division includes first target data corresponding to the second processor.
In a possible implementation, alltoall corresponds to S nodes, the first node is an s1th node in the S nodes, the second node is an s2th node in the S nodes, s1 and s2 are set to every integer in [0, S], and si is less than s2. The second processor is C processors included in the second node. The first target data is (s2×C)th to (s2×C+C−1)th pieces of data in S×C pieces of data after the division.
In a possible implementation, the alltoall corresponds to W groups, the first node is an s1th node of a w1th group in the W groups, and the second node is an s2th node of a w2th group in the W groups. w1 is set to every integer in [0, W−1], and w2=w1+offset, where offset=((s2% W)−(s1% W))% W.
In a possible implementation, that the first processor determines the first target data based on a collective communication manner and the intermediate data of the first processor includes: The first processor divides the intermediate data of the first processor based on the allreduce and a total quantity C of processors in the first node, to obtain C pieces of data. The first processor obtains ith data of other (C−1) processors in the first node through an intra-node channel of the first node. After the first processor performs summation on ith data in the C pieces of data and the ith data of the other (C−1) processors in the first node, the first target data is obtained. The first processor is an ith processor in the first node, and the second processor is an ith processor in the second node. In this way, when the first processor and the second processor belong to different nodes, the first processor may first obtain, through intra-node communication, a result of aggregation between ith data of the processors in the first node, that is, a result of data aggregation in the first node. Similarly, the second processor may also obtain a result of data aggregation in the second node. Then, the first processor and the second processor perform inter-node data aggregation. Specifically, the first processor may send the aggregation result (that is, the first target data) in the first node to the second processor, and the second processor may perform aggregation on the aggregation result in the first node and the aggregation result in the second node, to obtain a result of inter-node data aggregation.
In a possible implementation, the allreduce corresponds to the W groups, one group includes the S nodes, and one node includes the C processors. The first processor is an ith processor in a group to which the first processor belongs, and a second processor is an ith processor in a group to which the second processor belongs. That the first processor determines the first target data based on a collective communication manner and the intermediate data of the first processor includes: The first processor divides the intermediate data of the first processor based on the allreduce and the total quantity S×C of processors in the group, to obtain the S×C pieces of data. The first processor obtains, through the intra-node channel of the first node and/or optical transmission channels that are between the first node and other (S−1) nodes in the group to which the first processor belongs and that are constructed by the MEMS, ith data of other (S×C−1) processors in the group to which the first processor belongs. The first processor performs summation on the ith data in the S×C pieces of data and the ith data of the other (S×C−1) processors in the group to which the first processor belongs, to obtain the first target data. In this way, when the first processor and the second processor belong to different groups, the first processor may first obtain, through intra-group communication, an aggregation result between ith data of the processors in the group to which the first processor belongs, that is, a result of intra-group data aggregation. Similarly, the second processor may also obtain a result of data aggregation in the group to which the second processor belongs. Then the first processor and the second processor perform inter-group data aggregation. Specifically, the first processor may send the intra-group aggregation result (that is, the first target data) obtained by the first processor to the second processor, and the second processor may perform aggregation on the intra-group data aggregation result and the intra-group data aggregation result corresponding to the second processor, to obtain an inter-group data aggregation result.
In a possible implementation, the method further includes: The first processor obtains second target data, where the second target data is data that is obtained by the second processor by performing model training in the second processor and that is to be transmitted to the first processor. The first processor adjusts a parameter for model training in the first processor based on the second target data. In this way, the first processor may determine aggregated data based on the second target data, and adjust the parameter for model training in the first processor based on the aggregated data.
In a possible implementation, before the first processor performs model training in the first processor to obtain first target data, the method further includes: The first processor divides a plurality of nodes into W groups based on a total quantity of the plurality of nodes for joint model training, a total quantity of ports of the WSS, and a total quantity of available wavelengths in the WSS. When the total quantity of available wavelengths in the WSS is less than the total quantity of the ports of the WSS, and one group in the W groups corresponds to two preset wavelengths, W is equal to ½ of the total quantity of available wavelengths in the WSS.
According to a third aspect, this application further provides a computing device. The computing device includes a processor and a memory, and may further include a communication interface. The processor executes program instructions in the memory to perform the method provided in the second aspect or any possible implementation of the second aspect. The memory is coupled to the processor, and stores program instructions and data that are necessary for performing a data processing process. The communication interface is configured to communicate with another device, for example, send first target data to a second node.
According to a fourth aspect, this application provides a computer-readable storage medium. When the computer-readable storage medium is executed by a computing device, the computing device performs the method provided in the second aspect or any possible implementation of the second aspect. The storage medium stores a program. The storage medium includes, but is not limited to, a volatile memory, for example, a random access memory, or a non-volatile memory, such as a flash memory, a hard disk drive (HDD), and a solid-state drive (SSD).
According to a fifth aspect, this application provides a computing device program product. The computing device program product includes computer instructions, and when being executed by a computing device, the computing device performs the method provided in the second aspect or any possible implementation of the second aspect.
According to a sixth aspect, this application further provides a chip. The chip is connected to a memory, and the chip is configured to read and execute a software program stored in the memory, to perform the method in the second aspect or any possible implementation of the second aspect.
To better explain embodiments of this application, related terms or technologies in this application are first explained as follows.
A neural network (NN) is a mathematical model for performing distributed parallel information processing algorithm by simulating behavior characteristics of animal neural networks. Information can be processed by adjusting an interconnection relationship between a large quantity of nodes in the neural network. The neural network has self-learning and self-adaptation capabilities.
Specifically, the neural network may usually include a plurality of layers connected end to end, for example, a convolution layer, a fully connected layer (fully connected layer, FC), an activation layer, or a pooling layer. Each layer may be expressed as a function y=fw(x), where f is a function of the function, the function f is derivable, w is a weight (or referred to as a weight tensor), x is an input (or referred to as an input tensor), and y is an output (or referred to as an output tensor).
It is assumed that there is a dataset {(x0, l0), . . . , (xn-1, ln-1)}, where x0, . . . , and xn-1 are n inputs, and corresponding l0, . . . , and ln-1 are expected outputs of the n inputs respectively, which are also referred to as labels (labels). Each (xj, lj) is referred to as a piece of sample data.
Any input (which may be represented as xj) in the dataset is input to layer m−1 of the neural network in
An objective of model training is to solve w0, . . . , and wm-1, so that yjm-1 is closest to lj in a loss function L.
Further, a solving process may use a stochastic gradient descent (stochastic gradient descent, SGD) method shown in
Forward propagation method: Any input (which may be represented as xj) in a dataset is input to the function f0, so that the function f0 outputs y0j; then, y0j is input to a function f1, so that the function f1 outputs y1j; and by analogy, outputs respectively corresponding to functions f0 to fm-1 are obtained, that is, y0j, y2j, . . . , ym-1j. Then, a loss (loss) is calculated with reference to lj corresponding to xj and a loss function L.
Backward propagation: The chain rule is used to calculate a gradient Δyj of yj and a gradient Δwj of wj of each layer. Specifically, for example, a gradient Δym-1 of the layer m−1 is determined through loss and ym-1, and then a gradient Δwm-1 of the layer m−1 is determined through Δym-1 and wm-1. By analogy, Δy and Δw of each layer are obtained, that is, Δy0, Δw0, . . . , Δym-1, Δwm-1 are obtained.
In model training, K NPUs may be used for training a model together (or joint model training), where K is an integer greater than or equal to 2. In this application, the K NPUs may be represented as an NPU0, an NPU1, . . . , and an NPU(K−1), or may be represented as a 0th NPU, a 1st NPU, . . . , and a (K−1)th NPU. This description is also applicable to another example.
To make full use of a parallel computing capability inside each NPU, a dataset is usually divided into a plurality of subsets. A size of each subset is referred to as a batch size (batch size), and a subset may be denoted as a bs.
For the dataset, refer to the following expression 2:
bs0 may include bs pieces of sample data, and may be represented as (x0, l0), . . . , (xbs-1, lbs-1); and bs1 may also include bs pieces of sample data, and may be represented as (xbs, lbs), . . . , (x2×bs-1, l2×bs-1), and the like.
During each model training, one bs may be input into the neural network, that is, an operation shown in
To improve a training speed by using a plurality of NPUs, when the dataset is further increased, a layer, mini batch size, denoted as mbs, may further be added to the expression 2. In this way, a subset may further be divided into a plurality of mbss. For the dataset, refer to the following expression 3:
The mbs pieces of sample data in mbs0 may be represented as (x0, l0), . . . , (xmbs-1, imbs-1), and the mbs pieces of sample data in mbs1 may be represented as (xmbs, lmbs), . . . , (x2mbs-1, l2mbs-1), and the like.
Correspondingly, in each training iteration of model training, K bss or the K mbss may be respectively input into the K NPUs, to complete parallel data training shown in
For any one of the K NPUs (which may be represented as an NPUk), a weight corresponding to the layer m−1 of a neural network of the NPU may be represented as wm-1k, and a weight gradient may be represented as Δwm-1k.
Correspondingly, after the K NPUs respectively obtain their respective Δwm-1 through calculation, the K NPUs may perform data aggregation on the respective Δwm-1, to obtain an input in a next round of model training. For example, each NPU may be Δwm-1 of other (K−1) NPUs, and each NPU calculates an average value based on K Δwm-1. For details, refer to the following expression 4:
As a model scale further increases, for example, a quantity of model layers further increases, a processing capability of a single NPU cannot complete an operation of an entire model. In this case, the single model needs to be split into K NPUs, where K is an integer greater than or equal to 2. This calculation manner may be referred to as the model parallelism. Further, during model parallelism, the dataset may further be divided into K bss or K mbss, to combine the model parallelism and the data parallelism. This calculation manner may be referred to as hybrid parallelism.
In a data parallel training process, a model parallel training process, or a hybrid parallel training process, intermediate data of the K NPUs needs to be aggregated. Intermediate data of each NPU may include one or more of a feature (feature or activation), a gradient, and a model parameter obtained through model training. The feature is, for example, a feature of training data learned through a model, the model parameter is, for example, a parameter w of a function f in a neural network, and the gradient is, for example, a difference Δwj of wj generated during backward propagation. For ease of description, the intermediate data may be referred to as data for short in the following.
Specifically, data aggregation between the K NPUs may be completed in a collective communication manner, to obtain the aggregated data. The collective communication manner (or referred to as a collective algorithm) may specifically include one or more of allreduce and alltoall.
The allreduce can be used for data aggregation in a case of data parallelism. Common allreduce includes ring allreduce, recursive halving and doubling, butterfly, and hierarchical allreduce.
The ring allreduce is a logical ring formed by K NPUs (certainly, physical topology may not be a ring). Each NPU can divide data of the NPU into K pieces. Then, each NPU obtains data of the other (K−1) NPUs through the procedure shown in
With reference to
Refer to a first step shown in (a) in
Refer to step 2 shown in (b) in
By analogy, as shown in (c) in
Compared with ring allreduce, the recursive halving and doubling can reduce a quantity of transmission times between NPUs. Still using K NPUs as an example, each NPU may include data of the NPU, for example, an NPUk includes data k of the NPUk.
The principle of recursive halving and doubling is as follows.
Step 1: An NPUki sends data k1 to an NPUk1−1. Correspondingly, the NPUk1−1 uses a sum (represented as data k1+k1−1) of the local data k1−1 and the data k1 from the NPUk1 as local data, to obtain a sum of data of two adjacent NPUs in the K NPUs. k1 is greater than or equal to 1, and is less than or equal to K.
For details, refer to a first step in
Step 2: An NPUk2 sends data k2 to an NPUk2−2. Correspondingly, the NPUk2−2 uses a sum (represented as data k2+k2−2) of the local data k2−2 and the data k2 from the NPUk2 as local data. The NPUk2 is any one of a plurality of NPUs that receive data of other NPUs in step 1. In this way, a sum of data of four adjacent NPUs in the K NPUs is obtained.
For details, refer to step 2 in
In a manner similar to the step 1 and step 2, data of adjacent NPUs is sequentially summed, and finally data of the NPU0 to the NPU(K−1) is accumulated to the NPU0, that is, the NPU0 includes an accumulation result of the data of the NPU0 to the NPU(K−1). The accumulation result may be understood as an aggregation result of the K NPUs or aggregation data of the K NPUs.
Subsequently, each NPU distributes the accumulation result back to each NPU based on a sequence reverse to the foregoing data transmission sequence. In this way, all recursive halving and doubling are completed.
Compared with the foregoing unidirectional transmission of recursive halving and doubling, bidirectional data exchange may be implemented in the butterfly. For example, in the step 1 in
The butterfly may include the following steps.
Step 1: The NPUk1 exchanges local data with the NPUk1−1 to obtain a sum of data of two adjacent NPUs in K NPUs. k1 is greater than or equal to 1, and is less than or equal to K.
Step 2: The NPUk2 exchanges local data with the NPUk2−2 to obtain a sum of data of four adjacent NPUs in the K NPUs, where k2 is greater than or equal to 2, and is less than or equal to K.
In a manner similar to the step 1 and step 2, data of adjacent NPUs is sequentially summed, so that each NPU in the NPU0 to the NPU(K−1) has an accumulation result of K pieces of data.
A plurality of NPUs may be assembled into a same node. Bandwidth between the plurality of NPUs in the same node is higher than bandwidth between NPUs in different nodes. The node may be understood as a computing node, a computing server, or the like.
When the plurality of NPUs perform data aggregation, the plurality of NPUs may sequentially perform first intra-node data aggregation, inter-node data aggregation, and second intra-node data aggregation.
Refer to
A quantity of nodes is 4, the four nodes may be respectively represented as a node 0 to a node 3, each node includes four NPUs, and the four NPUs in each node may be represented as an NPU0 to an NPU3, and are distinguished based on a node to which the four NPUs belong.
In the first intra-node data aggregation:
for any node, each NPU divides data of the NPU into four pieces. An ith NPU obtains ith data in other NPUs in a current node, and accumulates the obtained ith data in the other NPUs and the ith data of the ith NPU.
For example, for the node 0, an NPU0 of the node 0 divides data into four pieces, which are respectively represented as a00, a01, a02, and a03; an NPU1 of the node 0 divides data into four pieces, which are respectively represented as b00, b01, b02, and b03; an NPU2 of the node 0 divides data into four pieces, which are respectively represented as c00, c01, c02, and c03; and an NPU3 of the node 0 divides data into four pieces, which are respectively represented as d00, d01, d02, and d03.
The NPU0 of the node 0 respectively obtains 0th data in the NPU1, the NPU2, and the NPU3 in the node 0, to obtain a sum of the 0th data of all NPUs in the node 0, that is, a00+b00+c00+d00. The NPU1 of the node 0 respectively obtains 1st data in the NPU0, the NPU2, and the NPU3 in the node 0, to obtain a sum of the 1st data of the NPUs in the node 0, that is, a01+b01+c01+d01.
In inter-node data aggregation:
the ith NPU of each node performs data aggregation through inter-node bandwidth, where the collective communication manner may be implemented through one of the ring allreduce, recursive halving and doubling, or butterfly.
For example, in
Subsequently, in the second intra-node data aggregation:
the ith NPU in each node distributes the ith data obtained by aggregating the inter-node data to another NPU in the current node. For example, in
Herein, the NPU in the node distributes data obtained by aggregating the inter-node data to another NPU in the current node. It may also be understood that this process is intra-node data distribution. This description may also be applicable to another example.
The alltoall can be used for data aggregation in hybrid parallel or model parallel cases.
For example, the alltoall is performed on four NPUs, each NPU includes four pieces of data, and the four pieces of data respectively corresponding to the four NPUs can form a 4×4 data matrix. For details, refer to (a) in
The MEMS is a type of optical cross-connect device (OXC), which can be used to deflect optical signals.
The MEMS may implement deflection of the optical signal by adjusting an angle of the MEMS micromirror, so that the optical signal is output from different output ports, to implement optical path switching.
The WSS is also a type of OXC. The WSS can configure any wavelength to any port.
With reference to an example in
In this case, when the modulator modulates the data to the carrier corresponding to the first wavelength, the carrier is input through the rightmost input port, and is output through the rightmost output port. When the modulator modulates the data to the carrier corresponding to the second wavelength, the carrier is input through the rightmost input port, and is output through the intermediate output port. When the modulator modulates the data to the carrier corresponding to the third wavelength, the carrier is input through the rightmost input port, and is output through the leftmost output port.
In a digital communication protocol, a valid communication port usually includes eight lanes of synchronized valid data. The eight lanes of synchronized valid data are input to an 8-lane WSS, and then are output from the 8-lane WSS. The 8-lane WSS is a WSS including eight lanes (links). The 8-lane WSS may include eight 1-lane WSSs. Refer to a schematic diagram of a structure of a WSS shown in (a) in
With reference to (a) in
Further, each 1 lane may correspond to one W-in-W-out WSS, where W is a positive integer, and each 1 lane may be configured to determine, based on a wavelength of a carrier and an input port, an output port to which data is to be transmitted. For details, refer to (b) in
In the model training, the model training node may include a plurality of NPUs. Each NPU may run different data (that is, data parallelism), or run different models (that is, model parallelism or hybrid parallelism). The plurality of NPUs may jointly train a model. Further, in each round of iteration of model training, the NPU needs to perform data aggregation on data obtained by executing a local model and data corresponding to another NPU, to update a model parameter in a next iteration.
Data aggregation between a plurality of NPUs can be implemented through a fat-tree electrical switching network. Specifically, for the fat-tree electrical switching network, refer to
However, as a model training scale gradually increases, and NPU computing power continuously increases, an amount of data that needs to be forwarded by a switch sharply increases. Consequently, a line congestion problem may occur when the switch forwards data. In this way, data may not be efficiently transmitted between NPUs in model training.
Therefore, this application provides a model training system and method, to implement larger-scale model training and implement efficient data transfer between a large quantity of processors. The processor may be an NPU, a GPU, a CPU, or another device having a processing function. The following uses an NPU as an example for description. Other processors are similar.
A model training system (hereinafter referred to as a system) in this application is first explained.
The system may include S nodes, one node may include C NPUs, and one NPU may further include P ports. Correspondingly, the system may include S×C NPUs, and each node may include C×P ports, where the ports may be input ports or output ports. S, C, and P are all positive integers.
In an example, the system may include four nodes, each node includes two NPUs (which may be referred to as four nodes and two NPUs for short below), and each NPU includes two ports.
In another example, the system may include four nodes, each node includes four NPUs (which may be referred to as four nodes and four NPUs for short below), and each NPU includes one port.
The S nodes may be respectively corresponding to respective node numbers, and the C×P ports in each node may be respectively corresponding to respective port numbers. In the following, the S nodes may be sequentially referred to as a node 0, a node 1, a node 2, . . . , and a node S−1, and the C×P ports in each node are sequentially referred to as a port 0, a port 1, a port 2, . . . , and a port C×P−1.
It may be understood that all ports included in the S nodes may form a port matrix. For example, the system in
Further, the system may further include an MEMS. The MEMS may be configured to construct an optical transmission channel between any two of the S nodes. It may also be understood that the MEMS implements connection between any two nodes, and NPUs in the two nodes may perform data aggregation through the optical transmission channel.
The following uses the node 0 and the node 1 in
The node 0 includes a port (0, 0), the node 1 includes a port (1, 0), and the port (0, 0) may be connected to the port (1, 0) by the MEMS. Specifically, the MEMS includes a port M1 and a port M2 corresponding to each other. An optical signal input from the port M1 may be output through the port M2, or an optical signal input from the port M2 may be output through the port M1. The port (0, 0) is connected to the port M1, and the port (1, 0) is connected to the port M2. In this way, the node 0 may communicate with the node 1 through the optical transmission channel (that is, an optical transmission channel between the port M1 and the port M2) corresponding to the MEMS.
Further, this application provides the following two manners of constructing an optical transmission channel by the MEMS.
Manner 1: A port (x1, y) is connected to a port (x2, y) by the MEMS. x1 and x2 correspond to different nodes, x1 and x2 are both set to every integer in [0, S−1], and y may be set to every integer in [0, C×P−1].
Manner 2: A port (x, y) is connected to a port (y, x) by the MEMS.
With reference to the 4×4 port matrix shown in
For a connection manner of manner 1, refer to
All the port pairs of the port (0, 0) and the port (1, 0), a port (2, 0) and a port (3, 0), the port (0, 1) and a port (3, 1), a port (1, 1) and a port (2, 1), a port (0, 2) and a port (2, 2), a port (1, 2) and a port (3, 2), a port (0, 3) and a port (1, 3), and a port (2, 3) and a port (3, 3) can be connected by the MEMS.
Optionally, in the plurality of port pairs, port pairs corresponding to a same port number may be connected by a same MEMS. For example, a port number 0 corresponds to two port pairs: the port (0, 0) and the port (1, 0), and the port (2, 0) and the port (3, 0). The two port pairs can be connected by a same MEMS. In other words, the MEMS may implement communication between the port (0, 0) and the port (1, 0), and can further implement communication between the port (2, 0) and the port (3, 0). In this way, ports in the MEMS can be fully used, thereby helping reduce a quantity of MEMSs in the system.
For a connection manner of manner 2, refer to
The port pairs of the port (0, 1) and the port (1, 0), the port (0, 2) and the port (2, 0), the port (0, 3) and the port (3, 0), the port (1, 2) and the port (2, 1), the port (1, 3) and the port (3, 1), and the port (2, 3) and the port (3, 2) can be separately connected by the MEMS.
Optionally, a plurality of port pairs can be connected by a same MEMS, for example, the port (0, 3) and the port (3, 0), and the port (1, 2) and the port (2, 1) may be connected by a same MEMS. That is, the MEMS may implement communication between the port (0, 3) and the port (3, 0), and may further implement communication between the port (1, 2) and the port (2, 1). In this way, ports in the MEMS can be fully used, thereby helping reduce a quantity of MEMSs in the system.
Certainly, there may be another connection manner in which the MEMS constructs an optical transmission channel. Examples are not provided in this application. An optical transmission channel constructed between any two of the plurality of nodes by the MEMS falls within the protection scope of this application.
The following explains how to implement connection between any two nodes by the MEMS in the manner 1 or manner 2 in a case of a larger node scale. Using an example of eight nodes and eight NPUs, each node includes eight NPUs, and each NPU includes one port, that is, each node includes eight ports. The eight nodes and eight NPUs may correspond to an 8×8 port matrix.
For a connection manner of manner 1, refer to
In this application, a connection manner of eight nodes and eight NPUs may be obtained based on a connection manner of the 4×4 port matrix in
For example, the connection manner of the 4×4 port matrix may be used as a first port matrix in a connection manner of eight nodes and eight NPUs, and then the first port matrix is translated in a horizontal direction to obtain a second port matrix. In this way, a connection between the port 0 and the port 3 in eight nodes may be created.
Further, refer to the following steps to create a connection between the port 4 and the port 7 in the eight nodes, to obtain a complete connection manner of eight nodes and eight NPUs.
Step 1: Create a connection between a port (x, 4) and a port (x+4, 4), that is, a port pair is formed by connecting the port (x, 4) and the port (x+4, 4) by the MEMS, where 0≤x≤3. For example, if x=1, the MEMS is connected to a port pair formed by a port (1, 4) and a port (5, 4), or the MEMS is connected to a port pair formed by a port (2, 4) and a port (6, 4).
Step 2: Create a connection relationship between ports y of the nodes based on a connection relationship between ports (y−4) of the nodes, and connect the nodes by the MEMS, where y is set to every positive integer from 5 to 7. Therefore, in the eight nodes corresponding to the port (y−4), connection between any two nodes can be implemented.
For example, a connection relationship between ports 5 of the nodes may be created based on a connection relationship between ports 1 of the nodes. For example, for the ports 1, the node 0 is connected to the node 3, the node 4 is connected to the node 7, and therefore, when the ports 5 are connected, the node 0 may be connected to the node 7, and the node 3 may be connected to the node 4. Further, the node 1 is connected to the node 2, the node 5 is connected to the node 6, and therefore, when the ports 5 are connected, the node 1 may be connected to the node 5, and the node 2 may be connected to the node 6.
In addition, the method in this application is further applicable to interconnection of other quantities of nodes and NPUs. A connection between ports on lower-half sides of the nodes may cross a left half side and a right half side.
The following uses a connection mode of an S×M port matrix as a basic connection to describe how to expand the port matrix to 2S×2M based on the basic connection. S is a number of nodes, and M is a number of ports in a node, where M=C×P.
First, the S×M port matrix is used as a first port matrix, and then the first port matrix is translated in a horizontal direction to obtain a second port matrix. In this way, connections between ports 0 to ports M−1 in the 2S nodes can be created.
Second, a connection between a port (x, M) and a port (x+S, M) is created, that is, a port pair is formed by the port (x, M) and the port (x+S, M) connected by the MEMS, where 0≤x≤M−1.
Then, a connection relationship between ports y of the nodes is created based on a connection relationship between ports (y−M) of the nodes, and the ports y are connected by the MEMS, where y is set to every positive integer from (M+1) to (2M−1).
Specifically, when a port (y−M) of a node x1 is connected to a port (y−M) of a node x2, and a port (y−M) of a node (x1+N) is connected to a port (y−M) of a node (x2+N), a port y of the node x1 may be connected to a port y of the node (x2+N), and a port y of the node (x1+N) may be connected to a port y of the node x2. Therefore, in the 2S nodes corresponding to the port (y−M), connection between any two nodes can be implemented.
For a connection manner of manner 2, refer to
The node x1 includes a port (x1, y1), the node x2 includes a port (x2, y2), where x1=y2 and x2=y1, and the MEMS is configured to connect the port (x1, y1) and the port (x2, y2). A specific connection manner is similar to that in
In addition, to reduce a quantity of MEMSs, a plurality of port pairs may further be connected to a same MEMS. For example, in an example in
The foregoing describes how to construct an optical transmission channel between any two of a plurality of nodes by the MEMS. Subsequently, NPUs in the any two nodes can implement data aggregation between the plurality of NPUs through the optical transmission channel corresponding to the MEMS.
In a possible manner, each NPU in the system may run a model training of the NPU to obtain data corresponding to the NPU. Then, each NPU may perform data aggregation with another NPU based on a current collective communication manner and data obtained through model training of the NPU. The following describes different collective communication manners in different cases.
To implement data aggregation performed by a plurality of NPUs by through the alltoall, the NPU may divide data obtained by the NPU through model training into a plurality of parts, and a quantity of data parts obtained through division may be the same as a quantity of NPUs. The system includes S nodes, each node includes C NPUs, and each NPU may divide data of the NPU into S×C parts.
With reference to the system architecture in
It may be understood that each NPU corresponds to eight parts of data of the NPU, and data in the eight NPUs may form an 8×8 data matrix. When the alltoall is performed, a transposition operation may be performed on the 8×8 data matrix.
In the S nodes, (s2×C)th data to (s2×C+C−1)th data of an ith NPU in an s1th node are exchanged with (s1×C+i)th data of all NPUs in an s2th node, where s1 is set to every integer in [0, S−1], s2 is set to every integer in [0, S−1], s1 is less than s2, and i is set to every integer in [0, C−1].
With reference to the example in
It may also be understood that, the system corresponds to the data matrix, and C2 pieces of data on a diagonal in the data matrix are only limited to intra-node transposition or are not transposed, and C2 pieces of data not on the diagonal may be transmitted to another node through inter-node transposition. Still with reference to the foregoing example, C2=4, and four pieces of data on a diagonal of the node 0 are respectively 000, 001, 010, and 011, where 000 and 011 are not transposed, and 001 and 010 are still in a current node after transposition (that is, intra-node transposition occurs between 001 and 010). Four pieces of data not on the diagonal of the node 0, for example, 002, 003, 012, and 013, where 002 is transmitted to 100 of the node 1 after inter-node transposition, 003 is transmitted to 110 of the node 1 after inter-node transposition, 012 is transmitted to 101 of the node 1 after inter-node transposition, and 013 is transmitted to 111 of the node 1 after inter-node transposition.
In a specific implementation, the node includes C2×P×C pieces of data, and the node performs division based on C2 pieces of data included in each part of data, and determines how to perform transposition on each part of data. With reference to the example in
A part of data in the node 0 is used as an example for description, and the part of data includes 002, 003, 012, and 013. The part of data is not located on the diagonal of the data matrix, and the node 0 performs inter-node transposition. For details, refer to
Step 1: The node 0 may exchange data at an upper left corner of the part of data, that is, 002, with 100 in the node 1, so that 002 is exchanged from the node 0 to the node 1. Similarly, 100 is exchanged from the node 1 to the node 0.
Step 2: The node 0 may exchange data at a lower left corner of the part of data, that is, 003, with 110 in the node 1, so that 003 is exchanged from the node 0 to the node 1. Similarly, 110 is exchanged from the node 1 to the node 0.
Step 3: The node 0 may exchange data at an upper right corner of the part of data, that is, 012, with 101 in the node 1, so that 012 is exchanged from the node 0 to the node 1. Similarly, 101 is exchanged from the node 1 to the node 0.
Step 4: The node 0 may exchange data at a lower right corner of the part of data, that is, 013, with 111 in the node 1, so that 013 is exchanged from the node 0 to the node 1. Similarly, 111 is exchanged from the node 1 to the node 0.
Another part of data in the node 0 is used as an example for description, and the part of data includes, for example, 000, 001, 010, and 011. The part of data is located on the diagonal line of the data matrix, and is transposed in the node. For details, refer to step 5 shown in
001 and 010 are transposed in the node, and 000 and 011 are not transposed.
In another specific implementation, each NPU may divide data of the NPU into C2×P (that is, S×C) parts, where each part includes C parts of data, and each NPU may determine an NPU with which data is to be exchanged. For example, in
The NPU may implement data aggregation according to the hierarchical allreduce shown in
For the first intra-node data aggregation, refer to
After the first intra-node data aggregation is performed, the inter-node data aggregation needs to be performed. In this embodiment of this application, the inter-node data aggregation needs to be implemented through an inter-node channel, or through an inter-node channel and an intra-node channel.
In a possible implementation, a quantity P of ports corresponding to the NPU is greater than or equal to 2. As shown in step 1 in
In this way, data is transmitted between the plurality of NPUs through an inter-node channel, or through an inter-node channel and an intra-node channel, so that in a port matrix (or a data matrix), data corresponding to an ith port in each node may be accumulated to an NPU corresponding to an ith port in an ith node.
Then, the NPU including the accumulated data may transfer, based on the inter-node channel and/or the intra-node channel again, the accumulated data to another NPU of a current node, or transfer the accumulated data to an NPU of another node.
In a case of four nodes and two NPUs, one NPU includes two ports, and a connection relationship of a 4×4 port matrix is shown in
For ease of description, the following uses data transmission between nodes as an example to describe the allreduce. When the node transmits data, the node may further correspond to an NPU in the node for data transmission. For example, when the node 0 transmits the data A1, specifically, the NPU00 transmits the data A1. For another example, when the node 1 transmits the data B2, specifically, the NPU11 transmits the data B2.
Arrow directions in
The following is implemented through an optical transmission channel corresponding to the port 0 in the node. The node 1 transmits the data B0 to the node 0, and the node 0 performs summation on the data B0 and the local data A0 to obtain data A0+B0. The node 3 transmits the data D0 to the node 2, and the node 2 performs summation on the data D0 and the local data C0 to obtain data C0+D0.
The following is implemented through an optical transmission channel corresponding to the port 1 in the node. The node 3 transmits the data D1 to the node 0, and the node 0 performs summation on the data D1 and the local data A1 to obtain data A1+D1. The node 2 transmits the data C1 to the node 1, and the node 1 performs summation on the data C1 and the local data B1 to obtain data C1+B1.
Similarly, the following may be implemented through an optical transmission channel corresponding to the port 2 in the node. The node 2 includes data A2+C2, and the node 3 includes data B2+D2. The following is implemented through an optical transmission channel corresponding to the port 3 in each node. The node 0 includes data A3+B3, and the node 3 includes data C3+D3.
Step 2: Perform Data Transmission with Reference to Internal Transmission of the Node and the Existing Optical Transmission Channel.
Vertical arrows in
The following is implemented through the optical transmission channel corresponding to the port 2 in the node. The node 2 transmits data C0+D0 to the node 0, and the node 0 performs summation on the data C0+D0 and local data A0+B0 to obtain data A0+B0+C0+D0.
It should be noted that, the data transmission is actually that the NPU20 in the node 2 transmits the data C0+D0 to the NPU00 in the node 0. The NPU20 corresponds to the port 0 and the port 1 of the node 2, and the NPU00 corresponds to the port 0 and the port 1 of the node 0. With reference to the connection relationship in
The following is implemented through an optical transmission channel corresponding to the port 3 in the node. The node 0 transmits data A1+D1 to the node 1, and the node 1 performs summation on the data A1+D1 and local data C1+B1 to obtain data A1+B1+C1+D1.
Similarly, the following is implemented through the optical transmission channel corresponding to the port 3 in the node. The node 2 includes data A2+B2+C2+D2. The following is implemented through the optical transmission channel corresponding to the port 1 in the node. The node 3 includes data A3+B3+C3+D3.
In the foregoing manner, it can be implemented that data corresponding to a port 0 of each node is accumulated to the node 0, data corresponding to a port 1 of each node is accumulated to the node 1, data corresponding to a port 2 of each node is accumulated to the node 2, and data corresponding to a port 3 of each node is accumulated to the node 3.
Specifically, the NPU00 includes the data A0+B0+C0+D0, the NPU10 includes the data A1+B1+C1+D1, the NPU21 includes the data A2+B2+C2+D2, and the NPU31 includes the data A3+B3+C3+D3. The NPUs may transfer respective data to another NPU in the node and an NPU in another node based on a previous transmission route.
In still another possible implementation, the NPU includes one port, and the NPU may directly perform the inter-node data aggregation. For example, in a case of four nodes and four NPUs, a connection relationship of a 4×4 port matrix is shown in
Further, the hierarchical allreduce may further be applicable to larger-scale data aggregation. An example in which eight nodes and eight NPUs correspond to an 8×8 port matrix is used. A connection relationship of the 8×8 port matrix is shown in
For data flows between nodes, refer to
Step 1: The node 0 to the node 7 jointly determine a first data matrix.
The node 4 transmits data E0 in the node 4 to the node 0 through an optical transmission channel between the node 4 and the node 0, an internal channel of the node 4, and an internal channel of the node 0, so that the node 0 obtains data A0+E0. In this step, an internal channel of the node 4 may be an NPU44 to an NPU04. An optical transmission channel exists between the port 4 of the NPU44 and the port 4 of the NPU04. That is, the NPU40 transmits the data E0 to the NPU44, and the NPU44 transmits the data E0 to the NPU04 through the optical transmission channel. Further, an internal channel of the node 0 may be the NPU04 to an NPU00, that is, the NPU04 may receive the data E0 from the NPU44 through an optical transmission channel, and then transmit the data E0 to the NPU00 through the internal channel of the node 0.
The node 5 transmits data F0 in the node 5 to the node 1 through an optical transmission channel between the node 5 and the node 1, an internal channel of the node 5, and an internal channel of the node 1, so that the node 1 obtains data B0+F0. For details of the internal channel of the node 5 and the internal channel of the node 1, refer to the description of the internal channel of the node 4 and the internal channel of the node 0.
By analogy, the first data matrix can be obtained, where the NPU00 includes A0+E0, and the NPU01 includes A1+H1; the NPU02 includes A2+G2, and the NPU03 includes A3+F3; the NPU10 includes B0+F0, the NPU11 includes B1+G1, the NPU12 includes B2+H2, and the NPU13 includes B3+E3; and others are similar.
Step 2: The node 0 to the node 7 jointly determine a second data matrix.
The node 0 sends data A4 of the node 0 to the node 4 through the optical transmission channel between the node 0 and the node 4, so that the node 4 can obtain A4+E4.
The node 1 sends data B4 of the node 1 to the node 5 through the optical transmission channel between the node 1 and the node 5, so that the node 5 can obtain B4+F4.
By analogy, the second data matrix can be obtained, where the NPU44 includes A4+E4, the NPU45 includes D5+E5, the NPU46 includes C6+E6, the NPU47 includes B7+E7, and others are similar.
Step 3: Based on steps similar to those in
Then, based on steps similar to those in
In addition, the foregoing uses the port connection relationship in
In the foregoing technical solution, the MEMS constructs an optical transmission channel between any two nodes, so that the any two nodes can perform data transmission based on the optical transmission channel between the any two nodes.
However, it should be noted that angle adjustment of the MEMS micromirror takes a relatively long time, for example, it takes hundreds of milliseconds to switch output from an original output port to another output port. Therefore, a connection relationship between the node and the MEMS is generally configured before model training, and data aggregation is performed by using the preconfigured connection relationship during model training. That is, the optical transmission path corresponding to the MEMS is fixed during model training.
For example, a node includes eight NPUs, and each NPU includes four ports. In this case, one node has 32 ports, and the 32 ports correspond to 32 nodes. In this case, a formed system may include 32 nodes×8 NPUs, that is, 256 NPUs. In this case, interconnection between the 256 NPUs may be implemented by the MEMS.
To further expand a scale of interconnection between NPUs in the model training, a WSS is further introduced in this application, and the WSS can expand a node scale (or an NPU scale) to W times of an original node scale. Specifically, the original S nodes and the MEMS may form a group (referred to as a first group). After the WSS is introduced, the system may add (W−1) extended groups based on the first group, and each extended group may include a same quantity of nodes and a same quantity of MEMSs.
Alternatively, it may be understood that a system may include W×S nodes, each node has a node number, and a modulo operation may be performed on W based on the node number, so that the W×S nodes are respectively divided into the W groups.
To distinguish a port in a node (or an NPU), a port in a WSS, and a port in an MEMS, the port in the node (or the NPU) is referred to as a node port, the port in the WSS is referred to as a WSS port, and the port in the MEMS is referred to as an MEMS port.
The WSS may be configured to implement connection between any two of the W groups. In a possible implementation, a quantity of WSSs included in the system is the same as a total quantity of node ports included in each group.
In a possible implementation, in each of the W groups, node ports located at corresponding positions are connected to a same WSS, where a position of the node port may be determined through a specific node in which the node port is located in the group and a specific node port located in the node.
For example, node ports that are in the W groups and that are located at corresponding positions may be connected to W WSS ports at a same end of the same WSS, and W WSS ports at the other end of the WSS are connected to one MEMS in each of the W groups. In this way, the WSS may connect any two of the W groups.
For example, one WSS includes W first WSS ports and W second WSS ports. The W first WSS ports are respectively connected to W node ports, the W node ports respectively belong to the W groups, and positions of the W node ports in respective groups are corresponding. The W node ports correspond to respective MEMS ports in respective groups, and MEMS ports corresponding to the W node ports are respectively connected to W second WSS ports.
The following uses W=2 as an example to describe an implementation in which the WSS is connected to two groups.
The WSS may include two first WSS ports and two second WSS ports, and the WSS can expand a node scale to twice an original node scale. For example, if the original node scale is four nodes, the node scale can be expanded to eight nodes after the WSS is introduced. For example, the eight nodes may be respectively represented as a node 0, a node 1, a node 2, . . . , and a node 7. A modulo operation may be performed on W=2 based on node numbers of the nodes, so that the eight nodes are divided into the two groups. The two groups may be separately represented as a group 0 and a group 1. The group 0 includes the node 0, the node 2, the node 4, and the node 6, and the group 1 includes the node 1, the node 3, the node 5, and the node 7.
Further, with reference to a connection relationship shown in
The WSS between the node port (6, 0) and the node port (7, 0) is used as an example to explain and describe connection relationships between the WSS and a group 0 and a group 1 respectively. Refer to
Further, the WSS can implement data transmission (or inter-group data transmission, or inter-group data aggregation) between any two groups, or can further implement data transmission (or intra-group data transmission, or intra-group data aggregation, or inter-node data transmission, or inter-node data aggregation) between different nodes in any group.
The following first describes one WSS.
The WSS may include W WSS input ports and W WSS output ports. For one of the WSS input ports, W carriers with different wavelengths may be input to the WSS input port. Based on an output port corresponding to both the WSS input port and the wavelength of the carrier, the W carriers with different wavelengths may be output through the W different WSS output ports.
Specifically, a plurality of mapping relationships may be preset in the WSS, and each mapping relationship may include a WSS input port, a wavelength, and a WSS output port that is jointly corresponding to the WSS input port and the wavelength. It may also be understood that one WSS input port may be separately combined with W wavelengths, and the obtained W combinations may be respectively corresponding to the W WSS output ports.
In a possible implementation, each WSS input port may correspond to one group (which may be referred to as a source group), and each WSS output port may also correspond to one group (which may be referred to as a target group). Further, each group may correspond to a preset wavelength of the group, and a wavelength in a mapping relationship of the WSS may be specifically a preset wavelength corresponding to the target group. That is, a mapping relationship of the WSS may be specifically a mapping relationship between the WSS input port corresponding to the source group, the wavelength corresponding to the target group, and the WSS output port corresponding to the target group.
With reference to examples in
For a mapping relationship in the WSS1, refer to Table 1.
It should be noted that a WSS port corresponds to a group, and specifically corresponds to an NPU of a node in the group. For example, the WSS port d0 corresponds to an NPU (which may be represented as an NPU60) of the node 6 in the group 0, and the WSS port d1 corresponds to an NPU (which may be represented as an NPU70) of the node 7 in the group 1.
Correspondingly, during intra-group data transmission, the NPU in the source group may specifically modulate the data to a carrier with a preset wavelength corresponding to the source group. During inter-group data transmission, the NPU in the source group may specifically modulate the data to a carrier with a preset wavelength corresponding to the target group. With reference to the example in Table 1, when the NPU60 needs to transmit data to an NPU of another node in the group, the NPU60 may modulate the data to a carrier corresponding to the preset wavelength 0. When the NPU60 needs to transmit data to an NPU in the group 1, the NPU60 may modulate the data to a carrier corresponding to the preset wavelength 1.
Intra-group data transmission and/or inter-group data transmission can be implemented through the WSS.
In a possible implementation, after the WSS outputs the carrier carrying the data through the WSS output port, the carrier may be transferred to the MEMS corresponding to the target group. Further, in the target group, the MEMS may transmit the carrier carrying the data to the target node based on a preset optical channel. With reference to the examples in
In a possible implementation, the WSS port may be an input port, or may be an output port. The MEMS of the target group may transmit a carrier carrying data to a corresponding WSS (which may be referred to as a WSS2) through a WSS port connected to the MEMS port. The WSS2 may configure a downlink channel as a straight-through channel, where the downlink channel may be understood as a channel from an MEMS port to a node port, and the straight-through channel may be understood as that positions of the input port and the output port are corresponding. For example, in
Further, it may further be set that one group corresponds to two preset wavelengths. A carrier corresponding to one preset wavelength is still transmitted to the MEMS of the corresponding group, and a carrier corresponding to the other preset wavelength may be directly transmitted to a node of the target group. In this way, when the carrier does not need to pass through the MEMS, the carrier may be directly transmitted to the node in the target group.
With reference to the examples in
For example, the NPU60 of the node 6 modulates data to a carrier corresponding to the preset wavelength 00. The carrier is input by the WSS port d0, output by the WSS port u0, and then input to the MEMS0. The NPU60 of the node 6 modulates the data to a carrier corresponding to the preset wavelength 10. The carrier is input by the WSS port d0, output by the WSS port u1, and then input to the MEMS1. The NPU0 of the node 6 modulates data to a carrier corresponding to the preset wavelength 11. The carrier is input by the WSS port d0, output by the WSS port d1, and then directly input to the node 7.
In Table 2, one group may correspond to two preset wavelengths. For example, when the total quantity of available wavelengths in the WSS is limited, or it may be understood that, when a total quantity of available wavelengths in the WSS is less than a total quantity of ports of the WSS, a quantity of groups may be ½ of the total quantity of available wavelengths. When the total quantity of available wavelengths in the WSS is abundant, or it may be understood that, when the total quantity of available wavelengths in the WSS is greater than the total quantity of ports of the WSS, the quantity of groups may be ½ of the total quantity of ports of the WSS. This helps implement that data in any NPU may be sent to an NPU in another node in a same group, or sent to an NPU in another group, and helps avoid unnecessary information transmission.
In addition, an output port corresponding to both a WSS input port and a wavelength is configured for an uplink channel of the WSS, or a mapping relationship between a wavelength and a group is set for the uplink channel of the WSS, and a downlink channel is configured as a straight-through channel. The uplink channel may be understood as a channel from a node port to an MEMS port. Certainly, in another embodiment, the uplink channel of the WSS may be configured as a straight-through channel, and the output port corresponding to both the WSS input port and the wavelength is configured for a downlink channel; or the uplink channel of the WSS and the downlink channel of the WSS may be configured as the output port corresponding to both the WSS input port and the wavelength, to implement inter-group data transmission or intra-group data transmission. This is not limited in this application.
The following still describes, based on two different collective communication manners, alltoall and allreduce, an implementation of data aggregation between a plurality of NPUs in different cases.
To implement data aggregation by the plurality of NPUs through the alltoall, an NPU may divide data obtained through model training into a plurality of parts, and a quantity of data parts obtained through division may be the same as a quantity of NPUs in the system. The system includes S×W nodes, each node includes C NPUs, and each NPU may divide data of the NPU into S×W×C parts.
It should be noted that, in this embodiment of this application, a transposition operation is still performed on a data matrix including data in a plurality of NPUs. With reference to the system architecture in
For example, 16 parts of data in the NPU00 may be respectively represented as 000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 00A, 00B, 00C, 00D, 00E, and 00F. Specifically, the NPU00 may send 001 to 00F to the NPU01 to the NPU71 (000 is still in the NPU00), and others are similar.
In an alltoall aggregation operation, for an NPU in the source group, when transmitting data to another NPU, the NPU may carry, based on the mapping relationship in the WSS, the data in a preset wavelength corresponding to the target group, to send the data to the target group. For details, refer to the descriptions in the foregoing embodiment.
Further, in the alltoall aggregation operation, the following relationship may exist between the group numbers of the source group and the target group.
A w1th group in the W groups may be used as the source group, and a w2th group may be used as the target group. In this case, an offset (that is, w2−w1) between the two groups may be determined based on a node number of a node to which each NPU belongs and the total quantity of groups. In a possible manner, (w2−w1) may be represented as ((s2% W)−(s1% W))% W, where s1 is a node number of a node to which the NPU in the w1th group belongs, and s2 is a node number of a node to which the NPU in the w2th group belongs.
With reference to the example in
In a case of allreduce, the NPU in each node can implement data aggregation based on hierarchical allreduce.
In an implementation, first intra-group data aggregation may be performed, then inter-group data aggregation is performed, and finally second intra-group data aggregation is performed.
The first intra-group data aggregation is specifically data aggregation performed between a plurality of NPUs in the group. For example, each NPU in the group may divide data of the NPU, and a quantity of parts obtained through division is the same as a total quantity of NPUs in the group. Further, an ith NPU in the group may obtain ith data of other NPUs in the group, and obtain an accumulated sum of ith data in all NPUs in the group, to obtain an accumulated result (that is, an intra-group aggregation result) in the group.
The group 0 in
Refer to first intra-group data aggregation shown in
Similarly, a first NPU is the NPU01, and the NPU1 may obtain first parts of data of other NPUs in the group 0, and obtain a sum 001+011+ . . . +601+611. A second NPU is the NPU20, and the NPU20 may obtain second parts of data of other NPUs in the group 0, and obtain a sum 002+012+ . . . +602+612.
A third NPU to a seventh NPU are similar to the foregoing. For details, refer to arrow directions in
Based on the foregoing similar steps, in the group 1, the ith NPU may also obtain the ith data of each NPU in the group and obtain a sum. For example, the 0th NPU is an NPU10, and the NPU10 may obtain the 0th data of each NPU in the group and obtain a sum, which is represented as 100+110+ . . . +700+710. For example, if the first NPU is an NPU11, the NPU11 may obtain 1st data of all NPUs in the group and obtain a sum, which is represented as 101+111+ . . . +701+711.
After the first intra-group data aggregation is performed, the inter-group data aggregation may be performed. The inter-group data aggregation is that data of the ith NPU in each group is added through inter-group communication between a plurality of groups, to obtain an aggregation result. The inter-group data aggregation may be implemented by using ring allreduce, recursive halving and doubling, or butterfly.
Refer to
Correspondingly, a square corresponding to a pattern 2 may represent data obtained by the first NPU in each group after the steps in
Subsequently, each NPU may distribute, in the group, obtained data to each NPU in the group. For example, the NPU0 may distribute data A to another NPU in the group 0, and the NPU1 may distribute data B to another NPU in the group 0. For details, refer to second intra-group data aggregation/distribution shown in
In another implementation, intra-node data aggregation, first intra-group data aggregation, inter-group data aggregation, and second intra-group data aggregation/distribution may be sequentially performed. The intra-node data aggregation means that data aggregation is performed between a plurality of NPUs in one node. For details, refer to (a) in
Intra-group data aggregation may also be considered as inter-node data aggregation, that is, data aggregation is performed between a plurality of nodes in the group. For example, one node in the group may divide data into a plurality of pieces, and a quantity of parts obtained through division is the same as a quantity of nodes in the group. Then, the ith node in the group may obtain the ith data in another node in the group, so that the ith node may obtain an accumulation result of the ith data of all nodes in the group, that is, the first intra-group data aggregation.
Specifically, refer to an example in
In this way, the first data aggregation in each group is completed, and then the inter-group data aggregation is performed, that is, data aggregation is performed between a plurality of groups through inter-group communication. For a specific implementation process, refer to
Based on the foregoing same inventive concept, the following provides a model training method. In the method, a plurality of NPUs may perform joint training to obtain a final model. When two NPUs in a plurality of NPUs perform data aggregation, if the two NPUs belong to different nodes, the two NPUs may communicate based on an optical transmission channel constructed by an MEMS between two nodes to which the two NPUs belong. It may also be understood that, in the plurality of nodes corresponding to the plurality of NPUs, any two nodes may communicate with each other through an optical transmission channel constructed by an MEMS between the two nodes.
Based on whether division exists in joint training, there are two possible manners as follows.
In a possible manner 1, the joint training includes a group (which may be referred to as a first group). The first group includes S nodes, and one node may include C NPUs. In other words, the S×C NPUs may jointly perform model training, to obtain a final model.
In a possible manner 2, the joint training includes W groups, and W is an integer greater than or equal to 2.
One group may include S nodes, and one node includes C NPUs, that is, S×C×W NPUs may jointly perform model training, to obtain a final model. Further, any node may divide the S×W nodes into the W groups based on node numbers of S×W nodes. For a group division manner, refer to descriptions in the embodiment related to
The joint training may include a plurality of iterations. The following explains one iteration: One NPU may perform model training of the NPU, to obtain intermediate data corresponding to model training of the NPU in a current iteration process. The intermediate data may be one or more of a feature, a gradient, or a model parameter. The NPU may implement data aggregation with another NPU based on a collective communication manner, to obtain aggregated data of this iteration. The aggregated data may be used for each NPU to adjust a parameter in a model of each NPU in a next iteration.
The following uses a first NPU and a second NPU as an example for description. The first NPU is located in a first node, the second NPU is located in a second node, and the first node and the second node are different nodes in the foregoing joint training.
Refer to
Step 2801: A first NPU performs first model training to obtain first target data.
In a possible implementation, the first NPU runs first model training. In one iteration, the first NPU runs the first model training to obtain first intermediate data, and then determines the first target data based on the first intermediate data and the collective communication manner. The collective communication manner may specifically include one or more of alltoall and allreduce. The first target data may be the first intermediate data, or a part of the first intermediate data.
The first target data may be sent by the first NPU to a second NPU, to be used for the second NPU to update a parameter in second model training in the second NPU.
When the joint training is performed to determine whether to perform division, a communication manner between the first NPU and the second NPU in different collective communication manners is different. The following describes two possible manners.
In a possible manner 1, the joint training includes one group.
In a case of alltoall, the first NPU may divide the first intermediate data into S×C parts of data, and then use data that is in the S×C parts of data and that is corresponding to the second NPU as the first target data. With reference to the example in
For example, the first node may be an s1th node in the S nodes, and the second node is an s2th node in the S nodes, where s1 and s2 are set to every integer in [0, S], and s1 is less than s2. The second NPU is C NPUs included in the second node, and the first target data is the (s2×C)th to (s2×C+C−1)th data in the S×C pieces of data obtained by dividing the first intermediate data by the first NPU. Herein, it may also be understood that the first NPU exchanges data with the second NPU. For example, the first NPU is a cth NPU in the first node. When the first NPU sends, to the C second NPUs, (s2×C)th to (s2×C+C−1)th data in the S×C pieces of data obtained through division, the first NPU may further separately obtain data from the C second NPUs. Specifically, each second NPU also divides second intermediate data of the second NPU into S×C parts, and the first NPU may obtain the (s1×C+C)th data of each second NPU from the C second NPUs.
In an allreduce case, the first NPU is the ith NPU in the first node, and the second NPU is the ith NPU in the second node, that is, the two NPUs are located at corresponding positions in different nodes. The first NPU and the second NPU perform inter-node data aggregation (or the first NPU and the second NPU perform intra-group data aggregation corresponding to the first group).
In an example, the first NPU divides the first intermediate data to obtain the C parts of data; the first NPU obtains the ith data of other (C−1) NPUs in the first node through the intra-node channel of the first node; and then the first NPU performs summation on the ith data in the C pieces of data and the ith data of the other (C−1) NPUs in the first node, to obtain the first target data. For example, in
In a possible manner 2, the joint training includes W groups.
It should be noted in advance that, the first NPU (or the first node) is used as a data sender, and a group to which the first NPU belongs may be referred to as a source group; and the second NPU (or the second node) is used as a data receiver, and a group to which the second NPU belongs may be referred to as a target group. The source group and the target group may be a same group or two different groups.
In a case of alltoall, the first NPU may divide the first intermediate data into S×C×W parts of data, and then use data that is in the S×C×W parts of data and that is corresponding to the second NPU as the first target data. With reference to the example in
The S×W nodes may be divided into the W groups based on node numbers. It may be understood that the first node to which the first NPU belongs may be the s1th node in the S×W nodes, and the second node is the s2th node in the S×W nodes. Similarly, data may also be transmitted between the first NPU and the second NPU. Details are similar to the foregoing possible manner 1. A difference lies in that a channel for transmitting data by the first NPU is different from a channel for transmitting data by the second NPU. The former is for intra-group data transmission. The first NPU may send the first target data to the second NPU through an inter-node channel (that is, the optical transmission channel constructed by the MEMS), or through an inter-node channel and an intra-node channel. The latter relates to inter-group data transmission. The first NPU needs to send the first target data to the second NPU in the target group not only through the inter-node channel, or through the inter-node channel and the intra-node channel, but also through the optical transmission channel constructed by a WSS.
For example, the source group is a w1th group in the W groups, the target group is a w2th group in the W groups, w1 may be set to every integer in [0, W−1], and an offset between w2 and w1 may be represented as offset=w2−w1=((s2% W)−(s1% W))% W.
In a case of allreduce, the first NPU is the ith NPU in the source group, and the second NPU is the ith NPU in the target group, that is, the two NPUs are located at corresponding positions in different groups.
When the source group and the target group are different groups, the first NPU and the second NPU need to perform inter-group data aggregation. There are at least the following two examples.
In an example, after performing the first model training to obtain the first intermediate data, the first NPU may first divide the first intermediate data to obtain the C parts of data. The first NPU obtains the ith data of the other (C−1) NPUs in the first node through the intra-node channel of the first node. Then, after the first NPU performs summation on the ith data in the C pieces of data and the ith data of the other (C−1) NPUs in the first node, to obtain an aggregation result in the first node in the source group. Similarly, the second NPU may obtain the aggregation result in the second node in the target group in a similar manner. Then, the first NPU and the second NPU perform inter-group data aggregation. Specifically, the first NPU may send the intra-node aggregation result in the first node to the second NPU, where the intra-node aggregation result in the first node is the first target data. Correspondingly, the second NPU may perform inter-group data aggregation based on the intra-node aggregation result in the first node and the intra-node aggregation result in the second node.
With reference to the example in
In still another example, after performing the first model training to obtain the first intermediate data, the first NPU may divide the first intermediate data to obtain CxS parts of data, that is, a quantity of parts obtained through division is the same as a quantity of NPUs in a group. The first NPU first performs intra-group data aggregation with another NPU in the source group in which the first NPU is located. For an implementation, refer to a related embodiment in the foregoing possible manner 1. Then, the first NPU may use an aggregation result corresponding to the intra-group data aggregation as the first target data.
For example, in
When the source group and the target group belong to a same group, the first NPU and the second NPU still perform inter-node data aggregation (or the first NPU and the second NPU perform intra-group data aggregation corresponding to a current group). Refer to a related embodiment in the foregoing possible manner 1.
Step 2802: The first NPU sends the first target data to the second NPU through the optical transmission channel constructed by the MEMS.
In a case that the joint training includes one group (that is, the first group),
an MEMS that is located between the first node and the second node and that is configured to connect the first node and the second node may be referred to as a first MEMS. Refer to an example in (a) in
For that the first NPU sends the first target data to the second NPU, refer to the following three examples.
Example 1: The node port A is a port in the first NPU, and the node port B is a port in the second NPU. When the first NPU sends the first target data to the second NPU, specifically, the first NPU sends the first target data to the second NPU sequentially through the node port A, the MEMS port A, the MEMS port B, and the node port B.
With reference to the example in
Example 2: The node port A is a port in another NPU (which may be referred to as a third NPU) other than the first NPU on the first node, and the node port B is a port in another NPU (which may be referred to as a fourth NPU) other than the second NPU on the second node. In this case, when the first NPU sends the first target data to the second NPU, specifically, the first NPU first sends the first target data to the third NPU through an internal channel of the first node, then, the third NPU sends the first target data to the fourth NPU sequentially through the node port A, the MEMS port A, the MEMS port B, and the node port B. The fourth NPU sends the first target data to the second NPU through the internal channel of the second node.
With reference to the example in
Example 3: The node portA is a port in the third NPU in the first node, and the node port B is a port in the second NPU in the second node; or the node port A is a port in the first NPU in the first node, and the node port B is a port in a fourth NPU in the second node. For specific implementation, refer to descriptions in Example 1 and/or Example 2.
In a case that the joint training includes W groups:
the MEMS that is located between the first node and the second node and that is configured to connect the first node to the second node may be referred to as a second MEMS, and both the second MEMS and the second node belong to the target group.
As shown in an example in (b) in
The node port a is connected to the WSS port a, the WSS port b is connected to the MEMS port a, and the MEMS port b is connected to the node port b. In this way, there is a connection channel between the first node and the second node, and the first node and the second node may communicate with each other through the connection channel.
That the first NPU sends the first target data to the second NPU may be specifically that the first NPU sends the first target data to the second NPU sequentially through the optical transmission channel constructed by the WSS and the optical transmission channel constructed by the second MEMS.
The node port a may be a port in the first NPU, and the first NPU may send the first target data to the WSS through the port in the first NPU; or the node port a is not a port in the first NPU, and the first NPU may first send, through the intra-node channel of the first node, the first target data to an NPU corresponding to the node port a, and then the NPU sends the first target data to the WSS through the port in the first NPU.
The node port b may be a port in the second NPU, and the second NPU may receive the first target data from the second MEMS through the node port b. Alternatively, the node port b is not a port in the second NPU, and an NPU corresponding to the node port b in the second node may receive the first target data from the second MEMS, and then the second NPU may receive, through an intra-node channel of the second node, the first target data from the NPU corresponding to the node port b.
Further, in the WSS, the WSS port a may further correspond to another WSS port. For details, refer to descriptions in related embodiments in Table 1 and Table 2. When sending the first target data to the WSS, the first NPU may modulate the first target data to a carrier of a preset wavelength (which may be referred to as a target preset wavelength) corresponding to the target group. In this way, the WSS may send, to the target group based on the mapping relationship between the target preset wavelength and the target group, the carrier carrying the first target data.
It may also be understood that the WSS includes a mapping relationship between the WSS input port corresponding to the source group, the target preset wavelength, and the WSS output port corresponding to the target group. After receiving, through the WSS input port corresponding to the source group, the carrier carrying the first target data and sent by the first NPU in the source group, the WSS may send the carrier carrying the first target data to the target group through the WSS output port corresponding to the target group and based on the mapping relationship.
With reference to the examples in
For another example, the first NPU belongs to the group 0, and the second NPU belongs to the group 1. The group 1 corresponds to the preset wavelength 1. The first NPU may modulate the first target data to a carrier corresponding to the preset wavelength 1, and input the first target data to the WSS through the WSS port a. The WSS sends the carrier carrying the first target data to the second MEMS through the WSS port b corresponding to both the preset wavelength 1 and the WSS port a. The second MEMS also belongs to the group 1, and the second MEMS sends the received carrier carrying the first target data to the second NPU.
It should further be noted that the foregoing describes how the WSS implements inter-group communication through only two groups as an example. When there are a plurality of groups, the first NPU may send different data to NPUs in different groups by changing a carrier wavelength. For example, the first NPU belongs to the group 0, and the first NPU needs to send five parts of target data to NPUs corresponding to five target groups, where the five target groups are the group 1, the group 2, the group 3, the group 4, and the group 5. The five target groups respectively correspond to the preset wavelength 1, the preset wavelength 2, the preset wavelength 3, the preset wavelength 4, and the preset wavelength 5. When sending the target data corresponding to the group 1 to the group 1, the first NPU may carry the target data in the carrier corresponding to the preset wavelength 1, so that the WSS may send the target data to the group 1. When sending the target data corresponding to the group 2 to the group 2, the first NPU may carry the target data in the carrier corresponding to the preset wavelength 2, so that the WSS may send the target data to the group 2. In this way, the first NPU may send different target data to different target groups by adjusting the carrier wavelength.
Step 2803: The second NPU obtains aggregated data based on the first target data, and adjusts, based on the aggregated data, a parameter for training the second model.
In a possible manner, the second NPU performs the second model training to obtain the second intermediate data, and then determines the second target data based on the second intermediate data and the collective communication manner. The second target data may be the second intermediate data, or may be included in the second intermediate data. An implementation in which the second NPU determines the second target data is similar to an implementation in which the first NPU determines the first target data in step 2801. The second NPU may further send the second target data to the first NPU. Correspondingly, the first NPU may also receive the second target data from the second NPU, determine the aggregated data based on the second target data, and adjust the parameter for training the first model based on the aggregated data.
In a possible manner, the second NPU may not only receive the first target data from the first NPU, but also receive the target data from the another NPU, and determine the aggregated data based on one or more of the target data of the another NPU, the first target data, and the second intermediate data. Then, the second NPU adjusts, based on the aggregated data, the parameter for training the second model.
In addition, the first NPU and the second NPU may alternatively belong to a same node. In this case, the first NPU and the second NPU may perform intra-node data aggregation. Refer to the first intra-node data aggregation shown in
As described above, the model is jointly trained by the plurality of NPUs, which helps expand a training scale of model training and can quickly aggregate intermediate data trained by each NPU model. In this way, the plurality of NPUs jointly train a model more efficiently.
It should be added that for implementations that are not described in detail in the model training method in this application, refer to descriptions in the system embodiments related to
Based on a same inventive concept as the method embodiment, an embodiment of this application further provides a model training apparatus. The model training apparatus may be deployed on a node, and is configured to perform the method performed by the processor in the method embodiment shown in
For a schematic diagram of a structure of the model training apparatus 3000, refer to
In a possible implementation, the interface unit 3002 is specifically configured to send the first target data to the second processor through the optical transmission channel constructed by the MEMS and an intra-node channel. The intra-node channel includes a channel between the apparatus 3000 and the MEMS in the first node, and/or a channel between the second processor and the MEMS in the second node.
In a possible implementation, the interface unit 3002 is specifically configured to sequentially send the first target data to the second processor through an optical transmission channel constructed by a wavelength selective switch WSS and the optical transmission channel established by the MEMS. The second node and the MEMS belong to a same group, and the WSS is located between the MEMS and the first node.
In a possible implementation, the WSS includes a mapping relationship between a carrier wavelength and a group, and in one mapping relationship, the carrier wavelength is a preset wavelength corresponding to the group. The interface unit 3002 is specifically configured to: modulate the first target data into the carrier, where a wavelength of the carrier is a preset wavelength corresponding to the group to which the second node belongs; and send the carrier carrying the first target data to the WSS, so that the WSS sends the carrier carrying the first target data to the MEMS.
In a possible implementation, the processing unit 3001 is specifically configured to: perform model training in the apparatus 3000 to obtain intermediate data of the apparatus 3000; and determine first target data based on a collective communication manner and the intermediate data of the apparatus 3000. The first target data is all or a part of the intermediate data of the apparatus 3000. Training data and a training model in the apparatus 3000 are different from those in the second processor, and the collective communication manner is alltoall. Alternatively, training data in the apparatus 3000 is different from that in the second processor, and the collective communication manner is allreduce.
In a possible implementation, the processing unit 3001 is specifically configured to divide the intermediate data of the apparatus 3000 based on alltoall and a total quantity of processors corresponding to alltoall. The processor corresponding to alltoall includes the apparatus 3000 and a second processor. A quantity of data parts after division is equal to the total quantity of processors, and the data after division includes first target data corresponding to the second processor.
In a possible implementation, alltoall corresponds to S nodes, the first node is an s1th node in the S nodes, the second node is an s2th node in the S nodes, s1 and s2 are set to every integer in [0, S], and s1 is less than s2. The second processor is C processors included in the second node. The first target data is (s2×C)th to (s2×C+C−1)th pieces of data in S×C pieces of data after the division.
In a possible implementation, the alltoall corresponds to W groups, the first node is an s1th node of a w1th group in the W groups, and the second node is an s2th node of a w2th group in the W groups. w1 is set to every integer in [0, W−1], and w2=w1+offset, where offset=((s2% W)−(s1% W))% W.
In a possible implementation, the processing unit 3001 is specifically configured to: divide the intermediate data of the apparatus 3000 based on the allreduce and the total quantity C of processors in the first node, to obtain the C pieces of data; obtain the ith data of other (C−1) processors in the first node through an intra-node channel of the first node; and perform summation on the ith data in the C pieces of data and the ith data of the other (C−1) processors in the first node, to obtain the first target data. The apparatus 3000 is an ith processor in the first node, and the second processor is an ith processor in the second node.
In a possible implementation, allreduce corresponds to W groups, one group includes S nodes, and one node includes C processors. The apparatus 3000 is an ith processor in a group to which the apparatus 3000 belongs. The second processor is an ith processor in the group to which the apparatus 3000 belongs. The processing unit 3001 is specifically configured to divide the intermediate data of the apparatus 3000 based on allreduce and the total quantity S×C of processors in the group, to obtain S×C pieces of data; obtain through the intra-node channel of the first node and/or optical transmission channels that are between the first node and other (S−1) nodes in the group to which the first processor belongs and that are constructed by the MEMS, ith data of other (S×C−1) processors in the group to which the second processor belongs; and perform summation on the ith data in the S×C pieces of data and the ith data of the other (S×C−1) processors in the group to which the second processor belongs, to obtain the first target data.
In a possible implementation, the processing unit 3001 is further configured to: obtain second target data, where the second target data is data that is obtained by the second processor by performing model training in the second processor and that is to be transmitted to the apparatus 3000; and adjust a parameter for model training in the apparatus 3000 based on the second target data.
In a possible implementation, the processing unit 3001 is further configured to: before performing the model training in the apparatus to obtain the first target data, divide a plurality of nodes into W groups based on a total quantity of the plurality of nodes for joint training a model, a total quantity of ports of the WSS, and a total quantity of available wavelengths in the WSS. When the total quantity of available wavelengths in the WSS is less than the total quantity of the ports of the WSS, and one group in the W groups corresponds to two preset wavelengths, W is equal to ½ of the total quantity of available wavelengths in the WSS.
Based on the foregoing content and the same concept,
It should be understood that the processor 100 may be an integrated circuit chip and has a signal processing capability. For example, the processor 100 may be a general purpose processor, may be a field programmable gate array (FPGA), may be an application specific integrated chip (ASIC), may be a system on chip (SoC), may be a network processor (NP), may be a digital signal processing circuit (DSP), may be a micro controller (MCU), may be a programmable controller (PLD), or another programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or another integrated chip.
The processor 100 may include a central processing unit (CPU), a neural-network processing unit (NPU), and a graphics processing unit (GPU), and may further include an application processor (AP), a modem processor, an image signal processor (ISP), a video codec, a digital signal processor (DSP), and/or a baseband processor. These components may be deployed on different chips in a distributed manner, or may be integrated into one chip. This is not specifically limited. The processor 100 may perform the method in the first processor in the foregoing method embodiments, or is configured to perform the method in any processor in the foregoing system embodiments.
It may be understood that the memory (for example, the external buffer and the internal memory) in embodiments of this application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), an electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), used as an external cache. Through example but not limitative description, many forms of RAMs may be used, for example, a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), a synchronous link dynamic random access memory (synchlink DRAM, SLDRAM), and a direct rambus dynamic random access memory (direct rambus RAM, DR RAM). It should be noted that the memory in the method described in this specification is intended to include, but is not limited to, these memories and any other memory of a proper type.
According to the method provided in embodiments of this application, this application further provides a computer program product. The computer program product includes computer program code. When the computer program code is run on a computer, the computer is enabled to perform the method in any one of the embodiments shown in
According to the method provided in embodiments of this application, this application further provides a computer-readable storage medium. The computer-readable medium stores program code. When the program code is run on a computer, the computer is enabled to perform the method in any one of the embodiments shown in
According to the method provided in embodiments of this application, this application further provides a computing device. The computing device includes a processor, the processor is connected to a memory, and the processor is configured to execute a computer program stored in the memory, so that the computing device performs the method in any one of the method embodiments shown in
Terminologies such as “component”, “module”, and “system” used in this specification are used to indicate computer-related entities, hardware, firmware, combinations of hardware and software, software, or software being executed. For example, a component may be, but is not limited to, a process that runs on a processor, a processor, an object, an executable file, an execution thread, a program, and/or a computer. As illustrated by using figures, both a computing device and an application that runs on the computing device may be components. One or more components may reside within a process and/or a thread of execution, and a component may be located on one computer and/or distributed between two or more computers. In addition, these components may be executed from various computer-readable media that store various data structures. For example, the components may communicate by using a local and/or remote process and based on, for example, a signal having one or more data packets (for example, data from two components interacting with another component in a local system, a distributed system, and/or across a network such as the Internet interacting with other systems by using the signal).
A person of ordinary skill in the art may be aware that, in combination with illustrative logical blocks (illustrative logical block) described in embodiments disclosed in this specification and steps (step) may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, refer to a corresponding process in the foregoing method embodiments. Details are not described herein again.
In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, division into the units is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, in other words, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of embodiments.
In addition, functional units in embodiments of this application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of this application essentially, or the part contributing to the conventional technology, or some of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device or the like) to perform all or some of the steps of the methods described in embodiments of this application. The foregoing storage medium includes any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc.
The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202111265110.7 | Oct 2021 | CN | national |
202111617115.1 | Dec 2021 | CN | national |
This application is a continuation of International Application No. PCT/CN2022/096842, filed on Jun. 2, 2022, which claims priority to Chinese Patent Application No. 202111617115.1, filed on Dec. 27, 2021, and Chinese Patent Application No. 202111265110.7, filed on Oct. 28, 2021. All of the aforementioned patent applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/096842 | Jun 2022 | WO |
Child | 18646489 | US |