This application relates to the field of artificial intelligence, and in particular, to a training apparatus and method for a neural network model, and a related device.
Artificial intelligence (AI) is a theory, a method, a technology, or an application system that simulates, extends, and expands human intelligence by using a digital computer or a machine controlled by a digital computer, to perceive an environment, obtain knowledge, and achieve an optimal result based on the knowledge. In other words, artificial intelligence is a branch of computer science, and is intended to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. The artificial intelligence is to study design principles and implementation methods of various intelligent machines, so that the machines have perceiving, inference, and decision-making functions. Researches in an artificial intelligence field include a robot, natural language processing, computer vision, decision-making and inference, human-computer interaction, recommendation and search, an AI basic theory, and the like.
In recent years, a large network and a large data amount are required in training of a neural network model. Usually, data parallelism can meet a requirement of a surge in computing requirements in a training network. A basic idea of data parallelism is to use model copies of a plurality of devices to simultaneously train a data subset, and synchronize model parameters across the copies when iteration ends.
Specifically,
The accelerator may be a graphics processing unit (GPU), a neural processing unit (NPU), or a tensor processing unit (TPU). Gradient addition may be implemented by using a plurality of methods, for example, collective communication.
In the data parallel processing manner, training parameters such as the initial weight in step (2) and the initial variable in step (6) need to consume storage space, in the accelerator, used during training, for example, a video random access memory (RAM) used in graphics processing unit (GPU) training, and a memory used in central processing unit (CPU) training. When a large neural network model is trained, training cannot be performed due to insufficient storage space in a single accelerator.
Embodiments of this application provide a training apparatus and method for a neural network model, and a related device, to reduce video RAM consumption of the training apparatus in a training process of a neural network model.
According to a first aspect, an embodiment of this application provides a training apparatus for a neural network model. The training apparatus includes a plurality of accelerators. When the training apparatus trains a neural network model, each accelerator in the training apparatus is configured to store some weight coefficients, and the some weight coefficients stored in each of the plurality of accelerators form a complete weight coefficient. In other words, the complete weight coefficient of the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner. Then, each accelerator adds the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient. Then, each accelerator trains the neural network model based on input data and the complete weight coefficient, where input data of the plurality of accelerators is different. In a parallel processing process in which the training apparatus trains the neural network model by using the plurality of accelerators, the complete weight coefficient of the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner, and weight coefficients of the plurality of accelerators are subsequently added, to obtain the complete weight coefficient. The neural network model is further trained on each accelerator based on different input data and the complete weight coefficient. In other words, the complete weight coefficient is stored in the plurality of accelerators in the training apparatus in a distributed manner, to reduce video RAM consumption of the training apparatus in a training process of the neural network model.
In a possible implementation of the first aspect in this embodiment of this application, when training the neural network model based on the input data and the complete weight coefficient, each accelerator is specifically configured to: calculate gradient information based on the input data and the complete weight coefficient; calculate a target gradient based on the gradient information of the plurality of accelerators; and further update the some weight coefficients based on the target gradient, and train the neural network model based on updated some weight coefficients. After calculating the gradient information based on different input data and the complete weight coefficient, each accelerator in the training apparatus calculates, based on the gradient information of the plurality of accelerators, the target gradient used to update the some weight coefficients. Further, each accelerator updates, based on the target gradient, the some weight coefficients stored in each accelerator, and trains the neural network model based on the updated some weight coefficients. This implementation provides a specific implementation process of training the neural network model based on different input data. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, each accelerator in the training apparatus is further configured to store some initial variables of an optimizer. Some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model. When updating the some weight coefficients based on the target gradient, each accelerator in the training apparatus is specifically configured to: process the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient; and then update the some weight coefficients based on the processed target gradient. When the optimizer optimizes the weight coefficient of the neural network model, each accelerator may store the initial variable of the accelerator in a distributed manner, that is, each accelerator stores the some initial variables. Then, each accelerator uses the target gradient and the some weight coefficients as an input of the optimizer; performs optimization according to an optimization algorithm preset in the optimizer, to obtain the processed target gradient; and then updates, based on the processed target gradient, the some weight coefficients stored in each accelerator. In other words, the plurality of accelerators in the training apparatus store the complete initial variable of the optimizer in a distributed manner, to further reduce video RAM consumption of the training apparatus in the training process of the neural network model.
In a possible implementation of the first aspect in this embodiment of this application, the optimizer includes a vector operation, and when processing the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient, each accelerator in the training apparatus is specifically configured to: calculate a scalar representation of the target gradient; add the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient; then calculate a vector representation of the target gradient based on the summation result; and further process the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. When the optimizer includes the vector operation (for example, a matrix operation or another vector operation), a complete gradient is required during calculation, and the target gradient is distributed in each accelerator. Therefore, when obtaining the processed target gradient, each accelerator first calculates the scalar representation of the target gradient; then adds the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient; calculates the vector representation of the target gradient based on the summation result; and further processes the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. In this implementation, the solution may be applied to an implementation process of an optimizer including a vector operation. This can improve feasibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, when adding the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient, each accelerator in the training apparatus is specifically configured to add the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient. Each accelerator in the training apparatus may obtain the summation result of the target gradient in the plurality of accelerators by using the allreduce operation in collective communication. This implementation provides a specific implementation process of obtaining the summation result of the target gradient. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, the some weight coefficients include weight coefficients assigned to the plurality of accelerators one by one after the complete weight coefficient is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete weight coefficient may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some weight coefficients obtained through even division. This implementation provides a specific implementation process in which the plurality of accelerators store the complete weight coefficient in a distributed manner. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, when adding the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient, each accelerator in the training apparatus is specifically configured to add, by using an allgather operation in collective communication, the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient. Each accelerator in the training apparatus may obtain the complete weight coefficient in the plurality of accelerators by using the allgather operation in collective communication. This implementation provides a specific implementation process of obtaining the complete weight coefficient. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, when calculating the target gradient based on the gradient information of the plurality of accelerators, each accelerator in the training apparatus is specifically configured to calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators. Each accelerator in the training apparatus may obtain the target gradient in the plurality of accelerators by using the reduce-scatter operation in collective communication. This implementation provides a specific implementation process of calculating the target gradient. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete initial variable may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some initial variables obtained through even division. This implementation provides a specific implementation process in which the plurality of accelerators store the complete initial variable in a distributed manner. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the first aspect in this embodiment of this application, each accelerator in the training apparatus is further configured to: obtain an update parameter of the some weight coefficients, and update the some weight coefficients based on the update parameter of the some weight coefficients; obtain an update parameter of the initial variable, and update the initial variable based on the update parameter of the variable; obtain an update parameter of the target gradient, and update the target gradient based on the update parameter of the target gradient; and/or obtain an update parameter of the processed target gradient, and update the target gradient based on the update parameter of the processed target gradient. Target parameters related to the training process of the neural network model may be stored in a distributed manner, and the target parameters include the some weight coefficients, the initial variable, the target gradient, and/or the processed target gradient, and the like. When some weight coefficients corresponding to the neural network model or the initial variable of the optimizer needs to be updated, each accelerator in the training apparatus may update the target parameters stored in a distributed manner. This can further reduce video RAM consumption of the accelerator in the training apparatus.
According to a second aspect, an embodiment of this application provides a training apparatus for a neural network model. The training apparatus includes a plurality of accelerators. When the training apparatus trains a neural network model, each accelerator in the training apparatus is configured to: calculate gradient information based on input data and a complete weight coefficient, where input data of the plurality of accelerators is different; then calculate a target gradient based on the gradient information of the plurality of accelerators; and store some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model; process the target gradient and some weight coefficients based on the some initial variables, to obtain a processed target gradient, where the some weight coefficients processed by each of the plurality of accelerators form the complete weight coefficient; and update the complete weight coefficient based on the processed target gradient, and train the neural network model based on an updated complete weight coefficient. In a parallel processing process in which the training apparatus trains the neural network model by using the plurality of accelerators, the complete initial variable of the optimizer in the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner. Each accelerator processes the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient; then updates the complete weight coefficient based on the processed target gradient; and trains the neural network model based on the updated complete weight coefficient. In other words, the complete initial variable of the optimizer is stored in the plurality of accelerators in the training apparatus in a distributed manner, to reduce video RAM consumption of the training apparatus in a training process of the neural network model.
In a possible implementation of the second aspect in this embodiment of this application, the optimizer includes a vector operation, and when processing the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient, each accelerator in the training apparatus is specifically configured to: calculate a scalar representation of the target gradient; add the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient; then calculate a vector representation of the target gradient based on the summation result; and further process the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. When the optimizer includes the vector operation (for example, a matrix operation or another vector operation), a complete gradient is required during calculation, and the target gradient is distributed in each accelerator. Therefore, when obtaining the processed target gradient, each accelerator first calculates the scalar representation of the target gradient; then adds the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient; calculates the vector representation of the target gradient based on the summation result; and further processes the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. In this implementation, the solution may be applied to an implementation process of an optimizer including a vector operation. This can improve feasibility of the solution.
In a possible implementation of the second aspect in this embodiment of this application, when adding the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient, each accelerator in the training apparatus is specifically configured to add the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient. Each accelerator in the training apparatus may obtain the summation result of the target gradient in the plurality of accelerators by using the allreduce operation in collective communication. This implementation provides a specific implementation process of obtaining the summation result of the target gradient. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the second aspect in this embodiment of this application, when calculating the target gradient based on the gradient information of the plurality of accelerators, each accelerator is specifically configured to calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators. Each accelerator in the training apparatus may obtain the target gradient in the plurality of accelerators by using the reduce-scatter operation in collective communication. This implementation provides a specific implementation process of calculating the target gradient. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the second aspect in this embodiment of this application, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete initial variable may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some initial variables obtained through even division. This implementation provides a specific implementation process in which the plurality of accelerators store the complete initial variable in a distributed manner. This can improve feasibility of the solution, and further improve flexibility of the solution.
In a possible implementation of the second aspect in this embodiment of this application, each accelerator in the training apparatus is further configured to: obtain an update parameter of the complete weight coefficient, and update the complete weight coefficient based on the update parameter of the complete weight coefficient; obtain an update parameter of the initial variable, and update the initial variable based on the update parameter of the initial variable; obtain an update parameter of the target gradient, and update the target gradient based on the update parameter of the target gradient; and/or obtain an update parameter of the processed target gradient, and update the target gradient based on the update parameter of the processed target gradient. Target parameters related to the training process of the neural network model may be stored in a distributed manner, and the target parameters include the complete weight coefficient, the initial variable, the target gradient, and/or the processed target gradient, and the like. When some weight coefficients corresponding to the neural network model or the initial variable of the optimizer needs to be updated, each accelerator in the training apparatus may update the target parameters stored in a distributed manner. This can further reduce video RAM consumption of the accelerator in the training apparatus.
According to a third aspect, an embodiment of this application provides a training method for a neural network model. The training method is applied to a plurality of accelerators. The plurality of accelerators are included in a training apparatus. The method includes: storing some weight coefficients, where the some weight coefficients stored in each of the plurality of accelerators form a complete weight coefficient; adding the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient; and training a neural network model based on input data and the complete weight coefficient, where input data of the plurality of accelerators is different.
In a possible implementation of the third aspect in this embodiment of this application, the training a neural network model based on input data and the complete weight coefficient includes: calculating gradient information based on the input data and the complete weight coefficient; calculating a target gradient based on the gradient information of the plurality of accelerators; and updating the some weight coefficients based on the target gradient, and training the neural network model based on updated some weight coefficients.
In a possible implementation of the third aspect in this embodiment of this application, the method further includes: storing some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model; and the updating the some weight coefficients based on the target gradient includes: processing the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient; and updating the some weight coefficients based on the processed target gradient.
In a possible implementation of the third aspect in this embodiment of this application, the optimizer includes a vector operation, and the processing the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient includes: calculating a scalar representation of the target gradient; adding the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient; calculating a vector representation of the target gradient based on the summation result; and processing the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient.
In a possible implementation of the third aspect in this embodiment of this application, the adding a scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient includes: adding the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient.
In a possible implementation of the third aspect in this embodiment of this application, the some weight coefficients include weight coefficients assigned to the plurality of accelerators one by one after the complete weight coefficient is evenly divided.
In a possible implementation of the third aspect in this embodiment of this application, the adding the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient includes: adding, by using an allgather operation in collective communication, the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
In a possible implementation of the third aspect in this embodiment of this application, the calculating a target gradient based on the gradient information of the plurality of accelerators includes: calculating, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators.
In a possible implementation of the third aspect in this embodiment of this application, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided.
In a possible implementation of the third aspect in this embodiment of this application, the method further includes: obtaining an update parameter of the some weight coefficients, and updating the some weight coefficients based on the update parameter of the some weight coefficients; obtaining an update parameter of the initial variable, and updating the initial variable based on the update parameter of the initial variable; obtaining an update parameter of the target gradient, and updating the target gradient based on the update parameter of the target gradient; and/or obtaining an update parameter of the processed target gradient, and updating the target gradient based on the update parameter of the processed target gradient.
For specific implementation steps of the third aspect and the possible implementations of the third aspect of this application and beneficial effects brought by each possible implementation, refer to descriptions in the possible implementations of the first aspect. Details are not described herein again.
According to a fourth aspect, an embodiment of this application provides a training method for a neural network model. The training method is applied to a plurality of accelerators. The plurality of accelerators are included in a training apparatus. The method includes: calculating gradient information based on input data and a complete weight coefficient, where input data of the plurality of accelerators is different; calculating a target gradient based on the gradient information of the plurality of accelerators; storing some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of a neural network model; processing the target gradient and some weight coefficients based on the some initial variables, to obtain a processed target gradient, where the some weight coefficients processed by each of the plurality of accelerators form the complete weight coefficient; and updating the complete weight coefficient based on the processed target gradient, and training the neural network model based on an updated complete weight coefficient.
In a possible implementation of the fourth aspect in this embodiment of this application, the optimizer includes a vector operation, and the processing the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient includes: calculating a scalar representation of the target gradient; adding the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient; calculating a vector representation of the target gradient based on the summation result; and processing the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient.
In a possible implementation of the fourth aspect in this embodiment of this application, the adding a scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient includes: adding the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient.
In a possible implementation of the fourth aspect in this embodiment of this application, the calculating a target gradient based on the gradient information of the plurality of accelerators includes: calculating, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators.
In a possible implementation of the fourth aspect in this embodiment of this application, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided.
In a possible implementation of the fourth aspect in this embodiment of this application, the method further includes: obtaining an update parameter of the complete weight coefficient, and updating the complete weight coefficient based on the update parameter of the complete weight coefficient; obtaining an update parameter of the initial variable, and updating the initial variable based on the update parameter of the initial variable; obtaining an update parameter of the target gradient, and updating the target gradient based on the update parameter of the target gradient; and/or obtaining an update parameter of the processed target gradient, and updating the target gradient based on the update parameter of the processed target gradient.
For specific implementation steps of the fourth aspect and the possible implementations of the fourth aspect of this application and beneficial effects brought by each possible implementation, refer to descriptions in the possible implementations of the second aspect. Details are not described herein again.
According to a fifth aspect, an embodiment of this application provides a training apparatus for a neural network model. The training apparatus includes a plurality of accelerators. Each accelerator includes: a storage unit, configured to store some weight coefficients, where the some weight coefficients stored in each of the plurality of accelerators form a complete weight coefficient; an addition unit, configured to add the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient; and a training unit, configured to train a neural network model based on input data and the complete weight coefficient, where input data of the plurality of accelerators is different.
In the fifth aspect of this application, modules included in the training apparatus in a neural network may further be configured to perform steps performed by the training apparatus in the possible implementations of the first aspect. For details, refer to the first aspect. Details are not described herein again.
According to a sixth aspect, an embodiment of this application provides a training apparatus for a neural network model. The training apparatus includes a plurality of accelerators. Each accelerator includes: a calculation unit, configured to calculate gradient information based on input data and a complete weight coefficient, where input data of the plurality of accelerators is different, and the calculation unit is further configured to calculate a target gradient based on the gradient information of the plurality of accelerators; a storage unit, configured to store some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of a neural network model; a processing unit, configured to process the target gradient and some weight coefficients based on the some initial variables, to obtain a processed target gradient, where the some weight coefficients processed by each of the plurality of accelerators form the complete weight coefficient; and an updating unit, configured to update the complete weight coefficient based on the processed target gradient, and train the neural network model based on an updated complete weight coefficient.
In the sixth aspect of this application, modules included in the training apparatus in a neural network may further be configured to perform steps performed by the training apparatus in the possible implementations of the second aspect. For details, refer to the second aspect. Details are not described herein again.
According to a seventh aspect, an embodiment of this application provides a computer-readable storage medium. The computer storage medium stores a computer program, the computer program includes program instructions, and when executing the program instructions, a processor performs the method in any one of the third aspect or the possible implementations of the third aspect or the method in any one of the fourth aspect or the possible implementations of the fourth aspect.
According to an eighth aspect, an embodiment of this application provides a chip system. The chip system includes a processor, configured to support an access network device in implementing a function in any one of the third aspect or the possible implementations of the third aspect, or any one of the fourth aspect or the possible implementations of the fourth aspect. In a possible design, the chip system may further include a memory. The memory is configured to store program instructions and data that are necessary for the access network device. The chip system may include a chip, or may include a chip and another discrete device.
According to a ninth aspect, an embodiment of this application provides a computer-readable storage medium storing one or more computer-executable instructions. When executing the computer-executable instructions, a processor performs the method in any one of the third aspect or the possible implementations of the third aspect or the method in any one of the fourth aspect or the possible implementations of the fourth aspect.
For technical effects brought by any possible implementation of the third aspect to the ninth aspect, refer to technical effects brought by the first aspect or different possible implementations of the first aspect or technical effects brought by the second aspect or different possible implementations of the second aspect. Details are not described herein again.
It can be learned from the technical solutions that embodiments of this application have the following advantages: In a parallel processing process in which the training apparatus trains the neural network model by using the plurality of accelerators, the complete weight coefficient of the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner, and weight coefficients of the plurality of accelerators are subsequently added, to obtain the complete weight coefficient. The neural network model is further trained on each accelerator based on different input data and the complete weight coefficient. In other words, the complete weight coefficient is stored in the plurality of accelerators in the training apparatus in a distributed manner, to reduce video RAM consumption of the training apparatus in a training process of the neural network model.
To describe technical solutions in embodiments of this application or in the conventional technology more clearly, the following briefly describes the accompanying drawings for describing embodiments or the conventional technology. It is clear that the accompanying drawings in the following descriptions show some embodiments of this application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
The following describes embodiments of this application with reference to the accompanying drawings in embodiments of this application. In the specification, claims, and the accompanying drawings of this application, terms such as “first”, “second”, “third”, and “fourth” are intended to distinguish between different objects but do not describe a particular order. In addition, terms “include”, “have”, or any other variant thereof are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device including a series of steps or units is not limited to the listed steps or units, but may include steps or units which are not listed, or that are inherent to such processes, methods, products, or devices. An “embodiment” mentioned in this specification means that a particular characteristic, structure, or feature described with reference to embodiments may be included in at least one embodiment of this application. The phrase shown in various locations in the specification may not necessarily refer to a same embodiment, and is not an independent or optional embodiment exclusive from another embodiment. It is explicitly and implicitly understood by a person skilled in the art that the embodiments described in the specification may be combined with another embodiment.
First, some terms in this application are described, to facilitate understanding of a person skilled in the art.
(1) Neural Network Training
A basic structure of a neural network is shown in
Mathematically, the neural network may be considered as a function transformation layer by layer. For example, a function such as y=a(f(x, w)) is shown in
After the gradient dw is obtained, there are a plurality of adjustment policies, namely, optimizers, for example, stochastic gradient descent. In other words, dw is simply multiplied by a learning rate, to adjust the weight. Currently, an Adam optimizer and a LAR optimizer are widely used.
2. Collective Communication
Collective communication provides application programming interfaces (APIs) of a plurality of operations for a user, to complete operations such as averaging in a plurality of accelerators. Correspondingly, during data parallel neural network training, gradient averaging in each accelerator may be performed through collective communication. Several basic collective communication operations are shown in
Allreduce: Data of each accelerator is summed up. After the allreduce operation ends, each accelerator obtains a same summation result. An allreduce process is shown in
Broadcast: Data of an accelerator is copied to all accelerators. A broadcast process is shown in
Allgather: Content of each accelerator is combined, and each accelerator may obtain a combined large tensor. An allgather process is shown in
Reduce-scatter: After tensors of the accelerators are scattered, each accelerator obtains a corresponding different part. A reduce-scatter process is shown in
Refer to
Work at each layer of a deep neural network may be described by using a mathematical expression
Because it is expected that an output of the deep neural network is as close as possible to a value that is actually expected to be predicted, a current predicted value of the network may be compared with a target value that is actually expected, and then a weight vector at each layer of the neural network is updated based on a difference between the current predicted value and the target value (there is usually an initialization process before the first update, that is, a parameter is preconfigured for each layer of the deep neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to lower the predicted value until the neural network can predict the target value that is actually expected. Therefore, “how to obtain, through comparison, the difference between the predicted value and the target value” needs to be predefined. This is a loss function or an objective function. The loss function and the objective function are important equations used to measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the deep neural network is a process of minimizing the loss as much as possible.
The target model/rule obtained by the training apparatus 220 may be applied to different systems or devices. In
The execution device 210 may invoke data, code, and the like in a data storage system 250, and may further store, in the data storage system 250, data, an instruction, and the like.
A calculation module 211 processes input data by using the target model/rule 201, to obtain gradient information; and further optimizes the gradient information by using an association function module 213, to obtain a processing result. The association function module 213 may specifically include an optimizer, that is, optimizing the gradient information according to an optimization algorithm preset in the optimizer.
Finally, the I/O interface 212 returns the processing result to the client device 240, and provides the processing result to the user.
More deeply, the training apparatus 220 may generate, for different targets, corresponding target models/rules 201 based on different data, to provide a better result for the user.
In a case shown in
It should be noted that
In the system architecture shown in
As shown in
In a complete training iteration in which the training apparatus 311 trains the neural network model, the n accelerators separately read different input data from the memory 322 by using the bus 323. Optionally, the CPU 321 may pre-process the input data. NLP processing is used as an example. The input data may include text data, and the text data obtained at a time includes a plurality of sentences. After reading the text data, the CPU 321 performs data preprocessing. Because data parallelism needs to be performed, the CPU 321 controls input data sent to each accelerator to be different. Then, for a specific implementation process between the CPU 321 and the n accelerators in the training apparatus 311, refer to an interaction process between a CPU 100 and n accelerators (101, 102, 103, and 104) in
Specifically, as shown in
(1) Create a same training model on each accelerator. For example, if a BERT network is trained, each accelerator needs to have a complete BERT model.
(2) Initialize a weight coefficient on the accelerator 1, and send the weight coefficient to each accelerator by using a broadcast operation in collective communication (1001). Usually, when a neural network is trained, an initial value may be randomly assigned to a weight. A weight is first randomly initialized on any accelerator, and then the weight is sent to each accelerator, to keep initial weights of the accelerators consistent.
(3) The CPU sends different data to different accelerators.
(4) Each accelerator performs forward-backward computation, to obtain corresponding gradient information. Step (4) is performed inside each accelerator. After forward-backward computation, gradient information corresponding to a batch of data of the accelerator is obtained. Because input data is different in step (3), gradients obtained by accelerators are different.
(5) Perform an allreduce operation in collective communication, to obtain an average gradient. Different gradients obtained by the accelerators in step (4) are averaged. After step (5), gradients of all accelerators are consistent, and are values obtained after the gradients of all accelerators are averaged.
(6) Use the average gradient to update the initial weight. The initial weights of the accelerators are consistent after step (2), and are updated by using the average gradient obtained after the allreduce operation. Therefore, it can be ensured that weights of the accelerators may also be consistent after each update. In an update process in step (6), each accelerator may further use the average gradient and the weight obtained in step (1) as an input of an optimizer. The optimizer performs optimization by using an initial variable of the optimizer (1002), and outputs a processed gradient. Each accelerator further updates the weight based on the processed gradient.
The accelerator may be a graphics processing unit (GPU), a neural processing unit (NPU), or a tensor processing unit (TPU). Gradient addition may be implemented by using a plurality of methods, for example, collective communication.
In the data parallel processing manner, training parameters such as the initial weight in step (2) and the initial variable in step (6) need to consume storage space, in the accelerator, used during training, for example, a video RAM used in GPU training, and memory used in CPU training. When a large neural network model is trained, training cannot be performed due to insufficient storage space in a single accelerator. The data parallel processing manner has the following disadvantages:
1. In step 1001, the weight coefficient of the neural network model occupies a large quantity of video RAMs. In the training apparatus of the neural network model, each of the n accelerators needs to store a weight coefficient. For a small- and medium-sized model, video RAM consumption is not too large. However, for a large model such as BERT or GPT-2, the weight coefficient occupies a large quantity of video RAMs, and when the weight coefficient is updated, each accelerator needs to perform a same updating operation.
2. In step 1002, an initial variable of the optimizer occupies a large quantity of video RAMs. In the training apparatus of the neural network model, n accelerators need to store n initial variables. During training, the initial variable is used in an iterative operation, and occupies a video RAM of each accelerator. In addition, each of then accelerators has the initial variable, and values of the initial variables are equal.
A typical fully connected layer (matrix multiplication) is used as an example. A quantity of accelerators in the training apparatus is 4, an input layer in the table may be an output (feature map, feature map) of an upper layer, and a size and video RAM consumption of the layer are shown in Table 1.
It can be learned from the foregoing that, in the data parallel processing manner, the weight coefficient of the neural network model and the initial variable of the optimizer are repeatedly stored in each of the n accelerators. Consequently, the n accelerators in the training apparatus cause an unnecessary waste of video RAMs in a model training process, even if utilization of video RAM space of the n accelerators is low. When a large neural network model is trained, training cannot be performed due to insufficient video RAM space in the accelerator.
In the foregoing, both the weight coefficient of the neural network model and the initial variable of the optimizer occupy a large quantity of video RAMs. In embodiments of this application, it is considered that the two parameters are separately stored in a plurality of accelerators in a training apparatus in a distributed manner. For details, refer to
1. A complete weight coefficient of a neural network model is stored in a distributed manner
Refer to
1101: Store some weight coefficients.
In this embodiment, a training apparatus includes a plurality of accelerators, and each accelerator stores the some weight coefficients in step 1101. The some weight coefficients stored in each of the plurality of accelerators in the training apparatus form a complete weight coefficient used for training a neural network model.
Specifically, in a training process of the neural network model, a weight coefficient of the neural network model needs to be initialized in a first training iteration. The initialization operation may be specified to be performed by a first accelerator, and the first accelerator is any one of the plurality of accelerators in the training apparatus. After the weight coefficient is initialized, the first accelerator may obtain the complete weight coefficient. Then, the first accelerator divides the complete weight coefficient, and sends the some weight coefficients to each of the plurality of accelerators in the training apparatus, to store the complete weight coefficient in a distributed manner. In other words, any one of the plurality of accelerators can initialize the weight coefficient, to reduce operation consumption of another accelerator in the training apparatus.
In a specific implementation, the first accelerator may further equally divide the complete weight coefficient, and then send the some weight coefficients to each of the plurality of accelerators in the training apparatus. Therefore, in step 1101, the some weight coefficients stored in each accelerator include weight coefficients assigned to the plurality of accelerators one by one after the complete weight coefficient is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete weight coefficient may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some weight coefficients obtained through even division. In a subsequent model training process, each accelerator trains the model by using the evenly divided weight coefficient, to synchronize processing progresses of different accelerators in the training apparatus. In addition, the quantity of accelerators may specifically be 2, 4, 8, 16, 32, or the like. This is not limited herein.
1102: Add the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
In this embodiment, each accelerator in the training apparatus adds the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
In a specific implementation, in step 1102, when adding the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient, each accelerator in the training apparatus may specifically add, by using an allgather operation in collective communication, the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient. For a specific communication process in which each accelerator performs the allgather operation, refer to descriptions in
1103: Train the neural network model based on input data and the complete weight coefficient.
In this embodiment, each accelerator in the training apparatus trains the neural network model based on the input data and the complete weight coefficient obtained in step 1102. Input data of the plurality of accelerators is different.
Specifically, in a data parallel system architecture including the plurality of accelerators in the training apparatus, a neural network training model may be pre-created in each accelerator. In one iterative training process, after the complete weight coefficient is obtained in step 1102, the complete weight coefficient and different input data are used as an input of the neural network model, to train the neural network model. For a specific training process, refer to the implementation process in
In this embodiment, in a parallel processing process in which the training apparatus trains the neural network model by using the plurality of accelerators, the complete weight coefficient of the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner, and weight coefficients of the plurality of accelerators are subsequently added, to obtain the complete weight coefficient. The neural network model is further trained on each accelerator based on different input data and the complete weight coefficient. In other words, the complete weight coefficient is stored in the plurality of accelerators in the training apparatus in a distributed manner, to reduce video RAM consumption of the training apparatus in a training process of the neural network model.
In the embodiment corresponding to
Refer to
1201: Store some weight coefficients.
In this embodiment, a training apparatus includes a plurality of accelerators, and each accelerator stores the some weight coefficients in step 1201. The some weight coefficients stored in each of the plurality of accelerators in the training apparatus form a complete weight coefficient used for training a neural network model.
1202: Add the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
In this embodiment, each accelerator in the training apparatus adds the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
For an implementation process of step 1201 and step 1202 in this embodiment, refer to an implementation process of step 1101 and step 1102 in
1203: Calculate gradient information based on input data and the complete weight coefficient.
In this embodiment, each accelerator in the training apparatus calculates the gradient information based on the input data and the complete weight coefficient obtained in step 1202. Input data of the plurality of accelerators is different.
Specifically, in a data parallel system architecture including the plurality of accelerators in the training apparatus, a neural network training model may be pre-created in each accelerator. In one iterative training process, after the complete weight coefficient is obtained in step 1202, the complete weight coefficient and different input data are used as an input of the neural network model, to train the neural network model. For a specific training process, refer to the implementation process in
1204: Calculate a target gradient based on the gradient information of the plurality of accelerators.
In this embodiment, each accelerator in the training apparatus calculates the target gradient based on the gradient information obtained through calculation in step 1203. The target gradient is used to update the some weight coefficients stored in each accelerator in step 1201, to complete an iteration process.
In a specific implementation, in step 1204, when calculating the target gradient based on the gradient information of the plurality of accelerators, each accelerator in the training apparatus may specifically calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators. For an implementation process of reduce-scatter, refer to the implementation process in
1205: Store some initial variables of an optimizer.
In this embodiment, the training apparatus includes the plurality of accelerators, and each accelerator stores the some initial variables of the optimizer in step 1205. The some initial variables stored in each of the plurality of accelerators in the training apparatus form a complete initial variable of the optimizer.
In a specific implementation, in step 1205, the some initial variables stored in each of the plurality of accelerators in the training apparatus form the complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model. Specifically, in a training process of the neural network model, an initial variable of the optimizer needs to be initialized in a first training iteration. The initialization operation may be specified to be performed by a first accelerator, and the first accelerator is any one of the plurality of accelerators in the training apparatus. After the initial variable is initialized, the first accelerator may obtain the complete initial variable of the optimizer. Then, the first accelerator divides the complete initial variable of the optimizer, and sends the some initial variables to each of the plurality of accelerators in the training apparatus, to store the complete weight coefficient in a distributed manner. In other words, any one of the plurality of accelerators can initialize the initial variable of the optimizer, to reduce operation consumption of another accelerator in the training apparatus.
In addition, the first accelerator may further equally divide the complete initial variable of the optimizer, and then send the some initial variables to each of the plurality of accelerators in the training apparatus. Therefore, in step 1205, the some initial variables stored in each accelerator include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete initial variable may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some initial variables obtained through even division. In a subsequent model training process, each accelerator trains the model by using the evenly divided initial variable, to synchronize processing progresses of different accelerators in the training apparatus.
1206: Process the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient.
In this embodiment, each accelerator in the training apparatus processes, based on the some initial variables stored in step 1205, the target gradient obtained through calculation in step 1204 and the some weight coefficients stored in step 1201, to obtain the processed target gradient.
Specifically, after the target gradient is obtained through calculation in step 1204, the target gradient may be adjusted for optimization. There are a plurality of adjustment policies, namely, optimizers, for example, stochastic gradient descent. In other words, the obtained target gradient is simply multiplied by a preset learning rate, to adjust the weight coefficient stored in each accelerator. Currently, two optimizers are widely used: a type of optimizer performing an element-wise operation on an initial variable, for example, an Adam optimizer, a momentum optimizer, an RMSProp optimizer, an AdaMax optimizer, an Adagrad optimizer, where when the initial variable is divided and sent to each accelerator, each accelerator can complete calculation without communication; and a type of optimizer performing a vector operation (including a matrix operation or a vector operation) on an initial variable, where when calculating the initial variable, the optimizer, for example, a LARS optimizer, needs an extra operation to complete calculation.
For an optimizer such as the element-wise operation, operations of the optimizer are bit-wise operations, and operations such as a matrix operation and accumulation are not required. When performing step 1206, each accelerator may directly perform an optimization operation on the locally obtained target gradient.
For an optimizer performing a vector operation, the optimizer needs to perform a vector operation such as a matrix operation or a vector operation, and needs a complete gradient during calculation. Because the target gradient is scattered on each accelerator, when performing step 1206, each accelerator needs extra collective communication, to ensure calculation correctness. The following describes in detail an implementation process of the optimizer.
Specifically, in step 1206, if the optimizer includes a vector operation, when processing the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient, each accelerator in the training apparatus may specifically calculate a scalar representation of the target gradient; add the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient; then calculate a vector representation of the target gradient based on the summation result; and further process the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. When the optimizer includes the vector operation (for example, a matrix operation or another vector operation), the complete gradient is required during calculation, and the target gradient is distributed in each accelerator. Therefore, when obtaining the processed target gradient, each accelerator first calculates the scalar representation of the target gradient; then adds the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient; calculates the vector representation of the target gradient based on the summation result; and further processes the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient. In this implementation, the solution may be applied to an implementation process of an optimizer including a vector operation. This can improve feasibility of the solution.
A data structure in the LARS optimizer shown in
In addition, in the foregoing operation, when adding the scalar representation of the target gradient in each of the plurality of accelerators, to obtain the summation result of the target gradient, each accelerator in the training apparatus may specifically add the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient. For a specific implementation process in which each accelerator uses allreduce for communication, refer to an implementation process in
1207: Update the some weight coefficients based on the processed target gradient, and train the neural network model based on updated some weight coefficients.
In this embodiment, each accelerator in the training apparatus updates, based on the processed target gradient obtained in step 1206, the some weight coefficients stored in step 1201, and trains the neural network model based on the updated some weight coefficients.
Specifically, in step 1207, when updating the some weight coefficients based on the processed target gradient, each accelerator in the training apparatus may update, based on the processed target gradient obtained in step 1206, the some weight coefficients stored in step 1201. When the optimizer optimizes the weight coefficient of the neural network model, each accelerator may store the initial variable of the accelerator in a distributed manner, that is, each accelerator stores the some initial variables. Then, each accelerator uses the target gradient and the some weight coefficients as an input of the optimizer; performs optimization according to an optimization algorithm preset in the optimizer, to obtain the processed target gradient; and then updates, based on the processed target gradient, the some weight coefficients stored in each accelerator. In other words, the plurality of accelerators in the training apparatus store the complete initial variable of the optimizer in a distributed manner, to further reduce video RAM consumption of the training apparatus in the training process of the neural network model.
In addition, in a specific implementation, each accelerator in the training apparatus may further obtain an update parameter of the some weight coefficients, and update the some weight coefficients based on the update parameter of the some weight coefficients; obtain an update parameter of the initial variable, and update the initial variable based on the update parameter of the variable; obtain an update parameter of the target gradient, and update the target gradient based on the update parameter of the target gradient; and/or obtain an update parameter of the processed target gradient, and update the target gradient based on the update parameter of the processed target gradient. Target parameters related to the training process of the neural network model may be stored in a distributed manner, and the target parameters include the some weight coefficients, the initial variable, the target gradient, and/or the processed target gradient, and the like. When some weight coefficients corresponding to the neural network model or the initial variable of the optimizer needs to be updated, each accelerator in the training apparatus may update the target parameters stored in a distributed manner. It can be learned that training parameters such as the complete weight coefficient and the complete initial variable are stored in a distributed manner, and only the training parameters stored locally can be calculated in the step of optimizing and updating a neural network. Therefore, this can reduce repeated calculation in an existing data parallel solution, reduce an overall calculation amount, and further reduce video RAM consumption of the accelerator in the training apparatus.
Training a BERT network and a GPT-2 network is used as an example. A quantity of parameters of the BERT network is about 1.3G, and an Adam optimizer is used by a training network. If training is performed in the existing data parallel manner, only a weight parameter and space occupied by an initial variable of the optimizer are concerned, and a video RAM is about 7.8 GBytes. If the GPT-2 network is trained in the existing data parallel manner, a video RAM is 36 GBytes. Currently, a video RAM of most GPU cards on the market is 16 GBytes. In addition to a size of a feature map, the video RAM may be insufficient, and training cannot be performed. The video RAM of the GPT-2 network is far from sufficient. In the embodiments shown in
In this embodiment, in a parallel process in which the training apparatus uses the plurality of accelerators to train the neural network model, the plurality of accelerators in the training apparatus store the complete weight coefficient of the neural network model in a distributed manner, and then perform addition to obtain the complete weight coefficient. Subsequently, when the optimizer optimizes the weight coefficient of the neural network model, each accelerator may store the initial variable of the accelerator in a distributed manner, that is, each accelerator stores the some initial variables. Then, each accelerator uses the target gradient and the some weight coefficients as an input of the optimizer; performs optimization according to an optimization algorithm preset in the optimizer, to obtain the processed target gradient; and then updates, based on the processed target gradient, the some weight coefficients stored in each accelerator. In other words, the plurality of accelerators in the training apparatus store the complete initial variable of the optimizer in a distributed manner, to further reduce video RAM consumption of the training apparatus in the training process of the neural network model.
2. A complete initial variable of an optimizer is stored in a distributed manner
Refer to
1401: Calculate gradient information based on input data and a complete weight coefficient.
In this embodiment, a training apparatus includes a plurality of accelerators, and each accelerator calculates the gradient information based on the input data and the complete weight coefficient in step 1401. In a data parallel system architecture including the plurality of accelerators in the training apparatus, a neural network training model may be pre-created in each accelerator. Each accelerator may initialize the weight coefficient of the neural network model, to obtain the complete weight coefficient. In addition, in step 1401, input data of the plurality of accelerators is different.
1402: Calculate a target gradient based on the gradient information of the plurality of accelerators.
In this embodiment, each accelerator in the training apparatus calculates the target gradient based on the gradient information obtained through calculation in step 1401. The target gradient may be used to update the complete weight coefficient stored in each accelerator in step 1401, to complete an iteration process.
In a specific implementation, in step 1402, when calculating the target gradient based on the gradient information of the plurality of accelerators, each accelerator in the training apparatus may specifically calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators. For an implementation process of reduce-scatter, refer to the implementation process in
1403: Store some initial variables of an optimizer.
In this embodiment, the training apparatus includes the plurality of accelerators, and each accelerator stores the some initial variables of the optimizer in step 1403. The some initial variables stored in each of the plurality of accelerators in the training apparatus form a complete initial variable of the optimizer. The optimizer is configured to update the weight coefficient of the neural network model.
In a specific implementation, in step 1403, the some initial variables stored in each of the plurality of accelerators in the training apparatus form the complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model. Specifically, in a training process of the neural network model, an initial variable of the optimizer needs to be initialized in a first training iteration. The initialization operation may be specified to be performed by a first accelerator, and the first accelerator is any one of the plurality of accelerators in the training apparatus. After the initial variable is initialized, the first accelerator may obtain the complete initial variable of the optimizer. Then, the first accelerator divides the complete initial variable of the optimizer, and sends the some initial variables to each of the plurality of accelerators in the training apparatus, to store the complete weight coefficient in a distributed manner. In other words, any one of the plurality of accelerators can initialize the initial variable of the optimizer, to reduce operation consumption of another accelerator in the training apparatus.
In addition, the first accelerator may further equally divide the complete initial variable of the optimizer, and then send the some initial variables to each of the plurality of accelerators in the training apparatus. Therefore, in step 1403, the some weight coefficients stored in each accelerator include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided. In the training process of the neural network model, processing capabilities of the plurality of accelerators in the training apparatus are usually the same or similar. Therefore, the complete initial variable may be evenly divided based on a quantity of the plurality of accelerators, and then assigned to the plurality of accelerators one by one, so that each accelerator stores, in a distributed manner, the some initial variables obtained through even division. In a subsequent model training process, each accelerator trains the model by using the evenly divided initial variable, to synchronize processing progresses of different accelerators in the training apparatus.
1404: Process the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient.
In this embodiment, each accelerator in the training apparatus processes, based on the some initial variables stored in step 1403, the target gradient obtained through calculation in step 1402 and the some weight coefficients, to obtain the processed target gradient. The some weight coefficients processed by each of the plurality of accelerators form the complete weight coefficient in step 1401.
Specifically, after the target gradient is obtained through calculation in step 1204, the target gradient may be adjusted for optimization. There are a plurality of adjustment policies, namely, optimizers, for example, stochastic gradient descent. In other words, the obtained target gradient is simply multiplied by a preset learning rate, to adjust the weight coefficient stored in each accelerator. Currently, two optimizers are widely used: a type of optimizer performing an element-wise operation on an initial variable, for example, an Adam optimizer, a momentum optimizer, an RMSProp optimizer, an AdaMax optimizer, an Adagrad optimizer, where when the initial variable is divided and sent to each accelerator, each accelerator can complete calculation without communication; and a type of optimizer performing a vector operation (including a matrix operation or a vector operation) on an initial variable, where when calculating the initial variable, the optimizer, for example, a LARS optimizer, needs an extra operation, to complete calculation. For a specific implementation process of the two optimizers, refer to an implementation process of step 1206. Details are not described herein again.
1405: Update the complete weight coefficient based on the processed target gradient, and train the neural network model based on an updated complete weight coefficient.
In this embodiment, each accelerator in the training apparatus updates, based on the processed target gradient obtained in step 1404, the complete weight coefficient pre-stored in each accelerator in step 1401, to obtain the updated complete weight coefficient, and trains the neural network model based on the updated complete weight coefficient. It can be learned from step 1401 that a data amount of the gradient information of each accelerator corresponds to a data amount of the complete weight coefficient. It can be learned from step 1402 and step 1403 that a data amount of the target gradient corresponds to a data amount of the some weight coefficients. Therefore, in step 1404, each accelerator in the training apparatus updates a portion of the complete weight coefficient.
Specifically, in a data parallel system architecture including the plurality of accelerators in the training apparatus, a neural network training model may be pre-created in each accelerator. In one iterative training process, after the complete weight coefficient is obtained in step 1401, the complete weight coefficient and different input data are used as an input of the neural network model, to train the neural network model. For a specific training process, refer to the implementation process in
In addition, each accelerator in the training apparatus may further obtain an update parameter of the complete weight coefficient, and update the complete weight coefficient based on the update parameter of the complete weight coefficient; obtain an update parameter of the initial variable, and update the initial variable based on the update parameter of the initial variable; obtain an update parameter of the target gradient, and update the target gradient based on the update parameter of the target gradient; and/or obtain an update parameter of the processed target gradient, and update the target gradient based on the update parameter of the processed target gradient. Target parameters related to the training process of the neural network model may be stored in a distributed manner, and the target parameters include the complete weight coefficient, the initial variable, the target gradient, and/or the processed target gradient, and the like. When some weight coefficients corresponding to the neural network model or the initial variable of the optimizer needs to be updated, each accelerator in the training apparatus may update the target parameters stored in a distributed manner. This can further reduce video RAM consumption of the accelerator in the training apparatus.
In this embodiment, in a parallel processing process in which the training apparatus trains the neural network model by using the plurality of accelerators, the complete initial variable of the optimizer in the neural network model is stored in the plurality of accelerators in the training apparatus in a distributed manner. Each accelerator processes the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient; then updates the complete weight coefficient based on the processed target gradient; and trains the neural network model based on the updated complete weight coefficient. In other words, the complete initial weight of the optimizer is stored in the plurality of accelerators in the training apparatus in a distributed manner, to reduce video RAM consumption of the training apparatus in a training process of the neural network model.
Based on the embodiments corresponding to
Specifically,
a storage unit 1501, configured to store some weight coefficients, where the some weight coefficients stored in each of the plurality of accelerators form a complete weight coefficient;
an addition unit 1502, configured to add the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient; and
a training unit 1503, configured to train a neural network model based on input data and the complete weight coefficient, where input data of the plurality of accelerators is different.
In a possible design, the training unit 1503 is specifically configured to:
calculate gradient information based on the input data and the complete weight coefficient;
calculate a target gradient based on the gradient information of the plurality of accelerators; and
update the some weight coefficients based on the target gradient, and train the neural network model based on updated some weight coefficients.
In a possible design, the storage unit 1501 is further configured to:
store some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of the neural network model.
The training unit 1503 is specifically configured to:
process the target gradient and the some weight coefficients based on the some initial variables, to obtain a processed target gradient; and
update the some weight coefficients based on the processed target gradient.
In a possible design, the optimizer includes a vector operation, and the training unit 1503 is specifically configured to:
calculate a scalar representation of the target gradient;
add the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient;
calculate a vector representation of the target gradient based on the summation result; and
process the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient.
In a possible design, the training unit 1503 is specifically configured to:
add the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient.
In a possible design, the some weight coefficients include weight coefficients assigned to the plurality of accelerators one by one after the complete weight coefficient is evenly divided.
In a possible design, the addition unit 1502 is specifically configured to:
add, by using an allgather operation in collective communication, the some weight coefficients stored in each of the plurality of accelerators, to obtain the complete weight coefficient.
In a possible design, the training unit 1503 is specifically configured to:
calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators.
In a possible design, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided.
It should be noted that content such as information exchange or an execution process between the modules/units in the training apparatus 1500 is based on a same concept as the embodiments shown in
Specifically,
a calculation unit 1601, configured to calculate gradient information based on input data and a complete weight coefficient, where input data of the plurality of accelerators is different, where the calculation unit 1601 is further configured to calculate a target gradient based on the gradient information of the plurality of accelerators;
a storage unit 1602, configured to store some initial variables of an optimizer, where some initial variables stored in each of the plurality of accelerators form a complete initial variable of the optimizer, and the optimizer is configured to update the weight coefficient of a neural network model;
a processing unit 1603, configured to process the target gradient and some weight coefficients based on the some initial variables, to obtain a processed target gradient, where the some weight coefficients processed by each of the plurality of accelerators form the complete weight coefficient; and
an updating unit 1604, configured to update the complete weight coefficient based on the processed target gradient, and train the neural network model based on an updated complete weight coefficient.
In a possible design, the optimizer includes a vector operation, and the processing unit 1603 is specifically configured to:
calculate a scalar representation of the target gradient;
add the scalar representation of the target gradient in each of the plurality of accelerators, to obtain a summation result of the target gradient;
calculate a vector representation of the target gradient based on the summation result; and
process the vector representation of the target gradient and the some weight coefficients based on the some initial variables, to obtain the processed target gradient.
In a possible design, the processing unit 1603 is specifically configured to:
add the scalar representation of the target gradient in each of the plurality of accelerators by using an allreduce operation in collective communication, to obtain the summation result of the target gradient.
In a possible design, the calculation unit 1601 is specifically configured to:
calculate, by using a reduce-scatter operation in collective communication, the target gradient based on the gradient information of the plurality of accelerators.
In a possible design, the some initial variables include initial variables assigned to the plurality of accelerators one by one after the complete initial variable is evenly divided.
It should be noted that content such as information exchange or an execution process between the modules/units in the training apparatus 1600 is based on a same concept as the embodiment corresponding to
An embodiment of this application further provides a training apparatus.
The training apparatus 1700 may further include one or more power supplies 1726, one or more wired or wireless network interfaces 1750, one or more input/output interfaces 1758, and/or one or more operating systems 1741, for example, Windows Server™, Mac OS X™, Unix™, Linux™, and FreeBSD™.
In this embodiment of this application, the central processing unit 1722 is configured to perform the obtaining method for a neural network performed by the training apparatus in the embodiment corresponding to
An embodiment of this application further provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform steps performed by the training apparatus in the method described in the embodiments shown in
An embodiment of this application further provides a computer-readable storage medium. The computer-readable storage medium stores a program used for signal processing. When the program is run on a computer, the computer is enabled to perform steps performed by the training apparatus in the method described in the embodiments shown in
In addition, it should be noted that the apparatus embodiments described above are merely an example. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all the modules may be selected based on an actual requirement to achieve objectives of the solutions of the embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided in this application, connection relationships between modules indicate that the modules have
Number | Date | Country | Kind |
---|---|---|---|
202010441573.3 | May 2020 | CN | national |
This application is a continuation of International Application No. PCT/CN2021/080784, filed on Mar. 15, 2021, which claims priority to Chinese Patent Application No. 202010441573.3, filed on May 22, 2020. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/080784 | Mar 2021 | US |
Child | 17991683 | US |