The present disclosure relates to computing, and more particularly to techniques for processing an artificial intelligence model such as an artificial neural network.
Artificial neural networks (hereinafter, neural network) have become increasingly important in artificial intelligence applications and modern computing in general. An example neural network is shown in
Such systems “learn” to perform tasks by considering examples, generally without being programmed with task-specific rules. Initially, the weights may be untrained. During a training phase, input values for corresponding known results are processed by the network, and a difference (or error) between the network output values and the known values is determined. The weights may be adjusted based on the error using a process known as backpropagation, where computations flow through the neural network in the reverse direction (e.g., from the output to the input). Training may involve successively adjusting weights across many input samples and corresponding known network output values. This is often referred to as the training phase. Once trained, the system may receive inputs and produce meaningful results (e.g., classification or recognition). This is often referred to as the inference phase.
Activation functions are mathematical equations that determine the output of a neural network. Activations sometimes refers to the values of the weights, for example, that produced a particular output at a particular time. Data may be flowing through the neural network continuously, and weights may be changing, and thus activations at particular times may be stored.
As shown in
The intermediate activations 203 calculated in the first forward operation (F0) may be used by the corresponding backward operation (B0) 205. The backward operations may include one or more operations performed using the forward operation with auto differentiation. Accordingly, the intermediate activations may be stashed (e.g., stored in a buffer) until the corresponding backward operation (B0) 205 is commenced, which may occur after all of the other intervening forward and backward operations are performed. However, stashing activations may require a significant amount of memory. Instead of stashing the intermediate activations, the input used to compute the intermediate activation may be stored instead and this input may be used to recompute the intermediate activation during the backward pass. The weight used during the forward pass may also be stored for use in such recomputations.
Neural networks consist of many numeric operations which need to be efficiently partitioned across computation resources. There are many approaches to partitioning the operations which are highly model and architecture specific. One partitioning approach that is especially suited to deep networks is to split layers onto sequential compute resources (e.g., separate devices or chips). Splitting layers is a hardware technique which for the case of neural networks leads directly to pipeline parallelism.
Pipeline parallelism is very efficient for feedforward networks but becomes much more complicated when feedback and weight updates are applied. One technique for neural networks is to update the weights based on a mini-batch of data elements (N example average). The approach is inefficient for a pipelined model as it requires the pipeline to flush out its contents before continuing. The flushing process clears the data out from the pipeline after the mini-batch has been processed. This means that the later stages in the pipeline are not processing data that they might otherwise be processing if the data were not split into mini-batches, causing delay in processing the entirety of the data. The delay caused by flushing leads to further delays as the pipeline must be refilled at the start of the next mini-batch. Refilling the pipeline requires the pipeline to be ramped-up as the first data element in the subsequent mini-batch is processed by the pipeline. Thus, processing data in mini-batches and updating weights after the mini-batch is processed may cause inefficiency and delay in training a neural network. In some implementations, such delays may account for 10-15% of time spent training a neural network. The amount of delay may be implementation specific.
One solution to avoid flushing the pipeline is to continuously update the weights each time that a weight is computed on the backward pass at each stage. Thus, the weights of the stages of the pipeline are updated asynchronously. However, the continuous update solution has a few issues in terms of performance and usability. Continuously updating the weights means that the weight gradient is not applied based on the weight used, and the weight used during re-computation of the activation in the backward pass uses a different or incorrect weight.
There is a need for improved techniques for processing neural networks in a pipeline.
Embodiments of the present provide improved techniques for processing neural networks in a pipeline.
One embodiment provides a computer system including one or more processors and a non-transitory computer readable storage medium coupled to the one or more processors. The storage medium having stored thereon program code executable by the one or more processors to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code is further executable to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
Another embodiment provides a method of processing an artificial intelligence model. The method includes training, using plurality of data samples, the artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The method further includes applying updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
Another embodiment provides a non-transitory computer readable storage medium having stored thereon program code executable by a computer system. The program code may cause the computer system to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code may further cause the computer system to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The weight updates are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
Various embodiments of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings.
In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. Such examples and details are not to be construed as unduly limiting the elements of the claims or the claimed subject matter as a whole. It will be evident to one skilled in the art, based on the language of the different claims, that the claimed subject matter may include some or all of the features in these examples, alone or in combination, and may further include modifications and equivalents of the features and techniques described herein.
As discussed above, pipeline parallelism is very efficient for feedforward networks but becomes much more complicated when feedback and weight updates are applied. One technique for neural networks is to update the weights based on a mini-batch of data elements (N example average). The approach is inefficient for a pipelined model as it requires the pipeline to flush out its contents before continuing. The flushing operation requires the pipeline to be cleared out at the end of the mini-batch and refilled at the start of the next mini-batch leading to a large inefficiency.
To avoid the delay and reduced performance caused by flushing the pipeline in order to update the weights, the present disclosure provides a pipeline processing technique that synchronously applies the calculated weight updates to the stages of the pipeline at specific intervals without flushing the pipeline. The interval may be predetermined based on a particular number of operations being performed (e.g., after 64 or 1024, etc., backwards pass operations have been completed by a particular stage of the pipeline). The interval may be predetermined in a similar manner as selecting a mini-batch size even though the data will not be separated into mini-batches.
Since the pipeline is not flushed before the calculated weight updates are applied each interval, the different stages of the pipeline are operating on a different time schedule due to the nature of the pipeline. That is, different stages of the pipeline are currently operating on different data samples and may have completed operations on different sets of data samples when the weight update is applied. For example, the last stage in the pipeline may have completed backwards pass operations for a particular data sample when the weight updates are applied while a first stage in the pipeline may not have begun or completed its backwards pass operations for that particular data sample (e.g., because the intermediate stages in the pipeline will perform backwards pass operations on that data sample prior to the first stage doing so). Therefore, the total number of weight update calculations of the backwards pass operations that have been performed by the different stages of the pipeline may be different when the weight updates are applied to the stages.
Accordingly, the weight updates applied to the different stages may be based on different sets of data samples. If the weight updates are applied at an interval based on the last stage having completed a certain number of backwards pass operations (e.g., gradient calculations and weight update calculations), then the weight update at the last stage will be based on a data sample which is not used by the other stages at that moment in time. However, the weight update calculations for that data sample may be used by the other stages when the following weight update is applied after the next interval. As such, the previously existing weight calculations used in applying weight updates at a particular stage may correspond to forward pass operations that were performed during the previous interval. Thus, these weight update calculations may be referred to as “stale” weights).
Since the weight updates at certain stages may be applied before corresponding backwards pass operations have been performed, the weights used during the forwards pass operations may be stored for used in recomputation and in the backwards pass operations. Therefore, the old weight (e.g., prior to the update being applied) may be used for data that is still in the pipeline while the new updated weights are used with the input data in forwards pass operations after the interval. This technique allows full pipeline efficiency (e.g., each stage of the pipeline continuously performs operations as they are available for the entire set of training data) with only a single ramp-up period and a single ramp-down (e.g., flush) period needed to process all of the training data, compared to using mini-batches where each batch requires a ramp-up and a ramp-down period.
The technique of applying weight updates at specific intervals improves upon the continuous update technique described above. As mentioned above, the continuous update solution has a few issues in terms of performance and usability. Continuously updating the weights means that the weight gradient is not applied based on the weight used, and the weight used during re-computation of the activation in the backward pass uses the wrong weight. The technique of applying weight updates at specific intervals improves upon the continuous update technique because it uses a consistent weight gradient during all times except when flushing the pipeline after all data has been processed. A further improvement is that the activations are consistent because they are used with re-computation. The technique of updating weights at specific intervals trades memory for seeps as it requires an extra set of weight storage in order to ensure that the weight-based calculations are accurate. A process for implementing the technique of updating weights at specific intervals is described below with respect to
At 302, the process applies different sets of calculated weight updates to the plurality of stages in the pipeline at one or more predetermined intervals during the training of the artificial intelligence model such that a fewer number of calculated weight updates are applied to one or more earlier stages of the pipeline and a greater number of calculated weight updates are applied to a later stage of the pipeline at each of the one or more predetermined intervals.
In some embodiments, the applying of the different sets of the calculated weight updates to the plurality of stages in the pipeline at a particular predetermined interval uses existing weights that have not been updated within a prior interval.
In some embodiments, the applying of the different sets of the calculated weight updates to the plurality of stages in the pipeline at a particular predetermined interval includes storing both the different sets of the calculated weight updates and one or more weights used in forward pass calculations during a prior interval.
In some embodiments, the plurality of data samples are not split into mini-batches to train the artificial intelligence model.
In some embodiments, the pipeline is in a steady-state operation during the applying of the different sets of the calculated weight updates to the plurality of stages, the pipeline not being flushed out prior to the applying of the different sets of the calculated weight updates to the plurality of stages.
In some embodiments, consistent weight gradients are used for a gradient calculation of the backward pass calculation, during steady state operation, for each of the plurality of data samples at each stage of the pipeline.
In some embodiments, the training of the artificial intelligence model in the plurality of stages forming the pipeline is completed while only flushing data from the pipeline once.
In some embodiments, the artificial intelligence model is an artificial neural network.
The pipeline stages 401-404 performs nine forward pass calculations (denoted as F0-F8 in
The forwards pass calculation may be expressed as:
Y=W*ƒ(x)
Where Y is the calculated activation, W is the weight, and x is the input data being applied to an activation function ƒ.
The error calculation of the backwards pass may be expressed as:
E
output=transpose(W)*Einput
Where Einput is the error output from the prior stage in the backwards pass (e.g., the stage that was the next stage in the corresponding forwards pass), which is multiplied by the of the weight W to calculate the error Eoutput to output to the subsequent stage in the backwards pass.
The weight update calculation may be expressed as:
W
n
=W
n-1
+u*E×x
Where Wn is the updated weight, Wn-1 is the previously used weight, and u is a gain term, which is multiplied by the error E cross the data x. Thus, the weight that was used in the forwards pass calculation is also used in the weight update calculation.
As data is processed in the pipeline, each of the four stages 401-404 (stage 0-3) may perform processing in parallel with the other stages in a pipeline as shown in
Once the pipeline has ramped up 451, it operates in a steady state (timeslot 3-29) until the last stage in the pipeline has performed the backwards pass calculations for the last data sample in the training set. “Steady state operation” may refer to a pipeline state in which each stage of the pipeline will perform calculations as they become available (i.e., the stages do not delay processing in order to flush data out of the pipeline). During “steady state operation” there may be periods in which a particular stage is idle after forward pass calculations for all data samples of the training data have processed, but before the corresponding backwards pass calculations have been provided by the subsequent stage in the pipeline. For instance, the first stage 401 (stage 0) may be idle at timeslots 15, 18, 21, 24, 27, and 30, the second stage 402 (stage 1) may be idle at timeslots 20, 23, 26, and 29, and the third stage 403 (stage 2) may be idle at timeslots 25 and 28. The fourth stage 404 (stage 3), which is the final stage in the pipeline, may not be idle until at ramp-down period 452 begins at timeslot 30 and continues to timeslot 32 when the last weight update calculation (for the 8th data element) is complete, at which point all of the data will have been flushed out of the pipeline. As an example, Stage 0 is idle at timeslot 15 after having processed F8 since it is waiting for Stage 1 to finish processing G3 at timeslot 15.
As shown in
Looking at timeslot 0-11, the fourth stage 404 (stage 3) has calculated the first three weight updates W0, W1, and W2 (for the first, second, and third data elements) while the first, second and third stages 401, 402, 403 (stages 0-2) have only calculated two weight updates W0 and W1, and have not yet calculated the third weight update W2. The third stage 403 (stage 2) will calculate W2 at timeslot 12, while the second stage 402 (stage 1) calculates W2 at timeslot 13. The first stage 401 (stage 0) will not calculate W2 until timeslot 14. Accordingly, if the weight updates are applied after the processing at timeslot 11 (411), then the updated weight stored the first stage (Stage 0) and at the last stage (Stage 3) may be based on different data samples as follows:
WeightStage0[time 11]=W0+W1
WeightStage3[time 11]=W0+W1+W2
That is, the weight update applied to the last stage at “timeslot 11” (411) is based on W0, W1, and W2 while the weight update applied to the first stage at timeslot 11 is based on W0 and W1, but not W2 because the first stage has not yet processed W2 at timeslot 11 (411). However, the weight update W2 processed by the first stage may be used in later weight updates even if it is “stale” (e.g., it is based on data that was processed in forwards pass operations during the previous interval). That is, the weight updates applied after the processing at timeslot 20 (420) for certain stages may be based on different data samples as follows:
WeightStage0[time 20]=W2+W3+W4
WeightStage3[time 20]=W3+W4+W5
Thus, the last stage (Stage 3) uses W2 when applying the weight update at timeslot 11 while the other stages use W2 in the following weight update occurring after the interval. In
In some embodiments, the training of the neural network may skip processing of weight update calculations by stages that would process it after the weight update has been applied (e.g., the processing highlighted in gray in
Given that the weight update has been applied at timeslot 11 (411), the new weights may be used in calculating the forward pass and the backward pass for new input data. For instance, the forward pass for the 9th data element (F8) at timeslot 12 will use the newly applied weights. In this example, only 9 data elements are being processed for simplicity of explanation and drawing. However, in a real implementation it is likely that more data elements will be processed. In such cases each new input will be processed using the newly applied weights. As such, a fewer number of calculated weight updates (2 weight updates W0 and W1 in this example) are applied to one or more earlier stages of the pipeline (401-403, in this example) and a greater number of calculated weight updates (3 weight updates: W0, W1, and W2 in this example) are applied to a later stage of the pipeline (i.e., the fourth stage 404) at each of the one or more predetermined intervals. Because of this, the weight update calculations (W2) for the third data element are not based on all of the data when calculated by the third stage 403 (stage 2) at timeslot 12, by the second stage 402 (stage 1) at timeslot 13, and by the first stage 401 (stage 0) at timeslot 14. This fact is denoted by gray highlighting in
The weight update calculation (W2) by the third stage 403 (stage 2) at timeslot 12, by the second stage 402 (stage 1) at timeslot 13, and by the first stage 401 (stage 0) at timeslot 14 may not be based on all of the data because it may be performed with “stale” data (e.g., weight data that was used in training the neural network before the weights were updated at timeslot 11) or it may not be updated. In the case where stale data is used, the training of the neural network may involve storing both the calculated weight updates and the stale data.
As shown in
Thus the loss of data used when synchronously updating the weights at the specific intervals may be insignificant compared to the significant delays in the alternative mini-batch solution which are caused by waiting for the data to be flushed out of the pipeline. That is, waiting for the data to be flushed out of the pipeline (e.g., when processing data in mini-batches) before applied the weight updates ensures that all of the data is used when applying the weight updates, but loss of a few data samples may not have a significant change on the resulting weight update when a large number of data samples are being processed (e.g., 64, 512, 1024, etc.).
As shown in
This delay is significant compared to the processing shown in
Features and advantages of applying weight updates at specific intervals during training of an artificial intelligence model (e.g., neural network) include avoiding the delay and reduced performance caused by flushing the pipeline when the data is split into mini-batches in order to update the weights. This technique allows full pipeline efficiency with only a single ramp-up period and a single ramp-down (e.g., flush) period needed to process all of the data, compared to using mini-batches where each batch requires a ramp-up and a ramp-down period. Furthermore, this technique of applying weight updates at specific intervals improves upon the continuous update technique, as described above.
As discussed above, an updated weight based on a set of weight update calculations may be applied to a particular stage such that the updated weight may be used in later forward pass calculations. However, when implementing the efficient weight updates during steady state operation as discussed above with respect to
The first stage (Stage 0) may include backwards pass operations (B0) 611 including a gradient calculation (G0) 613, a weight update calculation (W0) 614, and a staleness checking operation 612 (“check stale” in
An example of applying weight updates based on stale data was given above with respect to
WeightStage0[time 20]=W2+W3+W4
WeightStage3[time 20]=W3+W4+W5
In this example, the weight calculation W2 of the first stage (stage 0) is based on stale data because the forwards pass calculation (F2) corresponding to the weight calculation (W2) was performed in timeslot 2 during the first interval (e.g., timeslots 0-11) while the weight calculation (W2) was performed by Stage 0 in timeslot 14, after the weight updates were applied at timeslot 11.
As discussed above, the weight calculations may be based on stale data. However, in some embodiments the weight calculations may not be based on stale data. In some embodiments, such weight calculations based on stale data may not be performed.
In this embodiment, the first stage (e.g., Stage 0) stores a first weight (W0a) 701 for use in forwards pass calculations (F0) and in backwards pass operations. Unlike the embodiment discussed above with respect to
The first stage of the pipeline may include backwards pass operations (B0) 711 including a gradient calculation (G0) 713, a weight update calculation (W0) 714, and a staleness checking operation 712 (“check stale” in
In the example of the other embodiment discussed above with respect to
WeightStage0[time 20]=W2+W3+W4
WeightStage3[time 20]=W3+W4+W5
However, in this embodiments, the weight calculation W2 may not be performed if the staleness checking operation 711 determines that it would be based on stale data. Accordingly, the weight update applied to the first stage at timeslot 20 may not be based on W2 since it is based on stale data:
WeightStage0[time 20]=W3+W4
WeightStage3[time 20]=W3+W4+W5
Not performing the weight update calculation for stale data is advantageous because additional weights (e.g., W0b, W1b, W2b, W3b in
Bus subsystem 804 can provide a mechanism for letting the various components and subsystems of computer system 800 communicate with each other as intended. Although bus subsystem 804 is shown schematically as a single bus, alternative embodiments of the bus subsystem can utilize multiple busses.
Network interface subsystem 816 can serve as an interface for communicating data between computer system 800 and other computer systems or networks. Embodiments of network interface subsystem 816 can include, e.g., Ethernet, a Wi-Fi and/or cellular adapter, a modem (telephone, satellite, cable, ISDN, etc.), digital subscriber line (DSL) units, and/or the like.
Storage subsystem 806 includes a memory subsystem 808 and a file/disk storage subsystem 810. Subsystems 808 and 810 as well as other memories described herein are examples of non-transitory computer-readable storage media that can store executable program code and/or data that provide the functionality of embodiments of the present disclosure.
Memory subsystem 808 includes a number of memories including a main random access memory (RAM) 818 for storage of instructions and data during program execution and a read-only memory (ROM) 820 in which fixed instructions are stored. File storage subsystem 810 can provide persistent (e.g., non-volatile) storage for program and data files, and can include a magnetic or solid-state hard disk drive, an optical drive along with associated removable media (e.g., CD-ROM, DVD, Blu-Ray, etc.), a removable flash memory-based drive or card, and/or other types of storage media known in the art.
In this example environment, one or more servers 902, which may comprise architectures illustrated in
In various embodiments, the present disclosure includes systems, methods, and apparatuses for neural network training.
One embodiment provides a computer system including one or more processors and a non-transitory computer readable storage medium coupled to the one or more processors. The storage medium having stored thereon program code executable by the one or more processors to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code is further executable to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
Another embodiment provides a method of processing an artificial intelligence model. The method includes training, using plurality of data samples, the artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The method further includes applying updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
Another embodiment provides a non-transitory computer readable storage medium having stored thereon program code executable by a computer system. The program code may cause the computer system to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code may further cause the computer system to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The weight updates are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
One embodiment provides a computer system including one or more processors and a non-transitory computer readable storage medium coupled to the one or more processors. The storage medium having stored thereon program code executable by the one or more processors to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code is further executable to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
In some embodiments of the computer system, an updated weight applied to the last stage is based on the particular data sample and an updated weight applied to the first stage is not based on the particular data sample.
In some embodiments of the computer system, a second interval occurs after a first interval, and an updated weight applied to the first stage at a second interval is based on a first data sample used in a forward pass calculation during the first interval.
In some embodiments of the computer system, wherein the weight update calculation of the first stage is not performed based on the particular data sample during the training of the artificial intelligence model
In some embodiments of the computer system, the program code executable by the one or more processors to store, at the first stage, the updated weight and a previous weight used in calculating the updated weight.
In some embodiments of the computer system, the pipeline is not flushed out prior to the applying of the updated weights to the plurality of stages.
In some embodiments of the computer system, each backward pass calculation at a particular stage of the pipeline is based on a same weight as a corresponding forward pass calculation when the corresponding forward pass calculation is performed before an updated weight is applied to the particular stage and the backward pass calculation is performed after the updated weight is applied to the particular stage.
In some embodiments of the computer system, the training of the artificial intelligence model is completed while only flushing data from the pipeline once.
In some embodiments of the computer system, the artificial intelligence model is an artificial neural network.
Another embodiment provides a method of processing an artificial intelligence model. The method includes training, using plurality of data samples, the artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The method further includes applying updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The updated weights are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
In some embodiments of the method, an updated weight applied to the last stage is based on the particular data sample and an updated weight applied to the first stage is not based on the particular data sample.
In some embodiments of the method, a second interval occurs after a first interval, and an updated weight applied to the first stage at a second interval is based on a first data sample used in a forward pass calculation during the first interval.
In some embodiments of the method, wherein the weight update calculation of the first stage is not performed based on the particular data sample during the training of the artificial intelligence model
In some embodiments of the method, the program code executable by the one or more processors to store, at the first stage, the updated weight and a previous weight used in calculating the updated weight.
In some embodiments of the method, the pipeline is not flushed out prior to the applying of the updated weights to the plurality of stages.
In some embodiments of the method, each backward pass calculation at a particular stage of the pipeline is based on a same weight as a corresponding forward pass calculation when the corresponding forward pass calculation is performed before an updated weight is applied to the particular stage and the backward pass calculation is performed after the updated weight is applied to the particular stage.
In some embodiments of the method, the training of the artificial intelligence model is completed while only flushing data from the pipeline once.
In some embodiments of the method, the artificial intelligence model is an artificial neural network.
Another embodiment provides a non-transitory computer readable storage medium having stored thereon program code executable by a computer system. The program code may cause the computer system to train, using plurality of data samples, an artificial intelligence model in a plurality of stages forming a pipeline. The pipeline includes a first stage and a last stage. Each of the plurality of stages of the pipeline performs a forward pass calculation based on a particular weight, a backward pass calculation based on the particular weight, and a weight update calculation based on the particular weight for each of the plurality of data samples. The program code may further cause the computer system to apply updated weights to the plurality of stages at one or more predetermined intervals during a steady state operation of the training of the artificial intelligence model. The weight updates are applied such that a weight update calculation of the last stage has been performed based on a particular data sample and a weight update calculation of the first stage has not been performed based on the particular data sample.
In some embodiments of the storage medium, an updated weight applied to the last stage is based on the particular data sample and an updated weight applied to the first stage is not based on the particular data sample.
In some embodiments of the storage medium, a second interval occurs after a first interval, and an updated weight applied to the first stage at a second interval is based on a first data sample used in a forward pass calculation during the first interval.
In some embodiments of the storage medium, wherein the weight update calculation of the first stage is not performed based on the particular data sample during the training of the artificial intelligence model
In some embodiments of the storage medium, the program code executable by the one or more processors to store, at the first stage, the updated weight and a previous weight used in calculating the updated weight.
In some embodiments of the storage medium, the pipeline is not flushed out prior to the applying of the updated weights to the plurality of stages.
In some embodiments of the storage medium, each backward pass calculation at a particular stage of the pipeline is based on a same weight as a corresponding forward pass calculation when the corresponding forward pass calculation is performed before an updated weight is applied to the particular stage and the backward pass calculation is performed after the updated weight is applied to the particular stage.
In some embodiments of the storage medium, the training of the artificial intelligence model is completed while only flushing data from the pipeline once.
In some embodiments of the storage medium, the artificial intelligence model is an artificial neural network.
The above description illustrates various embodiments of the present disclosure along with examples of how aspects of the particular embodiments may be implemented. The above examples should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the particular embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations, and equivalents may be employed without departing from the scope of the present disclosure as defined by the claims.