Artificial intelligence (AI), or machine learning, utilizes learning networks (e.g. deep neural networks) loosely inspired by the brain in order to solve problems. Learning networks typically include layers of weights that weight signals (mimicking synapses) interleaved with activation layers that apply activation functions to the signals (mimicking neurons). Thus, a weight layer provides weighted input signals to an activation layer. Neurons in the activation layer operate on the weighted input signals by applying some activation function to the input signals and provide output signals corresponding to the statuses of the neurons. The output signals from the activation layer are provided as input signals to the next weight layer, if any. This process may be repeated for the layers of the network. Learning networks are thus able to reduce complex problems to a set of weights and the applied activation functions. The structure of the network (e.g., number of layers, connectivity among the layers, dimensionality of the layers, the type of activation function, etc.) are together known as a model. Learning networks can leverage hardware, such as graphics processing units (GPUs) and/or AI accelerators, which perform operations usable in machine learning in parallel. Such tools can dramatically improve the speed and efficiency with which data-heavy and other tasks can be accomplished by the learning network.
In order to be used in data-heavy tasks and/or other applications, the learning network is trained prior to its use in an application. Training involves optimizing a configuration of the high-dimensional and nonlinear set of weights. In other words, the weights in each layer are determined, thereby identifying the parameters of a model. Supervised training may include evaluating the final output signals of the last layer of the learning network based on a set of target outputs (e.g., the desired output signals) for a given set of input signals and adjusting the weights in one or more layers to improve the correlation between the output signals for the learning network and the target outputs. Once the correlation is sufficiently high, training may be considered complete. The model can then be deployed for use. Deploying the model may include copying the weights into a memory (or other storage) of the device on which the model is desired to be used. For example, the weights may be copied into the AI accelerator or storage for the GPU.
Although training can result in a learning network capable of solving challenging problems, determining solutions even with an optimized model may be time-consuming. Use of an AI accelerator may reduce the time required for the machine learning model to provide a solution. However, further improvements are desired. For example, for models having a large number of parameters (e.g. a large number of weights and/or large input data sets) operations such as convolutions may take a long time to complete. A convolution is a linear operation on a tensor, such as one or more vector-matrix multiplications. As a result, latencies and/or other inefficiencies may be introduced. It may be theoretically possible to perform some portions of an operations such as a convolution in parallel. However, the limited memory and other issues may make parallelism challenging. For example, performing a convolution in parallel may use multiple copies of the weights in order to perform the same vector-matrix multiplication in parallel. However, this may exhaust the available memory. Accordingly, what is desired is an improved technique for training and/or using learning networks.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
A method is described. The method includes profiling a learning network. The learning network includes compute tile(s) and a model. A compute tile includes compute engines and a general-purpose (GP) processor. Each compute engine includes a compute-in-memory (CIM) hardware module. The model includes convolutions and activation functions. The convolutions correspond to the compute engines. The method also includes determining, based on the profiling, a reschedule operation for a convolution of the convolutions. The reschedule operation provides multiple tensors based on an input tensor for the convolution. The tensors are configured to undergo at least a portion of the convolution in accordance with a temporal distribution. The method also includes performing, on the learning network and using the reschedule operation, a forward pass for input data.
In some embodiments, profiling includes determining at least one of CIM memory capacity for each compute engine, a tile memory capacity for the compute tile, a number of the plurality of compute engines used for the convolution, or a data dependency between the convolution and a subsequent convolution and/or a subsequent activation function.
In some embodiments, performing the forward pass includes performing at least the portion of the convolution on each of the tensors. Each of the tensors may have a unique start time. For example, the at least the portion of the convolution may be performed serially on the tensors. In another example, the at least the portion of the convolution may be performed on the tensors at least partially in parallel. Performing the forward pass may further include initiating a subsequent operation on at least a portion of a resultant of the convolution such that the convolution and the subsequent operation are performed at least partially in parallel. In some embodiments, the convolution and the subsequent operation utilize different compute engines such that the different compute engines operate in parallel.
In some embodiments, the reschedule operation further includes writing the input tensor to a buffer at a first speed and loading the plurality of tensors from the buffer at a second speed different from the first speed. The input tensor may include input values in a first dimension and a second dimension. In some such embodiments, each of the tensors includes rescheduled values in the first dimension and the second dimension. The rescheduled values of one of the tensors overlaps with the rescheduled values of another of the tensors in the first and/or second dimension. In some embodiments, multiple reschedule operations are determined based on the profiling. Each of the reschedule operations provides tensors based on the corresponding input tensors.
A compute tile is described. The compute tile includes compute engines and a GP processor coupled with the compute engines. Each compute engine includes a CIM hardware module configured to store weights and to perform at least a portion of a convolution. The GP processor is configured to provide control instructions and data to the compute engines. The compute tile is configured to implement a model including convolutions and activation functions in a forward pass. The convolutions correspond to the compute engines. The model further includes a reschedule operation for a convolution of the convolutions. The reschedule operation is based on a profile of the model and the compute tile. The profile may include a CIM memory capacity for the compute engines, a tile memory capacity for the compute tile, a number of the compute engines used for the convolution, and/or a data dependency between the convolution and a subsequent convolution and/or a subsequent activation function. The compute tile is configured to perform the reschedule operation in the forward pass. The reschedule operation provides tensors based on an input tensor for the convolution. The tensors being configured to undergo at least a portion of the convolution in accordance with a temporal distribution.
In some embodiments, the compute tile is configured to perform the portion of the convolution on each of the plurality of tensors at a unique start time. The compute tile may thus be configured to perform the at least the portion of the convolution on the tensors at least partially in parallel. The compute tile may be configured to initiate a subsequent operation to the convolution on at least a portion of a resultant of the convolution such that the convolution and the subsequent operation are performed at least partially in parallel. In some embodiments, the convolution and the subsequent operation utilize different compute engines such that the different compute engines operate in parallel.
In some embodiments, the compute tile includes a buffer. In such embodiments, to perform the reschedule operation, the compute tile may be configured to write the input tensor to the buffer at a first speed and load the tensors from the buffer at a second speed different from the first speed.
In some embodiments, the input tensor includes input values in a first dimension and a second dimension. Each of the tensors includes rescheduled values in the first dimension and the second dimension. The rescheduled values of one of the tensors may overlap with the rescheduled values of another of the tensors in the first and/or second dimension. In some embodiments, the compute tile is configured to perform multiple reschedule operations for at least a portion of the convolutions. Each of the reschedule operations provides multiple tensors based on each input tensor for the portion of the convolutions. In some embodiments, the compute tile is one of multiple compute tiles. In some such embodiments, each of the compute tiles may be separately profiled and reschedule operation(s) determined for each compute tile based on the profile of that compute tile.
A compute program product, embodied in a non-transitory computer readable medium is described. The computer program product includes computer instructions for: profiling a learning network. The learning network includes compute tile(s) and a model. A compute tile includes compute engines and a GP processor. Each compute engine includes a CIM hardware module. The model includes convolutions and activation functions. The convolutions correspond to the compute engines. The computer-program product also includes computer instructions for determining, based on the profiling, a reschedule operation for a convolution of the convolutions. The reschedule operation provides tensors based on an input tensor for the convolution. The tensors are configured to undergo at least a portion of the convolution in accordance with a temporal distribution. The learning network performs the reschedule operation as part of performing a forward pass.
GP processor 110 is a reduced instruction set computer (RISC) processor. For example, GP processor 110 may be a RISC-V processor or ARM processor. In other embodiments, different and/or additional general purpose processor(s) may be used. The GP processor 110 provides control instructions and data to the compute engines 120. GP processor 110 implements instruction set(s) used in controlling compute engines 120. GP processor 110 provides the commands to compute engines 120 and controls data movement to and/or from compute engines 120. GP processor 110 may thus function as part of a control plane for (i.e. providing commands and being part of the data path) compute engines 120 and tile 100.
In some embodiments, data is moved from memory 130 or another source to compute engine(s) 120 through GP processor 110. Data may be sent from memory 130 to internal memory of GP processor 110, and then to the appropriate compute engine(s) 120 via buses 140 and 150. For example, data from memory 130 may be provided to a vector register file (not shown) of GP processor 110 and then provided from GP processor 110 to the appropriate compute engine(s) 120. Once compute engines 120 have performed their functions, the output is provided to GP processor 110. Similarly, data may be moved from compute engines 120 to memory 130 or another destination via GP processor 110. Thus, GP processor 110 may be part of both the control plane and data plane for compute tile 100.
GP processor 110 may also perform other functions. GP processor 110 may apply activation function(s) to data. For example, an activation function (e.g. a ReLu, Tan h, and/or SoftMax) may be applied to the output of compute engine(s) 120. Thus, GP processor 110 may perform nonlinear operations. GP processor 110 may also perform linear functions and/or other operations. However, GP processor 110 is still desired to have reduced functionality as compared to, for example, a graphics processing unit (GPU) or central processing unit (CPU) of a computer system with which tile 100 might be used.
Compute engines 120 are configured to perform, efficiently and in parallel, tasks that may be part of using (e.g. performing inferences) and/or training (e.g. performing inferences and/or updating weights) a model. Compute engines 120 are coupled with and receive commands and, in at least some embodiments, data from GP processor 110. Compute engines 120 are modules which perform vector-matrix multiplications (VMMs) in parallel. Thus, compute engines 120 may perform linear operations. Each compute engine 120 includes a compute-in-memory (CIM) hardware module (not specifically shown in
The CIM module is a hardware module that stores data and performs operations. In some embodiments, CIM module stores weights for the model. As such, the CIM module determines the maximum size of the model that can be handled by compute tile 100 (i.e. the maximum number of parameters, or weights). The CIM module stores the weights (or other data) in cells that are fully addressable. The CIM module also performs operations using the weights. More specifically, the CIM module performs VMMs, where the vector may be an input vector (e.g. an activation) provided using GP processor 110 and the matrix may be weights (i.e. data/parameters) stored by the CIM module. The CIM module may be considered to include a memory (e.g. that stores the weights) and compute hardware (e.g. that performs the vector-matrix multiplication of the stored weights). In some embodiments, the vector may be a matrix. The CIM module may include an analog SRAM having multiple SRAM cells and configured to provide output(s) (e.g. voltage(s)) corresponding to the data (weight/parameter) stored in each cell of the SRAM multiplied by a corresponding element of the input vector. In some embodiments, the CIM module may include a digital SRAM having multiple SRAM cells and configured to provide output(s) corresponding to the data (weight/parameter) stored in each cell of the digital SRAM multiplied by a corresponding element of the input vector. Other configurations of CIM modules are possible. Each CIM module thus stores weights corresponding to a matrix in its cells and is configured to perform a vector-matrix multiplication of the matrix with an input vector. In some embodiments, the CIM module of a compute engine 120 may be repurposed as memory if the compute engine utilization falls below a particular threshold (e.g. 70%-80%). For example, the CIM might store duplicate weights or vectors (e.g. activations) in such embodiments.
In order to facilitate on-chip learning, local update (LU) modules (not shown) may also be provided in compute engines 120. LU modules are coupled with the corresponding CIM modules. LU modules are used to update the weights (or other data) stored in the CIM modules. LU modules are considered local because LU modules are in proximity to CIM modules. For example, LU module(s) for a particular compute engine 120 may reside in the same integrated circuit as the CIM module(s) for compute engine 120. In some embodiments, the LU module is considered local because it is fabricated on the same substrate (e.g. the same silicon wafer) as the corresponding CIM module. In some embodiments, LU modules are also used in determining the weight updates. In other embodiments, a separate component may calculate the weight updates. For example, in addition to or in lieu of LU modules, the weight updates may be determined by GP processor 110, in software by other processor(s) not part of compute tile 100, by other hardware that is part of compute tile 100, by other hardware outside of compute tile 100, and/or some combination thereof.
Memory 130 may be or include a static random access memory (SRAM) and/or some other type of memory. Memory 130 is shown as coupled with GP processor 110. Stated differently, data movement between memory 130 and compute engines 120 may take place via GP processor 120. In some embodiments, memory 130 may be coupled to compute bus 140 (i.e. to compute engines 120). Memory 130 may store activations (e.g. input vectors provided to compute tile 100 and the resultant of activation functions applied to the output of compute engines 120). Memory 130 may also store weights. For example, memory 130 may contain a backup copy of the weights or different weights if the weights stored in compute engines 120 are desired to be changed. In some embodiments, memory 130 is organized into banks of cells (e.g. banks of SRAM cells). In such embodiments, specific banks of memory 130 may service specific one(s) of compute engines 120. In other embodiments, banks of memory 130 may service any compute engine 120.
In operation, an input vector is provided to one or more of compute engines 120 by GP processor 110. The input vector is desired to be multiplied by the weights, which may have been previously stored in compute engine(s) 120. An input vector may be provided to multiple compute engines 120 if the weight matrix and/or input vector have too many elements for a single compute engine. In some such embodiments, a portion of the input vector is provided to each of the multiple compute engines 120 (each of which stores a portion of the weights). In some embodiments, the input vector is provided from memory 130 to GP processor 110 and from GP processor 110 to compute engine(s) 120. GP processor 110 also instructs compute engine(s) 120 to perform a VMM. Compute engine(s) 120 perform a VMM between the input vector and the matrix of weights to provide an output. The VMM is performed in parallel for the elements of the input vector. The output of compute engine(s) 120 may be considered an output vector. The output is provided by compute engine(s) 120 to GP processor 110. For example, the output may be stored in a vector register file of GP processor 110. GP processor 110 may also store the output (e.g. in memory 130) and/or may provide the output to another component off-tile. GP processor 110 may apply a function (e.g. an activation function) to the output. The results of the activation function applied to the output of compute engines 120 may be stored in GP processor 110 (e.g. in a buffer or the vector register file). GP processor 110 may also store the results in memory 130 or off-tile. GP processor 110 may provide the results as an input vector to other compute engine(s) 120 to apply a different set of weights to the results where another set of weights are stored in other compute engine(s) 120. Thus, one or more inferences with one or more distinct sets of weights may be performed. In some embodiments, training may also be performed by tile 100. In some such embodiments, GP processor 110 or another component (such as a host) may determine the desired update for the weights. In some embodiments, LU module (not shown) of compute engines 120 may be used to determine and apply the updates to the weights.
Thus, compute tile 100 includes two compute blocks, GP processor 110 and compute engines 120, which work together. GP processor 110 may perform nonlinear operations (e.g. activation functions) and compute engines perform 120 may perform linear operations (e.g. VMMs). GP processor 110 is in the control and data planes for compute engines 120. GP processor 110 and compute engines 120 are, therefore, tightly coupled. Consequently, data may be moved more efficiently within tile 100. Operations, such as VMMs and the application of activation functions to the output of compute engines 120, may be more efficiently performed. Further, a special purpose controller need not be designed and fabricated for compute tile 100. Instead, GP processor 110 is used. As a result, compute tile 100 may be more flexible and more readily designed and fabricated. For example, the activation applied by GP processor 110 may be updated by updating GP processor 110. A new special purpose controller need not be provided. Consequently, functions for machine learning may be more efficiently and readily performed. In addition, compute tile 100 includes on-tile memory 130. Use of on-tile memory, for example as a scratchpad memory, allows for a high degree of independence of compute tile 100 from other components (e.g. other tiles). Thus, multiple tiles 100 may more readily work in parallel. Consequently, efficiency of learning may be enhanced.
GP processor 210 is analogous to GP processor 110. Thus, GP processor 210 may be a RISC-V processor or ARM processor. In other embodiments, different and/or additional general purpose processor(s) may be used. The GP processor 210 provides control instructions and manages data flow for the compute engines 220. Data sent to or from compute engines 220 may also pass through GP processor 210. Thus, GP processor 210 may be part of both the control plane and data plane for compute tile 200. GP processor 210 may also perform other functions, including nonlinear functions. For example, GP processor 210 may apply activation function(s) to data. In some embodiments, GP processor 210 may include a vector processing unit (not shown) that executes nonlinear operations (e.g. applying activation functions to data). Also explicitly shown as part of GP processor 210 are local memories 212 and 214. In some embodiments, local memory 212 stores instructions while local memory 214 stores data.
Compute engines 220 are analogous to compute engines 120. Compute engines 220 are configured to perform, efficiently and in parallel, tasks that may be part of using and/or training a model. Compute engines 220 are coupled with and receive commands and, in at least some embodiments, data from GP processor 210. Compute engines 220 perform linear operations such as VMMs in parallel. Each compute engine 220 includes a CIM hardware module (not specifically shown in
Bus 250 couples GP processor 210 with compute bus 240 and, therefore, with compute engines 220. Compute bus 250 includes control bus 252, streaming bus 254, and status bus 256. Control bus 252, streaming bus 254, and status bus 256 are coupled with a control port (not explicitly labeled), a streaming port (not explicitly labeled), and a status port (not explicitly labeled), respectively, of GP processor 210. Control bus 252 receives instructions for compute engines 220 from GP processor 210. Compute engines 220 perform operations based on the instructions. For example, the instructions may include a load instruction to load data from GP processor 210 to identified compute engine(s) 220, a store instruction to store data from identified compute engine(s) 220 to GP processor 210, and supporting instructions that identify the addresses in identified compute engine(s) 220 to which data is to be loaded and from which data is to be read. Streaming bus 254 may be a high speed, high bandwidth bus. In some embodiments, streaming bus 254 is 512 bits wide. Other bus widths are possible. Streaming bus 254 is used to rapidly move data between GP processor 210 and compute engines 220. Status bus may allow for reading from or writing to a status register for a compute engine 220. Thus, GP processor 210 may be informed of the particular compute engine 220 completing a task, such as a VMM.
Compute tile 200 also includes DMA 270 and mesh stop 280. DMA 270 initiates data movement for compute tile 200. DMA 270 may be used to move data from off-tile to on-tile and vice-versa. Thus, DMA 270 may be used to communicate with a host (not shown) and/or other tiles (not shown in
Compute tile 200 functions in an analogous manner to compute tile 100. For example, data may be transferred on-tile from a host or other tile via DMA 270 and/or mesh stop 280. Such data may be stored in memory 230. Thus, memory 230 may store weights and input vectors. The weights may be loaded in one or more compute engines 220 for use. For example, the weights may be moved from memory 230 to the CIM hardware module(s) of compute engine(s) 220 via GP processor 210. For an inference, an input vector is provided to one or more of compute engines 220 by GP processor 210. To do so, the input vector/activation may be moved from memory 230 to GP processor 210 and from GP processor 210 to compute engine(s) 220 via streaming bus 254. Compute engine(s) 220 perform a VMM in parallel of the elements of the input vector and the matrix (or matrices) of weights stored in compute engine(s) 220. The output of compute engine(s) 220 may be stored from compute engine(s) 220 to GP processor 210 via streaming bus 254. GP processor 210 may apply a function (e.g. an activation function) to the output. The resultant of the activation function applied to the output of compute engines 220 may be stored in GP processor 210 (e.g. a buffer, which is not explicitly shown in
Compute tile 200 may share the benefits of compute tile 100. GP processor 210 and compute engines 220 are compute blocks which work closely together. For example, the data and control planes for compute tile 200 may include memory 230, GP processor 210, buses 240 and 250, and compute engines 220. Consequently, data may be moved more efficiently within tile 200 and operations, such as VMMs and the application of activation functions, may be more efficiently performed. Further, a special purpose controller need not be designed and fabricated for compute tile 200. As a result, compute tile 200 may be more flexible and more readily designed and fabricated. Consequently, functions for machine learning may be more efficiently and readily performed. In addition, on-tile memory 230 allows for a high degree of independence of compute tile 200 from other components (e.g. other tiles). Thus, multiple tiles 200 may more readily work in parallel and efficiency may be improved.
GP processor 310 is analogous to GP processors 110 and/or 210. Thus, GP processor 310 may be a RISC-V processor or ARM processor. In other embodiments, different and/or additional general purpose processor(s) may be used. The GP processor 310 provides control instructions and manages dataflow for the compute engines 320. Data sent to or from compute engines 320 may also pass through GP processor 310. Thus, GP processor 310 may be part of both the control plane and data plane for compute tile 300. GP processor 310 may also perform other functions, including nonlinear functions. For example, GP processor 310 may apply activation function(s) to data. In some embodiments, GP processor 310 may include a vector processing unit (not shown) that executes nonlinear operations (e.g. applying activation functions to data).
In addition, GP processor includes an additional fixed function compute block (FFCB) 316. In some embodiments, FFCB 316 is a single instruction multiple data arithmetic logic unit (SIMD ALU). In some embodiments, FFCB 316 may be configured in another manner. FFCB 316 may be a close-coupled fixed-function unit for on-device inference and training of learning networks. In some embodiments, FFCB 316 executes nonlinear operations, number format conversion and/or dynamic scaling. In some embodiments, other and/or additional operations may be performed by FFCB 316. FFCB 316 may be coupled with the data path for the vector processing unit of GP processor 310.
Compute engines 320 are analogous to compute engines 120 and/or 220. Compute engines 320 are configured to perform, efficiently and in parallel, tasks that may be part of using and/or training a model. Compute engines 320 are coupled with and receive commands and, in at least some embodiments, data from GP processor 310. Compute engines 320 perform linear operations such as VMMs in parallel. Each compute engine 320 includes a CIM hardware module (not specifically shown in
Compute engine 300 is also depicted as including buffer 390. In some embodiments, buffer 390 is a memory used in reschedule operations (described in the context of, for example,
CIM module 430 is a hardware module that stores data and performs operations. In some embodiments, CIM module 430 stores weights for the model. CIM module 430 also performs operations using the weights. More specifically, CIM module 430 performs vector-matrix multiplications, where the vector may be an input vector provided using processor 110 and the matrix may be weights (i.e. data/parameters) stored by CIM module 430. Thus, CIM module 430 may be considered to include a memory (e.g. that stores the weights) and compute hardware (e.g. that performs the vector-matrix multiplication of the stored weights). In some embodiments, the vector may be a matrix (i.e. an n×m vector where n>1 and m>1). For example, CIM module 430 may include an analog static random access memory (SRAM) having multiple SRAM cells and configured to provide output(s) (e.g. voltage(s)) corresponding to the data (weight/parameter) stored in each cell of the SRAM multiplied by a corresponding element of the input vector. In some embodiments CIM module 430 may include a digital static SRAM having multiple SRAM cells and configured to provide output(s) corresponding to the data (weight/parameter) stored in each cell of the digital SRAM multiplied by a corresponding element of the input vector. In some embodiments, CIM module 430 may include an analog resistive random access memory (RAM) configured to provide output (e.g. voltage(s)) corresponding to the impedance of each cell multiplied by the corresponding element of the input vector. Other configurations of CIM module 530 are possible. Each CIM module 430 thus stores weights corresponding to a matrix in its cells and is configured to perform a vector-matrix multiplication of the matrix with an input vector.
In order to facilitate on-chip learning, LU module 440 may be provided. LU module 440 is coupled with the corresponding CIM module 430. LU module 440 is used to update the weights (or other data) stored in CIM module 430. LU module 440 is considered local because LU module 440 is in proximity with CIM module 430. For example, LU module 440 may reside on the same integrated circuit as CIM module 430. In some embodiments LU module 440 for a particular compute engine resides in the same integrated circuit as the CIM module 430. In some embodiments, LU module 440 is considered local because it is fabricated on the same substrate (e.g. the same silicon wafer) as the corresponding CIM module 430. In some embodiments, LU module 440 is also used in determining the weight updates. In other embodiments, a separate component may calculate the weight updates. For example, in addition to or in lieu of LU module 440, the weight updates may be determined by a GP processor, in software by other processor(s) not part of compute engine 400 and/or the corresponding AI accelerator (e.g. compute tile 100, 200, or 300), by other hardware that is part of compute engine 400 and/or the corresponding AI accelerator (e.g. compute tile 100, 200, or 300), by other hardware outside of compute engine 400 or the corresponding AI accelerator (e.g. compute tile 100, 200, or 300), and/or some combination thereof.
Using compute engine 400 in the context of compute tiles 100, 200, or 300 and/or an analogous system, efficiency and performance of a learning network may be improved. Use of CIM modules 430 may dramatically reduce the time to perform the vector-matrix multiplication that provides the weighted signal. Thus, performing inference(s) using compute engine 400 may require less time and power. This may improve efficiency of training and use of the model. LU modules 440 allow for local updates to the weights in CIM modules 430. This may reduce the data movement that may otherwise be required for weight updates. Consequently, the time taken for training may be greatly reduced. In some embodiments, the time taken for a weight update using LU modules 440 may be an order of magnitude less (i.e. require one-tenth the time) than if updates are not performed locally. Efficiency and performance of a learning network provided using system 100 may be increased.
CIM module 530 is a hardware module that stores data corresponding to weights and performs vector-matrix multiplications. The vector is an input vector provided to CIM module 530 (e.g. via input cache 550) and the matrix includes the weights stored by CIM module 530. In some embodiments, the vector may be a matrix. Examples of embodiments CIM modules that may be used in CIM module 530 are depicted in
In operation, voltages of capacitors 620 and 622 are set to zero, for example via Reset provided to transistor 616. DAC 502 provides the differential voltages on lines 602 and 604, and the address decoder (not shown in
In operation, a row including digital SRAM cell 710 is enabled by address decoder 570 (not shown in
Referring back to
Input cache 550 receives an input vector for which a vector-matrix multiplication is desired to be performed. In some embodiments, the input vector is provided to input cache by a GP processor, such as GP processor 110. The input vector may be read from a memory, from a cache or register in the processor, or obtained in another manner. Digital-to-analog converter (DAC) 502 converts a digital input vector to analog in order for CIM module 530 to operate on the vector. Although shown as connected to only some portions of CIM module 530, DAC 502 may be connected to all of the cells of CIM module 530. Alternatively, multiple DACs 502 may be used to connect to all cells of CIM module 530. Address decoder 570 includes address circuitry configured to selectively couple vector adder 544 and write circuitry 542 with each cell of CIM module 530. Address decoder 570 selects the cells in CIM module 530. For example, address decoder 570 may select individual cells, rows, or columns to be updated, undergo a vector-matrix multiplication, or output the results. In some embodiments, aBit mixer 504 combines the results from CIM module 530. Use of aBit mixer 504 may save on ADCs 506 and allows access to analog output voltages.
ADC(s) 506 convert the analog resultant of the vector-matrix multiplication to digital form. Output cache 560 receives the result of the vector-matrix multiplication and outputs the result from compute engine 500. Thus, a vector-matrix multiplication may be performed using CIM module 530.
LU module 540 includes write circuitry 542 and vector adder 544. In some embodiments, LU module 540 includes weight update calculator 546. In other embodiments, weight update calculator 546 may be a separate component and/or may not reside within compute engine 500. Weigh update calculator 546 is used to determine how to update to the weights stored in CIM module 530. In some embodiments, the updates are determined sequentially based upon target outputs for the learning system of which compute engine 500 is a part. In some embodiments, the weight update provided may be sign-based (e.g. increments for a positive sign in the gradient of the loss function and decrements for a negative sign in the gradient of the loss function). In some embodiments, the weight update may be ternary (e.g. increments for a positive sign in the gradient of the loss function, decrements for a negative sign in the gradient of the loss function, and leaves the weight unchanged for a zero gradient of the loss function). Other types of weight updates may be possible. In some embodiments, weight update calculator 546 provides an update signal indicating how each weight is to be updated. The weight stored in a cell of CIM module 530 is sensed and is increased, decreased, or left unchanged based on the update signal. In particular, the weight update may be provided to vector adder 544, which also reads the weight of a cell in CIM module 530. More specifically, adder 544 is configured to be selectively coupled with each cell of CIM module by address decoder 570. Vector adder 544 receives a weight update and adds the weight update with a weight for each cell. Thus, the sum of the weight update and the weight is determined. The resulting sum (i.e. the updated weight) is provided to write circuitry 542. Write circuitry 542 is coupled with vector adder 544 and the cells of CIM module 530. Write circuitry 542 writes the sum of the weight and the weight update to each cell. In some embodiments, LU module 540 further includes a local batched weight update calculator (not shown in
Compute engine 500 may also include control unit 540. Control unit 540 generates the control signals depending on the operation mode of compute engine 500. Control unit 540 is configured to provide control signals to CIM hardware module 530 and LU module 549. Some of the control signals correspond to an inference mode. Some of the control signals correspond to a training, or weight update mode. In some embodiments, the mode is controlled by a control processor (not shown in
In inference mode, the input data is multiplied by the stored weights and output is obtained after ADC 506. This mode may include many steps. For example, if capacitors arranged in a voltage divider are used to provide the output (e.g. in
Using compute engine 500, efficiency and performance of a learning network may be improved. CIM module 530 may dramatically reduce the time to perform the vector-matrix multiplication. Thus, performing inference(s) using compute engine 500 may require less time and power. This may improve efficiency of training and use of the model. LU module 540 uses components 542, 544, and 546 to perform local updates to the weights stored in the cells of CIM module 530. This may reduce the data movement that may otherwise be required for weight updates. Consequently, the time taken for training may be dramatically reduced. Efficiency and performance of a learning network provided using compute engine 500 may be increased.
For example,
Compute tile(s) 100, 200, and/or 300 and compute engine(s) 120, 220, 320, 400, and/or 500 may be used to accelerate the processes of learning network 800. For simplicity, it is assumed that compute engine 500 is used in compute tile 300. Further, weight layers 810 are assumed to be storable within a single CIM module 530. Nothing prevents weight layers 810 from being extended across multiple CIM modules 530. In the data flow described above for learning network 800, an input vector is provided to a compute engine 320-1 from GP processor 310. More specifically, the input vector is provided to CIM module 530 (e.g. via input cache 550 and DAC(s) 502). Initial values of weights are stored in, for example, SRAM cells (e.g. 610 or 710) of CIM module 530. A vector matrix multiplication is performed by CIM module 530 and provided to output cache 560 (e.g. also using aBit mixers 504 and ADC(s) 506). Thus, the processes of weight layer 810-1 may be performed. Activation layer 820-1 may be performed using a GP processor 310. The output of activation layer 820-1 (e.g. from GP processor 310) is provided to the next weight layer 810-2. Initial weights for weight layer 810-2 may be in another compute engine 330-2/CIM module 530. In another embodiment, new weights corresponding to weight layer 810-2 may be stored in the same hardware CIM module 530 of the same compute engine 330-1. A vector matrix multiplication is performed by CIM module 530 and provided to output cache 560 (e.g. also using aBit mixers 504 and ADC(s) 506). Activation layer 820-2 may be performed using a processor such as GP processor 310. The output of activation layer 820-2 is used to determine the loss function via hardware or GP processor 310. The loss function may be used to determine the weight updates by GP processor 310, weight update calculator 546/800. Using LU modules 540 and the weights in CIM modules 530, weight layers 810 may be updated. Thus, learning network 800 may be realized using compute tile 100, 200, and/or 300 and/or compute engine 500. The benefits thereof may, therefore, be obtained.
Compute engines 120, 220, 320, 400 and/or 500 may be combined in a variety of architectures. For example,
In SoC 900, each tile 910 is an independent compute unit which has its own local memory analogous to SRAM 130, 230, and/or 330. Tiles 910 are interconnected by mesh interconnects. In some embodiments, this allows any tile 910 to access the memory of any other tile 910. Tiles 910 each have memory that is fully globally addressable. In some embodiments, a tile 910 may interact with any other tile 910 of SoC 900. Thus, tiles 920 may be considered to be tightly-coupled, independent compute and memory blocks with globally addressable memory that enable a compiler (not shown in
Using SoC 900 efficiency and performance of a learning network may be improved. In addition to the benefits of the individual tiles 900, such as more efficient control and movement of data within a tile, SoC 900 may extend the benefits to larger systems. Through super tiles, SoC 900 may be tailored to the specific traffic patterns and applications with which SoC 900 is desired to be used. Consequently, efficiency and performance may be enhanced.
Weights corresponding to a weight matrix may be stored in one or more compute engines of a compute tile, at 1002. In some embodiments, this occurs at a time that is distinct from the remainder of method 1000. In some embodiments, 1002 includes storing the weights in the CIM hardware module of the compute engine of the compute tile. An input vector is provided to the compute engine(s) of the compute tile, at 1004. In some embodiments, this is performed via the GP processor corresponding to the compute tile. The compute engine(s) perform a VMM between the input vector and the matrix, at 1006. In some embodiments, this is performed by the CIM hardware module. Thus, 1006 provides an output that is the weight matrix multiplied by the input vector. One or more activation functions are applied to the output, at 1008. In some embodiments, 1008 is performed by the GP processor for the compute tile. At 1010, 1004, 1006, and 1008 may be repeated for multiple inferences with the same or other compute engines (e.g. other weight matrices).
For example, weights may be stored in the compute engines 320 of compute tile 300, at 1002. For example, data may be stored in SRAM cells 610 of CIM hardware modules 530 of compute engine 500. During inference or training, an input vector is provided to compute engine(s) 320. For example, an input vector stored in memory 330 may be provided to GP processor 310, and from GP processor 310 to the appropriate compute engine(s) 320. GP processor may instruct compute engine(s) 320 to perform a VMM of the input vector and the weight matrix stored in compute engine(s) 320. Thus, at 1006, compute engine(s) 320 perform VMM in parallel. For example, compute engine 500 may use CIM hardware module 530 to perform a VMM. Also at 1006, the output of the VMM is provided to GP processor 310. Activation function(s) are applied to the output, at 1008. This may be performed by GP processor 310. In some embodiments, fixed function computing block 316 may be used in accomplishing 1008. The resultant of the activation function being applied to the output of compute engines 320 may be stored by GP processor 310 in memory 130. At 1008, these processes may be repeated. Thus, inferences may be improved. Further, training may be performed on-chip using the resultants of method 1000 and, for example, LU modules 440 and/or 540.
Using method 1000, the benefits of compute tiles 100, 200, and/or 300 may be achieved. For example, efficiency and performance of learning may be improved. The time to perform the VMMs may be reduced and the movement of data made more efficient. This may improve efficiency of training and use of the model. Efficiency and performance of a learning network provided using method 1000 may be increased.
Compute engines 120, 220, 320, 400, and/or 500 may improve the efficiency of linear operations such as VMMs used in inference and training of a learning network. However, further improvements may be desired. For example, compute engines 120, 220, 320, 400, and/or 500 may be desired to be configured in a weight-stationary architecture. In other words, the weights for a convolution are desired to be loaded into compute engines 120, 220, 320, 400, and/or 500 and are not swapped out throughout the forward pass (i.e. during an inference). Compute tiles 100, 200, 300, and/or 910 may also be desired to be used in conjunction with models having a large number or parameters (e.g. a large number of weights and/or large input data sets) without unduly sacrificing efficiency. Consequently, techniques for performing inferences (e.g. including 1006 and 1008 in a forward pass through a learning network implemented using compute tiles 100, 200, 300, and/or 910) are desired. For example, the ability of compute tiles, such as compute tiles 100, 200, 300, and/or 910 and/or their components to operate in parallel is desired to be improved.
The learning network is profiled, at 1102. The learning network includes compute tile(s) and a model. Thus, the characteristics of the compute tile(s) and model are determined at 1102. For example, 1102 may determine the number of parameters of the model, the number of layers of the model (e.g. the number of convolutions and activation functions to be applied), the activation functions used (e.g. rectified linear unit (ReLU), sigmoid, tan h, sigmoid linear unit (SiLU), Gaussian error linear unit (GeLU), Softmax etc.), how particular operation(s) (e.g. a convolution or activation function) depends upon the output from previous operation(s), the number of compute tiles, the number of compute engines in each compute tile, the storage available in each compute engine and/or compute tile, outside storage such as DRAM available, and/or other characteristics. Also at 1102, characteristics of the input data may be considered part of the model. Thus, the input data may also be profiled. For example, the size of the tensors (number of columns, number of rows, depth, number of bits per entry) may be determined.
Based on the profile, one or more reschedule operations are determined for at least one of the compute tiles, at 1104. A reschedule operation is used for a convolution (e.g. VMM(s)) that may be carried out by CIM hardware modules of the compute engine(s). A reschedule operation has as an input the input tensor(s) for the convolution. For example, the input tensor(s) may correspond to image data or other data which the learning network is performing an inference for (i.e. for which the learning network performs a forward pass). The reschedule operation provides one or more tensors based on the input tensor(s). For example, a single input tensor may result in multiple output tensors or multiple input tensors may result in a different number of output tensors. The reschedule operation allows the convolution being performed on portions of the input tensor to be re-timed (i.e., rescheduled). For example, the reschedule operation may split the input tensor into multiple tensors such that the learning network can control (e.g. offset) the start time(s) of the convolution for each of the multiple tensors. Stated differently, the tensors are configured to undergo at least a portion of the convolution in accordance with a temporal distribution. In some embodiments, each of the multiple tensors has a unique start time for the convolution. Thus, subsequent operations, such as activation functions, may also be re-timed.
As used herein, splitting an input tensor in a reschedule operation may include dividing the input tensor into multiple tensors such that no data is repeated in the multiple tensors or allowing replication of data in the multiple tensors. For example, one or more rows of one tensor may be repeated in another tensor. In one example, a reschedule operation may take an n×m matrix and return n 1×m matrices. In another example, a reschedule operation may take n 1×m matrices (that may be distributed over time) and return an n×m matrix. In another example, a reschedule operation may take a n×m matrix and return (n−4) 3×m matrices. Other combinations of input and output tensors are possible.
The multiple tensors determined by the reschedule operation are configured such that at least some tasks may be performed in parallel and/or such that at least some compute engines may be used in parallel without replicating (or with limited replication) of weights for a convolution in multiple compute engines. Thus, improvements in both serialization (performing an operation over time) and parallelization (performing operations in parallel in different compute engines) may be possible. The reschedule operation(s) are also configured such that the output of the learning network is not adversely affected. For example, the final output of the learning network, such as a classification of an image, may not be altered by the reschedule operation(s). In some embodiments, 1104 determines reschedule operation(s) for each compute tile individually. Thus, operation of different compute tiles (e.g. compute tiles 910 in SoC 900) may remain independent.
The learning network performs a forward pass (e.g. an inference) on data provided using the reschedule operation(s), at 1106. Thus, the convolutions and activation functions of the model are applied. In addition, the reschedule operation(s) are performed for the corresponding convolution(s). As a result, the reschedule operation(s) may be used to improve the parallelism and/or serialization of the learning network. Consequently, performance may be enhanced.
For example, method 1100 may be used in conjunction with compute tile 300 and compute engines 320 and/or 500. At 1102, compute tile 300 and the model used are profiled. Consequently, characteristics such as the number of compute engines 320 (e.g. six), the storage in CIM hardware modules 530, available storage in memory 330, the number of layers of the model implemented by compute tile 300, the activation function(s) applied using GP processor 310, the data dependencies between layers of the model (e.g. subsequent operation(s)), and the number of compute engines 320 used for a particular convolution are determined.
Using the information determined by the profiling of 1102, one or more reschedule operations are determined for compute tile 300, at 1104. For example, compute engines 320-0 and 320-1 may be used for a first convolution performed by compute tile 300, while compute engine 320-3 may be used to perform a second convolution. GP processor 310 may apply an activation function to the output of the first convolution (i.e. the output of compute engines 320-0 and 320-1). The reschedule operation may split the input tensor(s) for compute engines 320-0 and 320-1 such that the multiple tensors are provided to compute engines 320-0 and 320-1 at different times. Compute engines 320-0 and 320-1 thus perform portions of the convolution at different times (e.g. serially). Further, because data dependencies are part of the profile, the reschedule operation may allow for compute engine 320-3 to perform at least a portion of the corresponding convolution at the same time that compute engines 320-0 and 320-1 perform part of their corresponding convolution. Thus, compute engines 320-3 and compute engines 320-0 and 320-1 may operate in parallel. Consequently, performance of the learning network employing compute tile 300 may be improved.
The requirements of the model are determined, at 1202. For example, the activation functions used by the model, the number of layers, the memory requirements for the parameters, the data dependencies for convolutions, activation functions, and/or other operations, the size of the tensors input to the model, and/or other characteristics of the model may be determined at 1202. In some embodiments, at least some of these characteristics are provided by a user of the learning network.
The requirements of the compute engines are determined, at 1204. For example, the capacity of the storage for the VMMs, the number of compute engines, the latencies (e.g. number of cycles required for a VMM), and/or other features of the compute engines are determined. At 1206, additional requirements of the compute tile are determined. For example, additional memory capacity, other memory accessible by the compute tile, latencies for the given activation functions applied using the GP processor, and/or other characteristics of the compute tile are determined. Using method 1200, the traits of the model (e.g. operations performed and data used) and the hardware may be determined. Consequently, the desired reschedule operations may be provided using method 1100.
The data for input tensor(s) may be stored at a first speed, at 1302. In some embodiments, the data may simply be stored in a buffer as rapidly as possible. In some embodiments, the first speed may not be controlled. The data for the multiple tensors may be read at a second speed, at 1304. The second speed is different from the first speed. For example, data may be read from the buffer with delays between some or all of the reads. Consequently, method 1300 may essentially separate the input tensor into multiple tensors that may be output to the convolution, or other operation, at the desired rate.
For example, a reschedule operation may write activations (e.g. input tensors from SRAM 330, the output of previous VMMs of compute engine(s) 320, or the output of an activation applied by GP processor 310) into buffer 390 at a first speed, at 1302. The reschedule operation may also read the data stored in buffer 390 and provide this data to a subsequent component at a second speed, at 1304. For example, data from buffer 390 may be provided to compute engine 320-3 at a lower speed and/or in chunks separated in time.
Thus, using methods 1100, 1200, and/or 1300 and the corresponding reschedule operation(s), the flow of data through a compute tile, such as compute tiles 100, 200, 300, and/or 910 may be controlled. Consequently, operation of the compute engines, such as compute engines 120, 220, 320, 400, and/or 500 may be better parallelized while allowing for a weight stationary architecture and/or serialization of some tasks. Performance of compute tiles 100, 200, 300, and/or 910 may be improved.
For example,
An input tensor 1402 is provided for convolution 1420. Reschedule operation 1410 converts input tensor 1402 to multiple tensors that are processed over time using convolution portions 1420-1, 1420-2, 1420-3, 1420-4, 1420-5, and 1420-6. In some embodiments, each convolution portion 1420-1, 1420-2, 1420-3, 1420-4, 1420-5, and 1420-6 is performed by the same compute engine(s). In some embodiments, therefore, convolution 1420 may be accomplished without replicating weights in multiple compute engines.
After convolution 1420 is completed, activation function 1422 is applied via activation functions 1422-1, 1422-2, 1422-3, 1422-4, 1422-5, and 1422-6 (collectively or generically 1422) to the outputs of convolution 1420. For example, activation functions 1422 may apply a sigmoid to the outputs of convolution 1420. In some embodiments, reschedule operation 1412, a concatenation, or other analogous task might be performed before activation 1422.
An additional reschedule operation may be considered to be performed at 1412. 1412 may also be considered to be a write of the outputs of activation function 1422 to a buffer and providing partial outputs from the buffer. In particular, the outputs of activation functions 1422-1, 1422-2, 1422-3, 1422-4, 1422-5, and 1422-6 are collected, for example in the buffer. Because the data dependencies for convolution 1430 are known through profiling of the learning network, it can be determined when enough results of activation 1422 have been provided that convolution 1430 may be started. Convolution portions 1430-1, 1430-2, 1430-3, and 1430-4 are performed over time. Thus, a portion of the results of activation function 1422 that have been collected at 1412 may be provided to convolution portion 1430-1. When sufficient additional results of activation function 1422 have been collected via reschedule operation 1412, these results are provided to convolution portion 1430-2. This process continues for convolution portions 1430-3 and 1430-4.
As indicated in
After convolution 1430 is completed, activation function 1432 is applied via activation functions 1432-1, 1432-2, 1432-3, and 1432-4 (collectively or generically 1432) to the outputs of convolution 1430. For example, activation functions 1432 may apply a sigmoid to the outputs of convolution 1430. The flow of operations may then continue.
Thus, use of reschedule operations 1410 and 1412 may allow for parallelism (e.g. compute engines performing 1430-1 and 1420-5 in parallel) such that compute engines storing different weights operate in parallel. Reschedule operations 1410 and 1412 may also allow for serialization of convolutions (e.g. 1420-1 through 1420-6 being performed at different times) such that the same weights need not be replicated in multiple compute engines. Consequently, the efficiency of the learning network (e.g. model and compute tiles) performing flow 1400 may be improved.
Original flow 1501 includes convolutions 1520-1 and 1526 as well as activation functions 1522 and 1524. In the embodiment shown, activation functions 1522 and 1524 are a sigmoid and a MaxPool, respectively. In other embodiments, other and/or additional activation functions may be used. Input tensor 1502 corresponding to image data 1502 is provided to original flow 1501. Convolution 1520-1 is applied to input data 1502, followed by sigmoid activation functions 1522. The result of convolution 1520-1 is a 1×10×24×24 tensor (i.e. 1 block of 10×24×24 image data). This tensor is shown as image data 1508. Sigmoid 1522, MaxPool 1524 and convolution 1526 may also be applied to image data 1508.
Flow 1500 includes convolutions 1520 and 1526 and activation functions 1522 and 1524 corresponding to convolutions 1520-1 and 1526 and activation functions 1522 and 1524, respectively, of original flow 1501. Flow 1500 also includes reschedule operations 1510 and 1530. Reschedule operation 1510 operates on input data 1502 and outputs a 24×1×5×28 tensor (24 blocks of 1×5×8 image data 1504). Convolution 1520 is analogous to convolution 1520-1. The output of convolution 1520 is a 24×10×1×24 tensor (24 blocks of 10×1×24 image data 1506). In some embodiments, the number of blocks (e.g. 24 for convolution 1520) may be viewed as how often a particular task is executed in hardware. For example, convolution 1520 may be executed 24 times, once for each block of data. Reschedule 1530 accepts as an input data 1506 (i.e. a 24×20×1×24 tensor) and outputs image data 1508 (i.e. a 1×10×24×24 tensor). Activation functions 1522 and 1524 and convolution 1526 may then be performed in a manner analogous to original flow 1501.
Thus, reschedule operations 1510 and 1530 may be utilized. Such reschedule operations 1510 and 1530 may improve parallelism and serialization, while providing the same result as a corresponding flow that omits reschedule operations. Consequently, the efficiency of the learning network (e.g. model and compute tiles) performing flow 1500 may be improved without adversely affecting the results of the flow of data through the forward path of the learning network.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 63/528,283 entitled COMPILER TECHNIQUES FOR IN-MEMORY ARCHITECTURE filed Jul. 21, 2023 which is incorporated herein by reference for all purposes.
| Number | Date | Country | |
|---|---|---|---|
| 63528283 | Jul 2023 | US |