Artificial neural networks are computing systems with an architecture based on biological neural networks. Artificial neural networks can be trained in a training process, using training data, to learn about how to perform a certain computing task. Artificial neural networks can be implemented on a neural network processor, which can include memory and computation resources to support computation operations of artificial neural networks. Certain applications may impose limit on the amounts of memory and computation resources available on the neural network hardware accelerator.
In one example, a neural network processor is provided. The neural network processor comprises a memory interface, an instruction buffer, a weights buffer, an input data register, a weights register, an output data register, a computing engine, and a controller. The controller is configured to receive a first instruction from the instruction buffer, and responsive to the first instruction: fetch input data elements from the memory interface to the input data register, and fetch weight elements from the weights buffer to the weights register. The controller is also configured to receive a second instruction from the instruction buffer, and responsive to the second instruction: fetch the input data elements and the weight elements from, respectively, the input data register and the weights register to the computing engine, perform, using the computing engine, computation operations between the input data elements and the weight elements to generate output data elements, and store the output data elements at the output data register.
In one example, a method is provided. The method comprises receiving a first instruction from an instruction buffer of a neural network processor, and responsive to the first instruction: fetching input data elements from a memory external to the neural network processor to an input data register of the neural network processor; and fetching weight elements from a weights buffer of the neural network processor to the a weights register of the neural network processor. The method further comprises: receiving a second instruction from the instruction buffer, and responsive to the second instruction: fetching the input data elements and the weight elements from, respectively, the input data register and the weights register to a computing engine of the neural network processor, performing, using the computing engine, computation operations between the input data elements and the weight elements to generate output data elements, and storing the output data elements at an output data register of the neural network processor.
In one example, a neural network processor comprises a memory interface, an instruction buffer, a weights buffer, an input data register, a weights register, an output data register, address registers, an address generation engine, a data load/store engine, a computing engine, and a controller. The address generation engine is configurable to set input data addresses, output data addresses, and weights addresses in the address registers. The data load/store engine is configurable to fetch input data from the memory interface to the input data register based on the input data addresses, fetch output data from the output data register to the memory interface based on the output data addresses, and fetch weights from the weights buffer to the weights register based on the weight addresses. The computing engine is configurable to perform computations based on the input data and the weights to generate the output data. The controller is configured to, responsive to one or more instructions from the instruction buffer: extract a first sub-instruction directed to the address generation engine to set one of the input data addresses or the output data addresses, extract a second sub-instruction directed to the address generation engine to set the weights addresses, extract a third sub-instruction directed to the computing engine to perform the computations, extract a fourth sub-instruction directed to the data load/store engine to fetch the weights at the weight addresses of the weights buffer, extract a fifth sub-instruction directed to the data load/store engine to fetch the input data at the input data addresses via the memory interface, and configure the address generation engine with the first and second sub-instructions, the computing engine with the third sub-instruction, and the data load/store engine with the fourth and fifth instructions in parallel.
The same reference numbers are used in the drawings to designate the same (or similar) features.
Data processor 106 of a particular electronic device 102 can perform data processing operations on the data collected by sensor 104 on the particular electronic device to generate decision 110. For example, in examples where sensor 104 includes an audio/acoustic sensor, data processor 106 can perform data processing operations such as keyword spotting, voice activity detection, and detection of a particular acoustic signature (e.g., glass break, gunshot). Also, in examples where sensor 104 includes a motion sensor, data processor 106 can perform data processing operations such as vibration detection, activity recognition, and anomaly detection (e.g., whether a window/a door is hit or opened when no one is at home or in night time). Further, in examples where sensor 104 includes an image sensor data processor 106 can perform data processing operations such as face recognition, gesture recognition, and visual wake word detection (e.g., determining whether a person is present in an environment). Data processor 106 can also generate and output decision 110 based on the result of the data processing operations including, for example, detection of a keyword in a speech, detection of a particular acoustic signature, a particular activity, a particular gesture, etc. In a case where electronic device 102 includes multiple sensors, data processor 106 can perform a sensor fusion operation on the different types of sensor data to generate decision 110.
Data processor 106 can include various circuitries to process the sensor signal generated by sensor 104. For example, data processor 106 may include sample and hold (S/H) circuits to sample sensor signal. Data processor 106 may also include analog-to-digital converters (ADCs) to quantize the samples into digital signals. Data processor 106 can also include a neural network processor to implement an artificial neural network to process the samples. An artificial neural network (herein after “neural network”) may include multiple processing nodes. The neural network can perform an inferencing operation or a classification operation on the sensor data to generate the aforementioned decision. The inferencing operation can be performed by combining the sensor data with a set of weight elements, which are obtained from a neural network training operation, to generate the decision. Examples of neural networks can include a deep neural network (DNN), a convolutional neural network (CNN), etc.
The processing nodes of a neural network can be divided into layers including, for example, an input layer, a number of intermediate layers (e.g., hidden layers), and an output layer. The input layer and the intermediate layers can each be a convolution layer forming CNN, whereas the output layer can be a fully-connected layer, and the input layer, intermediate layer, and the output layer together form a DNN. Each processing node of the input layer receives an element of an input set, and scales the element with a weight element to indicate the element's degree of influence on the output. The input set may include, for example, acoustic data, motion data, image data, a combination of different types of data, or a set of input features extracted from those data. Also, the processing nodes in the intermediate layers may combine the scaled elements received from each processing node of the input layer to compute a set of intermediate outputs. For example, each processing node in the intermediate layers may compute a sum of the element-weight products, and then generate an intermediate output by applying an activation function to the sum. The intermediate outputs from each processing node of one intermediate layer may be considered as an activated vote (or no-vote), associated with a weight indicating the vote's influence, to determine the intermediate output of the next intermediate layer. The intermediate output can represent output features of a particular immediate layer. The output layer may generate a sum of the scaled intermediate outputs from the final intermediate layer. In some examples, the output layer may generate a binary output (e.g., “yes” or “no”) based on whether the sum of the scaled intermediate outputs exceeds a threshold, which can indicate a decision from the data processing operation (e.g., detection of a keyword in a speech, detection of a particular acoustic signature, a particular activity, a particular gesture).
The neural network processor of data processor 106 can be programmed to perform computations based on an artificial neural network model. The neural network processor can be programmed based on a sequence of instructions that include computation operations (e.g., adding, multiplication, processing of activation function, etc.) associated with the model. The instructions may also access internal and external memory devices to obtain and store data. A compiler may receive information about the neural network model, the input data, and the available memory and computation resources, and generate the set of instructions to indicate, for example, when to access the internal and external memory devices for the data, which component of the neural network processor to perform computations on the data based on the neural network model, etc., to perform the neural network processing. In the example of
Also, as part of DNN layer processing operation 204, data processor 106 can process features 214 using a multi-layer DNN and a sets of weight elements 216 and compute a set of outputs 218. Post processing operation 206 can post-process and quantize outputs 218 to generate inferencing outputs 220. The post processing operation can include, for example, activation function processing to map outputs 218 to the set of inferencing outputs, as well as other post processing operations such as batch normalization (batch norm, or BNorm) and residual layer processing to facilitate convergence in training. Inferencing outputs 220 can include a set of probabilities for a set of candidate words. In the example of
To perform a convolution operation to compute one output data element (e.g., output data element 304a), data processor 106 can compute a dot product between a set of weight elements 306 and a subset of the input data elements 302. In the example of
Y
e,f=Σr=0Kw−1Σs=0Kh−1Σc=0Nin−1XceD+r,fD+s×Fcr,s (Equation 1)
In Equation 1, r represents an index along the width dimension, s represents an index along the height dimension, and c represents an index along the input channel dimension. Also, XceD+r,fD+s represents an input data element 302 of input channel c and having indices eD+r along the width dimension and fD+s along the height dimension, where e and f are multiples of stride D, Fcr,s represents a weight data element 306 having indices r along the width dimension and s along the height dimension, and Ye,f represents an output data element 304 having indices e along the width dimension and f along the height dimension. In some examples, data processor 106 can perform convolution operation between input data elements 302 with multiple sets of weight elements 306 to generate output data elements 304 having Nout number of output channels, where convolution between input data elements and one set of weight elements 306 can generate output data elements 304 of a particular output channel.
The parameters of the set of loop instructions 400 can be configured based on a particular type/topology of neural network layer to be represented by the set of loop instructions. For example, to implement a depth wise convolution layer, where each output data element is generated from the dot product of input data elements and weight elements of a same input channel, the outermost loops 402 and 404 can be merged into one loop, or the variable m can be set to a constant. Also, to implement an average pooling layer, all of the weight elements provided to the neural network layer can be set to 1/(Kw*Kh). Further, to implement a pointwise convolution layer, where each output data element is generated from the dot product of input data elements and a 1×1 kernel having a depth equal to the number of input channels Nin, Km and Kh (width and height of the weight elements array) can each be set to 1. Further, to implement a fully connected layer, Kw, Kh, Fw, and Fh can each be set to 1.
Referring again to
Although it is advantageous to have electronic devices 102 perform inferencing operations on sensor data locally, there are various challenges. Specifically, inferencing operations, even when performed by dedicated hardware such as a neural network processor, can be power intensive, and may use substantial memory and computation resources. On the other hand, electronic devices 102 may be low power devices and may also have small form factors, especially in a case where they are IoT devices. Accordingly, a neural network processor on an electronic device 102 may have very limited memory and computation resources available to perform the inferencing operations. Further, different applications may have different and conflicting requirements for the inferencing operations. For example, some applications may require a high precision in the inferencing computations, while some other applications may not require such a high precision. Also, the neural network processor may support a wide range of neural network topologies, layer types, kernel/filter size, and dimensions for filter, input data, and output data to support different applications. All these present challenges to have a neural network processor that can perform various inferencing operations with limited memory and computation resources to support a wide range of applications.
Memory 512 is shared by neural network processor 502 and processor 514. Memory 512 may store the instructions, input data and weights of each neural network layer to be provided to neural network processor 502 to perform inferencing operations, and output data generated by neural network processor 502 from the inferencing operations. The input data can include, for example, sensor data provided by sensor 104, feature data extracted from the sensor data, or intermediate output data generated by a prior neural network layer. Memory 512 can also store other data, such as program instructions and data for processor 514. Memory 512 may include any suitable on-chip memory, such as flash memory devices, staic random access memory (SRAM), resistive random access memory (ReRAM), etc.
By having processor 514 and neural network processor 502 sharing memory 512 rather than providing dedicated memory devices separately for neural network processor 502 and processor 514, the total memory size of electronic device 102 can be reduced, which can reduce power and footprint of electronic device 102. As to be described in more details below, neural network processor 502 can implement various memory management schemes, such as in-place computation where output data overwrites input data using circular addressing of output data, and circular addressing of input data to support always on applications, to reduce memory resource usage and memory data movement, which allows neural network processor 502 to operate with limited memory resources. Moreover, neural network processor 502 is configured to handle variable latency for accessing data from memory 512, thereby allowing concurrent operation of the processor 514 with minimal performance impact on neural network processor 502.
Processor 514 can execute software programs that use neural network processor 502 to perform inferencing operations, and then perform additional operations based on the results of the inferencing operations. For example, processor 514 can execute a software program for a home security system. The software program can include instructions (e.g., application programming interface (API)) to neural network processor 502. One API may be associated with computations at one neural network layer, and the software program may include multiple APIs for multipole neural network layers. Upon invoking/executing an API, processor 514 can provide memory addresses of neural network layer instructions executable by neural network processor 502, as well as weights, parameters, and input data, to neural network processor 502, which can then fetch these data at the memory addresses of memory 512. Processor 514 can also transmit a control signal to neural network processor 502 to start computations for that neural network layer. Upon completion of the neural network layer computations and output data are stored in memory 512, neural network processor 502 can transmit the memory addresses of the output data and a control signal back to processor 514 to signal completion, and processor 514 can invoke another API for the next neural network layer. Processor 514 can also execute the software program to perform other functions, such as transmitting the inferencing decision to cloud network 103 (or other devices/systems), providing a graphical user interface, among others. Processor 514 can perform those other functions concurrently with neural network processor 502.
DMA controller 516 may be configured to perform DMA operations to transfer certain data between memory 512 and neural network processor 502. For example, upon invoking an API, processor 514 can provide the memory addresses for the stored instructions, weights, and parameters to neural network processor 502 (e.g., in the form of memory descriptors). Neural network processor 502 can then obtain the stored instructions, weights, and parameters based on the memory addresses provided by the processor 514. As to be described below, in some examples, neural network processor 502 can fetch input data directly from memory 512 on an as-needed basis, instead of fetching the input data in bulk using DMA controller 516. Also, neural network processor 502 can store newly generated output data directly to memory 512 in relatively small chunks (e.g., a chunk size of 32 bits) as the output data is generated and reaches the chunk size, instead of fetching the output data in bulk using DMA controller 516. Such arrangements can avoid having (or at least reduce the size of) an input/output data buffer on neural network processor 502, which can reduce the power and footprint of neural network processor 502.
Neural network processor 502 can be a neural network hardware accelerator, and can provide hardware resources, including computation resources and memory resources, for neural network layer computations to support the inferencing operations. Neural network processor 502 can include an instruction buffer 520, a computation controller 522, and a computing engine 524 having configurable computation precision. Neural network processor 502 also includes weights and parameters buffer 526, registers 528, load/store controller 530, and address generators 532. Load/store controller 530 further includes a memory interface 534. Each component of neural network processor 502 can include combinational logic circuits (e.g., logic gates), sequential logic circuits (e.g., latches, flip flops, etc.), and/or memory devices (e.g., SRAM) to support various operations of neural network processor 502 as to be described below.
Instruction buffer 520 can fetch and store computation instructions for a neural network layer from memory 512 responsive to a control signal from processor 514. For example, responsive to an API being invoked to start a neural network layer computation, processor 514 can control instruction buffer 520 to transfer instructions of the neural network layer computations (e.g., microcodes) from memory 512, or to receive the instructions from other sources (e.g., another processor), and store the instructions at instruction buffer 520.
Computation controller 522 can decode each computation instruction stored in instruction buffer 520 and control computing engine 524, load/store controller 530, and address generators 532 to perform operations based on the instruction. For example, responsive to an instruction to perform a convolution operation, computation controller 522 can control computation engine 524 to perform computations for the convolution operation after the data and weights for the convolution operation are fetch and stored in, respectively data registers 528a and weights/parameters register 528b. Computation controller 522 can maintain a program counter (PC) that tracks/points to the instruction to be executed next. In some examples, the computation instruction can include flow control elements, such as loop elements, macro elements, etc., that can be extracted by computation controller 522. Computation controller 522 can then alter the PC value responsive to the flow control elements, and alter the flow/sequence of execution of the instructions based on the flow control elements. Such arrangements allow the neural network layer instructions to include loop instructions that reflect convolutional layer computations, such as those shown in
Computing engine 524 can include circuitries to perform convolution operations to support a CNN network layer computation, and circuitries to perform post processing operations on the neural network output data (e.g., BNorm and residual layer processing). As to be discussed below, computing engine 524 is configurable to perform MAC (e.g., convolution) and post processing operations for weights and input data across a range of bit precisions (e.g., binary precision, ternary precision, 4-bit precision, 8-bit precision, etc.) based on parameters provided by processor 514 and stored in configuration registers 528c. The parameters can be provided for a particular neural network layer, and can be updated between the execution of different neural network layers. This allows processor 514 to dynamically configure the bit precisions of the convolution and post processing operations at computing engine 524 based on, for example, the need of the inferencing operation to be performed, the application that uses the inferencing result, the available power of electronic device 102, etc. In some examples, computing engine 524 allows different bit precisions for weights and input data for different neural network layers, which can enable a wide range of accuracy vs compute tradeoffs. This also allows neural network processor 502 to operate as a domain-specific or an application-specific instruction set processor, where neural network processor 502 can be configured/customized to perform certain applications (e.g., machine learning based on deep neural network) efficiently.
Weights and parameters buffer 526 can store the weights for multiple neural network layers, as well as parameters for different internal components of neural network processor 502 to support the post processing operations. Neural network processor 502 can fetch the weights and parameters from memory 512 via, for example, DMA controller 516. In some examples, weights and parameters buffer 526 can include SRAM devices. Having weights and parameters buffer 526 to store the weights and parameters, which are static for a particular neural network layer, can reduce the movement of such static data between neural network processor 502 and memory 512 during the computations for a neural network layer. Also, the size of weights and parameters data can be relatively small compared with the size of input and output data for the neural network layer computations, which allows weights and parameters buffer 526 to have a small footprint.
Further, registers 528 can include data registers 528a, weights and parameters registers 528b, address registers 528c, and configuration registers 528d. Data registers 528a can store a subset of the input data and a subset of the output data for computations of a neural network layer, and weights and parameters registers 528b can store a subset of the weights for the neural network layer computation, and parameters for post processing operations. Address registers 528c can store memory addresses to be accessed by load/store controller 530 at memory 512 for weights and input/output data. Address registers 528c can also store addresses in weights and parameters buffer 526 to be accessed by load/store controller 530 to fetch the weights and parameters. Configuration registers 528d can store configuration parameters that are common for various components for neural network processor 502. For example, configuration registers 528d can store parameters to set the bit precisions of input data elements and weight elements for the convolution computations and post processing operations at computing engine 524, a particular memory management scheme (e.g., in-place computation, circular addressing of input data, etc.) of neural network processor 502, etc. As described above, some of these parameters can be provided/updated by processor 514 between the execution of a neural network, or between the execution of two neural network layers.
The read/write operations of data registers 528a and weight registers 528b can be performed by load/store controller 530 based on instructions executed by computation controller 522. For example, responsive to an instruction that indicates fetching of input data to data register 528a, load/store controller 530 can fetch the input data from memory 512 directly via memory interface 534 (e.g., without through DMA controller 516) and store the fetched data at data registers 528a. Also, responsive to an instruction that indicates storing of output data back to memory 512, load/store controller 530 can fetch the output data from data registers 528a and store the output data at memory 512 directly via memory interface 534 (e.g., without through DMA controller 516). As discussed above, such arrangements can avoid having (or reduce the size of) an input/output data buffer on neural network processor 502, which can reduce the power and footprint of neural network processor 502.
Also, as to be described below, load/store controller 530 can implement various memory management schemes, such as in-place computation where output data overwrites input data, and circular addressing of input data to support always on applications, by setting the memory addresses stored in address registers 528c. Such arrangements can reduce the footprint of input/output data in memory 512 and to reduce movement of input/output data in memory 512, which facilitate shared access to memory 512 between neural network processor 502 and processor 514. On the other hand, load/store controller 530 can also fetch weights and parameters from weights and parameters buffer 526 to weight register 528b, based on instructions executed by computation controller 522. As described above, processor 514 can control neural network processor 502 to fetch the weights and parameters from memory 512 via a separate memory interface (not shown in
Further, sub-instruction 614 indicates a type of instruction 600. Neural network processor 502 can support instructions of different types and bit lengths. For example, in
Computation controller 522 can extract the sub-instructions from pre-determined bit positions of the instruction, and generate control signals for the target component of neural network processor 502 based on the extracted sub-instruction. Computation controller 522 can control computing engine 524, address generators 532, and load/store controller 530 to execute the respective sub-instructions in parallel, which allows neural network processor 502 to provide N-way (e.g., 5-way in execution of sub-instructions 602-610) parallel programmability.
Each of sub-instructions 602-610 includes fields that identify an operation to be executed by the target component and/or registers to be accessed. For example, sub-instruction 602 includes fields 602a, 602b, 602c, and 602d. Field 602a can identify a computation to be performed by computing engine 524, such as a multiply-and-accumulation (MAC) computation operation, BNorm computation operation, or a max pooling computation operation. Field 602b can identify a destination data register (labelled MACreg0-MACreg7) to store the output of the computation. Fields 602c and 602d identify, respectively, a source input data register (labelled Din0 and Din1 in
Also, sub-instruction 604 includes fields 604a, 604b, and 604c. Field 604a can identify an operation to be performed to update an address stored in a source address register identified by field 604c, and the updated address is to be stored in a destination address register identified by field 604b. The operation can include, for example, an increment by one (ADD), a decrement by one (SUB), and a move operation (MOV) to replace the address in the destination address register with the address in the source address register. The address can be a memory address in memory 512 (for input data/output data) or an alias/reference/address to a location in weights and parameters buffer 526 (for weights or parameters). The source/destination address registers can include address registers for input data and output data (labelled ARin0, ARin1, ARout0, and ARout1 in
Further, sub-instruction 606 includes fields 606a, 606b, and 606c. Field 606a can identify an operation to be performed to update an address stored in a source address register identified by field 606c, and the updated address is to be stored in a destination address register identified by field 606b. The operation can include, for example, an increment by one (ADD), a decrement by one (SUB), and a move operation (MOV) to replace the address in the destination address register with the address in the source address register. The address is a memory address in memory 512 for input or output data, and the source/destination address registers can include address registers for input data and output data (labelled ARin0, ARin1, ARout0, and ARout1 in
Also, sub-instruction 608 includes fields 608a, 608b, and 608c. Field 608a can indicate a load instruction to fetch a weight/bias/scale/shift from an address stored in an address register identified by field 608c (e.g., ARwt0, ARwt1, ARss0, ARss1, ARbias0, ARbias1) and store the fetched weight/bias/scale/shift to a register identified by field 608b (weights-0, weights-1, scale-shift-0, scale-shift-1, MACreg0-MACreg7).
Further, sub-instruction 610 includes fields 608a, 608b, and 608c. Field 608a can indicate whether sub-instruction 610 is a load instruction or a store instruction. For a load instruction, field 610c can identify the address register (ARin0, Arin1) that stores the memory address from which input data is to be loaded, and field 610b can identify the input data register to store the input data (Din0 or Din1). For a store instruction, field 610c can identify the address register (ARout0, ARout1) that stores the memory address for storing the output data, and field 610b can identify the output data register (Dout) from which the output data is to be fetched.
In some examples, each of Din0, Din1, and Dout register has 32 bits, each of weights and scale-shift registers has 64 bits, and each MACreg register has 72 bits.
Having neural network processor 502 configured to execute instructions of various bit lengths and various number of sub-instructions can reduce code size and power, while providing flexibility to maximize parallelism in execution of the sub-instructions. Specifically, neural network processor 502 do not always execute the five sub-instructions 602, 604, 606, 608, and 610 to support a neural network computation, while some combinations of the sub-instructions are executed together more often (e.g., sub-instructions that read from the memory and incrementing the address register) than other combinations. Accordingly, by supporting 24b instructions having sub-instructions that are more often executed together, the code size can be reduced. The power consumed in fetching and decoding the shortened instructions can also be reduced. On the other hand, by supporting 48b instructions including five sub-instructions 602-610, the parallelism in execution of the sub-instructions can also be maximized. All these can improve the flexibility of neural network processor 502 in supporting different applications having different requirements for power, code size, and execution parallelism.
At time slot 0, load/store controller 530 starts the fetching of input data at a first memory address of memory 512 specified in the ARin0 register. The input data are to be stored at input data register Din0. Also, load/store controller 530 fetches a first set of weight elements at a first buffer address (of weights and parameters buffer 526) specified in the ARwt0 register, and store the weights at weights register Wt0. Address generators 532 also increments the first memory address in the ARin0 register to generate a second memory address, and increment the first buffer address in the Wt0 register to generate a second buffer address. The fetching of input data from memory 512 can continue in time slot 1.
At time slot 2, the fetching of input data from memory 512 to input data register Din0 completes. Also, load/store controller 530 fetches a second set of weight elements at a second buffer address specified in the ARwt0 register, and store the weights at weights register Wt1. Address generators 532 also increment the second buffer address in the Wt0 register to generate a third buffer address. Computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and a subset of weight elements in weights register Wt0 (Wt0L), while load/store controller 530 fetches weights to weights register Wt1.
At time slot 3, load/store controller 530 starts the fetching of input data at the second memory address of memory 512 specified in the ARin0 register. The input data are to be stored at input data register Din1. Address generators 532 also increment the second memory address in the ARin0 register to generate a third memory address. Computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and the remainder of weight elements in weights register Wt0 (Wt0H). Load/store controller 530 holds off on fetching weights from the third buffer address at time slot 3 because the weights in weights registers Wt0 and Wt1 are either in use (at time slot 3) or yet to be used.
At time slot 4, the fetching of input data at the second memory address of memory 512 continues. The MAC computation between the input data stored in Din0 and the remainder of weight elements in weights register Wt0 completes. Accordingly, load/store controller 530 fetches a third set of weight elements from the third buffer address to weights register Wt0. Address generators 532 also increment the third buffer address to a fourth buffer address. Computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and a subset of weight elements in weights register Wt1 (Wt1L).
The fetching of input data to Din1 completes at time slot 5. However, load/store controller 530 holds off on fetching of input data from the third memory address to input data register Din0 because computing engine 524 is still operating on the input data in Din0. At time slot 5 computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and the remainder of weight elements in weights register Wt1 (Wt1H).
At time slot 6, load/store controller 530 fetches a fourth set of weight elements from the fourth buffer address to weights register Wt1, while computing engine 524 executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and a first set of weights stored in weights register Wt0 (Wt0L), which load/store controller 530 updated at time slot 4. Address generators 532 also increment the fourth buffer address to the fifth buffer address. At time slot 7, computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and the remainder of weight elements in weights register Wt0 (Wt0H).
At time slot 8, while computing engine 524 executes sub-instructions 602 to perform MAC computations between the input data stored in Din0 and weights elements in weights registers Wt1 (Wt1L and Wt1H), load/store controller 530 fetches weights at the fifth buffer address to weight register Wt0. The weights fetched to weight registers Wt0 at time slot 8 are to be used by computing engine 524 for MAC operations with the input data stored in Din1. Address generators 532 also increment the fifth buffer address to generate a sixth buffer address.
At time slot 9, computing engine 524 also executes sub-instruction 602 to perform a MAC computation between the input data stored in Din0 and the remainder of weight elements in weights register Wt1 (Wt1H), and completes a first set of the MAC computations between Din0 and the weights. The results of the MAC computations are stored in data registers MACreg0-MACreg7.
At time slot 10, load/store controller 530 fetches weights at the sixth buffer address to weight register Wt1, while computing engine 524 executes sub-instructions 602 to perform MAC computations between the input data stored in Din1 and a subset of weights elements in weights registers Wt0 (Wt0L).
At time slot 11, load/store controller 530 fetches input data at the third memory address of memory 512 to register Din0. Address generator 532 also increment the third memory address to generate a fourth memory address, and computing engine 524 executes sub-instructions 602 to perform MAC computations between the input data stored in Din1 and the remainder of weights elements in weights registers Wt0 (Wt0H).
As shown in
Referring to chart 800, memory 512 may store 16 sets of input data elements (fm0), with each set of input data elements including four input data elements for four different channels (c0, c1, c2, and c3). Each row in chart 800 can be associated with a memory address, and rows can be associated with consecutive addresses. However, the space for storing 17 sets of input data elements can be allocated, with row 16 (memory space 802) being empty and available to store one set of output data elements (fm1).
At time T0, the initial read address is at row 0, as indicated by a pointer 804. A first set of input data elements in row 0 are fetched to an input data register (e.g., Din0), which are then fetched to computing engine 524 for a computation operation. The initial write address for the first set of output data elements of the computation operation is associated with memory space 802, as indicated by a pointer 806. Such arrangements can prevent the output data overwriting the first set of input data elements, which may still be in use. The initial read and write addresses can be set using sub-instruction 606. After fetching the first set of input data elements, address generators 532 can increment the read address by one responsive to another sub-instruction 606, so that pointer 804 can point to row 1 subsequently.
At time T1, a second set of input data elements (fm0_0c0-fm0_O1c3) in row 1 are fetched to input data register, as indicated by pointer 804. After fetching the second set of input data elements, address generators 532 can increment the read address by one responsive to a sub-instruction 606, so that pointer 804 can point to row 2 subsequently. A first set of output data elements (fm1_00c0-fm1_00c3) is stored in row 16, as indicated by pointer 806. After storing the first set of output data elements in row 16, address generators 532 can increment the write address by one responsive to sub-instruction 606. However, to support in-place computation, circular addressing can be provided for output data, so that the incremented write address can wrap around and point to row 0, so that a next set of output data elements can overwrite the first set of input data elements.
Referring to
For example, for voice command recognition, the input speech samples can be grouped into 40 millisecond (ms) frames, and 8 frequency domain features can be extracted for each of the frames. Thus for a command length of 1 second, features are extracted for 25 frames, resulting in input features with 25 locations and 8 channels per location. In a case where inferencing operation is performed for every 120 ms, the new input features to be processed by the inferencing operation can have three new sets of features (at three locations, 8 channels per location) relative to the previous set of input features. The overlapping 22 sets of input features can be moved in the memory so that the input features are fit within the same set of allocated memory addresses, but such data movement adds to the latency in memory access and power consumption by the memory.
Instead of moving the overlapping input data in memory 512, the new input data can be stored (e.g., by sensor 104, by processor 514, etc.) in memory 512 following a circular addressing scheme, and address generators 532 can also update the input data addresses for fetching the input data based on the same circular addressing scheme. Chart 900 in
At time T0, memory 512 stores 25 initial sets of input data elements (fm0_00*-fm0_24*). The first address for fetching the 25 sets of input data elements to perform a first inferencing operation is at row 0, indicated by pointer 902. Address generators 532 can provide the initial input data address as the address associated with row 0, and increment the input data address responsive to sub-instructions 606, and load/store controller 530 can fetch the 25 sets of input data elements from memory 512 from the input data addresses.
At time T1, a new three sets of input data elements (fm0_25*-fm0_27*) are stored in memory 512. To avoid movement of the rest of the initial sets of input data elements, the new sets of input data elements are stored in memory addresses associated with rows 0, 1, and 2, and overwrite input data elements fm0_00*-fm0_02*. The first address for fetching the 25 sets of input data elements to perform a second inferencing operation is at row 3, indicated by pointer 902. Address generators 532 can provide the initial input data address as the address associated with row 3, and increment the input data address responsive to sub-instructions 606, and load/store controller 530 can fetch the 25 sets of input data elements from memory 512 from the input data addresses. But to read the new sets of input data elements, address generators 532 also implements a circular addressing scheme for input data, where the input data address wraps around after incrementing beyond the address associated with row 24 (storing input data elements fm0_24*) and restart at row 0, and ends at row 2.
At time T2, another new three sets of input data elements (fm0_28*-fm0_30*) are stored in memory 512. To avoid movement of the rest of the initial sets of input data elements, the new sets of input data elements are stored in memory addresses associated with rows 3, 4, and 5, and overwrite input data elements fm0_03*-fm0_05*. The first address for fetching the 25 sets of input data elements to perform a third inferencing operation is at row 6, indicated by pointer 902. Address generators 532 can provide the initial input data address as the address associated with row 6, and increment the input data address responsive to sub-instructions 606, and load/store controller 530 can fetch the 25 sets of input data elements from memory 512 from the input data addresses. But to read the new sets of input data elements, address generators 532 also implements a circular addressing scheme for input data, where the input data address wraps around after incrementing beyond the address associated with row 24 (storing input data elements fm0_24*) and restart at row 0, and ends at row 5.
Referring to
Multiplexer 1002 can receive a start input data address (“start-in” in
Also, difference circuit 1012 has a first input coupled to address input 1040 and a second input coupled to the output of multiplexer 1002. Difference circuit 1014 has a first input coupled to address input 1040 and a second input coupled to the output of multiplexer 1004. Difference circuit 1012 can generate an offset 1050 between start address and ARw0, which also indicates whether ARw0 is above or below the start address. Also, difference circuit 1014 can generate an offset 1052 between end address and ARw0, which also indicates whether ARw0 is above or below start address. Further, difference circuit 1016 can generate an address 1070 by subtracting offset 1050 from incremented end address 1060, and summation circuit 1020 can generate an address 1072 by adding offset 1052 to decremented end address 1062. Address 1070 can be the wrap-around address ARw0_cir0 if ARw0 is below the start address, and address 1072 can be the wrap-around address ARw0_cir1 if ARw0 is above the end address.
Multiplexers 1006 and 1008 can selectively forward one of the input address ARw0, address 1070 (ARw0_cir0), or address 1072 (ARw0_cir1) to address output 1042. The selection is performed by multiplexer control circuits 1030 and 1032. Specifically, multiplexer control circuits 1030 and 1032 can each receive an indication of whether circular addressing is enabled (“circ enable” in
As discussed above, neural network processor 502 supports sub-instructions including flow control elements, such as loop elements and macro elements, that allow compact representation of convolutional layer computations, such as those shown in
Also, post processing engine 1302 can receive intermediate output data elements from the MAC registers. Responsive to a control signal 1305 from computation controller 522, which generates the control signal responsive to sub-instruction 602 indicating a BNorm operation, post processing engine 1302 can perform post processing operations on the intermediate output data (e.g., BNorm and residual layer processing) to generate output data, and store the output data at output data register (e.g., DOUT) of data registers 528a. As to be described below, processing engine 1302 can also perform a data packing operation at the output data register, and transmit a signal to load/store controller 530 to store the output data from the output data register back to memory 512 upon completion of the data packing operation. Such arrangements can reduce neural network process 502's access to memory 512 in writing back output data, which can reduce memory usage and power consumption. In addition, post processing engine 1302 can also receive input data elements from input data registers, and perform residual mode operation based on the input data elements.
As described above, computing engine 524 is configurable to perform MAC and post processing operations for weights and input data across a range of bit precisions (e.g., binary precision, ternary precision, 4-bit precision, 8-bit precision, etc.) based on parameters provided by processor 514. Computing engine 524 includes weight multiplexer 1304 and input data multiplexer 1306 to support the precision configurability. Specifically, depending on an input data and weights configuration 1310, weight multiplexer 1304 can either fetch all of the weight elements stored in a weight register (e.g., one of weights-0 or weights-1 registers), or duplicates of half of the stored weight elements, as packed data having a pre-determined number of bits (e.g., 32 bits). Also, depending on configuration 1311, weight multiplexer 1304 can perform processing on the weight elements to support various operations, such as depthwise convolution operation and average pooling. For depthwise convolution, weight multiplexer 1304 can select one of the 8-bit weights stored in the weight register, split the 8-bit weights into groups (e.g., four groups) of weight elements, and pad each weight elements group with zeros, so that MAC engine 1300 can multiply input data elements of specific channels with zero. Such arrangements can ensure that the intermediate output for a particular channel is based on MAC operations only on input data of that channel. As another example, for average pooling operation, weight multiplexer 1304 can selectively forward zeros and ones as weight elements, where input data elements paired with weight elements of one are represented in the intermediate output data elements, and input data elements paired with zero weight elements are zeroed out and not represented in the intermediate output data elements. Configuration 1311 can indicate a layer type which can also indicate whether a depthwise convolution operation is to be performed. Configuration 1311 can also indicate whether an average pooing operation is to be performed. Configuration 1311 can be based on configuration data stored in configuration registers 528d and/or a control signal from the computation controller 522 responsive to an instruction.
Also, input data multiplexer 1306 can either fetch all of the input data elements stored in an input data register (e.g., one of Din0 or Din1), or duplicates of half of the stored input data elements, as packed data having a pre-determined number of bits (e.g., 32 bits). In some examples, input data and weights configuration 1310 can include an 8-bit mode (D8) or a 4-bit mode (D4), which indicates whether computing engine 524 fetches input data in 8-bit form or in 4-bit form. Input data and weights configuration 1310 can be stored in and received from configuration registers 528D. Whether computing engine 524 operates in D8 or D4 mode can depend on the input and weight precisions, which can also determine a number of input data elements and a number of weight elements to be fetched to MAC engine 1300 at a time (e.g., in one clock cycle).
Also, MAC engine 1300 and post processing engine 1302 can receive an input and weight precision configuration 1312. Depending on the input data precision and weight precision, the arithmetic circuits and logic circuits of MAC engine 1300 and post processing engine 1302 can handle the computations differently. Post processing engine 1302 also receives post processing parameters 1314 that can define, for example, the parameters for the post processing operations, some or all of which may also depend on the input and/or weights precisions. Input and weight precision configuration 1312 and some of post processing parameters 1314 can be received from configuration register 528d. Some of post processing parameters 1314, such as shift and scale, may vary between different internal components of post processing engine 1302. These parameters may be fetched from weights/parameters buffer 526 to weights/parameters registers 528b, and post processing engine 1302 can receive those parameters from weights/parameters registers 528b.
In addition, computing engine 524 may also include a max pooling engine 1320 to perform a max pooling operation on the input data elements stored in input data registers (DIN0, DIN1) and output data elements stored in output data register (DOUT), and store the max pooling result back at output data register (DOUT). Specifically, max pooling engine 1320 can overwrite an output data element in DOUT with an input data element at DIN0/DIN1 if the input data element has a higher value than the output data element. Computation controller 522 can provide a control signal 1322 to max pooling engine 1320 responsive to, for example, field 602c of sub-instruction 602 indicating a max pooling operation to be performed. Max pooling engine 1320 can then perform the max pooling operation responsive to control signal 1322. Max pooling engine 1320 also receives post processing parameters 1314 and can configure the max pooling operation based on the parameters. In some examples, max pooling engine 1320 can operate independently or in parallel with MAC engine 1300 and post processing engine 1302, which can minimize the disruption of the max pooling operation on the operations by the rest of computing engine 524 and improve efficiency.
In
In
In
MAC engine 1300 can generate a partial sum for each of intermediate output data elements Y[0], Y[1], Y[2], and Y[3] each associated with a different output channel based on multiplication of input data X[0]-X[3] and weights associated with a particular output channel and the input channels of X[0]-X[3]. For example, for an intermediate output data element Y[0], MAC engine 1300 can compute a multiplication product between W[0,3] and X[3], a multiplication product between W[0,2] and X[2], a multiplication product between W[0,1] and X[1], and a multiplication product between W[0,0] and X[0], and perform an accumulation operation by summing the multiplication products to a prior partial sum Y0′ from one of intermediate output data registers MACreg0-MACreg3 to generate a new partial sum Y0, and the new partial sum Y0 can be stored in MACreg0-MACreg3 in place of the prior partial sum Y0′. The summation can be saturating summation. In a case of a first instruction for a convolution operation, a bias value (e.g., Bias0) can be fetched from another set of intermediate output data registers MACreg4-MACreg7 via a multiplexer (labelled MUX in
MAC engine 1330 can perform 16 multiplication operations, such as multiplication operation 1602, to generate the multiplication products in one clock cycle. In some examples, MAC engine 1330 can include 16 multiplier circuits to perform the 16 multiplication operations in parallel. MAC engine 1330 can also update the prior partial sum Y′[0] stored in the MAC register by adding the new partial sum to Y′[0].
Table 1 below illustrates a set of input precisions and weight precisions supported by computing engine 524. Each row also indicate, for a given input precision and weight precision, a number of input data elements processed by computing engine 524 per clock cycle, a number of output data elements provided by computing engine 524 per clock cycle, and a number of multiplication and accumulation (MAC) operations performed by computing engine 524 per clock cycle. The number of MACs per cycle may be equal to a product between a number of input data elements processed and a number of output data elements generated per clock cycle.
In some examples, as to be described below, MAC engine 1330 can include an array of 32 4-bit 2-bit multiplier circuits, where each multiplier circuit can perform a multiplication operation (or bitwise operation) between a 4-bit number and a 2-bit number per clock cycle, and 32 multiplier circuits can perform 32 multiplications between 4-bit number and 2-bit number.
For 8-bit input data elements and 2-bit weight elements (binary or ternary), each input data element can be split into two 4-bit data values, where two multiplier circuits perform operations for one 8-bit data element and one 2-bit weight element. Accordingly, the array of 32 multiplier circuits can perform 32/2 computation operations of 8-bit input data and 2-bit weights per clock cycle.
For 8-bit input data elements and 4-bit weight elements, each input data element can be split into two 4-bit data values, and each 4-bit weight element can be split into two 2-bit weight values, where four multiplier circuits perform operations for one 8-bit data element and one 4-bit weight element. Accordingly, the array of 32 multiplier circuits can perform 32/(2*2) computation operations of 8-bit input data and 4-bit weights per clock cycle.
For 8-bit input data elements and 8-bit weight elements, each input data element can be split into two 4-bit data values, and each weight element can be split into four 2-bit weight values, where eight multiplier circuits perform operations for one 8-bit data element and one 8-bit weight element. Accordingly, the array of 32 multiplier circuits can perform 32/(2*4) computation operations of 8-bit input data and 4-bit weights per clock cycle.
For 4-bit input data elements and 4-bit weight elements, each weight element can be split into two 2-bit weight values, where two multiplier circuits perform operations for one 4-bit data element and one 2-bit weight element. Accordingly, the array of 32 multiplier circuits can perform 32/2 computation operations of 4-bit input data and 2-bit weights per clock cycle.
For binary weights and data, MAC engine 1330 (or computing engine 524) can internally convert a binary weight to a 2-bit weight. As to be described below, each multiplier circuit can perform four bit-wise computation operations (e.g., XNOR) between the binary weights and data, and the array of 32 multiplier circuits can perform 128 computation operations of the binary weights and data per clock cycle.
Multiplier circuit 1700 can receive multiplier configuration 1710 to configure each of multiplier circuits 1702 and 1704 to perform multiplication operations in various modes. Multiplier configuration 1710 can be part of precision configuration 1312 and can include a first flag that indicates whether D[7:4] is signed (e.g., D[7:4] is signed if the flag is asserted or represents a logical, unsigned if the flag is deasserted or represents a logical zero), a second flag that indicates whether D[3:0] is signed (both based on input precision), a third flag that indicates whether W[3:2] is signed, a fourth flag that indicates whether W[1:0] is signed (both based on weight precision), and a fifth flag that indicates whether to operate in binary mode. Multiplier circuits 1702 and 1704 can operate in binary mode if precision configuration 1312 indicates that both input data and weights have one-bit (binary) precisions.
If multiplier configuration 1710 indicates that binary mode is disabled, multiplier circuit 1702 can generate output N0 as 7-bit signed number by performing a multiplication operation between four-bit LSBs D[3:0] and two-bit LSBs W[1:0], and multiplier circuit 1704 can generate an output N1 as 7-bit signed number by performing a multiplication operation between four-bit MSBs D[7:4] and two-bit MSBs W[3:2]. The multiplication operation for NO can be a signed multiplication operation if at least one of D[3:0] or W[1:0] is signed, and the multiplication operation for N1 can be a signed multiplication operation if at least one of D[7:4] or W[3:2] is signed, based on multiplier configuration 1710.
On the other hand, if multiplier configuration 1710 indicates that binary mode enabled, multiplier circuit 1702 can generate an 8-bit output N0 by performing bitwise XNOR operation between D[3:0] and W[3:0] (e.g., (D[3](W[3])′ in
MAC engine 1300 can include multiple computation units, each including a set of multiplier circuits 1700 and other logic circuits, to perform MAC operations to generate an intermediate output data element for a range of input precisions and weight precisions described in Table 1.
Computation unit 1800 also includes adders 1802a, 1802b, and 1802c, and adders 1804a, 1804b, and 1804c. Adders 1802a, 1802b and 1802c generate a sum of N1a, N1b, N10c, and N1d as MAC_L, which has 9 bits and are represented as signed numbers when operating in non-binary mode. Also, adders 1804a, 1804b and 1804c generate a sum of N0a, N0b, N0c, and N0d as MAC_R, which also has 9 bits and are represented as signed numbers in non-binary mode. Computation unit 1800 also includes a bit shifter circuit 1806 can perform a left shift of MAC_L by a number of bits specified in shift control 1808 to generate MAC_L′. Computation unit 1800 can receive shift control 1808 from configuration registers 528c. As to be described below, the amount of left bit shift is based on the input and weight precision. Further, computation unit 1800 includes an adder 1809 that generates a sum of MAC_R and MAC_L′ as MAC_out, which has 13 bits and can be represented as a signed number in non-binary mode, as a MAC output. MAC_L and MAC_R can also be the MAC outputs.
Computation unit 1800 also includes an accumulator 1810 that can receive MACreg_in (18 bits) as the old partial sum, update the old partial sum by adding to it the MAC output to generate the new partial sum MACreg_out.
Referring to
MACreg_out[17:0]=Saturate((MACreg_in[17:0]+MAC_out[12:0]),18 bits) (Equation 2)
On the other hand, if MAC engine 1300 operates in the binary mode, each of N0a, N0b, N0c, and N0d, and each of N1a, N1b, N1c, and N1d, can take maximum value of 4, and MAC_L and MAC_R can each take a maximum value of 16. Adder 1902 and 1904 can be two separate 9-bit adders. Adder 1902 can generate the 9-bit MSBs of MACreg_out by summing the 9-bit MSBs of MACreg_in and 6-bit LSBs of MAC_L. Adder 1904 can generate the 9-bit LSBs of MACreg_out by summing the 9-bit LSBs of MACreg_in and 6-bit LSBs of MAC_R. Because there is no carry over from adder 1904 to adder 1902, adders 1902 and 1904 can perform the addition in parallel and speed up the updating of the partial sum.
Also, MAC engine 1300 includes groups of first, second, third, and fourth MAC weights inputs. Each computation unit is coupled a respective group of first, second, third, and fourth MAC weights inputs, where the first computation weights input is coupled to the first MAC weights input of the group, the second computation weights input is coupled to the second MAC weights input of the group, the third computation weights input is coupled to the third MAC weights input of the group, and the fourth computation weights input is coupled to the fourth MAC weights input of the group. Each computation unit can receive four 4-bit weights from weight multiplexer 1304. For example, computation unit 1800a receives 4-bit weights W0a, W1a, W2a, and W3a at, respective, the first, second, third, and fourth computation weights inputs of computation unit 1800a. Computation unit 1800b receives 4-bit weights W0b, W1b, W2b, and W3b at, respective, the first, second, third, and fourth computation weights inputs of computation unit 1800b. Computation unit 1800c receives 4-bit weights W0c, W1c, W2c, and W3c at, respective, the first, second, third, and fourth computation weights inputs of computation unit 1800c. Also, computation unit 1800d receives 4-bit weights W0d, W1d, W2d, and W3d at, respective, the first, second, third, and fourth computation weights inputs of computation unit 1800d. In
Weight multiplexer 1304 can selectively forward weights from one of weight-0 register or weight-1 register. In D4 mode, computation units 1800a and 1800b can receive 32 top half bits of the selected weight register, and computation units 1800c and 1800d can receive 32 bottom half bits of the selected register. In D8 mode, computation units 1800a and 1800b can receive duplicates of half of top half bits (16 bits) of selected weight register, and computation units 1800c and 1800d can receive duplicates of half of bottom half bits (16 bits) of selected weight register. The computation units 1800a-d can store intermediate output data elements in one MAC register, where each intermediate output data element can have 18 bits.
Y0=(W0*X0H+W1*X1H+W2*X2H+W3*X3H)<<4+(W0*X0L+W1*X1L+W2*X2L+W3*X3L) (Equation 3)
In Equation 3, each of W0, W1, W2, and W3 is a 2-bit weight element. X0, X1, X2, and X3 each is an 8-bit input data element. X0H, X1H, X2H, and X3H are, respectively, the 4-bit MSBs (also represented as D[7:4] bits) of X0, X1, X2, and X3, whereas X0L, X1L, X2L, and X3L are, respectively, the 4-bit LSBs (also represented as D[3:0] bits) of X0, X1, X2, and X3. Multiplier circuit 1700a (of computation unit 1800a) generates W0*X0H as N0a and W0*X0L as N0b, multiplier circuit 1700b generates W1*X1H as N1a and W1*X1L as N1b, multiplier circuit 1700c generates W2*X2H as N2a and W2*X2L as N2b, and multiplier circuit 1700d generates W3*X3H as N3a and W3*X3L as N3b. Shift control 1808 can control bit shifter circuit 1806a to left shift MAC_L by 4 bits. MAC registers multiplexer 1303 can selectively connect the output of accumulator 1810a to MAC registers and bypass data merge circuit 2002.
Y0=(W0H*X0+W1H*X1+W2H*X2+W3H*X3)<<2+(W0L*X0+W1L*X1+W2L*X2+W3L*X3) (Equation 4)
In Equation 4, each of W0, W1, W2, and W3 is a 4-bit weight element. X0, X1, X2, and X3 each is a 4-bit input data element. W0H, W1H, W2H, and W3H are, respectively, the 2-bit MSBs of W0, W1, W2, and W3, whereas W0L, W1L, W2L, and W3L are, respectively, the 2-bit LSBs of W0, W1, W2, and W3. Multiplier circuit 1700a receives X0 and X4 (from input data multiplexer 1306 operating in D4 mode) as an 8-bit input and W0 as a 4-bit input and generates X0*W0H as N0a and X4*W0L as N0b. Also, multiplier circuit 1700b generates X1*W1H as N1a and X5*W1L as N1b, multiplier circuit 1700c generates X2*W2H as N2a and X6*W2L as N2b, and multiplier circuit 1700d generates X3*W3H as N3a and X7*W3L as N3b. Shift control 1808 can control bit shifter circuit 1806a to left shift MAC_L by 2 bits. MAC registers multiplexer 1303 can selectively connect the output of accumulator 1810a to MAC registers and bypass data merge circuit 2002.
In a subsequent cycle (not shown in
Y0=(W0*X0+W2*X2+W4*X4+W6*X6)+(W1*X1+W3*X3+W5*X5+W7*X7) (Equation 5)
In Equation 5, each of W0, W1, W2, and W3 is a 2-bit weight element. X0, X1, X2, X3, X4, X5, X6, and X7 each is a 4-bit input data element (and represented as D[3:0] bits). Multiplier circuit 1700a receives X0 and X1 as an 8-bit input and A0 and A1 as a 4-bit input and generates A0*X0 as N0a and A1*X1 as N0b. Also, multiplier circuit 1700b generates A2*X2 as N1a and A3*X3 as N1b, multiplier circuit 1700c generates A4*X4 as N2a and A5*X5 as N2b, and multiplier circuit 1700d generates A6*X6 as N3a and A7*X7 as N3b. Shift control 1808 can control bit shifter circuit 1806a not to left shift MAC_L. Accordingly, bit shifter circuit 1806a is omitted in
Y0H=(W0H*X0H+W1H*X1H+W2H*X2H+W3H*X3H)<<4+(W0H*X0L+W1H*X1L+W2H*X2L+W3H*X3L) (Equation 6)
Y0L=(W0L*X0H+W1L*X1H+W2L*X2H+W3L*X3H)<<4+(W0L*X0L+W1L*X1L+W2L*X2L+W3L*X3L) (Equation 7)
Y0=Y0H<<2+Y0L (Equation 8)
Y1H=(W4H*X0H+W5H*X1H+W6H*X2H+W7H*X3H)<<4+(W4H*X0L+W5H*X1L+W6H*X2L+W7H*X3L) (Equation 9)
Y1L=(W4L*X0H+W5L*X1H+W6L*X2H+W7L*X3H)<<4+(W4L*X0L+W5L*X1L+W6L*X2L+W7L*X3L) (Equation 10)
Y1=Y1H<<2+Y1L (Equation 11)
In Equations 6-11, each of W0, W1, W2, W3, W4, W5, W6, and W7 is a 4-bit weight element. X0, X1, X2, and X3 each is an 8-bit input data element. W0H, W1H, W2H, W3H, W4H, W5H, W6H, W7H are, respectively, the 2-bit MSBs of W0, W1, W2, W3, W4, W5, W6, and W7, whereas W0L, W1L, W2L, W3L, W4L, W5L, W6L, W7L are, respectively, the 2-bit LSBs of W0, W1, W2, W3, W4, W5, W6, and W7. Also, X0H, X1H, X2H, and X3H are, respectively, the 4-bit MSBs of X0, X1, X2, and X3 (also represented as D[7:4] bits), whereas X0L, X1L, X2L, and X3L are, respectively, the 4-bit LSBs (also represented as D[3:0] bits) of X0, X1, X2, and X3.
Multiple computation units 1800 can be involved in computing Y0H and Y0L. For example, the first computation weights input of computation unit 1800a can receive duplicates of W0H, the second computation weights input of computation unit 1800a can receive duplicates of W1H, the third computation weights input of computation unit 1800a can receive duplicates of W2H, and the fourth computation weights input of computation unit 1800a can receive duplicates of W3H. Also, the first computation weights input of computation unit 1800b can receive duplicates of W0L, the second computation weights input of computation unit 1800b can receive duplicates of W1L, the third computation weights input of computation unit 1800b can receive duplicates of W2L, and the fourth computation weights input of computation unit 1800b can receive duplicates of W3L.
Further, the first computation weights input of computation unit 1800c can receive duplicates of W4H, the second computation weights input of computation unit 1800c can receive duplicates of W5H, the third computation weights input of computation unit 1800c can receive duplicates of W6H, and the fourth computation weights input of computation unit 1800c can receive duplicates of W7H. The first computation weights input of computation unit 1800d can receive duplicates of W0L, the second computation weights input of computation unit 1800d can receive duplicates of W1L, the third computation weights input of computation unit 1800d can receive duplicates of W2L, and the fourth computation weights input of computation unit 1800d can receive duplicates of W3L. The first computation data input of each of computation units 1800a-1800d can receive X0, the second computation data input of each of computation units 1800a-1800d can receive X1, the third computation data input of each of computation units 1800a-1800d can receive X2, and the fourth computation data input of each of computation units 1800a-1800d can receive X3.
Computation unit 1800a can compute W0H*X0H+W1H*X1H+W2H*X2H+W3H*X3H and W0H*X0L+W1H*X1L, and computation unit 1800b can compute W0L*X0H+W1L*X1H and W0L*X0L+W1L*X1L+W2H*X2L+W3H*X3L. Bit shifter 1806a and 1806b of each computation unit can perform left shift of four bits. The partial sum of Y0H is stored in a first MAC register (e.g., MACreg0), and the partial sum of Y0L can be stored in a second MAC register (e.g., MACreg1). Also, in the same cycle, computation unit 1800c can compute W4H*X0H+W5H*X1H+W6H*X2H+W7H*X3H and W4H*X0L+W5H*X1L+W6H*X2L+W7H*X3L, and computation unit 1800d can compute W4L*X0H+W5L*X1H+W6L*X2H+W7L*X3H and W4L*X0L+W5L*X1L+W6L*X2L+W7L*X3L, to generate the partial sums of Y1H and Y1L. The partial sums of Y1H and Y1L can be stored in a third MAC register (e.g., MACreg2) and a fourth MAC register (e.g., MACreg3). Accordingly, partial sums of two outputs (Y0 and Y1) can be generated per cycle.
Y0HH=(W0HH*X0H+W1HH*X1H+W2HH*X2H+W3HH*X3H)<<4+(W0HH*X0L+W1HH*X1L+W2HH*X2L+W3HH*X3L) (Equation 12)
Y0HL=(W0HL*X0H+W1HL*X1H+W2HL*X2H+W3HL*X3H)<<4+(W0HL*X0L+W1HL*X1L+W2HL*X2L+W3HL*X3L) (Equation 13)
Y0LH=(W0LH*X0H+W1LH*X1H+W2LH*X2H+W3LH*X3H)<<4+(W0LH*X0L+W1LH*X1L+W2LH*X2L+W3LH*X3L) (Equation 14)
Y0LL=(W0LL*X0H+W1LL*X1H+W2LL*X2H+W3LL*X3H)<<4+(W0LL*X0L+W1LL*X1L+W2LL*X2L+W3LL*X3L) (Equation 15)
Y0=Y0HH<<6+Y0HL<<4+Y0LH<<2+Y0LL (Equation 16)
In Equations 12-16, W0HH, W1HH, W2HH, and W3HH are, respectively, bits [7:6] of W0, W1, W2, and W3, W0HL, W1HL, W2HL, and W3HL are, respectively, bits [5:4] of W0, W1, W2, and W3, W0LH, W1LH, W2LH, and W3LH are, respectively, bits [3:2] of W0, W1, W2, and W3, and W0LL, W1L, W2LL, and W3LL are, respectively, bits [1:0] of W0, W1, W2, and W3. Also, X0H, X1H, X2H, and X3H are, respectively, the 4-bit MSBs of X0, X1, X2, and X3 (also represented as D[7:4] bits), whereas X0L, X1L, X2L, and X3L are, respectively, the 4-bit LSBs of X0, X1, X2, and X3 (also represented as D[3:0] bits). Further, Y0HH, Y0HL, Y0LH, and Y0LL are, respectively, bits[7:6], bits[5:4], bits[3:2], and bits[1:0] of Y0.
Multiple computation units 1800 can be involved in computing Y0HH, Y0HL, Y0LH, and Y0L. For example, the first computation weights input of computation unit 1800a can receive duplicates of W0HH, the second computation weights input of computation unit 1800a can receive duplicates of W1HH, the third computation weights input of computation unit 1800a can receive duplicates of W2HH, and the fourth computation weights input of computation unit 1800a can receive duplicates of W3HH. Also, the first computation weights input of computation unit 1800b can receive duplicates of W0HL, the second computation weights input of computation unit 1800b can receive duplicates of W1HL, the third computation weights input of computation unit 1800b can receive duplicates of W2HL, and the fourth computation weights input of computation unit 1800b can receive duplicates of W3HL.
Further, the first computation weights input of computation unit 1800c can receive duplicates of W0LH, the second computation weights input of computation unit 1800c can receive duplicates of W1LH, the third computation weights input of computation unit 1800c can receive duplicates of W2LH, and the fourth computation weights input of computation unit 1800c can receive duplicates of W3LH. The first computation weights input of computation unit 1800d can receive duplicates of W0LL, the second computation weights input of computation unit 1800d can receive duplicates of W1LL, the third computation weights input of computation unit 1800d can receive duplicates of W2LL, and the fourth computation weights input of computation unit 1800d can receive duplicates of W3LL. The first computation data input of each of computation units 1800a-1800d can receive X0, the second computation data input of each of computation units 1800a-1800d can receive X1, the third computation data input of each of computation units 1800a-1800d can receive X2, and the fourth computation data input of each of computation units 1800a-1800d can receive X3.
Computation unit 1800a can compute W0HH*X0H+W1HH*X1H+W2HH*X2H+W3HH*X3H and W0HH*X0L+W1HH*X1L+W2HH*X2L+W3HH*X3L, computation unit 1800b can compute W0HL*X0H+W1HL*X1H+W2HL*X2H+W3HL*X3H and W0HL*X0L+W1HL*X1L+W2HL*X2L+W3HL*X3L, computation unit 1800c can compute W0LH*X0H+W1LH*X1H+W2LH*X2H+W3LH*X3H and W0LH*X0L+W1LH*X1L+W2LH*X2L+W3LH*X3L, and computation unit 1800d can compute W0LL*X0H+W1LL*X1H+W2LL*X2H+W3LL*X3H and W0LL*X0L+W1LL*X1L+W2LL*X2L+W3LL*X3L. Bit shifter 1806a-d of each computation unit can perform left shift of four bits. The first partial sum of Y0HH can be stored in a first MAC register (e.g., MACreg0), the partial sum of Y0HL can be stored in a second MAC register (e.g., MACreg1), the partial sum of Y0LH can be stored in a third MAC register (e.g., MACreg2), and the partial sum of Y0LL can be stored in a fourth MAC register (e.g., MACreg3). Computation units 1800 can compute Y0HH as a signed number and Y0HL, Y0LH, and Y0LL as unsigned numbers.
Y0HH=(W0HH*X0H)<<4+(W0HH*X0L) (Equation 17)
Y0HL=(W0HL*X0H)<<4+(W0HL*X0L) (Equation 18)
Y0LH=(W0LH*X0H)<<4+(W0LH*X0L) (Equation 19)
Y0L=(W0LL*X0H)<<4+(W0LL*X0L) (Equation 20)
Y0=Y0HH<6+Y0HL4+Y0LH<<2+Y0LL (Equation 21)
Y0=(W0*X0H)<<4+(W0*X0L) (Equation 22)
Y1=(W1*X1H)<<4+(W1*X1L) (Equation 23)
Y2=(W2*X2H)<<4+(W2*X2L) (Equation 24)
Y3=(W3*X2H)<<4+(W3*X3L) (Equation 25)
Y0H=(W0H*X0H)<<4+(W0H*X0L) (Equation 26)
Y0L=(W0L*X0H)<<4+(W0L*X0L) (Equation 27)
Y1H=(W1H*X1H)<<4+(W1H*X1L) (Equation 28)
Y1L=(W1L*X1H)<<4+(W1L*X1L) (Equation 29)
Y0=Y0H<<2+Y0L (Equation 30)
Y1=Y1H<<2+Y1L (Equation 31)
In a subsequent cycle (not shown in the figures), computation units 1800a and 1800b can receive non-zero weights of the third channel (W2) at the third computation weights input and zero weights for other channels at other computation weights input, and computation units 1800c and 1800d can receive non-zero weights of the fourth channel (W3) at the fourth computation weights input and zero weights for other channels at other computation weights input, and compute Y2 and Y3 as follows:
Y2H=(W2H*X2H)<<4+(W2H*X2L) (Equation 32)
Y2L=(W2L*X2H)<<4+(W2L*X2L) (Equation 33)
Y3H=(W3H*X3H)<<4+(W3H*X3L) (Equation 34)
Y3L=(W3L*X3H)<<4+(W3L*X3L) (Equation 35)
Y2=Y2H<<2+Y2L (Equation 36)
Y3=Y3H<<2+Y3L (Equation 37)
In addition, weights multiplexer 1304 also includes a depthwise and average pooling logic 2228. When enabled by a depthwise convolution enable signal (e.g., from configuration 1311), logic 2228 can select one of the 8-bits from one of the 64-bit weights-0 or weights-1 registers, splits the 8-bits into 4 groups of 2 bits, and pads the weight groups with zeros and send the weight groups to the computation units 1800a, 1800b, 1800c, and 1800d. Also, depending on weight precision from input data and weights configuration 1310, weights multiplexer 1304 can pad different weights in each group with zero, as shown in
Data packing circuit 2304 can store output data, such as Out0, Out1, Out2, and Out3, at the DOUT register. In some examples, DOUT register has 32 bits. Data packing circuit 2304 is configured to store the output data at particular bit locations of the DOUT register and, responsive to DOUT register being filled with the output data (e.g., 32 bits of output data are stored), transmit a control signal 2305 to load/store controller 530 to fetch the output data from the DOUT register back to memory 512, which allows the DOUT register can be overwritten with new output data.
Specifically, for a certain set of input, output, and weight precisions, data packing circuit 2304 can generate four 8-bit output data elements Out0, Out1, Out2, and Out3 in one clock cycle. Data packing circuit 2304 can store the four 8-bit output data elements into DOUT register after one clock cycle, and then transmit control signal 2305 to load/store controller 530. For a different set of input, output, and weight precisions, processing circuit 2302 may generate 16-bit output data, such as four 4-bit output data elements, or two 8-bit output data elements, per clock cycle. In such cases, processing circuit 2302 may store the first set of 16-bit output data at first bit locations of the DOUT register after the first clock cycle, and then store the second set of 16-bit output data at second bit locations of the DOUT register after the second clock cycle. Such arrangements allow load/store controller 530 to transmit output data in chunks of a particular number of bits (e.g., 32 bits) that can be optimized for the write operations of memory 512. Such arrangements can reduce neural network process 502's access to memory 512 in writing back output data, which can reduce memory usage and power consumption.
In some examples, data packing circuit 2304 can also receive a control signal 2306 from computation controller 522, and transmit control signal 2305 responsive to control signal 2306. Computation controller 522 can generate control signal 2306 responsive to an instruction from instruction buffer 520.
Processing circuit 2302 also includes a data routing circuit 2412 coupled between the inputs of processing circuit 2302 and the inputs of multiplier circuits 2402a-d, and a data routing circuit 2414 coupled between the outputs of clamp circuits 2410a-d and the outputs of processing circuit 2302. As to be described below, data routing circuit 2412 can route the intermediate output data elements to the multiplier circuits, and data routing circuit 2414 can route the output of the clamp circuits to the outputs of processing circuit 2302, based on input, output and weight precisions. Further processing circuit 2302 include a multiplexer circuit 2416a coupled between the output of adder 2408a and an input of adder 2408b, and a multiplexer circuit 2416c coupled between the output of adder 2408c and an input of adder 2404d. The multiplexer circuits 2416 can also be controlled based on weight precision.
Each multiplier circuit 2402 is coupled to a respective input of 2302 to scale an intermediate output data element with a scaling factor (labelled scale[0], scale[1], scale[2], and scale[3] in
For each intermediate output data element Y_n (e.g., Y0, Y1, Y2, and Y3), processing circuit 2302 can generate an output data element Out_n (e.g., Out0, Out1, Out2, and Out3) based on the following Equation. The scaling and shifting can be different for different channels. Also, the bias values bias[n] can be introduced in the MAC registers to initialize the intermediate output data elements, as described above in
Out_n=clamp(Scale[n]*(Y_n+bias[n])>>shift[n],clamp_high,clamp_low) (Equation 38)
In
Also, in
Referring to
For each 8-bit chunk, the comparison between 4-bit MSBs can be performed by an MSB comparison circuit, such as compare circuit 2604a (also labelled CMP-hi in
Referring to chart 2610, CMP-hi circuit (e.g., compare circuit 2604a) can receive configuration data, such as whether the 4-bit MSBs of the 8-bit chunks of the DIN and DOUT data being compared are signed or unsigned data. If the data are signed, CMP-hi can sign extend both the 4-bit MSBs of the DIN and the DOUT data, otherwise pre-pend them with zero (by adding a zero before the MSBs).
If the 4-bit MSBs of the DIN data has a larger value than the 4-bit MSBs of DOUT data, CMP-hi circuit can set a control signal to the multiplexer (e.g., multiplexer circuit 2604c) to overwrite the 4-bit MSBs of the DOUT data with the 4-bit LSBs of the DIN data. CMP-hi circuit can also provide a first comparison signal (set cmp-res=“GT”) to the CMP-lo circuit. On the other hand, if the 4-bit MSBs of the DIN and DOUT data are the same, CMP-hi circuit can provide a second comparison signal (set cmp-res=“EQ”). The first and second comparison signals are provided to CMP-lo circuit to avoid making the wrong comparison decision where the MSBs of DOUT has a higher value than DIN but the LSBs of DOUT has a lower value than DIN. In both cases where the 4-bit MSBs of the DIN data has the same or a lower value than the DOUT data, CMP-hi circuit can maintain the 4-bit MSBs of the DOUT data in the DOUT register.
Also, referring to chart 2612, CMP-lo circuit (e.g., compare circuit 2604b) can receive configuration data, such as whether the 4-bit LSBs of the 8-bit chunks of the DIN and DOUT data being compared are signed or unsigned data, and whether 8-bit comparison or 4-bit comparison are performed. If the data are signed, CMP-lo can sign extend both the 4-bit LSBs of the DIN and the DOUT data, otherwise pre-pend them with zero. Also, if 4-bit comparison is performed, CMP-lo circuit can ignore the control signals from CMP-hi circuit, and overwrite the 4-bit LSBs of DOUT data with 4-bit LSBs of DIN data if the latter has a higher value. Further, if 8-bit comparison is performed, CMP-lo circuit can overwrite the 4-bit LSBs of DOUT data with 4-bit LSBs of DIN data (using multiplexer circuit 2604d) only if the first control signal indicates that 4-bit MSBs of DOUT data has a higher value than the 4-bit MSBs of DIN data, or if the second control signal indicates the 4-bit MSBs are equal, and the 4-bit LSBs of the DOUT data has a lower value than the 4-bit LSBs of DIN data.
In operation 2702, computation controller 522 receives a first instruction from instruction buffer 520. Example syntax of the first instruction is illustrated in
In operation 2704, responsive to the first instruction (e.g., sub-instruction 610), load/store controller 530 fetches the input data elements from a memory (e.g., memory 512) external to the neural network processor to an input data register (e.g., Din0 or Din1 of data register 528a) of the neural network processor. Load/store controller 530 can perform read operations via memory interface 534 to fetch the input data elements. The read address can be generated by address generators 532, and can be based on a circular addressing scheme as described in
Also, in operation 2706, responsive to the first instruction (e.g., sub-instruction 608), load/store controller 530 fetches the weight elements from a weights buffer of the neural network processor (e.g., weights and parameters buffer 526) to a weights register (e.g., weights-0 or weights-1 registers of weights/parameters registers 528b) of the neural network processor. The fetching of weight elements and input data elements can be performed in parallel.
In operation 2708, computation controller 522 receives a second instruction from instruction buffer 520. The second instruction can include sub-instruction 602 indicating a computation operation (e.g., MAC operations, post-processing operations, etc.) to be performed by computing engine 524 on weight elements stored in the weights register and input data elements stored in the input data register.
In operation 2710, responsive to the second instruction, the computing engine can fetch the input data elements and the weight elements from, respectively, the input data register and the weights register.
In operation 2712, responsive to the second instruction, the computing engine can perform computation operations, including MAC and post-processing operations, on the input data elements and the weight elements to generate output data elements.
In operation 2714, the computing engine can store the output data elements an output data register (e.g., Dout).
In operation 2802, computation controller 522 receives a first indication of a particular input precision and a second indication of a particular weight precision. The first and second indications can be received from configuration registers 528c.
In operation 2804, computation controller 522 configures a computing engine of a neural network processor based on the first and second indications. The configuration can be based on, for example, setting the D4/D8 modes for the weights multiplexer 1304 and input data multiplexer 1306, setting configuration 1710 including the binary mode (or no binary mode) for computation units 1800 of MAC engine 1300, etc.
In operation 2806, computation controller 522 receives an instruction from an instruction buffer. The instruction may include sub-instruction 602 indicating a set of MAC operations to be performed by MAC engine 1300 on weight elements stored in the weights register and input data elements stored in the input data register.
In operation 2808, responsive to the instruction, the computing engine configured based on the first and second indications can fetch input data elements and weight elements from, respectively, the input data register and the weights register. As described above, depending on whether the weights are 4-bit or 8-bit precisions, and whether the input data elements are 4-bit or 8-bit precisions, which can set the D8/D4 mode, weights multiplexer 1304 and data multiplexer 1306 can fetch the weight elements of different bit precisions and input data elements of different bit precisions to the computation units 1800, as illustrated in
In operation 2810, responsive to the instruction, the computing engine configured based on the first and second indications can perform multiplication and accumulation (MAC) operations between the input data elements at the particular input precision and the weight elements at the particular weight precision to generate intermediate output data elements. For example, for binary mode, the computing engine can perform bitwise XNOR operations between the input data elements and the weights elements. Also, for certain data and weight precisions (e.g., 8-bit input precision and 2-bit weight precision, 4-bit input and weight precisions, etc.), each computation unit can generate an intermediate output data element. For other data and weight precisions (e.g., 8-bit input precision and 4-bit weight precision, 8-bit input precision and 8-bit weight precision), the computing engine can use a data merge circuit (e.g., data merge circuit 2002) to merge outputs from different computation units to generate an intermediate output data element, as illustrated in
In operation 2812, the computing engine can store the intermediate output data elements at intermediate output data registers (e.g., MAC registers).
In operation 2902, computation controller 522 receives a first indication of a particular output precision and a second indication of a particular weight precision. Computation controller 522 may also receive a third indication of a particular input precision. The first, second, and third indications can be received from configuration registers 528c.
In operation 2904, computation controller 522 configures post-processing engine 1302 based on the first and second indications (and third indication). The configuration can be based on, for example, setting the clamp high and clamp low value based on the output precision (8 bit or 4 bit), setting multiplexer circuit 2416a to route adder 2408a output to adder 2408b input and multiplexer circuit 2416b to route adder 2408c output to adder 2408d input to support 2-terinary weights, setting data routing circuit 2412 to route the output of data merge circuit 2002 to multipliers 2402a-d to support 8-bit input precision and 4-bit weight precision or 8-bit input precision and 8-bit weight precision, as shown in
In operation 2906, computation controller 522 receives a first instruction from an instruction buffer. The first instruction may include sub-instruction 602 indicating a set of MAC operations to be performed by MAC engine 1300 on weight elements stored in the weights register and input data elements stored in the input data register.
In operation 2908, responsive to the first instruction, MAC engine 1300 of computing engine 524 can fetch input data elements and weight elements from, respectively, the input data register and the weights register. In some examples, computing engine 524 can also be configured based on the input and weights precisions. As described above, depending on whether the weights are 4-bit or 8-bit precisions, and whether the input data elements are 4-bit or 8-bit precisions, which can set the D8/D4 mode, weights multiplexer 1304 and data multiplexer 1306 can fetch the weight elements of different bit precisions and input data elements of different bit precisions to the computation units 1800, as illustrated in
In operation 2910, responsive to the instruction, MAC engine 1300 can perform multiplication and accumulation (MAC) operations between the input data elements at the particular input precision and the weight elements at the particular weight precision to generate intermediate output data elements. For example, for binary mode, the computing engine can perform bitwise XNOR operations between the input data elements and the weights elements. Also, for certain data and weight precisions (e.g., 8-bit input precision and 2-bit weight precision, 4-bit input and weight precisions, etc.), each computation unit can generate an intermediate output data element. For other data and weight precisions (e.g., 8-bit input precision and 4-bit weight precision, 8-bit input precision and 8-bit weight precision), the computing engine can use a data merge circuit (e.g., data merge circuit 2002) to merge outputs from different computation units to generate an intermediate output data element, as illustrated in
In operation 2912, MAC engine 1300 can store the intermediate output data elements at intermediate output data registers (e.g., MAC registers).
Referring to
In operation 2916, responsive to the second instruction, post-processing engine 1302 configured based on the first and second (and third) indications fetch the intermediate output data elements from the intermediate output data registers.
In operation 2918, responsive to the second instruction, post-processing engine 1302 configured based on the first and second (and third) indications perform post-processing operations, such as BNorm operations, residual layer processing, etc., on the intermediate data elements to generate output data elements, as illustrated in
In operation 2920, responsive to the second instruction, post-processing engine 1302 stores the output data elements at an output data register (e.g., Dout) of data registers 528a. The storing of the output data elements can be performed by data packing circuit 2304. Upon storing a threshold size of data at Dout (e.g., 32 bits), data packing circuit 2304 can transmit a control signal 2305 to load/store controller 530 to fetch the output data to memory 512.
In this description, the term “couple” may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A provides a signal to control device B to perform an action, then: (a) in a first example, device A is coupled to device B; or (b) in a second example, device A is coupled to device B through intervening component C if intervening component C does not substantially alter the functional relationship between device A and device B, such that device B is controlled by device A via the control signal provided by device A. Also, in this description, a device that is “configured to” perform a task or function may be configured (e.g., programmed and/or hardwired) at a time of manufacturing by a manufacturer to perform the function and/or may be configurable (or reconfigurable) by a user after manufacturing to perform the function and/or other additional or alternative functions. The configuring may be through firmware and/or software programming of the device, through a construction and/or layout of hardware components and interconnections of the device, or a combination thereof. Furthermore, in this description, a circuit or device that includes certain components may instead be adapted to be coupled to those components to form the described circuitry or device. For example, a structure described as including one or more semiconductor elements (such as transistors), one or more passive elements (such as resistors, capacitors and/or inductors), and/or one or more sources (such as voltage and/or current sources) may instead include only the semiconductor elements within a single physical device (e.g., a semiconductor die and/or integrated circuit (IC) package) and may be adapted to be coupled to at least some of the passive elements and/or the sources to form the described structure either at a time of manufacture or after a time of manufacture, such as by an end-user and/or a third party.
While particular transistor structures are referred to above, other transistors or device structures may be used instead. For example, p-type MOSFETs may be used in place of n-type MOSFETs with little or no additional changes. In addition, other types of transistors (such as bipolar transistors) may be utilized in place of the transistors shown. The capacitors may be implemented using different device structures (such as metal structures formed over each other to form a parallel plate capacitor) or may be formed on layers (metal or doped semiconductors) closer to or farther from the semiconductor substrate surface.
As used above, the terms “terminal”, “node”, “interconnection” and “pin” are used interchangeably. Unless specifically stated to the contrary, these terms are generally used to mean an interconnection between or a terminus of a device element, a circuit element, an integrated circuit, a device or other electronics or semiconductor component.
While certain components may be described herein as being of a particular process technology, these components may be exchanged for components of other process technologies. Circuits described herein are reconfigurable to include the replaced components to provide functionality at least partially similar to functionality available before the component replacement. Components shown as resistors, unless otherwise stated, are generally representative of any one or more elements coupled in series and/or parallel to provide an amount of impedance represented by the shown resistor. For example, a resistor or capacitor shown and described herein as a single component may instead be multiple resistors or capacitors, respectively, coupled in series or in parallel between the same two nodes as the single resistor or capacitor. Also, uses of the phrase “ground terminal” in this description include a chassis ground, an Earth ground, a floating ground, a virtual ground, a digital ground, a common ground, and/or any other form of ground connection applicable to, or suitable for, the teachings of this description. Unless otherwise stated, “about”, “approximately”, or “substantially” preceding a value means+/−10 percent of the stated value.
Modifications are possible in the described examples, and other examples are possible, within the scope of the claims.
This application claims priority to: (a) U.S. Provisional Patent Application No. 63/407,757, titled “Programmable HWA for Deep Neural Networks”, filed Sep. 19, 2022; (b) U.S. Provisional Patent Application No. 63/407,760, titled “Configurable and Scalable MAC engine for Reduced Precision Deep Neural Networks”, filed Sep. 19, 2022; and (c) U.S. Provisional Patent Application No. 63/407,758, titled “Post processing Hardware Engine for Reduced Precision DNNs”, filed Sep. 19, 2022, all of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
63407757 | Sep 2022 | US | |
63407760 | Sep 2022 | US | |
63407758 | Sep 2022 | US |