The disclosure relates to a neural network, and more particularly to an architecture of a processor adapted for neural network operation.
Convolutional neural networks (CNNs) have recently emerged as a means to tackle artificial intelligence (AI) problems such as computer vision. State-of-the-art CNNs can recognize one thousand categories of objects in the ImageNet dataset both faster and more accurately than humans.
Among the CNN techniques, binary CNNs (BNNs for short) are suitable for embedded devices such as those for the Internet of things (IoT). The multiplications of BNNs are equivalent to logic XNOR operations, which are much simpler and consume much lower power than full-precision integer or floating-point multiplications. Meanwhile, open-source hardware and open standard instruction set architecture (ISA) have also attracted great attention. For example, RISC-V solutions have become available and popular in recent years.
In view of the BNN, IoT, and RISC-V trends, some architectures that integrate embedded processors with BNN acceleration have been developed, such as the vector processor (VP) architecture and the peripheral engine (PE) architecture, as illustrated in
In the VP architecture, the BNN acceleration is tightly coupled to processor cores. More specifically, the VP architecture integrates vector instructions into the processor cores, and thus offers good programmability to support general-purpose workloads. However, such architecture is disadvantageous in that it involves significant costs for developing toolchains (e.g., compilers) and hardware (e.g., pipeline datapath and control), and the vector instructions may incur additional power and performance costs from, for example, moving data between static random access memory (SRAM) and processor registers (e.g., load and store) and loops (e.g., branch).
On the other hand, the PE architecture makes the BNN acceleration loosely coupled to the processor cores using a system bus such as the advanced high-performance bus (AHB). In contrast to the VP architecture, most IC design companies are familiar with the PE architecture, which avoids the abovementioned compiler and pipeline development costs. In addition, without loading, storing, and loop costs, the PE architecture can potentially achieve better performance than the VP architecture. The PE architecture is disadvantageous in utilizing private SRAM instead of sharing the available SRAM of the embedded processor cores. Typically, embedded processor cores for IoT devices are equipped with approximately 64 to 160 KB of tightly coupled memory (TCM) that is made of SRAM and that can support concurrent code executions and data transfers. TCM is also known as tightly integrated memory, scratchpad memory, or local memory.
Therefore, an object of the disclosure is to provide a processor adapted for neural network operation. The processor can have the advantages of both of the conventional VP architecture and the conventional PE architecture.
According to the disclosure, the processor includes a scratchpad memory, a processor core, a neural network accelerator and an arbitration unit (such as a multiplexer unit). The scratchpad memory is configured to store to-be-processed data, and multiple kernel maps of a neural network model, and has a memory interface. The processor core is configured to issue core-side read/write instructions (such as load and store instructions) that conform with the memory interface to access the scratchpad memory. The neural network accelerator is electrically coupled to the processor core and the scratchpad memory, and is configured to issue accelerator-side read/write instructions that conform with the memory interface to access the scratchpad memory for acquiring the to-be-processed data and the kernel maps from the scratchpad memory to perform a neural network operation on the to-be-processed data based on the kernel maps. The accelerator-side read/write instructions conform with the memory interface. The arbitration unit is electrically coupled to the processor core, the neural network accelerator and the scratchpad memory to permit one of the processor core and the neural network accelerator to access the scratchpad memory.
Another object of the disclosure is to provide a neural network accelerator for use in a processor of this disclosure. The processor includes a scratchpad memory storing to-be-processed data and storing multiple kernel maps of a convolutional neural network (CNN) model.
According to the disclosure, the neural network accelerator includes an operation circuit, a partial-sum memory, and a scheduler. The operation circuit is to be electrically coupled to the scratchpad memory. The partial-sum memory is electrically coupled to the operation circuit. The scheduler is electrically coupled to the partial-sum memory, and is to be electrically coupled to the scratchpad memory. When the neural network accelerator performs a convolution operation for an nth (n is a positive integer) layer of the CNN model, the to-be-processed data is nth-layer input data, and the following actions are performed: (1) the operation circuit receives, from the scratchpad memory, the to-be-processed data and nth-layer kernel maps which are those of the kernel maps that correspond to the nth layer, and performs, for each of the nth-layer kernel maps, multiple dot product operations of the convolution operation on the to-be-processed data and the nth-layer kernel map; (2) the partial-sum memory is controlled by the scheduler to store intermediate calculation results that are generated by the operation circuit during the dot product operations; and (3) the scheduler controls data transfer between the scratchpad memory and the operation circuit and data transfer between the operation circuit and the partial-sum memory in such a way that the operation circuit performs the convolution operation on the to-be-processed data and the nth-layer kernel maps so as to generate multiple nth-layer output feature maps that respectively correspond to the nth-layer kernel maps, after which the operation circuit provides the nth-layer output feature maps to the scratchpad memory for storage therein.
Yet another object is to provide a scheduler circuit for use in a neural network accelerator of this disclosure. The neural network accelerator is electrically coupled to a scratchpad memory of a processor. The scratchpad memory stores to-be-processed data, and multiple kernel maps of a convolutional neural network (CNN) model. The neural network accelerator is configured to acquire the to-be-processed data and the kernel maps from the scratchpad memory so as to perform a neural network operation on the to-be-processed data based on the kernel maps.
According to the disclosure, the scheduler includes multiple counters, each of which includes a register to store a counter value, a reset input terminal, a reset output terminal, a carry-in terminal, and a carry-out terminal. The counter values stored in the registers of the counters are related to memory addresses of the scratchpad memory where the to-be-processed data and the kernel maps are stored. Each of the counters is configured to, upon receipt of an input trigger at the reset input terminal thereof, set the counter value to an initial value, set an output signal at the carry-out terminal to a disabling state, and generate an output trigger at the reset output terminal. Each of the counters is configured to increment the counter value when an input signal at the carry-in terminal is in an enabling state. Each of the counters is configured to set the output signal at the carry-out terminal to the enabling state when the counter value has reached a predetermined upper limit. Each of the counters is configured to stop incrementing the counter value when the input signal at the carry-in terminal is in the disabling state. Each of the counters is configured to generate the output trigger at the reset output terminal when the counter value has incremented to be overflowing from the predetermined upper limit to become the initial value. The counters have a tree-structured connection in terms of connections among the reset input terminals and the reset output terminals of the counters, wherein, for any two of the counters that have a parent-child relationship in the tree-structured connection, the reset output terminal of one of the counters that serves as a parent node is electrically coupled to the reset input terminal of the other one of the counters that serves as a child node. The counters have a chain-structured connection in terms of connections among the carry-in terminals and the carry-out terminals of the counters, and the chain-structured connection is a post-order traversal of the tree-structured connection, wherein, for any two of the counters that are coupled together in series in the chain-structured connection, the carry-out terminal of one of the counters is electrically coupled to the carry-in terminal of the other one of the counters.
Other features and advantages of the disclosure will become apparent in the following detailed description of the embodiment(s) with reference to the accompanying drawings, of which:
Before the disclosure is described in greater detail, it should be noted that where considered appropriate, reference numerals or terminal portions of reference numerals have been repeated among the figures to indicate corresponding or analogous elements, which may optionally have similar characteristics.
Referring to
The scratchpad memory 1 may be static random-access memory (SRAM), magnetoresistive random-access memory (MRAM), or other types of non-volatile random-access memory, and has a memory interface. In this embodiment, the scratchpad memory 1 is realized using SRAM that has an SRAM interface (e.g., a specific format of a read enable (ren) signal, a write enable (wen) signal, input data (d), output data (q), and memory address data (addr), etc.), and is configured to store to-be-processed data and the kernel maps of the neural network model. The to-be-processed data may be different for different layers of the neural network model. For example, the to-be-processed data for the first layer could be an input image data, while the to-be-processed data for the nth layer (referred to as the nth-layer input data) may be an (n−1)th-layer output feature map (the output of the (n−1)th layer) in the case of n>1.
The processor core 2 is configured to issue memory address and read/write instructions (referred to as core-side read/write instructions) that conform with the memory interface to access the scratchpad memory 1.
The neural network accelerator 3 is electrically coupled to the processor core 2 and the scratchpad memory 1, and is configured to issue memory address and read/write instructions (referred to as accelerator-side instructions) that conform with the memory interface to access the scratchpad memory 1 for acquiring the to-be-processed data and the kernel maps from the scratchpad memory 1 to perform a neural network operation on the to-be-processed data based on the kernel maps.
In this embodiment, the processor core 2 has a memory-mapped input/output (MMIO) interface to communicate with the neural network accelerator 3. In other embodiments, the processor core 2 may use a port-mapped input/output (PMIO) interface to communicate with the neural network accelerator 3. Since commonly used processor cores usually support MMIO interface and/or PMIO interface, no additional cost is required in developing specialized toolchains (e.g., compilers) and hardware (pipeline datapath and control), which is advantageous in comparison to the conventional VP architecture that uses vector arithmetic instructions to perform required computation.
The arbitration unit 4 is electrically coupled to the processor core 2, the neural network accelerator 3 and the scratchpad memory 1 to permit one of the processor core 2 and the neural network accelerator 3 to access the scratchpad memory 1 (i.e., permitting passage of a read/write instruction, memory address, and/or to-be-stored data that are provided from one of the processor core 2 and the neural network accelerator 3 to the scratchpad memory 1). As a result, the neural network accelerator 3 can share the scratchpad memory with the processor core 2, and thus the processor requires less private memory in comparison to the conventional PE architecture. In this embodiment, the arbitration unit 4 is exemplarily realized as a multiplexer that is controlled by the processor core 2 to select output data, but this disclosure is not limited in this respect.
The abovementioned architecture is applicable to a variety of neural network models including convolutional neural networks (CNNs), recurrent neural networks (RNNs), long-short term memory (LSTM), and so on. In this embodiment, the neural network model is a convolutional neural network (CNN) model, and the neural network accelerator 3 includes an operation circuit 31, a partial-sum memory 32, a scheduler 33 and a feature processing circuit 34.
The operation circuit 31 is electrically coupled to the scratchpad memory 1 and the partial-sum memory 32. When the neural network accelerator 3 performs a convolution operation for the nth layer of the CNN model, the operation circuit 31 receives, from the scratchpad memory 1, the nth-layer input data and nth-layer kernel maps, and performs, for each of the nth-layer kernel maps, multiple dot product operations of the convolution operation on the nth-layer input data and the nth-layer kernel map.
The partial-sum memory 32 may be realized using SRAM, MRAM, or register files, and is controlled by the scheduler 33 to store intermediate calculation results that are generated by the operation circuit 31 during the dot product operations. Each of the intermediate calculation results corresponds to one of the dot product operations, and may be referred to as a partial sum or a partial sum value of a final result of said one of the dot product operations hereinafter. As an example, a dot product of two vectors A=[a1, a2, a3] and B=[b1, b2, b3] is a1b1+a2b2+a3b3, where a1b1 may be calculated first and serve as a partial sum of the dot product, then a2b2 is calculated and added to the partial sum (which is a1b1 at this time) to update the partial sum, and a3b3 is calculated and added to the partial sum (which is a1b1+a2b2 at this time) at last to obtain a total sum (final result) that serves as the dot product.
In this embodiment, the operation circuit 31 includes a convolver 310 (a circuit used to perform convolution) and a partial-sum adder 311 to perform the dot product operations for the nth-layer kernel maps, one nth-layer kernel map at a time. Referring to
In this embodiment, the CNN model is exemplified as a binary CNN (BNN for short) model, so each of the multipliers of the multiplier unit 3103 can be realized as an XNOR gate, and the convolver adder 3104 can be realized as a population count (popcount) circuit.
The partial-sum adder 311 is electrically coupled to the convolver adder 3104 for receiving a first input value, which is the sum that corresponds to a dot operation and that is outputted by the convolver adder 3104, is electrically coupled to the partial-sum memory 32 for receiving a second input value, which is one of the intermediate calculation results that corresponds to the dot operation, and adds up the first input value and the second input value to generate an updated intermediate calculation result which is to be stored back into the partial-sum memory 32 to update said one of the intermediate calculation results.
In other embodiments, the convolver 310 may include a plurality of the dot product operation units 3101 that respectively correspond to multiple different kernel maps of the same layer to perform the convolution operation on the to-be-processed data and different ones of the kernel maps at the same time, as exemplarily illustrated in
The data layout and the computation scheduling exemplified in
Referring to
When the neural network accelerator 3 performs the convolution operation for the nth layer of the neural network model, the input pointer 331 points to a first memory address of the scratchpad memory 1 where the nth-layer input data (denoted as “Layer N” in
When the neural network accelerator 3 performs the convolution operation for an (n+1)th layer of the neural network model, the input pointer 331 points to the third memory address of the scratchpad memory 1 and makes the nth-layer output feature maps stored therein serve as the to-be-processed data for the (n+1)th layer (denoted as “Layer N+1” in
Furthermore, the scheduler 33 is electrically coupled to the arbitration unit 4 for accessing the scratchpad memory 1 therethrough, is electrically coupled to the partial-sum memory 32 for accessing the partial-sum memory 32, and is electrically coupled to the convolver 310 for controlling the timing of updating data that is stored in the register unit 3100. When the neural network accelerator 3 performs a convolution operation for the nth layer of the neural network model, the scheduler 33 controls data transfer between the scratchpad memory 1 and the operation circuit 31 and data transfer between the operation circuit 31 and the partial-sum memory 32 in such a way that the operation circuit 31 performs the convolution operation on the to-be-processed data and each of the nth-layer kernel maps so as to generate multiple nth-layer output feature maps that respectively correspond to the nth-layer kernel maps, after which the operation circuit 31 provides the nth-layer output feature maps to the scratchpad memory 1 for storage therein. In detail, the scheduler 33 fetches the to-be-processed data and the kernel weights from the scratchpad memory 1, sends the same to the registers of the operation circuit 31 for performing bitwise dot products (e.g., XNOR, popcount, etc.) and accumulating the dot product results in the partial-sum memory 32. Particularly, the scheduler 33 of this embodiment schedules the operation circuit 31 to perform the convolution operation in a manner as exemplified in either
Each of the counters C1 to C8 includes a register to store a counter value, a reset input terminal (rst_in), a reset output terminal (rst_out), a carry-in terminal (cin), and a carry-out terminal (cout). The counter values stored in the registers of the counters C1-C8 are related to memory addresses of the scratchpad memory 1 where the to-be-processed data and the kernel maps are stored. Each of the counters C1-C8 is configured to perform the following actions: 1) upon receipt of an input trigger at the reset input terminal thereof, setting the counter value to an initial value (e.g., zero), setting an output signal at the control output terminal to a disabling state (e.g., logic low), and generating an output trigger at the reset output terminal; 2) when an input signal at the carry-in terminal is in an enabling state (e.g., logic high), incrementing the counter value (e.g., adding one to the counter value); 3) when the counter value has reached a predetermined upper limit, setting the output signal at the carry-out terminal to the enabling state; 4) when the input signal at the carry-in terminal is in the disabling state, stopping incrementing on the counter value; and 5) generating the output trigger at the reset output terminal when the counter value has incremented to be overflowing from the predetermined upper limit to become the initial value. It is noted that the processor core 2 may, via the MMIO interface, set the predetermined upper limit of the counter value, inform the scheduler 33 to start counting, check the progress of the counting, and prepare the next convolution operation (e.g., updating the input, kernel and output pointers 331, 332, 333, changing the predetermined upper limits for the counters if needed, etc.) when the counting is completed (i.e., the current convolution operation is finished). In this embodiment, the counter values of the counters C1-C8 respectively represent a position (Xo) of the output feature map in a width direction of the data structure, a position (Xk) of the kernel map (denoted as “kernel” in
The counters C1-C8 have a tree-structured connection in terms of connections among the reset input terminals and the reset output terminals of the counters C1-C8. That is, for any two of the counters C1-C8 that have a parent-child relationship in the tree-structured connection, the reset output terminal of one of the two counters that serves as a parent node is electrically coupled to the reset input terminal of the other one of the two counters that serves as a child node. As illustrated in
On the other hand, the counters C1-C8 have a chain-structured connection in terms of connections among the carry-in terminals and the carry-out terminals of the counters C1-C8, and the chain-structured connection is a post-order traversal of the tree-structured connection, wherein, for any two of the counters C1-C8 that are coupled together in series in the chain-structured connection, the carry-out terminal of one of the two counters is electrically coupled to the carry-in terminal of the other one of the two counters. As illustrated in
After the convolution of the to-be-processed data and one of the kernel maps is completed, usually the convolution result would undergo max pooling (optional in some layers), batch normalization and quantization. For the purpose of explanation, the quantization is exemplified as binarization since the exemplary neural network model is a BNN model. The max pooling, the batch normalization and the binarization can together be represented using a logic operation of:
y=NOT{sign((Max(xi−b0)−μ)÷√{square root over (σ2−ε)}×γ−β)} (1)
where xi represents inputs of the operation of the max pooling, the batch normalization and the binarization combined, which are results of the dot product operations of the convolution operation; y represents a result of the operation of the max pooling, the batch normalization and the binarization combined; b0 represents a predetermined bias; μ represents an estimated average of the results of the dot product operations of the convolution operation that is obtained during the training of the neural network model; σ represents an estimated standard deviation of the results of the dot product operations of the convolution operation that is obtained during the training of the neural network model; ε represents a small constant to avoid dividing by zero; γ represents a predetermined scaling factor; and β represents an offset.
This embodiment proposes using a simpler circuit structure on the feature processing circuit 34 to achieve the same function as the conventional circuit structure. The feature processing circuit 34 is configured to perform a fused operation of max pooling, batch normalization and binarization on a result of the convolution operation performed on the to-be-processed data and the nth-layer kernel maps, so as to generate the nth-layer output feature maps. The fused operation can be derived from equation (1) to be:
where xi represents inputs of the fused operation, which are results of the dot product operations of the convolution operation; y represents a result of the fused operation; γ represents a predetermined scaling factor, and ba represents an adjusted bias related to an estimated average and an estimated standard deviation of the results of the dot product operations of the convolution operation. In detail,
The feature processing circuit 34 includes a number i of adders for adding the adjusted bias to the inputs, a number i of binarization circuits, an i-input AND gate and a two-input XNOR gate that are coupled together to perform the fused operation. In this embodiment, the binarization circuits perform binarization by obtaining only the most significant bit of data inputted thereto, but this disclosure is not limited to such.
In summary, the embodiment of the processor of this disclosure uses an arbitration unit 4 so that the processor core 2 and the neural network accelerator 3 can share the scratchpad memory 1, and further uses a generic I/O interface (e.g., MMIO, PMIO, etc.) to communicate with the neural network accelerator 3, so as to reduce the cost for developing specialized toolchains and hardware. Therefore, the embodiment of the processor have the advantages of both of the conventional VP architecture and the conventional PE architecture. The proposed data layout and computation scheduling may help minimize the require capacity of the partial-sum memory by exhausting the reuses of the partial sums. The proposed structure of the feature processing circuit 34 fuses the max pooling, the batch normalization and the binarization, thereby reducing the required hardware resource.
In the description above, for the purposes of explanation, numerous specific details have been set forth in order to provide a thorough understanding of the embodiment(s). It will be apparent, however, to one skilled in the art, that one or more other embodiments may be practiced without some of these specific details. It should also be appreciated that reference throughout this specification to “one embodiment,” “an embodiment,” an embodiment with an indication of an ordinal number and so forth means that a particular feature, structure, or characteristic may be included in the practice of the disclosure. It should be further appreciated that in the description, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of various inventive aspects, and that one or more features or specific details from one embodiment may be practiced together with one or more features or specific details from another embodiment, where appropriate, in the practice of the disclosure.
While the disclosure has been described in connection with what is (are) considered the exemplary embodiment(s), it is understood that this disclosure is not limited to the disclosed embodiment(s) but is intended to cover various arrangements included within the spirit and scope of the broadest interpretation so as to encompass all such modifications and equivalent arrangements.
This application claims priority of U.S. Provisional Patent Application No. 62/943,820, filed on Dec. 5, 2019.
Number | Date | Country | |
---|---|---|---|
62943820 | Dec 2019 | US |