The disclosure relates in general to a deep neural network (DNN) hardware accelerator and an operating method thereof.
Deep neural network (DNN), which belongs to the artificial neural network (ANN), may be used in deep machine learning. The ANN has the learning function. The DNN has been widely used for resolving various problems, such as machine vision and speech recognition.
To enhance the efficiency of the DNN, a balance between transmission bandwidth and computing ability need to be reached in the design of the DNN. Therefore, it has become a prominent task for the industries to provide a scalable architecture for the DNN hardware accelerator.
According to one embodiment, a deep neural network (DNN) hardware accelerator including a processing element array is disclosed. The processing element array includes a plurality of processing element groups and each of the processing element groups includes a plurality of processing elements. A first network connection implementation between a first processing element group of the processing element groups and a second processing element group of the processing element groups is different from a second network connection implementation between the processing elements in the first processing element group
According to another embodiment, an operating method of a DNN hardware accelerator is provided. The DNN hardware accelerator includes a processing element array. The processing element array includes a plurality of processing element groups and each of the processing element groups includes a plurality of processing elements. The operating method includes: receiving input data by the processing element array; transmitting input data from a first processing element group of the processing element groups to a second processing element group of the processing element groups in a first network connection implementation; and transmitting data between the processing elements in the first processing element group in a second network connection implementation. The first network connection implementation is different from the second network connection implementation.
Technical terms are used in the specification with reference to generally-known terminologies used in the technology field. For any terms described or defined in the specification, the descriptions and definitions in the specification shall prevail. Each embodiment of the present disclosure has one or more technical features. Given that each embodiment is implementable, a person ordinarily skilled in the art may selectively implement or combine some or all of the technical features of any embodiment of the present disclosure.
As indicated in
As indicated in
As indicated in
As indicated in
In an embodiment of the present disclosure, the network distributor 210 may be realized by hardware, firmware or software or machine executable programming code stored in a memory and executed by a micro-processing element or a digital signal processing element. If the network distributor 210 is realized by hardware, then the network distributor 210 may be realized by single integrated circuit chip or multiple circuit chips, but the present disclosure is not limited thereto. The single integrated circuit chip or multiple circuit chips may be realized by a digital signal processing element, an application specific integrated circuit (ASIC) or a field programmable logic gate array (FPGA). The said memory may be realized by such as a random access memory, a read-only memory or a flash memory.
In an embodiment of the present disclosure, the processing element may be realized by a micro-controller, a micro-processing element, a processing element, a central processing unit (CPU), a digital signal processing element, an application specific integrated circuit (ASIC), a digital logic circuit, field programmable gate array (FPGA) and/or other hardware element with operation function. The processing elements may be coupled by an ASIC, a digital logic circuit, FPGA and/or other hardware elements.
The network distributor 210 allocates respective bandwidths of a plurality of data types according to the data bandwidth ratios (RI, RF, RIP, and ROP). In an embodiment, the DNN hardware accelerator 200 may adjust the bandwidth. Examples of the data types include input feature map (ifmap), filter, input partial sum (ipsum) and output partial sum (opsum). Examples of the data layer include convolutional layer, pool layer and/or fully-connect layer. For a particular data layer, it is possible that data ifmap may occupy a larger ratio; but for another data layer, it is possible that data filter may occupy a larger ratio. Therefore, in an embodiment of the present disclosure, respective bandwidth ratios (RI, RF, RIP and/or ROP) of the data layers may be determined according to the ratios of the data of respective data layers, and respective transmission bandwidths (such as the transmission bandwidth between the processing element array 220 and the network distributor 210) of the data types may be adjusted and/or allocated according to respective bandwidth ratios (RI, RF, RIP and/or ROP) of the data layers. The bandwidth ratios RI, RF, RIP and ROP respectively represent the bandwidth ratios of the data ifmap, filter, ipsum and opsum. The network distributor 210 may allocate the bandwidths of the data ifmapA, filterA, ipsumA and opsumA according to RI, RF, RIP and ROP, wherein, data ifmapA, filterA, ipsumA and opsumA represent the data transmitted between the network distributor 210 and the processing element array 220.
In an embodiment of the present disclosure, the DNN hardware accelerators 200 and 200A may selectively include a bandwidth parameter storage unit (not illustrated) coupled to the network distributor 210 for storing the bandwidth ratios RI, RF, RIP and/or ROP of the data layers and transmitting the bandwidth ratios RI, RI, RF, RIP and/or ROP of the data layers to the network distributor 210. The bandwidth ratios RI, RF, RIP and/or ROP stored in the bandwidth parameter storage unit may be obtained through offline training.
In another possible embodiment of the present disclosure, the bandwidth ratios RI, RF, RIP and/or ROP of the data layers may be obtained in a real-time manner. For example, the bandwidth ratios RI, RF, RIP and/or ROP of the data layers are obtained from dynamic analysis of the data layers performed by a micro-processing element (not illustrated), and the bandwidth ratios are subsequently transmitted to the network distributor 210. In an embodiment, if the micro-processing element (not illustrated) dynamically generates the bandwidth ratios RI, RF, RIP and/or ROP, then the offline training for obtaining the bandwidth ratios RI, RF, RIP and/or ROP may be omitted.
In
In an embodiment of the present disclosure as indicated in
Referring to
In an embodiment of the present disclosure, the network distributor 210 includes a tag generation unit (not illustrated), a data distributor (not illustrated) and a plurality of first in first out (FIFO) buffers (not illustrated).
The tag generation unit of the network distributor 210 generates a plurality of row tags and a plurality of column tags, but the present disclosure is not limited thereto.
As disclosed above, the processing elements and/or the processing element groups determine whether to process an item of data according to the row tags and the column tags.
The data distributor of the network distributor 210 is configured to receive data (ifmap, filter, ipsum) and/or the output data (opsum) from the FIFO buffers and to allocate the transmission bandwidths of the data (ifmap, filter, ipsum, opsum) for enabling the data to be transmitted between the network distributor 210 and the processing element array 220 according to the allocated bandwidths.
The internal FIFO buffers of the network distributor 210 are respectively configured to buffer the data ifmap, filter, ipsum and opsum.
After data is processed, the network distributor 210 transmits the data ifmapA, filterA and ipsumA to the processing element array 220 and receives the data opsumA from the processing element array 220. Thus, the data may be more effectively transmitted between the network distributor 210 and the processing element array 220.
In an embodiment of the present disclosure, each processing element group 222 further selectively includes a row decoder (not illustrated) configured to decode the row tags generated by the tag generation unit (not illustrated) of the network distributor 210 to determine which row of processing elements will receive this item of data. Suppose the processing element group 222 includes 4 rows of processing elements. If the row tags are directed to the first row (such as, the value of the row tag is 1), then the row decoder, after decoding the row tags, transmits this item of data to the first row of processing elements, and the rest may be obtained by the same analogy.
In an embodiment of the present disclosure, the processing element 310 includes a tag matching unit, a data selection and allocation unit, an operation unit, a plurality of FIFO buffers and a reshaping unit.
The tag matching unit of the processing elements 310 compares the column tag, which is generated by the tag generation unit of the network distributor 210 or is received from the external of the processing element array 220, with the col. ID to determine whether the processing element needs to process this item of data. If the comparison shows that the two are matched, then the data selection and allocation unit processes this item of data (such as the ifmap, filter or ipsum of
The data selection and allocation unit of the processing elements 310 selects data from the internal FIFO buffers of the processing elements 310 to form the data ifmapB, filterB and ipsumB (not illustrated).
The operation unit of the processing elements 310 includes but is not limited to the multiplication and addition unit operation unit. In an embodiment of the present disclosure (as indicated in
In an embodiment of the present disclosure, data inputted to the network distributor 210 may be from an internal buffer (not illustrated) of the DNN hardware accelerator 200A, wherein the internal buffer may be directly coupled to the network distributor 210. Or, in another possible embodiment of the present disclosure, the data inputted to the network distributor 210 may be from a memory (not illustrated) connected through a system bus (not illustrated). That is, the memory may possibly be coupled to the network distributor 210 through the system bus.
In a possible embodiment of the present disclosure, the network connection and data transmission between the processing element groups 222 may be performed using unicast network (as indicated in
In a possible embodiment of the present disclosure, the network connection and data transmission between the processing elements in the same processing element group may be performed using unicast network (as indicated in
As indicated in
Suppose data A is transmitted to the processing element groups PEG 4, PEG5, PEG6 and PEG7. The relation between data package and clock cycle is listed below:
In the 0-th clock cycle, data A is transmitted to the processing element group PEG 4 (ID=4), and the network type is unicast network (NT=0). It is determined that the network type needs to be changed (NC=1, to change the network type from unicast network to systolic network) based on needs, and data A will subsequently be transmitted to the processing element group PEG 5 (IN=1). In the 1st clock cycle, data A is transmitted from the processing element group PEG 4 to the processing element group PEG 5 (ID=4+1=5), and the network type is systolic network (NT=1). It is determined that the network type does not need to be changed (NC=0), and data A will subsequently be transmitted to the processing element group PEG6 (IN=1). In the 2nd clock cycle, data A is transmitted from the processing element group PEG 5 (ID=4+1+1=6) to the processing element group PEG 6, and the network type is systolic network (NT=1). It is determined that the network type does not need to be changed (NC=0), and data A will subsequently be transmitted to the processing element group PEG7 (IN=1). In the 3rd clock cycle, data A is transmitted from the processing element group PEG 6 (ID=4+1+1+1=7) to the processing element group PEG 7, and the network type is systolic network (NT=1). It is determined that the network type does not need to be changed (NC=0).
In another embodiment, the ID field may be changed, and the relation between package and clock cycle is listed below:
In the 0-th clock cycle, data A is transmitted to the processing element group PEG 4 (ID=4). In the 1st clock cycle, data A is transmitted from the processing element group PEG 4 to the processing element group PEG 5 (ID=4+1=5), and will subsequently be transmitted to the processing element group PEG6 (IN=1). In the 2nd clock cycle, data A is transmitted from the processing element group PEG 5 to the processing element group PEG 6 (ID=5+1=6), and will subsequently be transmitted to the processing element group PEG7 (IN=1). In the 3rd clock cycle, data A is transmitted from the processing element group PEG 6 to the processing element group PEG 7 (ID=6+1=7). The number, size and type of field may be designed according to actual needs, and the present invention does not have specific restrictions.
Thus, in the embodiments of the present disclosure, the network connection implementation between the processing element groups is switchable according to actual needs. For example, the network connection implementation may be switched between unicast network (as indicated in
Similarly, in the embodiments of the present disclosure, the network connection implementation between the processing elements in the same processing element group is switchable according to actual needs. For example, the network connection implementation may be switched between unicast network (as indicated in
In
In
The processing element array 540 includes a plurality of processing element groups PEG configured to receive data ifmap, filter and ipsum from the buffers 520 and 530, process the received data into data opsum, and then transmit the processed data opsum to the memory 550.
In
In
The buffers 630 are configured to buffer data ifmap, filter, ipsum and opsum.
Referring to
In
The buffers 710 and 720 may be regarded as being equivalent to or similar to the buffers 630 of
In above embodiments of the present disclosure, coupling between the processing element groups are implemented in the same network connection implementation. However, in other possible embodiment of the present disclosure, the network connection implementation between the first processing element group and the third processing element group may be different from the network connection implementation between the first processing element group and the second processing element group.
In above embodiments of the present disclosure, for each processing element group, coupling between the processing elements are implemented in the same network connection implementation (for example, the processing elements in all processing element groups are coupled using “multicast network”). However, in other possible embodiment of the present disclosure, the network connection implementation between the processing elements in the first processing element group may be different from the network connection implementation between the processing elements in the second processing element group. In an illustrative rather than a restrictive sense, the processing elements in the first processing element group are coupled using “multicast network”, but the processing elements in the second processing element group are coupled using “broadcast network”.
In an embodiment, the DNN hardware accelerator receives input data. Between the processing element groups, data is transmitted by a first network connection implementation. Between the processing element groups in the same processing element group, data is transmitted by a second network connection implementation. In an embodiment, the first network connection implementation between the processing element groups is different from the second network connection implementation between the processing elements in each processing element group.
The present disclosure may be used in the artificial intelligence (AI) accelerator of a terminal device (such as a smart phone but not limited to) or the system chip of a smart coupled device. The present disclosure may also be used in an Internet of Things (IoT) mobile device, an edge computing server, a cloud computing server, and so on.
In above embodiments of the present disclosure, due to architecture flexibility (the network connection implementation between the processing element groups may be changed according to actual needs, and the network connection implementation between the processing elements also may be changed according to actual needs), the processing element array may be easily augmented.
As disclosed in above embodiments of the present disclosure, the network connection implementation between the processing element groups may be different from the network connection implementation between the processing elements in the same processing element group. Or, the network connection implementation between the processing element groups may be identical to the network connection implementation between the processing elements in the same processing element group.
As disclosed in above embodiments of the present disclosure, the network connection implementation between the processing element groups may be unicast network, systolic network, multicast network or broadcast network, and is switchable according to actual needs.
As disclosed in above embodiments of the present disclosure, the network connection implementation between the processing elements in the same processing element group may be unicast network, systolic network, multicast network or broadcast network, and is switchable according to actual needs.
The present disclosure provides a DNN hardware accelerator effectively accelerating data transmission. The DNN hardware accelerator advantageously possesses the features of adjusting the corresponding bandwidth according to the needs in data transmission, reducing network complexity, and providing a scalable architecture.
As described above, embodiments of the application are disclosed as above but the application is not limited. Those skilled in the technical field of the application would do various modifications and variations within the spirit and the scope of the application. Therefore, scope of the application is defined by the following claims.