The present disclosure relates to neural networks (NNs), and specifically relates to selection of routing schemes for network-on-chip (NoC)-based deep NN (DNN) accelerators.
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Network-on-chip (NoC) interconnection is highly flexible and scalable. In order to reduce the design complexity of a deep neural network (DNN) accelerator implementation, an NoC-based DNN design becomes an attractive paradigm.
Aspects of the present disclosure provide a method for controlling a processing device to execute an application that employs a neural network (NN). The processing device can include a plurality of processing units arranged in a network-on-chip (NoC) to which the NN is mapped. For example, the method can include obtaining compiler information. The compiler information can include computing loads of the application on the processing units. The computing loads can relate a dataflow type of the NN. The method can further include determining a scaling factor for computing time of each of the processing units based on the computing loads, adjusting the computing time of the processing units based on the scaling factors, and enabling the processing units to perform their respective tasks of the application within their respective adjusted computing time.
In an embodiment, the scaling factor for the computing time of each of the processing units can be determined at each synchronization stage of the NN based on the computing load on the processing unit and a critical computing load on one of the processing units at the synchronization stage. For example, the dataflow type can be layer-by-layer tiling, the NN can include a plurality of layers each being partitioned into one or more tiles that correspond to the processing units, and the scaling factor for the computing time of each of the processing units can be determined in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer. As another example, the dataflow type can be cross-layer tiling, the NN can include a plurality of layers each being partitioned into one or more tiles, each of the processing units can process corresponding fused partitioned tiles of two or more of the layers, and the scaling factor for the computing time of each of the processing units can be determined in corresponding fused tiles at a corresponding synchronization stage of the NN based on the computing load of the corresponding fused tiles and a critical computing load of critical fused tiles at the corresponding synchronization stage. In some examples, the dataflow type can be layer pipeline tiling, the NN can include a plurality of layers each being partitioned into one or more tiles, the processing units, one after another at each synchronization stage, process corresponding tiles of corresponding layers sequentially, and the scaling factor for the computing time of each of the processing units can be determined in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer.
In an embodiment, the computing time of the processing units can be adjusted based on the scaling factors by employing dynamic voltage and frequency scaling (DVFS). For example, frequencies at which the processing units operate can be adjusted based on the scaling factors. As another example, voltages applied to the processing units can be adjusted based on the scaling factors.
Aspects of the present disclosure also provide a method for controlling a processing device to execute an application that employs a neural network (NN). The processing device can include a plurality of processing units arranged in a network-on-chip (NoC) to which the NN is mapped. For example, the method can include obtaining compiler information. The compiler information can include computing loads on the processing units for a plurality of dataflow types of the NN. The method can further include calculating a sum of the computing loads on the processing units for each of the dataflow types, selecting one of the dataflow types based on the sums, and enabling the processing units to perform their respective tasks of the application, the tasks corresponding to the computing loads on the processing units for the selected dataflow type.
In an embodiment, the method can further include determining a scaling factor for computing time of each of the processing units based on the computing loads, adjusting the computing time of the processing units based on the scaling factors, and enabling the processing units to perform their respective tasks of the application within their respective adjusted computing time.
Aspects of the present disclosure also provide an apparatus for executing an application that employs a neural network (NN). For example, the apparatus can include a plurality of processing units arranged in a network-on-chip (NoC) to which the NN is mapped. The apparatus can further include a receiving circuitry configured to receive compiler information. The compiler information can include computing loads of the application on the processing units. The computing loads can relate a dataflow type of the NN. The apparatus can further include a compiler coupled to the receiving circuitry and the processing units. The compiler is configured to determine a scaling factor for computing time of each of the processing units based on the computing loads, adjust the computing time of the processing units based on the scaling factors, and generate corresponding firmware for the processing units to execute to perform their respective tasks of the application within their respective adjusted computing time.
In an embodiment, the compiler can determine the scaling factor for the computing time of each of the processing units at each synchronization stage of the NN based on the computing load on the processing unit and a critical computing load on one of the processing units at the synchronization stage. For example, the dataflow type can be layer-by-layer tiling, the NN can include a plurality of layers each being partitioned into one or more tiles that correspond to the processing units, and compiler can determine the scaling factor for the computing time of each of the processing units in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer. As another example, the dataflow type can be cross-layer tiling, the NN can include a plurality of layers each being partitioned into one or more tiles, each of the processing units can process corresponding fused partitioned tiles of two or more of the layers, and the compiler can determine the scaling factor for the computing time of each of the processing units in corresponding fused tiles at a corresponding synchronization stage of the NN based on the computing load of the corresponding fused tiles and a critical computing load of critical fused tiles at the corresponding synchronization stage. In some examples, the dataflow type can be layer pipeline tiling, the NN can include a plurality of layers each being partitioned into one or more tiles, the processing units, one after another at each synchronization stage, process corresponding tiles of corresponding layers sequentially, and the compiler can determine the scaling factor for the computing time of each of the processing units in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer.
In an embodiment, the compiler can adjust the computing time of the processing units based on the scaling factors by employing dynamic voltage and frequency scaling (DVFS). For example, the compiler can adjust frequencies at which the processing units operate based on the scaling factors. As another example, the compiler can adjust voltages applied to the processing units based on the scaling factors.
In an embodiment, the compiler information can further include computing loads on the processing units for a plurality of dataflow types of the NN, and the compiler can be further configured to calculate a sum of the computing loads on the processing units for each of the dataflow types, select one of the dataflow types based on the sums, and generate the firmware that corresponds to the selected dataflow type. In another embodiment, the processing units can include deep learning accelerator (DLA) cores.
Note that this summary section does not specify every embodiment and/or incrementally novel aspect of the present disclosure or claimed invention. Instead, this summary only provides a preliminary discussion of different embodiments and corresponding points of novelty over conventional techniques. For additional details and/or possible perspectives of the present disclosure and embodiments, the reader is directed to the Detailed Description section and corresponding figures of the present disclosure as further discussed below.
Various embodiments of this disclosure that are proposed as examples will be described in detail with reference to the following figures, wherein like numerals reference like elements, and wherein:
Neural networks (NNs), e.g., deep neural networks (DNNs) and convolutional neural networks (CNN), have been widely used in a variety of cognitive applications, e.g., pattern recognition, image classification, computer vision, etc., and have achieved remarkable successes in some scenarios where the volume of data that are to be processed far exceeds the capability of human beings, e.g., self-driven cars. The scale of the DNNs is becoming larger and larger, in order to better infer data that are input to the DNNs. For example, current DNN models may consist of hundreds of layers and millions of parameters, e.g., weights, biases, kernels and activation functions, and involve complex vector and matrix computations at each layer. However, too large a DNN model may be too complex to be efficiently run on general hardware platforms. Network-on-chip (NoC), e.g., in the form of mesh, tree and ring, has been widely utilized in modern multi-core systems, e.g., deep learning accelerators (DLAs), for on-chip data transferring, and have provided a flexible, scalable and reusable solution to accelerate the operations of the DNN models.
The NoC 110 is a packet-switched network, which can enable a large number of processing elements (PEs), e.g., the cores 111, to communicate with each other. The NoC 110 may consist of routers and links, where each of the routers can be connected to a PE (or a group of PEs), and links can connect the routers to each other.
The DNN 100 can be mapped to the NoC 110 sequentially and randomly, or by some sophisticated algorithms, e.g., mapping a group of neurons that meet some specific requirements to a PE in order to reduce the overall data communication, packet latency and power consumption.
The tensor data input to the layers can be partitioned into XY-partition tiles and/or K-partition tiles, which may be different in sizes. As a result, the computing loadings of the cores 111 of the NoC 110 may be asymmetric due to different approaches of data tiling and mapping. Therefore, computing power may be wasted on non-critical loads. In average, 85% of the input buffers of the NoC 110 are idle, but still consume power. Besides, as the size of the NoC 110 increases, its network traffic load tends to become unbalanced, due to different approaches of data reuse, causing some routers to become hot-spot nodes.
An incoming flit may spend a router latency L(i) on the input buffers 221 and the switch 222. The router latency L(i) is a performance metric that directly reflects the level of congestion. Therefore, by analyzing the router delay L(i), information about the path congestion can be modeled accurately. The input buffers 221 and the switch 222 are prone to congestion, which increases queueing delays in the routing path. Accordingly, the router latency L(i) may consist of two major delays: a channel transfer delay (BCT+BTD(i)) and a switch delay (RST+OCD(i)), and can be expressed by
L(i)(BCT+BTD(i))+(RST+OCD(i)), where
i{north,east,sourth,west}. (1)
The channel transfer delay (BCT+BTD(i)) is related to the transmission of flits in the input buffers 221, and may consist of a buffer constant time (BCT) and a buffer transfer delay (BTD(i)). The BCT is a constant delay that occurs when a flit is transferred through an empty input buffer 221. The BTD(i) is a time duration that an incoming header experiences during its shift toward the top of the input buffer 221 after flits accumulation. The switch delay (RST+OCD(i)) is related to allocation and switching of flits, and may consist of a router service time (RST) and an output contention delay (OCD). The RST is a constant delay for a router, e.g., the DR 220, processing a flit. The OCD(i) is time of contention with other flits. For example, the OCD(i) is zero if there is no contention, and the switch delay is equal to the RST. The routed flit needs to wait for some flits serviced by the switch 222 and be transferred through the router, e.g., the DR 220, and then the output port of the DR 220 can be released. The OCD(i) can also be treated as the switch waiting time.
The router latency L(i) can reflect how different buffer architectures, allocations, and routing algorithms influence the total path delay of a packet. However, not all parameters are required to be considered when identifying how the selection function affects the packet delay. Assume that all routers are homogeneous; that is, they have the same buffer architecture and switch architecture. Therefore, the BCT and the RST remain unchanged for all routers. If the path congestion occurs, the BTD(i) and the OCD(i) can become a significant part of the overall packet delay. When congestion information is used for selection function, the impacts of the BTD(i) and the OCD(i) shall be considered simultaneously. Therefore, to estimate the congestion level, the BTD(i) and the OCD(i) are analyzed predominantly. Also, the modeling of congestion levels for channels and switches can be discussed, respectively.
As mentioned previously, the BTD(i) is the delay caused by previous flits accumulated on the same input buffer 221. In an embodiment, it is assumed that the flits of different packets are not interleaved; that is, the body flit arrive immediately after the header flit arrives to a port, and the amount of time that the incoming header spends in the input buffer 221 is thus equivalent to the service time of previous flits in the switch 222. Therefore, the BTD(i) can be expressed as the product of an occupied buffer size BDR(i) (i.e., the number of previous flits on the input buffer(i) 221 for downstream routers) and the RST, which is given by
BTD(i)=BDR(i)×RST. (2)
The OCD(i) represents the average port-acquisition delay met by incoming flit due to the contention with other packets. If the incoming flit receives a failed output request, it must be blocked and then wait for a grant from the switch allocator. That is, the flit needs to wait for the packets that are in the other input buffers of the same router to pass. Therefore, the length of OCD(i) depends on two factors: a) the channel transfer delay of the packets in the other input buffers, and b) the contention probability between input channels. Namely, OCD(i) can be expressed as the expected channel transfer delay of competing packets in the other input buffers, which is a function of BTD(j) and contention probability (cijo), and can be given by
OCD(i)=Σj=1,j≠iNChcijoBTD(j),
j∈{north,east,sourth,west}, (3)
where the term NCh denotes the number of channels in a router (e.g., for 2-D mesh, NCh=5 directions), and the coefficient cijo represents the contention probability between input channels i and j; that is, cijo is the probability that packets from input channels i and j compete for a common output o. It can be expressed as
where fio and fjo represent the probabilities of the presence of the packets in the input buffers (i) and (j) both toward the input buffer (o), respectively. Besides, since an incoming packet cannot be competed with itself, cijo is 0 when i is equal to j.
The energy model of multiple cores, e.g., the DLA cores 300, can be expressed by
wherein Pcomputing, k is the power of a computing DLA core, k is the number of DLA cores in an NoC, and v and fcore are the operating voltage and frequency of the DLA core, respectively.
As previously mentioned, the tensor data input to the layers of a DNN can be partitioned into a plurality of tiles, for example, XY-partition tiles or K-partitioned tiles, which can then be mapped to an NoC that corresponds to a plurality of DLA cores. However, the partitioned tiles may be different in size from one another, and, accordingly, computing loads on the DLA cores may be unbalanced. As a result, it takes asymmetric computing time for the DLA cores to complete theirs respective tasks. As shown in
According to the present disclosure, the asymmetric computing time of the DLA cores 0-3 are adjusted to become symmetric (or equal) so that the DLA cores 0-3 can complete their respective tasks at the same time during the synchronization stage. Therefore, none of the DLA cores 0-3 are idle and waste energy before the computing results at the current stage are forwarded to some other DLA cores at the next stage.
The tensor data input to layers of a DNN can be partitioned into one or more tiles in various manners.
where i denotes the current layer, and n denotes the DLA core n that processes the tile (i, n). After the critical computing time Tcritical_per_layer (i) is determined, a scaling factor (i, n) for the computing time of the other tiles n of each layer can be determined. In an embodiment, the scaling factor (i, n) for the computing time of the other tiles (i, n) can be determined by
The computing time of the other DLA cores n that process the other tiles (i, n) can be adjusted based on the scaling factors (i, n). For example, as shown in
where i denotes the fused tiles (i, n) at the current synchronization stage (i) and n denotes the DLA core n that processes the fused tiles (i, n). After the critical computing time Tcritical_per_fused_layer (i) is determined, a scaling factor (i, n) of the other fused tiles (i, n) at the current synchronization stage can be determined. In an embodiment, the scaling factor (i, n) of the other tiles (i, n) can be expressed by
The computing time of the other DLA cores n that process the other fused tiles (i, n) can be adjusted based on the scaling factors (i, n). For example, as shown in
In order to ensure that all of the four DLA cores 0-3 complete their respective tasks at the same time and none of them are idle, the computing time of the DLA cores 0-3, if being asymmetric, shall be adjusted to become equal. In an embodiment, a computing time of a critical path (or tile) of each stage, e.g., a critical computing time Tcritical_per_stage (i), can be determined by
where j denotes the current stage j, n denotes the currently processed critical tile (i, n) of each layer n, and i denotes the DLA core n that processes the current tile (i, n) of the layer n. After the critical computing time Tcritical_per_stage (j) is determined, a scaling factor (i, j, n) of the other tiles (i, n) at the current stage j can be determined. In an embodiment, the scaling factor (i, j, n) of the other tiles (i, n) can be determined by
The computing time of the other DLA cores n that process the other tiles (i, n) can be adjusted based on the scaling factors (i, n). For example, as shown in
As a designer generally has an in-depth knowledge of an application that he is about to run employing a network, e.g., a DNN, and can decide what type of tiling he is going to employ to partition each layer of the DNN to get to know the loads on and computing time of the partitioned tiles of each layer or fused tiles at each stage and calculate the scaling factor for each non-critical path of the DNN. For example, the knowledge, the load information and the scaling factors can be used by an off-line compiler to generate firmware, which may relate to computation-level energy saving, for the NoC, e.g., multi-DLAs, to execute at run-time, as shown in
At step S910, compiler information is obtained. In an embodiment, given a dataflow type, the compiler information can include loads on and/or computing time of the DLA cores. For example, given a layer-by-layer tiling (layer-based execution) for the DNN, the compiler information can include the computing loads on or computing time of the DLA cores to which one or more tiles of each of the layers of the DNN are mapped, as shown in
At step S920, it is determined as to whether a scaling factor for the computing time of each of the DLA cores at each synchronization stage (or layer) is less than one. If it is determined that the scaling factor for the computing time of a DLA core is less than one, regarding the DLA core, the method 900 proceeds to step S930; otherwise, the method 900 proceeds to step S940. In an embodiment, a critical computing time can be determined based on the loads on the tiles at each synchronization stage, and then scaling factors for non-critical loads on and/or computing time of the DLA cores to which the tiles are mapped can be calculated. For example, as shown in
As another example, as shown in
In yet another example, as shown in
At step S930, the asymmetric computing time of the DLA cores 2 and 3 are adjusted such that they are longer than their original computing time or equal to the critical computing time of the DLA core 0. In an embodiment, the computing time of the DLA cores 2 and 3 can be adjusted based on their respective scaling factors, e.g., t2/t3 and t1/t3, by employing, for example, DVFS. For example, the frequencies at which the DLA cores 2 and 3 operate can be adjusted to be the critical frequency of the DLA core 0 multiplying the scaling factors, i.e., t2/t3 and t1/t3, respectively. As another example, the voltages applied to the DLA cores 2 and 3 can be adjusted to be the critical voltage of the DLA core 0 multiplying the scaling factors, i.e., t2/t3 and t1/t3, respectively. The method 900 then proceeds to step S950.
At step S940, the symmetric computing time of the DLA core 1 is kept at its default setting. As the computing time of the DLA core 1 is equal to the critical computing time of the DLA core 0, the DLA core 1 will complete executing its task at the same time as the DLA core 0 does, and will not be idle during this synchronization stage. Therefore, no adjustment to the computing time is required for the DLA core 1.
At step S950, the DLA cores 0-3 perform their respective DNN tasks. As the computing time of all the DLA cores 0-3 are adjusted to become symmetric at this synchronization stage and some of the non-critical DLA cores, e.g., the DLA cores 2 and 3, have their frequencies and/or voltages reduced, none of the DLA cores 0-3 are idle during this synchronization stage and the power consumption is thus reduced.
At step S1010, compiler information is obtained. In an embodiment, the compiler information can include loads on and/or computing time of the DLA cores for a plurality of types of dataflow, e.g., layer-based execution such as layer-by-layer tiling shown in
At step S1020, the power consumption of the DLA cores to which the dataflow types are mapped is determined. In an embodiment, an average scaling factor for the computing time of the DLA cores for each of the dataflow types can be calculated. For example, the average scaling factor can be determined by calculating a sum of all the computing loads on or computing time of the DLA cores and dividing the sum by a product of the numbers of the DLA cores and the stages and the critical computing time.
At step S1030, one of the dataflow types is selected. For example, one of the dataflow types that corresponds to the smallest average scaling factor can be selected to be mapped to the DLA cores. In an embodiment, step S1030 can be followed by step S920 of the method 900.
In an embodiment, the apparatus 1100 can include a receiving circuitry 1120, a compiler 1130 coupled to the receiving circuitry 1120, and a DLA 1110 coupled to the compiler 1130. The receiving circuitry 1120 can receive compiler information for the compiler 1130 to generate firmware FW that the DLA 1110 can execute at run-time. For example, the compiler information can include loads on and/or computing time of the DLA 1110 for a plurality of dataflow types, e.g., layer-based execution such as layer-by-layer tiling shown in
In an embodiment, the DLA 1110 can include a plurality of DLA cores 1111 arranged in an NoC. The DLA cores 1111 can execute the firmware FW generated by the compiler 1130 at run-time.
In an embodiment, the compiler 1130 can, for each dataflow type, determine a critical computing time for one of DLA cores 1111 that performs a task in a critical path at each synchronization stage, calculate scaling factors for computing time of the other DLA cores 1111 that perform tasks in non-critical paths, and calculate an average scaling factor for computing time of the DLA cores 1111, and can thus select one of the dataflow types based on the calculated the average scaling factors. For example, when determining that the smallest average scaling factor corresponds to the layer-by-layer tiling, the compiler 1130 can adjust the computing time of the DLA cores 1111 based on their respective scaling factors, and generate the firmware FW for the DLA cores 1111 to execute at run-time, in order to minimize the energy consumption of the NoC. In an embodiment, the computing time of the DLA cores 1111 can be adjusted based on their respective scaling factors by employing DVFS. For example, the frequencies at which some of the DLA cores 1111 that are non-critical operate can be adjusted to be the critical frequency of one of the DLA cores 1111 that corresponds to the critical computing time at each synchronization stage multiplying the their respective scaling factors. As another example, the voltages applied to the non-critical DLA cores 1111 can be adjusted to be the critical voltage of the critical DLA core multiplying their respective scaling factors. Therefore, the non-critical DLA cores 1111 can complete their tasks at the same time as the critical DLA cores 1111 does at each synchronization stage, and consume less energy as their frequencies and/or voltages are reduced.
While aspects of the present disclosure have been described in conjunction with the specific embodiments thereof that are proposed as examples, alternatives, modifications, and variations to the examples may be made. Accordingly, embodiments as set forth herein are intended to be illustrative and not limiting. There are changes that may be made without departing from the scope of the claims set forth below.
This present application claims the benefit of U.S. Provisional Application No. 63/368,998, “DNN Compute Loading and Traffic-Aware Power Management for Multi-core AI Processing System” filed on Jul. 21, 2022, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63368998 | Jul 2022 | US |