The present disclosure relates to computational technologies, in particular to an electronic device, an accelerator, and an accelerating method applicable to a neural network operation.
In recent years, convolutional neural network (CNN) technology has seen wide-spread applications and is rapidly becoming an industry trend. Performing CNN operations on a processor, even with its improved computational power, is generally not considered a good idea because of the frequent memory accesses required, which significantly lower its computational efficiency. Conventionally, a graphics processing unit (GPU) is often used instead to accelerate CNN operations. However, GPU has high hardware cost and power consumption, making it difficult to apply to portable devices.
Therefore, there is a need to provide a new scheme for low power applications that require high computational efficiency.
The objective of the present disclosure is to provide an electronic device, an accelerator, and an accelerating method applicable to an operation for improving computational efficiency.
In one aspect, the present disclosure provides an electronic device, including: a data transmitting interface configured to transmit data; a memory configured to store the data; a processor configured to execute an application program; and an accelerator coupled to the processor via a bus, and according to an operation request transmitted from the processor, the accelerator is configured to read the data from the memory, perform an operation to the data to generate computed data, and store the computed data in the memory, wherein the processor is in a power saving state when the accelerator performs the operation.
In another aspect, the present disclosure provides an accelerator for performing a neural network operation to data in a memory, including: a register configured to store a plurality of parameters related to the neural network operation; a reader/writer configured to read the data from the memory; a controller coupled to the register and the reader/writer; and an arithmetic unit coupled to the controller, based on the parameters, the controller controlling the arithmetic unit to perform the neural network operation to the data to generate computed data.
In still another aspect, an accelerating method applicable to a neural network operation, including: (a) receiving data; (b) utilizing a processor to execute a neural network application program; (c) in execution of the neural network application program, storing the data in a memory and sending a first signal to an accelerator; (d) using the accelerator to perform the neural network operation to generate computed data; (e) sending a second signal to the processor by using the accelerator after the neural network operation is accomplished; (f) continuing executing the neural network application program using the processor; and (g) determining whether to run the accelerator; if yes, the processor sends a third signal to the accelerator and goes back to step (d); if no, terminate the process.
In the present disclosure, the processor delivers some operations (e.g., CNN operations) to the accelerator. This can reduce the time to access the memory and improve computational efficiency. Moreover, in some embodiments, when the accelerator performs the operation, the processor is in power saving state. Accordingly, this can efficiently reduce power consumption.
To further clarify the objectives, technical schemes, and technical effects of the present disclosure, the present disclosure will be described in details below by using embodiments in conjunction with the appended drawings. It should be understood that the specific embodiments described herein are merely for explaining the present disclosure, and as used herein, the term “embodiment” refers to an instance, an example, or an illustration but is not intended to limit the present disclosure. In addition, the articles “a” and “an” as used in the specification and the appended claims should generally be construed to mean “one or more” unless specified otherwise or is clear from the context to be directed to a singular form. Also, in the appended drawings, the components having similar or the same structure or function are indicated by the same reference number.
The present disclosure provides an electronic device, which is featured in splitting some operations from a processor. Particularly, these operations are related to convolutional neural network (CNN) operations. The electronic device of the present disclosure can improve computational efficiency dramatically.
Referring to
The processor 14 is used to execute an application program such as a neural network application program, and more particularly, a CNN application program. The processor 14 is coupled to the accelerator 16 via the bus 18. When the processor 14 requires to perform an operation, for example, an operation related to a CNN operation such as Convolution operation, Rectified Linear Units (ReLu) operation, and Max Pooling operation, the processor 14 sends an operation request to the accelerator 16 via the bus 18. The bus 18 can be implemented by Advanced High-Performance Bus (AHB).
The accelerator 16 receives the operation request from the processor 14 via the bus 18. When the operation request is received by the accelerator 16, the accelerator 16 reads the raw data from the memory 12, performs an operation to the raw data to generate computed data, and store the generated computed data in the memory 12. For example, the operation is a convolution operation. The convolution operation is the most complicated operation in CNN. For the convolution operation, the accelerator 16 multiplies each record of the raw data by a weight coefficient and then sums them up. It can also add a bias to the sum as an output. The result can propagate to a next CNN layer, serving as an input. For example, the result can propagate to a convolutional layer and the convolution operation is performed once again in the convolutional layer. Its output serves as an input of a next layer. The next layer can be a ReLu layer, a max pooling layer, or an average pooling layer. A full connected layer can be connected before a final output layer.
The operations performed by the accelerator 16 are not limited in taking the raw data as an input and directly operating the raw data. The operations performed by the accelerator 16 can be the operations required by each layer of the neural network, for example, the afore-mentioned Convolution operation, ReLu operation, and Max Pooling operation.
The above-mentioned raw data may be processed and optimized in a front end to generate a data, which is then stored in the memory 12. For example, the raw data may be processed with filtering, noise reduction, and time-frequency domain conversion in the front end, and then stored in the memory 12. The accelerator 16 performs the afore-mentioned operation to the processed data. In this article, the raw data may not be limited to the data retrieved from the sensor but referred broadly to any data that is transmitted to the accelerator 16 to be computed.
The electronic device can be carried out by System on Chip (SoC). That is, the data transmitting interface 10, the memory 12, the processor 14, the accelerator 16, and the bus 18 can be integrated into the SoC.
In the electronic device of the present disclosure, the processor 14 delivers some operations to the accelerator 16. This can reduce processor load, increase utilization of the processor 14, and reduce latency, and can also reduce cost of the processor 14 in some applications. If the operations related to CNN applications were processed using the processor 14, it would have taken too much time for the processor 14 to access the memory 12 leading to longer processing time. In the electronic device of the present disclosure, the accelerator 16 is in charge of the operations related to the neural network. One advantage in this aspect is that the memory access time is reduced. For example, in a situation that the processor 14 is running at twice the operational frequency of the accelerator 16 and the memory 12, the accelerator 16 will be able to access the content of the memory 12 in one cycle while it takes up to 10 cycles for the processor 14. Accordingly, deployment of the accelerator 16 can efficiently improve computational efficiency.
Another advantage of the present disclosure is that the electronic device can efficiently reduce power consumption. Specifically, when the accelerator 16 performs the operation, the processor 14 is idle and can be optionally put into a power saving state. The processor 14 operates under an operation mode and a power saving mode. When the accelerator 16 performs the operation, the processor 14 is in the power saving mode. In the power saving state or the power saving mode, the processor 14 can be in an idle state waiting for external interrupt, or in a low clock state, that is, the clock is lowered or completely disabled in the power saving mode. In one embodiment, when changed from the operation mode to the power saving mode, the processor 14 gets into the idle state and its clock is lowered to a low clock or completely disabled. In a situation that the processor 14 is running at an operational frequency or clock higher than the accelerator 16, the processor 14 consumes more power than the accelerator 16. In the embodiments of the present disclosure, the processor 14 gets into the power saving mode when the accelerator 16 perform the operation. Accordingly, this can efficiently reduce power consumption, and is beneficial to wearable device applications, for example.
In one embodiment, the raw data or the data can be stored in the first memory 121 and the computed data generated by performing the operation by the accelerator 16 can be stored in the second memory 122. Specifically, the processor 14 transmits the data to the accelerator 16. The accelerator 16 receives the data via the first bus 181 and writes the data to the first memory 121. The computed data generated by the accelerator 16 is written to the second memory 122 via the first bus 181.
In another embodiment, the raw data or the data can be stored in the second memory 122 and the computed data generated by performing the operation by the accelerator 16 can be stored in the first memory 121. Specifically, the data is written to the second memory 122 via the first bus 181. The computed data generated by the accelerator 16 is directly written to the first memory 121.
In still another embodiment, both the data and the computed data store in the first memory 121. The second memory 122 is used to store the data related to the application program executed by the processor 14. For example, the second memory 122 stores related data (e.g., program data) required by a convolutional neural network application program running on the processor 14. In this embodiment, the processor 14 transmits the data for operation to the accelerator 16. The accelerator 16 receives the data via the first bus 181 and writes the data to the first memory 121. The computed data generated by the accelerator 16 is directly written to the first memory 121.
The processor 14 and the accelerator 16 can share the first memory 121. The processor 14 can write the data into the first memory 121 and read the data from the first memory 121 via the accelerator 16. The accelerator 16 has priority over the processor 14 when accessing the first memory 121.
In the first embodiment, the electronic device further includes a flash memory controller 24 and a display controller 26 coupled to the second bus 182. The flash memory controller 24 is configured to be coupled to a flash memory 20 external to the electronic device. The display controller 26 is configured to be coupled to a display device 260 external to the electronic device. That is, the electronic device can be coupled to the flash memory 240 to achieve an external memory access function and coupled to the display device 260 to achieve a display function.
The system control unit 22 is coupled to the processor 14 via the first bus 181. The system control unit 22 can manage system resources and control activities between the processor 14 and other components. In another embodiment, the system control unit 22 can be integrated into the processor 14 as a component of the processor 14. Specifically, the system control unit 22 can control the processor clock, or operational frequency of the processor 14. In the present disclosure, the system control unit 22 is used to lower the processor clock or completely disable the clock to make the processor 14 get into the power saving mode from the operation mode. Similarly, the system control unit 22 is used to increase the processor clock to common clock frequency to make the processor 14 get into the operation mode from the power saving mode. In another aspect, when the accelerator 16 performs the operation, a firmware driver may be used to send a wait-for-interrupt (WFI) instruction to the processor 14 to put the processor 14 into the idle state.
Referring to
The register 78 is coupled to the processor 14 via the bus 18. A bus coupled to the register 78 and a bus coupled to the reader/writer 76 can be different buses. That is, the register 78 and the reader/writer 76 are coupled to the processor 14 via different buses. When the processor 14 executes the neural network application program for example and the firmware driver are executed, some parameters may be written to the register 78. For example, these parameters are parameters related to the neural network operation, such as data width, data depth, kernel width, kernel depth, and loop count. The register 78 may also store some control logic parameters. For example, a parameter CR_REG includes a Go bit, a Relu bit, a Pave bit, and a Pmax bit. According to the Go bit, the controller 72 determines whether to perform the neural network operation. Whether the neural network operation contains ReLu operation, Max Pooling operation, or Average Pooling operation is determined according to the Relu bit, the Pave bit, and the Pmax bit.
The controller 72 is coupled to the register 78, the reader/writer 76, and the arithmetic unit 74. The controller 72 is configured to operate based on the parameters stored in the register 78 to determine whether to control the reader/writer 76 to access the memory 12, and to control operation flow of the arithmetic unit 74. The controller 72 can be implemented by a finite-state machine (FSM), a micro control unit (MCU), or other types of controllers.
The arithmetic unit 74 can perform an operation related to the neural network, such as Convolution operation, ReLu operation, Average Pooling operation, and Max Pooling operation. Basically, the arithmetic unit 74 includes a multiply-accumulator which can multiply each record of the data by a weight coefficient and sum them up. In the present disclosure, the arithmetic unit 74 may have different configurations based on different applications. For example, the arithmetic unit 74 may include various types of operation logic and may include an adder, a multiplier, an accumulator, or their combinations. The arithmetic unit 74 may support various data types that may include unsigned integer, signed integer, and floating-point numbers, but are not limited thereto.
The arithmetic unit 74 includes a multiply array 82, an adder 84, and a carry-lookahead adder (CLA) 86. During computation, the arithmetic unit 74 will first read the data and corresponding weighs from the memory 12. The data can be an input in a zeroth layer or an output from a previous layer in the neural network. Next, the data and the weights expressed in binary numbers are input to the multiply array 82 to perform a multiply operation. For example, a record of the data is represented by a1a2, its corresponding weighting is represented by b1b2, and the multiply array 82 will obtain a1b1, a1b2, a2b1, and a2b2. The adder 84 is used to calculate a sum of the products, i.e., D1=a1b1+a1b2+a2b1+a2b2. The result is then outputted to the carry-lookahead adder 86. The multiply array 82 and the adder 84 can sum the products up in one time. This avoids intermediate calculations and thus reduce the time to access the memory 12. Next, a similar operation is performed to a next record of the data and its corresponding weighting to obtain D2. The carry-lookahead adder 86 is used to sum up the output values from the adder 84 (i.e., S1=D1+D2) by taking a sum of the values as an input and adding up the sum and a value output by the adder 84 (e.g., S2=S1+D3). Finally, the carry-lookahead adder 86 sums up the accumulated value and a read of the bias value from the memory 12, for example, Sn+b, where b is the bias.
During the computation, the arithmetic unit 74 of the present disclosure does not have to store results of the intermediate calculations to the memory 12 and reads them back to proceed next calculations. Accordingly, the present disclosure avoids frequent accessing to the memory 12, decreasing computing time while improving computational efficiency.
In step S90, data is received. The data is the data to be computed using the accelerator 16. For example, a sensor is used to capture a sensing data such as ECG data. The sensing data can be used as input data as-is or further processed with filtering, noise reduction, and/or time-frequency domain conversion before being used as data.
In step S92, the processor 14 is utilized to execute a CNN application program. After receiving the data, the processor 14 can execute the CNN application program based on a request for interrupt.
In step S94, in execution of the CNN application program, the data is stored in the memory 12 and a first signal is sent to the accelerator 16. In this step, the CNN application program writes the data, the weights, and the biases into the memory 12. The CNN application program can accomplish these copy operations by the firmware driver. The firmware driver may further copy the parameters (e.g., pointer, data width, data depth, kernel width, kernel depth, and computation types) required by the computation to the register 78. When all necessary data are ready, the firmware driver can send the first signal to the accelerator 16 to start the accelerator 16 to perform the operation. The first signal is an operation request signal. For example, the firmware driver may set the Go bit as true to start the CNN operation. The Go bit is contained in CR REG of the register 78 of the accelerator 16.
Meanwhile, the firmware driver may send a wait-for-interrupt (WFI) instruction to the processor 14 to put the processor 14 into an idle state to save power. In this way, when the accelerator 16 performs the operation, the processor 14 runs in a lower power state. The processor 14 may exit the idle state and restore back to an operation mode when receiving an interrupt signal.
The firmware driver can also send a signal to the system control unit 22. Based on this signal, the system control unit 22 can selectively lower the processor clock or completely disable it so as to transition the processor 14 into a power saving mode from the operation mode. For example, the firmware driver can determine whether to lower or disable the processor clock by determining whether the number of loops of the CNN operation requested to be executed is larger than a pre-set threshold.
In step S96, the accelerator 16 is used to perform the CNN operation to generate computed data. For example, when the controller 72 of the accelerator 16 detects that the Go bit in CR_REG of the register 78 is true, the controller 72 controls the arithmetic unit 74 to perform the CNN operation to the data to generate the computed date. The CNN operation may include Convolution operation, ReLu operation, Average Pooling operation, and Max Pooling operation. The arithmetic unit 74 may support various data types that may include unsigned integer, signed integer, and floating point, but are not limited thereto.
In step S98, the accelerator 16 sends a second signal to the processor 14 after the CNN operation is accomplished. When the CNN operation is accomplished, the firmware driver may set the Go bit of CR_REG of the register 78 as false to terminate the CNN operation. Meanwhile, the firmware driver can inform the system control unit 22 to restore the processor clock back to common clock frequency and the accelerator 16 sends an interrupt request to the processor 14 such that the processor 14 restores back to the operation mode from the idle state.
In step S100, the processor 14 continues executing the CNN application program. After restoring back to the operation mode, the processor 14 continues executing the rest of the application program.
In step S102, processor 14 determines whether to run the accelerator 16. If yes, the processor 14 sends a third signal to the accelerator 16 and goes back to step S94. If no, the process is terminated. The CNN application program determines whether there are more data to be processed using the accelerator 16. If yes, the third signal is sent to the accelerator 16 and the input data are copied to the memory 12 for performing the CNN operation. The third signal is an operation request signal. If no, the accelerating process is terminated.
Above all, while the preferred embodiments of the present disclosure have been illustrated and described in detail, various modifications and alterations can be made by persons skilled in this art. The embodiment of the present disclosure is therefore described in an illustrative but not restrictive sense. It is intended that the present disclosure shall not be limited to the particular forms as illustrated, and that all modifications and alterations that maintain the spirit and realm of the present disclosure are within the scope as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
106142473 | Dec 2017 | TW | national |