The present invention is related to an improved depth-first Winograd convolutional algorithm technology. More particularly, the present invention is aimed to provide a control method of a consecutive 3*3-kernel convolution structure, in which the disclosed control method is characterized by integrating the Winograd convolution with a line interleaved process so as to effectively reduce the array area of multiply accumulate circuits and power consumption thereof.
As known, in recent years, since the modern artificial intelligence (AI) technologies have developed significantly and rapidly, a Convolutional Neural Network (CNN) has been widely used in the related AI industries. In general, the convolutional neural networks (CNNs) have demonstrated superior quality for computational imaging applications such as super-resolution, denoising and deblurring. Although it has been acknowledged that the CNNs have the advantages of high quality in computational imaging, it is also noticeable that relatively high complexity of its neural network architecture remains unsolved in recent years due to the increasing amount of image resolution and process time of computation of the convolutional neural networks. As such, in order to achieve the real-time computation and reduce the model latency of the computing algorithm for a convolutional neural network, related research on computation acceleration of the convolutional neural network algorithm in AI related fields have been pursued in the existing technologies so far.
Regarding the general AI algorithm technologies, it has been known that the convolutional neural networks always require large amounts of computing resources for both training and inferences, primarily because the convolution layers in the convolutional neural network structures are computationally intensive. Fast convolution algorithms such as Winograd convolution is by far, believed to be used so as to greatly reduce the computational cost of these layers. On the other hand, it is also known that on-chip memory resource and off-chip memory bandwidth are also the key indices and will deteriorate considerably due to the increasing of resolution and frame rate of the display for the consumer electronics. Certain recent works have proposed an advanced layer-fusion method relying on the line-based depth-first processing, which can significantly reduce the on-chip memory resource and further shorten the network latency of the convolutional neural network in a low-bandwidth (or zero-bandwidth) manner. However, in such a type of line-based depth-first processing flow, each line of the intermediate feature of each convolution layer in the convolutional neural network will process independently in a depth-first schedule, which is incompatible with the pairwise multiplication characteristic of Winograd convolution. Although the line-based processing flow can be scheduled in a non-depth-first manner to fit the input and output block size of Winograd convolution, it also cancels out the advantage of low network latency and low on-chip memory requirement. On account of all, it is noticeable that how to combine both the advantages of Winograd convolution and line-based depth-first processing flow is still an open issue to be solved. As a result, it, in view of all, should be apparent and obvious that there is indeed an urgent need for the professionals in the field for a novel and inventive convolution algorithm to be developed, so as to solve the above-mentioned issues existing in the current technologies.
In order to overcome the above-mentioned issue, one major objective in accordance with the present invention is to provide an improved Winograd convolutional algorithm. According to the present invention, the disclosed improved convolutional algorithm is based on a depth-first processing flow and is involved with integrating the line interleaved computing methods, so as to achieve a minimum latency of the convolutional neural network model.
Another objective in accordance with the present invention is to provide a novel control method of a consecutive convolution structure with a kernel size equal to 3*3. By employing the proposed control method of the consecutive 3*3-kernel convolution structure, it is believed that residual connection data as well as activation data that needs to be temporarily stored in each layer while performing the algorithm can be reduced. By employing the technical solution disclosed in the present invention, the requirements for static random-access memories (SRAM) layouts can be effectively reduced, such that both power and area waste can be suppressed due to the improved convolutional algorithm integrating with line interleaved computing methods as disclosed in the present invention.
To be specific, please refer to one embodiment of the present invention, a control method of a consecutive convolution structure is provided. The disclosed control method of the consecutive convolution structure comprises a plurality of following steps of:
In another aspect, the present invention is also aimed to provide a control method of a consecutive convolution structure, comprising a following plurality of steps of:
In view of the various embodiments of the present invention, it is also provided that at least one padding line can be selectively adopted so as to provide dummy data, zero-point data or duplicate image data for performing the Winograd convolution.
Or alternatively, the at least one padding line may also be adopted for providing an image data which is identical to the first line of the input data in order to perform the Winograd convolution. Various implementations are preferably feasible by people who are skilled in the arts.
As a result, it should be noted that, according to the disclosed embodiment of the present invention, the alternatives and modifications will be apparent to those skilled in the art, once informed by the present disclosure. And the present disclosure certainly claims and covers the alternatives and modifications with equality.
The present invention should not be limited thereto the disclosed embodiments as provided in the present invention. In other words, the provided technical contents of the invention may also be widely utilized in view of other and various alternatives and modifications, which will be apparent to those skilled in the art, once acknowledged by the disclosed disclosure. As a result, it is to be understood that both the foregoing general description and the following detailed description are exemplary and are intended to provide further explanation of the invention as claimed. When adopting the proposed technical contents of the present invention, these and other objectives of the present invention will become obvious to those of ordinary skill in the art after reading the following detailed description of preferred embodiments.
The accompanying drawings are included to provide a further understanding of the present invention, and are incorporated in and constitute a part of this specification. The drawings illustrate the embodiments of the present invention and, together with the following descriptions, which are served to explain the principles of the present invention. As following, in the drawings:
Reference will now be made in detail to embodiments illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts. In the drawings, the shape and thickness may be exaggerated for clarity and convenience. This description will be directed in particular to elements forming part of, or cooperating more directly with, methods and apparatus in accordance with the present disclosure. It is to be understood that elements not specifically shown or described may take various forms well known to those skilled in the art. Many alternatives and modifications will be apparent to those skilled in the art, once informed by the present disclosure.
Unless otherwise specified, some conditional sentences or words, such as “can”, “could”, “might”, or “may”, usually attempt to express that the embodiment in the invention has, but it can also be interpreted as a feature, element, or step that may not be needed. In other embodiments, these features, elements, or steps may not be required.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Certain terms are used throughout the description and the claims to refer to particular components. One skilled in the art appreciates that a component may be referred to as different names. This disclosure does not intend to distinguish between components that differ in name but not in function. In the description and in the claims, the term “comprise” is used in an open-ended fashion, and thus should be interpreted to mean “include, but not limited to.” The phrases “be coupled to,” “couples to,” and “coupling to” are intended to compass any indirect or direct connection. Accordingly, if this disclosure mentioned that a first device is coupled with a second device, it means that the first device may be directly or indirectly connected to the second device through electrical connections, wireless communications, optical communications, or other signal connections with/without other intermediate devices or connection means.
The invention is particularly described with the following examples which are only for instance. Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the following disclosure should be construed as limited only by the metes and bounds of the appended claims. In the whole patent application and the claims, except for clearly described content, the meaning of the article “a” and “the” includes the meaning of “one or at least one” of the element or component. Moreover, in the whole patent application and the claims, except that the plurality can be excluded obviously according to the context, the singular articles also contain the description for the plurality of elements or components. In the entire specification and claims, unless the contents clearly specify the meaning of some terms, the meaning of the article “wherein” includes the meaning of the articles “wherein” and “whereon”. The meanings of every term used in the present claims and specification refer to a usual meaning known to one skilled in the art unless the meaning is additionally annotated. Some terms used to describe the invention will be discussed to guide practitioners about the invention. Every example in the present specification cannot limit the claimed scope of the invention.
In the following descriptions, a control method of a consecutive convolution structure will be provided. The proposed control method is applicable to a consecutive convolution structure having a kernel size, which is, for instance, equal to 3×3. Alternatively, the disclosed control method regarding the consecutive convolution structure, as provided below may also be applied to other consecutive convolution structures having various kernel size. It should be noted that the present invention is certainly not limited thereto.
Please refer to
As can be seen in
Next on, please refer to
As we can see, in order to perform the following Winograd convolution between the first output layer LO1 and the second output layer LO2, a third padding line X3 is provided to the first output layer LO1. According to the embodiment of the present invention, the third padding line X3 is adopted for providing zero-point data or providing an image data which is identical to the first line of the first output layer queue, i.e. the first row line “0” of the first output layer LO1.
As a result, according to the third padding line X3, the first row line “0”, the second row line “1” and the third row line “2” of the first output layer queue in the first output layer LO1, the Winograd convolution between the first output layer LO1 and the second output layer LO2 can be performed so as to generate the second output layer queue. As can be seen in
By applying the same manners, in order to perform the following Winograd convolution between the second output layer LO2 and the third output layer LO3, a fourth padding line X4 and a fifth padding line X5 are provided to the second output layer LO2. According to the embodiment of the present invention, the fourth padding line X4 and the fifth padding line X5 can be adopted for providing zero-point data, or the fourth padding line X4 and the fifth padding line X5 can be alternatively adopted for providing an image data which is identical to the first line of the second output layer queue, i.e. the first row line “0” of the second output layer LO2.
And therefore, it is believed that, according to the fourth padding line X4 and the fifth padding line X5, as well as the generated first row line “0” and the second row line “1” of the second output layer queue in the second output layer LO2, the Winograd convolution between the second output layer LO2 and the third output layer LO3 can be performed so as to generate the third output layer queue. As can be seen in
Later on, please refer to
As can be seen in the drawing, in order to perform the following Winograd convolution between the third output layer LO3 and the fourth output layer LO4, a sixth padding line X6 is provided to the third output layer LO3. According to the embodiment of the present invention, the sixth padding line X6 is adopted for providing zero-point data. Alternatively, the sixth padding line X6 may also be adopted for providing an image data which is identical to the first line of the third output layer queue, which is the first row line “0” of the third output layer LO3.
As a result, according to the sixth padding line X6, the first row line “0”, the second row line “1” and the third row line “2” of the third output layer LO3, the Winograd convolution between the third output layer LO3 and the fourth output layer LO4 can be performed so as to generate the fourth output layer queue. As can be seen in
By applying the same manners, in order to perform the following Winograd convolution between the fourth output layer LO4 and the fifth output layer LO5, a seventh padding line X7 and an eighth padding line X8 are provided to the fourth output layer LO4. According to the embodiment of the present invention, the seventh padding line X7 and the eighth padding line X8 are adopted for providing zero-point data. According to other practicable embodiment of the present invention, then the seventh padding line X7 and the eighth padding line X8 may alternatively be adopted for providing an image data which is identical to the first line of the fourth output layer queue, which is the first row line “0” of the fourth output layer LO4.
And therefore, it is believed that, according to the seventh padding line X7 and the eighth padding line X8, as well as the generated first row line “0” and the second row line “1” of the fourth output layer LO4, the Winograd convolution between the fourth output layer LO4 and the fifth output layer LO5 can be performed so as to generate the fifth output layer queue. As can be seen in
Later on,
And then,
As a result, to sum up, it is significantly obtained that the present invention discloses a control method of a consecutive convolution structure, which is proposed to integrate the Winograd convolution with a line interleaved process so as to enhance the algorithm efficiency. The disclosed improved convolutional algorithm of the present invention is based on a depth-first processing flow and is involved with integrating the line interleaved computing methods. Therefore, it is believed that a minimum latency of the convolutional neural network model is effectively achieved. In addition, in view of the disclosed process flow as provided in
Moreover, according to the disclosed process flow as provided in
For illustrating a first operation period, please refer to the steps S1102, S1104, S1106 in
Later, a second operation period is carried out, as indicated by the steps of S1108 and S1110 in
And successively, a third operation period will be performed, as indicated by the steps of S1112, S1114 and S1116 in
According to the embodiment of the present invention, when performing the steps of S1102, S1108 and S1114 while the input data is received, it is applicable that the first line, the second line, the third line, the fourth line, the fifth line and the sixth line of the input data can be temporarily stored in a buffer for the sequential processing flow. And since it requires only a four-line buffer in each data layer at one time in order to temporarily store the input data, a conventional SRAM size can be significantly reduced by employing the disclosed technical contents of the present invention.
In addition to the above disclosed method, the present invention, in another aspect also proposes a control method of the consecutive convolution structure for obtaining the generated output layer queue in the even output layers, for instance, the second output layer LO2, the fourth output layer LO4, the sixth output layer LO6, and so on. In the following sections, in order to provide a detailed and comprehensive description explaining the process flows of the disclosed control method regarding obtaining the generated output layer queue in the even output layers, the Applicants of the present invention simply take the first output layer LO1 and the second output layer LO2, for instance, and the Winograd convolution performed in between the first output layer LO1 and the second output layer LO2 as an exoplanar embodiment for reference.
As we can see in
In a first operation period of the disclosed control method, please refer to the steps S1502, S1504 and S1506 in
And next, for a second operation period of the disclosed control method, please refer to
And successively, a third operation period will be performed, as indicated by the steps of S1514, S1516 and S1518 in
In addition, according to such an alternative embodiment of the present invention, when performing the steps of S1502, S1510 and S1516 while the input data is received, it is applicable that the first line, the second line, the third line, the fourth line, the fifth line, the sixth line and the seventh line of the input data can be temporarily stored in a buffer for the sequential processing flow. And apparently, since it also requires only a four-line buffer in each data layer at one time to temporarily store the input data, a conventional SRAM size is believed to be significantly reduced by employing the disclosed technical contents of the present invention.
Furthermore, as provided in the following, the Applicants of the present invention additionally provide a plurality of data so as to verify and prove that the disclosed technical contents of the present invention are effective. Please refer to Table 1 as below, which compares the present invention with the conventional arts while each of the conventional arts and the proposed invention is respectively applied to a convolutional neural network structure.
From the above comparison Table 1, since the proposed invention replaces the conventional 3*3 convolution with the F(2*2,3*3) Winograd convolution, it can be found that 56% of the array area of multiply-accumulate (MAC) circuits can be effectively saved when adopting the proposed invention. Moreover, the model latency of the convolutional neural network structure can be maintained approximately the same without being increased while employing the proposed invention. And as a result, it is believed that an optimal and improved convolutional algorithm integrating with line interleaved computing methods as disclosed in the present invention has been proposed and provided by the Applicants of the present invention.
According to the present invention, the foregoing disclosed control method of a consecutive convolution structure is illustrated as being applied to a kernel size of 3*3 convolution structures, for example. And yet the invention is not limited to such 3*3 convolution structure. Alternative preferable kernel-sized of (n*n) convolution structures are also compatible.
Therefore, based on at least one embodiment provided above, it is believed that the proposed control method of a consecutive convolution structure of the present invention is characterized by integrating the Winograd convolution with a line interleaved process. By employing the proposed control method, the present invention is believed as beneficial to reducing the array area of multiply accumulate circuits and power consumption thereof. As a result, when compared to the prior arts, it is obvious that the present invention apparently shows much more effective performances than before. In addition, it is believed that the present invention is instinct, effective and highly competitive for IC technology and industries in the market nowadays, effectively reducing the array area of multiply accumulate circuits and power consumption, whereby having extraordinary availability and competitiveness for future industrial developments and being in condition for early allowance.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the invention and its equivalent.