The present disclosure relates generally to data processing in machine-learning applications. More particularly, the present disclosure relates to systems and methods for efficiently performing arithmetic operations in fully connected network (FCN) layers using compute circuits.
Machine learning is a subfield of artificial intelligence that enables computers to learn by example without being explicitly programmed in a conventional sense. Numerous machine learning applications utilize a Convolutional Neural Network (CNN), i.e., a supervised network that is capable of solving complex classification or regression problems, for example, for image or video processing applications. A CNN uses as input large amounts of multi-dimensional training data, such as image or sensor data to learn prominent features therein. A trained network can be fine-tuned to learn additional features. In an inference phase, i.e., once training or learning is completed, the CNN uses unsupervised operations to detect or interpolate previously unseen features or events in new input data to classify objects, or to compute an output such as a regression. For example, a CNN model may be used to automatically determine whether an image can be categorized as comprising a person or an animal. The CNN applies a number of hierarchical network layers and sub-layers to the input image when making its determination or prediction. A network layer is defined, among other parameters, by kernel size. A convolutional layer may use several kernels that apply a set of weights to the pixels of a convolution window of an image. For example, a two-dimensional convolution operation involves the generation of output feature maps for a layer by using data in a two-dimensional window from a previous layer.
One particularly useful operation is the fully connected (FC) operation, also known as linear layer or Multi-Layer Perceptron (MLP). Although CNNs primarily make use of CNN operations, an FC layer is often used as the last layer, where it may be called the “classification” layer. A common technique for increasing the utilization of both computation time and storage space for weights in many network layers is made possible by the fact that all nodes for a filter can share the same set of weights. This technique involves weight-sharing, i.e., reusing the same weights for each combination of input and output frames. However, such techniques are not applicable to complex FCN layers in which one weight for each combination of input and output pixel is required. Accordingly, the computational complexity of FCN layers and excessive power consumption associated therewith makes hardware acceleration and power-saving systems and methods particularly desirable.
References will be made to embodiments of the invention, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the invention is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the invention to these particular embodiments. Items in the figures are not to scale.
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present invention, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the invention and are meant to avoid obscuring the invention. It shall also be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
Reference in the specification to “one embodiment,” “preferred embodiment,” “an embodiment,” or “embodiments” means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the invention and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
The terms “include,” “including,” “comprise,” and “comprising” shall be understood to be open terms and any lists the follow are examples and not meant to be limited to the listed items. Each reference mentioned in this patent document is incorporate by reference herein in its entirety.
It shall be noted that embodiments described herein are given in the context of embedded machine learning accelerators, but one skilled in the art shall recognize that the teachings of the present disclosure are not so limited and may equally reduce power consumption in related or other devices.
In this document the terms “memory,” “memory device,” and “register” are used interchangeably. Similarly, the terms kernel, weight, parameter, and weight parameter are used interchangeably. “Neural network” includes any neural network known in the art. The term “hardware accelerator” refers to any type of electrical or optical circuit that may be used to perform mathematical operations and related functions, such as auxiliary control functions.
In operation, microcontroller 110 performs arithmetic operations for convolutions in software. Machine learning accelerator 114 typically uses weight data to perform matrix-multiplications and related convolution computations on input data using weight data. The weight data may be unloaded from accelerator 114, for example, to load new or different weight data prior to accelerator 114 performing a new set of operations using the new set weight data. More commonly, the weight data remains unchanged, and for each new computation, new input data is loaded into accelerator 114 to perform the computations. Machine learning accelerator 114 lacks hardware acceleration for at least some of a number of possible neural network computations. These missing operators are typically emulated in software by using software functions embedded in microcontroller 110. However, such approaches are very costly in terms of both power and time; and for many computationally intensive applications (e.g., real-time applications) general purpose computing hardware is unable to perform the necessary operations in a timely manner as the rate of calculations is limited by the computational resources and capabilities of existing hardware designs.
Further, using arithmetic functions of microcontroller 110 to generate intermediate results comes at the expense of computing time due to the added steps of transmitting data, allocating storage, and retrieving intermediate results from memory locations to complete an operation. For example, many conventional multipliers are scalar machines that use CPUs or GPUs as their computation unit and that use registers and a cache to process data stored in non-volatile memory relying on a series of software and hardware matrix manipulation steps, such as address generation, transpositions, bit-by-bit addition and shifting, converting multiplications into additions and outputting the result into some internal register. In practice, these repeated read/write operations performed on, for example, a significant amount of weight parameters and input data with large dimensions and/or large channel count, result in undesirable data movements in the data path and, thus, increase power consumption. There exist no mechanisms that efficiently select and use data, while avoiding generating redundant data and avoiding accessing data in a redundant fashion. Software must access the same locations of a standard memory and read, re-fetch, and write the same data over and over again even when performing simple arithmetic operations, which is computationally very burdensome and creates a bottleneck that curbs the boon for machine learning applications.
As the amount of data subject to convolution operations increases and the complexity of operations continues to grow, the added steps of storing and retrieving intermediate results from memory to complete an arithmetic operation present only some of the shortcoming of existing designs. In short, conventional hardware and methods are not well-suited for accelerating computationally intensive FC layers or performing a myriad of other complex processing steps that involve efficiently processing large amounts of data.
Accordingly, what is needed are systems and methods that allow existing hardware, such as conventional two-dimensional hardware accelerators, to perform arithmetic operations on FCNs and other network layers in an energy-efficient manner and without increasing hardware cost.
An FCN operates on one weight per input pixel since each pixel (and channel) on the input has its own weight when being connected to each pixel (and channel) on the output. For similar reasons, FCNs are also relatively harder to train that typical network layers in a CNN, especially, on deep neural networks (DNNs) used in modern image processing. In comparison, a typical CNN layer operates on one set of weights per input channel, thus, rendering conventional hardware accelerators unsuitable for FCN operations.
Therefore, various embodiments presented herein enable the desired one-weight-per-pixel relationship on conventional hardware accelerators architectures to associate each channel with one pixel, such that applying one weight per pixel is equivalent to applying one weight per channel that conventional hardware accelerators are capable of handling. Certain embodiments accomplish this by using a “flattening” method that involves converting a number of channels that each is associated with a number of pixels into a number of channels that equals the number of pixels.
As depicted in
In embodiments, a number of input channels 202-204 that each is associated with, e.g., four input pixels may be flattened into an array of twelve channels 220 that each is associated with one pixel. In in this manner, flattening the input data of three different input channels 202-204 results in a one-pixel-per-channel flattened view 220. Stated differently, input channels 202-204 may be converted to input channel sizes that each corresponds to one pixel. As a result, one weight (e.g., 232) in the set of twelve weights 230 may be used per input channel or pixel (e.g., 222) per output channel or pixel (e.g., 246). As illustrated in
Flattened input data 220 in
In embodiments, once input data is flattened in this manner, a conventional two-dimensional convolutional accelerator (not shown in
In embodiments, flattened data 220 may be used, for example, by a two-dimensional convolutional accelerator that reads the first data point associated with input channel 202; reads the first data point 232 associated with a first set of weight data; and then multiplies the two data points to obtain a first partial result. The accelerator also uses the first data point associated with input channel 202 and multiplies it with a first data point 242 associated with a second set of weight data to obtain a second partial result, and so on. As a person of skill in the art will appreciate, the convolutional accelerator may further perform different or additional operations such as, e.g., two-dimensional matrix multiplications that enable three-dimensional convolution operations.
It is noted that although two sets of weights 230, 240 are shown to generate two output pixels 246, 248, this is not intended as a limitation on the scope of the present disclosure. As a person of skill in the art will appreciate, any number of sets of weights may be applied to any number of input channels to obtain output channels or pixels. For example, instead of using input channels having a 2×2 format or size any other dimension may be processed.
In embodiments, input representation 254 replicates the input channel data in
In detail, in embodiments, input data 254 that has a size or shape HWC, where C represents the number of channels, each having a height coordinate H and a width coordinate W, is interpreted as a number of H×W×C channels, each channel having a height of 1 and a width of 1. In embodiments, doing so increases the number of input channels and allows input data 202-206 to be flattened into a string or concatenated data array of flattened data 220.
To accomplish this, in embodiments, the last column 256 of input matrix 254 in
In embodiments, the second through fourth rows in matrices 270, 280, and 290 may be filled with zeroes or interpreted as if filled with zeroes to maintain the two-dimensional matrix format such that each column in the flattened input view comprises one pixel to accommodate the one-weight-per-channel format required by conventional hardware accelerators. One result of treating input matrix 254 as expanded into three two-dimensional matrices 270, 280, and 290 is that data in input matrix 254 may be treated as having been rearranged into a format that is compatible with an input-output combination suitable for an existing two-dimensional hardware accelerator circuit that may, advantageously, be repurposed to process a FCN without having to implement into a system an additional, and likely underutilized, special hardware block that is customized to process FCNs. Finally, to emulate a linear FCN operation, the flattened input may be used, e.g., according to the calculations shown in
In embodiments, treating input channels as each comprising a single pixel, which changes how an existing hardware accelerator retrieves and/or reads input data, may be implemented by a flattening circuit, as will be discussed next. It is noted that, in embodiments, various different or additional implementation-specific steps may be used. Exemplary additional steps may include scaling operations, such as the scaling of output values by a predetermined factor in order to account for not having to store denominator values, which may be treated as implicit in a series of calculations.
In embodiments, flattening circuit 306 may comprise a combination of multipliers, adders, multiplexers, delay elements such as input latches, control logic such as a state machine, and other components or sub-circuits. Hardware accelerator 308 may comprise any existing computation engine known in the art, such as a conventional two-dimensional CNN accelerator, that in embodiments may comprise memory that has a two-dimensional data structure.
In operation, flattening circuit 306 may receive, fetch, load otherwise obtain input data from memory device 304 or data that has been output by a convolutional layer in a neural network. In embodiments, the input data may comprise, e.g., audio data, image data, or any data derived therefrom. It is understood that, in embodiments, input data may be streamed directly into flattening circuit 306 instead of being retrieved from memory device 304.
Input data may comprise input size information, such as height and width information, which may be obtained from configuration register 302, e.g., along with image data. In embodiments, flattening circuit 306 may use the information to flatten the data. In embodiments, the format of the input may be altered, e.g., by changing register values in hardware that configures the size of the input data such as to ascertain from where to retrieve each next bit or pixel and use it when flattening is activated or enabled.
In embodiments, flattening circuit 306 may be enabled, e.g., by setting a configuration bit, such that fattening may be performed virtually, i.e., without having to physically move around data, e.g., without copying the data into a string and then moving the data. As a person of skill in the art will appreciate, in embodiments, virtualization may be accomplished by using proper allocation of target addresses, e.g., such that several pieces of data may be loaded and subsequently used without having to explicitly reconfigure target addresses, pointers, and the like. Unlike address or data mechanisms used in conventional software implementations, which invariably move data in and out of memory devices and intermediate data storage, various embodiments herein, advantageously, aid in significantly reducing data movement and power consumption.
In embodiments, the output of flattening circuit 306 may be provided to hardware accelerator 308 that may process the output of flattening circuit 306, e.g., using an FC operation to obtain an inference result. It is understood that components in
At step 404, the flattening circuit may, based on the received configuration information to convert the input data into a one-dimensional data format, e.g., as illustrated in
At step 406, the flattening circuit may output the converted data comprising a one-dimensional data format to be further processed, e.g., by one or more layers of a neural network. In embodiments, such further processing may be performed by the flattening circuit itself, e.g., by using a sub-circuit the flattening circuit. Alternatively, a different circuit, e.g., a separate hardware accelerator may be used, such as the hardware accelerator depicted in
At step 504, the flattening circuit may use the received configuration information to convert the input data into a one-dimensional data format, as illustrated in
At step 506, the converted data may be used to process at least one fully connected network layer to obtain a result, e.g., the result of an inference or related operation.
Finally, at step 508, the result may be output.
As illustrated in
A number of controllers and peripheral devices may also be provided, as shown in
In the illustrated system, all major system components may connect to a bus 616, which may represent more than one physical bus. However, various system components may or may not be in physical proximity to one another. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network. Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as ASICs, programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as ASICs, PLDs, flash memory devices, and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub-modules or combined together.
It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.