This disclosure relates generally to deep neural networks (DNN), and more specifically, approximating activation functions in DNNs with look-up tables (LUTs) having hybrid architectures.
DNNs are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as each inference can require hundreds of millions of MAC (multiply-accumulate) operations as well as a large amount of data to read and write. DNN inference also requires computation of activation functions. Therefore, techniques to improve efficiency of DNNs are needed.
Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
Overview
The last decade has witnessed a rapid rise in AI (artificial intelligence) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability.
Activation functions are important parts of DNNs. An activation function can decide whether a neuron should or should not be activated by computing the weighted sum activations and adding bias. An important purpose of activation functions is to introduce non-linearity to the output of neurons. Considering the complexity of some of the non-linear activation functions used in many DNNs, hardware implementation may require approximation within certain level of accuracy. Piece-wise linear approximation is one approach to approximate complex non-linear activation functions. Piece-wise linear is usually based on approximating complex non-linear curves using several linear segments. Each linear segment could be represented using a slope and an intercept. The complete range of a non-linear activation function may be divided into smaller regions such that each region could be approximated using a linear segment. These regions could be of variable range, but executing the linear functions, even though there can be a greater number of linear functions, can be more efficient than executing the non-linear activation function itself. The slope and intercept of linear segments can be stored in a LUT. Accuracy increase usually requires more entries in the LUT. A LUT address generation logic is usually used to generate the address of the slopes and intercepts with the LUT that correspond to the linear segment within which the input lies.
For a DNN accelerator to be versatile, flexible, and future proof, it can be important to have DNN accelerators with the capability to be programmed for various types of activation functions as the need arises. Many currently available approaches for approximating activation functions require either a Digital Signal Processor (DSP) or dedicated LUTs with all entries directly accessible to all PPEs. These approaches suffer from inflexible LUT architecture with fixed area and performance, which can come with area penalty. Such area penalty can be a problem in resource constrained small form factor edge/client devices.
Embodiments of the present disclosure provide systems and methods for approximating activation functions with LUTs having hybrid architectures. An example LUT having a hybrid architecture may have a dedicated portion and a shared portion. The dedicated portion is accessible by a single PPE group, while the shared portion is accessible by multiple PPE groups. For instance, the dedicated portion of the LUT is connected to a data transfer path that connects to a single PPE group, while the shared portion of the LUT is connected to multiple data transfer paths that connect to multiple PPE groups. The hybrid architecture of the LUT may be determined using one or more parameters, which may indicate how many entities are attributed to the dedicated portion or how many entities are attributed to the shared portion. The parameters may be determined based on statistical analysis of the input data elements of the activation function. The dedicated portion may be used to store parameters of one or more linear functions that approximate an activation function for one or more selected input segments that have more (compared with an unselected input segment) input data elements of the activation function falling into these input segments. The shared portion may be used to store parameters of one or more linear functions that approximate an activation function for one or more unselected input segments.
In various embodiments of the present disclosure, a processing unit may receive data elements to be input into a non-linear activation function (“input data elements”). The input data elements may have a floating-point data type, such as FP32, FP16, BF16, FPB, and so on. FP stands for floating point. BF stands for brain floating point. The processing unit may apply the non-linear activation function on the input data elements to compute output data elements. The output data elements may constitute the output of the non-linear activation function. The output data elements may also have a floating-point data type. In some embodiments, the data type of the output data elements may be the same as the data type of the input data elements. In other embodiments, the data type of the output data elements may be different from the data type of the input data elements. For example, the output data elements may have a data type that has a higher precision than the data type of the input data elements. When the non-linear activation function is approximated by linear functions, the processing unit would apply the linear functions on the input data elements to compute output data elements. These output data elements may constitute the approximated output of the activation function. The approximated output may be the same or similar as the real output of the activation function.
The processing unit may include LUTs and PPE groups. Each PPE group may include one or more PPEs. The input range of the non-linear activation function may be divided into smaller regions. The smaller regions are referred to as input segments. Each of the input segments includes one or more input data elements of the non-linear activation function. One or more input segments may be selected based on a total number of input data elements of the activation function that fall into each selected input segment. A parameter of a first linear function that approximates the activation function for at least part of a selected input segment may be stored in a first portion of a first LUT. The first portion of the first LUT is dedicated to a first PPE group that computes an approximated output of the activation function using the parameter of the first linear function. A parameter of a second linear function that approximates the non-linear activation function for at least part of an unselected input segment may be stored in a shared pool of LUT entries. The shared pool of LUT entries includes a second portion of the first LUT and a portion of a second LUT and is shared by the first PPE group and a second PPE group.
The present disclosure provides a LUT architecture-based framework to achieve an optimal balance between area and performance. In many cases, the hybrid LUT architecture can reduce area consumed by the processing unit, compared with many currently available processing units for approximating activation functions. The parameters of the framework are driven by statistical analysis of output activations. The hybrid LUT architecture can also reduce or even minimize look-up latency for the most frequently occurring activations while saving area by allowing rarely occurring output activations to have higher latency. The LUT architecture-based framework in the present disclosure can mitigate performance loss for rarely occurring activations using parallel-write serial read command queue arbiter.
For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.
Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.
Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.
For the purposes of the present disclosure, the phrase “A or B” or the phrase “A and/or B” means (A), (B), or (A and B). For the purposes of the present disclosure, the phrase “A, B, or C” or the phrase “A, B, and/or C” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C). The term “between,” when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.
The description uses the phrases “in an embodiment” or “in embodiments,” which may each refer to one or more of the same or different embodiments. The terms “comprising,” “including,” “having,” and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as “above,” “below,” “top,” “bottom,” and “side” to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives “first,” “second,” and “third,” etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.
In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.
The terms “substantially,” “close,” “approximately,” “near,” and “about,” generally refer to being within +/−20% of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., “coplanar,” “perpendicular,” “orthogonal,” “parallel,” or any other angle between the elements, generally refer to being within +/−5-20% of a target value as described herein or as known in the art.
In addition, the terms “comprise,” “comprising,” “include,” “including,” “have,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term “or” refers to an inclusive “or” and not to an exclusive “or.”
The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.
The convolutional layers 110 summarize the presence of features in the input image 105. The convolutional layers 110 function as feature extractors. The first layer of the DNN 100 is a convolutional layer 110. In an example, a convolutional layer 110 performs a convolution on an input tensor 140 (also referred to as input feature map (IFM) 140) and a filter 150. As shown in
The convolution includes MAC operations with the input elements in the IFM 140 and the weights in the filter 150. The convolution may be a standard convolution 163 or a depthwise convolution 183. In the standard convolution 163, the whole filter 150 slides across the IFM 140. All the input channels are combined to produce an output tensor 160 (also referred to as output feature map (OFM) 160). The OFM 160 is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For the purpose of illustration, the standard convolution includes one filter in the embodiments of
The multiplication applied between a kernel-sized patch of the IFM 140 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernel-sized patch of the IFM 140 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product.” Using a kernel smaller than the IFM 140 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 140 multiple times at different points on the IFM 140. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 140, left to right, top to bottom. The result from multiplying the kernel with the IFM 140 one time is a single value. As the kernel is applied multiple times to the IFM 140, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 160) from the standard convolution 163 is referred to as an OFM.
In the depthwise convolution 183, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in
The OFM 160 is then passed to the next layer in the sequence. In some embodiments, the OFM 160 is passed through an activation function. An example activation function is rectified linear unit (ReLU). ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 110 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 160 is passed to the subsequent convolutional layer 110 (i.e., the convolutional layer 110 following the convolutional layer 110 generating the OFM 160 in the sequence). The subsequent convolutional layers 110 perform a convolution on the OFM 160 with new kernels and generate a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 110, and so on.
In some embodiments, a convolutional layer 110 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions F×F×D pixels), the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time), and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 110). The convolutional layers 110 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The DNN 100 includes 16 convolutional layers 110. In other embodiments, the DNN 100 may include a different number of convolutional layers.
The pooling layers 120 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers). In some embodiments, a pooling layer 120 is added after a convolutional layer 110, e.g., after an activation function (e.g., ReLU, etc.) has been applied to the OFM 160.
A pooling layer 120 receives feature maps generated by the preceding convolution layer 110 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the DNN and avoids over-learning. The pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 120 applied to a feature map of 6×6 results in an output pooled feature map of 3×3. The output of the pooling layer 120 is inputted into the subsequent convolution layer 110 for further feature extraction. In some embodiments, the pooling layer 120 operates upon each feature map separately to create a new set of the same number of pooled feature maps.
The fully-connected layers 130 are the last layers of the DNN. The fully-connected layers 130 may be convolutional or not. The fully-connected layers 130 may also be referred to as linear layers. In some embodiments, a fully-connected layer 130 (e.g., the first fully-connected layer in the DNN 100) may receive an input operand. The input operand may define the output of the convolutional layers 110 and pooling layers 120 and includes the values of the last feature map generated by the last pooling layer 120 in the sequence. The fully-connected layer 130 may apply a linear transformation to the input operand through a weight matrix. The weight matrix may be a kernel of the fully-connected layer 130. The linear transformation may include a tensor multiplication between the input operand and the weight matrix. The result of the linear transformation may be an output operand. In some embodiments, the fully-connected layer may further apply a non-linear transformation (e.g., by using a non-linear activation function) on the result of the linear transformation to generate an output operand. The output operand may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 130 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.
In some embodiments, the fully-connected layers 130 classify the input image 105 and return an operand of size N, where N is the number of classes in the image classification problem. In the embodiments of
The DNN module 201 facilitates generation and deployment of DNNs. In some embodiments, the DNN module 201 may generate and train DNNs. For instance, the DNN module 201 can define the layered architecture of a DNN. The DNN module 201 can also determine the internal parameters of the DNN through a DNN training process. The DNN module 201 may also determine one or more hyperparameters that define how the DNN is trained. An example hyperparameter is a sparsity ratio that defines the sparsity level of one or more deep learning tensors for the DNN. The DNN module 201 may also compress DNNs, e.g., during or after training.
The DNN module 201 may deploy trained, compressed, or validated DNNs for use in deep learning applications. In some embodiments, the DNN module 201 may distribute trained, compressed, or validated DNNs to devices or systems which may use the DNNs to perform tasks (e.g., image classification, motion planning, etc.) for which the DNNs were trained. In other embodiments, the DNN module 201 may facilitate deployment of the DNNs using the DNN accelerator 202. For instance, the DNN module 201 may receive data from a device or system coupled with the DNN system 200 and input the received data (or data generated by the DNN module 201, e.g., based on the received data) into a DNN. The DNN module 201 may generate instructions (e.g., configuration files) that control the operation of the DNN accelerator 202 during the DNN execution. The DNN module 201 may receive an output of the DNN from the DNN accelerator 202. The DNN module 201 may transmit the output of the DNN (or a result of processing the output of the DNN by the DNN module 201) to the device or system.
The DNN module 201 may control execution processes of trained, compressed, or validated DNNs. For instance, the DNN module 201 may facilitate approximation of non-linear activation functions with other functions including linear functions. The non-linear activation functions may be executed, e.g., by the PPE array 260, by executing these other functions. The outputs of these other functions may be approximated outputs of the non-linear activation functions and may be used in subsequent deep learning operations in the DNNs. The DNN module 201 may partition the input range of a non-linear activation function into multiple segments and approximate the non-linear activation function within the segments using various linear functions. In some embodiments, the DNN module 201 may generate a configuration descriptor for a non-linear activation function. The configuration descriptor may store information to be used for approximating the non-linear activation function. For instance, the configuration descriptor may include a LUT storing slopes and intercepts of linear functions. Certain aspects of the DNN module 201 are provided below in conjunction with
The DNN accelerator 202 executes DNNs provided by the DNN module 201. For instance, the DNN accelerator 202 can perform DNN execution, e.g., by running deep learning operations in the DNNs, for training DNNs or for using the trained/compressed/validated DNNs to perform tasks. As shown in
The memory 210 stores data associated with deep learning operations performed by the DNN accelerator. In some embodiments, the memory 210 may store data to be used by the data processing units 230 for DNN execution. For example, the memory 210 may store weights, such as weights of convolutional layers, which are determined by training DNNs. As another example, the memory 210 may store inputs to DNNs or outputs of DNNs. The memory 210 may also store data generated by the data processing units 230 from performing deep learning operations in DNNs. Example deep learning operations include convolutions (also referred to as “convolutional operations”), pooling operations, elementwise operations, activation functions, other types of deep learning operations, or some combination thereof. The memory 210 may be a main memory of the DNN accelerator 202. In some embodiments, the memory 210 includes one or more dynamic random-access memories (DRAMs).
The DMA engine 220 facilitates data transfer between the memory 210 and local memories of the data processing units 230. For example, the DMA engine 220 can read data from the memory 210 and write data into a local memory of a data processing unit 230. As another example, the DMA engine 220 can read data from a local memory of a data processing unit 230 and write data into the memory 210. The DMA engine 220 provides a DMA feature that allows the data processing unit 230 to initiate data transfer between the memory 210 and the local memories of the data processing units 230 and to perform other operations while the data transfer is in being conducted. In some embodiments, the DMA engine 220 may read tensors from the memory 210, modify the tensors in a way that is optimized for the data processing unit 230 before it writes the tensors into the local memories of the data processing units 230.
The data processing units 230 can perform deep learning operations in DNNs. For instance, a data processing unit 230 may run a deep learning operation in a DNN layer, or a portion of the deep learning operation, at a time. The data processing units 230 may be capable of running various types of deep learning operations, such as activation functions, convolution, pooling, elementwise operation, linear operation, non-linear operation, and so on. In an example, a data processing unit 230 may perform convolutions, e.g., standard convolution or depthwise convolution. In some embodiments, the data processing unit 230 receives an input tensor and one or more convolutional kernels and performs a convolution with the input tensor and convolutional kernels. The result of the convolution may be an output tensor, which can be further computed, e.g., by the data processing unit 230 or another data processing unit 230. In some embodiments, the operations of the DNN layers may be run by multiple data processing units 230 in parallel. For instance, multiple data processing units 230 may each perform a portion of a workload for a convolution. Data may be shared between the data processing units 230. A data processing unit 230 may also be referred to as a compute tile. In some embodiments, each data processing unit 230 may be a processing unit.
In the embodiments of
The local memory 240 is local to the corresponding data processing unit 230. In the embodiments of
In some embodiments, the local memory 240 includes one or more static random-access memories (SRAMs). The local memory 240 may be byte-addressable, and each memory address identifies a single byte (eight bits) of storage. In some embodiments, the local memory 240 may include memory banks. The number of data banks in the local memory 240 may be 16, 64, 128, 256, 512, 1024, 2048, or other numbers. A memory bank may include a plurality of storage units. In an example, a data bank may include 8, 16, 64, or a different number of storage units. A memory bank or a storage unit in a memory bank may have a memory address. In an example, a storage unit may store a single byte, and data larger than a single byte may be stored in storage units with consecutive memory addresses, i.e., adjacent storage units. For instance, a storage unit can store an integer number in the INT8 format, versus two storage units may be needed to store a number in the FP16 or BF16 format, which has 16 bits. In some embodiments, 16 bits can be transferred from the local memory 240 in a single read cycle. In other embodiments, 16 bits can be transferred from the local memory 240 in multiple read cycles, such as two cycles.
The sparse cell array 250 may include sparse cells arranged in columns, or columns and rows. Each sparse cell may include an array of MAC units that can perform MAC operations. In some embodiments (e.g., embodiments where the data processing unit 230 executes a convolutional layer), a computation in an MAC unit may be an MAC operation on an activation operand and a weight operand. The activation operand is an activation tensor that may include one or more activations in the input tensor of the convolution. Different activations may be in different input channels. The weight operand is a weight tensor that may include one or more weights in the filter of the convolution. The values of the weights are determined through training the DNN. The weights in the weight operand may be in different input channels.
In some embodiments, an MAC unit includes one or more multipliers for performing multiplications. An MAC unit may also include one or more accumulators (“adders”) for performing accumulations. A column of MAC units is referred to as an MAC column. An MAC column may be associated with one or more MAC lanes. An MAC lane is a path for loading data into an MAC column. An MAC lane may be also referred to as a data transmission lane or data loading lane. An MAC column may have multiple MAC lanes. The loading bandwidth of the MAC column is an aggregation of the loading bandwidths of all the MAC lanes associated with the MAC column. With a certain number of MAC lanes, data can be fed into the same number of independent MAC units simultaneously. In some embodiments where an MAC column has four MAC lanes for feeding activations or weights into the MAC column and each MAC lane may have a bandwidth of 16 bytes, the four MAC lanes can have a total loading bandwidth of 64 bytes.
In some embodiments, the sparse cell array 250 may be capable of depthwise convolution, standard convolution, or both. In a depthwise convolution, an MAC unit may perform an MAC operation that includes a sequence of multiplications for an input operand and a weight operand. Each multiplication in the sequence (also referred to as a cycle) is a multiplication of a different activation in the input operand with a different weight in the weight operand. The activation and weight in the same cycle may correspond to the same channel. The sequence of multiplication produces a product operand that includes a sequence of products. The MAC operation may also include accumulations in which multiple product operands are accumulated to produce an output operand of the MAC unit. The sparse cell array 250 may output multiple output operands at a time, each of which is generated by a different MAC unit. In a standard convolution, MAC operations may include accumulations across the channels. For instance, as opposed to generating an output operand, a MAC unit may accumulate products across different channels to generate a single output point.
In some embodiments, the sparse cell array 250 may include sparsity acceleration logic for facilitating sparsity acceleration. For instance, each sparse cell in the sparse cell array 250 may include one or more sparsity modules. In an example, each MAC column or each MAC row may have a corresponding sparsity module that accelerates MAC operations in the MAC column or MAC row. In some embodiments, a sparsity module accelerates computations in the sparse cell array 250 based on sparsity in activations or sparsity in weights. The sparsity module may include a storage unit that stores a sparsity tensor. The sparsity tensor may be an activation sparsity tensor, a weight sparsity tensor, or a combination of both.
The sparsity module may use the sparsity tensor to identify which data elements of the dense tensor correspond to data elements of the sparse tensor. Each identified data element of the dense tensor and the corresponding data element of the sparse tensor may constitute an activation-weight pair for an MAC operation. For instance, the identified data element of the dense tensor will be multiplied with the corresponding data element of the sparse tensor in the MAC operation. The sparsity module may select one or more data elements of the dense tensor based on one or more sparsity elements of the sparsity tensor that correspond to one or more nonzero valued data elements of the dense format of the sparse tensor. The sparsity module can forward the identified activation-weight pairs to the MAC units. Other data elements of the dense tensor would be skipped and not computed by the MAC units to accelerate computation in the sparse cell array 250, as these data elements will not contribute to the result of the MAC operation.
The PPE array 270 processes outputs of the sparse cell array 250. In some embodiments, the PPE array 260 executes activation functions, including non-linear activation functions. The PPE array 260 may receive outputs of the sparse cell array 250 as inputs to the activation functions. An input to an activation function may be a tensor including a plurality of input data elements. The tensor may be an output tensor of a DNN layer. In some embodiments, the data elements to be input into an activation function may be in a range, which is the input range of the activation function. The PPE array 270 may compute outputs of non-linear activation functions by using linear functions that approximate the non-linear activation functions. For instance, in the execution of a non-linear activation function, the PPE array 270 may apply a linear function on some or all input data elements and use the outputs of the linear function as the approximated outputs of the non-linear activation function. To apply the linear function on input data elements, the PPE array 270 may use data stored in a group of LUTs. The LUTs may be included in the PPE array 270. A LUT may be programmable. In some embodiments, the LUTs are configured by the DNN module 301, such as the activation function module 350 in the DNN module 201. For instance, the data stored in the LUTs may be determined by the DNN module 201.
In some embodiments, the PPE array 260 may transmit the outputs of the activation functions to the local memory 240. The outputs of the activation functions may be retrieved later by the sparse cell array 250 from the local memory 240 for further computation. For instance, the PPE array 260 may receive an output tensor of a DNN layer from the sparse cell array 250 and computes one or more activation functions on the output tensor. The results of the computation by the PPE array 260 may be stored in the local memory 240 and later used as input tensor of the next DNN layer.
In addition or alternative to activation functions, the PPE array 260 may perform other types of post processing on outputs of the sparse cell array 250. For example, the PPE array 260 may apply a bias or scale on an output of the sparse cell array 250 before executing activation function(s). As another example, the PPE array 260 may apply a bias or scale on approximated outputs of activation function(s). Certain aspects of the PPE array 260 are described below in conjunction with
The interface module 310 facilitates communications of the DNN module 201 with other modules or systems. For example, the interface module 310 establishes communications between the DNN module 201 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 310 supports the DNN module 201 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.
The training module 320 trains DNNs by using a training dataset. The training module 320 forms the training dataset. In an embodiment where the training module 320 trains an DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validating module 340 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset and the validation subset may be used to train the DNN.
The training module 320 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters). In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 1, 5, 10, 50, 100, 500, 1000, or even larger.
The training module 320 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image). The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, SoftMax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 2 channels). A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.
In the process of defining the architecture of the DNN, the training module 320 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.
After the training module 320 defines the architecture of the DNN, the training module 320 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 320 modifies the parameters inside the DNN (“internal parameters of the DNN”) to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 320 uses a cost function to minimize the error.
The training module 320 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 320 finishes the predetermined number of epochs, the training module 320 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.
The compressing module 330 compresses DNNs. For instance, the compressing module 330 may add pruning operations to DNN layers to reduce computational complexity or memory usage. A pruning operation may prune weight tensors of a DNN layer by changing one or more nonzero valued weights of the layer to zeros. The modification may be done before, during, or after training. Weights may be pruned during training, during inference, or a combination of both. The compressing module 330 may determine a sparsity ratio for a DNN layer. The sparsity ratio may be a ratio of the number of zero-valued weight to the total number of weights in the layer. The compressing module 330 may perform the pruning operation till the sparsity ratio of the DNN layer meets a target sparsity ration, such as 10%, 20%, 20%, 30%, 50%, and so on.
In some embodiments, the compressing module 330 may select one or more layers in a DNN and modify each selected layer with a pruning operation. For instance, the compressing module 330 may select computationally complex layers, such as layers with large filters. For a pruning operation of a layer or of a type of layer, the compressing module 330 may determine a weight threshold that would not cause a loss of the accuracy of the DNN to exceed an accuracy loss constraint. A pruning operation may modify weights having absolute values above the weight threshold to zeros and leave the other weights unchanged. The weight pruning can reduce memory storage as zero-valued weights may not be stored. Also, the number of operations in the layer can be reduced as computations on zero-valued weights can be skipped without impacting the output of the layer. In some embodiments, the compressing module 330 may also measure energy saving, final DNN accuracy, or layer-wise sparsity caused by pruning operations.
After compressing a DNN, the compressing module 330 may fine tune the DNN, e.g., through a retraining process. The compressing module 330 may fine tunes DNNs after weights are pruned. In some embodiments, the fine-tuning process is a retraining or further training process. For instance, after weights in a DNN are pruned, the compressing module 330 may further train the DNN by inputting a training dataset into the DNN. The values of the unpruned weights in the DNN may be modified based on outputs of the DNN and ground-truth labels of the training samples in the training dataset. In some embodiments, the values of the pruned weights (i.e., zero) are not changed during the fine-tuning process. For instance, the compressing module 330 may place a mask over a pruned weight block and the mask can prevent values in the pruned weight blocks from being changed during the fine-tuning process. In other embodiments, the values of all weights, including the pruned weights, may be changed during the fine-tuning process. After one or more cycles of retraining and weight changing by the compressing module 330, the compressing module 330 may perform a new pruning process, e.g., by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the fine-tuning process is done.
In some embodiments, the number of epochs in the fine-tuning process may be different from the number of epochs in the training process in which the pre-pruning values of the weights are determined. For instance, the fine-tuning process may have less epochs than the training process. In an example, the number of epochs in the fine-tuning process may be relatively small, such as 2, 2, 3, 5, and so on.
The validating module 340 verifies accuracy of trained or compressed DNNs. In some embodiments, the validating module 340 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 340 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 340 may use the following metrics to determine the accuracy score: Precision=TP/(TP+FP) and Recall=TP/(TP+FN), where precision may be how many the DNN correctly predicted (TP or true positives) out of the total it predicted (TP+FP or false positives), and recall may be how many the DNN correctly predicted (TP) out of the total number of objects that did have the property in question (TP+FN or false negatives). The F-score (F-score=2*PR/(P+R)) unifies precision and recall into a single measure.
The validating module 340 may compare the accuracy score with a threshold score. In an example where the validating module 340 determines that the accuracy score of the DNN is less than the threshold score, the validating module 340 instructs the training module 320 to re-train the DNN. In one embodiment, the training module 320 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.
The activation function module 350 programs LUTs for piece-wise linear approximation of non-linear activation functions in DNNs. The LUTs may be implemented in a data processing unit, such as the data processing unit 230. In an example, the LUTs may be implemented in the PPE array 260. A linear function may be denoted as y=ax+b, where a denotes the slope of the linear function, b denotes the intercept of the linear function, x denotes the input of the linear function, and y denotes the output of the linear function.
In an example process of approximating a non-linear activation function, the activation function module 350 may determine the input range of the non-linear activation function. The input range may be a range that includes some or all possible data elements that will be input into the non-linear activation function. The input range may depend on the datatypes (or data formats) of the input data elements. The activation function module 350 may support various data formats, including floating-point formats, such as FP32, FP16, BF16, FP8, and so on. The input data elements may be computed in a DNN layer, such as a convolutional layer, fully-connected layer, pooling layer, and so on. In an example, the input data elements may be output activations of a convolution in the DNN. The activation function module 350 may identify the exponents in the input range and divide the input range into input segments. In an example, an input segment is a portion of the input range that has input data elements having the same exponent. The input segments may correspond to the identified exponents, respectively.
The activation function module 350 may determine a linear function for an input segment and evaluate the accuracy of the linear function. The activation function module 350 may measure the accuracy of the linear function by comparing outputs of the linear function with real outputs of the non-linear activation function for inputs falling into the input segment.
The activation function module 350 may determine whether the accuracy of the linear function meets a desired accuracy, e.g., whether the accuracy is no less than the desired accuracy. In embodiments where the accuracy meets the desired accuracy, the activation function module 350 may store parameters of the linear function (e.g., slope and intercept) into a LUT. In embodiments where the accuracy does not meet the desired accuracy, the activation function module 350 may divide the input segment into multiple smaller input segments and determine a linear function for each of the smaller input segments. The activation function module 350 may further generate a configuration descriptor that includes the parameters of all the determined linear functions. The configuration descriptor may be provided to a PPE array (e.g., the PPE array 260) and stored in the LUTs of the PPE array 260. In some embodiments, one or more parameters (e.g., intercept and slope) of a single linear function may be stored in a single entry of a LUT (“LUT entry”).
In some embodiments, the activation function module 350 defines hybrid architectures of the LUTs in the PPE array 260 and allocates the parameters of the linear functions based on the hybrid architectures. In some embodiments, the activation function module 350 may determine one or more parameters that represent the hybrid architecture of one or more LUTs (“LUT architecture parameters”). The LUT architecture parameters may include, for example, the total number of PPE groups in the PPE array 260 (denoted as G), the total number of LUT entries in a single LUT (denoted as N), the total number of LUT entries in the dedicated portion of a LUT (denoted as K), and so on. The activation function module 350 may determine the one or more LUT architecture parameters based on statistical analysis of the input data elements of the activation function.
In some embodiments, the activation function module 350 may analyze statistics of the input data elements with respect to the input segments. For instance, the activation function module 350 may determine distribution frequencies of the input segments. A distribution frequency of an input segment may indicate how many input data elements of the activation function fall into the input segment. For instance, the distribution frequency may be a ratio between the total number of input data elements in the input segment and the total number of input data elements in the input range. For each input segment, the activation function module 350 may determine one or more linear functions that can approximate the activation function. In some embodiments, the activation function module 350 may determine one linear function for one input element. In other embodiments, the activation function module 350 may determine multiple linear functions for one input segment. The linear functions may be used to approximate the activation function for different portions of the input segment. A linear function may have one or more parameters that will be used to compute approximated output of the activation function. Examples of the parameters include intercepts, slopes, or other types of parameters of the linear functions.
In some embodiments, the activation function module 350 may divide the input range into input segments. The activation function module 350 may assign indices to the input segments. Each input segment may have a different index. An index may include three components that encode sign, exponent, and mantissa, respectively. The activation function module 350 may also determine distribution frequencies of the input segments. In an example, the activation function module 350 may associate the index of an input segment with all the input data elements falling into the input segment. The activation function module 350 may bin input data elements of the activation function into corresponding indices and track the count in each index bin. The activation function module 350 may determine the distribution frequency of the input segment based on a count of the index bin. In some embodiments, the activation function module 350 may rank the input segments based on the distribution frequencies. The activation function module 350 may generate a frequency table that lists the input segments in an order in accordance with their rankings.
The activation function module 350 may select one or more input segments having higher distribution frequencies. The parameters of linear functions that approximate the activation function for the selected input segments may be stored in the dedicated portion of each LUT, and the parameters of linear functions that approximate the activation function for the unselected input segments may be stored in the shared portions of the LUTs. The dedicated portion of a LUT may be coupled to a single PPE group which computes the approximated outputs of the activation function using the input data elements in the selected input segments and parameters of linear functions stored in the dedicated portion of the LUT. The shared portion of the LUTs may constitute a shared LUT entry pool of the PPE array 260. The shared LTU entry pool may be coupled to and shared by multiple PPE groups which compute the approximated outputs of the activation function using the input data elements in the unselected input segments and parameters of linear functions stored in the shared portions of the LUTs.
For each LUT, the activation function module 350 may determine the size of the dedicated portion of the LUT and the size of the shared portion of the LUT based on the LUT architecture parameters. In an example, the activation function module 350 may determine that the dedicated portion has K entries and the shared portion has K+(N−L)/G entries. In some embodiments, the activation function module 350 may use the same hybrid architecture for all the LUTs in the PPE array 260. In other embodiments, the activation function module 350 may determine different hybrid architectures for different LUTs.
To determine the values of the LUT architecture parameters, the activation function module 350 may conduct an iterative process starting with preliminary values of the LUT architecture parameters, e.g., preliminary values of G, N, and K. The activation function module 350 may estimate the area needed by the LUTs and the performance of the PPE array with the preliminary LUT architecture. The activation function module 350 may determine whether the combination of the estimated area and the estimated performance is optimal. In response to determining that the combination of the estimated area and the estimated performance is not optimal, the activation function module 350 may adjust one or more of the LUT architecture parameters. The activation function module 350 may re-estimate the area needed by the LUTs and the performance of the PPE array with the preliminary LUT architecture with the adjusted LUT architecture parameters. This process may continue till the optimal combination of the estimated area and the estimated performance is achieved.
In some embodiments, the optimal combination of the estimated area and the estimated performance may be an optimal balance between the estimated area and the estimated performance. Increase in the area consumed by the LUTs may cause decrease in the performance of the PPE array 260. The activation function module 350 may determine an area parameter indicating the estimated area. The activation function module 350 may also determine a performance parameter indicating the estimated performance. The activation function module 350 may determine a weighted sum of the two parameters to determine whether the optimized combination or optimal balance is achieved. In an example, the area parameter may have a positive value or a positive weight while the performance may have a negative value or a negative weight. In another example, the performance parameter may have a positive value or a positive weight while the area parameter may have a negative value or a negative weight.
The datastore 360 stores data received, generated, used, or otherwise associated with the DNN module 201. For example, the datastore 360 stores the datasets used by the training module 320 and validating module 340. The datastore 360 may also store data generated by the training module 320 and validating module 340, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc.), data for sparsity acceleration (e.g., sparsity bitmap, etc.), and so on. The datastore 360 may store LUT architecture parameters and configuration parameters generated by the activation function module 350. In the embodiment of
Each PPE group 410 includes one or more PPEs. A PPE group 410 may receive input data and compute output data to be used as outputs of activation functions. The output data may be approximated outputs of non-linear activation functions. In some embodiments, a PPE group 410 may also include one or more register files. A PPE may be configured to execute linear functions, including linear functions used to approximate non-linear activation functions. A PPE may include one or more multipliers and one or more accumulators. A register file in a PPE group may be used to store data input into the PPE group 410, such as input data elements of non-linear activation functions, slopes and intercepts of linear functions approximating the non-linear activation functions, and so on. The register file(s) in a PPE may store data computed by the PPE group 410, which may be output data elements of non-linear activation function.
The LUTs 430 store parameters of linear functions executed by the PPE groups 410. A LUT 430 may include a certain number of entries. The total number of entries in a LUT 430 may indicate the size of a LUT 430. The sizes of all the LUTs 430 may indicate the size of the area taken by the LUTs 430. The LUTs 430 may be programmable. For instance, the architectures of the LUTs 430 may be configured by the activation function module 350. In the embodiment of
The shared portions 435 of all the LUTs 430 are coupled to the arbiter 420 through data transfer paths 460. The arbiter 420 is coupled to all the PPE groups 410 through data transfer paths 450. The shared portions 435 of all the LUTs 430 may constitute a pool of LUT entries shared by all the PPE groups 410. Each PPE group 410 may access all the LUT entities in the shared pool. The arbiter 420 may be a memory arbiter that can decide, for each data transfer cycle, which PPE group 410 may be allowed to access the shared pool of LUT entities.
In some embodiments, the entries of the LUTs 430 can be configured, e.g., by the activation function module 350. The entries of the LUTs 430 may be reconfigured and updated so that the LUTs 430 can be used for approximating various activation functions. The LUTs 430 may support various floating-point data types. The datatype of a LUT 430 may also be configured by the activation function module 350. In some embodiments, the LUTs 430 may have entries of different data types. Even though
The PPE groups 510 may be the same or similar as the PPE groups 410 in
In some embodiments, a PPE may be implemented with a range index decoder that determines LUT_ID and LUT_Index based on the input data element is received. Each input data element has a value with corresponding exponent and mantissa. The decoder may use the exponent and mantissa to determine the LUT_Index. This decoder may have LUT_index to LUT_ID mapping that is predetermined based on analysis conducted by the activation function module 350. The command router 523 may decode LUT_ID and sends the command to the command queue 524 that is corresponding to the LUT specified by the LUT_ID. The commands can go to the same command queue 524 or different command queues 524. If there are no commands or responses in flight and the LUT_Index corresponds to top-K entries, the arbiter 520 may be bypassed.
The number of elements (“slots”) in a command queue 524 may equal the number of PPE groups 510 in the PPE array 500. For instance, for 4 PPE groups, the command queue 524 has 4 slots where the command can be entered. Each slot may have an ID (“SLOT_ID”) corresponding to an ID of the corresponding PPE group 510 (“PPE_ID”). The slots may be written to in parallel. The command queue 524 may be drained in a round robin fashion.
The LUT 530 may receive the command from the corresponding command queue 524 and pushes the response into the response queue 526. The response queue 526 may be a FIFO (first in first out). When the response queue 526 is popped, the response router 525 may decode the PPE_ID, which may be part of command metadata that is looped back on the response.
Example Approximation of Activation Function with Linear Function
In the embodiments of
The LUT portion 620 receives the configuration signal 602 and is configured to store the parameters of the linear functions. The configuration signal 602 may be received through a configuration bus. In some embodiments, the parameters of a single linear function are stored as a single entry in the LUT portion 620. For instance, the entry may start with the intercept of the linear function, followed by the slope of the linear function. In an example, an entry may include 32 bits, the intercept has 16 bits and the slope has 16 bits. The entries may have specific addresses and can be retrieved by the PPEs 630 based on the addresses.
The PPEs 630 receive an input signal 603. In some embodiments, the PPE 630 may receive different input signals in parallel. In other embodiments, the PPE 630 may share the input signal 603. The input signal 603 may be received through a data port, such as a data input port. The input signal 603 may include one or more input data elements of the activation function. The PPEs 630 process the input data elements in the input signal 603 to compute, using the parameters of linear functions in the LUT portion 620, outputs of the linear functions as approximated outputs of the activation function. In the embodiments of
In an example operation cycle for processing an input data element in the input signal 603, the LUT entry including the intercept and slope of the linear function for the input segment including the input data element may be identified. For instance, the address of the entry may be determined based on the value (or exponent) of the input data element. The intercept and slope are retrieved from the LUT portion 620 based on the address. The PPE 630 receives the intercept and slope. The PPE 630 also receives an increment value x−x0, which is a difference between the value of the input data element x and a segment start value x0 of the input segment including the input data element. In some embodiments, the PPE 630 may include a subtractor that computes the increment value using the input data element and the segment start value. The multiplier 640 may compute a product of the increment value and the slope. The adder 650 receives the product from the multiplier 640 and the intercept of the linear function. The adder 650 accumulates the product and the intercept and computes the approximated output data element of the activation function. The output of the adder 650 may be sent out from the PPE 630 through a data port (e.g., data output port) as a data element in an output signal 604 of the activation function. In some embodiments, the PPEs 630 may be in the same PPE group.
y=s(x−x0)+yi,
where y denotes the output of the linear function, s denotes the slope (also referred to as “multiplier”) of the linear function, x denotes the input of the linear function, and yi denotes the intercept (also referred to as “offset”) of the linear function.
In Step 1210, activations 1201 are received. The activations 1201 may be computed in a DNN layer, e.g., a convolutional layer. The activations 1201 may be input data elements of an activation function that can be approximated by linear functions. The range of the activations 1201 may be the input range of the activation function. The data type of the activations 1201 is converted. In an embodiment, the data type of the activations 1201 may be converted to FP 16. In another embodiment, the data type of the activations 1201 may be converted to a different data type. In yet another embodiment, the data type conversion may be bypassed.
In Step 1220, LUT indices are encoded. A LUT index may include one or more components, e.g., components that encode sign, exponent, and mantissa. Each LUT index may correspond to an input segment, i.e., a portion of the input range. The activations 1201 may be associated with the LUT indices based on the sign, exponent, or mantissa of the values of the activations 1201. Also, a count indicating how many activations are associated with each LUT index may be determined.
In Step 1230, a frequency table is created. The frequency table may list the LUT indices in Step 1220 and distribution frequencies corresponding to the LUT indices. In some embodiments, the distribution frequencies are represented by the counts of activations associated with each LUT index.
In Step 1240, LUT generation is conducted with preliminary architecture parameters 1202. The preliminary architecture parameters may be preliminary values of LUT architecture parameters, such as the LUT architecture parameters described above. Hybrid architecture of one or more LUTs may be determined based on the preliminary architecture parameters. The hybrid architecture may include one or more dedicated LUT portions and a shared pool of LUT entries. Also, parameters of linear functions may be stored in the one or more LUTs using the frequency table created in Step 1230. For instance, one or more input segments with higher distribution frequencies may be selected from the frequency table. The parameters of linear functions for approximating the activation function for the selected input segment(s) may be stored in the dedicated LUT portion(s), while the parameters of linear functions for approximating the activation function for the unselected input segment(s) may be stored in the shared pool of LUT entries.
In Step 1250, it may be determined whether the area and performance are optimal. The area may be an estimated area consumed by the LUTs with the hybrid architecture. The performance may be an estimated performance of the PPE array that includes the LUTs. The area and performance may be optimal when there is an optimal balance between the area and the performance.
When the area and performance are optimal, the LUT architecture 1203 may be determined and used to configure the LUTs. When the area and performance are not optimal, the architecture parameters are modified in Step 1260. After the architecture parameters are modified, Steps 1240 and 1250 may be performed again till the optimal balance between the area and the performance is found.
The activation function module 350 partitions 1310 an input range of the activation function into input segments. An input segment is a region in the input range. In some embodiments, the activation function is a non-linear activation function. The input range is a range of input data elements of the non-linear activation function. In some embodiments, the input data elements are output activations of a deep learning operation, e.g., a convolution. In some embodiments, the input range depends on the data type of the input data elements of the activation function.
The activation function module 350 selects 1320, from the input segments, one or more input elements based on a total number of input data elements of the activation function that fall into each selected input segment. In some embodiments, the activation function module 350 determines frequencies of the input segments based on a total number of input data elements in each of the input segments. The activation function module 350 selects the one or more input segments based on the frequencies. In some embodiments, the frequency of a selected input element is higher than the frequency of an unselected input segment.
In some embodiments, the activation function module 350 assigns indexes to the input segments. Each index corresponds to a different input segment. The activation function module 350 associates an index with one or more input data elements that fall into a corresponding input segment. The activation function module 350 determines the frequencies of the input segments based on counts of the indices.
The activation function module 350 divides 1330 a first LUT into a first portion and a second portion, the first portion of the first LUT dedicated to a first group of PPEs that computes an approximated output of the activation function for a selected input element. In some embodiments, the first portion of the first LUT comprises a predetermined number of entries in the first LUT. The activation function module 350 divides the first LUT by determining the predetermined number based on an estimated size of an area consumed by a LUT set that includes the first LUT and the second LUT. In some embodiments, the activation function module 350 determines the predetermined number further based on an estimated performance of a PPE array that includes the first group of PPEs and the second group of PPEs.
In some embodiments, the second portion of the first LUT or the portion of the second LUT comprises a predetermined number of entries in the first LUT. The activation function module 350 determines the predetermined number based on a total number of groups of PPEs in a PPE array that includes the first group of PPEs and the second group of PPEs. The predetermined number is further dependent on the total number of entries in the first LUT or in the second LUT.
The activation function module 350 stores 1340, in the first portion of the first LUT, a parameter of a first linear function that approximates the activation function for at least part of the selected input segment. In some embodiments, the first linear function has two parameters, such as an intercept and a slope. The two parameters are stored as a single entry in the first portion of the first LUT. In some embodiments, the activation function module 350 determines multiple linear functions for the selected input segment and stores the parameters of all the linear functions in the first portion of the first LUT.
The activation function module 350 stores 1340, in a pool of LUT entries, a parameter of a second linear function that approximates the activation function for at least part of an unselected input segment, the pool of LUT entries comprising the second portion of the first LUT and a portion of a second LUT, the pool of LUT entries shared by the first group of PPEs and a second group of PPEs. In some embodiments, another portion of the second LUT is dedicated to the second group of PPEs. In some embodiments, the other portion of the second LUT has the same number of entries as the first portion of the first LUT.
Example Computing Device
The computing device 1400 may include a processing device 1402 (e.g., one or more processing devices). The processing device 1402 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 1400 may include a memory 1404, which may itself include one or more memory devices such as volatile memory (e.g., DRAM), nonvolatile memory (e.g., read-only memory (ROM)), high bandwidth memory (HBM), flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 1404 may include memory that shares a die with the processing device 1402. In some embodiments, the memory 1404 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for executing activation functions in DNNs, e.g., the process 1200 described above in conjunction with
In some embodiments, the computing device 1400 may include a communication chip 1412 (e.g., one or more communication chips). For example, the communication chip 1412 may be configured for managing wireless communications for the transfer of data to and from the computing device 1400. The term “wireless” and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
The communication chip 1412 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family), IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment), Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as “3GPP2”), etc.). IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 1412 may operate in accordance with a Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E-HSPA), or LTE network. The communication chip 1412 may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN). The communication chip 1412 may operate in accordance with Code-division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunications (DECT), Evolution-Data Optimized (EV-DO), and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 5G, 5G, and beyond. The communication chip 1412 may operate in accordance with other wireless protocols in other embodiments. The computing device 1400 may include an antenna 1422 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions).
In some embodiments, the communication chip 1412 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet). As noted above, the communication chip 1412 may include multiple communication chips. For instance, a first communication chip 1412 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1412 may be dedicated to longer-range wireless communications such as global positioning system (GPS), EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 1412 may be dedicated to wireless communications, and a second communication chip 1412 may be dedicated to wired communications.
The computing device 1400 may include battery/power circuitry 1414. The battery/power circuitry 1414 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1400 to an energy source separate from the computing device 1400 (e.g., AC line power).
The computing device 1400 may include a display device 1406 (or corresponding interface circuitry, as discussed above). The display device 1406 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD), a light-emitting diode display, or a flat panel display, for example.
The computing device 1400 may include an audio output device 1408 (or corresponding interface circuitry, as discussed above). The audio output device 1408 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.
The computing device 1400 may include an audio input device 1418 (or corresponding interface circuitry, as discussed above). The audio input device 1418 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output).
The computing device 1400 may include a GPS device 1416 (or corresponding interface circuitry, as discussed above). The GPS device 1416 may be in communication with a satellite-based system and may receive a location of the computing device 1400, as known in the art.
The computing device 1400 may include another output device 1410 (or corresponding interface circuitry, as discussed above). Examples of the other output device 1410 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.
The computing device 1400 may include another input device 1420 (or corresponding interface circuitry, as discussed above). Examples of the other input device 1420 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.
The computing device 1400 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA), an ultramobile personal computer, etc.), a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 1400 may be any other electronic device that processes data.
The following paragraphs provide various examples of the embodiments disclosed herein.
Example 1 provides a method for approximating an activation function in a neural network, the method including partitioning an input range of the activation function into input segments, in which an input segment is a region in the input range; selecting, from the input segments, one or more input elements based on a total number of input data elements of the activation function that fall into each selected input segment; dividing a first LUT into a first portion and a second portion, the first portion of the first LUT dedicated to a first group of PPEs that computes an approximated output of the activation function for a selected input element; storing, in the first portion of the first LUT, a parameter of a first linear function that approximates the activation function for at least part of the selected input segment; and storing, in a pool of LUT entries, a parameter of a second linear function that approximates the activation function for at least part of an unselected input segment, in which the pool of LUT entries includes a second portion of the first LUT and a portion of a second LUT, and the pool of LUT entries is shared by the first group of PPEs and a second group of PPEs.
Example 2 provides the method of example 1, in which selecting the one or more input segments includes determining frequencies of the input segments based on a total number of input data elements in each of the input segments; and selecting the one or more input segments based on the frequencies.
Example 3 provides the method of example 2, in which the frequency of the selected input element is higher than a frequency of the unselected input segment.
Example 4 provides the method of example 2 or 3, in which the first portion of the first LUT includes a predetermined number of entries in the first LUT, and the predetermined number is determined based on an estimated size of an area consumed by a LUT set that includes the first LUT and the second LUT.
Example 5 provides the method of example 4, in which the predetermined number is further determined based on an estimated performance of a PPE array that includes the first group of PPEs and the second group of PPEs.
Example 6 provides the method of any one of examples 1-5, further including assigning indices to the input segments, each index corresponding to a different input segment; associating an index with one or more input data elements that fall into a corresponding input segment; and determining the frequencies of the input segments based on counts of the indices.
Example 7 provides the method of any one of examples 1-6, in which another portion of the second LUT is dedicated to the second group of PPEs.
Example 8 provides the method of example 7, in which the another portion of the second LUT has a same number of entries as the first portion of the first LUT.
Example 9 provides the method of any one of examples 1-8, in which the second portion of the first LUT or the portion of the second LUT includes a predetermined number of entries in the first LUT, and the predetermined number is dependent on a total number of groups of PPEs in a PPE array that includes the first group of PPEs and the second group of PPEs.
Example 10 provides the method of example 9, in which the predetermined number is further dependent on a total number of entries in the first LUT or in the second LUT.
Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations for approximating an activation function in a neural network, the operations including partitioning an input range of the activation function into input segments, in which an input segment is a region in the input range; selecting, from the input segments, one or more input elements based on a total number of input data elements of the activation function that fall into each selected input segment; dividing a first LUT into a first portion and a second portion, the first portion of the first LUT dedicated to a first group of PPEs that computes an approximated output of the activation function for a selected input element; storing, in the first portion of the first LUT, a parameter of a first linear function that approximates the activation function for at least part of the selected input segment; and storing, in a pool of LUT entries, a parameter of a second linear function that approximates the activation function for at least part of an unselected input segment, in which the pool of LUT entries includes a second portion of the first LUT and a portion of a second LUT, and the pool of LUT entries is shared by the first group of PPEs and a second group of PPEs.
Example 12 provides the one or more non-transitory computer-readable media of example 11, in which selecting the one or more input segments includes determining frequencies of the input segments based on a total number of input data elements in each of the input segments; and selecting the one or more input segments based on the frequencies, in which the frequency of the selected input element is higher than a frequency of the unselected input segment.
Example 13 provides the one or more non-transitory computer-readable media of example 12, in which the first portion of the first LUT includes a predetermined number of entries in the first LUT, and the predetermined number is determined based on an estimated size of an area consumed by a LUT set, which includes the first LUT and the second LUT, and an estimated performance of a PPE array that includes the first group of PPEs and the second group of PPEs.
Example 14 provides the one or more non-transitory computer-readable media of any one of examples 11-13, in which another portion of the second LUT is dedicated to the second group of PPEs, and the another portion of the second LUT has a same number of entries as the first portion of the first LUT.
Example 15 provides the one or more non-transitory computer-readable media of any one of examples 11-14, in which the second portion of the first LUT or the portion of the second LUT includes a predetermined number of entries in the first LUT, and the predetermined number is dependent on a total number of groups of PPEs in a PPE array that includes the first group of PPEs and the second group of PPEs.
Example 16 provides the one or more non-transitory computer-readable media of example 15, in which the predetermined number is further dependent on a total number of entries in the first LUT or in the second LUT.
Example 17 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations for approximating an activation function in a neural network, the operations including partitioning an input range of the activation function into input segments, in which an input segment is a region in the input range; selecting, from the input segments, one or more input elements based on a total number of input data elements of the activation function that fall into each selected input segment, dividing a first LUT into a first portion and a second portion, the first portion of the first LUT dedicated to a first group of PPEs that computes an approximated output of the activation function for a selected input element, storing, in the first portion of the first LUT, a parameter of a first linear function that approximates the activation function for at least part of the selected input segment, and storing, in a pool of LUT entries, a parameter of a second linear function that approximates the activation function for at least part of an unselected input segment, in which the pool of LUT entries includes a second portion of the first LUT and a portion of a second LUT, and the pool of LUT entries is shared by the first group of PPEs and a second group of PPEs.
Example 18 provides the apparatus of example 17, in which selecting the one or more input segments includes determining frequencies of the input segments based on a total number of input data elements in each of the input segments; and selecting the one or more input segments based on the frequencies, in which the frequency of the selected input element is higher than a frequency of the unselected input segment.
Example 19 provides the apparatus of example 18, in which the first portion of the first LUT includes a predetermined number of entries in the first LUT, and the predetermined number is determined based on an estimated size of an area consumed by a LUT set, which includes the first LUT and the second LUT, and an estimated performance of a PPE array that includes the first group of PPEs and the second group of PPEs.
Example 20 provides the apparatus of any one of examples 17-19, in which another portion of the second LUT is dedicated to the second group of PPEs, and the another portion of the second LUT has a same number of entries as the first portion of the first LUT.
The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.