Deep learning (DL) algorithms are implemented using neural networks such as convolutional neural networks (CNNs), deep neural networks (DNNs), recurrent neural networks (RNN), and the like. For example, CNNs are a class of artificial neural networks that learn how to perform tasks such as computer vision, driver assistance, image recognition, natural language processing, and game play. A CNN architecture includes a stack of layers that implement functions to transform an input volume (such as a digital image) into an output volume (such as labeled features detected in the digital image). The layers in a CNN are separated into convolutional layers, pooling layers, and fully connected layers. Multiple sets of convolutional, pooling, and fully connected layers are interleaved to form a complete CNN. The functions implemented by the layers in a CNN are explicit (i.e., known or predetermined) or hidden (i.e., unknown). A DNN is a type of CNN that performs deep learning on tasks that contain multiple hidden layers. For example, a DNN that is used to implement computer vision includes explicit functions (such as orientation maps) and multiple hidden functions in the hierarchy of vision flow. An RNN is a type of artificial neural network that forms a directed graph of connections between nodes along a temporal sequence and exhibits temporal dynamic behavior. For example, an RNN uses an internal state to process sequences of inputs. The RNN is typically applied to deep learning problems such as handwriting recognition and speech recognition. In some cases, an RNN is implemented with one or more long short-term memory (LSTM) units that provide feedback connections.
The present disclosure is better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
Neural networks including convolutional neural networks (CNNs), deep neural networks (DNNs), and recurrent neural networks (RNN) implement algebraic arithmetic units such as addition and multiplication operations, as well as transcendental arithmetic units such as a binary logarithm (LOG 2) and a binary anti-logarithm (EXP2). The transcendental arithmetic units are used to realize activation functions (e.g., sigmoid and tan h in CNNs) and long short-term memory (LSTM) network layers in RNNs and DNNs. Floating-point formats are used to implement the transcendental arithmetic units and different formats make different trade-offs between hardware cost (e.g., area) and accuracy. For example, single-precision floating-point formats use one sign bit, eight exponent bits, and 23 mantissa bits to achieve high precision at a relatively high hardware cost. For another example, half-precision floating-point formats use one sign bit, five exponent bits, and 10 mantissa bits to reduce the hardware cost while tolerating a loss in output accuracy. Applications implemented in RNNs typically require high accuracy and therefore frequently implement the single-precision floating-point format. This requirement is particularly pertinent to LSTMs and other such implementations because the current output of the RNN depends on the previous output of the RNN. Errors in the computation of transcendental functions for the RNN therefore accumulate over iterations of the RNN. Furthermore, conventional techniques for computing single (or double) precision floating-point transcendental functions require numerous computation cycles to achieve the required accuracy, which adds performance-sapping latency to the computation.
Addresses of the LUTs associated with the partitions are generated using the bits in the corresponding partitions. For example, a first address of a first LUT associated with a first partition of the bits representative of an input is generated using the bits in the first partition, a second address of a second LUT associated with a second partition of the input bits is generated using the bits in the second partition, and a third address of a third LUT associated with a third partition of the input bits is generated using the bits in the third partition. Entries in the LUTs represent values of outputs corresponding to the input to the transcendental arithmetic unit. Some embodiments of the transcendental arithmetic unit implement a binary logarithm function and the entries in the LUTs map bits in the partitions of the input mantissa to mantissas of the binary logarithms of the inputs (with values between one and two when represented in the floating-point format). Some embodiments of the transcendental arithmetic unit implement a binary anti-logarithm function and the entries in the LUTs map bits representing a fraction of the input to a mantissa of a fraction of the anti-logarithm of the inputs (with values between zero and one when represented in the floating-point format).
The processing system 100 includes a graphics processing unit (GPU) 115 that renders images for presentation on a display 120. For example, the GPU 115 renders objects to produce values of pixels that are provided to the display 120, which uses the pixel values to display an image that represents the rendered objects. Some embodiments of the GPU 115 are used to implement deep learning operations including CNNs, DNNs, and RNNs, as well as performing other general-purpose computing tasks. In the illustrated embodiment, the GPU 115 implements multiple processing elements 116, 117, 118 (collectively referred to herein as “the processing elements 116-118”) that executes instructions concurrently or in parallel. In the illustrated embodiment, the GPU 115 communicates with the memory 105 over the bus 110. However, some embodiments of the GPU 115 communicate with the memory 105 over a direct connection or via other buses, bridges, switches, routers, and the like. The GPU 115 executes instructions stored in the memory 105 and the GPU 115 stores information in the memory 105 such as the results of the executed instructions. For example, the memory 105 stores a copy 125 of instructions from a program code that is to be executed by the GPU 115.
The processing system 100 also includes a central processing unit (CPU) 130 that implements multiple processing elements 131, 132, 133, which are collectively referred to herein as “the processing elements 131-133.” The processing elements 131-133 execute instructions concurrently or in parallel. The CPU 130 is connected to the bus 110 and therefore communicates with the GPU 115 and the memory 105 via the bus 110. The CPU 130 executes instructions such as program code 135 stored in the memory 105 and the CPU 130 stores information in the memory 105 such as the results of the executed instructions. The CPU 130 is also able to initiate graphics processing by issuing draw calls to the GPU 115.
An input/output (I/O) engine 140 handles input or output operations associated with the display 120, as well as other elements of the processing system 100 such as keyboards, mice, printers, external disks, and the like. The I/O engine 140 is coupled to the bus 110 so that the I/O engine 140 communicates with the memory 105, the GPU 115, or the CPU 130. In the illustrated embodiment, the I/O engine 140 reads information stored on an external storage component 145, which is implemented using a non-transitory computer readable medium such as a compact disk (CD), a digital video disc (DVD), and the like. The I/O engine 140 also writes information to the external storage component 145, such as the results of processing by the GPU 115 or the CPU 130.
Artificial neural networks, such as a CNN, DNN, or RNN, are represented as program code that is configured using a corresponding set of parameters. The artificial neural network is therefore executed on the GPU 115, the CPU 130, or other processing units including field programmable gate arrays (FPGA), application-specific integrated circuits (ASIC), processing in memory (PIM), and the like. If the artificial neural network implements a known function that is trained using a corresponding known dataset, the artificial neural network is trained (i.e., the values of the parameters that define the artificial neural network are established) by providing input values of the known training data set to the artificial neural network executing on the GPU 115 or the CPU 130 and then comparing the output values of the artificial neural network to labeled output values in the known training data set. Error values are determined based on the comparison and back propagated to modify the values of the parameters that define the artificial neural network. This process is iterated until the values of the parameters satisfy a convergence criterion.
As discussed herein, neural networks in the processing system 100 include transcendental arithmetic units that implement transcendental functions such as a binary logarithm (LOG 2) or a binary anti-logarithm (EXP2). The transcendental arithmetic units are used, for example, to realize activation functions (e.g., sigmoid and tan h in CNNs) and long short-term memory (LSTM) network layers in RNNs and DNNs. Floating-point formats are used to implement the transcendental arithmetic units and different formats make different trade-offs between hardware cost (e.g., area) and accuracy. In order to achieve high precision, and throughput as high as one sample for every clock cycle, in an architecture that provides a reasonable area-accuracy trade-off the processing system implements multipartite LUTs to calculate the transcendental functions. Some embodiments of the multipartite LUTs include a set of LUTs and each LUTs in the set maps partitions of bits representative of an input number to values of a transcendental function of the bits representative of the input number in a floating-point format. The values of the transcendental function are combined to produce an output number in a floating-point format, which is the same or different than the floating-point format of the input number. The value of the output number is equal to the transcendental function of the input number. In some embodiments, the set of LUTs 150 is stored in the memory 105 and addresses of the LUTs 150 in the memory 105 are indicated by the partitions of the bits representative of the input number.
The DNN 200 includes convolutional layers 220 that implement a convolutional function that is defined by a set of parameters, which are trained with one or more training datasets. The parameters include a set of learnable filters (or kernels) that have a small receptive field and extend through a full depth of an input volume of convolutional layers 220. The parameters also include, but are not limited to, a depth parameter, a stride parameter, and a zero-padding parameter that control the size of the output volume of the convolutional layers 220. The convolutional layers 220 apply a convolution operation to input values and provide the results of the convolution operation to a subsequent layer in the DNN 200. For example, the portion 205 of the image 210 is provided as input 225 to the convolutional layers 220, which apply the convolution operation to the input 225 based on the set of parameters to generate a corresponding output value 230. In some embodiments, the convolutional layers 220 are identified as a subnetwork of the DNN 200. The subnetwork then represents a convolutional neural network (CNN). However, in some embodiments, the convolutional layers 220 are a part of a larger subnetwork of the DNN 200 or the convolutional layers 220 are further subdivided into multiple subnetworks of the DNN 200.
Results generated by the convolutional layers 220 are provided to pooling layers 235 in the DNN 200. The pooling layers 235 combine outputs of neuron clusters at the convolutional layers 220 into a smaller number of neuron clusters that are output from the pooling layers 235. The pooling layers 235 typically implement known (or explicit) functions. For example, pooling layers 235 that implement maximum pooling assign a maximum value of values of neurons in a cluster that is output from the convolutional layers 220 to a single neuron that is output from the pooling layers 235. For another example, pooling layers 235 that implement average pooling assign an average value of the values of the neurons in the cluster that is output from the convolutional layers 220 to a single neuron that is output from the pooling layers 235. The known (or explicit) functionality of the pooling layers 235 is trained using predetermined training datasets. In some embodiments, the pooling layers 235 are identified as a subnetwork of the DNN 200. However, in some cases, the pooling layers 235 is a part of a larger subnetwork of the DNN 200 or the pooling layers 235 is further subdivided into multiple subnetworks of the DNN 200.
In the illustrated embodiment, the DNN 200 also includes additional convolutional layers 240 that receive input from the pooling layers 235 and additional pooling layers 245 that receive input from the additional convolutional layers 240. However, the additional convolutional layers 240 and the additional pooling layers 245 are optional and are not present in some embodiments of the DNN 200. Furthermore, some embodiments of the DNN 200 include larger numbers of convolutional and pooling layers. The additional convolutional layers 240 and the additional pooling layers 245 are identified as subnetworks of the DNN 200, portions of subnetworks of the DNN 200, or they are subdivided into multiple subnetworks of the DNN 200.
Output from the additional pooling layers 245 is provided to fully connected layers 250, 255. The neurons in the fully connected layers 250, 255 are connected to every neuron in another layer, such as the additional pooling layers 245 or the other fully connected layers. The fully connected layers 250, 255 typically implement functionality that represents the high-level reasoning that produces the output values 215. For example, if the DNN 200 is trained to perform a deep learning operation such as computer vision, driver assistance, image recognition, natural language processing, or game play, the fully connected layers 250, 255 implement the functionality that labels portions of the image that have been “recognized” by the DNN 200. Examples of labels include names of people whose faces are detected in the image 210, types of objects detected in the image, and the like. The functions implemented in the fully connected layers 250, 255 are represented by values of parameters that are determined using a training dataset, as discussed herein. The fully connected layers 250, 255 are identified as subnetworks of the DNN 200, portions of subnetworks of the DNN 200, or they are subdivided into multiple subnetworks of the DNN 200.
1≤A,B,C<2
1≤A+B+C<2
Within these ranges, the transcendental function is approximated as:
f(A+B+C)≈f(A)+f(B)+f(C)
A similar approximation is also applied in a binary anti-logarithm converter 500 shown in
The input number 401 is partitioned into exponent bits 411 and partitions 412, 413, 414 of the mantissa bits of the input number 401 in the floating-point format. Some embodiments of the input number 401 are represented as an IEEE floating-point number, N:
N=(−1)Sign×2Exponent+Bias×(H.M)2
where Sign is the sign bit of the input number 401, Exponent is its true exponent value, Bias is the bias introduced by the IEEE standard, H is the hidden bit of the significand that is either 1 or 0 depending on the number being a normal number or a subnormal number, and M is the mantissa of the input number 401. For the IEEE single-precision format, the input number 401 includes one sign bit, eight exponent bits, and 23 mantissa bits. If the hidden bit is included in the mantissa, the significand of the mantissa becomes a 24-bit value.
Some embodiments of the significand of the mantissa are represented in real format for calculating the binary logarithm of the significand. For example, the significand of the mantissa M is represented in real format as:
which is approximated as:
log2(1+M)≈0+log2(M22M21 . . . M13 0000000000000)+log2(0000000000M12M11 . . . M40000)+log2(0000000000000000000M3M2M1M0)
This approximation is used to determine the binary logarithm of the input number 401. In some embodiments, the values of the arguments to above formula for the binary logarithm serve as addresses for corresponding LUTs 430, 431, 432 (which are collectively referred to herein as “the LUTs 430-432”) in the multipartite LUT 405.
The value of the exponent 411 is provided to an exponent extractor 415 to extract the exponent from the input number 401 and the results are provided to an incrementor 420, which generates an incremented exponent 425. The exponent of a floating-point number N is extracted through an exponent extractor 415 as:
Extracted Exponent=(Exponent+Bias)−(Bias+1)
Values of the partitions 412-414 of the mantissa bits are provided to the corresponding LUTs 430-432 in the multipartite LUT 405. Unlike previous multipartite implementations that determine address bits into the multipartite LUT 405 using bits from the multiple partitions, the binary logarithm converter 400 generates address bits of each of the individual LUTs 430-432 from a single corresponding partition of the input bits. For example, the address bits into the LUT 430 are generated based on the partition 412, the address bits into the LUT 431 are generated based on the partition 413, and the address bits into the LUT 432 are generated based on the partition 414. This approach reduces the number of words in the LUTs 430-432, as well as the size of each word in the LUTs 430-432, depending on the contents of the LUTs 430-432 obtained from the approximation discussed above.
The LUTs 430-432 contain the mantissas of the binary logarithms of inputs between the range of one and two, one represented in a floating-point format. In some embodiments, the contents of the LUTs 430-432 are determined using the following:
LUT 430: 1+(i+2−10),∀i∈[0,1023]
LUT 431: 1+(i+2−19),∀i∈[0,511]
LUT 432: 1+(i+2−23),∀i∈[0,15]
The values generated by the LUTs 430-432 based on the partitions 412-414 of the mantissa bits are provided to an adder 435, which combines the values to form and LUT output sum 440.
The binary logarithm converter 400 includes ABS circuitry 445 that combines the LUT output sum 440 with the incremented true exponent 425 to generate a fixed point representation of an output value, which is provided to a normalizer 450. Output compute circuitry 455 uses a normalized output value generated by the normalizer 450 to compute the output in floating-point format. The output compute circuitry 455 also configures the precision format of the output according to a return type generated by a type generator 460. The output number is then provided to a buffer 465. Some embodiments of the output compute circuitry 455 are implemented as hard-coded logic, programmable logic, or a combination thereof, as well as supporting analog circuitry (e.g., resistors, capacitors, inductors, terminals, vias, contacts, leads or traces, and the like).
An integer/fraction extractor 515 accesses the bits 511-513 that represent the input number 501 and extracts an integer (e.g., one or more integer bits) and a fraction (e.g., one or more fraction bits) of the input number 501 from the exponent bits 512 and the mantissa bits 513. An integer-to-exponent converter 520 generates an output exponent based on the integer provided by the integer/fraction extractor 515 and the sign bit 511.
The integer/fraction extractor 515 provides bits representing the fraction to fraction/address circuitry 525 that adjusts the number of bits in the fraction and generates addresses into a set of LUTs 526, 527, 528 (which are collectively referred to as “the LUTs 526-528”) in the multipartite LUT 505. The fraction/address circuitry 525 adjust the number of bits in the fraction to a format that includes the target number of bits, e.g., by adding one or more trailing zeros so that the total number of bits in the fraction is equal to the target number such as 23 bits. The fraction/address circuitry 525 partitions the bits in the bit-adjusted fraction into partitions 530, 531, 532, which are collectively referred to as “the partitions 530-532.”
For example, if the input number 501 is represented as N=I.F, where I is the integer portion and F is the fraction portion of the input number 501, the addresses of the LUTs 526-528 are generated by adjusting the positions of the bits of F to form a number having a target number of bits such as 23 bits. In that case,
EXP2(N)=EXP2(1.F)=EXP2(1.0+0.F)
Representing the above expression with the bit fields produces:
EXP2(1.0+0.F)=EXP2(1.0)×EXP2(0.F)
The binary anti-logarithm of the fraction is determined using the information in the LUTs 526-528 after adjusting the fraction to the target number of bits, which in this case is 23 bits. This leads to the expression:
EXP2(0.F)=EXP2(F22F21F20 . . . F1F0)
The address values for the LUTs 526-528 are then determined based on the partitions 530-532, e.g., using:
EXP2(0.F)≈EXP2(F22F21. . . F130000000000000)+EXP2(0000000000F12F11 . . . F3000)+EXP2(00000000000000000000F2F1F0)
The LUTs 526-528 contain the mantissa of the fraction of the anti-logarithm of the inputs between zero and one, when represented in the floating-point format. In some embodiments, the contents of the LUTs 526-528 are determined using the following:
LUT 526: 1+(i+2−10),∀i∈[0,1023]
LUT 527: 1+(i+2−20),∀i∈[0,1023]
LUT 528: 1+(i+2−23),∀i∈[0,7]
The values generated by the LUTs 526-528 based on the partitions 530-532 of the fraction bits are provided to an adder 535, which combines the values and provides the combined values to a converter 540 that converts the LUT output to a mantissa value for the output number. Output compute circuitry 545 uses a mantissa value generated by the converter 540 and the exponent value generated by the integer-to-exponent converter 520 to compute the output in floating-point format. The output number is then provided to a buffer 550. Some embodiments of the output compute circuitry 545 are implemented as hard-coded logic, programmable logic, or a combination thereof, as well as supporting analog circuitry (e.g., resistors, capacitors, inductors, terminals, vias, contacts, leads or traces, and the like).
Applications or functions implemented in a deep learning operation often require more than one transcendental function. For example, some operations implemented by the processing system 100 shown in
As discussed above, the LUTs 430-432 shown in
Δx1=log2(1+x)−x
Δx2=x−(2x−1)
∈=Δx1−Δx2
The plot 700 indicates that the residue ranges from 0-0.08 in the range 0≤x<1. The asymmetry indicated by the net residue can therefore be safely ignored in relatively error-resilient applications that implement both binary logarithm and binary anti-logarithm functions. In that case, a single set of LUTs for one of the functions log2(1+x) and 2x−1 is used to derive values for both types of functions, as needed.
A computer readable storage medium includes any non-transitory storage medium, or combination of non-transitory storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media includes, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium is embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
In some embodiments, certain aspects of the techniques described above are implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software includes the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium includes, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium is in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device is not required, and that one or more further activities are performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes are made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.