The invention relates to artificial neural networks. More specifically, the invention relates to predicting the sign of an activation function in an artificial neural network.
Artificial neural networks, such as convolutional neural networks (CNNs), are utilized for many tasks. Among those tasks are learning to accurately make predictions. For example, a CNN can receive a large amount of image data and learn, through machine learning (ML) to classify content in images.
The figures are not to scale. Instead, the thickness of the layers or regions may be enlarged in the drawings. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.
Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name.
As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events. As used herein, “processor circuitry” is defined to include (i) one or more special purpose electrical circuits structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform specific operations and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuitry include programmed microprocessors, Field Programmable Gate Arrays (FPGAs) that may instantiate instructions, Central Processor Units (CPUs), Graphics Processor Units (GPUs), Digital Signal Processors (DSPs), XPUs, or microcontrollers and integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or a combination thereof) and application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of the processing circuitry is/are best suited to execute the computing task(s).
Artificial neural networks, such as convolutional neural networks (CNNs), are utilized for many tasks. Among those tasks is learning to accurately make predictions. For example, a CNN can receive a large amount of image data and learn, through machine learning (ML), to classify content in images. In a CNN, the processes of image recognition and image classification commonly utilize a rectified linear unit (ReLU) as an activation function in practice. For a given node (also referred to as a layer) in a CNN, when fitting input data for recognition or classification, the ReLU activation function calculates the convolution of the input data with weight and bias parameter values. Whether these values are floating point, fixed point, or integer based, there is an overhead associated with such calculations. In a complex neural network that has a large number of nodes, the overhead will increase. Some of this overhead is wasted because any ReLU calculation result that returns a negative value is thrown out and never contributes to the CNN's output.
In some examples, input data, weight data, and bias data utilized in a CNN are in a 32-bit floating point (FP32) data type format. The FP32 data type format includes a sign bit (bit [31]), a set of exponent bits (bits [30:23]), and a set of mantissa bits (bits [22:0]). In other examples, one or more other data types may be utilized, such as fixed point or 8-bit integer data types, among others. The examples described below will largely be utilizing FP32, but any one or more other data types might be utilized in practice (e.g., double precision floating point (FP64), 8-bit integer, 16-bit integer, 32-bit integer, 64-bit integer, etc.). See
Typical CNNs utilize an activation function per node to map the input data to a series of weights and biases for image training and/or classification purposes. One of the most common activation functions in practice is the ReLU activation function. The examples described below will largely be utilizing the ReLU function for ease of explanation. In other examples, other activation functions that have similar behaviors to the ReLU function may be implemented in addition to or in place of the ReLU function (e.g., the leaky ReLU function) in some or all of the CNN nodes that use an activation function.
In some examples, the ReLU function consumes the output of a convolution layer in a CNN. The ReLU function clamps all the negative output values to zero (i.e., all the operations performed during the convolution layer resulting negative values are neutralized/discarded). Although the ReLU function is efficient from storage perspective because calculated convolution values with negative results are thrown out, there are still inefficiencies. For example, since the ReLU function throws out negative value results, there ends up being significant volumes of convolution calculations that are not further used.
If the result of each convolution calculation were able to be accurately predicted, the processing circuitry calculating the convolutions could be instructed to ignore calculations that end up as negative values. Thus, one purpose of predicting a sign (i.e., positive or negative) of a convolution result is to allow the hardware accelerator(s) performing the calculations to discontinue further calculations on input values that will have a negative ReLU result.
The hardware accelerator(s) process image data (and/or other data) layer by layer through the CNN in a tiled fashion. A tile is herein defined as a group of elements, each of which is a portion of the tile. For example, data from an image may be segmented into a series of 4×4 blocks of pixels, which also may be referred to as a 4×4 tile of (pixel) data elements. In some examples, each element is a base input data building block with which larger structures may be grouped, such as tiles. In some examples, hardware accelerators process data through a CNN in a tiled manner because each element in the tile is not dependent upon any calculated results of the other elements.
In the illustrated example in
In some examples, circuitry comprising tile processing logic encapsulated in box 118 of
In the illustrated example in
In some examples, for tile based FP32 operations at the nodes of a CNN, the output of each convolution node can be predicted by performing a partial FP32 calculation instead of performing a full FP32 calculation. More specifically, for a given example node that performs a ReLU function (or another activation function similar to ReLU), a partial FP32 calculation on the input data and the weight data in certain circumstances can lead to an accurate prediction of the sign (i.e., positive or negative) of the result. For a function like ReLU, predicting the sign of the result can lead to a more efficient flow of calculations of the tile of input data because all predicted negative results allow for discontinuing any remaining FP32 calculations.
For FP32 data type calculations, each example input data value and weight data value can be divided into two distinct groups/segments of bits (e.g., two subsets of the 32-bit total). In some examples, a first group includes sign bit (600 in
In some examples, the size of a tile of the input data may be utilized to help determine an efficient division of mantissa bits that make up the upper mantissa bits vs. the mantissa bits that make up to the lower mantissa bits. An example mathematical proof to determine an efficient division of mantissa bits is described below following the description of
While the examples described largely utilize a mantissa separated into two sections (an upper mantissa and a lower mantissa), it should be appreciated that in other examples the mantissa could be split into additional sections, such as in three sections (a lower mantissa, a middle mantissa section, and an upper mantissa section) or more.
In the illustrated example in
In some examples, the preprocessor circuitry 102A-102C calculates a partial convolution of the data using the first subset of FP32 bits for each of the input data elements and weight data elements at a given node. More specifically, in some examples, the following preprocessing operations are performed on the first subset of FP32 bits of the input data and the weight data by preprocessor circuitry 102A-102C:
1) XOR of sign bit
2) Perform multiplication on exponent bits (i.e., addition of exponents)
3) Perform multiplication on upper mantissa bits
Performing this set of operations on the first group of bits is herein referred to as calculating a partial convolution value (using the input data and weight data to do so). The value is a partial convolution because only a subset of FP32 bits that make up an input value and a weight value are used. Thus, in some examples, using the sign bit, the 8-bit exponent, and a 4-bit upper mantissa (bits [31:19]) from each of the input data and weight data values, the preprocessor circuitry 102A-102C calculates the partial convolution value. The result of the calculation will produce a value that can be positive or negative (or zero), herein referred to as the predicted sign. In some examples, the preprocessor circuitry 102A-102C can then send the predicted sign to control and decode circuitry 106.
In some example versions of a ReLU activation function or another similar function, the convolution data results are utilized for subsequent nodes in the CNN only if the result for a given node is positive. In other example versions of a ReLU or similar activation function, a zero result may default to a utilized result, thus in those versions the CNN nodes send the convolution results to subsequent nodes as long as the results are non-negative. Either version can be utilized for this process, but for simplicity the examples will focus around a non-negative convolution result being utilized.
In some examples, the predicted sign (also herein referred to as a sign indicator) may be a flag register, a designated bit in a hardware or software register, a communication packet, or any other type of signal meant to communicate a piece of information (e.g., information designating that the calculated partial convolution value is positive or negative). The sign information is referred to as “predicted” instead of known because the reduced number of mantissa bits utilized in the calculation introduces a certain amount of variability/error vs. the true/ideal value calculation utilizing all FP32 bits.
In some examples, the control and decode circuitry 106 (also referred to herein as the control 106) has logic that controls the flow of much of the system illustrated in
In the illustrated example in
In some examples, the control 106 includes logic to fetch at least input data and weight data from the higher level memory circuitry 110. As described above, in some examples, the input data and weight data that is fetched is in the FP32 format. Once the input data and weight data have been fetched, they can be stored into the L1 memory circuitry 108. In some examples, the control 106 performs and/or triggers a process to rearrange the FP32 data format into the portions that will be operated on independently. The control 106 then stores/loads the example rearranged data in L1 memory circuitry 108.
In the illustrated example in
Returning to the illustrated example in
In some examples, the control 106 loads the IBC 112 and the KWBC 114 with input data and weight data, respectively, retrieved from the L1 memory circuitry 108. In some examples, the control 106 initially loads a subset of input data and weight data associated with the sign bit, the exponent bits, and the upper mantissa bits into the IBC 112 and the KWBC 114, respectively (e.g., the first three groupings of bits associated with the rearranged FP32 input data). In some examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with all the elements of a tile of data. In other examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with a single element of a tile. In yet other examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with more than one tile, which may be up to and including loading all tiles of an image.
In some examples, the weight buffer information may not need to be updated once the CNN is trained. Thus, in some examples, the weight data for all four groupings of bits associated with the FP32 rearranged data is loaded once into the KWBC 114 at the beginning of the process for a tile and may be utilized across a series of partial convolution calculations involving multiple input data elements across one or more tiles (e.g., potentially for an entire image of input data calculations).
In the illustrated example of
In some examples, the control 106 includes logic that can receive indicators of certain conditions and act on those conditions (e.g., the control 106 can trigger processes to occur in other logic blocks in
In the illustrated example in
In some examples, the preprocessor circuitries 102A-102C store the partial convolution result value in a data distribution circuitry (DDC) 116. In some examples, the partial convolution result value is stored in the DDC 116 only if the predicted sign is determined to be non-negative. In some examples, the DDC 116 is a portion of a memory in the system in
Using the ReLU activation function as the example, if the predicted sign indicator (determined/calculated by the preprocessor circuitries 102A-102C and sent to the control 106) is non-negative, then the control 106 performs one or more resulting functions. In some examples, the control 106 will trigger (e.g., cause through some form of indicator/communication) one or more of the remainder processing circuitries 104A-104C to calculate the remaining portion of the convolution value using the remaining bits of the input data and weight data that were not calculated by the one or more preprocessor circuitries 102A-102C. For example, if the preprocessor circuitries 102A-102C calculated the partial convolution value from the sign bit, the 8-bit exponent, and a 4-bit upper mantissa (e.g., the most significant 13 bits total of the original FP32 operand), then the remainder processing circuitries 104A-104C calculates the convolution value of the 19-bit lower mantissa.
The example remainder processing circuitries 104A-104C combines the result of the 19-bit lower mantissa with a partial convolution result of the most significant 13 bits stored in the DDC 116 to create a full convolution value. In the illustrated example in
In some examples, if the predicted sign of the partial convolution value calculated by the preprocessor circuitries 102A-102C is negative, then the control 106 does not trigger a further calculation by the remainder processing circuitries 104A-104C and the partial convolution value is discarded from further use. In some examples, the negative predicted sign partial convolution value is not stored in the DDC 116. In other examples, the negative predicted sign partial convolution value is stored in the DDC 116, but upon determining the sign is negative, the control 106 flags the partial convolution value as invalid and the data can then subsequently be overwritten.
In some examples, the triggering process takes place on an entire tile of input data at the same time, across a group of remainder processing circuitries 104A-104C. In other examples, the triggering process can take place separately per element (i.e., per remainder processing circuitry). In some examples, for ReLU or similar activation functions, remainder processing circuitries 104A-104C that do not receive triggers will not calculate the lower mantissa bits of a given convolution, thus saving processing cycles.
A more detailed set of possible example implementations of the circuitry logic blocks shown in
While an example manner of implementing the apparatus that predicts signs for the ReLU activation function with partial data is illustrated in
A flowchart representative of example hardware logic circuitry, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the apparatus and system of
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., as portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of machine executable instructions that implement one or more operations that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine readable instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example operations of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” object, as used herein, refers to one or more of that object. The terms “a” (or “an”), “one or more”, and “at least one” are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
In the illustrated example of
The example process continues at block 302 with the control 106 populating the IBC 112 with a subset of the input data. In some examples, the data loaded has been rearranged into groups from an initial FP32 format. Thus, in some examples, the sign bit, the exponent bits, and a group of upper mantissa bits make up the subset of input data loaded into the IBC 112.
The example process continues at block 304 with the control 106 populating the KWBC 114 with a subset of the input data. Similarly to the group of data loaded into the IBC 112 in block 302 above, in some examples, the sign bit, the exponent bits, and a group of upper mantissa bits make up the subset of weight data loaded into the KWBC 114.
The example process continues at block 306 when one or more of the preprocessor circuitries 102A-102C calculate a partial convolution value using at least a portion of the input data subset and the weight data subset. In some examples, the partial convolution calculation uses the entire subset of the sign bit, the exponent bits, and the upper mantissa bits. In other examples, an initial partial convolution calculation uses only the sign bit and the exponent bits to calculate a first partial convolution value. In some examples, it is possible to predict the sign of the partial convolution using only the values of the sign bit and the exponent bits of the input data and weight data. In these situations, the entirety of the FP32 mantissa (both upper and lower portions) is not significant enough to possibly change the predicted sign.
The example process continues at block 308 when one or more of the preprocessor circuitries 102A-102C predict the sign of the partial convolution value calculated in block 306. In some examples, if the predicted sign is negative, the sign can't turn positive no matter what subset of additional less significant bits are utilized in subsequent calculations of the convolution value, thus a negative result is known. In some examples, if the predicted sign is positive, the sign still may possibly turn negative once additional less significant bits are considered in subsequent calculations.
The example process continues at block 310 when one or more of the preprocessor circuitries 102A-102C send the predicted sign of the partial convolution value to the control 106. At this point the process flow of
In the illustrated example of
The example process continues at block 402 when one or more of the preprocessor circuitries 102A-102C perform an exponent addition with the sign and exponent bits from the input data populated in the memory and a weight data.
The example process continues at block 404 when one or more of the preprocessor circuitries 102A-102C checks the result of the exponent addition in block 402 for a predicted negative value of the partial convolution result for a ReLU activation function.
If the predicted result of the exponent addition is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106. The element negative flag received by the control 106 indicates that no more processing of the element will be done because the input data value will be negative, thus the ReLU function discards the data.
If the predicted result of the exponent addition is non-negative, then the example process continues at block 408 when one or more of the preprocessor circuitries 102A-102C stores the partial compute data (e.g., a partial convolution value) into the memory (i.e., in response to the non-negative value). In some examples, the partially computed data is only stored into the memory when the predicted result determined in block 404 is a non-negative value. In other examples, the partially computed data is stored into the memory at a location in the process flow of the flowchart immediately above block 404. In these examples, the partially computed data from the exponent addition block 402 is stored into the memory regardless of the predicted sign.
The example process continues at block 410 when one or more of the preprocessor circuitries 102A-102C perform a mantissa multiplication with one or more of the upper mantissa bits (e.g., one or more of the most significant mantissa bits) of the input data populated in the memory and the same relevant bits for the weight data.
The example process continues at block 412 when one or more of the preprocessor circuitries 102A-102C checks the result of the upper mantissa multiplication for a predicted negative value of the partial convolution result for a ReLU activation function. In some examples, the preprocessor circuitries 102A-102C that check for a predicted negative value utilize the exponent addition result value(s) (stored in memory as partial compute data in block 408) with the upper mantissa multiplication result value(s) from block 410 to determine the new combined value (i.e., the partial convolution value of the input and weight sign bits, exponent bits, and upper mantissa bits).
If the predicted result of the upper mantissa multiplication is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106.
If the predicted result of the upper mantissa multiplication is non-negative, then the example process continues at block 414 when one or more of the preprocessor circuitries 102A-102C stores the partial compute data (i.e., the partial convolution value of the input and weight sign bits, exponent bits, and upper mantissa bits) into the memory.
The example process continues at block 416 when one or more of the remainder circuitries 104A-104C perform a mantissa multiplication with one or more of the lower mantissa bits (e.g., the remaining mantissa bits not utilized in the upper mantissa calculation from block 410) of the input data populated in the memory and the same relevant bits for the weight data. In some examples, the mantissa multiplication is performed in response to the control 106 causing one or more of the remainder circuitries 104A-104C to perform. In some examples, the control 106 triggers one or more of the remainder circuitries 104A-104C to calculate the mantissa for the remaining bits not utilized in the upper mantissa calculation (e.g., a remaining subset of bits not used to calculate the upper mantissa partial convolution result), where the control initiates the trigger in response to receiving a non-negative predicted result from one or more of the preprocessor circuitries 102A-102C.
The example process continues at block 418 when one or more of the preprocessor circuitries 102A-102C checks the result of the lower mantissa multiplication for a negative value of the whole convolution result for a ReLU activation function. In some examples, the preprocessor circuitries 102A-102C that check for the negative value utilize the exponent addition result value(s) (stored in memory as partial compute data in block 408) and the upper mantissa multiplication result value(s) (stored in memory as partial compute data in block 414 with the lower mantissa multiplication result value(s) from block 416 to determine the new combined value (i.e., the full convolution value of the input and weight sign bits, exponent bits, upper mantissa bits, and lower mantissa bits). At this point, there is no longer a predictive nature of the value of the sign because all 32 bits of the original FP32 format data are being utilized in the calculation. Therefore, the sign of the actual convolution result can be determined.
If the result of the lower mantissa multiplication is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106.
If the result of the lower mantissa multiplication is non-negative, then the example process continues at block 420 when one or more of the preprocessor circuitries 102A-102C store the full compute data (i.e., the full convolution value of the input and weight sign bits, exponent bits, upper mantissa bits, and lower mantissa bits) into the memory.
Returning to block 406 in the example process, once the element negative flag is sent to the control 106, then the example process continues at block 422 when the control 106 checks whether all elements have been processed in the input data tile. If all elements in the tile have been processed, then the example process is finished.
If there are still additional elements to be processed in the input data tile, then the control 106 triggers one or more of the processing element array circuitries (100A-100C), and, more specifically, one or more of the preprocessor circuitries 102A-102C, to begin processing next element(s) in the input data tile and the process repeats.
The example preprocessor circuitries 102A-102C perform the exponent addition at block 402 in
In some examples, when performing block 408 of the flowchart in
The example preprocessor circuitries 102A-102C perform the upper mantissa multiplication at block 410 in
In some examples, when performing block 410 of the flowchart in
The example preprocessor circuitries 102A-102C perform the lower mantissa multiplication at block 416 in
In some examples, when performing block 416 of the flowchart in
The mantissa bits that are used to predict a ReLU activation function result begin with the most significant bits of the mantissa value (i.e., the upper bits; the upper mantissa value). The mantissa bits that are not used for partial convolution value prediction include a series of consecutive mantissa bits from the least significant bit (bit [0]) up to the bit immediately below the least significant bit of the upper mantissa value. In some examples, the prediction of the ReLU activation function result utilizes the sign value 600, the exponent value 602, and the upper mantissa value 604. Removing the lower mantissa value from a calculation reduces the precision of the result.
Consider examining a 32-bit value. In an example first examination of the value, all 32 bits are visible/available, therefore predicting the value is not necessary because the entire value is known (i.e., an ideal calculation using all mantissa bits). In an example second examination of the value, the most significant 13 bits of the value are visible (i.e., the least significant 19 bits are not visible leading to a reduced precision of the value). The reduced precision of the value may include an error of up to the maximum size of the not visible least significant bits.
Returning to calculating a partial sum of a convolution, the error corresponds to a region of interest where there may be a discrepancy between a calculated ideal partial sum value of the convolution (using all mantissa bits in the calculation) and a calculated partial sum value of the convolution using a reduced number of mantissa bits. In some examples, the partial sum that utilizes the reduced number of mantissa bits may have a different sign than the ideal partial sum. In some examples, the absolute value of the actual mantissa will be greater than or equal to the absolute value of the predicted mantissa.
As shown in
In some examples, performing convolution using a reduced number of mantissa bits can produce erroneous ReLU prediction because of missed inclusion of remaining mantissa bits for positive elements only. Negative elements further aid ReLU fail and hence does not contribute to the final error.
In some examples, it can be determined mathematically that a subset of the entire input data of FP32 data type can be utilized to sufficiently predict negative values for convolutional matrix multiplications involving input data and weights. Thus, not all 32-bits of FP32 data are needed to accurately predict negative results. Below is a series of mathematical proofs that show some examples of the region of interest, the max possible error in prediction, and conditions to be checked to qualify the predictions. Following those requirements, in some examples, a significant reduction in bits utilized to accurately predict the sign of a partial convolution value is achievable.
For the following description, let:
This can also be represented as,
X
S
Ideal
=X
S
+X
Ideal (Equation 1)
X
S
Reduced
=X
S
+X
Reduced (Equation 2)
In some examples, reducing the number of mantissa bits in a floating-point number results in the number having a lower absolute magnitude. However, the sign remains unaffected as the sign bit is unchanged. Hence, if
X
Ideal<0
⇒XReduced>XIdeal
⇒XS+XReduced>XS+XIdeal
In some examples, Equations 1 and 2 show that)
X
S
Reduced
>X
S
Ideal (Equation 3)
In some examples, Equation 3 shows that if XSReduced<0, then XSIdeal<0. An error due to the addition of a negative value cannot alter the sign of the sum from positive to negative. Therefore
if XIdeal>0
then XReduced<XIdeal
then XS+XReduced<XS+XIdeal
Again, in some examples, Equations 1 and 2 show that
X
S
Reduced
<X
S
Ideal (Equation 4)
In some examples, for Equation 4, XSReduced<0 does not guarantee XSIdeal<0. Thus, errors due to the addition of positive values will contribute towards a possible sign change from positive to negative. These errors can be utilized to determine a threshold value to compare against to conclude that the convolution sum is negative when calculating a partial convolution value using a reduced amount of mantissa bits.
In some examples, if a positive term in the convolution sum is given by CMut=2E
In some examples, for any floating-point number given by
N=(−1)S×2E×M
where S, E, M represent the sign, unbiased exponent and mantissa value, the maximum possible error when only n mantissa bits are included is given by
E
Max=−2(E−n)×(−1)S (Equation 5)
Consider an activation input (I) and weight (W) of a convolution layer. They are represented as
I=(−1)S
W=(−1)S
From Equation 5, in some examples, the most erroneous values that could result from reducing the number of mantissa bits to n in I (from Equation 6) and W (from Equation 7) is given by
I
Reduced=(−1)S
W
Reduced=(−1)S
In some examples, the convolution term, when I (from Equation 6) and W (from Equation 7) are multiplied, is given by
C
Ideal=(−1)S
In some examples, with reduced mantissa in the convolution step, (Equation 8) and (Equation 9) gives
C
Reduced
=I
Reduced
×W
Reduced=(−1)S
Thus,
C
Reduced=2E
In some examples, the error in convolution terms due to reduced mantissa can be obtained from (Equation 10) and (Equation 11)
C
Error
=C
Ideal
−C
Reduced=2E
In some examples, because 2′ is always positive,
C
Error≤2E
Since MI and MW represent the mantissa values,
1≤MI,MW<2
⇒MI+MW≤2×MI×MW
Therefore, (Equation 12) can be rewritten as
C
Error≤2E
In some examples, (Equation 10) provides
C
Error≤2×CIdeal (Equation 13)
In some examples, Theorem 1 illustrates that only positive terms will contribute to errors that can contribute to incorrectly identifying a negative value. Hence, SI+SW=0 (Either both I and W are positive or both are negative).
In (Equation 10), CIdeal can be rewritten as
C
Ideal=2E
where EMul=EI+EW and MMul=MI×MW.
Thus, in some examples, the maximum error in a positive term in the convolution sum is
C
ErrMax=2E
In some examples, if the convolution sum before the ReLU activation layer is given by CTot=(−1)S
In some examples, the sum of all product terms in the convolution is given by
In some examples, from (Equation 15), the maximum error due to positive terms in the convolution is given by CErrMaxi=2E
In some examples, unlike other terms in the convolution sum, the bias does not involve multiplication of reduced mantissa numbers. Thus, the maximum error for bias values will be lower. However, in some examples, the same error is considered (as an upper bound) to simplify calculations.
In some examples, the sum of positive terms (including bias) in the convolution sum is represented as
In some examples, using (Equation 18), the total error in (Equation 17) can be rewritten as,
C
ErrTot=2−n+1×CPos (Equation 19)
In some examples, to conclude that a convolution sum is zero/negative, the following two conditions should hold:
|CTot|≥|CPos| (Equation 20)
S
Tot=1 (Equation 21)
In some examples, (Equation 20) can be expanded using (Equation 16) and (Equation 18) to give
2E
In some examples, note that if ETot=EPos−n+1, then the condition MTot≥MPos must hold (As the total convolution sum (CTot) must be greater than or equal to the sum of positive convolution terms and bias (CPos))
Thus, in some examples, (Equation 22) now becomes
E
Tot
≥E
Pos
−n+1 (Equation 23)
⇒ETot>EPos−n (Equation 24)
Therefore, from (Equation 21) and (Equation 24), in some examples, it holds that a convolution sum computed using reduced-mantissa bits is negative (and the ReLU output is zero) if STot=1, MTot≥MPos and ETot>EPos−n.
The processor platform 700 of the illustrated example includes processor circuitry 712. The processor circuitry 712 of the illustrated example is hardware. For example, the processor circuitry 712 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The processor circuitry 712 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the processor circuitry 712 implements the example processing element array circuitries 100A-100C (including the example preprocessor circuitries 102A-102C and the example remainder processing circuitries 104A-104C), the example control 106 circuitry, the example L1 memory circuitry 108, the example higher level memory circuitry 110, the example IBC 112, the example KWBC 114, and/or the example DDC 116. In some examples, tile processing logic 118 and the circuitry within (shown in greater detail in
The processor circuitry 712 of the illustrated example includes a local memory 713 (e.g., a cache, registers, etc.). The processor circuitry 712 of the illustrated example is in communication with a main memory including a volatile memory 714 and a non-volatile memory 716 by a bus 718. The volatile memory 714 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 716 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 714, 716 of the illustrated example is controlled by a memory controller 717.
The processor platform 700 of the illustrated example also includes interface circuitry 720. The interface circuitry 720 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a PCI interface, and/or a PCIe interface.
In the illustrated example, one or more input devices 722 are connected to the interface circuitry 720. The input device(s) 722 permit(s) a user to enter data and/or commands into the processor circuitry 712. The input device(s) 722 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices 724 are also connected to the interface circuitry 720 of the illustrated example. The output devices 724 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 720 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry 720 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 726. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The processor platform 700 of the illustrated example also includes one or more mass storage devices 728 to store software and/or data. Examples of such mass storage devices 728 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, Blu-ray disk drives, redundant array of independent disks (RAID) systems, solid state storage devices such as flash memory devices, and DVD drives.
The machine executable instructions 732, which may be implemented by the machine readable instructions of
The cores 802 may communicate by an example bus 804. In some examples, the bus 804 may implement a communication bus to effectuate communication associated with one(s) of the cores 802. For example, the bus 804 may implement at least one of an Inter-Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the bus 804 may implement any other type of computing or electrical bus. The cores 802 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry 806. The cores 802 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry 806. Although the cores 802 of this example include example local memory 820 (e.g., Level 1 (L1) cache that may be split into an L1 data cache and an L1 instruction cache), the microprocessor 800 also includes example shared memory 810 that may be shared by the cores (e.g., Level 2 (L2_cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory 810. The local memory 820 of each of the cores 802 and the shared memory 810 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory 814, 816 of
Each core 802 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core 802 includes control unit circuitry 814, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) 816, a plurality of registers 818, the L1 cache 820, and an example bus 822. Other structures may be present. For example, each core 802 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry 814 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core 802. The AL circuitry 816 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core 802. The AL circuitry 816 of some examples performs integer based operations. In other examples, the AL circuitry 816 also performs floating point operations. In yet other examples, the AL circuitry 816 may include first AL circuitry that performs integer based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry 816 may be referred to as an Arithmetic Logic Unit (ALU). The registers 818 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry 816 of the corresponding core 802. For example, the registers 818 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers 818 may be arranged in a bank as shown in
Each core 802 and/or, more generally, the microprocessor 800 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMSs), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor 800 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages. The processor circuitry may include and/or cooperate with one or more accelerators. In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general puspose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU or other programmable device can also be an accelerator. Accelerators may be on-board the processor circuitry, in the same chip package as the processor circuitry and/or in one or more separate packages from the processor circuitry.
More specifically, in contrast to the microprocessor 800 of
In the example of
The interconnections 910 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry 908 to program desired logic circuits.
The storage circuitry 912 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry 912 may be implemented by registers or the like. In the illustrated example, the storage circuitry 912 is distributed amongst the logic gate circuitry 908 to facilitate access and increase execution speed.
The example FPGA circuitry 900 of
Although
In some examples, the processor circuitry 712 of
From the foregoing, it will be appreciated that example apparatus, methods, and articles of manufacture have been disclosed that predict results of activation functions in convolutional neural networks.
To test the proficiency of the system illustrated in
The dataset used was the ImageNet inference dataset from ILSVRC2012, which is 50,000 images from 1,000 classes. As can be seen, a significant number of results were clamped to zero. Specifically, 61.14% of the outputs of the ReLU layers were zero for the ResNet-50 architecture with pretrained ImageNet weights. Additionally, as can be observed in
From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that predict the sign of an activation function in a neural network. The disclosed systems, methods, apparatus, and articles of manufacture improve the efficiency of using a computing device by predicting the sign of an activation function used for classification in a neural network prior to calculating all bits of the mantissa. Predicting the sign of an activation function accurately with less than full mantissa calculations reduces the amount of compute cycles required to run a neural network. The disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the operation of a machine such as a computer or other electronic and/or mechanical device.
Although certain example apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent. Further examples and combinations thereof include the following:
[EXAMPLE PARAGRAPHS MAPPING TO ALL CLAIMS WILL BE INSERTED WHEN A VERSION OF THE CLAIMS HAVE BEEN APPROVED]
The following claims are hereby incorporated into this Detailed Description by this reference, with each claim standing on its own.