SOFTMAX AND LOG SOFTMAX METHOD AND SYSTEM

Description

TECHNICAL FIELD

The disclosure generally relates to circuits for accelerating softmax and log softmax computations.

BACKGROUND

The softmax function applied to a vector X of n+1 real values (indexed 0 . . . n), normalizes the values into a probability distribution consisting of n+1 probabilities proportional to the exponentials of the input values. Some vector elements may be negative or greater than one. Each element in the tensor resulting from application of the softmax function will be in the interval (0, 1), and the elements sum to 1. The softmax function on element x_t(subscripts alternatively denoted “x_t” herein) of an input tensor can be stated as:

softmax(x_t)=e^x_t/SUM(y=0 . . . N,(e^x_y))

In order to avoid overflow, the element having the greatest value (x_max) in the input tensor is subtracted from the exponents in the calculation.

softmax(x_t)=(e^x_t−x_max)/SUM(y=0 . . . N(e^x_y−x_max)))

In applications involving large tensors, traversing all elements of the tensor to find the maximum value can consume a considerable amount of time, and until the maximum of the tensor elements is found, the exponential function calculation will be blocked.

For many neural networks, such as convolutional neural networks, recurrent neural networks, etc., softmax is applied only in the final layer. Transformer neural networks are now providing encouraging results in applications previously dominated by CNNs. Notably, each attention layer of a transformer neural network can have softmax and dropout operations in addition to the standard matrix multiplication-based in fully-connected layers. Transformer networks are large and can have hundreds of millions to hundreds of billions of parameters, and the softmax function can have a significant negative impact on performance.

SUMMARY

A disclosed method includes transforming in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements by a processor circuit (202). The respective power-of-two element from element x_tof the tensor is p_t, p_t=(x_t*log₂e), and p_thas an integer part and a fraction part. The method includes determining respective group-level biases for the groups by a comparison circuit (204), wherein the group-level bias of group_mis d_m, and d_mis an integer part of a maximum of the power-of-two elements of group_m. The method further includes determining a greatest one of the respective group-level biases by the comparison circuit (206) to be a tensor-level bias, d_max.

A disclosed circuit arrangement includes a processor circuit configured to transform in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements. The respective power-of-two element from element x_tof the tensor is p_t, p_t=(x_t*log₂e), and p_thas an integer part and a fraction part. A first comparison circuit is configured to determine respective group-level biases for the groups. The group-level bias of group_mis d_m, and d_mis an integer part of a maximum of the power-of-two elements of group_m. A second comparison circuit is configured to determine a greatest one of the respective group-level biases to be a tensor-level bias, d_max.

Other features will be recognized from consideration of the Detailed Description and Claims, which follow.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects and features of the methods and circuits will become apparent upon review of the following detailed description and upon reference to the drawings in which:

FIG. 1 shows a dataflow diagram of data generated in performing operations according to the disclosed approaches for computing a softmax function on a tensor X;

FIG. 2 shows an exemplary circuit arrangement for computing softmax and log(softmax) functions on a tensor X;

FIG. 3 shows a timing diagram of operations performed in computing the softmax function by the circuit arrangement of FIG. 2;

FIG. 4 shows a timing diagram of operations performed by the circuit arrangement of FIG. 2 in computing the log(softmax) function;

FIG. 5 shows a timing diagram of operations of the softmax function performed in parallel with operations of the log(softmax) function by the circuit arrangement of FIG. 2; and

FIG. 6 is a block diagram depicting a System-on-Chip (SoC) that can host circuitry that implements the softmax and log(softmax) functions according to the methods and circuits disclosed herein.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth to describe specific examples presented herein. It should be apparent, however, to one skilled in the art, that one or more other examples and/or variations of these examples may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the description of the examples herein. For ease of illustration, the same reference numerals may be used in different diagrams to refer to the same elements or additional instances of the same element.

The disclosed approaches provide methods and circuitry that addresses the aforementioned issues. This methods and circuits are useful in neural network inference and training. According to the disclosed approaches, the exponential functions of softmax are transformed into 2^xform. The transformation is explained as follows.

a^b=e^lna^b=e^blna

For a=2 and b*ln(a)=x,

$e^{x} = 2^{\frac{x}{\ln 2}} = 2^{x * \log_{2} e}$

$e^{x - x_{\max}} = 2^{\frac{x - x_{\max}}{\ln 2}} = 2^{x * \log_{2} e} * 2^{- x_{\max} * \log_{2} e}$

A bias, d_max, is computed to prevent overflow and underflow and to align terms for summing as:

d
_max
=[x
_max*log₂e]

where [⋅] is a floor operation. The softmax function can be restated as:

$soft \max (x_t) = \frac{e^{x_{t} - bias}}{\sum e^{x_{t} - bias}} = \frac{2^{- d_{\max}} e^{x_{t}}}{2^{- d_{\max}} \sum e^{x_{t}}} = \frac{2^{- d_{\max}} 2^{x_{t} * \log_{2} e}}{2^{- d_{\max}} \sum 2^{x_{t} * \log_{2} e}}$

where x_tis element t of tensor X, and the summations are of all x_tin X.

The term, 2^x^t^*log²^ecan be written as:

2^x^t^*log²^e=2^x^t_k*2^x^t_j

where x_{t_k}is the integer part, x_{t_j}is the fractional part of (x_t*log₂e), and x_{t_j}is in the interval [0, 1). To calculate the softmax function according to the disclosed approaches, three components are calculated: x_{t_k}, 2^(x_t)_j, and d_max. The term, x_{t_j}, is in the interval [0, 1), and 2^(x_t)_jcan be approximated by polynomial fitting with acceptable precision and degree. The value, 2^(x_t)_j, is a floating-point number and can have an 8-bit exponent, for example. After polynomial fitting, the exponent is modified by x_{t_k}−d_max.

The disclosed methods and circuitry significantly reduce the time expended in computing d_maxby dividing an input tensor is divided into several groups, converting tensor elements into power-of-two values, determining group-level biases, adjusting the power-of-two values according to the group-level biases, and summing the adjusted values of the groups. The number of tensor elements in each group is set according to the desired level of computational parallelism. In the exemplary methods and circuitry, each group has 8 tensor elements, though different implementations can have more or fewer tensor elements depending on hardware capabilities.

FIG. 1 shows a dataflow diagram of data generated in performing operations according to the disclosed approaches for computing a softmax function on a tensor X. The input tensor has n+1 elements (x₀. . . x_n) and i+1 groups of elements (group₀. . . group_i). Each group has p+1 elements (e.g., p=7). The input tensor is provided in buffer1 as shown by dashed block 102. Buffer1 can be an on-chip or off-chip RAM (relative to computational circuitry), and the tensor elements can be input by a streaming or direct memory access (DMA) interface.

The groups of elements can be input one group at a time, and a processor circuit is configured to multiply the elements of the group by log₂e in parallel (x_t*log₂e for t=0 . . . 7). The purpose multiplying x_t*log₂e is to transform e^x_tto the form 2^y(“power-of-two form”), where y=x_t*log₂e, per the derivations above. The products produced from each group m are used to determine the group-level bias, d_m. The d_mof group m is the integer part of the greatest one of the products of the group ([max (x_t*log₂e for t=0 . . . 7))]). A tensor-level bias, d_max, is determined by finding the greatest of the d_mvalues as the groups are successively processed.

The d_malong with the x_{t_k}and 2^(x_t)_j, values are used to adjust the computed products and prevent overflow and underflow relative to the group. The 2^(x_t)_j, values for a group are determined by polynomial fitting, and the power-of-two values are adjusted by x_{t_k}−d_m+(the exponent bits of 2^(x_t)_j). The group-biased power-of-two values, e^x_t*2^−d_m, are stored in association with the group-level d_mas each group is computed in buffer2, which is shown by dashed block 104. Each e^x_t*2^−d_mis a floating point value having an exponent equal to x_{t_k}−d_m+(the exponent bits of 2^(x_t)_j), and a mantissa equal to the mantissa of 2^(x_t)_j. Buffer2 can be an on-chip or off-chip RAM (relative to computational circuitry), and the group-level d_mvalues and associated group-biased power-of-two values can be input by a streaming or direct memory access (DMA) interface.

The adjusted power-of-two values are accumulated into a group-level sum (2^−d_msum_m=SUM_{group_m}=sum(e^x_t*2^−d_m) for all t in group_m) as the adjusted power-of-two values are computed. The group-level sums are accumulated into a tensor-level sum as each group is accumulated. The group-level d_mis compared to the group level d′_maxonce d_mis determined, and the group-level sum 2^−d_msum_mis aligned with the current sum (2^−d′_maxsum′) according to the current value of d′_max. Once aligned, the aligned group-level sum 2^−d_msum_mis added to the current sum (2^−d′_maxsum′) to produce a new current sum.

The tensor-level d_maxand tensor level sum, 2^−d_maxsum, (which is 2^−d^maxΣe^x^t) are available for the next stage of softmax computation once the tensor elements of group_ihave been processed.

Once d_maxhas been determined, the group-biased power-of-two values (e^x_t*2^−d_m) are tensor-wise adjusted based on the tensor-level d_maxvalue. The tensor-wise biases for elements in group_mare made by retrieving from buffer2, d_mand exponents of the associated group-wise-adjusted power-of-two values e^x_t*2^−d_m. The exponents of e^x_t*2^−d_min group_mare added to (d_m−d_max) to generate the exponents of the e^x_t* 2^−d_maxvalues, which are illustrated in the column 106 of blocks. The mantissas of the e^x_t*2^−d_maxvalues are the same as the mantissas of the corresponding values from buffer2. Though not shown in FIG. 1, it will be recognized that each of the e^x_t*2^−d_maxvalues in column 106 is divided by the tensor level sum, 2^−d_maxsum to generate softwmax (x_t).

FIG. 2 shows an exemplary circuit arrangement 200 for computing softmax and log(softmax) functions on a tensor X. The circuit arrangement generally includes one or more processor circuits configured to perform parallel multiply-and-accumulate operations, registers for storing temporary result values, and various addition and subtraction circuits.

A group of p+1 tensor elements (x_t, t=0 . . . p) is read from buffer1 102 and input in parallel to processor circuitry 202. Processor circuit 202 computes products (“power-of-two elements”) of x_t*log2e for t=0 . . . p in parallel. The p+1 power-of-two elements are provided on parallel signal lines to circuit 204, which compares values of the p+1 power-of-two elements and extracts and provides the integer portion of the greatest one of the values as d_m. The compare-and-select circuit 206 compares the d_mvalue from circuit 204 to the current d_maxvalue in register 208 and selects the greater of the two values to update the contents of the register.

The power-of-two elements computed by processor circuits 202 are floating point values, and the integer portions (group_mx_{t_k}) and fraction portions (group_mx_{t_j}) of the values are determined from the mantissas and exponents. The integer portions are provided to the subtraction circuits 210, and the fraction portions are provided to the processor circuitry 212, which can be a vector processor that performs multiply-and-accumulate (“MAC”) operations in parallel.

The subtraction circuits 210 compute in parallel the differences between the integer portions and the group-level bias, d_m(x_{t_k}−d_mfor t=0 . . . p). The processor circuitry 212 computes in parallel 2^(x_t)_jfor t=0 . . . p by polynomial fitting of the fraction portions, x_{t_j}. The tensor elements of the next group (group_m+1) can be input to the processor circuitry 202 for computing the power-of-two elements while circuit 204 determines the group-level bias d_m, the subtraction circuits 210 compute the differences (x_{t_k}−d_mfor t=0 . . . p), and the processor circuitry 212 computes 2^(x_t)_jfor t=0 . . . p for group_m.

The differences and exponents of the 2^(x_t)_jvalues are input to adder circuits 214 that compute in parallel the exponents of the group-biased power-of-two elements. Each e^x_t*2^−d_mis a floating point value having an exponent equal to x_{t_k}−d_m+(the exponent bits of 2^(x_t)_j), and a mantissa equal to the mantissa of 2^(x_t)_j. The group-biased power-of-two values for the group are stored in buffer2 104 in association with the group-level bias d_m.

The group-biased power-of-two values for the group are input to summing circuit 216, which sums the group-biased power-of-two values into a group-level sum (SUM_m=sum(e^x_t*2^−d_m) for all x_tin group m).

The update circuit 218 accumulates the group-level sums as each group-level sum is provided by summing circuit 216. The update circuit 218 inputs the group-level sum from summing circuit 216, the current greatest bias value, d_maxfrom register 208, and the current accumulated SUM from register 220. The update circuit aligns the group-level sum and the current accumulated SUM according to d_maxand produces a new SUM that is stored in register 220.

Once all groups of tensor elements of a tensor (e.g., group₀. . . group_iof a tensor having i+1 groups) have been processed and a final tensor-level sum has been computed, control circuit 222 can activate the final softmax circuitry 224. The final softmax circuitry generates final softmax values group-by-group, with the p+1 softmax values generated in parallel. The final softmax circuitry inputs the tensor-level bias, d_max, from register 208, the final tensor level SUM from register 220 (SUM=2^−d_max*sum(2^x_t*log_2(e))), and reads the group-biased power-of-two elements of group_mand the associated group-level bias d_mfrom buffer2 104.

The subtractor circuit 226 of the final softmax circuitry determines the difference between the group-level bias, d_m, and the tensor-level bias d_max(d_m−d_max). The difference from subtractor circuit 226 and the exponents of the group-biased power-of-two elements (exp(e^x_t*2^−d_m)) are input to adder circuits 228. The adder circuits 228 compute in parallel sums of the difference and the exponents of the e^x_t*2^−d_mterms from buffer 2. The sums from exponent adders 228 are exponents that are paired with the corresponding mantissas of the e^x_t*2^−d_mterms from buffer2 to provide the tensor-baised terms, “x_{t_dmax}” as divisors to the divider circuit 230. The exponent of x_{t_dmax}. is exp(x_{t_dmax})=(d_m−d_max)+exp(e^x_t*2^−d_m), and the mantissa of x_{t_dmax}. is man(x_{t_dmax})=man(e^x_t*2^−d_m).

Divider circuitry 230, which can be a vector division circuit, computes in parallel the final softmax values (x_{t_dmax}/SUM for t=0 . . . p)).

The control circuit 222 controls activation of the final softmax circuit 224 and final log_softmax circuit 232. The final softmax circuit and the final log_softmax circuit can be operated alone or in parallel with one another. For example, in response to a state of mode control signals, the control circuit 222 can activate the final softmax circuit 224 and deactivate final log_softmax circuit 232, deactivate the final softmax circuit 224 and activate final log_softmax circuit 232, or activate both the final softmax circuit 224 and the final log_softmax circuit 232 to operate in parallel. The control circuit 222 can gate clock signals to the final softmax circuit 224 and final log_softmax circuit 232 to reduce power consumption when only one of the circuits is activated.

The formula of log(softmax):

$\log (soft \max (x)) = \log (\frac{e^{x_{i} - x_{\max}}}{\sum e^{x_{i} - x_{\max}}}) = (x_{i} - x_{\max}) - \log (\sum e^{x_{i} - x_{\max}})$

The term, 2^−dmaxΣ2^xⁱ^log²^eis used to replace Σe^xⁱ^−x^max, and d_maxis used to replace x_maxas explained above. Thus, the log(softmax) can be restated as:

log(softmax(x))=(x_i−d_max)−log(2^−d^maxΣ2^xi*log²^e)

The input variable of the natural log function is a floating point number, which is represented in the form:

(−1)^s*(1+M)*2^E−E^a

where s is the sign bit, M is the mantissa, E is the exponent, which is shifted by a constant bias E₀. The log function can be written as:

$\log (y) = \frac{\log_{2} y}{\log_{2} e} = \frac{1}{\log_{2} e} (\log_{2} ((1 + ?) * ?)) = \frac{1}{\log_{2} e} ((E_{y} - E_{0}) + \log_{2} (1 + ?))$

$? indicates text missing or illegible when filed$

where M_yis in the interval to [0, 1), log₂(1+M_y) could be calculated by polynomial fitting.

The final log_softmax circuit 232 is activated once the tensor-level d_maxis available. A group of p+1 tensor elements (x_t, t=0 . . . p) is read from buffer1 102 and input in parallel to the subtraction circuits 234. The subtraction circuits compute in parallel, differences between (x_t−d_maxfor t=0 . . . p).

The mantissa (M_SUM) of the SUM is input to the processor circuitry 212, which is configured to compute log₂(1+M_SUM). The exponent of the SUM (E_SUM) is input to circuit 236, which converts E_SUMto a floating point value. Adder 238 sums the values output from circuits 212 and 236 (float(E_SUM)+log₂(1+M_SUM)), and the sum is input to processor circuitry 240. Processor circuitry 240 is configured to compute:

SUM_log=(float(E_SUM)+log₂(1+M_SUM))/log₂e

The processor circuitry 240 can be processor circuitry dedicated to computing SUM_log, or circuitry 212.

The SUM_logand p+1 differences (x_t−d_maxfor t=0 . . . p) from subtraction circuits 234 are input to subtraction circuits 242. Subtraction circuits compute in parallel (x_t−d_max)−SUM_logfor t=0 . . . p, and the p+1 output terms are log (softmax (x_t)).

FIG. 3 shows a timing diagram of operations performed in computing the softmax function by the circuit arrangement of FIG. 2. Each block summarizes an operation(s) performed in computing the softmax function, and the horizontal alignment of the blocks indicates relative time slots in which the operations are performed. Vertically aligned blocks indicate operations performed in the same time slot (in parallel).

The example shows the relative timing of operations involved in processing groups 0, 1, m, m+1, and i of i+1 groups of tensor elements (1<m<i). In time slot t0, p+1 tensor elements of a group are input to the circuit arrangement. In time slot t1, the tensor elements of group₀are multiplied by log₂e, and in parallel therewith, the tensor elements of group₁are input. In time slot t2, the group-level bias, d₀, is determined, along with polynomial fitting of the x_jterms and differences between the x_kterms and d₀. Also in time slot t2, the tensor elements of group₁are multiplied by log₂e. Though not shown, the tensor elements of group₃would be input in time slot t2.

In time slot t3, the differences and exponents of the 2^(x_t)_jvalues computed from group₀are summed into the group-biased power-of-two elements and stored in association with the group-level bias d₀. The group-biased power-of-two values for group₀are summed into a group-level sum (“SUM_group0”). Also, in time slot t3, the group-level bias, d₁, is determined, along with polynomial fitting of the x_jterms and differences between the x_kterms and d₁.

In time slot t4, the group-level bias, d₀, is compared to the current tensor-level bias, d_max, and the current tensor-level d_maxis updated to the value of d₀, since d₀is the first maximum computed. Also during time slot t4, the differences and exponents of the 2^(x_t)_jvalues computed from group₁are summed into the group-biased power-of-two elements and stored in association with the group-level bias d₁. The group-biased power-of-two values for groupo are summed into a group-level sum (“SUM_group0”).

In time slot t5, the group-level sum is accumulated with the current SUM. The group-level sum SUM_{group_0}is aligned with the current accumulated SUM according the current d_max, and the aligned values are added to produces a new SUM. Also in time slot t5, the group-level bias, d₁, is compared to the current tensor-level bias, d_max. If d₁>d_maxthen the current tensor-level bias, d_max, is updated to the value of d₁. Otherwise, d_maxremains unchanged.

In time slot t6, the group-level sum is accumulated with the current SUM. The group-level sum SUM_{group_1}is aligned with the current accumulated SUM according the current d_max, and the aligned values are added to produces a new SUM.

FIG. 3 shows similar processing of group_mtensor elements beginning in time slot t0+m*(1 time slot), and of group_m+1tensor elements beginning in time slot t0+m*(1 time slot)+1.

The final group_iof tensor elements commences in time slot t0+i, and the processing is similar to that described above for time slots t0+i through t0+i t+5. In time slot t0+i+6, the final operations of softmax processing begin.

In time slot t0+i+6, the group-biased power-of-two elements of group₀and the associated group-level bias d₀are input, and e^x_t*2^−d_maxvalues, “x_{t_dmax},” are computed for group₀as described above. Each e^x_t*2^−d_maxis a floating point value having an exponent equal to x_{t_k}−d_max+(the exponent bits of 2^(x_t)_j), and a mantissa equal to the mantissa of 2^(x_t)_j. In time slot t0+i+7, the p+1 softmax values of group₀are computed as (x_{t_dmax}/SUM for t=0 . . . p)) and then output. Though not shown, the operations in time slots t0+i+6 and t0+i+7 would be performed for group₁. . . group_i, in ensuing time slots. For example, in time slot t0+i+7, the group-biased power-of-two elements of group₁and the associated group-level bias d₁are input, and e^x_t*2^−d_maxvalues are computed for group₁. In time slot t0+i+8, the the p+1 softmax values of group₁are computed as (x_{t_exp}/SUM for t=0 . . . p)) and then output.

FIG. 4 shows a timing diagram of operations performed by the circuit arrangement of FIG. 2 in computing the log(softmax) function. The operations are the same as those described in FIG. 3 for softmax through time slot t0+i+4.

In time slot t0+i+5, the of p+1 tensor elements of groupo are input and parallel subtraction circuits compute differences between (x_t−d_maxfor t=0 . . . p).

In timeslot t0+i+6, log₂(1+M_SUM) is computed from the mantissa (M_SUM) of the SUM, and the exponent of the SUM (E_SUM) is converted to a floating point value.

In timeslot t0+i+7, the log₂(1+M_SUM) and float(E_SUM) values are summed.

In timeslot t0+i+8, the SUM_logterm is computed from the log₂(1+M_SUM) and float(E_SUM) values as:

(float(E_SUM)+log₂(1+M_SUM))/log₂e

In timeslot t0+i+9, the p+1 log(softmax) values of group₀are computed in parallel as (x_t−d_max)−SUM_logfor t=0 . . . p, and then output.

FIG. 5 shows a timing diagram of operations of the softmax function performed in parallel with operations of the log(softmax) function by the circuit arrangement of FIG. 2.

The operations are the same as those described in FIGS. 3 and 4 through time slot t0+i+5. In time slot t0+i+6, the e^x_t*2^−d_maxvalues are computed in a softmax operation for group₀, and the log₂(1+M_SUM) float(E_SUM) values are computed in log(softwmax) operations.

In time slot t0+i+7, the p+1 softmax values of group₀are computed and then output. In parallel with the final softmax operation in time slot t0+i+7, the log₂(1+M_SUM) and float(E_SUM) values are summed for log(softmax). The log(softmax) operations in time slots t0+i+8 and t0+i+9 are as described in FIG. 4.

FIG. 6 is a block diagram depicting a System-on-Chip (SoC) 601 that can host circuitry that implements the softmax and log(softmax) functions according to the methods and circuits disclosed herein. In the example, the SoC includes the processing subsystem (PS) 602 and the programmable logic subsystem 603. The processing subsystem 602 includes various processing units, such as a real-time processing unit (RPU) 604, an application processing unit (APU) 605, a graphics processing unit (GPU) 606, a configuration and security unit (CSU) 612, and a platform management unit (PMU) 611. The PS 602 also includes various support circuits, such as on-chip memory (OCM) 614, transceivers 607, peripherals 608, interconnect 616, DMA circuit 609, memory controller 610, peripherals 615, and multiplexed (MIO) circuit 613. The processing units and the support circuits are interconnected by the interconnect 616. The PL subsystem 603 is also coupled to the interconnect 616. The transceivers 607 are coupled to external pins 624. The PL 603 is coupled to external pins 623. The memory controller 610 is coupled to external pins 622. The MIO 613 is coupled to external pins 620. The PS 602 is generally coupled to external pins 621. The APU 605 can include a CPU 617, memory 618, and support circuits 619. The APU 605 can include other circuitry, including L1 and L2 caches and the like. The RPU 604 can include additional circuitry, such as L1 caches and the like. The interconnect 616 can include cache-coherent interconnect or the like.

Referring to the PS 602, each of the processing units includes one or more central processing units (CPUs) and associated circuits, such as memories, interrupt controllers, direct memory access (DMA) controllers, memory management units (MMUs), floating point units (FPUs), and the like. The interconnect 616 includes various switches, busses, communication links, and the like configured to interconnect the processing units, as well as interconnect the other components in the PS 602 to the processing units.

The OCM 614 includes one or more RAM modules, which can be distributed throughout the PS 602. For example, the OCM 614 can include battery backed RAM (BBRAM), tightly coupled memory (TCM), and the like. The memory controller 610 can include a DRAM interface for accessing external DRAM. The peripherals 608, 615 can include one or more components that provide an interface to the PS 602. For example, the peripherals can include a graphics processing unit (GPU), a display interface (e.g., DisplayPort, high-definition multimedia interface (HDMI) port, etc.), universal serial bus (USB) ports, Ethernet ports, universal asynchronous transceiver (UART) ports, serial peripheral interface (SPI) ports, general purpose (GPIO) ports, serial advanced technology attachment (SATA) ports, PCIe ports, and the like. The peripherals 615 can be coupled to the MIO 613. The peripherals 608 can be coupled to the transceivers 607. The transceivers 607 can include serializer/deserializer (SERDES) circuits, MGTs, and the like.

Various logic may be implemented as circuitry to carry out one or more of the operations and activities described herein and/or shown in the figures. In these contexts, a circuit or circuitry may be referred to as “logic,” “module,” “engine,” or “block.” It should be understood that logic, modules, engines and blocks are all circuits that carry out one or more of the operations/activities. In certain implementations, a programmable circuit is one or more computer circuits programmed to execute a set (or sets) of instructions stored in a ROM or RAM and/or operate according to configuration data stored in a configuration memory.

Though aspects and features may in some cases be described in individual figures, it will be appreciated that features from one figure can be combined with features of another figure even though the combination is not explicitly shown or explicitly described as a combination.

The methods and circuits are thought to be applicable to a variety of systems that compute softmax and log(softmax) functions. Other aspects and features will be apparent to those skilled in the art from consideration of the specification. The methods and circuits may be implemented as one or more processors configured to execute software, as an application specific integrated circuit (ASIC), or as a logic on a programmable logic device. It is intended that the specification and drawings be considered as examples only, with a true scope of the invention being indicated by the following claims.

Claims

1. A method comprising: transforming in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements by a processor circuit, wherein the respective power-of-two element from element xt of the tensor is pt, pt=(xt*log2e), and pt has an integer part and a fraction part;determining respective group-level biases for the groups by a comparison circuit (204), wherein the group-level bias of groupm is dm, and dm is an integer part of a maximum of the power-of-two elements of groupm; anddetermining a greatest one of the respective group-level biases by the comparison circuit to be a tensor-level bias, dmax.
2. The method of claim 1, further comprising: adjusting in parallel by a plurality of adder circuits, the respective power-of-two elements of each group into respective group-biased elements based on the respective group-level biases to prevent underflow and overflow;summing the respective group-biased elements by an accumulator circuit for each group into a group-level sum; andsumming the group-level sums into a tensor-level sum by an update circuit.
3. The method of claim 2, further comprising: adjusting the respective group-biased elements into respective tensor-biased elements corresponding to the elements of the tensor based on dmax, the respective group-level biases, and exponents of the group-biased elements; anddetermining softmax values in parallel for elements of each group by a processor circuit, wherein the softmax value of xt=(tensor-biased element corresponding to xt)/(tensor-level sum).
4. The method of claim 3, further comprising: determining a log-tensor-sum=loge(tensor-level sum) by a processor circuit; anddetermining log-softmax values in parallel for the elements xt of each group by a processor circuit, wherein log(softmax(xt))=xt−dmax−(log-tensor-sum).
5. The method of claim 4, wherein determining softmax values is performed by a first processor circuit, and determining log-softmax values is performed by a second processor circuit, and the first processor circuit and the second processor circuit operate in parallel.
6. The method of claim 4, wherein determining softmax values is performed by a first processor circuit, and determining log-softmax values is performed by a second processor circuit, and the method further comprising: activating the first processor circuit and deactivating the second processor circuit in response to a first state of mode control signals; anddeactivating the first processor circuit and activating the second processor circuit in response to a second state of the mode control signals.
7. The method of claim 3, wherein the group-biased element of xt of group m is equal to ex_t*2−d_m, and the adjusting the respective power-of-two elements into respective group-biased elements includes: determining in parallel by a plurality of subtractor circuits, differences between integer portions, xt_k, of the respective power-of-two elements of a group m and the respective group-level bias, dm;determining in parallel by a processor circuit from fraction portions, xt_j, of the respective power-of-two elements of the group m, floating point values of 2(x_t)_j; anddetermining in parallel by a plurality of adder circuits, exponents of the respective group-biased elements in parallel as sums of the differences from the plurality of subtractor circuits and exponents of the floating point values 2(x_t)_j.
8. The method of claim 3, wherein transforming the elements of each group into respective power-of-two elements includes transforming the elements of group m+1 of the plurality of groups concurrent with the comparison circuit determining the respective group-level bias for group m.
9. The method of claim 3, wherein: determining the respective group-level biases includes determining the respective group-level biases of the groups in successive time intervals such that the group-level bias of group m is determined in a first time interval, and the group-level bias of group m+1 is determined in a second time interval that follows the first time interval in succession;determining the tensor-level bias includes determining and registering a current tensor-level bias by a comparison circuit that compares the current tensor-level bias to the respective group-level bias as each group-level bias is determined; andwherein summing the group-level sums into the tensor-level sum includes aligning each group-level sum with a current tensor-level sum based on the current tensor-level bias, adding the group-level sum to the current tensor level sum after aligning the group-level sum, and registering an update of the current tensor-level sum.
10. The method of claim 3, wherein adjusting the respective group-biased elements into respective tensor-biased elements includes: determining a difference between each group-level bias dm and the tensor-level bias, dmax, by a subtractor circuit as (dm−dmax); anddetermining exponents of the respective group-biased elements in parallel by a plurality of adder circuits, wherein each exponent is a sum of (dm−dmax)+(the exponent of the respective group-biased element).
11. A circuit arrangement, comprising: a first processor circuit configured to transform in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements, wherein the respective power-of-two element from element xt of the tensor is pt, pt=(xt*log2e), and pt has an integer part and a fraction part;a first comparison circuit (204) configured to determine respective group-level biases for the groups, wherein the group-level bias of groupm is dm, and dm is an integer part of a maximum of the power-of-two elements of groupm; anda second comparison circuit configured to determine a greatest one of the respective group-level biases to be a tensor-level bias, dmax.
12. The circuit arrangement of claim 11, further comprising: a pluralty of adder circuits configured to adjust the respective power-of-two elements of each group in parallel into respective group-biased elements based on the respective group-level biases to prevent underflow and overflow;an accumulator circuit configured to sum the respective group-biased elements for each group into a group-level sum;an update circuit configured to sum the group-level sums into a tensor-level sum.
13. The circuit arrangement of claim 12, further comprising: an adjustment circuit configured to adjust the respective group-biased elements into respective tensor-biased elements corresponding to the elements of the tensor based on dmax, the respective group-level biases, and exponents of the group-biased elements; anda second processor circuit configured to determine softmax values in parallel for elements of each group, wherein the softmax value of xt=(tensor-biased element corresponding to xt)/(tensor-level sum).
14. The circuit arrangement of claim 13, further comprising: a third processor circuit configured to determine a log-tensor-sum=loge(tensor-level sum); anda fourth processor circuit configured to determine log-softmax values in parallel for the elements xt of each group, wherein log(softmax (xt))=xt−dmax−(log-tensor-sum).
15. The circuit arrangement of claim 14, wherein the first, second, third, and fourth processor circuits are configured to operate in parallel.
16. The circuit arrangement of claim 14, further comprising a control circuit configured to activate the first and second processor circuits and deactivating the third and fourth processor circuits in response to a first state of mode control signals; anddeactivate the first and second processor circuits and activating the third and fourth processor circuits in response to a second state of the mode control signals.
17. The circuit arrangement of claim 13, wherein the group-biased element of xt of group m is equal to ex_t*2−d_m, and further comprising: a plurality of subtractor circuits configured to determine in parallel, differences between integer portions, xt_k, of the respective power-of-two elements of a group m and the respective group-level bias, dm;a third processor circuit configured to determine floating point values of 2(x_t)_j in parallel from fraction portions, xt_j, of the respective power-of-two elements of the group m; anda plurality of adder circuits configured to determine in parallel, exponents of the respective group-biased elements in parallel as sums of the differences from the plurality of subtractor circuits and exponents of the floating point values 2(x_t)_j.
18. The circuit arrangement of claim 13, wherein the first processor circuit is configured to transform the elements of group m+1 of the plurality of groups concurrent with the comparison circuit determining the respective group-level bias for group m.
19. The circuit arrangement of claim 13, wherein: the first comparison circuit is configured to determine the respective group-level biases of the groups in successive time intervals such that the group-level bias of group m is determined in a first time interval, and the group-level bias of group m+1 is determined in a second time interval that follows the first time interval in succession;the second comparison circuit is configured to determine and register a current tensor-level bias by a comparison circuit that compares the current tensor-level bias to the respective group-level bias as each group-level bias is determined; andthe update circuit is configured to align each group-level sum with a current tensor-level sum based on the current tensor-level bias, add the group-level sum to the current tensor level sum after aligning the group-level sum, and register an update of the current tensor-level sum.
20. The circuit arrangement of claim 13, wherein the adjustment circuit includes: a subtractor circuit configured to determine a difference between each group-level bias dm and the tensor-level bias, dmax, by as (dm−dmax); anda plurality of adder circuits configured to determine exponents of the respective group-biased elements in parallel, wherein each exponent is a sum of (dm−dmax)+(the exponent of the respective group-biased element).

SOFTMAX AND LOG SOFTMAX METHOD AND SYSTEM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims