Hardware accelerators may be implemented to perform certain operations more efficiently than such operations would be performed on a general-purpose processor such as a central processing unit (CPU). For example, a matrix multiplication accelerator (MMA) may be implemented to perform matrix mathematical operations more efficiently than these operations would be performed on a general-purpose processor. Machine learning algorithms can be expressed as matrix operations that tend to be performance-dominated by matrix multiplication. Accordingly, machine learning is an example of an application area in which an MMA may be implemented to perform matrix mathematical operations such as matrix multiplication.
In hardware implementations of matrix multiplication such as by an MMA, calculations may be performed in a parallel, pipelined computation that may involve nearly-simultaneous evaluations of multiplications, dot product summations, and accumulations. Such computations generally involve a substantial amount of hardware components that operate at relatively high signal transition frequencies. For example, some computing systems that include MMAs may execute about 4096 to about 8192 matrix multiplications per clock cycle at gigahertz rates. The amount of hardware components and/or signal transition frequencies involved in hardware implemented matrix multiplication may contribute to relatively high current demand while computations involving an MMA are active (e.g., during an active cycle).
During some phase of program execution (e.g., during an idle cycle), a computing system including an MMA may not need to perform matrix mathematical operations such that computations involving the MMA are inactive. For example, the computing system may not need to perform matrix mathematical operations due to program structure or transient resource dependencies (e.g., cache misses). While computations involving the MMA are inactive (e.g., during an idle cycle) current demand may be low (e.g., about leakage level current in the MMA) relative to current demand while computations involving an MMA are active (e.g., during an active cycle).
Accordingly, a relatively high transient current (di/dt) can occur when computations involving an MMA start (e.g., when the MMA transitions from an idle cycle to an active cycle) and stop (e.g., when the MMA transitions from an active cycle to an idle cycle).
In examples, a device comprises control logic configured to detect an idle cycle, an operand generator configured to provide a synthetic operand responsive to the detection of the idle cycle, and a computational circuit. The computational circuit is configured to, during the idle cycle, perform a first computation on the synthetic operand. The computational circuit is configured to, during an active cycle, perform a second computation on an architectural operand.
The same reference numbers or other reference designators are used in the drawings to designate the same or similar (functionally and/or structurally) features.
As described above, relatively high transient current (di/dt) can occur when computations involving a computational circuit, such as a matrix multiplication accelerator (MMA), start and the circuit transitions from an idle cycle to an active cycle. Relatively high transient current can also occur when computations involving the computational circuit stop and the circuit transitions from an active cycle to an idle cycle. High transient current that occurs when a computational circuit transitions between active cycles and idle cycles can increase inductance sensitivity of a package (or board) design. For example, a direct relationship may exist between inductances and transient current, such that the impedance of an inductance can increase when a magnitude (|di/dt|) of transient current increases and decrease when the magnitude of the transient current decreases.
Increased transient current drawn by an MMA or other computational circuit when transitioning between active cycles and idle cycles can also increase package design complexity and production costs. For example, flattening a response of a power distribution network supplying current drawn by an MMA to avoid resonances that may be excited by narrow current demand pulse widths associated with such increases in transient current can increase package design complexity. In another example, components of power distribution network components involved in supplying current to an MMA are typically hardened to accommodate such increases in transient current, which can increase production costs.
Aspects of this description relate to transient current management managing transient current in a device during parallel matrix computations using activity leveling. In at least one example, the device includes an operand generator that is configured to provide synthetic operands. Generally, an operand can be the object of a mathematical operation or a computer instruction. Operands can include architectural operands and synthetic operands. Architectural operands can represent operands that are processed, manipulated, transformed, or created during some phase of program execution by a general-purpose processor, such as a central processing unit (CPU) or application control logic (ACL). Synthetic operands can represent operands that are generated or created by an operand generator external to any phase of program execution by a general-purpose processor, in accordance with various examples, and the results may be discarded without being used by any program.
Computations involving a given computational circuit can be performed on synthetic operands provided by the operand generator during otherwise idle cycles to consume power. Power consumed by performing computations on synthetic operands provided by the operand generator during idle cycles can reduce a magnitude of transient current drawn by the circuit (referred to herein as activity leveling) when transitioning between active cycles and idle cycles. Reducing transient current drawn by the circuit when transitioning between active cycles and idle cycles can avoid increases in package design inductance sensitivity, complexity, and production costs associated with increases in such transient current.
In operation, the processor 110 is configured to provide control signals at the command interface 137, which cause the MMA 120 to control operation of the input data formatter 121, the output data formatter 123, the buffer controller 125, and the matrix multiplier array 127. In some examples, the MMA 120 may store a data structure (not expressly shown) that determines the manner in which the input data formatter 121, the output data formatter 123, the buffer controller 125, and the matrix multiplier array 127 are to operate, and the processor 110 may control the contents of the data structure via the command interface 137. The control signals that the processor 110 provides at the command interface 137 can include opcode instructions, stall signals, formatting instructions, and other signals that modify operation of the MMA 120. The opcode instructions can include an opcode instruction that defines a matrix mathematical operation, such as matrix multiplication operations, direct vector-by-matrix multiplication (which may be useful to perform matrix-by-matrix multiplication), convolution, and other parallel matrix computations. The opcode instructions can also include an opcode instruction that defines a non-matrix mathematical operation, such as a matrix transpose operation, a matrix initialization operation, and other matrix related operations that do not involve a matrix mathematical operation. The formatting instructions can include formatting instructions that define how the MMA 120 is to interpret input data provided at the first source data bus 131 or at the second source data bus 133. The formatting instructions can also include formatting instructions that define how the MMA 120 is to present results to the processor 110 as output data provided at the results data bus 135.
The input data formatter 121 is configured to use formatting instructions provided at the command interface 137 to transform data provided at the first source data bus 131 and data provided at the second source data bus 133 into architectural operands for internal use within the MMA 120. The output data formatter 123 is configured to use formatting instructions provided at the command interface 137 to transform results data generated by computations involving the MMA 120 into output data provided at the results data bus 135. The buffer controller 125 is configured to provide and/or manage memory for storing architectural operands provided by the input data formatter 121 and for storing results data provided by the matrix multiplier array 127. The matrix multiplier array 127 is configured to perform parallel matrix computations using operands provided by the input data formatter 121. The matrix multiplier array 127 is also configured to provide results data generated by parallel matrix computations to the buffer controller 125 for storage. The control logic 129 is configured to modify, responsive to receiving control signals provided by the processor 110 at the command interface 137, operation of the input data formatter 121, the output data formatter 123, the buffer controller 125, and the matrix multiplier array 127. The control logic 129 is also configured to provide signals indicative of a status of the MMA 120 or indicative of a status of an operation performed by the MMA 120 at the status interface 139 for interrogation by the processor 110.
In some examples, the input data formatter 121, the output data formatter 123, the buffer controller 125, and the control logic 129 are implemented using hardware circuit logic. For instance, any suitable hardware circuit logic that is configured to manipulate data bits to facilitate the specific operations attributed herein to the input data formatter 121, the output data formatter 123, the buffer controller 125, and the control logic 129 may be useful. Taking the output data formatter 123 as an example, an example 8-bit by 8-bit vector multiplication yields a 16-bit result. There may be multiple such 16-bit results that are to be summed together, and overflow (e.g., two 16-bit numbers being summed producing a 17-bit result) should be considered. Accordingly, the accumulation may be performed at a 32-bit precision. However, in an example implementation in which the output is to have 8 bits, the output data formatter 123 may be hardware-configured to select which eight bits of the 32-bit sum is to be provided as an output. The output data formatter 123 may also be hardware-configured to perform other operations on data to be output, such as scaling and saturation operations.
The device 100 also includes an operand generator 140 that is configured to provide synthetic operands when activity leveling is enabled. Computations involving the MMA 120 can be performed on synthetic operands provided by the operand generator 140 during otherwise idle cycles to consume power. Power consumed by performing computations on synthetic operands provided by the operand generator 140 during idle cycles can reduce a magnitude of transient current drawn by the MMA 120 when transitioning between active cycles and idle cycles. In at least one example, a magnitude of transient current drawn by the MMA 120 when transitioning between active cycles and idle cycles can be further reduced when the operand generator 140 provides synthetic operands having statistical similarity with architectural operands provided by the processor 110. Computations performed on synthetic operands provided by the operand generator 140 during idle cycles can be architecturally transparent (e.g., without a discernible impact on device architecture, such as memory) by discarding any results data generated by such computations without modifying memory that the buffer controller 125 provides for storing results data. As shown by
In some examples, the operand generator 140 includes any suitable hardware circuit logic that is configured to perform the actions attributed herein to the operand generator 140.
The term “statistical similarity” refers to the similarity between synthetic operands and architectural operands that facilitates a relatively consistent amount of current draw from the MMA 120. More specifically, the current demand of a multiplier may depend on how the inputs to that multiplier are changing. For example, if the same data is provided to the inputs of a multiplier every clock cycle, then that multiplier may consume nearly zero power per clock cycle, because in static complementary metal oxide semiconductor (CMOS) technologies, a circuit consumes significant amounts of power only if the inputs to that circuit change (neglecting leaked power). However, a multiplier that has every input change during each clock cycle will consume a maximum amount of power each clock cycle. It is desirable to maintain a consistent current draw from the MMA 120. However, because the MMA 120's current draw over time is dependent on the sequence of input operands, the sequences of the synthetic and architectural operands should be made to look similar. Thus, for instance, if the architectural operands had, on average, 3 of 8 bits changing each clock cycle, then the synthetic operands should have 3 of 8 bits changing each clock cycle.
In this example implementation, and with simultaneous reference to
The buffer controller 125 can be configured to include and/or manage memory having a two-stage pipeline structure including buffers for storing architectural operands provided by the input data formatter 121 and for storing results data provided by the matrix multiplier array 127. The buffer controller 125 may also include additional circuitry, such as circuitry to manage the buffers shown in
The MMA 120 loads, responsive to the control logic 129 receiving the opcode instruction, data corresponding to a row of a multiplier matrix from the first source data bus 131. The input data formatter 121 transforms the data that the MMA 120 loads from the first source data bus 131 into an architectural multiplier operand. The input data formatter 121 provides the architectural multiplier operand to the buffer controller 125 to store in a foreground multiplier buffer 411. Multiple dot product computations are computed in parallel within the matrix multiplier array 127 using elements of the architectural multiplier operand stored in the foreground multiplier buffer 411 and columns of a multiplicand operand stored in a foreground multiplicand buffer 412 (the contents of which are provided by a background multiplicand buffer 412, which is populated as described below). The matrix multiplier array 127 provides a result of those multiple dot product computations to the buffer controller 125. During an active cycle, the buffer controller 125 stores the result provided by the matrix multiplier array 127 in a row 414 of a foreground product buffer 413 (e.g., as the result of an addition assignment operation, denoted by the symbol “+=”).
While computations occur within the matrix multiplier array 127, a first background data transfer occurs between the buffer controller 125 and the input data formatter 121 while computations occur within the matrix multiplier array 127 using the foreground multiplier buffer 411 and the foreground multiplicand buffer 412. The first background data transfer involves the input data formatter 121 providing formatted data to the buffer controller 125 to store in a background multiplicand buffer 422 using data that the MMA 120 loads from the second source data bus 133. A second background data transfer also occurs between the buffer controller 125 and the output data formatter 123 while those computations occur within the matrix multiplier array 127. The second background data transfer involves the buffer controller 125 providing the output data formatter 123 with data stored in a background product buffer 423 (which receives its contents from foreground product buffer 413, as
Referring to
In this example implementation, the control logic 129 receives a control signal provided by the processor 110 at the command interface 137 while computations involving the MMA 120 are active. The control signal that the control logic 129 receives cause the computation involving the MMA 120 to stop. Accordingly,
The control logic 129 detects, responsive to receiving the control signal provided by the processor 110 at the command interface 137, an idle cycle. The control logic 129 enables, responsive to detecting the idle cycle, activity leveling in the MMA 120 by asserting the leveling signal IDLE. The operand generator 140 provides, responsive to the control logic 129 enabling activity leveling, a synthetic operand on the synthetic data bus 142 prior to the idle cycle for storage in the foreground multiplier buffer 411. In at least one example, providing the synthetic operand involves the operand generator 140 selecting the synthetic operand from a sample buffer storing a set of sampled architectural operands (e.g., architectural multiplier operands) using a circular index or a pseudo-random index. In at least one example, the operand generator 140 constructs the set of sampled architectural operands by sampling architectural multiplier operands that the input data formatter 121 provides to the buffer controller 125 over a number of active cycles that precede the idle cycle detected by the control logic 129 to determine a pattern or trend in the architectural operands. In at least one example, the synthetic operand provided by the operand generator 140 has a statistical similarity with architectural operands provided by the processor 110, such as a synthetic operand provided by any example implementation of the operand generator 140 described with respect to either
The MUX 502 couples, responsive to the control logic 129 enabling activity leveling signal IDLE, the synthetic data bus 142 and the buffer controller 125. The buffer controller 125 stores, responsive to the MUX 502 coupling the synthetic data bus 142 and the buffer controller 125, the synthetic operand in the foreground multiplier buffer 411. Multiple dot product computations are computed in parallel within the matrix multiplier array 127, during the idle cycle with activity leveling enabled, using elements of the synthetic operand stored in the foreground multiplier buffer 411 and columns of a multiplicand operand stored in the foreground multiplicand buffer 412. The matrix multiplier array 127 provides a result of those multiple dot product computations to the buffer controller 125. During the idle cycle with activity leveling enabled, the buffer controller 125 discards the result provided by the matrix multiplier array 127 without modifying the foreground product buffer 413. As described in greater detail below, performing computations involving the MMA 120 using synthetic operands provided by the operand generator 140 with activity leveling enabled can reduce a magnitude of transient current drawn by the MMA 120 when transitioning between active cycles and idle cycles.
The shift circuit 608 is configured to update an average Hamming distance value stored in the averaging register 610 for an architectural operand element once every 2n active cycles, where n is a natural number. Updating an average Hamming distance value stored in the averaging register 610 for an architectural operand element involves the shift circuit 608 performing a bitwise right shift operation on a Hamming distance value stored in the accumulation register 606 for the architectural operand element. The shift circuit 608 is also configured to reset or clear, responsive to updating the average Hamming distance value stored in the averaging register 610, the Hamming distance value stored in the accumulation register 606. In at least one example, the logic gate 604, the accumulation register 606, the shift circuit 608, and/or the averaging register 610 can be replicated to increase a sampling rate of a Hamming distance or population count of architectural operand elements provided by the processor 110 for an active cycle.
The thermometer encoder 612 is configured to convert an average Hamming distance value stored in the averaging register 610 from binary to an 8-bit thermometer coded value having the average Hamming distance. A pseudo-random number provided by the pseudo-random number generator 616 can control the shuffling circuit 614 to generate a synthetic operand element having statistical similarity with an architectural operand element using thermometer code provided by the thermometer encoder 612. Generating the synthetic operand element can involve the shuffling circuit 614 randomly shuffling the 8-bit thermometer coded value using a shuffling algorithm (e.g., a Fisher-Yates algorithm or a Knuth algorithm) controlled using the pseudo-random number provided by the pseudo-random number generator 616. The operand generator 140 can use the synthetic operand element generated by the shuffling circuit 614 to generate a synthetic operand for the matrix multiplier array 127 to process during an idle cycle.
The mask generator 730 is configured to compute a binary mask from the average value for the architectural operand element stored in the averaging register 720. Computing the binary mask involves the mask generator 730 identifying a most significant set bit in the average value stored in the averaging register 720. Computing the binary mask also involves the mask generator 730 setting each bit between the most significant set bit and a least significant bit of the average value stored in the averaging register 720. The logic gate 740 is configured to generate a synthetic operand element having statistical similarity with the architectural operand element. Generating the synthetic operand element involves the logic gate 740 performing a bitwise AND logic operation on a binary mask provided by the mask generator 730 and on a pseudo random number provided by the pseudo-random number generator 616. The operand generator 140 can use the synthetic operand element generated by the logic gate 740 to generate a synthetic operand for the matrix multiplier array 127 to process during an idle cycle.
An output of each LFSR of the pseudo-random number generator 616 is coupled to an input of a different bit reverse register. For example, an output of the first LF SR 811 is coupled to an input of a first bit reverse register 821, an output of the second LF SR 812 is coupled to an input of a second bit reverse register 822, an output of the third LF SR 813 is coupled to an input of a third bit reverse register 823, an output of the fourth LF SR 814 is coupled to an input of a fourth bit reverse register 824. Each bit reverse register of the pseudo-random number generator 616 can perform a bit reversal operation on a pseudo-random value provided at an input of the bit reverse register to provide a pseudo-random value at an output of the bit reverse register.
The pseudo-random number generator 616 also includes a logic circuit 830 with multiple logic gates. In
For example, the first XOR gate 831 is configured to provide a first pseudo-random number (prng[n][31:0]) responsive to a bitwise XOR logic operation performed on data provided at an output of the first LFSR 811 and on data provided at an output of the first bit reverse register 821 that is driven by data provided at an output of the second LF SR 812. In another example, the second XOR gate 832 is configured to provide a second pseudo-random number (prng[n+1][31:0]) responsive to a bitwise XOR logic operation performed on data provided at an output of the second LF SR 812 and on data provided at an output of the second bit reverse register 822 that is driven by data provided at an output of the third LF SR 813.
In another example, the third XOR gate 833 is configured to provide a third pseudo-random number (prng[n+2][31:0]) responsive to a bitwise XOR logic operation performed on data provided at an output of the third LFSR 813 and on data provided at an output of the third bit reverse register 823 that is driven by data provided at an output of the fourth LFSR 814. In another example, the fourth XOR gate 834 is configured to provide a fourth pseudo-random number (prng[n+2][31:0]) responsive to a bitwise XOR logic operation performed on data provided at an output of the fourth LFSR 814 and on data provided at an output of the fourth bit reverse register 824 that is driven by data provided at an output of the first LFSR 811.
An LFSR having an output that provides data to a bitwise XOR logic operation of an XOR gate can form a pair of counter-rotating LFSRs with another LF SR that provides data for driving a bit reverse register that provides data to the bitwise XOR logic operation of the XOR gate. For example, the first LFSR 811 and the second LF SR 812 can form a pair of counter-rotating LFSRs with respect to the first XOR gate 831. In another example, the second LFSR 812 and the third LFSR 813 can form a pair of counter-rotating LFSRs with respect to the second XOR gate 832. Another example, the third LFSR 813 and the fourth LFSR 814 can form a pair of counter-rotating LFSRs with respect to the third XOR gate 833. In another example, the fourth LFSR 814 and the first LFSR 811 can form a pair of counter-rotating LFSRs with respect to the fourth XOR gate 834.
In at least one example, using counter-rotating LFSRs to provide pseudo-random numbers to the operand generator 140 for generating synthetic operands can reduce cycle-to-cycle correlation within a sequence of the pseudo-random numbers. Reducing such cycle-to-cycle correlation can mitigate electromagnetic interference (EMI) associated with performing matrix mathematical operations. In at least one example, using counter-rotating LFSRs to provide pseudo-random numbers to the operand generator 140 for generating synthetic operands can reduce die size by reducing a footprint of the pseudo-random number generator 616.
At time 912, an idle cycle 914 commences as the computations involving the MMA 120 stop. For example, the computations involving the MMA 120 may stop responsive to the MMA 120 receiving an opcode instruction from the processor 110 that defines a non-matrix mathematical operation. Between the active cycle 908 and the idle cycle 914, the waveform 902 decreases from the first power level 910 to a second power level 916. The second power level 916 approximates static leakage power of the MMA 120. Between the active cycle 908 and the idle cycle 914, the waveform 904 decreases from the first power level 910 to a third power level 918. While less than the first power level 910, the third power level 918 is higher than the second power level 916. Accordingly, a variance in power consumption by the MMA 120 with activity leveling enabled when transitioning between the active cycle 908 and the idle cycle 914 is less than a variance in power consumption by the MMA 120 with activity leveling disabled.
At time 920, an active cycle 922 commences as computations (e.g., matrix multiplications) involving the MMA 120 start. For example, the computations involving the MMA 120 may start responsive to the MMA 120 receiving an opcode instruction from the processor 110 that defines a matrix mathematical operation. Between the idle cycle 914 and the active cycle 922, the waveforms 902 and 904 each approach the first power level 910 that approximates full rate power of the MMA 120. Between the idle cycle 914 and the active cycle 922, the waveform 902 increases from the second power level 916 to the first power level 910. Between the idle cycle 914 and the active cycle 922, the waveform 904 increases from the third power level 918 to the first power level 910. The difference between the third power level 918 and the first power level 910 is less than the difference between the second power level 916 and the first power level 910. Accordingly, when transitioning between the idle cycle 914 and the active cycle 922, a variance in power consumption by the MMA 120 with activity leveling enabled is less than a variance in power consumption by the MMA 120 with activity leveling disabled. The diagram 900 shows that variations in power consumption by the MMA 120 when transitioning between active and idle cycles can be reduced by enabling activity leveling.
At time 1006, each implementation of the MMA 120 transitions from an active cycle 1008 to an idle cycle 1010 when computations (e.g., matrix multiplications) involving the MMA 120 stop. For example, the computations involving the MMA 120 may stop when the processor 110 asserts a stall signal provided to the MMA 120 responsive to the processor 110 encountering a stall condition, such as stall conditions related to program structure or transient resource dependencies (e.g., cache misses). Between the active cycle 1008 and the idle cycle 1010, the waveform 1002 decreases from a first power level 1012 to a second power level 1014. The first power level 1012 approximates full rate power of the MMA 120. The second power level 1014 approximates static leakage power of the MMA 120. Between the active cycle 1008 and the idle cycle 1010, the waveform 1004 decreases from the first power level 1012 to a third power level 1016. The difference between the first power level 1012 and the third power level 1016 is less than the difference between the first power level 1012 and the second power level 1014. Accordingly, when transitioning between the active cycle 1008 and the idle cycle 1010, a variance in power consumption by the MMA 120 with activity leveling enabled is less than a variance in power consumption by the MMA 120 with activity leveling disabled.
With reference to
With reference to
With reference to
While examples are provided of an MMA 120 performing operations on synthetic operands, the principle of performing statistical analysis on a set of architectural operands to determine a corresponding set of synthetic operations to use during idle cycles applies equally to any suitable computational circuit, such as a CPU, a graphics processing unit (GPU), fast Fourier transform (FFT) accelerator, a digital signal processor (DSP), or other signal processing circuit.
The term “couple” is used throughout the specification. The term may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A generates a signal to control device B to perform an action, in a first example device A is coupled to device B, or in a second example device A is coupled to device B through intervening component C if intervening component C does not substantially alter the functional relationship between device A and device B such that device B is controlled by device A via the control signal generated by device A.
A device that is “configured to” perform a task or function may be configured (e.g., programmed and/or hardwired) at a time of manufacturing by a manufacturer to perform the function and/or may be configurable (or re-configurable) by a user after manufacturing to perform the function and/or other additional or alternative functions. The configuring may be through firmware and/or software programming of the device, through a construction and/or layout of hardware components and interconnections of the device, or a combination thereof.
Unless otherwise stated, “about,” “approximately,” or “substantially” preceding a value means+/−10 percent of the stated value. Modifications are possible in the described examples, and other examples are possible within the scope of the claims.
The present application claims priority to U.S. Provisional Patent Application No. 63/392,528, which was filed Jul. 27, 2022, is titled “IDLE-TIME TRANSIENT CURRENT MANAGEMENT,” and is hereby incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63392528 | Jul 2022 | US |