Embodiments described herein are related to computation engines that assist processors and, more particularly, to computation engines that evaluate transcendental functions.
A variety of workloads being performed in modern computing systems rely on significant use of transcendental functions. For example, certain long short term memory (LSTM) learning algorithms are used in a variety of contexts such as language detection, card readers, natural language processing, handwriting processing, and machine learning, among other things. LSTM processing includes numerous evaluations of select transcendental functions in the front end (initialization) portion of the processing, up to about 15% of the instructions executed.
A transcendental function is an analytic function that does not satisfy a polynomial equation. That is, a transcendental function cannot be expressed in terms of a finite sequence of the algebraic operations of addition, multiplication, and root extraction. Examples of transcendental functions include the exponential function, the logarithm, and the trigonometric functions (e.g. sine, cosine, etc.). Thus, accurate computation of transcendental functions over the entire valid input range is complex and time consuming. However, if the entire input range is divided into intervals, the transcendentals can be approximated with high accuracy using relatively low-order polynomials. Different polynomials are used in different intervals. Thus, a high performance mechanism to select the polynomial for an input to the transcendental function and to evaluate the transcendental function can improve the performance of workloads that use significant amounts of transcendental function evaluation. The performance of such operations on a general purpose central processing unit (CPU) is often very low; while the power consumption is very high. Low performance, high power workloads are problematic for any computing system, but are especially problematic for battery-powered systems.
In an embodiment, a computation engine may offload work from a processor (e.g. a CPU) and efficiently perform transcendental functions. The computation engine may implement a range instruction that may be included in a program being executed by the CPU. The CPU may dispatch the range instruction to the computation engine. The range instruction may take an input operand (that is to be evaluated in a transcendental function, for example) and may reference a range table that defines a set of ranges for the transcendental function. The range instruction may identify one of the set of ranges that includes the input operand. For example, the range instruction may output an interval number identifying which interval of an overall set of valid input values contains the input operand. In an embodiment, the range instruction may take an input vector operand and output a vector of interval identifiers.
In an embodiment, the interval identifier(s) produced by the range instruction may be provided as index(es) into a lookup table. The lookup table may include, e.g. the coefficients for polynomials corresponding to each interval of a transcendental function, thereby selecting the polynomial for evaluation in the computation engine. While the range instruction may be used for transcendental function evaluation in one use case, such use is merely exemplary and numerous other uses of the range instruction are possible.
In an embodiment, determining intervals for input operands using the range instruction may contribute to a high performance, low power solution to various workloads executed by the CPU in a system. For example, the range instruction may be part of performing transcendental operations in certain workloads. LSTM workloads for machine learning tasks may benefit in the initialization section of the LSTM processing, in one particular use case. The initialization section may be up to 15% of the instructions executed to implement LSTM, as mentioned previously. For energy constrained systems (e.g. battery-operated mobile systems) and/or thermally-constrained systems (e.g. rack servers), improved performance and/or enhanced capabilities in the machine learning area may result.
The following detailed description makes reference to the accompanying drawings, which are now briefly described.
While embodiments described in this disclosure may be susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to. As used herein, the terms “first,” “second,” etc. are used as labels for nouns that they precede, and do not imply any type of ordering (e.g., spatial, temporal, logical, etc.) unless specifically stated.
Within this disclosure, different entities (which may variously be referred to as “units,” “circuits,” other components, etc.) may be described or claimed as “configured” to perform one or more tasks or operations. This formulation—[entity] configured to [perform one or more tasks]—is used herein to refer to structure (i.e., something physical, such as an electronic circuit). More specifically, this formulation is used to indicate that this structure is arranged to perform the one or more tasks during operation. A structure can be said to be “configured to” perform some task even if the structure is not currently being operated. A “clock circuit configured to generate an output clock signal” is intended to cover, for example, a circuit that performs this function during operation, even if the circuit in question is not currently being used (e.g., power is not connected to it). Thus, an entity described or recited as “configured to” perform some task refers to something physical, such as a device, circuit, memory storing program instructions executable to implement the task, etc. This phrase is not used herein to refer to something intangible. In general, the circuitry that forms the structure corresponding to “configured to” may include hardware circuits. The hardware circuits may include any combination of combinatorial logic circuitry, clocked storage devices such as flops, registers, latches, etc., finite state machines, memory such as static random access memory or embedded dynamic random access memory, custom designed circuitry, analog circuitry, programmable logic arrays, etc. Similarly, various units/circuits/components may be described as performing a task or tasks, for convenience in the description. Such descriptions should be interpreted as including the phrase “configured to.”
The term “configured to” is not intended to mean “configurable to.” An unprogrammed FPGA, for example, would not be considered to be “configured to” perform some specific function, although it may be “configurable to” perform that function. After appropriate programming, the FPGA may then be configured to perform that function.
Reciting in the appended claims a unit/circuit/component or other structure that is configured to perform one or more tasks is expressly intended not to invoke 35 U.S.C. § 112(f) interpretation for that claim element. Accordingly, none of the claims in this application as filed are intended to be interpreted as having means-plus-function elements. Should Applicant wish to invoke Section 112(f) during prosecution, it will recite claim elements using the “means for” [performing a function] construct.
In an embodiment, hardware circuits in accordance with this disclosure may be implemented by coding the description of the circuit in a hardware description language (HDL) such as Verilog or VHDL. The HDL description may be synthesized against a library of cells designed for a given integrated circuit fabrication technology, and may be modified for timing, power, and other reasons to result in a final design database that may be transmitted to a foundry to generate masks and ultimately produce the integrated circuit. Some hardware circuits or portions thereof may also be custom-designed in a schematic editor and captured into the integrated circuit design along with synthesized circuitry. The integrated circuits may include transistors and may further include other circuit elements (e.g. passive elements such as capacitors, resistors, inductors, etc.) and interconnect between the transistors and circuit elements. Some embodiments may implement multiple integrated circuits coupled together to implement the hardware circuits, and/or discrete elements may be used in some embodiments. Alternatively, the HDL design may be synthesized to a programmable logic array such as a field programmable gate array (FPGA) and may be implemented in the FPGA.
As used herein, the term “based on” or “dependent on” is used to describe one or more factors that affect a determination. This term does not foreclose the possibility that additional factors may affect the determination. That is, a determination may be solely based on specified factors or based on the specified factors as well as other, unspecified factors. Consider the phrase “determine A based on B.” This phrase specifies that B is a factor is used to determine A or that affects the determination of A. This phrase does not foreclose that the determination of A may also be based on some other factor, such as C. This phrase is also intended to cover an embodiment in which A is determined based solely on B. As used herein, the phrase “based on” is synonymous with the phrase “based at least in part on.”
This specification includes references to various embodiments, to indicate that the present disclosure is not intended to refer to one particular implementation, but rather a range of embodiments that fall within the spirit of the present disclosure, including the appended claims. Particular features, structures, or characteristics may be combined in any suitable manner consistent with this disclosure.
Turning now to
The computation engine 10 may be configured to perform one or more transcendental operations. Specifically, in an embodiment, the computation engine 10 may perform the low order polynomial evaluations corresponding to the transcendental operation, based on the interval that includes each input value to be evaluated. In an embodiment, the compute circuit 30 may perform the polynomial evaluations. The interval for each input value may be determined by executing a range instruction prior to an instruction to evaluate the polynomial. The range instruction may be performed by the range circuit 34. While the range circuit 34 and the compute circuit 30 are illustrated separately in
In one embodiment, the transcendental operations may be performed on vectors of input operands. For example, an embodiment receives vectors of operands (e.g. in the X memory 24 and the Y memory 26). The compute circuit 30 may include an array of circuits to perform the evaluation. Each circuit may receive vector elements from the X memory 24 or the Y memory 26, and may evaluate the polynomial corresponding to the selected vector element. Different vector elements may be included in different intervals. Accordingly, each circuit may receive the polynomial coefficients based on the interval identifier determined from a preceding range instruction.
In an embodiment, the computation engine 10 may support various data types and data sizes. For example, floating point and integer data types may be supported. The floating point data type may include 16 bit, 32 bit, and 64 bit sizes. The integer data types may include 16 bit and 32 bit sizes, and both signed and unsigned integers may be supported. Other embodiments may include a subset of the above sizes, additional sizes, or a subset of the above sizes and additional sizes (e.g. larger or smaller sizes).
In one embodiment, the large data sizes may include fewer intervals than the smaller data sizes of the same data type. That is, the number of intervals may be inversely dependent on the data size, where the maximum number of intervals decreases as the data size increases (and vice versa). In an embodiment, a range table that stores the bounds of the intervals may have a fixed size. Since the range bounds may be the same data size and data type to facilitate comparison, a range bound at a larger data size may consume more of the fixed size than a range bound at a smaller data size. Thus, more range bounds at the smaller data size may be stored in the in range table.
When the range instruction is used, e.g., to identify intervals for polynomial evaluation of transcendental functions, the input range may be limited in many cases such as LSTM initialization processing. Even though the data size can accommodate a larger range, the input for the given use case may be guaranteed to be in a subrange of the larger range. Additionally, argument reduction may be applied prior to polynomial approximation. The argument reduction may cause the reduced range to fall into ranges that may be identified via the range instruction.
Results for the polynomial evaluations may be stored in the Z memory 28. Similarly, results of the range instruction may be stored in the Z memory 28, or alternatively in one of the X memory 24 and/or Y memory 26. In an embodiment, the computation engine 10 may be configured to accumulate transcendental evaluations, and the current value in the Z memory 28 may be provided to the compute circuit 30 to be added to the result of the polynomial evaluation.
In an embodiment, the instructions executed by the computation engine 10 may also include memory instructions (e.g. load/store instructions). The load instructions may transfer vectors from a system memory (not shown) to the X memory 24, Y Memory 26, or Z memory 28. The store instructions may write the vectors from the Z memory 28 to the system memory. Other embodiments may also include store instructions to write vectors from the X and Y memories 24 and 26 to system memory. The system memory may be a memory accessed at a bottom of the cache hierarchy that includes the caches 14, 16, and 18. The system memory may be formed from a random access memory (RAM) such as various types of dynamic RAM (DRAM) or static RAM (SRAM). A memory controller may be included to interface to the system memory. In an embodiment, the computation engine 10 may be cache coherent with the processor 12. In an embodiment, the computation engine 10 may have access to the data cache 16 to read/write data. Alternatively, the computation engine 10 may have access to the lower level cache 14 instead, and the lower level cache 14 may ensure cache coherency with the data cache 16. In yet another alternative, the computation engine 10 may have access to the memory system, and a coherence point in the memory system may ensure the coherency of the accesses. In yet another alternative, the computation engine 10 may have access to the caches 14 and 16.
In some embodiments, the computation engine 10 may include a cache 32 to store data recently accessed by the computation engine 10. The choice of whether or not to include cache 32 may be based on the effective latency experienced by the outer product 10 and the desired level of performance for the computation engine 10. The cache 32 may have any capacity, cache line size, and configuration (e.g. set associative, direct mapped, etc.).
In the illustrated embodiment, the processor 12 is responsible for fetching the range instructions and computation instructions and transmitting the instructions to the computation engine 10 for execution. The overhead of the “front end” of the processor 12 fetching, decoding, etc. the instructions may be amortized over the computations performed by the computation engine 10. In one embodiment, the processor 12 may be configured to propagate the instructions down the pipeline (illustrated generally in
Generally, an instruction may be non-speculative if it is known that the instruction is going to complete execution without exception/interrupt. Thus, an instruction may be non-speculative once prior instructions (in program order) have been processed to the point that the prior instructions are known to not cause exceptions/speculative flushes in the processor 12 and the instruction itself is also known not to cause an exception/speculative flush. Some instructions may be known not to cause exceptions based on the instruction set architecture implemented by the processor 12 and may also not cause speculative flushes. Once the other prior instructions have been determined to be exception-free and flush-free, such instructions are also exception-free and flush-free.
In the case of memory instructions that are to be transmitted to the computation engine 10, the processing in the processor 12 may include translating the virtual address of the memory operation to a physical address (including performing any protection checks and ensuring that the memory instruction has a valid translation).
The instruction buffer 22 may be provided to allow the computation engine 10 to queue instructions while other instructions are being performed. In an embodiment, the instruction buffer 22 may be a first in, first out buffer (FIFO). That is, matrix computation instructions may be processed in program order. Other embodiments may implement other types of buffers.
The X memory 24 and the Y memory 26 may each be configured to store at least one vector of input operands defined for the range instruction. Similarly, the Z memory 28 may be configured to store at least one computation result. The result may be an array of results at the result size (e.g. 16 bit elements or 32 bit elements). In some embodiments, the X memory 24 and the Y memory 26 may be configured to store multiple vectors and/or the Z memory 28 may be configured to store multiple result vectors. Each vector may be stored in a different bank in the memories, and operands for a given instruction may be identified by bank number.
The processor 12 fetches instructions from the instruction cache (ICache) 18 and processes the instructions through the various pipeline stages 20A-20N. The pipeline is generalized, and may include any level of complexity and performance enhancing features in various embodiments. For example, the processor 12 may be superscalar and one or more pipeline stages may be configured to process multiple instructions at once. The pipeline may vary in length for different types of instructions (e.g. ALU instructions may have schedule, execute, and writeback stages while memory instructions may have schedule, address generation, translation/cache access, data forwarding, and miss processing stages). Stages may include branch prediction, register renaming, prefetching, etc.
Generally, there may be a point in the processing of each instruction at which the instruction becomes non-speculative. The pipeline stage 20M may represent this stage for computation instructions, which are transmitted from the non-speculative stage to the computation engine 10. The retirement stage 20N may represent the state at which a given instruction's results are committed to architectural state and can no longer by “undone” by flushing the instruction or reissuing the instruction. The instruction itself exits the processor at the retirement stage, in terms of the presently-executing instructions (e.g. the instruction may still be stored in the instruction cache). Thus, in the illustrated embodiment, retirement of outer product instructions occurs when the instruction has been successfully transmitted to the computation engine 10.
The instruction cache 18 and data cache (DCache) 16 may each be a cache having any desired capacity, cache line size, and configuration. Similarly, the lower level cache 14 may be any capacity, cache line size, and configuration. The lower level cache 14 may be any level in the cache hierarchy (e.g. the last level cache (LLC) for the processor 12, or any intermediate cache level).
Turning now to
When a range instruction is executed in the computation engine 10, the range circuit 34 may determine which interval I0 to IN−1 includes each vector element, and may output an identifier for the interval in the same vector position as the vector element in the output vector.
As the example in
The range table 40 may be a separate table provided to the range circuit 34, or may be an entry in the X memory 24 or Y memory 26. In an embodiment, the range table 40 may be sourced from the same memory 24 or 26 as the input vector 44 for the range operation.
The range bounds may form a set of non-overlapping intervals between b0 and bN. However, depending on the values of b0 and bN and the potential input values to the transcendental function, there may be input values that are not included in any of the intervals (e.g. values less than b0 and values greater than or equal to bN). The range instruction may be defined to cause an output of a value that is not any of the intervals (e.g. a value of all binary ones). This value may be used to identify vector elements that are not evaluated via the polynomials, for example. In other embodiments, depending on the values of b0 to bN, one or more intervals may overlap.
Furthermore, an input vector 62 shown in
A multiplexor (mux) 64 is shown in
It is noted that different implementations of determining the range and the corresponding polynomial coefficients for a transcendental function and evaluating the function may be used.
The computation engine 10 may evaluate a variety of transcendental functions. The range table 40 and the lookup table 60 may be programmed for a given transcendental function, and then reprogrammed for a different transcendental function, as desired.
Turning now to
As illustrated at reference numeral 70, the operation illustrated in
The computation engine 10 may find the first interval containing the element, where the intervals are defined in the range table 40 (block 72). The intervals may be viewed as ordered from left to right as shown in
The memory operations for the computation engine 10 may include load and store instructions. Specifically, in the illustrated embodiment, there are load and store instructions for the X, Y, and Z memories, respectively. In the case of the Z memory 28, a size parameter may indicate which element size is being used and thus which rows of the Z memory are written to memory or read from memory (e.g. all rows, every other row, ever fourth row, etc.). In an embodiment, the X and Y memories may have multiple banks for storing different vectors. In such an embodiment, there may be multiple instructions to read/write the different banks or there may be an operand specifying the bank affected by the load/store X/Y instructions. In each case, an X memory bank may store a pointer to memory from/to which the load/store is performed. The pointer may be virtual, and may be translated by the processor 12 as discussed above. Alternatively, the pointer may be physical and may be provided by the processor 12 post-translation.
The range instruction may determine the interval for each vector element in the vector in X memory entry Xn. A vector from a Y memory entry (e.g. Yn) may also be specified. Additionally, a source for the range table may be specified (implicitly or explicitly as an operand of the instruction). If the range table is explicitly specified, multiple range tables may be in the X memory 24 and Y memory 26 concurrently. Thus, for example, range tables for multiple different transcendental operations may be stored.
The compute instruction may perform a computation on the vector elements in the X and vectors and may sum the resulting matrix elements with the corresponding elements of the Z memory 28, in some embodiments. For example, in the case of a transcendental evaluation, the polynomial coefficients corresponding to each vector element may be multiplied by that vector element and the multiplication results may be summed to evaluate the polynomial for that vector element. Other compute instructions may be defined in various embodiments (e.g. a matrix multiply operation, etc.). The optional table operand may specify the lookup table if the input matrices use matrix elements that are smaller than the implemented size.
The peripherals 154 may include any desired circuitry, depending on the type of system 150. For example, in one embodiment, the system 150 may be a computing device (e.g., personal computer, laptop computer, etc.), a mobile device (e.g., personal digital assistant (PDA), smart phone, tablet, etc.), or an application specific computing device capable of benefiting from the computation engine 10 (e.g., neural networks, LSTM networks, other machine learning engines including devices that implement machine learning, etc.), In various embodiments of the system 150, the peripherals 154 may include devices for various types of wireless communication, such as wifi, Bluetooth, cellular, global positioning system, etc. The peripherals 154 may also include additional storage, including RAM storage, solid state storage, or disk storage. The peripherals 154 may include user interface devices such as a display screen, including touch display screens or multitouch display screens, keyboard or other input devices, microphones, speakers, etc. In other embodiments, the system 150 may be any type of computing system (e.g. desktop personal computer, laptop, workstation, net top etc.).
The external memory 158 may include any type of memory. For example, the external memory 158 may be SRAM, dynamic RAM (DRAM) such as synchronous DRAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, RAMBUS DRAM, low power versions of the DDR DRAM (e.g. LPDDR, mDDR, etc.), etc. The external memory 158 may include one or more memory modules to which the memory devices are mounted, such as single inline memory modules (SIMMs), dual inline memory modules (DIMMs), etc. Alternatively, the external memory 158 may include one or more memory devices that are mounted on the IC 152 in a chip-on-chip or package-on-package implementation.
Generally, the electronic description 162 of the IC 152 stored on the computer accessible storage medium 160 may be a database which can be read by a program and used, directly or indirectly, to fabricate the hardware comprising the IC 152. For example, the description may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising the IC 152. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the IC 152. Alternatively, the description 162 on the computer accessible storage medium 300 may be the netlist (with or without the synthesis library) or the data set, as desired.
While the computer accessible storage medium 160 stores a description 162 of the IC 152, other embodiments may store a description 162 of any portion of the IC 152, as desired (e.g. the computation engine 10 and/or the processor 12, as mentioned above).
Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.