The present disclosure generally relates to computer processor architecture, and more specifically to the function of instruction dispatch as a mechanism in a computer processor architecture.
In a processor, there are many challenges that may decrease the efficiency of a processor. For example, instructions need to be decoded and data for the instructions needs to be retrieved from cache or memory. The decoding of instructions and retrieving of data adds latency to the overall execution of the instructions. Processors generally include a collection of computational elements along with a mechanism to move data to the inputs, and from the outputs, of the computational elements. Processors also have a mechanism for dispatching instructions to the computational elements at the correct time relative to the data input and output.
Prior processor architectures have used instruction dispatch mechanisms that are inefficient for several reasons, such as: inefficient duplication and redundancy of instruction and data circuitry, inconvenient alignment of the timing relationship between the instruction bus and data bus, large storage overhead in main memory code space, high channel bandwidth required to transfer instructions from main memory to the processor. Typically, such processor architectures use a large area of the semiconductor die as a data cache memory and a second large area of the die as an instruction cache for as part of the processor system in an attempt to mitigate some of these shortcomings of the architecture.
Embodiments are directed to a processor comprising a computational array of computational elements and an instruction dispatch circuit. The computational elements receive data operands via data lanes extending along a first dimension, and processes the operands based upon instructions received from the instruction dispatch circuit via instruction lanes extending along a second dimension. The instruction dispatch circuit receives raw instructions, and comprises an instruction dispatch unit (IDU) processor that processes a set of raw instructions to generate processed instructions for dispatch to the computational elements, where the number of processed instructions is not equal to the number of instructions of the set of raw instructions. The processed instructions are dispatched to columns of the computational array via a plurality of instruction queues, wherein an instruction vector of instructions is shifted between adjacent instruction queues in a first direction, and dispatches instructions to the computational elements in a second direction.
In accordance with some embodiments, a processor, comprises a computational array comprising an array of computational elements and an instruction dispatch circuit configured to provide instructions to the computational elements of the computational array. The instruction dispatch circuit comprises an instruction buffer memory configured to receive, during each of a plurality of compute cycles, a set of input instructions. and an instruction dispatch unit (IDU) processor. The IDU processor is configured to, during each of the plurality of compute cycles, process the set of input instructions received by the instruction buffer memory to generate a plurality of processed instructions, wherein a number of instructions of the plurality of processed instructions is not equal to a number of instructions of the set of input instructions. The instruction buffer memory further comprises a plurality of memory output locations connected to respective sets of computational elements of the computational array from which processed instructions are dispatched to the computational elements of the computational array. In some embodiments, the input instructions are raw instructions, and IDU processor processes the input instructions by performing unpacking, decompression, and/or decryption on the input instructions. In some embodiments, the IDU processor generates a set of duplicate instructions based upon an input instruction.
In accordance with some embodiments, a processor comprises an arrangement of computational elements, and a memory storing data operands configured to, during each of a plurality of time intervals, provide stored data operands to the arrangement of computational elements via one or more data lines that flow in a first direction. The processor further comprises an instruction dispatch circuit configured to receive an array of instructions, and configured to, during each of the plurality of time intervals, shift one or more instruction vectors of the array of instructions along the first direction parallel to the direction of flow of data in the processor, and shift the one or more instruction vectors in a second direction perpendicular to the flow direction to provide an instruction from each instruction vector to the arrangement of computational elements. The arrangement of computational elements is configured to process data operands provided from the memory based upon the provided instructions.
The figures depict embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles, or benefits touted, of the disclosure described herein.
Embodiments are directed to processor configurations that provide an efficient mechanism for dispatching instructions to computational elements, and improves on many of the shortcomings of prior instruction dispatch mechanisms.
Embodiments are directed to a processor configuration having multiple arrays of computational elements. Instructions enter into a queue associated with each array, and are sequentially applied to a subset of the computational elements of each array. The instructions are then shifted across a different subset of the computational elements of each array.
In some embodiments, a processor comprises a plurality of computational elements (CEs), also referred to as tiles, and may correspond to arithmetic-logic units (ALUs), multiplier-accumulators (MAC), permutation units, or a memory array, by way of example. Data operands to be processed by the processor, flow on a temporal axis, and instructions to control the operation of the compute elements using the data operands are dispatched along a spatial axis of the processor array. To illustrate, the data operands initially flow along the temporal axis from the memory to the array. Results may flow along the temporal axis to other elements of the computational array or may be returned to memory. Instructions, either Single Instruction Multiple Data (SIMD) or Multiple Instructions Multiple Data (MIMD), are dispatched to columns of processing elements. In other embodiments, instructions are dispatched to a partial column of processing elements. In both embodiments, instructions are dispatched along the spatial axis. An array of instructions is presented to the processor during each time increment and shifted along the first direction parallel to the primary direction of flow of the rows of data in the processor, while concurrently select portions of the instruction vector may be shifted in a second direction perpendicular to the flow of data in the processor. It is within the scope of the disclosed configuration to allow, during certain timing increments, that the instructions move only in the row direction, or during certain timing increments, the instructions move only in the column direction, to further improve the flexibility and efficiency of instruction dispatch.
In some embodiments, it is understood that the instructions flow in a first direction called the temporal direction because they generally move in step with a timing increment, while concurrently a subset of the instructions also move in a second direction, perpendicular to the first direction, where the second direction is called the spatial direction because the flow in the spatial direction moves the instructions to be dispatched from or to a different space in the array of computational elements in the processor.
In some embodiments, a compiler for the processor calculates timing of the hardware configuration of the processor, and configures the timing of data and instruction flows such that corresponding data and instructions are received at each computational element with a predetermined temporal relationship (e.g., during the same cycle, separated by a predetermined delay, etc.). The predetermined temporal relationship may be based upon the hardware of the processor, a type of instruction, and/or the like. Because the temporal relationship between when data and instructions are received is known, the operand data received by a tile may not need to include any metadata indicating what the data is to be used for. Instead, each tile receives instructions, and based upon the predetermined timing, performs the instruction on the corresponding data that is received in accordance with the predetermined timing, allowing for the data and instructions to flow through the processor more efficiently. For example, because the data and instructions to the computational elements of the processor flow with predetermined timing (e.g., as specified by the compiler), the flow of data and instructions within the processor is deterministic. Due to this determinism, the processor may not need to contain any arbitration logic and/or traffic control logic. Instead, the data and instructions may automatically move along respective data and instruction lanes to reach the computational elements of the processor, which perform instructions on the received data based upon the timing at which the data and instruction reach the computational element. In some embodiments, the flow of data and the flow of instructions within the processor occur separately and independently from each other, and the computational elements of the processor are able to perform the intended instructions on the correct data due to the predetermined timing at which the data and instructions are received, as determined by the compiler.
Architectural Overview
As illustrated in
The control circuit 106 is configured to control operations of the memory 102 and instruction dispatch circuit 108 to provide data and instructions to the computational array 104. In some embodiments, the control circuit 106 receives a compiled program for the compiler, which specifies an order and timing of data and instructions to be provided to the computational array 104. Because the compiler calculates timing based on the hardware configuration of the processor 100, the compiler compiles the program such that the timing of data and instruction flows such that corresponding data and instructions are received at each computational element of the computational array 104 with a predetermined temporal relationship. As the data and instructions to the computational elements of the processor flow with predetermined timing (e.g., as determined by the compiler), the flow of data and instructions within the processor is deterministic and repeatable.
The memory 102 is configured to store data to be processed by the array of computational elements 104. In some embodiments, the data comprises data operands corresponding to weights or activations used to implement a model (e.g., a machine learning model). In some embodiments, the memory 102 comprises an array of memory tiles, each memory tile comprising an SRAM for on-chip storage. In some embodiments, the array of memory tiles is arranged in a plurality of columns and a plurality of rows corresponding to rows of the computational array 104. The rows of memory tiles may transmit data to the rows computational array 104 via respective data lanes (not shown). In addition, the memory tiles may receive computed results data generated by the computational array 104. The received results data may be stored back in the memory tiles or into other memory tiles of the memory 102, and may be used as data for subsequent computation.
In some embodiments, the memory 102 may be coupled to lane switching circuitry (not shown) allowing for data to be routed from one row of the memory or computational array to any other row. In some embodiments, the lane switching circuitry is implemented as a crossbar switch.
In some embodiments, the memory 102 may further store instructions to be executed on the data by the computational elements 104. For example, during operation of the processor 100, the instruction dispatch circuit 108 may retrieve instruction data from the memory 102, and use the retrieved instructions to populate one or more instruction queues, which provide instructions to columns computational elements of the computational array 104 via respective instruction lanes.
The computational array 104 comprises an array of computational elements (CEs), such as arithmetic-logic units (ALUs), multiplier-accumulators (MAC), and/or the like. In some embodiments, the computational elements of the computational array 104 are uniform and are configured to be able to perform the same types of instructions on received data operands. In other embodiments, the computational array 104 comprises different types of computational elements able to perform different types of instructions. In some embodiments, the computational array 104 corresponds to a systolic array used for matrix multiplication, performing convolution, and/or the like. In some embodiments, a computational array is used to implement a machine learning model.
The computational elements of the computational array 104 are configured to receive data operands from the memory 102 in a first, horizontal direction. For example, data operands are transmitted from the memory 102 through data lines that run along each row of the computational array 104. In some embodiment, a computational element is configured to, over a time period (e.g., a clock cycle or compute cycle), receive a data operand via the data lane corresponding to the row of the computational element, perform operations on the received data in accordance with received instructions, and output a result to the data lane to be processed by a subsequent computational element or to be returned for storage in the memory 102. As such, a data operand from the memory 102 may “travel” down a row of computational elements over a plurality of time periods/cycles, being processed by a different computational element during each time period/cycle. Operations performed by a computational element on received data operands are discussed in greater detail below in relation to
In some embodiments, the computational elements of the computational array 104 are configured to receive instructions via instruction lanes (not shown in
The instruction dispatch circuit 108 dispatches instructions (e.g., retrieved from the memory 102) to one or more instruction queues. In some embodiments, the instruction dispatch circuit 108 comprises a respective instruction queue for each column of the computational array 104. Each instruction queue is configured to provide instructions to the computational elements of a corresponding column via a respective instruction lane. In addition, in some embodiments, each instruction queue may be configured to shift its enqueued instructions to another instruction queue (e.g., an adjacent instruction queue in the row direction). Thus, the instruction dispatch circuit 108 may be configured to shift instructions in two directions, e.g., in the row direction to be provided to instruction queues of subsequent columns of the computational array, and in the column direction to be provided to the computational elements of a column of the computational array. Additional details relating to the instruction dispatch circuit and how the instruction dispatch circuit populates the one or more instruction queues are discussed below in relation to
Instruction Flow
In addition,
In some embodiments, over each subsequent time period, the instruction dispatch circuit 108 shifts the instructions of each instruction queue to a next instruction queue (e.g., in the row direction), as well as causes each instruction queue to provide/dequeue a first instruction to the computational elements of its respective column (e.g., in the column direction).
As such, the computational elements of the computational array 104 are configured to receive data operands (e.g., from the memory 102) via data lines extending along the rows of the computational array in a first direction (e.g., horizontal direction), and receive instructions (e.g., from the instruction dispatch circuit 108) via instruction lines extending along the columns of the computational array in a second direction (e.g., vertical direction). In some embodiments, each computational element of a first column of the computational array receives a respective data operand from the memory 102, processes the received data operand based on a received instruction, and passes the result to a next computational element in its row. Thus, the timing of data and instruction transmission may be configured such that a first set of data operands (e.g., set of data operands received by the computational elements of the first column of the computational array 104) will be processed based on each instruction of the first instruction vector in sequence. For example, in relation to
It is understood that in some embodiments, operations of the instruction dispatch circuit 108 may be performed in a different order than that described above. For example, while the above figures illustrate each instruction vector shifted to a next instruction queue and then dequeuing a first instruction, it is understood that in other embodiments, the instruction dispatch circuit 108 may cause each instruction queue to dequeue its first instruction before shifting the instruction vector to a next queue.
In addition, while
In embodiments where instructions are received by the computational elements of each column in a staggered fashion, transmission of data operands via data lanes across the rows of the computational array may be similarly staggered (e.g., with transmission of data operands to computational elements of upper rows occurring after transmission of data operands to lower rows) to preserve the relative timing at which each computational element receives data and instructions. In addition, it is understood that in other embodiments, a dequeued instruction may be provided to a set of computational elements of the column each cycle (instead of a single computational element), and transmitted via the instruction lane to subsequent sets of computational elements of the column over subsequent cycles.
As such, whether instructions are dispatched to an entire column at once or to a set of one or more computational elements at a time, the order of operations performed by the instruction dispatch circuit 108 may remain the same (e.g., shifting instructions to a subsequent queue, and dispatching/dequeuing a first instruction in each queue each cycle). Instead, the timing of data operand dispatch across the rows of the computational array is adjusted to be staggered in a manner that matches the transmission of dequeued instructions along columns of the computational array, to preserve the relative timing at which each computational element receives data operands and instructions. Under this configuration, each computational element maintains the ability to process a received data operand each cycle based on a received instruction.
While the above figures illustrate, for purpose of example, instructions dispatched by the instruction dispatch circuit 108 as arithmetic instructions, it is understood that the types of instructions that may be managed by the instruction dispatch circuit 108 and processed by the computational elements may include other types of instructions, such as logic instructions, pass-through instructions, etc.
As shown in the figures above, it can be seen in the sequence of images that the instructions in the “Next Instructions” array 206 move in the row direction to the right one position during each timing increment, while concurrently the instructions after the first column also move in the column direction one position up during each timing increment. The figures further exemplify the dispatch of instructions to apply to computational elements in the array of ALUs that comprise the processor.
It may be appreciated that prior approaches that do not make use of this configuration may have used control line configuration mechanisms instead of the inventive instruction queue to determine the behavior of the computational array. The control line configuration approach has limited flexibility that allows for configuration but not programmability. The instruction dispatch mechanism of this configuration increases the functionality of the processor block relative to prior approaches because new instructions can be shifted in along the row direction, or they can be shifted in along the column direction, or the instructions can be shifted in with concurrent row direction and column direction movement. It is a benefit of the present configuration that instructions can be applied to the processor at a rapid rate, allowing for greater flexibility and programmability, where the time increment between the application of different instructions may be faster than reconfiguration mechanisms that may have existed in prior approaches.
While the above figures illustrate the instruction vectors in each instruction queue shifted to a next instruction queue each cycle, in other embodiments, instructions of each instruction vector are shifted only in the column direction. In such embodiments, instructions in the “Next Instructions” array 206, instead of enqueuing next instructions as a new instruction vector into the first instruction queue 204 (and shifting each instruction vector to a next queue), enqueue additional instructions onto the instruction vectors for each queue, maintaining a length of each queue as instructions are dispatched from the queue to the computational array.
Process Flow
The processor executes the compiled program to provide data operands and instructions to the computational array with the predetermined timing as specified by the program. At 820, the processor provides data operands to the computational array via data lines extending in a first direction. In some embodiments, the first direction corresponds to a row direction. In some embodiments, for each row of the computational array, data operands from memory are provided to a first computational element of the row for processing based on a received instruction, the results of which may then be provided to a next computational element along the data line (e.g., next computational element of the row) for processing based on a subsequent instruction. As such, each data operand may “travel” across the computational array, being processed in accordance with received instructions at each computational element.
The processor further concurrently provides instructions to the computational array. At 830, the processor provides instructions to an instruction dispatch circuit as a plurality of instruction vectors. In some embodiments, the plurality of instruction vectors may correspond to an instruction array. The instruction dispatch circuit may manage the provision of instructions to the computational array using a plurality of instruction queues. In some embodiments, each instruction queue is configured to provide instructions to a respective subset (e.g., a respective column) of the computational array. At 840, the instruction dispatch circuit shifts an instruction vector to an instruction queue in the first direction. For example, the instruction dispatch circuit may provide an instruction vector to a first instruction queue. In addition, instructions within the instruction queue are shifted to a next instruction queue in the first direction (e.g., from the first instruction queue corresponding to a first column of the computational array to a second instruction queue corresponding to a second column). At 850, the instruction dispatch circuit further causes the instruction queues to provide instructions to the computational array via instruction lines extending in the second direction. The second direction is perpendicular to the first direction, and may correspond to a column direction. In some embodiments, each instruction queue dequeues a first instruction to be provided to the computational array. Each instruction may be provided to all computational elements within a column of the array, or to a subset thereof.
The provision of data operands and instructions to the computational array is performed with predetermined timing. At 860, the computational elements of the computational array process data operands received via the data lines from the first direction, based on instructions received via the instruction lines from the second direction at a predetermined timing relationship (e.g., receiving during a same cycle, or having a predetermined timing offset).
It is understood that while the above describes certain structures and methods for managing the flow of data and instructions in the processor, different configurations for providing data and instructions may be used in other embodiments. For example, the specific mechanisms for providing data operands and instructions to computational elements may be different and may be configured independently from each other. However, so long as the timing of how the data operands and instructions are provided is known by the compiler, the compiler may be able to construct a program to maintain the relative timings at which the computational elements receive data operands and instructions, maintaining the deterministic and repeatable nature of the program.
Computational Element Operations
As illustrated in
As shown in
The CEs of the column 904 are configured to process received data (e.g., loaded onto the registers 908 and 910 via respective data lanes) based upon received instructions 914. As discussed above, in some embodiments the instruction 914 may be provided to a plurality of CEs of the column 904 during a single cycle, while in other embodiments, the instruction 914 may be provided to a first CE of the column (e.g., CE 0) during a first cycle, which propagates the instruction to a next CE of the column during a subsequent cycle (such that each CE of the column receives the instruction in a staggered fashion). Although
At time T2, the CE 1002 loads an operand value OP1 transmitted along the data lane in accordance with a received operand load instruction. Similar to the weight load instruction discussed above, the operand load instruction instructs the CE 1002 to load data transmitted along the data lane at a particular time as an operand value, but does not indicate or identify the actual value OP1. The CE 1002 loads the operand value OP1 onto a register 1008. As illustrated in
At time T3, the CE 1002 processes the weight value W1 and operand value OP1 from the registers 1008 and 1010 in accordance with a data processing instruction to generate a result value R1. The generated result value R1 may be stored in a results register 1012. The data processing instruction may comprise a logical instruction, arithmetic instruction, or some combination thereof. In some embodiments, the data processing instruction may simply be a bypass instruction, in which the CE 1002 outputs the same operand value OP1 without processing.
At time T4, the CE 1002 outputs the result value R1 onto the data lane in accordance with a result output instruction. The result value R1 may be transmitted along the data lane to be received by another CE within the processor (e.g., another CE in the same row, or another CE in a different row via lane-switching circuitry). In addition, because the CE 1002 simply receives data transmitted along the data lane without needing to request or fetch the data from memory, the CE 1002 may also load a next operand value OP2 from the data lane in parallel with outputting the result value R1. As such, the CE 1002 is able to load a subsequent operand value as it outputs a result value, allowing for the operand values to be “streamed” through the CEs of the computational array.
While
Instruction Dispatch Circuit
The raw instruction source 1102 illustrated in
The instruction buffer memory 1104 comprises a memory configured to store instructions received from the raw instruction source 1102 for processing and queuing prior to dispatch to the CEs of the computational array, and may include the instruction queues illustrated in
In some embodiments, the instruction buffer memory 1104 dispatches instructions to columns of the computational array 104 via a plurality of memory output locations 1108. The memory output locations may correspond to portions of the instruction buffer memory 1104 adjacent to the CEs of the computational array 104. In some embodiments, the memory output locations 1108 correspond to the heads of the instruction queues within the instruction buffer memory 1104, from which the instruction is dispatched to the computational array 104 when they reach the head of the queue. The instruction buffer memory 1104 may correspond to embedded memory, SRAM, DRAM, shift registers, D-type FlipFlops, Latches, non-volatile memory, Flash, ROM, CAM, TCAM, or some combination thereof.
The instruction dispatch unit (IDU) processor 1106 of
In some embodiments, each mapping relationship defines a location in the instruction buffer circuit corresponding to an instruction of the set of Y instructions during a cycle t, and defines how the instruction is to be mapped to another location in the instruction buffer as an instruction of the set of X instructions for a next cycle t+1. In some embodiments, a different set of mapping relationships may be established for each compute cycle or sequence of compute cycles (e.g., after a prior sequence of compute cycles has completed, but prior to the commencement of a next sequence of compute cycles). For example, in some embodiments, the IDU 1106 establishes a different set of mappings after each sequence of n cycles (where n>1).
In some embodiments, the instruction dispatch circuit loads raw instructions from the raw instruction source 1102 to a first section of the instruction buffer memory 1104. The raw instructions are shifted to a second section of the instruction buffer memory 1104 accessed by the IDU processor 1106. The IDU processor 1106 receives the raw instructions from the second section, processes the raw instructions to generate processed instructions, and outputs the processed instructions to the instruction buffer memory (e.g., a third section of the instruction buffer memory corresponding to instruction queues for the columns of the computational array).
Although
In some embodiments, the IDU processor 1106 processes a received instruction by unpacking, decompressing, and/or decrypting the instruction. For example, as discussed above, the raw instructions may be packed such that each memory word may include more than one instruction and/or a portion of an instruction. The IDU processor 1106 unpacks the instructions and maps the unpacked instructions to distinct locations in the instruction buffer memory 1104.
In some embodiments, the IDU processor 1106 unrolls a received instruction to generate one or more instructions for dispatch. For example, the IDU processor 1106 may perform temporal and/or spatial expansion on a received instruction to generate multiple instructions, by replicating the instruction to be dispatched over multiple time cycles (temporal expansion), and/or replicating the instruction to be dispatched to multiple sets of CEs (e.g., multiple columns) (spatial expansion). In some embodiments, the IDU processor 1106 modifies an address, register, or data argument for replicated instances of the instruction. For example, in some embodiments, an instruction may indicate a register address of a CE on which to load data received via a data lane (such as a received operand value OP1 loaded onto an operand register of the CE as shown in
In some embodiments, the IDU processor replicates and modifies an instruction by looping, wherein in each loop, the IDU processor replicates the instruction and modifies an argument of the instruction a set amount (e.g., an offset value) from the argument of a previous instruction of the loop. In some embodiments, an instruction may be replicated to generate an array of instructions to be dispatched to an array of compute elements, by performing a nested loop that modifies the replicated instructions in two dimensions.
In some embodiments, the IDU processor 1106 unrolls an instruction based on a received meta instruction. For example, the raw instruction source 1102 loads onto the instruction buffer memory 1104 a first instruction and a meta instruction, where the meta instruction indicates to the IDU 1106 how the first instruction is to be unrolled, and is not mapped by the IDU processor 1106 to an instruction to be dispatched. For instructions where the IDU processor performs looping to replicate an instruction, the instruction may be associated with a meta instruction that indicates one or more parameters such as a number of loop iterations, boundary values, one or more offset values, etc.
In some embodiments, during each cycle, the instruction buffer memory 1104 outputs instructions to the columns of the computational array (e.g., via one or more instruction queues), where each CE of a column receives the same instruction. As such, the computational array may operate in a SIMD mode. In other embodiments, the instruction buffer memory 1104 outputs instructions to a first CE of each column of the computational array (e.g., a bottommost CE), whereupon the instruction is propagated to the remaining CEs of the column over subsequent cycles (e.g., to an adjacent CE each cycle, such that a second CE of the column receives the instruction during a second cycle, a third CE receives the instruction during a third cycle, etc.). This may be referred to as “staggered” SIMD.
In sum, in some embodiments, the instruction buffer memory may function as a queue having I instruction instances. Each cycle (e.g., each compute cycle), the instruction buffer memory outputs a set of X instruction instances to the computational array (e.g., one instruction for each column of the array), and receives a set of Y instruction instances from the raw instruction store. The received instructions are processed by the IDU processor, which unpacks, decompresses, decrypts, and/or unrolls the received instructions to generate processed instructions that are put back into the instruction buffer memory for dispatch to the computational array. As such, the number of instruction instances Y received by the instruction buffer memory during each cycle may be different from the number of instruction instances X dispatched to the computational array. For example, the processed instructions generated by the IDU processor may comprise duplicated instructions, and/or replicated instructions having different arguments. Further, the Y instructions received from the raw instruction store may include meta instructions that are consumed by the IDU processor itself when generating the processed instructions. In some embodiments, the IDU processor receives the Y instructions as input values, and maps the received Y instructions to memory locations in the instruction buffer memory to output the processed instructions, and may fill in a greater number of memory locations than the number of input values received by the IDU processor.
In some embodiments, the instruction buffer memory may comprise a plurality of queues (e.g., an instruction queue corresponding to each column of the computational array, as illustrated in
The instruction dispatch circuit further processes 1204, during each cycle at an instruction dispatch unit (IDU) processor, instructions received by the instruction buffer memory (e.g., from the raw instruction source) and generates processed instructions, where the number of generated processed instructions is not equal to the number of received instructions. In some embodiments, the IDU processor performs unpacking, decompression, and/or decryption on the received instructions, which may change the number of processed instructions generated from the received instructions. In addition, the IDU processor may unroll a received instruction to generate multiple instructions to be dispatched to the CEs of the computational array, e.g., replicated instructions over multiple cycles (temporal expansion) and/or over multiple CEs (spatial expansion). In some embodiments, the IDU processor replicates an instruction by executing one or more loops, each loop producing a replicated instruction with at least one offset relative to a previous instruction. In some embodiments, the IDU processor consumes meta instructions in order to determine one or more looping parameters, e.g., loop boundary conditions, offset amounts, etc.
The instruction dispatch circuit further, at the IDU processor, maps 1206 the generated processed instructions to locations in the instruction buffer memory. In some embodiments, the IDU processor maps the generated instruction to locations in the instruction buffer memory corresponding to instruction queues corresponding to columns of the computational array. In some embodiments, the IDU processor utilizes a particular set of mappings over multiple cycles within a sequence of cycles, and utilizes different sets of mappings over different sequences of cycles.
The instruction dispatch circuit dispatches 1208 instructions from locations in the instruction buffer memory to CEs of the computational array. In some embodiments, the instruction dispatch circuit dispatches an instruction to each column of the computational array, such that the same instruction is applied at all CEs of a particular column. In other embodiments, the instruction dispatch circuit dispatches an instruction to a first CE of a column of the computational array, which propagates the instruction along the column over subsequent cycles. In some embodiments, the instruction buffer memory of the instruction dispatch circuit comprises a plurality of instruction queues corresponding to the columns of the computational array, where the processed instructions generated by the IDU processor are enqueued into an instruction queue corresponding to a particular column for dispatch to the CEs of the column (e.g., as part of an instruction vector). In other embodiments, generated processed instructions are shifted between different instruction queues over one or more cycles, prior to reaching a head of a queue and being dispatched to the CEs of the computational array (e.g., as described in relation to
Additional Configuration Information
The foregoing description of the embodiments of the disclosure has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the disclosure in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the disclosure may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the disclosure be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.
This application is a divisional of U.S. patent application Ser. No. 16/951,938, filed Nov. 18, 2020, entitled “PROCESSOR INSTRUCTION DISPATCH CONFIGURATION”, which claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application Ser. No. 62/937,123, titled “Processor Instruction Dispatch Configuration,” filed on Nov. 18, 2019. The above noted applications are hereby expressed incorporated by reference herein in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
4670856 | Nishino et al. | Jun 1987 | A |
5058001 | Li | Oct 1991 | A |
5146543 | Vassiliadis et al. | Sep 1992 | A |
5179702 | Spix et al. | Jan 1993 | A |
5333279 | Dunning | Jul 1994 | A |
5379440 | Kelly et al. | Jan 1995 | A |
5488729 | VeQesna et al. | Jan 1996 | A |
5541914 | Krishnamoorthy et al. | Jul 1996 | A |
5590083 | Pinkham et al. | Dec 1996 | A |
5594915 | Atalla | Jan 1997 | A |
5794062 | Baxter | Aug 1998 | A |
5796745 | Adams et al. | Aug 1998 | A |
5842034 | Bolstad et al. | Nov 1998 | A |
5889413 | Bauer | Mar 1999 | A |
5898881 | Miura et al. | Apr 1999 | A |
5958041 | Petolino, Jr. et al. | Sep 1999 | A |
6181164 | Miller | Jan 2001 | B1 |
6243808 | Wang | Jun 2001 | B1 |
6279057 | Westby | Aug 2001 | B1 |
6298162 | Sutha et al. | Oct 2001 | B1 |
6304953 | Henstrom et al. | Oct 2001 | B1 |
6681316 | Clermidy et al. | Jan 2004 | B1 |
6712313 | Zoppitelli et al. | Mar 2004 | B2 |
6988181 | Saulsbury et al. | Jan 2006 | B2 |
7015913 | Lindholm et al. | Mar 2006 | B1 |
7181484 | Stribaek et al. | Feb 2007 | B2 |
7236995 | Hinds | Jun 2007 | B2 |
7272730 | Acquaviva et al. | Sep 2007 | B1 |
7339941 | Twomey | Mar 2008 | B2 |
7421559 | Yadav | Sep 2008 | B1 |
7640528 | Baeckler | Dec 2009 | B1 |
7805392 | Steele et al. | Sep 2010 | B1 |
7861060 | Nickolls et al. | Dec 2010 | B1 |
7912889 | Juffa et al. | Mar 2011 | B1 |
7965725 | Langevin et al. | Jun 2011 | B2 |
8038539 | Stamps et al. | Oct 2011 | B2 |
8089959 | Szymanski | Jan 2012 | B2 |
8250555 | Lee et al. | Aug 2012 | B1 |
8255765 | Yeo et al. | Aug 2012 | B1 |
8286172 | Chakradhar et al. | Oct 2012 | B2 |
8345540 | Rollins | Jan 2013 | B2 |
8370280 | Lin et al. | Feb 2013 | B1 |
8407167 | Abts et al. | Mar 2013 | B1 |
8583895 | Jacobs et al. | Nov 2013 | B2 |
8655937 | Vanderspek | Feb 2014 | B1 |
8689202 | Braun et al. | Apr 2014 | B1 |
8830993 | Dublin et al. | Sep 2014 | B1 |
8850262 | Cardinell et al. | Sep 2014 | B2 |
8989220 | Scrobohaci et al. | Mar 2015 | B2 |
9009660 | Griffin et al. | Apr 2015 | B1 |
9146747 | Moloney et al. | Sep 2015 | B2 |
9304775 | Lindholm et al. | Apr 2016 | B1 |
9388862 | Lidak | Jul 2016 | B2 |
9432298 | Smith | Aug 2016 | B1 |
9442757 | Munshi et al. | Sep 2016 | B2 |
9535869 | Zheng | Jan 2017 | B2 |
9639490 | Blankenship et al. | May 2017 | B2 |
9672188 | Vorbach | Jun 2017 | B2 |
9690938 | Saxe et al. | Jun 2017 | B1 |
9691019 | Gulland et al. | Jun 2017 | B1 |
9697463 | Ross et al. | Jul 2017 | B2 |
9710265 | Temam et al. | Jul 2017 | B1 |
9710748 | Ross et al. | Jul 2017 | B2 |
9723317 | Hattori | Aug 2017 | B2 |
9805303 | Ross et al. | Oct 2017 | B2 |
10073816 | Lu et al. | Sep 2018 | B1 |
10167800 | Chuna et al. | Jan 2019 | B1 |
10175980 | Temam et al. | Jan 2019 | B2 |
10235735 | Venkatesh et al. | Mar 2019 | B2 |
10320390 | Ross | Jun 2019 | B1 |
10489680 | Aliabadi et al. | Nov 2019 | B2 |
10521488 | Ross et al. | Dec 2019 | B1 |
10621269 | Phelps | Apr 2020 | B2 |
10754621 | Thorson | Aug 2020 | B2 |
10776110 | Pearce et al. | Sep 2020 | B2 |
10936569 | Baskaran et al. | Mar 2021 | B1 |
10970362 | Phelps | Apr 2021 | B2 |
11086623 | Valentine et al. | Aug 2021 | B2 |
11210594 | Ross et al. | Dec 2021 | B1 |
11243880 | Ross | Feb 2022 | B1 |
11467841 | Tran | Oct 2022 | B1 |
11652484 | Gunter | May 2023 | B1 |
20010051860 | Copeland et al. | Dec 2001 | A1 |
20010052053 | Nemirovsky et al. | Dec 2001 | A1 |
20020060796 | Kanno et al. | May 2002 | A1 |
20020103961 | Ayukawa et al. | Aug 2002 | A1 |
20030095547 | Schofield | May 2003 | A1 |
20030206527 | Yim | Nov 2003 | A1 |
20040078555 | Porten et al. | Apr 2004 | A1 |
20040098563 | Parthasarathy et al. | May 2004 | A1 |
20040150543 | Wang et al. | Aug 2004 | A1 |
20040215679 | Beaumont | Oct 2004 | A1 |
20050125594 | Mattausch et al. | Jun 2005 | A1 |
20050147036 | Hammarlund et al. | Jul 2005 | A1 |
20050278505 | Lim et al. | Dec 2005 | A1 |
20060161338 | Sohn et al. | Jul 2006 | A1 |
20060179207 | Eisen et al. | Aug 2006 | A1 |
20060190519 | Stribaek et al. | Aug 2006 | A1 |
20060225061 | Ludwig et al. | Oct 2006 | A1 |
20070124732 | Lia et al. | May 2007 | A1 |
20080126761 | Fontenot et al. | May 2008 | A1 |
20080209181 | Petkov et al. | Aug 2008 | A1 |
20080244135 | Akesson et al. | Oct 2008 | A1 |
20080301354 | Bekooij | Dec 2008 | A1 |
20090138534 | Lee et al. | May 2009 | A1 |
20090150621 | Lee | Jun 2009 | A1 |
20110022791 | Iyer et al. | Jan 2011 | A1 |
20110173258 | Arimilli et al. | Jul 2011 | A1 |
20110273459 | Letellier et al. | Nov 2011 | A1 |
20110320698 | Wang et al. | Dec 2011 | A1 |
20120072699 | Vorbach et al. | Mar 2012 | A1 |
20120127818 | Levy et al. | May 2012 | A1 |
20120159507 | Kwon et al. | Jun 2012 | A1 |
20120240185 | Kapoor et al. | Sep 2012 | A1 |
20120260065 | Henry et al. | Oct 2012 | A1 |
20120275545 | Utsunomiya et al. | Nov 2012 | A1 |
20120303933 | Manet et al. | Nov 2012 | A1 |
20120317065 | Bernstein et al. | Dec 2012 | A1 |
20120331197 | Campbell et al. | Dec 2012 | A1 |
20130010636 | Regula | Jan 2013 | A1 |
20130070588 | Steele et al. | Mar 2013 | A1 |
20130212277 | Bodik et al. | Aug 2013 | A1 |
20140047211 | Fleischer et al. | Feb 2014 | A1 |
20140115301 | Sanghai et al. | Apr 2014 | A1 |
20140181171 | Dourbal | Jun 2014 | A1 |
20140201755 | Munshi et al. | Jul 2014 | A1 |
20140281284 | Block et al. | Sep 2014 | A1 |
20150046678 | Moloney et al. | Feb 2015 | A1 |
20150378639 | Chien et al. | Dec 2015 | A1 |
20150379429 | Lee et al. | Dec 2015 | A1 |
20160062947 | Chetlur et al. | Mar 2016 | A1 |
20160246506 | Hebig et al. | Aug 2016 | A1 |
20160328158 | Bromberg et al. | Nov 2016 | A1 |
20160337484 | Tola | Nov 2016 | A1 |
20160342892 | Ross | Nov 2016 | A1 |
20160342893 | Ross et al. | Nov 2016 | A1 |
20160371093 | Chang | Dec 2016 | A1 |
20160378471 | Lerzer et al. | Dec 2016 | A1 |
20170032281 | Hsu | Feb 2017 | A1 |
20170063609 | Philip et al. | Mar 2017 | A1 |
20170085475 | Cheng et al. | Mar 2017 | A1 |
20170103316 | Ross et al. | Apr 2017 | A1 |
20170139677 | Lutz et al. | May 2017 | A1 |
20170161037 | Henry et al. | Jun 2017 | A1 |
20170168990 | Kernert et al. | Jun 2017 | A1 |
20170177352 | Ould-Ahmed-Vall | Jun 2017 | A1 |
20170220719 | Elrabaa et al. | Aug 2017 | A1 |
20170316312 | Goyal et al. | Nov 2017 | A1 |
20170331881 | Chandramouli et al. | Nov 2017 | A1 |
20170347109 | Hendry et al. | Nov 2017 | A1 |
20170372202 | Ginsburg et al. | Dec 2017 | A1 |
20180046903 | Yao et al. | Feb 2018 | A1 |
20180046907 | Ross et al. | Feb 2018 | A1 |
20180075338 | Gokmen | Mar 2018 | A1 |
20180121196 | Temam et al. | May 2018 | A1 |
20180121796 | Deisher et al. | May 2018 | A1 |
20180145850 | Tam et al. | May 2018 | A1 |
20180157966 | Henry et al. | Jun 2018 | A1 |
20180191537 | Xiong et al. | Jul 2018 | A1 |
20180198730 | Cook et al. | Jul 2018 | A1 |
20180247190 | Chuna et al. | Aug 2018 | A1 |
20180267932 | Zhu et al. | Sep 2018 | A1 |
20180314671 | Zhang et al. | Nov 2018 | A1 |
20180315157 | Ould-Ahmed-Vall et al. | Nov 2018 | A1 |
20180329479 | Meixner | Nov 2018 | A1 |
20180357019 | Karr et al. | Dec 2018 | A1 |
20190089619 | Yeager et al. | Mar 2019 | A1 |
20190206454 | Ross et al. | Jul 2019 | A1 |
20190244080 | Li et al. | Aug 2019 | A1 |
20190303147 | Brewer | Oct 2019 | A1 |
20190311243 | Whatmough et al. | Oct 2019 | A1 |
20190370645 | Lee et al. | Dec 2019 | A1 |
20200117993 | Martinez-Canales et al. | Apr 2020 | A1 |
20200192701 | Horowitz et al. | Jun 2020 | A1 |
20200285605 | Nam | Sep 2020 | A1 |
20200310815 | Ayupov et al. | Oct 2020 | A1 |
20200310817 | Cook et al. | Oct 2020 | A1 |
20210312266 | Youn et al. | Oct 2021 | A1 |
Number | Date | Country |
---|---|---|
108073983 | May 2018 | CN |
108459995 | Aug 2018 | CN |
0 940 012 | Apr 2002 | EP |
3 343 463 | Jul 2018 | EP |
2017-062781 | Mar 2017 | JP |
200926033 | Jun 2009 | TW |
201706871 | Feb 2017 | TW |
201706917 | Feb 2017 | TW |
201732560 | Sep 2017 | TW |
201734764 | Oct 2017 | TW |
201804320 | Feb 2018 | TW |
201810538 | Mar 2018 | TW |
201833819 | Sep 2018 | TW |
0152101 | Jul 2001 | WO |
2016186826 | Nov 2016 | WO |
Entry |
---|
Notice of Allowance received for U.S. Appl. No. 16/686,858 dated Aug. 3, 2022, 25 pages. |
Notice of Allowance received for U.S. Appl. No. 16/686,864 dated Jul. 29, 2021, 14 pages. |
Notice of Allowance received for U.S. Appl. No. 16/686,866 dated Dec. 7, 2021, 13 pages. |
Notice of Allowance received for U.S. Appl. No. 16/686,870 dated Aug. 17, 2022, 54 pages. |
Notice of Allowance received for U.S. Appl. No. 16/686,870 dated Aug. 24, 2022, 5 pages. |
Notice of Allowance received for U.S. Appl. No. 16/686,870 dated Oct. 25, 2022, 5 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/203,214 dated Mar. 15, 2023, 52 pages. |
Dey et al., “Fast Integer Multiplication Using Modular Arithmetic”, The proceedings of the 40th ACM Symposium on Theory of Computing, 2008, 7 pages. |
Lopes et al., “A Fused Hybrid Floating Point and Fixed Point Dot-product for FPGAs”, International Symposium on Applied Reconfigurable Computing, ARC, 2010, 12 pages. |
Haidar et al., “Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers”, SC18, Nov. 11-16, 2018, 12 pages. |
Abts et al., “Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads”, ACM/IEEE 47th Annual International Symposium on Computer Architecture, May 2020, pp. 145-158. |
Chang, W., “Computer Organization,” CSC137, Sacramento State University, Spring Semester 2020, pp. 1-70. |
De et al., “Fast Integer Multiplication Using Modular Arithmetic”, SIAM Journal on Computing, vol. 42, No. 2, Apr. 18, 2013, 7 pages. |
Groq, “Grog Announces World's First Architecture Capable of 1,000,000,000,000,000 Operations per Second on a Single Chip”, Nov. 14, 2019, 4 pages, [Online] [Retrieved on Jan. 12, 2021] Retrieved from the Internet <URL: https://www.prnewswire.com/news-releases/grog-announces-worlds-firstarchitecture-capable-of-1-000-000-000-000-000-operations-per-second-on-a-single-chip-300958743.html>. |
Hu et al., “On-Chip Instruction Generation for Cross-Layer CNN Accelerator on FPGA”, 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Jul. 2019, pp. 7-12. |
Johnson, J., “Making Floating Point Math Highly Efficient for AI Hardware”, Nov. 8, 2018, 10 pages, [Online] [Retrieved on Jan. 20, 2021] Retrieved from the Internet <URL: https://engineering.fb.com/2018/11/08/ai-research/floating-point-math/>. |
Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA '17, Jun. 2017, pp. 1-12. |
Narksith et al., “Switch Adjusting on Hierarchical Shuffle-exchange Networks for All-to-all Personalized Exchange”, The 2013 10th International Joint Conference on Computer Science and Software Engineering, May 29-31, 2013, pp. 121-126. |
International Search Report and Written Opinion received for PCT Application Serial No. PCT/US20/62241 dated Feb. 11, 2021, 7 pages. |
International Search Report and Written Opinion received for PCT Application Serial No. PCT/US2019/048568 dated Nov. 20, 2019, 10 pages. |
International Search Report and Written Opinion received for PCT Application Serial No. PCT/US2019/068767 dated Mar. 17, 2020, 10 pages. |
International Search Report and Written Opinion received for PCT Application Serial No. PCT/US2019/022357 dated Nov. 7, 2019, 9 pages. |
Ren et al., “Permutation Capability of Optical Cantor Network”, IEEE, Dec. 2007, pp. 398-403. |
Office Action received for Taiwanese Patent Application No. 108109969, dated Feb. 14, 2020, 12 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/117,763 dated Oct. 24, 2019, 17 pages. |
Final Office Action received for U.S. Appl. No. 16/132,243 dated Aug. 10, 2020, 32 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/132,243 dated Dec. 31, 2019, 18 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/105,976, Sep. 30, 2021, 37 pages. |
Waksman, A., “A Permutation Network”, Journal of the Association for Computing Machinery, vol. 15, No. 1, Jan. 1968, pp. 159-163. |
Wang et al., “Hera: A Reconfigurable and Mixed-Mode Parallel Computing Engine on Platform FPGAS” Department of Electrical and Computer Engineering, Jan. 2004, pp. 1-6. |
Wikipedia, “Complex Instruction Set Computer”, Last edited Dec. 27, 2020, pp. 1-4, [Online] [Retrieved Jan. 20, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/Complex instruction set computer>. |
Wikipedia, “Harvard Architecture”, Last edited Mar. 4, 2020, pp. 1-4, [Online] [Retrieved Jan. 20, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/Harvard architecture>. |
Wikipedia, “Instruction Pipelining”, Last edited Jan. 14, 2021, pp. 1-8, [Online] [Retrieved Jan. 8, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/Instruction pipelining>. |
Wikipedia, “Parallel Computing”, Last edited Jan. 16, 2021, pp. 1-21, [Online] [Retrieved Jan. 22, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/Parallel_computing>. |
Wikipedia, “Reduced Instruction Set Computer”, Last edited Jan. 14, 2021, pp. 1-10, [Online] [Retrieved Jan. 20, 2021] Retrieved from the Internet <URL: https://en.wikipedia.ora/wiki/Reduced_instruction_set_computer>. |
Wikipedia, “SIMD”, Last edited Dec. 18, 2020, pp. 1-10, [Online] [Retrieved Jan. 22, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/SIMD>. |
Wikipedia, “Tensor”, Last edited Jan. 10, 2021, pp. 1-20, [Online] [Retrieved Jan. 15, 2021] Retrieved from the Internet <URL: https://en.wikipedia.org/wiki/Tensor>. |
Yang et al., “Fast Subword Permutation Instructions Based on Butterfly Network”, Proceedings of SPIE, Media Processor 2000, Jan. 27-28, 2000, pp. 80-86. |
Office Action received for Taiwan Patent Application Serial No. 108131334 dated Jun. 30, 2022, 6 pages (Including English Translation). |
Final Office Action received for U.S. Appl. No. 16/951,938 dated Feb. 4, 2022, 23 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/951,938 dated Aug. 17, 2021, 32 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/932,632 dated May 19, 2021, 24 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/928,958 dated Sep. 21, 2021, 19 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/928,958 dated Jul. 23, 2021, 19 pages. |
Final Office Action received for U.S. Appl. No. 16/928,958 dated Jun. 4, 2021, 18 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/928,958 dated Apr. 12, 2021, 27 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/526,936 dated Jul. 1, 2022, 27 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/277,817 dated May 20, 2020, 37 pages. |
Final Office Action received for U.S. Appl. No. 16/243,768 dated Apr. 26, 2021, 26 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/243,768 dated Sep. 1, 2020, 22 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/132,196 dated Dec. 8, 2020, 30 pages. |
Final Office Action received for U.S. Appl. No. 16/132,196 dated May 20, 2020, 38 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/132,196 dated Dec. 11, 2019, 29 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/397,158 dated Oct. 6, 2022, 47 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/528,609 dated Jan. 4, 2023, 26 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/532,694 dated Jan. 19, 2023, 27 pages. |
Groq, Inc., “The Challenge of Batch Size 1: Groq Adds Responsiveness to Inference Performance”, White Paper, Apr. 2020, pp. 1-7. |
Office Action received for Indian Patent Application Serial No. 202247031762 dated Sep. 20, 2022, 6 pages. |
Lethin et al., “How VLIW Almost Disappeared—and Then Proliferated”, IEEE Solid-State Circuits Magazine, vol. 1, No. 3, Aug. 7, 2009, pp. 15-23. |
Mercaldi et al., “Instruction Scheduling for a Tiled Dataflow Architecture”, ACM SIGARCH Computer Architecture News, vol. 34, No. 5, Oct. 20, 2006, pp. 141-150. |
Sotiropoulos et al., “Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules”, Ipdps, Apr. 15, 2002, pp. 1-9. |
Southard, D., “Tensor Streaming Architecture Delivers Unmatched Performance for Compute-Intensive Workloads”, Groq White Paper, Nov. 18, 2019, pp. 1-7. |
Non-Final Office Action received for U.S. Appl. No. 17/684,337 dated Feb. 14, 2023, 45 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/104,465 dated Nov. 12, 2021, 40 pages. |
Notice of Allowance received for U.S. Appl. No. 16/132,196 dated Apr. 30, 2021, 35 pages. |
Notice of Allowance received for U.S. Appl. No. 17/397,158 dated Mar. 22, 2023, 39 pages. |
Notice of Allowance received for U.S. Appl. No. 16/243,768 dated May 21, 2021, 30 pages. |
Non Final Office Action received for U.S. Appl. No. 17/582,895 dated Apr. 6, 2023, 32 pages. |
Notice of Allowance received for U.S. Appl. No. 16/951,938 dated Dec. 23, 2022, 33 pages. |
Notice of Allowance received for U.S. Appl. No. 16/132,102 dated Jul. 1, 2021, 26 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,916 dated Sep. 20, 2021, 28 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,922 dated Aug. 27, 2021, 25 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,936 dated Oct. 13, 2022, 23 pages. |
Notice of Allowance received for U.S. Appl. No. 17/528,609 dated Jan. 30, 2023, 27 pages. |
Notice of Allowance received for U.S. Appl. No. 17/532,694 dated Feb. 10, 2023, 27 pages. |
Notice of Allowance received for U.S. Appl. No. 16/932,632 dated Sep. 9, 2021, 25 pages. |
Notice of Allowance received for U.S. Appl. No. 16/277,817 dated Sep. 30, 2020, 34 pages. |
Notice of Allowance received for U.S. Appl. No. 16/928,958 dated Dec. 17, 2021, 16 pages. |
Notice of Allowance received for U.S. Appl. No. 16/117,763 dated Apr. 14, 2020, 17 pages. |
Notice of Allowance received for U.S. Appl. No. 16/117,763 dated Jun. 8, 2020, 5 pages. |
Notice of Intent to Grant for European Patent Application Serial No. 19765954.3 dated Feb. 17, 2023, 41 pages. |
Notice of Intent to Grant for European Patent Application No. 19765954.3 dated Oct. 17, 2022, 41 pages. |
Communication Pursuant to Article 94(3) EPC received for European Patent Application Serial No. 19765954.3 dated Feb. 23, 2022, 8 pages. |
Notice of Allowance received for U.S. Appl. No. 17/105,976, Feb. 3, 2022, 28 pages. |
Notice of Allowance received for U.S. Appl. No. 17/684,337, Apr. 13, 2023, 50 pages. |
Sotiropoulos et al., “Enhancing the Performance of Tiled Loop Execution on to Clusters using Memory Mapped Network Interfaces and Pipelined Schedules”, 2002, citation 1 page. |
Notice of Allowance received for U.S. Appl. No. 17/697,201 dated Feb. 23, 2023, 37 pages. |
Notice of Allowance received for U.S. Appl. No. 17/697,201 dated Mar. 7, 2023, 4 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/132,243 dated Dec. 31, 2019, 28 pages. |
Notice of Allowance received for U.S. Appl. No. 16/132,243 dated Jun. 22, 2021, 47 pages. |
Notice of Allowance received for U.S. Appl. No. 16/132,243 dated Sep. 30, 2021, 42 pages. |
Notice of Allowance received for U.S. Appl. No. 16/132,243 dated Dec. 15, 2021, 19 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,966 dated Feb. 8, 2021, 45 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,966 dated Jun. 21, 2021, 28 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,966 dated Oct. 15, 2021, 30 pages. |
Notice of Allowance received for U.S. Appl. No. 16/526,966 dated Jan. 5, 2022, 18 pages. |
Communication Pursuant to Article 94(3) EPC received for European Patent Application Serial No. 19827878.0 dated May 22, 2023, 5 pages. |
Decision to Grant received for Japanese Patent Application Serial No. 2021-527941 dated Mar. 28, 2023, 5 pages (Including English Translation). |
Written Decision on Registration received for Korean Patent Application Serial No. KR20217012323 dated Apr. 24, 2023, 12 pages (Including English Translation). |
Notice of Allowance received for U.S. Appl. No. 17/203,214 dated Jul. 3, 2024, 113 pages. |
Non Final Office Action received for U.S. Appl. No. 18/482,558 dated May 9, 2024, 67 pages. |
Non Final Office Action received for U.S. Appl. No. 18/351,916 dated Jun. 20, 2024, 96 pages. |
Non Final Office Action received for U.S. Appl. No. 18/405,203 dated Jul. 12, 2024, 75 pages. |
Kye et al., “CPU-based Real-time Maximum Intensity Projection Via Fast Matrix Transposition Using Parallelization Operations with AVX Instruction Set”, Multimedia Tools and Applications, vol. 77, 2018 , pp. 15971-15994. |
International Search Report and Written Opinion received for International PCT Application Serial No. PCT/US2019/062303 dated Mar. 25, 2020, 14 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/686,864 dated Jun. 1, 2021, 22 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/686,866 dated Sep. 23, 2021, 25 pages. |
Non-Final Office Action received for U.S. Appl. No. 16/686,858 dated Jan. 25, 2022, 32 pages. |
Non-Final Office Action received for U.S. Appl. No. 17/519,425 dated Jan. 26, 2023, 17 pages. |
Bustamam et al., “Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, No. 3, Mar. 22, 2012, pp. 679-692. |
Bouaziz et al., “Parallel Long Short-Term Memory for Multi-Stream Classification”, IEEE Spoken Language Technology Workshop, Dec. 13-16, 2016, pp. 218-223. |
Fuchs et al., “Parallel Vectors Criteria for Unsteady Flow Vortices”, IEEE Transactions on Visualization and Computer Graphics, vol. 14, No. 3, May-Jun. 2008, pp. 615-626. |
Gelder et al., “Using PVsolve to Analyze and Locate Positions of Parallel Vectors”, IEEE Transactions on Visualization and Computer Graphics, vol. 15, No. 4, Jul.-Aug. 2009, pp. 682-695. |
Gil-Cacho et al., “Nonlinear Acoustic Echo Cancellation Based on a Parallel-Cascade Kernel Affine Projection Algorithm”, IEEE International Conference on Acoustics, Speech and Signal Processing, Mar. 25-30, 2012, pp. 33-36. |
Office Action received for Japanese Patent Application No. 2021-527941 dated Dec. 20, 2022, 11 pages (Including English Translation). |
Request for the Submission of an Opinion received for Korean Patent Application Serial No. 10-2021-7012323 dated Aug. 29, 2022, 10 pages (Including English Translation). |
Rodrigues et al., “SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors”, 4th Workshop on Programming Models for SIMD Nector Processing, Feb. 2018, pp. 1-8. |
Suh et al., “A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels”, 30th Annual International Symposium on Computer Architecture, Jun. 2003, 10 pages. |
Office Action received for Taiwan Patent Application Serial No. 108142039 dated Jan. 3, 2023, 28 pages (Including English Translation). |
Non-Final Office Action received for U.S. Appl. No. 16/686,870 dated May 27, 2022, 61 pages. |
Final Office Action received for U.S. Appl. No. 16/686,858 dated Jun. 29, 2022, 23 pages. |
Notice of Allowance received for U.S. Appl. No. 17/519,425 dated Mar. 15, 2023, 25 pages. |
Notice of Allowance received for U.S. Appl. No. 17/519,425 dated Jun. 20, 2023, 60 pages. |
Notice of Allowance received for U.S. Appl. No. 17/203,214 dated Jul. 19, 2023, 50 pages. |
Non-Final office action received for U.S. Appl. No. 18/083,388 dated Jul. 14, 2023, 50 pages. |
Notice of Allowance received for U.S. Appl. No. 17/684,337 dated Jul. 3, 2023, 91 pages. |
Decision to Grant a Patent received for European Patent Application Serial No. 19765954.3 dated Jun. 29, 2023, 2 pages. |
Notice of Allowance received for U.S. Appl. No. 17/397,158 dated Aug. 23, 2023, 82 pages. |
Notice of Allowance received for U.S. Appl. No. 16/951,938 dated Sep. 5, 2023, 81 pages. |
Notice of Allowance received for U.S. Appl. No. 18/083,388 dated Aug. 31, 2023, 25 pages. |
Notice of Allowance received for U.S. Appl. No. 17/582,895 dated Aug. 16, 2023, 39 pages. |
Office Action received for Taiwan Patent Application Serial No. 11220743060 dated Aug. 1, 2023, 4 pages. |
Office Action received for Chinese Patent Application Serial No. 201880006508.9 dated Jul. 19, 2023, 7 pages. |
Notice of Allowance received for U.S. Appl. No. 17/203,214 dated Aug. 16, 2023, 5 pages. |
Notice of Allowance received for U.S. Appl. No. 17/397,158 dated Oct. 2, 2023, 10 pages. |
Notice of Allowance received for U.S. Appl. No. 16/951,938 dated Sep. 27, 2023, 102 pages. |
Notice of Allowance received for U.S. Appl. No. 18/083,388 dated Oct. 4, 2023, 10 pages. |
Notice of Allowance received for U.S. Appl. No. 17/582,895 dated Oct. 4, 2023, 12 pages. |
First Office Action received for Chinese Patent Application Serial No. 201980074328.9 dated Aug. 14, 2023, 6 pages (Including English Translation). |
Non-Final office action received for U.S. Appl. No. 17/203,214 dated Dec. 1, 2023, 82 pages. |
Notice of Allowance received for U.S. Appl. No. 17/397,158 dated Dec. 1, 2023, 10 pages. |
Notice of Allowance received for U.S. Appl. No. 16/951,938 dated Nov. 22, 2023, 15 pages. |
Notice of Allowance received for U.S. Appl. No. 16/951,938 dated Dec. 11, 2023, 10 pages. |
Notice of Allowance received for U.S. Appl. No. 18/083,388 dated Nov. 24, 2023, 10 pages. |
Notice of Allowance received for U.S. Appl. No. 17/582,895 dated Dec. 1, 2023, 6 pages. |
Office Action received for Taiwan Patent Application Serial No. 108142038 dated Mar. 31, 2023, 11 pages. |
Second Office Action received for Chinese Patent Application Serial No. 201980074328.9 dated Mar. 23, 2024, 7 pages(Including English Translation). |
Notice of Allowance received for Taiwanese Patent Application Serial No. 108142041 dated Jan. 12, 2024, 4 pages (Original Copy Only). |
Request for the Submission of an Opinion received for Korean Patent Application Serial No. 10-2023-7025325 dated Feb. 28, 2024, 6 pages (Including English Translation). |
Notification to Grant Patent Right for Invention received for Chinese Patent Application Serial No. 201980074328 dated Jun. 17, 2024, 4 pages(Including English Translation). |
Office Action received for Taiwan Patent Application Serial No. 108142040 dated May 28, 2024, 12 pages(Including English Translation). |
Number | Date | Country | |
---|---|---|---|
20240118922 A1 | Apr 2024 | US |
Number | Date | Country | |
---|---|---|---|
62937123 | Nov 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16951938 | Nov 2020 | US |
Child | 18389984 | US |