In many modern processor architectures, data loads are performance bottlenecks. For example, in a central processing unit (CPU) that can service multiple outstanding load requests per cycle from the data cache, some of these loads may fetch smaller data sizes than the respective scalar and vector register widths available, e.g., due to variable data sizes. For example, for a CPU with 256-bit wide vector registers, 32-bit, 64-bit, and/or 128-bit loads may result in load requests that do not fully utilize available load/store throughput and bandwidth.
A more detailed understanding can be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
Modern processors have separate instruction and/or pop schedulers and execution lanes for vector and scalar instructions and/or pops. In some implementations, these execution lanes include dedicated and limited address generation units (AGU) which generate addresses and issue data load requests from the cache. To better utilize the AGUs and reduce the performance bottleneck of requesting cached data, in some implementations, a load fusion mechanism detects two or multiple loads of the same type (can be vector or scalar) to consecutive memory locations and fuses them into a single load, or fewer loads, respectively. In this context, fusing refers to replacing a plurality of loads with a single load, e.g., of a different size. In some implementations, this approach reduces load/store unit (LSU) utilization, which in some cases has the advantage of providing a speedup when the LSU is the bottleneck of the workload.
Some implementations provide a processor configured for load fusion. The processor includes circuitry configured to replace a first load operation and a second load operation, in a stream of operations, with a single load operation. The processor also includes circuitry configured to insert one or more operations, configured to move and shift a value stored in a destination register of the single load operation, after the single load operation in the stream of operations.
In some implementations, the stream of operations comprises a stream of instructions, or a stream of pops. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a single operation configured to move and shift the value stored in the destination register of the single load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a first operation configured to move the value stored in the destination register of the single load operation and a second operation configured to shift the value stored in the destination register of the single load operation. In some implementations, a size of a data load indicated by the single load operation is less than, or equal in size to, a sum of a data load indicated the first load operation and of a data load indicated the second load operation. In some implementations, the first load operation loads data of a size smaller than a width of a destination register of the first load operation, and the second load operation loads data of a size smaller than a width of a destination register of the second load operation. In some implementations, replacing the first load operation and the second load operation comprises replacing the first load operation with the single load operation and either replacing the second load operation with a null operation (nop) or eliminating the second load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register move operation that copies data from a destination register of the first load operation to a second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift operation that shifts bits of the second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift operation that shifts bits of the second register by a number of bits which aligns data corresponding to the second load operation.
Some implementations provide a method for load fusion. A first load operation and a second load operation, in a stream of operations, are replaced with a single load operation. One or more operations, configured to move and shift a value stored in a destination register of the single load operation, are inserted after the single load operation in the stream of operations. In some implementations, the stream of operations comprises a stream of instructions, or a stream of pops.
In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a single operation configured to move and shift the value stored in the destination register of the single load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a first operation configured to move the value stored in the destination register of the single load operation and a second operation configured to shift the value stored in the destination register of the single load operation. In some implementations, a size of a data load indicated by the single load operation is less than, or equal in size to, a sum of a data load indicated the first load operation and of a data load indicated the second load operation. In some implementations, the first load operation loads data of a size smaller than a width of a destination register of the first load operation, and the second load operation loads data of a size smaller than a width of a destination register of the second load operation. In some implementations, replacing the first load operation and the second load operation comprises replacing the first load operation with the single load operation and either replacing the second load operation with a null operation (nop) or eliminating the second load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register move operation that copies data from a destination register of the first load operation to a second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift operation that shifts bits of the second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift operation that shifts bits of the second register by a number of bits which aligns data corresponding to the second load operation.
Some implementations include a processor configured for load fusion. The processor includes circuitry configured to insert information of a first load operation into a tracking table. The processor also includes circuitry configured to insert information of a second load operation into the tracking table responsive to an address of the second load operation being within a range. The processor also includes circuitry configured to execute a load operation from an address indicated by the first load operation and from an address indicated by the second load operation based on the tracking table.
In some implementations, the first load operation and the second load operation are executed by different address generation units of an execution unit (EX). In some implementations, the load from the address indicated by the first load operation the load from the address indicated by the second load operation are queued in a same entry of a LQ. In some implementations, the load from the address indicated by the first load operation and the load from the address indicated by the second load operation are queued in a same entry of a LQ, at an offset. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from the LQ to different architectural registers. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from subsets of the entry of the LQ to different architectural registers.
Some implementations include a method for load fusion. Information of a first load operation is inserted into a tracking table. Information of a second load operation is inserted into the tracking table responsive to an address of the second load operation being within a range. A load operation is executed from an address indicated by the first load operation and from an address indicated by the second load operation based on the tracking table.
In some implementations, the first load operation and the second load operation are executed by different address generation units of an EX. In some implementations, the load from the address indicated by the first load operation the load from the address indicated by the second load operation are queued in a same entry of a LQ. In some implementations, the load from the address indicated by the first load operation and the load from the address indicated by the second load operation are queued in a same entry of a LQ, at an offset. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from the LQ to different architectural registers. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from subsets of the entry of the LQ to different architectural registers.
In various alternatives, the processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core can be a CPU or a GPU. In various alternatives, the memory 104 is located on the same die as the processor 102, or is located separately from the processor 102. The memory 104 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 includes a fixed or removable storage, for example, a hard disk drive, a solid-state drive, an optical disk, or a flash drive. The input devices 108 include, without limitation, a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 include, without limitation, a display device 118, a display connector/interface (e.g., an HDMI or DisplayPort connector or interface for connecting to an HDMI or Display Port compliant device), a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner if the input driver 112 and the output driver 114 are not present. The output driver 116 includes an accelerated processing device (“APD”) 116 which is coupled to a display device 118. The APD accepts compute commands and graphics rendering commands from processor 102, processes those compute and graphics rendering commands, and provides pixel output to display device 118 for display. As described in further detail below, the APD 116 includes one or more parallel processing units to perform computations in accordance with a single-instruction-multiple-data (“SIMD”) paradigm. Thus, although various functionality is described herein as being performed by or in conjunction with the APD 116, in various alternatives, the functionality described as being performed by the APD 116 is additionally or alternatively performed by other computing devices having similar capabilities that are not driven by a host processor (e.g., processor 102) and provides graphical output to a display device 118. For example, it is contemplated that any processing system that performs processing tasks in accordance with a SIMD paradigm may perform the functionality described herein. Alternatively, it is contemplated that computing systems that do not perform processing tasks in accordance with a SIMD paradigm can also perform the functionality described herein.
During operation of a typical processor, such as a CPU, multiple load instructions (or micro-operations (pops), depending on the architecture) may be issued to read from contiguous addresses in memory. For example, the pseudocode of Table 1 illustrates example vector load request instructions issued to contiguous addresses in memory:
Although the examples presented herein are discussed with respect to load instructions, it is noted that in some implementations, the same or similar techniques are applicable to load pops. For example, in some complex instruction set (CISC) architectures, instructions may be decoded and converted into micro-operations (pops) that are passed through CPU pipelines. In some reduced instructions set (RISC) architectures, instructions may not be converted into pops but may be decoded and passed through subsequent stages of the CPU pipeline.
In this example, vmovdq (“vector move double quad”) is a vector instruction indicating a double quad word width load instruction. In this example, the scale of the argument %rax is indicated as 8, however this is example context, and any suitable scaling (or no scaling) of the index register %rax is possible in other examples. Since the instruction is of double quad width, the width of the loaded value is 128 bits. In Table 1, the first vmovdq instruction loads 128 bits from memory at the address indicated by base register %rdx and index register %rax to the architectural register %ymm6. The second vmovdq instruction loads 128 bits from an address indicated by base register %rdx and index register %rax, offset by 16 bytes (128 bits), to the architectural register %ymm7. It is noted that architectural registers %ymm6 and %ymm7 are each 256 bits wide in this example. It is also noted that the instructions, instruction formats, data, and so forth in all example pseudocode herein are only exemplary, and any suitable instructions, formats, data, and so forth are usable in other implementations. It is also noted that pseudocode herein uses 64-bit x86 register set nomenclature, however any other register set or instruction set architecture is usable in other implementations. As used herein, an architectural register is a register name or label defined by the ISA that can be referenced directly by software (e.g., user applications, compilers, etc.) that is used to enable microarchitectural optimizations such as register renaming, and a physical register is a physical hardware element that stores the data values and execution results used in the CPU pipeline.
The only difference between the two load request instructions shown in Table 1 is the offset of the requested word and the destination architectural register. After both instructions are executed, two 256-bit architectural registers (%ymm6 and %ymm7) will each be partially occupied (each by 128 bits).
Similarly, the pseudocode of Table 2 illustrates example scalar load requests issued to contiguous addresses in memory (scalar load requests can also be referred to as integer load requests):
In this example, movd (“move double”) is a scalar instruction indicating a double word width load instruction. In this example, the scale of the loaded value is indicated as 8, however this is example context, and any suitable scaling (or no scaling) of the index register %rax is possible in other examples. Since the instruction is a double width, the width of the loaded value is 32 bits. In Table 2, the first movd instruction loads 32 bits from memory at the address indicated by base register %rdx and index register %rax to the architectural register %r15. The second movd instruction loads 32 bits from an address indicated by base register %rdx and index register %rax, offset by 4 bytes (32 bits) to the architectural register %r16. It is noted that architectural registers %r15 and %r16 are each 64 bits wide in this example
The only difference between the two load request instructions shown in Table 2 is the offset of the requested word and the destination architectural register. After both instructions are executed, two 64-bit physical registers corresponding to architectural registers %r15 and %r16 will each be partially occupied (each by 32 bits).
In both the vector example of Table 1, and the scalar example of Table 2, occupying two physical registers reduces the number of effective physical registers available in the physical register file (PRF). Further, this also utilizes additional address generation units (AGUs) and load store units (LSUs) as compared with a single load instruction and/or pop.
In some cases, unnecessary load instructions and/or pops (and consequent additional utilization of the subsystems used for the load instructions) prevent or impair additional performance boosts, e.g., in memory-intensive applications where memory reads are frequently used. Accordingly, it may be desired to provide load fusion mechanisms to reduce the total effective number of loads (i.e., number of loads executed by a processor regardless of the number of corresponding loads specified in software). Some implementations may be configured to determine whether memory reads are frequent enough to warrant employment of the load fusion mechanisms.
Some solutions exploit spatiotemporal memory accesses to fetch full cache lines (e.g., 64-byte cache lines) into dedicated on-chip buffers, and employ a predictive mechanism to determine if a cache line is accessed frequently enough to be fetched into a dedicated cache line access register. However, in some implementations this incurs significant additional hardware to implement buffers and access prediction hardware. Accordingly, it may be desired to provide instruction and/or pop fusion such as described herein. In some implementations, such instruction and/or pop fusion has the advantage of providing a straightforward mechanism for efficient data loads.
Some solutions exploit a compiler-based approach to attempt to optimize source code to reduce the number of issued loads by merging them. In certain situations, such as when a load address is dynamically generated and not known at compile time, compilers do not attempt to optimize the load instructions and instead leave them in their original state. Furthermore, in some cases, compilers provide only a static analysis of source code and are not aware of the LSU utilization over the duration of the workload. Accordingly, it may be desired to provide instruction and/or pop fusion such as described herein. In some implementations, such instruction and/or pop fusion has the advantage of detecting high utilization and adjusting dynamically to begin fusing loads to reduce the LSU pressure.
Some implementations fuse loads and track the corresponding registers for dependency information to be propagated to younger dependent instructions and/or pops. In some implementations, hardware structures and modifications to current hardware are provided. Some implementations include dynamically altering the instructions and/or pops (e.g., dynamically adjusting the instruction and/or pop stream) to explicitly issue a fused load. Some implementations include a transparent microarchitectural approach.
In some implementations where the instructions and/or pops (e.g., instruction and/or pop stream) is altered to issue a fused load, the instructions and/or pops (e.g., instruction and/or pop stream) are dynamically altered to fuse multiple loads and add additional instructions (e.g., move and/or shift instructions) to write values to the correct registers (i.e., to provide the same output that would have resulted from execution of the original instruction stream). This may be referred to as “explicit instruction fusion”.
Table 3 shows an example code modification which fuses two 32b loads into one 64b load, and adds further instructions to provide the same output that would have resulted from execution of the original instructions.
Here, the original instructions include two scalar double-word load (movd) instructions which each load 32 bits. The first movd instruction loads 32 bits from memory at addresses indicated by registers %rdx and %rax to the register file %r15d (i.e., the least significant 32 bits of register r15). The second movd instruction loads 32 bits from the same addresses indicated by rdx and rax, but offset by 4 bytes (32 bits), to the register %r16d (i.e., the least significant 32 bits of register r16).
As discussed earlier, this requires two separate load instructions, which may cause performance issues in some cases (e.g., where performance is memory constrained). Accordingly, in order to accomplish the loading of the 64 bits by a single instruction, in some implementations, the two movd instructions are replaced by a single scalar quad-word load (movq) instruction, which loads 64 bits from the addresses indicated by rdx and rax, to a single register r15 (i.e., the entire 64 bits of register r15). In order to replicate the output of the original instruction stream, the entire 64 bits are copied from r15 to r16 (within the register file), e.g., using a load (mov) instruction mov r15, r16, and the contents of register r16 are shifted by 32 bits, e.g., using a shift register (shr) instruction shr r16, 32, in order to replicate the offset loading of the second movd instruction in the original instruction stream. This example assumes that the upper 32 bits of r15 are not used by any subsequent instructions (otherwise the upper 32 bits would also have to be zeroed out for full equivalence).
In this example, the two loads (movd) have been fused into a single load instruction (movq) followed by a register move and shift. In some implementations, this approach has the advantage of not consuming additional registers, and of reusing the original load destination registers. In some implementations, this preserves any true dependencies in the original source code. In this example, the register move and shift move the upper 32 bits of r15 to the lower 32 bits of r16.
In some implementations, more than two loads are fused into a single load instruction. Table 4 shows an example code modification which fuses four 32b loads into one 64b load, and adds further instructions to provide the same output that would have resulted from execution of the original instructions. In the example of Table 4, each load in the original, pre-modified instruction stream is replaced by one move and one shift with a modified immediate:
In the example of Table 4, the updated instruction stream loads the 64b value into %r8 once, and moves the value from %r8 into each destination register, shifting by an appropriate number of bits to yield the same output as the original instruction stream.
The example of Table 4 shows a scalar example. It is noted that a similar approach is applicable to vector loads. For example, when fusing four 128b vector loads into one 512b vector load, each register would be shifted by 128b, 256b, and 384b, respectively.
Table 5 shows another example code modification which fuses four 32b loads into one 64b load, and adds further instructions to provide the same output that would have resulted from execution of the original instructions. In the example of Table 5, each load in the original, pre-modified instruction stream is replaced by one move and one shift with a similar immediate:
In the example of Table 5, the updated instruction stream loads the 64b value into %r8 once, and moves the value from %r8 into the first destination register, shifting by an appropriate number of bits, moves the value from the first destination register to the second destination register, shifting by an appropriate number of bits, and so forth, to yield the same output as the original instruction stream.
Table 5 shows a scalar example. It is noted that a similar approach is applicable to vector loads, as with Table 4.
In some implementations, the instruction stream would need to be dynamically modified within the pipeline. Accordingly, some implementations include hardware modifications within the decode stage of the pipeline. For example, the instruction cache (IC) and op cache (OC) are accessed prior to the decode stage; accordingly, in some implementations, e.g., to ensure that the instruction stream is correctly modified, an additional bypass is added from the rename stage to the IC and OC to fetch, from the op cache, the ops used to modify the instruction stream.
In some implementations, circuitry configured to perform checks on the bounds of load fusion candidates' addresses is also provided in the decode and/or rename stage. For example, in some implementations, circuitry is provided to confirm that the total size of the memory fetched by the load instructions proposed for fusion are not larger than the maximum size of the destination register (e.g., 64 bits for scalar loads, or 512 bits for vector loads in the examples herein), or otherwise to ensure that if a memory fetch is larger than the maximum size of the destination register, the loads are not fused.
In order to determine whether two load instructions and/or pops can be replaced or “fused” in this way, in some implementations, information about loads in an instruction and/or pop stream or other collection of instructions and/or pops is collected. Such information is collected in any suitable manner, such as by storing in a table or other data structure. In some implementations, the information is stored in a dedicated buffer or other device.
For example, load instruction and/or pop s that are candidates for load fusion (e.g., load instruction and/or pops that are below the maximum size of the respective destination register) may not all arrive in the rename stage within the same clock cycle. Accordingly, some implementations include circuitry configured to track and fuse loads across different instruction bundles, where instructions bundles are instructions that are decoded in the same clock cycle.
This may be done in any suitable way. For example, in some implementations, to support fusing loads from different instruction bundles, a table is used to track the candidate loads or information about the candidate loads. Such a table is referred to herein as a Fused Load Tracking Table (FLTT). In some implementations, the FLTT is a hardware structure. In some implementations, the FLTT indicates the logical destination registers from the original instruction stream and the physical register assigned to the first decoded/renamed load (e.g., in a buffer). In some implementations, the table resides, in a logical sense, between rename stages (e.g., by adding an additional rename stage), or within one of the rename stages, of the CPU pipeline. In some implementations, this facilitates population of the FLTT as load addresses and register destinations are available, and allows the instruction stream to be modified before dispatching to the necessary instruction and/or pop queue.
V field 204 indicates whether entry 200 is valid, and is set to indicate that entry 200 is valid when a new entry is created (e.g., when a load instruction is detected in the rename stage). The valid bit is set to indicate that entry 200 is invalid if the fused load has been fetched from memory and marked complete as indicated by a signal from the respective load queue (LQ) entry, or if age field 206 has saturated (e.g., reached a maximum, or threshold value) and the F field 208 does not indicate that the load instruction has been fused (i.e., indicates that further loads have not been detected to fuse with this load.)
In some implementations, V field 204 includes a single bit that is set to indicate validity, and cleared to indicate invalidity, however, any suitable implementation of the V field 204 is usable in other implementations.
Age field 206 is used to invalidate unfused entries after a period of time (e.g., after a threshold number of cycles), as described above, to allow later loads to be fused instead.
F field 208 indicates whether the entry represents a fused load. In some implementations, F field 208 includes a single bit that is set to indicate that the entry represents a fused load, and cleared to indicate that the entry represents a load that is not fused, however, any suitable implementation of the F field 208 is usable in other implementations.
LR1 field 210 indicates the logical register to which a candidate load instruction stores its loaded information, and LR2 field 212 indicates the logical register to which a second candidate load instruction stores its loaded information. It is noted that some implementations include additional LR fields (e.g., in order to fuse more than two load instructions).
PR1 field 214, indicates the physical register to which the fused load instruction, or the first load instruction, was renamed in the same decode bundle or past decode bundle, respectively.
Load source field 216 indicates the register indicating the address, or base address, from which the fused load instruction, or the first load instruction, loads its information.
In some implementations, the load source field 216 is used for future loads to check if they are fetching data within a bound of this entry's load (e.g., an 8 byte or 64 byte bound for scalar or vector loads, respectively) allowing the loads to be fused.
It is noted that in some implementations a FLTT entry includes additional fields (e.g., additional LR fields to track further load candidates in order to fuse more than two loads), or includes fewer than, or a subset of these fields (e.g., where the information is tracked in other ways, or is not tracked).
As shown in
When the next instruction bundle passes through the pipeline stage, if a load is detected, the load source is checked against all load source fields within the FLTT. If no address that is within the correct range is found, then a new entry can be created. However, if an address within the correct range is found within the FLTT, the load is tracked and fused using the same FLTT entry, and setting the F bit to one.
As shown in
Since a load fusion is now indicated by the FLTT entry, the load from bundle 300, which is in the next stage of the pipeline, is modified to accommodate the size of both load instructions. In this example, since the load from bundle 300 is a scalar double-word load (movd), it is changed to a scalar quad-word load (movq).
The load from bundle 400 is changed to a null operation (nop) instruction (or in some implementations, is simply eliminated), and a move and shift (e.g., as discussed with respect to Table 3) are injected into the instruction stream. In this example, mov r15, r17 and shr r17, 32 are injected into the instruction stream after the nop. In some implementations, this has the advantage of retaining the register dependencies across instruction bundles 300 and 400. The subtract instruction (sub r17, r18) of instruction bundle 400 is shown for example context.
In some implementations, the instruction stream is not modified, and instead, the load fusion occurs within the backend of the processor pipeline (e.g., beyond the dispatch stage; for example, in some implementations, load instructions are fused in the execution (EX) stage and/or the load queue (LQ) stage). In some implementations, it is indicated to the LQ that an amount of data (e.g., 64 bits or 512 bits, where the amount of data loaded would be indicated by the size of the new load and is not limited to the maximum register size (e.g., two 16b loads are fusible into one 32b load in some implementations) will be loaded into one LQ entry, and the register state is modified to the correct values. Such implementations can be referred to as reflecting a transparent microarchitectural approach, or as including implicit load fusion.
Since implicit load fusion approaches may not include modifications to the instruction stream to represent a single destination register for the fused load, in some implementations, the FLTT is modified (or otherwise differs from the implementations which alter the instruction stream) to preserve register dependencies.
A difference between the FLTT entry for the implicit and explicit load fusion mechanisms is the second physical register field (PR2 field 515). In some implementations of implicit load fusion, for each load that can be fused, there is one corresponding PR field in the FLTT entry. The example of
Implicit load fusion approaches do not alter the instruction stream as in explicit, instruction modification approaches. Accordingly, instead of dynamically converting instructions within the stream, the implicit mechanism allows both original loads to execute and have their addresses computed, e.g., by address generation units (AGUs). After the address generation is completed, in some implementations, the second load will not allocate a load queue (LQ) entry, but rather, will utilize the same LQ entry as the first load of the fused load. In some implementations, this is done by checking the PR fields of the FLTT to confirm that the physical registers of the load instructions correspond to the correct LQ entry.
In the example shown in
When the next instruction bundle passes through the pipeline stage, if a load is detected, the load source is checked against all load source fields within the FLTT. If no address that is within the correct range is found, then a new entry can be created. However, if an address within the correct range is found within the FLTT, the load is tracked and fused using the same FLTT entry, and setting the F bit to one.
As shown in
At this point, the FLTT entry shown in
As shown in
After the address generation is completed, in some implementations, a LQ entry is filled with data, and in some implementations the FLTT entry 500 is invalidated (e.g., by setting V bit 504 as invalid). In some implementations, the second load will not allocate a separate LQ entry, but rather, will utilize the same LQ entry as the first load of the fused load. This is shown in
In some implementations, indication of the corresponding sizes of the load (i.e.: 32b, 128b) are moved into the physical registers based on the order of the physical registers set within the FLTT entry. In some implementations, additional hardware (e.g., additional wires from the LQ to the PRF) is implemented to route subsets of the load queue entry to the correct physical register. In some implementations, additional hardware is implemented to route between the LQ entry subsets and the correct physical register. In some implementations, this has the advantage of fetching data from memory based on only one load instruction, allowing for loads to occur faster than if two loads were issued. In some implementations, this also has the advantage of reducing the number of LQ entries occupied with co-located loads, thus allowing more loads to be serviced and greater instruction level parallelism.
In 1110, a first load operation and a second load operation, e.g., of an operation stream, are replaced with a single load operation. In 1120, a register move operation is inserted after the single load operation, e.g., in the operation stream. In 1130, a register shift operation is inserted after the register move instruction, e.g., in the operation stream. In 1140, the operations are executed, e.g., as part of an operation stream.
In some implementations, the stream of operations comprises a stream of instructions, or a stream of pops. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a single operation configured to move and shift the value stored in the destination register of the single load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprises a first operation configured to move the value stored in the destination register of the single load operation and a second operation configured to shift the value stored in the destination register of the single load operation. In some implementations, a size of a data load indicated by the single load operation is less than, or equal in size to, a sum of a data load indicated the first load operation and of a data load indicated the second load operation. In some implementations, the first load operation loads data of a size smaller than a width of a destination register of the first load operation, and the second load operation loads data of a size smaller than a width of a destination register of the second load operation. In some implementations, replacing the first load operation and the second load operation comprises replacing the first load operation with the single load operation and either replacing the second load operation with a null operation (nop) or eliminating the second load operation. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register move operation that copies data from a destination register of the first load operation to a second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift operation that shifts bits of the second register. In some implementations, the one or more operations configured to move and shift the value stored in the destination register of the single load operation comprise a register shift instruction that shifts bits of the second register by a number of bits which aligns data corresponding to the second load operation.
In 1210, information of a first load operation (e.g., a load instruction in a first instruction bundle) is inserted into a tracking table (e.g., an FLTT). On condition 1220 that a second load operation (e.g., of a second instruction bundle) has an address within a range, information of the second load operation is inserted into the tracking table in 1230 and a load from an address indicated by the first load operation and from an address indicated by the second load operation are executed based on the tracking table in 1240, otherwise, the process begins again at 1210. In some implementations the address is within the range if the addresses are consecutive, if the distance or offset between the load addresses is less than or equal to a size of the destination register, or if the addresses are within a desired range, or any other suitable range.)
In some implementations, the first load operation and the second load operation are executed by different address generation units of an EX. In some implementations, the load from the address indicated by the first load operation the load from the address indicated by the second load operation are queued in a same entry of a LQ. In some implementations, the load from the address indicated by the first load operation and the load from the address indicated by the second load operation are queued in a same entry of a LQ, at an offset. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from the LQ to different architectural registers. Some implementations include circuitry configured to route the load from the address indicated by the first load operation and the load from the address indicated by the second load operation from subsets of the entry of the LQ to different architectural registers.
V field 1304 indicates whether entry 1300 is valid, and is set to indicate that entry 1300 is valid when a new entry is created (e.g., when a load operation is detected in the rename stage). The valid bit is set to indicate that entry 1300 is invalid if the fused load has been fetched from memory and marked complete as indicated by a signal from the respective load queue (LQ) entry, or if age field 1306 has saturated (e.g., reached a maximum, or threshold value) and the F field 1308 does not indicate that the load operation has been fused (i.e., indicates that further loads have not been detected to fuse with this load.)
In some implementations, V field 1304 includes a single bit that is set to indicate validity, and cleared to indicate invalidity, however, any suitable implementation of the V field 1304 is usable in other implementations.
Age field 1306 is used to invalidate unfused entries after a period of time (e.g., after a threshold number of cycles), as described above, to allow later loads to be fused instead.
F field 1308 indicates whether the entry represents a fused load. In some implementations, F field 1308 includes a single bit that is set to indicate that the entry represents a fused load, and cleared to indicate that the entry represents a load that is not fused, however, any suitable implementation of the F field 1308 is usable in other implementations.
LR1 field 1310 indicates the logical register to which a candidate load operation stores its loaded information, LR2 field 1311 indicates the logical register to which a second candidate load operation stores its loaded information, LR3 field 1312 indicates the logical register to which a third candidate load operation stores its loaded information, and LR4 field 1313 indicates the logical register to which a fourth candidate load operation stores its loaded information. It is noted that some implementations include additional LR fields (e.g., in order to fuse more than four load operations), or include fewer LR fields (e.g., in order to fuse fewer than four load operations).
PR1 field 1314, indicates the physical register to which the fused load operation, or the first load operation, was renamed in the same decode bundle or past decode bundle, respectively.
Load source field 1316 indicates the register indicating the address, or base address, from which the fused load operation, or the first load operation, loads its information.
In some implementations, the load source field 1316 is used for future loads to check if they are fetching data within a bound of this entry's load (e.g., an 8 byte or 64 byte bound for scalar or vector loads, respectively) allowing the loads to be fused.
It is noted that in some implementations a FLTT entry includes additional fields (e.g., additional LR fields to track further load candidates in order to fuse more than four loads), or includes fewer than, or a subset of these fields (e.g., where the information is tracked in other ways, or is not tracked).
A difference between the FLTT entry for the implicit and explicit load fusion mechanisms is the second, third, and fourth physical register fields (PR2, PR3, and PR4 fields 1415, 1417, 1418). In some implementations of implicit load fusion, for each load that can be fused, there is one corresponding PR field in the FLTT entry. The example of
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements.
The various functional units illustrated in the figures and/or described herein (including, but not limited to, the processor 102, the input driver 112, the input devices 108, the output driver 114, the output devices 110, the accelerated processing device 116, the scheduler 136, the graphics processing pipeline 134, the compute units 132, the SIMD units 138, may be implemented as a general purpose computer, a processor, or a processor core, or as a program, software, or firmware, stored in a non-transitory computer readable medium or in another medium, executable by a general purpose computer, a processor, or a processor core. The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) operations and other intermediary data including netlists (such operations capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements features of the disclosure.
The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).