Various configurations of the current invention relate generally to apparatus, systems, and methods for storing items in a queue. More particularly, the apparatus, systems, and methods relate to a queue that tracks the age of items within a queue. Specifically, the apparatus, systems, and methods provide for a queue that allows for items to be removed from the queue in a different order than how the items were placed in the queue.
In a processor, buffers are often provided between different functional units. In many cases, these buffers are implemented as a queue, which has an implicit relative ordering of slots in the queue. Items to be buffered arrive (or are stored) serially to the queue. In such a queue, the relative order of the items in queue represents a relative order of arrival (i.e., for every item in the queue, it is possible to determine whether any other item arrived earlier or later than that item simply by that item's relative position in the queue).
Other buffers may implement a First In First Out (FIFO) priority scheme. However, in situations where items may become ready for further processing out of an order in which they arrive, maintaining FIFO priority delays further processing of some items that are ready to be used within a processor. Thus, it may be desirable to be able to pick items from a buffer out of FIFO order. However, at times it may be desirable to pick an oldest item from among items that are ready to be output from the buffer.
One way to maintain relative age of items in a queue, in which items may leave the queue out of FIFO order is to compact later-arriving items into the slot(s) that were vacated. As long as a relative order of the items does not change, the order continues to represent the correct arrival order of the items. Newly arriving items are appended to the first empty slot at the back of the queue. However, such compaction requires consuming power and time to move items through the queue. Also, it is generally the case that items close to the front of the queue are more likely to become ready for retirement or removal from the queue, so as a queue becomes larger, items may need to be repeatedly shifted to the front.
Another way to track relative age of items in a queue is to maintain a counter for each slot in the queue. For example, if counters are incremented when an item enters the queue, then an item in the slot with the highest counter value is the oldest. When an item leaves a slot, the counter for that item is reset. When a new item arrives, an empty slot can be selected and then the counter for that slot again starts to be incremented. Implementing such a counter scheme requires maintaining a counter value for each queued item. In practice, a size of the registers to store each count must be maintained. The count should not roll over while items age since that would corrupt the aging information. Thus, the register holding the count needs to be sized according to an expected maximum amount of cycles that a given item may remain in the queue. If a queue has only a few slots, and a maximum delay is small, then implementing such a counter scheme is relatively low cost. However, for a larger queue, or in situations where a maximum delay is potentially large, implementing a counter scheme is expensive. What is needed is a better queue.
One embodiment is an apparatus for tracking the age of items stored within a queue in a processor. In one configuration, an apparatus includes an item storage array and an array of age-tracking bits. The item storage array stores data associated with valid items stored in the queue. Age-tracking bits associated with a subset of items in the queue are set to a first value when the subset of items is older than other items in the queue. The younger items in the queue correspond to the age-tracking bits set to the first value. Other age-tracking bits associated with the subset of items in the queue are set to a second value when the subset of items is younger than other items in the queue. Older queue items correspond to the age-tracking bits set to the second value. The queue may include picker logic for finding an oldest item in the queue based on the array of age-tracking bits. In other configurations, the subset of items in the queue may correspond to single items within the queue.
Another embodiment is a method of tracking items in a queue which may be part of a microprocessor. The method begins by storing a particular item into an item storage array portion of the queue that stores data associated with valid items stored in the queue. For example, an opcode ID, an address, a ready bit, a valid bit and/or other data associated with an item may be stored as an entry in the item storage array. In one configuration, the queue may be part of a load and store unit and may store parts of addresses and other portions of load and store instructions. Age-tracking bits associated with the particular item are set to a first value to indicate the particular item is older than other items (or entries) in the queue. Younger queue items correspond to the age-tracking bits set to the second value. Similarly, other age-tracking bits associated with the particular item are set to a second value to indicate the particular item is younger than other items in the queue. Older queue items correspond to the age-tracking bits set to the second value. The values may be binary values of zero “0” and one “1”. An age of the particular item in the queue is determined based, at least in part, on the age-tracking bits. As discussed below, Boolean logic in combination with comparators may be used to analyze the age-tracking bits to determine the oldest item in the queue or the age of any item in the queue relative to other items in the queue.
One or more preferred embodiments that illustrate the best mode(s) are set forth in the drawings and in the following description. The appended claims particularly and distinctly point out and set forth the invention.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate various example methods and other example embodiments of various aspects of the invention. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. One of ordinary skill in the art will appreciate that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa. Furthermore, elements may not be drawn to scale.
Similar numbers refer to similar parts throughout the drawings.
Queue 10 of
Item storage array 16 is the portion of array 10 that stores data associated with an item being stored in array 10. For example, if queue 10 is implemented as part of a load and store unit, then addresses or partial addresses and other information associated with a load or store instruction may be stored in corresponding entries 1-4 of item storage array 16.
As illustrated, array of age-tracking bits 12 is a 4 by 4 array of bits. As illustrated, one diagonal line of bits from the top left corner to the bottom right corner of array 12 is unimplemented and is marked with Xs through those locations. These bits are unimplemented because each bit in each row of the array of age-tracking bits 12 indicates if the queue entry of that row is older or younger than other row entries so that the diagonal of unimplemented bits does not need indicate if an item in a row is younger or older than itself. For simplicity, a 4 by 4 array of age-tracking bits 12 is illustrated; however, in other configurations, the size of this array may be any size, N×N. Notice that the array of age-tracking bits 12 is an efficient way of keeping track of the age of items in an array. For a 32-entry queue, 1024 (1K) of bits are needed minus 32 unimplemented diagonal bits. Later, a way of grouping items in the array together is explained that further reduces the number of age-tracking bits needed to track the age of each entry in a queue.
As mentioned, each bit in the array of age-tracking bits 12 indicates the age of an item in queue 10 with respect to other entries in queue 10. The row of age-tracking bits of the array of age-tracking bits 12 left of each item stored in item storage array 16 of queue 10 contain age information of other entries in queue with respect to the item stored in that row.
In the configuration of
When an item is ready to retire or otherwise ready to be removed from queue 10, picker logic 32 has the capability to find the oldest item in queue 10 or to find an item in queue 10 that may be ready to retire out of order and to place that item on output bus 38. Picker logic 32 may compare different age-tracking bits of the array of age-tracking bits 12 as discussed above to determine which entry in queue 10 is the oldest and may be a candidate to retire. Alternatively, picker logic 32 may be provided other information about an item in queue 10 that is to retire out of order. Picker logic 32 uses information about the oldest entry in queue 10 or information about an entry in queue 10 to be removed from queue 10 out of order to select the appropriate entry in queue 10 and may place that entry on output bus 38 as it is removed/retired/cleared from queue 10.
In some embodiments, queue maintenance logic 34 may assist placement logic 30 and picker logic 32 in placing and picking items from queue 10 and/or performing other useful functions. For example, when queue 10 is part of a load and store unit, addresses may be one item stored in queue 10. When provided an address, queue maintenance logic 34 may compare that address to addresses stored in queue 10 to determine if one or more addresses in queue 10 match that address. When one or more queue addresses match, it may be necessary for a store instruction associated with a matching queue address to forward/bypass its data to another instruction associated with the address to which it was matched. In other embodiments, portions or all of the queue maintenance logic 34 may be part of placement logic 30 and/or picker logic 32. Placement logic 30, picker logic 32, and/or queue maintenance logic 34 may implement comparison functions or other functionality as understood by those of ordinary skill in the art.
In one configuration, placement logic 30, picker logic 32, and/or queue management logic 34, when picking the oldest entry from queue 10 do not need to compare information of one queue entry to any other queue entry. Rather, each individual entry can independently look at it's own row of age bits to determine if there are any other entries that are older that it. If so, it outputs “0” indicating it is not the oldest entry. Otherwise, it outputs an indicator such as it's row number of the associated data indicating it is the oldest entry. These outputs of each of the entries may now simply be ORed together so that the oldest value is read out. This kind logic may be implemented essentially of AND gate and OR gate logic results in a very small number of gates and is very efficient in terms of area and speed.
While using the array of age-tracking bits 112 to track multiple entries per group reduces its size, there may need to be some implied ordering as to how slots A/B within a group are written to and removed from array 110. In one configuration, and as discussed below, once the first slot, A, of a group is written to with a valid item, the next item written to array 110 must be must be written to slot B. Similarly, once slot A or B is removed from a group, no other item may be written to that group before both slots A and B are removed from that group. Of course, those of ordinary skill in the art will appreciate that in other configurations the group sizes may be larger than two bits and that array 110 and an array of age-tracking bits 112 may be other sizes than what is illustrated and describe herein.
Similar to array 10 of
To maintain implicit ordering, the next item to be entered into queue 110 will be loaded into group 2, slot A, because it is the only empty group with two valid bits with a of value “0”.
Example methods may be better appreciated with reference to flow diagrams. While for purposes of simplicity, explanation of the illustrated methodologies are shown and described as a series of blocks. It is to be appreciated that the methodologies are not limited by the order of the blocks, as some blocks can occur in different orders and/or concurrently with other blocks from that shown and described. Moreover, less than all the illustrated blocks may be required to implement an example methodology. Blocks may be combined or separated into multiple components. Furthermore, additional and/or alternative methodologies can employ additional, not illustrated blocks.
The fetch logic 752 pre-fetches software instructions from memory that the processor 750 will execute. These pre-fetched instructions are placed in an instruction cache 754. These instructions are later removed from the instruction cache 754 by the decode and rename logic 756 and decoded into instructions that the processor can process. These instructions are also renamed and placed in the instruction queue 758. The decoder and rename logic 756 also provides information associated with branch instructions to the branch predictor and Instruction Translation Lookaside Buffers (ITLBs) 760. The branch predictor and ILTBs 760 predict branches and provide this branch prediction information to the fetch logic 752 so instructions of predicted branches are fetched.
A re-order buffer 762 stores results of speculatively completed instructions that may not be ready to retire in programming order. The re-order buffer 762 may also be used to unroll miss-predicted branches. The reservation station(s) 768 provides a location to which instructions can write their results without requiring a register to become available. The reservation station(s) 768 also provide for register renaming and dynamic instruction rescheduling. The commit unit 764 determines when instruction data values are ready to be committed/loaded into one or more registers in the register file 772. The load and store unit 766 monitors load and store instructions to be sure accesses to and from memory follows sequential program order, even though the processor 750 is speculatively executing instructions out of order. For example, the load and store unit 766 will not allow a load to load data from a memory location that a pending older store instruction has not yet written.
Instructions are executed in one or more out-of-order pipeline(s) 770 that are not required to execute instructions in programming order. In general, instructions eventually write their results to the register file 772.
Modern general purpose processors regularly require in excess of two billion transistors to be implemented, while graphics processing units may have in excess of five billion transistors. Such transistor counts are likely to increase. Such processors have used these transistors to implement increasing complex operation reordering, prediction, more parallelism, larger memories (including more and bigger caches) and so on. As such, it becomes necessary to be able to describe or discuss technical subject matter concerning such processors, whether general purpose or application specific, at a level of detail appropriate to the technology being addressed. In general, a hierarchy of concepts is applied to allow those of ordinary skill to focus on details of the matter being addressed.
For example, high-level features, such as what instructions a processor supports conveys architectural-level detail. When describing high-level technology, such as a programming model, such a level of abstraction is appropriate. Microarchitecture detail describes high-level detail concerning an implementation of architecture (even as the same microarchitecture may be able to execute different ISAs). Yet, microarchitecture detail typically describes different functional units and their interrelationship, such as how and when data moves among these different functional units. As such, referencing these units by their functionality is also an appropriate level of abstraction, rather than addressing implementations of these functional units, since each of these functional units may themselves comprise hundreds of thousands or millions of gates. When addressing some particular feature of these functional units, it may be appropriate to identify substituent functions of these units, and abstract those, while addressing in more detail the relevant part of that functional unit.
Eventually, a precise logical arrangement of the gates and interconnect (a netlist) implementing these functional units (in the context of the entire processor) can be specified. However, how such logical arrangement is physically realized in a particular chip (how that logic and interconnect is laid out in a particular design) still may differ in different process technology and for a variety of other reasons. Many of the details concerning producing netlists for functional units as well as actual layout are determined using design automation, proceeding from a high-level logical description of the logic to be implemented (e.g., a “hardware description language”).
The term “circuitry” does not imply a single electrically connected set of circuits. Circuitry may be fixed function, configurable, or programmable. In general, circuitry implementing a functional unit is more likely to be configurable, or may be more configurable, than circuitry implementing a specific portion of a functional unit. For example, an Arithmetic Logic Unit (ALU) of a processor may reuse the same portion of circuitry differently when performing different arithmetic or logic operations. As such, that portion of circuitry is effectively circuitry or part of circuitry for each different operation, when configured to perform or otherwise interconnected to perform each different operation. Such configuration may come from or be based on instructions, or microcode, for example.
In all these cases, describing portions of a processor in terms of its functionality conveys structure to a person of ordinary skill in the art. In the context of this disclosure, the term “unit” refers, in some implementations, to a class or group of circuitry that implements the functions or functions attributed to that unit. Such circuitry may implement additional functions, and so identification of circuitry performing one function does not mean that the same circuitry, or a portion thereof, cannot also perform other functions. In some circumstances, the functional unit may be identified, and then functional description of circuitry that performs a certain feature differently, or implements a new feature, may be described. For example, a “decode unit” refers to circuitry implementing decoding of processor instructions. The description explicates that in some aspects such decode unit, and hence circuitry implementing such decode unit, supports decoding of specified instruction types. Decoding of instructions differs across different architectures and microarchitectures, and the term makes no exclusion thereof, except for the explicit requirements of the claims. For example, different microarchitectures may implement instruction decoding and instruction scheduling somewhat differently, in accordance with design goals of that implementation. Similarly, there are situations in which structures have taken their names from the functions that they perform. For example, a “decoder” of program instructions, that behaves in a prescribed manner, describes structure supporting that behavior. In some cases, the structure may have permanent physical differences or adaptations from decoders that do not support such behavior. However, such structure also may be produced by a temporary adaptation or configuration, such as one caused under program control, microcode, or other source of configuration.
Different approaches to design of circuitry exist. For example, circuitry may be synchronous or asynchronous with respect to a clock. Circuitry may be designed to be static or be dynamic. Different circuit design philosophies may be used to implement different functional units or parts thereof. Absent some context-specific basis, “circuitry” encompasses all such design approaches.
Although circuitry or functional units described herein may be most frequently implemented by electrical circuitry, and more particularly by circuitry that primarily relies on a transistor implemented in a semiconductor as a primary switch element, this term is to be understood in relation to the technology being disclosed. For example, different physical processes may be used in circuitry-implementing aspects of the disclosure, such as optical, nanotubes, micro-electrical mechanical elements, quantum switches or memory storage, magneto resistive logic elements, and so on. Although a choice of technology used to construct circuitry or functional units according to the technology may change over time, this choice is an implementation decision to be made in accordance with the then-current state of technology. This is exemplified by the transitions from using vacuum tubes as switching elements to using circuits with discrete transistors, to using integrated circuits, and advances in memory technologies, in that while there were many inventions in each of these areas, these inventions did not necessarily fundamentally change how computers fundamentally worked. For example, the use of stored programs having a sequence of instructions selected from an instruction set architecture was an important change from a computer that required physical rewiring to change the program, but subsequently, many advances were made to various functional units within such a stored-program computer.
Functional modules may be composed of circuitry where such circuitry may be a fixed function, configurable under program control or under other configuration information, or some combination thereof. Functional modules themselves thus may be described by the functions that they perform to helpfully abstract how some of the constituent portions of such functions may be implemented.
In some situations, circuitry and functional modules may be described partially in functional terms and partially in structural terms. In some situations, the structural portion of such a description may be described in terms of a configuration applied to circuitry or to functional modules, or both.
Although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, a given structural feature may be subsumed within another structural element, or such feature may be split among or distributed to distinct components. Similarly, an example portion of a process may be achieved as a byproduct or concurrently with performance of another act or process, or may be performed as multiple, separate acts in some implementations. As such, implementations according to this disclosure are not limited to those that have a 1:1 correspondence to the examples depicted and/or described.
Above, various examples of computing hardware and/or software programming were explained, as well as examples of how such hardware/software can intercommunicate. These examples of hardware or hardware configured with software and such communication interfaces provide means for accomplishing the functions attributed to each of them. For example, a means for performing implementations of software processes described herein includes machine-executable code used to configure a machine to perform such process. Some aspects of the disclosure pertain to processes carried out by limited configurability or fixed-function circuits and in such situations, means for performing such processes include one or more of special purpose and limited-programmability hardware. Such hardware can be controlled or invoked by software executing on a general purpose computer.
Implementations of the disclosure may be provided for use in embedded systems, such as televisions, appliances, vehicles, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, game consoles, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, and the like.
In addition to hardware embodiments (e.g., within or coupled to a Central Processing Unit (“CPU”), microprocessor, microcontroller, digital signal processor, processor core, System on Chip (“SOC”), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer-readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed, for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, C++), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL), and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in computer usable medium including non-transitory memories such as memories using semiconductor, magnetic disk, optical disk, ferrous, resistive memory, and so on.
As specific examples, it is understood that implementations of disclosed apparatuses and methods may be implemented in a semiconductor intellectual property core, such as a microprocessor core, or a portion thereof, embodied in a Hardware Description Language (HDL), that can be used to produce a specific integrated circuit implementation. A computer readable medium may embody or store such description language data, and thus constitute an article of manufacture. A non-transitory machine readable medium is an example of computer-readable media. Examples of other embodiments include computer readable media storing Register Transfer Language (RTL) description that may be adapted for use in a specific architecture or microarchitecture implementation. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software that configures or programs hardware.
Also, in some cases, terminology has been used herein because it is considered to more reasonably convey salient points to a person of ordinary skill, but such terminology should not be considered to imply a limit as to a range of implementations encompassed by disclosed examples and other aspects. A number of examples have been illustrated and described in the preceding disclosure. By necessity, not every example can illustrate every aspect, and the examples do not illustrate exclusive compositions of such aspects. Instead, aspects illustrated and described with respect to one figure or example can be used or combined with aspects illustrated and described with respect to other figures. As such, a person of ordinary skill would understand from these disclosures that the above disclosure is not limiting as to constituency of embodiments according to the claims, and rather the scope of the claims define the breadth and scope of inventive embodiments herein. The summary and abstract sections may set forth one or more but not all exemplary embodiments and aspects of the invention within the scope of the claims.
In the foregoing description, certain terms have been used for brevity, clearness, and understanding. No unnecessary limitations are to be implied therefrom beyond the requirement of the prior art because such terms are used for descriptive purposes and are intended to be broadly construed. Therefore, the invention is not limited to the specific details, the representative embodiments, and illustrative examples shown and described. Thus, this application is intended to embrace alterations, modifications, and variations that fall within the scope of the appended claims.
Moreover, the description and illustration of the invention is an example and the invention is not limited to the exact details shown or described. References to “the preferred embodiment”, “an embodiment”, “one example”, “an example” and so on, indicate that the embodiment(s) or example(s) so described may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element, or limitation.