1. Field of the Invention
The invention generally relates to microprocessors and is of particular relevance to microprocessors that employ a pipeline with a branch target buffer (BTB).
2. Related Art
A BTB is typically a small cache of memory associated with a pipeline in a processor. A BTB is used to predict the target of a branch that is likely to be taken by comparing an instruction address against previously executed instruction addresses that have been stored in the BTB. This can save time in processing because it allows the processor to “skip” the step of computing a target address; instead it can just look it up in the BTB. Accordingly, the frequency with which a BTB can generate a “hit” for the target address directly impacts the speed with which an instruction can be executed. That is, the speed of execution is directly related to the number of entries a BTB can store. Traditionally, the only way to increase the number of entries a BTB could store was by increasing the size of the buffer.
Given that space is at a premium in modern microprocessors, it would be desirable to increase BTB performance without having to increase the size of the buffer itself, Accordingly, what is needed is an improved BTB with an optimized hit rate and improved performance relative to previous buffers.
To that end, embodiments of the present disclosure relate to improved BTBs and methods of processing data that address these concerns. The improved BTBs facilitate improved power usage, faster execution and a more efficient return predition. According to various embodiments, a BTB is provided that includes a non-return buffer, a return buffer, and a multiplexer. The non-return buffer is designed to store a multiple of non-return entries. Each non-return entry corresponds to a non-return type instructions (e.g., unconditional jumps, conditional branches, etc.). The return buffer is designed to store a plurality of return entries that each correspond to a return type instruction. Additionally, the return buffer may generate a control signal. The multiplexer also generates a control signal and outputs either data from the non-return buffer or data from a return prediction stack (RPS). Whether the multiplexer returns data from the non-return buffer or the RPS depends on the control signal.
According to Various embodiments, the return butler determines whether one of the multiple of return entries contains a tag that corresponds to an instruction address. Further, the return buffer generates the control signal such that it causes the multiplexer to output data from the head of RPS when it determines that a tag corresponds to the instruction address and to output data from the non-return buffer when it determines that none of the plurality of return entries contains a tag that corresponds to the instruction address. The non-return buffer may also determine whether one of the multiple of non-return entries corresponds to the instruction address.
According to various embodiments a method of fetching and address using a BTB is provided. According to the method, data relating to an instruction address is received. It can then be determined whether one of a multiple of return entries stored in a return buffer corresponds to the instruction address. Data can be output from one of a return prediction stack (RPS) and a non-return buster based on the prediction.
The determination of whether a return entry corresponds to the instruction address includes determining whether one of the multiple of return entries contains a tag that corresponds to the instruction address. Additionally a control signal may be generated based on the determination. The control signal causes data from the RPS to be output when a determination that one of the return entries correspond to the instruction address. Conversely, the control signal may be generated to cause data from the non-return buffer to be output when it is determined that none of the return entries correspond to the instruction address.
BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
Features and advantages of the invention will become more apparent from the detailed description of embodiments of the invention set forth below when taken in conjunction with the drawings in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawings in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
The following detailed description of embodiments of the invention refers to the accompanying drawings that illustrate exemplary embodiments. Embodiments described herein relate to a low power multiprocessor. In particular, the processor described herein has the benefit of using even less power than existing multiprocessors due to the improved scheme provided, below. Other embodiments are possible, and modifications can be made to the embodiments within the spirit and scope of this description. Therefore, the detailed description is not meant to limit the embodiments described below.
It should be apparent to one of skill in the relevant art that the embodiments described below can be implemented in many different embodiments of software, hardware, firmware, and/or the entities illustrated in the figures. Any actual software code with the specialized control of hardware to implement embodiments is not limiting of this description. Thus, the operational behavior of embodiments will be described with the understanding that modifications and variations of the embodiments are possible, given the level of detail presented herein.
Operation O2 is depicted in
One way in which delay can be avoided is to employ the use of a branch target buffer (BTB) 302 as depicted in
According to various embodiments, BTB 302 functions by comparing an instruction address against the tag portion 306: of its various entries, e.g., 3041, 3042, 3043 . . . 304N, to determine whether any of the entries 3041, 3042, 3043 . . . 304N correspond to the instruction address. If there is a match (or “hit” as sometimes called), then the associated data portion 306D of that entry can be used to determine the target address of the branch. This saves the pipeline any delay associated with calculating the target address.
The instruction address is then compared with the various entries (e.g., 3041, 3042, 3043 . . . 304N). In particular, according to various embodiments, the tag portion 306T of the entries is used to compare the entries to the instruction address.
At step 408, method 400 determines whether any of the tag portions 306T match or correspond to the instruction address. If it is determined that there is a match at step 408, then BTB 302 uses data portion 306D to determine the appropriate target address for the instruction. If however, it is determined that there is not a match at step 408, then the instruction fetcher 102 is forced to calculate the target address normally, which can incur a delay according to various embodiments. At step 414, method 400 ends.
An interesting situation arises when return-type instructions are part of BTB 302. Return type instructions comprise register-indirect branches and can, therefore, have dynamic target prediction. That is, for the same program counter, the next fetch address could be different, which depends on the instruction code path on which the return instruction was fetched and executed. This property of return type instructions puts pressure on BTB 302 sizing. However, it is possible to divide BTB 302 into a dedicated return buffer and a dedicated non-return buffer to reduce this pressure. Such a scheme is illustrated in
According to various embodiments, return buffer 504 is configured to stole a number of entries that correspond to return type instructions. As shown in
Non-return buffer 506 contains a number of entries M relating to non-return type instructions. In an embodiment, each entry contains a tag portion 506T and a data portion 506D. Tag portion 506T can contain information that identifies a previously executed instruction and the data portion 506D contains information that identifies the target address of the corresponding previously executed instruction. According to some embodiments, the number of entries M in the non-return buffer 506 may be greater than the number of entries P in the return buffer 504.
Multiplexer 508 multiplexes between data received from non-return buffer 506 and RPS 510 according to various embodiments. The multiplexer 508 may, for instance, receive control signal 516 from return buffer 508 and, based on the control signal send either non-return data 506D or data from RPS 510 to output 514. Return buffer 504 generates control signal 516 that causes multiplexer 508 to output data from RPS 510 when it has an entry that corresponds to an input instruction address. Conversely, return buffer 504 generates control signal 516 that causes multiplexer 508 to output data 506D from non-return buffer 506 when there are no entries that correspond to an input instruction address in return buffer 504.
Return prediction stack (RPS) 510 contains a number of entries that act as a mechanism for predicting return instructions. In an embodiment, each entry in RPS 510 corresponds to a return type instruction and includes a target address of the associated instruction. As noted above, to improve the speed of a hit from return buffer 504 and thus the BTB 502, none of the return buffer's entries P contain target addresses for the corresponding instructions. Instead, the target address for return type instructions are stored in the RPS 510. Accordingly, when there is a hit in return buffer 504 the target address is taken from the head of the RPS 510. This is why multiplexer 508 may receive control signal 516 that causes it to output data (e.g., a target address) from the RPS when such a hit occurs.
At step 606, the method determines whether the received instruction address is in Return buffer 504. According to various embodiments, the determination of whether the received address is in return buffer 504 can be made by determining whether any of the tags stored in return buffer 504 correspond to the received instruction address.
If at step 606, the determination is made that the instruction address corresponds to one of the entries in return buffer 504, then return buffer 504 generates control signal 516 that causes multiplexer 508 to output data from RPS 510 when it has an entry that corresponds to an input instruction address at step 608.
At step 610, the appropriate data can be output based on the control signal. Namely, because return buffer 504 has detected that the instruction address corresponds to one of its entries (e.g., a “hit”) it generates an appropriate control signal to cause multiplexer 508 to output data from RPS 510. The data from RPS 510 corresponds to the target address appropriate for the instruction address. Once the data from RPS 510 is output by multiplexer 508, the process can end at step 612.
However, if, at step 606, the determination is made that the instruction address corresponds to none of the entries in the return buffer, then it is determined whether any of the entries in non-return buffer 506 corresponds to the instruction address at step 614. According to various embodiments, this determination can be made by comparing tag portion 506T of the non-return buffer with the instruction address to determine if there is a corresponding entry.
If it is determined that the instruction address corresponds one of the entries in the non-return buffet 506 (e.g., if there is a “hit”), then, a control signal can be generated to output data from non-return buffer 506 at step 616. At step 610, the multiplexer, based on the control signal, outputs data 506D from non-return buffer 506.
If, at step 614, it is determined that there is no “hit” in non-return buffer 506, then the instruction fetch stage 102 must calculate the target address and incur a delay, as discussed above. The method 600 ends at step 612.
Method 600 depicts determining whether there is a “hit” in the non-return buffer when there is no hit in the return buffer at step 606. However, it is also possible to simply assume a “hit” in the non-return buffer according to various embodiments.
At step 706, the method determines whether the received instruction address is in return buffer 504. According to various embodiments, the determination of whether the received address is in return buffer 504 can be made by determining whether any of the tags stored in the return buffer 504 correspond to the received instruction address.
If, at step 706, the determination is made that the instruction address corresponds to one of the entries in return buffer 504, then return buffer 504 generates a control signal 516 that causes multiplexer 508 to output data from RPS 510 when it has an entry that corresponds to an input instruction address at step 708.
At step 710, the appropriate data can be output based on the control signal. Namely, because return buffer 504 has detected that the instruction address corresponds to one of its entries (e.g., a “hit”) it generates an appropriate control signal to cause multiplexer 508 to output data from RPS 510. The data from RPS 510 corresponds to the target address appropriate for the instruction address. Once the data from RPS 510 is output by multiplexer 508, the process ends at step 712.
If, at step 706, the determination is made that the instruction address corresponds to none of the entries in the return buffer, then it can be assumed that the non-return buffer will have a hit and the control signal can be set based on that assumption. Accordingly, control signal 516 can be set to cause multiplexer 508 to output 506D from non-return buffer 506. And at step 712, the appropriate data can be output.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Furthermore, it should be appreciated that the detailed description of the present invention provided herein, and not the summary and abstract sections, is intended to be used to interpret the claims. The summary and abstract sections may set forth one or more but, not all exemplary embodiments of the present invention as contemplated by the inventors.
For example, in addition to implementations using hardware (e.g., within or coupled to a Central Processing Unit (“CPU”), microprocessor, microcontroller, digital signal processor, processor core, System on Chip (“SOC”), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed. for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, C++), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL) and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in any known non-transitory computer usable medium including semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.).
It is understood that the apparatus and method embodiments described herein may be included in a semiconductor intellectual property core, such as a microprocessor core (e.g., embodied in HDL) and transformed to hardware in the production of integrated circuits. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalence. It will be appreciated that embodiments using a combination of hardware and software may be implemented or facilitated by or in cooperation with hardware components enabling the functionality of the various software routines, modules, elements, or instructions, e.g., the components noted above.
The embodiments herein have been. described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others may, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.