1. Field of the Invention
The invention is generally related to microprocessors.
2. Related Art
An instruction fetch unit of a microprocessor is responsible for continually providing the next appropriate instruction to the execution unit of the microprocessor. A conventional instruction fetch unit typically employs a large instruction cache that is always enabled in order to provide instructions to the execution unit as quickly as possible. While conventional fetch units work for their intended purpose, they consume a significant amount of the total power of a microprocessor. This makes microprocessors having conventional fetch units undesirable and/or impractical for many applications.
An embodiment provides a method of fetching data from a cache. The method begins by preparing to fetch a first set of one or more cache ways for a first data word of a first cache line a using a first microprocessor thread. Next, in parallel, a second set of one or more cache ways for a first data word of a second cache line is prepared to be fetched using a second microprocessor thread, and data associated with each cache way of the first set of cache ways are fetched using the first microprocessor thread. Also performed in parallel, data associated with each cache way of the second set of cache ways is fetched using the second microprocessor thread and a third set of one or more cache ways for a second data word of the first cache line is prepared to be fetched using the first microprocessor thread. Preparing to fetch the third set of one or more cache ways is based on a selected cache way, the selected cache way selected from the first set of cache ways by the first microprocessor thread.
A system for fetching data from a cache is also provided. The system includes a multiway instruction cache configured to perform the following: preparing to fetch a first set of one or more cache ways for a first data word of a first cache line a using a first microprocessor thread. Next, in parallel, a second set of one or more cache ways for a first data word of a second cache line is prepared to be fetched using a second microprocessor thread, and data associated with each cache way of the first set of cache ways are fetched using the first microprocessor thread. Also performed in parallel, data associated with each cache way of the second set of cache ways is fetched using the second microprocessor thread and a third set of one or more cache ways for a second data word of the first cache line is prepared to be fetched using the first microprocessor thread. Preparing to fetch the third set of one or more cache ways is based on a selected cache way, the selected cache way selected from the first set of cache ways by the first microprocessor thread.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, farther serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
Features and advantages of the invention will become more apparent from the detailed description of embodiments of the invention set forth below when taken in conjunction with the drawings in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawings in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
The following detailed description of embodiments of the invention refers to the accompanying drawings that illustrate exemplary embodiments. Embodiments described herein relate to a low power multiprocessor. Other embodiments are possible, and modifications can be made to the embodiments within the spirit and scope of this description. Therefore, the detailed description is not meant to limit the embodiments described below.
It should be apparent to one of skill in the relevant art that the embodiments described below can be implemented in many different embodiments of software, hardware, firmware, and/or the entities illustrated in the figures. Any actual software code with the specialized control of hardware to implement embodiments is not limiting of this description. Thus, the operational behavior of embodiments will be described with the understanding that modifications and variations of the embodiments are possible, given the level of detail presented herein.
Execution unit 102 preferably implements a load-store (RISC) architecture with single-cycle arithmetic logic unit operations (e.g., logical, shift, add, subtract, etc.). In one embodiment, execution unit 102 includes 32-bit general purpose registers (not shown) used for scalar integer operations and address calculations. Optionally, one or more additional register file sets can be included to minimize content switching overhead, for example, during interrupt and/or exception processing. Execution unit 102 interfaces with fetch unit 104 and load/store unit 108.
Fetch unit 104 provides instructions to execution unit 102. In one embodiment, fetch unit 104 includes control logic for multiway instruction cache 110, a recorder for recoding compressed format instructions, dynamic branch prediction, an instruction buffer to decouple operation of fetch unit 104 from execution unit 102, and an interface to a scratch pad 130. Fetch unit 104 interfaces with execution unit 102, memory management unit 112, multiway instruction cache 110, and bus interface unit 116.
As used herein, a scratch pad 130 is a memory that provides instructions that are mapped to one or more specific regions of an instruction address space. The one or more specific address regions of a scratch pad 130 may be pre-configured or configured programmatically while the microprocessor is running. An address region is a continuous range of addresses that may be specified, for example, by a base address and a region size. When base address and region size are used, the base address specifies the start of the address region and the region size, for example, is added to the base address to specify the end of the address region. Once an address region is specified for a scratch pad 130, all instructions corresponding to the specified address region are retrieved from the scratch pad 130.
Load/store unit 108 performs data loads and stores, and includes data cache control logic. Load/store unit 108 interfaces with data cache 114 and other memory such as, for example, a scratch pad 130 and/or a fill buffer (not shown). Load/store unit 108 also interfaces with memory management unit 112 and bus interface unit 116.
Memory management unit 112 translates virtual addresses to physical addresses for memory access. In one embodiment, memory management unit 112 includes a translation lookaside buffer (TLB) and may include a separate instruction TLB and a separate data TLB. Memory management unit 112 interfaces with fetch unit 104 and load/store unit 108.
Multiway instruction cache 110 is an on-chip memory array organized as a multi-way set associative cache such as, for example, a 2-way set associative cache, a 4-way set associative cache, an 8-way set associative cache, et cetera. Multiway instruction cache 110 is preferably virtually indexed and physically tagged, thereby allowing virtual-to-physical address translations to occur in parallel with cache accesses. In one embodiment, the tags include a valid bit and optional parity bits in addition to physical address bits. As described in more detail below, it is a feature of the present invention that components of multiway instruction cache 110 can be selectively enabled and disabled to reduce the total power consumed by processor core 100. Multiway instruction cache 110 interfaces with fetch unit 104.
Data cache 114 is also an on-chip memory array. Data cache 114 is preferably virtually indexed and physically tagged. In one embodiment, the tags include a valid bit and optional parity bits in addition to physical address bits. Data cache 114 interfaces with load/store unit 108.
Bus interface unit 116 controls external interface signals for processor core 100. In one embodiment, bus interface unit 116 includes a collapsing write buffer used to merge write-through transactions and gather writes from uncached stores.
To illustrate aspects of embodiments,
The following is intended to be a brief description of the different stages shown in
As would be appreciated by one having skill in the relevant art(s), given the description herein, the following list describes phases used by multiway instruction cache 110. In embodiments described with reference to
Instruction Prepare to Fetch (IPF) Stage 260
As would be appreciated by one having skill in the relevant art(s), given the description herein, in IPF stage 260, several operations are performed to prepare for fetching an instruction from data RAM cache 262. These operations include accessing a cache way predictor 261 to determine which ways 210A-D of data RAM cache 262 to prepare for fetching. The results of this stage are an address and control signals being presented to the instruction cache RAM arrays. As used herein, preparing to fetch an instruction can also be termed “enabling” the instruction.
As described in the '191 patent, a multi way instruction cache can use tag RAMs 212 from tag RAM cache 265 to store the physical address for tag comparison to select the applicable cache way.
As noted above, way prediction is performed at the instruction prepare to fetch (IPF) stage. In IPF stage 260, way predictor 261 is used select instructions to enable to be fetched in IF stage 270. Each enabled instruction becomes a cache way 210A-D to be fetched during IF stage 270. Information that improves way prediction is used at this stage. The more accurate the way prediction, the fewer ways 210A-D need to be fetched during the IF stage 270.
Parallel access of all way data RAMs and tag RAMs achieves highest performance but because a large amount of extra data is retrieved, parallel access also requires the highest access energy of the approaches discussed herein.
Instruction Fetch (IF) Stage 270
In IF stage 270, the retrieval of tag RAMs 212 and one or more enabled data ways 210A-D causes multiway instruction cache 110 to expend energy. For example, to increase performance and reduce the likelihood of a cache mis-predict, in one approach to implementing multiway instruction cache 110, in parallel, all four way 210A-D data RAMs are accessed with cache tag RAMs 212 and during IF stage 270. As compared to embodiments described herein, this approach expends a large amount of energy.
Reducing the quantity of ways 210A-D and tag RAMs 212 that are retrieved at this IF stage can reduce the power expended by multiway instruction cache 110. In embodiments described below with the description of
Instruction Selection (IS) Stage
After tag comparison completes, the applicable cache way is selected. Physical address 255 is received at tag comparator 250. Physical address 255 is compared to fetched tag RAMs 212, and one of the fetched cache ways 210A-D is selected by way selector 208 and forwarded as selected way 285 to instruction buffer 204.
Dispatch (IT) Stage
As would be appreciated by one having skill in the relevant art(s), given the description herein, in IT stage 290 an instruction stored in instruction buffer 204 is dispatched, as dispatched instruction 295, to execution unit 102 for execution. Embodiments described herein relate to populating instruction buffer 204 with instructions, an IT stage 290 is not discussed.
Embodiments described herein use way prediction. Way prediction can be based on known characteristics of the data as cached. These known characteristics allow for a prediction of the placement of a fetch word based on the location of a previously fetched word. For example, as shown in
In addition, it would be appreciated by one having skill in the relevant art(s), given the description herein, that way prediction can rely on other conditions. For example, writes to cache lines 355, 365 may have to be monitored to ensure that prior tag states stored in tag RAM cache 265 are still valid.
Cycles 401-406 are described below:
Cycle 401: In this cycle, IPF 410A, ways 210A-D are enabled as ways to access fetch word 352A. Cache ways 210A-D and tag RAMs 212 associated with ways 210A-D are enabled for fetching in IF 412A. As described with cycles 402-406 below, the approach described with
Cycle 402: In this cycle, IF 412A, tag RAMs 212 associated with selecting ways 210A-D and ways 210A-D are fetched. At this cycle, because all of the associated tag RAMs and ways 210A-D are fetched, power expended at this phase can be termed as 100% of the possible access energy expenditure for a non-way predicted approach (hereinafter “possible access energy expenditure”). It should be noted that, as used herein, estimates of possible access energy expenditure are based on the following values: assuming four cache ways 210A-D can be fetched, each cache way uses 20% of the possible access energy expenditure. Retrieving tag RAMs 212 associated with the cache ways uses an additional 20% of the possible access energy expenditure. One having skill in the relevant art(s), given the description herein will appreciate that estimating access energy can be based on different values and factors.
This fetching of tag RAMs 212 and ways at the same time is termed “parallel fetching.” Also, in this cycle, in IPF 410B, similar to cycle 401 above, cache ways 210A-D are enabled as ways to access fetch word 352B.
Cycle 403: In this cycle, IS 414A, physical address 255 associated with fetch word 352A is received at tag comparator 250. Tag comparator 250 compares received physical address 255 with tag RAMs 212 to select one of ways 210A-D. The data retrieved with selected way 285 are forwarded to instruction buffer 204. In one approach, selected way 285 can improve way prediction during the IPF stage of other fetch words in cache line 355. Because ways associated with fetch word 352D have already been predicted in IPF 410B of cycle 402, selected way 285 does not improve this prediction. Like IF 412A described above, because selected way 285 was not available at cycle 402 for IPF 410B, IF 412B uses 100% possible access energy expenditure.
In IPF 410C however, for fetch word 352C, selected way 285 improves way prediction. Selected way 285 information reduces the amount of data that is enabled during IPF 410C for fetch word 352C. In some circumstances, selected way 285 allows for only a single way 210A to be enabled for fetching at this stage.
In addition, because selected way 285 is available, only one way needs to be enabled, and tag RAMs 212 do to not need to be retrieved to select from multiple retrieved ways. This reduction in the amount of data fetched results in a power savings for fetching associated with fetch word 352C in cycle 404, IF 412C.
Finally, in IS 414A, the data retrieved with selected way 285 for fetch word 352A is forwarded to instruction buffer 204.
Cycle 404: In IPF 410D, similar to IPF 410C above, for fetch word 352D, selected way 285 improves way prediction. This way information reduces the amount of data that is enabled during IPF 410D. Selected way 285 allows for only a single way 210A to be enabled for fetching at this stage. In addition, because selected way 285 is available, only one way needs to be enabled, and tag RAMs 212 do to not need to be retrieved to select from multiple retrieved ways. This will result in a power savings for fetching associated with fetch word 352D in cycle 405, IF 412D.
As noted in cycle 403 above, during cycle 404, in IF 412C, enabled by way 210A is fetched. This fetch of a single predicted way 210A uses less power than IF 412A described with cycle 402 above. Because tag RAMs are not retrieved and only a single predicted way is retrieved, based on the estimate calculation outlined above, access energy expended by this stage is estimated at 20% of the possible access energy expenditure.
Also in this cycle, at IS 414B, fetch word 352B is selected and forwarded to instruction buffer 204.
Cycle 405: As noted in cycle 404 above, during cycle 405, in IF 412D, enabled by way 210A is fetched. This fetch of a single predicted way uses power similar to than IF 412C described with cycle 404 above. Because tag RAMs are also not retrieved and only a single predicted way is retrieved, power expended by this stage is estimated at 20% of the possible access energy expenditure.
Also in this cycle, at IS 414C, fetch word 352C is selected and forwarded to instruction buffer 204.
Cycle 406: In this cycle, at IS 414D, fetch word 352D is selected and forwarded to instruction buffer 204.
As described with cycles 401-406 above, a pipelined structure to provide a fetch address, access the cache RAMs, select a suitable cache way and store selected instructions inside an instruction buffer has inherent latencies before a way selection is calculated. Selected way 285 was not determined until cycle 403, and only improved way selection for fetch words 352C-D. Until the first way calculation completes access all tag and way RAMs are accessed until the first way calculation completes, e.g., for fetch words 352A-B.
Because fetch words 352A-B used 100% access energy and fetch words 352C-D used 20% access energy, the aggregate access energy estimate for this approach is 60% of the maximum possible expenditure.
The example in
As described with reference to
In an example shown on
Rather than a single thread of execution being in each stage, thread stages (IPF 260, IF 270, IS 280) are interleaved between two threads 320 and 330. In an embodiment, as described with reference to
With multithreaded operation of fetch unit 104, each thread processes independent address ranges and access requests. For example, as shown in
Cycle 601: In this cycle, IPF 610A, ways 210A-D are enabled as ways to access fetch word 352A. Because of this, ways 210A-D and tag RAMs associated with ways 210A-D are enabled for fetching in IF 612A. As with cycles 400 described with
Cycle 602: In this cycle, IF 612A using thread 320, the enabled tag RAMs 212 associated with selecting ways 210A-D and ways 210A-D are fetched. As with cycle 402 above, because all of the associated tag RAMs and ways 210A-D are fetched, power expended is 100% of the possible access energy expenditure. As in cycle 402 above, in this embodiment of multithreaded multiway instruction cache 550, when required, tag RAMs 212 and data RAMs are still parallel fetched.
In contrast to cycles 400 from
Cycle 603: In this cycle, IS 615A using thread 320, a physical address 255 associated with fetch word 352A is received at tag comparator 250. Tag comparator 250 compares received physical address 255 with tag RAMs 359 to select one of ways 210A-D. The data retrieved with selected way 285 are forwarded to the instruction buffer 515A associated with thread 320. In one approach, as noted with cycle 403 above, selected way 285 can improve way prediction during the IPF stage of other fetch words in same cache line.
Unlike cycle 403 above, where ways associated with fetch word 352B are not yet predicted, at cycle 603, selected way 285 can improve this prediction and reduce the access energy required to fetch fetch word 352B. Thus, in IPF 610B, based on selected way 285, thread 320 only enables a single data RAM and does not retrieve tag RAMs 359. It should be noted that, in cycle 602, interleaving in IPF 620A by thread 330 caused a delay that allowed selected way 285 to be generated in time for IPF 610B of fetch word 352B.
In IF 622A, thread 330 fetches enabled tag RAMs 212 and data RAMs associated with fetch word 362A. Similar to cycle 602, in the first IPF stage performed associated with fetch word 362A, all associated tag RAMs 212 and data RAMs are enabled. Thus, IF 622A, like IF 612A for fetch word 352A, uses 100% of the possible access energy expenditure.
Cycle 604: In this cycle, in IS 625A using thread 330, a physical address 255 associated with fetch word 362A is received at tag comparator 250. Tag comparator 250 compares received physical address 255 with tag RAMs 212 to select one of ways 210A-D. The data retrieved with selected way 285 are forwarded to the instruction buffer 515B associated with thread 330. As noted herein, this selected way 285 will assist with performing IPF stages associated with the same thread 330.
Thus, in IPF 620B, based on selected way 285 from IS 625A, for fetch word 362B, thread 330 only enables a single data RAM and does not retrieve tag RAMs 212. As with thread 320 in cycle 602, interleaving threads 320, 330 in causes a delay that allows selected way 285 to be generated in time for IPF 620B for fetch word 362B.
In IF 6128, using thread 320, the enabled way associated with fetch word 352B from IPF 610B is fetched. As noted in cycle 603 above, because selected way 285 was available for IPF 610B, IF 612B only needs to fetch a single way and no tag RAMs 212. Thus, in contrast to cycle 602 described above, fetching fetch word 352B in cycle 604 IF 612B is estimated to use 20% of possible access energy expenditure as compared to 100% in IF 612A of cycle 602.
Cycles 605 through 610: As would be appreciated by one having skill in the relevant art(s), given the description herein, as shown on
The original fetch energy reduction scheme described with reference to
In an embodiment of multithreaded multiway instruction cache 550, where instruction cache tag RAMs and data RAMs are serialized, access energy usage can be further reduced.
As with cycles 600, with multithreaded operation of fetch unit 104, using cycles 700, each thread 320, 330 processes independent address ranges and access requests. For example, as shown in
Cycle 701: In contrast to cycle 601 from the description of
Cycle 702: In this cycle, in IF 758 using thread 320, the enabled tag RAMs 359 are fetched. Though not an exact measurement, retrieving tag RAMS 359 is estimated to use 20% of possible access energy expenditure as compared to 100% for fetching both associated tag RAMs 359 and associated data RAMs.
Also in this cycle, in IPF 767 thread 330 enables all tag RAMs 369 associated with cache line 365. As with IPF 757 from cycle 701 above, no data RAMs are enabled during this cycle.
Cycle 703: In this cycle, in IS 359 using thread 320, enabled tag RAMs 359 are compared to received physical address 255 associated with fetch word 352A. Thereafter, for thread 320—now having a selected data RAM way, can proceed, requesting further fetches without requiring the enabling of tag RAMs 359 and additional ways during IPF stages.
Further in this cycle, in IF 768 using thread 330, the enabled tag RAMs 369 are fetched. Retrieving tag RAMS 359 is estimated to use 20% of possible access energy expenditure as compared to 100% for fetching both associated tag RAMs 369 and associated data RAMs.
Using the way selected by IS 759 described above, in IPF 710A using thread 320, a data RAM associated with fetch word 352A is enabled. As noted above, this contrasts with cycle 601 of
Cycle 704: In this cycle, in IS 769 using thread 330, enabled tag RAMs 369 are compared to received physical address 255 associated with fetch word 362A. Thereafter, for thread 330—now having a selected data RAM way, can proceed, requesting further fetches without requiring the enabling of tag RAMs 369 and additional ways during IPF stages.
Further in this cycle, in IF 712A using thread 320, the enabled data RAM from IPF 710A is fetched. Because selected way 285 was available for IPF 710A, IF 712A only needs to fetch a single way and no tag RAMs 359. Thus, in contrast to cycles 400 and 600 described with respective
Using the way selected by IS 769 described above, in IPF 720A using thread 330, a data RAM associated with fetch word 362A is enabled. As noted above, this contrasts with cycles 600 of
Cycle 705: In this cycle, in IS 715A using thread 330, enabled tag RAMs 369 are compared to received physical address 255 associated with fetch word 362A. Thereafter, for thread 330—now having a selected data RAM way, can proceed, requesting further fetches without requiring the enabling of tag RAMs 369 and additional ways during IPF stages.
Further in this cycle, in IF 712A using thread 320, the enabled data RAM from IPF 710A is fetched. Because selected way 285 was available for IPF 710A, IF 712A only needs to fetch a single way and no tag RAMs 359. Thus, in contrast to cycles 400 and 600 described with respective
Using the way selected by IS 769 described above, in IPF 720A using thread 330, a data RAM associated with fetch word 362A is enabled. As noted above, this contrasts with cycles 600 of
Cycles 706 through 712: As would be appreciated by one having skill in the relevant art(s), given the description herein, as shown on
It should be noted that, the 20% access energy expenditure associated retrieving tag RAMs 359, 369 in cycle 702, IF 758 and cycle 703, IF 768 can be considered as respectively distributed across the four fetch word 352A-D, 362A-D fetches. The true access power expenditure depends on number of cache lines implemented, physical address bits used for tag comparison and process technology parameters. In an example, a typical 32K byte cache was observed to reach the 20% combined tag energy assumption used herein.
Thus, because each fetch word is estimated at 20% potential access energy expended, the total access energy per fetch word is 25%, accounting for both the data access power and ¼ of the tag access power (assuming the cache line has 4 fetch words and all of them are accessed). In contrast to the embodiments described with respect to
Interlacing multiple threads to serialize tag and way RAMs access as described with reference to
In the example of
In another embodiment where thread priority is used to control aspects of embodiments, thread priority can be used to select between the multithreaded fetching approaches described with reference to
In another embodiment, the approaches of
in an example of this combination approach, thread 320 is a relatively high priority thread, and thread 330 is a relatively low priority thread. This example starts with thread 320 performing the IPF 610A of fetch word 352A described with reference to
As noted above with respect to FIGS. 2 and 4-7, some embodiments use way predictor 261 at instruction prepare to fetch (IPF) stage to identify one or more ways 210A-D from data RAM cache 262 for use by instruction fetch (IF) stage 270.
Different approaches to way prediction can be used by different embodiments. An example way predictor 261 as described in the embodiments of
As noted above with the description of
In another embodiment of way predictor 261, a micro-tag array (also termed a “micro-tag cache (MTC)” is used for way prediction during the IPF phase. Use of a micro-tag array for way selection by an embodiment can further reduce data cache access energy expenditure. The micro-tag stores base address data bits or base register data bits, offset data bits, a carry bit, and way selection data bits. When fetch word 352A is sought to be fetched, the instruction address is compared to data stored in the micro-tag array. If a micro-tag array hit occurs, the micro tag array generates a cache dataram enable signal. This signal enables only a single dataram of the cache. If a micro tag array hit occurs, a signal is also generated that disables the cache tagram.
An example a micro-tag array that can be used by embodiments is described in U.S. Pat. No. 7,650,465 ('465 patent) filed on Aug. 18, 2006, and issued on Jan. 19, 2010, entitled “Micro Tag Array Having Way Selection Bits for Reducing Data Cache Access Power” which is incorporated by reference herein in its entirety, although the invention is not limited to this example.
Micro-Tag Array with Multithreaded Fetch Operations
When a micro-tag array is used with multithreaded multiway instruction cache 550 from
A Micro-tag array can be beneficially used at IPF 610A. In IPF 610A for example, instead of enabling four (4) cache ways 210A-D for fetching by IF 612A, a micro-tag array hit can allow only a single way 210A to be enabled. In addition, instead of enabling tag RAMs 359 for parallel fetching with ways 210A-D at IF 612A, a micro-tag array hit at IPF 610A allows an embodiment to avoid enabling tag RAMs 359. Thus, at cycle 601, using a micro-tag array allows the potential for significant access energy expenditure savings.
When a micro-tag cache hit occurs at IPF 610A, no update of the micro-tag array is required based on selected way 185. As noted above, based on a micro-tag array hit, only one way was enabled at IPF 610A and this way is fetched at IF 612A and selected at IS 615A without the use of tag RAMs 359.
When no micro-tag array hit occurs at IPF 610A, the operation of an embodiment proceeds as with cycle 601 from the description of
As described above, when used at an initial IPF stage, a micro-tag array hit can significantly reduce the access energy expenditure of the associated IF stage. Without a micro-tag array hit, the access energy expenditure is comparable to approaches using different way prediction approaches, e.g., the simple approach described above with reference to
Micro-Tag Array with Multithreaded, Serialized Fetch Operations
A Micro-tag array can be beneficially used with multithreaded, serialized fetch operations described with reference to
Use of a micro-tag array with multithreaded serialized fetch operations can significantly reduce the access energy expenditure while increasing performance. This approach combines the potential benefits of skipping from IPF 757 to IF 722A with a micro-tag array hit, with the general benefits that can result from the multithreaded, serialized approach.
Without a micro-tag array hit, the access energy expenditure is comparable to access energy expenditures associated with different way prediction approaches, e.g., the less complex approach described above with reference to
Stages 830A-B are performed in parallel. For example, the example stages below are performed at cycle 602 on
Stages 840A-C are also performed in parallel. For example, the example stages below are performed at cycle 603 on
In stage 840B, a third set of one or more cache ways for a second data word of the first cache line is prepared to be fetched using the first microprocessor thread. This third set of cache ways is prepared to be fetched based on a selected cache way, the selected cache way selected from the first set of cache ways by the first microprocessor thread. For example, at cycle 603, using thread 320, IPF 610B prepares to fetch a third set of ways 210A-. These ways are associated with fetch word 352B from cache line 355. IPF 610B is based on the selection of selected way 285 by IS 615A. Once stages 840A-B are completed, the method ends at stage 850.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Furthermore, it should be appreciated that the detailed description of the present invention provided herein, and not the summary and abstract sections, is intended to be used to interpret the claims. The summary and abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors.
For example, in addition to implementations using hardware (e.g., within or coupled to a Central Processing Unit (“CPU”), microprocessor, microcontroller, digital signal processor, processor core, System on Chip (“SOC”), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed, for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, C++), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL) and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in any known non-transitory computer usable medium including semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.).
It is understood that the apparatus and method embodiments described herein may be included in a semiconductor intellectual property core, such as a microprocessor core (e.g., embodied in HDL) and transformed to hardware in the production of integrated circuits. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalence.
During IPF stage 970 micro-tag array 960 is used for way prediction. Based on this way prediction, preparations are made to fetch specific RAMs from data RAM cache 262 for ways 210A-D during IF stage 270. By comparing 945 a partial base address from program counter 950 micro-tag array 960 can identify one or more ways 210A-D in data RAM cache 262.
IF stage 972 includes data RAM cache 262 and tag RAMs from tag RAM cache 265. IS stage 974 includes way selector 208 coupled to tag comparator 250. Tag comparator 250 receives physical address 255. When a micro-tag array hit occurs using a partial address during the IPF stage, to verify 955 the enabled way, the full physical address 255 is compared to micro-tag array 960. Way selector 208 provides selected way 285 to instruction buffer 204. IT stage 976 includes dispatched instruction 295 from instruction buffer 204.
In an embodiment, with the examples described with respect to
When the portion of the base address data bits match the base address data bits stored in the base register of micro tag array 960, micro tag array 960 is configured to output an enable signal that enables a dataram of the cache specified by way selection data bits stored in the way selection register of the micro tag array.
An embodiment of the partial address compare micro-tag array uses lower order bits of the base address (after cache line address). As would be appreciated by one having skill in the relevant art(s), given the description herein, this approach is more likely to lead to a micro-tag array cache hit, but also more likely to lead to a mis-prediction. Instead of a single way resulting from a micro-tag array hit, multiple entries may match the submitted partial base address. In one approach to selecting from multiple ways found from a partial base address match, an embodiment only enables the most recently installed multi-tag array entry.
In an embodiment, because of the increased likelihood of mis-prediction, during the IF stage, when the address is available, a multi-tag array comparison of the full address is performed to check that the predicted way is not a mis-prediction. When a mis-prediction is detected, a replay of request to read all tags and datarams is performed.
Embodiments described herein relate to a low power multiprocessor. The summary and abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors, and thus, are not intended to limit the present invention and the claims in any way.
The embodiments herein have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others may, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
This patent application claims the benefit of U.S. Provisional Patent Application No. 61/436,931 filed on Jan. 27, 2011, entitled “Power Reduction instruction Cache in a Multi-Thread Processor Core,” which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61436931 | Jan 2011 | US |