The present disclosure generally relates to a microprocessor architecture, and specifically relates to modularization of cache structure in the microprocessor.
A Level-1 (L1) cache of a microprocessor is commonly implemented as a tightly coupled unit of tag and data arrays. Whether the L1 cache is implemented as a single-port cache array or a dual-port cache array, processes of fetching address and data are tightly coupled. If there is an address or data conflict related to at least two consecutive memory access instructions, then one of the memory access instructions needs to stall in a pipeline causing stalling of one or more following instructions that leads to performance degradation of the L1 cache as well as to performance degradation of a microprocessor where the L1 cache is incorporated. Stalling of instructions due to address and/or data conflict is further undesirable since the effective clock frequency of operating the L1 cache is substantially reduced. In addition, there is a negative performance impact if the tag array is not accessed as early as possible.
A data array of the L1 cache is typically much larger than a tag array. The data array may be, for example, a large compiler memory that typically requires multiple clock cycles (e.g., two clock cycles) for memory access, such as data load/store. Because of the multi-cycle memory access, the data array is typically implemented by using multiple memory banks that can be simultaneously accessed. If, for example, two consecutive memory access instructions request access to two different memory banks, no bank conflict exists and addresses in separate memory banks can be simultaneously accessed without any conflict. On the other hand, if two consecutive memory access instructions request access to the same memory bank (either the same or different data), then a bank conflict exists (e.g., data or address conflict), and one of the memory access instructions (e.g., a later instruction of the two consecutive instructions) needs to be stalled. Since instructions are executed in order, one or more instructions following the stalled instruction can be also stalled, which is undesirable since it negatively affects performance of an L1 cache as well as performance of a microprocessor incorporating the L1 cache. Generally, the multi-cycle access of a data array can cause more bank conflicts than accessing of a smaller tag array.
Certain embodiments of the present disclosure support implementation of a Level-1 (L1) cache in a microprocessor based on data arrays and tag arrays that are independently accessed. No stall pipeline mechanism is required for L1 cache implementations presented herein since stall of instructions is avoided when there is a memory conflict.
Example embodiments of the present disclosure include configurations that may include structures and processes within a microprocessor. For example, a configuration may include a data array in a cache of the microprocessor interfaced with one or more data index queues, a tag array, and circuitry coupled to the data array and the tag array. The one or more data index queues can store, upon occurrence of a conflict between at least one instruction requesting access to the data array and at least one other instruction that accessed the data array, at least one data index for accessing the data array associated with the at least one instruction. The tag array in the cache is coupled with a tag queue that stores one or more tag entries associated with one or more data outputs read from the data array based on accessing the data array. The circuitry in the cache coupled to the data array and the tag array is configured for independent access of the data array and the tag array.
Example embodiments of the present disclosure include configurations that may include structures and processes within a microprocessor. For example, a configuration may include issuing one or more instructions per clock cycle for accessing a data array in a cache of the microprocessor interfaced with one or more data index queues, keeping, in the one or more data index queues upon occurrence of a conflict between at least one instruction requesting access to the data array and at least one other instruction that accessed the data array, at least one data index for accessing the data array associated with the at least one instruction, accessing, for the at least one instruction and the at least one other instruction, a tag array in the cache interfaced with a tag queue independently of accessing the data array, and storing, in the tag queue based on accessing the tag array, one or more tag entries associated with one or more data outputs read from the data array in response to accessing the data array.
The figures depict embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles, or benefits touted, of the disclosure described herein.
For certain embodiments, a tag array unit within an L1 cache (e.g., the tag arrays 104 of the L1 cache 100 shown in
A data array access (e.g., access of the data array 202) is typically slower than a tag array access (e.g., access of the tag array 204) due to the aforementioned multi-cycle access of large data arrays. The tag arrays 204 as well as a data translation look-aside buffer (DT LB) 210 associated with the tag arrays 204 are substantially smaller than the data array (data memory banks) 202 and may be accessed within a single clock cycle. In one or more embodiments of the present disclosure, as illustrated in
For some embodiments, a read pointer 214 of the data index queue 206 may be manipulated (changed) to resend an instruction (e.g., data access) when a conflict is removed, without accessing again the tag arrays 204 and the DTLB 210. In contrast, for the L1 cache implementation based on a stall pipeline (e.g., the L1 cache 100 illustrated in
For some embodiments, in case of a tag miss, if the data index associated with data access is still in the data input queue 206, then this entry (data index) of the data index queue 206 may be invalidated and removed from the data index queue 206 based on tag hit/miss information 216. Hence, in the case of tag miss, accessing of the data array 202 may be cancelled, which may provide lower power consumption of the L1 cache 200. In general, if the corresponding entry in the data index queue 206 is not yet issued to the data array 202, then a specific way of data array accessing and/or tag array accessing can save power consumption of the L1 cache. In some embodiments, information 216 about a translation look-aside buffer (TLB) miss in relation to DTLB 210 can be used to cancel a corresponding entry in the data index queue 206. Then, an instruction with the TLB miss and subsequent instruction(s) may be cancelled and replayed from the LS input queue 218. A microprocessor may attempt to fetch a TLB entry from Level-2 TLB cache (not shown in
For some embodiments, the tag array 204 and the DTLB 210 may be configured to detect tag hit/miss and send the tag hit/miss information 216 to a reorder buffer (ROB), not shown in
In accordance with embodiments of the present disclosure, instructions from a Load/Store (LS) input queue 218 may go through into the data index queue 206 and to the tag queue 208 until there is enough space in the data index queue 206 and the tag queue 212. In one or more embodiments, the LS input queue 218 may directly communicate with the tag queue 212 in order to prevent overflowing of the tag queue 212. In one or more embodiments, the interface circuit 208 may control the read pointer 214 and a write pointer (not shown in
Because of the modulized implementation of the single-port L1 cache 200 illustrated in
In accordance with embodiments of the present disclosure, as discussed above, an L1 cache (e.g., the single-port L1 cache 200 illustrated in
As illustrated in
For some embodiments of the present disclosure, a store instruction does not need to access the data array 302 (data memory banks) when the store instruction is issued from a decode unit. Therefore, the store instruction does not have an entry in the data index queue 308. Since there is no stall on any store operation, performance of an L1 cache (e.g., the dual-port L1 cache 300 illustrated in
In some embodiments, a data output queue 406 associated with all data banks 402 may match entries in a tag output queue 408 that stores tags entries from tag arrays 410 and/or DTLB 412. In one embodiment, in case of returning result data to ROB in-order, a first entry in the data output queue 406 may bypass directly to a result data register and then to ROB. In another embodiment, in case of returning result data to ROB out-of-order, output data from the data banks 402 may be known two cycles early, and any entry with valid result data can bypass to the result data register and then to ROB.
In accordance with embodiments of the present disclosure, if there is a new input (i.e., a new data index) written in the data index queue 502, the new input may be written into a location of the data index queue 502 indicated by the write pointer 504. After that, the write pointer 504 may increment and point to a next write location in the data index queue 502. If an entry (data index) is about to be read from the data index queue 502 since a request to access a data array is sent, the data index may be read from a location of the data index queue 502 indicated by the read pointer 506. The read pointer 506 then increments and moves to a next read location of the data index queue 502, whereas the saved read pointer 508 does not yet increment and still points to a previous read location that was pointed to by the read pointer 506 before the read pointer 506 increments. If the request for data access is incomplete (i.e., if there is a memory bank conflict), the read pointer 506 may be restored based on a value of the saved read pointer 508. If the request for data access is complete (i.e., if there is no memory bank conflict), the saved read pointer 508 also increments. In one or more embodiments, if a difference in number of locations of the data index queue 502 between the write pointer 504 and the saved read pointer 508 is equal to a predefined threshold value (e.g., Q-full), then the data index queue 502 is full and a new data index entry cannot be written into the data index queue 502.
It should be noted that the performance improvement shown in
In some embodiments, the instruction cache 716 or the data cache 718 may correspond to the single-port L1 cache 200 illustrated in
Additional Considerations
The foregoing description of the embodiments of the disclosure has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the disclosure in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the disclosure may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the disclosure may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the disclosure be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5274790 | Suzuki | Dec 1993 | A |
5333284 | Nugent | Jul 1994 | A |
5463745 | Vidwans et al. | Oct 1995 | A |
5471591 | Edmondson | Nov 1995 | A |
5519864 | Martell et al. | May 1996 | A |
5584038 | Papworth et al. | Dec 1996 | A |
5675758 | Sowadsky et al. | Oct 1997 | A |
5684971 | Martell et al. | Nov 1997 | A |
5761476 | Martell | Jun 1998 | A |
5948098 | Leung et al. | Sep 1999 | A |
5987620 | Tran | Nov 1999 | A |
6076145 | Iwata | Jun 2000 | A |
6108769 | Chinnakonda et al. | Aug 2000 | A |
6112019 | Chamdani et al. | Aug 2000 | A |
6205543 | Tremblay et al. | Mar 2001 | B1 |
6233599 | Nation et al. | May 2001 | B1 |
6247094 | Kumar | Jun 2001 | B1 |
6272520 | Sharangpani et al. | Aug 2001 | B1 |
6341301 | Hagan | Jan 2002 | B1 |
6408325 | Shaylor | Jun 2002 | B1 |
6425072 | Meier et al. | Jul 2002 | B1 |
6557078 | Mulla | Apr 2003 | B1 |
6697939 | Kahle | Feb 2004 | B1 |
6785803 | Merchant | Aug 2004 | B1 |
7143243 | Miller | Nov 2006 | B2 |
7434032 | Coon et al. | Oct 2008 | B1 |
7610473 | Kissell | Oct 2009 | B2 |
7644221 | Chan | Jan 2010 | B1 |
9348595 | Mizrahi et al. | May 2016 | B1 |
20010056456 | Cota-Robles | Dec 2001 | A1 |
20020083304 | Leenstra et al. | Jun 2002 | A1 |
20030005263 | Eickemeyer et al. | Jan 2003 | A1 |
20030005266 | Akkary et al. | Jan 2003 | A1 |
20030033509 | Leibholz et al. | Feb 2003 | A1 |
20030061467 | Yeh et al. | Mar 2003 | A1 |
20040015684 | Peterson | Jan 2004 | A1 |
20040139306 | Albuz et al. | Jul 2004 | A1 |
20040172523 | Merchant et al. | Sep 2004 | A1 |
20040243764 | Miller | Dec 2004 | A1 |
20050044327 | Howard | Feb 2005 | A1 |
20050125802 | Wang | Jun 2005 | A1 |
20050149936 | Pilkington | Jul 2005 | A1 |
20050273580 | Chaudhry et al. | Dec 2005 | A1 |
20060117316 | Cismas et al. | Jun 2006 | A1 |
20070136562 | Caprioli et al. | Jun 2007 | A1 |
20070204137 | Tran | Aug 2007 | A1 |
20070266387 | Henmi | Nov 2007 | A1 |
20080082755 | Kornegay | Apr 2008 | A1 |
20080295105 | Ozer et al. | Nov 2008 | A1 |
20090037698 | Nguyen | Feb 2009 | A1 |
20100031268 | Dwyer et al. | Feb 2010 | A1 |
20100082945 | Adachi et al. | Apr 2010 | A1 |
20100083267 | Adachi et al. | Apr 2010 | A1 |
20100138608 | Rappoport | Jun 2010 | A1 |
20100250902 | Abernathy et al. | Sep 2010 | A1 |
20110067034 | Kawamoto | Mar 2011 | A1 |
20110296423 | Elnozahy et al. | Dec 2011 | A1 |
20120054447 | Swart | Mar 2012 | A1 |
20120173818 | Martin | Jul 2012 | A1 |
20120278596 | Tran et al. | Nov 2012 | A1 |
20120303936 | Tran et al. | Nov 2012 | A1 |
20130290639 | Tran et al. | Oct 2013 | A1 |
20130297912 | Tran et al. | Nov 2013 | A1 |
20130297916 | Suzuki et al. | Nov 2013 | A1 |
20130339619 | Roy | Dec 2013 | A1 |
20140047215 | Ogasawara | Feb 2014 | A1 |
20140109098 | Sato et al. | Apr 2014 | A1 |
20140189324 | Combs et al. | Jul 2014 | A1 |
20140372732 | Fleischman et al. | Dec 2014 | A1 |
20150220347 | Glossner et al. | Aug 2015 | A1 |
20160004534 | Padmanabha et al. | Jan 2016 | A1 |
20160246728 | Ron et al. | Aug 2016 | A1 |
20160291982 | Mizrahi et al. | Oct 2016 | A1 |
20160306633 | Mizrahi et al. | Oct 2016 | A1 |
20170046164 | Madhavan | Feb 2017 | A1 |
20170168949 | Jackson | Jun 2017 | A1 |
Entry |
---|
Pandey, R., Lecture Notes—“Process and Thread Scheduling,” Department of Computer Sciences, University of California, Davis, Winter 2005 (Year: 2005). |
Markovic, N., “Hardware Thread Scheduling Algorithms for Single-I SA Asymmetric CMPs,” (2015). (Year: 2015). |
Number | Date | Country | |
---|---|---|---|
20170351610 A1 | Dec 2017 | US |