The field of invention pertains generally to computing systems, and, more specifically, to a memory controller for a multi-level system memory having a sectored cache.
Computing systems typically include system memory (or main memory) that contains data and program code of the software code that the system's processor(s) are currently executing. A pertinent bottleneck in many computer systems is the system memory. Here, as is understood in the art, a computing system operates by executing program code stored in system memory. The program code when executed reads and writes data from/to system memory. As such, system memory is heavily utilized with many program code and data reads as well as many data writes over the course of the computing system's operation. Finding ways to speed-up system memory is therefore a motivation of computing system engineers.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
One of the ways to speed-up system memory without significantly increasing power consumption is to have a multi-level system memory.
In the case where near memory 113 is used as a memory side cache, near memory 113 is used to store data items that are expected to be more frequently called upon by the computing system. The near memory cache 113 has lower access times than the lower tiered far memory 114 region. By storing the more frequently called upon items in near memory 113, the system memory will be observed as faster because the system will often read items that are being stored in faster near memory 113.
According to some embodiments, for example, the near memory 113 exhibits reduced access times by having a faster clock speed than the far memory 114. Here, the near memory 113 may be a faster, volatile system memory technology (e.g., high performance dynamic random access memory (DRAM)) or faster non volatile memory. By contrast, far memory 114 may be either a volatile memory technology implemented with a slower clock speed (e.g., a DRAM component that receives a slower clock) or, e.g., a non volatile memory technology that is inherently slower than volatile/DRAM memory or whatever technology is used for near memory.
For example, far memory 114 may be comprised of an emerging non volatile byte addressable random access memory technology such as, to name a few possibilities, a phase change based memory, a ferro-electric based memory (e.g., FRAM), a magnetic based memory (e.g., MRAM), a spin transfer torque based memory (e.g., STT-RAM), a resistor based memory (e.g., ReRAM), a Memristor based memory, universal memory, Ge2Sb2Te5 memory, programmable metallization cell memory, amorphous cell memory, Ovshinsky memory, etc.
Such emerging non volatile random access memories technologies typically have some combination of the following: 1) higher storage densities than DRAM (e.g., by being constructed in three-dimensional (3D) circuit structures (e.g., a crosspoint 3D circuit structure); 2) lower power consumption densities than DRAM (e.g., because they do not need refreshing); and/or 3) access latency that is slower than DRAM yet still faster than traditional non-volatile memory technologies such as FLASH. The later characteristic in particular permits an emerging non volatile memory technology to be used in a main system memory role rather than a traditional storage role (which is the traditional architectural location of non volatile storage).
Regardless of whether far memory 114 is composed of a volatile or non volatile memory technology, in various embodiments far memory 114 acts as a true system memory in that it supports finer grained data accesses (e.g., cache lines) rather than larger blocked based accesses associated with traditional, non volatile storage (e.g., solid state drive (SSD), hard disk drive (HDD)), and/or, otherwise acts as an (e.g., byte) addressable memory that the program code being executed by processor(s) of the CPU operate out of.
Because near memory 113 acts as a cache, near memory 113 may not have its own individual addressing space. Rather, only far memory 114 includes the individually addressable memory space of the computing system's main memory. In various embodiments near memory 113 truly acts as a cache for far memory 114 rather than acting a last level CPU cache (generally, a CPU level cache is able to keep cache lines across the entirety of system memory addressing space that is made available to the processing cores 117 that are integrated on a same semiconductor chip as the memory controller 116).
For example, in various embodiments, system memory is implemented with dual in-line memory module (DIMM) cards where a single DIMM card has both DRAM and (e.g., emerging) non volatile memory chips disposed in it. The DRAM chips effectively act as an on board cache for the non volatile memory chips on the DIMM card. Ideally, the more frequently accessed cache lines of any particular DIMM card will be found on that DIMM card's DRAM chips rather than its non volatile memory chips. Given that multiple DIMM cards are typically plugged into a working computing system and each DIMM card is only given a section of the system memory addresses made available to the processing cores 117 of the semiconductor chip that the DIMM cards are coupled to, the DRAM chips are acting as a cache for the non volatile memory that they share a DIMM card with rather than a last level CPU cache.
In other configurations DIMM cards having only DRAM chips may be plugged into a same system memory channel (e.g., a DDR channel) with DIMM cards having only non volatile system memory chips. Ideally, the more frequently used cache lines of the channel will be found in the DRAM DIMM cards rather than the non volatile memory DIMM cards. Thus, again, because there are typically multiple memory channels coupled to a same semiconductor chip having multiple processing cores, the DRAM chips are acting as a cache for the non volatile memory chips that they share a same channel with rather than as a last level CPU cache. Although the above example referred to packaging solutions that included DIMM cards, it is pertinent to note that this is just one example and other embodiments may use other packaging solutions (e.g., stacked chip technology, one or more DRAM and phase change memories integrated on a same semiconductor die or at least within a same package as the processing core(s), etc.).
In yet other embodiments, near memory 113 may act as a CPU level cache.
The architecture of the near memory cache 113 may also vary from embodiment. According to one approach, the near memory cache 113 is implemented as a direct mapped cache in which multiple system memory addresses map to one cache line slot in near memory 113. Other embodiments may implement other types of cache structures (e.g., set associative, etc.). Regardless of the specific cache architecture, different cache lines may compete for the same cache resources in near memory 113.
For example, in the case of a direct mapped cache, when requests for two or more cache lines whose respective addresses map to the same near memory 113 cache line slot are concurrently received by the memory controller 116, the memory controller 116 will keep one of the cache lines in near memory cache 113 and cause the other cache line to be kept in far memory 114.
Whenever a request for a cache line is received by the memory controller 216, the memory controller first checks for the cache line in near memory cache 113. If the result is a cache hit, the memory controller 113 services the request from the version of the cache line in near memory 113 (in the case of a read request, the version of the cache line in near memory cache is forwarded to the requestor; in the case of a write, the version of the cache line in near memory cache is written over and kept in the near memory cache). In the case of a cache miss, for both read and write requests, the cache line that is targeted by the request is called up from far memory 114 and stored in near memory cache 113. In order to make room for the new cache line in near memory cache 113, another cache line that competes with the targeted cache line is evicted from near memory cache 113 and sent to far memory 114.
Data consistency problems may arise if care is not taken handling cache lines while in the process of evicting an old cache line from near memory 113 to far memory and filling the space created in near memory cache 113 by the eviction of the old cache line with the new cache line whose read or write request just suffered a cache miss. For example, if the evicted cache line is dirty (meaning it contains the most recent, up to date version of the cache line's data) and a write request is received for the evicted cache line before it is actually written to far memory 114, the memory controller 116 needs to take appropriate action to make sure the dirty cache line is updated with the new data.
Before beginning the discussion of
The data consistency problems mentioned just above are especially more likely to occur in the case of a sectored cache that moves entire super-lines between near memory and far memory (as opposed to a more traditional, nominally sized cache lines). For example, with the much larger size of a super-line, there is more data to move from near memory to far memory in the case of an eviction from near memory cache to far memory. This may result in more propagation delay, e.g., physically reading all of the data from near memory and then forwarding this data within the memory controller to a far memory interface. Additionally, again with the expansive size of the super line, there is a greater chance that an incoming write request will target a cache line within the super-line. Thus the likelihood that an incoming write request will target a cache line as it is in the process of moving between near memory and far memory becomes a more likely event in the case of a super-line and a sectored cache.
As such, the discussion below will generally refer to a super-line although the reader should understand the approach of
As observed in
Each tracker circuit 206 includes register space to hold state information for both the evicted and the filling super-lines. The state information may be kept, e.g., in memory such as dedicated (non cache) part of near memory 202. In an embodiment, the state information identifies a particular super-line by its address and two items of meta data that indicate whether the particular super-line is still formally residing in near memory cache 203, and, whether the particular super-line is in a modified state (M). Note that a single address may be used to identify a particular super-line (as suggested in
As is known in the art, a super-line in the M state is essentially a “dirty” super-line in that it holds the most recent, up to date data for the super-line. As will be described more clearly below, a pertinent feature of the memory controller 201 of
In an embodiment, a tag array (not shown) resides within a memory region (not depicted) of the memory controller 201 to indicate whether or not a cache hit has resulted for any particular super-line. A tag array essentially includes an entry for each slot in the sectored near memory cache and keeps the “tag” (e.g., upper) address bits for the particular super-line that is presently occupying the slot in the sectored near memory cache. For each incoming request, hashing or lookup circuitry (also not shown) associated with the tag array respectively performs a hash or lookup operation to map the address of the request to the particular entry in the tag array that the address maps to. If the tag held in the entry of the tag array matches the corresponding tag of the request the result is a cache hit. Otherwise the result is a cache miss. Note that, in an embodiment, the “old” entries in the state tracker circuits 206 may mimic the address tag information in the tag array. If the number of state tracker circuits is less than the number of slots in the cache, information from the tag array is used to “fill” the “old” entries of the state tracker circuits 206.
Continuing then with the present example, in contrast to slot 207_2, each of the other slots 207_1 and 207_3 through 207_5 have been newly targeted by a memory access request that resulted in a cache miss. As such, each of slots 207_1 and 207_3 through 207_5 have a corresponding old super-line that needs to be evicted and a new super-line that will fill the space created in near memory cache 202 by the evicted super-line. No actual eviction/filling activity has taken place as of
As observed in
As observed in
Note that because all four super-lines being evicted have not actually been evicted yet (they are still on the host side in the memory controller 201 and have not yet been written to far memory 203) their corresponding tracker entry still shows each super-line in the C state. That is, each of these super-lines still has a version of itself resident in its corresponding slot in near memory cache 202.
Note that because super-lines ADDR_3 and ADDR_5 are dirty they should be evicted into far memory 203. Whether or not the super-lines ADDR_1 and ADDR_4 should actually be evicted depends on implementation. Specifically, super-lines that are not dirty (such as the super-lines having addresses ADDR_1 and ADDR_4) need only actually be evicted into far memory 203 if there does not exist a copy of them already in far memory 203. Here, systems may differ as between the exact content of the near memory cache 202 and far memory 203. Some systems may keep a copy in far memory 203 of any super-line in near memory cache 202. For these systems, it is not necessary to write back to far memory 203 an evicted super-line that is not in the M state. Other systems, however, may not keep a copy in far memory 203 of a super-line that is cached in near memory 202. These systems, by contrast, should write back “clean” (non M state) evicted super-lines to far memory 203 as observed in
An applications processor or multi-core processor 550 may include one or more general purpose processing cores 515 within its CPU 501, one or more graphical processing units 516, a memory management function 517 (e.g., a memory controller) and an I/O control function 518. The general purpose processing cores 515 typically execute the operating system and application software of the computing system. The graphics processing units 516 typically execute graphics intensive functions to, e.g., generate graphics information that is presented on the display 503. The memory control function 517 interfaces with the system memory 502. The system memory 502 may be a multi-level system memory such as the multi-level system memory discussed at length above. The memory controller may include tracker circuitry as described at length above. During operation, data and/or instructions are typically transferred between deeper non volatile (e.g.,“disk”) storage 520 and system memory 502. The power management control unit 512 generally controls the power consumption of the system 500.
Each of the touchscreen display 503, the communication interfaces 504-507, the GPS interface 508, the sensors 509, the camera 510, and the speaker/microphone codec 513, 514 all can be viewed as various forms of I/O (input and/or output) relative to the overall computing system including, where appropriate, an integrated peripheral device as well (e.g., the camera 510). Depending on implementation, various ones of these I/O components may be integrated on the applications processor/multi-core processor 550 or may be located off the die or outside the package of the applications processor/multi-core processor 550.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in machine-executable instructions. The instructions can be used to cause a general-purpose or special-purpose processor to perform certain processes. Alternatively, these processes may be performed by specific hardware components that contain hardwired logic for performing the processes, or by any combination of programmed computer components and custom hardware components.
Elements of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.