The field of invention pertains generally to the computing sciences, and, more specifically, to a multi-level system memory having near memory space capable of behaving as near memory cache or fast addressable system memory depending on system state.
Computing systems typically include a system memory (or main memory) that contains data and program code of the software code that the system's processor(s) are currently executing. A pertinent issue in many computer systems is the system memory. Here, as is understood in the art, a computing system operates by executing program code stored in system memory. The program code when executed reads and writes data from/to system memory. As such, system memory is heavily utilized with many program codes and data reads as well as many data writes over the course of the computing system's operation. Finding ways to improve system memory is therefore a motivation of computing system engineers.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
1.a. Multi-Level System Memory Overview
One of the ways to improve system memory performance is to have a multi-level system memory.
The use of cache memories for computing systems is well-known. In the case where near memory 113 is used as a cache, near memory 113 is used to store an additional copy of those data items in far memory 114 that are expected to be more frequently called upon by the computing system. The near memory cache 113 has lower access times than the lower tiered far memory 114 region. By storing the more frequently called upon items in near memory 113, the system memory 112 will be observed as faster because the system will often read items that are being stored in faster near memory 113. For an implementation using a write-back technique, the copy of data items in near memory 113 may contain data that has been updated by the central processing unit (CPU), and is thus more up-to-date than the data in far memory 114. The process of writing back ‘dirty’ cache entries to far memory 114 ensures that such changes are not lost.
According to some embodiments, for example, the near memory 113 exhibits reduced access times by having a faster clock speed than the far memory 114. Here, the near memory 113 may be a faster (e.g., lower access time), volatile system memory technology (e.g., high performance dynamic random access memory (DRAM)) and/or SRAM memory cells co-located with the memory controller 116. By contrast, far memory 114 may be either a volatile memory technology implemented with a slower clock speed (e.g., a DRAM component that receives a slower clock) or, e.g., a non volatile memory technology that may be slower (e.g., longer access time) than volatile/DRAM memory or whatever technology is used for near memory.
For example, far memory 114 may be comprised of an emerging non volatile random access memory technology such as, to name a few possibilities, a phase change based memory, three dimensional crosspoint memory device, or other byte addressable nonvolatile memory devices, “write-in-place” non volatile main memory devices, memory devices that use chalcogenide phase change material (e.g., glass), single or multiple level flash memory, multi-threshold level flash memory, a ferro-electric based memory (e.g., FRAM), a magnetic based memory (e.g., MRAM), a spin transfer torque based memory (e.g., STT-RAM), a resistor based memory (e.g., ReRAM), a Memristor based memory, universal memory, Ge2Sb2Te5 memory, programmable metallization cell memory, amorphous cell memory, Ovshinsky memory, etc.
Such emerging non volatile random access memory technologies typically have some combination of the following: 1) higher storage densities than DRAM (e.g., by being constructed in three-dimensional (3D) circuit structures (e.g., a crosspoint 3D circuit structure)); 2) lower power consumption densities than DRAM (e.g., because they do not need refreshing); and/or, 3) access latency that is slower than DRAM yet still faster than traditional non-volatile memory technologies such as FLASH. The latter characteristic in particular permits various emerging non volatile memory technologies to be used in a main system memory role rather than a traditional mass storage role (which is the traditional architectural location of non volatile storage).
Regardless of whether far memory 114 is composed of a volatile or non volatile memory technology, in various embodiments far memory 114 acts as a true system memory in that it supports finer grained data accesses (e.g., cache lines) rather than larger based accesses associated with traditional, non volatile mass storage (e.g., solid state drive (SSD), hard disk drive (HDD)), and/or, otherwise acts as an (e.g., byte) addressable memory that the program code being executed by processor(s) of the CPU operate out of. However, far memory 114 may be inefficient when accessed for a small number of consecutive bytes (e.g., less than 128 bytes) of data, the effect of which may be mitigated by the presence of near memory 113 operating as cache which is able to efficiently handle such requests.
Because near memory 113 acts as a cache, near memory 113 may not have formal addressing space. Rather, in some cases, far memory 114 defines the individually addressable memory space of the computing system's main memory. In various embodiments near memory 113 acts as a cache for far memory 114 rather than acting a last level CPU cache. Generally, a CPU cache is optimized for servicing CPU transactions, and will add significant penalties (such as cache snoop overhead and cache eviction flows in the case of hit) to other memory users such as Direct Memory Access (DMA)-capable devices in a Peripheral Control Hub. By contrast, a memory side cache is designed to handle all accesses directed to system memory, irrespective of whether they arrive from the CPU, from the Peripheral Control Hub, or from some other device such as display controller.
For example, in various embodiments, system memory is implemented with dual in-line memory module (DIMM) cards where a single DIMM card has both volatile (e.g., DRAM) and (e.g., emerging) non volatile memory semiconductor chips disposed in it. The DRAM chips effectively act as an on board cache for the non volatile memory chips on the DIMM card. Ideally, the more frequently accessed cache lines of any particular DIMM card will be accessed from that DIMM card's DRAM chips rather than its non volatile memory chips. Given that multiple DIMM cards may be plugged into a working computing system and each DIMM card is only given a section of the system memory addresses made available to the processing cores 117 of the semiconductor chip that the DIMM cards are coupled to, the DRAM chips are acting as a cache for the non volatile memory that they share a DIMM card with rather than a last level CPU cache.
In other configurations DIMM cards having only DRAM chips may be plugged into a same system memory channel (e.g., a DDR channel) with DIMM cards having only non volatile system memory chips. Ideally, the more frequently used cache lines of the channel are in the DRAM DIMM cards rather than the non volatile memory DIMM cards. Thus, again, because there are typically multiple memory channels coupled to a same semiconductor chip having multiple processing cores, the DRAM chips are acting as a cache for the non volatile memory chips that they share a same channel with rather than as a last level CPU cache.
In yet other possible configurations or implementations, a DRAM device on a DIMM card can act as a memory side cache for a non volatile memory chip that resides on a different DIMM and is plugged into a different channel than the DIMM having the DRAM device. Although the DRAM device may potentially service the entire system memory address space, entries into the DRAM device are based in part from reads performed on the non volatile memory devices and not just evictions from the last level CPU cache. As such the DRAM device can still be characterized as a memory side cache.
In another possible configuration, a memory device such as a DRAM device functioning as near memory 113 may be assembled together with the memory controller 116 and processing cores 117 onto a single semiconductor device or within a same semiconductor package. Far memory 114 may be formed by other devices, such as slower DRAM or non-volatile memory and may be attached to, or integrated in that device.
As described at length above, near memory 113 may act as a cache for far memory 114. In various embodiments, the memory controller 116 and/or near memory 213 may include local cache information (hereafter referred to as “Metadata”) 120 so that the memory controller 116 can determine whether a cache hit or cache miss has occurred in near memory 113 for any incoming memory request. The metadata may also be stored in near memory 113.
In the case of an incoming write request, if there is a cache hit, the memory controller 116 writes the data (e.g., a 64-byte CPU cache line) associated with the request directly over the cached version in near memory 113. Likewise, in the case of a cache miss, in an embodiment, the memory controller 116 also writes the data associated with the request into near memory 113, potentially first having fetched from far memory 114 any missing parts of the data required to make up the minimum size of data that can be marked in Metadata as being valid in near memory 113, in a technique known as ‘underfill’. However, if the entry in the near memory cache 113 that the content is to be written into has been allocated to a different system memory address and contains newer data than held in far memory 114 (i.e. it is dirty), the data occupying the entry must be evicted from near memory 113 and written into far memory 114.
In the case of an incoming read request, if there is a cache hit, the memory controller 116 responds to the request by reading the version of the cache line from near memory 113 and providing it to the requestor. By contrast, if there is a cache miss, the memory controller 116 reads the requested cache line from far memory 114 and not only provides the cache line to the requestor but also writes another copy of the cache line into near memory 113. In many cases, the amount of data requested from far memory 114 and the amount of data written to near memory 113 will be larger than that requested by the incoming read request. Using a larger data size from far memory or to near memory increases the probability of a cache hit for a subsequent transaction to a nearby memory location.
In general, cache lines may be written to and/or read from near memory and/or far memory at different levels of granularity (e.g., writes and/or reads only occur at cache line granularity, and, e.g., byte addressability for writes/or reads is handled internally within the memory controller), byte granularity (e.g., true byte addressability in which the memory controller writes and/or reads only an identified one or more bytes within a cache line), or granularities in between. Additionally, note that the size of the cache line maintained within near memory and/or far memory may be larger than the cache line size maintained by CPU level caches. Different types of near memory caching architecture are possible (e.g., direct mapped, set associative, etc.).
In still other embodiments, at least some portion of near memory 113 has its own system address space apart from the system addresses that have been assigned to far memory 114 locations. In this case, the portion of near memory 113 that has been allocated its own system memory address space acts, e.g., as a higher priority level of system memory (because it is faster than far memory) rather than as a memory side cache. In other or combined embodiments, some portion of near memory 113 may also act as a last level CPU cache.
Here, as can be appreciated from the XXX component of the address range associated with each of the aforementioned segments, each segment contains multiple separately addressable system memory storage regions. That is, the critical bits that define which segment is being addressed (e.g., 000 for segment 221, 001 for segment 222, etc.) are higher order address bits that allow for lower ordered address bits (XXX) to uniquely identify/address any one of multiple separately accessible data units (e.g., cache lines) that are kept within the segment. For example, if XXX corresponds to any three bit pattern, then, e.g., eight different cache lines may be separately stored in each segment. Note that three lower ordered bits XXX as depicted in
In various embodiments, each of segments 221 through 225 corresponds to an amount of system memory storage space used to store a page of system memory data. Here, as is known in the art, system software such as a virtual machine monitor (VMM), operating system instance or application software program organizes its data and/or instructions into pages of information that are separately moved between system memory and mass block/sector non volatile storage (such as a disk drive or solid state disk). Typically, when software believes it will need the data/instructions on a particular page, it will fetch the page from mass storage and store it into system memory. From there, the system software operates out of system memory when referencing to the data/instructions stored on the page. If the software believes the page's data/instructions are no longer needed, the page may be moved back down to mass storage from system memory.
The particular set of segments 221 through 225 of
As mentioned above, the near memory segment 221 may act as a faster region of system memory, or, as a region to cache pages from any of segments 222 through 225. Here, as described in more detail further below, the system intelligently chooses between the two different uses of segment 221 based on system state. Additionally, as also will be described in more detail below, the system may swap the relationship between a logical address that identifies a specific page that is kept within one of the segments of the group 201 and the actual physical address of the segment where the page is stored.
Here, traditionally, software identifies a specific page by its virtual address. The virtual address, in turn, is then translated (e.g. in hardware with a translation look-aside buffer (TLB) and/or in software with a virtual machine monitor (VMM)) ultimately to a specific physical address in system memory where the page is located. The different logical addresses used to identify different pages in the segments of group 201 of
Here, referring to
By contrast, in other embodiments, the logical address that identifies a particular page may correspond to a virtual address and the assignment of the particular page to a particular segment within the group 201 corresponds to some or all of the translation of a virtual address to a physical address. For example, if page A were to be stored in segment 223 (a possibility that is described in more detail further below), the process of assigning page A to a segment other than a segment having the same base address may be part of the overall translation performed by the system of virtual addresses into physical addresses.
As such, the processes and techniques described herein may be performed entirely in software, entirely in hardware, or some combination of the two. More specifically, referring briefly back to
In the state of
Recognizing the state of
Referring to
As observed in
Each of the fields 511 also contains the logical address of the specific page that is allocated to operate out of its corresponding segment. From entry 510 in
The table entry 510 also includes a dedicated memory addressing structure 512 that includes a one bit field for each segment in the segment group that indicates whether its corresponding segment is supporting a separate system memory address range that the system is currently operating out of. As indicated in
A counter 513 is also included in the table entry 510 whose value(s) indicate which of the pages in the segment group 501 are most active. That is, counter 513 keeps one or more values that indicate which one of the pages in segments 1 through 5 is receiving the most read/write hits. The counter 513 can be used, for instance, to place the page having the most hits into the near memory segment (segment 1). The counter 513 may keep a separate count value for each segment/page, or use competing counter structures to reduce the size of the tracked count values. In the case of a competing counter structure, the counter structure 513 keeps a first value that is a relative comparison of the number of hits between two pages (e.g., each hit to the first page increments the counter and each hit to the second page decrements the counter).
The counter structure 513 also includes information that counts hits for each competing page pair as a whole so that a final determination as to which page received the most hits amongst all pages in the segment group can ultimately be determined. In the case of a segment group having five segments there would exist two competing counters to account for four of the pages and a third competing counter that tallies a competing count between the fifth page and one of the other four pages. Two additional counters may also be included to determine relative hit counts between the three competing counters so that the page having the most hits can be identified.
The table entry 510 also includes a cache bit 514 that indicates whether near memory is being used to cache for the content of another segment. As observed in FIG. 5a, segment 1 (the near memory segment) is not being used as a cache for another segment (i.e., the CH value is set to 0).
The table entry 510 also includes a dirty bit 515 which indicates whether a cached page in the near memory segment (segment 1) has been written to or not. As stated just above, in the system state of
Referring to
Referring to
Referring to
Furthermore, the cache bit 514 is updated to reflect that the segment group 501 is now acting in a caching mode. In caching mode, the far memory version of the page that is in near memory cache is not accessed. Rather all read/write activity for the page is directed to its near memory version in segment 1. As such, the version of the page that is in far memory is not uniquely supporting a system memory address range that the system is operating out of. As such, the section of the dedicated memory address structure 512 that corresponds to the dormant version of page D that is presently kept in far memory segment 4 is changed to a value of 0.
When in caching mode, as reflected by the assertion of field 514, the system will know to resolve a conflict of two identical logical addresses in the segment fields 511 in favor of the entry in near memory segment 1. That is, from the state of
Page B may be automatically written into near memory as depicted in
Although embodiments discussed above were directed to embodiments where near memory was assumed to have smaller access times than far memory, in other embodiments, far memory may possess other characteristics on top of or in lieu of smaller access times that, e.g., make it more attractive for handling e.g., higher priority or more active pages, such as having a higher bandwidth and/or consuming less power than far memory. Also, although embodiments described above were directed to a 1:N mapping of near memory pages to far memory pages in a particular page group, other page group embodiments may be designed to include 2:N, 3:N, etc. near memory to far memory page mappings.
An applications processor or multi-core processor 750 may include one or more general purpose processing cores 715 within its CPU 701, one or more graphical processing units 716, a memory management function 717 (e.g., a memory controller) and an I/O control function 718. The general purpose processing cores 715 typically execute the operating system and application software of the computing system. The graphics processing units 716 typically execute graphics intensive functions to, e.g., generate graphics information that is presented on the display 703. The memory control function 717 interfaces with the system memory 702. The system memory 702 may be a multi-level system memory such as the multi-level system memory discussed at length above. The host side processing cores 715 and/or memory controller 717 may be designed to switch near memory resources of the multi-level system memory between acting as a cache for far memory and acting as separately addressable system memory address space depending on system state as discussed at length above.
Each of the touchscreen display 703, the communication interfaces 704-707, the GPS interface 708, the sensors 709, the camera 710, and the speaker/microphone codec 713, 714 all can be viewed as various forms of I/O (input and/or output) relative to the overall computing system including, where appropriate, an integrated peripheral device as well (e.g., the camera 710). Depending on implementation, various ones of these I/O components may be integrated on the applications processor/multi-core processor 750 or may be located off the die or outside the package of the applications processor/multi-core processor 750.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in machine-executable instructions. The instructions can be used to cause a general-purpose or special-purpose processor to perform certain processes. Alternatively, these processes may be performed by specific hardware components that contain hardwired logic for performing the processes, or by any combination of software or instruction programmed computer components or custom hardware components, such as application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), or field programmable gate array (FPGA).
Elements of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.