Computing system designers are continually seeking ways to improve the performance of the computing systems they design. An area of increasing attention is the memory performance. Here, if processor performance continues to increase as a consequence of manufacturing improvements (e.g., reduced minimum feature size) and/or architectural improvements, the computer system as a whole will not reach its computational potential if the performance of the memory used by the processor is not able to keep pace with the computational logic.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
As each CPU core calls for data and/or instructions it first looks through a hierarchy of CPU caches. The last CPU cache in the hierarchy is the LLC 104. If the sought for data/instruction is not found in the LLC 104, a request is made to the main memory controller 105 for the data/instruction.
As can be seen, the memory controller is coupled to external main memory by way of multiple dual data rate (DDR) memory channels 106 such as an industry standard DDR memory channel (e.g., a DDR standard promulgated by the Joint Electronic Device Engineering Council (JEDEC) (e.g., DDR4, DDR5, etc.)). Each channel is coupled to one or more memory modules 107 (e.g., a dual in-line memory module (DIMM) having dynamic random access memory (DRAM) memory chips). The address of the sought for data/instruction is resolved to a particular memory channel and module that is plugged into that memory channel. The desired information, in the case of a read, is then obtained from the module over the channel and provided to the CPU core that requested it.
That is, whereas, traditional non volatile memory (e.g., flash memory) has been relegated to non volatile mass storage because it is only capable of accesses and/or erasures at larger granularities (e.g., page, block, sector) and, therefore, cannot operate as byte addressable main memory, by contrast, newer emerging NVRAM technologies are capable of being accessed at byte level granularity (and/or cache line granularity) and therefore can operate as main memory.
Emerging NVRAM memory technologies are often composed of three dimensional arrays of storage cells that are formed above a semiconductor chip's substrate amongst/within the chip's interconnect wiring. Such cells are commonly resistive and store a particular logic value by imposing a particular resistance through the cell (e.g., a first resistance corresponds to a first stored logical value and a second resistance corresponds to a second logical value). Examples of such memory include, among possible others, Optane™ memory from Intel Corporation, 3D XPoint™ memory from Micron corporation, QuantX™ memory from micron corporation, phase change memory, resistive random access memory, dielectric random access memory, ferroelectric random access memory (FeRAM), magnetic random access memory, and spin transfer torque random access memory (STT-RAM).
The use of emerging NVRAM memory in a main memory role can offer advantages for the overall computing system (such as the elimination of internal traffic congestion and power consumption concerning “write-backs” or “commitments” of main memory content back to mass storage). However, such emerging NVRAM memory nevertheless tends to be slower than dynamic random access memory (DRAM) which has been the traditional technology used for main memory.
In order to compensate for the increased main memory access latencies that would be observed if main memory was entirely implemented with emerging NVRAM memory, as observed in
According to one approach, on each channel, the capacity of only the NVRAM memory 209 is largely viewed as the system memory address space of that channel. By contrast, the capacity of the DRAM memory side cache 208 on the channel is largely not reserved as system memory address space, but rather, a store for the data/instructions in the NVRAM memory space 209 of the MSC's channel that are most frequently accessed (alternate implementations allow the MSC of one channel to cache data/instructions of another channel). By keeping the most frequently accessed NVRAM items in the faster DRAM memory side cache 208, continued use of such items can be serviced from the DRAM memory side cache 208 rather than the slower NVRAM memory 209.
Here, a memory side cache 208 is different from a CPU last level cache in that a memory side cache 208 attempts to store the items that are most frequently accessed in main memory, rather than, as with the CPU last level cache, the items that are most frequently accessed from a particular component or type of component (the CPU cores). By contrast, the memory side cache will cache the items that are most desired in main memory as a whole which can be requested by any component in the system that uses main memory. Thus, if a GPU or networking interface or both are generating large amounts of main memory requests, the memory side cache will be apt to keep the items associated with these components as well as the CPU cores.
As observed in the improved approach of
In various embodiments, the memory side cache 310 of either of
In the approach of
Notably, in various embodiments, as depicted in
Here, generalizing, the highest frequency that can be propagated along a signal path will be reduced with each external physical connection that exists along that signal path. In the case of an external memory module 208/308 that is coupled to a memory channel 206/306 that emanates from a SOC package, there are four external physical connections: 1) the physical connection from the packaged die I/Os to the package substrate; 2) the physical connection from the package I/Os to the memory channel; 3) the physical connection from the memory channel to the memory module I/Os; and, 4) the physical connection from memory module substrate to the I/Os of the targeted memory chip.
By contrast, with the memory side cache 310 being implemented with eDRAM 312 within the SOC package, at most there are only two external physical connections. In the case where the eDRAM memory side cache 310 is implemented internally within the SOC (
As such, the improved memory side cache 310 can respond to requests in less time than an external memory side cache 208/308. As is understood in the art, eDRAM 312 integrates DRAM on a high density logic die (as compared to a traditional DRAM memory die which has limited logic integration capability).
As discussed above, the memory side cache 310 includes an interface 311 that supports out-of-order transactions that communicates with the SOC CPU memory controller. That is, for instance, if the interface 311 that emanates from the SOC CPU memory controller is an NVDIMM-P interface, the memory side cache component 310 also includes an NVDIMM-P interface 311.
As the memory controller services the requests it receives, it issues memory access requests over the interface 311 to the memory side cache 310. The internal cache hit/miss logic 313 of the memory side cache 310 then snoops the eDRAM cache 312 for the requested data item. If there is a hit, in the case of a read, the data is fetched from the eDRAM cache 312 and returned to the memory controller. In the case of a write, the content of the targeted data item in the eDRAM cache 312 is written over with new information that was included with the write request.
In the case of a cache miss, the internal logic of the memory side cache 310 invokes a “back-end” interface 314 that couples to an external memory channel and corresponding memory modules (e.g., DIMMs) that plug into these memory channels. Here, the back-end interfaces 314 may correspond to industry standard DDR memory channels (e.g., JEDEC DDR4, JEDEC DDR5, etc.). As such, from the perspective of the external memory modules, the CPU package “appears” as a traditional CPU package (that memory modules couple to industry standard memory channels that emanate from the CPU package).
Thus, in the case of a cache miss, assuming there is no additional (second level) memory side cache memory module 208/308 as discussed above with respect to
Extended embodiments can also include additional (2nd level) memory side cache functionality. For example, according to one approach, each NVRAM memory module 309 also includes an on-board DRAM cache to cache the most frequently requested items of the particular memory module.
According to another approach, which can be combined with the approach described just above, a 2nd level DRAM memory side cache module 308, like the memory side cache module 208 discussed above with respect to
In this case, the logic to perform the cache lookup into the 2nd level memory side cache module 308 can be located on the 2nd level memory side cache module 308, or, can be located on the back-end of the memory side cache function 310 that is integrated in the SOC package. If there is a cache hit, in the case of a read, the desired data is read from the 2nd level memory side cache module 308, provided to the memory side cache function 310 and forwarded to the SOC. In the case of a write, new data that is included in the request is written over the targeted data item in the 2nd level memory side cache module 308. Note that the memory side cache function 310 can also include embedded logic to keep both the request transaction on the SOC interface 311 and the request transaction on the memory channel 306 active and/or other wise operable according to their respective protocols.
According to an embodiment where the hit/miss cache logic for determining hits/misses in the 2nd level memory side cache module 308 resides in the memory side cache function 310, the memory side cache function first performs a read of the 2nd level memory side cache module 308 to determine if there is a hit or a miss. Here, for instance, the address of any request maps to only one (or a limited plurality) of “slots” in the memory space of the 2nd level memory side cache module 308. A tag that is, e.g., some segment of the address of a data is kept with the data item in the 2nd level memory side cache 308 and is read from the 2nd level cache 308 along with the data item itself. From the tag that is returned with the data item, the memory side cache function 310 can determine if a hit or miss has occurred. In the case of a hit the request is serviced with the data item that was read from the 2nd level memory module 308. In the case of a miss the request is directed over the memory channel 306 to the appropriate NVRAM memory module 309.
Returning to a discussion of the (first level) memory side cache function 310, note that such a cache may be implemented according to various architectures such as direct mapped, set-associative or associative. Here, because the eDRAM can be integrated on a high density logic process, set-associative or associative caching architectures are feasible. This stands in contrast, e.g., to external DRAM memory side cache module solutions 208/308 that do not include cache hit/miss logic (e.g., to keep power consumption of the module within defined limits). Such solutions have been known to implement a direct mapped cache in order to limit the external DRAM cache access 308 to one access per request.
As such, the hit/miss logic of the eDRAM 312 memory side cache function 310 may include tag array(s) and/or other logic that supports associative or set-associative caches to track which data items are stored in which cache slots. Additionally, the eDRAM, through banking or other schemes, can be designed to have sufficient bandwidth to support a read-before-write scheme for either or both a tag array and a data array. An extended discussion of a possible memory address to cache slot mapping approach is provided in more detail below with respect to
In various embodiments, interface 311 is an out-of-order interface because the presence of the memory side cache 310 can result in later requests that experience a cache hit in the eDRAM cache 312 completing before earlier requests that missed in the memory side cache 310. This possibility generally can exist even in implementations that do not include non volatile memory modules that are coupled to an external memory channel 306. That is, even if all the memory modules that are coupled to the external memory channels 306 are DRAM memory modules, the comparatively faster eDRAM memory side cache 310 could result in out-of-order request completion.
If a 2nd level memory side cache exists, either as a stand alone DRAM memory module 308 that acts as a cache for NVRAM modules that plug into the same (and/or other) memory channels, or, as DRAM cache that resides on a NVRAM module to store more frequently items on a per module basis, the possibility of out-of-order request completion can also exist on the memory channels 306. In this case, the back-end interface 314 should also support out-of-order processing (such as the JEDEC's NVDIMM-P version of DDR).
With respect to replacement policy of the eDRAM memory side cache function 310, various embodiments are possible. According to one approach, a miss in the eDRAM cache 312 for a particular data item results in that data item being entered in the eDRAM cache 312 after it has been called up from its external memory module. Generally, after extended runtimes and heavy memory usage, entry of such a data item into the eDRAM cache 312 will result in the eviction of another data item from the eDRAM cache 312 back to an external memory module in order to make room for the new entry. Here, various eviction policies can be used such as least frequently used (LFU), least recently used (LRU), etc.
Also, the memory side cache function 310 may support for various types of write modes. A first mode, referred to as “write-through”, writes a copy of a data item that has been newly written/updated in the eDRAM cache 312 back to its corresponding location in an external memory module. According to this approach, the most recent version of a data item will not only be in the eDRAM cache 312 but will also be in an external memory module. Another type of mode, referred to as “write-back” does not write data that has been newly written/updated in the eDRAM cache 312 back to a memory module. Instead, the hit/miss logic 313 of the memory side cache function 310 keeps track of which of its data items are dirty and which ones are clean. If a data item is never written to after it is first entered into the eDRAM cache 312 it is clean and need not be written back to its external memory module if it is subsequently evicted from the eDRAM cache 312. By contrast, if data is updated with new data after it is first written into eDRAM cache 312, the data is marked as dirty and will be written back to its corresponding memory module if it is subsequently evicted from the eDRAM cache 312.
In various embodiments, the memory side cache function includes register space to allow configurability of various modes of operation for the eDRAM cache. For example, the register space may specify which caching policy is to be applied (e.g., LFU, LRU, etc.) and/or which write mode is to be applied (e.g., write-through, write-back, etc.).
The interface 416 between the back-end logic chip 415 and the memory side cache 410 can be any high speed communication link having sufficiently high throughput (e.g., Direct Media Interface (DMI), PCIe, etc.). Although
Note that even in the approach of
In various embodiments, where intra-package chip to chip communication exists (e.g., interface 311 in
Additionally, although embodiments described above have stressed the presence of the first level memory side cache being within the same package as a SOC, in yet other embodiments, a first level memory side cache resides outsides any SOC package but is implemented on a same CPU module (or “socket”) as one or more SOCs. Here, for instance, one or more packaged SOCs may be integrated onto a module that plugs into, e.g., a larger system motherboard. Memory DIMMs, including potentially a second level DRAM memory side cache DIMM and one or more NVRAM DIMMs, are plugged into memory channels that reside on the larger system motherboard. The first level memory side cache, by contrast, resides on the module with the packaged SOC(s). Because communications to/from the first level memory side cache does not propagate through the module/motherboard interconnects, the first level memory side cache should exhibit faster access times than any DIMMs that are plugged into the motherboard.
Note that although embodiments above have stressed a package having two SOCs per package, other embodiments may have more than two SOCs per package, or, have only one SOC per package. Moreover, although embodiments above have stressed implementation of the teachings above toward a main memory solution, other embodiments may be implemented elsewhere, such as the local memory for a high performance co-processor (e.g., an Artificial Intelligence co-processor, a vector co-processor, an image processor, a graphics processor, etc.).
From these page size configurations, system memory 517 is organized as: 1) 2M page groups 518_1a, 518_2a, 518_3a, etc. each composed of 256 pages (1M per page group) in
As can be seen from
Each of
According to an embodiment, each page 515 in the cache 512 is assigned to a particular one or more of the page groups (page groups assigned to a same cache page can be referred to as “siblings”). Memory pages 519 in a same page group 518 compete for the page(s) in cache 512 that have been assigned to the page group 518. If these same page(s) in the cache 512 have been assigned to additional page groups in memory it expands the pool of pages in memory that will compete for these pages (ideally, the most frequently accessed pages in memory will most frequently occupy the pages in cache). For example, if cache page 515 has been assigned to page groups 518_1, 518_2 and 518_3, all the pages in “sibling” page groups 518_1, 518_2, 518_3 compete for cache page 515.
Caching Quality of Service (QoS) is effected for different pages 519 in memory 517 by adjusting more or less pages 515 in the first level cache 512 for the page groups 518 they belong to. That is, for example, a page group 518 whose pages 519 are to receive a relatively high QoS has fewer page group “siblings” that it competes with for the same page(s) in the cache 512. Likewise, a page group whose pages are to receive a relatively low QoS has more page group siblings that is competes with for the same page(s) in the cache.
Said another way, a page in cache 512 that is to service higher QoS pages in memory 517 is assigned fewer page groups 518, while, a page in cache 512 that is to service lower QoS pages in memory is assigned a greater number of page groups 518. For example, a highest QoS level may assign one or more pages in cache 512 to only one, particular page group (e.g., page group 518_1), while a lowest QoS level may assing another page in cache 512 to many (e.g., ten, hundred, etc.) other page groups in the memory 517. Here, by assigning specific pages in cache 517 to a specific number of page groups 518, the amount of competition amongst pages in memory 517 for same page(s) in cache 512 can be precisely configured thereby establishing with some precision relative caching QoS amongst all the pages in memory 517.
In various embodiments the logical and/or physical addresses of pages for inclusion in a same page group is determined by applying some function to a specific set of address bits. In a simplest case, pages in a same page group have the same bit pattern in a specific section of the address space. In other embodiments, some function may be applied to a same section of address space to determine what page group a page belongs to. Here, referring briefly back to
An operating system (OS), operating system instance and/or virtual machine monitor (VMM) or hypervisior can readily configure the eDRAM space of the cache 312 for particular page sizes in the cache and configure the system memory for particular page sizes, number of pages per page group and number of page groups in the memory. As such, any of an OS, OS instance and/or VMM/hypervisor can readily configure, e.g., different applications, different kinds of data within an application or amongst applications to varying degrees of first level caching QoS as described above.
The aforementioned mapping logic, in various embodiments, can include configuration register space to establish the page size in the cache 312 while the “back end” logic circuitry 314 or associated logic circuitry (including but not limited to the SOC memory controller) can include configuration register space to establish any/all of page size in memory, number of pages per page group and number of page groups in system memory.
Apart from configuring page size in the cache 512, the mapping logic circuitry of the hit/miss logic circuitry and/or associated logic circuitry, in various embodiments, can also establish ways of pages in the first level cache. That is, groups of pages in the first level cache 512, rather than single pages, are assigned to same page group(s) in memory.
Note that in various embodiments the aforementioned memory pages of
By so doing, the OS or OS instance is free to refer to such smaller pages as per normal/traditional runtime operation, including, for example, demoting certain smaller pages from system memory to mass storage and promoting certain smaller pages from mass storage to system memory. Generally, system hardware (e.g., memory management unit (MMU) and/or translation look-aside buffer (TLB) logic circuitry) can be designed to provide physical addresses to smaller pages of a same application (e.g., same software thread and virtual address range) so that they will map into a same larger memory page used for memory side cache QoS treatment as described above with respect to
Comparing the improved QOS approach of
An applications processor or multi-core processor 650 may include one or more general purpose processing cores 615 within its CPU 601, one or more graphical processing units 616, a memory management function 617 (e.g., a memory controller) and an I/O control function 618. The general purpose processing cores 615 typically execute the system and application software of the computing system. The graphics processing unit 616 typically executes graphics intensive functions to, e.g., generate graphics information that is presented on the display 603. The memory control function 617 interfaces with the system memory 602 to write/read data to/from system memory 602.
The system/main memory 602 can be implemented as a multi-level system memory having an “in-package” memory side cache such as the memory side cache 310 described at length above. The external memory of other components (e.g., one or more high performance co-processors) may also have an “in package” memory side cache as described at length above.
Each of the touchscreen display 603, the communication interfaces 604-607, the GPS interface 608, the sensors 609, the camera(s) 610, and the speaker/microphone codec 613, 614 all can be viewed as various forms of I/O (input and/or output) relative to the overall computing system including, where appropriate, an integrated peripheral device as well (e.g., the one or more cameras 610). Depending on implementation, various ones of these I/O components may be integrated on the applications processor/multi-core processor 650 or may be located off the die or outside the package of the applications processor/multi-core processor 650. The power management control unit 612 generally controls the power consumption of the system 600.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in machine-executable instructions. The instructions can be used to cause a general-purpose or special-purpose processor to perform certain processes. Alternatively, these processes may be performed by specific/custom hardware components that contain hardwired logic circuitry or programmable logic circuitry (e.g., FPGA, PLD) for performing the processes, or by any combination of programmed computer components and custom hardware components.
Elements of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.