Embodiments generally relate to memory structures.
Computing system memory architectures may be structured as various levels of host processor-side caches (e.g., level one/L1 cache, level 2/L2 cache, last level cache/LLC) and a system memory that includes a memory-side cache (e.g., “near memory”) and additional memory (e.g., “far memory”) that is slower to access than the memory-side cache.
When a search for data in the near memory is unsuccessful (e.g., a memory-side cache miss occurs), the requested data may be retrieved from the far memory. Frequent misses in the near memory may reduce performance and increase power consumption due to the retrieval of data from the relatively slow far memory. While hit-miss prediction techniques may exist for relatively small host processor-side caches, such techniques may not be scalable to larger memory-side caches in terms of accuracy or predictor size/overhead.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Turning now to
Of particular note is that the illustrated prediction table 14 does not track tag hits in the near memory. The miss predictor 10 may prevent the prediction table 14 from tracking tag hits by, for example, bypassing any entry allocations and/or bit allocations in the prediction table 14 to tag hits. Accordingly, the size and/or overhead of the prediction table 14 may be significantly smaller than a conventional processor-side cache hit-miss predictor that tracks both hits and misses in the processor-side cache. The illustrated miss predictor 10 therefore enables greater scalability, which may be particularly useful in 2LM architectures having a relatively large (e.g., 32 GB or greater memory-side cache). As will be discussed in greater detail, the miss predictor 10 may also increase prediction accuracy.
For example, computer program code to carry out operations shown in the method 20 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Illustrated processing block 22 provides for maintaining a prediction table that tracks missed page addresses with respect to a first memory. Block 22 may include, for example, updating one or more replacement policy bits (e.g., to indicate recent use) of a prediction table entry associated with a valid page address if an access request corresponds to the valid page address in the prediction table. A determination may be made at block 24 as to whether a request to access the near memory has been received. If not, the illustrated method 20 continues to maintain the prediction table by returning to block 22. If an access request has been received, illustrated block 26 determines whether the access request corresponds to (e.g., matches) any valid page address in the prediction table. If the access request does not correspond to any valid page address in the prediction table, block 28 may send the access request to the near memory only and the illustrated method 20 returns to block 22.
If, however, it is determined at block 26 that the access request corresponds to a valid page address in the prediction table, a near memory tag miss is predicted and illustrated block 30 sends the access request to near memory and far memory in parallel. The near memory may generally be associated with a shorter access time than the far memory. Thus, the latency impact of a tag miss in the near memory may be eliminated if a tag miss does in fact occur in the near memory. Moreover, the illustrated block 26 improves accuracy by looking up full page addresses in the prediction table. By contrast, a conventional processor-side hit-miss predictor may use hash tables and/or partial tags, which may be subject to aliasing and, therefore, lower performance.
Illustrated processing block 34 provides for detecting a tag miss in near memory, wherein a determination may be made at block 36 as to whether the tag miss corresponds to any valid page address in a prediction table such as, for example, the prediction table 14 (
Illustrated processing block 48 provides for detecting a tag hit in the near memory, wherein a determination may be made at block 50 as to whether the tag hit resulted from an access request that corresponds to any valid page address in the prediction table. If so, a miss was incorrectly predicted in the near memory and block 52 may clear a valid bit of the prediction table entry associated with the valid page address in response to the tag hit. Otherwise, the illustrated method 46 terminates.
Turning now to
The system memory 138 may include a first memory 74 (e.g., near memory, memory-side cache) and a second memory 76 (e.g., far memory that may be either volatile memory or non-volatile memory), wherein the first memory 74 may be accessed more quickly than the far memory 76. The first memory 74 may include a memory controller 78 that may generally implement one or more aspects of the method 20 (
As already noted, the first memory 74 may be referred to as “near memory” in a two level memory/2LM architecture. In one example, the first memory 74 is a set associative (e.g., 4-way) structure, although other structures may be used. Moreover, the first memory 74 and/or the second memory 76 may include either volatile memory or non-volatile memory. Non-volatile memory is a storage medium that does not require power to maintain the state of data stored by the medium. In one embodiment, the system memory 70 is a block addressable memory device, such as those based on NAND or NOR technologies. A memory device may also include future generation nonvolatile devices, such as a three dimensional (3D) crosspoint memory device, or other byte addressable write-in-place nonvolatile memory devices. In one embodiment, the memory device may be or may include memory devices that use silicon-oxide-nitride-oxide-silicon (SONOS) memory, electrically erasable programmable read-only memory (EEPROM), chalcogenide glass, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level Phase Change Memory (PCM), a resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, magnetoresistive random access memory (MRAM) memory that incorporates memristor technology, resistive memory including the metal oxide base, the oxygen vacancy base and the conductive bridge Random Access Memory (CB-RAM), or spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thiristor based memory device, or a combination of any of the above, or other memory. The memory device may refer to the die itself and/or to a packaged memory product. In some embodiments, 3D crosspoint memory may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance. In particular embodiments, a memory module with non-volatile memory may comply with one or more standards promulgated by the Joint Electron Device Engineering Council (JEDEC), such as JESD218, JESD219, JESD220-1, JESD223B, JESD223-1, or other suitable standard (the JEDEC standards cited herein are available at jedec.org).
Volatile memory is a storage medium that requires power to maintain the state of data stored by the medium. Examples of volatile memory may include various types of random access memory (RAM), such as dynamic random access memory (DRAM) or static random access memory (SRAM). One particular type of DRAM that may be used in a memory module is synchronous dynamic random access memory (SDRAM). In particular embodiments, DRAM of the memory modules complies with a standard promulgated by JEDEC, such as JESD79F for Double Data Rate (DDR) SDRAM, JESD79-2F for DDR2 SDRAM, JESD79-3F for DDR3 SDRAM, or JESD79-4A for DDR4 SDRAM (these standards are available at www.jedec.org). Such standards (and similar standards) may be referred to as DDR-based standards and communication interfaces of the storage devices that implement such standards may be referred to as DDR-based interfaces.
Unexpectedly advantageous evaluation results include, for example, an average prediction accuracy of approximately 99.6%, a correction of approximately 75.1% near memory miss requests, and an increase in far memory bandwidth consumption of only about 3.68%.
Example 1 may include a semiconductor package apparatus comprising one or more substrates and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable logic or fixed functionality hardware logic, the logic coupled to the one or more substrates to maintain a prediction table that tracks missed page addresses with respect to a first memory, send, if an access request does not correspond to any valid page addresses in the prediction table, the access request to the first memory, send, if the access request corresponds to a valid page address in the prediction table, the access request to the first memory and a second memory in parallel, wherein the first memory is associated with a shorter access time than the second memory, update, if the access request corresponds to the valid page address in the prediction table, one or more replacement policy bits of a prediction table entry associated with the valid page address to maintain the prediction table, detect a tag hit in the first memory, wherein the tag hit is to be associated with the access request, clear, if the access request corresponds to the valid page address in the prediction table, a valid bit of the prediction table entry associated with the valid page address in response to the tag hit, and prevent the prediction table from tracking hits with respect to the first memory.
Example 2 may include the apparatus of Example 1, wherein, to maintain the prediction table, the logic coupled to the one or more substrates is further to detect a tag miss in the first memory, create, if the tag miss does not correspond to any valid page address in the prediction table, a new entry in the prediction table in response to the tag miss, set a valid bit of the new entry, and update one or more replacement policy bits of the new entry.
Example 3 may include the apparatus of Example 2, wherein the logic coupled to the one or more substrates is to replace an invalid entry in the prediction table to create the new entry.
Example 4 may include the apparatus of Example 2, wherein the logic coupled to the one or more substrates is to replace a valid entry in the prediction table based on one or more replacement policy bits of the valid entry to create the new entry.
Example 5 may include a two-level memory-based system comprising a processor to issue an access request, a first memory, a second memory, wherein the first memory is associated with a shorter access time than the second memory, and a memory controller coupled to the processor, the first memory and the second memory, the memory controller to maintain a prediction table that tracks missed page addresses with respect to the first memory, send, if the access request does not correspond to any valid page addresses in the prediction table, the access request to the first memory, and send, if the access request corresponds to a valid page address in the prediction table, the access request to the first memory and the second memory in parallel.
Example 6 may include the system of Example 5, wherein, to maintain the prediction table, the memory controller is to detect a tag miss in the first memory, create, if the tag miss does not correspond to any valid page address in the prediction table, a new entry in the prediction table in response to the tag miss, set a valid bit of the new entry, and update one or more replacement policy bits of the new entry.
Example 7 may include the system of Example 6, wherein the memory controller is to replace an invalid entry in the prediction table to create the new entry.
Example 8 may include the system of Example 6, wherein the memory controller is to replace a valid entry in the prediction table based on one or more replacement policy bits of the valid entry to create the new entry.
Example 9 may include the system of Example 5, wherein, if the access request corresponds to the valid page address in the prediction table, the memory controller is to update one or more replacement policy bits of a prediction table entry associated with the valid page address to maintain the prediction table.
Example 10 may include the system of Example 5, wherein the memory controller is to detect a tag hit in the first memory, wherein the tag hit is to be associated with the access request, and clear, if the access request corresponds to the valid page address in the prediction table, a valid bit of a prediction table entry associated with the valid page address in response to the tag hit.
Example 11 may include the system of any one of Examples 5 to 10, wherein the memory controller is to prevent the prediction table from tracking hits with respect to the first memory.
Example 12 may include a semiconductor package apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable logic or fixed functionality hardware logic, the logic coupled to the one or more substrates to maintain a prediction table that tracks missed page addresses with respect to a first memory, send, if an access request does not correspond to any valid page addresses in the prediction table, the access request to the first memory, and send, if the access request corresponds to a valid page address in the prediction table, the access request to the first memory and a second memory in parallel, wherein the first memory is associated with a shorter access time than the second memory.
Example 13 may include the apparatus of Example 12, wherein, to maintain the prediction table, the logic coupled to the one or more substrates is to detect a tag miss in the first memory, create, if the tag miss does not correspond to any valid page address in the prediction table, a new entry in the prediction table in response to the tag miss, set a valid bit of the new entry, and update one or more replacement policy bits of the new entry.
Example 14 may include the apparatus of Example 13, wherein the logic coupled to the one or more substrates is to replace an invalid entry in the prediction table to create the new entry.
Example 15 may include the apparatus of Example 13, wherein the logic coupled to the one or more substrates is to replace a valid entry in the prediction table based on one or more replacement policy bits of the valid entry to create the new entry.
Example 16 may include the apparatus of Example 12, wherein, if the access request corresponds to the valid page address in the prediction table, the logic coupled to the one or more substrates is to update one or more replacement policy bits of a prediction table entry associated with the valid page address to maintain the prediction table.
Example 17 may include the apparatus of Example 12, wherein the logic coupled to the one or more substrates is to detect a tag hit in the first memory, wherein the tag hit is to be associated with the access request, and clear, if the access request corresponds to the valid page address in the prediction table, a valid bit of a prediction table entry associated with the valid page address in response to the tag hit.
Example 18 may include the apparatus of any one of Examples 12 to 17, wherein the logic coupled to the one or more substrates is to prevent the prediction table from tracking hits with respect to the first memory.
Example 19 may include a method of operating a semiconductor package apparatus, comprising maintaining a prediction table that tracks missed page addresses with respect to a first memory, sending, if an access request does not correspond to any valid page addresses in the prediction table, the access request to the first memory, and sending, if the access request corresponds to a valid page address in the prediction table, the access request to the first memory and a second memory in parallel, wherein the first memory is associated with a shorter access time than the second memory.
Example 20 may include the method of Example 19, wherein maintaining the prediction table includes detecting a tag miss in the first memory, creating, if the tag miss does not correspond to any valid page address in the prediction table, a new entry in the prediction table in response to the tag miss, setting a valid bit of the new entry, and updating one or more replacement policy bits of the new entry.
Example 21 may include the method of Example 20, wherein creating the new entry includes replacing an invalid entry in the prediction table.
Example 22 may include the method of Example 20, wherein creating the new entry includes replacing a valid entry in the prediction table based on one or more replacement policy bits of the valid entry.
Example 23 may include the method of Example 19, wherein maintaining the prediction table includes, if the access request corresponds to the valid page address in the prediction table, updating one or more replacement policy bits of a prediction table entry associated with the valid page address.
Example 24 may include the method of Example 19, further including detecting a tag hit in the first memory, wherein the tag hit is associated with the access request, and clearing, if the access request corresponds to the valid page address in the prediction table, a valid bit of a prediction table entry associated with the valid page address in response to the tag hit.
Example 25 may include the method of any one of Examples 19 to 24, further including preventing the prediction table from tracking hits with respect to the first memory.
Example 26 may include a semiconductor package apparatus comprising means for maintaining a prediction table that tracks missed page addresses with respect to a first memory, means for sending, if an access request does not correspond to any valid page addresses in the prediction table, the access request to the first memory, and means for sending, if the access request corresponds to a valid page address in the prediction table, the access request to the first memory and a second memory in parallel, wherein the first memory is associated with a shorter access time than the second memory.
Example 27 may include the apparatus of Example 26, wherein the means for maintaining the prediction table includes means for detecting a tag miss in the first memory, means for creating, if the tag miss does not correspond to any valid page address in the prediction table, a new entry in the prediction table in response to the tag miss, means for setting a valid bit of the new entry, and means for updating one or more replacement policy bits of the new entry.
Example 28 may include the apparatus of Example 27, wherein the means for creating the new entry includes means for replacing an invalid entry in the prediction table.
Example 29 may include the apparatus of Example 27, wherein the means for creating the new entry includes means for replacing a valid entry in the prediction table based on one or more replacement policy bits of the valid entry.
Example 30 may include the apparatus of Example 26, wherein the means for maintaining the prediction table includes, if the access request corresponds to the valid page address in the prediction table, means for updating one or more replacement policy bits of a prediction table entry associated with the valid page address.
Example 31 may include the apparatus of Example 26, further including means for detecting a tag hit in the first memory, wherein the tag hit is to be associated with the access request, and means for clearing, if the access request corresponds to the valid page address in the prediction table, a valid bit of a prediction table entry associated with the valid page address in response to the tag hit.
Example 32 may include the apparatus of any one of Examples 26 to 31, further including means for preventing the prediction table from tracking hits with respect to the first memory.
Low overhead technology described herein may therefore significantly reduce memory access latencies and, in turn, improve performance. For example, by predicting near memory misses earlier, the technology enables far memory to be accessed more quickly and more reliably.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.