Disclosed embodiments are directed to configuring memory structures for high speed, low power applications. More particularly, exemplary embodiments are directed to configuring large Dynamic Random Access Memory (DRAM) structures for use as cache memory.
Computer processing systems generally comprise several levels of memory. Closet to the processing core or Central Processing Unit (CPU) are caches, such as first-level cache, and furthest away from the CPU is the main memory. Caches have requirements of high speed, and small sizes, especially if the caches are close to the CPU and are placed on-chip. Accordingly, caches closest to the CPU are usually formed from Static Random Access Memory (SRAM), which features high speeds. However, SRAM also comes at a high cost. On the other hand, Dynamic Random Access Memory (DRAM) is slower than SRAM, but also less expensive. Accordingly, DRAM has historically found a place further away from the CPU and closer to main memory.
Recent advances in technology have made it feasible to manufacture DRAM systems with large storage capacity and low power features. For example, wide input/output (TO) interfaces, and energy efficient stacking have enabled manufacture of DRAM systems with large storage capacity (as high as 2 GB), high bandwidth data transfers and also lower latencies than were previously known for DRAM.
Accordingly, the relatively large storage may make it possible for Low Power Stacked DRAM systems to act as on-chip main memory systems in some low power embedded systems and handheld device applications. However, such Low Power Stacked DRAM systems may not be a suitable replacement for main memory in high performance processing systems, as their storage capacity may not be large enough to meet the needs of main memory.
On the other hand, Low Power Stacked DRAM systems, featuring low energy and high speeds, may now be more attractive for caches close to the CPU. For example, the Low Power Stacked DRAM systems may be configured as caches for conventional DRAM systems which may be too slow to be placed close to the CPU. Accordingly, the Low Power Stacked DRAM systems may provide higher storage capacity in cache memories close to the CPU than were previously known.
However, currently available off-the-shelf Low Power Stacked DRAM models may suffer from several limitations which may restrict their ready applicability to such cache memory applications close to the CPU. For example, off-the-shelf Low Power Stacked DRAM systems may not be equipped with features like error-correcting codes (ECC). DRAM cells may be leaky and highly prone to errors. Therefore, a lack of error detection and error correction capability, such as ECC mechanisms, may render the Low Power Stacked DRAM systems unsuitable for their use in caches close to the CPU, or as any other kind of storage in an error-resistant system.
Another obstacle in configuring off-the-shelf Low Power Stacked DRAM systems for use as cache memory is their lack of support for features which enable high speed data access, such as tagging mechanisms. As is well known, cache memories include tagging mechanisms which specify the memory address corresponding to each copied line in the cache. Efficient tag structures enable high speed lookups for requested data in the cache memories. However, off-the-shelf Low Power Stacked DRAM systems do not feature tagging mechanisms, thereby rendering them unsuitable for use as caches, in the absence of alternate techniques for tag storage. Designing suitable tagging mechanisms for use in conjunction with DRAMs presents several challenges. For example, in the case of large DRAMs (2 GB, for example) tag fields themselves would require several MB of storage space. This large tag space overhead gives rise to several challenges in the placement and organization of tags on-chip.
Additionally, the design of tagging mechanisms for Low Power Stacked DRAMs is complicated by the implicit balance involved in sacrificing tag space for larger set-associativity, thus inviting problems of high miss rates. Similarly, challenges are also presented in designing Low Power Stacked DRAM systems to include intelligence associated with directory information or other memory coherency information for multi-processor environments.
Accordingly, in order to advantageously exploit Low Power Stacked DRAM systems for use in cache memory applications close to the CPU, there is a need to overcome challenges created by sensitivity to errors, lack of efficient tagging mechanisms and related intelligence features in conventional DRAM systems.
Exemplary embodiments of the invention are directed to systems and method for configuring large Dynamic Random Access Memory (DRAM) structures for use as cache memory.
For example, an exemplary embodiment is directed to a memory device without pre-existing dedicated metadata comprising a page based memory, wherein each page is divided into a first portion and a second portion, such that the first portion comprises data, and the second portion comprises metadata corresponding to the data in the first portion. In exemplary embodiments, the metadata may comprise at least one of error-correcting code (ECC), address tags, directory information, memory coherency information, or dirty/valid/lock information.
Another exemplary embodiment is directed to method of configuring a page-based memory device without pre-existing dedicated metadata, the method comprising: reading metadata from a metadata portion of a page of the memory device, and determining a characteristic of the page, based on the metadata.
Yet another exemplary embodiment is directed to memory system comprising: a page-based memory device without pre-existing metadata, wherein a page of the memory device comprises a first storage means and a second storage means, metadata stored in the first storage means, and data stored in the second storage means, wherein the metadata in the first storage means is associated with the data in the second storage means.
Another exemplary embodiment is directed to non-transitory computer-readable storage medium comprising code, which, when executed by a processor, causes the processor to perform operations for configuring a page-based memory device without pre-existing dedicated metadata, the non-transitory computer-readable storage medium comprising: code for reading metadata from a metadata portion of a page of the memory device, and code for determining a characteristic of the page, based on the metadata
The accompanying drawings are presented to aid in the description of embodiments of the invention and are provided solely for illustration of the embodiments and not limitation thereof.
Aspects of the invention are disclosed in the following description and related drawings directed to specific embodiments of the invention. Alternate embodiments may be devised without departing from the scope of the invention. Additionally, well-known elements of the invention will not be described in detail or will be omitted so as not to obscure the relevant details of the invention.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments of the invention” does not require that all embodiments of the invention include the discussed feature, advantage or mode of operation.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising,”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Further, many embodiments are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the embodiments described herein, the corresponding form of any such embodiments may be described herein as, for example, “logic configured to” perform the described action.
As previously presented, currently available off-the-shelf DRAM systems such as Low Power Stacked DRAM systems may be highly error prone and therefore, may not meet critical standards of data fidelity required in cache memories. Accordingly, exemplary embodiments comprise configurations of such Low Power Stacked DRAM systems wherein error detection and error correction features, such as ECC mechanisms, are introduced. Embodiments also include efficient utilization of data storage space and page-based memory architecture of Low Power Stacked DRAM systems, in order to introduce ECC bits with minimal storage space overhead and high speed access.
Exemplary embodiments also recognize that available off-the-shelf Low Power Stacked DRAM architectures lack built-in tagging mechanisms for fast data searches. DRAM systems conventionally store data in pages. For example, a DRAM system may comprise data stored in 1 KB page sizes. Embodiments realize a conversion of page-based DRAM memory into cache-like memory with tagging mechanisms, by treating each page as a set in a set-associative cache. Hereafter, without loss of generality, the description will focus on a single page of a Low Power Stacked DRAM system, configured as a cache with a single set, for ease of understanding. Each line in the page may then be treated as a way of the set-associative cache, and tags may be applied to each line. The tags comprise bits required to identify whether a particular memory location is present in the cache. Commonly, memory addresses are configured such that a few selected bits of the memory address may be used to identify bytes in a line, a few other selected other bits may be used to identify the set to which the memory address corresponds, and the remaining address bits may be utilized for forming the tag. However, the tag fields that are introduced by this process present challenges in their storage, organization, and placement.
Firstly, the tag fields require significant storage space. For example, an embodiment may involve configuration of a Low Power Stacked DRAM with page sizes of 1 Kilo Byte (KB) as a cache memory. Accordingly, the 1 KB page may be configured as a 16-way cache with 64 Byte (B) lines. Assuming the physical memory is of size 1 Terabyte (TB), 40-bits may be required for addressing the physical memory space. Accordingly, 6-bits may be required to identify a byte in a 64 B line and 21-bits to identify the set. Thus, 40−(6+21), or 13-bits may be required to form a tag for each line in a 1 KB 16-way DRAM cache with approximately 2 million sets. Therefore, for a 16-way cache with one cache line per each of the 16-ways, the number of tag bits may be 13×16 or 208-bits. As will be appreciated, 208-bits of tags for each 1 KB page size of DRAM data presents a significant tag space overhead.
Secondly, it will be recognized that in order to reduce the tag space, cache line size may be increased and the number of cache line entries may be decreased, such that the overall storage capacity of the cache remains unaltered. However, increasing the cache line size at the expense of decreasing the number of cache entries, may increase the miss rate. Further, increasing the cache line size also has the effect of increasing the amount of data that is transferred when a cache line is filled or read out. Further, intelligent organization of the cache lines and pages has significant implications on the number of pages which may need to be accessed in the process of searching for requested data. Accordingly, exemplary embodiments will describe efficient solutions for challenges involved in the tag space overhead and organization. For example, certain embodiments include tag fields corresponding to data in a page, within the page itself, such that on a page read, if the tags indicate a hit, then the corresponding data may be accessed while the page is still open. Additionally, exemplary embodiments also take into account the need for efficient configuration of directory information and memory coherency information for multi-processor environments.
As used herein, the term “metadata” inclusively refers to the various bits of information and error correcting codes that correspond to data introduced in the DRAM systems in exemplary embodiments. For example, ECC-bits, tag information (including dirty/valid/locked mode information, as is known in the art), directory information, and other memory coherency information may be collectively referred to as metadata. Exemplary embodiments are directed to techniques for introducing metadata in DRAM systems which lack such metadata. The embodiments are further directed to efficient storage, organization, and access of the metadata, in order to configure the DRAM systems as reliable and high efficiency cache systems.
It will be appreciated, that while reference and focus is on configuring Low Power Stacked DRAM systems as above, embodiments described herein are not so limited, but may be easily extended to converting any memory system without metadata to a memory system which includes metadata.
The following describes an exemplary process of configuring a DRAM system, such as a Low Power Stacked DRAM system, lacking error detection/correction features, into an exemplary DRAM system comprising efficient ECC implementations. With reference to
As previously discussed, DRAM system 100 is volatile because the bit cells' capacitors are leaky. Constant refreshing of the capacitors is required in order to retain the information stored therein. Moreover, the information is susceptible to errors introduced by various external factors, such as fluctuations in electro-magnetic fields. Therefore, error detection and correction is crucial for assuring fidelity of stored data.
A common technique for error detection and correction involves the use of ECC bits. ECC bits represent a level of redundancy associated with data bits. This redundancy is used to check the consistency of data. ECC bits are initially calculated based on original data values which are known to be correct. As a simple example, ECC bits may represent a parity value, such that the parity value may indicate if the number of logic “ones” present in the original data is odd or even. At a later point in time, a parity value may be generated again on data then present, and compared with the ECC bits. If there is a mismatch, it may be determined that at least one error has been introduced in the original data. More complex algorithms are well known in the art for sophisticated analysis of errors and subsequent correction of errors if detected, using the basic principles of ECC. Detailed explanations of such algorithms will not be provided herein, as skilled persons will recognize suitable error detection/correction algorithms for particular applications which are enabled by exemplary embodiments.
Returning now, to DRAM system 100 of
With reference now to
According to exemplary embodiments, lines L1-L15 may first be filled with data to be stored in page 202. ECC bits may then be calculated for each of the lines of data L1-L15, and the ECC bits may be stored in fields E1-E15 respectively. As shown above, 11-bits of ECC may be sufficient for SEC/DED of each of the 512-bit lines L1-L15. In this example, 11 of the 32-bits in each of fields E1-E15 may be occupied by ECC bits, thus making available 21-bits for use by other metadata information pertaining to lines L1-L15, as described further below. Regarding field E0, ECC information pertaining to fields E1-E15 may be made available in field E0, such that the metadata fields may also be afforded protection from possible errors. In certain implementations, field E0 may be set to a zero-value for performing ECC calculations. Skilled persons will recognize efficient implementation details of ECC for particular applications, based on the above detailed technique.
Description will now be provided for efficient implementations of tagging mechanisms for fast searching of data in page 202 of
With continuing reference to
In exemplary embodiments, when a data request is directed to page 202 of DRAM system 200, page 202 is first opened for inspection. Next, line L0 is accessed, and metadata including tags in fields E1-E15 are analyzed. If there is a hit in one of the tags in fields E1-E15, the line L1-L15 corresponding to the tag which caused a hit, will be determined to be the line comprising requested data. The data line comprising requested data may then be read out, for example, in a read operation. On the other hand, if there is no hit in any of the tags stored in fields E1-E15, it may be quickly determined that page 202 does not comprise the requested data, and page 202 may be promptly closed. Alternatively, if the requested data is not present in the cache and will cause a miss, leading to the data being subsequently placed in the cache, the appropriate page is opened for the new line, and also for any evicted line that may also need to be written back as a result of the miss. As each page is treated as a set in exemplary embodiments, once it is determined that page 202 does not comprise the requested data and page 202 is closed, it may be determined that the requested data is not present in DRAM system 200. Thereafter, embodiments may then initiate access to main memory to service the data request. Thus, it will be appreciated that configuring data and corresponding tags in the same page, obviates the need for separate degenerate accesses to a tag database followed by access to stored data, thus improving access speeds and energy efficiency.
Now will be described, several optimizations to the organization of metadata in exemplary embodiments, in order to further improve speed and efficiency. Memory accesses may be pipelined in processing systems, such that a memory access operation may be broken down into several steps, with each step executed in a single cycle of the system clock. Such steps may be expressed as “beats”, wherein a first beat of a memory operation may be performed in a first clock cycle, a second beat performed in the next clock cycle, and so on. The metadata may be organized such that more critical information is made available during the first few beats. Such an organization may enable a prompt determination of the usefulness of a particular page which has been opened for inspection.
For example, in an embodiment, the least significant 8-bits of the 13-bit tags may be placed in fields E1-E15 in such a manner as to be made available in the first beat after page 202 is opened. These least significant 8-bits of the tags provide a very good estimation of the likelihood of a hit or miss for requested data within page 202. In a case wherein only one of the tags in fields E1-E15 present a hit in the least significant 8-bits, it may be determined that the hit is less likely to be spurious (if on the other hand, multiple hits are presented, then it is likely that the least significant 8-bits may be insufficient to accurately determine the presence of requested data in page 202). Accordingly, if a single hit is determined in the first beat, an early fetch request may be issued for the corresponding data.
Thereafter, the remaining bits of the tag may be accessed in a second beat, and studied in conjunction with the least significant 8-bits of the tag accessed in the first beat. The complete tag may then be analyzed for a hit or miss, and action may be taken accordingly. For example, if it is determined in the second beat that the hit indication in the first beat is spurious, then any issued early fetch requests may be aborted. Alternately, if a hit is determined or confirmed in the second beat, a fetch request may be initiated or sustained, respectively. A miss indication in the first and second beats may trigger the search process to proceed to a different page within DRAM system 200. Skilled persons will recognize various alternative implementations on similar lines as described above, without departing from the scope of exemplary embodiments.
Further optimizations may include placing ECC bits in fields E1-E15, such that they may be accessed in later beats after critical tag information. This is because ECC bits may not be relevant for quick determination of the presence of requested data in page 202. In certain embodiments, ECC bits may be also be determined for the metadata itself (and stored, for example, in field E0). If such ECC bits reveal that an error may have occurred in the tags, then the previous determination of hits/misses in earlier beats may need to be suitably revised. Speculative fetching of data based on hit/miss determination in earlier beats may be suitably metered in embodiments based on acceptable trade-offs between speed and power requirements, as speculative fetches may improve speed at the cost of burning power in the case of misprediction.
With reference now to
Further beneficial features may be included in certain embodiments. For example, embodiments may derive further advantages from retaining metadata in the same page as corresponding data, as will now be described. Conventional indexing schemes rely on least significant bits for forming tags, such that consecutively addressed lines are organized in consecutive sets in a set-associative cache structure. Extending such conventional indexing principles to exemplary embodiments would imply that a new page may need to be opened on consecutive misses on consecutively addressed lines, because each page has been configured as a set. In order to minimize the negative impacts associated with such consecutive misses, embodiments may utilize middle bits of the tag for indexing, as opposed to the least significant bits. Thus, it will be ensured that misses on consecutively addressed lines may fall within the same DRAM page, and multiple DRAM pages need not be successively opened.
As an illustrative example, the least significant 6-bits of the 13-bits of tags in exemplary embodiments may be used to address individual bytes in a 64-Byte line. Therefore, instead of using the least significant bits as in conventional techniques, higher order bits in positions 8-29 may be used for indexing in exemplary embodiments, which would facilitate consecutively addressed lines to belong to the set, thereby causing misses on consecutively addressed lines to fall within the same DRAM page. While such an organization of lines within the DRAM page-cache may increase conflict pressure among the various lines in a page, such organizations would advantageously improve latency. As will be recognized, the 16 lines in page 202 have been configured to form a 15-way cache (lines L1-L15; line L0 is used for metadata).
Further advantageous aspects may be included in exemplary embodiments, based on unused metadata space which may be available in fields E0-E15. As has been described with respect to page 202, each of the fields E0-E15 comprises 32-bits. The ECC bits occupy 11-bits, and tag information including state bits (representing valid/dirty/locked states) occupy 13+3=16-bits. This leaves 5-bits of unused space in the metadata fields. As previously described, directory information and other cache-coherency related information may be stored in the remaining 5-bits of metadata. Further, “valid,” “dirty,” and “locked” bits may also be introduced in the metadata fields. Valid and dirty bits may assist in tracking and replacing outdated/modified cache lines. Sometimes, defective parts may be recovered by designating a related DRAM cache line as invalid and locked. Other information, such as information to facilitate more efficient replacement policies or prefetch techniques, may also be introduced in the metadata fields. Various other forms of intelligence may be included in the metadata fields, and skilled persons will be able to recognize suitable configurations of metadata, based on exemplary descriptions provided herein.
Additionally, exemplary embodiments may also be configured to cache metadata separately, such that information related to frequently accessed cache lines corresponding to the cached metadata may be retrieved speedily. Implementations may involve separate caching structures for caching such metadata, or alternately, such caching may be performed in one or more pages of a DRAM system such as DRAM system 200. As a further optimization, only the metadata related to pages which are currently known to be open may be cached when it is known that corresponding cache lines in the open pages have a high likelihood of future access, based on the nature of applications being executed on the memory system.
From the above disclosure of exemplary embodiments, it will be seen that a page based memory device, (such as DRAM system 200 in
It will also be appreciated that embodiments include various methods for performing the processes, functions and/or algorithms disclosed herein. For example, as illustrated in
Further, it will be appreciated that Low Power Stacked DRAM such as DRAM system 200 may be accessed by a master device such as a processing core through a wide input/output interface, true silicon via (TSV) interface, or a stacked interface.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The methods, sequences and/or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
Accordingly, an embodiment of the invention can include a computer readable media embodying a method for configuring a memory device for use as a cache. Accordingly, the invention is not limited to illustrated examples and any means for performing the functionality described herein are included in embodiments of the invention.
While the foregoing disclosure shows illustrative embodiments of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the embodiments of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.