1. Technical Field
The present invention relates generally to memory systems, and more specifically to a memory system providing greater efficiency and performance in accessing sparsely stored data items.
2. Description of the Related Art
Many modern computer systems rely on caching as a means of improving memory performance. A cache is a section of memory used to store data that is used more frequently than those in storage locations that may take longer to access. Processors typically use caches to reduce the average time required to access memory, as cache memory is typically constructed of a faster (but more expensive or bulky) variety of memory (such as static random access memory or SRAM) than is used for main memory (such as dynamic random access memory or DRAM). When a processor wishes to read or write a location in main memory, the processor first checks to see whether that memory location is present in the cache. If the processor finds that the memory location is present in the cache, a cache hit has occurred. Otherwise, a cache miss is present. As a result of a cache miss, a processor immediately reads the data from memory or writes the data to a cache line within the cache. A cache line is a location in the cache that has a tag containing an index of the data in main memory that is stored in the cache. Cache lines are also sometimes referred to as cache blocks.
Caches generally rely on two concepts known as spatial locality and temporal locality. These assume that the most recently used data will be re-used soon, and that data close in memory to currently accessed data will be accessed in the near future. In many instances, these assumptions are valid. For instance, single dimensional arrays that are traversed in order follow this principle, since a memory access to one element of the array will likely be followed by an access to the next element in the array (which will be in the next adjacent memory location). In other situations, these principles have less application. For instance, a column-major traversal of a two-dimensional array stored in row-major order will result in successive memory accesses to locations that are not adjacent to each other. In situations such as this, where sparsely-stored data must be accessed the performance benefits associated with caching may be significantly offset by the fact that many successive cache misses are likely to be triggered by the spaced memory accesses.
The present invention provides a virtual address scheme for improving performance and efficiency of memory accesses of sparsely-stored data items in a cached memory system. In a preferred embodiment of the present invention, a special address translation unit is used to translate sets of non-contiguous addresses in real memory into contiguous blocks of addresses in an “intermediate address space.” This intermediate address space is a fictitious or “virtual” address space, but is distinguishable from the effective address space visible to application programs, and in user-level memory operations. Effective addresses seen and manipulated by application programs are translated into intermediate addresses by an additional address translation unit for memory caching purposes. This scheme allows non-contiguous data items in memory to be assembled into contiguous cache lines for more efficient caching/access (due to the perceived spatial proximity of the data from the perspective of the processor).
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings, wherein:
The following is intended to provide a detailed description of an example of the invention and should not be taken to be limiting of the invention itself. Rather, any number of variations may fall within the scope of the invention, which is defined in the claims following the description.
The skilled reader will understand in the present art, most memory caches are indexed according to the physical addresses in main memory to which each cache line in the cache corresponds (generally through the use of a plurality of “tag bits” which are a portion of that physical address denoting the location of the cache line in main memory). Caches 106 and 108 in this preferred embodiment, however, are indexed according a fictitious or “virtual” address space referred to herein as the “intermediate address space,” which will be described in more detail below. Each of processing units 102 and 104 is equipped with an “intermediate address translation unit” (IATU) (110 and 112, respectively), which translates effective addresses in the virtual memory space in which the processor operates into intermediate addresses in the intermediate address space. The skilled reader will recognize that this function is essentially identical to the function performed by conventional address translation units in virtual memory systems as existing in the art, with the exception that instead of translating virtual addresses into real (physical) addresses, IATUs 110 and 112 translate the user-level virtual addresses (here called “effective addresses”) into intermediate addresses.
A memory controller unit 118, positioned between system bus 114 and main memory 116, serves as an intermediary between caches 106 and 108 and main memory 116, managing the actual memory caching and preserving consistency of data between caches 106 and 108. In addition to memory controller unit 118, however, there is included a “real address translation unit” (RATU) 120, which is used to define a mapping between intermediate addresses (in the fictitious “intermediate address space”) and real addresses in physical memory (main memory 116). RATU 120, as its name indicates, translates intermediate addresses into real addresses for use in accessing main memory 116.
The conceptual operation of intermediate addresses in the context of a preferred embodiment of the present invention is shown in
With regard to the address mapping provided by RATU 206, it is important to note the manner in which the addresses are mapped in order to appreciate many of the advantages provided by a preferred embodiment of the invention. Firstly, in a preferred embodiment, the mapping between intermediate addresses and real addresses is bijective. That is, the mapping is “one-to-one” and “onto.” Each address in real address space 208 corresponds to one and only one address in intermediate address space 204.
Secondly, the mapping is fine-grained. In other words, the mapping is from individual memory address to individual memory address. This fine-grained mapping permits individual non-contiguous memory locations in real address space 208 to be mapped into contiguous memory locations in intermediate address space 204 by RATU 206. The particular mapping between intermediate address space 204 and real address space 208 can be defined or modified by system software (e.g., an operating system, hypervisor, or other firmware). For example, system software may direct RATU 206 to map every “Nth” memory location in real memory starting at real memory address “A” to a corresponding address in a contiguous block of addresses in the intermediate address space starting at intermediate address “B.” This ability makes it possible to effectively “re-arrange” the contents of main memory without performing any actual manipulation of the physical data. This facility is useful for processing data that is stored in the form of a matrix or data that is stored in an interleaved format (e.g., video/graphics data).
An example of an application in which a preferred embodiment of the present invention is well suited is provided in
Because the array is stored in memory in row-major order in real memory 302, the sequence of successive memory accesses performed by the doubly-nested loop in code fragment 300 will be at non-contiguous locations in main memory 302. In this example, it is presumed that the rows in the matrix are of a size that is on the order of the size of the cache lines employed in cache 308. Thus, in this example, each successive memory access requires a different cache line to first be retrieved from main memory 302 by memory controller 304, transmitted over system bus 306 and placed into cache 308 before processing on that memory location may proceed. This is inefficient because each retrieval of a cache line from main memory takes time and uses space within cache 308.
However, execution of code fragment 400 is much more efficient, as fewer cache lines need be retrieved. Because RATU 404 maps the non-contiguous data items in a single column of the matrix in real memory into a contiguous block of the transposed matrix in the intermediate address space, RATU 404 arranges non-contiguous data items from real memory 402 into a contiguous cache line. Because RATU 404 makes the data items appear contiguous in the intermediate address space, fewer cache lines need be transmitted over system bus 406 and entered into cache 408, since each cache line retrieved contains only those data items that will be used right away. This results in not only a performance increase (due to fewer cache misses), but also a savings in resources, since fewer cache lines need be loaded into cache 408.
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, changes and modifications may be made without departing from this invention and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this invention. Furthermore, it is to be understood that the invention is solely defined by the appended claims. It will be understood by those with skill in the art that if a specific number of an introduced claim element is intended, such intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present. For non-limiting example, as an aid to understanding, the following appended claims contain usage of the introductory phrases “at least one” and “one or more” to introduce claim elements. However, the use of such phrases should not be construed to imply that the introduction of a claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an;” the same holds true for the use in the claims of definite articles. Where the word “or” is used in the claims, it is used in an inclusive sense (i.e., “A and/or B,” as opposed to “either A or B”).