1. Field of the Invention
The present invention generally relates to a method and system for providing an improved store-in cache, and more particularly, to the operation of stores in a cache system and the reliable maintenance of locally modified data in such a cache system.
2. Description of the Conventional Art
Caches are categorized according to many different parameters, each of which has its own implications on performance, power, design complexity, and limitations of use. One of the major parameters used is the Store Policy, which determines how stores to the cache are handled. Such a Store Policy includes two basic approaches, called Store-In and Store-Through.
When storing into a Store-In cache, that is all that one needs to do: store into it. This is exceedingly simple. However, the directory entry for any line that has been stored to must have a status bit (sometimes called a “dirty bit”) to indicate that the contents of the line have been changed. When a store has not been percolated into the rest of the cache hierarchy, but has simply been stored into, then the local cache has the most recent, hence the only valid copy of the new data.
This means that if a remote processor attempts to reference this line, it will miss in its local cache, and it must get the only valid copy from the only place that exists—which is the local cache of the processor that last stored into the line. It further means that if a cache selects a line for replacement that has its “dirty bit” set, the modified line cannot simply be overwritten. First, the modified line has to be written back to the next cache level in the hierarchy. This operation is called a “Castout.”
Usually, a Castout is done by moving the modified line into a “Castout Buffer,” then waiting for the bus (to the next level in the cache hierarchy) to become available (because it should be busy bringing in the new line to replace the Castout), and then moving the line out of the Castout Buffer and over the bus to the next cache level. While a Castout sounds like it is a lot of trouble because it is a new operation that needs to be done, in fact the effect of Castouts is to reduce the overall traffic. This is because most lines that get modified get modified repeatedly. The Castout essentially aggregates these multiple modifications into a single transfer—unlike what occurs in the second approach to the Store Policy, which is a Store-Through approach.
In a Store-Through cache, when data is stored into the local cache, it is also “stored through” the cache, which means that it is stored into the next level of cache too. Thus the total store bandwidth coming out of a Store-Through cache is higher, since every store goes through it. It is noted that a Store-In cache has the effect of aggregating multiple stores made to the same location. It is also noted that, with a Store-Through cache, not only does it have the most recent copy of the stored data, but the next layer of cache in the cache hierarchy has it as well. This means that remote misses can be serviced directly from the next layer of cache in the hierarchy (which may be quicker), and it also means that soft-errors occurring in the lower level of cache are not fatal, since valid data exists in the next level above it.
Conventionally, server processors used for reliable applications all have Store-Through L1 caches, which means that each store made by the processor is done to both its L1 cache and to the next cache level in the hierarchy. This is precisely to protect against soft errors in modified L1 lines, which works because there is a recoverable copy of the data further up in the cache hierarchy.
Of course, having a Store-Through L1 cache would not be a requirement for reliability if Error Correcting Codes (ECC) were used at the L1 level, but this is very difficult to do for the following reason. Many stores in database applications are single byte stores. Maintaining ECC on a byte granularity requires 3 additional bits per byte, which is quite costly.
The alternative to using byte-ECC is to use doubleword (8 byte) ECC, which requires 8 bits per doubleword—the same overhead as byte parity. However, doubleword ECC would require a longer pipeline for byte store instructions, because the ECC would need to be regenerated for the entire doubleword containing the byte. Doing a byte store would no longer simply be a matter of storing a byte. Instead, it first would require reading out the original doubleword, then doing an ECC check to verify that the data in the doubleword is good, then merging the new byte into the doubleword, then regenerating the ECC for the modified doubleword, and finally, storing the new doubleword back. The performance lost to this longer pipeline can be significant.
In some cases, for performance reasons it is more desirable to have the L1 be a Store-In cache. In a Store-In cache, stores do not percolate through the L1 into the rest of the hierarchy, but instead are accumulated in the L1 lines. The only event in which data is written up to the next level in the hierarchy is if a modified line is chosen (by the L1) for replacement, i.e., for a Castout. In this case, the entire line is written out to the next cache level in the hierarchy.
One reason that this is desirable is that the higher levels in the hierarchy are shielded from the raw store bandwidth. Another reason is that certain optimizations can be made in higher levels of the hierarchy if they need only deal with a single store quanta (e.g., just lines as opposed to both lines and doublewords).
In conventional systems and methods, even when a Store-In cache is preferable, such is not an option if reliable operation is a requirement. The present invention overcomes the above problems.
Some conventional Store-In and Store-Through cache implementations are described below.
In the exemplary arrangement illustrated in
When there is a cache miss, the cache 101 sends the miss transaction to the Bus Interface Unit (BIU) 102. The “transaction” includes the miss address and the desired state of the miss data (meaning shared or exclusive). The BIU 102 forwards this information to the next cache level in the hierarchy as a “miss request.” In the mean time, if the cache selects a line for replacement (by the line that is to be brought in by the miss) that has been modified locally, the modified line needs to be sent to the next cache level to update its copy of the line. To prepare for this, the cache 101 moves the modified line into the Castout Buffer (COB) 103, which notifies the BIU 102 that it has a Castout.
Typically, by the time that the modified line is moved from the cache 101 to the Castout Buffer 103, the BIU 102 will be in the process of handling the incoming line from the miss request, and putting it into the cache 101. Once the incoming line has been completely transferred, the BIU 102 will send the modified line from the Castout Buffer 103 up to the next cache level in the hierarchy (not shown).
Note that the processor 100 interacts with the cache on either a doubleword granularity for fetches, or on a byte granularity for stores.
For purposes of this disclosure, “byte granularity” generally means that the stores can be as small as a single byte, but they can also be multiple bytes, up to a doubleword.
On the other hand, the Bus Interface Unit 102, hence the next cache in the hierarchy (not shown), only works with cache lines, which are typically 128 bytes. That is to say that all transactions to the next cache level in the hierarchy (not shown) are either line fetches or line stores. This means that the next level in the cache hierarchy can be optimized to handle only lines.
The rate of transactions (which are all line transactions) to the next cache level in the hierarchy is the basic L1 miss rate (which are line fetch requests) plus the Castout rate (which are all line store requests). Since only a fraction of the misses will cause Castouts, the Castout rate will be a fraction of the miss rate.
The PSB 203 deals only in doublewords. Within the PSB, a doubleword Error-Control Code (ECC) is generated for the doubleword sent by the processor (not shown), and the (now protected) doubleword is buffered until the instruction that did the store operation has been completed.
Typically the ECC is a Single Error Correcting, Double Error Detecting (SECDED) code, which does just what it says: if a single bit is flipped, the ECC will be able to determine which bit it was, and it will correct it; if two bits are flipped, the ECC will be able to detect that the data is bad, but it will not be able to correct the data.
When the store instruction is completed, the processor 200 notifies the PSB 203 that the stored data should be sent to the next cache level in the hierarchy (not shown). The PSB 203 sends a doubleword store request to the BIU 202, which will send the modified doubleword up to the next cache level in the hierarchy (not shown).
Meanwhile, as was the case with the Store-In cache of
Note that in this case, there are two granularities of data that are used in the next cache level in the hierarchy. For misses, there are line-oriented fetch requests sent to the next level. These requests occur at the L1 cache 201 miss rate. And for every store issued by the processor 200, there is a doubleword store request sent to the next level in the hierarchy. Thus, the next cache level cannot be optimized for a single data granularity, since it must deal both with lines and with doublewords. Further, the next cache level is subjected to the full store-bandwidth of the processor 200.
In view of the foregoing and other exemplary problems, drawbacks, and disadvantages of the related art methods and structures, an exemplary feature of the present invention is to provide a system, method, and framework for providing an improved store-in cache, and more particularly, to the operation of stores in a cache system and the reliable maintenance of locally modified data in such a cache system.
More particularly, an exemplary feature of the present invention provides a Store-In cache with an additional mechanism, which includes an “Ancillary Store-Only Cache” (ASOC). The ASOC according to the present invention can provide and ensure robust reliability of the system. One skilled in the art would recognize that the ASOC according to the present invention, and how it operates, can include many variations.
An exemplary feature of the present invention provides a method and apparatus for protecting a Store-In cache, which may hold the only valid copy of recently stored data, from soft errors. This allows the use of a Store-In policy when it is desirable for performance reasons, without sacrificing the robust recovery capability that is normally sacrificed when the Store-In policy is used.
Conventionally, server processors used for reliable applications all have store-through caches, which means that each store is done to both the L1 cache and to the next cache level in the hierarchy. This is so that if soft errors occur in modified L1 lines, there is a recoverable copy of the data further up in the hierarchy.
Of course, this would not be necessary if Error Correcting Codes were used at the L1 level, but this is very difficult to do for the following reasons.
Many stores in database applications are single byte stores. Maintaining ECC on a byte granularity requires 3 additional bits per byte, which can be quite costly. An alternative is to use doubleword (8 byte) ECC, which requires 8 bits per doubleword—the same overhead as byte parity. However, doubleword ECC would require a longer pipeline for byte store instructions. Such a store would require reading out the original doubleword, doing an ECC check, merging in the new byte, regenerating the ECC, and storing back the new doubleword. Hence, the performance lost to this longer pipeline can be significant.
In some cases, for performance reasons, it is more desirable to have the L1 be a store-in cache. In a store-in cache, stores do not percolate through the L1 into the rest of the hierarchy, but instead are accumulated in the L1 lines. The only event in which data is written up to the next level in the hierarchy is if a modified line is chosen (by the L1) for replacement. In this case, the entire line is written out to the next cache level in the hierarchy. One reason that this is desirable is that the higher levels in the hierarchy are shielded from the raw store bandwidth. Another reason is that certain optimizations can be made in higher levels of the hierarchy if they need only deal with a single store quanta (e.g., just lines as opposed to both lines and doublewords).
Conventionally, even when a store-in L1 is preferable, it is not an option if reliable operation is a requirement. The present invention provides a system and method that overcomes such problems with the conventional methods and systems.
For example, the exemplary aspects of the present invention can provide a store-in L1 with an additional means, called an “Ancillary Store-Only Cache” (ASOC), which ensures robust reliability of the system. According to the exemplary aspects of the present invention, the ASOC, and how it operates, can include many variations. Several exemplary aspects of the present invention are described below. However, anyone skilled in the art will recognize that the present invention is not limited to the examples provided below.
For purposes of the present invention, the ASOC generally is defined as a small cache (e.g., 8-16 lines) having the same linesize as the L1. The ASOC can be a cache of the most-recently stored-to lines. Lines in the ASOC can be kept with doubleword ECC.
When a line is first stored-to, the line can be fetched from the L1, and copied into the ASOC, while generating doubleword ECC during the transfer. For byte stores, the object doubleword can be read from the L1 during store pretest, and it can be parity-checked. When the store is committed, the new byte can be written into the L1 with its parity. This is the path that the pipeline “sees.” What the pipeline does not see is that the byte is then merged into the object doubleword (that was previously fetched), ECC is generated, and the new doubleword is written into the ASOC.
If no parity errors are encountered in the L1, the contents of the ASOC may not be used. However, if there is a parity error, the correct data can be recovered from the ASOC.
When a line ages out of the ASOC, the exemplary aspects of the invention can, for example, do either of two things. First, the exemplary aspects of the invention can write the line out into the hierarchy (and mark it “unmodified” in the L1). Alternatively, the exemplary aspects of the invention can just write the line back into the L1. According to the present invention, the L1 should have the valid data contents, since the present invention would be updating it all along. However, what the line does not have is an ECC—which it needs if it is to remain in the L1 (but not in the ASOC) in a “modified” state. Thus, all that the exemplary aspects of the invention would need to do is to write back the ECC.
The exemplary aspects of the invention take advantage of the fact that doubleword ECC is the same number of bits as byte parity. When the present invention ages a line out of the ASOC, the exemplary aspects of the invention can overwrite the parity bits in the L1 with the corresponding doubleword ECC bits, and set a new state bit to indicate that the check bits for the line are ECC bits, and not parity.
Alternatively, the exemplary aspects of the invention can allocate space in the L1 cache for both ECC and for parity. This is a relatively low cost overhead. The exemplary aspects of the invention also need to indicate whether the ECC bits are valid. For unmodified lines, they will not be. However, according to the exemplary aspects of the invention, once a line has become modified, the ECC bits should be valid.
It is also noted that the exemplary aspects of the invention do not actually need to copy the entire contents of a line from the L1 to the ASOC when the line is first put into the ASOC. Instead, the present invention need only maintain the doublewords that are actually stored to. The exemplary aspects of the invention treat the doublewords in an ASOC line as sectors, and use a “presence” bit for each stored doubleword.
When storing the sector ECCs back to the L1, the exemplary aspects of the invention can indicate (within the L1) which of the sectors (doublewords) have actually been modified, so that it can be known that the checkbits associated with those sectors are actually ECC bits. Alternatively, in a case in which there is room for both, the exemplary aspects of the invention can indicate which ones have actually been set.
If it is desirable to keep both byte-parity and doubleword ECC, but the full overhead of ECC for all doublewords (an additional bit per byte) is undesirable, the exemplary aspects of the invention can instead allocate space for only a subset of the doublewords in a line (e.g., 2, 3, or 4) with an indication of which doublewords these are associated with. In this last exemplary case, lines having more than this many doublewords modified can be castout (to the hierarchy) when this threshold is exceeded.
The practice of these exemplary methods, together with the exemplary apparati described above, enables store-in behavior (as seen by the pipeline and by the rest of the cache hierarchy) while providing the robust protection of a store-through cache.
In one exemplary aspect of the invention, a hardened store-in cache mechanism includes a store-in cache having lines of a first linesize stored with checkbits. The checkbits have byte-parity bits. The hardened store-in cache mechanism also includes an ancillary store-only cache (ASOC) that holds a copy of most recently stored-to lines of the store-in cache. The ancillary store-only cache (ASOC) includes fewer lines than the store-in cache. Each line of the ancillary store-only cache (ASOC) has the first linesize stored with the checkbits, and the checkbits of the ancillary store-only cache (ASOC) are doubleword Error Correcting Code (ECC) for each doubleword within the stored-to lines. The stored-to lines are marked as being modified within the store-in cache when the stored-to lines are stored to using a modified indicator.
In another exemplary aspect of the invention, a hardened store-in cache mechanism includes a store-in cache having lines of a first linesize stored with checkbits, wherein the checkbits include byte-parity bits, and storing means for holding a copy of most recently stored-to lines of the store-in cache, wherein the storing means includes fewer lines than the store-in cache, each line of the storing means having the first linesize stored with the checkbits, the checkbits of the storing means being doubleword Error Correcting Code (ECC) for each doubleword within the stored-to lines, and the stored-to lines being marked as being modified within the store-in cache when the stored-to lines are stored to using a modified indicator.
Another exemplary aspect of the invention is directed to a method of controlling, storing, and recovering data in a store-in cache system having a store-in cache having lines of a first linesize stored with checkbits, wherein the checkbits are byte-parity bits, and an ancillary store-only cache (ASOC) that holds a copy of most recently stored-to lines of the store-in cache, wherein the ancillary store-only cache (ASOC) includes fewer lines than the store-in cache, each line of the ancillary store-only cache (ASOC) having the first linesize stored with the checkbits, the checkbits of the ancillary store-only cache (ASOC) being doubleword Error Correcting Code (ECC) for each doubleword within the stored-to lines, and the stored-to lines being marked as being modified within the store-in cache when the stored-to lines are stored to using a modified indicator. The exemplary method includes storing the most recently stored-to lines of the store-in cache into the ancillary store-only cache (ASOC) with doubleword Error Correcting Codes, reading data stored into the ancillary store-only cache only when the corresponding copy of that data is found to have parity errors in the store-in cache, and using the read data from the ancillary store-only cache to overwrite the data having parity errors in the store-in cache.
The foregoing and other exemplary purposes, aspects and advantages will be better understood from the following detailed description of an exemplary aspects of the invention with reference to the drawings, in which:
Referring now to the drawings, and more particularly to
The present invention relates to a method and system for providing an improved store-in cache, and more particularly, to the operation of stores in a cache system and the reliable maintenance of locally modified data in such a cache system.
Thus, the ASOC 304 and its basic function provide an important advantage over the conventional systems. Of course, those skilled in the art would recognize that there are many variations on the specifics of how the exemplary ASOC 304 can be used and/or managed. The present invention provides some examples of such below. For example, the basic broad function of the ASOC 304 is explained below.
Fundamentally, the ASOC 304 is another cache that is (logically) operated in parallel with the Store-In cache 301. However, the exemplary ASOC 304 may only hold lines that have been stored into (modified) locally. Further, the ASOC 304 can hold those lines with doubleword ECC, whereas the Store-In cache 301 holds the same lines with byte parity (at least in this first exemplary aspect).
Furthermore, for purposes of this disclosure, the term “doubleword” generally is used as a proxy for any quanta that is larger than a byte but smaller than a cache line. That is, by using the word “doubleword,” the present invention is not restricted exclusively to an 8-byte quanta. Instead, such is illustrated for exemplary purposes only. For example, an optimization that happens to be particular to an 8-byte quanta, will be described below.
When a store is issued by the processor 300 in the exemplary “Hardened Store-In Cache” system of
However, at the same time, the line to which the store is issued can be copied into the ASOC 304, and doubleword ECC can be generated for the line in this exemplary process. When the store is first issued, the doubleword into which the byte is stored can be prefetched into the processor 300 just as was done in the case of the Store-Through cache illustrated in
In essence, this merging of the new byte into the object doubleword, and generation of doubleword ECC can be similar to the undesirable operation of dealing with doublewords and ECC that was described in the Background section above. For example, in the Background section, it was explained that such was undesirable because it lengthened the pipeline associated with the store operation, which can have a deleterious effect on performance. However, such problems can be avoided or overcome with the ASOC, according to the exemplary aspects of the present invention.
According to the present invention, the ASOC is not part of the processor's store pipeline, hence the merging of bytes and the generation of doubleword ECC have no effect on the performance of the processor's pipeline. The processor pipeline involves only the Store-In cache 301, and is similar to (or the same as) the processor pipeline of
Thus, the ASOC is simply a small cache that keeps a copy of all of the locally-modified lines, and it keeps those lines with ECC. This allows the processor to work with the Store-In cache of
The exemplary “Hardened Store-In Cache” system of
Those with ordinarily skill in the art will recognize that variations on the above exemplary aspects of the invention can include any or all variations in how the ASOC is actually managed, and what is actually kept in both the ASOC and the main cache.
Further Exemplary Aspects of the Invention
As mentioned above, for purposes of the present application, the ASOC generally is defined as a small cache (e.g., 8-16 lines) having the same line size as the L1. The ASOC can be a cache of the most-recently stored-to lines. Lines in the ASOC can be kept with doubleword ECC.
When a line is first stored-to, the line is fetched from the L1, and copied into the ASOC, while generating doubleword ECC during the transfer. For byte stores, the object doubleword is read from the L1 during store pretest, and it is parity-checked. When the store is committed, the new byte is written into the L1 with its parity. This is the path that the pipeline “sees.” What the pipeline does not see is that the byte is then merged into the object doubleword (that was previously fetched during the store pretest), ECC is generated, and the new doubleword is written into the ASOC.
If no parity errors are encountered in the L1, the contents of the ASOC generally are not used. However, if there is a parity error, the correct data can be recovered from the ASOC.
When a line ages out of the ASOC, the exemplary aspects of the present invention can do either of two things. First, the present invention can write the line out into the hierarchy (and mark it “unmodified” in the L1). Alternatively, the present invention can write the line back into the L1.
Recall that the L1 should have the valid data contents, since the present invention can be updating it all along. However, what the line in the L1 does not have is an ECC, which it needs if it is to remain in the L1 (but not in the ASOC) in a “modified” state. Thus, the present invention merely needs to write back the ECC into the checkbits that had originally held parity.
The present invention takes advantage of the fact that doubleword ECC is the same number of bits as byte parity. When the present invention ages a line out of the ASOC, the present invention can overwrite the parity bits in the L1 with the corresponding doubleword ECC bits, and set a new state bit to indicate that the check bits for the line are ECC bits, and not parity.
Note that, strictly speaking, the present invention does not actually need this new state bit. The ordinarily skilled artisan would recognize that, if the cache were managed in this way, all modified lines in the L1 that are not in the ASOC must have ECC.
Alternatively, the present invention can allocate space in the L1 cache for both ECC bits and for parity bits. This is a relatively low cost overhead. In this exemplary aspect, the present invention also could indicate whether the ECC bits are valid. It is noted that, for unmodified lines, they will not be. However, once a line has become modified, the ECC bits should be valid.
It also is noted that this exemplary aspect of the present invention does not actually need to copy the entire contents of a line from the L1 to the ASOC when the line is first put into the ASOC. Instead, this exemplary aspect of the present invention would only need to maintain the doublewords that are actually stored to. The present invention can treat the doublewords in an ASOC line as sectors, and can use a “presence” bit for each stored doubleword.
When the sector ECCs are stored back to the L1, the exemplary aspect of the present invention can indicate (within the L1) which of the sectors (doublewords) have actually been modified, so that it is known that the check bits associated with those sectors are actually ECC bits. Alternatively, in the exemplary case in which there is room for both, the present invention can provide an indication of which ones have actually been set.
If it is desirable to keep both byte-parity and doubleword ECC, but it is not desirable to have the full overhead of ECC for all doublewords (an additional bit per byte), space can instead be allocated for only a subset of the doublewords in a line (e.g., 2, 3, or 4) with an indication of which doublewords these are associated with. In this last exemplary case, lines including more than this many doublewords modified are Castout (to the hierarchy) when such a threshold is exceeded.
According to the exemplary aspects of the invention, the ASOC need not actually contain the doubleword data. Instead, the exemplary ASOC can simply be a cache that just contains the ECC bits for the modified lines in the L1.
The practice of these exemplary methods, together with the exemplary apparati described above, can enable store-in behavior (as seen by the pipeline and by the rest of the cache hierarchy) while providing the robust protection of a store-through cache.
While the invention has been described in terms of several exemplary aspects, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.
Further, it is noted that, Applicants' intent is to encompass equivalents of all claim elements, even if amended later during prosecution.