This disclosure generally relates to cache memory systems and more particularly, but not exclusively, to the use of a victim cache to provide protection from side-channel attacks.
In a processor-based system, a cache memory is used to temporarily store information including data or instructions to enable more rapid access by processing elements of the system such as one or more processors, graphics devices and so forth. Modern processors include internal cache memories that act as repositories for frequently used and recently used information. Because this cache memory is within a processor package and typically on a single semiconductor die with one or more cores of the processor, much more rapid access is possible than from more remote locations of a memory hierarchy, which include system memory.
To enable maintaining the most relevant information within a cache, some type of replacement mechanism is used. Many systems implement a type of least recently used algorithm to maintain information. More specifically, each line of a cache is associated with metadata information relating to the relative age of the information such that when a cache line is to be replaced, an appropriate line for eviction can be determined.
Over the years, caches have become a source of information leakage that are exposed to “side-channel” attacks whereby a malicious agent is able to infer sensitive data (e.g., cryptographic keys) that is processed by a victim software process. Typically, cache-based side-channel attacks, which exploit cache-induced timing differences of memory accesses, are used to break Advanced Encryption Standard (AES), Rivest—Shamir—Adleman (RSA) or other cryptographic protections, to bypass address-space layout randomization (ASLR), or to otherwise access critical information. As the number and variety of side-channel attacks continue to increase, there is expected to be an increasing demand placed on improved protections to cache memory systems.
The various embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
Embodiments discussed herein variously provide techniques and mechanisms for a victim cache to operate in conjunction with another cache to help mitigate the risk of a side-channel attack. Certain features of various embodiments are described herein with reference to the provisioning of a victim cache in support of another cache (which is referred to herein as a “primary” cache). However, other embodiments similarly incorporate a similar use of a victim resource in support of a corresponding primary resource of a same resource type. For example, other embodiments variously provision one “victim” coherence directory in support of another “primary” coherence directory (e.g., a victim snoop filter in support of a corresponding primary snoop filter).
Embodiments described herein variously supplement operation of a primary cache with another cache (referred to herein as a “victim cache”) which is to be an at least temporary repository of a line that is evicted from the cache. Such a line is subject to being moved back (or “reinserted”) in the primary cache—e.g., in response to a request by an executing process to access the line. It is to be noted that the term “primary” is used herein in the limited context of data being inserted in such a cache as one primary (or “preliminary”) condition for said data potentially being inserted, subsequently, in a corresponding victim cache. More particularly, various embodiments are not limited as to whether or not some other cache might also have received such data prior to the primary cache.
Certain features of some embodiments are described herein with reference the use a victim cache to supplement a primary cache which operates as a shared cache of a processor. However, in alternative embodiments, such a primary cache instead operates as any of various other types of caches including, but not limited to, a lowest level (L0) cache, a L1 cache, a L2 cache, a cache which is external to a processor, or the like.
As used herein with respect to the storing of information at a cache, “line” refers to information which is to be so stored at a given location (e.g., by particular memory cells in a set) of that cache. As variously used herein, a label of the type “LK” represents a given line which is addressable with a corresponding address labeled “K”—e.g., wherein a line LC corresponds to an address C, a line LV corresponds to an address V, a line LX corresponds to an address X, a line LY corresponds to an address Y, etc.
As used herein, “skewed cache” refers to a cache which is partitioned into multiple divisions comprising respective ways (which, in turn, each comprise respective sets). In some embodiments, randomization of a victim cache is provided, for example, by the use of one or more (pseudo)random indices each for a respective set of the victim cache—e.g., wherein a cipher block or hash function generates one or more indices based on address information and a corresponding one or more key values. In some embodiments, indexing of a victim cache is regularly updated by replacing or otherwise modifying such one or more key values.
Additionally or alternatively, randomization of a primary cache (e.g., a skewed cache) is provided, for example, by the use of (pseudo)random indexing for one or more sets of the primary cache—e.g., wherein a cipher block or hash function generates indices based on address information and a corresponding one or more other key values. In some embodiments, indexing of a primary cache is regularly updated by replacing or otherwise modifying such one or more other key values.
In some embodiments, access to a randomized victim cache and/or a randomized primary cache—such as a randomized skewed cache (RSC)—is provided with encryption functionality or hashing functionality in a memory pipeline. In one such embodiment, a set of one or more keys is generated (e.g., randomly) upon a boot-up or other initialization process. Such a set of one or more keys is stored for use in determining cache indices, and (for example) is to be changed at some regular—e.g., configurable—interval.
In various embodiments, an encryption scheme (or hash function) used to calculate per-division set indices is sufficiently lightweight—as determined by implementation-specific details—to accommodate critical time constraints of cache operations. However, the encryption scheme (or hash function) is preferably strong enough, for example, to prevent malicious agents from easily finding addresses that have cache-set contention. QARMA, PRINCE, and SPECK are examples of some types of encryption schemes which are variously adaptable to facilitate generation of set indices in certain embodiments.
In various embodiments, the victim cache is implemented as a fully associative or as a set-associative cache that is to be accessed using an independently randomized mapping. In one such embodiment, the victim cache is to be searched (for example) in parallel with a search of the primary cache. In one such embodiment, the primary cache and the victim cache have substantially the same hit latency (e.g., where one such latency is within 10% of the other latency).
In an embodiment, the primary cache and/or the victim cache include (or otherwise operate based on) controller circuitry that is able to automatically access the primary cache to reinsert lines as described herein. Additionally or alternatively, arbitration circuitry is provided to arbitrate between use of the primary cache for reinsertion of lines from the victim cache, and use of the primary cache by the core.
In providing a victim cache with functionality to reinsert lines from the victim cache to a primary cache, some embodiments make it significantly more difficult for a malicious agent to observe cache contention for the primary cache. As a result, such embodiments enable a relatively large interval at which encryption keys (and/or other information for securing cache accesses) should be updated to effectively protect from contention-based cache attacks.
As shown in
A given core 102 supports one or more instructions sets such as the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif., a RISC-V instruction set, or the like. It should be understood that, in some embodiments, a core 102 supports multithreading—i.e., executing two or more parallel sets of operations or threads—and (for example) does so in any of a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding combined with simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
In one example embodiment, processor 100 is a general-purpose processor, such as a Core™ i3, i5, i7, 2 Duo and Quad, Xeon™, Itanium™, XScale™ or StrongARM™ processor, which are available from Intel Corporation, of Santa Clara, Calif. Alternatively, processor 100 is from another company, such as ARM Holdings, Ltd, MIPS, etc. In other embodiments, processor 100 is a special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, co-processor, embedded processor, or the like. In various embodiments, processor 100 is implemented on one or more chips. Alternatively, or in addition, processor 100 is a part of and/or implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS. In some embodiments, a system on a chip (SoC) includes processor 100.
In general, execution circuitry 110 of core 102a operates to fetch instructions, decode the instructions, execute the instructions and retire the instructions. In one such embodiment, some of these instructions—e.g., including user-level instructions or privileged level instructions—are encoded to allocate data or instructions into a cache, and/or to access the cache to read said data or instructions.
Processor 100 comprises one or more levels of cache—e.g., wherein some or all of cores 102a, 102b, . . . , 102n each include a respective one or more cache levels, and (for example) wherein one or more caches of processor 100 are shared by various ones of cores 102a, 102b, . . . , 102n. In the illustrative embodiment shown, core 102a comprises a lowest level (LO) cache 130, and a next higher level of cache, namely a level 1 (L1) cache 140 which is coupled to LO cache 130. Some or all of cores 102a, 102b, . . . , 102n are each coupled to shared cache circuitry 150 that, in turn, is coupled to a system agent 160, also referred to as uncore circuitry, which can include various components of a processor such as power control circuitry, memory controller circuitry, interfaces to off-chip components and the like. Although shown at this high level in the embodiment of
As seen in
In one example embodiment, shared cache circuitry 150 comprises memory regions to provide respective caches 152, 154—e.g., wherein cache 154 is to function as a random victim cache for cache 152. In an embodiment, control circuitry 120 and shared cache circuitry 150 operate to provide a partitioning of cache 152—e.g., where such partitioning facilitates operation of cache 152 as a skewed cache. For example, control circuitry 120 provides functionality to partition cache 152 into one or more divisions—e.g., including the illustrative divisions D0, D1 shown—which are each arranged into respective columns (or “ways”), which in turn each comprise respective sets. The various ways of cache 152 provide multiple respective degrees of set associativity. To help protect side-channel attacks which target cache 152, control circuitry 120 operates victim cache 154, in some embodiments, as a repository to receive lines which are evicted from cache 152. In one such embodiment, a given one of said lines is subsequently evicted from cache 154 for reinsertion into cache 152.
For example, cache look-up circuitry 122 of control circuitry 120 illustrates any of a variety of one or more of microcontrollers, application-specific integrated circuits (ASICs), state machines, programmable gate arrays, mode registers, fuses, and/or other suitable circuit resources which are configured to identify a particular location—at one of caches 152, 154—to which (or from which) a given line is to be cached, read, evicted, reinserted or otherwise accessed. In one such embodiment, cache look-up circuitry 122 performs a calculation, look-up or other operation to identify an index for the cache location—e.g., based on an address which has been identified as corresponding to a given line. For example, the index is determined based on an encryption (or hash) operation which uses the address. In some embodiments, cache look-up circuitry 122 also provides functionality to provide at least some further randomization with respect to how cache locations are each to correspond to a respective index, and/or with respect to how lines are each to be stored to a respective cache location. In one such embodiment, cache look-up circuitry 122 further supports a regular updating of indices which are to be used to access cache 152—e.g., wherein cache look-up circuitry 122 updates encryption keys or hash functions which are variously used to determine said indices. In various embodiments, cache look-up circuitry 122 performs one or more operations which, for example, are adapted from conventional techniques for partitioning, accessing or otherwise operating a cache (such as a randomized and/or skewed cache).
Cache insertion circuitry 124 of control circuitry 120 illustrates any of a variety of one or more of microcontrollers, ASICs, state machines, programmable gate arrays, mode registers, fuses, and/or other suitable circuit resources which are configured to store a given line into a location of one of caches 152, 154. For example, cache insertion circuitry 124 operates (e.g., in conjunction with cache look-up circuitry 122) to identify a location at cache 152 which is to receive a line which is to be evicted from cache 154 or, for example, a line which is retrieved from an external memory (not shown) that is to be coupled to processor 100. Alternatively or in addition, cache insertion circuitry 124 operates to identify a location at cache 154 which is to receive a line which is to be evicted from cache 152, for example. In various embodiments, cache insertion circuitry 124 performs one or more operations which are adapted from conventional techniques for storing lines to a cache.
Cache eviction circuitry 126 of control circuitry 120 illustrates any of a variety of one or more of microcontrollers, ASICs, state machines, programmable gate arrays, mode registers, fuses, and/or other suitable circuit resources which are configured to evict a line from one of caches 152, 154. For example, cache eviction circuitry 126 operates—e.g., in conjunction with cache look-up circuitry 122—to identify a location at cache 152 which stores a line to be evicted to cache 154 (or, for example, to be evicted to an external memory). Alternatively or in addition, cache eviction circuitry 126 operates to identify a location at cache 154 which stores a line to be evicted to cache 152, for example. In various embodiments, cache eviction circuitry 126 performs one or more operations which are adapted from conventional techniques for evicting lines from a cache.
In various embodiments, access to victim cache 154 also randomized—e.g., wherein victim cache 154 is a fully associative cache with wVC ways, or (for example) a set-associative cache with wVC ways and sVC sets—i.e., with NVC=(sVC·wVC) lines. In one such embodiment, a randomized, set-associative victim cache (similar to randomized, set-associative primary cache), uses a cryptographic function based on a block cipher E (or a hash function H) and a secret key KVC to derive the victim cache's set index from the cache line address. Lines that are evicted from the primary cache are put into the victim cache. Lines that hit in the victim cache can be reinserted into the primary cache but may also stay in the victim cache without loss of security. To introduce randomness into replacement decisions, either the victim cache or the primary cache uses a randomized replacement policy. For a good performance-security trade-off, the primary cache will, in some embodiments, use a well-performing replacement algorithm like (Quad-Age/Octa-Age)-LRU, whereas the victim cache will use random replacement, for example.
In an illustrative scenario according to one embodiment, cache 152 is operated as a primary cache which comprises N=(s·w) lines (where the integer s is a number of sets, and the integer w is a number of ways). For example, the N lines are organized into d divisions by grouping together w/d ways in each division (wherein 1≤d≤w, and wherein 1≤w/d≤w). Cache 152 is skewed by such divisions—e.g., wherein indices are used to variously select corresponding sets each in a different respective one of the d division. Such indices are variously derived (for example) using a block cipher, or a keyed hash function within the memory pipeline.
However, it is to be noted that some embodiments are not limited with respect to cache 152 being a skewed cache. For example, in an alternative embodiment, cache 152 includes only one division D0—e.g., providing an otherwise traditional set-associative cache that, in some embodiments, uses a randomized set index (e.g., derived by a cryptographic scheme using a block cipher E or a keyed hash function H).
By way of illustration and not limitation, in some embodiments, a look-up operation to access a given set of cache 152 comprises cache look-up circuitry 122 obtaining d differently encrypted (or, for example, hashed) values Cenc,0, Cenc,1, . . . , Cenc,d−1 each based on an address C and a different respective key. In one such embodiment, for a line LC of data which corresponds to the address C, d different encryptions of the address C are performed by cache look-up circuitry 122, where each such encryption is based on a block cipher E and on a different respective one of d keys K0, K1, . . . , Kd−1. Alternatively, d different hashes of the address C are calculated by cache look-up circuitry 122, each based on a hash function H and on a different respective one of the d keys K0, K1, . . . , Kd−1. Subsequently, d different indices idx0, idx1, . . . , idxd−1 are determined—e.g., by identifying, for each of the d encrypted (or hashed) address values Cenc,0, Cenc1, . . . , Cenc,d−1, a respective slice of log2 s bits.
In one such embodiment, accessing cache 152 comprises cache look-up circuitry 122 performing lookups in each of d sets—e.g., in parallel with each other—to determine if a given line LC corresponding to the indicated address C can be found. If line LC is not found by said lookups, cache insertion circuitry 124 chooses one of the d sets (for example, at randomly) for inserting the line LC. Any of a variety of replacement algorithm can be used—e.g., according to conventional cache management techniques—to store the line LC within the chosen set. Typically, since a pseudorandom mapping of an address C to indices idx0, idx1, . . . , idxd−1 is at risk of being learned over time by a malicious agent, the keys K0, K1, . . . , Kd−1 should be updated regularly to mitigate the risk of contention-based (or other) side-channel attacks.
Various features of processor 100 described herein help reduce the risk of side-channel attacks which target a primary cache. For example, fine-grained contention-based cache attacks exploit contention in cache sets, wherein malicious agents typically need some minimal set of addresses that map to the same cache set (a so-called eviction set) as the victim address of interest. Some types of primary caches (for example, randomized and/or skewed caches) significantly increase the complexity of finding such eviction sets, due to the use of a pseudo-random address-to-set mapping and cache skewing. Namely, addresses that collide with a victim address in all d divisions are very unlikely—i.e., with a probability in proportion to s−d—forcing a malicious agent to use more likely partial collisions. Such partially conflicting addresses collide with the victim, e.g., in a single division only, but also have smaller probability to evict the victim address (or observe a victim access), i.e., d−2 if an address collides with the victim address in a single division.
One technique for side-channel attacks to find partially conflicting addresses includes (1) priming a cache with a set of candidate attacker addresses, (2) removing candidate addresses that miss in the cache (pruning step), (3) triggering the victim to access the address of interest, and (4) probing the remaining set of candidate addresses. A candidate address missing in the cache has a conflict with the victim in at least one division. While this interactive profiling technique does not break caches entirely, it demands that keys be refreshed at relatively high rates, which impacts system performance.
To mitigate the threat and/or impact of these attacks, some embodiments variously provide a type of cache—referred to herein as a reinserting victim cache (VC) or, for brevity, simply “VC”—which is available for use, in combination with a primary cache, such as a randomized cache (RC) and, in some embodiments, a randomized skewed cache (RSC). In one such embodiment, a given line which is evicted from a primary cache is put into a VC. Subsequently, said line is reinserted into the primary cache—e.g., by swapping lines between respective entries of the the primary cache and the VC. Benefits variously provided by such embodiments include, but are not limited to, an increased effective amount of cache associativity, a reduced observability of cache contention, a decoupling of actual evictions from cache-set contention, and increased side-channel security at longer re-keying intervals. Some embodiments variously provide cache randomization functionality in combination with cache skewing to improve security against contention-based cache attacks (where, for example, use of a victim cache enables a reduction to a rate of key refreshes).
For example, in various embodiments, cache eviction circuitry 126 (for example) variously evicts lines from the cache 152 over time, and puts them into cache 154. In one such embodiment, cache insertion circuitry 124 variously reinserts some or all such evicted lines into the cache 152 at different times. For example, when there is a hit for a line in cache 154, that line is automatically evicted from cache 154 by cache eviction circuitry 126, and reinserted into cache 152 by cache insertion circuitry 124. In one such embodiment, control circuitry 120 maintains two indices idxVC,insert and idxVC,reinsert which identify (respectively) an item that has been most recently inserted into cache 154, and an item that has been most recently reinserted into cache 152. In providing a VC with cache management functionality that automatically reinserts lines from cache 154 to cache 152, some embodiments variously hide contention in cache 152 by decoupling evictions in cache 154 from evictions in cache 152.
Some embodiments variously provide a randomized victim cache (VC) for at least one other corresponding “primary” caches to efficiently mitigate security risks such as contention-based side-channel attacks. In some embodiments, a primary cache is also a randomized cache (RC). In one such embodiment, a given line is evicted from a primary cache, and is put into a corresponding VC—e.g., before a subsequent eviction of said line into memory. The VC may be implemented as a fully associative cache, or as a set-associative cache, that independently uses a secret, randomized mapping. Either the primary cache or the VC implements a randomized replacement policy to add noise to an attacker's cache observations. The proposed solution thus hinders observations of cache contention to reduce the re-keying rates, increase the security margin, and maintain cache performance. In various embodiments, some re-keying is performed to generate one or more new cryptographic keys which are to be used for accessing a victim cache (e.g., in addition to that which is performed to generate new keys for accessing a primary cache).
Some fine-grained contention-based cache attacks exploit contention in cache sets, wherein an attacker typically needs some minimal set of addresses that map to the same cache set as the victim address of interest, a so-called eviction set. Randomization of a primary cache increases the complexity of finding such eviction sets due to the pseudo-random address-to-set mapping. Primary cache randomization on its own is often insufficient for many types of attacks—e.g., wherein regular encryption key updates are further relied upon. However, encryption key updates are often problematic for the performance of cache memory systems.
Some embodiments variously mitigate a long-standing security problem by closing cache-based side and covert channels that have been used in recent speculative execution attacks, to break cryptographic code, and (for example) bypass ASLR. Such embodiments increase the security of products at low performance overheads, and/or lower implementation complexity.
The use of a VC according to some embodiments hinders profiling techniques by breaking a direct observability of cache-set contention. For example, an attacker address X, which conflicts with a victim access C, can be evicted into a VC after a line V is evicted from the VC into memory. The attacker can then observe a miss for the line V, but this line is completely unrelated to the cache-set contention between X and C in the primary cache. As a result, the attacker's attempts using X do not contribute to an eviction set.
It might be possible for an attacker to improve visibility into cache-set contention in a primary cache—e.g., by trying to flush the VC. However, such flushing the VC could be done only indirectly by creating contention in the primary cache—e.g., by causing lines to move to the VC and evict other lines from there. Moreover, such an approach creates noise in the attacker's profiling observations, wherein such noise is proportional to the size of the victim cache. For example, the fraction of lines that is sampled by the attacker that are truly conflicting (and not noise) is
This extends the attacker's profiling effort proportional to NVC, the number of lines in the VC. In some embodiments, a frequency of encryption key updating can thus by reduced, in proportion to NVC, for improved cache performance without substantially increased security risk.
Some embodiments extend the concept of cache randomization for use in the accessing of a victim cache—e.g., to increase the security against contention-based cache attacks and to reduce the required key refresh rate. In various embodiments, randomization of access to a cache (e.g., access to a victim cache and/or access to a primary cache) is implemented with an encryption or hashing functionality in the memory pipeline. In one such embodiment, a set of keys is generated—e.g., randomly—upon startup, wherein said keys are stored, and (for example) changed at a configurable interval. For example, some embodiments provide cryptographic protection using a QARMA block cipher, a PRINCE block cipher, a SPECK block cipher, or any of various other suitable block ciphers.
In some embodiments, a VC is a set associative (or alternatively, a fully associative) cache that, for example, is looked up in parallel to a looked up of primary cache. The accessing of a set-associative VC is based (for example) on the derivation of a secret set index using cryptography and a secret key. In one embodiment, such derivation is performed using the same cryptographic primitive as is used for accessing the primary cache. For example, some embodiments use the same cryptographic functionality to derive indices for both the primary cache and the VC, but where the caches are accessed using different respective keys. In some embodiments (e.g., wherein the block size of a cryptographic primitive is significantly large), non-overlapping subsets of the output bits from a cryptographic routine are used each to obtain a different respective one of set indices which are each for a different one of the VC or the primary cache.
For example, one illustrative embodiment provides a randomized cache (RC) as a primary cache. Such an RC comprises (for example) N=s·w lines—wherein s is a number of sets, and w is a number of ways—which are organized in 1≤d≤w divisions by grouping together
ways in each division. By way of illustration and not limitation, for one such primary cache, N=32, s=8, w=4, and d=2, in an embodiment. The primary cache is skewed by such divisions—e.g., wherein accessing the cache comprises using a different index to select the set in each division. These indices are derived via a cryptographic scheme which based (for example) on a block cipher E or a keyed hash function H within the memory pipeline. Namely, as a given cache line address C is looked up in the primary cache, it is first encrypted (or hashed) using a block cipher E (or hash function H) with d different keys K0, K1, . . . , Kd−1 to obtain d differently encrypted (hashed) addresses Cenc,0, Cenc1, . . . , Cenc,d−1. Slicing out log2 s bits from these encrypted (hashed) addresses gives d different indices idx0, idx1, . . . , idxd−1 to select the sets S0,idx
It is to be noted that the primary cache is not a skewed cache, in some embodiments. Namely, for some embodiments, a primary cache may be realized using d=1 divisions with w ways, in which case the primary cache is a traditional set-associative cache that (for example) uses a randomized set index that is derived by a cryptographic scheme using a block cipher E or a keyed hash function H.
Some embodiments further comprise a victim cache (VC) that, for example, is realized as a fully associative victim cache—with some wVC ways—or as a randomized, set-associative cache with wVC ways and sVC sets, i.e., with NVC=sVC·wVC lines. In one embodiment, such a randomized, set-associative VC uses a cryptographic function based on a block cipher E (hash function H) and a secret key KVC to derive the VC's set index from the cache line address. Lines that are evicted from the primary cache are put into the VC. Lines that hit in the VC can be reinserted into the primary cache but may also stay in the VC without loss of security. To introduce randomness into replacement decisions, either the VC or the primary cache uses a randomized replacement policy. For a good performance-security trade-off, the primary cache in some embodiments uses a well-performing replacement algorithm like (Quad-Age/Octa-Age)-LRU, whereas the victim cache will use random replacement.
Supplementing a primary cache with a VC as described herein increases the primary cache's security margin for contention-based cache side-channel attacks as the VC obfuscates contention in the primary cache and the VC thus allows to extend the primary cache's re-key interval. Moreover, a VC with random replacement helps achieve better performance of the randomized cache architecture through reduced re-keying intervals and (for example) by allowing the secure use of least recently used (LRU) replacement inside the primary cache. Note that primary caches with a VC are not only applicable to actual caches, but also to coherence directories—e.g., including snoop filters—and, for example, to combinations of a cache and a coherence directory.
In an illustrative scenario according to one embodiment, on a memory request to a line C, cache controller circuitry maps the respective address to one or more locations (e.g., sets) in the victim cache—for example, via a cipher E—and further maps the respective address to one or more sets in a primary cache. The cache controller circuitry then looks up the line C in both the primary cache set(s) and the VC location(s). If the line C is found in the primary cache, the line is simply returned. If the line C hits in the VC, the line is returned and (in some embodiments) reinserted into one of the primary cache sets previously determined for line C via the cipher E. Alternatively, the line that was hit in the VC is further maintained in the VC—e.g., without being reinserted into the primary cache. If a reinsertion of C into the primary cache conflicts with another line Y in the primary cache, then (in some embodiments) this other line Y is evicted to memory. In other embodiments, this other line Y is instead evicted from the primary cache to where line C was in the VC, i.e., wherein lines C and Y are swapped. In some embodiments, such swapping of lines C and Y takes place only where they both map to the same set in the VC, which automatically is the case for a fully associative VC. Otherwise, Y must be inserted in the VC set it maps to.
In another scenario, if a request to line C misses in both the primary cache and the VC, the line C is fetched from memory and inserted in one of the sets of the primary cache as determined for line C by the cipher E before. If this insertion conflicts with some line X stored in the primary cache before, the line X is moved from the primary cache to the VC. For a set-associative VC, this first requires deriving the set index of the line X in the set-associative VC via the cipher E. Moving the line X to the VC may cause eviction of another line V from the victim cache to the memory. Note that some embodiments use a scrubber that automatically walks the sets of the VC to randomly evict lines from the VC to the memory to ensure free slots are available whenever a line X is evicted from the primary cache to the VC.
Some embodiments provide address mapping of the primary cache, wherein all lines in the primary cache and a VC are initially invalid, for example. As used herein, the label, Si,j denotes a set j in a division i of a primary cache, wherein Ki and idxi denote, respectively, a key and a set index which correspond to division i. A given cache line address C is mapped to various primary cache entries by an encryption of C with d different keys K0, K1, . . . , Kd−1, from which log2 s bits are sliced out to obtain indices idx0, idx1, . . . , idxd−1. Said indices are variously used to access d divisions, which results in d cache sets S0,idx0, S1,idx1, . . . , Sd−1,idx
entries each.
In some embodiments, a cache line C is mapped to VC entries by encrypting C with the key KVC, log2 sVC bits of which are sliced out to obtain the index idxVC. In one such embodiment, said index idxVC is used to access the cache set SVC,idx
When a given cache line address C is requested, it is mapped to cache sets S0,idx
In one example scenario, line C is fetched from memory for insertion into one of multiple sets S0,idx
In another example scenario, a cache request hits in the VC, wherein a line V is returned by the hit. In one such embodiment, the line V is reinserted into the primary cache due to the hit—e.g., using any of various primary cache reinsert operations described herein. In another such embodiment, the line V is instead kept in the VC, but the replacement bits for the location (e.g., the set) of line V in the VC are updated as needed.
In another example scenario, wherein a line V in a VC is to be reinserted into a primary cache, a division {circumflex over (d)} ∈ {0, . . . , d−1} of the primary cache is selected—e.g., randomly—and the line address V is encrypted with the respective division's key K{circumflex over (d)}. From the resulting ciphertext Venc,{circumflex over (d)}, log2 s bits are sliced out to obtain an index idx{circumflex over (d)}, which is used to select the cache set S{circumflex over (d)},idx
In some embodiments, access to a victim cache is randomized by the use of an encryption calculation or a hash calculation to identify (e.g., including calculating an index of) a particular set of a set-associative victim cache, for example. In one such embodiment, an encryption or hash calculation which is performed to identify a given set of the primary cache is different than another encryption or hash calculation which is performed to identify a given set of the victim cache. Such calculations use different respective encryption keys or different respective hash functions, for example.
In one embodiment, a first form in which information is cached to a given entry of the preliminary cache is different than a second form in which that same information is cached, at a different time, to an entry of the victim cache. For example, one (e.g., only one) of the first form or the second form is a plaintext form—for example, wherein the other of the first form or the second form is an encrypted form. In another embodiment, the first form and the second form correspond to different respective encryption types which, for example, are each based on a different respective encryption key, or a different respective cipher. The particular forms of the cached information will determine whether—and if so how—a given line is to be decrypted and/or (re)encrypted if, for example, the line is evicted from the preliminary cache to the victim cache, if the line is reinserted from the victim cache to the preliminary cache
In still another embodiment, a single cipher block is used to generate a single cryptographic primitive, but various (e.g., non-overlapping) bits of the primitive are used to select different respective cache sets including—for example—one or more sets of the primary cache and/or one or more sets of the victim cache. For example, a single cryptographic block cipher generates a 32-bit output, wherein the most significant 16 bits are used to facilitate randomized access of a preliminary cache set, and the least significant 16 bits are used to facilitate randomized access of a victim cache set.
As shown in
Method 200 further comprises operations 205 which are performed based on the first message which is received at 210. In an embodiment, operations 205 comprise (at 212) identifying a first location of a primary cache—e.g., wherein cache look-up circuitry 122 performs operations to identify one or more sets of the primary cache which have been mapped to the indicated first address. For example, cache look-up circuitry 122 performs a look-up, encryption, hash or other operation to identify one of multiple indices which are based on the first address, and which each correspond to a different respective set of the primary cache.
In some embodiments, the primary cache is operated as a random and/or skewed cache. For example, the first location is identified at 212 based on a selected one of first indices which are determined—by cache look-up circuitry 122 (or other suitable cache controller logic)—based on the first address and a first plurality of key values. The first indices—generated using a cipher block or hash function, for example—each correspond to a different respective one of multiple sets of the primary cache (e.g., where the sets are each in a different respective division of the primary cache). At some other time, access to the multiple sets is instead provided with second indices which are determined based on the first address and a second plurality of alternative key values—e.g., wherein control circuitry 120 changes indexing of the primary cache to help protect from side-channel (or other) attacks.
Operations 205 further comprise (at 214) performing a calculation to identify a second location of a victim cache, wherein the calculation is based on one of an encryption key or a hash function. In one example embodiment, determining the second location comprises mapping the first address to a set of the victim cache—e.g., wherein the mapping comprises calculating an index for the set based on the first address and the one of the encryption key or the hash function. In one such embodiment, an address C is mapped to a particular set of a victim cache by encrypting C with a key KVC, slicing out log2(sVC) bits to obtain an index idxVC, and using this index to access a set SVC,idxVC of the set-associative victim cache.
Operations 205 further comprise (at 216) moving a second line from the first location to the second location of the victim cache. For example, moving the second line to the second location at 216 is based on cache insertion circuitry 124 (or other suitable cache controller logic) detecting a valid state of the second line. In some embodiments, operations 205 further comprise evicting a third line from the second location to a memory (e.g., an external memory which is coupled to processor 100) before moving the second line from the first location to the second location. In one such embodiment, evicting the third line is based on cache insertion circuitry 124 (or other suitable cache controller logic) detecting a valid state of the third line. Operations 205 further comprise (at 218) storing the first line to the first location, which (for example) is available after the moving at 216.
Although some embodiments are not limited in this regard, method 200 additionally or alternatively comprises operations to reinsert a line from the victim cache to the primary cache. For example, in one such embodiment, method 200 further comprises (at 220) receiving a second message indicating a second address which corresponds to the second line. The second message (e.g., received by control circuitry 120) is, for example, a request to search for the second line in the primary cache and the second cache. Based on the second message received at 220, method 200 (at 222) moves the second line from the second location to the primary cache.
In one such embodiment, moving the second line at 222 comprises swapping the second line and a third line between the primary cache and the second cache—e.g., based on the detecting of a valid state of the third line. In an alternative scenario (e.g., wherein the third line is invalid), the moving at 222 simply writes over the third line. In some embodiments, the moving at 222 comprises identifying the second location in the victim cache—e.g., by performing operations, such as those referred to above, to calculate or otherwise identify the corresponding index idxVC based on an encryption key or a hash function.
As shown in
In an example scenario according to one embodiment, system 300 receives or otherwise communicates a request 330 which provides an address C to indicate, at least in part, that a line LC corresponding to the address C is to be cached, read, evicted, or otherwise accessed. For example, request 330 is communicated by execution circuitry 110 to control circuitry 120 to determine whether either of caches 152, 154 stores some cached version of line LC.
In one such embodiment, the address C is mapped (e.g., by circuit blocks 332, 333 provided, for example, with cache look-up circuitry 122) to multiple sets of PC 310, where each such set at a different respective division of PC 310. Such mapping is based, for example, on a cipher E (or alternatively, based on a hash function H) and a set of keys including, for example, the illustrative keys K0, K1 shown. Furthermore, the address C is also mapped (e.g., by a circuit block 340 provided, for example, with cache look-up circuitry 122) to a set of VC 320. Such mapping is based, for example, on a cipher E (or alternatively, based on a hash function H) and another key including, such as the illustrative key KVC shown. For example, circuit blocks 332, 333 generate respective indices idx0, idx1 to variously access sets which are each in a different respective one of divisions D0, D1—e.g., wherein circuit block 340 generates another index idxvc to access a set in VC 320. In one such embodiment, indices idx0, idx1, idxvc are variously generated each based on the same cipher E (or the same hash function, for example).
Based on a request for the line LC, a look up operation is performed to see if a mapped set in VC 320, or any of the mapped sets in PC 310, includes the line LC. Where the line LC is found in PC 310, said line LC is simply returned in a response to the request. However, where the line LC is found in VC 320, said line LC—in addition to being returned in response to the request—is also evicted from VC 320 and reinserted into PC 310 (e.g., at one of those sets of PC 310 which are mapped to the address C).
In an embodiment, if reinsertion of line LC from VC 320 to PC 310 conflicts with some other line which is currently in PC 310, then that other line is swapped with the line LC—e.g., wherein the line LC is inserted into the location where PC 310 stored the other line, and where that other line is evicted to the location where VC 320 stored the line LC.
If the request for the line LC misses—that is, where neither of PC 310 or VC 320 currently has line LC—then the line LC is fetched from an external memory (not shown) and, for example is inserted into one of the sets of PC 310—e.g., as determined based on the address C and a cipher E (or a hash function H, for example). If an insertion of the line LC conflicts with some second line LX which is currently stored in PC 310, then that second line LX is evicted from PC 310 to VC 320. In some cases, such eviction of the line LX from PC 310 to VC 320 results in eviction of some third line LV from VC 320 to memory. In some embodiments, reinsertion of a given line from a victim cache to a primary cache comprises the cache management circuitry trying to reinsert the line into a different cache way (and, for example, a different division) other than the one from which the line was originally evicted.
In various embodiments, use of VC 320 (and randomized access thereto) with PC 310 hinders some side-channel attacks by breaking the direct observability of cache-set contention. In an illustrative scenario according to one embodiment, an attacker uses a line LX which corresponds to an attacker address X that conflicts with an address C of a line LC. Servicing of a request for line LC results in the line LX being evicted from PC 310 into VC 320, which in turn results in an eviction of another line LV from VC 320 into memory (as indicated by the line transfer operations labeled “1” and “2” in
For example, in a first example situation, a later request for line LX would result in line LX being reinserted from VC 320 back into the same division where it was originally stored in PC 310. This would result in the line LC (corresponding to address C) being selected for eviction to VC 320, i.e., wherein line LX and the line LC are swapped. As a result, line Lx and the line LC are stored in PC 310 and VC 320 (respectively), making the previous contention in PC 310 invisible.
In a second example situation, the line Lx is instead reinserted from VC 320 into a division of PC 310 other than the one from which it was previously evicted. As a result, the line LC is not selected for eviction from PC 310—e.g., where (for example) some other line LY, previously stored in PC 310, is instead swapped with the line LX (as indicated by the line swap operation labeled “3” in
Evictions observed by a malicious agent thus appear to be unrelated to the victim address C, or to have an only indirect relationship to the victim address. An indirectly contending address Y might arise in the following situation: line LY evicts line LX, and then line LX evicts line LC. Hence, line LX contends with both line LY and line LC, depending on the division. Indirectly contending addresses like Y remain sufficiently hard to exploit in practice for multiple reasons. For example, malicious agents cannot distinguish whether they sampled such indirectly contending address, or an unrelated one. This aspect effectively increases the number of addresses required to identify an eviction set.
Furthermore, malicious agents typically do not know what the address X is, because it cannot be observed. However, an indirectly contending address Y is only valuable to a malicious agent if they know address X and insert line Lx before beforehand—i.e., line LX is a required proxy for line LY to evict line LC. Other addresses are unlikely to contribute this this kind of multi-collision—i.e., it is unlikely for many addresses X′ to contend with both address C and address Y in different divisions. In general, there is roughly a probability of w2s−2 (e.g., 1 in 4096 addresses for a cache with 1024 sets and 16 ways) that some address would have this property. There is a special case where lines LX, LY collide with line LC in the same division (which can occur, for example, where d<w). In one such situation, line LY could directly evict line LC from PC 310, but where VC 320 triggers automatic reinsertion of line LC (effectively preventing eviction).
Further still, even if malicious agents know address X, they have low probability of evicting the victim line LC from PC 310. For example, line LC and line LX must both have been placed in the correct cache ways beforehand, line LY must be inserted so as to evict line LX, and line LX must be reinserted so as to evict line LC to VC 320. In general, such a sequence for line placements has a probability of roughly w−4. This significantly increases the number of addresses needed in the eviction set to obtain a good eviction probability.
Further still, the malicious agent would typically need to flush VC 320 with lines for random addresses once line LC was evicted from PC 310 into VC 320. This process requires additional contention in PC 310 and adds “noise” to the probing process of a side-channel attack.
As shown in
By way of illustration and not limitation, encrypted address values are determined each based on the address C and a different respective one of d keys K0, K1, . . . , Kd−1 (where a given encryption key Ki corresponds to a particular division i of the primary cache). Indices idx0, idx1, . . . , idxd−1 are then obtained—e.g., each by identifying a respective slice of log2 spc bits from a corresponding one of the encrypted address values (where a given set index idxi corresponds to a particular division i of the primary cache). Said indices are available to be used each to access a respective one of the d divisions—e.g., to access any of d cache sets S0,idx0, S1,idx1, . . . , Sd−1,idxd−1 (where, for a given set Si,j, index i indicates a particular division, and index j indicates a particular set of that division i). In some embodiments, the d cache sets S0,idx0, S1,idx1, . . . , Sd−1,idxd−1 comprise w/d entries each, for example.
Method 400 further comprises (at 412) receiving a request—such as the request 330 communicated with system 300—which indicates the address C, and (at 414) identifying, based on the request which is received at 412, one or more first sets of the primary cache which have been mapped to the address C. For example, at some point after the mapping of primary cache sets, a request to access the line LC provides the corresponding address C, which is used by control circuitry 120 (or other suitable cache control logic) to determine the corresponding mapped one or more sets—e.g., sets S0,idx0, S1,idx1, . . . , Sd−1,idxd−1—of the primary cache.
Based on the request received at 412—and further based on an encryption key or a hash function—method 400 (at 416) identifies one or more second sets of a VC as corresponding to the address C. By way of illustration and not limitation, an encrypted address value is determined based on both the address C and another key Kvc. In some embodiments, an index idvc is thus obtained—e.g., each by identifying a slice of some log2 svc bits from the encrypted address value. Said index is available to be used to access a corresponding set SVC,idxVC of the (set-associative, for example) victim cache.
Method 400 further comprises (at 418) searching the one or more first sets which were identified at 414—as well as the one or more second sets which were identified at 416—for the line LC (where said searching is based on the request received at 412). For example, look-ups of the mapped sets of the PC and VC are then performed (in parallel, for example) to search for the line LC. In some embodiments, if the request hits in some mapped set of the primary cache, then the line LC is returned in a response to the access request, and—in some embodiments—replacement bits in the set Si,idxi are updated. If the request hits in the VC, then a VC hit operation is performed (as described herein). If the request misses in both the VC and the PC, a primary cache insert operation is performed (as described herein).
For example, method 400 determines (at 420) whether or not the searching at 418 has resulted in a cache hit—i.e., wherein the line LC has been found in either of the primary cache or the victim cache. Where a cache hit is detected at 420, method 400 (at 422) performs operations to access one or more lines of the primary cache or the victim cache based on the cache hit. In various embodiments, the cache hit operations at 422 include moving one line between respective sets of the primary cache and the victim cache. In one such embodiment, the cache hit operations at 422 further comprise moving a line between an external memory and one of the primary cache or the victim cache. In various embodiments, cache hit operation 422 includes some or all of the method 500 shown in
Where it is instead determined at 420 that no such cache hit is indicated, method 400 (at 424) performs operations to cache a line from an external memory to the primary cache based on the detected cache miss. In various embodiments, the cache miss operations at 424 include moving another line between respective sets of the primary cache and the victim cache. In one such embodiment, the cache miss operations at 424 further comprise moving a line between an external memory and one of the primary cache or the victim cache. In various embodiments, cache miss operation 424 includes some or all of the method 600 shown in
In various embodiments, a victim cache hit operation—in response to a data request hitting a line which is stored in a victim cache—comprises control circuitry 120 (or other suitable cache control logic) returning the line in a response to the request, and also reinserting that line into a primary cache using a primary cache reinsert routine. By way of illustration and not limitation, reinsertion of a line from the victim cache into the primary cache is automatically triggered if, for example, the index idxVC,reinsert is less than the index idxVC,insert. Subsequently, the index idxVC,reinsert is incremented or otherwise updated. In one such embodiment, reinsertion of the requested line from the victim cache into the primary cache, includes randomly selecting or otherwise identifying a division d {circumflex over ( )}∈ {0, . . . , d−1}, where a line address C corresponding to the requested line is encrypted with the respective division's key Kd. From a resulting ciphertext Cenc,d{circumflex over ( )}, log2 spc bits are sliced out to obtain index idxd{circumflex over ( )}, which is used to select the cache set Sd{circumflex over ( )},idxd{circumflex over ( )} in division d{circumflex over ( )}. Within Sd{circumflex over ( )},idxd{circumflex over ( )}, a victim line LX is selected—e.g., according to a replacement policy for the cache set. The requested line in the victim cache and line LX in the primary cache are then swapped (e.g., if line LX is valid) and the replacement bits in Sd{circumflex over ( )}, idxd{circumflex over ( )}are updated. Afterwards, the requested line, and the line LX will be in the primary cache and the victim cache, respectively.
For example,
As shown in
By contrast, where it is instead determined at 510 that the detected cache hit was not at the primary cache, but rather at a victim cache which corresponds to the primary cache, method 500 performs a victim cache hit process which includes (at 512) identifying the location in the victim cache from which the line LC is to be evicted. Furthermore, method 500 (at 514) identifies a location in the primary cache which is to receive the line LC from the victim cache. In one example embodiment, determining a particular location where the victim cache (or for example, where a primary cache) is to receive a given line C comprises mapping the line C in question to an entry of the victim (or primary) cache. For example, a cache line C is mapped to a particular victim cache entry by encrypting C with the key KVC, slicing out log2(sVC) bits to obtain an index idxVC, and using this index to access a cache set SVC,idxVC in the set-associative victim cache. In various embodiments, wherein a victim cache is fully associative, the victim cache has only one set, SVC,0, which can be directly obtained by selecting idxVC=0—i.e., without a need to derive a set index through encryption.
In one such embodiment, the victim cache hit process of method 500 further determines (at 516) whether or not the location which is identified at 514 currently stores any line LX which is classified as being valid. Where it is determined at 516 that the primary cache location does currently store some valid line LX, method 500 (at 518) swaps the lines LC, LX between the two cache locations which are variously identified at 512 and at 514. Where it is instead determined at 516 that the primary cache location does not currently store any valid line LX, method 500 (at 520) simply moves the line LC from the location identified at 512 to the location identified at 514—e.g., wherein such moving simply writes over data stored in the location identified at 514.
In either case, the victim cache hit process includes, or is performed in conjunction with, the providing of the line LC (at 522) in the response to previously-detected request which indicates the address C. However, some embodiments omit (or provide support for selectively disabling) functionality to reinsert a given line from a victim cache to a primary cache. In one such embodiment, such a line remains in the victim cache until (for example) it is evicted to memory to make room for another line being evicted from the primary cache to the victim cache.
In some embodiments, a primary cache insert process comprises fetching a line LC from a data storage device, and storing the line LC to one of the d sets S0,idx0, S1,idx1, . . . , Sd−1,idxd−1 of the primary cache which have been mapped to the corresponding address C. For example, one division d{circumflex over ( )}− {0, . . . , d−1} is chosen (e.g., randomly) from among the d divisions of the primary cache, wherein a set Sd{circumflex over ( )},idxd{circumflex over ( )}, of the division d{circumflex over ( )} is selected, and a victim line LX in the set Sd{circumflex over ( )},idxd{circumflex over ( )}is designated for eviction—e.g., according to a policy which (for example) is adapted from any of various conventional cache replacement techniques. The designated line LX is then replaced by the line LC —e.g., wherein replacement bits in Sd{circumflex over ( )},idxd{circumflex over ( )}are updated accordingly. If the evicted line LX is a valid one at the time of eviction, idxVC,insert is incremented (or otherwise updated) and line LX is inserted into the VC at position idxVC,insert. If some valid line LV is currently at the position idxVC,insert where line LX is to be inserted, then that line LV is evicted from the VC, and written to a higher level cache, to a data storage device (e.g., disk storage), or the like.
For example,
As shown in
A primary cache insert process of method 600 comprises (at 612) identifying a location in the primary cache which is to receive the line LC that is retrieved at 610. Furthermore, method 600 determines (at 614) whether or not the location which is identified at 612 currently stores any line LX which is classified as being valid. Where it is determined at 614 that the primary cache location does currently store some valid line LX, method 600 (at 616) identifies a location in the victim cache which is to receive the line LX from the primary cache. In various embodiments, identifying the location at 616 is based on an address C which corresponds to the line Lc, and is further based on an encryption key or hash function. For example, an encryption calculation is performed based on the address C, and an encryption key KVC as part of operations to determine an index value for a cache set SVC,idxVC in the set-associative victim cache. Subsequently, the primary cache insert process determines (at 618) whether or not the location which is identified at 616 currently stores any other line LV which is classified as being valid.
Where it is determined at 618 that the victim cache location does currently store some valid line LV, method 600 (at 620) evicts the line LV from the victim cache for writing to a higher level cache, to a data storage device (e.g., disk storage), or the like. Subsequently (at 622), the line LX is evicted from the location in the primary cache which is identified at 612, and stored at the location in the victim cache which is identified at 616. By contrast, where it is instead determined at 618 that the victim cache location does not currently store any valid line LV, method 600 performs the evicting at 622, but foregoes the evicting at 620. In either case, method 600 (at 624) caches the line LC to the location which is identified at 612.
In some embodiments, where it is instead determined at 614 that the primary cache location—which was identified at 612—does not currently store any valid line LX, method 600 simply performs a caching of the line to that primary cache location (at 624). Regardless, in some embodiments, a cache miss process includes (or is performed in conjunction with) the providing of the line LC (at 626) in a response to a previously-detected request for line LC.
Detailed below are describes of exemplary computer architectures. Other system designs and configurations known in the arts for laptop, desktop, and handheld personal computers (PC)s, personal digital assistants, engineering workstations, servers, disaggregated servers, network devices, network hubs, switches, routers, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, micro controllers, cell phones, portable media players, hand-held devices, and various other electronic devices, are also suitable. In general, a variety of systems or electronic devices capable of incorporating a processor and/or other execution logic as disclosed herein are generally suitable.
Processors 770 and 780 are shown including integrated memory controller (IMC) circuitry 772 and 782, respectively. Processor 770 also includes as part of its interconnect controller point-to-point (P-P) interfaces 776 and 778; similarly, second processor 780 includes P-P interfaces 786 and 788. Processors 770, 780 may exchange information via the point-to-point (P-P) interconnect 750 using P-P interface circuits 778, 788. IMCs 772 and 782 couple the processors 770, 780 to respective memories, namely a memory 732 and a memory 734, which may be portions of main memory locally attached to the respective processors.
Processors 770, 780 may each exchange information with a chipset 790 via individual P-P interconnects 752, 754 using point to point interface circuits 776, 794, 786, 798. Chipset 790 may optionally exchange information with a coprocessor 738 via an interface 792. In some examples, the coprocessor 738 is a special-purpose processor, such as, for example, a high-throughput processor, a network or communication processor, compression engine, graphics processor, general purpose graphics processing unit (GPGPU), neural-network processing unit (NPU), embedded processor, or the like.
A shared cache (not shown) may be included in either processor 770, 780 or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 790 may be coupled to a first interconnect 716 via an interface 796. In some examples, first interconnect 716 may be a Peripheral Component Interconnect (PCI) interconnect, or an interconnect such as a PCI Express interconnect or another 110 interconnect. In some examples, one of the interconnects couples to a power control unit (PCU) 717, which may include circuitry, software, and/or firmware to perform power management operations with regard to the processors 770, 780 and/or co-processor 738. PCU 717 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate the appropriate regulated voltage. PCU 717 also provides control information to control the operating voltage generated. In various examples, PCU 717 may include a variety of power management logic units (circuitry) to perform hardware-based power management. Such power management may be wholly processor controlled (e.g., by various processor hardware, and which may be triggered by workload and/or power, thermal or other processor constraints) and/or the power management may be performed responsive to external sources (such as a platform or power management source or system software).
PCU 717 is illustrated as being present as logic separate from the processor 770 and/or processor 780. In other cases, PCU 717 may execute on a given one or more of cores (not shown) of processor 770 or 780. In some cases, PCU 717 may be implemented as a microcontroller (dedicated or general-purpose) or other control logic configured to execute its own dedicated power management code, sometimes referred to as P-code. In yet other examples, power management operations to be performed by PCU 717 may be implemented externally to a processor, such as by way of a separate power management integrated circuit (PMIC) or another component external to the processor. In yet other examples, power management operations to be performed by PCU 717 may be implemented within BIOS or other system software.
Various I/O devices 714 may be coupled to first interconnect 716, along with a bus bridge 718 which couples first interconnect 716 to a second interconnect 720. In some examples, one or more additional processor(s) 715, such as coprocessors, high-throughput many integrated core (MIC) processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays (FPGAs), or any other processor, are coupled to first interconnect 716. In some examples, second interconnect 720 may be a low pin count (LPC) interconnect. Various devices may be coupled to second interconnect 720 including, for example, a keyboard and/or mouse 722, communication devices 727 and a storage circuitry 728. Storage circuitry 728 may be one or more non-transitory machine-readable storage media as described below, such as a disk drive or other mass storage device which may include instructions/code and data 730 in some examples. Further, an audio I/O 724 may be coupled to second interconnect 720. Note that other architectures than the point-to-point architecture described above are possible. For example, instead of the point-to-point architecture, a system such as multiprocessor system 700 may implement a multi-drop interconnect or other such architecture.
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high-performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput) computing. Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip (SoC) that may include on the same die as the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Exemplary core architectures are described next, followed by descriptions of exemplary processors and computer architectures.
Thus, different implementations of the processor 800 may include: 1) a CPU with the special purpose logic 808 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores, not shown), and the cores 802A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, or a combination of the two); 2) a coprocessor with the cores 802A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 802A-N being a large number of general purpose in-order cores. Thus, the processor 800 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit circuitry), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 800 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, complementary metal oxide semiconductor (CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (PMOS), or N-type metal oxide semiconductor (NMOS).
A memory hierarchy includes one or more levels of cache unit(s) circuitry 804A-N within the cores 802A-N, a set of one or more shared cache unit(s) circuitry 806, and external memory (not shown) coupled to the set of integrated memory controller unit(s) circuitry 814. The set of one or more shared cache unit(s) circuitry 806 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, such as a last level cache (LLC), and/or combinations thereof. While in some examples ring-based interconnect network circuitry 812 interconnects the special purpose logic 808 (e.g., integrated graphics logic), the set of shared cache unit(s) circuitry 806, and the system agent unit circuitry 810, alternative examples use any number of well-known techniques for interconnecting such units. In some examples, coherency is maintained between one or more of the shared cache unit(s) circuitry 806 and cores 802A-N.
In some examples, one or more of the cores 802A-N are capable of multi-threading. The system agent unit circuitry 810 includes those components coordinating and operating cores 802A-N. The system agent unit circuitry 810 may include, for example, power control unit (PCU) circuitry and/or display unit circuitry (not shown). The PCU may be or may include logic and components needed for regulating the power state of the cores 802A-N and/or the special purpose logic 808 (e.g., integrated graphics logic). The display unit circuitry is for driving one or more externally connected displays.
The cores 802A-N may be homogenous in terms of instruction set architecture (ISA). Alternatively, the cores 802A-N may be heterogeneous in terms of ISA; that is, a subset of the cores 802A-N may be capable of executing an ISA, while other cores may be capable of executing only a subset of that ISA or another ISA.
In
By way of example, the exemplary register renaming, out-of-order issue/execution architecture core of
The front end unit circuitry 930 may include branch prediction circuitry 932 coupled to an instruction cache circuitry 934, which is coupled to an instruction translation lookaside buffer (TLB) 936, which is coupled to instruction fetch circuitry 938, which is coupled to decode circuitry 940. In one example, the instruction cache circuitry 934 is included in the memory unit circuitry 970 rather than the front-end circuitry 930. The decode circuitry 940 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode circuitry 940 may further include an address generation unit (AGU, not shown) circuitry. In one example, the AGU generates an LSU address using forwarded register ports, and may further perform branch forwarding (e.g., immediate offset branch forwarding, LR register branch forwarding, etc.). The decode circuitry 940 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one example, the core 990 includes a microcode ROM (not shown) or other medium that stores microcode for certain macroinstructions (e.g., in decode circuitry 940 or otherwise within the front end circuitry 930). In one example, the decode circuitry 940 includes a micro-operation (micro-op) or operation cache (not shown) to hold/cache decoded operations, micro-tags, or micro-operations generated during the decode or other stages of the processor pipeline 900. The decode circuitry 940 may be coupled to rename/allocator unit circuitry 952 in the execution engine circuitry 950.
The execution engine circuitry 950 includes the rename/allocator unit circuitry 952 coupled to a retirement unit circuitry 954 and a set of one or more scheduler(s) circuitry 956. The scheduler(s) circuitry 956 represents any number of different schedulers, including reservations stations, central instruction window, etc. In some examples, the scheduler(s) circuitry 956 can include arithmetic logic unit (ALU) scheduler/scheduling circuitry, ALU queues, arithmetic generation unit (AGU) scheduler/scheduling circuitry, AGU queues, etc. The scheduler(s) circuitry 956 is coupled to the physical register file(s) circuitry 958. Each of the physical register file(s) circuitry 958 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one example, the physical register file(s) circuitry 958 includes vector registers unit circuitry, writemask registers unit circuitry, and scalar register unit circuitry. These register units may provide architectural vector registers, vector mask registers, general-purpose registers, etc. The physical register file(s) circuitry 958 is coupled to the retirement unit circuitry 954 (also known as a retire queue or a retirement queue) to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) (ROB(s)) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit circuitry 954 and the physical register file(s) circuitry 958 are coupled to the execution cluster(s) 960. The execution cluster(s) 960 includes a set of one or more execution unit(s) circuitry 962 and a set of one or more memory access circuitry 964. The execution unit(s) circuitry 962 may perform various arithmetic, logic, floating-point or other types of operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point). While some examples may include a number of execution units or execution unit circuitry dedicated to specific functions or sets of functions, other examples may include only one execution unit circuitry or multiple execution units/execution unit circuitry that all perform all functions. The scheduler(s) circuitry 956, physical register file(s) circuitry 958, and execution cluster(s) 960 are shown as being possibly plural because certain examples create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating-point/packed integer/packed floating-point/vector integer/vector floating-point pipeline, and/or a memory access pipeline that each have their own scheduler circuitry, physical register file(s) circuitry, and/or execution cluster—and in the case of a separate memory access pipeline, certain examples are implemented in which only the execution cluster of this pipeline has the memory access unit(s) circuitry 964). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
In some examples, the execution engine unit circuitry 950 may perform load store unit (LSU) address/data pipelining to an Advanced Microcontroller Bus (AMB) interface (not shown), and address phase and writeback, data phase load, store, and branches.
The set of memory access circuitry 964 is coupled to the memory unit circuitry 970, which includes data TLB circuitry 972 coupled to a data cache circuitry 974 coupled to a level 2 (L2) cache circuitry 976. In one exemplary example, the memory access circuitry 964 may include a load unit circuitry, a store address unit circuit, and a store data unit circuitry, each of which is coupled to the data TLB circuitry 972 in the memory unit circuitry 970. The instruction cache circuitry 934 is further coupled to the level 2 (L2) cache circuitry 976 in the memory unit circuitry 970. In one example, the instruction cache 934 and the data cache 974 are combined into a single instruction and data cache (not shown) in L2 cache circuitry 976, a level 3 (L3) cache circuitry (not shown), and/or main memory. The L2 cache circuitry 976 is coupled to one or more other levels of cache and eventually to a main memory.
The core 990 may support one or more instructions sets (e.g., the x86 instruction set architecture (optionally with some extensions that have been added with newer versions); the MIPS instruction set architecture; the ARM instruction set architecture (optionally with optional additional extensions such as NEON)), including the instruction(s) described herein. In one example, the core 990 includes logic to support a packed data instruction set architecture extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
In some examples, the register architecture 1100 includes writemask/predicate registers 1115. For example, in some examples, there are 8 writemask/predicate registers (sometimes called k0 through k7) that are each 16-bit, 32-bit, 64-bit, or 128-bit in size. Writemask/predicate registers 1115 may allow for merging (e.g., allowing any set of elements in the destination to be protected from updates during the execution of any operation) and/or zeroing (e.g., zeroing vector masks allow any set of elements in the destination to be zeroed during the execution of any operation). In some examples, each data element position in a given writemask/predicate register 1115 corresponds to a data element position of the destination. In other examples, the writemask/predicate registers 1115 are scalable and consists of a set number of enable bits for a given vector element (e.g., 8 enable bits per 64-bit vector element).
The register architecture 1100 includes a plurality of general-purpose registers 1125. These registers may be 16-bit, 32-bit, 64-bit, etc. and can be used for scalar operations. In some examples, these registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
In some examples, the register architecture 1100 includes scalar floating-point (FP) register 1145 which is used for scalar floating-point operations on 32/64/80-bit floating-point data using the x87 instruction set architecture extension or as MMX registers to perform operations on 64-bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.
One or more flag registers 1140 (e.g., EFLAGS, RFLAGS, etc.) store status and control information for arithmetic, compare, and system operations. For example, the one or more flag registers 1140 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some examples, the one or more flag registers 1140 are called program status and control registers.
Segment registers 1120 contain segment points for use in accessing memory. In some examples, these registers are referenced by the names CS, DS, SS, ES, FS, and GS.
Machine specific registers (MSRs) 1135 control and report on processor performance. Most MSRs 1135 handle system-related functions and are not accessible to an application program. Machine check registers 1160 consist of control, status, and error reporting MSRs that are used to detect and report on hardware errors.
One or more instruction pointer register(s) 1130 store an instruction pointer value. Control register(s) 1155 (e.g., CRO-CR4) determine the operating mode of a processor (e.g., processor 770, 780, 738, 715, and/or 800) and the characteristics of a currently executing task. Debug registers 1150 control and allow for the monitoring of a processor or core's debugging operations.
Memory (mem) management registers 1165 specify the locations of data structures used in protected mode memory management. These registers may include a GDTR, IDRT, task register, and a LDTR register.
Alternative examples may use wider or narrower registers. Additionally, alternative examples may use more, less, or different register files and registers. The register architecture 1100 may, for example, be used in physical register file(s) circuitry 958.
An instruction set architecture (ISA) may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, location of bits) to specify, among other things, the operation to be performed (e.g., opcode) and the operand(s) on which that operation is to be performed and/or other data field(s) (e.g., mask). Some instruction formats are further broken down through the definition of instruction templates (or sub-formats). For example, the instruction templates of a given instruction format may be defined to have different subsets of the instruction format's fields (the included fields are typically in the same order, but at least some have different bit positions because there are less fields included) and/or defined to have a given field interpreted differently. Thus, each instruction of an ISA is expressed using a given instruction format (and, if defined, in a given one of the instruction templates of that instruction format) and includes fields for specifying the operation and the operands. For example, an exemplary ADD instruction has a specific opcode and an instruction format that includes an opcode field to specify that opcode and operand fields to select operands (source1/destination and source2); and an occurrence of this ADD instruction in an instruction stream will have specific contents in the operand fields that select specific operands. In addition, though the description below is made in the context of x86 ISA, it is within the knowledge of one skilled in the art to apply the teachings of the present disclosure in another ISA.
Examples of the instruction(s) described herein may be embodied in different formats. Additionally, exemplary systems, architectures, and pipelines are detailed below. Examples of the instruction(s) may be executed on such systems, architectures, and pipelines, but are not limited to those detailed.
The prefix(es) field(s) 1201, when used, modifies an instruction. In some examples, one or more prefixes are used to repeat string instructions (e.g., 0xF0, 0xF2, 0xF3, etc.), to provide section overrides (e.g., 0x2E, 0x36, 0x3E, 0x26, 0x64, 0x65, 0x2E, 0x3E, etc.), to perform bus lock operations, and/or to change operand (e.g., 0x66) and address sizes (e.g., 0x67). Certain instructions require a mandatory prefix (e.g., 0x66, 0xF2, 0xF3, etc.). Certain of these prefixes may be considered “legacy” prefixes. Other prefixes, one or more examples of which are detailed herein, indicate, and/or provide further capability, such as specifying particular registers, etc. The other prefixes typically follow the “legacy” prefixes.
The opcode field 1203 is used to at least partially define the operation to be performed upon a decoding of the instruction. In some examples, a primary opcode encoded in the opcode field 1203 is one, two, or three bytes in length. In other examples, a primary opcode can be a different length. An additional 3-bit opcode field is sometimes encoded in another field.
The addressing field 1205 is used to address one or more operands of the instruction, such as a location in memory or one or more registers.
The content of the MOD field 1342 distinguishes between memory access and non-memory access modes. In some examples, when the MOD field 1342 has a binary value of 11 (11b), a register-direct addressing mode is utilized, and otherwise register-indirect addressing is used.
The register field 1344 may encode either the destination register operand or a source register operand, or may encode an opcode extension and not be used to encode any instruction operand. The content of register index field 1344, directly or through address generation, specifies the locations of a source or destination operand (either in a register or in memory). In some examples, the register field 1344 is supplemented with an additional bit from a prefix (e.g., prefix 1201) to allow for greater addressing.
The R/M field 1346 may be used to encode an instruction operand that references a memory address or may be used to encode either the destination register operand or a source register operand. Note the R/M field 1346 may be combined with the MOD field 1342 to dictate an addressing mode in some examples.
The SIB byte 1304 includes a scale field 1352, an index field 1354, and a base field 1356 to be used in the generation of an address. The scale field 1352 indicates scaling factor. The index field 1354 specifies an index register to use. In some examples, the index field 1354 is supplemented with an additional bit from a prefix (e.g., prefix 1201) to allow for greater addressing. The base field 1356 specifies a base register to use. In some examples, the base field 1356 is supplemented with an additional bit from a prefix (e.g., prefix 1201) to allow for greater addressing. In practice, the content of the scale field 1352 allows for the scaling of the content of the index field 1354 for memory address generation (e.g., for address generation that uses 2scale*index+base).
Some addressing forms utilize a displacement value to generate a memory address. For example, a memory address may be generated according to 2scale*index+base+displacement, index*scale+displacement, r/m+displacement, instruction pointer (RIP/EIP)+displacement, register+displacement, etc. The displacement may be a 1-byte, 2-byte, 4-byte, etc. value. In some examples, a displacement 1207 provides this value. Additionally, in some examples, a displacement factor usage is encoded in the MOD field of the addressing field 1205 that indicates a compressed displacement scheme for which a displacement value is calculated and stored in the displacement field 1207.
In some examples, an immediate field 1209 specifies an immediate value for the instruction. An immediate value may be encoded as a 1-byte value, a 2-byte value, a 4-byte value, etc.
Instructions using the first prefix 1201(A) may specify up to three registers using 3-bit fields depending on the format: 1) using the reg field 1344 and the R/M field 1346 of the Mod R/M byte 1302; 2) using the Mod R/M byte 1302 with the SIB byte 1304 including using the reg field 1344 and the base field 1356 and index field 1354; or 3) using the register field of an opcode.
In the first prefix 1201(A), bit positions 7:4 are set as 0100. Bit position 3 (W) can be used to determine the operand size but may not solely determine operand width. As such, when W=0, the operand size is determined by a code segment descriptor (CS.D) and when W=1, the operand size is 64-bit.
Note that the addition of another bit allows for 16 (24) registers to be addressed, whereas the MOD R/M reg field 1344 and MOD R/M R/M field 1346 alone can each only address 8 registers.
In the first prefix 1201(A), bit position 2 (R) may be an extension of the MOD R/M reg field 1344 and may be used to modify the ModR/M reg field 1344 when that field encodes a general-purpose register, a 64-bit packed data register (e.g., a SSE register), or a control or debug register. R is ignored when Mod R/M byte 1302 specifies other registers or defines an extended opcode.
Bit position 1 (X) may modify the SIB byte index field 1354.
Bit position 0 (B) may modify the base in the Mod R/M R/M field 1346 or the SIB byte base field 1356; or it may modify the opcode register field used for accessing general purpose registers (e.g., general purpose registers 1125).
In some examples, the second prefix 1201(B) comes in two forms—a two-byte form and a three-byte form. The two-byte second prefix 1201(B) is used mainly for 128-bit, scalar, and some 256-bit instructions; while the three-byte second prefix 1201(B) provides a compact replacement of the first prefix 1201(A) and 3-byte opcode instructions.
Instructions that use this prefix may use the Mod R/M R/M field 1346 to encode the instruction operand that references a memory address or encode either the destination register operand or a source register operand.
Instructions that use this prefix may use the Mod R/M reg field 1344 to encode either the destination register operand or a source register operand, be treated as an opcode extension and not used to encode any instruction operand.
For instruction syntax that support four operands, vvvv, the Mod R/M R/M field 1346 and the Mod R/M reg field 1344 encode three of the four operands. Bits[7:4] of the immediate 1209 are then used to encode the third source register operand.
Bit[7] of byte 21617 is used similar to W of the first prefix 1201(A) including helping to determine promotable operand sizes. Bit[2] is used to dictate the length (L) of the vector (where a value of 0 is a scalar or 128-bit vector and a value of 1 is a 256-bit vector). Bits[1:0] provide opcode extensionality equivalent to some legacy prefixes (e.g., 00=no prefix, 01=66H, 10=F3H, and 11=F2H). Bits[6:3], shown as vvvv, may be used to: 1) encode the first source register operand, specified in inverted (1s complement) form and valid for instructions with 2 or more source operands; 2) encode the destination register operand, specified in is complement form for certain vector shifts; or 3) not encode any operand, the field is reserved and should contain a certain value, such as 1111b.
Instructions that use this prefix may use the Mod R/M R/M field 1346 to encode the instruction operand that references a memory address or encode either the destination register operand or a source register operand.
Instructions that use this prefix may use the Mod R/M reg field 1344 to encode either the destination register operand or a source register operand, be treated as an opcode extension and not used to encode any instruction operand.
For instruction syntax that support four operands, vvvv, the Mod R/M R/M field 1346, and the Mod R/M reg field 1344 encode three of the four operands. Bits[7:4] of the immediate 1209 are then used to encode the third source register operand.
The third prefix 1201(C) can encode 32 vector registers (e.g., 128-bit, 256-bit, and 512-bit registers) in 64-bit mode. In some examples, instructions that utilize a writemask/opmask (see discussion of registers in a previous figure, such as
The third prefix 1201(C) may encode functionality that is specific to instruction classes (e.g., a packed instruction with “load+op” semantic can support embedded broadcast functionality, a floating-point instruction with rounding semantic can support static rounding functionality, a floating-point instruction with non-rounding arithmetic semantic can support “suppress all exceptions” functionality, etc.).
The first byte of the third prefix 1201(C) is a format field 1711 that has a value, in one example, of 62H. Subsequent bytes are referred to as payload bytes 1715-1719 and collectively form a 24-bit value of P[23:0] providing specific capability in the form of one or more fields (detailed herein).
In some examples, P[1:0] of payload byte 1719 are identical to the low two mmmmm bits. P[3:2] are reserved in some examples. Bit P[4] (R′) allows access to the high 16 vector register set when combined with P[7] and the ModR/M reg field 1344. P[6] can also provide access to a high 16 vector register when SIB-type addressing is not needed. P[7:5] consist of an R, X, and B which are operand specifier modifier bits for vector register, general purpose register, memory addressing and allow access to the next set of 8 registers beyond the low 8 registers when combined with the ModR/M register field 1344 and ModR/M R/M field 1346. P[9:8] provide opcode extensionality equivalent to some legacy prefixes (e.g., 00=no prefix, 01=66H, 10=F3H, and 11=F2H). P[10] in some examples is a fixed value of 1. P[14:11], shown as vvvv, may be used to: 1) encode the first source register operand, specified in inverted (ls complement) form and valid for instructions with 2 or more source operands; 2) encode the destination register operand, specified in is complement form for certain vector shifts; or 3) not encode any operand, the field is reserved and should contain a certain value, such as 1111b.
P[15] is similar to W of the first prefix 1201(A) and second prefix 1211(B) and may serve as an opcode extension bit or operand size promotion.
P[18:16] specify the index of a register in the opmask (writemask) registers (e.g., writemask/predicate registers 1115). In one example, the specific value aaa=000 has a special behavior implying no opmask is used for the particular instruction (this may be implemented in a variety of ways including the use of a opmask hardwired to all ones or hardware that bypasses the masking hardware). When merging, vector masks allow any set of elements in the destination to be protected from updates during the execution of any operation (specified by the base operation and the augmentation operation); in other one example, preserving the old value of each element of the destination where the corresponding mask bit has a 0. In contrast, when zeroing vector masks allow any set of elements in the destination to be zeroed during the execution of any operation (specified by the base operation and the augmentation operation); in one example, an element of the destination is set to 0 when the corresponding mask bit has a 0 value. A subset of this functionality is the ability to control the vector length of the operation being performed (that is, the span of elements being modified, from the first to the last one); however, it is not necessary that the elements that are modified be consecutive. Thus, the opmask field allows for partial vector operations, including loads, stores, arithmetic, logical, etc. While examples are described in which the opmask field's content selects one of a number of opmask registers that contains the opmask to be used (and thus the opmask field's content indirectly identifies that masking to be performed), alternative examples instead or additional allow the mask write field's content to directly specify the masking to be performed.
P[19] can be combined with P[14:11] to encode a second source vector register in a non-destructive source syntax which can access an upper 16 vector registers using P[19]. P[20] encodes multiple functionalities, which differs across different classes of instructions and can affect the meaning of the vector length/rounding control specifier field (P[22:21]). P[23] indicates support for merging-writemasking (e.g., when set to 0) or support for zeroing and merging-writemasking (e.g., when set to 1).
Exemplary examples of encoding of registers in instructions using the third prefix 1201(C) are detailed in the following tables.
Program code may be applied to input information to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microprocessor, or any combination thereof.
The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
Examples of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Examples may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
One or more aspects of at least one example may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, examples also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such examples may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set architecture to a target instruction set architecture. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
In one or more first embodiments, an integrated circuit comprises first circuitry to receive a first message which indicates a first address which corresponds to a first line of data, and identify a first location of a preliminary cache based on the message, second circuitry coupled to the first circuitry, wherein, based on the first message, the second circuitry is to identify a second location of a victim cache, comprising the second circuitry to perform a calculation based on one of an encryption key or a hash function, move a second line from the first location to the second location, and store the first line to the first location.
In one or more second embodiments, further to the first embodiment, the second circuitry to identify the second location comprises the second circuitry to determine, based on the calculation, an index of a set of the victim cache.
In one or more third embodiments, further to the first embodiment or the second embodiment, the primary cache comprises a skewed cache.
In one or more fourth embodiments, further to any of the first through third embodiments, the first circuitry is further to receive a second message which indicates a second address which corresponds to the second line, and wherein the second circuitry is further to move the second line from the second location to the preliminary cache based on the second message.
In one or more fifth embodiments, further to the fourth embodiment, the second circuitry to move the second line from the second location to the preliminary cache comprises the second circuitry to swap the second line and a third line between the preliminary cache and the victim cache.
In one or more sixth embodiments, further to the fifth embodiment, the second line and the third line are swapped based on a valid state of the third line.
In one or more seventh embodiments, further to the fourth embodiment, based on the first message, the second circuitry is further to evict a third line from the second location to a memory before the second circuitry is to move the second line from the first location to the second location.
In one or more eighth embodiments, further to the seventh embodiment, the second circuitry is to evict the third line based on a valid state of the third line.
In one or more ninth embodiments, further to the fourth embodiment, the second circuitry is to move the second line to the second location based on a valid state of the second line.
In one or more tenth embodiments, a method at a processor comprises receiving a first message which indicates a first address which corresponds to a first line of data, identifying a first location of a preliminary cache based on the message, based on the first message identifying a second location of a victim cache, comprising performing a calculation based on one of an encryption key or a hash function, moving a second line from the first location to the second location, and storing the first line to the first location.
In one or more eleventh embodiments, further to the tenth embodiment, identifying the second location comprises determining, based on the calculation, an index of a set of the victim cache.
In one or more twelfth embodiments, further to the tenth embodiment or the eleventh embodiment, the primary cache comprises a skewed cache.
In one or more thirteenth embodiments, further to any of the tenth through twelfth embodiments, the method further comprises receiving a second message which indicates a second address which corresponds to the second line, and moving the second line from the second location to the preliminary cache based on the second message.
In one or more fourteenth embodiments, further to the thirteenth embodiment, moving the second line from the second location to the preliminary cache comprises swapping the second line and a third line between the preliminary cache and the victim cache.
In one or more fifteenth embodiments, further to the fourteenth embodiment, the second line and the third line are swapped based on a valid state of the third line.
In one or more sixteenth embodiments, further to the thirteenth embodiment, the method further comprises based on the first message, evicting a third line from the second location to a memory before the the second line is moved from the first location to the second location.
In one or more seventeenth embodiments, further to the sixteenth embodiment, the third line is evicted based on a valid state of the third line.
In one or more eighteenth embodiments, further to the thirteenth embodiment, the second line is moved to the second location based on a valid state of the second line.
In one or more nineteenth embodiments, a system comprises an integrated circuit (IC) chip comprising first circuitry to receive a first message which indicates a first address which corresponds to a first line of data, and identify a first location of a preliminary cache based on the message, second circuitry coupled to the first circuitry, wherein, based on the first message, the second circuitry is to identify a second location of a victim cache, comprising the second circuitry to perform a calculation based on one of an encryption key or a hash function, move a second line from the first location to the second location, and store the first line to the first location, and a display device coupled to the IC chip, the display device to display an image based on a signal communicated with the IC chip.
In one or more twentieth embodiments, further to the nineteenth embodiment, the second circuitry to identify the second location comprises the second circuitry to determine, based on the calculation, an index of a set of the victim cache.
In one or more twenty-first embodiments, further to the nineteenth embodiment or the twentieth embodiment, the primary cache comprises a skewed cache.
In one or more twenty-second embodiments, further to any of the nineteenth through twenty-first embodiments, the first circuitry is further to receive a second message which indicates a second address which corresponds to the second line, and wherein the second circuitry is further to move the second line from the second location to the preliminary cache based on the second message.
In one or more twenty-third embodiments, further to the twenty-second embodiment, the second circuitry to move the second line from the second location to the preliminary cache comprises the second circuitry to swap the second line and a third line between the preliminary cache and the victim cache.
In one or more twenty-fourth embodiments, further to the twenty-third embodiment, the second line and the third line are swapped based on a valid state of the third line.
In one or more twenty-fifth embodiments, further to the twenty-second embodiment, based on the first message, the second circuitry is further to evict a third line from the second location to a memory before the second circuitry is to move the second line from the first location to the second location.
In one or more twenty-sixth embodiments, further to the twenty-fifth embodiment, the second circuitry is to evict the third line based on a valid state of the third line.
In one or more twenty-seventh embodiments, further to the twenty-second embodiment, the second circuitry is to move the second line to the second location based on a valid state of the second line.
Numerous details are described herein to provide a more thorough explanation of the embodiments of the present disclosure. It will be apparent to one skilled in the art, however, that embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present disclosure.
Note that in the corresponding drawings of the embodiments, signals are represented with lines. Some lines may be thicker, to indicate a greater number of constituent signal paths, and/or have arrows at one or more ends, to indicate a direction of information flow. Such indications are not intended to be limiting. Rather, the lines are used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit or a logical unit. Any represented signal, as dictated by design needs or preferences, may actually comprise one or more signals that may travel in either direction and may be implemented with any suitable type of signal scheme.
Throughout the specification, and in the claims, the term “connected” means a direct connection, such as electrical, mechanical, or magnetic connection between the things that are connected, without any intermediary devices. The term “coupled” means a direct or indirect connection, such as a direct electrical, mechanical, or magnetic connection between the things that are connected or an indirect connection, through one or more passive or active intermediary devices. The term “circuit” or “module” may refer to one or more passive and/or active components that are arranged to cooperate with one another to provide a desired function. The term “signal” may refer to at least one current signal, voltage signal, magnetic signal, or data/clock signal. The meaning of “a,” “an,” and “the” include plural references. The meaning of “in” includes “in” and “on.”
The term “device” may generally refer to an apparatus according to the context of the usage of that term. For example, a device may refer to a stack of layers or structures, a single structure or layer, a connection of various structures having active and/or passive elements, etc. Generally, a device is a three-dimensional structure with a plane along the x-y direction and a height along the z direction of an x-y-z Cartesian coordinate system. The plane of the device may also be the plane of an apparatus which comprises the device.
The term “scaling” generally refers to converting a design (schematic and layout) from one process technology to another process technology and subsequently being reduced in layout area. The term “scaling” generally also refers to downsizing layout and devices within the same technology node. The term “scaling” may also refer to adjusting (e.g., slowing down or speeding up—i.e. scaling down, or scaling up respectively) of a signal frequency relative to another parameter, for example, power supply level.
The terms “substantially,” “close,” “approximately,” “near,” and “about,” generally refer to being within +/−10% of a target value. For example, unless otherwise specified in the explicit context of their use, the terms “substantially equal,” “about equal” and “approximately equal” mean that there is no more than incidental variation between among things so described. In the art, such variation is typically no more than +/−10% of a predetermined target value.
It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.
Unless otherwise specified the use of the ordinal adjectives “first,” “second,” and “third,” etc., to describe a common object, merely indicate that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.
The terms “left,” “right,” “front,” “back,” “top,” “bottom,” “over,” “under,” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. For example, the terms “over,” “under,” “front side,” “back side,” “top,” “bottom,” “over,” “under,” and “on” as used herein refer to a relative position of one component, structure, or material with respect to other referenced components, structures or materials within a device, where such physical relationships are noteworthy. These terms are employed herein for descriptive purposes only and predominantly within the context of a device z-axis and therefore may be relative to an orientation of a device. Hence, a first material “over” a second material in the context of a figure provided herein may also be “under” the second material if the device is oriented upside-down relative to the context of the figure provided. In the context of materials, one material disposed over or under another may be directly in contact or may have one or more intervening materials. Moreover, one material disposed between two materials may be directly in contact with the two layers or may have one or more intervening layers. In contrast, a first material “on” a second material is in direct contact with that second material. Similar distinctions are to be made in the context of component assemblies.
The term “between” may be employed in the context of the z-axis, x-axis or y-axis of a device. A material that is between two other materials may be in contact with one or both of those materials, or it may be separated from both of the other two materials by one or more intervening materials. A material “between” two other materials may therefore be in contact with either of the other two materials, or it may be coupled to the other two materials through an intervening material. A device that is between two other devices may be directly connected to one or both of those devices, or it may be separated from both of the other two devices by one or more intervening devices.
As used throughout this description, and in the claims, a list of items joined by the term “at least one of” or “one or more of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C. It is pointed out that those elements of a figure having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such.
In addition, the various elements of combinatorial logic and sequential logic discussed in the present disclosure may pertain both to physical structures (such as AND gates, OR gates, or XOR gates), or to synthesized or otherwise optimized collections of devices implementing the logical structures that are Boolean equivalents of the logic under discussion.
Techniques and architectures for operating a cache are described herein. In the above description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of certain embodiments. It will be apparent, however, to one skilled in the art that certain embodiments can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the description.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some portions of the detailed description herein are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the computing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the discussion herein, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Certain embodiments also relate to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) such as dynamic RAM (DRAM), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description herein. In addition, certain embodiments are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of such embodiments as described herein.
Besides what is described herein, various modifications may be made to the disclosed embodiments and implementations thereof without departing from their scope. Therefore, the illustrations and examples herein should be construed in an illustrative, and not a restrictive sense. The scope of the invention should be measured solely by reference to the claims that follow.
The present application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application Ser. No. 63/335,671 filed Apr. 27, 2022 and entitled “DEVICE, METHOD AND SYSTEM TO SUPPLEMENT A CACHE WITH A RANDOMIZED VICTIM CACHE,” which is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63335671 | Apr 2022 | US |