1. Field of the Invention
The present invention is generally directed to computing operations performed in multi-processor computer systems. More particularly, the present invention is directed to reducing processor cache-coherency probe traffic resulting from false sharing of data in multi-processor computer systems and applications thereof.
2. Background
A multiprocessor computing system includes a main memory and a plurality of processors. Each processor can read from and write to the main memory. In addition to the main memory, each processor includes a cache memory, or simply a cache. The cache of a processor can be accessed by that processor faster than that processor can access the main memory. Thus, each processor stores frequently accessed data in its cache.
Consequently, multiple processors in a multi-processor system can each hold a copy of data corresponding to a single location in the main memory. Because each processor can access its own cache faster than it can access the main memory, each processor has the potential to update its local copy of the data before the updated data is stored in the main memory. If one of the processors modifies its local copy of the data and the other processors do not receive those modifications, the local copy of the data in each of the other processors may be out-of-date.
Conventional processors in a multiprocessor system implement one or more cache-coherency protocols to signal changes to cached data shared by multiple processors. Example cache-coherency protocols include, for example, MOESI, MESI, MESIF, and others. The signals that are broadcasted are termed probes or snoops.
Unfortunately, the sharing of cached data between processors in a multiprocessor system can lead to false sharing. False sharing occurs when multiple processors each store a local copy of a cache line, but each processor accesses a different data object/memory block of the cache line.
For example, a first processor and a second processor may each store a local copy of a cache line that includes two data objects—a data object A and a data object B—wherein the first processor accesses only the data object A and the second processor accesses only the data object B. Conventionally, if the first processor modifies data object A of its local copy of the cache line, the first processor will send a probe to the second processor, causing the second processor to update its local copy of the cache line even though the second processor is not accessing data object A. The first and second processor in this example are involved in false sharing because, although they each store local copies of the same cache line, they are each accessing different data objects of the cache line. False sharing is inefficient and leads to performance overhead and is, therefore, undesirable.
Conventional solutions for dealing with false sharing are software-based solutions. One such software-based solution is to pad data to insure that data objects that are accessed by two different processors do not fall on the same cache line. For example, if the first processor accessed only data object A and the second processor accessed only data object B, then this conventional solution would be to pad the data so that data object A falls on one cache line and data object B falls on another cache line.
This type of conventional solution is problematic for several reasons. For example, padding the data increases the memory footprint, thereby affecting performance because worthless data (i.e., the padding data) must be moved on a systems data busses.
Given the foregoing, what is needed is an improved manner for dealing with false sharing in multiprocessor systems.
Embodiments of the present invention meet the above-described needs by providing improvements for reducing cache probe traffic resulting from false data sharing in multiprocessor systems and applications thereof.
For example, an embodiment of the present invention provides a processing unit for use in a multi-processing unit system. In this embodiment, the multi-processing unit system includes a main memory and another processing unit. The processing unit comprises a cache and logic. The cache is configured to store data from the main memory. The logic is configured to maintain an entry in a directory of the cache. The entry indicates whether either of the processing unit and the other processing unit accesses a data object of a cache line for which the processing unit is a home node.
Another embodiment of the present invention provides a system, including a main memory, a first processing unit, and a second processing unit. The first processing unit and the second processing unit are coupled to the main memory. The first processing unit includes a cache and logic. The cache is configured to store data from the main memory. The logic is configured to maintain an entry in a directory of the cache. The entry indicates whether either of the first processing unit and the second processing unit accesses a data object of a cache line for which the first processing unit is a home node.
A further embodiment of the present invention provides a method implemented in a computing system, wherein the computing system includes a first processing unit, a second processing unit, and a main memory. According to this embodiment, the method includes storing data from the main memory in a cache line, wherein the cache line comprises one or more data objects. The method further includes maintaining an entry in a cache of the first processing unit, wherein the entry indicates whether either of the first processing unit and the second processing unit accesses a data object of the cache line.
Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in detail below with reference to the accompanying drawings. It is noted that the invention is not limited to the specific embodiments described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the relevant art(s) to make and use the invention.
The features and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding objects throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
I. Overview
Embodiments of the present invention are directed to reducing cache probe traffic resulting from false data sharing and applications thereof. In the detailed description that follows, references to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Embodiments of the present invention are directed to filtering broadcast probes used to maintain cache coherency on multiprocessor/multi-node systems. According to an embodiment of the present invention, probe-filter (PF) logic uses a portion of a level-three (L3) cache to store a directory of entries that track cache lines. Each node maintains a separate directory and tracks lines cached anywhere in the multiprocessor/multi-node system for which it is the home node. Based on whether a cache line is present in the directory, the PF logic can either generate a directed probe or handle a data-access request without generating any probes.
Before describing additional details regarding PF logic in accordance with an embodiment of the present invention, it is first helpful to present an example system in which such PF logic may be implemented.
II. An Example System
Referring to
As illustrated in
Execution unit 122 comprises one or more arithmetic logic units (ALUs) for executing instructions, as is well known in the art.
Each cache 124 is configured to store data and/or instructions. By storing the data and/or instructions in cache 124, processor 120 can access the data and/or instructions faster than if it had to retrieve the data and/or instructions from main memory 104. As illustrated in
Bus interface 126 includes a memory controller for controlling access to main memory 104, as is known in the art. In addition, bus interface 126 includes PF logic for filtering probes in accordance with an embodiment of the present invention as described in more detail below.
In the embodiment of
In the embodiment of
A processor is the home node if a cache line originates from an address space in main memory 104 that is assigned to the processor. For example,
Based on whether a cache line is present in the directory or not, PF logic 310 either generates a directed probe or handles the request without generating any probes. In an embodiment, each PF entry comprises a bit mask whose size depends on the number of processors/nodes in the system and the data-object/memory block granularity at which the system tracks false sharing.
For example, in an eight-processor system with 64-byte cache lines that tracks false sharing for every 16 bytes of data, the bit mask is 8 times 4 bits wide for each cache line. The “8” in this example bit mask corresponds to the eight processors in the system. The “4” in this example bit mask corresponds to the fact that there are four 16-byte chunks in a 64-byte cache line (since 64÷16=4). Each of the four 8-bit portions of the bit mask indicates whether a particular processor accesses (i.e., reads/writes) one of the four 16-byte chunks of the 64-byte cache line. A set bit indicates that the chunk is accessed, and an unset bit indicates that the chunk is never accessed (i.e., read or written) by a particular processor. However, it is to be appreciated that a set bit could indicate that a chunk is not accessed, and an unset bit indicates that a chunk is accessed, as would be understood by a person skilled in the art.
Example operation of such a bit mask is presented below.
III. Example Operation
For illustrative purposes, and not limitation, an example method for filtering probes in accordance with an embodiment of the present invention is described below in the context of an example two-processor system. In this example two-processor system, it is assumed for illustrative purposes that each cache line has two data objects—A and B—of 32-bytes and the two-processor system tracks false sharing at a granularity of 32-byte chunks. It is to be appreciated that the present invention is not limited to this example two-processor system nor any other two-processor system. Based on the description provided herein, a person skilled in the relevant art(s) will understand how to practice probe filtering in accordance with embodiments of the present invention in processor systems including more than two processors.
In this example two-processor system, a first processor, processor 1, and a second processor, processor 2, each hold a copy of the same cache line. For illustrative purposes, it is assumed that processor 1 is the home node for this cache line, and it is further assumed that processor 1 reads and writes both data object A and data object B of the cache line and that processor 2 reads and writes only data object B of the cache line. In this example, PF logic of processor 1 maintains a bit mask in the L3 cache of processor 1. The bit mask for this particular cache line can be represented as follows:
|0|1|, |1|1|
wherein (i) bit 0 (on the extreme right) is set to indicate processor 1 accesses (reads/writes) data object B, (ii) bit 1 (second from the right) is set to indicate processor 2 accesses (reads/writes) data object B, (iii) bit 2 (second from the left) is set to indicate processor 1 accesses (reads/writes) data object A, and (iv) bit 3 (extreme left) is not set to indicate processor 2 does not access (read/write) data object A.
In a step 504, the PF logic of processor 1 determines whether another processor (e.g., processor 2) is accessing a data object of a cache line that is modified by processor 1 and for which processor 1 is the home node. This determination can be made based on the bit mask described above.
If in step 504, it is determined that no other processors access the data object of the cache line that was modified, then the PF logic of processor 1 is configured not to send any probes to the other processors as indicated in a step 506. For the example bit mask above, if processor 1 modifies object A of the cache line, the PF logic of processor 1 does NOT send a probe broadcast to processor 2 to indicate that the cache line is now dirty because processor 2 is NOT accessing object A.
If, on the other hand, it is determined in step 504 that another processor is accessing the data object of the cache line, then when processor 1 modifies the data object of the cache line the PF logic of processor 1 is configured to send a probe only to the processors that access that cache line as indicated in a step 508. For the example bit mask above, if processor 1 modifies object B of the cache line, the PF logic of processor 1 sends a probe broadcast to processor 2 to indicate that the cache line is now dirty because processor 2 IS accessing object B.
Thus, according to method 500, when a first processor modifies a data object of a cache line that is shared by another processor, the PF logic of the first processor sends a probe to the other processor only if the other processor is accessing the data object of that cache line. In contrast, conventional cache-coherency protocols require a first processor to send a probe whenever the first processor modifies a data object of a shared cache line even though the other processor(s) may not be accessing the data object modified by the first processor. Thus, by implementing the probe filtering of an embodiment of the present invention, the probe traffic between processors is reduced.
Returning to a step 510 of method 500 of
If in step 510, it is determined that this is NOT the first time that the processor is accessing the data object of the cache line, then the processor does not send a probe as indicated in a step 512.
If, on the other hand, it is determined in step 510 that this is the first time that the processor is accessing the data object of the cache line, then the processor sends a probe to the owner of that cache line, requesting the latest copy of that cache line, as indicated in a step 514. For example, suppose processor 1 reads a data object that was not previously accessed by processor 1. In this example, the PF logic of processor 1 tracks the first time a bit in non-zero bit-mask is set and requests the owner of the cache line to provide the latest copy.
IV. Example Computer Implementation
Embodiments of the present invention may be implemented using hardware, software, or a combination thereof and may be implemented in one or more computer systems or other processing systems. Like computing system 100 of
Like processors 120 of
Computer system 600 includes a display interface 602 that forwards graphics, text, and other data from communication infrastructure 606 (or from a frame buffer not shown) for display on display unit 630.
Computer system 600 also includes a main memory 608 (like main memory 104 of
In alternative embodiments, secondary memory 610 may include other similar devices for allowing computer programs or other instructions to be loaded into computer system 600. Such devices may include, for example, a removable storage unit 622 and an interface 620. Examples of such may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an erasable programmable read only memory (EPROM), or programmable read only memory (PROM)) and associated socket, and other removable storage units 622 and interfaces 620, which allow software and data to be transferred from the removable storage unit 622 to computer system 600.
Computer system 600 may also include a communications interface 624. Communications interface 624 allows software and data to be transferred between computer system 600 and external devices. Examples of communications interface 624 may include a modem, a network interface (such as an Ethernet card), a communications port, a Personal Computer Memory Card International Association (PCMCIA) slot and card, etc. Software and data transferred via communications interface 624 are in the faun of signals 628 which may be electronic, electromagnetic, optical or other signals capable of being received by communications interface 624. These signals 628 are provided to communications interface 624 via a communications path (e.g., channel) 626. This channel 626 carries signals 628 and may be implemented using wire or cable, fiber optics, a telephone line, a cellular link, an radio frequency (RF) link and other communications channels.
In this document, the term “computer-readable storage medium” is used to generally refer to media such as removable storage drive 614 and a hard disk installed in hard disk drive 612. These computer program products provide software to computer system 600.
Computer programs (also referred to as computer control logic) are stored in main memory 608 and/or secondary memory 610. Computer programs may also be received via communications interface 624. Such computer programs, when executed, enable the computer system 600 to perform the features of the present invention, as discussed herein. Accordingly, such computer programs represent controllers of the computer system 600.
In an embodiment, the software may be stored in a computer program product and loaded into computer system 600 using removable storage drive 614, hard drive 612 or communications interface 624. The control logic (software), when executed by the processor 604, causes the processor 604 to perform the functions of embodiments of the invention as described herein.
V. Example Software Implementations
In addition to hardware implementations of processing units (e.g., processors 120, 140, and/or 604), such processing units may also be embodied in software disposed, for example, in a computer-readable medium configured to store the software (e.g., a computer-readable program code). The program code causes the enablement of embodiments of the present invention, including the following embodiments: (i) the functions of the systems and techniques disclosed herein (such as, method 500 illustrated in
This can be accomplished, for example, through the use of general-programming languages (such as C or C++), hardware-description languages (HDL) including Verilog HDL, VHDL, Altera HDL (AHDL) and so on, or other available programming and/or schematic-capture tools (such as, circuit-capture tools). The program code can be disposed in any known computer-readable medium including semiconductor, magnetic disk, or optical disk (such as, CD-ROM, DVD-ROM). As such, the code can be transmitted over communication networks including the Internet and internets. It is understood that the functions accomplished and/or structure provided by the systems and techniques described above can be represented in a core (such as a processing-unit core) that is embodied in program code and may be transformed to hardware as part of the production of integrated circuits.
VI. Conclusion
Disclosed above are embodiments for reducing cache probe traffic resulting from false data sharing and applications thereof. It is to be appreciated that the Detailed Description section, and not the Summary and Abstract sections, is intended to be used to interpret the claims. The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventor(s), and thus, are not intended to limit the present invention and the appended claims in any way.
Number | Name | Date | Kind |
---|---|---|---|
6636950 | Mithal et al. | Oct 2003 | B1 |
6920532 | Glasco et al. | Jul 2005 | B2 |
6925536 | Glasco et al. | Aug 2005 | B2 |
6934814 | Glasco et al. | Aug 2005 | B2 |
7003633 | Glasco | Feb 2006 | B2 |
7024521 | Glasco | Apr 2006 | B2 |
7103726 | Glasco | Sep 2006 | B2 |
7249224 | Glasco | Jul 2007 | B2 |
7272688 | Glasco | Sep 2007 | B1 |
7296121 | Morton et al. | Nov 2007 | B2 |
7334089 | Glasco | Feb 2008 | B2 |
7337279 | Glasco | Feb 2008 | B2 |
7346744 | Glasco | Mar 2008 | B1 |
7373462 | Blumrich et al. | May 2008 | B2 |
7380071 | Blumrich et al. | May 2008 | B2 |
7386683 | Blumrich et al. | Jun 2008 | B2 |
7386684 | Blumrich et al. | Jun 2008 | B2 |
7386685 | Blumrich et al. | Jun 2008 | B2 |
7392351 | Blumrich et al. | Jun 2008 | B2 |
7392352 | Mithal et al. | Jun 2008 | B2 |
7603523 | Blumrich et al. | Oct 2009 | B2 |
7603524 | Blumrich et al. | Oct 2009 | B2 |
7617366 | Blumrich et al. | Nov 2009 | B2 |
8103836 | Blumrich et al. | Jan 2012 | B2 |
8135917 | Blumrich et al. | Mar 2012 | B2 |
8185695 | Conway et al. | May 2012 | B2 |
8255638 | Blumrich et al. | Aug 2012 | B2 |
20040083343 | Mithal et al. | Apr 2004 | A1 |
20040088492 | Glasco | May 2004 | A1 |
20040088493 | Glasco | May 2004 | A1 |
20040088494 | Glasco et al. | May 2004 | A1 |
20040088495 | Glasco et al. | May 2004 | A1 |
20040088496 | Glasco et al. | May 2004 | A1 |
20040236912 | Glasco | Nov 2004 | A1 |
20040268052 | Glasco | Dec 2004 | A1 |
20050033924 | Glasco | Feb 2005 | A1 |
20050251626 | Glasco | Nov 2005 | A1 |
20060004967 | Mithal et al. | Jan 2006 | A1 |
20060224835 | Blumrich et al. | Oct 2006 | A1 |
20060224836 | Blumrich et al. | Oct 2006 | A1 |
20060224837 | Blumrich et al. | Oct 2006 | A1 |
20060224838 | Blumrich et al. | Oct 2006 | A1 |
20060224839 | Blumrich et al. | Oct 2006 | A1 |
20060224840 | Blumrich et al. | Oct 2006 | A1 |
20060230239 | Blumrich et al. | Oct 2006 | A1 |
20070055826 | Morton et al. | Mar 2007 | A1 |
20080133845 | Blumrich et al. | Jun 2008 | A1 |
20080155201 | Blumrich et al. | Jun 2008 | A1 |
20080209128 | Blumrich et al. | Aug 2008 | A1 |
20080222364 | Blumrich et al. | Sep 2008 | A1 |
20080244194 | Blumrich et al. | Oct 2008 | A1 |
20090006770 | Blumrich et al. | Jan 2009 | A1 |
20100332762 | Moga et al. | Dec 2010 | A1 |
20110078492 | Kumar et al. | Mar 2011 | A1 |
20110154000 | Fryman et al. | Jun 2011 | A1 |
20120311272 | Blumrich et al. | Dec 2012 | A1 |
Number | Date | Country | |
---|---|---|---|
20120005432 A1 | Jan 2012 | US |