Embodiments relate to remote atomic operations in a processor-based system.
An atomic memory operation is one during which a processor core can read a location, modify it, and write it back in what appears to other cores as a single operation. An atomic memory operation in a multi-core system is one that cannot be divided into any smaller parts, or appears to other cores in the system to be a single operation. Read-modify-write is one of a class of atomic memory operations that both reads a memory location and writes a new value into it as part of the same operation, at least as it appears to other cores in the multi-core system. Atomic operations are prevalent in a diverse set of applications, including packet processing, high-performance computing, and machine learning, among others.
One or more central processing unit (CPU) cores can exist within a processor, which can occupy one of multiple sockets in a multi-socket system. Execution of atomic operations can suffer inefficiencies, especially with contention for a same memory address among multiple sockets in a multi-socket system.
In various embodiments, remote atomic operations (RAO) can be performed in a wide class of devices. As used herein, a remote atomic operation is an operation that is executed outside a central processing unit (CPU). More specifically herein, such RAO may be executed externally to core circuitry of a processor, for example, at a home agent, a cache controller, a memory controller or other core-external circuitry. In different use cases, a destination agent that executes a RAO may be a remote socket, a coherent or non-coherent device coupled to a processor socket, a remote memory or so forth. In use cases herein, these destination agents may themselves locally perform a remote atomic operation, e.g., in a home agent, accelerator device, or so forth. In this way, the overhead incurred in obtaining a requested cache line and bringing it back to a requesting agent is avoided. Embodiments enable such RAO operations to be sent to external load/store devices such as Peripheral Component Interconnect Express (PCIe) devices and destination agents coupled via cache coherent protocols such as Intel® Ultra Path Interconnect or Compute Express Link (CXL) protocols. In addition, embodiments enable RAO operations to be issued to uncacheable memory and other such devices.
Embodiments described herein allow the latency of atomic memory operations to be reduced (in the contended case, by queuing operations at a single location and performing them in place). These remote atomic operations include a read-modify-write operation on a memory location, atomically. Further, memory accesses performed by RAO instructions can be weakly ordered yet are built on top of a more strongly-ordered memory model. (As used herein, regular memory loads and stores have more strongly-ordered memory semantics, while RAO instructions have weaker memory semantics with respect to other RAOs and regular instructions. Such weak semantics allow more memory interleavings and, hence, better performance.) Embodiments disclosed herein thus allow the latency of atomic operations to be hidden (via weak ordering semantics).
Some RAOs allow for computations to be performed within a memory subsystem. Such operation is in contrast to classic atomics (lock prefix), where data is pulled into the requesting core and hence ping-pongs between cores, which is extremely costly under contention. With RAO, data stays put and operations are instead sent to a destination agent (e.g., a memory subsystem). In addition RAOs that do not have any data return (aka posted-RAO or fire-n-forget) execute with weak ordering semantics. This further reduces cycle cost at a core level and also reduces the number of transactions on an interconnect.
With embodiments, user-level RAO instructions may be provided to devices coupled to a processor over uncached memory such as PCIe or other load/store devices. In different use cases, such devices may be internal to a given processor socket (e.g., on a same silicon die as a multicore processor, within the same package, or in a different package). Still further, embodiments may be used to reduce overhead of RAO transactions by sending such transactions to a destination agent that currently hosts the cache line. In this way, write back/cached memory transactions may be improved. Such destination agents may include, as examples, agents coupled to a source agent via a CXL or Intel® UPI interconnect. Understand that embodiments are applicable to a wide variety of multi-socket architectures that are fully cache coherent.
The various embodiments are not limited in this regard. Different use cases may include any of the following: a source agent is a core within a CPU of a SoC, and a target agent is an SoC internal device; a source agent is a core within a CPU of a SoC, and a target agent is an external device; a source agent is a core within a CPU of a SoC, and a target agent is an external CPU; a source agent is an SoC internal device, and a target agent is within a CPU of the SoC; a source agent is an external device, and a target agent is within a CPU of the SoC; a source agent is an external device, and a target agent is another external device; a source agent is an SoC internal device, and a target agent is another SoC internal device; or a source agent is an SoC internal device, and a target agent is an external device. Of course other use cases are possible.
Referring now to
Core cluster 102, according to the embodiment of
In some embodiments, core cluster 102 includes a load/store unit (LSU) 106. As shown, LSU 106 includes buffers including a load buffer (LB), a store data buffer (SD), and a store buffer (SB) to hold data transfers between circuitry 104 and L1/L2 caches 108. In some embodiments, each of the entries of the LB, SD, and SB buffers is 64 bytes wide.
As shown, core cluster 102 includes L1/L2 caches 108. A cache hierarchy in core cluster 102 contains a first level instruction cache (L1 ICache), a first level data cache (L1 DCache) and a second level (L2) cache. When circuitry 104 implements multiple logical processors, they share the L1 DCache. The L2 cache is shared by instructions and data. In some embodiments, the L1 and L2 data caches are non-blocking, so can handle multiple simultaneous cache misses.
As shown, core cluster 102 includes a bus interface unit (BIU) 110, which, in operation, handles transfer of data and addresses by sending out addresses, fetching instructions from a code storage, reading data from ports and memory, and writing data to ports and memory.
CCPI 112, according to the embodiment of
As further shown in
As further illustrated, a device 160 couples to multicore processor 101 via an interconnect 150. In different embodiments, device 160 may be a non-coherent device such as a PCIe or other load/store device (and in such case, interconnect 150 may be a non-coherent interconnect). In other embodiments, device 160 may be a coherent device such as another processor or another coherent device (in which case, interconnect 150 may be a coherent interconnect). As non-limiting examples, device 160 may be an accelerator, a network interface circuit (NIC), storage or any other type of device that may couple to a processor socket, e.g., by way of an input/output interface. In any event, device 160 may include an execution circuit such as a micro-ALU to perform RAO operations as described herein.
Referring now to
Note that cache control circuits 202, 204, 206, and 208 are logical representations of cache control circuitry, such as a CHA that includes several physical components. Similarly, LLCs 202X, 204X, 206X, and 208X are logical representations of last level cache circuitry that have multiple components and circuitry, potentially divided into partitions.
As illustrated, sockets 0-3 are connected in a cross-bar configuration, allowing direct connections among cache control circuits 202, 204, 206, and 208 in accordance with some embodiments. In some embodiments, the cache control circuit in each of the sockets 0-3 conducts discovery to learn the topology of the system. Understand that embodiments may be used in other interconnect topologies such as rings, meshes and so forth.
In some embodiments, sockets 0-3 are each disposed on a printed circuit board, the sockets being connected in a cross-bar configuration. In some embodiments, two or more processors operating in accordance with embodiments disclosed herein are plugged into the sockets. A multi-socket system as illustrated in
In various embodiments, a source device implemented as a coherent agent such as a core or coherent device (e.g., a caching agent) may issue an RAO transaction to a destination device. With embodiments herein, instead of fetching a destination data operand of the RAO transaction to the source agent (e.g., last level cache), the RAO transaction, including an opcode and partial data for use in the transaction, is carried to the location of the destination data. In different use cases this destination location may be a local socket LLC, a remote socket home agent or memory, or an external device cache. Thereafter, an execution circuit of the destination agent may perform the RAO transaction and update the destination data element, avoiding the need for additional communications between source and destination. In different use cases intermediate communication protocols including intra-die interconnect protocols, Intel® UPI, PCIe and CXL protocols may support flows as described herein.
Understand that in different embodiments a variety of different RAO operations may be performed. As examples, RAO operations may include operations in which memory is updated such as by way of an add operation or a result of a logical operation (such as an AND, OR or XOR operation). As additional examples, RAO operations may include compare and exchange operations, including such operations that provide for add, increment or decrement operations. As still further examples, RAO operations may include ADD operations where a destination data is updated with a data payload of an RAO transaction. Similarly, data payloads may be used for atomic logical operations (e.g., AND, OR or XOR operations). Other RAO transactions may avoid data payloads, such as atomic increment or decrement operations.
Referring now to
As illustrated, method 300 begins by receiving a RAO request from a requester agent (block 310). As one example, a cache controller of a local processor socket may receive an RAO request from a core or other device of the processor socket. Understand that this request may, in some cases, include at least a portion of the data to be operated on in the RAO, such as according to a data push model. In other cases, the cache controller may request at least a portion of the data to be operated on using a data pull model.
In any event, next at diamond 320 it is determined whether destination data for the request is present in the local caching agent, such as a cache memory associated with the cache controller. If not, control passes to block 330 where the RAO request (and at least a portion of the data associated with the request) is sent to a remote socket (block 330). Understand that this remote socket may be another processor socket to which the first processor socket is coupled by way of a coherent interconnect. In other cases, this remote socket may be a non-coherent agent, such as a load/store device.
Understand that this remote device, socket or other remote agent performs the RAO using a received data along with destination data such as located in a memory location local to the remote device. After the RAO is performed, this memory location is in turn updated with the RAO result.
Also as part of performing the RAO operation, the remote device may send a completion back to the cache controller. Thus as illustrated in
Still with reference to
Referring now to
As illustrated, method 400 begins by receiving a RAO request from a remote socket in a target agent (block 410). In different cases, this target agent may be a coherent device or a non-coherent device. In any case control passes to diamond 420 where it is determined whether destination data is present in the device. If not, control passes to block 430 where the destination data may be obtained from a remote socket. Control passes next to block 440 where the RAO is executed in the device itself. Then the RAO result is stored in the destination location (block 450). Finally, at block 460 a completion may be sent to the remote socket that requested performance of the RAO transaction. Note that depending upon implementation and request type, this completion may include data (e.g., in the case of a non-posted RAO request) or not include data (e.g., for a posted RAO request). Understand while shown at this high level in
Referring now to
As illustrated, transaction flow diagram 500 begins with core 510 issuing a RAO request to caching agent 520. Assume that this request is for a destination address that misses in a local cache such as a LLC. In this implementation example, cache agent 520 may in turn issue a write pull to core/device 510 to obtain the data that is to be used as an operand in the RAOAs seen, when provided with the data, caching agent 520 sends the RAO request with this data to home agent 530. As discussed above, embodiments may be used in data push and data pull models. As such, in other cases core 510 may issue the RAO request with the data. Home agent 530 may access a snoop filter to determine whether a destination address of the RAO request hits. In this example, assume that there is not a hit. As such, caching agent 530 sends the RAO request and data to memory 540, which may execute the RAO locally (e.g., within an execution unit present in its memory subsystem).
As a result of this local execution of the RAO, improved efficiency is realized. Understand that the RAO result may be stored directly in the destination address in memory 540. In addition, as further illustrated in
Referring now to
As illustrated, transaction flow diagram 600 begins with core 610 issuing a RAO request to caching agent 620. Assume that this request is for a destination address that misses in a local cache. In this implementation example, caching agent 620 may in turn issue a write pull to core/device 610 to obtain the data that is to be used as an operand in the RAO. Caching agent 620 sends the RAO with this data to home agent 630. Since a destination address of the RAO request hits in a snoop filter of home agent 630, the RAO request and data are sent directly to caching agent 640, which may execute the RAO locally (e.g., within an execution unit present in its memory subsystem). As a result of this local execution of the RAO, improved efficiency is realized (with the RAO result stored directly in caching agent 640). Note that this RAO result remains in the caching agent and is not written back to memory until eviction occurs. In addition, as further illustrated in
In embodiments, load/store agents also may issue RAO transactions to destination agents (including coherent agents). In such use cases, a coherency bridge circuit that couples the non-coherent load/store device and coherent agent (e.g., a remote processor socket) may be used to convert load/store semantics to coherent processing flows. With such use cases, this coherency bridge circuit may convert or translate load/store-sourced RAO transactions to coherent RAO transactions to be executed by the remote coherent agent.
Referring now to
As illustrated, transaction flow diagram 700 begins with requester agent 710 issuing a RAO request to coherency bridge circuit 720. This request is a posted request and it includes data to be used in the RAO operation, e.g., according to a data push model. In turn, coherency bridge circuit 720 sends the RAO request to caching agent 730. According to a data pull model as illustrated in
Referring now to
It is also possible in embodiments to have a load/store device be a destination for (and thus execute) a RAO operation. Referring now to
It is further possible for multiple non-coherent devices, e.g., two load/store devices, to implement RAO transactions. Referring now to
Embodiments can be used in many different use cases. Several example use cases for RAO operations as described herein include counters/statistics for network processing and producer/consumer synchronization applications.
Counters and statistics handling are a common task for many applications and system software. One example use case is a scenario in the networking space, but similar scenarios are also present in many other domains. Statistics handling, although a simple operation, typically accounts for around 10% of an IP stack with application processing. For packet processing the different layers contain a large set of counters, which both count common flows such as number of packets processed, number of bytes processed for different scenarios, and provide for a wide range of error counters. The counters may be updated at a high frequency, sometimes in a bursty nature, and some will be relatively contended whereas others are significantly less frequently touched. Commonly these statistics are handled within one compute unit (e.g., a CPU). But with the introduction of smart network interface circuits (NICs), there is a push to divide processing between CPU and device. Based on different flows the counters can therefore be updated both from the CPU (e.g., slow-path flows) as well as the device (e.g., fast-path flows) but the state is to be made globally visible in an efficient way for fast access. Furthermore, a total counter dataset can be very large when the number of flows is large and hence may not necessarily fit into the memory of the device. RAO in accordance with an embodiment may solve these problems by pushing the updates to one location, e.g., the CPU, and statistics-based decisions can then efficiently be taken there.
Another usage area is for producer/consumer transactions. This is a classic problem arising in a number of applications. As one example, assume a producer writes new data at known locations in memory. A consumer reads this data. The synchronization problem is to ensure that the producer does not overwrite unconsumed data and the consumer does not read old data. Classical solutions result in several overheads either because of the need to avoid “races” or because of scalability issues. Embodiments may provide a solution to a multiple-producer multiple-consumer synchronization problem.
Referring now to
Assume token counter 1140 and queue pointers have been initialized. A producer 1110 pushes data, and once data is globally observed, it sends a posted RAO (e.g., an atomic increment). Ensuring global observations depends on the producer type, and may include load/store (PCIe) agents relying on posted ordering to flush data before RAO; CXL.cache has semantics to know global observability; and CPU producers may execute a store fence before the RAO. In turn, a consumer 1120 may poll on token counter 1140 with a CMPXCHG instruction (note polling overheads can be reduced with “snap shot read” instructions). If token counter 1140 is not zero, it performs the CMPXCHG instruction. If the CMPXCHG is successful, consumer 1120 may consume the data, or otherwise it does not.
The following examples pertain to further embodiments.
In one embodiment, a processor includes at least one core and a cache control circuit coupled to the at least one core. The cache control circuit is to: receive a RAO request from a requester; send the RAO request and data associated with the RAO request to a destination device, where the destination device is to execute the RAO using the data and destination data obtained by the destination device and store a result of the RAO to a destination location; and receive a completion for the RAO from the destination device.
In an example, the cache control circuit is, in response to the RAO request, to perform a write pull to obtain the data.
In an example, the cache control circuit is to perform the write pull in response to a miss for the data in a cache memory associated with the cache control circuit.
In an example, when the RAO request comprises a non-posted request, the completion comprises the result.
In an example, when the RAO request comprises a posted request, the completion does not include the result.
In an example, the destination device comprises a remote device coupled to the processor via a CXL interconnect.
In an example, the remote device comprises a home agent to send the RAO request and the data to a memory in response to a snoop filter miss for the destination data in the home agent.
In an example, the remote device comprises a home agent to send the RAO request and the data to a caching agent in response to a snoop filter hit for the destination data in the home agent.
In an example, the destination device comprises a remote processor socket coupled to the processor via a cache coherent interconnect.
In an example, the at least one core is the requester, and the at least one core is to send the data with the RAO request.
In an example, the processor further comprises one or more devices, where a first non-coherent device of the one or more devices comprises the destination device.
In an example, the processor further comprises a token counter, where a first core is to send the RAO request to the token counter to cause an increment to the token counter, after writing a first data to a queue, and a first device is to consume the first data based on a value of the token counter.
In another example, a method comprises: receiving, in a coherency bridge circuit coupled between a source agent and a destination agent, a RAO request, the coherency bridge circuit to translate coherent transactions to non-coherent transactions and translate non-coherent transactions to coherent transactions; sending, from the coherency bridge circuit, the RAO request to the destination agent to cause the destination agent to execute the RAO using destination data stored in a destination address owned by the destination agent and store a result of the RAO at the destination address; and receiving a completion for the RAO from the destination agent to indicate that the RAO has been completed.
In an example, the method further comprises translating a non-coherent request comprising the RAO request to a coherent request comprising the RAO request and sending the coherent request to the destination agent via a caching agent associated with the destination address.
In an example, the method further comprises: sending the RAO request to the caching agent; receiving a write pull request from the caching agent; and sending a datum to the caching agent in response to the write pull request, to cause the caching agent to execute the RAO further using the datum.
In an example, the method further comprises sending the completion with the result to the source agent, where the RAO request comprises a non-posted RAO request.
In an example, the method further comprises: pushing, by the source agent, a first data element to a queue; sending the RAO request to the coherency bridge circuit after the first data element is globally observed, where the destination data comprises a value of a token counter; and consuming, by a consuming circuit, the first data element in response to determining that the destination data comprising the value of the token counter matches a first value.
In another example, a computer readable medium including instructions is to perform the method of any of the above examples.
In a further example, a computer readable medium including data is to be used by at least one machine to fabricate at least one integrated circuit to perform the method of any one of the above examples.
In a still further example, an apparatus comprises means for performing the method of any one of the above examples.
In another example, a system comprises a SoC and a remote device coupled to the SoC via at least one interconnect. The SoC may include: at least one core; one or more devices coupled to the at least one core; and a cache control circuit coupled to the at least one core and the one or more devices. The cache control circuit is to: receive a RAO request from a requester; send the RAO request and data associated with the RAO request to the remote device, where the remote device is to execute the RAO using the data and destination data and store a result of the RAO to a destination location; and receive a completion for the RAO. The remote device may comprise an execution circuit to receive the RAO request and the data, obtain the destination data from the destination location, execute at least one operation of the RAO request using the data and the destination data and store the result to the destination location.
In an example, the SoC further comprises a coherency bridge circuit coupled to the at least one core, the requester comprises the at least one core and the remote device comprises a non-coherent device, the coherency bridge circuit to convert the RAO request to a non-coherent RAO request and send the non-coherent RAO request to the remote device, and send the completion to the cache control circuit in response to a global observation of the result.
In an example, the remote device is to send a second RAO request to the SoC to cause the SoC to execute at least one second RAO operation of the second RAO request using a second data stored in a second destination location identified in the second RAO request and store a second result of the at least one second RAO operation to the second destination location.
Understand that various combinations of the above examples are possible.
Note that the terms “circuit” and “circuitry” are used interchangeably herein. As used herein, these terms and the term “logic” are used to refer to alone or in any combination, analog circuitry, digital circuitry, hard wired circuitry, programmable circuitry, processor circuitry, microcontroller circuitry, hardware logic circuitry, state machine circuitry and/or any other type of physical hardware component. Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. Embodiments also may be implemented in data and may be stored on a non-transitory storage medium, which if used by at least one machine, causes the at least one machine to fabricate at least one integrated circuit to perform one or more operations. Still further embodiments may be implemented in a computer readable storage medium including information that, when manufactured into a SoC or other processor, is to configure the SoC or other processor to perform one or more operations. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.