Examples described herein are generally related to techniques associated with data movement within a caching hierarchy for a multi-die, disaggregated system.
In some server use cases, increases in a core count for some types of system on chips (SoCs) have led to the use of dis-aggregated dies in these types of SoCs. The dis-aggregated dies are coupled or connected together using a high-speed package interface, such as for example, an embedded multi-die interconnect bridge (EMIB). One or more input/output (I/O) agents are often disposed on one die while one or more processor cores, sometimes referred to as core building blocks (CBBs), are disposed on a separate die. Each individual die has its own cache hierarchy. A memory or a large memory side cache is typically shared across the cache hierarchies associated with each of the dies. Data communications between a processor core on the one die and an I/O agent on the separate die are typically conducted via the memory or the large monolithic memory side cache shared across the cache hierarchies associated with the two different dies. Movement of data from the I/O agent to the processor core can involve multiple data movements across the interconnect fabric and EMIB boundaries. The multiple data movements may result in relatively high data access latencies as well as relatively high interconnect power consumption. In addition, relatively high consumption of both memory bandwidth and die-to-die interconnect (EMIB) bandwidths may occur.
As contemplated by this disclosure, disaggregated dies included in a system on chip (SoC) may have one or more core building blocks (CBB) dies, each including a plurality of processing cores and also including one or more I/O agents on at least one other die. The one or more I/O agents can include, but are not limited to, an accelerator, a field programmable gate arrays (FPGA), an application specific integrated circuitry (ASIC), or a network interface controller/infrastructure processing unit (NIC/IPU). These types of SoCs may be used in or deployed in a network server for executing networking and packet processing workloads, which can be referred to as networking and edge (NEX) workloads. As core counts in these types of SoCs increase, scalability and performance issues can arise from a significantly large variance among different cores trying to access low level cache lines while processing a NEX workload. This can result in an undesired behavior that negatively impacts most NEX workloads. NEX workloads are also sensitive to how fast and how predictably each network packet can be processed by one or more cores, often in conjunction with on-SoC I/O agents such as accelerators.
As mentioned above, disaggregated dies in an SoC used in a network server can be constructed by putting a certain number of I/O agents on one or more dies of the SoC and a certain number of cores on another one or more dies. Each individual die typically has its own cache hierarchy. Data which needs to be communicated between a core and an I/O agent can be communicated using a next common level of cache between the different caching hierarchies, such as a memory side cache. In the absence of the memory side cache, the data to be communicated will be written to system memory (e.g., dynamic random access memory (DRAM).
Communication of data between the core and the I/O agent on separate dies through system memory or a shared memory side cache suffers from several disadvantages. A first example disadvantage is that a latency of access of the data produced by the I/O agent (e.g., incoming network NIC data) and consumed by the core (e.g., for network packet processing) is high. This high latency will likely affect packet processing bandwidth significantly as packet processing bandwidth is typically determined by a time each stage of a packet processing pipeline takes to execute. The execution can take place either on the core or on-die/external I/O agents. A second example disadvantage is an extraneous consumption of die-to-die interconnect bandwidths that can result in data congestion and extra power consumption on the die-to-die interconnect. For example, when all networking packet data is pushed to the core, but only the headers of each packet need to be processed by the core (e.g., 5G user plane function (UPG) workloads). A third example disadvantage is for cases where there is no memory side cache for the SoC and this results in data being written to system memory/DRAM. Writing data to system memory can result in wasted memory bandwidth and power. A fourth example disadvantage is if there are differences in how a networking workload is performed in relation to each task for packet processing. In some instances, a packet header can be pushed as close to the processing core or I/O agent for reduced processing latency. While in other instances, the entire packet may need to be processed by the core or I/O agent (e.g., IPsec, web servers/proxy). This disclosure describes techniques for data movement to a cache in a disaggregated die system that avoids or minimizes the need to write data to system memory or use a memory side cache.
According to some examples, communications within CBB 102 on compute die(s) 144 and within I/O domain 104 on I/O die(s) 146 may be in accordance with on-die or intra-die communication protocols, such as, but not limited to, the Intel® Intra-Die Interconnect (IDI) protocol. Communications between CBB 102, I/O domain 104, and logic and/or features resident on intermediate cache die(s) 148 across or via interconnect network 110 may be supported by inter-die or inter-chip communication protocols such as, but not limited to, the Intel® Ultra Path Interconnect (UPI) protocol or the Universal Chiplet Interconnect Express (UCIe) protocols as described in the UCIe 1.0 specification, published in March of 2022 by the UCIe™ organization (“the UCIe specification”). Data communications between the CBB 102 and the I/O domain 104, in some examples, can be facilitated by HA 106 and/or I/O stack logic 145. Also, communications between HA 106 and/or MC 141 and memory 108 across memory channels 112 can be in accordance with one or more memory access protocols, such as described in various Joint Electronic Device Engineering Council specifications for double data rate (DDR) memory access. For example, JEDEC DDR specifications to include, but not limited to, DDR3, DDR4, DDR5, LPDDR3, LPDDR4, LPDDR5, high bandwidth memory (HBM), HBM2 or HBM3. In other examples, memory access protocols via memory channel(s) 112 can be in accordance with other types of memory channel protocols such as CXL.mem memory channel protocols used in accordance with the compute express link (CXL) 3.0 specification, published in August of 2022 by the CXL™ organization (“the CXL specification”).
System 100 may include additional components that facilitate the operation of the system 100. Furthermore, while an example of interconnect network 110 and memory channel(s) 112 illustrate the coupling between the disaggregated dies of system 100, alternative networks and/or memory channel configurations can be used to couple dies and/or components of system 100.
According to some examples, CBB 102 may include one or more core(s) 114, a CBB shared cache hierarchy 118, and a CBB caching agent 116. An example of a CBB shared cache hierarchy 118 is a level 3 (L3) cache. For these examples, CBB shared cache hierarchy 118 may be shared by and accessible to core(s) 114 in the CBB 102. Each core from among core(s) 114 includes a hardware circuit, such as a control circuit 120, to execute core operations and includes a core cache hierarchy 122. According to some examples, core cache hierarchy 122 includes an L1 cache and an L2 cache. CBB caching agent 116 may manage operations associated with CBB shared cache hierarchy 118. To this end, the CBB caching agent 116 includes a hardware circuit, such as a control circuit 124, to manage operations associated with CBB shared cache hierarchy 118. CBB 102 is not limited to the components shown in
In some examples, I/O domain 104 can include one or more I/O device(s) 126, one or more I/O agent(s) 128, one or more I/O caching agent(s) 132. Each I/O agent of I/O agent(s) 128 may be coupled to a respective I/O device from among I/O device(s) 126. Each I/O device of I/O device(s) 126 may include a hardware circuit, such as a control circuit 134, to manage I/O device operations. Each I/O agent from among I/O agent(s) 128 may include a hardware circuit, such as a control circuit 136, to manage I/O agent operations and an internal cache 138. Internal cache 138 may also be referred to as a write buffer or write cache (Wr$). Examples of I/O agents include, but are not limited to, accelerator device instances such as a data streaming accelerator or a host processor with multiple I/O device(s) 126 connected downstream. I/O domain 104 is not limited to the components shown in
According to some examples, IO$ 130 resident on intermediate I/O cache die(s) 148 may be coupled to I/O agent(s)s 128 via interconnect network 110. As described more below, IO$ 130 may be a type of shared last level or L3 cache arranged as a logical standalone “scratchpad” memory that acts as an intermediate delivery stop for data received from I/O agent(s) 128.
According to some examples, logic and/or features at intermediate I/O cache die(s) 148 such as HA 106, HSF 143 or I/O stack logic 145 can be supported or included in circuitry resident on intermediate I/O cache die(s) 148. The circuitry, for example, can include one or more of an FPGA, an ASIC, a processor circuit, a multi-core processor or one or more cores of the multi-core processor. The logic and/or features at intermediate I/O cache die(s) 148 such as HA 106, HSF 143 or I/O stack logic 145 can be arranged to implement data direct I/O (DDIO) modes associated with the use of IO$ 130 and CBB shared cache 118 to meet performance objectives that may vary based on a type of workload being executed by I/O agent(s) 128, I/O device(s) 126 or core(s) 114. The performance objections, for example, can balance reducing latency, increasing processing bandwidth and attempting to reduce or minimize power usage. These different DDIO modes, for example, can be related to Intel® DDIO technology that can allow for core(s) 114 to operate on inbound data received from I/O agent(s) 128 without invoking a memory access to memory 108. Consistent with Intel® DDIO technology, a write to memory 108 only occurs upon eviction from a highest level cache. The highest level cache, depending on the selected DDIO mode, can be either CBB shared cache 118 or IO$ 130. Also, as described more below, DDIO modes can be statically configured/selected for system 100 through a UEFI BIOS such as BIOS 101 or can be dynamically configured/selected by using packet hints provided by I/O agent(s) 128. The packets hints, for example, can be included in PCI Express (PCIe) transaction layer packets (TLPs) as described in the PCIe Base Specification, Rev. 6.0, published in January of 2022 by the Peripheral Component Interconnect Special Interest Group (PCI-SIG) (“the PCIe specification”). The different DDIO modes can allow for flexibility for different networking workloads, whether those networking workloads are deployed in a network function virtualization (NFV) environment, a bare metal containers environment or as public cloud native, service mesh or virtualized containers orchestrated by Kubernetes environment that have less hardware specific awareness.
In some examples, compute domain 102 is disposed on a compute die 144 and I/O domain 104 is disposed on an I/O die(s) 146. Also, in some examples, IO$ 130, HA 106, MC 141, HSF 143 and CSR 147 are disposed on one or more intermediate I/O cache die(s) 148 and memory 108 is disposed on one or more memory die(s) 150. In alternative examples, IO$ 130, HA 106, MC 141, HSF 143 and CSR 147 may be disposed on one of compute die(s) 144, I/O die(s) 146, or memory die(s) 150. In alternative examples, multiple CBBs 102 may be disposed on a single compute die from among compute die(s) 144 and/or multiple I/O domains 104 may be disposed on a single I/O die from among I/O die(s) 146.
According to some examples, memory 108 at memory die(s) 150 may include various types of volatile and/or non-volatile memory. Volatile types of memory may include, but are not limited to, dynamic random access memory (DRAM), static random access memory (SRAM), thyristor RAM (TRAM) or zero-capacitor RAM (ZRAM). Non-volatile types of memory may include byte or block addressable types of non-volatile memory having a 3-dimensional (3-D) cross-point memory structure that includes chalcogenide phase change material (e.g., chalcogenide glass) hereinafter referred to as “3-D cross-point memory”. Non-volatile types of memory may also include other types of byte or block addressable non-volatile memory such as, but not limited to, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level phase change memory (PCM), resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), magnetoresistive random access memory (MRAM) that incorporates memristor technology, spin transfer torque MRAM (STT-MRAM), or a combination of any of the above.
According to some examples, a BIOS of a system (e.g., BIOS 101 of system 100) can configure a CAT mask table such as CAT mask table 300 to map each 4-bit binary value for a respective IO RDT CLOS ID to a 2-bit binary value for a respective IO$ partition ID of an IO$ such as IO$ 130 to generate a 6-bit IO CLOS value. A configured CAT mask table 300, can be included in a data structure maintained by I/O stack logic at an intermediate I/O cache die such as I/O stack logic 145 at intermediate I/O cache die(s) 148. Alternatively, the BIOS may set or program one or more registers at the intermediate I/O cache die such as register(s) included in CSR 147. The set or programmed register(s) of CSR 147 may then be accessible by I/O stack logic 145 to access the configured CAT mask table 300. For these examples, the IO$ partition IDs shown in
In some examples, at 5.1, IPU/NIC 510 generates an upstream PCIe write. For these examples, the upstream PCIe write may be targeted to a core included in CBB 102-1 such as core 114-1-1. Although not shown in
According to some examples, at 5.2, I/O logic and/or features of HA 106 such as stack logic 145 may utilize a CAT mask table such as example CAT mask table 300 that applies IO RDT to determine where to place the data associated with the upstream PCIe write in IO$ 130 or may utilize an I/O cache allocation scheme such as I/O cache allocation 200 shown in
In some examples, at 5.3, logic and/or features of HA 106 such as I/O stack logic 145 causes the data associated with the upstream PCIe write sent from IPU/NIC 510 to be stored to partition ID P0 of IO$ 130 based on the determination made by I/O stack logic 145 at 5.2.
According to some examples, at 5.4, HA 106 at intermediate I/O cache die 148 can be arranged to cause CBB caching agent 116-1 to read from partition P0 of IO$ 130 to pull the data associated with the upstream PCIe write sent from IPU/NIC 410 to CBB shared cache 118-1. The data pulled from IO$ 130 is routed via a die-to-die or chip-to-chip communication links routed through an interconnect network such as communication links routed through interconnect network 110.
In some examples, at 5.5, CBB caching agent 116-1 notifies core 114-1-1 that the data associated with the upstream PCIe write sent form IPU/NIC 510 destined for processing by core 114-1-1 has been placed into CBB shared cache 118-1. Core 114-1-1 may then read the data from CBB shared cache 118-1 and place the data in its core cache hierarchy 122 for eventual processing of the data. Data movement scheme 500 then comes to an end.
In some examples, as shown in
Also, as shown in
Also, as shown in
Also, as shown in
In some examples, TPH bits for HV and PH of a TLP request packet in the example format of TLP format 600 may be set to indicate a possible selection of DDIO modes on a dynamic or static basis. As shown in
According to some examples, dynamic allocation DDIO mode can be selected in various usage scenarios as shown in
According to some examples, CBB shared cache 118 at CBB 102 and IO$ 130 are depicted as being below a dashed line between the arrow for IDI/CXL.$ protocols and the UCIe protocols to indicate that data flowing into IO$ 130 from I/O agent 128 and then into CBB shared cache 118 may both be routed through interconnect network 110 via communication links such as communication links 812-1 and 812-2. Also, CA 116 at CBB 102 and I/O CA at I/O domain 104 may communicate with HAs 106-1 to 106-N through interconnect network 110 via at least one of communication links 812-1 to 812-N. As shown in
As described below for various data flows, HAs 106-1 to 106-N and HSFs 143-1 to 143-N resident on intermediate I/O cache die(s) 148 can be arranged to facilitate use of IO$ 130 or CBB shared cache (L3) 118 based on a selected DDIO mode. HAs 106-1 to 106-N and HSF 143-1 to 143-N can be arranged to communicate to CA 116 or I/O CA 132 through interconnect network 110 via data links 812-3 and 812-4 according to the selected DDIO mode, be it a dynamically selected DDIO mode or a statically configured DDIO mode.
Similar to CA 116 and I/O CA 132, as shown in
According to some examples, at 9.2, logic and/or features of HA 106-N (e.g., I/O stack logic 145) utilizes CAT mask table 300 to identify an IO CLOS value based on the IO RDT CLOS ID (e.g., 0000) received with the ownership request. For these examples, HA 106-N forwards the identified IO CLOS value to HSF 143-N. HSF 143-N utilizes CBB ID table 400 to determine that a CBB ID hit exits for the identified IO CLOS value. For example, the identified IO CLOS value is 000000 and an RMID for application 0 is included in CBB ID table 400 that corresponds to or maps to an IO CLOS value of 000000 as shown in
In some examples, at 9.3, logic and/or features of HA 106-N communicate with CA 116 through interconnect network 110 via communication links 812-1 and 812-N utilizing UCIe protocols to allocate a cache line in CBB shared cache (L3) 118. For these examples, the cache line to allocate is associated with the CBB ID (RMID for application 0) as determined by the HSF hit.
According to some examples, at 9.4, I/O CA 132 causes writeback data from Wr$ 138 to be sent to HA 106-N through interconnect network 110 via communication links 812-2 and 812-N, for example, utilizing UCIe protocols.
In some examples, at 9.5, logic and/or features of HA 106-N cause the writeback data from Wr$ 138 to be pushed to a cache line address in CBB shared cache (L3) 118 associated with the identified CBB ID. For these examples, the writeback data from Wr$ 138 can be routed through interconnect network 110 via communication links 812-N and 812-1 utilizing UCIe protocols. Data flow 900 then comes to an end.
According to some examples, at 10.2, logic and/or features of HA 106-N (e.g., I/O stack logic 145) utilizes CAT mask table 300 to identify an IO CLOS value based on the IO RDT CLOS ID received with the ownership request. Rather than having HSF 143-N use a CBB ID table such as CBB ID table 400 to look for a CBB ID as described above for data flow 900, the identified IO CLOS value is used by logic and/or features of HA 106-N based on static information to determine a CBB ID. The static information, for example, can be based on a predetermined affinity between a core and the I/O agent generating the data to be processed by the core as mentioned above for the description of the core$ allocation DDIO mode in DDIO mode table 700. The static information, for example, can be determined by BIOS 101 or OS 103 during initialization of system 100 and that static information is provided to HA 106-N (e.g., via setting or programing register(s) of CSR 147 or populating a data structure accessible to HA 106-N). Once a CBB ID is identified, the ownership request is granted by HA 106-N.
According to some examples, at 10.3, I/O CA 132 causes writeback data from Wr$ 138 to be sent to HA 106-N through interconnect network 110 via communication links 812-2 and 812-N, for example, utilizing UCIe protocols.
In some examples, at 10.4, logic and/or features of HA 106-N cause the writeback data from Wr$ 138 to be pushed to a cache line address in CBB shared cache (L3) 118 associated with the identified CBB ID for which ownership was granted to I/O CA 132. For these examples, the writeback data from Wr$ 138 can be routed through interconnect network 110 via communication links 812-N and 812-1 utilizing UCIe protocols. Data flow 1000 then comes to an end.
According to some examples, at 11.2, logic and/or features of HA 106-N (e.g., I/O stack logic 145) identifies the IO$ partition ID based on I/O cache allocation 200 that allocates IO$ 130 based on a data traffic source type (e.g., see data traffic source types in
In some examples, at 11.3, HA 106 also sends a notification to CA 116 at CBB 102 to indicate that cache line ownership has been granted for I/O agent 128 to writeback data to IO$ 130. The notification to include the cache address for the cache line included in the identified IO$ partition ID that was also provide to I/O CA 132. For these examples, CA 116 causes core 114 to pull writeback data placed at the cache address by I/O agent 128 from IO$ 130. The writeback data may be pulled utilizing UCIe protocols. As mentioned above, IO$ 130 can be located on a same die or chip as HA 106-N, so communication links 812-N and 812-1 can be used to pull the writeback data from IO$ 130 to at least CBB shared cache (L3) 118 at CBB 102. Data flow 1100 then comes to an end.
Included herein is a set of logic flows representative of example methodologies for performing novel aspects of the disclosed architecture. While, for purposes of simplicity of explanation, the one or more methodologies shown herein are shown and described as a series of acts, those skilled in the art will understand and appreciate that the methodologies are not limited by the order of acts. Some acts may, in accordance therewith, occur in a different order and/or concurrently with other acts from that shown and described herein. For example, those skilled in the art will understand and appreciate that a methodology could alternatively be represented as a series of interrelated states or events, such as in a state diagram. Moreover, not all acts illustrated in a methodology may be required for a novel implementation.
A logic flow may be implemented in software, firmware, and/or hardware. In software and firmware examples, a logic flow may be implemented by computer executable instructions stored on at least one non-transitory computer readable medium or machine readable medium, such as an optical, magnetic or semiconductor storage. The examples are not limited in this context.
In some examples, as shown in
According to some examples, logic flow 1200 at 1204 may grant the request for ownership of the cache line based, at least in part, on a data traffic source type associated with the I/O agent, the granted request to cause the writeback data to be stored to a cache line address for the cache line. For these examples, logic and/or features of HA 106 can grant the request based, at least in part, on a data traffic source type associated with the I/O agent. For example, the data traffic source type can be an IOMMU, an accelerator, an IPU/NIC, or a CXL type 1 or type 2 device.
While various examples described herein could use System-on-a-Chip or System-on-Chip (“SoC”) to describe a device or system having a processor and associated circuitry (e.g., Input/Output (“I/O”) circuitry, power delivery circuitry, memory circuitry, etc.) integrated monolithically into a single integrated circuit (“IC”) die, or chip, the present disclosure is not limited in that respect. For example, in various examples of the present disclosure, a device or system could have one or more processors (e.g., one or more processor cores) and associated circuitry (e.g., I/O circuitry, power delivery circuitry, etc.) arranged in a disaggregated collection of discrete dies, tiles and/or chiplets (e.g., one or more discrete processor core die arranged adjacent to one or more other die such as memory die, I/O die, etc.). In such disaggregated devices and systems the various dies, tiles and/or chiplets could be physically and electrically coupled together by a package structure including, for example, various packaging substrates, interposers, interconnect bridges and the like. Also, these disaggregated devices can be referred to as a system-on-a-package (SoP).
One or more aspects of at least one example may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, various examples also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such examples may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Various examples may be implemented using hardware elements, software elements, or a combination of both. In some examples, hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, ASICs, PLDs, DSPs, FPGAs, memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some examples, software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, APIs, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation.
Some examples may be described using the expression “in one example” or “an example” along with their derivatives. These terms mean that a particular feature, structure, or characteristic described in connection with the example is included in at least one example. The appearances of the phrase “in one example” in various places in the specification are not necessarily all referring to the same example.
Some examples may be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms “connected” and/or “coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The following examples pertain to additional examples of technologies disclosed herein.
Example 1. An example apparatus can include a first die arranged to be included in a multi-die system and circuitry. The circuitry can receive a request for ownership of a cache line for an I/O agent of an I/O device to writeback data to the cache line for a core of a processor to obtain and process the writeback data. The cache line can be associated with a first cache resident on a second die of the multi-die system that also includes the core or is associated with a second cache resident on the first die. The circuitry can also grant the request for ownership of the cache line based, at least in part, on a data traffic source type associated with the I/O agent, the granted request to cause the writeback data to be stored to a cache line address for the cache line.
Example 2. The apparatus of example 1, the first cache resident on the second die can include an L3 cache shared by the core with other cores of the processor.
Example 3. The apparatus of example 1, the I/O agent and I/O device can be resident on a third die of the multi-die system.
Example 4. The apparatus of example 3, the circuitry can be communicatively coupled to the I/O agent resident on the third die via a chip-to-chip interconnect network arranged to operate according to the UCIe specification.
Example 5. The apparatus of example 1, the request for ownership can be received via a PCIe TLP request, the PCIe request TLP to include information to indicate the request for ownership is to the first cache resident on the second die.
Example 6. The apparatus of example 5, the information to indicate the request for ownership is to the first cache can be a CLOS ID for the first cache. The CLOS ID can be associated with the cache line address to store the writeback data, to grant the request for ownership of the cache line to the first cache can be based on the data traffic source type and the CLOS ID.
Example 7. The apparatus of example 1, the request for ownership can be received via a PCIe TLP request, the PCIe request TLP to include information to indicate the request for ownership can be to the second cache resident on the first die.
Example 8. The apparatus of example 7, the information to indicate the request for ownership is to the second cache can be an absence of a CLOS ID for the first cache. The CLOS
ID can be associated with a cache line address at the first cache, to grant the request for ownership of the cache line to the second cache can be based on the data traffic source type and the absence of the CLOS ID.
Example 9. The apparatus of example 1, the request for ownership can be received via a PCIe TLP. The PCIe request TLP can include information in TPH bits to indicate a dynamic or static selection of a DDIO mode to cause the writeback data to be stored to the cache line address for the cache line. The selected DDIO mode can include one of a dynamic allocation that includes granting the request for ownership of the cache line for either the first cache or for the second cache, a core cache allocation that includes granting the request for ownership of the cache line for only the first cache, or an I/O cache allocation that includes granting the request for ownership of the cache line for only the second cache.
Example 10. The apparatus of example 1, the data traffic source type associated with the I/O agent can be an IOMMU, a CXL type 1 or type 2 device, an accelerator, a NIC or an IPU.
Example 11. An example method can include receiving, by circuitry at a first die of a multi-die system, a request for ownership of a cache line for an I/O agent of an I/O device to writeback data to the cache line for a core of a processor to obtain and process the writeback data. The cache line can be associated with a first cache resident on a second die of the multi-die system that also includes the core or can be associated with a second cache resident on the first die. The method can also include granting the request for ownership of the cache line based, at least in part, on a data traffic source type associated with the I/O agent. The granted request can cause the writeback data to be stored to a cache line address for the cache line.
Example 12. The method of example 11, the first cache resident on the second die can be an L3 cache shared by the core with other cores of the processor.
Example 13. The method of example 11, the I/O agent and I/O device can be resident on a third die of the multi-die system.
Example 14. The method of example 13, the circuitry at the first die can be communicatively coupled to the I/O agent resident on the third die via a chip-to-chip interconnect network arranged to operate according to the UCIe specification.
Example 15. The method of example 11, the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the first cache resident on the second die.
Example 16. The method of example 15, the information to indicate the request for ownership is to the first cache can be a CLOS ID for the first cache. The CLOS ID can be associated with the cache line address to store the writeback data, granting the request for ownership of the cache line to the first cache can be based on the data traffic source type and the CLOS ID.
Example 17. The method of example 11, the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the second cache resident on the first die.
Example 18. The method of example 17, the information to indicate the request for ownership is to the second cache can be an absence of a CLOS ID for the first cache. The CLOS ID can be associated with a cache line address at the first cache, granting the request for ownership of the cache line to the second cache can be based on the data traffic source type and the absence of the CLOS ID.
Example 19. The method of example 11, the request for ownership can be received via a PCIe TLP. The PCIe request TLP can include information in TPH bits to indicate a dynamic or static selection of a DDIO mode to cause the writeback data to be stored to the cache line address for the cache line. The selected DDIO mode can include one of a dynamic allocation that includes granting the request for ownership of the cache line for either the first cache or for the second cache, a core cache allocation that includes granting the request for ownership of the cache line for only the first cache, or an I/O cache allocation that includes granting the request for ownership of the cache line for only the second cache.
Example 20. The method of example 11, the data traffic source type associated with the I/O agent can be an IOMMU, a CXL type 1 or type 2 device, an accelerator, a NIC or an IPU.
Example 21. An example at least one machine readable medium can include a plurality of instructions that in response to being executed by a circuitry of a multi-die system can cause the circuitry to carry out a method according to any one of examples 11 to 20.
Example 22. An example apparatus can include means for performing the methods of any one of examples 11 to 20.
Example 23. An example at least one machine readable medium can include a plurality of instructions that in response to being executed by circuitry at a first die of a multi-die system, can cause the circuitry to receive a request for ownership of a cache line for an I/O agent of an I/O device to writeback data to the cache line for a core of a processor to obtain and process the writeback data. The cache line can be associated with a first cache resident on a second die of the multi-die system that also includes the core or is associated with a second cache resident on the first die. The instructions can also cause the circuitry to grant the request for ownership of the cache line based, at least in part, on a data traffic source type associated with the I/O agent. The granted request can cause the writeback data to be stored to a cache line address for the cache line.
Example 24. The at least one machine readable medium of example 23, the first cache resident on the second die can be a L3 cache shared by the core with other cores of the processor.
Example 25. The at least one machine readable medium of example 23, wherein the I/O agent and I/O device can be resident on a third die of the multi-die system.
Example 26. The at least one machine readable medium of example 25, the circuitry at the first die can be communicatively coupled to the I/O agent resident on the third die via a chip-to-chip interconnect network arranged to operate according to the UCIe specification.
Example 27. The at least one machine readable medium of example 23, the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the first cache resident on the second die.
Example 28. The at least one machine readable medium of example 27, the information to indicate the request for ownership is to the first cache can be a CLOS ID for the first cache. The CLOS ID can be associated with the cache line address to store the writeback data. Granting the request for ownership of the cache line to the first cache can be based on the data traffic source type and the CLOS ID.
Example 29. The at least one machine readable medium of example 23, the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the second cache resident on the first die.
Example 30. The at least one machine readable medium of example 29, the information to indicate the request for ownership is to the second cache can be an absence of a CLOS ID for the first cache. The CLOS ID can be associated with a cache line address at the first cache. Granting the request for ownership of the cache line to the second cache can be based on the data traffic source type and the absence of the CLOS ID.
Example 31. The at least one machine readable medium of example 23, the request for ownership can be received via a PCIe TLP. The PCIe request TLP can include information in TPH bits to indicate a dynamic or static selection of a DDIO mode to cause the writeback data to be stored to the cache line address for the cache line. The selected DDIO mode can include one of a dynamic allocation that includes granting the request for ownership of the cache line for either the first cache or for the second cache, core cache allocation that includes granting the request for ownership of the cache line for only the first cache, or an I/O cache allocation that includes granting the request for ownership of the cache line for only the second cache.
Example 32. The at least one machine readable medium of example 23, the data traffic source type associated with the I/O agent can be an IOMMU, a CXL type 1 or type 2 device, an accelerator, a NIC or an IPU.
Example 33. An example multi-die system can include a first die arranged to have an I/O agent of an I/O device resident on the first die. The multi-die system can also include a second die arranged to have a core of a processor and a first cache resident on the second die. The multi-die system can also include a third die arranged to have circuitry and a second cache resident on the third die. For these examples, the circuitry can receive a request for ownership of a cache line for the I/O agent to writeback data to the cache line for the core of the processor to obtain and process the writeback data. The cache line can be associated with the first cache or can be associated with the second cache. The circuitry may also grant the request for ownership of the cache line based, at least in part, on a data traffic source type associated with the I/O agent, the granted request to cause the writeback data to be stored to a cache line address for the cache line.
Example 34. The multi-die system of example 33, the first cache can be an L3 cache arranged to be shared by the core with other cores of the processor.
Example 35. The multi-die system of example 33, the circuitry can be arranged to be communicatively coupled to the I/O agent resident on the third die via a chip-to-chip interconnect network arranged to operate according to the UCIe specification.
Example 36. The multi-die system of example 33, wherein the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the first cache.
Example 37. The multi-die system of example 36, the information to indicate the request for ownership is to the first cache can be a CLOS ID for the first cache. The CLOS ID can be associated with the cache line address to store the writeback data. Granting the request for ownership of the cache line to the first cache can be based on the data traffic source type and the CLOS ID.
Example 38. The multi-die system of example 33, the request for ownership can be received via a PCIe TLP request. The PCIe request TLP can include information to indicate the request for ownership is to the second cache.
Example 39. The multi-die system of example 38, the information to indicate the request for ownership is to the second cache can be an absence of a CLOS ID for the first cache. The CLOS ID can be associated with a cache line address at the first cache. Granting the request for ownership of the cache line to the second cache can be based on the data traffic source type and the absence of the CLOS ID.
Example 40. The multi-die system of example 33, the request for ownership can be received via a PCIe TLP. The PCIe request TLP can include information in TPH bits to indicate a dynamic or static selection of a DDIO mode to cause the writeback data to be stored to the cache line address for the cache line. The selected DDIO mode can include one of a dynamic allocation that includes granting the request for ownership of the cache line for either the first cache or for the second cache, a core cache allocation that includes granting the request for ownership of the cache line for only the first cache, or an I/O cache allocation that includes granting the request for ownership of the cache line for only the second cache.
Example 41. The multi-die system of example 33, the data traffic source type associated with the I/O agent can be an IOMMU, a CXL type 1 or type 2 device, an accelerator, a NIC or an IPU.
It is emphasized that the Abstract of the Disclosure is provided to comply with 37 C.F.R. Section 1.72(b), requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single example for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed examples require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed example. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate example. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein,” respectively. Moreover, the terms “first,” “second,” “third,” and so forth, are used merely as labels, and are not intended to impose numerical requirements on their objects.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.