This disclosure pertains to computing system, and in particular (but not exclusively) to point-to-point interconnects.
Advances in semi-conductor processing and logic design have permitted an increase in the amount of logic that may be present on integrated circuit devices. As a corollary, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores, multiple hardware threads, and multiple logical processors present on individual integrated circuits, as well as other interfaces integrated within such processors. A processor or integrated circuit typically comprises a single physical processor die, where the processor die may include any number of cores, hardware threads, logical processors, interfaces, memory, controller hubs, etc.
As a result of the greater ability to fit more processing power in smaller packages, smaller computing devices have increased in popularity. Smartphones, tablets, ultrathin notebooks, and other user equipment have grown exponentially. However, these smaller devices are reliant on servers both for data storage and complex processing that exceeds the form factor. Consequently, the demand in the high-performance computing market (i.e. server space) has also increased. For instance, in modern servers, there is typically not only a single processor with multiple cores, but also multiple physical processors (also referred to as multiple sockets) to increase the computing power. But as the processing power grows along with the number of devices in a computing system, the communication between sockets and other devices becomes more critical.
In fact, interconnects have grown from more traditional multi-drop buses that primarily handled electrical communications to full blown interconnect architectures that facilitate fast communication. Unfortunately, as the demand for future processors to consume at even higher-rates corresponding demand is placed on the capabilities of existing interconnect architectures.
In the following description, numerous specific details are set forth, such as examples of specific types of processors and system configurations, specific hardware structures, specific architectural and micro architectural details, specific register configurations, specific instruction types, specific system components, specific measurements/heights, specific processor pipeline stages and operation etc. in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that these specific details need not be employed to practice the embodiments of the present disclosure. In other instances, well known components or methods, such as specific and alternative processor architectures, specific logic circuits/code for described algorithms, specific firmware code, specific interconnect operation, specific logic configurations, specific manufacturing techniques and materials, specific compiler implementations, specific expression of algorithms in code, specific power down and gating techniques/logic and other specific operational details of computer system haven't been described in detail in order to avoid unnecessarily obscuring the present disclosure.
Although the following embodiments may be described with reference to efficient high-speed data transmission and configurability in specific integrated circuits, such as in computing platforms or microprocessors, other embodiments are applicable to other types of integrated circuits and logic devices. Similar techniques and teachings of embodiments described herein may be applied to other types of circuits or semiconductor devices that may also benefit from better energy efficiency and energy conservation. For example, the disclosed embodiments may be applied to computing systems embodied as servers, blades, desktop computer systems, system on chip (SoC) device, handheld devices, tablets, set top boxes, in-vehicle computing systems, computer vision system, gaming systems, machine learning systems, and embedded applications. As will become readily apparent in the description below, the embodiments of methods, apparatus', and systems described herein (whether in reference to hardware, firmware, software, or a combination thereof) are beneficial to the development of high-performance computer interconnects and their respective systems.
As computing systems are advancing, the components therein are becoming more complex. As a result, the interconnect architecture to couple and communicate between the components is also increasing in complexity to ensure bandwidth requirements are met for optimal component operation. Furthermore, different market segments demand different aspects of interconnect architectures to suit the market's needs. For example, servers require higher performance, while the mobile ecosystem is sometimes able to sacrifice overall performance for power savings. Yet, it is a singular purpose of most fabrics to provide highest possible performance with maximum power saving. Below, a number of interconnects are discussed, which would potentially benefit from aspects of the solutions described herein.
One example interconnect fabric architecture includes the Peripheral Component Interconnect (PCI) Express (PCIe) architecture. A primary goal of PCIe is to enable components and devices from different vendors to inter-operate in an open architecture, spanning multiple market segments; Clients (Desktops and Mobile), Servers (Standard and Enterprise), and Embedded and Communication devices. PCI Express is a high performance, general purpose I/O interconnect defined for a wide variety of future computing and communication platforms. Some PCI attributes, such as its usage model, load-store architecture, and software interfaces, have been maintained through its revisions, whereas previous parallel bus implementations have been replaced by a highly scalable, fully serial interface. The more recent versions of PCI Express take advantage of advances in point-to-point interconnects, Switch-based technology, and packetized protocol to deliver new levels of performance and features. Power Management, Quality Of Service (QoS), Hot-Plug/Hot-Swap support, Data Integrity, and Error Handling are among some of the advanced features supported by PCI Express.
Traditional streaming interfaces to couple fabric to protocol agents have generally included proprietary interfaces (e.g., Intel™ On-chip System Fabric (IOSF™)), interfaces developed for coherent or unordered protocol, and other interfaces that are poorly adapted to scaling to handle the evolving data rates in modern protocols and architectures. For instance, proprietary interfaces may carry custom or use-case specific information or features that prevent standardization of the interface or that fail to scale to next generation bandwidths. While other traditional interfaces may be defined in a more generic manner, for instance, as a data bus for carrying packets. However, traditional bus definitions and interfaces may lead to receiver decode complexity, particularly in the presence of multiple flow control classes or virtual channels, especially as data rates increase and more packets are able to be processed per clock cycle. As an example, if four (or even more) packets of any channel or flow control can potentially arrive at a given clock cycle, and these were accessing shared buffers, then a corresponding four (or more) logical write ports may need to be provisioned in the receiver, resulting in excess surface area dedicated to providing such logic (and buffers). In some instances, traditional interfaces address use cases where multiple packets per cycle (of different flow control classes) simply by stamping multiple copies of the interface (e.g., one for each flow control class), leading to high pin counts. Additionally, traditional streaming interfaces have header and data packets following each other on the same physical wires, limiting the potential for latency optimizations. Some traditional interfaces fail to provide effective, flexible mechanisms for crediting flows, among other example shortcomings.
In some implementations, an improved, scalable streaming interface may be defined between agent logic on a device and a fabric, such as between the protocol layer and other devices coupled to a fabric (e.g., a CPU, endpoint device, switch, etc.). The streaming interface may support a load/store protocol, such as PCIe, Compute Express Link (CXL) (e.g., CXL.io), among other load/store protocols. The improved streaming interface may define interface rules and channels of the interface to enable significant chip area and latency advantages during implementation, while providing the power-efficient bandwidth scaling advantages that will become ever more critical, particularly as protocols approach higher speeds, such as the move to 32.0 GT/s in PCIe Gen 5, or to 64.0 GT/s Data Rates and beyond starting with PCIe Gen 6 and CXL 3.0, among other examples. Such an interface may optimize the best balance of pin count versus receiver decoding complexity. In some implementations, the improved streaming interface discussed herein may allow a modest number of logical write ports on receiver buffers, where the receiver buffers are shared amongst multiple virtual channels and flow control classes. Further, an improved streaming interface may bifurcate the header and data of packets into independent physical channels (e.g., a header channel and a data channel) to thereby allow the receiver to start processing the headers while data is still streaming in and thereby helps reduce overall latency and buffer sizing and complexity. Further, the improved streaming interface discussed herein may be standardized to enable ecosystems of IP blocks to adopt and develop to a scalable, standardized interface, rather than traditional proprietary interfaces, and allow more options of interoperability, among other example features and advantages, such as discussed herein.
Turning to the simplified block diagram 100 of
Compute blocks (e.g., 110, 115, 120, 125, 130, 135, 140, 145) of an example SoC 105 may be interconnected by an SoC fabric (e.g., 150). The fabric 150 may be implemented itself using a set of one or more IP blocks facilitating communication between compute blocks (e.g., 110, 115, 120, 125, 130, 135, 140, 145). In some implementations, the fabric 150 may be implemented as a network on chip (NOC), such as a NOC implemented one or more circuitry blocks.
Communication by the various blocks (e.g., 110, 115, 120, 125, 130, 135, 140, 145) may be facilitated through protocol agents (e.g., 160a-h) provided on the blocks (e.g., 110, 115, 120, 125, 130, 135, 140, 145). Each agent (e.g., 160a-h) may include logic (e.g., implemented in hardware circuitry, firmware, and/or software) to implement all or a subset of layers of one or more interconnect protocols (e.g., PCIe, Compute Express Link (CXL), Gen-Z, OpenCAPI, In-Die Interface, Cache Coherent Interconnect for Accelerators (CCIX), UltraPath Interconnect (UPI), etc.) through which the corresponding compute block is to communicate with other compute blocks in the system. As discussed herein, the agents may couple to the fabric 150 via a respective interface. While such agents may have traditionally coupled to fabrics via proprietary wire interfaces, one or more agents (e.g., 160a-h) may utilize respective instances of a configurable flexible on-die wire interface, which may be deployed to support the multiple different protocols of multiple different agents of the SoC 105. In other instances, interfaces between agents (e.g. 160a-h) may be to support non-coherent and/or load/store streaming protocols, and corresponding streaming fabric interfaces may be defined and implemented on the blocks (e.g., 110, 115, 120, 125, 130, 135, 140, 145) and the fabric 150, among other example implementations.
As introduced above, an improved streaming fabric interface architecture (SFI) may be provided in components of a system (e.g., IP blocks and components implementing the fabric of the system) to map Load/Store protocols (e.g., PCIe, CXL.io) between an agent and a fabric. An SFI interface may provide a scalable streaming interface that can sustain the high bandwidth requirements of Load/Store protocols, including emerging next generation speeds for such protocols. An SFI interface may enable ease of implementation on both the transmit and receive side when transmitting such high data rates. Additionally, the logic implementing the SFI interface may embody, realize, and enforce rules for communications on the interface (e.g., beyond those defined in the protocols supported by the interface) to greatly simplify storage overhead in the context of read/write ports on the receiver, among other example advantages.
An SFI interface may be employed both in the context of a host CPU (e.g., through the root complex) or in the context of a device endpoint. In both cases, SFI serves to carry protocol layer (transaction layer) specific information between different processing entities. As an example, on the device side, SFI can be used to interface between the PCIe controller and the application layer (e.g., the fabric or a gasket layer between the controller and the fabric). Similarly, on the host side, SFI can be used to interface between the PCIe Root Port and the CPU fabric. Configurable parameters may be defined in an SFI interface to allow instances of the interface to be parametrized to be wide enough and carry multiple packets in a single transfer according to the supported protocols and the system use case(s). On a given SFI interface, data transfer may be unidirectional. Accordingly, in some implementations, a pair of SFI interface instances may be provided (one in each direction) to facilitate implementations utilizing bidirectional data transfer between communicating blocks. Accordingly, many of the examples herein discuss a transmitter (TX) and receiver (RX) pair for a single instance of an SFI interface.
Different configurations can be enabled using SFI as the intermediate interface. For instance, an SFI interface may make no assumptions around protocol- or application-specific responsibilities of the transmitter and receiver of the interface. Rather, an SFI interface may simply provide a mechanism and rules for high bandwidth packet transfer. For instance,
While some implementations of SFI may utilize semantics and header formats of a PCIe-based protocol, SFI is not limited to supported PCIe-based protocol. Further, SFI does not contain a new protocol definition. SFI semantics can be used to support a variety of different protocols, provided the protocol can be mapped to or adapted to the flow control (FC) and virtual channel (VC) semantics that SFI provides, among other example features. For instance, SFI supports advertisement of 0 or more shared credit pools for the receiver queues (such as discussed in more detail below).
Turning to
Among the example features adopted in an example, improved SFI interface, receiver decoding may be simplified, with the interface scaling to support a wide range of data payloads (e.g., from as small as 4B to as large as 4KB (or larger)). An improved streaming interface may allow multiple packets to be delivered in the same cycle, allowing a scalable interface across a variety of payload sizes while maintaining a common set of semantics and ordering (e.g., PCIe-based, etc.). Configurable parameters may include the number of logical write ports at the receiver (e.g., 1 or 2), which may be supported by defining rules for the interface restricting the number of different packets or headers transmitted in a clock cycle to using a corresponding number of flow control classes and/or virtual channels. Reducing the number of logical write ports at the receiver may save significant area and complexity. Additionally, as noted above, an improved streaming interface may enable header processing (e.g., of header received over a dedicated header channel) at the receiver to begin while data is streaming in to improve latency (e.g., in the case of CPU Host, to help overlap ownership request latency with an incoming data stream.
Compute Express Link, or CXL, is a low-latency, high-bandwidth discrete or on-package link that supports dynamic protocol multiplexing (or muxing) of a coherency protocol (CXL.cache), memory access protocol (CXL.mem), and I/O protocol (CXL.io). CXL.cache is an agent coherency protocol that supports device caching of host memory, CXL.mem is a memory access protocol that supports device-attached memory, and CXL.io is a PCIe-based non-coherent I/O protocol with enhancements for accelerator support. CXL is intended to thereby provide a rich set of protocols to support a vast spectrum of devices, such as accelerator devices. Depending on the particular accelerator usage model, all of the CXL protocols (CXL.io, CXL.mem, CXL.cache) or only a subset may be enabled to provide a low-latency, high-bandwidth path for a corresponding computing block or device (e.g., an accelerator) to access the system.
As noted above, in some implementations, agents utilized to implement a CXL.io protocol may couple to system fabric utilizing an SFI interface, such as described herein. For instance, turning to
Continuing with the example of
As shown in
In some implementations, an improved streaming interface may be implemented that is adapted to support a load/store protocol based at least in part on PCIe or PCIe semantics (e.g., PCIe or CXL.io). For instance, a supported protocol may utilize packet formats based on PCIe-defined formats. Additionally, Flow Control/Virtual Channel notions may be extended from PCIe definitions. It should be appreciated that other, additional protocols (e.g., non-PCIe or CXL protocols) may also be supported by such SFI interfaces. Indeed, while many of the examples discussed herein reference PCIe- or CXL.io-based protocols and implementations, it should be appreciated that the principles, features, and solutions discussed herein may be more generally applied, for instance, to a variety of other streaming or load/store protocols, among other example systems.
In some implementations, an SFI interface may have separate Header (HDR) and Data buses or channels, each of which can carry multiple packets' headers or payloads concurrently. Further, formalized rules may be set and adopted in logic of the agent to govern how packets are packed/unpacked on the header and data interfaces. For instance, an additional metadata channel, or bus, may be provided on the improved interface to carry metadata to enable the receiver to identify how to unpack the headers/data sent on the separate header and payload data channels respectively. Through separate, parallel header and data channels a system (e.g., the root complex of a CPU host) may enjoy latency benefits, for instance, by receiving potentially multiple headers before the corresponding payload is received. This resulting lead time may be used by the system to process the headers and start fetching ownership for the cache lines for multiple header requests, while the data of those requests is still streaming in. This helps overlap latencies and helps reduce buffer residency, among other example advantages.
Turning to
Each of the HDR and DATA channels can carry multiple packets on the same cycle of transfer. Since most Load/Store protocols rely on ordering semantics, SFI assumes implicit ordering when multiple packets are sent on the same cycle. Packets may be ordered, for instance, from the least significant position to the most significant position. For example, if TLP 0 begins from byte 0 of the header signal 505 and TLP 1 begins from byte 16 of the header signal 505, then the receiver considers TLP 1 to be ordered behind TLP 0 when such ordering rules are applied. For transfers across different clock cycles, the ordering rules of the relevant protocol are followed (e.g., SFI carries over all PCIe ordering rules when used for PCIe). In cases of link subdivision (e.g., dividing the overall lanes of the link into two or more smaller-width links (e.g., associated with respective root ports), the different ports from the controller perspective map to different virtual channels on the SFI. For instance, in such cases, implementations can support multiple port configurations within the same physical block (e.g., implemented as an agent or controller). In these cases, the same physical channel of SFI can be used to transfer packets for different ports, with each port mapped to its own set of virtual channels (e.g. 1 or more virtual channels per port), among other example implementations.
A set of parameters may be defined for an instance of an SFI interface to configure aspects of the instance. For instance, metadata signals of the HDR and DATA channels may be based on one or more of the configurable parameters. For instance, parameters may identify how the metadata signals carry metadata to convey information about the position of different packets within a single transfer, among other example information. For instance, in SFI, packet headers that have data associated with it send the packet header on the HDR channel and send the associated data separately on the DATA channel. There may be no timing relationship guarantee between the DATA and HDR channel transfers. It is assumed that the receiver tracking the associated data length for each received header and only processing the relevant data size. The data size may be sent with the packet header information (e.g., a PCIe implementation, using a PCIe packet header format identifies the amount of data in the length field of the PCIe TLP header to indicate how many 4-byte chunks of data are associated with that header). Information in the metadata sent over the metadata signals may also be used by the receiver to determine which headers map to which data (e.g., through flow control and virtual channel ID combinations), parity information, information about the header format (e.g., the header size), among other example information.
A global layer or channel of signals (e.g., 550) may carry signals that apply across all physical channels of the interface 205, such as control signals, vendor-defined signals, and other signals enabling other example functionality. For instance, the global channel 550 may carry the signals that are also used for initialization and shutdown of the interface (such as in the examples discussed below). Table 1 describes an example implementation of signals of a global channel of an example SFI interface.
The HDR channel carries the header of request messages from the transmitter to the receiver. A variety of information may be encapsulated in the (protocol-specific) fields of a header transmitted using the HDR channel, including address and other protocol-level command information. Table 2 describes an example implementation of signals of an HDR channel of an example SFI interface.
The header size may be a predetermined parameter based on the peak sustained bandwidth expected or required of the system. An SFI interface (and corresponding logic) may enforce rules for the HDR channel such as having a packet header begin and end on the same cycle of transfer. Multiple packet headers may nonetheless be sent on the same cycle by sending one of the packet headers on a first subset of the header signal lanes and the other packet header on another subset of the header signal lanes. The interface may define, however, that the first packet on a valid header transfer starts on the lanes of the header signal corresponding to byte 0 of the header field (logically represented by the header signal lanes).
The header valid signals (hdr_valid) may be asserted to indicate corresponding valid values on the lanes of the header signal. In some implementations, the number of lanes of the header signal may be logically divided into byte-wise subsets (e.g., 16 bytes or 32 bytes of lane width in each subset) corresponding to the size of one of the protocol headers to be carried on the header signal. Further, each header valid lane may be mapped to one of the subsets to indicate that valid header data is being sent on a corresponding one of the subsets of lanes of the header signal. Additionally, the header metadata signal (hd_info_bytes) may carry metadata (e.g., aligned with one of the headers carried on the header signal) to describe key attributes that can be used by the receiver to decode the corresponding header.
A DATA physical channel of an SFI interface may be used to carry payload data for all requests that have data associated with it. In SFI, there may be no explicit timing relationship or requirement between the HDR channel and associated data carried on the DATA channel. However, transmitters may be equipped with logic to check both HDR channel and DATA channel credits before scheduling either header data on the HDR channel or payload data on the DATA channel. Table 3 describes an example implementation of signals of a DATA channel of an example SFI interface.
In implementations of an SFI interface, payload data may be sent on the data signal of the DATA channel according to a multi-byte granularity (e.g., 4-byte granularity). Accordingly, the data for any payload may be identified as ending at a particular “chunk” of data (e.g., a particular 4-byte chunk). As an example, if the width of the data signal D is 64 bytes, the number of potential data end positions is DE=64/4=16, with data—end[0] corresponding to data bytes[3:0], data—end[1] corresponding to data bytes[7:4], data_end[DE-1] for data bytes[D-1:D-4], and so on. The start of data signal (data_start) may utilize the same or a different granularity than the end of data signal. An instance of an SFI interface may be parameterized to support (and limit the number of payload starts according to) a maximum number of starts DS in a clock cycle. As an example, if the width of the data signal bus D is 64 bytes and the instance of the SFI interface is configured to limit the number of starts in a cycle to 2, DS =2, effectively dividing the data bus into two 32 byte chunks in which a new payload may begin being sent. For instance, in an example where D =62 and DS =2, data_start[0] would correspond to a chunk of data starting at data byte[0] and data_start[1] corresponding to a chunk of data starting at data byte[32], among other examples (including examples with lower or higher granularity in the start of data and end of data chunks (e.g., DS>2), smaller or larger data bus sizes, etc.).
In one example implementation of a DATA channel of an SFI interface, the width of the data start signal may be equal to DS and the signal may effectively act as a mask to identify each corresponding chunk of data on the data signal (e.g., aligned in the same clock cycle) that corresponds to the start of a respective payload. Further, each data start bit may have an associated data_info_byte signal sent with it that indicates metadata for the corresponding payload. In some implementations. the data_info_byte is sent only once for a given payload {e.g., with the corresponding data start chunk and data_start_bit), while in other instances the metadata may be sent (e.g., repeated} to correspond with every chunk of data in the same payload, among other example implementations. In one implementation, the data_info_byte signal may indicate the respective FC ID and the VC ID of the corresponding packet (e.g., with 4 bits (e.g., data_info_byte[3:0]) carrying the FC ID and another 4 bits (e.g., data_info_byte[7:4]) carrying the VC ID), among other example information for use by the receiver in processing the data payloads sent over the data signal bus.
Unlike the HDR channel, in some implementations of a DATA channel, data chunks from the same packet can be transferred over multiple cycles. For example, the raw data bus width could be implemented as 64 B per cycle, allowing a 128 B data packet to be transferred over 2 clock cycles. In some implementations, once a payload has begun transmission, the transmitter may guarantee that all the relevant data chunks in the payload are transferred consecutively from LSB to MSB and across successive clocks (e.g., without any gaps or bubbles). In some implementations, only one packet of a particular FC ID/VC ID combination may be sent on the interface at a time (with the FC ID/VC ID combination only reused after the preceding packet using the combination finishes sending). In some implementations, packets with different FC ID/VC ID combinations may be interleaved on an SFI interface (e.g., with a packet of one FC ID/VC ID combination being interrupted to send at least a portion of a packet with another FC ID/VC ID combination), among other examples.
The granularity of credits on the data channel may also be configurable (e.g., at design compile time) and may correspond to a multiple of N-bytes. For instance, in one example, the granularity may be required to be a multiple of 4 bytes. If the credit granularity is chosen to be 16 bytes, then even a 4-byte data packet transferred uses one 16-byte worth of credit, among other example implementations.
As discussed above, An SFI interface (and corresponding logic and buffers/trackers utilized by the transmitter and/or receiver to implement its half of the interface) may enable pipelining of header processing while data is streaming. Indeed, latency savings realized therethrough, in terms of header processing, directly translate to saved buffers in the receiver. In the context of Load/Store protocols, it is assumed that a receiver will separate the header and data internally anyway, as the headers are heavily consumed by the control path, whereas data for the most part is isolated to the data path. By splitting the header and data channels on an example SFI interface, headers of later requests may even bypass data of earlier requests and this can allow the receiver to start processing headers while data transfer is being completed. In the context of Host CPU processing inbound (device to host) writes, this may translate to a head start in obtaining ownership of the relevant cache lines, among other example use cases and advantages. Indeed, since fetching ownership is one of the most significant drivers of latency when processing writes, overlapping this while data streams can help reduce overall latency and buffers in the CPU. Deadlock is avoided by making sure that the transmitter checks for both header and data credits before sending either header or data.
In some implementations, each VC and FC defined for an SFI interface is to use a credit for sending any message and collect credit returns from the receiver. The source may consume the full credits required for a message to complete. Transmitters check for both HDR channel and DATA channel credits before sending corresponding messages on the respective channel to the receiver. The granularity of HDR and DATA channel credits are predetermined between the TX and RX. For instance, the granularity of credits on the data channel may be configured (e.g., at design compile time) to only be a multiple of N-bytes. For instance, in one example, the granularity may be required to be a multiple of 4 bytes. If the credit granularity is chosen to be 16 bytes, then even a 4-byte data packet transferred uses one 16-byte worth of credit, among other example implementations. In one example, FC IDs may be based on PCIe semantics (e.g., 4′h0=Posted, 4′h1=Non-Posted, 4′h2=Completions), among other example implementations. Further, each of the physical channels (e.g., DATA and HDR) may be outfitted with dedicated credit return wires (which, unlike the remaining channels flow from the receiver to the transmitter). For instance, during operation, the receiver returns credits whenever it has processed the message (or guaranteed a buffer position for the next transaction).
In some implementations, SFI allows two schemes for supporting sharing of buffers between different FC and VC IDs. In both the schemes, the receiver is to advertise the minimum number of dedicated resources needed for a forward progress guarantee. For large packet transfers, this means that the maximum payload size is based on the dedicated credit advertisement. If shared credits are used, the transmitter and receiver are to predetermine which of the credit types, or schemes, is to be used. This determination may be made at design time, in some implementations. In alternative implementations, the credit scheme may be dynamically determined (e.g., based on parameters written to corresponding configuration registers), among other examples.
A first one of the two schemes for credit sharing may be transmitter-managed. In this scheme, the transmitter is responsible for managing shared buffers in the receiver. One or more shared credit pools are advertised or consumed with spare VC ID/FC ID encodings. When the transmitter consumes the shared credit pool credit, it sends the packet using the corresponding VC ID/FC ID encoding. When the receiver deallocates a transaction that used the shared credit, it does a credit return on the corresponding VC/FC ID combination. In some implementations, a bit may be provided in the header (along with a corresponding signal on the HDR channel) to indicate whether the credit is a shared credit or not. Accordingly, the receiver may have to further decode the header packet to explicitly determine the real VC ID or FC ID of the packet, among other examples.
In one example implementation of transmitter-managed credit sharing, the mapping of example shared credit pools advertised by the receiver (e.g., in a PCIe-based implementation) may support two VCs on the link and adopt the following example mapping shown in Table 4:
The another one of the two credit-sharing schemes may be receiver-managed. In a receiver-managed scheme, the receiver is responsible for managing shared buffers. Only the dedicated credits are advertised to the transmitter. Typically, the advertised dedicated credits cover the point-to-point credit loop across the SFI, and the shared credits are used to cover the larger credit loops (e.g., the CPU fabric or Application Layer latencies). After a particular FC/VC ID transaction is received, and shared credits are available, a credit can be returned for that FC/VC ID combination (e.g., without waiting for the transaction to deallocate from the receiver queue). This implicitly gives a shared buffer spot for that FC/VC ID. Internally, the receiver tracks the credits returned to transmitter on a FC/VC basis and further tracks the credits currently consumed by transmitter. With this tracking, the receiver can ensure the maximum number of buffers used per FC/VC. The receiver may guarantee the required dedicated resources for forward progress guarantee, among other example implementations.
Error handling for illegal flow control cases may result in undefined behavior. Accordingly, SFI interface logic on the agents and fabric may check for illegal cases to trigger assertions in RTL and also log/signal fatal errors to allow for post-silicon debug. For instance, SFI may maintain consistency between the HDR and DATA streams, meaning that the transmitter is to send the data payloads in the same order it is sending the corresponding headers and vice versa. In some implementations, receiver logic may include functionality to detect and flag fatal errors for violations, among other example error handling features. In some implementations, SFI provisions for data poisoning to be sent at the end of a data transfer. In case of occasional errors, the ownership request could be discarded/written back without modification, or the host can choose to poison the relevant cache lines and write the updated data, among other examples.
Turning to
In implementations of an SFI interface, a number of maximum packet headers that can be transmitted in 1 cycle on the interface may be predetermined (e.g., and recorded in a configurable parameter of the interface). The maximum packet headers per cycle may be determined by the width (or number of lanes) (H) of the header signal and the maximum packet header size. An SFI interface may be implemented (and designed) such that the header width (H) allow the common case usage to sustain maximum throughput. As an example, assuming the common case application header size is 16 bytes (e.g., mapping to 4 D-Word headers in PCIe), and that the interface is to sustain 2 headers per cycle, H=2*(16)=32 bytes. A corresponding valid signal (and lane) may be included in the HDR channel to correspond to the number of desired headers per cycle. As an example, if it is desired for the interface to sustain up to 2 headers per cycle, a corresponding M=2 number of valid lanes may be defined to support one valid signal for each of the potential 2 headers in a cycle (e.g., with hdr_valid[0] corresponding to a header starting in byte 0 of the header signal, and hdr_valid[1] corresponding to a header starting in byte 16 of the header signal. In some instances, one or more of the header formats of a supported protocol may be too large to be sent in only one of the subsets of lanes defined in the header signal (and assigned to a respective one of the valid signal lanes), meaning that such headers may utilize two or more of the subsets of lanes in the header signal for transmissions (and only a first (least significant bit) one of the two or more associated valid signals may be asserted). In such instances, when the maximum headers per cycle is set to 2, if a larger header format is to be sent on the header signal, only 1 header can be transferred in that cycle and hdr_valid[1] is not asserted, among other examples.
Continuing with the example of
Transmitters may utilize credits associated with FCs, VCs, or FC-VC combinations to determine whether a packet may be sent over the channel. For instance, if a packet header has data associated with it, the packet header is sent on the HDR channel and the associated data is sent on the DATA channel. Prior to sending the header or payload data, the transmitter may check (e.g., a tracking record in local memory) for available credits for both headers and payload data (and the corresponding HDR and DATA channels) before scheduling the header or payload data transfer. In some implementations, the credit granularity for the Header channel may be set to the maximum supported header size. For example, if the maximum header size supported is 20 bytes, then 1 credit on the Header channel may correspond to 20 bytes worth of storage at the receiver. In some instances, even if only a 16-byte header is to be sent, 1 full credit is consumed corresponding to the full 20 bytes, among other examples and similar alternative flow control and crediting implementations.
Turning to
Continuing with the example of
In the particular, simplified example of
In cycle 4, the header of another TLP, TLP4, is to be transmitted. In this example, the size of the header of TLP4 requires transport over both of the header bus subsections 820, 835 in order to communicate the header over the HDR channel in a single clock cycle. For instance, the headers (e.g., 840, 845, 850, 855) of TLPs 0-3 may have been of size HDR_SIZE=4, while the size of the TLP4 header is HDR_SIZE=5. Accordingly, in this example, the bytes of the TLP4 header (860a-b) are transmitted on the lanes of both header bus subsections 820 and 835. In this example, only the valid signal 810 corresponding to the subsection (or bytes) of the header bus carrying the beginning of the header (or the least significant bytes) is asserted high (at 890), while the other valid signal 825 remains deasserted in clock cycle 4. Similarly, only one of the header metadata signals (e.g., 815) may be used to carry the metadata information for the TLP4 header, with the metadata signal (e.g., 830) corresponding to the most significant bytes of the header carrying a null or other signal. In one example, the headers of TLPs0-4 may be according to a PCIe-based protocol. In such instances, The TLP Hdr bytes follow the format described in the PCI Express Base Specification. In this example, hdr_start[0] is associated with header byte[0] and hdr_start[1] is always associated with header byte[16], among other example implementations.
In some implementations, an SFI interface may be implemented as a synchronous interface, where both sides of the interface run on the same clock. This notwithstanding, transmitters and receivers may not be required to coordinate resets at each respective device. Instead, in some implementations, an initialization flow defined for the interface may define a separate handshake to ensure transmitter and receiver exchange information about interface reset and flow control before traffic begins on the interface.
Turning to
In some implementations of an SFI DATA channel, a start of data (or data_start) signal may be provided, which is implemented on a set of lanes to implement a corresponding number of bits of the data_start signal. For instance, the data_start signal may be implemented as a bit vector with a corresponding data_start lane (e.g., 925, 926, 928, etc.) being mapped to a respective byte or span of bytes in the data bus. For instance, each data_start lane (e.g., 925, 926, 928, etc.) may map to a corresponding one of the X+1 subsections of the data bus. For instance, in an example where there are 8 subsections of the data bus, the start of data signal may be composed of 8 bits or lanes, with each bit mapped to one of the subsections. When a first byte (e.g., as measured from the least significant byte to the most significant byte) of a payload is communicated in a particular clock cycle, the corresponding start of data signal (e.g., 925) may asserted (e.g., at 954) identify the subsection (or chunk) of the data bus in which that first payload byte can be found. Through this, a receiver may identify a boundary between two payloads communicated on the channel.
As in the example of an HDR channel, an SFI DATA channel may also carry metadata on dedicated metadata (data_info) signal lanes (e.g., 930, 935) to describe corresponding payload data sent on the data bus. In some implementations, metadata for a payload may be communicated on the DATA channel in association with the start of that payload (e.g., aligned with the first byte of the payload and the corresponding data_start signal). Indeed, multiple metadata signals may be defined and carried on the DATA channel, one corresponding to each of a corresponding number of subsections of the data bus (e.g., 915, 920). The subsections or chunks, in some implementations, may correspond to the same logical chunks utilized in the data_start signal (and/or the data_end signal 940). For instance, when a particular chunk carries the first bytes of a new payload, a corresponding one of the metadata signals (e.g., 930, 935) is responsible for carrying the corresponding metadata for that payload. As an example, as shown in
Continuing with the example of
Continuing with the example of
In some implementations, a state machine or other logic may be provided on agent and fabric devices to participate in defined connect and disconnect flows for an SFI interface. For instance, such flows may be invoked during boot/reset and when going into a low power mode, among other example states or events. In some implementations, SFI defines an initialization phase where information about credit availability in the receiver (RX) is communicated to the transmitter (TX) after a connection is established. In some instances, reset can independently de-assert between the agent and fabric sides of SFI. For independent reset, the initialization signals may be driven (e.g., on the Global channel) to the disconnected condition when in reset and no traffic may be sent until initialization reaches the connected state. The disconnect flow may be additionally supported by agents, for instance, to reconfigure credits and achieve power saving. Without this flow, all SFI credits may be configured to a final value before the first connection can proceed.
In initializations, the transmitter and receiver sides (e.g., the agent and fabric sides) of an SFI interface may be brought out of reset close to or at the same time. One end of the interface (e.g., after coming out of reset) may not have implicit requirements for when the other end should come out of reset. In some implementations, SFI may define an explicit handshake during initialization between the agent and fabric to ensures that both endpoints (and all pipeline stages between them) are out of reset before any credits or transactions are sent on the UFI interface. Accordingly, after reset, the receiver may begin sending credits for use by the transmitter.
Signaling rules may be defined for a Global initialization signal set. In one example, the txcon_req signal may be defined such that a transition from 0 to 1 reflects a connection request and a transition from 1 to 0 reflects a disconnection request. Credit return signals may be provided, for instance, with a credit valid (crd_valid) signal and a credit shared (crd_shared) signal. In one example, crd_valid=1 may be defined to mean it is releasing the dedicated message credits for a protocol ID and a virtual channel ID, while crd_shared=1 means it is releasing a shared credit (which can happen in parallel with a dedicated message credit return). In some implementations, a credit return behaves in the same way during the first initialization of credits as it does during runtime return of credits. The rx_empty signal indicates all channel credits returned from the receiver and all receiver queues are empty (although this may not account for messages that are in flight or in intermediate buffers such as clock crossing queues, among other example issues). In some implementations, a transmitter may check rx_empty before initiating a disconnect. By checking, it increases the probability that the disconnect is quickly accepted (e.g., in absence of possible in-flight requests that have not yet registered in at the receiver). In some implementations, to further increase the probability of disconnect acceptance, the transmitter may implement a timer delay after the last valid message sent such that the receiver pipeline would have time to drain into the receiver queues, among other example features. In some implementations, during initialization, the transmitter sends messages as soon as any credits are available and not dependant on a rx_empty assertion. Alternatively, a transmitter may stall the sending of any packets after initialization until rx_empty is asserted, the transmitter can use the credits received as an indication of the total credits a receiver has advertised. In an example implementation of an SFI interface, a transmitter can send packets when it receives sufficient credits from the receiver. The transmitter may identify the packet is to be transmitted and determine that there are respectively sufficient HDR and Data credits for the packet before the transmission begins.
As further examples of signaling rules, which may be defined in a UFI implementations, connection ACKs may be defined to always follows connection requests. As noted above, a connection request may be signaled by txcon_req transitioning from 0→1. This transition serves as an indication that the transmitter Tx is ready to receive credits and is in normal operation. An ACK may be signaled by rxcon_ack transitioning from 0→1. An ACK may be stalled for an arbitrary time until a receiver is ready to complete. Similarly, disconnect ACKs or NACKs may be defined to follow disconnect requests. A disconnect request may be is signaled by a txcon_req transition from 1→0. A disconnect ACK may be signaled by an rxcon_ack transition from 1→0. A disconnect NACK may be signaled by an rxdiscon_nack transitioning from 0→1. A rule may be defined to require a receiver to either respond with an ACK or NACK to each disconnect request it receives, among other example policies and implementations.
Turning to
To enter a connected state, once the transmitter is out of reset, it may assert the txcon_req signal 1120 to identify the request to the receiver. Similarly, when the receiver is out of reset, it waits for a connection request on the txcon_req signal 1120. The assertion of the connection request can be an arbitrary number of cycles after the reset (e.g., 1130) asserts. Until the connection is complete, the txcon_req signal 1120 is to remain asserted and is to only de-assert as part of the disconnect flow. Upon receiving a connection request on the txcon_req signal 1120, the receiver may assert the rxcon_ack signal 1115 to acknowledge the request. The rxcon_ack signal 1115 may be asserted after the resets of receiver and transmitter and the assertion of the txcon_req signal 1120. The rxcon_ack signal 1115 is to remain asserted and is to be first deasserted only in a disconnect flow.
This sequence may allow the initialization link state 1105 to progress from a Disconnected to a Connecting to the Connected state. Upon entering the Connected state (and sending the rxcon_ack signal) the receiver may immediately begin returning credits (e.g., on credit return wires 1125. Indeed, the receiver may start to return credits simultaneously with the assertion of rxcon_ack signal 1115. Accordingly, the transmitter (e.g., the agent) is prepared to accept credit returns upon asserting the txcon_req signal 1120 (e.g., at clock cycle x4), for instance, because credit returns might be observed before observation of A2F_rxcon_ack due to intermediate buffering or clock crossings. After the minimum credits are received to send packets, the transmitter can start sending packets or messages over the channel. The reconnect flow may be implemented similar to the connect from reset flow discussed herein, however, to start a new credit initialization, the receiver will first reset its credit counters to reset values and the transmitter is to reset its credits available counters to zero, among other example implementations.
Turning to
While the diagram 1200 of
In some implementations, the connect and disconnect flows are expected to complete within a few microseconds after initiation. In some implementations, a timeout may be defined, explicitly or implicitly. For instance, a receiver may be configured to reply with an ACK or NACK within a defined or recommended window of time. For instance, the agent, fabric, or system (e.g., SoC) can define a timeout or time window to enforce this expectation.
In some instances, an agent or fabric element may reset while the SFI interface is in a connected state, resulting in a surprise reset. For instance, the defined or recommended flow may be to enter Disconnect before Reset. As one example, a rxcon_ack signal may transition 1→0 occurs because of a surprise reset on receiver side of the link while value of the transmitter's txcon_req signal is 1. In such a case, the transmitter may force itself to a disconnected state and restart initialization. If this happens when the transmitter is in an idle state, it can recover without loss of messages. As another example of a surprise reset, if the txcon_req signal transitions 1→0 because of a surprise reset on the transmitter side of the link while the rxcon_ack is 1, the standard disconnect flow may be followed. If this happens when receiver is in an idle state, disconnect should receive Ack and cleanly reach a disconnected state provided transmitter stays in reset. If the disconnect is Denied (NACK) by the receiver, however, a fatal or illegal link state may result (e.g., an unrecoverable error). In cases of surprise resets, if traffic is active (e.g., not idle), a loss of protocol messages can result and may be fatal to continued normal operation.
As discussed above, an SFI interface in a system may be configurable according to a variety of parameters. For instance, a set of parameters may be specifically defined in accordance with the use case, features, protocols, and topology of a given system, such as a particular SoC design. Such parameters may define, for instance, the maximum number of headers that can be transmitted in a single cycle, the maximum header size, the maximum number of payloads of different packets that may be sent in a single cycle, among other example parameters. Parameters values may be defined and saved, for instance, in a configuration register or other data structure for use and reference by the agent and fabric components connected through the interface. Table 6 presents an example of parameters, which may be set in one example of an SFI interface.
As introduced above, SFI may utilize PCIe Flit Mode (FM) header formats and semantics, even when PCIe/CXL.io links train to non-Flit Mode (NFM). For instance, PCIe Flit Mode may define a transaction layer packet (TLP) grammar with: (1) zero or more one data word (1DW) local vendor-defined TLP prefixes followed by (2) a TLP header base with size indicated by Type[7:0] field, followed by zero to 7 DW of Orthogonal Header Content (OHC) as indicated by the OHC[4:0] field in the TLP header base. (3) TLP data payload of 0 to 1024DW may follow the TLP header base, followed by (4) a TLP Trailer (if present as indicated by TS[2:0] field of the header base), and then (5) zero or more 1DW end-to-end suffixes. When links train to NFM, I/O fabrics or interconnects that use SFI rely on the Transaction Layer to perform the FM/NFM conversions to ensure that only FM formats are carried over SFI. However, because not all NFM fields have a FM equivalent mapping, including NFM Reserved fields, this can compromise attempts to encrypt the corresponding data (e.g., according to CXL TLP Integrity and Data Encryption (IDE), etc.). In conventional implementations, NFM-to-NFM communication with IDE encryption is only available with non-streaming interfaces. In some implementations, logic (e.g., implemented in circuitry implementing the interface) may enable two NFM-trained links to communicate while preserving the benefits of the SFI fabric and not compromising packet integrity while preserving the benefits of cut-through routing and receiver decoding simplicity of Flit Mode formats, among other example benefits. For instance, flit format extensions may be defined to tunnel NFM-unique header information through FM header structures to enable NFM-to-FM-to-NFM end-to-end encryption. This may include all NFM Reserved fields that don't have a FM equivalent, all NFM non-Reserved fields that don't have a FM equivalent, and a new hint to notify the destination of the changes to the packet formats to be decoded accordingly, among other example features.
Turning to
Conventional devices do not support selective IDE streams for an NFM device (e.g., 1410) communicating with another NFM device (e.g., 1415) through an SFI-based fabric (e.g., 1405). Converter logic (e.g., 1420, 1425) may be provided at such devices to tunnel all NFM-unique fields through existing SFI capabilities to allow end-to-end encryption while preserving the streaming benefits of the SFI interface and not adding decoding complexity for the fabric. In one example, the NFM-to-FM converter circuitry (e.g., 1420) is to identify all NFM format Reserved fields (e.g., as defined in PCIe 6.0 or later), identify all (e.g., PCIe 6.0) NFM format fields without a FM equivalent, and define SFI format extensions to tunnel NFM fields through to the destination device (e.g., 1415).
In accordance with one example implementation, Table 7 lists PCIe 6.0 NFM Reserved fields for all formats and prefixes and how to map them to SFI formats:
Similarly, Table 8 lists PCIe 6.0 NFM fields for all formats and prefixes that do not have a FM equivalent and how to map them to SFI formats in one example:
Turning to
To tunnel the additional bits of an example NFM packet through to a destination using SFI, the NFM Prefix (e.g., defined as a new PCIe FM vendor defined local TLP prefix) may be utilized. For instance, when tunneling is supported, the SFI NFM Prefix (e.g., 1505) is to be inserted after any other local TLP prefixes and before the rest of the protocol header (e.g., PCIe flit mode base header). In one example, an NFM field 1506 is included in the NFM prefix 1505. When the NFM field is encoded with a value of “1”, the Address[ 1:0] position in the base header is to have same definition as NFM formats (e.g., for all format types). Further, when NFM=1, the Address Type (AT) bits for the flit are provided from the AT field 1507 of the SFI NFM Prefix 1505. However, if the NFM field 1506 is encoded with a “0”, the base header and OHC formats of the flit are to strictly follow the defined (e.g., PCIe) FM formats.
In some implementations, in addition to the definition of a new prefix, NFM/FM conversion in SFI may be further supported through the definition of new SFI-specific extensions of PCIe 6.0 FM formats, which may be used when the NFM field 1506 of the NFM prefix indicates an NFM format of the data (e.g., when NFM=1). For instance,
As introduced above, SFI may support a mix of shared and dedicated credits for communications between an SFI received and an SFI transmitter over an SFI interface. An SFI
Receiver that implements shared buffers and operates in block size operations to utilize the streaming benefits of SFI may, in some implementations, be paired with and communicate with an SFI Transmitter that is incapable of sharing credits, and instead relies exclusively on dedicated credits. While a store and forward approach could be attempted to address such a situation, such a solution would be both costly from an area and latency perspective (e.g., storing and forwarding of at least minimum packet size-sized credits for every supported combination of flow control (FC) or virtual channel (VC)). For instance, in PCIe, six types of information (e.g., Posted Request headers (PH); Posted Requests Data payload (PD); Non-Posted Request headers (NPH); Non-Posted Request Data payload (NPD); Completion headers (CplH); and Completion Data payload (CplD)) may be tracked by flow control for each virtual channel, resulting in six FC/VC combinations for each VC.
In an improved implementation, a lightweight credit conversion gasket (e.g., implemented in circuitry of a device implementing the SFI interface) may be provided to manage the shared credit pool conversion from the Receiver into dedicated credits to the Transmitter while being buffer-less and implementing anti-starvation controls as well as QoS and bandwidth shaping algorithms. For instance, an SFI Receiver may advertise shared credits to credit gasket logic. The credit gasket may accumulate credit returns from the Receiver, and tracks credits required by each enabled FC/VC at the Transmitter to achieve a desired level of link utilization. The credit gasket may identify and consider the Receiver occupancy and throttle dedicated credit returns at an FC/VC granularity for anti-starvation protection. By tracking and assigning Receiver credits in block size, the credit gasket may eliminate the need for storage and minimize the impact to header/data latency. Link utilization monitoring may also be used to adjust credit assignments and control dynamic bandwidth allocation to reduce receiver storage requirements. Such a solution may be provided as a plug-and-play extension for Receivers to provide flexibility in communicating with agents of varying crediting capabilities while minimizing area, latency costs, and implementation complexity.
The credit gasket 1715 may expose a programmable interface (e.g., to system software or firmware) to allow the credit gasket to be informed of which FC/VC combinations at the SFI Transmitter are active, and how many dedicated credits should be respectively allocated to each (e.g., credits for the HDR and DATA layers). The number of dedicated credits allocated may depend on the link utilization rate desired for each FC/VC and the delay properties of the link, among other example considerations. For instance, as a simplified example, if only one FC/VC is active for an SFI link configured to send two headers per cycle with a credit loop latency of 4 cycles and no additional constraints, the gasket would be configured to track 8 dedicated credits for that FC/VC, among other (potentially much more complex) examples.
To facilitate the conversion of shared credits to dedicated credits, the credit gasket 1715 may track, for each FC/VC the credit deficit and pending credit returns for each FC/VC, together with the size of the remaining shared credit pool, among other example information. The credit “deficit” is the mechanism the credit gasket uses to track how many credits it should provide for a given FC/VC and relates to the difference in the credits currently demanded by the FC/VC and those allocated. Once the deficit is satisfied, no other credits should be assigned (until another deficit is identified). When the deficit is increased for a given FC/VC, the credit gasket determines that more credits should be allocated to the FC/VC than previously or originally allocated. In one example, the credit gasket 1715 tracks this per-FC/VC deficit by initializing counters to the programmed values, and then decrementing on every credit return and incrementing on every valid SFI header or data transfer for that FC/VC combination.
For instance,
As noted above, the credit gasket 1715 may identify the number of credits initially allocated to each FC/VC combination and continually monitor whether the FC/VC falls into a credit deficit. For instance, if any FC/VC combinations are determined to be in deficit based on programming, the credit gasket may then act to arbitrate credit assignments by assigning additional block-sized chunks of credits. In the example of
Turning to
Turning to
Continuing with the example of
In some implementations, a credit gasket may include a configurable credit return arbitrator 2020 to assess the status of the credit gasket counters to arbitrate in the issuance of credits to the various FC/VC combination supported by the transmitter and ensure the shared credit pool supported by the receiver are dynamically allocated in an efficient manner. For instance, the credit return arbitrator 2020 may determine from the pending return credit counter (at 2075) and the shared pool counter (at 2080), whether to issue a credit return back to the transmitter (at 2085), for instance, when update to the credit deficit counter of the FC/VC occurs. The credit return arbitrator 2020 may determine to either grant the credit return (at 2050) or, instead, pull back credits from those allocated to the FC/VC (at 2035), among other example implementations. The credit return arbiter 2020 may be implemented in hardware circuitry and may provide an interface through which software or firmware is able to configure the algorithms utilized by the credit return arbiter 2020 in determining how to arbitrate credit returns by the credit gasket, among other example implementations.
In some implementations, the initialized deficit for an enabled FC/VC should at least fully cover one packet for deadlock avoidance (e.g., 1 header credit, maximum payload size (MPS)-sized data credits) and can be static or dynamic (e.g., as directed by software). In the static case, the initialized deficit should be set as high as needed to achieve the desired link utilization rate. If multiple FC/VC combinations are active, this could result in larger storage demands for the receiver. This can be compounded if the per-FC/VC activity is not uniform, and some FC/VC combinations experience periods of high activity and periods of low activity. To mitigate this, in some implementations, the credit gasket may dynamically adjust the tracked credit deficit over time depending on FC/VC activity to reduce Receiver storage demands and save area. This allows for dynamic credit and bandwidth allocation between the active FC/VCs. In this scenario, the deficit would be initialized to only cover the MPS-sized packet at reset to allow for initial traffic flow. As the link is used by an FC/VC (e.g., detected by counting number of valid packets from a given FC/VC in a configurable time window), the credit gasket can choose to maintain, increase, or decrease the deficit. If the deficit is increased due to increased activity from an FC/VC, it may result in the credit gasket arbitrating for more shared pool credit resources to assign to the FC/VC. If the deficit is decreased due to reduced activity from an FC/VC, it will result in the credit gasket reducing the number of credits assigned to the FC/VC. For instance, the timing diagram 2100 illustrated in
In cases where the credit gasket determines that the credits were initially overallocated to a given FC/VC, any credits that have already been released to an FC/VC that are going through a deficit reduction cannot be retrieved with existing SFI mechanics. If the implementation guarantees future packet transfers from the reduced FC/VC, it can choose to simply track that credit reduction and have it deducted from future credit returns. Alternatively, the credits may be retrieved on-demand in different ways with extensions of SFI. For instance, receiver-to-transmitter pull may be defined to incorporate new signals added to SFI to allow the receiver to request the credits back from the transmitter, and the transmitter to acknowledge the pull (or reduction). In one example, as shown in the example of
In one example implementation, the wires used to implement signals 2205, 2220, 2225, etc. may be shared between credit returns and credit pulls, allowing only one of those two events to occur on any given cycle (e.g., *crd_rtn_pull and *crd_rtn_valid are mutually exclusive events) in such instances. As a result, in such implementations, the receiver 1710 may only have one outstanding pull request at a time. As an alternative implementations, deallocation of a credits from a FC/VC may instead be initiated by the transmitter, for instance through a transmitter-to-return credit return signal. For instance, a transmitter-to-return credit return signal may be added to SFI to allow a transmitter to initiate credit returns back to the receiver if it individually detects or predicts decreased activity. In some implementations, the transmitter-to-return credit return signal may minor existing RX→TX credit return signals, but would be in the reverse direction. With this option, the receiver plays a more passive role in credit retrieval and relies on transmitter to self-monitor, among other example implementations.
Traditional streaming interfaces, such as defined in existing versions of the SFI Specification, define 1-to-1 physical interfaces to couple a single transmitter with a single receiver. In some implementations, a 1-to-many streaming interface may be implemented utilizing an arbiter, or arbitration circuitry. A buffered arbiter may be developed to facilitate such interfaces, however, the use of a buffered arbiter may be area- and latency-intensive due to the store-and-forward architectures. For instance, a buffered arbiter stores and forwards MPS-sized credits in order to account for transmitters that may send packets whenever a credit is available and handle burst rules for the one-to-many interface, among other complexities. In an improved implementations, a bufferless arbiter may instead be utilized to enable I/O fabrics to implement 1-to-many connections using SFI mechanics without the use of buffers to store and forward. For instance, a bufferless arbiter may utilize established SFI mechanics of early valid, block, and data interleaving to allow many-to-one lightweight bufferless, time-division multiplexing without the need to store and forward. Indeed, a bufferless arbiter may represent a lightweight solution through a common interface that uses existing SFI interface mechanics and extremely low area fabric switches and assists in the scalability of system on chip (SoC) devices, among other example advantages.
For instance,
Turning to
In implementations, where SFI's*early valid is used as an arbitration request, this may place an additional requirement on the SFI Transmitter to be more efficient with early valid assertion, particularly closer to the time of actual packet transmission, in order to risk efficiency loss. For instance,
Similar principles may be applied to arbitrate 1-to-many DATA physical channel using a bufferless arbiter 2305, such as illustrated in the diagram 2400d of
Note that the apparatus', methods', and systems described above may be implemented in any electronic device or system as aforementioned. As specific illustrations, the figures below provide exemplary systems (e.g., SoCs, computing blocks, fabric blocks, etc.) for utilizing the solutions described herein. As the systems below are described in more detail, a number of different interconnects, use cases, topologies, and applications are disclosed, described, and revisited from the discussion above. And as is readily apparent, the advances described above may be applied to any of those interconnects, fabrics, or architectures and their composite components.
Note that the apparatus', methods', and systems described above may be implemented in any electronic device or system as aforementioned. For instance, the computing platforms illustrated in the examples of
Referring to
In one embodiment, a processing element refers to hardware or logic to support a software thread. Examples of hardware processing elements include: a thread unit, a thread slot, a thread, a process unit, a context, a context unit, a logical processor, a hardware thread, a core, and/or any other element, which is capable of holding a state for a processor, such as an execution state or architectural state. In other words, a processing element, in one embodiment, refers to any hardware capable of being independently associated with code, such as a software thread, operating system, application, or other code. A physical processor (or processor socket) typically refers to an integrated circuit, which potentially includes any number of other processing elements, such as cores or hardware threads.
A core often refers to logic located on an integrated circuit capable of maintaining an independent architectural state, wherein each independently maintained architectural state is associated with at least some dedicated execution resources. In contrast to cores, a hardware thread typically refers to any logic located on an integrated circuit capable of maintaining an independent architectural state, wherein the independently maintained architectural states share access to execution resources. As can be seen, when certain resources are shared and others are dedicated to an architectural state, the line between the nomenclature of a hardware thread and core overlaps. Yet often, a core and a hardware thread are viewed by an operating system as individual logical processors, where the operating system is able to individually schedule operations on each logical processor.
Physical processor 2500, as illustrated in
As depicted, core 2501 includes two hardware threads 2501a and 2501b, which may also be referred to as hardware thread slots 2501a and 2501b. Therefore, software entities, such as an operating system, in one embodiment potentially view processor 2500 as four separate processors, e.g., four logical processors or processing elements capable of executing four software threads concurrently. As alluded to above, a first thread is associated with architecture state registers 2501a, a second thread is associated with architecture state registers 2501b, a third thread may be associated with architecture state registers 2502a, and a fourth thread may be associated with architecture state registers 2502b. Here, each of the architecture state registers (2501a, 2501b, 2502a, and 2502b) may be referred to as processing elements, thread slots, or thread units, as described above. As illustrated, architecture state registers 2501a are replicated in architecture state registers 2501b, so individual architecture states/contexts are capable of being stored for logical processor 2501a and logical processor 2501b. In core 2501, other smaller resources, such as instruction pointers and renaming logic in allocator and renamer block 2530 may also be replicated for threads 2501a and 2501b. Some resources, such as re-order buffers in reorder/retirement unit 2535, ILTB 2520, load/store buffers, and queues may be shared through partitioning. Other resources, such as general purpose internal registers, page-table base register(s), low-level data-cache and data-TLB 2515, execution unit(s) 2540, and portions of out-of-order unit 2535 are potentially fully shared.
Processor 2500 often includes other resources, which may be fully shared, shared through partitioning, or dedicated by/to processing elements. In
Core 2501 further includes decode module 2525 coupled to fetch unit 2520 to decode fetched elements. Fetch logic, in one embodiment, includes individual sequencers associated with thread slots 2501a, 2501b, respectively. Usually core 2501 is associated with a first ISA, which defines/specifies instructions executable on processor 2500. Often machine code instructions that are part of the first ISA include a portion of the instruction (referred to as an opcode), which references/specifies an instruction or operation to be performed. Decode logic 2525 includes circuitry that recognizes these instructions from their opcodes and passes the decoded instructions on in the pipeline for processing as defined by the first ISA. For example, as discussed in more detail below decoders 2525, in one embodiment, include logic designed or adapted to recognize specific instructions, such as transactional instruction. As a result of the recognition by decoders 2525, the architecture or core 2501 takes specific, predefined actions to perform tasks associated with the appropriate instruction. It is important to note that any of the tasks, blocks, operations, and methods described herein may be performed in response to a single or multiple instructions; some of which may be new or old instructions. Note decoders 2526, in one embodiment, recognize the same ISA (or a subset thereof). Alternatively, in a heterogeneous core environment, decoders 2526 recognize a second ISA (either a subset of the first ISA or a distinct ISA).
In one example, allocator and renamer block 2530 includes an allocator to reserve resources, such as register files to store instruction processing results. However, threads 2501a and 2501b are potentially capable of out-of-order execution, where allocator and renamer block 2530 also reserves other resources, such as reorder buffers to track instruction results. Unit 2530 may also include a register renamer to rename program/instruction reference registers to other registers internal to processor 2500. Reorder/retirement unit 2535 includes components, such as the reorder buffers mentioned above, load buffers, and store buffers, to support out-of-order execution and later in-order retirement of instructions executed out-of-order.
Scheduler and execution unit(s) block 2540, in one embodiment, includes a scheduler unit to schedule instructions/operation on execution units. For example, a floating point instruction is scheduled on a port of an execution unit that has an available floating point execution unit. Register files associated with the execution units are also included to store information instruction processing results. Exemplary execution units include a floating point execution unit, an integer execution unit, a jump execution unit, a load execution unit, a store execution unit, and other known execution units.
Lower level data cache and data translation buffer (D-TLB) 2550 are coupled to execution unit(s) 2540. The data cache is to store recently used/operated-on elements, such as data operands, which are potentially held in memory coherency states. The D-TLB is to store recent virtual/linear to physical address translations. As a specific example, a processor may include a page table structure to break physical memory into a plurality of virtual pages.
Here, cores 2501 and 2502 share access to higher-level or further-out cache, such as a second level cache associated with on-chip interface 2510. Note that higher-level or further-out refers to cache levels increasing or getting further way from the execution unit(s). In one embodiment, higher-level cache is a last-level data cache—last cache in the memory hierarchy on processor 2500—such as a second or third level data cache. However, higher level cache is not so limited, as it may be associated with or include an instruction cache. A trace cache—a type of instruction cache—instead may be coupled after decoder 2525 to store recently decoded traces. Here, an instruction potentially refers to a macro-instruction (e.g., a general instruction recognized by the decoders), which may decode into a number of micro-instructions (micro-operations).
In the depicted configuration, processor 2500 also includes on-chip interface module 2510. Historically, a memory controller, which is described in more detail below, has been included in a computing system external to processor 2500. In this scenario, on-chip interface 2510 is to communicate with devices external to processor 2500, such as system memory 2575, a chipset (often including a memory controller hub to connect to memory 2575 and an I/O controller hub to connect peripheral devices), a memory controller hub, a northbridge, or other integrated circuit. And in this scenario, bus 2505 may include any known interconnect, such as multi-drop bus, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g. cache coherent) bus, a layered protocol architecture, a differential bus, and a GTL bus.
Memory 2575 may be dedicated to processor 2500 or shared with other devices in a system. Common examples of types of memory 2575 include DRAM, SRAM, non-volatile memory (NV memory), and other known storage devices. Note that device 2580 may include a graphic accelerator, processor or card coupled to a memory controller hub, data storage coupled to an I/O controller hub, a wireless transceiver, a flash device, an audio controller, a network controller, or other known device.
Recently however, as more logic and devices are being integrated on a single die, such as SOC, each of these devices may be incorporated on processor 2500. For example in one embodiment, a memory controller hub is on the same package and/or die with processor 2500. Here, a portion of the core (an on-core portion) 2510 includes one or more controller(s) for interfacing with other devices such as memory 2575 or a graphics device 2580. The configuration including an interconnect and controllers for interfacing with such devices is often referred to as an on-core (or un-core configuration). As an example, on-chip interface 2510 includes a ring interconnect for on-chip communication and a high-speed serial point-to-point link 2505 for off-chip communication. Yet, in the SOC environment, even more devices, such as the network interface, co-processors, memory 2575, graphics processor 2580, and any other known computer devices/interface may be integrated on a single die or integrated circuit to provide small form factor with high functionality and low power consumption.
In one embodiment, processor 2500 is capable of executing a compiler, optimization, and/or translator code 2577 to compile, translate, and/or optimize application code 2576 to support the apparatus and methods described herein or to interface therewith. A compiler often includes a program or set of programs to translate source text/code into target text/code. Usually, compilation of program/application code with a compiler is done in multiple phases and passes to transform hi-level programming language code into low-level machine or assembly language code. Yet, single pass compilers may still be utilized for simple compilation. A compiler may utilize any known compilation techniques and perform any known compiler operations, such as lexical analysis, preprocessing, parsing, semantic analysis, code generation, code transformation, and code optimization.
Larger compilers often include multiple phases, but most often these phases are included within two general phases: (1) a front-end, e.g., generally where syntactic processing, semantic processing, and some transformation/optimization may take place, and (2) a back-end, e.g., generally where analysis, transformations, optimizations, and code generation takes place. Some compilers refer to a middle, which illustrates the blurring of delineation between a front-end and back end of a compiler. As a result, reference to insertion, association, generation, or other operation of a compiler may take place in any of the aforementioned phases or passes, as well as any other known phases or passes of a compiler. As an illustrative example, a compiler potentially inserts operations, calls, functions, etc. in one or more phases of compilation, such as insertion of calls/operations in a front-end phase of compilation and then transformation of the calls/operations into lower-level code during a transformation phase. Note that during dynamic compilation, compiler code or dynamic optimization code may insert such operations/calls, as well as optimize the code for execution during runtime. As a specific illustrative example, binary code (already compiled code) may be dynamically optimized during runtime. Here, the program code may include the dynamic optimization code, the binary code, or a combination thereof.
Similar to a compiler, a translator, such as a binary translator, translates code either statically or dynamically to optimize and/or translate code. Therefore, reference to execution of code, application code, program code, or other software environment may refer to: (1) execution of a compiler program(s), optimization code optimizer, or translator either dynamically or statically, to compile program code, to maintain software structures, to perform other operations, to optimize code, or to translate code; (2) execution of main program code including operations/calls, such as application code that has been optimized/compiled; (3) execution of other program code, such as libraries, associated with the main program code to maintain software structures, to perform other software related operations, or to optimize code; or (4) a combination thereof.
Referring now to
While shown with only two processors 2670, 2680, it is to be understood that the scope of the present disclosure is not so limited. In other embodiments, one or more additional processors may be present in a given processor.
Processors 2670 and 2680 are shown including integrated memory controller units 2672 and 2682, respectively. Processor 2670 also includes as part of its bus controller units point-to-point (P-P) interfaces 2676 and 2678; similarly, second processor 2680 includes P-P interfaces 2686 and 2688. Processors 2670, 2680 may exchange information via a point-to-point (P-P) interface 2650 using P-P interface circuits 2678, 2688. As shown in
Processors 2670, 2680 each exchange information with a chipset 2690 via individual P-P interfaces 2652, 2654 using point to point interface circuits 2676, 2694, 2686, 2698. Chipset 2690 also exchanges information with a high-performance graphics circuit 2638 via an interface circuit 2692 along a high-performance graphics interconnect 2639.
A shared cache (not shown) may be included in either processor or outside of both processors; yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 2690 may be coupled to a first bus 2616 via an interface 2696. In one embodiment, first bus 2616 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCIe bus or another third generation I/O interconnect bus, although the scope of the present disclosure is not so limited.
As shown in
Computing systems can include various combinations of components. These components may be implemented as ICs, portions thereof, discrete electronic devices, or other modules, logic, hardware, software, firmware, or a combination thereof adapted in a computer system, or as components otherwise incorporated within a chassis of the computer system. However, it is to be understood that some of the components shown may be omitted, additional components may be present, and different arrangement of the components shown may occur in other implementations. As a result, the solutions described above may be implemented in any portion of one or more of the interconnects illustrated or described herein.
A processor, in one embodiment, includes a microprocessor, multi-core processor, multithreaded processor, an ultra low voltage processor, an embedded processor, or other known processing element. In the illustrated implementation, processor acts as a main processing unit and central hub for communication with many of the various components of the system. As one example, processor is implemented as a system on a chip (SoC). As a specific illustrative example, processor includes an Intel® Architecture Core™-based processor such as an i3, i5, i7 or another such processor available from Intel Corporation. However, understand that other low power processors such as available from Advanced Micro Devices, Inc. (AMD) of Sunnyvale, CA, a MIPS-based design from MIPS Technologies, Inc. of Sunnyvale, CA, an ARM-based design licensed from ARM Holdings, Ltd. or customer thereof, or their licensees or adopters may instead be present in other embodiments such as an Apple A5/A6 processor, a Qualcomm Snapdragon processor, or TI OMAP processor. Note that many of the customer versions of such processors are modified and varied; however, they may support or recognize a specific instruction set that performs defined algorithms as set forth by the processor licensor. Here, the microarchitectural implementation may vary, but the architectural function of the processor is usually consistent. Certain details regarding the architecture and operation of processor in one implementation will be discussed further below to provide an illustrative example.
Processor, in one embodiment, communicates with a system memory. As an illustrative example, which in an embodiment can be implemented via multiple memory devices to provide for a given amount of system memory. As examples, the memory can be in accordance with a Joint Electron Devices Engineering Council (JEDEC) low power double data rate (LPDDR)-based design such as the current LPDDR2 standard according to JEDEC JESD 209-2E (published April 2009), or a next generation LPDDR standard to be referred to as LPDDR3 or LPDDR4 that will offer extensions to LPDDR2 to increase bandwidth. In various implementations the individual memory devices may be of different package types such as single die package (SDP), dual die package (DDP) or quad die package (13P). These devices, in some embodiments, are directly soldered onto a motherboard to provide a lower profile solution, while in other embodiments the devices are configured as one or more memory modules that in turn couple to the motherboard by a given connector. And of course, other memory implementations are possible such as other types of memory modules, e.g., dual inline memory modules (DIMMs) of different varieties including but not limited to microDIMMs, MiniDIMMs. In a particular illustrative embodiment, memory is sized between 2 GB and 16 GB, and may be configured as a DDR3LM package or an LPDDR2 or LPDDR3 memory that is soldered onto a motherboard via a ball grid array (BGA).
To provide for persistent storage of information such as data, applications, one or more operating systems and so forth, a mass storage may also couple to processor. In various embodiments, to enable a thinner and lighter system design as well as to improve system responsiveness, this mass storage may be implemented via a SSD. However in other embodiments, the mass storage may primarily be implemented using a hard disk drive (HDD) with a smaller amount of SSD storage to act as a SSD cache to enable non-volatile storage of context state and other such information during power down events so that a fast power up can occur on re-initiation of system activities. A flash device may be coupled to processor, e.g., via a serial peripheral interface (SPI). This flash device may provide for non-volatile storage of system software, including a basic input/output software (BIOS) as well as other firmware of the system.
In various embodiments, mass storage of the system is implemented by a SSD alone or as a disk, optical or other drive with an SSD cache. In some embodiments, the mass storage is implemented as a SSD or as a HDD along with a restore (RST) cache module. In various implementations, the HDD provides for storage of between 320 GB-4 terabytes (TB) and upward while the RST cache is implemented with a SSD having a capacity of 24 GB-256 GB. Note that such SSD cache may be configured as a single level cache (SLC) or multi-level cache (MLC) option to provide an appropriate level of responsiveness. In a SSD-only option, the module may be accommodated in various locations such as in a mSATA or NGFF slot. As an example, an SSD has a capacity ranging from 120 GB-1 TB.
While the present disclosure has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present disclosure.
A design may go through various stages, from creation to simulation to fabrication. Data representing a design may represent the design in a number of manners. First, as is useful in simulations, the hardware may be represented using a hardware description language or another functional description language. Additionally, a circuit level model with logic and/or transistor gates may be produced at some stages of the design process. Furthermore, most designs, at some stage, reach a level of data representing the physical placement of various devices in the hardware model. In the case where conventional semiconductor fabrication techniques are used, the data representing the hardware model may be the data specifying the presence or absence of various features on different mask layers for masks used to produce the integrated circuit. In any representation of the design, the data may be stored in any form of a machine readable medium. A memory or a magnetic or optical storage such as a disc may be the machine readable medium to store information transmitted via optical or electrical wave modulated or otherwise generated to transmit such information. When an electrical carrier wave indicating or carrying the code or design is transmitted, to the extent that copying, buffering, or re-transmission of the electrical signal is performed, a new copy is made. Thus, a communication provider or a network provider may store on a tangible, machine-readable medium, at least temporarily, an article, such as information encoded into a carrier wave, embodying techniques of embodiments of the present disclosure.
A module as used herein refers to any combination of hardware, software, and/or firmware. As an example, a module includes hardware, such as a micro-controller, associated with a non-transitory medium to store code adapted to be executed by the micro-controller. Therefore, reference to a module, in one embodiment, refers to the hardware, which is specifically configured to recognize and/or execute the code to be held on a non-transitory medium. Furthermore, in another embodiment, use of a module refers to the non-transitory medium including the code, which is specifically adapted to be executed by the microcontroller to perform predetermined operations. And as can be inferred, in yet another embodiment, the term module (in this example) may refer to the combination of the microcontroller and the non-transitory medium. Often module boundaries that are illustrated as separate commonly vary and potentially overlap. For example, a first and a second module may share hardware, software, firmware, or a combination thereof, while potentially retaining some independent hardware, software, or firmware. In one embodiment, use of the term logic includes hardware, such as transistors, registers, or other hardware, such as programmable logic devices.
Use of the phrase ‘to’ or ‘configured to,’ in one embodiment, refers to arranging, putting together, manufacturing, offering to sell, importing and/or designing an apparatus, hardware, logic, or element to perform a designated or determined task. In this example, an apparatus or element thereof that is not operating is still ‘configured to’ perform a designated task if it is designed, coupled, and/or interconnected to perform said designated task. As a purely illustrative example, a logic gate may provide a 0 or a 1 during operation. But a logic gate ‘configured to’ provide an enable signal to a clock does not include every potential logic gate that may provide a 1 or 0. Instead, the logic gate is one coupled in some manner that during operation the 1 or 0 output is to enable the clock. Note once again that use of the term ‘configured to’ does not require operation, but instead focus on the latent state of an apparatus, hardware, and/or element, where in the latent state the apparatus, hardware, and/or element is designed to perform a particular task when the apparatus, hardware, and/or element is operating.
Furthermore, use of the phrases ‘capable of/to,’ and or ‘operable to,’ in one embodiment, refers to some apparatus, logic, hardware, and/or element designed in such a way to enable use of the apparatus, logic, hardware, and/or element in a specified manner. Note as above that use of to, capable to, or operable to, in one embodiment, refers to the latent state of an apparatus, logic, hardware, and/or element, where the apparatus, logic, hardware, and/or element is not operating but is designed in such a manner to enable use of an apparatus in a specified manner.
A value, as used herein, includes any known representation of a number, a state, a logical state, or a binary logical state. Often, the use of logic levels, logic values, or logical values is also referred to as 1's and 0's, which simply represents binary logic states. For example, a 1 refers to a high logic level and 0 refers to a low logic level. In one embodiment, a storage cell, such as a transistor or flash cell, may be capable of holding a single logical value or multiple logical values. However, other representations of values in computer systems have been used. For example, the decimal number ten may also be represented as a binary value of 2510 and a hexadecimal letter A. Therefore, a value includes any representation of information capable of being held in a computer system.
Moreover, states may be represented by values or portions of values. As an example, a first value, such as a logical one, may represent a default or initial state, while a second value, such as a logical zero, may represent a non-default state. In addition, the terms reset and set, in one embodiment, refer to a default and an updated value or state, respectively. For example, a default value potentially includes a high logical value, e.g., reset, while an updated value potentially includes a low logical value, e.g., set. Note that any combination of values may be utilized to represent any number of states.
The embodiments of methods, hardware, software, firmware or code set forth above may be implemented via instructions or code stored on a machine-accessible, machine readable, computer accessible, or computer readable medium which are executable by a processing element. A non-transitory machine-accessible/readable medium includes any mechanism that provides (e.g., stores and/or transmits) information in a form readable by a machine, such as a computer or electronic system. For example, a non-transitory machine-accessible medium includes random-access memory (RAM), such as static RAM (SRAM) or dynamic RAM (DRAM); ROM; magnetic or optical storage medium; flash memory devices; electrical storage devices; optical storage devices; acoustical storage devices; other form of storage devices for holding information received from transitory (propagated) signals (e.g., carrier waves, infrared signals, digital signals); etc., which are to be distinguished from the non-transitory mediums that may receive information there from.
Instructions used to program logic to perform embodiments of the disclosure may be stored within a memory in the system, such as DRAM, cache, flash memory, or other storage. Furthermore, the instructions can be distributed via a network or by way of other computer readable media. Thus a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), but is not limited to, floppy diskettes, optical disks, Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks, Read-Only Memory (ROMs), Random Access Memory (RAM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic or optical cards, flash memory, or a tangible, machine-readable storage used in the transmission of information over the Internet via electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.). Accordingly, the computer-readable medium includes any type of tangible machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
The following examples pertain to embodiments in accordance with this Specification. Example 1 is an apparatus including: protocol circuitry to implement an input/output (I/O) interconnect protocol, where the I/O interconnect protocol includes a flit mode and a non-flit mode, where a set of flit mode header formats are used when in the flit mode and a set of non-flit mode header formats are used when in the non-flit mode, and the set of non-flit mode header formats include one or more non-flit mode fields; and interface circuitry to implement an interface to couple to a fabric, where the interface circuitry is to: determine that a link is trained to the non-flit mode; generate a header according to the set of flit mode header formats, where the header includes a field to indicate that a corresponding packet originated as a non-flit mode packet, and one or more fields of the set of flit mode header formats are repurposed in the header to carry the one or more non-flit mode fields; and send the header over the interface.
Example 2 includes the subject matter of example 1, where the one or more non-flit mode fields are not included in the set of flit mode header formats.
Example 3 includes the subject matter of any one of examples 1-2, where the I/O interconnect protocol includes a load/store interconnection protocol.
Example 4 includes the subject matter of example 3, where the I/O interconnect protocol includes one of a Peripheral Component Interconnect Express (PCIe)-based protocol or a Compute Express Link (CXL)-based protocol.
Example 5 includes the subject matter of any one of examples 1-4, where the interface includes: a header channel implemented on a first subset of a plurality of physical lanes, where the first subset of lanes includes first lanes to carry packet headers based on the interconnect protocol and second lanes to carry metadata for the packet headers; and a data channel implemented on a separate second subset of the plurality of physical lanes, where the second subset of lanes includes third lanes to carry a packet payloads and fourth lanes to carry metadata for the packet payloads, where the header is sent over the header channel.
Example 6 includes the subject matter of any one of examples 1-5, where the flit mode and the non-flit mode are based on a PCIe-based protocol.
Example 7 includes the subject matter of example 6, where the one or more non-flit mode fields are carried in the one or more fields of the set of flit mode header formats based on a mapping.
Example 8 includes the subject matter of any one of examples 6-7, where the set of flit mode header formats include one or more orthogonal content headers and a particular field in a particular one of the one or more orthogonal content headers is to carry a particular field in the one or more non-flit mode fields.
Example 9 includes the subject matter of any one of examples 6-8, where the set of flit mode header formats include one or more prefixes and a particular field in a particular one of the one or more prefixes is to carry a particular field in the one or more non-flit mode fields.
Example 10 includes the subject matter of example 9, where the particular prefix includes a mode field to indicate that the corresponding packet originated as a non-flit mode packet.
Example 11 includes the subject matter of any one of examples 1-10, where end-to-end encryption is to be provided on the link based on the flit mode.
Example 12 includes the subject matter of any one of examples 1-11, where the interface is based on a Streaming Fabric Interface (SFI) specification.
Example 13 is a method including: identifying a header of a packet, where the header of the packet is based on a non-flit mode format of a load/store interconnect protocol, where the load/store interconnect protocol further defines a flit mode; generating a flit mode version of the header of the packet, where the flit mode version of the header of the packet is based on a flit mode format, a first subset of fields in the non-flit mode format are also provided in the flit mode format, and a second subset of fields in the non-flit mode format are excluded in the flit mode format, where generating the flit mode version of the header of the packet includes carrying one or more fields in the second subset of fields in repurposed fields defined in the flit mode format; sending the flit mode version of the header of the packet to a fabric over an interface, where the flit mode version of the header of the packet is sent on a header channel implemented on a first plurality of physical lanes; and sending payload data of the packet to the fabric over the interface, where the payload data of the packet is sent over a data channel implemented on a separate, second plurality of physical lanes.
Example 14 includes the subject matter of example 13, where the interface is defined according to an SFI specification and the load/store protocol includes one of PCIe or CXL.io.
Example 15 includes the subject matter of any one of examples 13-14, where the flit mode version of the header of the packet is sent on a first subset of the first plurality of physical lanes, and the method includes: sending header metadata on the interface using a second subset of the second plurality of physical lanes of the header channel.
Example 16 includes the subject matter of any one of examples 13-15, where the interface includes: a header channel implemented on a first subset of a plurality of physical lanes, where the first subset of lanes includes first lanes to carry packet headers based on the interconnect protocol and second lanes to carry metadata for the packet headers; and a data channel implemented on a separate second subset of the plurality of physical lanes, where the second subset of lanes includes third lanes to carry a packet payloads and fourth lanes to carry metadata for the packet payloads, where the header is sent over the header channel.
Example 17 includes the subject matter of any one of examples 13-16, where the flit mode and the non-flit mode are based on a PCIe-based protocol.
Example 18 includes the subject matter of example 17, where the second subset of fields are carried in the one or more fields of the set of flit mode header formats based on a mapping.
Example 19 includes the subject matter of any one of examples 17-18, where the set of flit mode header formats include one or more orthogonal content headers and a particular field in a particular one of the one or more orthogonal content headers is to carry a particular field in the second subset of fields.
Example 20 includes the subject matter of any one of examples 17-19, where the set of flit mode header formats include one or more prefixes and a particular field in a particular one of the one or more prefixes is to carry a particular field in the second subset of fields.
Example 21 includes the subject matter of example 20, where the particular prefix includes a mode field to indicate that the corresponding packet originated as a non-flit mode packet.
Example 22 includes the subject matter of any one of examples 13-21, further including providing end-to-end encryption based on the flit mode.
Example 23 is a system including means to perform the method of any one of examples 13-22.
Example 24 is a system including: a fabric; and a plurality of compute blocks communicatively coupled through the fabric, where a particular compute block in the plurality of compute blocks includes: agent circuitry to support a load/store interconnect protocol; and interface circuitry to implement an interface to couple to the fabric, where the interface circuitry is to: determine that a link is trained to the non-flit mode; generate a header according to the set of flit mode header formats, where the header includes a field to indicate that a corresponding packet originated as a non-flit mode packet, and one or more fields of the set of flit mode header formats is repurposed in the header to carry the one or more non-flit mode fields; and send the header over the interface.
Example 25 includes the subject matter of example 24, further including a bufferless arbiter to facilitate a one-to-many connection on the interface.
Example 26 includes the subject matter of any one of examples 24-25, further including a credit gasket to convert dedicated credits used by a transmitter on a first one of the compute blocks to shared credits used by a receiver on a second one of the compute blocks.
Example 27 includes the subject matter of any one of examples 24-26, where the fabric includes an interconnect fabric of a system on chip (SoC) device, and the SoC device includes the plurality of compute blocks.
Example 28 includes the subject matter of any one of examples 24-27, where the interface includes a header channel including a set of dedicated physical lanes to communicate packet headers, and the flit mode is to be used for headers communicated on the header channel.
Example 29 is an apparatus including: fabric circuitry to implement a fabric, where the fabric is to support communications according to an input/output (I/O) interconnect protocol, where the I/O interconnect protocol includes a flit mode and a non-flit mode, where a set of flit mode header formats are used when in the flit mode and a set of non-flit mode header formats are used when in the non-flit mode, and the set of non-flit mode header formats include one or more non-flit mode fields; and interface circuitry to implement an interface to couple to an agent, where the interface circuitry is to: determine that a link is trained to the non-flit mode; generate a header according to the set of flit mode header formats, where the header includes a field to indicate that a corresponding packet originated as a non-flit mode packet, and one or more fields of the set of flit mode header formats are repurposed in the header to carry the one or more non-flit mode fields; and send the header over the interface.
Example 30 includes the subject matter of example 29, where the one or more non-flit mode fields are not included in the set of flit mode header formats.
Example 31 includes the subject matter of any one of examples 29-30, where the I/O interconnect protocol includes a load/store interconnection protocol.
Example 32 includes the subject matter of example 31, where the I/O interconnect protocol includes one of a Peripheral Component Interconnect Express (PCIe)-based protocol or a Compute Express Link (CXL)-based protocol.
Example 33 includes the subject matter of any one of examples 29-32, where the interface includes: a header channel implemented on a first subset of a plurality of physical lanes, where the first subset of lanes includes first lanes to carry packet headers based on the interconnect protocol and second lanes to carry metadata for the packet headers; and a data channel implemented on a separate second subset of the plurality of physical lanes, where the second subset of lanes includes third lanes to carry a packet payloads and fourth lanes to carry metadata for the packet payloads, where the header is sent over the header channel.
Example 34 includes the subject matter of any one of examples 29-33, where the flit mode and the non-flit mode are based on a PCIe-based protocol.
Example 35 includes the subject matter of example 34, where the one or more non-flit mode fields are carried in the one or more fields of the set of flit mode header formats based on a mapping.
Example 36 includes the subject matter of any one of examples 34-35, where the set of flit mode header formats include one or more orthogonal content headers and a particular field in a particular one of the one or more orthogonal content headers is to carry a particular field in the one or more non-flit mode fields.
Example 37 includes the subject matter of any one of examples 34-36, where the set of flit mode header formats include one or more prefixes and a particular field in a particular one of the one or more prefixes is to carry a particular field in the one or more non-flit mode fields.
Example 38 includes the subject matter of example 37, where the particular prefix includes a mode field to indicate that the corresponding packet originated as a non-flit mode packet.
Example 39 includes the subject matter of any one of examples 34-38, where end-to-end encryption is to be provided on the link based on the flit mode.
Example 40 includes the subject matter of any one of examples 34-39, where the interface is based on a Streaming Fabric Interface (SFI) specification.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In the foregoing specification, a detailed description has been given with reference to specific exemplary embodiments. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense. Furthermore, the foregoing use of embodiment and other exemplarily language does not necessarily refer to the same embodiment or the same example, but may refer to different and distinct embodiments, as well as potentially the same embodiment.