Examples of the present disclosure generally relate to integrated circuits (ICs) and, more particularly, to low latency gigabit transceiver (GT) PHY-based signal switching for emulation, prototyping, and high performance computing (HPC).
Multiple configurable/programmable integrated circuits (ICs), such as field-programmable gate arrays (FPGAs) may be interconnected to provide a configurable high-speed computing (HSC) platform. The HSC platform may be useful, for example, to emulate, prototype, and/or simulate operation of a circuit design (e.g., for a system-on-chip (SoC)). Emulation may be useful for verifying the circuit design. Prototyping may be useful for validating the circuit design. For emulation and/or prototyping, silicon components of the circuit design are synthesized and mapped to equivalent hardware resources within programmable circuitry (i.e., fabric) of the ICs. If the circuit design does not fit within the fabric of a single IC, the circuit design is partitioned, and the partitions are implemented in the fabric of respective ICs. Signals between the partitions (cut nets) may be routed amongst the respective ICs via gigabit transceivers (GTs) of the ICs. In some situations, the number of cut nets that cross between the ICs can be in a range of tens of thousands, which exceeds the number of GTs.
Techniques for low latency gigabit transceiver (GT) PHY-based signal switching for emulation, prototyping, and high performance computing (HPC) are described. One example is an integrated circuit that includes receiver circuitry that de-serializes and extracts data from a received signal, transmitter circuitry that serializes and transmits outgoing data, functional circuitry that receives the extracted data and provides the outgoing data, and bypass circuitry that provides the extracted data from the receiver circuitry to the transmit circuitry as the outgoing data, bypassing the functional circuitry, in a bypass mode.
Another example described herein is a system that includes multiple integrated circuits (ICs), where a first one of the ICs includes functional circuitry, a receiver that receives a signal from a second one of the ICs, a transmitter that transmits outgoing data to a third one of the ICs, and a bypass circuit that selectively provides an output of the receiver to one of the functional circuitry and the transmitter.
Another example described herein is method that includes receiving a signal from a first IC at a second IC, de-serializing the received signal at the second IC, extracting data from the de-serialized signal at the second IC, and selectively routing the extracted data to one of functional circuitry of the second IC and a transmitter of the second IC.
Another example described herein is an integrated circuit (IC) device, that includes first, second, and third ICs. The third IC includes first and second transceivers. The first transceiver includes a first receiver, a first transmitter, and a first loopback path between the first receiver and the first transmitter. The second transceiver includes a second receiver, a second transmitter, and a second loopback path between the second receiver and the second transmitter. The third IC further includes a bypass link between the first and second loopback paths. The third IC is configurable to receive a signal from the first IC at the first receiver, route the signal from the first receiver to the second transmitter via the bypass link, and transmit the signal from the second transmitter to the second IC.
So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples.
Various features are described hereinafter with reference to the figures. It should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the features or as a limitation on the scope of the claims. In addition, an illustrated example need not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated, or if not so explicitly described.
Embodiments herein describe low latency gigabit transceiver (GT) PHY-based signal switching for emulation, prototyping, and high performance computing (HPC).
Unless indicated otherwise herein, the terms emulation, prototyping, and simulating may be used interchangeably.
A SoC may include approximately 1 billion application-specific integrated circuit (ASIC) gates. In order to map such a SoC to a SoC prototyping platform (e.g., a FPGA-based prototyping platform), the platform may need approximately 60 integrated circuits (e.g., FPGAs). For such a computing platform, approximately 1000 cables may be needed to connect the 60 integrated circuits (ICs) at an IO bank level to provide a mesh amongst the 60 ICs. Moreover, such a mesh would not necessarily provide point-to-point connections between each pair of the FPGAs. Rather, communications between some pairs of the ICs may be routed through one or more other ICs, which increases latency.
Where the number of cut nets exceeds the number of available pins, the ICs may employ pin-multiplexing techniques. As SoCs become increasing complex, even with multiplexing, the finite number of GTs may result in a signal from one IC/partition being routed through GTs of one or more intervening ICs (i.e., multiple hops) to reach a destination IC/partition, which increases latency.
Disclosed herein are techniques to reduce latency associated with multiple hops, including techniques to bypass data processing circuitry (e.g., media access control circuitry) and functional circuitry of intervening ICs. Such techniques may be referred to as bypass switching or PHY mode operation.
In PHY mode, bypass circuitry of an IC (e.g., an FPGA) may couple an output of receive-side physical layer (PHY) circuitry to an input of transmit-side PHY circuitry. In such a configuration, the receive-side PHY circuitry extracts data from a signal received from another IC (e.g., another FPGA), and the bypass circuitry provides the extracted data to the transmit-side PHY circuitry for transmission to another IC, bypassing (i.e., avoiding latency associated with) data processing circuitry and functional circuitry of the IC.
As an example, a chassis may include 8 FPGAs, each including 8 GT quads (i.e., 8×QSFP28 connectors). Multiple such chassis may be interconnected as disclosed herein to provide a mesh of 512×FPGAs. A maximum hop routing latency between any two FPGA nodes of the mesh may be, for example, approximately 50 ns (˜25 ns×2). Deployment of such a mesh may be relatively simple and inexpensive.
As another example, a FPGA-based computing platform may include approximately 64 FPGAs, each FPGA may include 8 GT quads and a low-latency bypass switch (at a GT PHY level), may employ GT-based pin-multiplexing, and may be interconnected with approximately 256 high-speed cable pairs (e.g., QSFP28 type passive copper cable containing four high-speed copper pairs, each operating at data rates of up to 28 GbE). In this example, the low-latency bypass switches at the GT PHY level may reduce/minimize routing through intervening FPGAs, and may reduce latency in the order of approximately 25 nanoseconds (ns).
Techniques disclosed herein may be useful to reduce/minimize hop latency, system complexity, and costs associated with manufacturing, deployment, and maintenance. For example, a rack-based, FPGA-based prototyping platform may include approximately 1000 custom cables, which are costly. Techniques disclosed herein may reduce cabling needs of such a computing platform to, for example and without limitation, within a range of approximately 400 to 500 cables, which reduces costs. Moreover, fewer cables mean fewer cable faults, which may reduce deployment and maintenance costs, and may improve system up-time.
Bypass switching, or PHY mode operation, as disclosed herein, may be employed alone and/or in combination with other latency-reducing techniques disclosed herein such as, without limitation, pin-multiplexing and/or operating PHY circuitry in a “raw” mode.
Techniques disclosed herein may be useful in other applications such as, without limitation, datacenter switch and connectivity, large scale inter-connected FPGA-based acceleration for high performance computing (HPC), and communications amongst heterogeneous die, chips, and/or cards.
Receiver circuitry 104 includes receive-side physical layer (PHY) circuitry 108 that de-serializes a received signal 110 to provide a de-serialized signal 112. Receive-side PHY circuitry 108 may include analog front-end circuitry and/or digital front-end circuitry. The analog front-end circuitry may include physical medium attachment (PMA) circuitry. The digital front-end circuitry may include physical coding sublayer (PCS) circuitry.
Receiver circuitry 104 may receive signal 110 over a channel 140 (e.g., a gigabit channel), which may include a physical link (e.g., a cable). Receiver circuitry 104 may receive signal 110 as a differential signal over a twisted pair of wires.
Receiver circuitry 104 further includes data extraction circuitry 114 that extracts data 116 from de-serialized signal 112. Where received signal 110 is packetized, data extraction circuitry 114 de-packetize de-serialized signal 112.
IC 100 further includes data processing circuitry 118 that processes extracted data 116, and provides resultant processed data 120 to functional circuitry 102. Data processing circuitry 118 may perform one or more of a variety of processes such as, without limitation, buffering, decoding, and/or protocol formatting. Data processing circuitry 118 may verify frame check sequences of a sender, and may strip off a preamble and padding of the sender before passing data up to higher layers. Receive-side data processing circuitry 118 may represent a receive-side media access controller or a portion thereof.
Functional circuitry 102 may perform one or more of a variety of functions with respect to processed data 120 and/or other data, examples of which are provided further below.
IC 100 further includes transmit-side data processing circuitry 124 that processes outgoing data 122 received from functional circuitry 102. Outgoing data 122 may be related or unrelated to processed data 120. Outgoing data 122 may be un-packetized, non-serialized data. Transmit-side data processing circuitry 124 may perform one or more of a variety of processes such as, without limitation, clock edge detection and/or data acquisition. Transmit-side data processing circuitry 124 may represent a transmit-side media access controller, or a portion thereof.
Transmitter circuitry 106 includes framing circuitry 128 that frames processed outgoing data 126 for transport. A frame is a digital data transmission unit. In a packet switched environment, a frame may represent a container for a packet. Framing circuitry 128 may packetize processed outgoing data 126. Framing circuitry 128 may provide processed outgoing data 126 with a pre-defined header, data beats, end-of-frame bit(s), a parity block, and/or an error code correction (ECC) block. Framing circuitry 128 outputs framed version of processed outgoing data 126 as outgoing data 130. Framing circuitry 128 may represent a portion of a transmit-side media access controller.
Transmitter circuitry 106 further includes transmit-side physical layer circuitry (PHY) 132 that converts outgoing data 130 to an output signal 134. Transmit-side PHY circuitry 132 transmits output signal 134 over channel 142 (e.g., a gigabit channel), which may include a physical link (e.g., cable). Transmit-side PHY circuitry 132 may transmit output signal 134 as a differential signal over a twisted pair of cables. Transmit-side PHY circuitry 132 may serialize outgoing data 130 for transmission.
IC 100 further includes a bypass link 136 that provides extracted data 116 to transmitter circuitry 106, bypassing receive-side data processing circuitry 118, functional circuitry 102, and transmit-side data processing circuitry 124. In the example of
Bypass link 136 is not limited to the example of
IC 100 may further include bypass control circuitry 138 that selectively provides extracted data 116 to functional circuitry 102 (via data processing circuitry 118), or to transmitter circuitry 106 (via bypass link 136). Bypass control circuitry 138 may determine to provide extracted data 116 to functional circuitry 102 or to transmitter circuitry 106 based on, for example and without limitation, a destination identifier (ID) or destination address associated with extracted data 116 (e.g., a destination ID or address extracted from received signal 110).
IC 100 may include fixed function circuitry (i.e., non-configurable/non-programmable, or hardened circuitry) and/or programmable/configurable circuitry. As an example, and without limitation, receive-side PHY circuitry 108 and transmit-side PHY circuitry 132 may be implemented in fixed function circuitry, and remaining circuitry (i.e., functional circuitry 102, data extraction circuitry 114, bypass control circuitry 138, receive-side data processing circuitry 118, transmit-side data processing circuitry 124, and framing circuitry 128) may be implemented in programmable/configurable circuitry. In an embodiment, receive-side PHY circuitry 108 and transmit-side PHY circuitry 132 include configurable or selectable features, which may be bypassed to further reduce latency, examples of which are provided further below.
Multiple instances of IC 100 may interconnect to provide a high-performance computing (HPC) platform, such as described in examples below. Such a computing platform may be useful for a variety of applications including, without limitation, emulating, prototyping, and/or simulating operation of a circuit design by partitioning the circuit design and configuring functional circuitry 102 of the multiple instances of IC 100 based on respective partitions of the circuit design.
IC 100-2 includes receiver circuitry 104-2, transmitter circuitry 106-2, and a bypass link 136-2. IC 100-2 may further include receive-side data processing circuitry, functional circuitry, and transmit-side data processing circuitry, such as described further above. Receiver circuitry 104-2 receives signal 134-1 from IC 100-1 over channel 142-1 and outputs extracted data 116-2, which is provided to transmitter circuitry 106-2 via bypass link 136-2. Transmitter circuitry 106-2 converts extracted data 116-2 to an output signal 134-2, and transmits output signal 134-2 to IC 100-4 over a channel 142-2, such as described above with reference to
In
IC 100 and/or computing platform 500 may be implemented as described in one or more examples below. IC 100 and computing platform 500 are not, however, limited to the following examples.
Circuit boards 604 include ICs 606 disposed thereon. ICs 606 may include configurable/programmable circuitry (fabric), such as, without limitation, field-programmable gate arrays (FPGAs). ICs 606 may include system-on-chips (SoCs), application-specific integrated circuitry (ASICs), and/or types of circuitry ICs that include configurable/programmable circuitry. One or more circuit boards 604 may include multiple ICs 606.
ICs 606 further include transceivers 608. Transceivers 608 may provide relatively high-speed serial communications (e.g., 28 gigabits per second (GBPS), and may be referred to as gigabit transceivers (GTs). ICs 606 may further include serializer/deserializer (SERDES) circuitry that serialize data to be transmitted by transceivers 608, and to de-serialize data received by transceivers 608. Circuit boards 604 may further include multiplexing circuitry to multiplex cut nets of the circuit design through transceivers 608. Computing platform 600 further includes cables 610 that provide communication paths/channels amongst transceivers 608.
ICs 606 may represent instances of IC 100 in
Computing platform 600 may also be useful for cost reduction from pin-multiplexing, described further above, which increases the number of signals communicated amongst circuit boards 608 for a given number of transceivers 608 and cables 610 (e.g., without increasing the number of transceivers 608 and cables 610, and/or without using cabling for select IO).
Computing platform 600 may be communicatively linked to a data processing system (not shown) and operate in coordination with, and/or under control of, such data processing system executing appropriate software. An example of a data processing system is described herein in connection with
PHY 706 includes a physical medium attachment sublayer (PMA) 708, a buffer 710, and a physical coding sublayer (PCS) 712. PHY 706 may be subdivided into two portions corresponding to a transmit PHY and a receive PHY. For example, each of PMA 708 and PCS 712 may include a transmit portion and a receive portion. PCS 712 may be coupled to a PCS of a transceiver 608 of another IC 606 via a communication channel 714 (e.g., over one of cables 610). Communication channel 714 may include serial communication channel. Communication channel 714 may include a serial transmit channel 716 and a serial receive channel 718. Communication channels 716 and 718 may utilize differential signaling. In other words, channels 716 and 718 may each include two-pins and corresponding wires. Communication channel 714 may maintain cycle accurate features of computing platform 600 at boundaries of IC 606. In other words, data may be sent via communication channel 714 from a partition implemented in IC 606-1 to a destination partition in another IC 606, with the data being presented to the destination partition as expected in the same manner as if the two partitions were directly connected (e.g., in a same IC 606).
ICs 606 may include configuration data that specifies the portion of the circuit design being emulated/prototyped, and may further include configuration details for the various PHYs 706 of transceivers 608. In an example, TX circuit 702 and RX circuit 704 may be implemented using programmable circuitry and may be coupled to PHY 706 as illustrated.
Transceiver 608 may be operated in a “raw mode,” in which transceiver 608 sends and receives raw data. Raw data is data that is transmitted “as-is” (e.g., with one or more features of transceiver 608 disabled or bypassed). Raw mode may be useful to reduce latency within and/or amongst transceivers 608. Raw mode may include, for example and without limitation, bypassing line encoding circuitry (e.g., without 8b10b or 64/66b encoding), buffers, memory, and/or other available features of transceiver 608. In the example of
Where PCS 712 includes alignment logic, the alignment logic may be disabled to further reduce latency in PHY 706. Where PCS 712 includes enumeration logic that locates byte boundaries for channel alignment, the enumeration logic may be architected so that alignment is limited (e.g., limited to a 32-bit (e.g., a 4 byte) boundary). If alignment cannot be achieved, the alignment starts anew. Such an architect may help to ensure minimum and predictable latency. When bypassing buffers, such as buffer 710, configurable/programmable logic of the respective IC 606 may perform phase alignment. The phase alignment may be performed by a respective partition of the circuit design that interfaces with TX circuit 702 and/or RX circuit 704.
Further in
In
Framing circuit 804 samples data from signals of partitioned nets 814, and packetizes the data. Framing circuit 804 may compute and add error-detection code to the packets. The error-detection code may include, without limitation, cycle redundancy checks (CRCs) and/or a parity bit(s).
Scrambler circuit 806 scrambles the packetized data. Scrambling may be useful for DC balancing and clock data recovery (CDR). Scrambler circuit 806 may apply additive or multiplicative scrambling to the packetized data. Additive scrambling requires a receiver to be synchronized with a known pattern. Whereas multiplicative scrambling is self-synchronizing and need not be synchronized. Multiplicative scrambling may be suitable where an environment in which computing platform 600 operates is not unduly harsh or noisy. Transceivers 608 may synchronize with one another based on a synchronization (synch) pattern. Scrambler circuit 806 in TX circuit 702 and a descrambler circuit of an RX circuit of another transceiver may be reset at periodic intervals to adjust for drift during periods of relatively extended operation.
Before transceiver 608 is able to communicate user emulation data to another transceiver coupled to communication channel 714, the transceivers need to be enumerated and achieve block lock. In an example implementation, framing circuit 804, e.g., upon power on or upon reset, is capable of transmitting signals as a training pattern referred to as TP1 via transmit channel 716 to another transceiver coupled to transmit channel 716. In response to the other transceiver (e.g., the RX circuit thereof) receiving TP1 and aligning with TP1, the TX circuit of the other transceiver transmits a block lock training pattern referred to as TP2 to transceiver 608 (e.g., to RX circuit 704). In response to receiving TP2, transceiver 608 is ready to begin transmitting user data. In an example implementation, as a precautionary measure, the enumeration process described above may be repeated multiple successive times (e.g., 3 times) to avoid accidental data alignment and block lock corresponding to accidental detection of TP2.
The enumeration logic described above (e.g., TX circuit 702 and RX circuit 704) requires few resources and has a small footprint on IC 606, thereby leaving most of the circuit resources of IC 606 available for emulation. Once communication channel 714 is enumerated, emulation data (e.g., user data) may be transmitted. Transmission of emulation data via communication channel 714 may begin with edge detector circuit 802 detecting an active edge of emulation clock 808 (e.g., either a rising or falling edge). In response to detecting an active edge, edge detector circuit 802 notifies framing circuit 804. In response, framing circuit 804 latches incoming signals, e.g., data, on partitioned nets 814. Data from partitioned nets 814 is sampled in the transceiver clock domain. Framing circuit 804 is capable of packetizing the emulation data before sending to scrambler circuit 806 and PHY 706. In one aspect, each packet may be structured to include a Start of Frame (SOF), data, and an End of Frame (EOF). As noted, framing circuit 804 may also be configured to add an error-detection code to each packet. In the example of
In one aspect, as part of the design flow to implement the circuit design in computing platform 600, any nets crossing from the emulation clock domain, e.g., partitioned nets 814, are timed with delay constraints such as “set_max_delay” constraints. The “set_max_delay” constraint establishes a data valid window that allows the signal to be stable before the signal is latched in the transceiver clock domain. The delay constraints serve to reduce latency in the resulting circuitry as signals cross from the emulation clock domain to the transceiver clock domain. Since the “set_max_delay” with “data_path_only” flag does not account for clock skew, additional margin may be included before data is captured by framing circuit 804.
The approach described herein, where emulation clock 808 is received by edge detector circuit 802, eliminates the need for clock domain circuits such as First-In-First-Out (FIFO) memories and/or Block Random Access Memories (BRAMs) designed for a multi-bit bus. Such is the case as the data received over partitioned nets 814 is aligned with emulation clock 808. Having received data that is time aligned with emulation clock 808, there is no need for clock domain crossing circuitry to address meta-stability since stability of the data may be accurately predicted in the transceiver clock domain and circuitry therein may be timed to latch stable data.
Electronic Design Automation (EDA) tools use multiple approaches for emulating circuit designs. For example, some EDA vendors use PLL's to generate design/emulation clocks, whereas other vendors use fixed, high-frequency clocks for all sequential logic coupled with low-speed data enables. The active edge detection logic described herein as implemented in edge detector circuit 802 is capable of detecting the start of a cycle of emulation clock 808 when present. Edge detector circuit 802 is also capable of successfully detecting a start of a cycle in cases where emulation clock enable 812 is present. Once the start of frame is detected, edge detector circuit 802 is capable of triggering framing circuit 804 to start packetization and transmission. Edge detector circuit 802 is also capable of generating the necessary enables for latching data by framing circuit 804.
As an illustrative and non-limiting example, consider the case where N=512 and M=8. In this example, the transceiver clock runs at 8 times the frequency of the emulation clock providing 8 slots on which the received emulation data may be sampled. That is, for a given cycle of the emulation clock, there are 8 slots (e.g., 8 cycles) of the transceiver clock. Thus, the emulation data may be divided into 8 groups, where each group is captured on a different slot. In the example, a signal “din” (corresponding to partitioned nets 814) is received. Din is 512 bits in width (e.g., N=512). In the example, din is organized into 8 groups, where each group includes 64 bits of the 512-bit signal. At slot (e.g., clock cycle) 0, bits 0:63 are sampled. At slot 1, bits 64:127 are sampled and so forth as illustrated in
It should be appreciated that groups may be formed to include other numbers of signals. For example, while
Referring again to
The timing constraints that are applied to partitioned nets 814 in consequence of the slots used by transceivers 608 may be leveraged by the EDA tools including the partitioner. During partitioning performed on the circuit design, for example, the partitioner may allocate timing critical nets of partitioned nets 814 with high timing delays to later slots while nets of partitioned nets 814 that are not critical or are less critical and have low timing delays may be assigned to earlier slots. Other signals may be assigned to respective groups based on logic delays or logic levels to improve performance (e.g., reduce timing violations).
Partitioned nets 814 may be constrained in the circuit design using “max_delay” constraints and introducing necessary delay setups so that nets assigned to slot 0 have the highest timing penalty while nets assigned to slot 7 have the lowest timing penalty. By applying constraints as described, place and route tools are better able to reach a solution as circuit components generating signals assigned to higher slots may be located farther away from transceiver 608. Since PHY 706 is an asynchronous interface, there is no need to constrain pins of PHY 706. By comparison, when using Select I/Os in, the Select I/Os are timed for input and output delays.
Select I/O refer to a class of input/output pins that can be driven high (VCC) or low (GND) directly through Register Transfer Level (RTL) code. In some ICs, Select I/O pins may be grouped in clusters called banks. The Select I/Os may be configured to operate at different voltages thereby allowing the IC to communicate with a range of different devices. Select I/Os are limited in terms of speed of operation to a range of approximately 500 MHz to 1.6 GHz. By comparison, the examples described herein utilizing transceivers are capable of operating at speeds ranging from approximately 500 MHz to 28 GHz.
Referring to
In the example of
The example implementation of
The example of
Referring to
Were a timing violation to occur without the architectures of
Descrambler circuit 1104 is capable of performing the inverse operation performed by scrambler circuit 806. Extractor circuit 1106 is capable of de-multiplexing the received emulation data and sending the de-multiplexed emulation data as signals on partitioned nets 814 to the circuitry 1112 in IC 606 that is emulating the circuit design.
In the example of
Extractor circuit 1106 may also include a RAM 1110. In the example of
In one aspect, PHY 706 is configurable to operate in a 32-bit mode or a 64-bit mode. The 32-bit mode may be used with lower line rates, while the 64-bit mode may be used with higher line-rates. Operation of PHY 706 may be limited to 32-bit and 64-bit to bypass circuits such as any TX and/or RX up/down-size circuits as such circuits introduce additional latency into the signal path.
Referring to
With reference to
In the examples, the SOF and EOF may be implemented as special characters set with a specific value. In such an arrangement, detecting SOF and EOF does not require full 32-bit or 64-bit comparison as the case may be, so that to detect SOF/EOF, full 32-bit or 64-bit comparators are not needed. Instead, comparators may be designed that need only evaluate a few bytes/nibbles to successfully detect SOF and EOF. This configuration for implementing comparators to detect SOF and/or EOF requires fewer resources in ICs 606.
In the examples of
In the case where the net is partitioned using cut 1, the net is broken near FF 1502. Accordingly, FF 1502 is located in the driving IC 606 (TX side). Combinatorial logic 1504 and FF 1506 are located in the destination IC 606 (RX side). Using cut 1 for the partition causes the driving IC including FF 1502 to have minimum timing impact as there are no logic levels. Accordingly, the net may be scheduled to slot 0. As discussed, the nets assigned to slot 0 on the TX side have a high timing penalty and must adhere to one slot clock cycle. Slot 0 on the RX side, however, has the highest timing margin of the slots since nets assigned to slot 0 arrive the earliest thereby allowing for relaxed timing (e.g., more time to reach the load). Accordingly, referring to the prior example clock speeds, using cut 1 with the net assigned to slot 0, the setup time on the TX side will be 5 ns. In the destination IC on the RX side, the net may be scheduled with relaxed timing to allow the signal on the net time to traverse through combinatorial logic 1504 to FF 1506. The setup time on the TX side will be up to 40 ns.
In the case where cut 2 is used for the partitioning, combinatorial logic 1504 is subdivided so that a portion of combinatorial logic 1504 is located on the TX side and the other portion of combinatorial logic 1504 is located on the RX side. In that case, the net may be assigned to an intermediate slot such as slot 3. Slot 3 offers a balanced timing penalty with respect to both the TX and RX sides.
In the case where cut 3 is used for the partitioning, the net on the driving side is scheduled to slot 7 so that timing is more relaxed on the TX side. On the RX side, however, slot 7 results in the highest timing penalty with the minimum setup time.
The example of
In a conventional emulation system that uses Select I/O based pin multiplexing, the partitioning tool spends a significant amount of time finding the lowest multiplexing ratios to keep the emulation clock high. Recall that the lower the pin multiplexing ratio, the higher the emulation clock frequency. Within conventional emulation systems using Select I/O, moving from one multiplexing ratio to the next incurs a significant performance penalty. This penalty may be as low as 10%, but is often more than a 10% slow-down in the emulation clock frequency. Because of the significant performance penalty incurred, the partitioner tends to be particularly vigilant in finding a partitioning solution for the circuit design having the lowest multiplexing ratios. It is not uncommon for a partitioner to run for many hours to partition a complex circuit design.
In accordance with the inventive arrangements described within this disclosure, since the slots are at the 32/64-bit boundary that transmit out at the PHY line-rate (e.g., up to 26 Gbps), the penalty of moving to the next multiplexing ratio is typically a reduction in emulation clock frequency of about 5% or less. In many cases, the penalty is closer to a 1% slow-down. The lower penalty means that the partitioner may move to a next higher multiplexing ratio without incurring a noticeable performance degradation. As such, the partitioner may be less strict. Further, in increasing the multiplexing ratio, the partitioner may have more than enough slots available so as to not use, or ignore, one or more of the lower slots such as slot 0. The inputs to circuitry corresponding to slot 0, for example, may be tied to ground by the EDA tools. The partitioner would start assigning partitioned nets to slot 1 and proceed to assign signals to slots 2, 3, 4, 5, 6 and 7, with slot 0 being unused. The largely penalty free ability to move to a next higher multiplexing ratio means that the partitioner is able to generate a partitioning of a larger circuit design in much less time than would otherwise be the case. A partitioner configured to operate as described within this disclosure using the transceiver architectures described may complete a partitioning of a circuit design hours before a partitioner using conventional techniques. Table 1 below illustrates example data points showing the performance penalties incurred with respect to emulation clock speed as the multiplexing ratio is increased for a Select I/O solution and the inventive arrangements described herein (the transceiver solution).
The example implementations described herein also provide lower latencies compared to other emulation systems. Table 2 below illustrates total latency achieved for various line rates.
Bus 1606 represents one or more of any of a variety of communication bus structures. By way of example, and not limitation, bus 1606 may be implemented as a Peripheral Component Interconnect Express (PCIe) bus. Computer 1600 typically includes a variety of computer system readable media. Such media may include computer-readable volatile and non-volatile media and computer-readable removable and non-removable media.
In the example of
Program/utility 1614 may be stored in memory 1604. By way of example, program/utility may include program code corresponding to an operating system, one or more application programs, other executable instructions and/or scripts, and/or program data. Program/utility 1614, when executed by processor 1602, generally carries out the functions and/or methodologies of the example implementations described within this disclosure. Program/utility 1614 and any data items used, generated, and/or operated upon by computer 1600 are functional data structures that impart functionality when employed by computer 1600.
Computer 1600 may include one or more Input/Output (I/O) interfaces 1618 communicatively linked to bus 1606. I/O interface(s) 1618 allow computer 1600 to communicate with external devices, couple to external devices that allow user(s) to interact with computer 1600, couple to external devices that allow computer 1600 to communicate with other computing devices, and the like. For example, computer 1600 may be communicatively linked to a display 1620 and to external system 1622 through I/O interface(s) 1618. In an example, external system 1622 may be computing platform 600. Computer 1600 may be coupled to other external devices such as a keyboard (not shown) via I/O interface(s) 1618. Examples of I/O interfaces 1618 may include, but are not limited to, network cards, modems, network adapters, hardware controllers, etc.
Computer 1600 is an example of a data processing system and/or computer hardware that is capable of performing various operations described herein. Computer 1600 can be practiced as a standalone computer system such as a server, as part of a computer cluster (e.g., one or more interconnected or networked computers), or in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices. The example of
Computer 1600 may include fewer components than shown or additional components not illustrated in
Computer 1600 is also an example implementation of one or more EDA tools including a partitioner. Program/utility 1614 may include program code that is capable of performing partitioning of a circuit design and a design flow (e.g., synthesis, placement, routing, and/or configuration data generation) on the partitioned circuit design as described herein. In this regard, computer 1600 serves as an example of one or more EDA tools or a system that is capable of processing circuit designs and/or generating configuration data that may be loaded into ICs 606 to emulate the circuit design in computing platform 600.
At 1702, computer 1600 determines a cut of a net of the circuit design. Computer 1600 may cut the net as part of a partitioning process performed to emulate the circuit design using an emulation system. Each resulting partition of the circuit design may be assigned to, and emulated by, circuitry in a different IC 606.
At 1704, computer 1600 assigns the net to a slot selected from a plurality of slots corresponding to a transceiver clock of a transceiver in an IC 606 of computing platform 600. In one aspect, the selected slot is selected based on a location of the cut along the net. For example, the system may select a slot as described in connection with
In another aspect, the plurality of emulation nets may be organized into the plurality of groups with each group being allocated to one of the plurality of slots. The plurality of slots corresponds to the transceiver clock. Each of the emulation nets may be assigned to one of the groups based on a location of a cut of the emulation net. For example, referring to
At 1706, computer 1600 assigns a first (e.g., one or more) timing constraint to a first portion of the net corresponding to a driver of the net to the cut. For example, computer 1600 may assign one or more timing constraints to the signal path from FF 1502 to the cut, whether cut 1, cut 2, or cut 3. Computer 1600 may assign a second (e.g., one or more) timing constraint to a second portion of the net corresponding to the cut to a load of the net. For example, computer 1600 may assign one or more timing constraints to the signal path starting at the cut (e.g., cut 1, cut 2, or cut 3) to FF 1506.
Regardless of the cut, the first and second timing constraints are generated to depend on the slot to which the net is assigned. Assignment of timing constraints is described in connection with
At 1710, computer 1600 implements partitions of the circuit design including the net using the first and second timing constraints. Computer 1600 may, for example, perform synthesis, placement, and routing of the partitions for implementation in different ICs 606 of computing platform 600. Once a design flow has been performed using the timing constraints, the resulting configuration data may be loaded into the respective ICs 606 of computing platform 600 to emulate the circuit design.
Method 1700 may further include changing the slot of the net post implementation of the circuit design in computing platform 600. For example, the slot of the net may be exchanged or swapped with another slot to alleviate a timing violation of the net.
Method 1700 may further include assigning the net to a slot by excluding one or more slots from consideration. In assigning the net to a slot, for example, slot 0 may be omitted from consideration by the system leaving only slots 2-7 for assigning the net.
At 1802, IC 100-1 receives signal 110-1 from IC 100-3.
At 1804, receiver circuitry 104-1 de-serializes signal 110-1.
At 1806, receiver circuitry 104-1 extracts data 116-1 from de-serialized signal 112.
At 1808, bypass control circuitry 138 routes extracted data 116-1 to functional circuitry 102 (
Method 1800 may further include processing extracted data 116-1 with receive-side data processing circuitry 118 when extracted data 116-1 is routed to functional circuitry 102. Receive-side data processing may include converting extracted data 116-1 to a protocol of functional circuitry 102.
Method 1800 may further include framing and serializing extracted data 116-1 when bypass control circuitry 138 routes extracted data 116-1 to transmitter circuitry 106-1.
Method 1800 may further include disabling selectable features of receive-side physical layer circuitry within receiver circuitry 104-1, and disabling selectable features of transmit-side physical layer circuitry within transmitter circuitry 106-1 when extracted data 116-1 is routed to transmitter circuitry 106-1.
Method 1800 may further include multiplexing multiple streams of outgoing data to transmitter circuitry 106-1.
In
In the example of
The foregoing and other implementations can each optionally include one or more of the following features, alone or in combination. Some example implementations include all the following features in combination.
An emulation system can include a first IC including first circuitry and a first transceiver. The first circuitry is configured to emulate a first partition of a circuit design. The first circuitry is clocked by an emulation clock and the first transceiver is clocked by a transceiver clock asynchronous with the emulation clock. The transceiver clock has a higher frequency than the emulation clock. The emulation system can include a second IC configured to emulate a second partition of the circuit design. The second IC includes a second transceiver. The first transceiver is configured to generate multiplexed emulation data by multiplexing a plurality of nets that cross from the first partition to the second partition of the circuit design. The first transceiver is configured to send the multiplexed emulation data over a serial communication channel to the second transceiver. The multiplexed emulation data includes a clock signal of the first transceiver embedded therein.
In one aspect, the first transceiver includes a physical layer circuit configured to send the multiplexed emulation data over the serial communication channel using differential signaling. The plurality of nets belong to a same clock domain.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots, the plurality of slots corresponding to the transceiver clock. Each net is assigned to one of the groups based on a location of a cut of the net.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots, the plurality of slots corresponding to the transceiver clock. The first partition of the circuit design may be implemented in the first IC using timing constraints for the plurality of nets that depend on the slot to which each net is assigned.
In another aspect, the second partition of the circuit design is implemented in the second integrated circuit using timing constraints for the plurality of nets that depend on the slot to which each net is assigned.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots corresponding to the transceiver clock. The circuit design may be partitioned into the first partition and the second partition by preventing any of the plurality of groups from being allocated to an earliest slot of the plurality of slots.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots corresponding to the transceiver clock. Subsequent to implementation of the first partition of the circuit design within the first integrated circuit, at least one of the plurality of groups is re-allocated to a different slot.
In another aspect, the first transceiver includes a framing circuit block configured to generate packets of the multiplexed emulation data and generate an error-detection code that is included with each packet for sending to the second transceiver.
In another aspect, the packets are sent to the second transceiver using raw mode.
An IC can include first circuitry configured to emulate a partition of a circuit design. The first circuitry is clocked by an emulation clock. The IC includes a transceiver coupled to the first circuitry. The transceiver is clocked by a transceiver clock that is asynchronous with the emulation clock and that has a higher frequency than the emulation clock. The transceiver can include an edge detector circuit configured to detect edges of the emulation clock and a framing circuit configured to generate multiplexed emulation data by multiplexing a plurality of nets of the first circuitry. The framing circuit further generates packets of the multiplexed emulation data. The framing circuit is operative responsive to the edge detector circuit. The transceiver can include a scrambler circuit configured to scramble the packets from the framing circuit. The transceiver also can include a physical layer circuit (PHY) configured to send the scrambled packets over a serial communication channel. The scrambled packets include a clock signal of the transceiver embedded therein.
In one aspect, the PHY is configured to send the multiplexed emulation data over the serial communication channel using differential signaling. The plurality of nets belong to a same clock domain.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots corresponding to the transceiver clock. The partition of the circuit design is implemented using timing constraints that depend on the slot to which each net is assigned.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots corresponding to the transceiver clock. The circuit design is partitioned by preventing any of the plurality of groups from being allocated to an earliest slot of the plurality of slots.
In another aspect, the plurality of nets are organized into a plurality of groups, wherein each group is allocated to one of a plurality of slots corresponding to the transceiver clock. Subsequent to implementation of the partition of the circuit design, at least one of the plurality of groups are re-allocated to a different slot.
In another aspect, the framing circuit is configured to generate an error-detection code that is included with each packet for sending over the serial communication channel.
In another aspect, the packets are sent over the serial communication channel using raw mode.
IC 100, ICs 606, IC 1902, and/or IC 1904 may include one or more of a variety of types of configurable circuit blocks, such as described below with reference to
In the example of
One or more tiles may include a programmable interconnect element (INT) 2011 having connections to input and output terminals 2020 of a programmable logic element within the same tile and/or to one or more other tiles. A programmable INT 2011 may include connections to interconnect segments 2022 of another programmable INT 2011 in the same tile and/or another tile(s). A programmable INT 2011 may include connections to interconnect segments 2024 of general routing resources between logic blocks (not shown). The general routing resources may include routing channels between logic blocks (not shown) including tracks of interconnect segments (e.g., interconnect segments 2024) and switch blocks (not shown) for connecting interconnect segments. Interconnect segments of general routing resources (e.g., interconnect segments 2024) may span one or more logic blocks. Programmable INTs 2011, in combination with general routing resources, may represent a programmable interconnect structure.
A CLB 2002 may include a configurable logic element (CLE) 2012 that can be programmed to implement user logic. A CLB 2002 may also include a programmable INT 2011.
A BRAM 2003 may include a BRAM logic element (BRL) 2013 and one or more programmable INTs 2011. A number of interconnect elements included in a tile may depends on a height of the tile. A BRAM 2003 may, for example, have a height of five CLBs 2002. Other numbers (e.g., four) may also be used.
A DSP block 2006 may include a DSP logic element (DSPL) 2014 in addition to one or more programmable INTs 2011. An IOB 2004 may include, for example, two instances of an input/output logic element (IOL) 2015 in addition to one or more instances of a programmable INT 2011. An I/O pad connected to, for example, an I/O logic element 2015, is not necessarily confined to an area of the I/O logic element 2015.
In the example of
A logic block (e.g., programmable of fixed-function) may disrupt a columnar structure of configurable circuitry 2000. For example, processor 2010 spans several columns of CLBs 2002 and BRAMs 2003. Processor 2010 may include one or more of a variety of components such as, without limitation, a single microprocessor to a complete programmable processing system of microprocessor(s), memory controllers, and/or peripherals.
In
In the preceding, reference is made to embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to specific described embodiments. Instead, any combination of the described features and elements, whether related to different embodiments or not, is contemplated to implement and practice contemplated embodiments. Furthermore, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the present disclosure. Thus, the preceding aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).
As will be appreciated by one skilled in the art, the embodiments disclosed herein may be embodied as a system, method or computer program product. Accordingly, aspects may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium is any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present disclosure are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments presented in this disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.