Modern semiconductor chips include a variety of circuits and components to facilitate fast and efficient computation. When transferring information between functional blocks in a semiconductor chip, electrical signals are typically sent on metal traces. Transmitters in a first functional block send the electrical signals across the metal traces. Receivers in a second functional block receive the electrical signals. In some cases, the two functional blocks are within a same die. In other cases, the two functional blocks are on separate dies.
The processing speed of information processing systems and devices continues to increase as new systems and devices are developed. When data signals and corresponding clock signals are sent between functional blocks, the signals can become misaligned with respect to one another. Realigning the signals typically involves the introduction of a large amount of latency.
The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various implementations may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Various systems, apparatuses, methods, and computer-readable mediums for deskewing method for a physical layer interface on a multi-chip module implementing a synchronous clock-domain crossing with reduced latency are disclosed. In one implementation, a circuit connected to a plurality of communication lanes trains each lane to synchronize a local clock of the lane with a corresponding global clock at a beginning of a timing window. Next, the circuit symbol rotates each lane by a single step responsive to determining that all of the plurality of lanes have an incorrect symbol alignment. Responsive to determining that some but not all of the plurality of lanes have a correct symbol alignment, the circuit symbol rotates lanes which have an incorrect symbol alignment by a single step. When the end of the timing window has been reached, the circuit symbol rotates lanes which have a correct symbol alignment and adjusts a phase of a corresponding global clock to compensate for missed symbol rotations. The circuit samples a plurality of data signals using a plurality of local clocks to generate a plurality of data sequences responsive to determining that all of the plurality of lanes have a correct symbol alignment.
In various implementations, techniques for implementing a synchronous clock-domain crossing with reduced latency are disclosed. In one implementation, a circuit generates a local clock while receiving a global clock and a data signal. The circuit includes a register which samples the local clock with the global clock. The circuit also includes a barrel shifter which generates phase-shifted versions of the local clock in single unit interval (UI) step sizes. The circuit further includes control logic which uses the barrel shifter to sweep the local clock across all phases until an edge transition is detected. When the edge transition is detected, this indicates that the local clock is aligned with the global clock. Then, the control logic adjusts a phase of the local clock to meet setup and hold requirements for sampling the data signal. Next, the data signal is sampled with the phase-adjusted local clock to generate a data sequence.
Referring now to
Transmitter 105 and receiver 110 can be any type of devices depending on the implementation. For example, in one implementation, transmitter 105 is a processing unit (e.g., central processing unit (CPU), graphics processing unit (GPU)) and receiver 110 is a memory device. The memory device can be any type of memory, such as dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM (including mobile versions of the SDRAMs such as mDDR3, etc., and/or low power versions of the SDRAMs such as LPDDR2, etc.), RAMBUS DRAM (RDRAM), static RAM (SRAM), etc. One or more memory devices can be coupled onto a circuit board to form memory modules such as single inline memory modules (SIMMs), dual inline memory modules (DIMMs), etc. Alternatively, the memory devices can be mounted within a system on chip (SoC) or integrated circuit (IC) in a chip-on-chip configuration, a package-on-package configuration, or a multi-chip module (MCM) configuration.
In another implementation, transmitter 105 is an input/output (I/O) fabric and receiver 110 is a peripheral device. The peripheral devices can include devices for various types of wireless communication, such as wifi, Bluetooth, cellular, global positioning system, etc. The peripheral devices can also include additional storage, including RAM storage, solid state storage, or disk storage. The peripheral devices can also include user interface devices such as a display screen, including touch display screens or multitouch display screens, keyboard or other input devices, microphones, speakers, etc. In other implementations, transmitter 105 and receiver 110 are other types of devices. It is noted that system 100 can be any type of system, such as an IC, SoC, MCM, and so on.
Turning now to
Referring now to
In various implementations, master PCS 305 includes or is coupled to a phase-locked loop (PLL) 310 for generating a clock which is coupled to clock transmit module 325 and to other modules within master PCS 305. Master PCS 305 includes an array of N receivers 315 and an array of M transmitters 335, where N and M are positive integers, and where the value of N and M varies according to the implementation. A receive clock gating (RXCG) module 320 is coupled to the array of receivers 315 and a transmit clock gating (TXCG) module 330 is coupled to the array of transmitters 335. In one implementation, a clock is forwarded from master PCS 305 which is used as a reference clock for PLL 310. The clock generated by PLL 310 is forwarded over channel 340 to be used as a reference clock for the PLL 375 (of slave PCS 380) and downstream logic. In one implementation, a half rate architecture is utilized by master PCS 305 and slave PCS 380 where the clock is half the frequency of the data rate and the data is sampled on rising and falling edges of the clock.
In one implementation, the array of transmitters 335 of master 305 are connected to corresponding lanes in channel 340 that have a width which is greater than 1000 microns (i.e., micrometers) between the furthest lanes. Also, in one implementation, the array of receivers 315 of master 305 are connected to corresponding lanes in package channel 340 that have a width which is greater than 1000 microns between the furthest lanes. This results in a drift between the clock edges of the different clocks transmitted on the lanes of channel 340.
In one implementation, a synchronous clock domain crossing is achieved across the array of transmitters 335 using one or more of the methods and mechanisms described herein. For example, even if the lanes of the array of transmitters 335 are separated by a distance greater than 1000 microns wide, a synchronous clock domain crossing is achieved using the techniques presented herein while adding a relatively small amount of latency (as compared to traditional approaches) to the interface. In one implementation, to achieve the synchronous clock domain crossing, the local clock of slave PCS 380 is trained to become aligned with the forwarded controller clock. As used herein, the term “local clock” is defined as a divided version of a relatively fast PLL clock. Also, as used herein, the term “controller clock” is defined as a system on chip (SoC) master clock which is distributed across the lanes of a channel (e.g., channel 340).
In one implementation, delay is added with 1 unit interval (UI) granularity to minimize the transmit channel skew between lanes. In one implementation, an optional first-in, first-out (FIFO) mode is implemented to increase the timing margin available for achieving the synchronous crossing. With the FIFO mode enabled, more delay is available at the expense of more latency. However, this provides additional tuning range available to reduce the transmit channel skew. In one implementation, prior to training, the symbol starting points are spread across the various transmit lanes of the channel 340. After training, the symbol starting points on the various transmit lanes of channel 340 are aligned within 1 UI of each other.
Turning now to
In one implementation, a training method of two stages is performed to train the phase of a local clock to match a corresponding controller clock. In the first stage, when the phase relationship between the local clock and the controller clock is unknown, a training step is performed to establish the phase relationship for each lane's local clock with respect to the controller clock. In the second stage of the training method, the phase of each local clock is adjusted with respect to the controller clock to meet the setup and hold requirements.
In one implementation, the controller clock is used to sample the local clock at sync register 417 and the output of sync register 417 is provided to clock train finite state machine (FSM) 410. Using barrel shifter 415, the local clock is swept across all of the phases until the sampler output (i.e., the output of sync register 417) is detected going from 0 to 1. When the sampler output is detected going from 0 to 1, this indicates that the local clock is aligned within +/−1 unit interval (UI) of the controller clock. In one implementation, 1 UI is 1/10th of the 1× clock, and 1 UI is the minimum resolution for sweeping the clock. In one implementation, additional averaging is performed in clock train FSM 410 to account for deltas due to low frequency phase offsets. Once the local clock is aligned to the controller clock, the local clock's phase is adjusted to meet setup and hold timing requirements. This method is repeated for all of the lanes so that data coming from the different lanes is synchronized into the controller clock with relatively low latency. Additionally, in one implementation, delay is added between the lanes to minimize the skew caused due to the transmitter and the channel.
Using the approach shown in
Referring now to
In one implementation, in a default mode, the local clock is adjusted with 1 UI resolution based on the initial average difference between the controller clock and the local clock in order to guarantee adequate timing margin for the expected controller clock jitter. In one implementation, in optional FIFO mode, a half-rate version of the controller clock acts as a write pointer while a half-rate version of the local clock is used as a read pointer. The clock training algorithm determines the initial average write/read pointer separation for the programmable placement of the read pointer. In one implementation, the read pointer placement is adjusted with 1 UI resolution in order to guarantee adequate timing margin while also minimizing latency at the parallel data capture interface based on the expected controller clock jitter magnitude. In one implementation, if there is adequate margin in the pointer separation, the read pointer is further adjusted after initial clock training with a 1 UI step size to reduce lane-to-lane skew. Once the data is crossed to the local clock domain, the data is then serialized and provided to the channel through the transmit lanes.
In one implementation, the input data is fed through flops 510 in the physical coding sublayer (PCS) 505 domain. The controller clock is provided to the clock inputs of flops 510. In one implementation, the input data is 10 bits wide. However, in other implementations, the input data can have other bit-widths. Similarly, the bit-width of other paths in circuit 500 can vary in other implementations. The outputs of flops 510 are provided in parallel to flops 515 and 520 in the transmit macro domain 507. A half rate controller clock is provided as the clock to the clock inputs of flops 515 and 520. The outputs of flops 515 and 520 are provided to the inputs of multiplexer 525. The outputs of multiplexer 525 are coupled to the inputs of flops 535. A 1× rate local clock is coupled to the clock inputs of flops 535. The outputs of flops 535 are provided to a serializer (not shown).
In one implementation, a high-speed 5× rate clock is provided as an input to word clock barrel shifter 540. In one implementation, word clock barrel shifter 540 generates 10 unique phases of a local 1× rate clock. A phase-adjusted output of word clock barrel shifter 540 is inverted and then coupled to the clock port of flop 530. The output of flop 530 is inverted and then coupled back to the input of flop 530 to generate a phase-adjusted, half-rate local clock. The phase-adjusted, half-rate local clock is coupled to the select input of multiplexer 525. The 1× controller clock is inverted and coupled to the clock input of flop 545. The output of flop 545 is inverted and then coupled back to the input of flop 545 to generate a half rate controller clock. It is noted that the flops shown in circuit 500 (of
Turning now to
As shown for circuit 600, the incoming serial data is sampled on both edges of a half-rate clock by flops 615A-B to get the two bits of data which are then passed through multiplexers 620 and 622 to flops 645. In one implementation, after deserialization, the data is in local clock domain and will be moved to the controller clock domain again. This is accomplished by the synchronous clock-domain crossing as explained earlier in the discussion of
For the operation of circuit 600, the input data is received from channel 605 and coupled to two sets of flops 615A-B. Clock generation unit 610 generates a first clock with a phase of 0 degrees which is coupled to the input of flop 615A. Clock generation unit 610 also generates a second clock with a phase of 180 degrees which is coupled to the input of flop 615B. The 180-degrees phase second clock coupled to the input of flop 615B is 180-degrees out of phase with respect to the 0 degrees phase first clock coupled to the input of flop 615A. The output of flop 615A is labeled Data1 and is coupled to the “1” inputs of multiplexers 620 and 622. The output of flop 615B is labeled Data0 and is coupled to the “0” inputs of multiplexers 620 and 622. The select signals for multiplexers 620 and 622 are generated by the shift data counter 625 which flips on an odd count.
The logic 630 for checking if a symbol rotation is needed generates the shift data for rotating clocks. The output from logic 630 is coupled to shift data counter 625 and barrel shifter 635. A 5× rate clock is also coupled to barrel shifter 635. Barrel shifter 635 generates a plurality of 1× rate clocks with different phases based on the 5× rate clock input. The plurality of 1× clocks with different phases generated by barrel shifter 635 are provided to different flops among flops 645. The outputs of flops 645 are coupled to the inputs of flops 650. Flops 650 are clocked with the local clock, and the outputs of flops 650 are coupled to flops 655, which are clocked with the controller clock. In the case of an odd shift, the sample from the “0” flop 645 is delayed, and so there is an extra flop 640 to ensure that the data does not get overwritten by the incoming bit.
Referring now to
Turning now to
Referring now to
Turning now to
Before link training, the synchronous clock domain crossing scheme is used to train each lane so that the local clock timing to controller clock is at the beginning of the timing window (i.e., to meet the maximum hold, minimum setup time). This step is repeated until the link is established. As used herein, the term “timing window” is defined as a duration of time during which corresponding data is valid. The timing window is defined by the leading and trailing edges of a data signal. Then, after the link is established, each lane is checked and one of the following options is used. If all lanes do not have the correct symbol, then the state diagram 1000 moves from reset state 1005 to out-eye state 1015. In out-eye state 1015, each lane is symbol rotated by 1 step using a barrel shifter and the controller clock is also rotated by 1 step to maintain the established timing relationship. Then all lanes are checked. If some lanes are symbol locked, then the state diagram moves to state 1010. In state 1010, the local clocks for the lanes that are not symbol locked are shifted by 1 UI. In one implementation, lanes which are not symbol locked are shifted 1 UI using a barrel shifter. The setup time increases by 1 UI and the hold time increases by 1 UI for the timing relationship between the controller clock and the local clock since the controller clock is static. This is continued until all of the lanes are symbol locked, up to a timing margin window. In the case where some of the lanes are still not symbol locked by the time the timing margin window is reached, then state diagram 1000 moves back to the reset state 1005. Otherwise, if all of the lanes are symbol locked, then state diagram 1000 moves to lock state 1030.
If, in reset state 1005, some lanes are symbol locked, then state diagram 1000 moves to state 1010. From state 1010, the state diagram 1000 moves to analyze state 1025. Then, state diagram 1000 moves to begin_end_eye state 1020 where the local clock is shifted for the lanes that are symbol aligned. This method is used until the end of the timing window (i.e., minimum hold time, maximum setup time) is reached. At the end of the timing window, all locked lanes and the controller clock are rotated to compensate for the missed rotations. If a newly locked lane is locked on a deskewed symbol, then the existing locked lanes except for the latest locked lane and the controller clock are rotated to compensate for the missed rotations.
Referring now to
Due to mismatched lengths of traces connected to lanes 1105A-D and/or due to other factors, the symbol eye for each lane will typically differ for the different lanes 1105A-D prior to training. The IN_EYE starting point for reset is illustrated at the top of diagram 1100 to represent one example of an implementation of control logic performing a method associated with state diagram 1000 (of
Turning now to
In one implementation, after establishing the lock, the input bit stream is deserialized into parallel data. In one implementation, the input serial bit stream is deserialized into 10-bit parallel data. However, in other implementations, other numbers of bits of parallel data are generated.
In one implementation, after deserialization, the data is in the local clock domain and needs to be moved to the controller clock domain. In one implementation, this is accomplished using the previously described synchronous clock-domain crossing techniques as well as through the use of circuit 1200. In one implementation, the clock domain crossing between the local clock (i.e., word clock) and the potentially high-jitter controller clock is accomplished with two-entry FIFO circuit 1200. A half rate local clock is used as the write pointer and a half-rate version of the controller clock acts as the read pointer. The write pointer and the local clock are both generated from a common high-speed clock. In one implementation, the write pointer placement is adjusted with 1 UI resolution.
In one implementation, the input serial data is coupled to the inputs of flops 1205A-N. In one implementation, the input data includes 10 input lanes. However, it should be understood that in other implementations, other numbers of input lanes can be supported by circuit 1200 by adjusting the number of flops 1205A-N and the bit-widths of the other components of circuit 1200. In one implementation, word clock barrel shifter 1207 receives a high-speed 5× rate clock which word clock barrel shifter 1207 uses to generate a 1× rate clock output with 5 unique phases and with a 2 UI resolution. These 1× rate cock with different phases are coupled to the clock ports of flops 1205A-N. The data inputs to flops 1205A-N are sampled using the phase-shifted clocks.
In one implementation, the outputs from flops 1205A-N are coupled to the input ports of flops 1210. Word clock barrel shifter 1207 also generates a local 1× rate clock which is coupled to the clock ports of flops 1210. The outputs of flops 1210 are coupled to the input ports of flops 1215 and 1220. The high-speed 5× rate clock is also coupled to write pointer generator 1240. Write pointer generator 1240 generates a half rate local clock which is coupled to the clock ports of flops 1215. The half rate local clock of write pointer generator 1240 is negated and coupled to the clock ports of flops 1220. The outputs of flops 1215 are coupled to the “0” input of multiplexer 1225 while the outputs of flops 1220 are coupled to the “1” input of multiplexer 1225. A half rate controller clock, generated by flop 1245, is coupled to the select port of multiplexer 1225. The output lanes of multiplexer 1225 are coupled to the input ports of flops 1235. The 1× rate controller clock is coupled to the clock ports of flops 1235. The parallel data outputs from flops 1235 are provided to a controller (not shown) or other component for further processing.
Referring now to
A circuit receives a global clock signal and a data signal (block 1305). It is noted that the global clock can also be referred to herein as a “controller clock”. In some implementations, the circuit receives a plurality of global clocks and a plurality of data signals. In one implementation, the global clock and the data signal are generated in a separate functional unit from the receiving functional unit. The circuit also generates a local clock signal (block 1310). For example, in one implementation, the circuit generates the local clock signal using a local PLL.
The circuit sweeps the local clock across phases in 1 UI steps with a barrel shifter while sampling the local clock with a register clocked by the global clock so as to detect an edge transition (block 1315). In one implementation, an output of the register is coupled to control logic which detects an edge transition. In one implementation, the control logic includes a finite state machine (FSM). In one implementation, the edge transition is a rising edge transition (i.e., from a low voltage to a high voltage). Next, the control logic determines that the local clock is aligned with the global clock responsive to detecting the edge transition (block 1320). Then, a phase of the local clock is adjusted with respect to the global clock to meet setup and hold timing requirements (block 1325). Next, the data signal is sampled using the phase-adjusted local clock (block 1330). Then, the sampled data is provided to a subsequent stage for additional processing and/or storage (block 1335). In one implementation, the sampled data is provided to a deserializer. After block 1335, method 1300 ends.
Turning now to
Next, the control logic determines that each local clock is aligned with a corresponding global clock responsive to detecting the edge transition (block 1420). Then, a phase of each local clock is adjusted with respect to a corresponding global clock to meet setup and hold timing requirements (block 1425). Next, each data signal is sampled using the corresponding phase-adjusted local clock (block 1430). Then, the sampled data is provided to a subsequent stage for additional processing and/or storage (block 1435). In one implementation, the sampled data is provided to a deserializer. After block 1435, method 1400 ends.
Referring now to
If some of the lanes have a correct symbol alignment (conditional block 1515, “some lanes” leg), then lanes which are not locked are symbol rotated by a single step (block 1525). In one implementation, blocks 1520 and 1525 are performed using circuit 600 (of
Turning now to
Non-transitory computer-readable storage medium 1600 can include any of various appropriate types of memory devices or storage devices. Medium 1600 can be an installation medium (e.g., a thumb drive, CD-ROM), a computer system memory or random access memory (e.g., DRAM, DDR RAM, SRAM, EDO RAM, Rambus RAM), a non-volatile memory (e.g., a Flash, magnetic media, a hard drive, optical storage), registers, or other types of memory elements. Medium 1600 can include other types of non-transitory memory as well or any combinations thereof. Medium 1600 can include two or more memory mediums which reside in different locations (e.g., in different computer systems that are connected over a network).
In various implementations, circuit representation 1605 is specified using any of various appropriate computer languages, including hardware description languages such as, without limitation: VHDL, Verilog, SystemC, SystemVerilog, RHDL, etc. Circuit representation 1605 is usable by circuit fabrication system 1610 to fabricate at least a portion of one or more of integrated circuits 1615A-N. The format of circuit representation 1605 is recognizable by at least one circuit fabrication system 1610. In some implementations, circuit representation 1605 includes one or more cell libraries which specify the synthesis and/or layout of the integrated circuits 1615A-N.
Circuit fabrication system 1610 includes any of various appropriate elements configured to fabricate integrated circuits. This can include, for example, elements for depositing semiconductor materials (e.g., on a wafer, which can include masking), removing materials, altering the shape of deposited materials, modifying materials (e.g., by doping materials or modifying dielectric constants using ultraviolet processing), etc. Circuit fabrication system 1610 can also perform testing of fabricated circuits for correct operation.
In various implementations, integrated circuits 1615A-N operate according to a circuit design specified by circuit representation 1605, which can include performing any of the functionality described herein. For example, integrated circuits 1615A-N can include any of various elements shown in the circuits illustrated herein and/or multiple instances of the circuit illustrated herein. Furthermore, integrated circuits 1615A-N can perform various functions described herein in conjunction with other components. For example, integrated circuits 1615A-N can be coupled to voltage supply circuitry that is configured to provide a supply voltage (e.g., as opposed to including a voltage supply itself). Further, the functionality described herein can be performed by multiple connected integrated circuits.
As used herein, a phrase of the form “circuit representation that specifies a design of a circuit . . . ” does not imply that the circuit in question must be fabricated in order for the element to be met. Rather, this phrase indicates that the circuit representation describes a circuit that, upon being fabricated, will be configured to perform the indicated actions or will include the specified components.
In various implementations, program instructions are used to implement the methods and/or mechanisms described herein. For example, program instructions are written that describe the behavior or design of hardware, in one implementation, such program instructions are represented by a hardware design language (HDL) such as Verilog. In various implementations, the program instructions are stored on any of a variety of non-transitory computer readable storage mediums. The storage medium is accessible by a computing system during use to provide the program instructions to the computing system for circuit fabrication, program execution, or otherwise. Generally speaking, such a computing system includes at least one or more memories and one or more processors configured to execute program instructions.
It should be emphasized that the above-described implementations are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
This application is a continuation of U.S. patent application Ser. No. 16/709,472, entitled “DESKEWING METHOD FOR A PHYSICAL LAYER INTERFACE ON A MULTI-CHIP MODULE”, filed Dec. 10, 2019, which is a continuation of U.S. patent application Ser. No. 16/397,848, entitled “DESKEWING METHOD FOR A PHYSICAL LAYER INTERFACE ON A MULTI-CHIP MODULE”, filed Apr. 29, 2019, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 16709472 | Dec 2019 | US |
Child | 17128720 | US | |
Parent | 16397848 | Apr 2019 | US |
Child | 16709472 | US |