This invention relates generally to single-package processors with at least two separate processing cores.
To achieve an increase in the frequency operation for a bus between agents on a bus, such as two processing cores and a chipset, the trace lengths between any two agents on the bus can be shortened. An agent can be a processing core or a chipset, or another device coupled to the bus. Shortening the trace lengths can satisfy the setup time requirements between all the agents on a bus. The bus agents can be connected in a daisy-chain topology, for example from a processing core to a second processing core and from the second processing core to a chipset. The inputs and outputs of the two end agents, for example a processing core and a chipset, can provide bus termination circuits that the other agents on the bus do not provide. A bus termination circuit can be resistors matching the effective impedance of the bus.
To avoid any timing violation caused by possible race conditions, however, the trace lengths between any two bus agents cannot be too short. A race condition occurs when data is sent from an agent on a bus and another agent on the bus receives the data before the agent is ready to receive the data, thus violating a hold time requirement. Placing the two processing cores close to each other on a single package can create such a hold time violation between the two cores. In a daisy-chain topology, the hold time requirement between the two processing cores can limit how short the overall bus length can be.
The trace length between an end agent and an intermediate agent can be increased to avoid the timing violation while maintaining the overall bus length between the agents at the end of a bus. This can result in a star topology, where there are at least three segments of traces originating from a location on the bus and connecting each agent to the bus. This can create a stub which branches off the main bus between the two end agents to connect the intermediate agent to the bus. This stub can cause ring-back due to an impedance mismatch at the branch-off point of the bus. When the voltage and current wave from one trace branch arrives at the branch-off point, it sees two traces in parallel which introduces inherent impedance mismatch. If a stub is unterminated, for example to maintain the same direct current (DC) operating condition as in the original daisy-chain topology bus, it can result in increased amounts of ring-back when the current that flows through the bus to an open circuit is reflected back into the bus. When ring-back is present, the frequency of the bus can be lowered to reduce the effects of ring-back. This can cause the bus and the system to operate at a slower frequency.
To reduce such ring-back and increase the bus frequency, all endpoints of a star topology bus can be terminated by a bus termination circuit. Additional termination circuits, however, reduce the direct current (DC) voltage range available for the bus operation and can result in less noise margin than the daisy-chain topology with the terminations at the two end points.
In an embodiment the processing cores can have their own inputs and outputs coupled to a common electrical bus. A processing core or chipset can communicate with other agents through a common electrical bus. A package can hold processing cores and can comprise a bus to connect inputs and outputs of agents in the package and package pins. The bus in the package can be a trace. The package pins can connect to traces on the platform, such as a printed circuit board. The traces can connect the package pins to the chipset inputs and outputs to allow communication between the processing cores or agents in the package and a chipset. In one embodiment, the chipset can communicate with other subsystems in the platform such as system, memory, graphics, display, and/or other input output devices through separate inputs and outputs for each subsystem.
In order to increase the bus speed, the clock to output time of the driving agent,the flight time of the signal, and the setup time of the receiving agent can be reduced. The driving agent can be the agent that is sending data on a bus and the receiving agent can be an agent that is latching the data being sent. The clock to output time is the time between the common reference clock received by the driving agent and the new data appearing at the pin of the driving agent. The setup time is the least amount of time for the new data to be valid at the receiving agent before its clock edge for a successful data transfer to occur. The hold time is the time the new data must be valid at the receiving agent after a clock edge for a successful data transfer to occur. The data setup time plus the hold time can be called the valid data window. The flight time is the time a data signal takes to travel from one agent to another agent on a bus.
There is a limitation on how much clock to output time and data setup time can be reduced to reduce the overall timing between the processing core furthest from the chipset and the chipset on the bus. In some embodiments, both the difference between the minimum and maximum clock to output times, and the data setup time plus the hold time, are fixed. Reducing the clock to output time and the setup time can cause hold violations between two agents, for example between the two processing core agents which can be placed close to each other on a single processor package.
The bus's operating frequency can be increased by reducing the system clock period or the time to complete one cycle on the bus. The frequency may not be increased above a value that can result in the length of one cycle being less than the clock to output time plus the flight time plus the data setup time plus the clock skew plus the clock jitter. A setup requirement violation can occur at the receiving agent if the frequency is increased above this value. Clock skew is the time difference between the clock edges received at all agents on the bus, which can be caused by the differences in time for the clock signals to reach all bus agents from the clock generation chip or by the clock chip itself. Clock jitter can also be introduced by the clock chip itself or by board effects such as noises, creating a clock period of less than the intended value. For example, a clock with a period of 9.8 nanoseconds can occur in some clock cycles while the intended period is 10 nanoseconds. In this case a clock jitter of 0.2 nanoseconds can be subtracted from the period, and the bus can be designed for 9.8 nanoseconds to compensate for the clock jitter.
On the other hand, to avoid hold violations the clock to output time of the driving agent plus the flight time has to be greater than the hold time of the receiving agent plus the clock skew. Reducing the maximum clock to output time and the setup time can reduce the minimum clock to output time and increase the hold time making it more difficult to meet the hold requirement.
For data transmitted from one agent to appear at other agents in the valid data window, the following variables can be considered for both hold and setup cases: the setup time, the hold time, the clock skew, the clock jitter, the flight time, and the minimum and maximum clock to output time.
A delay can be added to the input and output paths of the agents on the bus to increase the clock to output time, to increase the setup time and to decrease the hold time. Delay lines can be used to meet these timing requirements between two agents on a bus. In one embodiment, a delay line can be a series of gates comprising transistors. A delay line can be created with a delay amount that can be digitally controlled. For example, a delay line can be created to adjust from no delay through sixteen or more levels of delay elements.
With reference to the figures,
The bus connection terminal 110 may couple to a bus termination circuit 120, input sense amplifier 108, and output driver 118. The bus termination circuit 120 can be deactivated by the link 122 when the bus connection terminal 110 of the processing core 100 is not located at the end of a bus. A link is a circuit component that is designed to allow modifications after semiconductor processing. For example, the link can be a fusible connection that burns off when a relatively high current is applied. The link can also be a software-controlled circuit component which can be configured to be either an open or a short circuit.
The processing core 100 can comprise configurable delay lines 102 and 112. The configurable delay line 102 can be located in the input path of processing core 100 between the input sense amplifier 108 and the input latch 106. Link 104 can be used to adjust the amount of delay in the delay line 102. Input latch 106 can store data received from the bus connection terminal 110 through the sense amplifier 108. The input sense amplifier 108 senses the input voltage on the bus and outputs a digital signal to the input latch 106.
The configurable delay line 112 can be located in the output path of the processing core 100 between the output latch 116 and the output driver 118. Output latch 116 can store data that is waiting to be output to the bus through the output driver 118. The output driver 118 senses the data in the output latch 116 and amplifies the signal for transmission on a bus. The configurable delay lines 102 and 112 can include different amounts of delay that can be adjusted using links 104, 114.
In this embodiment, the links 160, 162, 164 can be connected or disconnected to select paths 170, 172, 174, 176, or 178 depending on the appropriate amount of delay to be added to an input or output path. Although four delay elements 150, 152, 154 and 156, five bypass paths 170, 172, 174, 176, and 178, and three links 160, 162, and 164 are depicted as one embodiment in
Within the processing cores 100 and 200 and the chipset 250 are bus termination circuits 120, 220, and 270. The bus termination circuits 120, 220, and 270 can be coupled to the bus 224 at the bus connection terminals 110, 210, and 260 with links 122, 222, and 272. Also coupled to the bus 224 at the bus connection terminals 110 and 210 for the processing cores 100 and 200 are sense amplifiers 108 and 208 and output drivers 118 and 218. The delay lines 102 and 202 connect the input sense amplifiers 108 and 208 to the input latches 106 and 206. The delay lines 112 and 212 connect the output drivers 118 and 218 to the output latches 116 and 216. The chipset 250 may also include input sense amplifiers 258, input latches 256, output drivers 268, and output latches 266. The processing cores 100 and 200 and the chipset 250 are depicted with components to illustrate one embodiment but the processing cores 100 and 200 and the chipset 250 can include additional components.
The semiconductor manufacturing cost can be reduced in one embodiment by manufacturing all processing cores from masks with approximately the same layout for the processing core 100 and 200, and then using links to turn different circuit components off or on to create multiple configurations of the processing cores based on their locations within the processor package.
The location of the processing cores 100 and 200 within package 226 can be determined after the processing cores are manufactured. In an embodiment, the location of a processing core within the package 226 can be detected by the processing core itself by the state of a pin 128 or 228 on a processing core 100 or 200 respectively. After the processing cores 100 and 200 are installed in a package 226, the package 226 can be designed to pull the package pins 128 and 228 to ground or supply voltage rail, for example. Each processing core can have an internal logic circuit to read the package pins 128 and 228 and determine which components remain active. For example, when the package pin 128 is pulled to ground as shown in the figure, the logic can turn off bus termination circuit 120 and a delay can be set in the input and output path delay lines 102 and 112. The delay lines 102 and 112 can be adjusted to avoid the possibility of hold timing problems between the processing core 100 and processing core 200 on bus 224.
The amount of delay can be determined by the location of the processing core 100 in relation to the processing core 200. In one embodiment, the amount of the delay to be added to the processing core 100 can be approximately the same as the flight time of the bus between terminals 210 and 110 of the two processing cores as shown in
When pin 228 is pulled to supply voltage rail, for example in the processing core 200 which is at the end of a bus, the signal delay can be minimized in the input output paths to increase the frequency of the bus 224.
To create a bus that can reduce the hold time risk between processing core 100 and processing core 200 without reducing the bus frequency, delay lines 102, 112, 202, and 212 can be adjusted in the input and output paths of the processing cores between the input and output latches 106, 116, 206, and 216 and the bus 224. The delay lines 102, 112, 202, and 212 can create different delay lengths to compensate for the relative location of the processing core along the bus 224.
The delay line 102 can be adjusted using links 104 to increase or decrease the delay between the input latch 106 and the bus connection 110. The links 104 can be added or broken to increase or decrease the amount of delay created by gates and transistors in the delay lines. In one embodiment, the delay line 112 can be adjusted-to increase the delay between the output latch 116 and the bus connection terminal 110. The increased delay from the input and output latches 106 and 116 to the bus connection terminal 110 can increase the time that it takes to transmit data between processing core 100 and processing core 200. The increase in time allows data to be valid at a time corresponding to the receipt of a clock signal.
Although only two processing cores, processing core 100 and processing core 200, are depicted, additional processing cores can be used as well. The additional processing cores can include delay lines tuned to maintain the frequency of the bus 224 while creating a delay between the processing cores 100 and 200 so that data sent between the processing cores appears in the valid data window.
In one embodiment, timing measurements can be used after the semiconductor manufacturing to determine the optimal amount of delay to be added to the input and output paths of the processing core 100. This amount can depend on the relative locations of the two processing cores within the package 226. The manufacturing process can result in variation of delay per delay element for each processed core due to manufacturing process variation, and the post-semiconductor testing can compensate for such variation.
For example, a timing tester placed at the bus connection terminals 110 and 210 of processing cores 100 and 200 can record the clock to output time, the input times, and the setup and hold times of both processing cores. Then, the delay lines 102, 112, 202, and 212 of processing cores 100 and 200 can be adjusted so that the clock to output and setup times of the two cores is matched at the processor package pins when the processing cores are different distances from the package pins. The package pins can be used to connect the package 226 to a printed circuit board, a cable or another electrical connector. The adjustment of the delay lines 102, 112, 202, and 212 can be an automated process, and the adjustment can take place right after the semiconductor processing is complete.
In another embodiment, a timing test can be done after the two cores have been packaged into a package 226. A timing tester can be placed at a package 226 pin and can record two sets of input and output timings, one set when driven by the processing core 100 and another set when driven by the processing core 200. The delay lines 102, 112, 202, and 212 can be configured such that the two sets of timings are matched. For example, the clock to output timing when driven by the processing core 100 can be adjusted to be similar to the clock to output timing when driven by the processing core 200 in a single package.
References throughout this specification to “one embodiment” or “an embodiment” mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation encompassed within the present invention. Thus, appearances of the phrase “one embodiment” or “in an embodiment” are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be instituted in other suitable forms other than the particular embodiment illustrated and all such forms may be encompassed within the claims of the present application.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.