This application relates to memories, and more particularly to a memory controller and its routing to a plurality of distributed endpoints.
A memory controller for external dynamic random access memory (DRAM) must meet certain strict timing relationships as required, for example, under the Joint Electron Device Engineering Council (JEDEC) standards. For example, the memory controller must satisfy the write latency (WL) requirement between the write data (DQ) to be written to the DRAM and the corresponding command and address (CA) signals. In other words, a DRAM cannot receive the write data in the same memory clock cycle over which the DRAM receives a write command. Instead, the write data is presented the write latency number of clock cycles after the presentation of the write command. With regard to enforcing the write latency, the memory controller digital core interfaces to the corresponding DRAM(s) through input/output (I/O) circuits that may also be designated as endpoints or endpoint circuits.
In applications such as for a personal computer (PC), the routing between the memory controller and its endpoints is relatively simple. In that regard, a PC microprocessor integrated circuit is mounted onto a motherboard that also supports various other integrated circuits such as those required for networking, graphics processing, and so on. A series of dynamic random memory (DRAM) integrated circuits are also mounted onto the motherboard and accessed through a motherboard memory slot. The memory controller for the DRAMs is typically located within a memory controller integrated circuit that couples between the microprocessor bus and the DRAMs. The PC memory controller and its endpoints are relatively co-located within the memory controller integrated circuit, which simplifies routing the CA signals and the DQ signals to the endpoints with the proper signal integrity. Should the memory controller instead be integrated with the microprocessor, the memory controller may still be relatively co-located with the corresponding endpoints such that routing issues between the memory controller and the endpoints are mitigated.
But the memory controller design is quite different for a system on a chip (SoC) integrated circuit such as developed for the burgeoning smartphone/wearable market in which a package-on-package (PoP) LPDDR DRAM configuration is used for many products. In such PoPs, different DRAM pins may need to be accessed from different sides of the SoC. The memory controller in an SoC is thus located relatively far from the endpoints. Thus the endpoints (I/O circuits) are located on the periphery of the SoC die. In contrast, the memory controller is located more centrally within the SoC die so that the trace lengths for the buses from the memory controller to the various endpoints may be more readily matched. The CA and DQ signals from an SoC memory controller must thus traverse relatively long propagation paths over the corresponding buses from the SoC memory controller to the endpoints. Should metal traces alone be used to form these relatively-long propagation paths across the SoC die, the CA and DQ signals would be subject to significant propagation losses, delay, and noise. It is thus conventional to insert a plurality of buffers into the CA and DQ buses the memory controller to the endpoints. The buffers boost the CA and DQ signals and thus address the losses and noise. In addition, the propagation delay along a metal trace is proportional to a product of its capacitance and resistance. Both these factors will tend to linearly increase as the propagation path length is extended such that the propagation delay becomes quadratically proportional to the path length. The shorter paths between the consecutive buffers on the buffered buses thus reduces the propagation delay that would otherwise occur across an un-buffered path having the same length as a buffered bus. Since the buses carry high-frequency signals with tight timing requirements, the metal traces are typically subject to non-default routing (NDR) rules to minimize propagation delay, signal deterioration, and crosstalk. The NDR rules specify a larger wire width, larger spacing, and also shielding wires running in parallel with the signal wires to mitigate crosstalk and related issues. The resulting NDR routing between the memory controller and its endpoints in a conventional SoC demands significant area usage and complicates the routing of other signals.
As an alternative to the use of buffers buses and NDR routing, the CA and DQ buses may each be pipelined using a series of registers. The resulting routing for the pipelined paths need no longer follow NDR rules and is thus more compact as compared to the buffered routing approach. But the registers add a significant pipeline delay to each path. For example, if the CA and DQ bus is each pipelined with eight registers, it may require four clock cycles to drive a CA or DQ signal from memory controller to an endpoint (assuming half the registers are clocked with the rising clock edges and half are clocked with the falling clock edges). But the CA bus carries both the read and the write commands. The SoC processor and other execution engines will thus be undesirably subjected to the pipeline delays every time it issues a read command. The increased delay for read data can negatively affect the performance of the various execution engines in the SoC. An SoC designer is then forced to choose between the area demands of bulky buffered CA and DQ buses or the increased delay of pipelined CA and DQ buses.
Accordingly, there is a need in the art for improved memory controller architectures for system on a chip applications such as used in PoP packages.
To improve density without suffering from increased delay, an integrated circuit is provided with a memory controller that drives a command and address (CA) write signal over a buffered CA bus and that drives a data (DQ) signal over a pipelined DQ bus. Since the buffered CA bus is not pipelined, it will be received at a CA endpoint circuit in the same memory clock cycle as when the write signal was launched from the memory controller. In contrast, the pipelined DQ bus has a pipeline delay corresponding to P cycles of the clock signal such that the DQ signal will be received at a DQ endpoint circuit P clock cycles after it was launched by the memory controller (P being a positive integer). In turn, the DQ endpoint circuit will launch the received DQ signal to an external memory having a write latency (WL) period requirement that equals WL clock cycles (WL also being a positive integer). To assure that the write latency period requirement is satisfied at the external memory, the memory controller is configured to launch the DQ signal a modified write latency period after the launching of the write command, where the modified write latency period equals (WL-P) clock cycles.
The resulting integrated circuit is relatively compact. In addition, a processor in the integrated circuit may issue read and write commands without suffering from the delays of a pipelined architecture. These and other advantageous features may be better appreciated through the following detailed description.
The various aspects of the present disclosure and their advantages are best understood by referring to the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures.
To increase density and operating speed, a memory controller is provided in which the command and address (CA) bus between the memory controller and its endpoints is buffered whereas the data (DQ) buses between the memory controller and its endpoints are pipelined with registers. Since there may be only one buffered CA bus for a relatively large number of pipelined DQ paths, the area demands from any non-default routing rules (NDR) routing of the metal traces for the buffered CA bus is minimal. In addition, the buffered CA bus increases memory operating speed. Since the data signals carried on the DQ buses will now be delayed by the clock cycles corresponding to the number of pipeline registers in each DQ bus whereas the CA signals will be unhindered by any pipelining, the write latency between the generation of the CA signals and the generation DQ signals within the memory controller is decoupled. In particular, the memory controllers disclosed herein launch their DQ signals with regard to a modified write latency that is shorter than the write latency required by the external memory.
An example system-on-a-chip (SoC) 100 including a memory controller 101 is shown in
In addition, memory controller 101 drives a plurality of pipelined data (DQ) buses 125 that are received by a corresponding plurality of DQ endpoints 145. Each pipelined DQ bus 125 includes a plurality of pipeline registers that are clocked by the memory write clock distributed by memory controller 101 to DQ endpoints 145. The corresponding clock paths and clock source are not shown for illustration clarity. Each DQ bus 125 may be deemed to comprise a means for propagating a DQ signal from the memory controller 101 to a DQ endpoint 145 with a pipeline delay. The pipeline registers may alternate as rising-edge clocked registers 115 and falling-edge clocked registers 120. The delay between a consecutive pair of registers 115 and 120 thus corresponds to one half cycle of the memory clock signal. The total delay in clock cycles across each pipeline DQ bus 125 thus depends upon how many pipeline stages formed by pairs of registers 115 and 120 are included. For example, if there six registers 115 (and thus six registers 120) included in each pipelined DQ bus 125, the total pipeline delay in clock cycles for the DQ signals to propagate from memory controller 101 to the corresponding DQ endpoint 145 would be six clock cycles. In alternative implementations, pipelined DQ bus 125 may be responsive to just one clock edge (rising or falling) such that its registers would be all rising-edge triggered or all falling-edge triggered. As will be explained further herein, memory controller 101 is configured to use this pipeline delay with regard to launching the DQ data signals with respect to a modified or pseudo write latency period. For example, if the pipelining delay is six clock cycles whereas the desired write latency is eight clock cycles, memory controller 101 may launch the DQ signals two clock cycles after the launch of the corresponding write command. More generally, the pipelining delay may be represented by a variable P whereas the write latency required by the external memory may be represented as the variable WL (both delays being some integer number of clock cycles). The memory controller may thus launch the DQ signals by the difference between the write latency and the pipelining delay (WL-P) in clock cycles after the launch of the corresponding write command. The write command is subjected to no pipelining delay on buffered CA bus 110 such that it arrives at CA endpoint 130 in the same clock cycle as when it was launched. In contrast, the DQ signals will be delayed by the pipelining delay. Since the DQ signals were launched WL-P clock cycles after the write command, the DQ signals thus arrive at their DQ endpoints 145 by a delay of WL−P+P=WL in clock cycles after the launch of the CA write command. The desired write latency is thus maintained despite the lack of pipelining for the CA write command.
Note that the required write latency for DRAMs such as specified by the JEDEC specification may depend upon the clock rate. The clock rate may be changed depending upon the mode of operation. For example, the clock rate may be slowed down in a low power mode of operation as compared to the rate used in a high performance mode of operation. In that regard, the JEDEC specification requires a write latency of eight clock cycles at a clock rate of 988 MHz but reduces the required write latency to be just three clock cycles at a clock rate of 400 MHz. The resulting change in clock rate may thus result in the changed write latency being less than the pipelining delay for each DQ bus 125. For example, if the pipelining delay was six clock cycles but the new value for the write latency was three clock cycles, memory controller 101 could not satisfy the required write latency even if it launched the DQ data signals in the same clock cycle as it launched the corresponding CA write command.
To account for any changing of the write latency such as with regard to modes of operation, each pipelined DQ bus 125 in system 100 may be replaced by an adaptive pipelined DQ bus 140 as shown in
Note that each DQ signal carried on a corresponding pipelined DQ bus 125 or 140 is a multi-bit word just like the corresponding CA write command. Each pipelined DQ bus 125 or 140 may thus comprise a plurality of metal layer traces corresponding to the width in bits of the DQ signals they carry. These individual traces are not shown for illustration clarity. Registers 115 and 120 would thus comprises a plurality of such registers for each individual bit in the corresponding DQ signal.
A more detailed view of SoC 100 is shown in
A DQ generation circuit 210 is configured to calculate the delay difference between the write latency and the pipeline delay, which in this example would be two clock cycles. This delay difference may be considered to be a “modified write latency period” in that DQ generation circuit launches the DQ signals responsive to the expiration of the delay difference period analogously to how a conventional memory controller would launch its DQ signals at the expiration of the write latency period following the launch of the write command. DQ timers 215 are configured accordingly to time this two clock cycle difference so that DQ generation circuit 210 launches the corresponding DQ signals two clock cycles after timing and command generation circuit 200 launched the write command. DQ generation circuit 210 may comprise a plurality of logic gates such as to implement a finite state machine configured to perform the necessary DQ generation and timing functions. The write latency between the CA generation (in this example, eight clock cycles) and the modified write latency with regard to the DQ generation (in this example, two clock cycles) is thus decoupled. Although DQ buses 125 are pipelined, note that the read data buses from DQ endpoints 145 to memory controller 101 may be buffered so as to minimize the read latency. DQ generation circuit 210 may be considered to comprise a means for determining a delay difference period between a write latency period for an external memory and the pipeline delay and for driving the DQ signal into DQ bus 125 upon the expiration of the delay difference period.
The resulting latency between the launching of the CA write command and the write data (DQ) is shown in tabular form in
A method of operation will now be discussed with regard to the flowchart shown in
As those of some skill in this art will by now appreciate and depending on the particular application at hand, many modifications, substitutions and variations can be made in and to the materials, apparatus, configurations and methods of use of the devices of the present disclosure without departing from the spirit and scope thereof. In light of this, the scope of the present disclosure should not be limited to that of the particular implementations illustrated and described herein, as they are merely by way of some examples thereof, but rather, should be fully commensurate with that of the claims appended hereafter and their functional equivalents.