Universal memory interface with dynamic bidirectional data transfers

Information

  • Patent Grant
  • 12204468
  • Patent Number
    12,204,468
  • Date Filed
    Wednesday, May 1, 2024
    a year ago
  • Date Issued
    Tuesday, January 21, 2025
    11 months ago
Abstract
Semiconductor devices, packaging architectures and associated methods are disclosed. In one embodiment, a memory chiplet is disclosed. The memory chiplet includes at least one memory die of a first memory type. Memory control circuitry is coupled to the at least one memory die. An interface circuit is for coupling to a host IC chiplet. The interface circuit includes data input/output (I/O) circuitry for coupling to multiple data lanes. Link directional control circuitry selects, for a first memory transaction, a first subset of the multiple data lanes to transfer data between the memory chiplet and the host IC chiplet.
Description
TECHNICAL FIELD

The disclosure herein relates to semiconductor devices, packaging and associated methods.


BACKGROUND

As integrated circuit (IC) chips such as system on chips (SoCs) become larger, the yields realized in manufacturing the chips become smaller. Decreasing yields for larger chips increases overall costs for chip manufacturers. To address the yield problem, chiplet architectures have been proposed that favor a modular approach to SoCs. The solution employs smaller sub-processing chips, each containing a well-defined subset of functionality. Chiplets thus allow for dividing a complex design, such as a high-end processor or networking chip, into several small die instead of one large monolithic die.


When accessing memory, traditional chiplet architectures often employ relatively large and complex die-to-die (D2D) interfaces for transferring data between the chiplet and a specific memory type. While beneficial in certain circumstances, many conventional D2D interfaces are typically designed to support a variety of applications. Using generic interfaces specifically for memory applications in a chiplet context is often non-optimal, with sacrifices in latency and power efficiency often made in the interests of wider interface applicability.





BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:



FIG. 1 illustrates a high-level top plan view of a multi-chip module (MCM), including a first integrated circuit (IC) chiplet that is coupled to a memory chiplet via a universal memory interface (UMI).



FIG. 2 illustrates one embodiment of respective receive and transmit pipelines coupled to common clock logic in the memory chiplet of FIG. 1.



FIG. 3 illustrates further detail relating to the common clock logic shown in FIG. 2.



FIG. 4 illustrates one embodiment of steps employed in a coarse training sequence for the common clock logic of FIG. 3.



FIG. 5 illustrates one embodiment of steps employed in a fine training sequence for the common clock logic of FIG. 3.





DETAILED DESCRIPTION

Semiconductor devices, packaging architectures and associated methods are disclosed. In one embodiment, a memory chiplet is disclosed. The memory chiplet includes at least one memory die of a first memory type. Memory control circuitry is coupled to the at least one memory die. An interface circuit is for coupling to a host IC chiplet. The interface circuit includes data input/output (I/O) circuitry for coupling to multiple data lanes. Link directional control circuitry selects, for a first memory transaction, a first subset of the multiple data lanes to transfer data between the memory chiplet and the host IC chiplet. By providing link directional control capability, such that multiple groups of links can change directions independently, arbitration efficiencies may be improved as compared to fixed link mapping.


Throughout the disclosure provided herein, the term multi-chip module (MCM) is used to represent a semiconductor device that incorporates multiple semiconductor die or sub-packages in a single unitary package. An MCM may also be referred to as a system in a package (SiP). The die or sub-packages are referred to herein as chiplets. The die or sub-packages that are interconnected in an MCM or SiP are referred to herein as chiplets. Packaged die that are disposed external to an MCM or SiP, such as being mounted on a printed circuit board (PCB), are referred to herein as chips.



FIG. 1 illustrates one embodiment of an MCM, generally designated 100, that employs a package substrate 102 for mounting a host integrated circuit (IC) chiplet 104 and a memory chiplet 106. An interface circuit 108 provides a low-latency and arbitration-efficient communications protocol between the host IC chiplet 104 and the memory chiplet 106. For one embodiment, the interface circuit 108 is packet-based and operates via a packet protocol. As explained in further detail below, embodiments of the communications protocol described herein enhances memory transaction efficiency while reducing power consumption.


Further referring to FIG. 1, the package substrate 102 may take a variety of forms, depending on the application. For some embodiments, the package substrate 102 may be realized as a “standard” package substrate, formed with an organic non-silicon material and incorporating a relatively sparse trace density appropriate for standard ball grid array (BGA) contact arrays (such as on the order of approximately one hundred to one hundred fifty microns). In other embodiments, the package substrate 102 may take the form of an “advanced” package substrate, such as a silicon interposer or silicon bridge-based substrate that provides a trace density on the order of approximately twenty-five to fifty-five microns.


With continued reference to FIG. 1, the host IC chiplet 104 generally includes processor circuitry 110 or other logic that performs operations on data, with the need to periodically carry out read and write data transfers with the memory chiplet 106. The processor circuitry 110 may take the form of one or more processors such as a computer processing unit (CPU), graphics processing unit (GPU), tensor processing unit (TPU), artificial intelligence (AI) processing circuitry, field-programmable gate array (FPGA) circuitry or other form of host chiplet with a need to access memory.


Further referring to FIG. 1, the first IC chiplet 104 includes a communications fabric 112 for controlling communications on-chip, but for also controlling how the first IC chiplet 104 communicates off-chip with other chiplets, such as the memory chiplet 106. For one embodiment, the communications fabric 112 includes network-on-chip (NoC) circuitry, such as that disclosed in U.S. patent Ser. No. 18/528,702, filed Dec. 4, 2023, titled: “UNIVERSAL NETWORK-ATTACHED MEMORY ARCHITECTURE”, owned by the assignee of the instant application and expressly incorporated herein by reference.


With continued reference to FIG. 1, the IC chiplet 104 includes a “primary” interface sub-circuit 114 that forms a portion of the overall interface 108. For one embodiment, the primary interface sub-circuit 114 includes minion link directional control circuitry 116 to control a selection of lanes for data transmission between the IC chiplet 104 and the memory chiplet 106. As explained more fully below, lane selection may be based on a variety of factors, including data prioritization, relative flow control between read and write pipelines, and other criteria. Clock generation and cycle count circuitry 118 provides a system clock signal SCLK to serve as a forwarded clock to synchronize overall timing throughout the interface 108, as well as providing a cycle count for dynamic bidirectional switching synchronization. Flow control circuitry 120 provides traffic regulation at the host IC chiplet 104 end to maximize data pipeline efficiencies. Further details pertaining to each of the above features of the primary interface sub-circuit 114 are more fully described below.


Further referring to FIG. 1, for one embodiment, the memory IC chiplet 106 includes memory control circuitry 122 for controlling the scheduling of memory transactions between the host IC chiplet 104 and memory of a specific standard or type, such as high-bandwidth memory (HBM), double-data rate (DDR) memory, low-power double data rate (LPDDR), graphics double data rate (GDDR), to name but a few. Positioning the memory control circuitry 122 on the memory chiplet 106 removes the need for the host IC chiplet 104 (often a costly application-specific integrated circuit) to know the type of memory being accessed. The memory-agnostic feature of the primary interface sub-circuit 114 allows the host IC chiplet 104 to be paired with a variety of memory types, hence taking the form of a universally applicable memory interface (UMI).


For one embodiment, the memory chiplet 106 includes a second portion of the interface 108, referred to herein as a “secondary” interface sub-circuit 124. For one embodiment, the secondary interface subcircuit 124 includes master link directional control circuitry 126, lane allocation circuitry 128, cycle count circuitry 130 and I/O circuitry 132. In some embodiments, register storage 134 may be provided on the memory chiplet 106 to store configurable parameters, such as turnaround time, relating to bidirectional link control, among other things. A memory chiplet-side flow control circuit 136 cooperates with the host IC chiplet-side flow control circuit 120 and provides traffic regulation at the memory IC chiplet 104 end to maximize data pipeline efficiencies.


For some embodiments, the memory chiplet 106 may take the form of a single-die chiplet that includes the memory control circuitry 122 and the features of the secondary interface sub-circuit 124. The single-die chiplet may then be employed as a base die upon which are stacked memory die 135 for a stacked memory implementation, such as for HBM. Other embodiments may employ the single die as a buffer or intermediary between the IC chiplet 104 and memory die disposed proximate the single die on the package substrate 102 or off-MCM (not shown).


For one embodiment, the I/O circuitry 108 of the host IC chiplet 104 connects to the I/O circuitry 132 of the memory chiplet 106 via multiple lanes 136. For one embodiment, the multiple lanes 136 are configured (from the perspective of the IC chiplet 104) with memory transactions in mind to employ and utilize memory-centric features and functionality, thereby reducing latency and power consumption that might otherwise result from use of a generic D2D interface designed for a wide range of applications. For one embodiment, the multiple lanes 136 are configured or partitioned (from the perspective of the primary interface sub-circuit 114) into an egress link 138, an ingress link 140, a data link 142 and a forwarding clock link 144. For one specific embodiment, the data link 142 may be partitioned into multiple data links DATA1 and DATA2. For some embodiments, a bidirectional sideband link 146 may be employed for out-of-band communications. Further details regarding specific embodiments of the multiple lanes 136 are disclosed in copending U.S. patent application Ser. No. 18/652,675, filed May 1, 2024, titled “UNIVERSAL MEMORY INTERFACE”, owned by the assignee of the instant application and expressly incorporated herein by reference.


Further referring to FIG. 1, and as noted above, for one specific embodiment, the partitioned data link 142 takes the form of two subsets of bidirectional data lanes DATA1 and DATA2. Each subset of bidirectional data lanes includes a number of lanes sufficient to transfer a sixty-four byte cacheline of data within an acceptable turnaround time constraint. One implementation utilizes thirty-seven lanes for each subset of bidirectional data lines for transferring the sixty-four byte cacheline of data. The time interval required to transfer the cacheline of data is referred to herein as a slot in the data link. Rather than assigning the subsets of bidirectional data lanes DATA1 and DATA2 to a fixed lane mapping (such as having one set operate only in a write direction, and the other set operate only in a read direction), the two subsets of lanes are individually dynamically configurable by the master link direction control circuitry 126 during operation to, as an example only, transfer a first cacheline of data (such as read data) in one direction, and subsequently transfer a second cacheline of data (such as write data) in the opposite direction. Dynamically configuring the subsets of bidirectional data lanes DATA1 and DATA2 in a memory-centric manner allows for arbitration efficiencies that typically exceed those of fixed lane mapping.


In an effort to maximize the arbitration efficiency involved in communications between the IC chiplet 104 and the memory chiplet 106, the minion link directional control circuitry 116 of the “primary” interface sub-circuit 114 and the master link directional control circuitry 126 of the “secondary” interface sub-circuit 124 are configured to cooperate in selecting between the subset of lanes DATA1 and DATA2 for a given memory operation based on a variety of factors. For one embodiment, the secondary interface sub-circuit 124 is configured as the default owner of the bidirectional data links DATA1 and DATA2. Being positioned on the memory chiplet 106, the secondary interface sub-circuit 124 is closer to the memory 135 and the memory control circuitry 122 than any circuitry of the host IC chiplet 104. This allows fast access to link availability information involving, for example, a read data pipeline in the memory chiplet 104, and specifically information regarding when read operations have been scheduled by the memory control circuitry 122, and thus when read data is available to traverse one or more of the data links DATA1 and DATA2. The proximity also drastically lowers the area and power needed to add dedicated control signals between the Memory Control Circuitry 122 and the Master Link Direction Control Circuitry 126. This proximity to the read pipeline information reduces any latency and area penalty that may be involved in obtaining the read operation information.


As explained above, with the read pipeline information readily available and in close proximity, for one embodiment, the master link directional control circuitry 126 on the memory chiplet 106 uses the link availability information as at least one factor in selecting from which of the two subsets of bidirectional data lanes DATA1 or DATA2 to use for a given write operation or read operation. In addition to read pipeline information, the master link directional control circuitry 126 may track the difference between write requests received and write data received to determine the number of outstanding write data transfers. It can then use this information to optimize how aggressively it allocates slots for write data transfers.


For some embodiments, selecting between the two subsets of bidirectional data lanes DATA1 or DATA2 to use for a given write operation or read operation may be achieved in a variety of ways. For one specific embodiment, the link direction is controlled by “link availability” information generated by the master link direction control circuitry 126 and specified in a field of a response packet, which is issued from the memory chiplet 106 by the “secondary” interface sub-circuit 124 for transfer along the ingress link 140 to the host IC chiplet 104. Regardless of the implementation employed, an optimization between link bandwidth utilization versus read latency should be observed. For instance, based upon memory scheduling alone there may not be open timing intervals in the read pipeline that are large enough to allow for write data transfers, or the open timing intervals might not be large enough to minimize bandwidth lost to turnaround cycles. In these cases, the master link directional control circuitry 126 may need to temporarily delay some read data returns to allow for better link utilization and to prevent a large number of write data transactions from being queued in chiplet 104. The queing of write data transactions should be limited to avoid two possible negative performance effects: (a) if the write data queues are ever filled, processing in chiplet 104 must be stopped to prevent overflow, and (b) read after write conflict logic often requires writes to be flushed to the DRAM so a conflicting read may need to wait for a long time if the offending write data hasn't already been transferred to chiplet 106.


In addition to specifying link availability information, for one embodiment, response packets that control link direction also specify a clock synchronization parameter referred to herein as a cycle count. The cycle count represents a count of the system clock cycles at a given timing instant and at a given location in the system. For one specific embodiment, and described more fully below, the cycle count is generated by the cycle count circuitry 130 of the memory chiplet 104, thus providing a common timing reference point for both chiplets. Distributing the cycle count so that it is known by both the primary interface sub-circuit 114 and the secondary interface sub-circuit 124 allows for performing synchronized operations across the link, particularly those involving bidirectional bus direction switchovers.



FIG. 2 illustrates one embodiment of a pipeline architecture, generally designated 200, that includes a cycle counter 202 which may be employed by both the primary interface sub-circuit 114 and the secondary interface sub-circuit 124 to generate the clock cycle count to support synchronized transfers between the two interface sub-circuits 114 and 124 while employing dynamic bidirectional capability for the data links DATA1 and DATA2. For one embodiment, the pipeline architecture 200 includes a receive pipeline 204 that includes sampling circuitry 206 to sample signals received from the link partner interface. The sampled signals are then fed to deserialization circuitry 208 to deserialize the signals from a serialized stream to parallel signals. The parallel signals are then forwarded to a receive buffer 210 which is coupled to common clock logic 212 which includes the cycle counter 202. Running in a direction opposite to that of the receive pipeline 204, a transmit pipeline 214 includes a transmit buffer 216 coupled to the common clock logic 212. The transmit buffer feeds serialization circuitry 218 to serialize parallel signals from the transmit buffer 216 for transmission to the chiplet link partner by driver circuitry 220.



FIG. 3 illustrates an overall data path and associated logic for one specific embodiment of the common clock logic 212. The common clock logic 212 includes majority function logic 302 that performs a majority function on received pattern information (during a training procedure described more fully below). A first count register 304 stores a received reflected count that is received by the majority function logic 302. A second count register 306 stores a received partner count that is also received by the majority function logic 302. A third count register 308 stores a snapshot count value from the cycle counter 202. An invariant cycle counter 310 couples to the cycle counter 202. Histogram tracking logic 312 is coupled to the invariant cycle counter 310 and provides histogram information to controller logic 314.


Prior to operation, the multiple lanes 136 undergo a lane initialization or calibration sequence to deskew relative signal propagation times between the various lanes. The deskew process may take the form of one from a variety of methods, with the underlying goal to have all signals for a given clock cycle or unit interval (UI) of a given packet arrive at the intended receiver circuitry in a predetermined alignment.


Further referring to FIG. 3 and FIG. 4, following the lane training and word byte/alignment training, the cycle count value may be generated by carrying out a cycle count training process. The cycle count training process may be performed as a one-time procedure during an initialization mode of operation, or periodically during a run-time mode of operation. For one embodiment, the training process involves an initial coarse training sequence to generate a coarse calibrated value followed by a fine training sequence that further refines the coarse calibrated value into a finely tuned calibrated value.


To support the cycle count training process, the entire interface 108 is configured in a training mode of operation with sets of lanes associated with the ingress and egress lanes 140 and 138 designated for transferring specific patterns that include information regarding certain cycle counts. For example, in one embodiment, two specific training patterns are employed by each interface sub-circuit 114 and 124 for concurrent transmission along a certain number of lanes. One pattern may include a repeating 8-bit (byte) cycle count value, while the other pattern may include a repeating 8-bit received partner count value. Each pattern is sent across three lanes, for a total of six lanes in each direction. Comma values may be spaced every thirty-two repetitions of the count values.


Referring now to FIG. 4, for one embodiment, the coarse training sequence starts with the primary interface sub-circuit 114 sending a start event, at 402, to the secondary interface sub-circuit 124 via a write transaction executed by the sideband link 146. The write transaction essentially serves as a command for the secondary interface sub-circuit 124 to begin its cycle count training sequence. The primary interface sub-circuit 116 then starts its own cycle count training sequence, at 404. At the start of the cycle count training sequence, each interface sub-circuit, or “PHY”, at 406, sets its own cycle counter to zero, and begins transmitting the first pattern (which includes that PHY's cycle count value) along a first set of three lanes, and the second pattern (which includes the other PHY's cycle count value) along a second set of three lanes. During the initial clock cycles of the training sequence, the received partner count value will be zero for several clock cycles until a count is received from the other PHY. As the training sequence progresses, real-time count value updates are made to the patterns.


With the patterns running, on the receive path of each PHY, two values will be available for the common clock logic 212, the received partner count value from the remote PHY, stored in the second register 306 of the common clock logic 212, and the local PHY's forwarded cycle count value, known as the received reflected count value, from the first register 304 of the common clock logic 212, after undergoing a full round trip of delay. Once a valid received reflected count value is received on the forwarded channel, at 408, the common clock logic 212 for that chiplet immediately creates a snapshot copy of its own cycle count value, known as a cycle count snapshot, and loads the value into the third register 308 of the common clock logic 212, AT 410. The cycle count snapshot thus represents the number of clock cycles that it took for the forwarded cycle count value to propagate to the partner chiplet and back, representing a round-trip propagation delay. At the secondary interface sub-circuit 124, the common clock logic 212 periodically compares, at 412, whether half the difference between the received reflected count value and the cycle count snapshot matches the difference between the received partner count and the cycle count snapshot. If the comparison results in a non-match, then the common clock signal fed to the common clock logic should be incrementally delayed or advanced by a delay circuit, at 414, to reduce the cycle count by at least one clock cycle for a subsequent comparison. The comparison and adjustment steps are iterated until the comparison results in a match, at 412.


Once the comparison operation results in a consistent matching condition, the cycle count training sequence ends the coarse training sequence, and begins the fine training sequence with a series of steps shown in FIG. 5. The fine training sequence begins, at 502, by setting a fine tune counter to 20479 (in terms of system clock cycles), such that it counts down to zero, defining a fine count interval. At the start of the fine tune count interval, the invariant cycle count counter will be loaded from the cycle count counter, at 504. The invariant cycle count will free-run from this point on and will not be subject to any incrementing or decrementing adjustments. Based on the timing difference between the cycle count and the invariant cycle count, a histogram is created, at 506, that counts the number of cycles where the timing difference is equal to 0, +1, −1, +2, −2, and Other.


Further referring to FIG. 5, once the fine tune count interval ends, a delay will be applied to the system clock such that the cycle count for the secondary interface sub-circuit is adjusted, at 508, to coincide with the median of the histogram. No further adjustments are made. If any samples of the histogram corresponded to the “Other” category, at 510, then the fine tune sequence has failed, and a new sequence initiated, at 502. If the sequence consistently fails, then an error condition is generated and sent to the primary interface sub-circuit 114. If the sequence succeeds, then the training sequence ends, at 512, and the interface 108 placed into a runtime mode of operation.


During runtime operation, the data links may be selectively switched to transfer data in either direction to allow for optimized traffic conditions. For some embodiments, the data traffic conditions may be additionally optimized through use of a flow control process that regulates packet transfers to prevent buffer circuits at each end of the links from overflowing or starving while at the same time optimizing the filling of the buffer circuits to reduce latency. For one embodiment, a credit-based flow control system is employed that employs credit counters in each chiplet. The use of such counters provides a predictive indication of the buffer usage at the other end of the link, without the need to wait for an actual acknowledgment that a given packet was received. As one example, where the memory chiplet 106 has a maximum buffer space, during the initialization mode, the buffer space may be advertised to the host IC chiplet 104. In response, the host IC chiplet may configure a credit counter with a number of “credits” that correspond to the available buffer space of the memory chiplet 104. Having a positive count of credits in the credit counter provides an indication to a transmitter in the host IC chiplet 104 that the destination receiver has room for additional data. As packets are sent to the destination receiver, the credit counter may decrement to account for the reduced buffer availability. A value of “0” generally corresponds to no availability in the buffer. When the packets are received and validated at the destination, the memory chiplet 106 may send a response packet confirming receipt of the packets. The credit counter at the host IC chiplet 104 may then increment the count back up once receiving the response packet.


While the dynamic bidirectional switching, cycle count, and flow control circuitry and techniques described above cooperate to maximize arbitration efficiency with minimal latency, further improvements may be realized by including the ability to dispatch partial transfers between the host IC chiplet 104 and the memory chiplet 106, and reassembling the partial transfers into a whole transfer at the receiving end. For one embodiment, this may be accomplished by using the lane allocation circuitry 128 to tap into the read and write pipeline information and identify potential intervals or “holes” along a given link where a full or partial cacheline of data may be inserted for transfer. This allows for maximizing the transfer efficiency of the link by fully packing the pipeline of interest with data. Partial transfers may be tracked through use of tag information that may be included in packet fields. For one embodiment, the tag information for multiple partial packet payloads is unique, allowing for reassembly of the partial payloads into a full payload, such as an entire cacheline of data, at the receiving end of the link. The lane allocation circuitry 128 may also include logic to allocate lane availability based on one or more prioritization schemes.


When received within a computer system via one or more computer-readable media, such data and/or instruction-based expressions of the above described circuits may be processed by a processing entity (e.g., one or more processors) within the computer system in conjunction with execution of one or more other computer programs including, without limitation, net-list generation programs, place and route programs and the like, to generate a representation or image of a physical manifestation of such circuits. Such representation or image may thereafter be used in device fabrication, for example, by enabling generation of one or more masks that are used to form various components of the circuits in a device fabrication process.


In the foregoing description and in the accompanying drawings, specific terminology and drawing symbols have been set forth to provide a thorough understanding of the present invention. In some instances, the terminology and symbols may imply specific details that are not required to practice the invention. For example, any of the specific numbers of bits, signal path widths, signaling or operating frequencies, component circuits or devices and the like may be different from those described above in alternative embodiments. Also, the interconnection between circuit elements or circuit blocks shown or described as multi-conductor signal links may alternatively be single-conductor signal links, and single conductor signal links may alternatively be multi-conductor signal links. Signals and signaling paths shown or described as being single-ended may also be differential, and vice-versa. Similarly, signals described or depicted as having active-high or active-low logic levels may have opposite logic levels in alternative embodiments. Component circuitry within integrated circuit devices may be implemented using metal oxide semiconductor (MOS) technology, bipolar technology or any other technology in which logical and analog circuits may be implemented. With respect to terminology, a signal is said to be “asserted” when the signal is driven to a low or high logic state (or charged to a high logic state or discharged to a low logic state) to indicate a particular condition. Conversely, a signal is said to be “deasserted” to indicate that the signal is driven (or charged or discharged) to a state other than the asserted state (including a high or low logic state, or the floating state that may occur when the signal driving circuit is transitioned to a high impedance condition, such as an open drain or open collector condition). A signal driving circuit is said to “output” a signal to a signal receiving circuit when the signal driving circuit asserts (or deasserts, if explicitly stated or indicated by context) the signal on a signal line coupled between the signal driving and signal receiving circuits. A signal line is said to be “activated” when a signal is asserted on the signal line, and “deactivated” when the signal is deasserted. Additionally, the prefix symbol “/” attached to signal names indicates that the signal is an active low signal (i.e., the asserted state is a logic low state). A line over a signal name (e.g., ‘<signal name>’) is also used to indicate an active low signal. The term “coupled” is used herein to express a direct connection as well as a connection through one or more intervening circuits or structures. Integrated circuit device “programming” may include, for example and without limitation, loading a control value into a register or other storage circuit within the device in response to a host instruction and thus controlling an operational aspect of the device, establishing a device configuration or controlling an operational aspect of the device through a one-time programming operation (e.g., blowing fuses within a configuration circuit during device production), and/or connecting one or more selected pins or other contact structures of the device to reference voltage lines (also referred to as strapping) to establish a particular device configuration or operation aspect of the device. The term “exemplary” is used to express an example, not a preference or requirement.


While the invention has been described with reference to specific embodiments thereof, it will be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. For example, features or aspects of any of the embodiments may be applied, at least where practicable, in combination with any other of the embodiments or in place of counterpart features or aspects thereof. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims
  • 1. A chiplet-based multi-chip module (MCM) to couple to a base substrate, comprising: a package substrate that is separate from the base substrate;a host integrated circuit (IC) chiplet coupled to the package substrate and comprising: at least one processing element;a primary memory interface to transfer memory information from the at least one processing element via a communications fabric;a memory chiplet comprising: a secondary memory interface coupled to the primary memory interface via multiple data lanes;a memory port comprising a memory physical interface to access memory storage; andlink directional control circuitry to select, for a first memory transaction, a first subset of the multiple data lanes to transfer data between the host IC chiplet and the memory chiplet.
  • 2. The chiplet-based MCM of claim 1, wherein the memory chiplet further comprises: a read data pipeline; andwherein the link directional control circuitry selects the first subset of the multiple data lanes based on usage information relating to the read data pipeline.
  • 3. The chiplet-based MCM of claim 1, wherein the memory chiplet further comprises: a write data pipeline; andwherein the link directional control circuitry selects the first subset of the multiple data lanes based on usage information relating to the write data pipeline.
  • 4. The chiplet-based MCM of claim 3, wherein: the usage information relating to the write data pipeline comprises difference information between received write data requests and received write data corresponding to the received write data requests.
  • 5. The chiplet-based MCM of claim 1, wherein: the link directional control circuitry is to select, for a memory write operation following a memory read operation, the first subset of the multiple data lanes to receive write data from the host IC chiplet.
  • 6. The chiplet-based MCM of claim 1, wherein: the link directional control circuitry is to select, for a memory write operation performed concurrent with a memory read operation, a second subset of the multiple data lanes to receive write data from the host IC chiplet.
  • 7. The chiplet-based MCM of claim 1, wherein: the multiple data lanes comprise multiple bidirectional data lanes.
  • 8. The chiplet-based MCM of claim 1, wherein: the multiple data lanes corresponds to a full set of memory channels associated with the memory chiplet.
  • 9. The chiplet-based MCM of claim 1, wherein: the secondary memory interface and the memory port of the memory chiplet are formed in a base die; andthe memory storage comprises at least one memory die stacked on the base die.
  • 10. The chiplet-based MCM of claim 1, wherein: the memory storage comprises high bandwidth memory (HBM), double-data rate (DDR) memory, low-power double-data rate (LPDDR) memory or graphics double-data rate (GDDR) memory.
  • 11. A memory chiplet, comprising: at least one memory die;memory control circuitry coupled to the at least one memory die;an interface circuit comprising multiple input/output (I/O) circuits to couple to a host IC chiplet via multiple data lanes; andlink directional control circuitry to select, for a first memory transaction, a first subset of the multiple data lanes to transfer data between the memory chiplet and the host IC chiplet.
  • 12. The memory chiplet of claim 11, further comprising: a read data pipeline; andwherein the link directional control circuitry selects the first subset of the multiple data lanes based on usage information relating to the read data pipeline.
  • 13. The memory chiplet of claim 11, further comprising: a write data pipeline; andwherein the link directional control circuitry selects the first subset of the multiple data lanes based on usage information relating to the write data pipeline.
  • 14. The memory chiplet of claim 13, wherein: the usage information relating to the write data pipeline comprises difference information between received write data requests and received write data.
  • 15. The memory chiplet of claim 11, wherein: the link directional control circuitry is to select, for a memory write transaction following a memory read operation, the first subset of the multiple data lanes to receive write data from the host IC chiplet.
  • 16. The memory chiplet of claim 11, wherein: the link directional control circuitry is to select, for a memory write operation performed concurrent with a memory read operation, a second subset of the multiple data lanes to receive write data from the host IC chiplet.
  • 17. The memory chiplet of claim 11, wherein: the multiple I/O circuits comprise multiple bidirectional I/O circuits.
  • 18. The memory chiplet of claim 11, wherein: the interface circuit and the memory control circuitry of the memory chiplet are formed in a base die; andthe at least one memory die is stacked on the base die.
  • 19. The memory chiplet of claim 18, wherein: the at least one memory die comprises high bandwidth memory (HBM), double-data rate (DDR) memory, low-power double-data rate (LPDDR) memory or graphics double-data rate (GDDR) memory.
  • 20. A host integrated circuit (IC) chiplet, comprising: at least one processing element;a communications fabric switchably coupled to the at least one processing element; anda primary memory interface to transfer memory information from the at least one processing element via the communications fabric, the primary memory interface comprising data input/output (I/O) circuitry to couple to a memory chiplet via multiple data lanes; andlink directional control circuitry responsive to control information from the memory chiplet to configure, for a first memory transaction, a first subset of the data I/O circuitry corresponding to a first subset of the multiple data lanes to transfer data between the host IC chiplet and the memory chiplet.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Non-Provisional that claims priority to U.S. Provisional Application No. 63/543,517, filed Oct. 11, 2023, titled “UNIVERSAL MEMORY INTERFACE (UMI) WITH HALF-DUPLEX BIDIRECTIONAL D2D & C2C PHYS FOR PACKET-BASED MEMORY TRAFFIC TRANSFER”, which is incorporated herein by reference in its entirety.

US Referenced Citations (184)
Number Name Date Kind
4334305 Girardi Jun 1982 A
5396581 Mashiko Mar 1995 A
5677569 Choi Oct 1997 A
5892287 Hoffman Apr 1999 A
5910010 Nishizawa Jun 1999 A
6031729 Berkely Feb 2000 A
6055235 Blanc Apr 2000 A
6417737 Moloudi Jul 2002 B1
6492727 Nishizawa Dec 2002 B2
6690742 Chan Feb 2004 B2
6721313 Van Duyne Apr 2004 B1
6932618 Nelson Aug 2005 B1
7027529 Ohishi Apr 2006 B1
7248890 Raghavan Jul 2007 B1
7269212 Chau Sep 2007 B1
7477615 Oshita Jan 2009 B2
7535958 Best May 2009 B2
7593271 Ong Sep 2009 B2
7701957 Bicknell Apr 2010 B1
7907469 Sohn et al. Mar 2011 B2
7978754 Yeung Jul 2011 B2
8004330 Acimovic Aug 2011 B1
8024142 Gagnon Sep 2011 B1
8121541 Rofougaran Feb 2012 B2
8176238 Yu et al. May 2012 B2
8468381 Jones Jun 2013 B2
8483579 Fukuda Jul 2013 B2
8546955 Wu Oct 2013 B1
8704364 Banijamali et al. Apr 2014 B2
8861573 Chu Oct 2014 B2
8948203 Nolan Feb 2015 B1
8982905 Kamble Mar 2015 B2
9088334 Chakraborty Jul 2015 B2
9106229 Hutton Aug 2015 B1
9129935 Chandrasekar Sep 2015 B1
9294313 Prokop Mar 2016 B2
9349707 Sun May 2016 B1
9379878 Lugthart Jun 2016 B1
9432298 Smith Aug 2016 B1
9558143 Leidel Jan 2017 B2
9832006 Bandi Nov 2017 B1
9842784 Nasrullah Dec 2017 B2
9886275 Carlson Feb 2018 B1
9934842 Mozak Apr 2018 B2
9961812 Suorsa May 2018 B2
9977731 Pyeon May 2018 B2
10171115 Shirinfar Jan 2019 B1
10402363 Long et al. Sep 2019 B2
10410694 Arbel Sep 2019 B1
10439661 Heydari Oct 2019 B1
10642767 Farjadrad May 2020 B1
10678738 Dai Jun 2020 B2
10735176 Heydari Aug 2020 B1
10748852 Sauter Aug 2020 B1
10803548 Matam et al. Oct 2020 B2
10804204 Rubin et al. Oct 2020 B2
10825496 Murphy Nov 2020 B2
10855498 Farjadrad Dec 2020 B1
10935593 Goyal Mar 2021 B2
11088876 Farjadrad Aug 2021 B1
11100028 Subramaniam Aug 2021 B1
11164817 Rubin et al. Nov 2021 B2
11204863 Sheffler Dec 2021 B2
11581282 Elshirbini Feb 2023 B2
11669474 Lee Jun 2023 B1
11789649 Chatterjee et al. Oct 2023 B2
11841815 Farjadrad Dec 2023 B1
11842986 Ramin Dec 2023 B1
11855043 Farjadrad Dec 2023 B1
11855056 Rad Dec 2023 B1
11892242 Mao Feb 2024 B2
11893242 Farjadrad Feb 2024 B1
11983125 Soni May 2024 B2
12001355 Dreier Jun 2024 B1
20020122479 Agazzi Sep 2002 A1
20020136315 Chan Sep 2002 A1
20040088444 Baumer May 2004 A1
20040113239 Prokofiev Jun 2004 A1
20040130347 Moll Jul 2004 A1
20040156461 Agazzi Aug 2004 A1
20050041683 Kizer Feb 2005 A1
20050134306 Stojanovic Jun 2005 A1
20050157781 Ho Jul 2005 A1
20050205983 Origasa Sep 2005 A1
20060060376 Yoon Mar 2006 A1
20060103011 Andry May 2006 A1
20060158229 Hsu Jul 2006 A1
20060181283 Wajcer Aug 2006 A1
20060188043 Zerbe Aug 2006 A1
20060250985 Baumer Nov 2006 A1
20060251194 Bublil Nov 2006 A1
20070281643 Kawai Dec 2007 A1
20080063395 Royle Mar 2008 A1
20080143422 Lalithambika Jun 2008 A1
20080186987 Baumer Aug 2008 A1
20080222407 Carpenter Sep 2008 A1
20090113158 Schnell Apr 2009 A1
20090154365 Diab Jun 2009 A1
20090174448 Zabinski Jul 2009 A1
20090220240 Abhari Sep 2009 A1
20090225900 Yamaguchi Sep 2009 A1
20090304054 Tonietto Dec 2009 A1
20100177841 Yoon Jul 2010 A1
20100197231 Kenington Aug 2010 A1
20100294547 Hatanaka Nov 2010 A1
20110029803 Redman-White Feb 2011 A1
20110038286 Ta Feb 2011 A1
20110167297 Su Jul 2011 A1
20110187430 Tang Aug 2011 A1
20110204428 Erickson Aug 2011 A1
20110267073 Chengson Nov 2011 A1
20110293041 Luo Dec 2011 A1
20120082194 Tam Apr 2012 A1
20120182776 Best Jul 2012 A1
20120192023 Lee Jul 2012 A1
20120216084 Chun Aug 2012 A1
20120327818 Takatori Dec 2012 A1
20130181257 Ngai Jul 2013 A1
20130222026 Havens Aug 2013 A1
20130249290 Buonpane Sep 2013 A1
20130285584 Kim Oct 2013 A1
20140016524 Choi Jan 2014 A1
20140048947 Lee Feb 2014 A1
20140126613 Zhang May 2014 A1
20140192583 Rajan Jul 2014 A1
20140269860 Brown Sep 2014 A1
20140269983 Baeckler Sep 2014 A1
20150012677 Nagarajan Jan 2015 A1
20150172040 Pelekhaty Jun 2015 A1
20150180760 Rickard Jun 2015 A1
20150206867 Lim Jul 2015 A1
20150271074 Hirth Sep 2015 A1
20150326348 Shen Nov 2015 A1
20150358005 Chen Dec 2015 A1
20160056125 Pan Feb 2016 A1
20160071818 Wang Mar 2016 A1
20160111406 Mak Apr 2016 A1
20160217872 Hossain Jul 2016 A1
20160294585 Rahman Oct 2016 A1
20170317859 Hormati Nov 2017 A1
20170331651 Suzuki Nov 2017 A1
20180010329 Golding, Jr. Jan 2018 A1
20180082981 Gowda Mar 2018 A1
20180137005 Wu May 2018 A1
20180175001 Pyo Jun 2018 A1
20180190635 Choi Jul 2018 A1
20180210830 Malladi et al. Jul 2018 A1
20180315735 Delacruz Nov 2018 A1
20190044764 Hollis Feb 2019 A1
20190058457 Ran Feb 2019 A1
20190108111 Levin Apr 2019 A1
20190198489 Kim Jun 2019 A1
20190319626 Dabral Oct 2019 A1
20200051961 Rickard Feb 2020 A1
20200105718 Collins et al. Apr 2020 A1
20200257619 Sheffler Aug 2020 A1
20200373286 Dennis Nov 2020 A1
20210056058 Lee Feb 2021 A1
20210082875 Nelson Mar 2021 A1
20210117102 Grenier Apr 2021 A1
20210181974 Ghosh Jun 2021 A1
20210183842 Fay Jun 2021 A1
20210193567 Cheah et al. Jun 2021 A1
20210225827 Lanka Jul 2021 A1
20210258078 Meade Aug 2021 A1
20210311900 Malladi Oct 2021 A1
20210365203 O Nov 2021 A1
20220051989 Agarwal Feb 2022 A1
20220121381 Brewer Apr 2022 A1
20220159860 Winzer May 2022 A1
20220179792 Banerjee Jun 2022 A1
20220222198 Lanka Jul 2022 A1
20220223522 Scearce Jul 2022 A1
20220237138 Lanka Jul 2022 A1
20220254390 Gans Aug 2022 A1
20220327276 Seshan Oct 2022 A1
20220334995 Das Sharma Oct 2022 A1
20220342840 Das Sharma Oct 2022 A1
20230039033 Zarkovsky Feb 2023 A1
20230068802 Wang Mar 2023 A1
20230090061 Zarkovsky Mar 2023 A1
20230181599 Erickson May 2023 A1
20230359579 Madhira Nov 2023 A1
20240273041 Lee Aug 2024 A1
Non-Patent Literature Citations (12)
Entry
Block Memory Generator v8.2 LogiCORE IP Product Guide Vivado Design Suite; Xilinx; Apr. 1, 2015.
Farjadrad et al., “A Bunch of Wires (BOW) Interface for Inter-Chiplet Communication”, 2019 IEEE Symposium on High-Performance Interconnects (HOTI), pp. 27-30, Oct. 2019.
Universal Chiplet Interconnect Express (UCIe) Specification Rev. 1.0, Feb. 24, 2022.
Kurt Lender et al., “Questions from the Compute Express Link Exploring Coherent Memory and Innovative Cases Webinar”, Apr. 13, 2020, CXL Consortium, pp. 1-6.
Planet Analog, “The basics of SerDes (serializers/deserializers) for interfacing”, Dec. 1, 2020, Planet Analog, as preserved by the internet Archive, pp. 1-9.
“Hot Chips 2017: Intel Deep Dives Into EMIB”, TomsHardware.com; Aug. 25, 2017.
“Using Chiplet Encapsulation Technology to Achieve Processing-In-Memory Functions”; Micromachines 2022, 13, 1790; https://www.mdpi.com/journal/micromachines; Tian et al.
“Multiport memory for high-speed interprocessor communication in MultiCom;” Scientia Iranica, vol. 8, No. 4, pp. 322-331; Sharif University of Technology, Oct. 2001; Asgari et al.
Universal Chiplet Interconnect Express (UCIe) Specification, Revision 1.1, Version 1.0, Jul. 10, 2023.
Hybrid Memory Cube Specification 2.1, Hybrid Memory Cube Consortium, HMC-30G-VSR PHY, 2014.
“Using Dual Port Memory as Interconnect”, EE Times, Apr. 26, 2005, Daniel Barry.
Quartus II Handbook Version 9.0 vol. 4: SOPC Builder; “System Interconnect Fabric for Memory-Mapped Interfaces”; Mar. 2009.
Provisional Applications (1)
Number Date Country
63543517 Oct 2023 US