The present disclosure generally relates to the field of electronics. More particularly, various embodiments of the invention relate to memory stacking and/or transferring data from stacked memory, for example, through die-to-die vias.
Memory access times may be a performance bottleneck in some computing systems. For example, when data stored in a memory is accessed through a shared bus, memory accesses may need to be synchronized with edges of a synchronization clock signal. Since the clock edges may occur at certain intervals, data accesses may need to wait for one or more clock periods before data communication can commence, even if the data is otherwise ready for transfer. Also, memory accesses through a shared bus may be further delayed, for example, because the bus may not be available until data transfers by other devices sharing the same bus are complete.
Generally, memory may include a dynamic random access memory (DRAM) chip. A DRAM chip may be organized as a two-dimensional matrix and each memory location may be accessed using a row address and column address. The total access time for a memory chip may correspond to three components: row access time, column access time, and data transfer time.
For each memory access, a row may be activated (or opened) and the row data may be moved to a page buffer. Subsequently, a column address may be used to select data from the page buffer. Furthermore, a DRAM chip may include sense amplifiers to amplify signals corresponding to data bits stored in a row. These sense amplifiers may be implemented as differential sense amplifiers and may consume more power than some of the other components of a DRAM, and their operation may increase memory latency. Accordingly, each time a row is activated, memory latency may be increased and additional power may be consumed by the corresponding sense amplifiers.
To reduce the memory access latency, an activated (or open) row may remain activated until another row is accessed. This policy may be referred to as an “open page” policy, which may work efficiently if successive operations access the same memory row. However, keeping a row open may result in additional power consumption.
The detailed description is provided with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of various embodiments. However, some embodiments may be practiced without the specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the particular embodiments.
Some of the embodiments discussed herein may provide efficient mechanisms for transferring data from a stacked memory chip through a dedicated (or non-shared) interconnect, such as die-to-die vias. In an embodiment, data may be transferred (or prefetched) through vias to reduce memory latency and/or power consumption in devices or systems that include multiple dies, such as those discussed with reference to
In an embodiment, vias 106 may be constructed with material such as aluminum, copper, silver, gold, combinations thereof, or other electrically conductive material. Moreover, each of the dies 102 and 104 may include circuitry corresponding to various components of a computing system, such as the components discussed with reference to
As illustrated in
In an embodiment, the processor 302-1 may include one or more processor cores 306-1 through 306-M (referred to herein as “cores 306,” or more generally as “core 306”), a cache 308 (which may be a shared cache or a private cache), and/or a router 310. The processor cores 306 may be implemented on a single integrated circuit (IC) chip (e.g., one of the dies 102 or 104 of
In one embodiment, the router 310 may be used to communicate between various components of the processor 302-1 and/or system 300. Moreover, the processor 302-1 may include more than one router 310. Furthermore, the multitude of routers (310) may be in communication to enable data routing between various components inside or outside of the processor 302-1. For example, the router 310 may communicate through the vias 106 and/or 206 of
The cache 308 may store data (e.g., including instructions) that are utilized by one or more components of the processor 302-1, such as the cores 306. For example, the cache 308 may locally cache data stored in a memory 314 for faster access by the components of the processor 302. As shown in
In an embodiment, the cache 308 (that may be shared) may be a last level cache (LLC). Also, each of the cores 306 may include a level 1 (L1) cache (316-1) (generally referred to herein as “L1 cache 316”). Furthermore, the processor 302-1 may include a mid-level cache that is shared by several cores (306). Various components of the processor 302-1 may communicate with the cache 308 directly, through a bus (e.g., the bus 312), and/or a memory controller or hub.
As illustrated in
In one embodiment, the system 400 may include an optional page cache 410 and an optional page cache controller 412. The page cache 410 may store data that is transferred (or prefetched) from the memory 314, and subsequently provided to the cache 308, as will be further discussed with reference to some of the operations of
Referring to
In an embodiment, if the corresponding data of the operation 504 is absent from the cache 308, the page cache controller 412 may determine if the corresponding data is present in the page cache 410 at an operation 508. If the page cache 410 includes the corresponding data, the data may be copied from the page cache 410 into the cache 308 (e.g., including one or more of the caches 402) at an operation 510, for example, by the controllers 404 and/or 412.
In one embodiment, after the operation 504 determines that the data is absent from the cache 308, the cache controller 404 may generate a cache miss signal, and, in response to the cache miss signal, the logic 408 may generate one or more memory access (or prefetch) requests at an operation 512. The memory controller 406 may receive the memory access (or prefetch) requests through the vias 106 and/or interconnection 304 and open one or more corresponding pages (e.g., by activating one or more rows) in the memory 314 at an operation 514.
In an embodiment, at an operation 516, data may be copied from the memory 314 into a buffer such as the page cache 410, for example, by the controllers 404 and/or 412. At an operation 518, data may be copied through vias 106 from the page cache 410 and/or the memory 314 into the cache 308 (e.g., including one or more of the caches 402), for example, by the controllers 404, 406, and/or 412. After copying the data into the page cache 410 or the cache 308 (at operations 516 or 518, respectively), the opened memory pages of the operation 514 may be closed at an operation 520, for example, by the memory controller 406. As illustrated in
In an embodiment, upon occurrence of a cache miss (e.g., as determined at operation 504), one or more memory pages may be opened (514) to copy the corresponding data from the memory 314 into a buffer (such as the page cache 410 and/or cache 308) through the vias 106. The opened memory pages are then closed at operation 520, e.g., to conserve power, for example by turning off one or more corresponding sense amplifiers in the memory 314. In one embodiment, data copied through the vias 106 may include both data from a memory location in the memory 314 that corresponds to the memory access request of operation 502 as well as additional data, for example, from one or more neighboring or adjacent memory locations such as a preceding or a succeeding memory locations, rows, or pages. Accordingly, data copied through the vias 106 may include data from at least two contiguous memory locations, rows, or pages, in accordance with various embodiments of the invention.
In an embodiment, the memory access request of the operation 502 may correspond to a 64 byte block of data within the memory 314, and the techniques discussed herein may be utilized to instead copy a 1 kilo-byte block of data (e.g., including preceding or subsequent memory locations, or a full page) through the vias 106 into the cache 308 (or its various levels (402)), e.g., without closing the corresponding opened page(s) before the data transfer operations are completed. As discussed with reference to
In one embodiment, a buffer such as the page cache 410 may be utilized to temporarily store the transferred (or prefetched) data from the memory 314 before the data is drained or copied into the cache 308 (or its various levels), e.g., for access by the cores 306. In an embodiment, the page cache 410 may include less expensive data storage elements than those utilized for the memory 314. Furthermore, more open pages may be maintained in the page cache 410 (e.g., to improve performance) than the memory 314, for example, due to less power consumption by the data storage elements of the page cache 410 than the memory 314.
A chipset 606 may also communicate with the interconnection network 604. The chipset 606 may include a memory control hub (MCH) 608. The MCH 608 may include a memory controller 610 that communicates with a memory 612 (which may be the same or similar to the memory controller 406 of
The MCH 608 may also include a graphics interface 614 that communicates with a graphics accelerator 616. In one embodiment of the invention, the graphics interface 614 may communicate with the graphics accelerator 616 via an accelerated graphics port (AGP). In an embodiment of the invention, a display (such as a flat panel display) may communicate with the graphics interface 614 through, for example, a signal converter that translates a digital representation of an image stored in a storage device such as video memory or system memory into display signals that are interpreted and displayed by the display. The display signals produced by the display device may pass through various control devices before being interpreted by and subsequently displayed on the display.
A hub interface 618 may allow the MCH 608 and an input/output control hub (ICH) 620 to communicate. The ICH 620 may provide an interface to I/O devices that communicate with the computing system 600. The ICH 620 may communicate with a bus 622 through a peripheral bridge (or controller) 624, such as a peripheral component interconnect (PCI) bridge, a universal serial bus (USB) controller, or other types of peripheral bridges or controllers. The bridge 624 may provide a data path between the CPU 602 and peripheral devices. Other types of topologies may be utilized. Also, multiple buses may communicate with the ICH 620, e.g., through multiple bridges or controllers. Moreover, other peripherals in communication with the ICH 620 may include, in various embodiments of the invention, integrated drive electronics (IDE) or small computer system interface (SCSI) hard drive(s), USB port(s), a keyboard, a mouse, parallel port(s), serial port(s), floppy disk drive(s), digital output support (e.g., digital video interface (DVI)), or other devices.
The bus 622 may communicate with an audio device 626, one or more disk drive(s) 628, and a network interface device 630 (which is in communication with the computer network 603). Other devices may communicate via the bus 622. Also, various components (such as the network interface device 630) may communicate with the MCH 608 in some embodiments of the invention. In addition, the processor 602 and the MCH 608 may be combined to form a single chip. Furthermore, the graphics accelerator 616 may be included within the MCH 608 in other embodiments of the invention.
Furthermore, the computing system 600 may include volatile and/or nonvolatile memory (or storage). For example, nonvolatile memory may include one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive (e.g., 628), a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, or other types of nonvolatile machine-readable media that are capable of storing electronic data (e.g., including instructions).
As illustrated in
In an embodiment, the processors 702 and 704 may be one of the processors 602 discussed with reference to
At least one embodiment of the invention may be provided within the processors 702 and 704. For example, one or more of the cores 306 and/or cache 308 of
The chipset 720 may communicate with a bus 740 using a PtP interface circuit 741. The bus 740 may have one or more devices that communicate with it, such as a bus bridge 742 and I/O devices 743. Via a bus 744, the bus bridge 743 may communicate with other devices such as a keyboard/mouse 745, communication devices 746 (such as modems, network interface devices, or other communication devices that may communicate with the computer network 603), audio I/O device, and/or a data storage device 748. The data storage device 748 may store code 749 that may be executed by the processors 702 and/or 704.
In various embodiments of the invention, the operations discussed herein, e.g., with reference to
Additionally, such computer-readable media may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a bus, a modem, or a network connection). Accordingly, herein, a carrier wave shall be regarded as comprising a machine-readable medium.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least an implementation. The appearances of the phrase “in one embodiment” in various places in the specification may or may not be all referring to the same embodiment.
Also, in the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. In some embodiments of the invention, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements may not be in direct contact with each other, but may still cooperate or interact with each other.
Thus, although embodiments of the invention have been described in language specific to structural features and/or methodological acts, it is to be understood that claimed subject matter may not be limited to the specific features or acts described. Rather, the specific features and acts are disclosed as sample forms of implementing the claimed subject matter.