Conventional computer architectures (e.g., the von Neumann architecture) are well known in the art.
As seen, off chip data movement costs vastly more energy and more CPU cycles. Consequently, if data is being moved such as to-or-from main memory, it would be desirable if computations on that data could be performed with little silicon/power/time costs. It should be appreciated that 64-bit operations are merely exemplary and that other operations of other dimensions are contemplated.
In addition, some circuitries are now reaching such density that if all the circuits are on at the same time, the energy flux density may exceed that of a nuclear reactor. Hence, the notion of “dark silicon”, i.e. circuits which are on only part of the time, is acceptable. Thus, it would be desirable if a memory controller—which performs operations that are invoked only part of the time—can perform useful operations which do not add significant time delay or energy costs during essential but energy and time delay intensive data streams into and out of memory. It would also be desirable that the operations are flexible and require a common simple architecture.
Several embodiments of the present application comprising systems and methods of reduced data representation memory controllers and related chiplet sets are disclosed.
In the several embodiments, a data transformation is performed on the data as it is passing through the traditional memory controller electronics. In one embodiment, a transform may be construed as a useful, universal data transformation which can be performed by a memory side controller. The memory controller being a part of a computer system further comprising a central processor and a hierarchy of computer memory elements, the transform memory controller comprising: an input, the input receiving data signals associated with the computer memory elements; a set of logic and arithmetic elements, the set of logic elements configured to perform a transform operation on the data signals associated with the computer memory element wherein the transform operation performs a desired computation on the data signals without the need of the desired computation being performed by the CPU of the computer system; and an output, the output of the transform operation sends results of the computation to the computer memory elements.
In another set of embodiments, a method for performing transform operations on data residing in desired levels of slower memory elements, the steps of said method comprising: receiving an instruction for an operation on data in the computer system; determining the cost of the operation on the data to be performed at the central processor; and if the cost of the operation is above a desired threshold, then performing the data operation at the transform memory controller instead of at the central processor.
Other features and advantages of the present system are presented below in the Detailed Description when read in connection with the drawings presented within this application.
Features and advantages of the present system and method are presented below in the Detailed Description when read in connection with the drawings presented within this application.
All references, publications, patents, and patent applications, cited herein and/or cited in any accompanying Information Disclosure Statement (IDS), are hereby incorporated herein by reference in their entirety for all purposes.
The accompanying figures, in which like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to illustrate various examples and to explain various principles and advantages all in accordance with the present disclosure, in which:
As required, detailed embodiments are disclosed herein; however, it is to be understood that the disclosed embodiments are merely examples and that the devices, systems, and methods described herein can be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one of ordinary skill in the art to variously employ the disclosed subject matter in virtually any proprietary detailed structure and function. Further, the terms and phrases used herein are not intended to be limiting, but rather, to provide an understandable description. Additionally, unless otherwise specifically expressed or clearly understood from the context of use, a term as used herein describes the singular and/or the plural of that term.
As mentioned, data movement now takes more energy, time, and silicon real estate than computation. Thus, a memory controller which performs operations that are invoked only part of the time can perform useful operations as the data streams into and out of memory. Various embodiments of the present application describe many useful functions that a memory controller can perform in addition to the usual functions performed by the previous memory controllers. Such usual functions of a typical memory controller may include the following functions-read by address, read by content, write by address, write by content, error correction, and more recently, encryption/decryption. In addition, there also are various addressing modes such as block transfer, streaming, or page transfer.
In one embodiment of the present application, the memory controller functionality may be expanded to perform operations in parallel with such basic memory functions—which may comprise useful functions improving the overall system performance. These operations can be programmatically invoked on the data as it is coming into and out of memory or to eliminate the transfer to the CPU all together. In some embodiments, the results of these operations may be stored in other parts of main memory or in memory controller registers. In such cases, the memory controller functions may shorten data flow paths from memory to the CPU, may process data flows from other data sources, and process data flows to and from the edge—e.g., sensors and/or actuators.
In one embodiment, the results of these operations can be stored in other parts of main memory or in memory controller registers. In the context of the present application, the term “transform” operations/algorithms/firmware and/or hardware refer to additional and/or auxiliary functions and/or hardware that may be performed to reduce the cost of moving data to-and-from storage to CPU.
In many embodiments, it suffices for the purposes of the present application that the operations on data of a desired size/dimension be costly in terms of energy consumption and time delay that it would be desirable to perform the data operations in a transform memory controller as described herein—as opposed to transferring the data signals to the processor (e.g., CPU, GPU, etc.) of the computer system to perform. In many embodiments, the transform memory controller may be configured to perform operations on data signals that would have a threshold (either actual or predicted) energy cost and/or time delay.
As may be seen in
In many embodiments, the useful operations depend on the application. Because the costs of designing, fabricating, testing, and packaging these DRAM memory arrays is very expensive, the ability to customize the memory side computations could be quite useful. The various embodiments of the present application enable the system designer to locate system limiting functions in a “chiplet”—i.e., a customizable, memory side processor which optimizes various memory operations, depending on the predominant system usage cases without requiring the additional high cost of redesigning a memory chip fabrication process.
In many embodiments, such chiplets may be added to the DRAM die through the use of a silicon interposer, solder bumps or other high bandwidth chip interconnect technologies. Moreover, customizing memory through the use of chiplets permit logic optimized fabrication and testing for the chiplet and memory optimized fabrication for the DRAM memory. This may be desirable, as the optimum fabrication process for logic and memory are well known to be somewhat incompatible.
For one example of a transform operation,
In 506, another example involves high dimensional matrix multiplications. Random linear algebra of unitary matrices may employ transform matrices, SUaT and SUa, as shown. For a subspace-preserving embedding, if Ua is an orthogonal matrix, then SUa is approximately orthogonal. It should be appreciated that many of the firmware/hardware embodiments described herein may be able to process such reduced linear algebra operations as described. Thus, by multiplying data by a sketching matrix as the data leaves or enters the memory can preserve much of the properties of the data in a reduced sketched form. Subsequent linear algebra operations can be rapidly approximated on the sketched data. Only if the full details of the data are required, will it be necessary to access all of the data and incurring the energy and access time costs
In one embodiment, as the data flows to and from the main memory under the control of the address and address decoder, the data may be multiplied or other operations performed by the transform matrix. The transform matrix may be preloaded under the control of the memory controller instructions specifying the operations to be performed by the memory controller and where to store the results in the transform storage memory. Many exemplary architectures may be optimized for matrix/matrix or matrix tensor multiplication either dense or sparse and implemented as a transforming memory controller. For merely one embodiment, a FPGA block may be programmable and allow for a variant of functionality which would work well for flow processing, as a transform chiplet.
In another embodiment, it should be appreciated that the transform memory controller could be made such that the set of logic elements are constructed integrally with a processor to comprise a System On a Chip (SOC). In yet another embodiment, the system may comprise a set of arithmetic units, a possible set of registers, and may some program memory so it has components of a processor (e.g., CPU, GPU, etc.) but may not have an entire set of processor logic. In some embodiments, the system may be constructed similar to a digital signal processing (DSP) unit which may convolve an input data stream with a kernel located in the DSP registers.
9C, 10 and 11A and B depict several embodiments of transform processes/operations that may be affected by the architectures described herein.
EE1: In a computer system comprising a central processor and a hierarchy of computer memory elements and further comprising a transform memory controller, the transform memory controller performing operations on data residing in desired levels of slower memory elements, a method for performing transform operations on data residing in desired levels of slower memory elements, the steps of said method comprising:
receiving an instruction for an operation on data in the computer system;
determining the cost of the operation on the data to be performed at the central processor; and
if the cost of the operation is above a desired threshold, then performing the data operation at the transform memory controller instead of at the central processor.
EE 1.2: The method of EE1 wherein the step of receiving an instruction further comprises determining whether the instructions is among a set of instructions that are pre-determined to be performed at the transform memory controller.
EE 1.3: The method of EE1 the step of determining the cost of the operation on the data to be performed at the central processor is a function of the energy consumption of the received instruction to be performed at the central processor of the computer system.
EE1.4: The method of EE1 the step of determining the cost of the operation on the data to be performed at the central processor is a function of the time delay of the received instruction to be performed at the central processor of the computer system.
EE2: A transform memory controller, the controller being a part of a computer system further comprising a processor and a hierarchy of computer memory elements, the transform memory controller comprising:
an input, the input receiving data signals associated with the computer memory elements;
a set of logic elements, the set of logic elements configured to perform a transform operation on the data signals associated with the computer memory element wherein the transform operation performs a desired computation on the data signals without the need of the desired computation being performed by the processor of the computer system; and
and an output, the output of the transform operation sends results of the computation to the computer memory elements.
EE2.1: The transform memory controller of EE2 wherein:
the set of logic elements comprise a chiplet, said chiplet configured to be in electronic communications with neighboring computer memory elements.
EE2.2: The transform memory controller of EE2.1 wherein:
the chiplet is mechanically mated to a substrate, the substrate comprising computer memory elements mechanically mated to the substrate.
Now that various embodiments have been herein disclosed, it is also to be appreciated that any one or more of the particular tasks, steps, processes, methods, functions, elements and/or components described herein may suitably be implemented via hardware, software, firmware or a combination thereof. In particular, various modules, components and/or elements may be embodied by processors, electrical circuits, computers and/or other electronic data processing devices that are configured and/or otherwise provisioned to perform one or more of the tasks, steps, processes, methods and/or functions described herein. For example, a controller, a processor, computer or other electronic data processing device embodying a particular element may be provided, supplied and/or programmed with a suitable listing of code (e.g., such as source code, interpretive code, object code, directly executable code, and so forth) or other like instructions or software or firmware, such that when run and/or executed by the controller, processor, computer or other electronic data processing device one or more of the tasks, steps, processes, methods and/or functions described herein are completed or otherwise performed. Suitably, the listing of code or other like instructions or software or firmware is implemented as and/or recorded, stored, contained or included in and/or on a non-transitory computer and/or machine readable storage medium or media so as to be providable to and/or executable by the computer or other electronic data processing device. For example, suitable storage mediums and/or media can include but are not limited to: floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic storage medium or media, CD-ROM, DVD, optical disks, or any other optical medium or media, a RAM, a ROM, a PROM, an EPROM, a FLASH-EPROM, or other memory or chip or cartridge, or any other tangible medium or media from which a computer or machine or electronic data processing device can read and use. In essence, as used herein, non-transitory computer-readable and/or machine-readable mediums and/or media comprise all computer-readable and/or machine-readable mediums and/or media except for a transitory, propagating signal.
Optionally, any one or more of the particular tasks, steps, processes, methods, functions, elements and/or components described herein may be implemented on and/or embodiment in one or more general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like. In general, any device, capable of implementing a finite state machine that is in turn capable of implementing the respective tasks, steps, processes, methods and/or functions described herein can be used.
A detailed description of one or more embodiments of the application, read along with accompanying figures, that illustrate the principles of the application has now been given. It is to be appreciated that the application is described in connection with such embodiments, but the application is not limited to any embodiment. The scope of the application is limited only by the claims and the application encompasses numerous alternatives, modifications and equivalents. Numerous specific details have been set forth in this description in order to provide a thorough understanding of the application. These details are provided for the purpose of example and the application may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the application has not been described in detail so that the application is not unnecessarily obscured.
Number | Date | Country | |
---|---|---|---|
Parent | 17712137 | Apr 2022 | US |
Child | 18809778 | US |