The present disclosure relates to integrated circuits and computer systems, and more particularly to a method and system for sharing memory resources in a final level cache system.
It is widely known in the data center industry, a significant portion, estimated to be up to 75% of the DRAM (dynamic random access memory) main memory deployed in modern cloud CPU (central process unit) servers are unused. This occurs because the minimum size of the DRAM main memory allocated in each CPU server is almost always determined by the memory demanding applications as that would give them the most return of their investment. Stated another way, memory size is selected based on the applications that could be run on that server, regardless of the actual application that will be or are running on the server.
Furthermore, the number of cores in a CPU socket is also determined by the most CPU core demanding applications, regardless of how many cores will be in use in a server for a particular application build out. Unfortunately, these constraints do not necessarily overlap each other, and data center designers had to make the hard choice of having too many CPU cores, or too much DRAM capacity, or not enough CPU/memory resources for those that could afford to pay more for the resources.
Intuitively the memory or CPU core utilization problem is a solvable problem if one could simply build a system with an extremely large DRAM main memory combined with a lot more CPU sockets/cores that we have today. Such a system could then rely on probabilities/statistics of large numbers of users and applications running on a shared pool of CPU sockets and DRAMs to achieve a much higher CPU core and/or DRAM utilizations.
Unfortunately, pooling CPU and main memory resources is easier said than done. With modern server CPU sockets being delivered with 64 or more cores, even today's single CPU socket with dedicated local DRAM main memory interface is already being taxed beyond its capability to provide the needed bandwidth for that single CPU socket. Adding a gigantic main memory pooling into a high-end server CPU socket is only moving the CPU to memory bandwidth limitation from the CPU socket to the gigantic main memory pooling resources. For example, if current single high-end CPU socket already require 256 GB/s of main memory bandwidth, connecting a thousand of such CPU sockets to a shared memory pool would require a DRAM main memory pool with the ability to provide 256 TB/s of bandwidth. Certainly, even if it is one day technically possible to build this hypothetical DRAM memory pooling system, the cost would be extremely prohibitive considering that even the most advanced internet network core router silicon used in data centers today (costing tens of thousand dollars each) can barely deliver 10 Tb/s (not 10 TB/s) of bandwidth.
A further challenge presented to the industry for memory input and output is limitations on the memory bandwidth between the memory and a host (entity requesting memory read/write). Numerous computing environments are bottlenecked in total processing capability by memory input and output limitations. This hindrance is particularly limiting in artificial intelligence processing, which relies on a large amount of data processing when formulating an intelligent output. Even memory designs focused on high bandwidth can be improved.
To overcome the drawbacks of the prior art and provide additional benefits, a data storage and access system for use with a processor is disclosed. In one embodiment, a data storage and access system for use with a processor is disclosed that includes a processor, having processor cache such that the processor is configured to generate a data request for data. Also part of this embodiment is a final level cache (FLC) cache system that is configured to function as main memory and receive the data request. The FLC cache system comprising a first FLC module having a first FLC controller and first memory. The first FLC module is configured to process the data request from the processor. A second FLC module having a second FLC controller and a second memory such that the second FLC module, responsive to the first FLC module not having the data requested by the processor, receives and processes the data request from the first FLC module. A storage drive connected to the FLC cache system as does a switch accessible memory, which connects through a switch. The storage drive or the switch accessible memory receives the data request responsive to the second FLC module not having the data and the storage drive, switch accessible memory, or both, are shared by additional FLC cache systems as a shared memory pool.
In one embodiment, this system further comprises DRAM or SRAM memory connected to the second FLC cache system. It is contemplated that the DRAM or SRAM memory comprises low power double data rate (LPDDR) memory and the LPDDR memory is shared with one or more additional FLC cache system which connect to the LPDDR. In one configuration, the data request includes a physical address and the first FLC controller includes a look-up table configured to translate the physical address to a first virtual address. For example, if the first FLC controller look-up table does not contain the physical address, the first FLC controller is configured to forward the data request with the physical address to the second FLC controller. The second FLC controller also includes a look-up table configured to translate the physical address to a second virtual address.
In one embodiment, the first FLC module is faster and has lower power consumption than the second FLC module. As shown herein, the second FLC module accesses the switch accessible memory through network interface and a PCI bus. The system may further comprise a second processor connected to the FLC cache system. It is contemplated that the first FLC module, the second FLC module, or both are configured to perform predictive fetching of data stored at addresses expected to be accessed in the future.
Also disclosed is a method of operating a data access system, wherein the data access system comprises a processor having processor cache, switch connected memory, a first final level cache (FLC) module which includes a first FLC controller and a first DRAM and a second FLC module which includes a second FLC controller and a second DRAM. Using this system, the method comprises generating, with the processor, a request for data which includes a physical address and providing the request for data to the first FLC module. With the first FLC module, determining if the first FLC controller contains the physical address, and responsive to the first FLC controller containing the physical address, retrieving the data from the first DRAM and providing the data to the processor. If responsive to the first FLC controller not containing the physical address, forwarding the request for data and the physical address to the second FLC module, and at the second FLC controller, determining if the second FLC controller contains the physical address. Responsive to the second FLC controller not containing the physical address, forwarding the request for data and the physical address to the switch connected memory and retrieving the data from the switch connected memory and providing the data to the second FLC module, the first FLC module, and the processor.
In one embodiment, the switch connected memory is a shared memory resource for additional FLC modules. This method may further comprise, responsive to the second FLC controller not containing the physical address, retrieving the data from a RAM type memory that is external to but connected to the second FLC module. The method of operation may further comprise performing a look-up in a look up table to determine whether the data is in the switch connected memory or an SSD connected to the data access system. For example, the step of determining if the first FLC controller contains the physical address may include accessing an address cache storing address entries in the first FLC controller to reduce time taken for determining. In one embodiment, the method further comprises, responsive to the first FLC controller containing the physical address and providing the data to the processor, updating a status register reflecting the recent use of a cache line containing the data.
Also disclosed is a data storage and access system for use with a processor such that the processor includes, or is associated with, a processor cache. This embodiment includes a first final level cache (FLC) cache system, communication with the processor, configured to function as main memory cache and receive a data request for data from the processor. A network connected memory pool, accessible by the FLC cache system, is configured to store data, including data that is not stored in the cache, such that the memory pool is shared by other FLC cache systems as a shared memory resource.
This system may further comprise a second FLC cache system, connected between the FLC cache system and the network connected memory pool. The second FLC cache system is configured to function as a second main memory cache and receive the data request for the data, if the data is not located in the first FLC cache system, and if the second FLC cache system does not contain the data, forward the data request to the network connected memory pool. The system may further comprise a system bus and the processor communicates with the first FLC cache system over the system bus.
In another embodiment, the memory storage and access system comprises two or more processors, each having a processor cache, the two or more processors configured to generate data requests for data. This system also includes two or more final level cache (FLC) cache systems, each configured to receive the data requests. Each FLC cache system comprises a first FLC module having a first FLC controller and first memory, such that the first FLC module processes the data requests from the processor, and a second FLC module having a second FLC controller and second memory. The second FLC module, responsive to the first FLC module not having the data requested by the processor, receiving and processing the data requests from the first FLC module. Also provided are two or more switch fabrics, of which two or more are connected to switch fabric accessible memory, such that each of the two or more switch fabric connect to at least one of the two or more FLC cache systems wherein the switch accessible memory is configured to receive the data requests from the second FLC module responsive to the second FLC module not having the data, and the switch fabric accessible memory is shared by the two or more FLC cache systems as a shared memory pool.
In one embodiment, each of the two or more switch fabrics have a switch fabric accessible memory attached thereto. It is contemplated that each processor may have two or more ports, and two or more of the two or more ports connect to an FLC cache system. The shared memory pool may comprise SSD memory, DDR memory, or both. The system may further comprise a shared local memory pool that is accessible by at least two of the two or more FLC cache systems.
It is contemplated that this system may further comprise additional memory directly connected to, and accessible by, the data storage and access system. In one configuration, if the data is not contained in the first FLC cache system, then the data request is sent to the network connected memory pool to retrieve the data from the network connected memory pool, and the network connected memory pool is shared with and accessible by other FLC cache systems associated with other processors. In one arrangement, the FLC cache system comprises a FLC controller and a memory. It is contemplated that more than one processor may connect to the first FLC cache system.
The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. In the figures, like reference numerals designate corresponding parts throughout the different views.
In the drawings, reference numbers may be reused to identify similar and/or identical elements.
To resolve both the local and memory pooling bandwidth dilemma, it is proposed to change the architect of the main memory. To be more precise it is proposed to lower the bandwidth required or imposed to the memory pool by each CPU socket to a bare minimum, such as for example, less than 1% or 2% of what is currently in use. Interestingly, the only viable way to do this is to utilize a new main memory architecture proposed with FLC (final level cache) technology.
In this embodiment, the FLC system 104 connects to a host CPU 108. The host CPU 108 may be any type processor, microprocessor, controller, ASIC, DSP, GPU, or similar device currently available or developed in the future. The host CPU 108 processes data and executes machine readable code. The host CPU 108 includes processor cache, such as L0, L1, L2, L3 cache which is part of the host CPU 108. The host CPU 108 requests data (data and/or machine readable code) from the FLC system 104 as would be typical in prior art. However, the configuration and arrangement of the FLC system 104 is different than a typical memory system for prior art computers and servers.
Also connected to the FLC system 104 is external LPDDR memory 120A, 120B, one or more solid state drives (SSD memory) 116, and a CXL memory/storage pool 112 (hereafter switch accessible memory (SAM)). The LPDDR 120A, 120B represents low-power DDR DRAM memory that is external to the FLC system 104. The SSD 116 may comprise any type memory but is typically one of SSD drives of any size. The switch accessible memory (SAM) 112 is memory or storage space which is accessible through a switch fabric, such as but not limited to CXL memory.
The benefit to the SAM 112 is that in the rare event the data requested by the host CPU 108 is not available in the FLC system 104, the data may be quickly accessed in the SAM. In some embodiments, a data access operation to the SAM 112 is faster than a data access to the SSD 116. In addition, the SAM 112 may connect to a vast amount of memory thereby providing a vast memory pool, that may be shared with the host CPU 108. As a result, in the event the host CPU requires more memory than that provided by the FLC system, the additional memory resources may be accessed through the SAM 112.
It is also noted that the FLC system 104 is a two stage cached memory with a cache miss rate of about or less than 0.1%, and as a result, it is rare that a memory call needs to be made to the SAM 112 or the SSD 116. If there is a miss in the first FLC cache, then the request is sent to the second FLC cache, and in the event of a miss at the second FLC cache, the request is sent to the SAM 112 or other memory such as the SSD. In the event a cache miss does occur, the read/write speed of the SAM 112 is very fast, such as for example, a 200 nanosecond read/write time to access another FLC's DRAM memory compared to a 10 microsecond read/write time for an SSD drive. Thus, the SAM 112 may be 100 times faster than a traditional SSD driver read/write operation. Thus, not only is a nearly unlimited amount of memory accessible through the SAM 112, but the speed of memory access is also significantly faster than prior art memory reads the SSD. It is further disclosed that any system discussed and shown herein may be implemented with pre-fetching to further speed operation and CPU data access.
Associated with the FLC system 104 are numerous elements, which are described below. The host CPU 108 is in communication via PCIe bus or communication path 124 with a FLC L1 controller 128, which in this embodiment is a dual channel device that functions as a cache controller. Although shown as a dual channel configuration, it is contemplated that a single channel configuration or more than two channels may be enabled. The FLC controller 128 operates as discussed above and includes a DRAM memory controller configured to interface with multiple-channel in-package memory (MC-IPM) 130A, 130B, which in turn connects to in package memory (IPM) 132A, 132B as shown. This is the memory resource for the FLC L1 controller 128. As can be seen, this first level cache system is very fast using dual channel high speed memory having an exemplary speed of 32 Gb/s. Although high speed memory is more expensive that standard speed memory, only 128 MB are used for each channel for a total of 256 MB of FLC L1 cache memory. Various exemplary bus speeds are shown, and it is contemplated that these speeds will increase over time. There are also numerous different types or formats of buses which may be used between the host CPU 108 and the FLC system 104.
Also provided is a FLC L2 stage comprising two or more FLC L2 cache controllers 134, memory controllers 138A, 138B and associated LDDDR memory 120A, 120B. Two FLC L2 controllers increase bandwidth as compared to a single FLC L2 system. The FLC L2 cache controllers 134 function as cache controllers to receive a data request from the FLC L1 controller 128 in the event of a miss by the FLC L1 controller. The FLC L2 controller 134 processes the data request and attempts to retrieve the requested data from the LP DDR memory 120A, 120B via the memory controllers 138A, 138B. The memory 120A, 120B is low power DDR memory operating, in this embodiment, at 16 GB/s and has a capacity of less than or equal to 16 GB capacity as shown.
In the rare instance of a cache miss by both the FLC L1 controller 128 and the FLC L2 controllers 134, the data request may be forwarded to the SAM 112 via one or more buses shown in this example embodiment as a generation 5 PCIe bus 142A, 142B. A queue manager 144 is provided to oversee and control traffic on the bus 142A, 142B. In this example embodiment, each bus 142A, 142B has a bandwidth of 16 GB/s although other parameters may be enabled in different embodiments. Any type of memory may connect to or be part of the CAM 112.
Also shown in
As discussed herein, a cache miss at the second FLC (final level cache) forwards a data request to the memory 150, or any of the other memories such as those accessible through a switch fabric. In addition, the memory 150 is also accessible by other systems (see
A memory controller is provided for memory 150 or as part of the FLC system to update one or more memory address tables. This system design operates efficiently and without bandwidth bottlenecks because the first FLC system and the second FLC system have over a 99% cache hit rate (often up to or greater than 99.9% hit rate), thereby requiring access to memory 150 (or switch fabric 112 accessible memory) for a small percentage of all CPU data requests, which prevents the memory 150 or the switch fabric 112 from being overloaded with processor requests. For example, the first FLC memory cache may have a hit rate for 99% of all data request from the CPU, while the second FLC memory cache may also have a 99% hit rate. Thus, for a million data request from the CPU only 10,000 are passed to the second FLC memory cache and of those 10,000 requests from the first FLC system to the second FLC system, only 100 are provided to the memory 150 or the switch fabric 112. Hence only 100 requests are not satisfied by the caches out of 1 million, which equates to 100/1,000,000 which is 0.01% or 99.99% of cache requests are not satisfied by the first FLC memory and the second FLC memory. This results in a very low burden for memory 150 or switch fabric 112.
The FLC cache hit rates may be made even higher by increasing the amount of cache memory or pre-caching data. The FLC memory cache, separate from the CUP die/package, is in addition to the traditional cache memory that is part of or associated with the CPU.
The systems based on the designs disclosed herein are very scalable allowing thousands of CPUs with FLC cache system to connect to one or more switch fabrics. Sharing each memory 150 from many different linked systems creates a massive memory pool that can be shared and accessed by numerous different systems. The FLC L1 cache 128 and FLC L2 cache 134 take the load off of memory resources 150, 112. Having the memory 150 integrated increases speed and DIMMs have very high bandwidth already, which would otherwise not be used by only one CPU 108. Further, some CPU's 108 may need more memory than other CPUs.
Operation of the FLC cache system is described herein and as such is not described again. In this embodiment, the operation is supplemented with additional memory resources which are accessible over a SAM 112. This increases memory resources available to the host CPU 108 and makes the memory resources of the FLC system 104 available to other systems, which require additional memory capacity. In addition, data which in the past was duplicated on many different computers or servers, may now be stored at a single location and accessible through the SAM 112 by numerous different host CPUs 108. The switch fabric allows for sharing of any memory resources with any other system. For example, in an extreme example, there could be thousands of processors connected to petabytes of memory and such a system could be used by a thousand people or one person. In one example environment of use, the system disclosed herein may be configured as virtual servers, such as in data center.
In addition, although specific examples of memory types and interface types are provided to guide the reader, it is contemplated that other types of memory may be used or other types of memory interfaces. For example, although LPDDR is shown in
It is further contemplated and disclosed that any feature or configuration of one embodiment or figure may be combined, in any combination with any other feature or embodiment of different drawings. Thus, various combinations are contemplated that draw on the features and configuration disclosed across all the figures.
Any number of hosts maybe connected to the switch 212. Each host 208, 220A, 220B may be the same, or different type systems or configurations thereby allowing interaction between different system. In addition, the hosts 208, 220A, 220B may be located in the same case or housing, adjacent each other, in different rooms, different buildings in the same city, or at remote locations. As discussed herein, each host may have multiple ports, such as for example eight ports, each of which may connect to a FLC system or to a switch. In other embodiments, each CPU may have 8, 12, 16, or 64 ports (such as with or without multiple CPU systems) allowing for very large memory systems.
This configuration allows the host CPU 208 to access the memory of its associated FLC system 204 as well as all the other resources available through the switch fabric 212 such as the other hosts memory 220A, 220B, and the memory resources 224, 228, 232. As a result, if the host CPU 208 requires more memory, it can utilize any additional memory accessible through the switch fabric 212. Also shown in
In this embodiment, the other host CPUs 220A, 220B can access the memory resources of the FLC memory 204. As a result, data can be easily and quickly shared and data that is rarely used, but occasionally required, may be stored in one location, and accessed by numerous host CPUs, thereby clearing space in the memory of each host. In addition, an amount of memory may be dedicated to one particular host CPU and FLC system, and the remaining memory may be designated as shared memory and thus accessible by other hosts CPUs. Although shown with two additional hosts, it is contemplated that any number of additional hosts CPU and FLC systems 220A, 220B may be connected to the switch fabric 212. In addition, it is also contemplated that the switch fabric 212 may connect to another switch fabric (not shown) to further expand the memory access and capacity capability.
The interface between the CPU and the FLC1, such as for example between the CPU 208 and the FLC system 204 may be a lower power interface due to the close proximity of these two elements, such as in a chiplet, in the same integrated circuit, or on a common circuit board. The distance may be 18 inches or less. In contrast, the interface (serdes) between the CXL switch 212 and the FLC systems 204, 220A, 220B may be of higher power, due to the longer distance or range that the signal has to travel. In one embodiment, the higher power may be used when the distance is over 18 inches, such as for example, 1 meter. Using a lower power interface reduces power consumption and heat generation for the serdes (serial/deserializer, PHY), while higher power interfaces enable greater scalability by allowing the FLC system to be located further away from the switch fabric 212. In addition, the interfaces may be optical.
The various embodiments disclosed herein are well suited to streaming of data, such video to multiple users from a CPU 208 over a network interface 260. In this mode of operation, the data to be streamed may be prefetched from a DDR memory 232 or SSD memory 224 into FLC memory. The CPU can then quickly access the access data and forward the data to multiple different users via the network interface 260. The CPU 208 can service many users, all of which may be streaming the same video (movie) but at different locations. The entire movie may be loaded into the FLC memory providing rapid access to the movie by the CPU for the numerous users. In this embodiment, the memory associated with the FLC2 or FLC1 may be large, such as 4 DIMS of 32 GB totaling 128 GB of memory, which is sufficient space for multiple movies.
Shown above is sharing pools of memory in or associated with different FLC modules. It is also contemplated that the memory 224, 228, and 232 will be shared between various FLC modules (and CPUs) to further expand available resources for systems that require an additional data path between the FLC modules 204, 220A, 220B and the SSD 224 or the DDR 232. This is shown by dashed line 330 and 334. Additionally, the CPU 208 may also access the data stored in any of the memory 224, 228, and 232. The memory 224, 228, 232 functions as a shared pool available to any CPU.
This arrangement provides several benefits, including full bandwidth efficiently for the switch fabric system as shown. Because the FLC system 204, or any FLC system associated with other hosts only has cache misses for 0.1% of memory requests, each FLC system will utilize very little switch fabric bandwidth, thus preventing the switch fabric from becoming a bottle neck and increasing the number of host CPUs which can efficiently be connected to the switch fabric. It is also contemplated that the memory resources can be assigned different designations and/or priorities. For example, some FLC memory resources may be dedicated to the host CPU to which the FLC system is associated. In addition, other memory resources may be designed as shared resources. This designation may dynamically change based on CPU usage or allocations. This ensures that sufficient resources are available for a host CPU, while allowing other resources to be shared, thereby allowing use of otherwise unused (wasted) memory capacity. The same allocation principles may be allowed to other memory resources. For example, and without limitation, the memory 224, 228, 232 may have resources which are dedicated to a particular host CPU. It is also contemplated that there may be striped local DRAM space (LPDDR or DDR).
Also shown in
In one embodiment, CPU port 0 connects to fabric port 0, CPU port 1 connects to fabric port 1, CPU port 2 connects to fabric port 2, up to CPU port N connecting to switch fabric port N. If each fabric takes care of 256 ports, there are N times 256 ports capacity into the system or fabric capacity. Each CPU will have many ports, but each fabric may only communicate with one CPU port. For example, a CPU may have many cores shared among many ports, such as 32 ports shared between 128 cores. A CPU core can typically communicate with any CPU port. By connecting each CPU port to a different switch fabric, the system has more capacity and is more scalable. In relation to a scaled system, any FLC cache system can access any DIMM 830 within any connected FLC system.
The DIMM 830 capacity is selected to suit the needs of the CPU and the needs of the other CPUs which can access the DIMM. The memory SSD 3DNAND, 3D XPoint, and DDR memory shown on the right-hand side of
In
It is contemplated that this system may be well suited for use with HBM (high bandwidth memory), which has many channels (such as 16 or more) and a higher bandwidth that existing memory. In the figures shown in this application, the HBM memory may replace or supplement the IPM memory. The HBM may be on the die 850 but it is shared by multiple FLC cache system on the die. Each FLC cache system may have its own dedicated HBM or the HBM may be shared. In one configuration, eight IPM has the same performance as one HBM. By integrating multiple FLC cache systems into one die with the HBM memory, then the switch fabric connection for all the multi-channel FLC system can be collapsed to one all while accessing the high speed HBM. This multiplies the bandwidth of the fabric even higher by allowing a greater number for FLC cache systems to connect to the switch fabric having a fixed number of ports. It is also contemplated that fiber optic cables may be used in connection with an optic switch to further increase bandwidth.
Using a two level FLC main memory caching technology into each CPU socket provides numerous benefits. One benefit is that the disclosed systems and methods provide a massive amount of bandwidth to a CPU socket from the 1st level FLC in-package wide bandwidth DRAM memory (IPM or in-package memory). In addition, since each FLC module uses a small amount of memory for the 1st level FLC amount (in comparison to a typical DRAM main memory size but very large when compared to typical CPU caches), it reduces cost and size. The IPM device could furthermore be optimized for extreme low power operation thus delivering significantly higher system energy efficiency. The 1st level FLC may also be optimized for extremely fast operation. The low power consumption of nature of the disclosed system is a significant advantage over prior art memory systems which consume a large amount of power and consequently generate a correspondingly large amount of heat, both of which are undesirable.
For example, this interface could be a D2D (die-to-die) interface between chiplet contained in the same package, such as but not limited to IPM (in package memory). Conversely, the devices may be in the same die, or located in separate packages. As shown, dashed line 1082 may represent the IC (chip) that includes a CPU 1008, the memory interface 1024, and the FLC11028, of the dashed line 1082 which may represent the package that includes chiplets 1086 and 1088. In addition, the interface between the CPU and the memory interface 1024 may be a CXL fabric.
The first FLC unit operates as described herein and provided the benefits outlined herein. In this embodiment, one or more second FLC units 1034 are connected to the first FLC unit 1028. Although shown with two second FLC units 1034, it is contemplated that only one second FLC unit may be provided, or that more than two second FLC unit may be present. It is also contemplated that only one FLC unit may be present, or that more than a first and second FLC units 1028, 1034 may be provided.
In this embodiment, the second FLC units 1034 connect to memory interfaces 1038A, 1038B, which in turn connect to and provide access to external memory 1020A, 1020B. As shown in
In this embodiment, the memory interface and/or network interface 1070 connects to internal memory 1050, which may comprise any type of memory configured to store data or other information. The memory interface and/or network interface 1070 may also optionally connect to a memory I/O interface that connects to external memory 1050. It is contemplated that one or more additional FLC units (not shown) may connect to the network accessible external memory.
Also connected to the memory interface and/or network interface 1070 is network accessible memory 1012. The network accessible may be any type of memory. As with the other external memory, it is contemplated that additional FLC units may connect to to the network accessible memory 1012. Also contemplated is that one or more SSD memory drives 1016 may be accessed via the memory interface and/or network interface 1070. Multiple SSD drives may be provided and may likewise be accessed by other FLC units.
Additional benefits and optional configurations are contemplated. The system can be configured to provide large cache capacity for the 2nd level FLC using a fraction of the amount of DRAM used in typical server CPU today. Even if the system was configured with ⅛th of the DRAM normally allocated in existing servers (around 128 GB) this would provide 16 GB of 2nd level FLC.
The system enables highly efficient main memory pooling as the bandwidth that is needed to and from the main memory pooling would now only be the bandwidth of the cache misses from such a 2nd level FLC, which is very low miss rate. With such a large capacity 2nd level FLC (FLC2), the miss rate is often much lower than 0.1% resulting in a memory pool bandwidth that is less than 0.1% of the bandwidth of a main memory pooling system without using FLC system. From the CPU socket point of view, this is effectively multiplying the main memory pooling bandwidth by more than three orders of magnitude.
Alternatively, we could move the external shared main memory (DDR for example) from the fabric into the FLC controller module as shown in
On top of that, an SSD interface may be included in the FLC controller locally to also enable a fully shared and distributed SSD deployment through the same CXL interface between the FLC controller and the external CXL fabric.
In
It is also contemplated and disclosed that cache lines of sizes other than 4 KB may be used. For example, it is contemplated and disclosed that cache lines sizes of 0.5 KB, 1 KB, 2 KB, 3 KB 8 KB, 12 KB or 16 KB may be used, or any variation of these values. In one embodiment, if smaller IPM were used, then a smaller cache line size may also be used. A smaller size cache line may reduce the cost of the system, although bandwidth/transfer speeds may be reduced. Fully associated cache system would still be implemented though.
By stripping multiple FLC controllers into distinct 4 KB page boundaries, we could now attach multiple (N) of CXL fabrics without concern of page data crossing between the different CXL fabrics. This effectively increases the CXL fabric interconnect and bandwidth capacity by the number of FLC channels that are connected to the CPU devices. For example, if a given CPU socket is equipped with 8 channels of x8 CXL ports that could be connected to the FLC controllers, up to 8 of CXL fabric devices could be used to build the interconnect for the memory pooling.
Moreover, an even larger memory pooling capacity could simply be obtained by grouping multiple CPU sockets into a coherent system, whereas each of the CXL CPU ports of the group is assigned with dedicated 4K page address boundaries. A coherent CPU network of 8 CPU sockets with each socket supporting 8 channels of CXL ports could for example support 64 of independent CXL fabric. A 128×4 ports of CXL fabric would therefore be able to support 64*128 channels of FLC controllers. Even assuming only 64 GB/s bandwidth for each FLC controller a cache bandwidth of 64*128*64 GB/s would be available to such a system. Assuming each CPU core needing 8 GB/s of sustained bandwidth, such a configuration would easily support at least 64*128*64/8 cores or 65K CPU cores.
The above method of stripping the FLC controllers with distinct page boundaries effectively multiply the bandwidth of the CXL fabric without resorting to building an impossible task fabric device with tens of thousands of CXL ports.
On the surface moving FLC into the CPU fabric means limiting the number of IPM dies that could be packaged into a fabric. And that implies fewer number of CPU cores that could be integrated into a single fabric. Yet on the other hand, this limitation is actually a benefit as today the number of applications that could coherently use all the 64 CPU cores in a single socket is practically non-existence.
The question is then why modern high-end CPU sockets are designed with more CPU cores than what is needed by the end users? Surprisingly the only reason that high end CPU sockets are built with increasingly more CPU cores is only to reduce the data center footprints as data center real estate cost is not insignificant.
However, by integrating FLC into the CPU fabric the innovation now drastically reduces the footprint of the CPU socket size. As a result, we could now reduce the number of cores needed in a single socket thus improving the chip manufacturing yield and improving the speed of the CPU cores as we could improve the CPU power supply stability with fewer cores per socket. It would furthermore improve the cooling efficiency of the CPU socket, drastically reducing the cooling cost of our future data centers.
FLC caching technology in combination with memory pooling allows CPU sockets to be grouped only with other high energy consuming devices (other CPU′ and GPUs for example) in dedicated racks having a dedicated cooling system. These grouped logic devices can now be operated at significantly higher temperatures compared to today's data center architecture, which reduces cooling costs for these elements because less cooling is required due the ability of these grouped elements to operate at a higher temperature, than the memory devices. Similarly, the FLC controllers and the associated memory pools can now be cooled with a much smaller cooling system since these devices generate less heat, and yet still operate at lower temperatures to keep the memory devices to functioning reliably.
A data center with compute and memory cooling partitioning would enable the compute devices to be cooled with much cheaper refrigeration systems or even without any refrigeration cooling systems in practically any environment with cooler than 40 degrees C. temperatures.
The following provides additional details and various embodiments of the FLC system referenced above. The following provides examples of various types of FLC modules and systems that may be used with the embodiments shown in
The SoC 1112 can include one or more image processing devices 1120, a system bus 1122 and a memory controller 1124. Each of the image processing devices 1120 can include, for example: a control module 1126 with a central processor (or central processing unit (CPU)) 1128; a graphics processor (or graphics processing unit (GPU)) 1130; a video recorder 1132; a camera image signal processor (ISP) 1134; an Ethernet interface such as a gigabit (Gb) Ethernet interface 1136; a serial interface such as a universal serial bus (USB) interface 1138 and a serial advanced technology attachment (SATA) interface 1140; and a peripheral component interconnect express (PCIe) interface 1142. The image processing devices 1120 access the DRAMs 1114 via the system bus 1122 and the memory controller 1124. The DRAMs 1114 are used as main memory. For example, one of the image processing devices 1120 provides a physical address to the memory controller 1124 when accessing a corresponding physical location in one of the DRAMs 1114. The image processing devices 1120 can also access the storage drives 1116 via the system bus 1122.
The SoC 1112 and/or the memory controller 1124 can be connected to the DRAMs 1114 via one or more access ports 1144 of the SoC 1112. The DRAMs 1114 store user data, system data, and/or programs. The SoC 1112 can execute the programs using first data to generate second data. The first data can be stored in the DRAMs 1114 prior to the execution of the programs. The SoC 1112 can store the second data in the DRAMs 1114 during and/or subsequent to execution of the programs. The DRAMs 1114 can have a high-bandwidth interface and low-cost-per-bit memory storage capacity and can handle a wide range of applications.
The SoC 1112 includes cache memory, which can include one or more of a level zero (L0) cache, a level one (L1) cache, a level two (L2) cache, or a level three (L3) cache. The L0-L3 caches are arranged on the SoC 1112 in close proximity to the corresponding ones of the image processing devices 1120. In the example shown, the control module 1126 includes the central processor 1128 and L1-L3 caches 1150. The central processor 1128 includes a L0 cache 1152. The central processor 1128 also includes a memory management unit (MMU) 1154, which can control access to the caches 1150, 1152.
As the level of cache increases, the access latency and the storage capacity of the cache increases. For example, L1 cache typically has less storage capacity than L2 cache and L3 cache. However, L1 cache typically has lower latency than L2 cache and L3 cache.
The caches within the SoC 1112 are typically implemented as static random access memories (SRAMs). Because of the close proximity of the caches to the image processing devices 1120, the caches can operate at the same clock frequencies as the image processing devices 1120. Thus, caches exhibit shorter latency periods than the DRAMS 1114.
The number and size of the caches in the SoC 1112 depends upon the application. For example, an entry level handset (or mobile phone) may not include an L3 cache and can have smaller sized L1 cache and L2 cache than a personal computer. Similarly, the number and size of each of the DRAMs 1114 depends on the application. For example, mobile phones currently have 4-12 gigabytes (GB) of DRAM, personal computers currently have 8-32 GB of DRAM, and servers currently have 32 GB-512 GB of DRAM. In general, cost increases with large amounts of main memory as the number of DRAM chips increases.
In addition to the cost of DRAM, it is becoming increasingly more difficult to decrease the package size of DRAM for the same amount of storage capacity. Also, as the size and number of DRAMs incorporated in a device increases, the capacitances of the DRAMs increase, the number and/or lengths of conductive elements associated with the DRAMs increases, and buffering associated with the DRAMs increases. In addition, as the capacitances of the DRAMs increase, the operating frequencies of the DRAM's decrease and the latency periods of the DRAMs increase.
During operation, programs and/or data are transferred from the DRAMs 1114 to the caches in the SoC 1112 as needed. These transfers have higher latency as compared to data exchanges between (i) the caches, and (ii) the corresponding processors and/or image processing devices. For this reason, accesses to the DRAMs 1114 are minimized by building SOC's with larger L3 caches. Yet despite having larger and larger L3 caches, every year computing systems still need more and more DRAM's (larger main memory). With all else being equal, a computer with a larger main memory will have better performance than a computer with smaller main memory. With today's operating systems, a modern PC with a 4 GB main memory would in fact perform extremely poorly even if it is equipped with the fastest and best processor. The reason why computer main memory size keeps increasing over time is explained next.
During boot up, programs can be transferred from the storage drives 1116 to the DRAMs 1114. For example, the central processor 1128 can transfer programs from the storage drive 1116 to the DRAMs 1114 during the boot up. Only when the programs are fully loaded to the DRAM's can central processor 1128 executes the instructions stored in the DRAMs. If the CPU needs to run a program one at a time and the user is willing to wait while the CPU kills the previous program before launching a new program, the computer system would indeed require very small amount of main memory. However, this would be unacceptable to consumers which are now accustomed to instant response time when launching new programs and switching between programs on the fly. This is why every year computers always need more DRAMs and that establishes the priority of DRAM companies to manufacture larger DRAMs.
At least some of the following examples include final level cache (FLC) modules and storage drives. The FLC modules are used as main memory cache and the storage drives are used as physical storage for user files and also a portion of the storage drive is partitioned for use by the FLC modules as the actual main memory. This is in contrast of traditional computers where the actual main memory is made of DRAMs. Data is first attempted to be read from or written to the DRAM of the FLC modules with the main memory portion of the physical storage drive providing the last resort back up in the event of misses from FLC modules. Look up tables in the FLC modules are referred to herein as content addressable memory (CAM). FLC controllers of the FLC modules control access to the memory in the FLC modules and the storage drives using various CAM techniques described below. The CAM techniques and other disclosed features reduce the required storage capability of the DRAM in a device while maximizing memory access rates and minimizing power consumption. The device may be a mobile computing device, desktop computers, server, network device or a wireless network device. Examples of devices include but are not limited to a computer, a mobile phone, a tablet, a camera, etc. The DRAM in the following examples is generally not used as main memory, but rather is used as caches of the much slower main memory that is now located in a portion the storage drive. Thus, the partition of the storage drive is the main memory and the DRAM is cache of the main memory.
Tasks described below as being performed by a processing device may be performed by, for example, the central processor 1273 and/or the MMU 1279.
The processing devices 1272 are connected to the FLC module 1276 via the system bus 1274. The processing devices 1272 are connected to the storage drive 1278 via the bus and interfaces (i) between the processing devices 1272 and the system bus 1274, and (ii) between the system bus 1274 and the storage drive 1278. The interfaces may include, for example, Ethernet interfaces, serial interfaces, PCIe interfaces and/or embedded multi-media controller (eMMC) interfaces. The storage drive 1278 may be located anywhere in the world away from the processing devices 1272 and/or the FLC controller 1280. The storage drive 1278 may be in communication with the processing devices 1272 and/or the FLC controller 1280 via one or more networks (e.g., a WLAN, an Internet network, or a remote storage network (or cloud)).
The FLC module 1276 includes a FLC controller 1280, a DRAM controller 1282, and a DRAM IC 1284. The terms DRAM IC and DRAM are used interchangeable. Although referenced for understanding as DRAM, other types of memory could be used include any type RAM, SRAM, DRAM, or any other memory that performs as described herein but with a different name. The DRAM IC 1284 is used predominately as virtual and temporary storage while the storage drive 1278 is used as physical and permanent storage. This implies that generally a location in the DRAM IC has no static/fixed relationship to the physical address that is generated by the processor module. The storage drive 1278 may include a partition that is reserved for use as main memory while the remaining portion of the storage drive is used as traditional storage drive space to store user files. This is different than prior art demand paging operations that would occur when the computer is out of physical main memory space in the DRAM. In that case, large blocks of data/programs from DRAM are transferred into and from the hard disk drive. This also entails deallocating and reallocating physical address assignments which is done by the MMU and the Operating System, which is a slow process as operating system (OS) does not have sufficient nor it has precise information on the relative importance of the data/programs that are stored in the main memory. The processing devices 1272 address the DRAM IC 1284 and the main memory partition of the storage drive 1278 as if they were a single main memory device. A user does not have access to and cannot view data or files stored in the main memory partition of the storage drive, in the same way that a user can not see the files stored in RAM during computer operation. While reading and/or writing data, the processing devices 1272 sends access requests to the FLC controller 1280. The FLC controller 1280 accesses the DRAM IC 1284 via the DRAM controller 1282 and/or accesses the storage drive 1278. The FLC controller 1280 may access the storage drive directly (as indicated by dashed line) or via the system bus 1274. From processor and programmer point of view, accesses to the storage partition dedicated as main memory are done through processor native load and store operations and not as I/O operations.
Various examples of the data access system 1270 are described herein. In a first example, the FLC module 1276 is implemented in a SoC separate from the processing devices 1272, the system bus 1274 and the storage drive 1278. In another embodiment, the elements are on different integrated circuits. In a second example, one of the processing devices 1272 is a CPU implemented processing device. The one of the processing devices 1272 may be implemented in a SoC separate from the FLC module 1276 and the storage drive 1278. As another example, the processing devices 1272 and the system bus 1274 are implemented in a SoC separate from the FLC module 1276 and the storage drive 1278. In another example, the processing devices 1272, the system bus 1274 and the FLC module 1276 are implemented in a SoC separate from the storage drive 1278. Other examples of the data access system 1270 are disclosed below.
The DRAM IC 1284 may be used as a final level of cache. The DRAM IC 1284 may have various storage capacities. For example, the DRAM IC 1284 may have 1-2 GB of storage capacity for mobile phone applications, 4-8 GB of storage capacity for personal computer applications, and 16-64 GB of storage capacity for server applications.
The storage drive 1278 may include NAND flash SSD or other non-volatile memory such as Resistive RAM and Phase Change Memory. The storage drive 1278 may have more storage capacity than the DRAM IC 1284. For example, the storage drive 1278 may include 8-16 times more storage than the DRAM IC 1284. The DRAM IC 1284 may include high-speed DRAM and the storage drive 1278 may, even in the future, be made of ultra-low cost and low-speed DRAM if low task latency switching time is important. Ultimately a new class of high capacity serial/sequential large-page DRAM (with limited random accessibility) could be built for the final main memory. Such a serial DRAM device could be at least two times more cost effective than traditional DRAM as die size could be at least two times smaller than traditional DRAM. In one embodiment, the serial DRAM would have a minimum block (chunk) size which could be retrieved or written at a time, such as one cache line (4 KB) but other embodiment a minimum block sizes could be established. Thus, data not be read or written to any location, but instead only to/from certain block. Such serial DRAM could furthermore be packaged with an ultra-high speed serial interface to enable high capacity DRAM to be mounted far away from the processor devices which would enable processors to run at their full potential without worrying about overheating. As shown, a portion of the storage drive 1278 is partitioned to serve as main memory and thus is utilized by the FLC controller 1280 as an extension of the FLC DRAM 1284.
The cache line stored in the DRAM IC 1284 may be data that is accessed most recently, most often, and/or has the highest associated priority level. The cache line stored in the DRAM IC 1284 may include cache line that is locked in. Cache line that is locked in refers to data that is always kept in the DRAM IC 1284. Locked in cache line cannot be kicked out by other cache lines even if the locked in cache line has not been accessed for a long period of time. Locked in cache line however may be updated (written). In one embodiment defective DRAM cells (and its corresponding cache line) may be locked out (mapped out) from the FLC system by removing a DRAM address entry that has defective cell(s) to prevent the FLC address look up engine from assigning a cache line entry to that defective DRAM location. The defective DRAM entries are normally found during device manufacturing. Yet in another embodiment, the operating system may use the map out function to place a portion of DRAM into a temporary state where it is unusable by the processor for normal operations. Such function allows the operating system to issue commands to check the health of the mapped DRAM section one section at a time while the system is running actual applications. If a section of the DRAM is found with weak cells operating system may then proactively disable the cache line that contains the weak cell(s) and bring the so called “weak cache line” out of service. In one embodiment FLC engine could include hardware diagnostic functions to off load the processor from performing DRAM diagnostics in software.
In some example embodiments, the data stored in the DRAM IC 1284 does not include software applications, fonts, software code, alternate code and data to support different spoken languages, etc., that are not frequently used (e.g., accessed more than a predetermined number of times over a predetermined period of time). This can aid in minimizing size requirements of the DRAM IC 1284. Software code that is used very infrequently or never at all could be considered as “garbage code” as far as FLC is concerned and they may not be loaded by FLC during the boot up process and if they did get loaded and used only once for example to be purged by FLC and never loaded anymore in the future thus freeing up the space of the DRAM IC 1284 for truly useful data/programs. As the size of the DRAM IC 1284 decreases, DRAM performance increases and power consumption, capacitance and buffering decrease. As capacitance and buffering decrease, latencies decrease. Also, by consuming less power, the battery life of a corresponding device is increased. Overall system performance of course would increase with bigger DRAM IC 1284 but that comes at the expense of increase of cost and power.
The FLC controller 1280 performs CAM techniques in response to receiving requests from the processing devices 1272. The CAM techniques include converting first physical address of the requests provided by the processing devices 1272 to virtual addresses. These virtual addresses are independent of and different than virtual addresses originally generated by the processing devices 1272 and mapped to the first physical addresses by the processing devices 1272. The DRAM controller 1282 converts (or maps) the virtual addresses generated by the FLC controller 1280 to DRAM addresses. If the DRAM addresses are not in the DRAM IC 1284, the FLC controller 1280 may (i) fetch the data from the storage drive 1278, or (ii) may indicate to (or signal) the corresponding one of the processing devices 1272 that a cache miss has occurred. Fetching the data from the storage drive 1278 may include mapping the first physical addresses received by the FLC controller 1280 to a second physical addresses to access the data in the storage drive 1278. A cache miss may be detected by the FLC controller 1280 while translating a physical address to a virtual address.
FLC controller 1280 may then signal one of the processing devices 1272 of the cache miss as it accesses the storage drive 1278 for the data. This may include accessing the data in the storage drive 1278 based on the first (original) physical addresses through mapping of the first/original physical address to a storage address and then accessing the storage drive 1278 based on the mapped storage addresses.
CAM techniques are used to map first physical address to virtual address in the FLC controller. The CAM techniques provide fully associative address translation. This may include logically comparing the processor physical addresses to all virtual address entries stored in a directory of the FLC controller 1280. Set associative address translation should be avoided as it would result in much higher miss rates which in return would reduce processor performance. A hit rate of data being located in the DRAM IC 1284 with a fully associative and large cache line architecture (FLC) after initial boot up may be as high as 99.9% depending on the size of the DRAM IC 1284. The DRAM IC 1284 in general should be sized to assure a near 100% medium term (minutes of time) average hit rate with minimal idle time of a processor and/or processing device. For example, this may be accomplished using a 1-2 GB DRAM IC for mobile phone applications, 4-8 GB DRAM ICs for personal computer applications, and 16-64 GB DRAM ICs for server applications.
Each of the DRAM entries00-XY may have, for example, 4 KB of storage capacity. Each of the drive entries00-MN may also have 4 KB of storage granularity. If data is to be read from or written to one of the DRAM entries00-XY and the one of the DRAM entries00-XY is full and/or does not have all of the data associated with a request, a corresponding one of the drive entries00-MN is accessed. Thus, the DRAM IC 1384 and the storage drive 1378 are divided up into memory blocks of 4 KB. Each block of memory in the DRAM IC 1384 may have a respective one or more blocks of memory in the storage drive 1378. This mapping and division of memory may be transparent to the processing devices 1272 of
During operation, one of the processing devices 1272 may generate a request signal for a block of data (or portion of it). If a block of data is not located in the DRAM IC 1384, the FLC controller 1280 may access the block of data in the storage drive 1378. While the FLC controller 1280 is accessing the data from the storage drive 1378, the FLC controller 1280 may send an alert signal (such as a bus error signal) back to the processing device that requested the data. The alert signal may indicate that the FLC controller 1280 is in the process of accessing the data from a slow storage device and as a result the system bus 1274 is not ready for transfer of the data to the processing device 1272 for some time. If bus error signal is used, the transmission of the bus error signal may be referred to as a “bus abort” from the FLC module 1276 to the processing device and/or SoC of the processing device 1272. The processing device 1272 may then perform other tasks while waiting for the FLC storage transaction to be ready. The other processor tasks then may proceed to continue by using data already stored in, for example, one or more caches (e.g., L0-L3 caches) in the SoC of the processing device and other data already stored in FLC DRAM. This also minimizes idle time of a processor and/or processing device.
If sequential access is performed, the FLC controller 1280 and/or the DRAM controller 1282 may perform predictive fetching of data stored at addresses expected to be accessed in the future. This may occur during a boot up and/or subsequent to the boot up. The FLC controller 1280 and/or the DRAM controller 1282 may: track data and/or software usage; evaluate upcoming lines of code to be executed; track memory access patterns; and based on this information predict next addresses of data expected to be accessed. The next addresses may be addresses of the DRAM IC 1384 and/or the storage drive 1378. As an example, the FLC controller 1280 and/or the DRAM controller 1282, independent of and/or without previously receiving a request for data, may access the data stored in the storage drive 1378 and transfer the data to the DRAM IC 1384.
The above-described examples may be implemented via servers in a network (may be referred to as a “cloud”). Each of the servers may include a FLC module (e.g., the FLC module 1276) and communicate with each other. The servers may share DRAM and/or memory stored in the DRAM ICs and the storage drives. Each of the servers may access the DRAMs and/or storage drives in other servers via the network. Each of the FLC modules may operate similar to the FLC module of
The above-described examples may also be implemented in a data access system including: a multi-chip module having multiple chips; a switch; and a primary chip having a primary FLC module. The multi-chip module is connected to the primary chip module via the switch. Each of the FLC modules may operate similar to the FLC module of
As an example, each of the secondary DRAMs in the multi-chip module and the primary DRAM in the primary chip may have 1 GB of storage capacity. A storage drive in the primary chip may have, for example, 64 GB of storage capacity. As another example, the data access system may be used in an automotive vehicle. The primary chip may be, for example, a central controller, a module, a processor, an engine control module, a transmission control module, and/or a hybrid control module. The primary chip may be used to control corresponding aspects of related systems, such as a throttle position, spark timing, fuel timing, transitions between transmission gears, etc. The secondary chips in the multi-chip module may each be associated with a particular vehicle system, such as a lighting system, an entertainment system, an air-conditioning system, an exhaust system, a navigation system, an audio system, a video system, a braking system, a steering system, etc. and used to control aspects of the corresponding systems.
As yet another example, the above-described examples may also be implemented in a data access system that includes a host (or SoC) and a hybrid drive. The host may include a central processor or other processing device and communicate with the hybrid drive via an interface. The interface may be, for example, a GE interface, a USB interface, a SATA interface, a PCIe interface, or other suitable interfaces. The hybrid drive includes a first storage drive and a second storage drive. The first storage drive includes an FLC module (e.g., the FLC module 1276 of
As a further example, the above-described examples may also be implemented in a storage system that includes a SoC, a first high speed DRAM cache (faster than the second DRAM cache), a second larger DRAM cache (larger than the first DRAM cache), and a non-volatile memory (storage drive). The SoC is separate from the first DRAM, the second DRAM and the non-volatile memory. The first DRAM may store high-priority and/or frequently accessed data. A high-percentage of data access requests may be directed to data stored in the first DRAM. As an example, 99% or more of the data access requests may be directed to data stored in the first DRAM and the remaining 0.9% or less of the data access requests may be directed to data stored in the second DRAM, and less than 0.1% of data to the non-volatile memory (main memory partition in the storage drive). Low-priority and/or less frequently accessed data may be stored in the second DRAM and/or the non-volatile memory. As an example, a user may have multiple web browsers open which are stored in the first DRAM (high speed DRAM). The second DRAM on the other hand has a much higher capacity to store the numerous number of idle applications (such as idle web browser tabs) or applications that have low duty cycle operation. The second DRAM should therefore be optimized for low cost by using commodity DRAM and as such it would only have commodity DRAM performance it would also exhibit longer latency than the first DRAM. Contents for the truly old applications that would not fit in the second DRAM would then be stored in the non-volatile memory. Moreover, only dirty cache line contents of first and/or second DRAM could be written to the non-volatile memory prior deep hibernation. Upon wakeup from deep hibernation, only the immediately needed contents would be brought back to second and first FLC DRAM caches. As a result, wakeup time from deep hibernation could be orders of magnitude faster than computers using traditional DRAM main memory solution.
The SoC may include one or more control modules, an interface module, a cache (or FLC) module, and a graphics module. The cache module may operate similar to the FLC module of
The first DRAM may have a first portion with a same or higher hierarchical level than the L3 cache, the L4 cache, and/or the highest-level cache. A second portion of the first DRAM may have a same or lower hierarchical level than the second DRAM and/or the non-volatile memory. The second DRAM may have a higher hierarchical level than the first DRAM. The non-volatile memory may have a same or higher hierarchical level than the second DRAM. The control modules may change hierarchical levels of portions or all of each of the first DRAM, the second DRAM, and/or the non-volatile memory based on, for example, caching needs.
The control modules, a graphics module connected to the interface module, and/or other devices (internal or external to the SoC) connected to the interface module may send request signals to the cache module to store and/or access data in the first DRAM, the second DRAM, and/or the non-volatile memory. The cache module may control access to the first DRAM, the second DRAM, and the non-volatile memory. As an example, the control modules, the graphics module, and/or other devices connected to the interface module may be unaware of the number and/or size of DRAMs that are connected to the SoC.
The cache module may convert the first processor physical addresses and/or requests received from the control modules, the graphics module, and/or other devices connected to the interface module to virtual addresses of the first DRAM and the second DRAM, and/or storage addresses of the non-volatile memory. The cache module may store one or more lookup tables (e.g., fully set associative lookup tables) for the conversion of the first processor physical addresses to the virtual addresses of the first and second DRAM's and/or conversion of the first processor physical addresses to storage addresses. As a result, the cache module and one or more of the first DRAM, the second DRAM, and the non-volatile memory (main memory partition of the storage drive) may operate as a single memory (main memory) relative to the control modules, the graphics module, and/or other devices connected to the interface module. The graphics module may control output of video data from the control modules and/or the SoC to a display and/or the other video device.
The control modules may swap (or transfer) data, data sets, programs, and/or portions thereof between (i) the cache module, and (ii) the L1 cache, L2 cache, and L3 cache. The cache module may swap (or transfer) data, data sets, programs and/or portions thereof between two or more of the first DRAM, the second DRAM and the non-volatile memory. This may be performed independent of the control modules and/or without receiving control signals from the control modules to perform the transfer. The storage location of data, data sets, programs and/or portions thereof in one or more of the first DRAM, the second DRAM and the non-volatile memory may be based on the corresponding priority levels, frequency of use, frequency of access, and/or other parameters associated with the data, data sets, programs and/or portions thereof. The transferring of data, data sets, programs and/or portions thereof may include transferring blocks of data. Each of the blocks of data may have a predetermined size. As an example, a swap of data from the second DRAM to the first DRAM may include multiple transfer events, where each transfer event includes transferring a block of data (e.g., 4 KB of data).
For best performance the cache module of the first DRAM must be fully associative with large cache line sizes (FLC cache solution). However, for applications that could tolerate much higher miss rates, a set associative architecture could alternatively be used only for the first level DRAM cache. But even that it would still have large cache line sizes to reduce the number of cache controller entry tables. As for the second level DRAM cache fully associative and large cache line cache are used as anything else may shorten the life of the non-volatile main memory.
The first DRAM may have a first predetermined amount of storage capacity (e.g., 0.25 GB, 0.5 GB, 1 GB, 4 GB or 8 GB). A 0.5 GB first DRAM is 512 times larger than a typical L2 cache. The second DRAM may have a second predetermined amount of storage capacity (e.g., 2-8 GB or more for non-server based systems or 16-64 GB or more server based systems). The non-volatile memory may have a third predetermined amount of storage capacity (e.g., 16-256 GB or more). The non-volatile memory may include solid-state memory, such as flash memory or magneto-resistive random access memory (MRAM), and/or rotating magnetic media. The non-volatile memory may include a SSD and a HDD. Although the storage system has the second DRAM and the non-volatile memory (main memory partition of the storage drive), either of the second DRAM and the non-volatile memory may not be included in the storage system.
As a further example, the above-described examples may also be implemented in a storage system that includes a SoC and a DRAM IC. The SoC may include multiple control modules (or processors) that access the DRAM IC via a ring bus. The ring bus may be a bi-directional bus that minimizes access latencies. If cost is more important than performance, the ring bus may be a unidirectional bus. Intermediary devices may be located between the control modules and the ring bus and/or between the ring bus and the DRAM IC. For example, the above-described cache module may be located between the control modules and the ring bus or between the ring bus and the DRAM IC.
The control modules may share the DRAM IC and/or have designated portions of the DRAM IC. For example, a first portion of the DRAM IC may be allocated as cache for the first control module. A second portion of the DRAM IC may be allocated as cache for the second control module. A third portion of the DRAM IC may be allocated as cache for the third control module. A fourth portion of the DRAM IC may not be allocated as cache.
As a further example, the above-described examples may also be implemented in a server system. The server system may be referred to as a storage system and include multiple servers. The servers include respective storage systems, which are in communication with each other via a network (or cloud). One or more of the storage systems may be located in the cloud. Each of the storage systems may include respective SoCs.
The SoCs may have respective first DRAMs, second DRAMs, solid-state non-volatile memories, non-volatile memories and I/O ports. The I/O ports may be in communication with the cloud via respective I/O channels, such as peripheral component interconnect express (PCIe) channels, and respective network interfaces, such as such as peripheral component interconnect express (PCIe) channels. The I/O ports, I/O channels, and network interfaces may be Ethernet ports, channels and network interfaces and transfer data at predetermined speeds (e.g., 1 gigabit per second (Gb/s), 10 Gb/s, 50 Gb/s, etc.). Some of the network interfaces may be located in the cloud. The connection of multiple storage systems provides a low-cost, distributed, and scalable server system. Multiples of the disclosed storage systems and/or server systems may be in communication with each other and be included in a network (or cloud).
The solid-state non-volatile memories may each include, for example, NAND flash memory and/or other solid-state memory. The non-volatile memories may each include solid-state memory and/or rotating magnetic media. The non-volatile memories may each include a SSD and/or a HDD.
The architecture of the server system provides DRAMs as caches. The DRAMs may be allocated as L4 and/or highest level caches for the respective SoCs and have a high-bandwidth and large storage capacity. The stacked memory may include, for example, DDR memory, low power double data rate type four (LPDDR) memory, wide-I/O2 memory, HMC memory, IPM (in package memory) and/or any other type of memory now in existence or developed in the future. The stacked memory may be DRAM or other memory types. Each of the SoCs may have one or more control modules. The control modules communicate with the corresponding DRAMs via respective ring buses. The ring buses may be bi-directional buses. This provides high-bandwidth and minimal latency between the control modules and the corresponding DRAMs.
Each of the control modules may access data and/or programs stored: in control modules of the same or different SoC; in any of the DRAMs; in any of the solid-state non-volatile memories; and/or in any of the non-volatile memories.
The SoCs and/or ports of the SoCs may have medium access controller (MAC) addresses. The control modules (or processors) of the SoCs may have respective processor cluster addresses. Each of the control modules may access other control modules in the same SoC or in another SoC using the corresponding MAC address and processor cluster address. Each of the control modules of the SoCs may access the DRAMs. A control module of a first SoC may request data and/or programs stored in a DRAM connected to a second SoC by sending a request signal having the MAC address of the second SOC and the processor cluster address of a second control module in the second SoC.
Each of the SoCs and/or the control modules in the SoCs may store one or more address translation tables. The address translation tables may include and/or provide translations for: MAC addresses of the SoCs; processor cluster addresses of the control modules; processor physical addresses of memory cells in the DRAMs, the solid-state non-volatile memories, and the non-volatile memories; and/or physical block addresses of memory cells in the DRAMs, the solid-state non-volatile memories, and the non-volatile memories. In one embodiment, the DRAM controller generates DRAM row and column address bits form a virtual address.
As an example, data and programs may be stored in the solid-state non-volatile memories and/or the non-volatile memories. The data and programs and/or portions thereof may be distributed over the network to the SoCs and control modules. Programs and/or data needed for execution by a control module may be stored locally in the DRAMs, a solid-state non-volatile memory, and/or a non-volatile memory of the SoC in which the control module is located. The control module may then access and transfer the programs and/or data needed for execution from the DRAMs, the solid-state non-volatile memory, and/or the non-volatile memory to caches in the control module. Communication between the SoCs and the network and/or between the SoCs may include wireless communication.
As a further example, the above-described examples may also be implemented in a server system that includes SoCs. Some of the SoCs may be incorporated in respective servers and may be referred to as server SoCs. Some of the SoCs (referred to as companion SoCs) may be incorporated in a server of a first SoC or may be separate from the server of the first SoC. The server SoCs include respective: clusters of control modules (e.g., central processing modules); intra-cluster ring buses, FLC modules, memory control modules, FLC ring buses, and one or more hopping buses. The hopping buses extend (i) between the server SoCs and the companion SoCs via inter-chip bus members and corresponding ports and (ii) through the companion SoCs. A hopping bus may refer to a bus extending to and from hopping bus stops, adaptors, or nodes and corresponding ports of one or more SoCs. A hopping bus may extend through the hopping bus stops and/or the one or more SoCs. A single transfer of data to or from a hopping bus stop may be referred to as a single hop. Multiple hops may be performed when transferring data between a transmitting device and a receiving device. Data may travel between bus stops each clock cycle until the data reaches a destination. Each bus stop disclosed herein may be implemented as a module and include logic to transfer data between devices based on a clock signal. Also, each bus disclosed herein may have any number of channels for the serial and/or parallel transmission of data.
Each of the clusters of control modules has a corresponding one of the intra-cluster ring buses. The intra-cluster ring buses are bi-directional and provide communication between the control modules in each of the clusters. The intra-cluster ring buses may have ring bus stops for access by the control modules to data signals transmitted on the intra-cluster ring buses. The ring bus stops may perform as signal repeaters and/or access nodes. The control modules may be connected to and access the intra-cluster ring buses via the ring bus stops. Data may be transmitted around the intra-cluster ring buses from a first control module at a first one of the ring bus stops to a second control module at a second one of the ring bus stops. Each of the control modules may be a central processing unit or processor.
Each of the memory control modules may control access to the respective one of the FLC modules. The FLC modules may be stacked on the server SoCs. Each of the FLC modules includes a FLC (or DRAM) and may be implemented as and operate similar to any of the FLC modules disclosed herein. The memory control modules may access the FLC ring buses at respective ring bus stops on the FLC ring buses and transfer data between the ring bus stops and the FLC modules. Alternatively, the FLC modules may directly access the FLC ring buses at respective ring bus stops. Each of the memory control modules may include memory clocks that generate memory clock signals for a respective one of the FLC modules and/or for the bus stops of the ring buses and/or the hopping buses. The bus stops may receive the memory clock signals indirectly via the ring buses and/or the hopping buses or directly from the memory control modules. Data may be cycled through the bus stops based on the memory clock signal.
The FLC ring buses may be bi-directional buses and have two types of ring bus stops SRB and SRH. Each of the ring bus stops may perform as a signal repeater and/or as an access node. The ring bus stops SRB are connected to devices other than hopping buses. The devices may include: an inter-cluster ring bus0; the FLC modules and/or memory control modules; and graphics processing modules. The inter-cluster ring bus provides connections (i) between the clusters, and (ii) between intersection rings stops. The intersection ring bus stops provide access to and may connect the inter-cluster ring bus to ring bus extensions that extend between (i) the clusters and (ii) ring bus stops. The ring bus stops are on the FLC ring buses. The inter-cluster ring bus and the intersection ring bus stops provide connections (iii) between the first cluster and the ring bus stop of the second FLC ring bus, and (iv) between the second cluster and the ring bus stop of the first FLC ring bus. This allows the control modules to access the FLC of the second FLC module and the control modules to access the FLC of the first FLC module.
The inter-cluster ring bus may include intra-chip traces and inter-chip traces. The intra-chip traces extend internal to the server SoCs and between (i) one of the ring bus stops and (ii) one of the ports. The inter-chip traces extend external to the server SoCs and between respective pairs of the ports.
The ring bus stops SRH of each of the server SoCs are connected to corresponding ones of the FLC ring buses and hopping buses. Each of the hopping buses has multiple hopping bus stops SHB, which provide respective interfaces access to a corresponding one of the hopping buses. The hopping bus stops SHB may perform as signal repeaters and/or as access nodes.
The first hopping bus, a ring bus stop, and first hopping bus stops provide connections between (i) the FLC ring bus and (ii) a liquid crystal display (LCD) interface in the server SoC and interfaces of the companion SoCs. The LCD interface may be connected to a display and may be controlled via the GPM. The interfaces of the companion SoC include a serial attached small computer system interface (SAS) interface and a PCIe interface. The interfaces of the companion SoC may be image processor (IP) interfaces.
The interfaces are connected to respective ports, which may be connected to devices, such as peripheral devices. The SAS interface and the PCIe interface may be connected respectively to a SAS compatible device and PCIe compatible device via the ports. As an example, a storage drive may be connected to the port. The storage drive may be a hard disk drive, a solid-state drive, or a hybrid drive. The ports may be connected to image processing devices. Examples of image processing devices are disclosed above. The fourth SoC may be daisy chained to the third SoC via the inter-chip bus member (also referred to as a daisy chain member). The inter-chip bus member is a member of the first hopping bus. Additional SoCs may be daisy chained to the fourth SoC via port, which is connected to the first hopping bus. The server SoC, the control modules, and the FLC module may communicate with the fourth SoC via the FLC ring bus, the first hopping bus and/or the third SoC. As an example, the SoCs may be southbridge chips and control communication and transfer of interrupts between (i) the server SoC and (ii) peripheral devices connected to the ports.
The second hopping bus provides connections, via a ring bus stop and second hopping bus stops, between (i) the FLC ring bus and (ii) interfaces in the server SoC. The interfaces in the server SoC may include an Ethernet interface, one or more PCIe interfaces, and a hybrid (or combination) interface. The Ethernet interface may be a 10GE interface and is connected to a network via a first Ethernet bus. The Ethernet interface may communicate with the second SoC via the first Ethernet bus, the network and a second Ethernet bus. The network may be an Ethernet network, a cloud network, and/or other Ethernet compatible network. The one or more PCIe interfaces may include as examples a third generation PCIe interface PCIe3 and a mini PCIe interface (mPCIe). The PCIe interfaces may be connected to solid-state drives. The hybrid interface may be SATA and PCIe compatible to transfer data according to SATA and/or PCIe protocols to and from SATA compatible devices and/or PCIe compatible devices. As an example, the PCIe interface may be connected to a storage drive, such as a solid-state drive or a hybrid drive. The interfaces have respective ports for connection to devices external to the server SoC.
The third hopping bus may be connected to the ring bus via a ring bus stop and may be connected to a LCD interface and a port via a hopping bus stop. The LCD interface may be connected to a display and may be controlled via the GPM. The port may be connected to one or more companion SoCs. The fourth hopping bus may be connected to (i) the ring bus via a ring bus stop, and (ii) interfaces via hopping bus stops. The interfaces may be Ethernet, PCIe and hybrid interfaces. The interfaces have respective ports.
The server SoCs and/or other server SoCs may communicate with each other via the inter-cluster ring bus. The server SoCs and/or other server SoCs may communicate with each other via respective Ethernet interfaces and the network.
The companion SoCs may include respective control modules. The control modules may access and/or control access to the interfaces via the hopping bus stops. In one embodiment, the control modules are not included. The control modules may be connected to and in communication with the corresponding ones of the hopping bus stops and/or the corresponding ones of the interfaces.
As a further example, the above-described examples may also be implemented in a circuit of a mobile device. The mobile device may be a computer, a cellular phone, or other a wireless network device. The circuit includes SoCs. The SoC may be referred to as a mobile SoC. The SoC may be referred to as a companion SoC. The mobile SoC includes: a cluster of control modules; an intra-cluster ring bus, a FLC module, a memory control module, a FLC ring bus, and one or more hopping buses. The hopping bus extends (i) between the mobile SoC and the companion SoC via an inter-chip bus member and corresponding ports and (ii) through the companion SoC.
The intra-cluster ring bus is bi-directional and provides communication between the control modules. The intra-cluster ring bus may have ring bus stops for access by the control modules to data signals transmitted on the intra-cluster ring bus. The ring bus stops may perform as signal repeaters and/or access nodes. The control modules may be connected to and access the intra-cluster ring bus via the ring bus stops. Data may be transmitted around the intra-cluster ring bus from a first control module at a first one of the ring bus stops to a second control module at a second one of the ring bus stops. Data may travel between bus stops each clock cycle until the data reaches a destination. Each of the control modules may be a central processing unit or processor.
The memory control module may control access to the FLC module. In one embodiment, the memory control module is not included. The FLC module may be stacked on the mobile SoC. The FLC module may a FLC or DRAM and may be implemented as and operate similar to any of the FLC modules disclosed herein. The memory control module may access the FLC ring bus at a respective ring bus stop on the FLC ring bus and transfer data between the ring bus stop and the FLC module. Alternatively, the FLC module may directly access the FLC ring bus a respective ring bus stop. The memory control module may include a memory clock that generates a memory clock signal for the FLC module, the bus stops of the ring bus and/or the hopping buses. The bus stops may receive the memory clock signal indirectly via the ring bus and/or the hopping buses or directly from the memory control module. Data may be cycled through the bus stops based on the memory clock signal.
The FLC ring bus may be a bi-directional bus and have two types of ring bus stops SRB and SRH. Each of the ring bus stops may perform as a signal repeater and/or as an access node. The ring bus stops SRB are connected to devices other than hopping buses. The devices may include: the cluster; the FLC module and/or the memory control module; and a graphics processing module.
The ring bus stops SRH of the mobile SoC are connected to the FLC ring bus and a corresponding one of the hopping buses. Each of the hopping buses has multiple hopping bus stops SHB, which provide respective interfaces access to a corresponding one of the hopping buses. The hopping bus stops SHB may perform as signal repeaters and/or as access nodes.
The first hopping bus, a ring bus stop, and first hopping bus stops are connected between (i) the FLC ring bus and (ii) a liquid crystal display (LCD) interface, a video processing module (VPM), and interfaces of the companion SoC. The LCD interface is in the server SoC and may be connected to a display and may be controlled via the GPM. The interfaces of the companion SoC include a cellular interface, a wireless local area network (WLAN) interface, and an image signal processor interface. The cellular interface may include a physical layer device for wireless communication with other mobile and/or wireless devices. The physical layer device may operate and/or transmit and receive signals according to long-term evolution (LTE) standards and/or third generation (3G), fourth generation (4G), and/or fifth generation (5G) mobile telecommunication standards. The WLAN interface may operate according to Bluetooth®, Wi-Fi®, and/or other WLAN protocols and communicate with other network devices in a WLAN of the mobile device. The ISP interface may be connected to image processing devices (or image signal processing devices) external to the companion SoC, such as a storage drive or other image processing device. The interfaces may be connected to devices external to the companion SoC via respective ports. The ISP interface may be connected to devices external to the mobile device.
The companion SoC may be connected to the mobile SoC via the inter-chip bus member. The inter-chip bus member is a member of the first hopping bus. Additional SoCs may be daisy chained to the companion SoC via a port, which is connected to the first hopping bus. The mobile SoC, the control modules, and the FLC module may communicate with the companion SoC via the FLC ring bus and the first hopping bus.
The second hopping bus provides connections via a ring bus stop and second hopping bus stops between (i) the FLC ring bus and (ii) interfaces in the mobile SoC. The interfaces in the mobile SoC may include an Ethernet interface, one or more PCIe interfaces, and a hybrid (or combination) interface. The Ethernet interface may be a 10GE interface and is connected to an Ethernet network via a port. The one or more PCIe interfaces may include as examples a third generation PCIe interface PCIe3 and a mini PCIe interface (mPCIe). The PCIe interfaces may be connected to solid-state drives. The hybrid interface may be SATA and PCIe compatible to transfer data according to SATA and/or PCIe protocols to and from SATA compatible devices and/or PCIe compatible devices. As an example, the PCIe interface may be connected to a storage drive via a port. The storage drive may be a solid-state drive or a hybrid drive. The interfaces have respective ports for connection to devices external to the mobile SoC.
The companion SoC may include a control module. The control module may access and/or control access to the VPM and the interfaces via the hopping bus stops. In one embodiment, the control module is not included. The control module may be connected to and in communication with the hopping bus stops, the VPM, and/or the interfaces.
In this example embodiment, cache line size of 4 KBytes is selected. In other embodiments, other cache line sizes may be utilized. One benefit from using a cache line of this size is that it matches the size of a memory page size which is typically assigned, as the smallest memory allocation size, by the operating system to an application or program. As a result, the 4 KByte cache line size aligns with the operating memory allocations size.
A processor typically only reads or writes 64 Bytes at a time. Thus, the FLC cache line size is much larger, using 4 KBytes as an example. As a result, when a write or read request results in a miss at an FLC module, the system first reads a complete 4 KByte cache line from the storage drive (i.e., the final level of main memory in the storage drive partition). After that occurs, the system can write the processor data to the retrieved cache line, and this cache line is stored in a DRAM. Cache lines are identified by virtual addresses. Entire cache lines are pulled from memory at a time. Further, the entire cache line is forwarded, such as from the FLC-SS module to the FLC-HS module. There could be 100,000 or even 1 million and more cache lines in an operational system.
Comparing the FLC module caching to the CPU cache, these elements are separate and distinct caches. The CPU (processor cache) is part of the processor device as shown and is configured as in the prior art. The FLC modules act as cache, serve as the main memory, and are separate and distinct form the CPU caches. The FLC module cache tracks all the data that is likely to be needed over several minutes of operation much as a main memory and associated controller would. However, the CPU cache only tracks and stores what the processor needs or will use in the next few microseconds or perhaps a millisecond.
Fully associative look up enables massive numbers of truly random processor tasks/threads to semi-permanently (when measured in seconds to minutes of time) reside in the FLC caches. This is a fundamental feature as the thousands of tasks or threads that the processors are working on could otherwise easily trash (disrupt) the numerous tasks/threads that are supposed to be kept in the FLC caches. Fully associative look up is however costly in terms of either silicon area, power or both. Therefore, it is also important that the FLC cache line sizes are maximized to minimize the number of entries in the fully associative look up tables. In fact, it is important that it should be much bigger that CPU cache line sizes which is currently at 64B. At the same time the cache line sizes should not be too big as it would cause undue hardships to the Operating System (OS). Since modern OS typically uses 4 KB page size FLC cache line size is therefore, in one example embodiment, set at 4 KB. If, in the future, the OS page size is increased to say 16 KB, then the FLC cache line size could theoretically be made to be 16 KB as well.
In order to hide the energy cost of the fully associative address look up process, in one embodiment, an address cache for the address translation table is included in the FLC controller. It is important to note that the address cache is not caching any processor data. Instead, it caches only the most recently seen address translations and the translations of physical addressed to virtual addresses. As such the optional address cache does not have to be fully associative. A simple set associative cache for the address cache is sufficient as even a 5% miss rate would already reduce the need to perform a fully associative look up process by at least twenty times. The address cache would additionally result in lower address translation latency as a simple set associative cache used in it could typically translate an address in one clock cycle. This is approximately ten to twenty times faster than the fastest multi-stage hashing algorithm that could perform the CAM like address translation operation.
The storage drive 1378 may be a traditional non-volatile storage device, such as a magnetic disk drive, solid state drive, hybrid drive, optic drive or any other type storage device. The DRAM associated with the FLC modules, as well as partitioned portion of the storage drive, serves as main memory. In the embodiment disclosed herein, the amount of DRAM is less than in a traditional prior art computing system. This provides the benefits of less power consumption, lower system cost, and reduced space requirements. In the event additional main memory is required for system operation, a portion of the storage drive 1378 is allocated or partitioned (reserved) for use as additional main memory. The storage drive 1378 is understood to have a storage drive controller and the storage drive controller will process requests from the processing device 1500 (
This method starts at a step 1408 where the system may be initialized. At a step 1412 the FLC controller receives a request from the possessing device (processor) for a read or write request. The request includes a physical address that the processor uses to identify the location of the data or where the data is to be written.
At a decision step 1416, a determination is made whether the physical address provided by the processor is located in the FLC controller. The memory (SRAM) of the FLC controller stores physical to virtual address map data. The physical address being located in the FLC controller, is designated as a hit while the physical address not being located in the FLC controller is designated as a miss. The processor's request for data (with physical address) can only be satisfied by the FLC module if the FLC controller has the physical address entry in its memory. If the physical address is not stored in the memory of the FLC controller, then the request must be forwarded to the storage drive.
If, at decision step 1416 the physical address is identified in the FLC controller, then the request is considered a hit and the operation advances to a step 1420. At step 1420 the FLC controller translates the physical address to a virtual address based on a look-up operation using a look-up table stored in a memory of the FLC controller or memory that is part of the DRAM that is allocated for use by the FLC controller. The virtual address may be associated with a physical address in the FLC DRAM. The FLC controller may include one or more translation mapping tables for mapping physical addresses (from the processor) to virtual addresses.
After translation of the physical address to a virtual address, the operation advances to a decision step 1424. If at decision step 1416, the physical address is not located in the FLC controller, a miss has occurred and the operation advances to step 1428. At step 1428, the FLC controller allocates a new (in this case empty) cache line in the FLC controller for the data to be read or written and which is not already in the FLC module (i.e., the DRAM of the FLC module). An existing cache line could be overwritten if space is not otherwise available. Step 1428 includes updating the memory mapping to include the physical address provided by the processor, thereby establishing the FLC controller as having that physical address. Next, at a step 1432 the physical address is translated to a storage drive address, which is an address used by the storage drive to retrieve the data. In this embodiment, the FLC controller performs this step but in other embodiment other devices, such as the storage drive may perform the translation. The storage drive address is an address that is used by or understood by the storage drive. In one embodiment, the storage drive address is a PCI-e address.
At a step 1436, the FLC controller forwards the storage address to the storage drive, for example, a PCI-e based device, a NVMe (non-volatile memory express) type device, a SATTA SSD device, or any other storage drive now known or developed in the future. As discussed above, the storage drive may be a traditional hard disk drive, SSD, or hybrid drive and a portion of the storage drive is used in the traditional sense to store files, such documents, images, videos, or the like. A portion of the storage drive is also used and partitioned as main memory to supplement the storage capacity provided by the DRAM of the FLC module(s).
Advancing to a step 1440, the storage drive controller (not shown) retrieves the cache line, at the physical address provided by the processor, from the storage drive and the cache line is provided to the FLC controller. The cache line, identified by the cache line address, stores the requested data or is designated to be the location where the data is written. This may occur in a manner that is known in the art. At a step 1444, the FLC controller writes the cache line to the FLC DRAM and it is associated with the physical address, such that this association is maintained in the loop-up table in the FLC controller.
Also part of step 1444 is an update to the FLC status register to designate the cache line or data as most recently used. The FLC status register, which may be stored in DRAM or a separate register, is a register that tracks when a cache line or data in the FLC DRAM was lasted used, accessed or written by the processor. As part of the cache mechanism, recently used cache lines are maintained in the cache so that recently used data is readily available for the processor again when requested. Cache lines are least recently used, accessed or written to by the processor are overwritten to make room for more recently used cache lines/data. In this arrangement, the cache operates in a least recently used, first out basis. After step 1444, the operation advances to step 1424.
At decision step 1424 the request from the processor is evaluated as a read request or a write request. If the request is a write request, the operation advances to step 1448 and the write request is sent with the virtual address to the FLC DRAM controller. As shown in
Alternatively, if at decision step 1424 is it determined that the request from the processor is a read request, then the operation advances to step 1464 and the FLC controller sends the read request with the virtual address to the FLC DRAM controller for processing by the DRAM controller. Then at step 1468, the DRAM controller generates DRAM row and column address bits from the virtual address, which are used at a step 1472 to read (retrieve) the data from the FLC DRAM so that data can be provided to the processor. At a step 1476, the data retrieved from FLC DRAM is provide to the processor to satisfy the processor read request. Then, at a step 1480, the FLC controller updates the FLC status register for the data (address) to reflect the recent use of the data that was read from the FLC DRAM. Because the physical address is mapped into the FLC controller memory mapping, that FLC controller maintains the physical address in the memory mapping as readily available if again requested by the processor.
The above-described tasks of
As discussed above, status registers maintain the states of cache lines which are stored in the FLC module. It is contemplated that several aspects regarding cache lines and the data stored in cache lines may be tracked. One such aspect is the relative importance of the different cache lines in relation to pre-set criteria or in relation to other cache lines. In one embodiment, the most recently accessed cache lines would be marked or defined as most important while least recently used cache lines are marked or defined as least important. The cache lines that are marked as the least important, such as for example, least recently used, would then be eligible for being kicked out of the FLC or overwritten to allow new cache lines to be created in FLC or new data to be stored. The steps used for this task are understood by one of ordinary skill in the art and thus not described in detail herein. However, unlike traditional CPU cache controllers, an FLC controller would additionally track cache lines that had been written by CPU/GPU. This occurs so that the FLC controller does not accidentally write to the storage drive, such as an SSD, when a cache line that had only been used for reading is eventually purged out of FLC. In this scenario, the FLC controller marks an FLC cache line that has been written as “dirty”.
In one embodiment, certain cache lines may be designed as locked FLC cache lines. Certain cache lines in FLC could be locked to prevent accidental purging of such cache lines out of FLC. This may be particularly important for keeping the addresses of data in the FLC controller when such addresses/data can not tolerate a delay for retrieval, and thus will be locked and thus maintained in FLC, even if it was least recently used.
It is also contemplated that a time out timer for locked cache lines may be implemented. In this configuration, a cache line may be locked, but only for a certain period of time as tracked by a timer. The timer may reset after a period time from lock creation or after use of the cache line. The amount of time may vary based on the cache line, the data stored in the cache line, or the application or program assigned to the cache line.
Additionally, it is contemplated a time out bit is provided to a locked cache line for the following purposes: to allow locked cache lines to be purged out of FLC after a very long period of inactivity or to allow locked cache lines to be eventually purged to the next stage or level of FLC module and at the same time inherit the locked status bit in the next FLC stage to minimize the time penalty for cache line/data retrieval resulting from the previously locked cache line being purged from the high speed FLC module.
Also part of the embodiment of
In embodiments with an FLC as shown in
Although the main memory partition of the storage drive 1578 is slower than RAM for I/O operation, the hit rate for the FLC modules is so high, such as 99% or higher, that I/O to the main memory partition in the storage drive rarely occurs and thus does not degrade performance. This discussion of the storage drive 1578 and its main memory partition applies to storage drives shown in the other figures. In all embodiments shown and described, the contents of the main memory partition of the storage drive may be encrypted. Encryption may occur to prevent viewing of personal information, Internet history, passwords, documents, emails, images that are stored in the main memory partition of storage drive 1578 (which is non-volatile). With encryption, should the computing device ever be discarded, recycled, or lost, this sensitive information could not be read. Unlike the RAM, which does not maintain the data stored in when powered down, the storage drive will maintain the data even upon a power down event.
As shown in
It is noted that both the FLC-HS controller and the DRAM-HS are optimized for low power consumption, high bandwidth, and low latency (high speed). Thus, both elements provide the benefits described above. On the other hand, both the FLC-SS controller and the DRAM-SS are optimized for lower cost. In one configuration, the look-up tables of the FLC-HS controller are located in the FLC-HS controller and utilized SRAM or other high speed/lower power memory. However, for the FLC-SS, the look-up tables may be stored in the DRAM-SS. While having this configuration is slower than having the look-up tables stored in the FLC-SS controller, it is more cost effective to partition a small portion of the DRAM-SS for the look-up tables needed for the FLC-SS. In one embodiment, to reduce the time penalty of accessing the lookup table stored in the DRAM-SS a small SRAM cache of the DRAM-SS lookup table may be included to cache the most recently seen (used) address translations. Such an address cache does not have to be fully associative as only the address translation tables are being cached. A set associative cache such as that used in a CPU L2 and L3 cache is sufficient as even 5% misses already reduce the need of doing the address translation in the DRAM by a factor of 20×. This may be achieved with only a small percentage, such as 1000 out of 64,000, look-up table entries cached. The address cache may also be based on least recently used/first out operation.
In this embodiment the FLC module 1540 includes an FLC-HS controller 1532 and a DRAM-HS memory 1528 with associated memory controller 1544. The FLC module 1542 includes an FLC-SS controller 1536 and a DRAM-SS memory 1524 with associated memory controller 1548. The FCL-HS controller 1532 connects to the processing device 1500. This also connects to the DRAM-HS 1528 and also to the FLC-SS controller 1536 as shown. The outputs of the FLC-SS controller 1536 connect to the DRAM-SS 1524 and also to the storage drive 1578.
The controllers 1544, 1548 of each DRAM 1528, 1524 operate as understood in the art to guide and control, read and write operation to the DRAM, and as such these elements and related operation are not described in detail. Although shown as DRAM it is contemplated that any type RAM may be utilized. The connection between controllers 1544, 1548 and the DRAM 1528, 1524 enable communication between these elements and allow for data to retrieved from and stored to the respective DRAM.
In this example embodiment, the FLC controllers 1532, 1536 include one or more look-up tables storing physical memory addresses which are may be translated to addresses which correspond to locations in the DRAM 1528, 1524. For example, the physical address may be converted to a virtual address and the DRAM controller may use the virtual address to generate DRAM row and column address bits. The DRAM 1528, 1524 function as cache memory. In this embodiment the look-up tables are full-associative thus having a one to one mapping and permits data to be stored in any cache block which leads to no conflicts between two or more memory address mapping to a single cache block.
As shown in
In general, during operation of a memory read event, a data request with a physical address for the requested data is sent from the processing device 1500 to the FLC-HS controller 1532. The FLC-HS controller 1532 stores one or more tables of memory addresses accessible by the FLC-HS controller 1532 in the associated DRAM-HS 1528. The FLC-HS controller 1532 determines if its memory tables contain a corresponding physical address. If the FLC-HS controller 1532 contains a corresponding memory address in its table, then a hit has occurred that the FLC-HS controller 1532 retrieves the data from the DRAM-HS 1528 (via the controller 1544), which is in turn provided back to the processing device 1500 through the FLC-HS controller.
Alternatively, if the FLC-HS controller 1532 does not contain a matching physical address the outcome is a miss, and the request is forwarded to the FLC-SS controller 1536. This process repeats at the FLC-SS controller 1536 such that if a matching physical address is located in the memory address look-up table of the FLC-SS controller 1536, then the requested is translated or converted into a virtual memory address and the data pulled from the DRAM-SS 1524 via the memory controller 1548. The DRAM controller generates DRAM row and column address bits from the virtual address. In the event that a matching physical address is located in the memory address look-up table of the FLC-SS controller 1536, then the data request and physical address is directed by the FLC-SS controller 1536 to the storage drive.
If the requested data is not available in the DRAM-HS 1528, but is stored and retrieved from the DRAM-SS, then the retrieved data is backfilled in the DRAM-HS when provided to the processor by being transferred to the FLC-SS controller 1536 and then to the FLC-HS controller, and then to the processor 1500. When backfilling the data, if space is not available in a DRAM-SS or DRAM-HS, then the least most recently used data or cache line will be removed or the data therein overwritten. In one embodiment, data removed from the high speed cache remains in the standard speed cache until additional space is needed in the standard speed cache. It is further contemplated that in some instances data may be stored in only the high speed FLC module and the not standard speed FLC module, or vice versa.
If the requested data is not available in the DRAM-HS 1528 and also not available in the DRAM-SS 1524 and is thus retrieved from the storage drive 1578, then the retrieved data is backfilled in the DRAM-HS, the DRAM-SS, or both when provided to the processor. Thus, the most recently used data is stored in the DRAMs 1528, 1524 and overtime, the DRAM content is dynamically updated with the most recently used data. Least often used data is discarded from or overwritten in the DRAM 1528, 1524 to make room more recently used data. These back-fill paths are shown in
In
The state machine 1560 connects to memory 1576, such as for example, SRAM. The memory 1576 stores look-up tables which contain physical addresses stored in the FLC controller. These physical addresses can be translated or mapped to virtual addresses which identify cache lines accessible by FLC controller 1532. The memory 1576 may store address maps and multiple hash tables. Using multiple hash tables reduce power consumption and reduce operational delay.
The state machine 1560 and the memory 1576 operate together to translate a physical address from the processing device to a virtual address. The virtual address is provided to the DRAM over a hit I/O line 1568 when a ‘hit’ occurs. If the state machine 1560 determines that its memory 1576 does not contain the physical address entry, then a miss has occurred. If a miss occurs, then the FLC logic unit state machines provides the request with the physical address a miss I/O line 1572 which leads to the storage drive or to another FLC controller.
Operation of the embodiment of
If the physical address is located at step 1708, then the outcome is a hit and the operation advances to a step 1712. At step 1712, the read request is sent with the virtual address to the DRAM-HS controller. As shown in
Alternatively, if at step 1708 the physical address is not identified in the FLC-HS, then the operation advances to step 1732 and a new (empty) cache line is allocated in the FLC-HS controller, such as the memory look-up table and the DRAM-HS. Because the physical address was not identified in the FLC-HS module, space must be created for a cache line. Then, at a step 1736, the FLC-HS controller forwards the data request and the physical address to the FLC-SS module.
As occurs in the FLC-HS module, at a decision step 1740 a determination is made whether the physical address is identified in the FLC-SS. If the physical address is in the FLC-SS module, as revealed by the physical address being present in a look-up table of the FLC-SS controller, then the operation advances to a step 1744. At step 1744, the read request is sent with the virtual address to the DRAM-SS controller. At a step 1748, the DRAM-SS controller generates DRAM row and column address bits from the virtual address, which are used at a step 1752 to read (retrieve) the data from the DRAM-SS. The virtual address of the FLC-HS is different than the virtual address of the FLC-SS so a different conversion of the physical address to virtual address occurs in each FLC controller.
At a step 1724 the FLC-HS controller forwards the requested cache line to the FLC-HS controller, which in turn provides the cache lines (with data) to the DRAM-HS so that it is cached in the FLC-HS module. Eventually, the data is provided from the FLC-HS to the processor. Then, at a step 1760, the FLC-HS controller updates the FLC status register for the data (address) to reflect the recent use of the data provided to the FLC-HS and then to the processor.
If at step 1740 the physical address is not identified in the FLC-SS, then a miss has occurred in the FLC-SS controller and the operation advances to a step 1764 and new (empty) cache line is allocated in the FLC-SS controller. Because the physical address was not identified in the FLC-SS controller, then space must be created for a cache line. At a step 1768 the FLC-SS controller translates the physical address to a storage drive address, such as for example a PCI-e type address. The storage drive address is an address understood by or used by the storage drive to identify the location of the cache line. Next, at a step 1772, the storage drive address, resulting from the translation, is forwarded to the storage drive, for example, PCI-e, NVMe, or SATA SSD. At a step 1776, using the storage drive address, the storage drive controller retrieves the data and the retrieved data is provided to the FLC-SS controller. At a step 1780, the FLC-SS controller writes the data to the FLC-SS DRAM and updates the FLC-SS status register. As discussed above, updating the status register occurs to designate the cache line as recently used, thereby preventing it from being overwritten until it becomes a least recently used. Although tracking of least recently used status is tracked on a cache line basis, it is contemplated that least recently used status could be tracked for individual data items within cache lines, but this would add complexity and additional overhead burden.
In one embodiment, a cache line is retrieved from the storage drive as shown at step 1764 and 1752. The entire cache line is provided to the FLC-HS controller. The FLC-HS controller stores the entire cache line in the DRAM-HS. The data requested by the processor is stored in this cache line. To satisfy the processors request, the FLC-HS controller extracts the data from the cache line and provides the data to the processor. This may occur before or after the cache line is written to the DRAM-HS. In one configuration, the only the cache line is provided from the FLC-SS controller to the FLC-HS controller, and then the FLC-HS controller extracts the data requested by the processor from the cache line. In another embodiment, the FLC-SS controller provides first the requested data and then the cache line to the FLC-HS controller. The FLC-HS controller can then provide the data processor and then or concurrently write the cache line to the FLC-HS. This may be faster as the extracted data is provided to the FLC-HS controller first.
As mentioned above, the virtual addresses of the FLC-HS controller are not the same as the virtual addresses of the FLC-SS controller. The look-up tables, in each FLC controller are distinct and have no relationship between them. As a result, each FLC controllers virtual address set is also unique. It is possible that virtual address could, by chance, have the same bits between them but the virtual addresses are different as they are meant to be used in their respective DRAM (DRAM-HS and DRAM-SS).
As shown in
One or more of the FLC modules 1820 may be configured as high speed FLC modules, which have high speed/low latency/low power DRAM or the FLC modules may be standard speed modules with standard speed DRAM. This allows for different operational speed for different FLC modules. This in turn accommodates the processing modules 1500 directing important data read/write requests to the high-speed FLC module while less important read/write requests are routed to the standard speed FLC modules.
In one embodiment, each FLC slice (FLCa, FLCb, FLCc) connects to a SoC bus and each FLC slice is assigned an address by the processing device. Each FLC slice is a distinct element aid separate and distinct memory look-up tables. A bus address look-up table or hash table may be used to map memory addresses to FLC slices. In one configuration, certain bits in the physical address define which FLC slice is assigned to the address. In another embodiment, a bi-directional multiplexer (not shown) may be provided between the FLC slices and the processing unit 1500 to control access to each FLC slice, but this arrangement may create a bottleneck which slows operation.
It is also contemplated that the embodiments of
In this embodiment, multiple FLC slices are established to increase FLC capacity and bandwidth. Each FLC slices are allocated to a portion of the system bus memory address space (regions). Moreover, these memory regions are interleaved among the FLC slices. The interleaving granularity is set to match the FLC cache line sizes to prevent unwanted duplications (through overlapping) of FLC look up table entries in the different FLC controller slices and ultimately to maximize the FLC hit rates.
One example embodiment, the mapping assigns, in interleaved order, address blocks of FLC cache line size, to the FLC modules. For example, for an FLC implementation with cache line sizes of 4 KB and for an implementation of four different FLCs (FLCa, FLCb, FLCc, FLCd) the following mapping (assignment) of memory identified, by the physical addresses, to the FLCs is as follows:
This memory mapping assignment scheme continues following this pattern. This may be referred to as memory mapping with cache line boundaries to segregate the data to different FLC modules. In this matter, the memory addresses used by the processing device are divided among the FLC slices thereby creating a parallel arranged FLC system that allows for increased performance without any bottlenecks. This allows multiple different programs to utilize only one FLC module, or spread their memory usage among all the FLC modules which increases operational speed and reduces bottlenecks.
In one embodiment, each FLC slice corresponds to a memory address. In this example method of operation, there are four FLC slices, defined as FLCa, FLCb, FLCc, and FLCd. Each FLC slice has a unique code that identifies the FLC slice. For example, exemplary memory addresses are provided below with FLC slice assignments:
where the x's are any combinations of “0” and “1”. In other embodiment, other addressing mapping schemes may be utilized.
Any other address block mapping schemes with integer number of FLC cache line size could be used. With partial or non-integer block sizes there could be duplicates of look up table entries in the different FLC slices. While this may not be fatal it would nonetheless result in a smaller number of distinct address look up table entries and ultimately impact FLC cache hit performance.
Returning to
A first output from the bypass module 2004 connects to the FLC-HS controller 1532. A second outputs from the bypass module 2004 connects to a multiplexer 2008. The multiplexer 2008 also receives a control signal on a control input 2012. The multiplexer 2008 may be any type switch configured to, responsive to the control signal, output one of the input signals at a particular time. The output of the multiplexer 2008 connects to the standard speed FLC controller 1536 of the standard speed FLC module 1542.
Operation of the bypass module 2004 and multiplexer 2008, in connection with the cascaded FLC modules as shown in
In some embodiments, bypass data is data that is not used often enough to qualify, from a performance standpoint, for storage in the high speed DRAM. In other embodiments, certain physical addresses from the processing devices are designated as bypass addresses which the bypass module routes to the bypass path. This is referred to as fixed address mapping whereby certain addresses or blocks of addresses are directed to the bypass path. Similarly, the bypass decision could be based on data type as designated by the processor or other software/hardware function.
The bypass designation could also be based on a task ID, which is defined as the importance of a task. The task ID, defining the task importance, may be set by a fixed set of criteria or vary over time based on the available capacity of the DRAM-HS or other factors. A software engine or algorithm could also designate task ID. The bypass module may also be configured to reserve space in the DRAM-HS such that only certain task ID's can be placed in the reserved DRAM-HS memory space. To avoid never ending or needless blocking of caching to the DRAM-HS based on bypass module control, the task IDs or designation may time out, meaning the bypass designation is terminated after a fixed or programmable timer period. Task ID's could furthermore be used to define DRAM-HS cache line allocation capacity on per Task ID's basis. This is to prevent greedy tasks/threads from purging non-greedy tasks/threads and ultimately to enable a more balanced overall system performance. Operating Systems could also change the cache line allocation capacity table over time to reflect the number of concurrent tasks/threads that needs to simultaneously operate during a given period of time.
By way of example, a screen display showing active video play (movie) has a constantly changing screen display, but when not playing video, the screen display is static. As a result, the bypass module may be configured to bypass the active video display to the bypass path due to the video not being re-display more than once or twice to the screen. However, for a paused movie or during non-video play when the screen is static, the display data may be cached (not bypassed) since it is re-used over and over when refreshing the screen. Thus, it is best to have the data forming the static display in the FLC-HS module because FLC-HS module has lower power consumption. This can be done in software or hardware to detect if the screen is a repeating screen display.
In one embodiment, the bypass module includes algorithms and machine learning engines that monitor, over time, which data (rarely used or used only once) should be bypassed away from the high speed FLC module toward the standard speed FLC module. Over time the machine learning capability with artificial intelligence of the bypass module determines which data, for a particular user, is rarely used, or used only once, and thus should be bypassed away from the high speed FLC module. If the user, over time, uses that data more often, then the machine learning aspects of the bypass module will adjust and adapt to the change in behavior to direct that data to the high speed FLC module to be cached to maximize performance.
In one embodiment, the bypass module does not use machine learning or adapt to the user's behavior, instead the data or address which are bypassed to other than the high speed FLC module are fixed, user programmable, or software controlled. This is a less complicated approach.
It is also contemplated that the processing device may designate data to be bypass type data. As such, the request (read or write) from the processing device to the bypass module would include a designation as bypass type data. This provides a further mechanism to control which data is stored in the high speed FLC module, which has the flexibility of software control.
It is also contemplated and disclosed that the bypass designate for data may have a timer function which removes the bypass designation after a period of time, or after a period of time, the bypass designation must be renewed to remain active. This prevents the bypass designation from being applied to data that should no longer have the bypass designation.
Returning to
Alternatively, if at decision step 2116 the bypass module determines that the data should be bypassed, then the operation advances to step 2124 and the data request with physical address is routed from the bypass module to the bypass multiplexer. In other embodiments, the data request and physical address may be routed to a bypass multiplexer. The bypass multiplexer (as well as other multiplexers disclosed herein) is a by-direction multiplexer that, responsive to a control signal, passes one of its inputs to its output, which in this embodiment connects to the standard speed FLC module. The other input to the bypass multiplexer is from the high speed FLC controller as shown in
At a step 2128, responsive to the control signal to the bypass multiplexer, the bypass multiplexer routes the data request and physical address to the standard speed FLC-SS module. In other embodiments, the data request and physical address from the bypass multiplexer may be transferred to a different location, such as a different high speed FLC module or directly to the storage drive. Then, at a step 2132, the data request and physical address is processed by the standard speed FLC-SS module in the manner described in
In this embodiment, a portion of the DRAM-SS 1524 is partitioned to be reserved as non-cacheable memory. In the non-cacheable data partition of the DRAM-SS, non-cacheable data is stored. As such, the non-cacheable data partition operates as traditional processor/DRAM. If the processor requests non-cacheable data, such as a video file which is typically viewed once, then the file is retrieved by the processor over the file I/O path 1520 from the storage drive 1578 and provided to the non-cacheable partition of the DRAM-SS. This data now stored in the DRAM-SS may then be retrieved by the processor in smaller blocks, over the non-cacheable data path. A video file, such as a movie, is typically very large and is typically only watched once, and thus not cached because there would be no performance benefit to caching data used only once. Partitioning a portion of a memory is understood by one of ordinary skill in the art and as such, this process is not described in detail herein. The non-cacheable data could also be stored in the storage drive 1578.
In this embodiment the bypass module 2004 is further configured to analyze the read request and determine if the read request is for data classified as non-cacheable data. If so, then the data read request from the processing device 1500 is routed to the second multiplexer 2208 through non-cacheable data path 2204. The second multiplexer 2208, responsive to the control signal, determines whether to pass, to the DRAM-SS 1524 either the non-cacheable data read request or the request from the standard speed FLC-SS controller 1536. Because the data is non-cacheable, after the data is provided to the processor, the data is not cached in either the DRAM-HS 1528 or the DRAM-SS 1524, but could be stored in the non-cacheable data partition of the DRAM-SS.
Thereafter, at a step 2320, responsive to a control signal provided to the bypass multiplexer, the data request with physical address is routed from the bypass multiplexer to the FLC-SS module. Then at step 2324 the FLC-SS module processes the data request and physical address as described in
Alternatively, if at decision step 2312 it is determined that the bypass criteria was not satisfied, then the operation advances to decision step 2328 where it is determined if the requested is a cacheable memory request. A cacheable memory request is a request from the processing device for data that will be cached in one of the FLC modules while a non-cacheable memory request is for data that will not be cached. If the request is for cacheable memory, then the operation advances to step 2332 and the process of
Alternatively, if at step 2328 the requested data is determined to be non-cacheable, then the operation advances to step 2336. At step 2336 the non-cacheable data request including the physical address is routed from the bypass module to a second multiplexer. The second multiplexer may be configured and operate generally similar to the bypass multiplexer. At a step 2340, responsive to a second multiplexer control signal, the data request and physical address from the second multiplexer is provided to the DRAM-SS controller which directs the request to a partition of the DRAM-SS reserved for non-cacheable data. At a step 2344 the FLC-SS controller retrieves the non-cacheable data from the DRAM-SS non-cacheable data partition and at step 2348 the FLC-SS controller provides the non-cacheable data to the processing device. The retrieved data is not cached in the DRAM-HS cache or the DRAM-SS cache, but may be maintained in the non-cacheable partition of the DRAM-SS. As such, it is not assessable through the FLC-SS module but is instead accessed through the non-cacheable data path.
It is contemplated and disclosed that any of the embodiments, elements or variations described above may be assembled or arranged in any combination to form new embodiments. For example, as shown in
It is also understood that although the flow charts and methods of operation are shown and discussed in relation to sequential operation, it is understood and disclosed that various operation may be occurring in parallel. This increases the speed of operation, bandwidth, and reduces latency in the system.
The wireless communication aspects described in the present disclosure can be conducted in full or partial compliance with IEEE standard 802.11-2012, IEEE standard 802.16-2009, IEEE standard 802.20-2008, and/or Bluetooth Core Specification v4.0. In various implementations, Bluetooth Core Specification v4.0 may be modified by one or more of Bluetooth Core Specification Addendums 2, 3, or 4. In various implementations, IEEE 802.11-2012 may be supplemented by draft IEEE standard 802.11ac, draft IEEE standard 802.11ad, and/or draft IEEE standard 802.11ah.
Although the terms first, second, third, etc. may be used herein to describe various chips, modules, signals, elements, and/or components, these items should not be limited by these terms. These terms may be only used to distinguish one item from another item. Terms such as “first,” “second,” and other numerical terms when used herein do not imply a sequence or order unless clearly indicated by the context. Thus, a first item discussed below could be termed a second item without departing from the teachings of the example examples.
Also, various terms are used to describe the physical relationship between components. When a first element is referred to as being “connected to”, “engaged to”, or “coupled to” a second element, the first element may be directly connected, engaged, disposed, applied, or coupled to the second element, or intervening elements may be present. In contrast, when an element is referred to as being “directly connected to”, “directly engaged to”, or “directly coupled to” another element, there may be no intervening elements present. Stating that a first element is “connected to”, “engaged to”, or “coupled to” a second element implies that the first element may be “directly connected to”, “directly engaged to”, or “directly coupled to” the second element. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., “between” versus “directly between”, “adjacent” versus “directly adjacent”, etc.).
The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure can be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims. As used herein, the phrase at least one of A, B, and C should be construed to mean a logical (A OR B OR C), using a non-exclusive logical OR, and should not be construed to mean “at least one of A, at least one of B, and at least one of C.” It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure.
In this application, including the definitions below, the term ‘module’ or the term ‘controller’ may be replaced with the term ‘circuit.’ The term ‘module’ and the term ‘controller’ may refer to, be part of, or include: an Application Specific Integrated Circuit (ASIC); a digital, analog, or mixed analog/digital discrete circuit; a digital, analog, or mixed analog/digital integrated circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor circuit (shared, dedicated, or group) that executes code; a memory circuit (shared, dedicated, or group) that stores code executed by the processor circuit; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
A module or a controller may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module or controller of the present disclosure may be distributed among multiple modules and/or controllers that are connected via interface circuits. For example, multiple modules and/or controllers may allow load balancing. In a further example, a server (also known as remote, or cloud) module or (remote, or cloud) controller may accomplish some functionality on behalf of a client module and/or a client controller.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. The term shared processor circuit encompasses a single processor circuit that executes some or all code from multiple modules and/or controllers. The term group processor circuit encompasses a processor circuit that, in combination with additional processor circuits, executes some or all code from one or more modules and/or controllers. References to multiple processor circuits encompass multiple processor circuits on discrete dies, multiple processor circuits on a single die, multiple cores of a single processor circuit, multiple threads of a single processor circuit, or a combination of the above. The term shared memory circuit encompasses a single memory circuit that stores some or all code from multiple modules and/or controllers. The term group memory circuit encompasses a memory circuit that, in combination with additional memories, stores some or all code from one or more modules and/or controllers.
The term memory circuit is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are non-volatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks and flowchart elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory, tangible computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language) or XML (extensible markup language), (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective C, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5, Ada, ASP (active server pages), PHP, Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, and Python®.
None of the elements recited in the claims are intended to be a means-plus-function element within the meaning of 35 U.S.C. § 112(f) unless an element is expressly recited using the phrase “means for,” or in the case of a method claim using the phrases “operation for” or “step for.”
U.S. Provisional Patent Application No. 62/686,333 titled Multi-Path or Multi-Stage Cache Improvement filed on Jun. 18, 2018, is incorporated by reference in its entirety herein and the contents of the incorporated reference, including figures, should be considered as being part of this patent application.
Also disclosed herein are example embodiments which implement the final level cache (FLC) system such that the memory that is used in the FLC system is integrated with the FLC logic controller, or stated another way, the FLC system is integrated with the memory. One type of DRAM that is well suited to this combination is HBM3 DRAM (or any type HBM technology) that is produced by the DRAM industry. With modification to a base die of a HBMx DRAM, the HBMx DRAM can be beneficially paired with the FLC innovation disclosed herein. The ‘x’ designates any type or generation of HBM memory. In addition, although referred to herein as HBM type memory, it is disclosed that any memory structure will benefit from use with FLC system. Specifically in the existing HBMx device there is a “base die” that is used for testing/repairing and provides an ultra-high speed link layer to the processor from the stacked DRAM die arrays (and vice versa). High Bandwidth Memory (HBM) is a computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM). It is used in conjunction with high-performance graphics accelerators, network devices, high-performance datacenter processing systems, and on-package RAM in upcoming CPUs, FPGAs and in some supercomputers. HBM achieves higher bandwidth than DDR or GDDR while using less power, and in a substantially smaller form factor. This is achieved by stacking up to eight DRAM dies and an optional base die which can include buffer circuitry and test logic, as well as in the embodiments disclosed herein an FLC controller. The stack is often connected to the memory controller on a GPU or CPU through a substrate, such as a silicon interposer. Alternatively, the memory die could be stacked directly on the CPU or GPU chip. Within the stack, the die are vertically interconnected by through-silicon vias (TSVs) and micro-bumps.
A HBM memory bus is very wide in comparison to other DRAM memories such as DDR or GDDR type memory. An HBM stack of four DRAM dies (4-Hi) has two 128-bit channels per die for a total of 8 channels and a width of 1024 bits in total. A graphics card/GPU with four 4-Hi HBM stacks (4 memory die) would therefore have a memory bus with a width of 4096 bits. In comparison, the bus width of GDDR memories is 32 bits, with 16 channels for a graphics card with a 512-bit memory interface. The larger number of connections to the memory, relative to DDR or GDDR, would benefit from a new method of connecting the HBM memory to the GPU (or other processor). Some designs utilize purpose-built silicon chips, called interposers, to connect the memory and GPU. This interposer has the added advantage of requiring the memory and processor to be physically close, decreasing memory path distances.
The HBM type memory has various different versions or generations. At this time the various types include, but are not limited, to HBM3 Gen2, HBM3, HBM2E, and HBM2. The numerous various types of HBM memory are referred to herein as HBMx memory where the x denotes any type or generation of HBM type memory. HBM3 is referred to in some of the following discussions to aid in understanding. The newest version of HBM type memory is HBM3 Gen2 which has a maximum capacity of 24 GB, a maximum bandwidth per pin of 9.2 GT/s, with a capability of stacking up to 8 DRAM dies (integrated circuits) on top of another. HBM3 Gen2 also realizes an effective bus width of 1024 bits and a bandwidth per stack of 1.2 terabytes per second. While the HBM3 Gen2 is faster than prior versions, it is also more complex and costly.
It is proposed to modify a base die of the HBM structure (or any type of memory) to include a FLC controller and a look-up engine and integrate the path to process the look up misses to an outside environment, such as a different memory resource using for example the standard LP or DDR memory or even a CXL interface. This would provide a virtual HBMx device having significantly fewer DRAM die stack (as few as 4-die stack in the current HBMx DRAM die implementation) while providing the same bandwidth, such as up to or greater than 1 TB/s bandwidth. Because the FLC look up engine is integrated in the new base die, this design will virtually expand the HBMx capacity using a significantly lower cost solution, including using 3DXP memory die or any other type memory access layout specification.
An additional benefit of implementing the FLC system into the base die of HBMx is that this configuration would skip the protocol conversion that would be needed, such as but not limited to, if the request for data had to be sent to the external memory or off the HBM/FLC chiplet. This in turn provides lower latency. In addition, this implementation allows for an easier die-to-die chiplet interface thus enabling use or integration of the FLC system with any CPU/GPU application. Finally, is it also proposed to incorporate CXL memory pooling behind the FLC lookup engine to expand the capacity of the HBMx memory that is in chiplet.
Interconnecting each memory die 2416A, 2416B is a memory interconnect 2420, as shown. The stacked memory 2408 connects to an interposer layer or base die 2424. On or part of the base die 2424 is an FLC controller 2412 has numerous conductors which extend to the HBMx memory 2408 as shown. On or within the base die 2424 is an FLC controller 2412, which may also be referred to as a tag look up engine. The FLC controller 2412 functions, as described herein, as a cache memory controller to receive and process data requests from a processor or other data requesting element, and determining if the requested data is located in the HBMx memory 2408. Based on whether the requested data is in the HBMx memory, the FLC controller 2412 either retrieves the requested data from the HBMx memory or from a different memory, such as an external or shared pooled memory. Providing an interface between the processor, which requests the data, and the FLC controller 2412, is a D2D (die to die) physical layer 2430 that has a plurality of input/output leads 2434 that electrically connects the base die 2424 to a processor or other device requesting data for reading or writing. In one embodiment and other embodiments, the elements 2434, 2430 may be configured as a die to die interface or UCIe interface.
Located on the base die 2430 is a DRAM analog/MiMcap 2450 (metal-insulator-metal capacitors) and a test/repair elements and ECC elements (collectively) 2454 which may include error correction functions, test functions and/or memory repair functions. Also located on the base die 2430 is an external memory physical layer with memory controller 2440 which in this example embodiment is a DXP/LP PHY with associated memory controller. In other embodiments, other interface types, standards, and configuration may be provided to interface the HBMx system 2404 with external memory (not shown in
Operation of the embodiment shown in
In the case of an eight to sixteen memory die stack as shown in arrangement 2720, the cost jumps up significantly, such that the cost may range from $175 to $700. By using a smaller number of memory dies in an arrangement, such as shown in arrangements 2708, 2712, and 2716, configured with FLC controllers therein, less memory is required, due to the high hit rate of the cache arrangement, as compared to the prior art. This reduces the cost of the system and size/power requirements. In data centers, there are some companies which run 100,000 servers. If the amount of HBMx memory can be reduced from 8 dies to 2 dies, this will result in a cost saving of 155 per HBMx system. This results in a total cost savings of $15.5 million, plus the cost savings over the life of the server in reduced energy consumption and reduced cooling costs. As a result, incorporating the FLC controller and functionality into the HBMx type memory systems results in substantial savings in cost, space, and power.
Also within or on the base die is an HBMx controller 2832 that functions as a memory controller for the one or more dies in the HBMx stacks 2840. It is also contemplated that the HBMx controller 2832 may be incorporated directly into the one or more HBMx dies that form the HBMx stacks 2840. The HBMx controller 2832 functions as a memory controller, which is an element typically associated with a memory.
The interposer layer 2836 also facilitates communication between the processor 2808 and printed circuit board (PCB) 2804 over one or more communication paths 2812. The PCB 2804 may include one or more data buses, or other communication paths.
Also shown in
This method starts at a step 2908 at which a processor sends a data request, which may include a physical address corresponding to the data, to the HBMx FLC system. At a step 2912 the FLC controller on the base die for the HBMx memory receives a request from the possessing device (processor) for a read or write request. The request includes a physical address that the processor uses to identify the location of the data or where the data is to be written.
At a decision step 2916, a determination is made whether the physical address provided by the processor is located in the FLC controller of the HBMx FLC system. The memory (SRAM) of the FLC controller stores a physical to virtual address map data. The physical address being located in the FLC controller, is designated as a hit while the physical address not being located in the FLC controller is designated as a miss. The processor's request for data (with physical address) can only be satisfied by the HBMx FLC system if the FLC controller has the physical address entry in its memory. If the physical address is not stored in the memory of the FLC controller, then the request must be forwarded to a different memory, such as a storage drive, local memory, a shared memory pool, or any other type of other memory.
If, at decision step 2916 the physical address is identified in the memory FLC controller (such as a look up table), then the request is considered a hit and the operation advances to a step 2920. At step 2920 the FLC controller in the base die of the HBMx FLC system translates the physical address to a virtual address based on a look-up operation using a look-up table stored in a memory of the FLC controller or the HBMx memory that is allocated for use by the FLC controller. The virtual address may be associated with a physical address in the HBMx memory. The FLC controller may include one or more translation mapping tables for mapping physical addresses (from the processor) to virtual addresses.
After translation of the physical address to a virtual address, the operation advances to a decision step 2924. If at decision step 2916, the physical address is not located in the FLC controller, a miss has occurred and the operation advances to step 2928. At step 2928, the FLC controller allocates a new (in this case empty) cache line in the FLC controller for the data to be read or written, and which is not already in the FLC module (i.e., the HBMx of the HBMx FLC system). An existing cache line could be overwritten if space is not otherwise available such that the cache line corresponding to the least recently used data may be overwritten. Step 2928 includes updating the memory mapping to include the physical address provided by the processor, thereby establishing the FLC controller as having that physical address. Next, at a step 2932 the physical address is translated to a non-HBM memory address (storage address), which is an address used by the non-HBM memory (storage drive) to retrieve the data. In this embodiment, the FLC controller performs this step but in other embodiments, other devices such as the non-HBM memory may perform the translation. The non-HBM memory address is an address that is used by or understood by the non-HBM memory. In one embodiment, the non-HBM memory drive address is a PCI-e address.
At a step 2936, the FLC controller forwards the storage address to the non-HBM memory, for example and without limitation, an external memory, PCI-e based device, a NVMe (non-volatile memory express) type device, a shared or pooled memory, a SATTA SSD device, or any other storage drive now known or developed in the future. As discussed above, the non-HBM memory may be a traditional hard disk drive, SSD, or hybrid drive and a portion of the storage drive is used in the traditional sense to store files, such documents, images, videos, or the like. A portion of the storage drive is also used and partitioned as main memory to supplement the storage capacity provided by the HBmx memory of the non-HBM memory(s).
Advancing to a step 2940, a controller (not shown) for the non-HBM memory retrieves the cache line, at the physical address provided by the processor, from the non-HBM memory and the cache line is provided to the FLC controller in the base die of the HBMx FLC system. The cache line, identified by the cache line address, stores (contains) the requested data or is designated to be the location where the data is written. This may occur in a manner that is known in the art. At a step 2944, the FLC controller writes the cache line to the HBMx memory of the HBMx FLC system and it is associated with the physical address, such that this association is maintained in the loop-up table in the FLC controller.
Also part of step 2944 is an update to the FLC status register to designate the cache line or data as most recently used data as part of the HBMx memory cache which contains the most recently used data. The FLC status register, which may be stored in the HBMx or a separate register, is a register that tracks when a cache line or data in the HBMx FLC system was last used, accessed or written by the processor. As part of the cache mechanism, recently used cache lines are maintained in the cache so that recently used data is readily available (in the HBMx memory) for the processor again when requested. Cache lines that are least recently used, accessed or written to by the processor are overwritten to make room for more recently used cache lines/data. In this arrangement, the cache operates in a least recently used, first out basis. After step 2944, the operation advances to step 2924.
At decision step 2924 the request from the processor is evaluated as a read request or a write request. If the request is a write request, the operation advances to step 2948 and the write request is sent with the virtual address to the controller or interface for the HBMx memory. As is understood in the art, HBMx memory has an associated memory controller to oversee read/write operations to the HBMx memory. At a step 2952, the HBMx memory controller generates HBMx row and column address bits (and/or die identifier) from the virtual address, which are used at a step 2956 to write the data from the processor (processor data) to the HBMx memory. Then, at a step 2960, the FLC controller updates the FLC status register for the cache line or data to reflect the recent use of the cache line/data just written to the HBMx memory. Because the physical address is mapped into the FLC controller memory mapping, that FLC controller (on or in the base die) now possess that physical address if requested by the processor.
Alternatively, if at decision step 2924 is it determined that the request from the processor is a read request, then the operation advances to step 2964 and the FLC controller sends the read request with the virtual address to the HBMx memory controller for processing by the HBMx memory controller. Then at step 2968, the HBMx memory controller generates die, row and column address bits from the virtual address, which are used at a step 2972 to read (retrieve) the data from the HBMx memory so that data can be provided to the processor. At a step 2976, the data retrieved from HBMx memory is provided to the processor to satisfy the processor read request. Then, at a step 2980, the FLC controller updates the FLC status register for the data (address) to reflect the recent use of the data that was read from the HBMx memory. Because the physical address is mapped into the FLC controller memory mapping, that FLC controller maintains the physical address in the memory mapping as readily available within the HBMx memory if again requested by the processor. This increases memory bandwidth by having the data most likely to be requested in the high speed and closely located HBMx memory, while also allowing for use of less memory than in prior art embodiments.
The above-described tasks of
As discussed above, status registers maintain the states of cache lines which are stored in the FLC module. It is contemplated that several aspects regarding cache lines and the data stored in cache lines may be tracked. One such aspect is the relative importance of the different cache lines in relation to pre-set criteria or in relation to other cache lines. In one embodiment, the most recently accessed cache lines would be marked or defined as most important while least recently used cache lines are marked or defined as least important. The cache lines that are marked as the least important, such as for example, least recently used, would then be eligible for being kicked out of the FLC or overwritten to allow new cache lines to be created in FLC or new data to be stored. The steps used for this task are understood by one of ordinary skill in the art and thus not described in detail herein. However, unlike traditional CPU cache controllers, an FLC controller would additionally track cache lines that had been written by CPU/GPU. This occurs so that the FLC controller does not accidentally write to the storage drive, such as an SSD, when a cache line that had only been used for reading is eventually purged out of FLC. In this scenario, the FLC controller marks an FLC cache line that has been written as “dirty”.
In one embodiment, certain cache lines may be designed as locked FLC cache lines. Certain cache lines in FLC could be locked to prevent accidental purging of such cache lines out of FLC. This may be particularly important for keeping the addresses of data in the FLC controller when such addresses/data can not tolerate a delay for retrieval, and thus will be locked and thus maintained in FLC, even if it was least recently used.
It is also contemplated that a time out timer for locked cache lines may be implemented. In this configuration, a cache line may be locked, but only for a certain period of time as tracked by a timer. The timer may reset after a period time from lock creation or after use of the cache line. The amount of time may vary based on the cache line, the data stored in the cache line, or the application or program assigned to the cache line.
Additionally, it is contemplated a time out bit is provided to a locked cache line for the following purposes: to allow locked cache lines to be purged out of FLC after a very long period of inactivity or to allow locked cache lines to be eventually purged to the next stage or level of FLC module and at the same time inherit the locked status bit in the next FLC stage to minimize the time penalty for cache line/data retrieval resulting from the previously locked cache line being purged from the high speed FLC module.
HBMx FLC System with Other Disclosed Systems
It is contemplated and disclosed that the embodiments that combine a HBMx system with an FCL controller may be incorporated and utilized in any of the embodiments and configuration shown in
Number | Date | Country | |
---|---|---|---|
63467835 | May 2023 | US |