The field of invention pertains generally to the computing sciences, and, more specifically, to a data center environment with customizable software caching levels.
With the growing importance of cloud-computing services and network and/or cloud storage services, the data center environments from which such services are provided are under increasing demand to utilize their underlying hardware resources more efficiently so that better performance and/or customer service is realized from the underlying hardware resources.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
The server computer 102 can be viewed as a peripheral component that relies on various centralized functions of the data center 103. For example, the software programs 101 may rely on the data center 103 for various cloud-like services such as: 1) Internet and/or other network access; 2) one or more persisted databases and/or non volatile mass storage resources 105, 106; 3) load balancing of incoming new requests (e.g., received from the Internet) directed to the software programs 101; 4) failover protection for any of the server computers that are coupled to the data center 103; 5) security; and/or, 6) management and statistics monitoring.
In the specific caching architecture of
High performance software programs have traditionally been monolithic or, said another way, largely self contained, in terms of the logic and processes that they utilize to effect their respective functions. In a sense, the overall traditional implementation of
Because of the coarse-grained nature of the overall implementation 100, the caching functions themselves are relatively simplistic. Essentially, caching for all software programs include all caching levels (L1, L2 and L3) and are utilized/accessed in strict sequence order. That is, if an item of data is not found in particular caching level it is looked for in an immediately next lower caching level, or, similarly, if an item of data is evicted from a particular caching level it is entered into the immediately next lower caching level. This simple caching function is essentially followed for all software processes including each of the multiple and various different kinds of software processes that can exist within the monolithic software bodies 101 themselves. The traditional caching structure of
Two emerging changes however, one in software structure and another in hardware caching level structure, provide an opportunity to at least partially remove the course-grained and unilateral caching service and replace it with a more fine-grained and customized caching service approach.
Referring to
The smaller bodies of software can, in various instances, support the software logic of more than one application software program. Here, functions that are common or fundamental to many different types of application software programs (e.g., user identification, user location tracking, cataloging, order processing, marketing, etc.) are being instantiated as “micro-services” 210 within the overall software solution 201 that the respective custom logic of each application software program 211 calls upon and utilizes. As such, whereas older generation application programs were written with custom code that internally performed these services, by contrast, newer generation application software 211 is becoming more and more composed of just the custom logic that is specific to the application with embedded functional calls as needed to the micro-services 210 that have been instantiated within a lower level software platform.
A second change is the increased number of caching levels offered by the hardware and/or data center architecture. With respect to the actual hardware, advances in the physical integration of DRAM memory, such as embedded DRAM (eDRAM) and die stacking technologies (e.g., High Bandwidth Memory (HBM)) and/or the integration of emerging byte addressable non volatile memory technology as a replacement for DRAM in system memory have resulted in additional CPU level caches (e.g., L4 and/or L5 caches) and/or “memory side” caches 212 that behave as a front-end cache of the system memory.
The new lower level (L4, L5) CPU level cache(s) architecturally reside beneath the traditional SRAM L3 cache of
Emerging byte addressable non volatile memory as a replacement for DRAM in system memory 209 has resulted in multi-level system memory architectures in which, e.g., a higher level of DRAM acts as a memory side cache 212_1, 212_2 for the slower emerging non volatile memory which is allocated the system memory address space of the computer. Here, the memory side cache 212 can be viewed as a “front-end” cache for system memory that speeds up system memory performance for all components that use system memory (e.g., the CPU cores, GPUs, peripheral controllers, network interfaces, etc.). Nevertheless, because CPU cores heavily utilize system memory, memory side caches can be viewed as a caching level in the hardware architecture from the perspective of a CPU core even though such memory side caches are not strictly CPU caches (because they do not strictly cache data only for CPU cores).
For simplicity
Additionally or in the alternative, in systems where system memory is implemented with dual in line memory modules (DIMMs) that plug into the system, one or more memory side caches may be structured into the DIMMs. For example, one or more DRAM DIMMs may plug into a same memory channel as one or more emerging non volatile memory DIMMs. Here, the DRAM DIMMs may act as a memory side cache on the memory channel for the non volatile DIMMs. In yet other implementations the entire combined capacity of the DRAM DIMMs may be treated as a single cache such that a DIMM on one channel can cache data stored on a non volatile DIMM on another channel.
Additionally or in the alternative a single DIMM may have both DRAM and non volatile memory where the DRAM acts as a memory side cache on the DIMM for the non volatile memory. Alternatively the DRAM may be used as a memory side cache for the DIMM's memory channel or for all of system memory.
Regardless, note the potential for many more caching levels including more than one memory side cache. For example, a single system may have three active memory side caches (e.g., stacked DRAM that caches all of system memory as a highest memory side cache level, DRAM DIMMs that act as memory side cache for their respective memory channel that act as a middle memory side cache level, and DIMMs having both DRAM and non volatile memory where the DRAM acts as memory side cache for just the DIMM as a lowest memory side cache level). For simplicity, much the remainder of the discussion will assume only one memory side cache level. However the reader should understand that multiple memory side caching levels are possible and understand that the teaching below apply to such implementations.
Further still, a DIMM is just one type of pluggable memory component having memory capacity with integrated memory chips and that can plug into a fixture, e.g. of a system motherboard or CPU socket, to expand the memory capacity of the system it is being plugged into. Over the years other types of pluggable memory components may emerge (e.g., having different form factor than a DIMM). Here, the customizable caching resources (and possibly the look-up and gateway circuitry) may also reside on a pluggable memory component.
A further data caching improvement is the presence of a data center edge cache 213. Here, the data center itself caches frequently accessed data items at the “edge” of the datacenter 203 so that, e.g., the penalty of accessing an inherently slower database 205, 206 or mass storage resource that resides within the data center is avoided. The edge cache 213 can be seen as a data cache that caches the items that are most frequently requested of the data center. Thus, the edge cache 213 may collectively cache items that are persisted in different databases, different mass storage devices and/or are located within any other devices within the data center.
Thus, returning to a comparison of
By contrast, all caching levels beneath the L1 cache level can be customized. As such, the L2 cache level includes a gateway function 301 that determines, for each cache miss from a higher L1 cache, whether the miss is to be serviced by the L2 cache. Here, as is known in the art, each request for data from a cache essentially requests a cache line of data identified by a particular system memory address. The gateway logic 301 of the L2 cache includes internal information that identifies which system memory address ranges are to receive L2 cache treatment and which ones are not. If an incoming request from an L1 miss specifies a system memory address that is within one of the ranges that the L2 cache is configured to support, the request is passed to the look-up logic of the L2 cache which performs a look-up for the requested cache line.
Here, as is known in the art, software programs are allocated system memory address space. If the address of the requested cache line falls within one of the address ranges that the L2 cache is configured to support, in various embodiments, the address range that the request falls within corresponds to the address range (or portion thereof) that has been allocated to the software program that presently needs the requested data. Thus, by configuring the allocated system memory address range (or portion thereof) of the software program that has issued the request for the cache line's data into the gateway 301 of the L2 cache, the software program is affectively configured with L2 cache service. Software programs (or portions thereof) that are not to be configured with L2 cache service do not have their corresponding system memory address ranges programmed into the L2 cache gateway 301 for purposes of determining whether or not L2 cache service is to be provided.
Continuing with the present example, assuming that the incoming request is for a software program that has been configured with L2 cache service, the request's address will fall within an address range that has been programmed into the L2 cache gateway for L2 cache service. If the requested cache line is found in the L2 cache, the cache line is returned to the requestor (the pipeline that requested the data).
If the cache line is not found in the L2 cache, or if the request's address is not within an address range that has been configured for L2 cache service (e.g., the software thread that issued the cache line request belongs to a software program that has not been configured to receive L2 cache service), the gateway logic 301 of the L2 cache determines which cache level is the next appropriate cache level for the request. Thus, in the particular embodiment of
As such,
Ideally, the gateway logic of any of the lower cache levels L3, L4 and MSC need not determine whether or not cache treatment is appropriate. That is, because the gateway logic 301 of the L2 level sends all lower requests to their correct cache level, the recipient level need not ask the question if the request is to be processed at the recipient level (the answer is always yes). As such, the gateway logic of the lower L3, L4 and MSC levels need only ask what the next correct lower level is in the case of a cache miss at the present, lower level. Evictions from a particular cache level are handled similarly, in that, an address range that the evicted cache line is associated with is entered in the cache level's gateway which informs the gateway as to which lower level cache the evicted cache line is to be directed to.
The pathways observed in
As is known in the art, lower level software, such as an operating system instance or virtual machine monitor understands which software programs have been allocated which system memory address space ranges. As such, the software “knows” if a needed item of data is within system memory or not. In cases where a needed item of data is known to not be physically present in system memory, the software instead asks deeper non volatile mass storage for one or more “pages” of data that include the needed data to be moved from mass storage to system memory.
Referring briefly back to
As observed in
Further still, the emergence of byte addressable non volatile memory as a replacement of DRAM in system memory has blurred the lines between traditional system memory and traditional storage. As such, conceivably, system memory may be deemed to include the address space of the mass non volatile storage 405 and/or data access granularity at the edge cache and/or mass storage device(s) 405 are a cache line or at least something less than one or more pages of data (or at least something smaller than one traditional 4 kB page of data). In the case of the former (the mass storage device 405 is deemed a system memory component), the edge cache becomes, e.g., another CPU level cache (e.g., an L5 cache). In this case, the switch core 402 can be designed to be programmed with the kind of functionality described above for the gateway logic of the cache levels of
In reference to the exemplary system of
For example, if the state of the overall system is such that a few of the currently executing programs are high performance programs (are highly sensitive to L2, L3 or L4 cache misses) while the remaining other executing programs are relatively low performance programs (that are indifferent to L2, L3 or L4 cache misses), then, the configuration software 503 may change the settings of the L2, L3 and L4 gateways to provide as much L2, L3 and L4 caching resources to the high performance programs but not the low performance programs. Here, the aforementioned state of the overall system (that recognizes execution of a few high performance programs and remaining execution of low performance programs) may be detected by management software 501 that oversees operation of the overall system including recognition of actively executing programs, cache utilization levels, statistic tracking, etc. By reporting its observations to the caching configuration software 502, the caching configuration software can “tweak” which actively executing programs are allocated to which caching levels. Thus, over time, the addresses that are programmed into the gateways are changed over time. Although described as software, the management 501 and configuration 502 functions can also be implemented in hardware or as combinations of software and hardware, partially or wholly.
In further or related embodiments, different configuration settings are programmed into the gateways pre-runtime, and, which configuration settings are utilized depends on, e.g., caching level utilization. For example, a gateway may be configured to allocate only small percentage of the address space for service at a particular caching level for each of a large number of different software programs under high capacity utilization of the caching level. However, the gateway is also programmed to allocate more address space per program as the capacity utilization of the caching levels recedes.
Alternatively or in combination, a gateway may be configured to not permit caching service for certain programs while utilization levels are high. However, as utilization of the caching level recedes, respective address space of these programs are programmed into the gateway to open-up caching service at the caching level for these programs. Here, the utilization levels and address space ranges can be programmed into the gateway pre-runtime and the gateway has logic to use the correct address ranges based on the utilization of its respective cache level.
Where the caching circuitry of
The following different kinds of software micro-services and/or other bodies of more granular code may make use of customized caching level treatment with, e.g., the below suggested customized caching configurations.
1. Software that provides information for immediate display to a user (e.g., a product catalog micro-service, an on-line order micro-service, etc.) may be configured at least with the lowest latency caches (e.g., L1, L2, L3, L4) if not all caching levels to ensure potential customers do not become annoyed with slower performance of, e.g., an on-line service.
2. Statistics collection software tends to be used as background processes that do not have any immediate need. As such, they tend to be indifferent to data access latency and can be “left out” of the lowest latency caching levels if not all caching levels (e.g., be configured without any or very little caching level support).
3. Machine learning software processes, or other processes that rely on sets of low latency of references may be configured to consume large amounts of L1, L2, L3 and L4 caching level support at least to ensure that the references are on-die or just-off die to ensure low latency for these references. Here, the system memory addresses of these references at a minimum may be programmed into each of the L1, L2, L3 and L4 references to ensure the references receive caching treatment at these levels.
5. Software processes that use tiled data structures (e.g., graphics processing software threads that break an image down into smaller, rectangular tiles of an image) where such tiles are called up once from memory/storage, operated upon by the software and then written back with little/no access thereafter, may be configured to have lowest latency caching levels (e.g., L1, L2, L3) but no lower level caching support (e.g., L4, MSC and edge cache). Here, e.g., after being operating on at the L1, L2 and L3 levels, each tile is not really utilized. As such, an eviction path from the L3 to the L4, MSC and/or edge cache levels would only consume these caching resources with little/no access activity being issued to them. The tiles can therefore be written directly back to mass storage or system memory without consuming/wasting any of the L4, MSC or edge cache resources.
Note that the exclusive caches can also be easily implemented with the above described architecture. Here, an exclusive cache is a cache that dedicated to a particular entity, such as a particular software application such that competing requests for a same cache item and/or cache slot are not possible. Here, traditional caches include coherency logic to deal with the former and snoop logic (e.g., that hashes a request address to identify its cache slot). Coherency logic and snoop logic are generally associated with the look-up logic 602 of
An applications processor or multi-core processor 750 may include one or more general purpose processing cores 715 within its CPU 701, one or more graphical processing units 716, a memory management function 717 (e.g., a memory controller) and an I/O control function 718. The general purpose processing cores 715 typically execute the operating system and application software of the computing system which may include micro-service software programs as described above. Even lower levels of software may be executed by the processing cores such as, e.g., a virtual machine monitor.
The graphics processing unit 716 typically executes graphics intensive functions to, e.g., generate graphics information that is presented on the display 703. The memory control function 717 (e.g., a system memory controller) interfaces with the system memory 702 to write/read data to/from system memory 702. The power management control unit 712 generally controls the power consumption of the system 700.
Each of the touchscreen display 703, the communication interfaces 704-707, the GPS interface 708, the sensors 709, the camera(s) 710, and the speaker/microphone codec 713, 714 all can be viewed as various forms of I/O (input and/or output) relative to the overall computing system including, where appropriate, an integrated peripheral device as well (e.g., the one or more cameras 710). Depending on implementation, various ones of these I/O components may be integrated on the applications processor/multi-core processor 750 or may be located off the die or outside the package of the applications processor/multi-core processor 750.
Different caching levels of the system (e.g., the L1, L2, L3 and L4 levels of a processor chip that contains the processing cores 715, the memory controller 717 and I/O controller 718 (also referred to as a peripheral controller) may have a gateway function for determining which requests are to receive local cache treatment and/or which lower cache level is the appropriate cache miss or eviction destination. The gateway function and associated look-up circuitry may be implemented with any of hardware logic circuitry, programmable logic circuitry (e.g., SRAM, DRAM, FPGA, PLD, PLA, etc.) and/or logic circuitry that is designed to execute some form of program code (e.g., an embedded processor, an embedded controller, etc.). The local cache resources that are associated with the gateway and look-up circuitry may be implemented with any information retention circuitry (e.g., DRAM circuitry, SRAM circuitry, non volatile memory circuitry, etc.).
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in machine-executable instructions. The instructions can be used to cause a general-purpose or special-purpose processor to perform certain processes. Alternatively, these processes may be performed by specific/custom hardware components that contain hardwired logic circuitry or programmable logic circuitry (e.g., FPGA, PLD) for performing the processes, or by any combination of programmed computer components and custom hardware components.
Elements of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.