Examples described herein are generally related to reduce power consumption during a system idle state.
Client platforms such as laptops, notebooks, and Chromebooks use either double data rate (DDR) or low power DDR (LPDDR) based Dynamic Random-Access Memory (DRAM) technology for system memory. Recent trends in client platforms show that memory is growing in bit density which in turns means higher capacity, for example, system memory supported in laptops are ranging between 8 GB and 64 GB.
During standby states such as Windows® Modern standby, Chrome® Lucid Sleep and S3, currently, the lowest power mode the DRAM memory enters is self-refresh. Self-refresh means that the capacitance on the memory must be supplied with power periodically, such that the data is retained. This contributes to higher power consumption. The higher the memory capacity, the higher the self-refresh power. The impact of memory power is significant when a client platform is in low power standby states that causes other platform components like system-on-chip (SoC), devices, etc. to consume relatively little power compared to power consumed by system memory.
Currently, the memory power consumption (especially DDR memory) is high in client platform standby power states (e.g., about 20-40% of client platform power). The higher power associated with keeping memory in self-refresh has significant impact on battery drain in laptops and has an impact on meeting energy regulations which are becoming increasingly more stringent for some types of desktops.
In some examples, different kinds of pages may be hosted by an operating system (OS). Amongst these different kinds of pages are pinned/locked pages. Locked pages are typically used to always be physically present in the system's random accesses memory (RAM) (e.g., hereinafter referred to as “system memory”). Pinned pages are typically used such that pinned pages are always present in a specific address location of the system memory. Devices utilizing system memory may request locked or pinned pages via direct memory access (DMA). Also, the OS and other software may also keep certain sections of their code in locked pages for performance purposes. When pages requested by an OS or software are not physically present in system memory, the system causes a page fault which impacts performance. When pages used by devices for DMA are not in memory, these can cause memory corruption, device errors and failures. In some examples, locked pages are scattered across multiple system memory regions and or physically stored to multiple DRAM devices or chips.
During a traditional suspend-to-RAM type of operation, the pinned and locked pages are retained in a dynamic random-access memory (DRAM) along with other data. Retaining the pages in the DRAM incurs power due to the DRAM self-refresh operation. During traditional suspend-to-disk operation, the entire contents of the DRAM are moved to a slower storage disk. The slower storage disk does not incur the DRAM self-refresh power, but the system is slow to wake while all the contents of the DRAM are restored from the slower storage disk. To find a solution for both lower self-refresh power of the DRAM and lower system wake times, the pinned and locked pages in the DRAM can be moved to one segment (e.g., maintained in one or more DRAM chips) of the system memory which would be self-refreshed. Since the movement of the pinned and locked pages is from DRAM to DRAM the, system wake latency would see a significant improvement over suspend-to-disk.
Joint Electron Device Engineering Counsel (JEDEC) memory standard supports memory power management features such as Maximum Power Save mode (MPSM) and Partial array self-refresh (PASR), where self-refresh to portions of memory such as ranks, and banks can be turned off. However, the standard does not provide any suitable implementation of such features. Further, issues mentioned above related to locked/pinned pages spread among multiple memory devices may inhibit or even prohibit the shutting down of portions of memory. This inhibition/prohibition of shutting down portions of memory may result in higher than desired power consumption in a system idle power state. For example, in an ACPI S3 standby power state entire system-on-chip (SoC) power is approximately 5 mW and platform power are approximately 270 mW. Memory consumes significantly high power. For instance, 16 GB DDR5 in self-refresh consumes about 160 mW which is about 60% of the platform power consumption. Likwise 16 GB LPDDR5 in self-refresh consumes about 22 mW. The power savings by turning off the self-refresh power for 16 GB DDR5, keeping one segment in self-refresh, is about 60 mW savings. The power savings by turning off the self-refresh power for 16 GB LPDDR5, keeping one segment in self-refresh is about 8 mW savings. If any more than the one segment needs to stay in self-refresh, power savings will be reduced.
According to some examples, solutions implemented by a Chrome® OS or a Linux OS use of a concept of setting up various types of zones in system memory to include a “zone movable”, a “zone normal” and a “zone DMA”. Zone normal holds all pinned pages, zone DMA holds DMA pages and zone movable holds pageable pages. When a system using these zones triggers/enters PASR, memory devices with addresses mapped to the zone normal and zone DMA are maintained in a self-refresh and memory devices with addresses mapped to zone moveable are turned off. This solution may work for Chrome® OS or a Linux OS but does not work for Windows® OS for various reasons that adds an unacceptable level of complexity to map system memory to include these types of zones. Hence, Windows® OS memory managers lack the concept of these zones.
In some examples, another solution may include use of a mechanism where the locked/pinned pages are saved and restored by a hardware accelerator which is transparent to the OS. For these examples, the addition of a hardware accelerator to a lower cost client platform may add an unacceptable level of complexity that exceeds benefits of using the hardware accelerator to reduce power consumption and increase exit times from a standby state. It is with respect to these challenges that the examples described herein are needed.
According to some examples, each processor 103 is a die, dielet, or chiplet. Here the term “die” generally refers to a single continuous piece of semiconductor material (e.g. silicon) where transistors or other components making up a processor core may reside. Multi-core processors may have two or more processors on a single die, but alternatively, the two or more processors may be provided on two or more respective dies. In some examples, dies are of the same size and functionality i.e., symmetric cores. However, dies can also be asymmetric. For example, some dies have different size and/or function than other dies. Each processor 103 may also be a dielet or chiplet. Here the term “dielet” or “chiplet” generally refers to a physically distinct semiconductor die, typically connected to an adjacent die in a way that allows the fabric across a die boundary to function like a single fabric rather than as two distinct fabrics. Thus at least some dies may be dielets.
In some examples, fabric 104 may be a collection of interconnects or a single interconnect that allows the various dies to communicate with one another. Here the term “fabric” generally refers to communication mechanism having a known set of sources, destinations, routing rules, topology, and other properties. The sources and destinations may be any type of data handling functional unit such as, but not limited to, power management units or mapping agents. Fabrics can be two-dimensional spanning along an x-y plane of a die and/or three-dimensional (3D) spanning along an x-y-z plane of a stack of vertical and horizontally positioned dies. A single fabric may span multiple dies. A fabric can take any topology such as mesh topology, star topology, daisy chain topology. A fabric may be part of a network-on-chip (NoC) with multiple agents. These agents can be any functional unit.
According to some examples, each processor 103 may include a number of processor cores. One such example is illustrated with reference to processor 103-10. In this example, processor 103-10 includes a plurality of processor cores 106-1 through 106-M, where M is a number. For the sake of simplicity, a processor core is referred by the general label 106. Here, the term “processor core” generally refers to an independent execution unit that can run one program thread at a time in parallel with other cores. In some examples, all processor cores are of the same size and functionality i.e., symmetric cores. However, processor cores can also be asymmetric. For example, some processor cores may have different sizes and/or functions than other processor cores. A processor core may be a virtual processor core or a physical processor core. Processor 103-10 may include an integrated voltage regulator (IVR) 107, a locked loop (PLL) and/or frequency locked loop (FLL) 109 and/or a power control unit (P-unit) 111. The various blocks of processor 103-10 may be coupled via an interface or fabric.
In some examples, as described more below, cores 106 of processor 103 may be partitioned or mapped by a non-uniform mapping architecture (NUMA) mapping agent 108 into a plurality of virtual NUMA nodes to facilitate directing memory requests associated with pinned/locked pages of data to a portion of system memory (e.g., segment(s)) maintained in DRAM 110). As shown in
According to some examples where NUMA mapping agent 108 is included on processor 103-10, NUMA mapping agent 108 may be coupled to OS 102 via an interface. Also, in P-unit 111 may similarly be coupled to OS 102 via an interface. As used herein, the term “interface” generally refers to software and/or hardware at processor 103-10 used to communicate with an interconnect. An interface may include logic and I/O driver/receiver to send and receive data over the interconnect or one or more wires.
In some examples, each processor 103 is coupled to a power supply via a voltage regulator. The voltage regulator may be internal to single socket processor system 101 (e.g., on the package of single socket processor system 101) or external to single socket processor system 101. In some examples, each processor 103 includes IVR 107 that receives a primary regulated voltage from the voltage regulator of single socket processor system 101 and generates an operating voltage for agents, logic and/or features of processor 103. The agents of processor 103 are the various components of processor 103 including cores 106, IVR 107, NUMA mapping agent 108, PLL/FLL 109.
Accordingly, an implementation of IVR 107 may allow for fine-grained control of voltage and thus power and performance of each individual core 106. As such, each core 106 can operate at an independent voltage and frequency, enabling great flexibility and affording wide opportunities for balancing power consumption with performance. In some embodiments, the use of multiple IVRs enables the grouping of components into separate power planes, such that power is regulated and supplied by the IVR to only those components in the group. For example, each core 106 may include an IVR to manage power supply to that core where that IVR receives input power supply from the regulated output of IVR 107 or voltage regulator of single socket processor system 101. During power management, a given power domain of one IVR may be powered down or off when the processor core 106 is placed into a certain low power state, while another power domain of another IVR remains active, or fully powered. As such, an IVR may control a certain domain of a logic or processor core 106. Here the term “power domain” generally refers to a logical or physical perimeter that has similar properties (e.g., supply voltage, operating frequency, type of circuits or logic, and/or workload type) and/or is controlled by a particular power agent. For example, a power domain may be a group of logic units or function units that are controlled by a particular power supervisor. A power domain may also be referred to as an Autonomous Perimeter (AP). A power domain can be an entire system-on-chip (SoC) or part of the SoC and is governed by a p-unit.
In some examples, each processor 103 includes its own p-unit 111. P-unit 111 controls the power and/or performance of processor 103. P-unit 111 may control power and/or performance (e.g., IPC, frequency) of each individual core 106. In various examples, p-unit 111 of each processor 103 is coupled via fabric 104. As such, the p-units 111 of each processor 103 communicate with another and OS 102 to determine the optimal power state of single socket processor system 101 by controlling power states of individual cores 106 under their domain.
P-unit 111 may include circuitry including hardware, software and/or firmware to perform power management operations with regard to processor 103. In some embodiments, p-unit 111 provides control information to voltage regulator of processor system 101 via an interface to cause the voltage regulator to generate the appropriate regulated voltage. In some embodiments, p-unit 111 provides control information to IVRs of cores 106 via another interface to control the operating voltage generated (or to cause a corresponding IVR to be disabled in a low power mode). In some embodiments, p-unit 111 may include a variety of power management logic units to perform hardware-based power management. Such power management may be wholly processor controlled (e.g., by various processor hardware, and which may be triggered by workload and/or power, thermal or other processor constraints) and/or the power management may be performed responsive to external sources (such as a platform or power management source or system software). In some embodiments, p-unit 111 is implemented as a microcontroller. The microcontroller can be an embedded microcontroller which is a dedicated controller or as a general-purpose controller. In some embodiments, p-unit 111 is implemented as a control logic configured to execute its own dedicated power management code, here referred to as pCode. In some embodiments, power management operations to be performed by p-unit 111 may be implemented externally to a processor 103, such as by way of a separate power management integrated circuit (PMIC) or other component external to processor system 101. In yet other embodiments, power management operations to be performed by p-unit 111 may be implemented within BIOS or other system software. In some embodiments, p-unit 111 of a processor 103 may assume a role of a supervisor or supervisee.
Here the term “supervisor” generally refers to a power controller, or power management, unit (a “p-unit”), which monitors and manages power and performance related parameters for one or more associated power domains, either alone or in cooperation with one or more other p-units. Power/performance related parameters may include but are not limited to domain power, platform power, voltage, voltage domain current, die current, load-line, temperature, device latency, utilization, clock frequency, processing efficiency, current/future workload information, and other parameters. It may determine new power or performance parameters (limits, average operational, etc.) for the one or more domains. These parameters may then be communicated to supervisee p-units, or directly to controlled or monitored entities such as VR or clock throttle control registers, via one or more fabrics and/or interconnects. A supervisor learns of the workload (present and future) of one or more dies, power measurements of the one or more dies, and other parameters (e.g., platform level power boundaries) and determines new power limits for the one or more dies. These power limits are then communicated by supervisor p-units to the supervisee p-units via one or more fabrics and/or interconnects. In examples where a die has one p-unit, a supervisor (Svor) p-unit may also be referred to as supervisor die.
Here the term “supervisee” generally refers to a power controller, or power management, unit (a “p-unit”), which monitors and manages power and performance related parameters for one or more associated power domains, either alone or in cooperation with one or more other p-units and receives instructions from a supervisor to set power and/or performance parameters (e.g., supply voltage, operating frequency, maximum current, throttling threshold, etc.) for its associated power domain. In examples where a die has one p-unit, a supervisee (Svee) p-unit may also be referred to as a supervisee die. Note that a p-unit may serve either as a Svor, a Svee, or both a Svor/Svee p-unit
In various embodiments, p-unit 111 executes a firmware (referred to as pCode) that communicates with OS 102. In various embodiments, each processor 103 includes a PLL or FLL 109 that generates clock from p-unit 111 and input clock (or reference clock) for each core 106. Cores 106 may include or be associated with independent clock generation circuitry such as one or more PLLs to control operating frequency of each core 106 independently.
According to some examples, storage 120 may include one or more types of non-volatile memory. The one or more types of non-volatile memory may include byte or block addressable types of non-volatile memory having a 3-dimensional (3-D) cross-point memory structure that includes, but is not limited to, chalcogenide phase change material (e.g., chalcogenide glass) hereinafter referred to as “3-D cross-point memory”. Non-volatile types of memory may also include other types of byte or block addressable non-volatile memory such as, but not limited to, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level phase change memory (PCM), resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, resistive memory including a metal oxide base, an oxygen vacancy base and a conductive bridge random access memory (CB-RAM), a spintronic magnetic junction memory, a magnetic tunneling junction (MTJ) memory, a domain wall (DW) and spin orbit transfer (SOT) memory, a thyristor based memory, a magnetoresistive random access memory (MRAM) that incorporates memristor technology, spin transfer torque MRAM (STT-MRAM), or a combination of any of the above.
As mentioned above, there are two JEDEC standard features to save DRAM power when the memory is in self refresh-partial array self-refresh (PASR) and maximum power saving mode (MPSM). PASR allows suspension of the self-refresh operation on selected banks to save power during a system idle state and MPSM allows suspension of an entire rank to save power during a system idle state. PASR and MPSM features, for example, may be available in LPDDRS DRAMs and in DDRS DRAMs.
When using either the PASR or MPSM feature by turning off the self-refresh operation in in a portion of DRAM 110, there will be a loss of data in the banks and/or ranks of DRAM 110 included in the portion that will not be refreshed. To prevent loss of the data, pages associated with the data will need to be moved from the non-refreshed portion of DRAM 110 to a persistent/non-volatile storage (NVM) or consolidate the pages associated with the data in another portion of DRAM 110 that will continue to be refreshed following a PASR/MPSM trigger event (e.g., entering an system idle state). One goal is to maintain merely data associated with critical pages (e.g., pinned/locked pages) in the refreshed portion of DRAM 110 following a PASR/MPSM trigger event. The data associated with the critical pages may then be used for fast system exit from an idle state while still saving power by possibly reducing the number of banks and/or ranks that will need to be refreshed to prevent loss of data associated with the critical pages.
Logic flow 400 begins at block 410 where logic and/or features of NUMA mapping agent 108 may partition or map cores 106 of processor 103 into virtual NUMA nodes. According to some examples, as shown in
Moving from block 410 to block 420, logic and/or features of NUMA mapping agent 108 may cause a partition of system memory into 8 segments and then separate portions of the 8 segments to be separately map to different virtual local memories. As mentioned above, DRAM 110 is arranged to be system memory for a system that includes processor 103. According to some examples, the partitioned system memory includes segment-0 of DRAM 110 being partitioned separate from segment-1 to segment-7. The logic and/or features of NUMA mapping agent 108 may cause segment-0 to be mapped to virtual local memory 310-1 and cause segment-1 to segment-7 to be mapped to virtual local memory 310-2.
Moving from block 420 to block 430, logic and/or features of NUMA mapping agent 108 may then map virtual local memory 310-1 to virtual NUMA Node-0 to cause segment-0 to be assigned to virtual NUMA Node-0 and map virtual local memory 310-2 to virtual NUMA node-1 to cause segment-1 to segment-7 to be assigned to virtual NUMA Node-1.
Moving from block 430 to block 440, logic and/or features of NUMA mapping agent 108 may cause local and remote paths to be established from virtual NUMA Node-0 and Node-1 to respective virtual local memories 310-1 and 310-2. In some examples, as shown in
Moving from block 440 to block 450, logic and/or features of NUMA mapping agent 108 may cause system memory requests (e.g., from applications or drivers) that include hints/flags for pinning or locking pages to be directed to virtual NUMA Node-0. According to some examples, as described more below, causing the system memory requests having these hints/flags to be directed to virtual NUMA Node-0 results in data stored to this allocated system memory to be stored to segment-0 of DRAM 110. Segment-0 may be arranged to keep self-refresh operations active/on during a system idle state and segment-1 to segment-7 are arranged to turn off/deactivate self-refresh operations during the system idle state. When PASR/MPSM is triggered the non-critical data from segment1-7 can be moved to storage. Since the pinned/locked pages will already be in segment-0 they don't need to be moved and will result in power saving as well as reduction of latency in entering and exiting PASR/MPSM.
According to some examples, OS 102 may build a mapping table 510 based on SRAT/SLIT information 501 to determine PXM 1 and PXM 0 characteristics. For example, table 510 for NUMA PXM 0 indicates cores 106-1 and 106-2 and a size of segment-0 and for NUMA PXM 1 indicates cores 106-3 to 106-M and a size of segment-1 to segment-7. For these examples and as described more below, OS 102 may enable applications and/or drivers (enlightened software) a mechanism to query NUMA information included in mapping table 510 via NUMA API(s) 502 and then the applications or drives may give hints/flags to OS 102 that these applications and/or drivers intend to pin/lock pages in PXM 0 that includes segment-0 when they request a system memory allocation. The applications and/or drivers may also just indicate that a target of the request is via virtual NUMA Node-0 to cause pinned/locked pages to be stored to segment-0 knowing that this segment of DRAM 110 will not have self-refresh operations turned off during an idle system state.
Logic flow 600 begins at block 610 where logic and/or features of NUMA agent 108 and/or OS 102 may respond to a query for NUMA information for pinnable memory. According to some examples, the query may be received from applications or drivers 503 that are aware of ways to use NUMA API(s) 502 to query for NUMA information. As mentioned previously, these NUMA API aware applications or drivers may be deemed as enlightened software. For these examples, the logic and features of NUMA agent 108 and/or OS 102 may provide the information included in mapping table 510 to indicate NUMA PXM 0 that includes segment-0 assigned to virtual NUMA Node-0 is allocated for use as pinnable memory (e.g., self-refresh operations for segment-0 are not stopped during an idle state).
Moving from block 610 to block 620, logic and/or features of NUMA agent 108 and/or OS 102 may receive one or more system memory requests for an allocation of system memory.
Moving from block 620 to decision block 630, logic and/or features of NUMA agent 108 and/or OS 102 may determine whether the system memory request includes pinnable hints/flags or targets NUMA Node-0. In some examples, enlightened software that is aware that a segment of DRAM 110 has been allocated for pinnable memory may indicate an intention to pin/lock memory for a duration of time by providing hints/flags. Alternatively and/or in addition to hints/flags, the enlightened software may indicate in the request that virtual NUMO Node-0 is targeted to be used to access the requested allocation of system memory (e.g., to execute at least a portion of a workload for the application). Thus, by targeting NUMA Node-0 that is assigned to segment-0, the enlightened software knows that the system memory request will be for an allocation to a pinnable memory of DRAM 110. For these examples, if the system memory request includes pinnable hints/flags or targets NUMA Node-0, logic flow 600 moves to block 640. Otherwise, logic flow 600 moves to block 650.
Moving from decision block 630 to block 640, logic and/or features of NUMA agent 108 and/or OS 102 may allocate system memory responsive to the system memory requests not having hints/flags or targeting virtual NUMA Node-0 to memory addresses associated with one or more segments 1-7 of DRAM 110.
Moving from decision block 630 to block 650, logic and/or features of NUMA agent 108 and/or OS 102 may allocate system memory to segments 1-7 responsive to the system memory request not having hints/flags or targeting virtual NUMA Node-0 to memory addresses associated with segment-0 of DRAM 110.
Moving from block 650 to decision block 660, logic and/or features of NUMA agent 108 and/or OS 102 may check if locked/pinned page request have been received for page in segments 1-7 of DRAM 110. In some examples, requests for unenlightened locked/pinned pages may be related to applications or drivers not aware of or capable of utilizing NUMA API(s) 502. Hence, system memory requests from these types of unenlightened software may cause pinned/locked pages to be allocated to segments 1-7 instead of segment-0. For these examples, periodically for pinned/locked page requests may minimize an amount of data that needs to be moved to segment-0 when transitioning to an idle state. In some examples, if requests for unenlightened locked/pinned pages are detected, logic flow 600 moves to block 620. Otherwise, logic flow 600 moves to decision block 670.
Moving from decision block 660 to decision block 670, logic and/or features of NUMA agent 108 and/or OS 102 may determine whether a PASR or MPSM is triggered that indicates a transition to an idle state. If a PASR or MPSM was trigged, logic flow 600 moves to decision block 680. Otherwise, logic flow 600 moves back to decision block 660.
Moving from decision block 670 to decision block 680, logic and/or features of NUMA agent 108 and/or OS 102 may determine if any locked/pinned pages are located in segments 1-7. If locked/pinned pages are located in segments 1-7, logic flow 600 moves to block 685. Otherwise, logic flow 600 moves to block 690.
Moving from decision block 680 to block 685, logic and/or features of NUMA agent 108 and/or OS 102 may move the unenlightened locked/pinned pages to segment-0 to prevent possible loss of data associated with these pages.
Moving from decision block 680 or from block 690, logic and/or features of NUMA agent 108 and/or OS 102 may move data in segments 1-7 to a persistent memory (e.g., storage 120).
Moving from block 690 to block 685, logic and/or features of NUMA agent 108 and/or OS 102 may cause self-refresh operations to be turned off or deactivated for segments 1-7.
According to some examples, apparatus 700 may be supported by circuitry 720 and apparatus 700 may be located as part of a single socket multi-core processor system and/or as part of an operating system executed by one or more cores of the single socket multi-core processor system(e.g., cores 106 of single socket processor system 101). Circuitry 720 may be arranged to execute one or more software or firmware implemented logic, components, agents, or modules 722-a (e.g., implemented, at least in part, by a controller of a memory device). It is worthy to note that “a” and “b” and “c” and similar designators as used herein are intended to be variables representing any positive integer. Thus, for example, if an implementation sets a value for a=4, then a complete set of software or firmware for logic, components, agents, or modules 722-a may include logic 722-1, 722-2, 722-3 or 722-4. Also, at least a portion of “logic” may be software/firmware stored in computer-readable media, or may be implemented, at least in part in hardware and although the logic is shown in
In some examples, apparatus 700 may include a node map logic 722-1. Node map logic 722-1 may be a logic and/or feature executed by circuitry 720 to map a first portion of cores of a single socket multi-core processor to a first virtual NUMA node and map a second portion of cores of the single socket multi-core processor to a second virtual NUMA node. For these examples, the second portion of cores does not include any cores included in the first portion of cores. Node map logic 722-1 may utilize information included in core info. 705 to determine how to map the cores of the single socket multi-core processor (e.g., total number and types of logical processors/cores).
According to some examples, apparatus 700 may also include a partition logic 722-2. Partition logic 722-2 may be a logic and/or feature executed by circuitry 720 to partition a DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. For these examples, partition logic 722-2 may use information included in DRAM info. 710 to determined how to partition the DRAM into the multiple segments.
In some examples, apparatus 700 may also include a segment map logic 722-3. Segment map logic 722-3 may be a logic and/or feature executed by circuitry 720 to map at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node to cause the at least one segment to be assigned to the first virtual NUMA node. Segment map logic 722-3 may also map remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node to cause the remaining segments to be assigned to the second virtual NUMA node. For these examples, segment map logic 722-3 may provide the NUMA mapping information to an operating system or to an application executed by the single socket multi-core processor via NUMA info. 715.
According to some examples, apparatus 700 may also include an allocation logic 722-4. Allocation logic 722-4 may be a logic and/or feature executed by circuitry 720 to cause a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node. For these examples, the memory request may be received via memory request 730 and allocation(s) 735 may indicate how the memory was allocated.
According to some examples, as shown in
In some examples, logic flow 800 at block 804 may map a second portion of the cores to a second virtual NUMA node, the second portion of cores to not include any cores included in the first portion of cores. For these examples, node map logic 722-1 may map the second portion of cores to the second virtual NUMA node.
According to some examples, logic flow 800 at block 806 may partition a DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. For these examples, partition logic 722-2 may partition the DRAM device.
According to some examples, logic flow 800 at block 808 may map at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node. For these examples, segment map logic 722-3 maps the at least one segment to the first virtual local memory of the first virtual NUMA node.
In some examples, logic flow 800 at block 810 may map remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node. For these examples segment map logic 722-3 maps the remaining segments to the second virtual local memory of the second virtual NUMA node.
According to some examples, logic flow 800 at block 812 may cause a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node. For these examples, allocation logic 722-4 may cause the memory request to be directed to the first virtual NUMA node.
The set of logic flows shown in
A logic flow may be implemented in software, firmware, and/or hardware. In software and firmware embodiments, a logic flow may be implemented by computer executable instructions stored on at least one non-transitory computer readable medium or machine readable medium, such as an optical, magnetic or semiconductor storage. The embodiments are not limited in this context.
According to some examples, processing components 1040 may execute at least some processing operations or logic for apparatus 700 based on instructions included in a storage media that includes storage medium 900. Processing components 1040 may include various hardware elements, software elements, or a combination of both. For these examples, Examples of hardware elements may include devices, logic devices, components, processors, microprocessors, management controllers, companion dice, circuits, processor circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, ASICs, programmable logic devices (PLDs), digital signal processors (DSPs), FPGAs, memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. Examples of software elements may include software components, programs, applications, computer programs, application programs, device drivers, system programs, software development programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (APIs), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given example.
According to some examples, processing component 1040 may include an infrastructure processing unit (IPU) or a data processing unit (DPU) or may be utilized by an IPU or a DPU. An xPU may refer at least to an IPU, DPU, graphic processing unit (GPU), general-purpose GPU (GPGPU). An IPU or DPU may include a network interface with one or more programmable or fixed function processors to perform offload of workloads or operations that could have been performed by a CPU. The IPU or DPU can include one or more memory devices (not shown). In some examples, the IPU or DPU can perform virtual switch operations, manage storage transactions (e.g., compression, cryptography, virtualization), and manage operations performed on other IPUs, DPUs, servers, or devices.
In some examples, other platform components 1050 may include common computing elements, memory units (that include system memory), chipsets, controllers, peripherals, interfaces, oscillators, timing devices, video cards, audio cards, multimedia input/output (I/O) components (e.g., digital displays), power supplies, and so forth. Examples of memory units or memory devices included in other platform components 1050 may include without limitation various types of computer readable and machine readable storage media in the form of one or more higher speed memory units, such as read-only memory (ROM), random-access memory (RAM), dynamic RAM (DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), static RAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, polymer memory such as ferroelectric polymer memory, ovonic memory, phase change or ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS) memory, magnetic or optical cards, an array of devices such as Redundant Array of Independent Disks (RAID) drives, solid state memory devices (e.g., USB memory), solid state drives (SSD) and any other type of storage media suitable for storing information.
In some examples, communications interface 1060 may include logic and/or features to support a communication interface. For these examples, communications interface 1060 may include one or more communication interfaces that operate according to various communication protocols or standards to communicate over direct or network communication links. Direct communications may occur via use of communication protocols or standards described in one or more industry standards (including progenies and variants) such as those associated with the PCIe specification, the NVMe specification or the I3C specification. Network communications may occur via use of communication protocols or standards such those described in one or more Ethernet standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE). For example, one such Ethernet standard promulgated by IEEE may include, but is not limited to, IEEE 802.3-2018, Carrier sense Multiple access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications, Published in August 2018 (hereinafter “IEEE 802.3 specification”). Network communication may also occur according to one or more OpenFlow specifications such as the OpenFlow Hardware Abstraction API Specification. Network communications may also occur according to one or more Infiniband Architecture specifications.
Compute device 1000 may be coupled to a computing device that may be, for example, user equipment, a computer, a personal computer (PC), a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet, a smart phone, embedded electronics, a gaming console, a server, a server array or server farm, a web server, a network server, an Internet server, a work station, a mini-computer, a main frame computer, a supercomputer, a network appliance, a web appliance, a distributed computing system, multiprocessor systems, processor-based systems, or combination thereof.
Functions and/or specific configurations of compute device 1000 described herein, may be included, or omitted in various embodiments of compute device 1000, as suitably desired.
The components and features of compute device 1000 may be implemented using any combination of discrete circuitry, ASICs, logic gates and/or single chip architectures. Further, the features of compute device 1000 may be implemented using microcontrollers, programmable logic arrays and/or microprocessors or any combination of the foregoing where suitably appropriate. It is noted that hardware, firmware and/or software elements may be collectively or individually referred to herein as “logic”, “circuit” or “circuitry.”
It should be appreciated that the exemplary compute device 1000 shown in the block diagram of
Although not depicted, any system can include and use a power supply such as but not limited to a battery, AC-DC converter at least to receive alternating current and supply direct current, renewable energy source (e.g., solar power or motion based power), or the like.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within a processor, processor circuit, ASIC, or FPGA which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein. Such representations may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the processor, processor circuit, ASIC, or FPGA.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a machine, computing device or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
Some examples may be described using the expression “in one example” or “an example” along with their derivatives. These terms mean that a particular feature, structure, or characteristic described in connection with the example is included in at least one example. The appearances of the phrase “in one example” in various places in the specification are not necessarily all referring to the same example.
Some examples may be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms “connected” and/or “coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The following examples pertain to additional examples of technologies disclosed herein.
Example 1. An example apparatus may include circuitry coupled with a single socket multi-core processor. For this example, the circuitry may map a first portion of cores of the single socket multi-core processor to a first virtual NUMA node and map a second portion of the cores to a second virtual NUMA node. For this example, the second portion of the cores does not include any cores included in the first portion of cores. The circuitry may also partition a DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. The circuitry may also map at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node and map remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node. The circuitry may also cause a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node.
Example 2. The apparatus of example 1, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the circuitry may also cause the memory request for the pinned or locked page of data to be directed to the first virtual NUMA node based on an indication in the memory request that a pinned or locked page is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting an allocation of memory for use by the first virtual NUMA node to execute at least a portion of a workload for the application.
Example 3. The apparatus of example 2, the indication in the memory request that the pinned or locked page of data is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting the allocation of memory for use by the first virtual NUMA node may be based on the application being capable of using a NUMA API to query an operating system executed by the single socket multi-core processor to determine that the at least one segment is mapped to the first virtual local memory.
Example 4. The apparatus of example 1, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the circuitry may cause a second memory request to allocate memory to a page of data to be directed to the second virtual NUMA node based on the second memory request not including an indication that the page of data is to be pinned or locked and based on the memory request targeting an allocation of memory for use by the second virtual NUMA node to execute at least a portion of the application.
Example 5. The apparatus of example 1, the circuitry may also receive an indication that a system that includes the single socket multi-core processor is to enter an idle power state. The circuitry may also determine whether any pinned or locked pages of data have been allocated to the remaining segments mapped to the second virtual local memory. The circuitry may also move pinned or locked pages of data based on the determination to the at least one segment mapped to the first virtual local memory. The circuitry may also move non-pinned or non-locked pages of data to a persistent storage device coupled with the single socket multi-core processor. The circuitry may also turn off self-refresh operations for the remaining segments mapped to the second virtual local memory.
Example 6. An example method may include mapping a first portion of cores of a single socket multi-core processor to a first virtual NUMA node and mapping a second portion of the cores to a second virtual NUMA node. For this example, the second portion of the cores does not include any cores included in the first portion of cores. The method may also include partitioning a DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. The method may also include mapping at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node and mapping remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node. The method may also include causing a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node.
Example 7. The method of example 6, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example the method may further include causing the memory request for the pinned or locked page of data to be directed to the first virtual NUMA node based on an indication in the memory request that a pinned or locked page is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting an allocation of memory for use by the first virtual NUMA node to execute at least a portion of a workload for the application.
Example 8. The method of example 7, the indication in the memory request that the pinned or locked page of data is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting the allocation of memory for use by the first virtual NUMA node may be based on the application being capable of using a NUMA API to query an operating system executed by the single socket multi-core processor to determine that the at least one segment is mapped to the first virtual local memory.
Example 9. The method of example 6, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the method further include causing a second memory request to allocate memory to a page of data to be directed to the second virtual NUMA node based on the second memory request not including an indication that the page of data is to be pinned or locked and based on the memory request targeting an allocation of memory for use by the second virtual NUMA node to execute at least a portion of the application.
Example 10. The method of example 6 may also include receiving an indication that a system that includes the single socket multi-core processor is to enter an idle power state. The method may also include determining whether any pinned or locked pages of data have been allocated to the remaining segments mapped to the second virtual local memory. The method may also include moving pinned or locked pages of data based on the determination to the at least one segment mapped to the first virtual local memory. The method may also include moving non-pinned or non-locked pages of data to a persistent storage device coupled with the single socket multi-core processor. The method may also include turning off self-refresh operations for the remaining segments mapped to the second virtual local memory.
Example 11. An example system may include a DRAM device, a single socket multi-core processor coupled with the DRAM device, and an operating system to be executed by the single socket multi-core processor. For this example, the operating system may include logic to map a first portion of cores of the single socket multi-core processor to a first virtual NUMA node and map a second portion of the cores to a second virtual NUMA node. For this example, the second portion of the cores does not include any cores included in the first portion of cores and partition the DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. The logic may also map at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node and map remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node. The logic may also cause a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node.
Example 12. The system of example 11, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the operating system logic may cause the memory request for the pinned or locked page of data to be directed to the first virtual NUMA node based on an indication in the memory request that a pinned or locked page is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting an allocation of memory for use by the first virtual NUMA node to execute at least a portion of a workload for the application.
Example 13. The system of example 12, the indication in the memory request that the pinned or locked page of data is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting the allocation of memory for use by the first virtual NUMA node may be based on the application being capable of using a NUMA API to query an operating system executed by the single socket multi-core processor to determine that the at least one segment is mapped to the first virtual local memory.
Example 14. The system of example 11, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the operating system logic may cause a second memory request to allocate memory to a page of data to be directed to the second virtual NUMA node based on the second memory request not including an indication that the page of data is to be pinned or locked and based on the memory request targeting an allocation of memory for use by the second virtual NUMA node to execute at least a portion of the application.
Example 15. The system of example 11, the operating system logic may also receive an indication that a system that includes the single socket multi-core processor is to enter an idle power state. The logic may also determine whether any pinned or locked pages of data have been allocated to the remaining segments mapped to the second virtual local memory. The logic may also move pinned or locked pages of data based on the determination to the at least one segment mapped to the first virtual local memory. The logic may also move non-pinned or non-locked pages of data to a persistent storage device coupled with the single socket multi-core processor. The logic may also cause self-refresh operations for the remaining segments mapped to the second virtual local memory to be turned off.
Example 16. An example at least one machine readable medium may include a plurality of instructions that in response to being executed by a system, may cause the system to map a first portion of cores of a single socket multi-core processor to a first virtual NUMA node and map a second portion of the cores to a second virtual NUMA node. For this example, the second portion of the cores does not include any cores included in the first portion of cores and partition a DRAM device into multiple segments, each segment capable of having self-refresh operations separately deactivated or activated. The instructions may also cause the system to map at least one segment from among the multiple segments to a first virtual local memory of the first virtual NUMA node and map remaining segments from among the multiple segments to a second virtual local memory of the second virtual NUMA node. The instructions may also cause the system to cause a memory request to allocate memory for a pinned or a locked page of data to be directed to the first virtual NUMA node.
Example 17. The at least one machine readable medium of example 16, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the instructions may further cause the system to cause the memory request for the pinned or locked page of data to be directed to the first virtual NUMA node based on an indication in the memory request that a pinned or locked page is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting an allocation of memory for use by the first virtual NUMA node to execute at least a portion of a workload for the application.
Example 18. The at least one machine readable medium of example 17, the indication in the memory request that the pinned or locked page of data is to be allocated to the at least one segment mapped to the first virtual local memory or based on the memory request targeting the allocation of memory for use by the first virtual NUMA node may be based on the application being capable of using a NUMA API to query an operating system executed by the single socket multi-core processor to determine that the at least one segment is mapped to the first virtual local memory.
Example 19. The at least one machine readable medium of example 16, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the instructions may further cause the system to cause a second memory request to allocate memory to a page of data to be directed to the second virtual NUMA node based on the second memory request not including an indication that the page of data is to be pinned or locked and based on the memory request targeting an allocation of memory for use by the second virtual NUMA node to execute at least a portion of the application.
Example 20. The at least one machine readable medium of example 16, the memory request may be received from an application to be executed by the single socket multi-core processor. For this example, the instructions may further cause the system to receive an indication that a system that includes the single socket multi-core processor is to enter an idle power state. The instructions may also cause the system to determine whether any pinned or locked pages of data have been allocated to the remaining segments mapped to the second virtual local memory. The instructions may also cause the system to move pinned or locked pages of data based on the determination to the at least one segment mapped to the first virtual local memory. The instructions may also cause the system to move non-pinned or non-locked pages of data to a persistent storage device coupled with the single socket multi-core processor. The instructions may also cause the system to turn off self-refresh operations for the remaining segments mapped to the second virtual local memory.
It is emphasized that the Abstract of the Disclosure is provided to comply with 37 C.F.R. Section 1.72(b), requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single example for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed examples require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed example. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate example. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein,” respectively. Moreover, the terms “first,” “second,” “third,” and so forth, are used merely as labels, and are not intended to impose numerical requirements on their objects.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.