Portable computing devices (e.g., cellular telephones, smart phones, tablet computers, portable digital assistants (PDAs), portable game consoles, wearable devices, and other battery-powered devices) and other computing devices continue to offer an ever-expanding array of features and services, and provide users with unprecedented levels of access to information, resources, and communications. To keep pace with these service enhancements, such devices have become more powerful and more complex. Portable computing devices now commonly include a system on chip (SoC) comprising a plurality of memory clients embedded on a single substrate (e.g., one or more central processing units (CPUs), a graphics processing unit (GPU), digital signal processors, etc.). The memory clients may read data from and store data in a memory system electrically coupled to the SoC via a memory bus.
The energy efficiency and power consumption of such portable computing devices may be managed to meet performance demands, workload types, etc. For example, existing methods for managing power consumption of multiprocessor devices may involve dynamic clock and voltage scaling (DCVS) techniques. DCVS involves selectively adjusting the frequency and/or voltage applied to the processors, hardware devices, etc. to yield the desired performance and/or power efficiency characteristics. Furthermore, a memory frequency controller may also adjust the operating frequency of the memory system to control memory bandwidth.
Busy time in processing cores comprises two main components: (1) a core execution time in which a processing core actively executes instructions and processes data; and (2) a core stall time in which the processing core waits for data read/write in memory in case of a cache miss. When there are many cache misses, the processing core waits for memory read/write access, which increases the core stall time due to memory access. An increased stall time percentage significantly decreases energy efficiency. As known in the art, the power overhead penalty depends on various factors, including, the types of processing cores, the operating frequency, temperature, and leakage of the cores, and the stall time duration and/or percentage. Existing energy efficiency solutions pursue the lowest operating frequency in memory based on the processing core(s) bandwidth voting.
Existing solutions may reduce execution time by increasing the operating frequency of the processing core, but this does not address core stall time. The core stall time may be reduced by increasing the operating frequency of the memory bus (shorter cache misses and refill overhead) or by increasing the size of the cache (reducing cache misses). However, these approaches do not address core execution times.
Accordingly, there is a need for improved systems and methods for controlling power efficiency in a multi-processor system.
Systems, methods, and computer programs are disclosed for controlling power efficiency in a multi-processor system. The method comprises determining a core stall time due to memory access for one of a plurality of cores in a multi-processor system. A core execution time is determined for the one of the plurality of cores. A ratio of the core stall time versus the core execution time is calculated. A frequency vote for a memory bus is dynamically scaled based on the ratio of the core stall time versus the core execution time.
Another embodiment is a system comprising a dynamic random access memory (DRAM) and a system on chip (SoC) electrically coupled to the DRAM via a double data rate (DDR) bus. The SoC comprises a plurality of processing cores, a cache, and a DDR frequency controller. The DDR frequency controller is configured to dynamically scale a frequency vote for the DDR bus based on a calculated ratio of a core stall time versus a core execution time for one of the plurality of processing cores.
In the Figures, like reference numerals refer to like parts throughout the various views unless otherwise indicated. For reference numerals with letter character designations such as “102A” or “102B”, the letter character designations may differentiate two like parts or elements present in the same Figure. Letter character designations for reference numerals may be omitted when it is intended that a reference numeral to encompass all parts having the same reference numeral in all Figures.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
In this description, the term “application” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, an “application” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “content” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, “content” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
As used in this description, the terms “component,” “database,” “module,” “system,” and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal).
In this description, the terms “communication device,” “wireless device,” “wireless telephone”, “wireless communication device,” and “wireless handset” are used interchangeably. With the advent of third generation (“3G”) wireless technology and four generation (“4G”), greater bandwidth availability has enabled more portable computing devices with a greater variety of wireless capabilities. Therefore, a portable computing device may include a cellular telephone, a pager, a PDA, a smartphone, a navigation device, or a hand-held computer with a wireless connection or link.
As illustrated in
Each processing core 106, 108, and 110 may comprise one or more processing units (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a video encoder, a modem, or other memory clients requesting read/write access to the memory system. The system 100 further comprises a high-level operating system (HLOS) 120.
The DRAM controller 114 controls the transfer of data over DDR bus 122. Cache 112 is a component that stores data so future requests for that data can be served faster. In an embodiment, cache 112 may comprise a multi-level hierarchy (e.g., L1 cache, L2 cache, etc.) with a last-level cache that is shared among the plurality of memory clients.
RPM 116 comprises various functional blocks for managing system resources, such as, for example, clocks, regulators, bus frequencies, etc. RPM 116 enables each component in the system 100 to vote for the state of system resources. As known in the art, RPM 116 may comprise a central resource manager configured to manage data related to the processing cores 106, 108, and 110. In an embodiment, RPM 116 may maintain a list of the types of processing cores 106, 108, and 110, as well as the operating frequency, temperature, and leakage of each core. As described below in more detail, RPM 116 may also update a stall time duration and/or percentage (e.g., a moving average) of each core. For each core, RPM 116 may collect a core stall time due to memory access and a core execution time. The core stall time and core execution times may be explicitly provided or estimated via one or more counters. For example, in an embodiment, cache miss counters associated with cache 112 may be used to estimate the core stall time.
RPM 116 may be configured to calculate a power/energy penalty overhead of stall duration per core. In an embodiment, the power/energy penalty overhead may be calculated by multiplying a power consumption during stall time by the stall duration. RPM 116 may calculate a total stall time power penalty (energy overhead) of all processing cores in the system 100. RPM 116 may be further configured to calculate the memory system power consumption for operating frequency level(s) for one level higher and lower than a current level. Based on this information, RPM 116 may determine whether the overall SOC power consumption (e.g., DRAM 104 and processing cores 106, 108, and 110) may be further reduced by increasing the memory operating frequency. In this regard, power reduction may be achieved by running DRAM 104 at a higher frequency and reducing stall time power overhead on the core side.
In the embodiment of
By receiving both the core stall time and the core execution time for each processing core, the workload analyzer 202 may distinguish workload tasks with a relatively larger stall time (e.g., workload type B 304) due to, for example, cache misses. In such cases, RPM 116 may maintain the current core frequency (or perhaps slightly increase the core frequency with minimal power penalty) while increasing the memory frequency to decrease the core stall time without degrading performance. As illustrated in
System 600 may also comprise other processing devices, such as, for example, a graphics processing unit (GPU) 606 and a digital signal processor (DSP) 608. Because performance and power penalty can vary depending on the core types, different scaling factors may be applied for different cores and/or clusters. Functional scaling blocks 610, 612, 614, and 616 may be used to dynamically scale an instantaneous memory bandwidth vote for Little CPUs 602, Big CPUs 604, GPU 606, and DSP 608, respectively. The “original IB votes” provided to blocks 610, 612, 614, and 616 comprise original instantaneous votes (e.g., in units of Mbyte/sec). It should be appreciated that an original instantaneous vote represents the amount of peak read/write traffic that the core (or other processing device) may generate over a predetermined short time duration (e.g., tens of or hundreds of nano-seconds). Each scaling block may be configured with a dedicated scaling factor matched to the corresponding processing device. Functional scaling blocks 610, 612, 614, and 616 up/down scale the original instantaneous bandwidth vote to a higher or lower value depending on the core stall percentage. In an embodiment, the scaling may be implemented via a simple multiplication or look-up table or mathematical conversion function. The outputs of the functional scaling blocks 610, 612, 614, and 616 are provided to the DDR frequency controller 206 along with, for example, corresponding average bandwidth votes. As further illustrated in
It should be appreciated that the information regarding the core stall time versus the core execution time may be used to enhance various system controls (e.g., core DCVS, memory frequency control, big.LITTLE scheduling, and cache allocation).
S=[100%]/(100%−core stall time %) Equation 1
Graph 670 illustrates corresponding values (lines 672, 674, 676, and 678) for the scaled IB vote (W) along the line 662 in graph 660. Point 664 in graph 660 corresponds to line 674 in graph 670. Point 666 in graph 660 corresponds to line 678 in graph 670. As illustrated, line 674 is steeper than line 678. One of ordinary skill in the art will appreciate that line 674 may represent the case in which there is a relatively large core stall time percentage and a higher DRAM frequency is desired. Line 678 may represent the case in which there is a relatively smaller core stall time percentage and a lower DRAM frequency is desired. In this regard, the functional scaling block 650 may dynamically adjust the memory frequency between the lines illustrated in graph 670.
The workload analyzer 202 receives core stall time data from GPU 606 on an interface 712. The workload analyzer 202 receives core stall time data from CPUs 602/604 on an interface 714. The workload analyzer 202 may also receive cache miss ratio data from dedicate cache 702 and 704 on an interface 710. The workload analyzer 202 may calculate core execution time percentages and core stall time percentages for GPU 606 and CPUs 602/604. As further illustrated in
The workload analyzer 202 may provide the core stall time percentage on an interface 718 to the DDR frequency controller 206. In response to memory traffic profile data received on an interface 732, the DDR frequency controller 206 may initiate memory frequency scaling on an interface 734. The shared cache allocator 508 may interface with the workload analyzer 202 and, based on the ratio of core stall time versus core execution time may allocate more or less cache to the GPU 606 and/or the CPUs 602/604.
One of ordinary skill in the art will readily appreciate that the scheme(s) described for dynamically scaling memory frequency may be further extended and/or applied in alternative embodiments, such as, for example, for a plurality of heterogeneous cores such as a modem core, a DSP core, a video codec core, a camera core, an audio codec core, and a display processor core.
As mentioned above, the system 100 may be incorporated into any desirable computing system.
A display controller 328 and a touch screen controller 330 may be coupled to the CPU 802. In turn, the touch screen display 606 external to the on-chip system 322 may be coupled to the display controller 328 and the touch screen controller 330.
Further, as shown in
As further illustrated in
As depicted in
It should be appreciated that one or more of the method steps described herein may be stored in the memory as computer program instructions, such as the modules described above. These instructions may be executed by any suitable processor in combination or in concert with the corresponding module to perform the methods described herein.
Certain steps in the processes or process flows described in this specification naturally precede others for the invention to function as described. However, the invention is not limited to the order of the steps described if such order or sequence does not alter the functionality of the invention. That is, it is recognized that some steps may performed before, after, or parallel (substantially simultaneously with) other steps without departing from the scope and spirit of the invention. In some instances, certain steps may be omitted or not performed without departing from the invention. Further, words such as “thereafter”, “then”, “next”, etc. are not intended to limit the order of the steps. These words are simply used to guide the reader through the description of the exemplary method.
Additionally, one of ordinary skill in programming is able to write computer code or identify appropriate hardware and/or circuits to implement the disclosed invention without difficulty based on the flow charts and associated description in this specification, for example.
Therefore, disclosure of a particular set of program code instructions or detailed hardware devices is not considered necessary for an adequate understanding of how to make and use the invention. The inventive functionality of the claimed computer implemented processes is explained in more detail in the above description and in conjunction with the Figures which may illustrate various process flows.
In one or more exemplary aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media include both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such computer-readable media may comprise RAM, ROM, EEPROM, NAND flash, NOR flash, M-RAM, P-RAM, R-RAM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store desired program code in the form of instructions or data structures and that may be accessed by a computer.
Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (“DSL”), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
Disk and disc, as used herein, includes compact disc (“CD”), laser disc, optical disc, digital versatile disc (“DVD”), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Alternative embodiments will become apparent to one of ordinary skill in the art to which the invention pertains without departing from its spirit and scope. Therefore, although selected aspects have been illustrated and described in detail, it will be understood that various substitutions and alterations may be made therein without departing from the spirit and scope of the present invention, as defined by the following claims.