Embodiments are directed to processors, and more particularly to processor microarchitectures that scale the processor clock frequency in response to a cache miss.
The clock tree of a processor can consume a major component of the total power consumed by the processor. For example, for some modem processor designs it has been estimated that the clock tree dynamic power can be as high as 15% to 20% of the total processor core power. Assuming that the processor design is completely clock gated, for such an example the processor will always dissipate a non-appreciable amount of power while running regardless of whether the processor is active or idle when waiting for data from a memory sub-system.
Exemplary embodiments of the invention are directed to systems and method for for effective clock scaling at exposed cache stalls.
[I typically complete this section in the final draft after the claims have been approved.]
The accompanying drawings are presented to aid in the description of embodiments of the invention and are provided solely for illustration of the embodiments and not limitation thereof.
Embodiments of the invention are disclosed in the following description and related drawings. Alternate embodiments may be devised without departing from the scope of the invention. Additionally, well-known elements of the invention will not be described in detail or will be omitted so as not to obscure the relevant details of the invention.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments of the invention” does not require that all embodiments of the invention include the discussed feature, advantage or mode of operation.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising,”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Further, many embodiments are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the embodiments described herein, the corresponding form of any such embodiments may be described herein as, for example, “logic configured to” perform the described action.
A processor according to an embodiment identifies when it is most likely stalled while waiting for data from system memory, and as a result scales down its clock frequency while waiting for the data to return from a memory sub-system (e.g., off-chip system memory). The processor returns to full clock frequency when the cache stall condition is lifted. This mechanism is aimed at reducing the power consumed in a clock tree without appreciably affecting performance.
The memory 110 represents off-chip memory that may include system memory, caches at a higher level than the instruction cache 104 or the data cache 106, or any combinations thereof. For example, the memory 110 may represent a memory hierarchy that includes L2 (level 2) cache, and other system memory components that may include both volatile and non-volatile memory.
Embodiments make use of one or more of the three registers shown in the register file 108: the register 112, referred to as the exposed load register 112; the register 114, referred to as the miss status handling register 114 (MSHR 114); and the register 116, referred to as the cache miss return counter 116. In practice, there may be more than one MSHR. Accordingly, the term “MSHRs 114” may be used to indicate a plurality of miss status handling registers. The state machine 118 has access to the registers 112, 114, and 116, and receives the cache miss signal at the input port 122 and the data return signal at the input port 124. As will be described in more detail below, the state machine 118 sets the clock 120 to a low frequency or a high-frequency depending upon the state stored in the state machine 118, the values stored in one or more of the registers 112, 114, and 116, and the cache miss signal and the data return signal.
Because the processor 100 may be viewed as a state machine, the states of the state machine 118 as described below may also be viewed as possible states of the processor 100.
The clock 120 in
When the state machine 118 is in one of the states 202, 204, or 206, the clock 120 is operated at the high frequency, whereas when the state machine 118 is in the state 208 the clock 120 is operated at the low frequency. Initially, the state machine 100 is in the HF0 state, so that this state may also be referred to as the initial state. The state transition 210 from the state 202 (the HF0 or initial state) to the state 204 (the HF1 state) occurs when a candidate load instruction is detected.
A candidate load instruction is a load instruction that causes a last level cache miss, such that the load instruction is not in the shadow of an earlier executed load instruction that is causing a dispatch stall due to a last level cache miss. (A dispatch stall is sometimes referred to as a cache stall.) That is, a candidate load instruction is a load instruction that causes a last level cache miss when there are no other outstanding load instructions in the pipeline 102 that caused a last level cache miss. The “last level” cache refers to that cache having the highest level in the memory hierarchy represented by the memory 110. For example, the last level cache in the memory 110 may be an L2 (Level 2) cache. In some embodiments, the last level cache may be integrated in the processor 100. Different embodiments for detecting a candidate load instruction are described later.
In response to detecting a candidate load instruction, the pipeline 102 stores the load instruction ID (identification) in the field 126 of the exposed load register 112, and sets the field 128 of the exposed load register 112 to indicate that the content of the exposed load register 112 is valid. The field 128 may be referred to as a valid field, or valid bit. This response to detecting a candidate load instruction is indicated within the parentheses next to the state transition 210.
The state transition 212 from the HF1 state to the HF2 state occurs in response to the processor 100 determining that the candidate load instruction is the oldest load instruction that has not yet retired. The oldest load instruction may be determined by accessing the load queue 130. However, note the state transition 211 from the HF1 state to the HF0 state. The state transition 211 occurs when the number of clock cycles since the state machine 118 entered the HF1 state exceeds a threshold, denoted as N1 in
The register 130, referred to as the counter_HF register in
The state transition 214 from the HF2 state to the LF state occurs in response to the processor 100 detecting that a dispatch stall variable TSTALL has reached M1 consecutive clock cycles. In one embodiment, the dispatch stall variable TSTALL begins counting from the time the candidate load instruction becomes the oldest load instruction, where the dispatch stall variable TSTALL is in units of processor clock cycles. That is, the dispatch stall variable TSTALL is initialized when or sometime before the state machine 118 entered the HF2 state, and is incremented thereafter for each processor clock cycle, whereupon the LF state is entered if the stall variable TSTALL reaches M1. The value of TSTALL may be stored in the register 132, where for example the state machine 118 resets the value of the register 132 to zero at the beginning of each dispatch stall.
When entering the LF state, the state machine 118 sets the clock 120 (or gates the processor 100) to the low frequency so as to achieve power savings without an appreciable loss in performance. However, note the state transition 213 from the HF2 state to the HF0 state, which occurs when the number of clock cycles since the state machine 118 entered the HF2 state exceeds a threshold, denoted as N2 in
Accordingly, the state transition 214 occurs only if N2 processor clock cycles have not elapsed since the state machine 118 transitioned from the HF1 state to the HF2 state. As before, the register 130 may be used for counting the number of clock cycles since the state machine 118 transitioned from the HF1 state to the HF2 state.
The state transition 218 from the LF state to the HF0 state occurs in response to a memory return in which data from the memory 110 is returned from the target memory location of the load instruction, or when there is a pipeline flush. In response to the state transition 218, the field 128 is cleared to indicate that the content of the exposed load register 112 is no longer valid.
In another embodiment, the HF2 state may be skipped as indicated by the dashed line for the state transition 216. In such an embodiment, the candidate load instruction need not be determined to be the oldest load instruction as indicated by the state transition 212. Rather, the state machine 118 transitions from the HF1 state directly to the LF state in response to detecting that the dispatch stall variable TSTALL has reached M2 consecutive clock cycles, where in this case the dispatch stall variable TSTALL begins counting when the last level cache miss occurred, that is, when the state machine 118 entered the HF1 state. The integer M1 need not equal the integer M2. But again, a necessary condition for the state transition 216 is that the number of processor clock cycles since the state machine 118 transitioned from the HF0 state to the HF1 state does not exceed N1.
In the embodiment illustrated in
In the embodiment illustrated in
Embodiments may find application in a number of devices, such as for example a cellular phone, laptop, or computer server, or a power efficient appliance with Internet connectivity, to name just a few examples.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or a combination of computer software and hardware. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The methods, sequences and/or algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or a combination of computer software and hardware, executed by a processor (it being understood that “processor” may include multiple processors or multiple processor cores) and electronic circuits. A software module for implementing part of an embodiment may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
Accordingly, an embodiment of the invention can include a computer readable media embodying a method for effective clock scaling at exposed cache stalls. Accordingly, the invention is not limited to illustrated examples and any means for performing the functionality described herein are included in embodiments of the invention.
While the foregoing disclosure shows illustrative embodiments of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the embodiments of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.