The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The illustrative embodiments provide a system and method for performing power simulations on complex designs running complex software applications. The illustrative embodiments may be used with any device having a sufficiently complex architecture for which power estimation using software simulation is prohibitive. One such multiprocessor system for which the illustrative embodiments may be implemented is the Cell Broadband Engine (CBE) architecture available from International Business Machines Corporation of Armonk, N.Y. The CBE architecture will be used as an example multiprocessor processing system that may be a device under test with which the illustrative embodiments are implemented for purposes of this description. However, it should be appreciated that the illustrative embodiments are not limited to use with the CBE architecture and may be used with other multiprocessor devices without departing from the spirit and scope of the present invention.
With reference now to the drawings,
As shown in
The CBE 100 may be a system-on-a-chip such that each of the elements depicted in
The SPEs 120-134 are coupled to each other and to the L2 cache 114 via the EIB 196. In addition, the SPEs 120-134 are coupled to MIC 198 and BIC 197 via the EIB 196. The MIC 198 provides a communication interface to shared memory 199. The BIC 197 provides a communication interface between the CBE 100 and other external buses and devices.
The PPE 110 is a dual threaded PPE 110. The combination of this dual threaded PPE 110 and the eight SPEs 120-134 makes the CBE 100 capable of handling 10 simultaneous threads and over 128 outstanding memory requests. The PPE 110 acts as a controller for the other eight SPEs 120-134 which handle most of the computational workload. The PPE 110 may be used to run conventional operating systems while the SPEs 120-134 perform vectorized floating point code execution, for example.
The SPEs 120-134 comprise a synergistic processing unit (SPU) 140-154, memory flow control units 155-162, local memory or store 163-170, and an interface unit 180-194. The local memory or store 163-170, in one exemplary embodiment, comprises a 256 KB instruction and data memory which is visible to the PPE 110 and can be addressed directly by software.
The PPE 110 may load the SPEs 120-134 with small programs or threads, chaining the SPEs together to handle each step in a complex operation. For example, a set-top box incorporating the CBE 100 may load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until it finally ended up on the output display. At 4 GHz, each SPE 120-134 gives a theoretical 32 GFLOPS of performance with the PPE 110 having a similar level of performance.
The memory flow control units (MFCs) 155-162 serve as an interface for an SPU to the rest of the system and other elements. The MFCs 155-162 provide the primary mechanism for data transfer, protection, and synchronization between main storage and the local storages 163-170. There is logically an MFC for each SPU in a processor. Some implementations can share resources of a single MFC between multiple SPUs. In such a case, all the facilities and commands defined for the MFC must appear independent to software for each SPU. The effects of sharing an MFC are limited to implementation-dependent facilities and commands.
Processor architectures are becoming very complex. One might say that the architectures are becoming “huge”; however, the physical chips themselves are becoming smaller relative to the number of functional components being fabricated on the die area. For example, the Cell Broadband Engine (CBE) architecture is an architecture that extends the 64-bit Power Architecture™ technology. “Power Architecture” is a trademark of International Business Machines Corporation in the United States, other countries, or both. Ideal for computation-intensive tasks like gaming, multimedia, and physics- or life-sciences and related workloads, the CBE architecture is a single-chip multiprocessor no bigger than a fingernail, with eight or more processors operating on a shared, coherent memory. The CBE processor contains one or more Power Architecture™-based control processors (PPUs) augmented with seven or more Synergistic Processor Units (SPUs) and a rich set of DMA commands for efficient communications between processing elements.
The heat generated by high power devices may cause failures if cooling systems are insufficient. While most software may run smoothly without overheating, some code may burden the processor, causing high power consumption and, hence, heat generation. For example, a long loop with highly computation-intensive code may cause a processor to overheat. If a portion of software code causes the processor to generate more heat than can be handled by the cooling system, the processor may fail.
A microprocessor, particularly a multiple core heterogeneous processor such as, for example, the CBE architecture described above with reference to
A software simulator is a software application that simulates the execution of a hardware design. A software simulator accepts a simulation model in the form of a HDL model. A software simulator may execute on a single computer, on a cluster of computers, or perhaps using grid computing technology. An example of a known software simulator is the MESA simulator, which is a VHDL simulator.
Estimation of power consumption may begin with breaking a design into smaller analytic components. The smaller components are referred to as “macros,” which are essentially smaller block portions of a larger circuit. Examining smaller components of a chip allows for convenience in modeling. Once the processor architecture is broken down into macros, engineers may develop an energy model for each macro. One conventional method is to estimate a switching factor for all blocks in a design Then, vectors based on this switching factor are applied to all blocks and their average power is calculated. This is then aggregated to calculate total chip power. Estimations based on these methods may yield an overall average power consumption; however, these methods do not accurately model the fine grain clock gating that is required in a number of microprocessors today. It also does not provide the time variation of power essential for determining peak power and model the noise on power distribution network.
Full chip simulations for complex processor architectures, such as the CBE architecture described above with reference to
In accordance with the illustrative embodiments, a power estimation system uses a hardware accelerated simulator to advance simulation to a point of interest for power estimation. The hardware accelerated simulator generates a checkpoint file, which is then used by a software simulator to initiate simulation of the processor design model for power estimation. An on-the-fly power estimator provides power calculations in memory. Thus, the power estimation system described herein isolates instruction sequences to determine portions of software code that may consume excess power or generate noise and to provide a more accurate power estimate on the fly.
Checkpointing is a function provided by known hardware accelerated simulators and software simulators. Checkpointing saves the states of all the latches and other inputs that have been set to a desired value at a specified point in time. The state of all the combinational logic does not need to be preserved, because the state of the latches and input/output (I/O) will propagate through the combinational logic at the time the checkpoint is restored. In other words, checkpoint file 220 is a snapshot of the state of the simulation model at a particular point in time.
A point-of-interest checkpoint file is a checkpoint file that stores the state of the simulation model at a point of interest. The point of interest may be a point within the software application being executed. For example, a point-of-interest checkpoint file may be a checkpoint file taken when a particular instruction address is encountered. Alternatively, a point-of-interest checkpoint file may be taken at other points of interest. For example, a point-of-interest checkpoint file may store state information for the simulation model at a particular point in time, such as after running the software application for a predetermined number of hours.
For complex processor architectures, the startup process of doing power on reset, self test, a serial flush of all latches, register initialization, and starting functional clocks is a complicated and time-consuming task. Power on reset checkpoint file 202 allows the simulation to begin at the end of this process. Engineers who specialize in this testing may create power on reset checkpoint file 202.
The next step of the simulation process is to get the software application 204 loaded and the processor's execution of this application started. Software application 204 is a workload software application to be executed on the device under test. Simulation model 206 represents the processor hardware. Simulation model 206 may be represented using a hardware description language, such as VHDL, for example. Loading an application is a very lengthy operation if the workload application is loaded by the serial process used in a lab. Even with hardware acceleration, loading the workload application 204 would be very time consuming and prone to error. A loader may be provided to accelerate the loading of the workload application into the memory of the chip architecture as generally known in the art. The loader may be included as a module in a run time executable (RTX). The use of a RTX loader may reduce the loading time from hours or days to a few minutes.
RTX components (run time executable) are the controlling software of the simulation environment. This software can have a wide variety of function and interaction with the design under test. When using hardware accelerated simulation, there is a significant penalty for probing the model of the design under test. A reduced function RTX can be used when it is not necessary to check or modify the designs behavior during the simulation to receive the greatest performance from the simulator. When the application workload is loaded onto the design, a larger, fuller function RTX is used to initialize the design and memory with the application workload, and the software simulator is used.
The workload application itself requires its own setup and initialization, which may require millions of simulation cycles to be run before processing cores are running the instructions for which power measurements are to be performed. Hardware accelerated simulator 210 focuses on running software application 204 on simulation model 206 with a higher performance than that of a software simulator. Hardware accelerated simulator 210 may display instruction addresses periodically—every two thousand cycles, for example—to show that the simulation is progressing.
In this context, “software simulator” refers to the entire simulation environment, which includes the simulator itself and all controlling software, such as RTX components. The simulator itself allows RTX components to add functionality, such as software loaders, for example. In the illustrated embodiment, software simulator 230 loads on-the-fly calculator 232, which is a controlling software component, described in further detail below.
An operator may identify checkpoint file 220, generated by hardware accelerated simulator 210, to be used to begin software simulation for power estimation. The operator may examine instruction addresses to determine whether the hardware accelerated simulation has advanced to a portion of code that is of interest. A software simulator, such as software simulator 230, is faster for creating traces. Therefore, software simulator 230 receives checkpoint file 220, common power analysis methodology (CPAM) data 222, and simulation model 206 to begin software simulation. Software simulator 230 may be a known software simulator, such as the MESA simulator. Simulation model 206 and simulation model 226 may be the same model, such as a VHDL model for instance; however, simulation model 206 may be compiled for hardware accelerated simulator 210 and simulation model 226 may be compiled for software simulator 230.
Software simulator 230 also receives and loads on-the-fly power calculator 232. As software simulator 230 runs simulation cycles, it also runs on-the-fly power calculator 232 to generate power consumption numbers on a cycle-by-cycle basis. Software simulator 230 outputs the cycle-by-cycle power consumption numbers as power estimations 240.
On-the-fly power calculator 232 provides a tool that provides accurate, cycle-by-cycle power estimates due to heavy use of fine grain clock gating. On-the-fly power calculator 232 provides an accurate transistor-level power simulation for a high percentage of custom macros with unique circuit topologies including arrays and dynamic circuits. Software simulator 230 simulates thousands of cycles to estimate power for different workloads. This provides a high throughput register transfer level (RTL) simulation to verify the RTL and circuit implementation of the design and to estimate active workload-dependent power.
Switching power of a circuit in a given cycle is defined by the following equation:
P=1/2CV2f
where C is the total node capacitance switched, V is the power supply voltage, and f is the clock frequency. The factors affecting switching node capacitance (C) are input switching and clock gating in the circuit.
As seen in
On-the-fly power calculator 320 uses the switching and clock gating information to calculate power for each macro instance to get total chip power for all macros. Power due to signal interconnect capacitance may be estimated using signal switching information or interconnect capacitance estimate using Steiner routes 302 or three-dimensional (3D) extraction 312. Total power is equal to macro power plus net switching power. The on-the-fly power calculator repeats this calculation for every cycle and outputs cycle-by-cycle power estimates 322.
A macro is defined as the lowest level block of the design hierarchy in a floorplan. A macro may range from hundreds to thousands of gates. The macro power model may be created using the Common Power Analysis Methodology (CPAM) tool, for example, which is available from International Business Machines Corporation. The macro power model may be area based 304 or schematic based 314. Input switching factor is defined as the percent of inputs switching state between two consecutive clock cycles.
CPAM, for example, runs random vectors on the schematic 314 using multiple switching factors under two conditions. The first condition is all clock buffers turned on for fully clock active power. The second condition is all clock buffers forced off to get fully clock gated power.
Register transfer level (RTL) simulations are done using a software simulator, such as, for example, the MESA simulator from International Business Machines Corporation. For each macro instance, the state of each input is monitored at cycle boundaries to measure the input switching factor. The switching of each global net is monitored to calculate interconnect switching power.
Clock activity for custom macros is measured by monitoring all clock buffers that are turned on in the macro. The designers provide a table (not shown) with relative power weights for each clock buffer. The clock activity is determined by adding the weights of the clock buffers that are turned on. For synthesized macros, clock activity is measured by the percent of latch bits that are active in the given cycle.
Using clock activity and input switching factors for each macro instance in a cycle, the total power in a given cycle C may be calculated by the following equation:
Total Power (C)=ΣBlkPwr(SF, CLK)+½Cnet(C)V2f
where Cnet is the total interconnect capacitance switched, V is the power supply voltage, and f is the clock frequency.
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or by combinations of special purpose hardware and computer instructions.
With particular reference to
Then, the on-the-fly power calculator uses the switching and clock gating information to calculate power for each macro instance to get total macro power for the processor architecture (block 506). The on-the-fly power calculator estimates power due to signal interconnect capacitance (block 508). The on-the-fly power calculator determines the total power to be the total macro power plus net switching power (block 510).
The on-the-fly power calculator determines whether the current cycle is the last cycle for software simulation and power estimation (block 512). If the current cycle is not the last cycle, operation returns to block 506 to calculate power for the next cycle. If the current cycle is the last cycle in block 512, then operation ends.
Thus, the illustrative embodiments solve the disadvantages of the prior art by providing a power estimation system that uses a hardware accelerated simulator to advance simulation to a point of interest for power estimation. The hardware accelerated simulator generates a checkpoint file, which is then used by a software simulator to initiate simulation of the processor design model for power estimation. An on-the-fly power estimator provides power calculations in memory. Thus, the power estimation system described herein isolates instruction sequences to determine portions of software code that may consume excess power or generate noise and to provide a more accurate power estimate on the fly.
It should be appreciated that the illustrative embodiments described above may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the illustrative embodiments may take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium may be any apparatus that may contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
As described previously above, a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.