The present invention relates in general to the field of multi-core microprocessors, and particularly to monitoring instruction execution therein.
Modern microprocessors are extremely complex, and the task of debugging them is a difficult one. Microprocessor designers commonly use a software functional model that simulates the architectural behavior of the microprocessor as a debugging tool. A software functional model can be very useful because it can simulate the execution of a large number of instructions quickly relative to other software models, such as a Verilog simulator. A software functional model executes a single instruction at a time according to the architectural definition. A software functional model is very useful for debugging a single core processor.
The software functional model may also be used to debug a multi-core processor. A different respective instance of the software functional model may be used to simulate the execution of instructions on each of the cores. This works well as long as the cores are not interacting with one another. However, there are some multi-core processor bugs that only manifest in the context of memory access interactions between the multiple cores, particularly when the cores are sharing a memory location, such as a software semaphore. The memory accesses to a shared memory location are essentially asynchronous to each other. For example, consider the case in which a first core is looping reading a semaphore waiting for a second core to write the semaphore. Unless the two instances of the software functional model execute their instructions in a manner that sufficiently approximates the order in which the actual processor executes instructions when the bug manifests, the software functional model tool may not be very useful in debugging the dual-core processor. Therefore, what is needed is a way to control the order in which the simulated cores execute instructions relative to one another that approximates the order of the post-silicon multi-core processor.
In one aspect the present invention provides a method for debugging a microprocessor having a plurality of cores. The method includes causing the microprocessor to perform an actual execution of instructions and obtaining from the microprocessor heartbeat information that specifies an actual execution sequence of the instructions by the plurality of cores relative to one another. The method also includes commanding a corresponding plurality of instances of a software functional model of the plurality of cores to execute the instructions according to the actual execution sequence specified by the heartbeat information to generate simulated results of the execution of the instructions. The method also includes comparing the simulated results with actual results of the execution of the instructions to determine whether they match.
In another aspect, the present invention provides a microprocessor. The microprocessor includes a plurality of processing cores, each configured to output an instruction execution indicator that indicates the number of instructions executed by the core during each clock cycle of the core. The microprocessor also includes a heartbeat generator coupled to receive the instruction execution indicator from each of the plurality of processing cores. The heartbeat generator is configured to generate a heartbeat indicator for each of the plurality of processing cores on a bus external to the microprocessor in response to the instruction execution indicators. The heartbeat indicator indicates the number of instructions executed by each of the plurality of processing cores during each clock cycle of the external bus.
In yet another aspect, the present invention provides a microprocessor. The microprocessor includes a plurality of processing cores, each configured to generate an indication of the number of instructions executed by the core during each clock cycle of the core. The microprocessor also includes a memory array configured to store the indications generated by the plurality of processing cores for a sequence of core clock cycles. The microprocessor also includes a bus interface unit, configured to couple to a bus external to the microprocessor. The bus interface unit is configured to write the indications stored in the memory array to a memory external to the microprocessor.
Described herein are embodiments of a multi-core processor configured to generate heartbeat signals that indicate the rate at which each core is executing instructions relative to one another. The processor designer captures the heartbeat signals as the processor operates and uses the captured heartbeat information to dynamically control the rate at which the software functional model executes instructions for each core. In this way, the heartbeat signals provide visibility into the inner workings of the multi-core processor needed by the software functional model to control the order in which the simulated cores execute instructions relative to one another that approximates the order of the actual multi-core processor that is exhibiting a bug. In some embodiments, the processor provides the heartbeat signal information on the architectural processor bus. However, because this may affect the timing of program execution on the multi-core processor, it may cause some bugs to go away when the heartbeats are enabled. Therefore, in preferred embodiments, the processor non-invasively provides the heartbeat signals on an external sideband bus rather than on the architectural processor bus.
Referring now to
The dual-core processor 102 also includes a heartbeat generator 103, coupled to each of the cores 104. Specifically, core A 104A generates an instruction execution indicator 105A that indicates the number of instructions it has executed in the given core clock cycle, and core B 104B generates an instruction execution indicator 105B that indicates the number of instructions it has executed in the given core clock cycle. The heartbeat generator 103 generates the heartbeat signals 106 to indicate that the cores 104 have executed instructions in response to the instruction execution indicators 105. In one embodiment, the cores 104 perform speculative execution of instructions, and the instruction execution indicators 105 indicate to the heartbeat generator 103 that instructions have been retired, i.e., have updated the architectural state of the core 104 as opposed to merely speculatively executed.
The computing system 100 also includes memory 112 coupled to the dual-core processor 102. Each core 104 of the processor 102 may be programmed to periodically stop executing user program instructions, dump its current state to a predetermined location in memory 112, and flush its caches to memory 112, which is referred to herein as a checkpoint. The core 104 state includes the state of its internal registers, which is referred to herein as a checkpoint state. More specifically, each core 104 may be programmed by the designer to continuously execute a predetermined number of instructions (e.g., 100,000), stop executing instructions and dump a checkpoint state and flush its caches, resume executing instructions until it has again executed the predetermined number of instructions, stop executing instructions and dump a checkpoint state and flush its caches, and so forth.
The computing system 100 also includes a logic analyzer 108. In one embodiment, the logic analyzer 108 comprises one of the cores 104 within the multi-core processor 102. The logic analyzer 108 monitors the processor bus 114 and captures transactions thereon, including the transactions that write the checkpoint state to memory 112 and flush the caches. The logic analyzer 108 also monitors and captures the heartbeat signals 106. The logic analyzer 108 saves the captured information to a file 116, such as on a disk drive. The file 116 includes the captured processor bus transaction information 118 and the captured heartbeat signal information 122. In one embodiment, the heartbeat signals 106 are provided on a sideband bus that is also the JTAG bus for the processor 102. In one embodiment, the sideband bus is also used by a separate service processor within the dual-core processor 102 chip.
The computing system 100 also includes a software functional model simulation environment 124. Typically, the software functional model simulation environment 124 comprises one or more computing systems distinct from the computer that includes the processor 102. The software functional model simulation environment 124 uses the captured processor bus transaction information 118 and heartbeat signal information 122 stored in the file 116 to simulate the operation of the dual-core processor 102, as described in more detail below.
Referring now to
The simulated initial state generator 202 receives as input the captured processor bus transaction information 118, which it uses to generate a simulated initial memory image 212, a simulated initial state of core A 214A, and a simulated initial state of core B 214B. Subsequently, the simulated initial memory image 212 is copied to a simulated result memory image 232, the simulated initial state of core A 214A is copied to a simulated result state of core A 234A, and the simulated initial state of core B 214B is copied to a simulated result state of core B 234B. For ease of description, assume each core 104 has dumped a first checkpoint state (which includes the state of its internal registers, as discussed above) and flushed its caches, resumed operation and executed the predetermined number of instructions, and dumped a second checkpoint state and flushed its caches; and further assume that the transaction information 118 includes the bus transactions for both the first and second checkpoints and all bus transactions in between, which are caused by the execution of the predetermined number of instructions. See U.S. Provisional Application Ser. No. 61/297,505, filed Jan. 22, 2010 for a description of a method of synchronizing checkpoints between the two cores 104.
According to one embodiment, the simulated initial state generator 202 generates the simulated initial memory image 212 by: (1) detecting each transaction in between the first and second checkpoints in which the processor 102 reads a location in memory 112; (2) determining whether the read transaction is the first read from the location in between the first and second checkpoints; (3) if so, creating a memory location record for the transaction that includes the memory location address and the value of the data read. By this method, the simulated initial state generator 202 generates a sparse simulated initial memory image 212; however, the sparse image is sufficient to supply the needs of the software functional model instances 206, since the software functional model instances 206, during the interval between the first and second checkpoints, will only need to read the memory locations created by this method; otherwise, this indicates a bug in the actual processor 102.
The simulated initial state generator 202 generates the simulated initial state of core A 214A directly from the first checkpoint state captured in the transaction information 118. According to one embodiment, as mentioned above, at each checkpoint each core 104 writes its state information according to a predetermined format to a respective predetermined location in memory 112, which enables the simulated initial state generator 202 to find the first checkpoint state of core A 104A within the transaction information 118. The simulated initial state generator 202 generates the simulated initial state of core B 214B directly from the first checkpoint state captured in the transaction information 118 in a similar manner.
The actual result generator 208 receives as input the captured processor bus transaction information 118 of
According to one embodiment, the actual result generator 208 generates the actual result memory image 222 by: (1) detecting each transaction in between the first and second checkpoints in which the processor 102 writes a location in memory 112, which includes the writes by each core 104 to memory locations to flush its internal caches at the second checkpoint; (2) determining whether the write transaction is the last write to the location in between the first and second checkpoints; (3) if so, creating a memory location record for the transaction that includes the memory location address and the value of the data written. By this method, the actual result generator 208 generates a sparse actual result memory image 222; however, the sparse image is sufficient to supply the needs of the software functional model instances 206, since the software functional model instances 206, during the interval between the first and second checkpoints, will only need to write to the memory locations created by this method; otherwise, this indicates a bug in the actual processor 102. As discussed below, the comparison function 226 will compare the actual result memory image 222 with a simulated result memory image 232.
The rate controller 204 receives as input the captured heartbeat signal information 122 of
Each software functional model instance 206 simulates the architectural behavior of a core 104. The software functional model instance for core A 206A reads and writes the simulated result state of core A 234A and the software functional model instance for core B 206B reads and writes the simulated result state of core B 234B. Additionally, each of the software functional model instances 206 reads and/or writes the simulated result memory image 232 as it executes memory access instructions, as commanded by the rate controller 204. In particular, data written to the simulated result memory image 232 by software functional model instance for core A 206A is seen by software functional model instance for core B 206B, and vice versa, which affects the simulated result state 234 of the respective core instances 206. At the completion of the execution of the predetermined number of instructions (e.g., 100,000) by each of the software functional model instances 206, the simulated initial state of core A 214A that was copied to the simulated result state of core A 234A will have been updated to become the true simulated result state of core A 234A, and the simulated initial state of core B 214B that was copied to the simulated result state of core B 214B will have been updated to become the true simulated result state of core B 234B. The comparison function 226 compares the simulated result state of core A 234A with the actual result state of core A 224A, and the comparison function 226 compares the simulated result state of core B 234B with the actual result state of core B 224B to determine whether the actual processor 102 manifested the bug during the interval between the first and second checkpoints, as indicated by the pass/fail indicator 228. Additionally, at the completion of the execution of the predetermined number of instructions (e.g., 100,000) by each of the software functional model instances 206, the value of the simulated initial memory image 212 that was copied to the simulated result memory image 232 will have been updated to become the true simulated result memory image 232. The comparison function 226 compares the simulated result memory image 232 to the actual result memory image 222 to determine whether the actual processor 102 manifested the bug during the interval between the first and second checkpoints, as indicated by the pass/fail indicator 228.
Thus, advantageously, through the medium of the rate controller 204, the heartbeat signal information 122 is used to dynamically control the rate at which each software functional model instance 206 executes instructions. That is, the rate controller 204 controls the order in which the software functional model instances 206 execute instructions relative to one another, such that the instructions are executed in proper order relative to memory accesses by each core 104 to accurately simulate the behavior that the actual processor 102 performed, or should have performed, from the actual initial state of each core 104 and the actual initial state of memory 112. This enables the comparison function 226 to compare the behavior of the actual processor 102 to its simulated behavior.
Referring now to
At block 302, the rate controller 204 receives the heartbeat signal information 122 from the file 116. Flow proceeds to block 304.
At block 304, for the next clock cycle of the heartbeat signals 106 indicated in the heartbeat signal information 122, the rate controller 204 examines the value of the heartbeat signals 106 for each of the cores 104. The values of the heartbeat signal 106 according to various embodiments are discussed below with respect to the remaining Figures. Flow proceeds to decision block 306.
At decision block 306, the rate controller 204 determines whether a heartbeat was generated for core N, where core N is either core A 104A or core B 104B, of reach of the cores 104. If so, flow proceeds to block 308; otherwise, flow returns to block 304 to examine the next clock cycle.
At block 308, the rate controller 204 commands 218 the software functional model instance for core N 206 to execute one or more instructions based on the heartbeat information determined at decision block 306, as discussed below with respect to the remaining Figures. Flow proceeds to block 312.
At block 312, the software functional instance model for core N 206 executes the next instruction or instructions based on the simulated result memory image 232 and the simulated result state of core N 234. If the instruction is a memory read instruction, the software functional instance model for core N 206 reads the simulated result memory image 232. If the instruction is a memory write instruction, the software functional instance model for core N 206 updates the simulated result memory image 232. Flow returns to block 304 to examine the next clock cycle.
Various embodiments of the instruction execution indicators 105, heartbeat generator 103, heartbeat signals 106, and their uses by the rate controller 204 will now be described.
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Referring now to
Relative advantages and disadvantages of the embodiments described above will now be discussed. The embodiments of
Referring now to
At block 1402, the actual result generator 208 uses the bus transaction information 118 to generate the actual result memory image 222 and actual result state of the cores 224 of the execution of the predetermined number of instruction by the dual-core processor 102 of
At block 1404, the simulated initial state generator 202 uses the bus transaction information 118 to generate the simulated initial memory image 212 and simulated initial state of the cores 214, as described above, mainly with respect to
At block 1406, the simulated initial memory image 212 is copied to the simulated result memory image 232, the simulated initial state of the core A 214A is copied to the simulated result state of the core A 234A, and the simulated initial state of the core B 214B is copied to the simulated result state of the core B 234B. Subsequently, the rate controller 204 and software functional models 206 use and update the simulated result memory image 232 and simulated result state of the cores 234, as described above with respect to
At block 1408, the comparison function 226 compares the simulated results generated at block 1406 with the actual results generated at block 1402. Flow proceeds to decision block 1412.
At decision block 1412, the comparison function 226 determines whether the results compared at block 1408 match. If so, flow proceeds to block 1414; otherwise, flow proceeds to decision block 1416.
At block 1414, the comparison function 226 generates a pass value on the pass/fail indicator 228. Flow ends at block 1414.
At block 1416, the simulation environment 124 determines whether there are other possible memory access orderings that have not been used to perform block 1406; if so, flow returns to block 1406 to use a different one of the other possible memory access orderings that have not been used; otherwise, flow proceeds to block 1418.
At block 1418, the comparison function 226 generates a fail value on the pass/fail indicator 228. Flow ends at block 1418.
Referring now to
Referring now to
The embodiment of
Although embodiments have been described in which each core 104 executes a single thread, other embodiments are contemplated in which each core 104 is configured to simultaneously execute multiple threads, and the heartbeat information indicates the thread of the retired instructions.
Additionally, although embodiment have been described in which both cores 104 have the same core clock rate, other embodiments are contemplated in which the two cores 104 have different core clock rates. The heartbeat signal information 122 indicates the two clock rates and the rate controller 204 takes this into account when generating the commands 218.
An alternative to the heartbeat embodiments described herein is to use a Verilog simulator of the processor design in place of the actual processor. Using the Verilog simulator essentially enables the debugger to access any net of the processor at any time, including signals that indicate the times at which each core executes instructions and accesses memory. This would enable the debugger to provide that information to the software functional model so that it can execute instructions and perform memory accesses at the same times as the actual processor—or at least as the Verilog simulation of the actual processor. However, there are three disadvantages to this approach. First, the Verilog simulator approach requires a very large amount of computing power/time, depending upon the number of clock cycles/instructions that must be simulated. The large amount of computing power/time potentially makes the Verilog simulation an impractical solution, at least for some classes of bugs. Second, there is always the possibility that the actual processor is behaving differently from the Verilog simulation. Third, the Verilog simulation solution requires the processor to be designed with perfect state-per-clock replay capability, which is difficult to implement. Broadly speaking, a microprocessor with perfect state-per-clock replay is capable of being loaded with an input state that defines the entire state of the processor; stated alternatively, there is no state of the processor that cannot be initialized by loading the input state. Advantageously, the heartbeat embodiments described herein does not suffer from these disadvantages of the Verilog simulation solution.
While various embodiments of the present invention have been described herein, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the scope of the invention. For example, software can enable, for example, the function, fabrication, modeling, simulation, description and/or testing of the apparatus and methods described herein. This can be accomplished through the use of general programming languages (e.g., C, C++), hardware description languages (HDL) including Verilog HDL, VHDL, and so on, or other available programs. Such software can be disposed in any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, or optical disc (e.g., CD-ROM, DVD-ROM, etc.), a network, wire line, wireless or other communications medium. Embodiments of the apparatus and method described herein may be included in a semiconductor intellectual property core, such as a microprocessor core (e.g., embodied in HDL) and transformed to hardware in the production of integrated circuits. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the exemplary embodiments described herein, but should be defined only in accordance with the following claims and their equivalents. Specifically, the present invention may be implemented within a microprocessor device which may be used in a general purpose computer. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims.
This application claims priority based on U.S. Provisional Application Ser. No. 61/314,253, filed Mar. 16, 2010, entitled MULTI-CORE PROCESSOR WITH EXTERNAL INSTRUCTION EXECUTION RATE HEARTBEAT, and on U.S. Provisional Application Ser. No. 61/297,505, filed Jan. 22, 2010, entitled SIMULTANEOUS EXECUTION RESUMPTION OF MULTIPLE PROCESSOR CORES AFTER CORE STATE INFORMATION DUMP TO FACILITATE DEBUGGING VIA MULTI-CORE PROCESSOR SIMULATOR USING THE STATE INFORMATION, which are hereby incorporated by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
4881228 | Shouda | Nov 1989 | A |
5630049 | Cardoza et al. | May 1997 | A |
5850562 | Crump et al. | Dec 1998 | A |
5963725 | Inoue | Oct 1999 | A |
6094729 | Mann | Jul 2000 | A |
6438709 | Poisner | Aug 2002 | B2 |
6457073 | Barry et al. | Sep 2002 | B2 |
6463529 | Miller et al. | Oct 2002 | B1 |
6643796 | Floyd et al. | Nov 2003 | B1 |
6687865 | Dervisoglu et al. | Feb 2004 | B1 |
6728904 | Kanekawa et al. | Apr 2004 | B2 |
7013398 | Zhao | Mar 2006 | B2 |
7058557 | Lin | Jun 2006 | B2 |
7111160 | Henniger et al. | Sep 2006 | B1 |
7373550 | Brawn et al. | May 2008 | B2 |
7590891 | Ishihara | Sep 2009 | B2 |
7770034 | Nanja | Aug 2010 | B2 |
7870439 | Fujiyama et al. | Jan 2011 | B2 |
8375219 | Westerinen et al. | Feb 2013 | B2 |
20010042198 | Poisner | Nov 2001 | A1 |
20020062480 | Kirisawa | May 2002 | A1 |
20020073400 | Beuten et al. | Jun 2002 | A1 |
20020129309 | Floyd et al. | Sep 2002 | A1 |
20030014264 | Fujii et al. | Jan 2003 | A1 |
20030056154 | Edwards et al. | Mar 2003 | A1 |
20030088855 | Kuzemchak et al. | May 2003 | A1 |
20040078413 | Yoshimoto et al. | Apr 2004 | A1 |
20040221196 | Datta et al. | Nov 2004 | A1 |
20050229160 | Rothman et al. | Oct 2005 | A1 |
20050240933 | Barsness et al. | Oct 2005 | A1 |
20070180315 | Aizawa | Aug 2007 | A1 |
20070209072 | Chen | Sep 2007 | A1 |
20070214341 | Das | Sep 2007 | A1 |
20070265822 | Mathewson et al. | Nov 2007 | A1 |
20080016405 | Kitahara | Jan 2008 | A1 |
20080126877 | Alsup | May 2008 | A1 |
20080141073 | Shih et al. | Jun 2008 | A1 |
20080177527 | Yoshinaga | Jul 2008 | A1 |
20080184055 | Moyer et al. | Jul 2008 | A1 |
20080288942 | Barsness et al. | Nov 2008 | A1 |
20090313507 | Swaine et al. | Dec 2009 | A1 |
20090313623 | Coskun et al. | Dec 2009 | A1 |
20100008464 | Hellwig | Jan 2010 | A1 |
20100064173 | Pedersen et al. | Mar 2010 | A1 |
20110010530 | Henry et al. | Jan 2011 | A1 |
20110010531 | Henry et al. | Jan 2011 | A1 |
20110029823 | Horley et al. | Feb 2011 | A1 |
20110053649 | Wilson | Mar 2011 | A1 |
20110143809 | Salomone et al. | Jun 2011 | A1 |
20110185153 | Henry et al. | Jul 2011 | A1 |
20110202796 | Henry et al. | Aug 2011 | A1 |
20120185681 | Henry et al. | Jul 2012 | A1 |
Number | Date | Country |
---|---|---|
101000596 | Jul 2007 | CN |
62256298 | Nov 1987 | JP |
Number | Date | Country | |
---|---|---|---|
20110185160 A1 | Jul 2011 | US |
Number | Date | Country | |
---|---|---|---|
61297505 | Jan 2010 | US | |
61314253 | Mar 2010 | US |