This invention relates generally to the method of modeling a processor for system-level simulation, and more particularly to a Cycle-Count-Accurate (CCA) processor modeling which shows the superior simulation speed and accuracy and benefits the system design tasks.
As both system-on-a-chip (SoC) design complexity and time-to-market pressure increase relentlessly, system-level simulation emerges as a crucial design approach for non-recurring engineering (NRE) cost saving and design cycle reduction. With system components, such as processors and busses, modeled at a proper abstraction level, system simulation enables early architecture performance analysis and functionality verification before real hardware implementation.
To construct a proper system platform for simulation, models for system components of various abstraction levels are proposed for simulation accuracy and performance trade-off. For example, Cycle-accurate (CA) models are proposed to eliminate detailed pins and wires to improve simulation performance while preserving cycle timing accuracy. CA models are suitable for micro-architecture verification. The verification of correctness involves detailed states, such as values of register contents at every cycle. In practice, the simulation speeds of CA models are slow because of the enormous number of simulated states and are not satisfactory for system-level simulation.
To further increase simulation performance while sacrificing timing accuracy, cycle-approximate (CX) models apply simple fixed, approximated delays to represent timing behaviors. CX models achieve significant simulation performance speedup and are useful for architecture performance estimation at early design stages. Nevertheless, the approximated timing is inadequate for system simulation such as HW/SW co-simulation or multi-processor simulation. Without precise timing information, both performance evaluation and functionality verification cannot be accurate.
A new modeling approach, i.e., cycle-count-accurate (CCA) approach, has received great attention lately, offering superior simulation performance speedup compared to CA models by eliminating unnecessary timing details while keeping only needed system timing information. Compared to CX, CCA technique preserves accurate cycle count information of execution behaviors, and the preserved accuracy is adequate for system-level simulation.
A CCA processor modeling technique is disclosed in the present invention. The idea is essentially based on the observation that, if the timing and functional behaviors of every access (such as bus access) on a component interface are correct, the effects from the component to the simulated system behaviors will remain correct. In other words, unnecessary internal component details can be eliminated to achieve better simulation performance while maintaining accurate system behaviors, as long as the interface behaviors are correct.
The disclosed CCA processor model of the present invention preserves accurate cycle count information between any two consecutive external interface accesses through pre-abstracted processor pipeline and cache timing information using static analysis.
The present invention discloses a Cycle-Count-Accurate (CCA) processor modeling, hereinafter called a CCA processor modeling, for system-level simulation. The CCA processor modeling achieves both fast and accurate simulation for a System-on-a-chip (SoC) design. The CCA processor modeling for system-level system simulation mainly includes a pipeline subsystem model (PSM), hereinafter called PSM, and a cache subsystem model (CSM), hereinafter called CSM. In one embodiment, the CCA processor modeling further includes a branch predictor and a bus interface model.
Instead of observing all internal states at every clock cycle, the PSM analyzes all possible pipeline execution behaviors (PEB), hereinafter called PEB, of a plurality of basic blocks of a given program. First of all, the PSM statically pre-analyzes the numbers of possible PEB for each basic block of a given program. Then, during simulation, the PSM dynamically calculates an actual timing point of an access event by adding a time offset to the starting execution time of a target basic block. The above-mentioned time offset is a pre-analyzed time according to the static PEB analysis.
In one embodiment, the PSM only identifies a potential missed instruction fetch as an access event for simulation, since only it causes external instruction fetches and affects the behavior of the processor interface. The PSM checks the time point for a data access event when a memory load/store or an input/output instruction scheduled in execution stages. In addition, the PSM will dynamically adjust an additional delay cycles to the target basic block while a cache miss happens in simulation.
The CSM returns correct access delay values, depending on hit or miss conditions, to the PSM at the clock cycle when an access event issued from the PSM, and triggers external accesses accurately via a processor interface.
In one embodiment, the CSM includes a hierarchical cache system. The hierarchical cache system issues all external accesses at accurate time points and returns correct access delays to the PSM, depending on hit or miss results of the first and the second level caches.
In one embodiment, the CSM returns only one cycle delay to the PSM if the first level cache hits. On the contrary, given that the first level cache misses, the CSM returns X+1 cycles delay to the PSM because the first level cache requires X cycle before and one cycle after an additional handshake with the second level cache. The aforementioned X is an integer and depends on processor models. In case of the miss happened in the CSM, it will trigger an external memory access according to a pre-analyzed timing.
The bus interface model is used to simulate the behavior of the processor interface, which accesses datum, via an external bus, to and from external components, such as ROM, RAM or other hardware, when the CSM issues a hit miss signal. Only the timing and functional behaviors of the bus interface at the clock cycle of accessing data to/from the external components are extracted for system-level simulation. If the timing and functional behaviors of every bus access on a component interface are correct, the effects from the component to the simulated system behaviors will remain correct. In other words, unnecessary internal component details can be eliminated to achieve fast and accurate system simulation, as long as the interface behaviors are correct.
The above objects, and other features and advantages of the present invention will become more apparent after reading the following detailed description when taken in conjunction with the drawings, in which:
a) illustrates a system-on-a-chip architecture which includes a processor, a bus, and several components outside the processor.
b) illustrates a sample timing diagram of the bus transfer.
a) illustrates a Cycle-accurate (CA) model, which captures all the concurrent behaviors of the processor by updating every process state at every clock cycle.
b) illustrates an abstract processor model, such as CCA processor model, which has different internal execution details compared to CA model, but gives same effects to the system by providing equivalent bus access behaviors.
a) illustrates a basic block of a program.
b) illustrates the pipeline execution behavior (PEB) of a basic block.
a) illustrates a program segment, which contains a basic block C (BBC).
b) illustrates a control flow graph (CFG) of the program.
c) illustrates the pipeline execution behavior (PEB) of the basic block C alone.
d) illustrates the pipeline execution behavior (PEB) of basic block C following basic block A (BBA).
a) illustrates a control flow graph (CFG) of the program.
b) illustrates an example of static analysis of access events in a pipeline execution behavior (PEB).
c) illustrates an example for dynamic timing calculation.
a) shows a processor with two hierarchical caches, L1 and L2, and the clocked finite state machine (CFSM) of L1 to describe the cycle-by-cycle state transition behavior of the L1 cache.
b) illustrates the CFSM being converted into a compressed computation tree. The two paths of the computation tress correspond to the two types of the cache timing behaviors, i.e., hit and miss.
c) illustrates the CCA cache model is implemented by a procedure call. Different paths in the computation tree are represented by different control flow branches.
The method of a Cycle-Count-Accurate (CCA) processor modeling is described below. In the following description, more detailed descriptions are set forth in order to provide a thorough understanding of the present invention and the scope of the present invention is expressly not limited expect as specified in the accompanying claims.
The key idea of the CCA modeling technique is to leverage limited observability of component internal states and speed up simulation by eliminating unnecessary internal modeling details without affecting overall system simulation accuracy. In the following, we first discuss the observability property of processor models and then propose a CCA processor model.
For a processor component, only the behaviors on its interface are directly observable to the system (or specifically, to the rest of the system). In other words, a system cannot directly observe and interact with a processor except through the interface.
As shown in
In one embodiment, when there is an instruction inside the pipeline requests writing data to the HW 1300, to accomplish the request, the data transferred has passes through the cache 1120 and triggers a bus transfer action on the bus interface (BIF) 1130 and is written to the HW 1300 via an external bus 1200. A sample timing diagram of the bus transfer is shown in
In one embodiment, as shown in
As far as a processor is concerned, in view of all external accesses are initiated from the processor pipeline, and then pass through the caches to the processor interface. Hence, as shown in
The modeling of pipeline subsystem model (PSM) 310 is described in detail below. In one embodiment, with respect to the pipeline subsystem model (PSM) 310, all possible pipeline execution behaviors (PEBs) of each basic block (BB) of a given program are statically analyzed before a simulation in order to eliminate unnecessary simulation details of the PSM 310. Then at simulation, the actual time points of issuing access events to the CSM 320 are calculated based on the pre-analyzed PEBs. Basic blocks usually form the vertices or nodes in a control flow graph (CFG). Compilers usually decompose programs into their basic blocks as a first step in the analysis process. As shown in
In one embodiment, the pipeline subsystem model (PSM) 310 captures target pipeline architecture and the pipeline execution of any given fixed sequence of instructions can be statically determined. Nevertheless, a complete program cannot be statically analyzed because it contains branches determinable only at runtime. Hence, the pipeline subsystem model (PSM) 310 first statically pre-analyzes each basic block of the program since it contains no branches. As shown in
In one embodiment, as shown in
In one embodiment, a basic block may have several possible PEBs because its execution could be affected by the executions of its precedent basic blocks. Considering the above-mentioned situation, the CCA processor modeling 300 includes a branch predictor 340, as shown in
In one embodiment, the PEB 530 is the case when the branch predictor 340 fails the branch prediction and the pipeline is flushed and hence the basic block C 501 is executed alone. However, if the branch prediction succeeds, the basic block C 501 is executed immediately following the basic block A 502, as shown in
In one embodiment, for efficient PSM simulation, all possible PEBs of every basic block are pre-analyzed. Given a program's CFG, the static analysis finds all strings of precedent blocks (or upward combinations of consecutive precedent blocks) that may induce different PEBs. Owing to the limited length of the pipeline 1110, the number of PEBs is bounded by the pipeline length as well. Therefore, if a precedent block is too far away from the currently analyzed block, the instructions of the two basic blocks cannot be executed simultaneously in the pipeline and such that a new PEB will not be created.
In one embodiment, the basic block D 503 in
In one embodiment, for efficient PSM simulation, the access timing behavior of each PEB is statically analyzed by identifying both instruction and data access events at their corresponding execution time points. For instruction access events, each instruction at the stage of instruction fetch (IF) in PEB is checked to indicate the time point of an instruction cache (I-cache) access occurs. Only instruction accesses which may potentially cause cache misses should be identified as access events for simulation, since only they could cause external accesses and affect interface behaviors.
In one embodiment, as shown in
In one embodiment, the method to analyze the PEB 620 is disclosed in
In one embodiment, the dynamic simulation behavior of the PSM 310 is described below. During dynamic simulation, the PSM 310 issues the access events based on the pre-analyzed PEBs. As shown in
As shown in
In one embodiment, as shown in
In one embodiment, the CFSM 720 is converted into a compressed computation tree 730 as in
In one embodiment, the CSM 320 is implemented by a procedure call as in
In one embodiment, as shown in
A CCA processor modeling 300 including the PSM 310 and CSM 320 and optionally including the bus interface model 330 and the branch predictor 340, shows the superior simulation speed and accuracy based on some experimental results. The experimental results are shown in
For accuracy verification, the simulated clock times of bus accesses from the generated CCA processor modeling 300 are checked against that of the target RTL model. Also, each test-case run on the generated CCA modeling 300 has the same execution cycle count as on the RTL model.
Simulation speeds are shown in million cycles per second (MCPS) for comparison. The proposed model, CCA processor modeling 300, is on average 50 times faster than the Traditional CA simulator, an interpretive ISS with a CA timing model. In comparison, Compiled CA, which uses the compiled ISS technique with the CA timing model, is barely twice the speed of the Traditional CA approach. This shows that no significant simulation speed-up can be achieved when only using a fast ISS technique with the CA timing model, because the CA timing simulation contributes a great portion of simulation time.
The
Although preferred embodiments of the present invention have been described, it will be understood by those skilled in the art that the present invention should not be limited to the described preferred embodiments. Rather, various changes and modifications can be made within the spirit and scope of the present invention, as defined by the following Claims.