The present invention relates generally to the design of integrated circuits. More specifically, it relates to improved techniques for simulating integrated circuit designs.
Application specific integrated circuits (ASIC) require testing and functional verification using simulators, prior to mask generation and chip fabrication. The steps followed today in the design of ASIC's are shown in the prior art flow diagram 100 of
Simulators for HDLs run very slowly, for example a simulation model for a 200 MHz graphics chip might run at 100 Hz, a slowdown factor of 2 million. Further, simulating 1 second of the operation of such a graphics chip might take 2 million seconds, or about 2½ days.
In a simulator, a cache stores recently accessed memory locations in blocks of high-speed memory.
Cache replacement policy decides where in the cache a copy of a particular entry of main memory will go. In a fully-associative cache, the replacement policy is free to choose any entry in the cache to hold the copy. Alternatively, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise, and are described as set-associative. Associativity is a trade-off. If there are eight places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all eight places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses, so less time is spent servicing those misses.
One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index, which might have a copy of that datum, is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag.
Working sets of large programs (e.g. multi-million-gate simulations) overflow processor caches, making the caches ineffective. State-of-the-art in software allocates memory addresses to program data without regard to how they map onto cache. Central data structures (e.g. those used by the scheduler of a simulator) compete for cache space with large user data that regularly spills the cache. Measurement of the execution cost of central routines (e.g. a scheduler) as a function of simulation size demonstrates that the execution cost rise dramatically.
Modern microprocessors that execute instruction in stages to execute an instruction In is called a pipeline. There is usually more than one type of pipeline in a microprocessor. For example, the pipeline used to execute a “floating-point add” is different from the one used to execute a conditional branch instruction. The number stages in each pipeline can also vary. For a microprocessor using its average pipeline depth D, instructions move from one stage the next in a fixed amount of time, represented by a clock period. The clock period of a modern microprocessor is often under 1 nanosecond. The reciprocal of the clock period determines a processor's operating frequency, for example a 3 GHz microprocessor moves instructions between stages every 0.33 nanoseconds. In addition to pipelining instructions, modern microprocessors execute several consecutive instructions in parallel. For example, a modern family of microprocessors issues up to 4 instructions in parallel using pipelines whose average depth D is 14 stages.
The result of executing an instruction In can be needed by subsequent instructions Ij (where j>n). If an instruction In begins to execute at clock period c, it will not be finished executing, on the average, until clock period c+D, where D the processor's pipeline depth. If instruction I(n+1) uses the result of instruction In, but starts to execute at clock period c+1, it has to wait as long as D−1 clock periods before the result is available. The stage of instruction I(n+1) that needs the result of instruction In is usually not the first stage; so the latency need not be for D−1 clock periods. Since such inter-instruction dependencies are common, microprocessors issue instructions to execution pipelines on a speculative basis. For example, consider the code
In a scenario where ADD and CMP (compare) instructions enter the first stage of the execution pipe at a clock period 5, and that the conditional control transfer JE (jump if equal) enters the execution pipe's first stage at clock period 6. By the time the JE instruction is ready to fetch conditionally the instruction at label var_is—17, it is very likely that the results of the CMP instruction, and possibly even the ADD are not ready. Rather than wait for the instructions to complete, the JE instruction makes a guess, or predicts, whether or not the branch will be taken. If the prediction is correct, a lot of waiting is avoided. However, if the prediction is incorrect, all of the instructions being executed in the mispredicted path need to be discarded. Backtracking in this manner is very expensive. Because of the high cost of mispredicted conditional branches, modern microprocessors expend a lot of logic to implement good prediction algorithms. These algorithms all look at the past behavior of a branch, perhaps in the context of other temporally precedent branches, to predict the behavior of the current branch.
Accordingly, there is a need to develop a method to allow simulators to run faster by improving cache utilization, predication methods, and block selection path determination methods.
To overcome the current teachings in the art, a method of improved simulator processing is provided. The method according to the current invention includes grouping frequently accessed data into one set id to improve memory hierarchy performance. The method further includes simulating predication in a non-predicated architecture to improve CPU performance. The simulated predication includes pseudo-predicated implementation of read-operation vector element access pseudo-predicated implementation of write-operation vector element access, and predicated implementation of multi-way branches with assignment statements having a same left-hand-side (lhs). The method further includes determining a selection path in a multi-sensitive “always” block. The multi-sensitive “always” block selection path determination includes generating instance-specific code to save port allocation storage, and generating inlined instance-specific code to combine sensitive actions. The method further includes regenerating code affected by the assignment statement to implement value-change callback.
In one aspect of the invention, the data used by a simulation scheduler is allocated by providing a processor cache for probing a line size of the cache, providing a software override of a value of the probed line size, selecting a core routine algorithm and/or data structure for the scheduler such that a sum of line sizes having the value of most frequently used data structure is no more than a d1_linesize, where d1_linesize is the line size of the level 1 data cache. In a further aspect, a start address of the data structures is aligned to an address that is a multiple of the d1_linesize, whereby the data structures are ensured occupy a minimum number of sets of the data cache.
In another aspect of the invention, a user specifies a set id of a class of central routines as either a fixed value between a range of 0 and S−1 inclusive, or as a randomly chosen value in the range.
In a further aspect of the invention, the method further includes applying programming constructs that are unique to hardware description language (HDL), where the predication in the non-predicated architectures is simulated.
In yet another aspect of the invention, target machine microarchitecture characteristics are measured, where a maximum pseudo-predicated instruction sequence length is determined, and the measure may be overridden by a compiler-user-specified parameter.
In a further aspect, a first phantom element at index −1 of each vector is introduced, where the element is initialized to a variable and a pseudo-predicated evaluation of the vector is enabled.
In another aspect of the invention, a second phantom element at index −2 of each vector is introduced, such that when the vector has X/Z bits the −2 index is a temporary storage location.
In another aspect of the invention, if there is no “else”/“default” clause and a number of terms in a multi-way branch are below a programmable threshold, a value of expr_else is set to lhs, whereby statements of the multi-way branch are converted to allow for the predication.
In yet another aspect of the invention, statements without an “else” clause are not converted to predicated form such that when inlining the “always” block, triggering of the “always” block is tracked.
In one aspect, code is inlined for each instance of a small module such that a need for per-point pointer is obviated, and a module of each the inlined code can directly encode an actual parameter address.
In one aspect all the instances of a module are viewed at compile time, whereby a chain of actions triggered by an assignment is inlined without consideration of instance boundaries, whereas as long as a number of machine instructions in the inline sequence is less than a parameter value.
In a further aspect, a presence of X/Z bits are checked, where, if the X/Z bits are present a separate code area is branched to for handling.
In another aspect of the invention, condition checks are done only by mainline code, where code for statement bodies for each condition is stored in a separate code area. Further, nesting of condition checks is provided, where the nesting is of the separate code area.
In a further aspect of the invention, when the regenerated code for an assignment that is affected by a temporal call to an acc_vcl_add( ) command then the acc_vcl_add( ) command is executed.
In another aspect, the invention further comprises assigning a unique id to each format specifier. In this aspect, an I/O command only sends the format specifier id and data values to an I/O subsystem. In this embodiment, the I/O subsystem runs on a separate processor/thread to offload a main simulation processor.
The objectives and advantages of the present invention will be understood by reading the following detailed description in conjunction with the drawing, in which:
Although the following detailed description contains many specifics for the purposes of illustration, anyone of ordinary skill in the art will readily appreciate that many variations and alterations to the following exemplary details are within the scope of the invention. Accordingly, the following preferred embodiment of the invention is set forth without any loss of generality to, and without imposing limitations upon, the claimed invention.
The current invention improves memory hierarchy performance by grouping frequently accessed data into one set id. The working set of large programs, for example multi-million-gate simulations, overflows processor caches and makes the caches ineffective. Current software techniques allocate memory addresses to program data without regard to how they map onto the cache, causing central data structures, for example those used by the scheduler of a simulator, compete for cache space with large user data that regularly spills the cache. Measurements of the execution cost of central routines, such as a scheduler, as a function of simulation size causes the execution cost to rise dramatically.
The current invention allocates data used by a particular class of central routines, such as a scheduler by probing the line size of the target processor's level-1 D-cache (call it d1_linesize), where a software override for the probed value is also provided. The current invention selects the core routine's algorithm and/or data structures, such as a scheduler, so that the sum of the sizes of the most frequently used data structures is no more than d1_linesize. According to the current invention, to ensure that data structures mentioned above do not stride a level-1 D-cache line, the start address of the data structures is aligned at an address that is a multiple of d1_linesize. In addition to grouping the data structures for a particular class of central routines into a level-1 D-cache line, the set id of the memory addresses are also allowed to be set in the range 0 to S−1 by some method. For example, the user is allowed to specify the set id of each class of central routines as either a fixed value between 0 and S−1 inclusive, or as a randomly chosen value in the same range.
The current invention improves CPU performance by simulating predication in a non-predicated architecture. The deep pipelines of modern microprocessors result in many instructions in-flight simultaneously, for example up to 120 instructions can be in-flight in a modern microprocessor. All in-flight instructions are discarded on a branch mispredict. The need to avoid mispredicted branches is well known. A commonly accepted way of avoiding branches altogether is the use of predication. Predication has been built into the instruction set of some newer microprocessor architectures. Avoiding branches has been invoked in simple expressions such as
X=(A<B)?CONST1:CONST2.
The current invention simulates predication in non-predicated architectures, for example the 80x86, by applying the technique to programming constructs unique to hardware description languages (HDL). As an example, Verilog the semantics of assignments to vector elements includes:
As a further example, a multi-part “if . . . then . . . else” statement or a “case . . . endcase” statement, with each part containing only an assignment statement “lvaluei=rvaluei” such that all lvaluei are the same. Such constructs are not implemented using pseudo-predication in state-of-the art compilers.
The simulation of predication on non-predicated architectures like the 80x86 is termed pseudo-predication according to the current invention. The current invention determines maximum pseudo-predicated instruction sequence length by considering target machine microarchitecture characteristics, such as the average number of in-flight instructions. Further, the current invention allows this measure to be overridden by a compiler-user-specified parameter.
The current invention provides pseudo-predicated implementation of vector element access for a read operation. Verilog semantics of the read of element index_expr of the vector v in an expression like lhs=v[index_expr] states that the value of v[index_expr] should be X (i.e. “unknown”) if index_expr itself has any X or Z (i.e. “high-impedance”) bits in it. An obvious implementation is:
but this requires conditional branches, which are subject to misprediction.
According to the current invention, introduction of the phantom index permits the pseudo-predicated evaluation of vector index 404.
The current invention provides pseudo-predicated implementation of vector element access for a write operation. Verilog semantics the write of element index_expr of the vector v in an expression like v[index_expr]=rhs_expr states that no store should occur to v if index_expr itself has any X (i.e. “unknown) or Z (i.e. “high impedance”) bits in it. An obvious implementation is:
but this requires conditional branches (which are subject to misprediction).
According to the current invention, introducing the phantom index permits the pseudo-predicated evaluation of vector index 504.
The current invention provides predicated implementation of multi-way branches with an assignment statement having the same left-hand-side.
The following example is provided of a prior art multi-way branch statement of the form:
this multi-way branch statement is converted, as shown below, by the methods of the current invention, to allow for predication.
Accordingly, if there is no “else”/“default” clause, and the number of terms in the multi-way branch is below a programmable threshold, the value of expr_else is set to lhs. Otherwise, statements without else clauses are not converted to predicated form.
The current invention reduces taken branches by determining the selection path in a multi-sensitive “always” block. A common coding style in HDLs, for example Verilog, is always @(reset or clock) begin
Here, the activation of the reset command is infrequent, yet the code requires jumping around (or predicating through) the reset block. According to the invention, when inlining this “always” block, the current invention tracks the sensitive assignment that triggered this “always” block. If the assignment variable is not present in the condition expression of the top-level “if . . . else”, the corresponding if branch body is considered unlikely to execute.
The current invention further provides reduction of taken branches by determining the selection path in a multi-sensitive “always” block. As an example, clock=0 generates the sensitive inline sequence:
However, reset=0 generates the sensitive inline sequence
The current invention saves port allocation storage by generating instance-specific code. The discrepancy between access times of on-chip caches and off-chip memory makes high-data cache utilization a critical code generation objective. Each port declaration in a module requires a pointer allocation per port to point at an actual parameter. This is a particular issue for very small modules that have many ports. The current invention inlines the code for each instance of a small module. This method obviates the need for a per-port pointer, since each inlined module can encode the actual parameter address directly.
The current invention combines sensitive actions by generating inlined, instance-specific code. Static scheduling of actions that are triggered by, or are sensitive to, an assignment is key to increasing simulation performance. For example:
generates code equivalent to
If clk is bound to a port in one or more module instances, a subroutine has to be called in the module's code to process sensitivities to clk within the module. This approach is expensive because the assignment may not activated within the module instance, and this fact may be determinable statically; e.g. if a submodule only has “always @(posedge clk)” references, the “clk=0” reference will never trigger in the submodule the subroutine call/call return needed is expensive in terms of CPU time. This is especially so since module instance triggered by the “clk=0” assignment may in turn trigger a chain of nested actions.
The current invention looks at all module instances at compile time, and inlines the chain of actions triggered by an assignment, without regard to instance boundaries. This is done as long as the number of machine instructions in the inline sequence is less than a parameter value.
Packed 4-value data representation reduces execution cost and improves cache utilization. Here, data is packed in a way that is compatible with target architecture's registers
Encoding mode and value bits simultaneously often saves instructions such as to compute conditional var===2, or to compute bitwise operations (a|b, a&b, . . . ). Encoding mode and value bits simultaneously maximizes use of scarce register resources.
Compacting data items in a module improves cache utilization. For example, compacting using a scheme that sorts data items in a module according to decreasing size, or a scheme that further allocates memory according to “natural” boundaries, but only up to min(bus_width, reg_size) of the simulation target processor, where bus_width is the number of data bits fetched from memory in a single clock, and reg_size is the maximum number of bits in a general register of the processor. Note that the allocation scheme permits the bus_width value to be specified as a compiler flag. This scheme maximizes cache utilization by reordering data to improve packing. However, it avoids the problem of “split bus transactions” by aligning at least up to the width of a memory bus transfer. Some examples include:
Affinity-based code grouping improves cache utilization and reduces control hazards. Not all code related to an algorithm is equally likely to execute, such as in Verilog:
The current invention checks for X/Z bits and if present, branches to a separate code area to handle.
According to the current invention, mainline code only does condition checks. Code for the statement bodies for each condition is stored in a separate code area. Since statements can be nested, for example the body of an if . . . if-else . . . statement can itself contain an if . . . if-else . . . , the current invention allows nesting of the separate code area. According to the current invention, the possibly nested series of separate code areas are called shove areas. The current invention keeps code that is expected to execute sequentially in a single code area, and moves expected non-sequential code to a separate shove area to reduce taken branches, and improve the efficiency of instruction prefetchers present in modern microprocessors.
The current invention implements a value-change callback by regenerating code. Value-change callback is a Verilog feature that allows a foreign language such as C/C++ to register a callback at runtime. Callback is activated whenever a value change occurs during simulation, a fundamental of a Verilog PLI routine acc_vcl_add( ). Since the Verilog object being monitored is not known prior to the simulation run, every Verilog object needs to be monitored. Monitoring every object is prohibitively expensive and state-of-the-art Verilog compilers require the user to specify list of objects that need monitoring. The current invention provides a method of regenerating the code for assignment statements affected by a call to acc_vcl_add( ) at the time the acc_vcl_add( ) is executed, which is optimum in the context of a dynamic code-generation platform according to the current invention. Since many acc_vcl_add( ) calls can be made in a single PLI routine, code regeneration is best deferred to the point where the PLI routine exits.
The current invention provides a method of compressing and improving throughput of formatted I/O in Verilog. Simulations often result in many gigabytes of generated data, for example, in Verilog, using $display( ), or the io_printf( ) call. Formatting $display( ) and io_printf( ) text is time consuming. Storage size and network/DASD traffic are real performance issues, especially when used in arrays of simulators or “simulation farms.” Instead of compressing formatted data, the current invention assigns a unique id to each format specifier. The I/O command only sends format specifier id and data values to I/O subsystem optionally, I/O subsystem runs on a separate processor/thread to offload main simulation processor.
The present invention has now been described in accordance with several exemplary embodiments using the hardware description language (HDL) called Verilog, which are intended to be illustrative in all aspects, rather than restrictive. Thus, the present invention is capable of many variations in detailed implementation, which may be derived from the description contained herein by a person of ordinary skill in the art. For example, the same scenarios are applicable to the hardware description language VHDL and SystemVerilog, as well as the modeling language System C.
All such variations are considered to be within the scope and spirit of the present invention as defined by the following claims and their legal equivalents.
This application is cross-referenced to and claims the benefit from U.S. Provisional Patent Application 60/923,389 filed Apr. 12, 2007, which is hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4654791 | Ushiro | Mar 1987 | A |
5305389 | Palmer | Apr 1994 | A |
5339420 | Hoxey | Aug 1994 | A |
5509135 | Steely, Jr. | Apr 1996 | A |
5584013 | Cheong et al. | Dec 1996 | A |
5592634 | Circello et al. | Jan 1997 | A |
5606683 | Riordan | Feb 1997 | A |
5627994 | Levy et al. | May 1997 | A |
5642493 | Burgess | Jun 1997 | A |
5669001 | Moreno | Sep 1997 | A |
5721854 | Ebcioglu et al. | Feb 1998 | A |
5749094 | Jaggar | May 1998 | A |
5778436 | Kedem et al. | Jul 1998 | A |
5896517 | Wilson | Apr 1999 | A |
5951674 | Moreno | Sep 1999 | A |
5953747 | Steely et al. | Sep 1999 | A |
5958048 | Babaian et al. | Sep 1999 | A |
5966737 | Steely et al. | Oct 1999 | A |
6044459 | Bae et al. | Mar 2000 | A |
6047358 | Jacobs | Apr 2000 | A |
6047359 | Fouts | Apr 2000 | A |
6189083 | Snyder, II | Feb 2001 | B1 |
6205544 | Mills et al. | Mar 2001 | B1 |
6243788 | Franke et al. | Jun 2001 | B1 |
6263428 | Nonomura et al. | Jul 2001 | B1 |
6308252 | Agarwal et al. | Oct 2001 | B1 |
6321295 | Vincent | Nov 2001 | B1 |
6339813 | Smith et al. | Jan 2002 | B1 |
6412057 | Sasahara et al. | Jun 2002 | B1 |
6412105 | Maslennikov et al. | Jun 2002 | B1 |
6418525 | Charney et al. | Jul 2002 | B1 |
6490654 | Wickeraad et al. | Dec 2002 | B2 |
6539522 | Devins et al. | Mar 2003 | B1 |
6691080 | Tachibana | Feb 2004 | B1 |
6732356 | Chen | May 2004 | B1 |
6751792 | Nair | Jun 2004 | B1 |
6892173 | Gaither et al. | May 2005 | B1 |
6901483 | Robinson et al. | May 2005 | B2 |
6918009 | Sato et al. | Jul 2005 | B1 |
6944853 | Topham | Sep 2005 | B2 |
6983361 | Blandy | Jan 2006 | B1 |
6985985 | Moss | Jan 2006 | B2 |
7020854 | Killian et al. | Mar 2006 | B2 |
7080365 | Broughton et al. | Jul 2006 | B2 |
7100164 | Edwards | Aug 2006 | B1 |
7111025 | Finlay et al. | Sep 2006 | B2 |
7181598 | Jourdan et al. | Feb 2007 | B2 |
7234040 | Berg et al. | Jun 2007 | B2 |
7299341 | Lin | Nov 2007 | B2 |
7401189 | Bell et al. | Jul 2008 | B2 |
7590830 | Chaudhry et al. | Sep 2009 | B2 |
7617362 | Craft et al. | Nov 2009 | B2 |
7873788 | Moll et al. | Jan 2011 | B1 |
7966315 | Okamoto et al. | Jun 2011 | B2 |
8027829 | Mayer et al. | Sep 2011 | B2 |
8131982 | Emma et al. | Mar 2012 | B2 |
8205200 | Liao et al. | Jun 2012 | B2 |
8291192 | Ueno | Oct 2012 | B2 |
20010037322 | Lindsay et al. | Nov 2001 | A1 |
20020091996 | Topham | Jul 2002 | A1 |
20020144092 | Topham et al. | Oct 2002 | A1 |
20030101335 | Gillies et al. | May 2003 | A1 |
20030182539 | Kunkel et al. | Sep 2003 | A1 |
20030188299 | Broughton et al. | Oct 2003 | A1 |
20030204834 | Ball et al. | Oct 2003 | A1 |
20040010501 | Anderson et al. | Jan 2004 | A1 |
20040044847 | Ray et al. | Mar 2004 | A1 |
20040083341 | Robinson et al. | Apr 2004 | A1 |
20040216095 | Wu | Oct 2004 | A1 |
20050034113 | Gillies et al. | Feb 2005 | A1 |
20050125785 | Gillies et al. | Jun 2005 | A1 |
20060074970 | Narayanan et al. | Apr 2006 | A1 |
20060200333 | Dalal et al. | Sep 2006 | A1 |
20070124732 | Lia et al. | May 2007 | A1 |
20070239975 | Wang | Oct 2007 | A1 |
20070294286 | Werner et al. | Dec 2007 | A1 |
20080209425 | Ferris | Aug 2008 | A1 |
20080228679 | Ward | Sep 2008 | A1 |
20080270773 | Ziegler | Oct 2008 | A1 |
20100318333 | Hall | Dec 2010 | A1 |
Entry |
---|
M. S. Lam, E. E. Rothberg, & M. E. Wolf, “The Cache Performance and Optimization of Blocked Algorithms”1993 ACM, pp. 63-74. |
Number | Date | Country | |
---|---|---|---|
20090037161 A1 | Feb 2009 | US |
Number | Date | Country | |
---|---|---|---|
60923389 | Apr 2007 | US |