1. Field of the Invention
The present invention relates to the electronic design of integrated circuits, and more specifically to a method for the hardware accelerated functional verification of a target integrated circuit design modeled in a hardware description language such as Verilog, VHDL, System Verilog, or System C.
2. Related Art
Functional verification is one of the steps in the design of integrated circuits. Functional verification generally refers to determining whether a design representing an integrated circuit performs a function it is designed for. The inventors have previously disclosed functional verification systems (U.S. Pat. Nos. 6,691,287, 6,629,297, 6,629,296, 6,625,786, 6,480,988, 6,470,480, and 6,138,266) in which a target design is partitioned into many combinational logic blocks connected by sequential elements. The state tables corresponding to the logic blocks are evaluated and stored in multiple random access storage devices (RASDs). Such an approach may have several disadvantages. For example, some logic blocks may exceed the convenient width of typical RASDs. Some target designs may contain functional blocks such as user specific memories, or simply require many more logic blocks and internal signals than can be practically accommodated. Accordingly, the embodiments of previous patents may not be suitable in some environments. Furthermore conventional verification environments do not scale with the rapidly expanding size of chips and complexity of designs deploying reusable silicon intellectual property. Thus it can be appreciated that what is needed is a system to scale a hardware simulation system for electronic circuit design which efficiently uses a large number of processors physically distributed among multiple units which requires accommodation of transfer delay. Accordingly, what is needed is a method of compiling a hardware description to execute in a scalable architecture for a plurality of processors with non-uniform transfer delay.
The present invention is a method embodied in a compiler for translating a hardware description of an electronic circuit to evaluation instructions and optimizing the instructions to efficiently utilize a plurality of processors distributed across a plurality of units.
The present invention is a system for verifying electronic circuit designs in anticipation of fabrication by simulation and emulation. The system uses
The present invention further comprises a method for scalably emulating the electronic circuit description, tangibly embodied as program instructions on a computer-readable medium controlling the operation of one or more processors, the method comprising the steps of
executing program instructions on a plurality of evaluation processors and on a plurality of canvassing processors resulting in the transfer of results of selected evaluation processor evaluations available to and read by selected evaluation processors to perform further evaluations; and
updating one or more circuit signal values,
wherein updating in an embodiment comprises the steps of
reading a circuit signal value,
transferring a circuit signal value, and
storing a circuit signal value data in circuit signal value storage media;
compiling one or more hardware descriptions to processor instructions, wherein compiling comprises
The present invention further disclosed in
The means for transferring an instruction or a circuit signal value among one or more processors, and one or more storage devices, include but are not limited to
Each evaluation processor is coupled to a plurality of other evaluation processors and through a canvassing processor to a medium coupled to all other evaluation processors in the system. The evaluation processor is further coupled to an instruction storage device and to a circuit value storage device. The evaluation processor is blocked from executing the instruction until all the necessary circuit values it requires as inputs are validated by a data checking circuit.
Each canvassing processor is coupled to the outputs of a plurality of evaluation processors and is coupled to certain transfer circuits of the medium. Under the control of a canvassing instruction scheduled by the compiler, it deterministically transfers a certain evaluated circuit signal value to a certain reading circuit coupled to a certain evaluation processor requiring the circuit signal value for further evaluation.
A model of a circuit written in a hardware description language is converted to instructions executable by a plurality of evaluation processors located on a plurality of evaluation units interconnected by canvassing processors. The present invention is embodied in the compiler which reads a hardware description language file and emits executable instruction files for the evaluation processors and the canvassing processors.
The present invention for compiling a circuit description to evaluation instruction for a plurality of evaluation processors within a plurality of units is a method comprising the steps of
The method of selecting instructions comprises building a table of available instruction templates appropriate to the evaluation processor, reading a hardware description of a circuit, and selecting instructions from available instruction templates according to speed, capacity requirements, and cost.
The method of clustering critical paths comprises creating an uncuttable group of related critical paths referred to as a fascine, assigning a cost to each communication edge between instructions, and tracing from the inputs of every register backward through instructions to an output of a register to identify a critical path with a greater number of communication edges than other paths.
The method of partitioning among a plurality of units comprises distributing a graph among units and ensuring that send and receive nodes are not on critical paths so as to balance computation across all available hardware resources and to minimize the overall critical path of the system. The method for maximizing parallelism comprises allocating instructions to processors in a balanced way and minimizing communication congestion on critical paths, keeping critical paths on the same node by assigning a cost to each communication edge between instructions that reflects criticality, generating an uncuttable group out of related critical paths referred to as a fascine.
The method of scheduling in reverse order comprises partitioning sending and receiving nodes on critical paths to be close rather than remote, scheduling an instruction for a sending node that must be remote from a receiving node earlier to allow propagation of results and ensuring that every send node is computed before its results are required at a receive node by scheduling in reverse order from outputs to inputs by synthesizing canvassing processor instructions.
The invention further includes a critical path optimizing method comprising assigning a cost value to every path, assigning a higher cost value to critical paths, assigning nodes to units, adding additional cost to paths which traverse unit to unit, computing the overall cost to determine if a critical path has been cut, and canceling the assignment if the effect is deleterious.
The invention further includes a unit assignment compacting method comprising levelizing evaluation instructions with respect to registers of the design, folding levels into flights constrained by the processor resources, inserting noops to space evaluation instructions within a fold, packing non-critical evaluation instructions to replace noops, grouping signals to be communicated into packets and encoding constraints on the netlist on the order in which packets are sent so as to ensure that the transmission ordering constraint imposed by the order in which signals are received does not conflict with other constraints on computing the order in which signal transmit whereby the compiler can schedule backward in time by grouping signals that are to be received together before determining exactly when they will be sent.
The invention further includes estimating transfer delay comprising one of uniform transfer delay or a plurality of quantized transfer delay comprising the steps of selecting an edge of a directed acyclic graph of the design pseudo-randomly, inserting a quanta of delay associated with breaking the path, determining if it becomes a critical path, measuring the topological interconnection between two critical paths, and assigning both paths to a fascine of critical paths with uniform transfer delay if the potential communication traffic is above average.
The invention further includes a meta function evaluation method comprising selecting an evaluation with input width greater than the capacity of a single processor, assigning the evaluation to a canvassing processor, setting an address of a canvassing processor storage to one of the possible input values of the evaluation, and storing a result of a meta function evaluation into the canvassing processor storage so as to cause retrieval of a result of a meta function evaluation from a canvassing processor storage by applying the evaluation inputs as the address of a canvassing processor storage.
Scheduler
The process of scheduling ensures that every send node is computed before its results are required at a receive node and to do so efficiently using available resources and with minimum delay. Sending nodes that must be remote from the receiving nodes would have to have their instructions scheduled earlier to allow propagation of results. By partitioning sending and receiving nodes on critical paths to be close physically, the present invention simplifies scheduling.
The present invention further comprises a method of coordinating the evaluation of logic and transfer of logic evaluation results on a bus to eliminate the possibility of deadlock wherein results cannot reach the logic which requires input data.
The present invention further comprises a method for managing unit to unit data transfer. This takes several cycles so transfer must be scheduled within a window ahead of when data is needed in a target unit. And only so many transfers can be handled “in transit” so some logic may be held for evaluation until bandwidth is available. The method is data driven, ie. not strictly synchronous, thereby tolerating some flexibility in promptness.
Initially every transfer is assumed at its worse case of being unit to unit. By assigning an edge to intra-unit transfer it simplifies the scheduling of the bus resource and reduces the time spent in transit. An edge on the critical path is randomly chosen to be placed within a unit. If the critical path is still critical repeat, else calculate another critical path. Stop optimizing when all of the physical resources for clusters in a unit are consumed. In conventional systems there is effectively one unit and therefore no method of optimizing assignment across units.
The present invention further comprises a method for bus management to avoid deadlock. A window of several cycles is required to propagate evaluation output data to the subscribing evaluation inputs. So scheduling of a data receive to drive a specific cluster, means a data transmit must be done with some error margin before that and the logic evaluation that drives the bus must occur in a cluster in an advanced time.
It is not the case that transfer can occur in any order. Suppose that nodes A and B are on unit X and need to send data to unit Y. It is not necessarily the case that the data from nodes A and B can be sent from X to Y in the same cluster. For example, maybe A drives B, so A needs to be evaluated before B. If we were scheduling forward in time, this would not be an issue. However, the compiler schedules backward in time, so it needs to group signals that are to be received together before it determines exactly when they will be sent. Therefore, to prevent deadlock, the unit assigner method comprises the step of grouping signals to be communicated into packets and encoding constraints in the netlist on the order in which packets are sent to make sure that transmission ordering constraint imposed by the order in which signals are received does not conflict with other constraints on computing the order in which signals transmit.
A deadlock is described if two units were to send too much data to each other without receiving anything, causing execution of both units to block each other. To prevent deadlock, the compiler method further comprises the step of tracking the amount of communication in progress from each unit to each other unit. If this amount might be bigger than the transmission FIFO memory, the compiler method further comprises the step of avoiding scheduling receives until transmits have been scheduled. If necessary, the compiler method further comprises modifying the netlist to allow a transmission to be scheduled immediately.
The present invention, embodied in a compiler, is a method of compiling a hardware description language description of a circuit to efficient parallel instructions for use in an array of processors comprises the steps of assigning instructions to processors, scheduling instructions in reverse order, optimizing critical paths in the topology of the design, and translating a hardware description of a circuit to a plurality of canvassing instructions and a plurality of evaluation instructions, whereby all evaluations are executed in advance of when their propagated results are required for subsequent evaluations and the circuit is simulated in the least time.
The present invention, a method of compiling a circuit description to processor instructions, comprises the following steps:
assigning evaluation instructions to certain processors, wherein assigning comprises the steps of: packing non-critical evaluation instruction efficiently to replace noops, balancing the load of evaluation instructions among processors, and minimizing data transfer volume and delay; scheduling instructions in reverse order,
wherein scheduling comprises the steps of:
levelizing evaluation instructions with respect to registers of the design,
folding levels into flights constrained by the processor resources, and
inserting noops to space evaluation instructions within a fold;
optimizing critical paths in the topology of the design,
wherein optimizing comprises the steps of:
estimating the effect of transfer delay on critical paths, assembling a fascine of critical paths to optimize data transfer, and breaking paths not included in a fascine of critical paths; and translating a hardware description of a circuit to evaluation and canvassing instructions.
The method further comprises the following steps:
The optimizing method further comprises the steps of:
estimating the effect of a transfer delay on a critical path, assembling a fascine of critical paths to optimize data transfer, and breaking a path not included in a fascine of critical paths.
The transfer delay may be either estimated as uniform transfer delay or as a plurality of quantized transfer delay. The optimizing method comprises selecting an edge of a directed acyclic graph of the design pseudo-randomly, inserting a quanta of delay associated with breaking the path and determining if it becomes a critical path. The optimizing method further comprises measuring the topological interconnection between two critical paths and assigning them to a fascine of critical paths with uniform transfer delay if the potential communication traffic is above average.
Critical Path Reducer
The present invention further comprises a method of selecting and reassigning nodes or nets within the critical path of a design to efficiently assign physical resources and communication bandwidth.
The method of critical path merging comprising the steps of
1. For each node v, computing the length of longest path from v to register or primary output. Since the netlist is a directed acyclic graph, the longest path exists and is finite. Call this value the back rank of v.
2. Computing the length of longest path in the domain. This times the intraboard delay is a lower bound on the time required to evaluate the domain. This value is the goal path length.
3. For each node v working from inputs to outputs, computing an estimated execution time as follows:
4. merging u and v if estimated execution time of u and v as computed in step 3 above differ by less than the interunit delay and u drives v.
The scheduling method further comprises the steps of:
levelizing evaluation instructions with respect to registers of the design,
folding levels into flights constrained by the processor resources, and
inserting noop instructions to space evaluation instructions within a fold.
The optimization further comprises the steps of: replacing a noop instruction with a non-critical evaluation instruction, balancing the load of evaluation instructions among processors, and minimizing data transfer volume and delay.
The compiler may generate a canvassing instruction when a data transfer requires crossing a unit boundary to transfer results between two evaluation instructions.
The present invention may be tangibly embodied as program instructions on a computer-readable medium for controlling the operation of one or more processors, comprising the steps of
executing program instructions on a plurality of evaluation processors and on a plurality of canvassing processors resulting in the transfer of results of selected evaluation processor evaluations available to and read by selected evaluation processors to perform further evaluations; and
updating one or more circuit signal values, wherein updating comprises the steps of
transferring a circuit signal value,
reading a circuit signal value, and
storing a circuit signal value data in circuit signal value storage media, these steps performed in any order or simultaneously
controlling the transfer of signal values, wherein controlling comprises the steps of
composing canvassing instructions to pass the results of a selected evaluation processor to those evaluation processors which require those results to execute their evaluation instructions
compiling one or more hardware descriptions to processor instructions,
translating the electronic circuit description into executable evaluation instructions, and
analyzing the circuit value transfers inherent to the electronic circuit description.
A single-user simulation acceleration verification center comprising a fiber-based interconnection topology 200 is shown in
For each of the evaluation module units there may be a plurality of evaluation transmitters and receivers 210 allowing each evaluation module unit to communicate with every other evaluation module unit within its chassis as well as to an evaluation module unit in another chassis. An evaluation module unit may also have a plurality of host transmitters and host receivers 230 and connect to the first evaluation module unit in a chassis and thence to the host through high speed serial links 250.
In an embodiment each evaluation module unit may be attached by a plurality of evaluation transmitter physical links, a plurality of evaluation receiver physical links, a plurality of local evaluation receiver links, a plurality of host transmitter physical links and a plurality of host receiver physical links.
A simulation acceleration appliance 300 is shown in
Evaluation Unit—An embodiment of the present invention further comprises a control processor, a plurality of octal combinational logic operation evaluators, a trace unit and a data unit attached to the interconnect network.
An evaluation module unit 400 shown in
Although particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications may be made without departing from the present invention in its broader aspects, and therefore, the appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.
The present invention is a method for adapting a design description to a process executable by a plurality of processors in a plurality of units comprising the steps of assigning domains, analyzing critical paths, assigning units, and scheduling; wherein assigning domains comprises dividing a graph representing a design description into at least one of a part controlled by an identifiably distinct clocking entity and a part shared between a second identifiably distinct clocking entity, wherein analyzing critical paths comprises identifying the logic and communication, delay path dependencies of the design description and finding the longest paths in the design description, wherein assigning units comprises allocating a graph element to a processor unit based on a solution of the communication/process allocation constraint problem, and wherein scheduling comprises allocating an instruction and a meta functions to a process slot and to a processor, so as to satisfy the space and time constraints represented in a design graph.
The present invention further comprises the step of optimizing critical paths, wherein optimizing critical paths comprises identifying the logic and communication delay path dependencies of the design description and finding at least one longest path in the design description so as to ensure that the longest path may be kept within a single unit whenever possible.
In the event that the design description contains memories as well as logic, the invention further comprises allocating memory comprising the step of allocating physical memories and assigning a design memory to a physical memory based on constraints such as size (width and depth) and cost of access.
To simulate a large design, it will require more processors than can be located in a single unit with necessary transfer delay between units, therefore the invention has the capability of scheduling interunit communications, comprising selecting the process slots which produce inter-unit data and placing those slots which receive such data.
The method further emits loadable code comprising generating code for the sequencing engine code, constructing a final machine image and writing a file in a form suitable for loading into at least one memory of a unit.
A necessary step is to expand a design description into instructions selected from a list of instructions available to a processor by selecting an instruction for decomposition of design functions into at least one of a hardware instruction, a meta function and a machine operation in one embodiment a memory access, and optimizing using at least one of eliminating dead code, propagating constants, and combining CSE methods.
The present invention addresses the issue of scalability of emulation and simulation of electronic circuits in the design of more complex products in a timely manner.
The present invention provides means for electronics design engineers to efficiently execute instructions compiled from a hardware description language functional model of a hypothetical system prior to fabrication on a plurality of processors.
The present application claims the benefit of priority under 35 USC .sctn. 119(e) from U.S. provisional patent application 60/595,057 filing date Jun. 2, 2005 first named inventor Ganesan, titled: “Massively parallel platform for accelerated verification of hardware and software.” The present application is a continuation in part of U.S. patent application Ser. No. 11/307198 filing date 2006 Jan. 26, first named inventor Ganesan, titled: “A scalable system for simulation and emulation of electronic circuits using asymmetrical evaluation and canvassing instruction processors”.
Number | Date | Country | |
---|---|---|---|
60595057 | Jun 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11307198 | Jan 2006 | US |
Child | 11427945 | Jun 2006 | US |