1. Field of the Invention
This invention relates to hardware accelerated simulation engines, and particularly to hardware accelerated simulation engines that employ a field of special purpose ASIC chips to simulate pieces of the design under test where the simulation engine establishes a communication network to other such ASIC chips.
2. Description of Background
In the process of circuit design the designer first defines the design by describing it in a formal hardware description language. Such definition takes the form of a data file.
One of the subsequent phases on the road to physical realization of the design is logic verification. In the logic verification phase the logic designer tests the design to determine if the logic design meets the specifications/requirements. One method of logic verification is simulation.
During the process of simulation a software program or a hardware engine (the simulator) is employed to imitate the running of the circuit design. During simulation the designer can get snapshots of the dynamic state of the design under test. The simulator will imitate the running of the design significantly slower than the final realization of the design. This is especially true for a software simulator where the speed could be a prohibitive factor.
To achieve close to real time simulation speeds special purpose hardware accelerated simulation engines were developed. These engines consist of a computer, an attached hardware unit, a compiler, and a runtime facilitator program.
Hardware accelerated simulation engine vendors developed two main types of engines. FPGA based and ASIC based.
A Field Programmable Gate Array (FPGA) based simulation engines employ a field of FPGA chips placed on multiple boards, connected by a network of IO lines. Each FPGA chip is preprogrammed to simulate a particular segment of the design. While these engines are achieving close to real-time speeds their capacity is limited by the size of the FPGA.
Application-Specific Integrated Circuit (ASIC) based simulation engines employ a field of ASIC chips placed on one or more boards. These chips include two major components: the Logic Evaluation Unit (LEU) and the Instruction Memory (IM). The LEU acts as an FPGA that is programmed using instructions stored in the IM. The simulation of a single time step of the design is achieved in multiple simulator steps. In each of these simulation steps an instruction row is read from the IM and used to reconfigure the LEU. The simulator step is concluded by allowing the such configured LEU to take a single step and evaluate the design piece it represents.
ASIC based simulation engines need to perform multiple steps to simulate a single design time step hence they are inherently slower than FPGA based engines, though the gap is shrinking. In exchange, their capacity is bigger.
The LEU has two major functions: to simulate the design piece it is programmed to and to route various signals of the design under test to other LEU units on the simulator engine. The latter task is achieved by employing, among other hardware elements, programmable cross-points.
A programmable cross-point is a hardware element that consists of an array of input signals, an array of output signals, and an array of command signals. Assuming a fixed set of values on the command signals, the programmable cross-point behaves as if the output signals were directly connected to the input signals using some permutation. A different set of values on the command signals results in a different permutation.
A typical implementation of a programmable cross-point would employ multiple multiplexers. Each output would have its private multiplexer that connects it with one of the inputs based on the values of the command signals of the multiplexer.
The capacity of an ASIC based hardware accelerated simulation engine is determined by the number of ASIC chips it employs, by the size of the IM, by the size of an instruction row, and by the size of the design piece the LEU can simulate in a single simulator step. Many of these factors are bound by technology constraints.
Clearly, a need exists to increase capacity of an ASIC based hardware accelerated simulation engine.
Our invention effectively reduces the instruction row size. This is accomplished through an alternative implementation of the programmable cross-point that uses less command signals thereby reducing the size of the instruction row. The saving in instruction row size is achieved by utilizing the special requirements dictated by the hardware accelerated simulation engine environment. These are in detail:
(2) The logic implementing the programmable cross-point runs on a significantly higher frequency than the cross-point itself. In one particular embodiment the logic of LEU, and hence the logic of the cross-point, had a step rate of 1 nanosecond (ns) while the cross-point was expected to propagate a new set of input signals to the appropriate output signals in every 32 ns only.
(3) The cross-point does not propagate all the input signals to the appropriate output signal with the same latency. The cross point only achieves a given average data throughput. In the above mentioned embodiment the cross-point propagation latency varied between 1 ns and 128 ns averaging 64 ns.
An ASIC based hardware accelerated simulation engine as described herein is a special purpose massively parallel computer designed to accelerate the process of logic verification of integrated circuit designs utilizing a field of ASIC chips. These ASIC chips are interconnected by direct connections; hence the communication between these chips has to be accomplished by switching technology internal to the chips. The switching technology employs programmable cross-points, that is, hardware elements with input, output and command ports. The programmable cross-points propagate signals from their input ports to their output ports following a given permutation determined by the values on the command port.
To program the various logic elements of an ASIC chip, the ASIC chip contains an instruction memory. By the regular operation of the ASIC chip, instruction rows are read out of the instruction memory in a sequential manner and the read instruction rows (after a decoding process) provides the command bits for the command ports of the various logic elements (the programmable cross-points among them) of the ASIC chip. As the size of the instruction memory directly influences the capacity and the usability of the ASIC based hardware accelerated simulation engine, it is desired to reduce the amount of the required command bits.
The invention described herewith provides a conveyor belt based implementation of the programmable cross-point that has reduced command bit requirement compared to prior art solutions.
The cross point described herein provides a solution which requires four times less command bits on the instructions word for driving the programmable cross-point.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
The major components of an ASIC based hardware accelerated simulation engine are depicted in
An additional low speed communication network, the host bus 141 and host interface 143, is provided to exchange data between the ASIC chips 111, 113, 115, 117, and the host computer 103. The host bus 141 is typically inactive of its functionality is severely limited while the ASIC chips, 111, 113, 115, 117 are active, i.e., performing simulation.
The interconnect network 121 consists of direct connections between the IO pins of the ASIC chips 111, 113, 115, 117 and that of the memory modules 131 and user supplied devices. Every connection has a pre-determined data flow direction designating one of its ends as input and the other as output. In accordance with this designation, the pins of the ASIC chips can be categorized as either input or output.
To synchronize the data transfer on the interconnect, clock signals are used. In the typical embodiment 32 ns step rate was used on the interconnect 121. The operation of the ASIC chip can be based on a different clock. A typical embodiment uses a 1 ns step rate.
As depicted on
In phase 2 the LEU 221 will route signals from its input pins to its internal storage registers, it will simulate the running of a piece of the design under test using its internal registers as stimuli, and will route signals from its internal storage registers to its output. The LEU 221 performs the listed three actions guided by the values stored on its command bits.
In the preferred embodiment of the invention the aforementioned two phases are performed in parallel in a pipelined manner.
As illustrated in
The Gate Evaluation Processors receive their command bits from the instruction row decoder 501, illustrated in
The number of registers on the conveyor belts 403, 405 is equal and also equal to the number of input and output signals. A segment of the programmable cross point consists of four registers: an input register one register from the left and one from the right oriented conveyor belt and one from the output registers.
To facilitate the placement and removal of signals to/from the belts, segments are equipped with read ports 411 and write ports 413. Each of these ports 411, 413 have an enable command line and a selection command line. Hence, each of the segments requires four command lines. The write ports 413 function as follows: if the enable command line 421 is active then, based on the selection command line 423, one of the belt registers is updated from its neighbor while the other is updated from the input register of the segment; if the enable line 421 is inactive then both belt registers of the segment are updated from their respective neighbor registers on the belt.
The read ports 411 function as follows: if the enable command line 421 is active then, based on the selection command line 423, the output register of the segment is updated from one of the belt registers; if the enable command line is inactive then the output register retains its value from the previous LEU cycle.
The propagation of a signal from the input registers to one of the output registers requires the following phases. In some LEU step, the segments write port 421 has to be enabled and the signal thus has to be moved on one of the belts. It is desirable that the belt whose orientation results in a faster delivery is selected. Once the signal is placed on the belt the segment, that contains the target output register, has to remove it by having its read port enabled and having its select port select that appropriate belt.
As the step rate of the LIEU 211 is higher than that of the interconnect, the compiler has a time window to initiate the propagation. If the write port of the segment that contains the signal is not receiving a write enable command within the allotted time window, then the signal is over-written by the next signal arriving on the interconnect. Once the signal is placed on the belt it will get passed to neighboring belt registers. After a given number of LEU instructions, the signal will arrive to one of the belt registers of the receiving segment. The receiving segment's read port has to be enabled at that LEU step.
In the typical embodiment, the conveyor belts contained 256 registers realizing a 256×256 programmable cross point. It had 256 registers requiring 1024 command lines. As the LEU was running on a clock speed 32 times faster than that of the interconnect, the time window to forward a signal from the input register was 32 LEU steps. The implementation choose the belt that resulted in the lowest travel time: if the destination was 0-127 positions to the left then the left oriented conveyor belt was selected while if the destination was 1-128 positions to the right then the right oriented belt was selected. Utilizing the uniform distribution of the signal targets, we concluded that in average a signal had to travel 64 LEU steps, that is, for the duration of two interconnect steps.
Finally
The capabilities of the present invention can be implemented in hardware. Additionally, the invention or various implementations of it may be implementation in software. When implemented in software, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided to carry the program code.
The circuit diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the number of conveyor belts within a programmable cross point may be 4 or 8 instead of 2. Another variation to the concept described herein is to define a segment as the collection of 2 or more registers of a conveyor belt instead of just 1 All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.