1. Field of the Invention
The present invention relates generally to VLIW (very long instruction word) processors, including for example simulation processors that may be used in hardware acceleration systems for simulation of the design of semiconductor integrated circuits, also known as semiconductor chips.
2. Description of the Related Art
Simulation of the design of a semiconductor chip typically requires high processing speed and a large number of execution steps due to the large amount of logic in the design, the large amount of on-chip and off-chip memory, and the high speed of operation typically present in the designs for modern semiconductor chips. The typical approach for simulation is software-based simulation (i.e., software simulators). In this approach, the logic and memory of a chip (which shall be referred to as user logic and user memory for convenience) are simulated by computer software executing on general purpose hardware. The user logic is simulated by the execution of software instructions that mimic the logic function. The user memory is simulated by allocating main memory in the general purpose hardware and then transferring data back and forth from these memory locations as needed by the simulation. Unfortunately, software simulators typically are very slow. The simulation of a large amount of logic on the chip requires that a large number of operands, results and corresponding software instructions be transferred from main memory to the general purpose processor for execution. The simulation of a large amount of memory on the chip requires a large number of data transfers and corresponding address translations between the address used in the chip description and the corresponding address used in main memory of the general purpose hardware.
Another approach for chip simulation is hardware-based simulation (i.e., hardware emulators). In this approach, user logic and user memory are mapped on a dedicated basis to hardware circuits in the emulator, and the hardware circuits then perform the simulation. User logic is mapped to specific hardware gates in the emulator, and user memory is mapped to specific physical memory in the emulator. Unfortunately, hardware emulators typically require high cost because the number of hardware circuits required in the emulator increases according to the size of the simulated chip design. For example, hardware emulators typically require the same amount of logic as is present on the chip, since the on-chip logic is mapped on a dedicated basis to physical logic in the emulator. If there is a large amount of user logic, then there must be an equally large amount of physical logic in the emulator. Furthermore, user memory must also be mapped onto the emulator, and requires also a dedicated mapping from the user memory to the physical memory in the hardware emulator. Typically, emulator memory is instantiated and partitioned to mimic the user memory. This can be quite inefficient as each memory uses physical address and data ports. Typically, the amount of user logic and user memory that can be mapped depends on emulator architectural features, but both user logic and user memory require physical resources to be included in the emulator and scale upwards with the design size. This drives up the cost of the emulator. It also slows down the performance and complicates the design of the emulator. Emulator memory typically is high-speed but small. A large user memory may have to be split among many emulator memories. This then requires synchronization among the different emulator memories.
Still another approach for logic simulation is hardware-accelerated simulation. Hardware-accelerated simulation typically utilizes a specialized hardware simulation system that includes processor elements configurable to emulate or simulate the logic designs. A compiler is typically provided to convert the logic design (e.g., in the form of a netlist or RTL (Register Transfer Language)) to a program containing instructions which are loaded to the processor elements to simulate the logic design. Hardware-accelerated simulation does not have to scale proportionally to the size of the logic design, because various techniques may be utilized to partition the logic design into smaller portions (or domains) and load these domains to the simulation processor. As a result, hardware-accelerated simulators typically are significantly less expensive than hardware emulators. In addition, hardware-accelerated simulators typically are faster than software simulators due to the hardware acceleration produced by the simulation processor.
However, hardware-accelerated simulators typically require coordination between overall simulation control and the simulation of a specific domain that occurs within the accelerated hardware simulator. For example, if the user design is simulated one domain at a time, some control is required to load the current state of a domain into the hardware simulator, have the hardware simulator perform the simulation of that domain, and then swap out the revised state of the domain (and possibly also additional data such as results or error messages) in exchange for loading the state of the next domain to be simulated. As another example, commands for functions other than simulation may also need to be coordinated with the hardware simulator. Reporting, interrupts and errors, and branching within the simulation are some examples.
These functions preferably are implemented in a resource-efficient manner and with low overhead. For example, swapping state spaces for different domains preferably occurs without unduly delaying the simulation. Therefore, there is a need for an approach to hardware-accelerated functional simulation of chip designs that overcomes some or all of the above drawbacks.
In one aspect, the present invention overcomes the limitations of the prior art by performing logic simulation of a chip design on a domain-by-domain basis, but storing a history of the state space of the domain during simulation. In this way, additional information beyond just the end result can be reviewed in order to debug or otherwise analyze the design.
In one approach, logic simulation occurs on a hardware accelerator that includes a VLIW simulation processor and a program memory. The program memory stores instructions for simulating different domains and also stores the state spaces for the domains. The state space for a first domain is loaded from the program memory into a local memory of the simulation processor. The VLIW instructions for simulating the first domain are also loaded from the program memory and executed by the simulation processor, thus simulating the domain. During the logic simulation, the state space changes. The history of the state space is stored by transferring the state spaces for different simulated times from the local memory of the simulation processor to a memory external to the simulation processor. In different embodiments, the state space history can be transferred from the local memory to the program memory, to a separate storage memory, to a main memory of the host computer, or to memory on extension cards. In various embodiments, the state space can be saved at every time step, can be saved periodically and/or can be saved when requested by the user.
In another aspect of the invention, the chip design is divided into different clock domains and simulated on a clock domain by clock domain basis. In one approach, the clock domains are selected for simulation in an order based on the chronological order of the clock edges for the clock domains. The order for simulation may or may not exactly follow the chronological order of the clock edges. For example, if two clock domains are independent of each other, they may be simulated out of order to reduce the amount of state space swapping required.
In one method, the state space of the clock domain currently selected for simulation is loaded into the local memory of the simulation processor. The corresponding instructions are executed on the simulation processor in order to simulate the logic of the selected clock domain. At some point, the state space of the selected clock domain is swapped out from the local memory, for example when a different clock domain is to be loaded in and simulated. These steps are repeated for one clock domain after another until the logic simulation is completed. In a specific implementation, the chip design is divided into a global clock domain and many local clock domains. Local clock domains are dominated by a specific clock and largely do not affect each other. The global clock domain is introduced to account for interaction between local clock domains. Simulating a specific clock domain includes simulating the corresponding local clock domain as well as the global clock domain.
Depending on the hardware accelerator architecture, state space swapping of the clock domains may be initiated by a software driver running on a host computer. If a state space swap is required at every clock tick, the overhead for communication between the software driver and the hardware accelerator (simulation processor) may become unnecessarily large. This can be reduced by reducing the number of state space swaps and/or by initiating state space swaps by the hardware accelerator rather than the software driver. As an example of the former, independent clock domains may be simulated out of order in order to reduce the number of state space swaps. As an example of the latter, the software driver may initiate state space swaps only for some of the more important clocks, with the hardware accelerator initiating state space swaps for the other clocks. In both of these cases, if the state space is retained in local memory for multiple clock ticks, it may be useful to store the intermediate history of the state space before it is overwritten on the next clock tick.
In another aspect of the invention, the program memory is architected to support multi-threading by the VLIW simulation processor. The VLIW simulation processor has many processor units coupled to many memory controllers. The memory controllers access a program memory that is implemented as a number of program memory instances. In one example, each program memory instance includes one or more memory chips. Each memory controller controls a corresponding program memory instance. The program memory is logically organized into memory slices, and each program memory instance represents one of the memory slices. The program memory contains instructions for execution by the processor units and also contains data for use by the processor units.
By dividing the program memory into slices, each of which is accessed by a separate controller, processor clustering and multi-threading can be supported. The processor units can be logically organized into processor clusters, each of which includes one memory controller and accesses one memory slice. If all processor clusters access the same address in program memory, then an entire VLIW word will be accessed. If different processor clusters access different addresses, the processor clusters can operate fairly independently of each other. Thus, multi-threading is supported.
Other aspects of the invention include methods, devices, systems and applications corresponding to the approaches described above. Further aspects of the invention include the VLIW techniques described above but applied to applications other than logic simulation.
The invention has other advantages and features which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The system shown in
For further descriptions of example compilers 108, see U.S. Patent Application Publication No. US 2003/0105617 A1, “Hardware Acceleration System for Simulation,” published on Jun. 5, 2003, which is incorporated herein by reference. See especially paragraphs 191-252 and the corresponding figures. The instructions in program 109 are initially stored in memory 112.
The simulation processor 100 includes a plurality of processor elements 102 for simulating the logic gates of the user logic, and a local memory 104 for storing instructions and/or data for the processor elements 102. In one embodiment, the HW simulator 130 is implemented on a generic PCI-board using an FPGA (Field-Programmable Gate Array) with PCI (Peripheral Component Interconnect) and DMA (Direct Memory Access) controllers, so that the HW simulator 130 naturally plugs into any general computing system, host computer 110. The simulation processor 100 forms a portion of the HW simulator 130. The simulation processor 100 has direct access to the main memory 112 of the host computer 110, with its operation being controlled by the host computer 110 via the API 116. The host computer 110 can direct DMA transfers between the main memory 112 and the memories 121, 122 on the HW simulator 130, although the DMA between the main memory 112 and the memory 122 may be optional.
The host computer 110 takes simulation vectors (not shown) specified by the user and the program 109 generated by the compiler 108 as inputs, and generates board-level instructions 118 for the simulation processor 100. The simulation vectors (not shown) include values of the inputs to the netlist 106 that is simulated. The board-level instructions 118 are transferred by DMA from the main memory 112 to the program memory 121 of the HW simulator 130. The storage memory 122 stores user memory data. Simulation vectors (not shown) and results 120 can be stored in either program memory 121 or storage memory 122, for transfer with the host computer 110.
The memory interfaces 142, 144 provide interfaces for the processor elements 102 to access the memories 121, 122, respectively. The processor elements 102 execute the instructions 118 and, at some point, return simulation results 120 to the host computer 110 also by DMA. Intermediate results may remain on-board for use by subsequent instructions. Executing all instructions 118 simulates the entire netlist 106 for one simulation vector.
For a simulation processor 100 containing n processor units, each having 2 inputs, 2n signals must be selectable in the crossbar for a non-blocking architecture. If each processor unit is identical, each preferably will supply two variables into the crossbar. This yields a 2n×2n non-blocking crossbar. However, this architecture is not required. Blocking architectures, non-homogenous architectures, optimized architectures (for specific design styles), shared architectures (in which processor units either share the address bits, or share either the input or the output lines into the crossbar) are some examples where an interconnect system 101 other than a non-blocking 2n×2n crossbar may be preferred.
Each of the processor units 103 includes a processor element (PE) 302, a local cache 308 (implemented as a shift register in some implementations), and a corresponding part 326 of the local memory 104 as its dedicated local memory. Each processor unit 103 can be configured to simulate at least one logic gate of the user logic and store intermediate or final simulation values during the simulation. The processor unit 103 also includes multiplexers 304, 306, 310, 312, 314, 316, 320, and flip flops 318, 322. The processor units 103 are controlled by the VLIW instruction 118. In this example, the VLIW instruction 118 contains individual PE instructions 218A-218K, one for each processor unit 103.
The PE 302 is a configurable ALU (Arithmetic Logic Unit) that can be configured to simulate any logic gate with two or fewer inputs (e.g., NOT, AND, NAND, OR, NOR, XOR, constant 1, constant 0, etc.). The type of logic gate that the PE 302 simulates depends upon the PE instruction 218, which programs the PE 302 to simulate a particular type of logic gate.
The multiplexers 304 and 306 select input data from one of the 2n bus lines of the crossbar 101 in response to selection signals in the PE instruction 218. In the example of
The output of the PE 302 can be routed to the crossbar 101 (via multiplexer 316 and flip flop 318), the local cache 308 or the dedicated local memory 326. The local cache 308 is implemented as a shift register and stores intermediate values generated while the PEs 302 in the simulation processor 100 simulate a large number of gates of the logic design 106 in multiple cycles.
On the output side of the local cache 308, the multiplexers 312 and 314 select one of the memory cells of the local cache 308 as specified in the relevant fields of the PE instruction 218. Depending on the state of multiplexers 316 and 320, the selected outputs can be routed to the crossbar 101 for consumption by the data inputs of processor units 103.
The dedicated local memory 326 allows handling of a much larger design than just the local cache 308 can handle. Local memory 326 has an input port DI and an output port DO for storing data to permit the local cache 308 to be spilled over due to its limited size. In other words, the data in the local cache 308 may be loaded from and/or stored into the memory 326. The number of intermediate signal values that may be stored is limited by the total size of the memory 326. Since memories 326 are relative inexpensive and fast, this scheme provides a scalable, fast and inexpensive solution for logic simulation. The memory 326 is addressed by fields in the PE instruction 218.
The input port DI is coupled to receive the output of the PE 302. In a separate data path, values that are transferred to local cache 308 can be subsequently moved to memory 326 by outputting them from the local cache 308 to crossbar 101 and then re-entering them through a PE 302 to the memory 326. The output port DO is coupled to the multiplexer 320 for possible presentation to the crossbar 101.
The dedicated local memory 326 also has a second output port 327, which can access both the storage memory 122 and the program memory 121. This patent application concentrates more on reading and writing data words 540 between port 327 and the program memory 121. For more details on reading and writing data words 540 to the storage memory 122, see for example U.S. patent application Ser. No. ______ [***attorney docket 10656], “Hardware Acceleration System for Simulation of Logic and Memory,” filed Dec. 1, 2005 by Verheyen and Watt, which is incorporated herein by reference.
For further details and example of various aspects of processor unit 103, see for example U.S. patent application Ser. No. 11/238,505, “Hardware Acceleration System for Logic Simulation Using Shift Register as Local Cache,” filed Sep. 28, 2005; U.S. patent application Ser. No. ______ [***attorney docket 10769], “Hardware Acceleration System for Logic Simulation Using Shift Register as Local Cache with Path for Bypassing Shift Register,” filed Nov. 30, 2005; U.S. patent application Ser. No. ______ [***attorney docket 10656], “Hardware Acceleration System for Simulation of Logic and Memory,” filed Dec. 1, 2005; and U.S. Provisional Patent Application Ser. No. 60/732,078, “VLIW Acceleration System Using Multi-State Logic,” filed Oct. 31, 2005. The teachings of all of the foregoing are incorporated herein by reference.
A simulator can be event-driven or cycle-based. An event-driven simulator evaluates a logic gate (or a block of statements) whenever the state of the simulation changes in a way that could affect the evaluation of the logic gate, for example if an input to the logic gate changes value or if a variable which otherwise affects the logic gate (e.g., tri-state enable) changes value. This change in value is called an event. A cycle-based simulator partitions a circuit according to clock domains and evaluates the subcircuit in a clock domain once at each triggering edge of the clock. Therefore, event count affects the speed at which a simulator runs. A circuit with low event counts runs faster on event-driven simulators, whereas a circuit with high event counts runs faster on cycle-based simulators. In practice, most circuits have enough event counts that cycle-based simulators outperform their event-driven counterparts. The following description first explains how the current architecture can be used to map a cycle-based simulator and then explains how to implement control flow to handle event-driven simulators.
Typically, a software simulator running on the host CPU 114 controls which portions of the logic circuit are simulated by the hardware accelerator 130. The logic that is mapped onto the hardware accelerator 130 can be viewed as a black box in the software simulator. The connectivity to the logic mapped onto the hardware accelerator can be modeled through input and output signals connecting through this black box. This is modeled similarly for both internal and external signals, i.e. all internal signals (e.g. “probes”) are also brought out as input and output signals for the black box. For convenience, these signals will be referred to as the primary input (PI) and primary output (PO) for the black box. Note that this can be a superset of the primary input and primary output of a specific chip design if the black box represents the entire chip design. Usually, system task and other logic (e.g. assertions) are also included, and often, a part of the test bench is also included in the black box.
When any of the primary input signals changes in the software simulator, this causes an event that directly affects the black box. The software simulator sends the stimulus to the black box interface, which in this example is a software driver. The driver can send this event directly to the hardware accelerator, or accrue the stimulus. Accrual occurs when the hardware accelerator operates on a cycle-based principle. For synchronous clock domains, only events on clock signals require the hardware accelerator to compute the PO values. However, for combinational paths throughout the design, any event on an input typically will require the hardware accelerator to compute PO values. The software driver in this case updates the PI changes and logs which clock signals have events. At the end of evaluation of the current time step, before the simulator moves to the next time step, the software driver is called again, but now to compute the PO values for the black box. This will be referred to as a simulation event. Note that there will typically be only one simulation event per time-point, although it is possible for the software simulator to re-evaluate the black box if combinatorial feedback paths exist. At this point, the software driver is analyzing the list of the clock signals that have changed, and it directs the hardware accelerator to compute the new PO values for those domains. Other domains, for which the clocks did not change, typically need not be updated. This leads to better efficiency. To support combinational logic as well as clock domain interaction, a combinational clock domain is introduced which is evaluated regardless of clock events.
At each simulation event, the accrued changes are copied from main memory 112 into program memory 121, using DMA methods. After the DMA completes, a list of which clock domains and the sequence in which to execute them resides in the software driver. This list can be used to invoke the hardware accelerator 130 to update the POs for each clock domain, one domain at a time, or this list can be sent to the hardware accelerator in its entirety and have the hardware control routine execute the selected clock domains all at once, in a given sequence. Combinations hereof are also possible.
In one embodiment, the program memory 121 is arranged as depicted in
For instance, if CK1 and CK2 are asychnronous clocks, operating at CK1=250 Mhz (4.0 ns) and CK2=330 MHz (3.3 ns), respectively, events would occur at t=3.3 ns (first clock edge on CK1), t=4.0 ns (first clock edge on CK2), t=6.6 ns (2nd clock edge for CK1), t=8.0 ns (2nd clock edge for CK2), and so on. At each of these events, the GCLK domain is also evaluated.
As a different example, assume that CK1 and CK2 are synchronous clocks, e.g. CLK2 is the gated divide-by-two (half-frequency) clock of CK1. Assume CK2=125 MHz and CK1=250 MHz. Then, events would occur at t=4.0 ns for CK1 and GCLK, at t=8.0 ns for CK1, CK2 and GCLK, at t=12.0 ns for CK1 and GCLK, at t=16.0 ns for CK1, CK2 and GCLK and so on.
Information about the different domains is stored in the program memory 121. Each domain has an instruction set (IS) and a state space (SS). The instruction set is the group of instructions 118 used to simulate that domain. The state space is the current state of the variables in that clock domain. For convenience, the states spaces for the local domains CK1 SS, CK2 SS, etc. are stored together, as shown in
During simulation of a specific clock domain, the state space for the clock domain is stored in local memory 104 and the instructions 118 simulating the clock domain are fetched and executed. As shown in
During simulation, the instructions used to simulate the clock domain CKn (including the instructions for global clock domain GCLK) are fetched and executed by the PEs 102.
The same clock domain usually will be loaded into local memory 104 more than once to simulate different time steps.
In a straightforward implementation, at time step t0 of the simulation, clock domain CK1 is simulated. The state space for CK1 is loaded into local memory 104 and the instruction set for CK1 (and GCLK) is executed. At time step t1 of the simulation, the state space for CK1 is stored back to the program memory 121 and the state space for clock domain CK3 is loaded into local memory 104. Once the simulation of CK3 is completed, the next time step t2 in the simulation is taken. Clock domain CK2 is loaded to the local memory 104 and simulated. This continues for all clock edges in the simulation. Note that time steps t6 and t7 are both clock edges for CK3 and there are no intervening clock edges for other clock domains. Hence, clock domain CK3 can be simulated for two consecutive time steps without swapping out the state space.
States spaces that are being swapped out of local memory 104 can be written back to their original memory locations in program memory 121, overwriting the old state space information. This conserves memory but, unless the old state space was preserved somewhere else, the history of the state space will be lost. In one approach, the old state space can be copied from program memory 121 to the host computer's main memory 112 before being overwritten. In an alternate approach, rather than overwriting the old state space, the new state space is written to a different address in program memory 121. In
In addition, not every time step need be recorded. Only every jth time step or only selected time steps (e.g., specific time steps that are requested by the user) may be recorded instead. Information other than the state space may also be preserved. Data can also be written at times other than as part of a state space swap. In
As another variation, the clock domains can be simulated out of order if there are no dependencies. For example, assume that CK3 does not depend on CK2. The simulation could proceed as follows. First simulate CK1 at time step to, then CK3 at time steps t1 and t3, then CK2 at time step t2, and so on. By simulating CK3 at time steps t1 and t3 consecutively and then simulating CK2 at time step t2, the number of state space swaps is reduced. The out of order simulation can be performed because CK3 does not depend on CK2. Since the simulation of CK2 at t2 does not affect the simulation of CK3 at t3, the order of these simulations can be reversed. Note that if CK2 does not depend on CK3, this does not necessarily mean that CK3 also does not depend on CK2.
One advantage of this out of order execution is that control overhead can be reduced in some architectures. In many cases, the state space swap is initiated by a software driver, which is a software program running on the host computer 110 and typically has access to the DMA interfaces to the hardware simulator 130. In the chronological, edge-by-edge approach, at time t0, the software driver would instruct the simulation processor 100 to load the CK1 state space and simulate CK1. Then at time t1, the driver would instruct the simulation processor to swap out the CK1 state space for the CK3 state space and then simulate CK3, and so on. For each clock edge shown in
In a related approach, the simulation may be driven only from certain clock edges rather than by all clock edges. For example, if the CK2 and CK3 domains do not depend on each other, then the simulation could be triggered by the CK1 edges. The software driver interacts with the simulation processor 100 only on the CK1 edges (and not on all clock edges). At CK1 clock edge t0, the software driver instructs the simulation processor 100 to simulate one clock tick for CK1, one clock tick for CK2 and two clock ticks for CK3 (i.e., the simulations for clock edges t0-t3). At t4, the software driver instructs the simulation processor 100 to simulate another clock tick for CK1, two clock ticks for CK2 and two clock ticks for CK3, and so on. The clock ticks can be executed in order or, if further optimization is possible, out of order. One advantage is that the overhead for interaction with the software driver is reduced. This example uses only three interactions (at t0, t4 and t9) whereas the edge-by-edge approach would use thirteen interactions (for t0-t12). Saving the state space for every time step (or after certain time steps) is useful in this approach since the simulation processor and software driver do not interact at every time step.
The interface in
Reads and writes from and to the storage memory 122 occur through the processor 810 and co-processor 820. For a write to storage memory, the storage memory address flows from read register 520 to write FIFO 812 to interface 850 to read FIFO 824 to memory controller 828. The data flows along the same path, finally being written to the storage memory 122. For a read from storage memory, the storage memory address flows along the same path as before. However, data from the storage memory 122 flows through memory controller 828 to write FIFO 822 to interface 850 to read FIFO 814 to write register 510 to simulation processor 100.
The operating frequency for executing instructions on the simulation processor 100 and the data transfer frequency (bandwidth) for access to the storage memory 122 generally differ. In practice, the operating frequency for instruction execution is typically limited by the bandwidth to the program memory 121 since instructions are fetched from the program memory 121. The data transfer frequency to/from the storage memory 122 typically is limited by either the bandwidth to the storage memory 122 (e.g., between controller 828 and storage memory 122), the access to the simulation processor 100 (via read register 510 and write register 520) or by the bandwidth across interface 850.
In one implementation designed for logic simulation, the program memory 121 and storage memory 122 have different bandwidths and access methods. The program memory 121 connects directly to the main processor 810 and is realized with a bandwidth of over 200 billion bits per second. Storage memory 122 connects to the co-processor 820 and is realized with a bandwidth of over 20 billion bits per second. As storage memory 122 is not directly connected to the main processor 810, latency (including interface 850) is a factor. In one specific design, program memory 121 is physically realized as a reg [2,560] mem [8M], and storage memory 122 is physically realized as a reg [256] mem [125M] but is further divided by hardware and software logic into a reg [64] mem [500M]. Relatively speaking, program memory 121 is wide (2,560 bits per word) and shallow (8 million words), whereas storage memory 122 is narrow (64 bits per word) and deep (500 million words). This should be taken into account when deciding on which DMA transfer (to either of the program memory 121 and the storage memory 122) to use for which amount and frequency of data transfer. For this reason, the VLIW processor can be operated in co-simulation mode or stimulus mode.
In co-simulation mode, a software simulator is being executed on the host CPU 114, using the main memory 112 for internal variables. When the hardware mapped portion needs to be simulated, the software simulator invokes a request for response data from the hardware mapped portion, based on the current input data (at that time-step). In this mode, the software driver transfers the current input data (single stimulus vector) from the software simulator to the hardware simulator 130 by using DMA into program memory 121. Upon completion of the execution for this input data set, the requested response data (single response vector) is stored in program memory 121. The software driver then uses DMA to retrieve the response data from the program memory 121 and communicate it back to the software simulator. In an alternate approach, state space history and other data is transferred back to the host computer 110 via interface 842 (shown as a PCI interface in this example) and the path 520-812-850-824-842.
In stimulus mode, there is no need for a software simulator being executed on the host CPU 114. Only the software driver is used. In this mode, the hardware accelerator 130 can be viewed as a data-driven machine that prepares stimulus data (DMA from the host computer 110 to the hardware simulator 130), executes (issues start command), and obtains stimulus response (DMA from the hardware simulator 130 to the host computer 110).
The two usage models have different characteristics. In co-simulation with a software simulator, there can be significant overhead observed in the run-time and communication time of the software simulator itself. The software simulator is generating, or reading, the vast amount of stimulus data based on execution in CPU 114. At any one time, the data set to be transferred to the hardware simulator 130 reflects the I/O to the logic portion mapped onto the hardware simulator 130. There typically will be many DMA requests in and out of the hardware simulator 130, but the data sets will typically be small. Therefore, use of program memory 121 is preferred over storage memory 122 for this data communication because the program memory 121 is wide and shallow.
In stimulus mode, the interactions to the software simulator may be non-existent (e.g. software driver only), or may be at a higher level (e.g. protocol boundaries rather than vector/clock boundaries). In this mode, the amount of data being transferred to/from the host computer 110 typically will be much larger. Therefore, storage memory 122 is typically a preferred location for the larger amount of data (e.g. stimulus and response vectors) because it is narrow and deep. Therefore, data, including state space history, can be accumulated in the storage memory 122 rather than (or in addition to) the program memory 121, for later transfer to the host computer 110.
In an expansion mode, interface 852 allows expansion to another card. Data, including state space history, can be transferred to the other card for additional processing or storage. In one implementation, this second card compresses the data. An analogous approach can be used to transfer data from other cards back to the co-processor 820.
Furthermore, as shown in
With this architecture, each memory slice 621 can be accessed and controlled separately. Controller 610A uses Address 1, Control 1 and Data 1. Control 1 indicates that data should be read from Address 1 within memory slice 621A. Control 2 might indicate that data should be written to Address 2 within memory slice 621B. Control 3 might indicate an instruction fetch (a type of data read) from Address 3 within memory slice 621C, and so on. In this way, each memory slice 621 can operate independently of the others. The memory slices 621 can also operate together. If the address and control for all memory slices 621 are the same, then an entire word of D bits will be written to (or read from) a single address within program memory 121.
Typically, the instruction word width for each processor cluster, e.g. D1, is limited by physical realization, whereas the number of instruction bits per PE and also the number of data bits for storage are determined by architecture choices. As a result, D1 may not correspond exactly to the PE-level instruction width times the number of PEs in the processor cluster. Furthermore, additional bits typically are used to program various cluster-level behavior. If it is assumed that at least one of the PEs is idle in each cluster, then those PE-level instruction bits can be available to program cluster-level behavior. The widths of the cluster-level instructions can be consciously designed to optimize this mapping. As a result, cluster-level instructions for different processor clusters may have different widths.
Although the present invention has been described above with respect to several embodiments, various modifications can be made within the scope of the present invention. For example, although the present invention is described in the context of PEs that are the same, alternate embodiments can use different types of PEs and different numbers of PEs. The PEs also are not required to have the same connectivity. PEs may also share resources. For example, more than one PE may write to the same shift register and/or local memory. The reverse is also true, a single PE may write to more than one shift register and/or local memory.
In another aspect, the simulation processor 100 of the present invention can be realized in ASIC (Application-Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array) or other types of integrated circuits. It also need not be implemented on a separate circuit board or plugged into the host computer 110. There may be no separate host computer 110. For example, referring to
Although the present invention is described in the context of logic simulation for semiconductor chips, the VLIW processor architecture presented here can also be used for other applications. For example, the processor architecture can be extended from single bit, 2-state, logic simulation to 2 bit, 4-state logic simulation, to fixed width computing (e.g., DSP programming), and to floating point computing (e.g. IEEE-754). Applications that have inherent parallelism are good candidates for this processor architecture. In the area of scientific computing, examples include climate modeling, geophysics and seismic analysis for oil and gas exploration, nuclear simulations, computational fluid dynamics, particle physics, financial modeling and materials science, finite element modeling, and computer tomography such as MRI. In the life sciences and biotechnology, computational chemistry and biology, protein folding and simulation of biological systems, DNA sequencing, pharmacogenomics, and in silico drug discovery are some examples. Nanotechnology applications may include molecular modeling and simulation, density functional theory, atom-atom dynamics, and quantum analysis. Examples of digital content creation include animation, compositing and rendering, video processing and editing, and image processing.
As a specific example, if the PEs are capable of integer or floating point arithmetic (as described in U.S. Provisional Patent Application Ser. No. 60/732,078, “VLIW Acceleration System Using Multi-State Logic,” filed Oct. 31, 2005, hereby incorporated by reference in its entirety), the VLIW architecture described above enables a general purpose data driven computer to be created. For example, the stimulus data might be raw data obtained by computer tomography. The hardware accelerator 130 is an integer or floating point accelerator which produces the output data, in this case the 3D images that need to be computed.
Depending on the specifics of the application, the hardware accelerator can be event-driven or cycle-based (or, more generally, domain-based). In the domain-based approach, the problem of computing the required 3D images is subdivided into “subproblems” (e.g., perhaps local FFTs). These “subproblems” are analogous to the clock domains described above, and the techniques described above with respect to clock domains (e.g., state space swap, state space history, out of order execution) can also be applied to this situation. Loops can be implemented by unrolling the loop into a flat, deterministic program. Alternately, loop control can be implemented by the host software, which determines which domains are loaded into the hardware accelerator for evaluation based on the state of the loop. Branching can be implemented similarly. Alternatively, loop control can be implemented by a hardware state machine which loops or branches depending on values in one or more of the SS sets. This can improve performance by reducing the communication back to the host software between loop iterations.
The multi-threading and clustering techniques described in
The concepts of co-simulation mode and stimulus mode also apply to applications other than logic simulation. In co-simulation mode for a VLIW math hardware accelerator, host software controls usage of the hardware accelerator. When a specific math function is to be evaluated, the host software invokes a request for the evaluation. The software driver transfers the relevant input data to the hardware accelerator, the hardware accelerator executes the VLIW instructions to evaluate the math function, and the output data (i.e., calculated result) is made available to the host software. In stimulus mode, the hardware accelerator can be viewed as a data-driven machine that receives input data (e.g., via DMA from the host computer 110 to the hardware simulator 130), executes the math evaluation, and obtains the corresponding result (for DMA to the host computer 110).
Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.