The disclosed embodiments are generally directed to automated performance verification in integrated circuit design.
Digital integrated circuit (IC) design generally consists of electronic system level (ESL) design, register transfer logic (RTL) design and physical design. The ESL design step creates a user functional specification that is converted in the RTL design step into an RTL description. The RTL describes, for example, the behavior of the digital circuits on the chip. The physical design step takes the RTL along with a library of available logic gates, and generates a chip design.
The RTL design step is where functional verification is performed. As noted above, the user functional specification is translated into hundreds of pages of detailed text and thousands of lines of computer code. All potential paths need to be performance verified. However, arbitrary decisions on performance evaluation are usually made in the verification process. The verification tools are randomly selected and not systematic. Moreover, in some situations, hand operated procedures are often used to schedule jobs manually to fulfill verification tasks. This requires tracking the task executing process and trying to run one task after another. As intimated, this generates gaps between two consecutive tasks as the tasks are not running continuously. All of this leads to a limited number of executed verification steps from which minimal performance data can be extracted. It therefore becomes difficult to analyze the actual performance of a system.
A method and apparatus for automated performance verification for integrated circuit design is described herein. In some embodiments, the method includes a test preparation stage and an automated verification stage. The test preparation stage generates design feature-specific performance tests to meet expected performance goals under certain workloads using a variety of optimization approaches and for different design configurations. The automated verification stage is implemented by integrating three functional, automated modules into a verification infrastructure. These modules include a register transfer level (RTL) simulation module, a performance evaluation module and a performance publish module. The RTL simulation module schedules performance testing jobs, runs a series of performance tests on simulation logic nearly simultaneously and generates performance counters for each functional unit. The performance evaluation module consists of three sub-functions including a functional comparison between actual results and a reference file containing the expected results, performance measurements for throughput, execution time, latency values and the like, and performance analysis. The performance publish module generates and publishes performance results and analysis reports, for example, onto a web page or into a database.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
Described herein is a method and apparatus for automated performance verification for integrated circuit design. In some embodiments, the method includes a test preparation stage and an automated verification stage. The test preparation stage generates design feature-specific performance tests to meet expected performance goals under certain workloads using a variety of optimization approaches and for different design configurations. The automated verification stage is implemented by integrating three functional, automated modules into a verification infrastructure. These modules include a register transfer level (RTL) simulation module, a performance evaluation module and a performance publish module. The RTL simulation module schedules performance testing jobs, runs a series of performance tests on simulation logic nearly simultaneously and generates performance counters for each functional unit. The performance evaluation module consists of three sub-functions including a functional comparison between actual results and a reference file containing the expected results, performance measurements for throughput, execution time, latency values and the like, and performance analysis. The performance publish module generates and publishes performance results and analysis reports, for example, onto a web page or into a database.
An automated verification stage 110 uses the performance tests to verify the functionality of the unit. This verification process is implemented by integrating three consecutive functional modules into a verification infrastructure. All three modules are fully automated so that there are no gaps in timing in the testing. The functional modules include an RTL simulation module 115 which does the actual testing and passes the results to a performance evaluation module 120. If the performance evaluation module 120 determines that the unit has met expectations (123), then the performance results will be sent to a performance publish module 125 which will publish and present the performance results in tabular and graphical formats on a web page or in a database. If the performance evaluation fails (124), the process starts over again at the test preparation stage 105. For example, this may include debugging the unit, adjusting the performance tests and then retesting the unit.
The test preparation stage 200 also needs to account for specified workload conditions as performance requirements or expectations will vary depending on the activity level of the unit, the size of the unit or the size of the IC (210). For example, under certain scenarios, the performance tests may need to have minimum workloads to hide instruction latency issues. Improper workload adjustments will skew the results in the wrong direction. In another example, the workload may need to be adjusted to obtain a reasonable RTL simulation time in certain design versions while guaranteeing an expected performance measurement window at the same time. It may also be necessary to evenly distribute the workload to a number of function units which are different in design versions so that accurate performance data may be obtained. The performance tests are updated and revised automatically based on actual performance and analysis and are fine tuned using the automated verification system. This increases reliability and increases the value of the performance analysis of the data.
The performance tests may also be optimized to improve and match the performance requirements (215). These optimization techniques may include, but are not limited to, padding a hull shader to avoid local data storage bank conflicts, address alignment, not allowing a Shader seQuence Cache (SQC) request to split a cache, avoid having a primitive being sent to two shader engines, warming a cache for tests with virtual memory settings, and properly setting a memory channel mapping register for different kinds of memory clients to avoid unexpected remote memory requests with very long latency. These optimizations assist in distinguishing whether a performance issue is software setting related or hardware design related. These optimizations are updated and revised based on actual performance and analysis and are fine tuned using the automated verification system.
The performance evaluation module 310 receives the results from the RTL simulation module 305 and performs a functional comparison between the actual results and a reference file containing the expected results (330). The performance evaluation module 310 then determines performance measurements for throughput, execution time, latency values and other measurement parameters (332) and performs a performance analysis on the performance measurements (334). The analysis from the performance evaluation module 310 are sent to the performance publish module 310, which publishes the performance results and analysis report (340).
The functional comparison module 405 determines, (on a rolling basis), if the simulation run is done for the unit (420). If the simulation run is done, then the functional comparison module 405 compares the actual output results with a reference to determine whether the functional behavior of the unit meets expectations (422). If the unit's functional behavior meets expectations (424), then the flow continues to the performance measurements module 410. If the functional behavior does not pass, then the process starts over again at the test preparation stage 105 in
The performance measurements module 410 collects or extracts performance measurement data for the completed simulation run for each unit. It is much easier and clearer to analyze the performance data after extracting all comprehensive performance data systematically from the various performance counters generated by the RTL simulation. The comprehensive and valuable performance data generated by the performance tools include, but are not limited to, throughput, execution time, register settings for correct design configuration, latency information for memory devices, starve/stall values or workload balance values per each working unit. For example, the data collected may include, but is not limited to, throughput data (430), execution time information (432), latency information (434), starve/stall values (436) and other performance parameters. This information is then used by the performance analysis module 415. The analysis from these modules are automatically fed back to the performance test generation modules to increase the overall reliability and value of the performance data.
The performance analysis module 415 calculates the theoretical peak rate, compares it with the measured data and analyzes the difference between them. This includes calculating a theoretical peak rate value (440), computing an actual peak rate data (442) and performing a comparison between the theoretical and actual numbers (444). If unit's peak rate performance passes (446), then the test has been successfully completed and the process flows to the performance publish module 125 in
If the unit did not meet the desired or expected peak rate (448), the performance analysis module 415 analyzes the data to identify the bottleneck. This analysis may include analyzing the starve/stall value for each unit (450), analyzing the latency information (452), verifying the bandwidth usage for memory devices (454) and checking workload balance for each unit (456). After the analysis is complete, the flow returns to the test preparation stage 105 in
Group verification tasks 505 are tasks that run multiple tests in parallel for a unit. The group verification tasks 505 may include the test verification tasks 510. Group verification tasks 505 are scheduled once and will be executed simultaneously. There will also be no extra execution time wasted between any group verification tasks 505 as they are running in parallel. The verification infrastructure is similar to an Integrated Development Environment which provides comprehensive facilities for creating verification systems. All verification tasks are integrated into the verification infrastructure as a single flow to make sure that all the required tasks are executed continuously one after another. The automated performance verification method is fast in both group verification tasks 505 and test verification tasks 510 as all the verification tasks are running continuously under the automated verification system.
The problems in verification systems include a lack of a systematic verification method. This leads to limited coverage of verification steps and extraction of limited valuable performance data. Another problem is that most verification systems are manual operation intensive requiring a greater workforce for surveillance of the task executing process. It also requires extra execution time to finish identical jobs that are manually operated. Personnel need to manually run tasks one after another. This generates gaps between two consecutive tasks because they are not running continuously and extra execution time is needed to finish the verification work and more personnel is required to engage in the verification process. Practical verification results show that at least one more extra hour will be consumed per each test under existing verification processes and this amplified when running massive verification tasks with more than 3,000 tests. As stated herein, a systematic verification method for the verification system improves the overall work efficiency and allows greater contributions to a project by fewer team members.
Moreover, these manually operated verification methods have limitations in the scope of coverage with respect to performance verification and analysis. For example, arbitrary decisions on performance evaluation may be made in the verification process due to manual operations. The verification tools are randomly selected and not systematic. This leads to limited coverage as all or some verification steps are not executed. This in turn limits or decreases the amount of valuable performance data that is available or could be extracted. Analysis of a limited set of performance data provides little or no basis for measuring performance in view of expectations.
The automated verification system as described herein may save 1-3 personnel on the verification work for each project as all the necessary verification tasks can be submitted once, run on simulation logic simultaneously and be finished and evaluated automatically without surveillance. Practical verification experiences show that at least one hour could be saved per each test during the verification process. This savings is amplified running massive verification tasks with more than 3,000 tests.
All test verification tasks are scheduled in an executing queue 702 and run one by one. If a task reaches a call simulation module task 705, an execution request 707 is sent to the RTL simulation module 710. The RTL simulation module 710 executes and returns the results back to the execution queue 702 when the simulation function is complete (715). The next task is then executed. For example, the call evaluation module task 720 sends an execution request 722 to the performance evaluation module 725. The performance evaluation module 725 executes and returns the results back to the execution queue 702 when the evaluation function is complete (730). The process repeats for the publish module 740. In particular, the call publish module task 735 sends an execution request 737 to the publish module 740. The publish module 740 executes and returns the results back to the execution queue 702 when the publish function is complete (745).
As described hereinabove, the performance results are illustrated using tables and figures and are published on a web page or written into a database. It is easy for a system architect to review the overall performance of the system or for a marketing engineer to show the performance of the product to the public.
The processor 902 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 904 may be located on the same die as the processor 902, or may be located separately from the processor 902. The memory 904 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 906 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 908 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 910 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 912 communicates with the processor 902 and the input devices 908, and permits the processor 902 to receive input from the input devices 908. The output driver 914 communicates with the processor 902 and the output devices 910, and permits the processor 902 to send output to the output devices 910. It is noted that the input driver 912 and the output driver 914 are optional components, and that the device 900 will operate in the same manner if the input driver 912 and the output driver 914 are not present.
In general and in accordance with some embodiments, a method for verifying performance of a unit in an integrated circuit is described herein. Design feature-specific performance tests are generated to meet expected performance goals that account for workloads, optimization techniques and different integrated circuit design configurations. A register transfer level (RTL) simulation is run using the performance tests to generate actual performance results. The actual performance results are then verified to meet the expected performance results. The verification includes performing a functional comparison between the actual performance results and the expected performance results, determining performance measurements based on the actual performance results, and analyzing the performance measurements. The actual performance results are published in a visual, organized format.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
The methods or flow charts provided herein, to the extent applicable, may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).