The present invention relates to programmable pipeline fabrics, and more particularly to methods and apparatuses for providing for real-time capture of internal state of individual processing elements within the fabric for a desired step of a program implemented by the fabric.
A programmable pipeline fabric has been developed that dramatically advanced the state of the art of microprocessors. Details regarding the construction and operation of this type of processor may be found in Schmit, et al, “PipeRench: a virtualized programmable data path in 0.18 Micron Technology”, in Proceedings of the IEEE Custom Integrated Circuits Conference (CICC), 2002, the entirety of which is hereby incorporated by reference, Schmit, “PipeRench: a reconfigurable architecture and compiler”, IEEE Computer, pages 70-76 (April 2000), the entirety of which is hereby incorporated by reference, Schmit, “Incremental Reconfiguration for Pipelined Applications”, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, pp. 47-55, 1997, the entirety of which is hereby incorporated by reference, Schmit et al, “PipeRench: A Coprocessor for Streaming Multimedia Acceleration”, International Symposium on Computer Architecture, pp. 38-49, 1999, the entirety of which is hereby incorporated by reference, and Schmit, et al, “Managing Pipeline-Reconfigurable FPGAs” published in ACM 6th International Symposium on FPGAs, February 1998, the entirety of which is hereby incorporated by reference. Certain additional novel aspects of this technology have been described in U.S. Pat. No. 7,131,017 and Ser. No. 10/222,645, the contents of which are incorporated herein by reference.
While the above architecture is in many ways superior to existing architectures, improvements are still possible. For example, compared to conventional processors and architectures, it is not as straightforward to debug applications or emulate performance of the programmable pipeline fabric/architecture described above. Whereas typical CPUs include tools such as debuggers, and embedded processors can be debugged using tools such as in-circuit emulators (ICE) to provide real-time capture of internal state, similar tasks are not as straightforward in the pipelined architecture described above, especially in a single-instruction multiple data (SIMD) programming flow, and further when the pipeline fabric is embedded within an integrated circuit.
More specifically, as shown in
Accordingly, it would be desirable to have a scheme for emulating performance of a pipelined architecture, including means for providing real-time capture of internal state of desired elements and programming steps within the architecture at any given time.
The present invention allows emulation of a programmable pipeline processor fabric or architecture. According to certain aspects, the invention permits real-time capture of state information for any given stage of a processing flow performed by the fabric or architecture. According to other aspects, the invention allows a particular stage and data set of a SIMD flow to be analyzed. According to other aspects, the invention utilizes an independent clocking domain for the capture of state information.
In accordance with these and other aspects, an apparatus according to invention includes a real-time capture block that controllably captures internal state from selected programmable elements in a pipeline fabric. In further accordance with these and other aspects, a method according the invention includes controllably capturing internal state from selected programmable elements in a pipeline fabric in real time.
These and other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures, wherein:
The present invention will now be described in detail with reference to the drawings, which are provided as illustrative examples of the invention so as to enable those skilled in the art to practice the invention. Notably, the figures and examples below are not meant to limit the scope of the present invention to a single embodiment, but other embodiments are possible by way of interchange of some or all of the described or illustrated elements. Moreover, where certain elements of the present invention can be partially or fully implemented using known components, only those portions of such known components that are necessary for an understanding of the present invention will be described, and detailed descriptions of other portions of such known components will be omitted so as not to obscure the invention. In the present specification, an embodiment showing a singular component should not be considered limiting; rather, the invention is intended to encompass other embodiments including a plurality of the same component, and vice-versa, unless explicitly stated otherwise herein. Moreover, applicants do not intend for any term in the specification or claims to be ascribed an uncommon or special meaning unless explicitly set forth as such. Further, the present invention encompasses present and future known equivalents to the known components referred to herein by way of illustration.
In general, the present invention allows real-time capture of internal state of any processor(s) within any given pipeline stage(s) and/or data set(s) (i.e. vector) in a programmable pipeline fabric. A block diagram illustrating certain general aspects of the invention is shown in
In the example of
In general, the capture select data input to block 304 via the stage ID, vector number and/or serial data, includes data that selects the pipeline stage and input data for which state data is desired to be captured. This allows precise identification of both the processing that is desired to be analyzed, as well as the data that the processing acts upon, which can be important in certain processing flows such as single-instruction, multiple data flows (SIMD). It should be noted that for other types of processing flows, just one of the instruction stage and data set may need to be identified, and the invention encompasses such embodiments. The selection data can further include information regarding the particular type of data to be captured (for example, a particular subset of state information corresponding to a particular processing element within a physical stripe corresponding to the pipeline stage).
In embodiments, fabric 302 can be implemented by a programmable pipeline fabric such as a Kilocore fabric from Rapport, Incorporated of Redwood City, Calif., aspects of which are described in the prior publications, patents and applications referred to above. Accordingly, as described in those references, the configuration data includes instructions to be executed by processing elements in stripes 0 to n, using input data supplied to the fabric 302, and resulting in output data that is output from the fabric 302. The selection data provided to block 304, as made possible by adding the functionality and circuitry of the present invention to fabric 302, thus allows the internal state of a particular processing element that is executing a particular set of instructions provided in the configuration data on a particular vector provided in the input data, in a manner that has not been previously possible. In one example using the Kilocore fabric, the vector number associated with the input is conventionally provided and propagated within the fabric 302, and so the invention taps this existing information within the fabric. The manner of identifying the particular set of instructions to view via a stage ID bit will become more apparent from the descriptions below.
Regardless of the particular implementation details of fabric 302 (i.e. whether it is a Kilocore or other fabric), the operation principles of the invention can be substantially the same. A program is loaded into the processor by loading the configuration data with the instructions and other information. In conjunction with this step, the instruction/pipeline stage to be analyzed is identified, and this identification information (i.e. stage ID) is loaded together with the configuration data. In certain applications, it is also desired to analyze the execution of instructions with certain sets of data or vectors (i.e. the set of data operated on by a stripe during a given pipeline stage or processing sequence). Accordingly, included with the input data is a vector number associated with each set of data that is provided to the stripes. Concurrently with or prior to program execution, capture select data can further be clocked into the real-time capture block 304 using the shift clock. In embodiments, the data is provided serially, and the shift clock operates in an emulation clock domain or operation that is separate from the real-time system clock.
During program operation, the fabric 302 is configured with configuration data and input data is provided to the fabric 302 in real-time according to the system clock. When the real-time capture block 304 detects both that (1) the stage ID provided with the configuration data and (2) the vector number provided with the input data matches the stored capture select data, the capture operation is triggered, and the desired state information is extracted from the desired pipeline processor in real-time. The captured data can then be clocked out using the shift clock either concurrently with or following complete program execution.
An example implementation of a pipeline fabric including real-time capture functionality according to embodiments of the invention is illustrated in more detail in
As shown in
More particularly, in one example implementation, input data is provided to the first stripe (i.e. stripe 0) in the pipeline fabric and propagated from stripe to stripe via internal interconnections. It should be apparent that although input data is shown in
In any event, in the original Kilocore fabric and other architectures, a vector number associated with the input data is provided together with the input data. In the present invention, capture block 404 in each stripe also receives this vector number, thereby allowing it to uniquely identify the data currently being processed by the associated stripe.
As further shown in
For example, in a Kilocore or similar fabric, a configuration store (corresponding to the cache in
As discussed above, capture select data is clocked into the capture block 404 circuitry serially with a clock (not shown) that is separate from the system clock for the fabric, and captured data is clocked out using the same separate clock. As shown in this example implementation, the serial data is first provided to the capture block 404-0 associated with the first stripe, clocked through that block, and then to capture block 404-1 associated with the second stripe, and continuing on in succession to the capture block 404-n associated with the last stripe. The serial data output from the last capture block 404-n associated with the last stripe can include data captured during an emulation operation.
It should be noted that many variations in the illustrated implementation are possible. For example, even though data will only be captured from one stripe, capture select data can be commonly provided to all stripes. This would be possible because the stage ID bit is used to trigger capture from a particular stripe, and thereby allows simplification of the implementation because it is immaterial what control data is provided to the other stripes. Moreover, since real-time data capture will only be triggered for one stripe at a time, capture data can be output from all stripes and OR-ed to provide the desired capture data. It should be noted, however, that the present invention can be implemented in various additional or alternative ways.
An example implementation of a real-time capture block 404 in accordance with embodiments of the invention is illustrated in more detail in
As shown in
As further shown in
Controls block 502 preferably stores certain of the capture select data received for the associated stripe. This stored select data includes the desired vector number corresponding to the state information to be captured. Controls block 502 parses or copies the vector number from the received capture select data and provides it to match tag register 506, which in turn provides it to one input of comparator 508.
As shown, controls block 502 is in the serial shift path of the serial data input clocked into block 404 by the shift clock. In one example, block 502 is implemented as a shift register or series of flip-flops or similar structures. Since the number of bits held by block 502 will be known, as will be the number of flip-flops 512 in the chain of block 404, it is straightforward how to shift the desired data into precisely the desired locations in block 502 and flip-flops 512 simply based on the number of cycles of the shift clock. Those skilled in the art will further understand how to cause the desired data to be shifted into the proper locations and for the desired block among the stripes simply by knowing the number of flops 512 and the size of block 502 in each block 404, as well as the interconnections between the stripes. Thus, any desired subset of data can be selected from any one or more PEs in any stripe.
The stage ID bit included with the configuration information for a stripe is received by configuration latch 504 along with the configuration data provided to the stripe in any given cycle. When received, latch 504 drives an enable signal to comparator 508.
The vector number associated with the input data that is propagated to the stripe (from a previous stage or stripe, or from an input buffer, for example) is stored in tag register 510. This vector number is provided as the other input to comparator 508. When enabled by the configuration bit from configuration latch 504, comparator 508 compares its inputs to each other and if there is a match, it drives the clocks for the capture chain. Based on the control information clocked into the flip-flops 512, the selected subset of information for the desired PE will then be driven to the appropriate flip-flop 512 output. The captured data can then be clocked out of the block using the shift clock.
As described previously, the shift clock provides a separate clock domain for shifting data into and out of block 404. More particularly, the shift clock clocks the capture select data into the capture chain, controls block 502 and match tag register 506. After data has been captured, the shift clock clocks the captured data out of the capture chain. Those skilled in the art will understand how to recover the captured data from all the serial data output from all the blocks.
Although the present invention has been particularly described with reference to the preferred embodiments thereof, it should be readily apparent to those of ordinary skill in the art that changes and modifications in the form and details may be made without departing from the spirit and scope of the invention.
For example, it should be noted that the capture block circuitry could also be adapted to cause data to be loaded into the processing elements via the serial chain. In such an example, further circuitry or functionality could also be added to cause a hold or interrupt to occur when the desired point in a program is reached, at which point data is loaded into, and/or captured from, the programmable fabric.
Many other alternatives and adaptations of the present invention will occur to those skilled in the art after being taught by this disclosure, and it is intended that the appended claims encompass such changes and modifications.