The invention relates generally to embedded microprocessor architecture and more specifically to systems and methods for synchronizing the operation of multiple processing engines in a microprocessor-based system.
Processor extension logic is utilized to extend a microprocessor's capability.
Typically, this logic is in parallel and accessible by the main processor pipeline. It is often used to perform specific, repetitive, computationally intensive functions thereby freeing up the main processor pipeline.
A design issue that must be addressed in microprocessor architectures and microprocessor-based system in general that employ processor extension logic, such as an extended instruction pipeline that is distinct from the main instruction pipeline, is synchronization and control. It is difficult to balance the competing interests of simplifying implementation and debugging while maximizing parallelism.
Thus, there exists a need for a parallel pipeline architecture that can fully exploit the advantages of parallelism without suffering from the design complexity of loosely or completely decoupled pipelines.
At least one embodiment of the invention may provide a method for synchronization of multiple processing engines in an extended processor core. The method according to this embodiment may comprise placing direct memory access (DMA) functionality in a single instruction multiple data (SIMD) pipeline, where the DMA functionality comprises a data-in engine and a data-out engine, and each DMA engine is allowed to buffer at least one instruction issued to it in a queue without stopping the SIMD pipeline. The method may also comprise, when the DMA engine queue is full, and a new DMA instruction is trying to enter the queue, blocking the SIMD pipeline from executing any instructions that follow until the current DMA operation is complete, thereby allowing the DMA engine and SIMI pipeline to maximize parallel operation while still remaining synchronized.
Another embodiment of the invention provides a method for synchronizing multiple processing engines of a microprocessor. The method according to this embodiment comprises coupling an extended instruction pipeline to a main instruction pipeline, coupling direct memory access (DMA) engines to the extended instruction pipeline, buffering at least one instruction in a queue in the DMA engine without stopping the extended instruction pipeline, and blocking the extended instruction pipeline from further execution when a DMA engine queue is full and a new DMA instruction arrives at the queue until a current DMA operation is complete.
A further embodiment of the invention provides a multi-processing engine architecture for a microprocessor. The multi-processing engine architecture for a microprocessor according to this embodiment comprises a main instruction pipeline, an extended instruction pipeline coupled to the main instruction pipeline via an instruction queue, and direct memory access (DMA) engines coupled to the extended instruction pipeline, the DMA access engines comprising a data-in engine and a data-out engine, wherein each of the data-in and data-out engines comprise an instruction queue adapted to buffer at least one instruction
An additional embodiment of the invention provides, in a microprocessor having a main instruction pipeline and processor extension logic comprising an extended instruction pipeline that is coupled to the main instruction pipeline via an instruction queue, wherein the extended instruction pipeline is adapted to be selectively decoupled from the main instruction pipeline to perform autonomous operation, and where the extended instruction pipeline is further coupled to DMA engines for moving data into and moving data out of a local memory, a method for maximizing simultaneous operation of the extended instruction pipeline and the DMA engines. The method according to this embodiment comprises executing an instruction from the extended instruction pipeline requiring the DMA engine, buffering the instruction if sufficient queue space is available in the DMA engine, and preventing the extended instruction pipeline from further execution if insufficient queue space is available until a current DMA operation is complete, freeing up a space the queue to accept a blocked DMA instruction on the instruction pipeline, thereafter resuming execution of the extended processor pipeline.
These and other embodiments and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
In order to facilitate a fuller understanding of the present disclosure, reference is now made to the accompanying drawings, in which like elements are referenced with like numerals. These drawings should not be construed as limiting the present disclosure, but are intended to be exemplary only
The following description is intended to convey a thorough understanding of the embodiments described by providing a number of specific embodiments and details involving microprocessor architecture and systems and methods for synchronizing multiple processing engines in a microprocessor-based system. It should be appreciated, however, that the present invention is not limited to these specific embodiments and details, which are exemplary only. It is further understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
Commonly assigned U.S. patent application Ser. No. ______ titled “System and Method for Selectively Decoupling a Parallel Extended Processor Pipeline,” filed concurrently with this application is hereby incorporated by reference in its entirety into the disclosure of this application.
Referring now to
In various embodiments, a single instruction issued by the processor pipeline 12 may cause up to sixteen 16-bit elements to be operated on in parallel through the use of the 128-bit data path 55 in the media engine 50. In various embodiments, the SIMD engine 50 utilizes closely coupled memory units. In various embodiments, the SIMD data memory 52 (SDM) is a 128-bit wide data memory that provides low latency access to perform loads to and stores from the 128-bit vector register file 51. The SDM contents are transferable via a DMA unit 54 thereby freeing up the processor core 10 and the SIMD core 50. In various embodiments, the DMA unit 54 comprises a DMA in engine 61 and a DMA out engine 62. In various embodiments, both the DMA in engine 61 and DMA out engine 62 may comprise instruction queues (labeled Q in the Figure) for buffering one or more instructions. In various embodiments, a SIMD code memory 56 (SCM) allows the SIMD unit to fetch instructions from a localized code memory, allowing the SIMD pipeline to dynamically decouple from the processor core 10 resulting in truly parallel operation between the processor core and SIMD media engine as discussed in commonly assigned U.S. patent application Ser. No. ______, titled, “Systems and Methods for Recording Instruction Sequences in a Microprocessor Having a Dynamically Decoupleable Extended Instruction Pipeline,” filed concurrently herewith, the disclosure of which is hereby incorporated by reference in its entirety.
Therefore, in various embodiments, the microprocessor architecture according to various embodiments of the invention may permit the processor to operate in both closely coupled and decoupled modes of operation. In the closely coupled mode of operation, the SIMD program code fetch and program stream supply is exclusively handled by the processor core 10. In the decoupled mode of operation, the SIMD pipeline 53 executes code from a local memory 56 independent of the processor core 10. The processor core 10 may control the SIMD pipeline 53 to execute video tasks such as audio processing, entropy encoding/decoding, discrete cosine transforms (DCTs) and inverse DCTs, motion compensation and de-block filtering.
With continued reference to the microprocessor architecture in
As discussed above, operating the main pipeline, extended pipeline and DMA engines in parallel introduces the problem of synchronization. For example, a sequence of SIMD code segment will have to wait for a DMA operation to finish transferring data into the SDM, which is kicked off by the instruction just preceding it. On the other hand, the DMA engine cannot start transferring data out of the SDM until the previously issued SIMD code has been executed. This type of synchronization is normally performed by using software to probe status bits toggled by these engines, or by using interrupts and their associated service routines to kick off the dependent processes. Both of these solutions require large overheads in terms of cycles as well as coding effort to achieve the synchronization desired.
In order to reduce these overheads, in various embodiments of the invention, the DMA engines 61, 62 are placed in the SIMD pipeline 53 itself, but each DMA engine is allowed to buffer one or more instructions issued to it in a queue without stopping the SIMD pipeline execution. When the DMA engine instruction queue is full, the SIMD engine pipeline 53 will be blocked from executing further instructions only when another DMA instruction arrives at the DMA. This allows the software to be re-organized so that a SIMD code will have to wait for a DMA operation to complete, or vice versa, as long as a double or more buffering approach is used, that is, two or more buffers are used to allow overlapping of data transfer and data computation.
With continued reference to the processor architecture of
Referring to
This approach avoids the need of the main processor core from intervening continuously in order to achieve synchronization between the DMA unit and the SIMD pipeline. However, the processor core 10 does need to ensure that the instruction sequence sent uses this functionality to achieve the best performance by parallelizing SIMD and DMA operations. Thus, an advantage of this approach is that it facilitates the synchronization of SIMD and DMA operations in a multi-engine video processing core with minimal interaction between the main control processor core. This approach can be extended by increasing the depth of the DMA non-blocking instruction queue so as to allow more DMA instructions to be buffered in the DMA channels, allowing double, triple or more buffering.
Referring now to
The embodiments of the present inventions are not to be limited in scope by the specific embodiments described herein. For example, although many of the embodiments disclosed herein have been described with reference to systems and method for synchronizing multiple processing engines in a microprocessor-based system having a main instruction pipeline and an extended instruction pipeline, the principles herein are equally applicable to other aspects of microprocessor design and function. Indeed, various modifications of the embodiments of the present inventions, in addition to those described herein, will be apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such modifications are intended to fall within the scope of the following appended claims. Further, although some of the embodiments of the present invention have been described herein in the context of a particular implementation in a particular environment for a particular purpose, those of ordinary skill in the art will recognize that its usefulness is not limited thereto and that the embodiments of the present inventions can be beneficially implemented in any number of environments for any number of purposes. Accordingly, the claims set forth below should be construed in view of the full breath and spirit of the embodiments of the present inventions as disclosed herein.
This application claims priority to U.S. Provisional Patent Application No. 60/721,108 titled “SIMD Architecture and Associated Systems and Methods,” filed Sep. 28, 2005, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60721108 | Sep 2005 | US |