The present invention relates to microprocessors and more particularly to microcoded compute engines and even more particularly to extensible microcoded compute engines.
In the past, it has been well known to use microcoded microprocessors such as the Advanced-Architecture Microprocessor (AAMP) as described in an article by the same name in the August 1982 issue of IEEE MICRO, which is incorporated herein in its entirety by this reference.
In simplest terms, the control unit of a microprogrammed digital machine comprises a control store and a microsequencer. Typically, the control store (a control store implemented with RAM is a Writable Control Store (WCS)) is a wide memory that contains the microprogram. Each line of code in the microprogram is referred to as a microword or microinstruction. One or more microwords are collectively referred to as microcode.
To access a given microinstruction from the control store, the microsequencer must issue the microaddress associated with the desired microinstruction. It follows that each memory location in the control store contains a microinstruction. We sequence through a series of microinstructions to accomplish some task.
Each microinstruction contains individual fields that directly control a digital machine's primitive data flow and sequencing functions. Consisting of one or more bits each, these fields are known as microorders, microoperations, or microcommands. Representative microcommands for an ALU would be OR, AND, and ADD. If the ALU had eight commands, the ALU microcommand field would be (optimally) 3 bits wide.
Microcommands should not be confused with the more familiar assembly code instructions of processors such as the Pentium and PowerPC. In a microprogrammed processor, execution of an assembly code instruction is accomplished by execution of one or more microinstructions on its behalf.
Now referring to
Basic data flow is from the register file, through the ALUs, and then back to the register file. Referring to the data side 110 of
Various types of functional subunits may be attached to the A, B, and C buses. The number of subunits is limited by the width and encoding capability of the control store. In addition, depending on their function, the subunits may or may not have both a data source and a data sink.
For vision processing, the MicroCore has been augmented with a Data Address Generator (DAG) on the left side of UC16 (called MicroCore w/ DAG in
Another enhancement is the provision for accessing one or more scratchpad memories via the external buses. Read and write addresses for these memories are provided by dedicated registers (WAx and RAx) in the address-side register file.
Still another enhancement to the basic concept of the AAMP was the provision of a Look-ahead Fetch Unit (LFU) in the AAMP5 sold by Rockwell Collins Inc.
As illustrated in
1) To distinguish it from AAMP5 microcode
2) By virtue of the LFU being a much simpler compute engine with respect to the AAMP5.
While the AAMP line of microprocessors has been quite successful over the years providing utility in many specialized applications, it has evolved over the years in successive iterations of improvements. Years of research and development went into the several variations of AAMP.
There has been a need in the microcoded microprocessor industry to enhance the flexibility and increase the utility of such processors by decreasing the design effort and engineering time often required in redesigning such existing microprocessors to include increased computational power and functionality.
The present invention is directed toward providing such improvement in microprocessors and particularly in meeting some of the need for the ability to rapidly and efficiently create high performance low power consumption designs.
It is an object of the present invention to increase the computational power of a microcoded processor.
It is a feature of the present invention to include nested hierarchical microcoded compute engines.
It is an advantage of the present invention to provide for ease in computational power expansion by nesting a compute engine between the C and B buses or between the E and G buses of the next higher compute engine.
The present invention is intended to achieve the above-described object and include the aforementioned feature, and provide the previously stated advantage.
Accordingly the present invention is:
a system of hierarchical interconnected nested microprocessors comprising:
a first microcoded compute engine;
a first source bus and a first sink bus, each coupled to said first microcoded compute engine;
a loadable micro control store for storing a microorder;
a passive functional unit disposed between said first source bus and said first sink bus; said passive functional unit being addressed by microorders comprising only data and control path primitives;
a second microcoded compute engine disposed in a nested configuration at a lower hierarchy level than said first microcoded compute engine and in parallel with said passive functional unit and coupled to said first source bus and said first sink bus; said second microcoded compute engine having a second source bus and a second sink bus coupled thereto;
said first microcoded compute engine, said passive functional unit, said second microcoded compute engine, said first source bus and said first sink bus all being configured, so that said passive functional unit could be replaced by an expansion microcoded compute engine, in which data exchange between hierarchy levels is through memory and FIFO access microorders that do not differentiate between action upon a passive functional unit and action upon a compute engine.
In the above description of the present invention, a new hierarchy level is created by disposing a microcoded compute engine and a passive functional unit between the sink and source buses of a microcoded compute engine at the next higher hierarchy level, with the possibility of replacing said passive functional unit with yet another microcoded compute engine. This replacement would be to increase the effective computation power of the encompassing hierarchy level.
Having stated that, it is important to note that the present invention does not require a minimum number of either passive functional units or microcoded compute engines at any level of the hierarchy. Replacement of a passive functional unit with a compute engine is intended to provide extra computation power, and it could be the case that, at any given level, the initial system design called for no passive functional unit at all, and only one or more microcoded compute engines are needed at that level. We have motivated discussion of the present invention with the notion of “replacement” to highlight the interchangeability of passive functional units and microcoded compute engines from the standpoint of microorders common to both.
Finally, we note that progression to a lower hierarchy level requires at least one microcoded compute engine at the current hierarchy level. If such were not the case, there would be no sink and source buses for disposition of the lower level compute engine.
Now referring to the drawings, wherein like numerals refer to like matter throughout, and more particularly referring now to
This drawing emphasizes the extension of the B, C, E, and G buses outside of the uc16 block. (Note: no program memory is shown on
In
Now referring to
We create a simple computational hierarchy by hanging a “nanocore” off the B and C or E and G buses of one of our microcores, as shown in
A nanocore is treated by its controlling microcore as just another functional unit akin to a scratchpad memory or FIFO. For example, to read a value from scratchpad memory #0, add 5 to that value, and write the result to scratchpad memory #1, the following line of microcode might be used:
CNSTA READ0B Add WRT1C 5
Suppose now that what the microcore 100 views as memory 1 is actually the nanocore 610 of
Using the scheme just described, compute engines may be nested to any depth. Now referring to
By virtue of being in the middle level of the hierarchy, the nanocore 610 transmits and receives data to and from both the microcore 100 and picocore 710. In communicating with the microcore, the nanocore performs some task on behalf of the microcore 100, so it receives data to process and then returns the result. Conversely, in communicating with the picocore 710, the nanocore 610 sends data for the picocore 710 to process and then consumes the result. In this basic hierarchical arrangement, the picocore 710 serves the nanocore 610 and the nanocore 610 serves the microcore 100.
Above it has been explained how to construct a simple flat array with mailboxes and a simple hierarchy with B/C bus interconnects. We now discuss various combinations of these processor arrangements.
Arrays of microcores are typically interconnected with mailboxes. This was discussed in the section above labeled SIMPLE ARRAY and is a straightforward concept. An important aspect of the present invention is the possibility of fully or partially interconnecting nanocores and/or picocores using the same type of mailboxes.
If we suppose that each microcore in 2×2 array (such as that shown in
Every level of a hierarchy may be interconnected in this fashion, allowing us to have an arbitrary number of parallel m×n arrays of processing cores. Note also that not all hierarchical levels need have mailbox connections, nor do all cores within a given level need be interconnected. For example, given 3 levels of hierarchy, level 1 could be fully interconnected, level 2 not connected at all, and level 3 just have N/S connections.
Mailboxes of one hierarchy level may communicate with mailboxes at a different level of hierarchy. In
Mailbox connections may skip any number of hierarchy levels, as shown in
In summary, given arrays of processing cores nested to any level, mailboxes of any core may be connected to mailboxes of any other core, regardless of a given mailbox's hierarchy level.
Multiple processing cores may be embedded in a host core. This is illustrated by
The number of cores embedded in another core is limited only by the width of the host core's control store. As before, the embedded cores' mailboxes may be connected to other mailboxes at any array position and hierarchy level.
Previous sections have referred to embedded cores (nano, pico, etc.) as being less complex as we traverse deeper into the hierarchy. Architecturally, this is typical, but not required. That is, any processing core, no matter where embedded or what its array position, may be of arbitrary complexity. The sole expectation for a core is that it is microprogrammable and has source/sink buses (C/B) as described in Section MICROCORE.
1. Microcore array served by an embedded DMA nanocore array.
2. Microcore array with a communications security protocol enforced by an embedded nanocore array.
Germane to construction of the processing architecture discussed previously is the ability to substitute an active processing core for a host core's passive functional unit (typically a memory) without changing the microorders used to communicate with that unit. In other words, interface microorders view a functional unit as some form of memory whether or not that is its sole or intended purpose.
The interface logic architecture that enables this easy substitution is discussed in the following sections.
As discussed in Section MICROCORE, scratchpad memories may be attached to a microcore's external C/B and/or G/E buses. Communication to these memories is controlled via dedicated logic and read/write pointer registers on the address side of the microcore, as illustrated in
A typical microcore has up to 4 scratchpad memories referred to as mem0, mem1, mem2, and mem3 (represented by memX in
In
Having substituted a nanoengine for a memory block, it remains to define the rules of behavior for data exchange between the encompassing microcore and the nanoengine.
1. Writing one or more data words to the nanoengine should initiate some sort of nanoprocessing.
2. Processing results are accessible via the block's read port.
3. Different write addresses may be associated with different nano-algorithms.
4. If rule 2 is followed, one would expect to read the results from different read addresses.
5. There are several possibilities for synchronizing read access to the processing results:
Typically, the FIFO port opposite that attached to the microcore is connected to an external data source or sink, which affords the microcore a communication channel to outside resources.
Nanoengine for FIFO substitution is intuitively more satisfying than the memory block replacement discussed in Sections MEMORY INTERFACE LOGIC and MEMORY-BASED DATA EXCHANGE, since no addressing is involved and the FIFO interface necessarily provides “hold” logic (Scratchpad memories are usually synchronous RAMs with single-clock read and write access).
Now referring to
Now referring to
We start with empty r0 and r1 buffer registers, and TAKin inactive (meaning there's no attempt to read from the miniFIFO). If the microcore issues a write FIFO microorder (such as FIFOC in
Now referring to
The preceding description of FIFOin and FIFOout shows how easy it is to view a nanoengine as a FIFO in which the empty and full flags serve as indicators of a processing engine's “state of completion.” Therefore, having substituted a nanoengine for a FIFO, the rules of behavior (listed below) for data exchange between the encompassing microcore and the nanoengine are simpler (and thus more gratifying) than those for memory block substitution.
1. Writing one or more data words to the nanoengine initiates nanoprocessing.
2. Processing results are accessible via the block's read port.
3. Read access to the processing results is automatically synchronized, since the FIFO “empty” flag may be used as a processing completion flag, and a microcore is automatically held if it attempts to read from an empty FIFO.
We note in passing that FIFO-interface-based data exchange may occur at the same or different hierarchy levels.
Mailboxes were discussed in the section on MICROENGINE HIERARCHY. Since a mailbox is simply a shallow in/out FIFO pairing, FIFO-related nanoengine interconnect and substitution approaches are equally applicable to mailboxes. For example, a nanoengine could be connected between the South_In port of one microcore and the North_Out port of another microcore, which would allow some kind of pre- or post-processing on the data moving between those two microcores. Furthermore, these interconnects can skip hierarchy levels, as discussed in the section on INTER-LEVEL MAILBOX CONNECTIONS.
Communication with a nanoengine via memory access microorders need not imply that there is an “addressable” memory living between the microcore and nanoengine. For example, the RAR and/or WAR for that memory could be ignored if FIFO-like behavior is preferred. Better yet, FIFO-like behavior could be inferred for some address ranges but not for others.
In this section we present a flexible synchronization method for our cooperating hierarchy of microcores.
Barriers are synchronization objects used to block and release two or more processing threads, and are a well-known signaling mechanism in the domain of high-level operating systems.
To use a barrier, one must create a barrier object and define which threads constitute a “quorum” for that barrier. Once a barrier is created, members of its quorum may “wait_for” it. If a thread waits on a barrier, the operating system blocks its execution until all member threads of that barrier's quorum have also performed a wait_for on the barrier. In other words, after the final thread does a wait_for, all threads of the quorum in question are released simultaneously by the operating system. Release of all quorum threads also “clears” the quorum, thus preparing the barrier object for another round of wait_fors by the same group of threads.
The following subsections describe how the barrier concept may be applied to a group of cooperating microcores.
For a microcoded environment, a barrier object comprises bit mapped “release” and “quorum” registers of arbitrary width. When a microcore waits for the barrier, its bit in the release register is set. The quorum register identifies the microcores that belong to the barrier's quorum, with a set bit identifying the corresponding microcore as a member. If the release register=the quorum register, all microcores have checked in and are immediately released. At the same time, the release register is reset.
In
A microcore performs a wait_for on a barrier by executing a microorder. Now referring to
Barriers may be nested by allowing a bit in a barrier's release register to be set by virtue of another barrier being released. If barriers are “nested” in this manner, logic must be provided to determine whether or not the “inner” barrier (that is, the barrier whose release sets an “outer” barrier's release bit) should be released immediately. The other choice is to hold the inner barrier until one or more of the outer barriers in the hierarchy are released.
Nesting of barriers suggests a powerful extension to the interpretation of release bits, namely, the possibility of associating the output of any combinational or sequential logic with one or more release bits. This is equivalent to equating a “quorum” with some arbitrary set of preconditions. Representative preconditions are listed below.
1. Barrier wait_for (the “original” intent)
2. FIFO full and empty flags
3. Timer alarms
4. Expired countdowns
5. A particular finite state machine event
6. Input discrete
Henceforth, “member” of a quorum may refer to a satisfied precondition as well as a microcore performing a wait_for.
Release bits may be mapped to preconditions at any level of a microcore hierarchy in a heterogeneous fashion. In other words, a single quorum can refer to a mix of preconditions from different hierarchy levels. Obviously, if this is done, the barrier in question must be global in scope relative to the hierarchy levels involved. It follows that a barrier's scope should be limited to the levels represented in the quorum. One implementation approach would be to build scoping into the barrier pointer addressing decode logic.
Waiting on a barrier is meant to be a “blocking” action, and the suggested way to do this in the microcore environment is with “hold” logic like that described in Section DATA FLOW. This implementation is straightforward, because waiting on a barrier that has not achieved quorum is like trying to read from an empty FIFO. For example, executing UBARRIER on a barrier without quorum will result in a Hold just like executing a FIFOB on an empty FIFO (doing a wait_for that results in a quorum will NOT hold the microcore in question).
Achieving barrier quorum is a release event and, as described previously, unblocks the quorum's microcores. Another release action would be to force quorum microcores to execute a particular sequence of microcode in response to achieving quorum.
In this approach, doing a wait_for would NOT block a microcore. Release bits would be set as before, but the quorum microcores would continue executing until quorum is achieved, at which point each microcore would be vectored to a jam microaddress associated with the barrier in question. The “jam” approach allows microcores to make progress on some other task while waiting for quorum conditions to be satisfied.
It is possible to mix blocks and jams in one barrier.
One way to construct a complex control scheme in a microcore hierarchy is to group a sequential progression of barrier objects into one data structure, as shown in
When a quorum is achieved and the microcores released, the barrier pointer is automatically advanced to the next barrier in the piano roll. In
Barrier quorums in a piano roll are typically, but need not be, different. Furthermore, the piano roll's scope must be as broad as the most global barrier quorum. As before, any mix of hierarchy layer and level is possible, as well as jam and hold protocols.
Section NANOENGINE FOR BIT PLANE CODER presents a nanoengine accelerator for JPEG2000 EBCOT Tier 1 processing. This section contains background material for that example.
Depending on options selected, there are several types of JPEG2000 data flow. For the purpose of this discussion, the data flow of
There are two parts to Tier 1 processing: bit plane coding functions pass/update and arithmetic (MQ) coding 2222. The bit plane coder outputs a 6-bit context/decision pair based on image data and state information derived from previous bit processing. The arithmetic coder, in turn, outputs a stream of bytes based on the context and decision bits from the bit plane coder.
Due to the intensely bit-oriented nature of EBCOT processing, a naïve implementation on the MicroCore platform would be computationally prohibitive. Now referring to
Now referring to
Stripes are transferred one at a time from program memory to the sign and bit memories in the Tier 1 logic block. Once a stripe is resident in the Tier 1 logic block, it is processed one column at a time as detailed in
Now referring to
1. A considerable number of clock cycles are expended on both bit plane coding and MQ coding.
2. The context/decision pairs are processed by the MQ Coder in order and one at a time.
3. Generation of (CX,D) pairs is independent of the MQ Coder; i.e., any number of (CX,D) pairs can be processed before being MQ coded.
4. MQ coding may proceed in parallel with bit plane coding.
To take advantage of this parallel processing opportunity, we add a “NanoEngine”2610 and “(CX,D)-FIFO” 2620 to the EBCOT logic block 2600. The NanoEngine 2610 generates (CX,D) pairs and the FIFO 2620 buffers the (CX,D) pairs for subsequent use by the MQ Coder.
As discussed in previous sections, the NanoEngine exists at a lower hierarchical level of the MicroCore, and is a miniaturized version of the microsequencer and micro control store from
In the meantime, the higher level microprogram performs MQ coding on the (CX,D) pairs being written to the FIFO by the NanoEngine. MQ coding progresses until the NanoEngine enters an idle state, at which point the MicroCore microprogram transfers another stripe to the EBCOT logic block and kicks off the NanoEngine once again.
The net effect of this microarchitecture is to parallelize MQ and bit plane coding so that the Bit Plane Coder does not have to wait for the MQ Coder after the generation of each (CX,D) pair. As shown in
We emphasize that bit plane coding is performed by the nanoprogram 2704, while MQ coding is performed by the microprogram 2702.
The following terms, when used in the claims, are hereafter defined to have their ascribed meanings:
Microorder shall mean: a singular control field of a micro instruction and shall specifically exclude the following: assembly language instructions or microprocessor.
Microcode shall mean: a sequence of micro instructions and shall specifically exclude the following: any combination of assembly code instructions.
Microcoded compute engine shall mean: any compute engine which directly uses micro instructions to perform computational tasks.
Passive functional unit shall mean: memory or logic which do not perform computations.
Nested hierarchical levels shall mean: a compute engine coupled to source and sink buses of another compute engine.
Number | Name | Date | Kind |
---|---|---|---|
4050058 | Garlic | Sep 1977 | A |
4527237 | Frieder et al. | Jul 1985 | A |
6691206 | Rubinstein | Feb 2004 | B1 |
7376811 | Kizhepat | May 2008 | B2 |
20090228686 | Koenck | Sep 2009 | A1 |
20090228693 | Koenck et al. | Sep 2009 | A1 |
Entry |
---|
Sperry et al, “A microprogrammed Signal Processor”, Apr. 1981, IEEE International Conference on ICASSP'81, Acoustics, Speech, and Signal Processing, vol. 6, pp. 579-582. |
IEEE, “IEEE 100 The Authoritative Dictionary of IEEE Standars Terms”, Feb. 2007, 7th Ed., p. 450-451. |
IEEE Micro-AAMP1 Article “An Advanced-Architecture CMOS/SOS Microprocessor”, by David W. Best, Charles E. Kress, Nick M. Mykris, Jeffrey D. Russell, and William J. Smith, published in Aug. 1982, pp. 11-26,IEEE MICRO. |