The disclosure relates generally to a processor and, more particularly, to a processing cluster.
Generally, system-on-a-chip designs (SoCs) are based on a combination of programmable processors (central processing units (CPUs), microcontrollers (MCUs), or digital signals processors (DSPs)), application-specific integrated circuit (ASIC) functions, and hardware peripherals and interfaces. Typically, processors implement software operating environments, user interfaces, user applications, and hardware-control functions (e.g., drivers). ASICs implement complex, high-level functionality such as baseband physical-layer processing, video encode/decode, etc. In theory, ASIC functionality (unlike physical-layer interfaces) can be implemented by a programmable processor; in practice, ASIC hardware is used for functionality that is generally beyond the capabilities of any actual processor implementation.
Compared to ASIC implementations, programmable processors provide a great deal of flexibility and development productivity, but with a large amount of implementation overhead. The advantages of processors, relative to ASICs are:
Not surprisingly, a motivation for ASICs (other than hardware interfaces or physical layers) is to overcome the weaknesses of processor-based solutions. However, ASIC-based designs also have weaknesses that mirror the advantages of processor-based designs. The advantages of ASICs, relative to processors are:
Parallel processing, though very simple in concept, is very difficult to use effectively. It is easy to draw analogies to real-world example of parallelism, but computing does not share the same underlying characteristics, even though superficially it might appear to. There are many obstacles to executing programs in parallel, particularly on a large number of cores.
Turning to
However, when code sequences 120 and 122 are converter from serial program 102 to parallel program 104 so as to be executed on two processors, several issues arise. First, sequences 120 and 122 are controlled by two separate program counters so that if the sequences 120 and 122 are left “as is” there is generally no way to ensure that the value for variable x is valid on the attempted read in sequence 122. In fact, in the simplest case, assuming both code sequences 120 and 122 execute sequentially starting at the same time, the value for variable x is not defined in time, because there are many more instructions to the definition of variable x in sequence 120 than there are to the use of variable x in sequence 122. Second, the value for variable x cannot be transmitted through a register or local cache because, although code sequences 120 and 122 have a common view of the address for variable x, the local caches map these addresses to two, physically distinct memory locations. Third, although not shown directly in the
For at least these reasons, the serial program 102 should be extensively modified to achieve correct parallel execution. First, sequence 120 should wait until sequence 120 signals that variable x has been written, which causes code sequence 122 to incurs delay 112. Delay 112 is generally a combination the cycles that sequence 120 takes to write variable x and delay 110 (the cycles to generate and transmit the signal). This signal is usually a semaphore or similar mechanism using shared memory that incurs the delay of writing and reading shared memory along with delays incurred for exclusive access to the semaphore. The write of variable x in sequence 120 also is subject to a barrier in that sequence 122 cannot be enabled to read variable x until sequence 122 can obtain the correct value for variable x. Generally, there can be no ordering hazards between writing the value and signaling that it has been written, caused by buffering, caching, and so forth, which usually delays execution in sequence 120 some number of cycles (represented by delay 114) compared to writes of unshared data directly into a local cache.
Second, sequence 122 generally cannot read its local cache directly to obtain variable x because the write of variable x by sequence 120 would have caused an invalidation of the cache line containing code sequence 120. Sequence 122 incurs additional delay 116 to obtain the correct value from level-2 (L2) cache for sequence 120 or from shared memory. Third, sequence 122 generally imposes additional delays (due in part to delay 118) on sequence 120 before any subsequent write by sequence 120 so that all reads in sequence 122 are complete before sequence 120 changes the value of variable x. This not only can stall the progress of sequence 120 but can also delay the new value of variable x such that sequence 122 has to wait again for the new value. Because of the number of cycles that sequence 122 spends obtaining the value for variable x, sequence 120 could potentially be ahead in subsequent iterations even though it was behind in the first iteration, but synchronization between sequences 120 and 122 tends to serialize both programs so there is little, if any, overlap.
The operations used to synchronize and ensure exclusive access to shared variables normally are not safe to implement directly in application code because of the hazards that can be introduced (e.g., timing-dependent deadlock). Thus, these operations are usually implemented by system calls, which cases delays due to procedure call and return and, possibly, context switching. The net effect is that a simple operation in sequential code (i.e., serial program 102) can be transformed into a much more complex set of operations in the “parallel” code (i.e., parallel program 104), and have a much longer execution time. The result is that parallel programming is limited to applications that do not incur significant overhead for parallel execution. This implies that: 1) there is essentially no data interaction between programs (e.g., web servers); 2) the amount of data shared is a small portion of the datasets used in computing (e.g., finite-element analysis); or 3) the number of computing cycles is very large in proportion to the amount of data shared (e.g., graphics).
Even if the overhead of parallel execution is small enough to make it worthwhile, overhead can significantly limit the benefit. This is especially true for parallel execution on more than two cores. This limitation is captured in a simplified equation for the effect, known as Amdahl's Law, which compares the performance of single-core execution to that of multiple-core execution. According to Amdahl's Law, a certain percentage of single-core execution cannot feasibly be executed in parallel because the overhead is too high. Namely, the overhead incurred is the sum of the percentage of time spent without parallel execution and the percentage of time spent for synchronization and communication.
Turning to
Further limiting the applicability of parallel processing is the cost of multiple cores. In
Turning to
There is another dimension to the difficulty of parallel computing; namely, it is the question of how the potential parallelism in an application is expressed by a programmer. Programming languages are inherently serial, text-based. Transforming a serial language into a large number of parallel processes is a well-studied problem that has yielded very little in actual results.
Turning to
As shown, this example illustrates the use of several directives, which are embedded in the text following the headers (“#pragma omp”). Specifically, these directives include loops 506 and 508 and function 510, and each of loops 506 and 508 respectively employs functions 512 and 514. This source code 502 is shown as a parallel implementation 504 and is executed on four threads over four processors. Since these threads are created by serial operating-system code 502, the threads are not generally created at exactly the same time, and this lack of overlap increases the overall execution time. Also, the input and result vectors are shared. Reading the input vectors generally can require synchronization periods 516-1, 516-3, 516-5, and 516-7 to ensure there are no writers updating the data (a relatively short operation). Writing the results in write periods 518, 520, 522, 524, 526, 528, 530, and 532 can require synchronization periods 516-2, 516-4, 516-6, and 516-8 because one thread can be updating the result vectors at any given time (even though in this case the vector elements being updated are independent, serializing writes is a general operation that applies to shared variables). After another synchronization and communication period 516-9, 516-10, 516-11, and 516-12, the threads obtain multiple copies of the result vectors and compute function 510.
As shown, there can be significant overhead to parallel execution and a lack of parallel overlap, which is why parallel execution is made conditional on the vector length. It might be uncommon for the compiler to chose to implement the code in parallel, as a function of the system and the average vector length. However, when the code is implemented in parallel, there are a couple of subtle issues related to the way the code is written. To improve efficiency, the programmer should recognize that the expression for function 510 can be executed by multiple threads and obtain the same value and should explicitly declare function 510 as a private variable even though the expression that assigns function 510 contains only shared variables. Declaring function 510 as shared would result in four threads serializing to perform the same, lengthy computation to update the shared variable function 510 with the same value. This serialization time is on the order of four times the amount of time taken to complete the earlier, parallel vector adds, making it impossible to benefit from parallel execution and making vector length the wrong criteria for implementing the code in parallel since this serialization time is directly proportional to vector length. Furthermore, whether or not function 510 can be private is a function of the expression that assigns the value. For example, assume that function 510 is later changed to include a shared variable “offset” as follows:
(1) scale=sum(a,0,n)+sum(z,0,n)+offset++;
In this case, function 510 should be declared as shared, but it is insufficient. This change implies that the code should not be allowed to execute in parallel because of serialization overhead. Code development and maintenance not only includes the target functionality, but also how changes in the functionality affect and interact with the surrounding parallelism constructs.
There is another issue with the code 502 in this example, namely, an error introduced for the purpose of illustration. The loop termination variable n is declared as private, which is correct because variable n is effectively a constant in each thread. However, private variables are not initialized by default, so variable n should be declared as shared so that the compiler initializes the value for all threads. This example works well when the compiler chooses a serial implementation but fails for a parallel one. Since this code 502 is conditionally parallel, the error is not easy to test for.
This example is a very simple error because it will likely usually fail, assuming that the code can be forced to execute in parallel (depending on how uninitialized variables are handled). However, there are an almost infinite number of synchronization and communication errors that can be introduced with OpenMP directives (this example is a communication error)—and many of these can result in intermittent failures depending on the relative timing and performance of the parallel code, as well as the execution order chosen by the scheduler.
Thus, there is a need for an improved processing cluster and associated tool chain.
An embodiment of the present disclosure, accordingly, provides a method. The method comprises receiving source code, wherein the source code includes an algorithm module that encapsulates an algorithm kernel within a class declaration; traversing the source code with a system programming tool to generate hosted application code from the source code for a hosted environment; allocating compute and memory resources of a processor based at least in part on the source code with the system programming tool, wherein the processor includes a plurality of processing nodes and a processing core; generating node application code for a processing environment based at least in part on the allocated compute and memory resources of the processor with the system programming tool; and creating a data structure in the processor based at least in part on the allocated compute and memory resources with the system programming tool.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including: control node circuitry having address inputs coupled to the address leads, data inputs coupled to the data leads, and serial messaging leads; and parallel processing circuitry coupled to the serial messaging leads.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including: global load store circuitry having external data inputs and outputs coupled to the data leads, and node data leads; and parallel processing circuitry coupled to the node data leads.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including: shared function memory circuitry data inputs and outputs coupled with the data leads; and parallel processing circuitry coupled to the data leads.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including node circuitry having parallel processing circuitry coupled to the data leads.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including first circuitry, second circuitry, and third circuitry coupled to the data leads, serial messaging leads connected between the first circuitry, the second circuitry, and the third circuitry, and the first, second, and third circuitry each including messaging circuitry for sending and receiving messages.
An embodiment of the present disclosure, accordingly, provides an apparatus. The apparatus comprises address leads; data leads; a host processor coupled to the address leads and the data leads; memory circuits coupled to the address leads and the data leads; and processing cluster circuitry coupled to the address leads and the data leads, the processing cluster circuitry including reduced instruction set computing (RISC) processor circuitry for executing program instructions in a first context and a second context and the RISC processor circuitry executing an instruction to shift from the first context to the second context in one cycle.
The foregoing has outlined rather broadly the features and technical advantages of the present disclosure in order that the detailed description of the disclosure that follows may be better understood. Additional features and advantages of the disclosure will be described hereinafter which form the subject of the claims of the disclosure. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the disclosure as set forth in the appended claims.
For a more complete understanding of the present disclosure, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
Refer now to the drawings in which depicted elements are, for the sake of clarity, not necessarily shown to scale and in which like or similar elements are designated by the same reference numeral through the several views.
Turning to
However, the source code for the serial program 601 is structured for autogeneration. When structure for autogeneration, an interate-over-read thread module 624 is generated to perform system reads for parallel modules 626 (which is comprised of parallel iterations of serial module 610), and the outputs from parallel module 626 are provided to parallel module 630 (which is generally comprised of parallel iterations of the serial modules 612 and 618). This parallel module 630 can then use parallel modules 628 and 630 (which are generally comprised of parallel iterations of serial module 616) to generate outputs for read thread 620.
With the parallel implementation 603, there are several desirable features. First, data dependencies are generally resolved by hardware. Second, there are no objects; instead standalone programs with “global” variables in private contexts are employed. Third, programs can communicate using hardware pointers and symbolic linkage of “externs” in source programs. Fourth, there is variable allocation of computing resources, and sources can be merged (e.g. modules 612 and 618) for efficiency.
In order to implement such a parallel processing environment, a new architecture is generally desired. In
In
Preferably, dataflow for hardware 722 is designed to minimize the cost of data communication and synchronization. Input variables to a parallel program can be assigned directly by a program executing on another core. Synchronization operates such that an access of a variable implies both that the data is valid, and that it has been written only once, in order, by the most recent writer. The synchronization and communication operations require no delay. This is accomplished using a context-management state, which can introduce interlocks for correctness. However, dataflow is normally overlapped with execution and managed so that these stalls rarely, if ever, occur. Furthermore, techniques of system 700 generally minimize the hardware costs of parallelism by enabling nearly unlimited processor customization, to maximize the number of operations sustained per cycle, and by reducing the cost of programming abstractions—both high-level language (HLL) and operating system (OS) abstractions—to zero.
One limitation on processor customization is that the resulting implementation should remain an efficient target of a HLL (i.e. C++) optimizing compiler, which is generally incorporated into complier 706. The benefits typically associated with binary compatibility are obtained by having cores remain source-code compatible within a particular set of applications, as well as designing them to be efficient targets of a compiler (i.e. complier 706). The benefits of generality are obtained by permitting any number of cores to have any desired features. A specific implementation has only the required subset of features, but across all implementations, any general set of features is possible. This can include unusual data types that are not normally associated with general-purpose processors.
Data and control flow are performed off “critical” paths of the operations used by the application software. This uses superscalar techniques at the node level, and uses multi-tasking, dataflow techniques, and messaging at the system level. Superscalar techniques permit loads, stores, and branches to be performed in parallel with the operational data path, with no cycle overhead. Procedure calls are not required for the target applications, and the programming model supports extensive in-lining even though applications are written in a modular form. Loads and stores from/to system memory and peripherals are performed by a separate, multi-threaded processor. This enables reading program inputs, and writing outputs, with no cycle overhead. The microarchitecture of nodes 808-1 to 808-N also supports fine-grained multi-tasking over multiple contexts with 0-cycle context switch time. OS-like abstractions, for scheduling, synchronization, memory management, and so forth are performed directly in hardware by messages, context descriptors, and sequencing structures.
Additionally, processing flow diagrams are normally developed as part of application development, whether programmed or implemented by an ASIC. Typically, however, these diagrams are used to describe the functionality of the software, the hardware, the software processes interacting in a host environment, or some combination thereof. In any case, the diagrams describe and document the operation of the hardware and/or software. System 700, instead, directly implements specifications, without requiring users to see the underlying details. This also maintains a direct correspondence between the graphical representation and the implementation, in that nodes and arcs in the diagram have corresponding programs (or hardware functions) and dataflow in the implementation. This provides a large benefit to verification and debug.
Typically, “parallelism” refers to performing multiple operations at the same time. All useful applications perform a very large number of operations, but mainstream programming languages (such as C++) express these operations using a sequential model of execution. A given program statement is “executed” before the next, at least in appearance. Furthermore, even applications that are implemented by multiple “threads” (separately executed binaries) are forced by an OS to conform to an execution model of time-multiplexing on a single processor, with a shared memory that is visible to all threads and which can be used for communication—this fundamentally imposes some amount of serialization and resource contention on the implementation.
To achieve a high level of parallelism, it should be possible to overlap any operations expressed by the original application program or programs, regardless of where in the HLL source operations appear. The only useful measure of overlap counts only the operations that matter to the end result of the application, not those that are required for flow control, abstractions, or to achieve correctness in a parallel system. The correct measure of parallelism effectiveness is throughput—the number of results produced per unit time—not utilization, or the relative amount of time that resources are kept busy doing something.
Ideally, the degree of overlap should be determined only by two fundamental factors: data dependencies and resources. Data dependencies capture the constraint that operations cannot have correct results unless they have correct inputs, and that no operation can be performed in zero time. Resources capture the constraint of cost—that it's not possible, in general, to provide enough hardware to execute all operations in parallel, so hardware such as functional units, registers, processors, and memories should be re-used. Ideally, the solution should permit the maximum amount of overlap permitted by a given resource allocation and a given degree of data interaction between operations. Parallel operations can be derived from any scope within an application, from small regions of code to the entire set of programs that implement the application. In rough terms, these correspond to the concepts of fine-, medium-, and coarse-grained parallelism.
“Instruction parallelism” generally refers to the overlapped execution of operations performed by instructions from a small region of a program. These instruction sequences are short—generally not more than a few 10's of instructions. Moreover, an instruction normally executes in a small number of cycles—usually a single cycle. And, finally, the operations are highly dependent, with at least one input of every operation, on average, depending on a previous operation within the region. As a result, executing instructions in parallel can require very high-bandwidth, low-latency data communication between operations: on the order of the number of parallel operations times the number of operands per operation, communicated in a single cycle via registers or direct forwarding. This data bandwidth makes it very expensive to execute a large number of instructions in parallel using this technique, which is the reason its scope is limited to a small region of the program.
Supporting a high degree of processor customization, to enable efficient multi-core systems, can reduce the effectiveness, or even feasibility, of compiler code generation. For a feature of the processor to be useful, the compiler 706 should be able to recognize a mapping from source code to the instruction set, to emit instructions using the feature. Furthermore, to the degree allowed by the processor resources, the compiler 706 should be able to generate code that has a high execution rate, or the number of desired operations per cycle.
Nodes 808-1 to 808-N are generally the basic target template for complier 706 for code generation. Typically, these nodes 808-1 to 808-N (which are discussed in greater detail below) include two processing units, arranged in a superscalar organization: a general-purpose, 32-bit reduced instruction set (RISC) processor; and a specialized operational data path customized for the application. An example of this RISC processor is described below. The RISC processor is typically the primary target for complier 706 but normally performs a very small portion of the application because it has the inefficiencies of any general-purpose processor. Its main purpose is to generally ensure correct operation regardless of source code (though not necessarily efficient in cycle count), to perform flow control (if any), and to maintain context desired by the operational data path.
Most of the customization for the application is in the operational data path. This has a dedicated operand data memory, with a variable number of read and write ports (accomplished using a variable number of banks), with loads to and stores from a register file with a variable number of registers. The data path has a number of functional units, in a very long instruction word (VLIW) organization—up to an operation per functional unit per cycle. The operational data path is completely overlapped with the RISC processor execution and operand-memory loads and stores. Operations are executed at an upper limit of the rate permitted by data dependencies and the number of functional units.
The instruction packet for a node 808-1 to 808-N generally comprises a RISC processor instruction, a variable number of load/store instructions for the operand memory, and a variable number of instructions for the functional units in the data path (generally one per functional unit). The compiler 706 schedules these instructions using techniques similar to those used for an in-order superscalar or VLIW microarchitecture. This can be based on any form of source code, but, in general, coding guidelines are used to assist the compiler in generating efficient code. For example, conditional branches should be used sparingly or not at all, procedures should be in-line, and so on. Also, intrinsics are used for operations that cannot be mapped well from standard source code.
There is also another dimension of instruction parallelism. It is possible to replicate the operational data path in a single input multiple data (SIMD) organization, if appropriate to the application, to support a higher number of operations per cycle. This dimension is generally hidden from the compiler 706 and is not usually expressed directly in the source code, allowing the hardware 722 to be sized for the application.
“Thread parallelism” generally refers to the overlapped execution of operations in a relatively large span of instructions. The term “thread” refers to sequential execution of these instructions, where parallelism is accomplished by overlapping multiples of these instruction sequences. This is a broad classification, because it includes entire programs executed in parallel, code at different levels of program abstraction (applications, libraries, run-time calls, OS, etc.), or code from different procedures within the same level of abstraction. These all share the characteristic that only moderate data bandwidth is required between parallel operations (i.e., for function parameters or to communicate through shared data structures). However, thread parallelism is very difficult to characterize for the purposes of data-dependency analysis and resource allocation, and this introduces a lot of variation and uncertainty in the benefits of thread parallelism.
Thread parallelism is typically the most difficult type of parallelism to use effectively. The basic problem is that the term “thread” means nothing more than a sequence of instructions, and threads have no other, generalized characteristics in common with other threads. Typically, a thread can be of any length, but there is little advantage to parallel execution unless the parallel threads have roughly the same execution times. For example, overlapping a thread that executes in a million cycles with one that executes in a thousand cycles is generally pointless because there is a 0.1% benefit assuming perfect overlap and no interaction or interference.
Additionally, threads can have any type of dependency relationship, from very frequent access to shared, global variables, to no interaction at all. Threads also can imply exclusion, as when one thread calls another as a procedure, which implies that the caller does not resume execution until the callee is complete. Furthermore, there is not necessarily anything in the thread itself to describe these dependencies. The dependencies should be detected by the threads' address sequences, or the threads should perform explicit operations such as using lock mechanisms to generally provide correct ordering and dependency resolution.
Finally, a thread can be any sequence of any instructions, and all instructions have resource dependencies of some sort, often at several levels in the system such as caches and shared memories. It is impossible, in general, to schedule thread overlap so there is no resource contention. For example, sharing a cache between two threads increases the conflict misses in the cache, which has an effect similar to reducing the size of the cache for a single thread by a factor of four, so what is overlapped consists of a much higher percentage of cache reload time due both to higher conflict misses and to an increase reload time resulting from higher demand on system memory. This is one of the reasons that “utilization” is a poor measure of the effectiveness of overlapped execution, as opposed to throughput. Overlapped stalls increase utilization but do nothing for throughput, which is what users care about.
System 700, however, uses a specific form of “thread” parallelism, which is based on objects, that avoids these difficulties, as illustrated in
Objects serve as a basic unit for scheduling overlapped execution because each object module (i.e., 904, 906, and 908) can be characterized by execution time and resource utilization. Objects implement specific functionality, instead of control flow, and execution time can be determined from parameters such as buffer size and/or the degree of loop iteration. As a result, objects (i.e., 904, 906, and 908) can be scheduled onto available resources with a high degree of control over the effectiveness of overlapped execution.
Objects also typically have well-defined data dependencies given directly by the pointers to input data structures of other objects. Inputs are typically read-only. Outputs are typically write-only, and general read/write access is generally only allowed to variables contained within the objects (i.e., 904, 906, and 908). This provides a very well-structured mechanism for dependency analysis. It has benefits to parallelism similar to those of functional languages (where functional languages can communicate through procedure parameters and results) and closures (where closures are similar to functional languages except that a closure can have local state that is persistent from one call to the next, whereas in functional languages local variables are lost at the end of a procedure). However, there are advantages to using objects for this purpose instead of parameter-passing to functions, namely
“Data Parallelism” generally refers to the overlapped execution of operations which have very few (or no) data dependencies, or which have data dependencies that are very well structured and easy to characterize. To the degree that data communication is required at all, performance is normally sensitive only to data bandwidth, not latency. As a side effect, the overlapped operations are typically well balanced in terms of execution time and resource requirements. This category is sometimes referred to as “embarrassingly parallel.” Typically, there are four types of data parallelism that can be employed: client-server, partitioned-data, pipelined, and streaming.
In client-server systems, computing and memory resources are shared for generally unrelated applications for multiple clients (a client can be a user, a terminal, another computing system, etc.). There are few data dependencies between client applications, and resources can be provided to minimize resource conflicts. The client applications typically require different execution times, but all clients together can present a roughly constant load to the system that, combined with OS scheduling, permits efficient use of parallelism.
In partitioned-data systems, computing operates on large, fixed-size datasets that are mostly contained in private memory. Data can be shared between partitions, but this sharing is well structured (for example, leftmost and rightmost columns of arrays in adjacent datasets), and is a small portion of the total data involved in the computation. Computing is naturally overlapped, since all compute nodes perform the same operations on the same amount of data.
In pipelined systems, there is a large amount of data sharing between computations, but the application can be divided into long phases that operate on large amounts of data and that are independent of each other for the duration of the phase. At the end of a phase, data is passed to the next phase. This can be accomplished either by copying data directly, by exchanging pointers to the data, or by leaving the data in place and swapping to the program for the next phase to operate on the data. Overlap is accomplished by designing the phases, and the resource allocation, so that each phase can require approximately the same execution time.
In streaming systems, there is a large amount of data sharing between computations, but the application can be divided into short phases that operate on small amounts of input data. Data dependencies are satisfied by overlapping data transmission with execution, usually with a small amount of buffering between phases. Overlap is accomplished by matching each phase to the overall requirements of end-to-end throughput.
The framework of system 700 generally encompasses all of these levels of parallel execution, enabling them to be utilized in any combination to increase throughput for a given application (the suitability of a particular granularity depends on the application). This uses a structured, uniform set of techniques for rapid development, characterization, robustness, and re-use.
Turning now to
Even though this example in
The dependency mechanism generally ensures that destination objects do not execute until all input data is valid and that sources do not over-write input data until it is no longer desired. In system 700, this mechanism is implemented by the dataflow protocol. This protocol operates in the background, overlapped with execution, and normally adds no cycles to parallel operation. It depends on compiler support to indicate: 1) the point in execution in which a source has provided all output data, so that destinations can begin execution; and 2) the point in execution where a destination no longer can require input data, so it can be over-written by sources. Since programs generally behave such that inputs are consumed early in execution, and outputs are provided late, this permits the maximum amount of overlap between sources and destinations—destinations are consuming previous inputs while sources are computing new inputs.
The dataflow protocol results in a fully general streaming model for data parallelism. There is no restriction on the types of, or the total size of, transferred data. Streaming is based on variables declared in source code (i.e., C++), which can include any user-defined type. This allows execution modules to be executed in parallel, for example modules 1004 and 1006, and also allows overall system throughput to be limited by the block that has the longest latency between successive outputs (the longest cycle time from one iteration to the next). With one exception, this permits the mapping of any data-parallel style onto a system 700.
An exception to mapping data-parallel systems arises in partitioned-data parallelism as shown in
As already mentioned, data parallelism is not effective unless the overlapped threads have roughly the same execution time. This problem is overcome in system 700 using static scheduling to balance execution time within throughput requirements (assuming there are sufficient resources). This scheduling increases the throughput of long threads (with the same effect as reducing execution time) by replicating objects and partitioning data, and increases the effective execution time of short threads by having them share computing resources—either multi-tasking on a shared compute node, or by physically combining source code into a single thread.
An example of application for an SOC that performs parallel processing can be seen in
There are a variety of processing operations that can be performed by the SOC 1300 (as employed in imaging device 1250. In
In
Turning to
In
Multi-cast threads are also possible. Multi-cast threads are generally any combination of the above types, with the limitation that the same source data is sent to all destinations. If the source data is not homogeneous for all destinations, then the multiple-output capability of the destination descriptors is used instead, and output-instruction identifiers are used to distinguish destinations. Destination descriptors can have mixed types of destinations, including nodes, hardware accelerators, write threads, and multi-cast threads.
Processing cluster 1400 generally uses a “push” model for data transfers. The transfers generally appear as posted writes, rather than request-response types of accesses. This has the benefit of reducing occupation on global interconnect (i.e., data interconnect 814) by a factor of two compared to request-response accesses because data transfer is one-way. There is generally no desire to route a request through the interconnect 814, followed by routing the response to the requestor, resulting in two transitions over the interconnect 814. The push model generates a single transfer. This is important for scalability because network latency increases as network size increases, and this invariably reduces the performance of request-response transactions.
The push model, along with the dataflow protocol (i.e., 812-1 to 812-N), generally minimize global data traffic to that used for correctness, while also generally minimizing the effect of global dataflow on local node utilization. There is normally little to no impact on node (i.e., 808-i) performance even with a large amount of global traffic. Sources write data into global output buffers (discussed below) and continue without requiring an acknowledgement of transfer success. The dataflow protocol (i.e., 812-1 to 812-N) generally ensures that the transfer succeeds on the first attempt to move data to the destination, with a single transfer over interconnect 814. The global output buffers (which are discussed below) can hold up to 16 outputs (for example), making it very unlikely that a node (i.e., 808-i) stalls because of insufficient instantaneous global bandwidth for output. Furthermore, the instantaneous bandwidth is not impacted by request-response transactions or replaying of unsuccessful transfers.
Finally, the push model more closely matches the programming model, namely programs do not “fetch” their own data. Instead, their input variables and/or parameters are written before being invoked. In the programming environment, initialization of input variables appears as writes into memory by the source program. In the processing cluster 1400, these writes are converted into posted writes that populate the values of variables in node contexts.
The global input buffers (which are discussed below) are used to receive data from source nodes. Since the data memory for each node 808-1 to 808-N is single-ported, the write of input data might conflict with a read by the local SIMD. This contention is avoided by accepting input data into the global input buffer, where it can wait for an open data memory cycle (that is, there is no bank conflict with the SIMD access). The data memory can have 32 banks (for example), so it is very likely that the buffer is freed quickly. However, the node (i.e., 808-i) should have a free buffer entry because there is no handshaking to acknowledge the transfer. If desired, the global input buffer can stall the local node (i.e., 808-i) and force a write into the data memory to free a buffer location, but this event should be extremely rare. Typically, the global input buffer is implemented as two separate random access memories (RAMs), so that one can be in a state to write global data while the other is in a state to be read into the data memory. The messaging interconnect is separate from the global data interconnect but also uses a push model.
At the system level, nodes 808-1 to 808-N are replicated in processing cluster 1400 analogous to SMP or symmetric multi-processing with the number of nodes scaled to the desired throughput. The processing cluster 1400 can scale to a very large number of nodes. Nodes 808-1 to 808-N are grouped into partitions 1402-1 to 1402-R, with each having one or more nodes. Partitions 1402-1 to 1402-R assist scalability by increasing local communication between nodes, and by allowing larger programs to compute larger amounts of output data, making it more likely to meet desired throughput requirements. Within a partition (i.e., 1402-i), nodes communicate using local interconnect, and do not require global resources. The nodes within a partition (i.e., 1404-i) also can share instruction memory (i.e., 1404-i), with any granularity: from each node using an exclusive instruction memory to all nodes using common instruction memory. For example, three nodes can share three banks of instruction memory, with a fourth node having an exclusive bank of instruction memory. When nodes share instruction memory (i.e., 1404-i), the nodes generally execute the same program synchronously.
The processing cluster 1400 also can support a very large number of nodes (i.e., 808-i) and partitions (i.e., 1402-i). The number of nodes per partition, however, is usually limited to 4 because having more than 4 nodes per partition generally resembles a non-uniform memory access (NUMA) architecture. In this case, partitions are connected through one (or more) crossbars (which are described below with respect to interconnect 814) that have a generally constant cross-sectional bandwidth. Processing cluster 1400 is currently architected to transfer one node's width of data (for example, 64, 16-bit pixels) every cycle, segmented into 4 transfers of 16 pixels per cycle over 4 cycles. The processing cluster 1400 is generally latency-tolerant, and node buffering generally prevents node stalls even when the interconnect 814 is nearly saturated (note that this condition is very difficult to achieve except by synthetic programs).
Typically, processing cluster 1400 includes global resources that are shared between partitions:
Because nodes 808-1 to 808-N can be targeted to scan-line-based, pixel-processing applications, the architecture of the node processors 4322 (described below) can have many features that address this type of processing. These include features that are very unconventional, for the purpose of retaining and processing large portions of a scan-line.
In
As shown in this example, each processing stage operates on a region of the image. For a given computed pixel, the input data is a set of pixels in the neighborhood of that pixel's position. For example, the right-most Gb pixel result from the 2D noise filter is computed using the 5×5 region of input pixels surrounding that pixel's location. The input dataset for each pixel is unique to that pixel, but there is a large amount of re-use of input data between neighboring pixels, in both the horizontal and vertical directions. In the horizontal direction, this re-use implies sharing data between the memories used to store the data, in both left and right directions. In the vertical direction, this re-use implies retaining the content of memories over large spans of execution.
In this example, 28 pixels are output using a total of 780 input pixels (2.5×312), with a large amount of re-use of input data, arguing strongly for retaining most of this context between iterations. In a steady state, 39 pixels of input are required to generate 28 pixels of output, or, stated another way, output is not valid in 11 pixel positions with respect to the input, after just two processing stages. This invalid output is recovered by recomputing the output using a slightly different set of input data, offset so that the re-computed output data is contiguous with the output of the first computed output data. This second pass provides additional output, but can require additional cycles, and, overall, the computation is around 72% efficient in this example.
This inefficiency directly affects pixel throughput, because invalid outputs create the desire for additional computing passes. The inefficiency is directly proportional to the width of the input dataset, because the number of invalid output pixels depends on the algorithms. In this example, tripling the output width to 84 pixels (input width 95 pixels) increases efficiency from 72% to 87% (over 2× reduction in inefficiency—28% to 13%). Thus, efficient use of resources is directly related to the width of the image that these resources are processing. The hardware should be capable of storing wide regions of the image, with nearly unrestricted sharing of pixel contexts both in the horizontal and vertical directions within these regions.
“Top-level programming” refers to a program that describes the operation of an entire use-case at the system level, including input from memory 1416 and/or peripherals 1414. Namely, top-level programming generally defines a general input/output topology of algorithm modules, possibly including intermediate system memory buffers and hardware accelerators, and output to memory 1416 and/or peripherals 1414.
A very simple, conceptual example, for a memory-to-memory operation using a single algorithm module is shown in
In this example, the top-level program source code 1502 generally corresponds to flow graph 1504. As shown, code 1502 includes an outer FOR loop that iterates over an image in the vertical direction, reading from de-interleaved system frame buffers (R[i], Gr[i], Gb[i], B[i]) and writing algorithm module inputs. The inputs are four circular buffers in the algorithm object's input structure, containing the red (R), green near red (Gr), green near blue (Gb), and blue (B) pixels for the iteration. Circular buffers are used to retain state in the vertical direction from one invocation to the next, using a fixed amount of statically-allocated memory. Circular addressing is expressed explicitly in this example, but nodes (i.e., 808-i) directly support circular addressing, without the modulus function, for example. After the algorithm inputs are written, the algorithm kernel is called though the procedure “run” defined for the algorithm class. This kernel iterates single-pixel operations, for all input pixels, in the horizontal direction. This horizontal iteration is part of the implementation of the “Line” class. Multiple instances of the class (not relevant to this example) can be used to distinguish their contexts. Execution of the algorithm writes algorithm outputs into the input structure of the write thread (Wr_Thread_input). In this case, the input to the write thread is a single circular buffer (Pixel_Out). After completion of the algorithm, the write thread copies the new line of from its input buffer to an output frame buffer in memory (G_Out[i]).
Turning to
Looking now to
A foundation for the programming abstractions of system 700, object-based thread parallelism, and resource allocation is the algorithm module 1802, which is shown in
Turning to
The kernel 1808 is written as a standalone procedure and can include other procedures to implement the algorithm. However, these other procedures are not intended to be called from outside the kernel 1808, which is called through the procedure “simple_ISP3.” The keyword SUBROUTINE is defined (using the #define keyword elsewhere in the source code) depending on whether the source-code compilation is targeted to a host. For this example, SUBROUTINE is defined as “static inline.” The compiler 706 can expand these procedures in-line for pixel processing when architecture (i.e., processing cluster 1400) may not provide for procedure calls, due to cost in cycles and hardware (memory). In other host environments, the keyword SUBROUTINE is blank and has no effect on compilation. The included file “simple_ISP_def.h” is also described below.
Intrinsics are used to provide direct access to pixel-specific data types and supported operations. For example, the data type “uPair” is an unsigned pair of 16-bit pixels packed into 32 bits, and the intrinsic “_pcmv” is a conditional move of this packed structure to a destination structure based on a specific condition tested for each pixel. These intrinsics enable the compiler 706 to directly emit the appropriate instructions, instead of having to recognize the use from generalized source code matching complex machine descriptions for the operations. This generally can require that the programmer learn the specialized data types and operations, but hides all other details such as register allocation, scheduling, and parallelism. General C++ integer operations can also be supported, using 16-bit short and 32-bit long integers.
An advantage of this programming style is that the programmer does not deal with: (1) the parallelism provided by the SIMD data paths; (2) the multi-tasking across multiple contexts for efficient execution in the presence of dependencies on a horizontal scan line (for image processing); or (3) the mechanics of parallel execution across multiple nodes (i.e., 808-i). Furthermore, the programs (which are generally written in C++) can be used in any general development environment, with full functional equivalence. The application code can be used in outside environment for development and testing, with little knowledge of the specifics of system 700 and without requiring the use of simulators. This code also can be used in a SystemC model to achieve cycle-approximate behavior without underlying processor models
Inputs to algorithm modules are defined as structures—declared using the “struct” keyword—containing all the input variables for the module. Inputs are not generally passed as procedure parameters because this implies that there is a single source for inputs (the caller). To map to ASIC-style data flows, there should be a provision for multiple source modules to provide input to a given destination, which implies that object inputs are independent public variables that can be written independently. However, these variables are not declared independently, but instead are placed in an input data structure. This is to avoid naming conflicts, as described below.
The input and output data structures for the application are defined by the programmer in a global file (global for the application) that contains the structure declarations. An example of an input/output (IO) structure 2000, which shows the definitions of these structures for the “simple_ISP” example image pipeline, can be seen in
An API generally documents a set of uniquely-named procedures whose parameter names are not necessarily unique because the procedures may appear within the scope of the uniquely-named procedure. As discussed above, algorithm modules (i.e. 1802) cannot generally use procedure-call interfaces, but structures provide a similar scoping mechanism. Structures allow inputs to have the scope of public variables but encapsulate the names of member variables within the structure, similar to procedure declarations encapsulating parameter names. This is generally not an issue in the hosted environment because the public variables (i.e., 1804) are also encapsulated in an object instance that has a unique name. Instead, as explained below, this is an issue related to potential name conflicts because system programming tool 718 removes the object encapsulation in order to provide an opportunity to generally optimize the resource allocation. The programming abstractions provided by objects are preserved, but the implementation allows algorithm code to share memory usage with other, possibly unrelated, code. This results in public variables having the scope of global variables, and this introduces the requirement for public variables (i.e., 1804) to have globally-unique names between object instances. This is accomplished by placing these variables into a structure variable that has a globally unique name. It should also be noted that using structures to avoid name conflicts in this way does not generally have all the benefits of procedure parameters. A source of data has to use the name of the structure member, whereas a procedure parameter can pass a variable of any name, as long as it has a compatible type.
Nodes 808-1 to 808-N also have two different destination memories: the processor data memory (discussed in detail below) and the SIMD data memory (which is discussed in detail below). The processor data memory generally contains conventional data types, such as “short” and “int” (named in the environment as “shorts” and “intS” to denote abstract), scalar data memory data in nodes 808-1 to 808-N (which is generally used to distinguish this data from other conventional data types and to associate the data with a unique context identifier). There can also a special 32-bit (for example) data type called “Circ” that is used to control the addressing of circular buffers (which is discussed in detail below). SIMD data memory generally contains what can be considered either vectors of pixels (“Line), using image processing as an example, or words containing two signed or unsigned values (“Pair” and “uPair”). Scalar and vector inputs have to be declared in two separate structures because the associated memories are addressed independently, and structure members are allocated in contiguous addresses.
To autogenerate source code for a use-case, it is strongly preferred that system programming tool 718 can instantiate instances of objects, and form associations between object outputs and inputs, without knowing the underlying class variables, member functions, and datatypes. It is cumbersome to maintain this information in system programming tool 718 because any change in the underlying implementation by the programmer should generally reflected in system programming tool 718. This is avoided using naming conventions in the source code, for public variables, functions, and types that are used for autogeneration. Other, internal variables and so on can be named by the programmer.
Turning to
Both input and output types are defined by the same naming convention, appending the algorithm name with “_INS” for scalar input to processor data memory, “_INV” for vector input to SIMD data memory, and “_OUT” for output. If a module has multiple inputs (which can vary by use-case), input variables—different members of the input structure—can be set independently by source objects.
If a module has multiple output types, each is defined separately, appending the algorithm name with “_OUT0,” “_OUT1,” and so forth, as shown in the IO data type module 2200 of
Turning now to
Typically, the processor data memory input associated with the algorithm contains configuration variables, of any general type—with the exception of the “Circ” type to control the addressing of circular buffers in the SIMD data memory (which is described below). This input data structure follows a naming convention, appending the algorithm name with “_inputS” to indicate the scalar input structure to processor data memory. The SIMD data memory input is a specified type, for example “Line” variables in the “simple_ISP3_input” structure (type “ycc”). This input data structure follows a similar naming convention, appending the algorithm name with “_inputV” to indicate the vector input structure to SIMD data memory. Additionally, the processor data memory context is associated with the entire vector of input pixels, whatever width is configured. Here, this width can span multiple physical contexts, possibly in multiple nodes 808-1 to 808-N. For example, each associated processor data memory context contains a copy of the same scalar data, even though the vector data is different (since it is logically different elements of the same vector). The GLS unit 1408 provides these copies of scalar parameters and maintains the state of “Circ” variables. The programming model provides a mechanism for software to signal the hardware to distinguish different types of data. Any given scalar or vector variable is placed at the same address offsets in all contexts, in the associated data memory.
Turning to
Turning now to
The file “simple_ISP3_input.h”, for example, is included as declaration 2618 to define the public input variables of the object. This is a somewhat unusual place to include a header file, but it provides a convenient way to define inputs in both multiple environments using a single source file. Otherwise, additional maintenance would be required to keep multiple copies of these declarations consistent between the multiple environments. A public function 2620 is declared, named “run”, that is used to invoke the algorithm instance. This hides the details of the calling sequence to the algorithm kernel (i.e., 1808), in this case the number of output pointers that are passed to the kernel (i.e., 1808). The calls “_set_simd_size(simd_size)” and “_set_ctx_id(ctx_id)”, for example, define the width of “Line” variables and uniquely identify the SIMD data memory variable contexts for the object instance. These are used during the execution of the algorithm kernel (i.e., 1808) for this instance. Finally, the algorithm kernel “simple_ISP3.cpp” or 1808 is included as member function 2622. This is also somewhat unconventional, including a “.cpp” file in a header file instead of vice versa, but is done for reasons already described—to permit common, consistent source code between multiple environments.
4.2. Autogeneration from Source Code in a Hosted Environment
In
As show, the algorithm class and instance declarations 1702 and 1704 are generally are straightforward cases. The first section (class declarations) includes the files that declare the algorithm object classes for each component on the use-case diagram (i.e., 1000), using the naming conventions of the respective classes to locate the included files. The second section (instance declarations) declares pointers to instances of these objects, using the instance names of the components. The code 2702 in this example also shows the inclusion of the file 2600, which is “simple_ISP_def.h” that defines constant values. This file is normally—but not necessarily—included in algorithm kernel code 1808. It is included here for completeness, and the file “simple_ISP_def.h” includes a “#ifndef” pre-processor directive to generally ensure that the file is included once. This is a conventional programming practice, and many pre-processor directives have been omitted from these examples for clarity.
The initialization section 1706 includes the initialization code for each programmable node. The included files are named by the corresponding components in the use-case diagram (i.e., 1000 and described below). Programmable nodes are typically initialized in following order: iterators→read threads→write threads are passed parameters, similar to function calls, to control their behaviour. Programmable nodes do not generally support a procedure-call interface; instead, initialization is accomplished by writing into the respective object's scalar input data structure, similar to other input data.
In this example, most of the variables set during initialization are based on variables and values determined by the programmer. An exception is the circular-buffer state. This state is set by a call to “_init_circ”. The parameters passed to “_init_circ”, in the order shown, are:
(1) a pointer to the “circ_s” structure for this buffer;
(2) the initial pointer into the buffer, which depends on “delay_offset” and the buffer size;
(3) the size of the buffer in number of entries;
(4) the size of an entry in number of elements;
(5) “delay_offset”, which determines how many iterations are required before the buffer generates valid outputs;
(6) a bit to protect against invalid output (initialized to 1); and
(7) the offset from the top boundary for the first data received (initialized to 0).
This approach permits both the programmer and system programming tool 718 to determine buffer parameters, and to populate the “c_s” array so that the read thread can manage all circular buffers in the use-case, as a part of data transfer, based on frame parameters. It also permits multiple buffers within the same algorithm class to have independent settings depending on the use-case.
The traverse function 1708 is generally the inner loop of the iterator 602, created by code autogeneration. Typically, it updates circular-buffer addressing states for the iteration, and then calls each algorithm instance in an order that satisfies data dependencies. Here, the traverse function 1708 is shown for “simple_ISP”. This function 1708 is passed four parameters:
(1) an index (idx) indicating the vertical scan line for the iteration;
(2) the height of the frame division;
(3) the number of circular buffers in the use-case (“circ_no”); and
(4) the array of circular-buffer addressing state for the use-case, “c_s”.
Before calling the algorithm instances, traverse function 1708 calls the function “_set_circ” for each element in the “c_s” array, passing the height and scan-line number (for example). The “_set_circ” function updates the values of all “Circ” variables in all instances, based on this information, and also updates the state of array entries for the next iteration. After the circular-buffer addressing state has been set, traverse function 1708 calls the execution member functions (“run”) in each algorithm instance. The read thread (i.e., 904) is passed a parameter (i.e., the index into the current scan-line).
The hosted-program function 1710 is called by a user-supplied testbench (or other routine) to execute to use case on an entire frame (or frame division) of user-supplied data. This can be used to verify the use-case and to determine quality metrics for algorithms. As shown in this example, the hosted-function 1710 is used for “simple_ISP”. This function 1710 is passed two parameters indicating the “height” and width (“simd_size”) of the frame, for example. The function 1710 is also passed a variable number of parameters that are pointers to instances of the “Frame” class, which describe system-memory buffers or other peripheral input. The first set of parameters is for the read thread(s) (i.e., 904), and the second is for the write thread(s) (i.e., 908). The number of parameters in each set depends on the input and output data formats, including information such as whether or not system data is interleaved. In this example, the input format is interleaved Bayer, and the output is de-interleaved YCbCr. Parameters are declared in the order of their declarations in the respective threads. The corresponding system data is provided in data structures provided by the user in the surrounding testbench, with pointers passed to the hosted function.
Hosted-program function 1710 also includes creation of object instances 1712. The first statement in this example is a call to the function “_set_simd_size”, which defines the width of the SIMD contexts (normally, an entire scan-line). This is used by “Frame” and “Line” objects to determine the degree of iteration within the objects (in the horizontal direction). This is followed by an instantiation of the read thread (i.e., 906). This thread is constructed with a parameter indicating the height and width of the frame. Here, the width is expressed as “simd_size”, and the third parameter is used in frame-division processing. It might appear that the iterator (i.e., 602) has to know the height, since iteration is over all scan-lines. However, number of iterations is generally somewhat higher than the number of scan-lines, to take into account the delays caused by dependent circular buffers. The total number of iterations is sufficient to fill and all buffers and provide all valid outputs. However, the read thread (i.e., 904) should not iterate beyond the bottom of the frame, so it should know the height in order to conditionally disable the system access. Following this, there is a series of paired statements, where the first sets a unique value for the context identifier of the object that is about to be instantiated and where the second instantiates the object. The context identifier is used in the implementation of the “Line” class to differentiate the contexts of different SIMD instantiation. A unique identifier is associated with all “Line” variables that are created as part of an object instance. The read thread (i.e. 904) does not generally desire a context identifier because it reads directly from the system to the context(s) of other objects. The write thread (i.e., 908) does generally desire a context identifier because it has the equivalent of a buffer to store outputs from the use-case before they are stored into the system.
After the algorithm objects have been instantiated, their output pointers can be set according to the use-case diagram 1714. This relies on all objects consistently naming the output pointers. It also relies on the algorithm modules defining type names for input structures according to the class name, rather than a meaningful name for the underlying type (the meaningful name can still be used in algorithm coding). Otherwise, the association of component outputs to inputs directly follows the connectivity in the use-case graph (i.e., 1000).
Additionally, the hosted-program function 1710 includes the object initialization section 1716 for the “simple_ISP” use-case, for example. The first statement creates the array of “circ_s” values, one array element per circular buffer, and initializes the elements (this array is local to the hosted function, and passed to other functions as desired). The initialization values relevant here are the pointers to the “Circ” variables in the object instances. These pointers are used during execution to update the circular-addressing state in the instances. Following this, the initialization function provided (and named by) the programmer is called for each instance. The initialization functions are passed:
(1) a pointer to the scalar input structure of the instance;
(2) a pointer to the “c_struct” array entry for the corresponding circular buffer; and
(3) the relative “delay_offset” of the instance.
An initiation 1718 of an instance of the iterator “frame_loop” can be seen. This initiation 1718 uses the name from the use-case diagram. The constructor for this instance sets the height of the frame, a parameter indicating the number of circular buffers (four buffers in this case), and a pointer to the “c_struct” array. This array is not used directly by the iterator (i.e., 602), but is passed to the traverse function 1708, along with the number of circular buffers. The number of circular buffers is also used to increase the number of iterations; for example, four buffers would require three additional iterations to generate all valid outputs. The read and write thread (i.e., 904 and 908, respectively) are constructed with the height of the frame, so the correct amount of system data is read and written despite the additional iterations. The remaining statements create a pointer to the traverse function 1708 and call the iterator (i.e., 602) with this pointer. The pointer is used to call traverse function 1708 within the main body of the iterator (i.e., 602).
Finally, the hosted-program function 1710 in includes a delete object instances function 1720. This function 1720 simply de-allocates the object instances and frees the memory associated with them, preventing memory leaks for repeated calls to the hosted function.
As can be seen in
A read thread 904 or write thread 908 is specified by thread name, the class name, and the input or output format. The thread name is used as the name of the instance of the given class in the source code, and the input or output format is used to configure the GLS unit 1408 to convert the system data format (for example, interleaved pixels) into the de-interleaved formats required by SIMD nodes (i.e., 808-i). Messaging supports passing a general set of parameters to a read thread 904 or write thread 908. In most cases, the thread class determines basic characteristics such as buffer addressing patterns, and the instances are passed parameters to define things such as frame size, system address pointers, system pixel formats, and any other relevant information for the thread 904 or 908. These parameters are specified as input parameters to the thread's member function and are passed to the thread by the host processor based on application-level information. Multiple instance of multiple thread classes can be used for different addressing patterns, system data types, an so forth.
An iterator 602 is generally defined by iterator name and class name. As with read threads 904 and write threads 908, the iterator 602 can be passed parameters, specified in the iterator's function declaration. These parameters are also passed by the host processor based on application information. An iterator 602 can be logically considered an “outer loop” surrounding an instance of a read thread 904. In hardware, other execution is data-driven by the read thread 904, so the iterator 602 effectively is the “outer loop” for all other instances that are dependent on the read thread—either directly or indirectly, including write threads 908. There is typically one iterator 602 per read thread 904. Different read threads 904 can be controlled by different instances of the same iterator class, or by instances of different iterator classes, as long as the iterators 602 are compatible in terms of causing the read threads 904 to provide data used by the use-case.
An algorithm-module instance (i.e., 1802), associated with a programmable node module 2902, is specified by module instance name, the class name, and the name of the initialization header file. These names are used to locate source files, instantiate objects, to form pointers to inputs for source objects, and to initialize object instances. These all rely on the naming conventions described above. Each algorithm class has associated meta-data, shown in the
Accelerators (from 1418) are identified by accelerator name in accelerator module 2904. The system programming tool 718 cannot allocate these resources, but can create the desired hardware configuration for dataflow into and out of any accelerators. It is assumed that the accelerators can support the throughput.
Multi-cast modules 290 permit any object's outputs to be routed to multiple destinations. There is generally no associated software; it provides connectivity information to system programming tool 718 for setting up multi-cast threads in the GLS unit 1408. Multi-cast threads can be used in particular use-cases, so that an algorithm can be completely independent of various dataflow scenarios. Multi-cast threads also can be inserted temporarily into a use-case, for example so that an output can be “probed” by multi-casting to a write thread 908, where it can be inspected in memory 1416, as well as to the destination required by the use-case.
Turning to
Here, diagram 3000 shows two types each of data and control flow. Explicit dataflow is represented by solid arrows. Implicit or user-defined dataflow, including passing parameters and initialization, is represented by dashed arrows. Direct control flow, determined by the iterator 602, is represented by the arrow marked “Direct Iteration (outer loop).” Implied control flow, determined by data-driven execution, is represented by dashed arrows. Internal data and control flow, from stage 3006 output to 3012 input, is accomplished by the node programming flow (as described below). All other data and control flow is accomplished by the global LS threads.
Additionally, the source code that is converted to autogenerated source code (i.e., 2702) by system programming tool 718 is generally free-form, C++ code, including procedure calls and objects. The overhead in cycle count is usually acceptable because iterations typically result in the movement of a very large amount of data relative to the number of cycle spent in the iteration. For example, consider a read thread (i.e., 904) that moves interleaved Bayer data into three node contexts. In each context, this data is represented as four lines of 64 pixels each—one line each for R, Gr, B, and Gb. Across the three contexts, this is twelve, 64-pixels lines total, or 768 pixels. Assuming that all 16 threads are active and presenting roughly equivalent execution demand (this is very rarely the case), and a throughput of one pixel per cycle (a likely upper limit), each iteration of a thread can use 768/16=48 cycles. Setting up the Bayer transfer can require on the order of six instructions (three each for R-Gr and Gb-B), so there are 42 cycles remaining in this extreme case for loop overhead, state maintenance, and so forth.
Turning to
Turning to
In
Circular buffers can be used extensively in pixel and signal processing, to manage local data contexts such as a region of scan lines or filter-input samples. Circular buffers are typically used to retain local pixel context (for example), offset up or down in the vertical direction from a given central scan line. The buffers are programmable, and can be defined to have an arbitrary number of entries, each entry of arbitrary size, in any contiguous set of data memory locations (the actual location is determined by compiler data-structure layout). In some respects, this functionality is similar to circular addressing in the C6x.
However, there are a few issues introduced by the application of circular buffers here. Pixel processing (for example) can require boundary processing at the top and bottom edges of the frame. This provides data in place of “missing” data beyond the frame boundary. The form of this processing, and the number of “missing” scan lines provided, depends on the algorithm. The implementation provided here of a circular buffer is generally independent of the actual location of the buffer in the dataflow. Dependent buffers are generally “filled” at the top of a frame and “drained” at the bottom. The actual state of any particular buffer depends on where it is located in the dataflow relative to other buffers.
Turning to
The first iteration provides input data at the first scan-line of the frame (top) to buffer 3402-1. In this example, this is not sufficient for buffer 3402-1 to generate valid output. The circular buffers 3402-1 to 3402-3 have three entries each, implying that entries from three scan-lines are used to calculate an output value. At this point, the buffer index points to the entry that is logically one line before the first scan-line (above the frame). Neither buffer 3402-2 nor buffer 3402-3 has valid input at this point. The second iteration provides data at the second scan-line (top+1) to buffer 3402-1, and the index points to the first scan-line. In this example, boundary processing can provide the equivalent of three scan-lines of data because the second scan-line is logically reflected above the top boundary. The entry after the index generally serves two purposes, providing data to represent a value at top−1 (above the boundary), and actual data at top+1 (the second scan-line). This is sufficient to provide output data to buffer 3402-2, but this data is not sufficient for buffer 3402-3 to generate valid output so that buffer 3402-2 has no input. The third iteration provides three scan-line inputs to buffer 3402-1, which provides a second input to buffer 3402-2. At this point, buffer 3402-2 uses boundary processing to generate output to buffer 3402-3. On the fifth iteration, all stages 3402-1 to 3402-3 have valid datasets for generating output, but each is offset by a scan-line due to the delays in filling the buffers through the processing stages. For example, in the fifth iteration, buffer 3402-1 generates output at top+3, buffer 3402-2 at top+2, and buffer 3402-3 at top+1.
Generally, it is not possible for algorithm kernels (i.e., 1808) to completely specify initial settings or the behavior of their circular buffers (i.e., 3402-1) because, among other things, this depends on how many stages removed they are from input data. This information is available from the system programming tool 718, based on the use-case diagram. However, the system programming tool 718 also does not completely specify the behavior of circular buffers (i.e., 3402-1) because, for example, the size of the buffers and the specifics of boundary processing depend on the algorithm. Thus, the behavior of circular buffers (i.e., 3402-1) is determined by a combination of information known to the application and to system programming tool 817. Furthermore, the behavior of a circular buffer (i.e., 3402-1) also depends on the position of the buffer relative to the frame, which is information known to the read thread (i.e., 906), at run time.
SIMD data memory and node processor data memory (i.e., 4328 and which is described below in detail) are partitioned into a variable number of contexts, of variable size. Data in the vertical frame direction is retained and re-used within the context itself, using circular buffers. Data in the horizontal frame direction is shared by linking contexts together into a horizontal group (in the programming model, this is represented by the datatype Line). It is important to note that the context organization is mostly independent of the number of nodes involved in a computation and how they interact with each other. A purpose of contexts is to retain, share, and re-use image data, regardless of the organization of nodes that operate on this data.
Turning to
Variable allocation is provided for the number of contexts, and sizes of contexts, to object instances in which contexts (i.e., 3502-1) allocated to the same object class can be considered separate object instances. Also, context allocation can includes both scalar and vector (i.e., SIMD) data, where scalar data can include parameters, configuration data, and circular-buffer state. Additionally, there are several ways of overlapping data transfer with computation: (1) using 2 contexts (or more) for double-buffering (or more); (2) compiler flags when input state is no longer desired—next transfer in parallel with completing execution; and (3) addressing modes permit the implementation of circular buffers (e.g. first-in-first-out buffers or FIFOs). Data transfer at the system level can look like variable assignment in the programming model with the system 700 matching context offsets during a “linking” phase. Moreover, multi-tasking can be used to most efficiently schedule node resources so as to run whatever contexts are ready with system-level dependency checking that enforces a correct task order and registers that can be saved and restored in a single cycle—no overhead for multi-tasking
Turning to
Typically, a variable number of contexts (i.e., 3502-1), of variable sizes, are allocated to a variable number of programs. For a given program, all contexts are generally the same size, as provided by the system programming tool 718. SIMD data memory not allocated to contexts is available for access from all contexts, using a negative offset from the bottom of the data memory. This area is used as a compiler 706 spill/fill area 3610 for data that does not desire to be preserved across task boundaries, which generally avoids the requirement that this memory be allocated to each context separately.
Each descriptor 3702 for node processor data memory (4328 and which is described below in detail) can contains a field (i.e., 3703-1 and 3703-2) that specifies the base address of the associated context (which can be seen in
Turning to
SIMD data memory descriptors 3704 are usually organized as linear lists, with a bit in the descriptor indicating that it is the last entry in the list for the associated program. When a program is scheduled, part of the scheduling message indicates the base context number of the program. For example, the message scheduling program B (object instance 1802-2) in the
Turning to
Typically, the horizontal group begins on the left at a left boundary, and terminates on the right at a right boundary. Boundary processing applies to these contexts for any attempt to access left-side or right-side context. Boundary processing is valid at the actual left and right boundaries of the image. However, if an entire scan-line does not fit into the horizontal group, the left- and right-boundary contexts can be at intermediate points in the scan-line, and boundary processing does not produce correct results. This means that any computation using this context generates an invalid result, and this invalid data propagates for every access of side context. This is compensated for by fetching horizontal groups with enough overlap to create valid final results. This reflects the inefficiency discussed earlier that is partially compensated for by wide horizontal groups (relatively small overlap is required, compared to the total number of pixels in the horizontal group).
Note that the side-context pointers generally permit the right boundary to share side context with the left boundary. This is valid for computing that progresses horizontally across scan lines. However, since in this configuration contexts are used for multiple horizontal segments, this does not permit sharing of data in the vertical direction. If this data is required, this implies a large amount of system-level data movement to save and restore these contexts.
A context (i.e., 3602-1) can be set so that it is not linked to a horizontal group, but instead is a standalone context providing outputs based on inputs. This is useful for operations that span multiple regions of the frame, such as gathering statistics, or for operations that don't depend specifically on a horizontal location and can be shared by a horizontal group. A standalone context is threaded, so that input data from sources, and output data to destinations, is provided in scan-line order.
Turning back to
Node addresses are generally structures of two identifiers. One part of the structure is a “Segment_ID”, and the second part is a “Node_ID”. This permits nodes (i.e., 808-i) with similar functionality to be grouped into a segment, and to be addressed with a single transfer using multi-cast to the segment. The “Node_ID” selects the node within the segment. Null connections are indicated by Segment_ID.Node_ID=00.0000°b. Valid bits are not required because invalid descriptors are not referenced. The first word of the descriptor indicates the base address of the context in SIMD data memory. The next word contains bits 3706 and 3707 indicating the last descriptor on the list of descriptors allocated to a program (Bk=1 for the last descriptor) and whether the context is a standalone, threaded context (Th=1). The second word also specifies horizontal position from the left boundary (field 3708), whether the context depends on input data (field 3710), and the number of data inputs in field 3709, with values 0-7 representing 1-8 inputs, respectively (input data can be provided by up to four sources, but each source can provide both scalar and vector data). The third and fourth words contain the segment, node, and context identifiers for the contexts sharing data on the left and right sides, respectively, called the left-context pointer and right-context pointer in fields 3711 to 3718.
The context-state RAM or memory also has up to four entries describing context outputs, in a structure called a destination descriptor (the format of which can be seen in
In
A context (i.e., 3502-1) normally has at least one destination for output data, but it is also possible that a single program in a context (i.e., 3502-1) can output several different sets of data, of different types, to different destinations. The capability for multiple outputs is generally employed in two situations:
Destination descriptors support a generalized system dataflow and can be seen in
In basic node (i.e., 808-i) allocation, throughput is met by adjusting and balancing the effective cycle counts so that data sources produce output at the required rate. This is determined by true dependencies between source and destination programs. For example, scan-based pixel processing has a much more complex set of dependencies than those between serially-connected sources and destinations, and the potential stalls introduced should be analyzed by system programming tool 718. As discussed in this section, this can be done after resource allocation, because it depends on context configurations, but has to occur before compiling source code, because the compiler uses information from system programming tool 718 to avoid these stalls.
In scan-based processing, data is shared not only between outputs and inputs, but also between contexts that are co-coordinating on different segments of a horizontal group. This sharing is essential to meet throughput, so that the number of pixels output by a program can be adjusted according to the cycle count (increasing cycles implies increasing pixels output, to maintain the required throughput in terms of pixels per cycle). To accomplish this, the program executes in multiple contexts, either in parallel or multi-tasked, and these contexts should logically appear as a single program operating on the total width of allocated contexts. Input and intermediate data associated with the scan lines are shared across the co-coordinating contexts, in both left-to-right and right-to-left directions.
To meet throughput for scan-line-based applications, all dependencies should be considered, including those reflected through shared side-contexts. Nodes (i.e., 808-i) use task and program pre-emption (i.e., 3802, 3804, and 3806) to reduce the impact of these dependencies, but this is not generally sufficient to prevent all dependency stalls, as shown in
These side-context stalls are a complex function of task sizes (cycles between task boundaries, determined by the source code and code generation), the task sequence in the presence of task pre-emption, the number of tasks, the number of contexts, and the context organization (intra-node or inter-node). There is no closed-form expression that can predict whether or not stalls can occur. Instead, the system programming tool 718 builds the dependency graph, as shown in the figure, to determine whether or not there is a likelihood of side-context dependency stalls. The meta-data that the compiler 706 provides, as a result of compiling algorithm modules as stand-alone programs, includes a table of the tasks and their relative cycle counts. The system programming tool 718 uses this information to construct the graph, after resource allocation determines the number of contexts and their organizations. This graph also comprehends task pre-emption (but not program pre-emption, for simplicity).
If the graph does indicate the possibility of one or more dependency stalls, system programming tool 718 can eliminate the stalls by introducing artificial task boundaries to balance dependencies with resource utilization. In this example, the problem is the size of tasks 3306-1 to 3306-6 (for node 808-i) with respect to subsequent, dependent tasks; an outlier in terms of task size is usually the cause since it causes the node 808-i to be occupied for a length of time that does not satisfy the dependencies of contexts in previous nodes (i.e., 808-(i−1)), which are dependent on right-side context from subsequent nodes. The stall is removed by splitting each of tasks 3306-1 to 3306-6 into two sub-tasks. This task boundary has to be communicated to the compiler 706 along with the source files (concatenating task tables for merged programs). The compiler 706 inserts the task boundary because SIMD registers are not live across these boundaries, and so the compiler 706 allocates registers and spill/fill accordingly. This can alter the cycle count and the relative location of the task boundary, but task balancing is not very sensitive to the actual placement of the artificial boundary. After compilation, the system programming tool 718 reconstructs the dependency graph as a check on the results.
Dependency checking can be complex, given the number of contexts across all nodes that possibly share data, the fact that data is shared both though node input/output (I/O) and side-context sharing, and the fact that node I/O can include system memory, peripherals, and hardware accelerators. Dependency checking should properly handle: 1) true dependencies, so that program execution does not proceed unless all required data is valid; and 2) anti-dependencies, so that a source of data does not over-write a data location until it is no longer desired by the local program. There are no output dependencies—outputs are usually in strict program and scan-line order.
Since there are many styles of sharing data, terminology is introduced to distinguish the types of sharing and the protocols used to generally ensure that dependency conditions are met. The list below defines the terminology in the
Local context management controls dataflow and dependency checking between local shared contexts on the same node (i.e., 808-i) or logically adjacent nodes. This concerns shared left side contexts 3602 or right side contexts 3606, copied into the left-side or right-side context RAMs or memories
Contexts that are shared in the horizontal direction have dependencies in both the left and right directions. A context (i.e., 3502-1) receives Llc and Rlc data from the contexts on its left and right, and also provides Rlc and Llc data to those contexts. This introduces circularity in the data dependencies: a context should receive Llc data from the context on its left before it can provide Rlc data to that context, but that context desires Rlc data from this context, on its right, before it can provide the Llc context.
This circularity is broken using fine-grained multi-tasking. For example, tasks 3306-1 to 3306-6 (from
The figure also shows two nodes, each having the same task set and context configuration (part of the sequence is shown for node 808-(i+1)). Assume that task 3306-1 is at the left boundary for illustration, so it has no Llc dependencies. Multi-tasking is illustrated by tasks executing in different time slices on the same node (i.e., 808-i); the tasks 3306-1 to 3306-6 are spread horizontally to emphasize the relationship to the horizontal position in the frame.
As task 3306-1 executes, it generates left local context data for task 3306-2. If task 3306-1 reaches a point where it can require right local context data, it cannot proceed, because this data is not available. Its Rlc data is generated by task 3306-2 executing in its own context, using the left local context data generated by task 3306-1 (if required). Task 3306-2 has not executed yet because of hardware contention (both tasks execute on the same node 808-i). At this point, task 3306-1 is suspended, and task 3306-2 executes. During the execution of task 3306-2, it provides left local context data to task 3306-3, and also Rlc data to task 3308-1, where task 3308-1 is simply a continuation of the same program, but with valid Rlc data. This illustration is for intra-node organizations, but the same issues apply to inter-node organizations. Inter-node organizations are simply generalized intra-node organizations, for example replacing node 808-i with two or more nodes.
A program can begin executing in a context (i.e., 3502-1) when all Lin, Cin, and Rin data is valid for that context (if required), as determined by the Lvin, Cvin, and Rvin states. During execution, the program creates results using this input context, and updates Llc and Clc data—this data can be used without restriction. The Rlc context is not valid, but the Rvlc state is set to enable the hardware to use Rin context without stalling. If the program encounters an access to Rlc data, it cannot proceed beyond that point, because this data may not have been computed yet (the program to compute it cannot necessarily execute because the number of nodes is smaller than the number of contexts, so not all contexts can be computed in parallel). On the completion of the instruction before Rlc data is accessed, a task switch occurs, suspending the current task and initiating another task. The Rvlc state is reset when the task switch occurs.
The task switch is based on an instruction flag set by the compiler 706, which recognizes that right-side intermediate context is being accessed for the first time in the program flow. The compiler 706 can distinguish between input variables and intermediate context, and so can avoid this task switch for input data, which is valid until no longer desired. The task switch frees up the node to compute in a new context, normally the context whose Llc data was updated by the first task (exceptions to this are noted later). This task executes the same code as the first task, but in the new context, assuming Lvin, Cvin, and Rvin are set—Llc data is valid because it was copied earlier into the left-side context RAM. The new task creates results which update Llc and Clc data, and also update Rlc data in the previous context. Since the new task executes the same code as the first, it will also encounter the same task boundary, and a subsequent task switch will occur. This task switch signals the context on its left to set the Rvlc state, since the end of the task implies that all Rlc data is valid up to that point in execution.
At the second task switch, there are two possible choices for the next task to schedule. A third task can execute the same code in the next context to the right, as just described, or the first task can resume where it was suspended, since it now has valid Lin, Cin, Rin, Llc, Clc, and Rlc data. Both tasks should execute at some point, but the order generally does not matter for correctness. The scheduling algorithm normally attempts to chose the first alternative, proceeding left-to-right as far as possible (possibly all the way to the right boundary). This satisfies more dependencies, since this order generates both valid Llc and Rlc data, whereas resuming the first task would generate Llc data as it did before. Satisfying more dependencies maximizes the number of tasks that are ready to resume, making it more likely that some task will be ready to run when a task switch occurs.
It is important to maximize the number of tasks ready to execute, because multi-tasking is used also to optimize utilization of compute resources. Here, there are a large number of data dependencies interacting with a large number of resource dependencies. There is no fixed task schedule that can keep the hardware fully utilized in the presence of both dependencies and resource conflicts. If a node (i.e., 808-i) cannot proceed left-to-right for some reason (generally because dependencies are not satisfied yet), the scheduler will resume the task in the first context—that is, the left-most context on the node (i.e., 808-i). Any of the contexts on the left should be ready to execute, but resuming in the left-most context maximizes the number of cycles available to resolve those dependencies that caused this change in execution order, because this enables tasks to execute in the maximum number of contexts. As a result, pre-empt (i.e., pre-empt 3802), which are times during which the task schedule is modified, can be used.
Turning to
To summarize, tasks start with the left-most context with respect to their horizontal position, proceed left-to-right as far as possible until encountering either a stall or the right-most context, then resume in the left-most context. This maximizes node utilization by minimizing the chance of a dependency stall (a node, like node 808-i, can have up to eight scheduled programs, and tasks from any of these can be scheduled).
The discussion on side-context dependencies so far has focused on true dependencies, but there is also an anti-dependency through side contexts. A program can write a given context location more than once, and normally does so to minimize memory requirements. If a program reads Llc data at that location between these writes, this implies that the context on the right also desires to read this data, but since the task for this context hasn't executed yet, the second write would overwrite the data of the first write before the second task has read it. This dependency case is handled by introducing a task switch before the second write, and task scheduling ensures that the task executes in the context on the right, because scheduling assumes that this task has to execute to provide Rlc data. In this case, however, the task boundary enables the second task to read Llc data before it is modified a second time.
The left-side context RAM is typically read-only with respect to a program executing in a local context. It is written by two write buffers which receive data from other sources, and which are used by the local node to perform dependency checking. One write buffer is for global input data, Lin, based on data written as Cin data in the context on the left. The Lin buffer has a single entry. The second buffer is for Llc data supplied by operations within the same context on the left. The Llc buffer has 6 entries, roughly corresponding to the 2 writes per cycle that can be executed by a SIMD instruction, with a 3-entry queue for each of the 2 writes (this is conceptual—the actual organization is more general). These buffers are managed differently, though both perform the function of separating data transfer from RAM write cycles and providing setup time for the RAM write.
The Lin buffer stores input data sent from the context on the left, and holds this data for an available write cycle into the left-side context RAM. The left-side context RAM is typically a single-port RAM and can read or write in a cycle (but not both). These cycles are almost always available because they are unavailable in the case of a left-side context access within the same bank (on one of the 4 read ports, 32 banks), which is statistically very infrequent. This is why there is usually one buffer entry—it is very unlikely that the buffer is occupied when a second Lin transfer happens, because at the system level there are at least four cycles between two Cin transfers, and usually many more than four cycles. The hardware checks this condition, and forces the buffer to empty if desired, but this is to generally ensure correctness—it is nearly impossible to create this condition in normal operation.
An example of a format for the Lin buffer 3807 can be seen in
Dependency checking on the Lin buffer 3807 can be based on the signal sent by the context on the left when it has received Set_Valid signals from all of its sources (i.e., sources which have not signaled Input_Done). This sets the Lvin state. If Lvin is not set for a context, and the SIMD instruction attempts to access left-side context, the node (i.e., 808-i) stalls until the Lvin state is set. The Lvin state is ignored if there is no left-side context access. Also, as will be discussed below, there is a system-level protocol that prevents anti-dependencies on Lin data, so there is almost no situation where the context on the left will attempt to overwrite Lin data before it has been used.
The Llc write buffer stores local data from the context on the left, to wait for available RAM cycles. The format and use of an Llc buffer entry is similar to the Lin buffer entry and can be a hardware-only structure. Some differences with the Lin buffer are that there are multiple entries—six instead of one—and the context offset field, in addition to specifying the offset for writing the left-side RAM, is used also to detect hits on entries in the buffer and forward from the buffer if desired. This bypasses the left-side context RAM, so that the data can be used with virtually no delay.
As described above, Llc data is updated in the left-side context RAMs in advance of a task switch to compute Rlc data using—or to ensure that Llc data is used in—the context on the right. Llc data can be used immediately by the node on the right, though the nodes are not necessarily executing a synchronous instruction sequence. In almost all cases, these nodes are physically adjacent: within a partition, this is true by definition; between partitions, this can be guaranteed by node allocation with the system programming tool 718. In these cases, data is copied into the Llc write buffers feeding the left-side context RAMs quickly enough that data can be used without stalls, which can be an important property for performance and correctness of synchronous nodes.
Llc data can be transferred from source to destination contexts in a single cycle, and there is no penalty between update and use. Llc dependency checking can be done concurrently with execution, to properly locate and forward data as described below, and to check for stall conditions. The design goal is to transmit Llc data within one cycle for adjacent contexts, either on the same node or a physically adjacent node.
Forwarding from the Llc write buffer can be performed when the buffer is written with data destined for the current context (that is, a task is executing in the context concurrently with data transfer from the source). Concurrent contexts arise when the last context on one node is sharing data concurrently with the first context on the adjacent node to the right (for example, in
For a given configuration of context descriptors, the right-context pointer of a source context forms a fixed relationship with its destination context. Thus each destination context has static association with the source, for the duration of the configuration. This static property can be important because, even if the source context is potentially concurrent, the source node can be executing ahead of, synchronously with, behind, or non-concurrently with, the destination context, since different nodes can have private program counters or PCs and private instruction memories. The detection of potential concurrency is based on static context relationships, not actual task states. For example, a task switch can occur into a potentially concurrent context from a non-concurrent one and should be able to perform dependency checking even if the source context has not yet begun execution.
If the source context is not concurrent with the destination, then there is no dependency checking or forwarding in the Llc buffer. An entry is allocated for each write from the source, and the information in the entry used to write the left-side context RAM. The order of writes from the source is generally unimportant with respect to writes into the destination context. These writes simply populate the destination context with data that will be used later, and the source cannot write a given location twice without a context switch that permits the destination to read the value first. For this reason, the Llc buffer can allocate any entries, in any order, for any writes from the source.
Also, regardless of the order in which they were allocated, the buffer can empty any two entries which target non-accessed banks (that is, when there are no left-side context accesses to the banks). Six entries are provided (compared to the single entry for the Lin buffer) because SIMD writes are much more frequent than global data writes. Despite this, there statistically are still many available write cycles, since any two entries can be written in any order to any set of available banks, and since the left-side RAM banks are available more frequently that center RAM banks, because they are free except when the SIMD reads left-side context (in contrast to the center context which is usually accessed on a read). It is very unlikely that the write buffer will encounter an overflow condition, though the hardware does check for this and forces writes if desired. For example, six entries can be specified so that the Llc buffer can be managed as a first-in-first-out (FIFO) of two writes per cycle, over three cycles, if this simplifies the implementation. Another alternative can be to reduce the number of entries and using random allocation and de-allocation.
When the non-concurrent source task suspends, this is signaled to the destination context and sets the Lvlc state in that context. This state indicates that the context should not use the dependency checking mechanism for concurrent contexts. It also is used for anti-dependency checking. The source context cannot again write into the destination context until it has been processed and its task has ended, resetting the Lvlc state. This condition is checked because task pre-emption can re-order execution, so that the source node resumes execution before the destination node has used the Llc data. This is a stall condition that the scheduler attempts to work around by further pre-emption.
Since adjacent nodes (i.e., 808-i and 808-(i+1)) can use different program counters or PCs and instruction memories and since these adjacent nodes have different dependencies and resource conflicts, a source of Llc data does not necessarily execute synchronously with its destination, even if it is potentially concurrent. Potentially concurrent tasks might or might not execute at the same time, and their relative execution timing changes dynamically, based on system-level scheduling and dependencies. The source task may: 1) have executed and suspended before the destination context executes; 2) be any number of instructions ahead of—or exactly synchronous with—the destination context; 3) be any number of instructions behind the destination context; or 4) execute after the destination context has completed. The latter case occurs when the destination task does not access new Llc context from the source, but instead is supplying Rlc context to a future task and/or using older Llc context.
The Llc dependency checking generally operates correctly regardless of the actual temporal relationship of the source and destination tasks. If the source context executes and suspends before the destination, the Llc buffer effectively operates as described above for non-concurrent tasks, and this situation is detected by the Lvlc state being set when the destination task begins. If the Lvlc state is not set when a concurrent task begins execution, Llc buffer dependency checking should provide correct data (or stall the node) even though the source and destination nodes are not at the same point in execution. This is referred to as real-time Llc dependency checking
Real-time Llc dependency checking generally operates in one of two modes of operation, depending on whether or not the source is ahead of the destination. If the source is ahead of the destination (or synchronous with it), source data is valid when the destination accesses it, either from the Llc write buffer or the left-side context RAM. If the destination is ahead of the source, it should stall and wait on source data when it attempts to read data that has not yet been provided by the source. It cannot stall on just any Llc access, because this might be an access for data that was provided by some previous task, in which case it is valid in the left-side RAM and will not be written by the source. Dependency checking should be precise, to provide correct data and also prevent a deadlock stall waiting for data that will never arrive, or to avoid stalling a potentially large number of cycles until the source task completes and sets the Lvlc state, which releases the stall, but very inefficiently.
To understand how real-time dependencies are resolved, note that, though the source and destination contexts can be offset in time, the contexts are executing the same instruction sequence and generating the same SIMD data memory write sequence. To some degree, the temporal relationship does not matter because there is a lot of information available to the destination about what the source will do, even if the source is behind: 1) writes appear at the same relative locations in the instruction sequence; 2) write offsets are identical for corresponding writes; and 3) a write to a dependent Llc location can occur once within the task.
For real-time dependency checking, the temporal relationship of the source and destination is determined by a relative count of the number of active write cycles—that is, cycles in which one or more writes occur (the number of writes per cycle is generally unimportant). For example, there can be two, 16-bit counters in each node (i.e., 808-i), associated with Llc dependency checking. One counter, the source write count, is incremented for an active write cycle received from a source context, regardless of the source or destination contexts. When a source task completes, the counter is reset to 0, and begins counting again when the next source task begins. The second counter, the destination write counter, is incremented for an active write cycle in the destination context, but when the source task has not completed when the destination task is executing (determined by the Lvlc state). These counters, along with other information, determine the temporal relationship of source and destination and how dependency checking is accomplished.
When a destination task begins and Lvlc state is not set, this indicates that the source task has not completed (and may not have begun). The destination task can execute as long as it does not depend on source data that has not been provided, and it should stall if it is actually dependent on the source. Furthermore, this dependency checking should operate correctly even in extreme cases such as when the source has not begun execution when the destination does, but does start at a later point in time and then moves ahead of the destination. The destination generally checks the following conditions:
It is relatively easy for the destination to detect that the source is active, because the contexts have a fixed relationship. The source context can signal when it is in execution, because its context descriptor is currently active. If the source is active, whether or not it is ahead is determined by the relationship of the source and destination write counters. If the source counter is greater than the destination counter, the source is ahead. If the source counter is less than the destination counter, it is behind. If the source counter is equal to the destination counter, the source and destination contexts are executing synchronously (at least temporarily). If a destination context is behind or synchronous with the source context, then it accesses valid data either from the left-side RAM or the Llc write buffer. If the destination context is ahead of the source context, it should keep track of future source context writes and stall on an Llc access to a location that hasn't been written yet. This is accomplished by writing into the left-side RAM (the value is unimportant), and resetting a valid bit in the written location. Because dependent writes are unique, any number of locations can be written in this way to indicate true dependencies, and there are no output dependencies (i.e. there are no multiple writes to be ordered for destination reads).
So Llc real-time dependency checking generally operates as follows:
As described above, Rlc data is provided by task sequencing. There will usually be a task switch between the write and the read, and, in most cases, the next task will not desire this Rlc data, because task scheduling prefers tasks that generate both Llc data and Rlc data, rather than a previous task that uses Rlc data.
Rlc dependencies cannot generally be checked in real time because the source and destination tasks do not execute the same instructions (the code is sequential, not concurrent), and this is a key property enabling real-time dependency checking for Llc data. It is required that the source task has suspended, setting the Rvlc state, before the destination task can access right-side context (it stalls on an attempted access of this context if Rvlc is reset). This can stall a task unnecessarily, because it does not detect that the read is actually dependent on a recent write, but there is no way to detect this condition. This is one reason for providing task pre-emption, so that the SIMD can be used efficiently even though tasks are not allowed to execute until it is known that all right-side source data should have been written. When the destination tasks suspends, it resets the Rvlc state, so it should be set again by the source after it provides a new set of Rlc context. There are write buffers for Rin and Rlc data, to avoid contention for RAM banks on the right-side context RAM. These buffers have the same entry format and size as the Lin and Llc write buffers. However, the Rlc write buffer is not used for forwarding as the Llc write buffer is.
Global context management relates to node input and output at the system level. It generally ensures that data transfer into and out of nodes is overlapped as much as possible with execution, ideally completely overlapped so there are no cycles spent waiting on data input or stalled for data output. A feature of processing cluster 1400 is that no cycles are spent, in the critical path of computation, to perform loads or stores, or related synchronization or communication. This can be important, for example, for pixel processing, which is characterized by very short programs (a few hundred instructions) having a very large amount of data interaction both between nodes whose contexts relate through horizontal groups, and between nodes that communicate with each other for various stages of the processing chain. In nodes (i.e., 808-i), loads and stores are performed in parallel with SIMD operations, and the cycles do not appear in series with pixel operations. Furthermore, global-context management operates so that these loads and stores also imply that the data is globally coherent, without any cycles taken for synchronization and communication. Coherency handles both true and anti-dependencies, so that valid data is usually used correctly and retained until it is no longer desired.
In general, input data is provided by a system peripheral or memory, flows into node contexts, is processed by the contexts, possibly including dataflow between nodes and hardware accelerators, and results are output to system peripherals and memory. Contexts can have multiple inputs sources, and can output to multiple destinations, either independently to different destinations or multi-casting the same data to multiple destinations. Since there are possibly many contexts on many nodes, some contexts are normally receiving inputs, while other contexts are executing and producing results. There is a large amount of potential overlap of these operations, and very likely that node computing resources can approach full utilization, because nodes execute on one set of contexts at a time out of the many contexts available. The system-coherency protocols guarantee correct operation at all times. Even though hardware can be kept fully busy in steady state, this cannot always be guaranteed, especially during startup phases or transitions between different use-cases or system configurations.
Data into and out of the processing cluster 1400 is under control of the GLS unit 1408, which generates read accesses from the system into the node contexts, and writes context output data to the system. These accesses are ultimately determined by a program (from a hosted environment) whose data types reflect system and data which is compiled onto the GLS processor 5402 (described in detail below). The program copies system variables into node program-input variables, and invokes the node program by asserting Set_Valid. The node program computes using input and retained private variables, producing output which writes to other processing cluster 1400 contexts and/or to the system. The programs are structured so that they can be compiled in a cross-hosted development (i.e., C++) environment, and create correct results when executed sequentially. When the target is the processing cluster 1400, these programs are compiled as separate GLS processor 5402 (described below) and node programs, and executed in parallel, with fine-grained multi-tasking to achieve the most efficient use of resources and to provide the maximum overlap between input/output and computation.
Because context-input data is contained in program variables, the input is fully general, representing any data types with any layout in data memory. The GLS processor 5402 program marks the point at which the code performs the last output to the node program. This in turn marks the final transfer into the node with a Set_Valid signal (either scalar data to node processor data memory, vector data to SIMD data memory, or both). Output is conditional on program flow, so different iterations of the GLS processor 5402 program can output different combinations of vector and scalar data, to different combinations of variables and types.
The context descriptor indicates the number of input sources, from one to four sources. There is usually one Set_Valid for every unique input—scalar and/or vector input from each source. The context should receive an expected number of Set_Valid signals from each source before the program can begin execution. The maximum number of Set_Valid signals can (for example) be eight, representing both scalar and vector from four sources. The minimum number of Set_Valid signals can (for example) be zero, indicating that no new input is expected for the next program invocation.
Set_Valid signals can (for example) be recorded using a two-bit valid-input flag, ValFlag, for each source: the MSB of this flag is set to indicate that a vector Set_Valid is expected from the source, and the LSB is set to indicate that scalar Set_Valid is expected. When a context is enabled to receive input (described below), valid-flag bits are set according to the number of source: one pair if set if there is one source, two pairs if there are two source, and so on, indicating the maximal dependency on each source. Before input is received from a source, that source sends a Source Notification message (described below) indicating that the source is ready to provide data, and indicating whether its type is scalar, vector, both, or none (for the current input set): the type is determined by the DataType field in the source's destination descriptor, and updates the ValFlag field from its initial value (the initial value is set to record a dependency before the nature of the dependency is known). As Set_Valid signals are received from a source (synchronous with data), the corresponding ValFlag bits are reset. The receipt of all Set_Valid signals is indicated by all ValFlag bits being zero.
When the desired number of Set_Valid signals has been received, the context can set Cvin and also can use side-context pointers to set Rvin and Lvin of the contexts shared to the left and right (
A similar process for transfer of input data from GLS unit 1408 can be used for input from other nodes. Nodes output data using an instruction which transfers data to the Global Output buffer. This instruction indicates which of the destination-descriptor entries is to be used to specify the destination of the data. Based on a compiler-generated flag in the instruction which performs the final output, the node signals Set_Valid with this output. The compiler can detect which variables represent output, and also can determine at what point in the program there is no more output to a given destination. The destination does not generally distinguish between data sent by the GLS UNIT 1408 and data sent by another node; both are treated the same, and affect the count of inputs in the same way. If a program has multiple outputs to multiple destinations, the compiler 706 marks the final output data for each output in the same way, both scalar and vector output as applicable.
Because of conditional program flow, it is possible that the initial Source Notification message indicates expected data that is not generally provided, because the data is output under program conditions that are not satisfied. In this case, the source signals Input_Done in a scalar data transfer, indicating that all input has been provided from the source despite the initial notification: the data in this transfer is not valid, and is not written into data memory. The Input_Done signal resets both ValFlag bits, indicating valid data from the corresponding source. In this case, data that was previously provided is used instead of new input data.
The compiler 706 marks the final output depending on the program flow-control that generates the output to a given destination. If the output does not depend on flow-control, there is no Input_Done signal, since the Set_Valid is usually signaled with the final data transfer. If the output does depend on flow-control, Input_Done follows the last output in the union of all paths that perform output, of either scalar or vector data. This uses an encoding of the instruction that normally outputs scalar data, but the accompanying data is not valid. The use of this encoding can be to signal to the destination that there is no more current output from the source.
As mentioned previously, context input data can be of any type, in any location, and accessed randomly by the node program. The point at which the hardware, without assistance, can detect that input data is no longer desired is when the program ends (all tasks have executed in the context). However, most programs generally read input data relatively early in execution, so that waiting until the program ends makes it likely that there are a significant number of cycles that could be used for input which go unused instead.
This inefficiency can be avoided using a compiler-generated flag, Release_Input, to indicate the point in the program where input data is no longer desired. This is similar in concept to the detection of the Set_Valid point, except that it is based on compiler recognizing at what point in the code input variables will not generally be accessed again. This is the earliest point at which new inputs can be accepted, maximizing potential overlap of data transfer and computation.
The Release_Input flag resets the Cvin, Lvin, and Rvin of the local context (
Once a context receives all required Set_Valid signals indicating that all input data is valid, it cannot receive any more input data until the program indicates that input data is no longer desired. It is undesirable to stall the source node using in-band handshaking signals during an unwanted transfer, since this would tie up global interconnect resources for an extended period of time—potentially with hundreds of rejected transfers before an accepted one. Considering the number of source and destination contexts that can be in this situation, it is very likely that global interconnect 814 would be consumed by repeated attempts to transfer, with a large, undesired use of global resources and power consumption.
Instead, processing cluster 1400 implements a dataflow protocol that uses out-of-band messages to send permissions to source contexts, based on the availability of destination contexts to receive inputs. This protocol also enables ordering of data to and from threads, which includes transfers to and from system memory, peripherals, hardware accelerators, and threaded node contexts—the term thread is used to indicate that the dataflow should have sequential ordering. The protocol also enables discovery of source-destination pairs, because it is possible for these to change dynamically. For example, a fetch sequence from system memory by the GLS unit 1408 is distributed to a horizontal group of contexts, though neither the program for the GLS processor (discussed below) nor the GLS unit 1408 has any knowledge of the destination context configuration. The context configuration is reflected in distributed context descriptors, programmed by Tsys based on memory-allocation requirements. This configuration can vary from one use-case to another even for the same set of programs.
For node contexts, source and destination associations are formed by the sources' destination descriptors, indicating for each center-context pointer where that output is to be sent. For example, the left-most source context is configured to send to a left-most destination context (it can be either on the same node or another). This abstracts input/output from the context configurations, and distributes the implementation, so there is no centralized point of control for dependencies and dataflow, which would likely be a bottleneck limiting scalability and throughput.
In
Image context (for example) generally cannot be retained and re-used in a frame unless there is an equivalent number of node contexts at all stages of processing. There is a one-to-one relationship between the width of the frame and the width of the contexts, and data cannot be retained for re-use unless this relationship is preserved. For this reason, the figure shows all node groups implementing twelve contexts. Since the number of contexts is constant, the association of contexts is fixed for the duration of the configuration.
The dataflow protocol operates by source and destination contexts exchanging messages in advance of actual data transfer.
The center-context pointer for node 808-a, context 0, points to node 808-e, context 4, and the center-context pointer for node a (the same node, though shown separately), context 1, points to node 808-e (also the same destination node shown separately), context 5. When each context is ready to begin execution, its pointer is used to send a Source Notification (SN) message to the destination context, indicating that the source is ready to transmit data. Nodes become ready to execute independently, and there is no guaranteed order to these messages. The SN message is addressed to the destination context using its Segment_ID.Node_ID and context number, collectively called the destination identifier (ID). The message also contains the same information for the source context, called the source identifier (ID). When the destination context is ready to accept data, it replies with a Source Permission (SP) message, enabling the source context to generate outputs. The source context also updates the destination descriptor with the destination ID received in the SP message: there are cases, described later, where the SP is received from a context different than the one to which the SN was sent, and in this case the SP is received from the actual intended destination.
Once the source output is set valid, the source context can no longer transmit data to the destination (note that normally the node does not stall, but instead executes other tasks and/or programs in other contexts). When the source context becomes ready to execute again, it sends a second SN message to the destination context. The destination context responds to the SN message with an SP message when InEn is set. This enables the source context to send data, up to the point of the next Set_Valid, at which point the protocol should be used again for every set of data transfers, up to the point of program termination in the source context.
A context can output to several destinations and also receive data from multiple sources. The dataflow protocol is used for every combination of source-destination pairs. Sources originate SN messages for every destination, based on destination IDs in the context descriptor. Destinations can receive multiples of these messages and should respond to every one with an SP message to enable input. The SN message contains a destination tag field (Dst_Tag) identifying the corresponding destination descriptor: for example, a context with three outputs has three values for the Dst_Tag field, numbered 0-2, corresponding to the first, second, and third destination descriptors. The SP uses this field to indicate to the source which of its destinations is being enabled by the message. The SN message also contains a source tag field (Src_Tag) to uniquely identify the source to the destination. This enables the destination to maintain state information for each source.
Both the Src_Tag and the Dst_Tag fields should be assigned sequential values, starting with 0. This maintains a correspondence between the range of these values and fields that specify the number of sources and/or destinations. For example, if a context has three sources, it can be inferred that the Src_Tag values have the values 0-2.
Destinations can maintain source state for each source, because source SN messages and input data are not synchronized among sources. In the extreme, a source can send an SN, the destination can respond with an SP message, and the source provide input, up to the point of Set_Valid, before any other source has sent even an SN message (this is not common, but cannot be prevented). Under these conditions, the source can provide a second SN message for a subsequent input, and this should be distinguished from SN messages that will be received for current input. This is accomplished by keeping two bits of state information for each source, as shown in
As a result of the dataflow protocol, contexts can output data in any order, there is no timing relationship between them, and transfers are known to be successful ahead of time. There are no stalls or retransmissions on interconnect. A single exchange of dataflow message enables all transfers from source to destination, over the entire span of execution in the context, so the frequency of these messages is very low compared to the amount of data-exchange that is enabled. Since there is no retransmission, the interconnect is occupied for the minimum duration required to transfer data. It is especially important not to occupy the interconnect for exchanges that are rejected because the receiving context is not ready—this would quickly saturate the available bandwidth. Also, because data transfers between contexts have no particular ordering with other contexts, and because the nodes provide a larger amount of buffering in the global input and global output buffers, it is possible to operate the interconnect at very high utilization without stalling the nodes. Because it enables execution to be dataflow-driven, the dataflow protocol tends to distribute data traffic evenly at the processing cluster 1400 level. This is because, in steady state, transfers between nodes tend to throttle to the level of input data from the system, meaning that interconnect traffic will relate to the relatively small portion of the image data received from the system at any given time. This is an additional benefit permitting efficient utilization of the interconnect.
Data transfer between node contexts has no ordering with respect to transfers between other contexts. From a conceptual, programming standpoint: 1) input variables of a program are set to their correct values before a program is invoked; 2) both the writer and the reader are sequential programs; and 3) the read order does not matter with respect to the write order. In the system, inputs to different contexts are distributed in time, but the Set_Valid signal achieves functionality that is logically equivalent to the programming view of a procedure call invoking the destination program. Data is sent as a set of random accesses to destinations, similar to writing function input parameters, and the Set_Valid signal marks the point at which the program would have been “called” in a sequential order of execution.
The out-of-order nature of data transfer between nodes cannot be maintained for data involving transfers to and from system memory, peripherals, hardware accelerators, and threaded node (standalone) contexts. Outside of the processing cluster 1400, data transfers are normally highly ordered, for example tied to a sequential address sequence that writes a memory buffer or outputs to a display. Within the processing cluster 1400, data transfer can be ordered to accommodate a mismatch between node context organizations. For example, ordering provides a means for data movement between horizontal groups and single, standalone contexts or hardware accelerators.
It can be difficult and costly to reconstruct the ordering expected and supplied by system devices using the dataflow mechanisms that transfer data out-of-order between nodes, because this could require a very large amount of buffering to re-order data (roughly the number of contexts times the amount of input and output data per context). Instead, it is much simpler to use the dataflow protocol to keep node input/output in order when communicating with these devices. This reduces complexity and hardware requirements.
To understand how ordering can be imposed, consider context outputs that are being sent to a hardware accelerator. The accelerator wrapper that interfaces the processing cluster 1400 to hardware accelerators can be designed specifically to adapt to that set of accelerators, to permit re-use of existing hardware. Accelerators often operate sequentially on a small amount of context, very different than nodes operating in parallel on large contexts. For node-to-node transfers, exchanges of dataflow messages set up context associations and impose flow control to satisfy dependencies for entire programs in all contexts. For an accelerator, the flow control should be on a per-context, per-node basis so that the accelerator can operate on data in the expected order.
The term thread is used to describe ordered data transfer to and from system memory 1416, peripherals, hardware accelerators, and standalone node contexts, referring to the sequential nature of the transfer. Horizontal groups contain information related to the ordering required by threads, because contexts are ordered through right-context pointers from the left boundary to the right boundary. However, this information is distributed among the contexts and is not available in one particular location. As a result, contexts should transmit information through the right-context pointers, in co-operation with the dataflow protocol, to impose the proper ordering.
Data received from a thread into a horizontal group of contexts is written starting at the left boundary. Conceptually, data is written into this context before transfers occur to the next context on its right (in reality, these can occur in parallel and still retain the ordering information). That context, in turn, receives data from the thread before transfers occur to the context on its right. This continues up to the right boundary, at which point the thread is notified to sequence back to the left boundary for subsequent input.
Analogously, data output from a horizontal group of contexts to a thread begins at the left boundary. Conceptually, data is sent from this context before output occurs from the context on its right (though, again, in reality these can occur in parallel). That context, in turn, sends data to the thread before transfers occur from the context on its right. This continues up to the right boundary, at which point the output sequences back to the left boundary for subsequent output.
When the thread is ready to provide input data, it sends an SN message to the left-boundary context (which is identified by a static entry in its destination descriptor). This SN indicates that the source is a thread (setting a bit in the message, Th=1). The SN message normally enables the destination context to indicate that it is ready for input, but a node context is ready by definition after initialization. In response to the SN message, the destination sends an SP message to the thread. This enables output to the context, and also provides the destination ID for this data (in general, the data is transferred to a context other than the one that receives the original SN message, as described below, though at start-up both the message and the data are sent to the left-boundary context). The thread records the destination ID in the destination descriptor, and uses this for transmitting data.
When the thread is ready to transmit data to the next ordered context, it sends a second SN to the left-boundary context (this occurs, at the latest, after the Set_Valid point, as shown in the figure, but can occur earlier as described below). This message has a bit set (Rt), indicating that the receiving context should forward the SN message to the next ordered context. This is accomplished by the receiving context notifying the context given by the right-context pointer that this context is going to receive data from a thread, including the thread source ID (segment, node, and thread IDs) and Src_Tag. This uses local interconnect, using the same path to the right-side context that is used to transmit side-context data.
The context to the right of the left boundary responds to this notification by sending its own SP to the thread, containing its own destination ID. This information, and the fact that the permission has been received, is stored in the thread's destination descriptor, replacing the destination ID of the left-boundary context (which is now either unused or stored in a private data buffer).
For read threads that access the system, the forwarded SN message can be transmitted before the Set_Valid point, in order to overlap system transfers and mitigate the effects of system latency (node thread sources cannot overlap because they execute sequential programs). If sufficient local buffering is available and system accesses are independent (e.g. no de-interleaving is required), the thread can initiate a transfer to the next context using the forwarded SP message, up to the point of having all reads pending for all contexts. The thread sends a number of SN messages to the sequence of destination contexts, depending on buffer availability. When all input to a context is complete, with Set_Valid, buffers are freed, and the transfer for the next destination ID can begin using the available buffers.
This process repeats up to the right-boundary context. The SP message contains a bit to indicate that the responding context is at the right boundary (Rt=1), and this indicates to the read thread the location of the boundary. At this point, the thread normally increments to the next vertical scan-line (a constant offset given by the width of the image frame, and independent of the context organization). It then repeats the protocol starting with an SN message, except in this case the SP messages are used to indicate that the destination contexts (center and side) are ready to receive data, in addition to notifying the thread of the context order. If a context receives a forwarded SN message and is not enabled for input, it records the SN message, and responds when it is ready.
When the thread is ready to transmit data for the next line, it repeats the protocol starting with an SN message, except in this case the SN message is sent to the right-boundary context with Rt=1. This is forwarded to the left-boundary context. Even though the right-boundary context does not provide side-context data to the left-boundary context, its right-context pointer points back to the left-boundary context, so that the thread can use an SN message to the right-boundary context to enable forwarding back to the left boundary.
Node thread contexts should have two destination descriptors for any given set of destination contexts. The first of these contains destination ID the left-boundary context, and doesn't change during operation. The second contains the destination ID for the current output, and is updated during operation according to information received in SP messages. Since a node has four destination descriptors, this allows usually two outputs for thread contexts. The left-boundary destination IDs are contained in the first two words, and the destination IDs for the current output are in the second two words. A Dst_Tag value of 0 selects the first and third words, and a Dst_Tag value of 1 selects the second and fourth words.
When the source outputs the final data, with Set_Valid, if forwards the SN message to the context given by the right-context pointer, indicating that the context should send an SN message to the thread, including the thread's destination ID and Dst_Tag (these are used to update destination descriptor, because a previous value may be stale). This uses local interconnect, using the same path to the right-side context that is used to transmit side-context data. This context then sends an SN message to the thread when it is ready to output, with its own source ID, and the thread responds with an SP message when it is ready. As with all SP message responses, this contains a destination ID that the source places in its destination descriptor—the responding destination can be different than the one the original SN message is sent to (destinations can be re-routed). This SP message enables output from the source, also including a P_Incr value.
When the context at the right boundary sends an SN message to the thread, it indicates that the source context is at a right boundary (the Rt bit is set). This can cause the thread to sequence to the next scan-line, for example. Furthermore, the right-context pointer of the right-boundary context points back to the left-boundary context. This is not used for side-context data transfer, but instead permits the right-boundary context to forward the SN message for the thread to the left-boundary context.
Unlike thread sources, which can enable multiple contexts to receive data to mitigate system latency, thread destinations can be enabled for one source at a time. As long as the destination thread has sufficient input bandwidth, it should not affect performance of processing cluster 1400. Threads that output to the system should provide enough buffering to ensure that performance is generally not affected by instantaneous system bandwidth. Buffer availability is communicated using P_Incr, so the buffer can be less than the total transfer size.
If a program attempts to output to a destination that is not enabled for output, it is undesirable to stall, because this could consume execution resources for a long period of time. Instead, there is a special form of task-switch instruction that tests for the output being enabled for a particular Dst_Tag (this is executed on the scalar core and is very unlikely to affect performance). The node processor (i.e., 4322) compiler generates this instruction before any output with the given Dst_Tag, and this causes a task switch if output is not enabled, so that the scheduler can attempt to execute another program. This task switch usually cannot be implemented by hardware-only, because SIMD registers are not preserved across the task boundary, and the compiler should allocate registers accordingly.
The combination of dependencies and ordering restrictions creates a potential deadlock condition that is avoided by special treatment during code generation. When a program attempts to access right-side context, and the data is not valid, there is a task switch so that the context on the right can execute and produce this data. However, one of these contexts can be enabled for output to a thread, normally the one on the left (or neither). If the context on the right attempts output, it cannot make progress because output is not enabled, but the context on the left cannot be enabled to execute until the one on the right produces right-context data and sets Rvlc.
To avoid this, code generation collects all output to a particular destination within the same task interval, the interval with the final output (Set_Valid). This permits the context on the left to forward the SN and enable output for the context on the right, avoiding this deadlock. The context on the right also produces output in the same task interval, so all such side-context deadlock is avoided within the horizontal group.
Note that there are two task-switch instructions involved in this case: the one begins the task interval for the side-context dependency and the one that tests for output being enabled. These usually cannot be the same instruction because the test for output enables is conditional on the output being enabled. The output-enable test and output instructions should be grouped as closely as possible, ideally in sequence. This provides the maximum time for the context on the right to receive the forwarded SN, exchange SN-SP messages with the destination, and enable output before the output-enable test. The round trip from SN to SP is typically 6-10 cycles, so this benefits all but very short task intervals.
Delaying the outputs to occur in the same interval usually does not affect performance, because the final output is the one that enables the destination, and the timing of this instruction is not changed by moving the others (if required) to occur in the same task interval. However, there is a slight cost in memory and register pressure, because output values have to be preserved until the corresponding output instructions can be executed, except when the instructions already naturally occur in the same interval.
Dataflow in processing cluster 1400 programs can initiated at system inputs and terminates at system outputs. There can be any number of programs, in any number of contexts, operating between the system input and output: the relative delay of a program output from system inputs is given by the OutputDelay field in the context descriptor(s) for that program (this field is set by the system programming tool 718). In addition to feed-forward dataflow paths from system input to output, there can also be feedback paths from a program to another program that precedes it in in the feed-forward path (the OutputDelay of the feedback source is larger than the OutputDelay of the destination). A simple example of program feedback is illustrated in
The intent in this case is for A and B to execute after the first set of inputs from the system. It is generally impossible for the output of C to be provided to B for this first set of inputs, because C depends on input from B before it can execute. Instead of operating on input from C, B should use some initial value for this input, which can be provided by the same program that provides system input: it can write any variable in B at any point in execution, so during initialization it can write data that's normally written as feedback from C. However, B has to ignore the dependency on C up to the point where C can provide data.
It is usually sufficient for correctness for B to ignore the dependency on C the first time it executes, but this is undesirable from a performance standpoint. This would permit B (and A) to execute, providing input to C, but then B would be waiting for C to complete its feedback output before executing again. This has the effect of serializing the execution of B with C: B executes and provides input to C, then waits for C to provide feedback output before it executes again (this also serializes A, because C permits input from A when it is enabled to receive new input).
The desired behavior, for performance, is to execute A and B in parallel, pipelined with C and D. To accomplish this, B should ignore the lack of input from C until the third set of input from the system, which is received along with valid data from C. At this point, all four programs can execute in parallel: A and B on new system input, and C and D pipelined using the results of previous system input.
The feedback from C to B is indicated by FdBk=1 bit in C's destination descriptor for B. This enables C to satisfy the dependencies of B without actually providing valid data. Normally, C sends an SN message to B after it begins execution. However, if FdBk is set, C sends an SN to B as soon as it is scheduled to execute (all contexts scheduled for C send SNs to their feedback destinations). These SNs indicate a data type of “none” (00′b), which has the effect of resetting both ValFlag bits for this input to B, enabling it for execution once it receives system input.
The SP from B in response to the SN enables C to transmit another SN, with type set to 00′b, for the next set of inputs. The total number of these initial SNs is determined by the OutputDelay field in the context descriptor for C. C maintains a DelayCount field to track the number of initial SN-SP exchanges that have occurred. When DelayCount is equal to OutputDelay, C is enabled to execute using valid inputs by definition, and the SN messages reflect the actual output of C given by the destination-descriptor DataType field.
This technique supports any number of feedback paths from any program to any previous program. In all almost cases, the OutputDelay is determined by the number of program stages from system input to the context's program output, regardless of the number and span of feedback paths from the program. The value of OutputDelay determines how many sets of system inputs are required before the feedback data is valid.
Source contexts maintain output state for each destination to control the enabling of outputs to the destination, and to order outputs to thread destinations. There are two bits of state for each output: one bit is used for output to non-threads (ThDst=0), and both bits are used for outputs to threads (ThDst=1). Outputs to threads are more complex because of the desire to both forward SNs and to hold back SNs to the thread until ordering restrictions are met. To simplify the discussion, these are presented as separate state sequences.
The output-state transitions for ThDst=0 are shown in
If the output is feedback, this triggers an SN message with Type=00′b as long as the value of DelayCount is less than OutputDelay. DelayCount is incremented for every SP received, until it reaches the value OutputDelay. At this point, the output state is 01′b, which enables output for normal execution (the final SP is a valid SP even though it's a response to a feedback output). By the definition of OutputDelay, the context receives valid input at this point and is enabled to execute. The program has to execute an END instruction before it is enabled to send a subsequent SN, which occurs when the program executes again.
The output-state transitions for ThDst=1 are shown in
When the final vector output occurs, with Set_Valid the context forwards the SN message for the Dst_Tag using the right-context pointer. In most cases, the next event is that the program executes an END instruction, and the output state transitions back into the state where it is waiting for a forwarded SN message. However, the forwarded SN message enables other contexts to output and also forward SNs, so there is nothing to prevent a race condition where the context that just forwarded the SN receives a subsequent SN while it is still executing. This SN message should be recorded and wait for subsequent execution. This is accomplished by the state 10′b, which records the forwarded SN message and waits until the program executes an END instruction before entering the state ′00b, where the SN is sent when the program begins execution again.
If the output to the thread is feedback, this triggers an SN message with Type=00′b as long as the value of DelayCount is less than OutputDelay. Since the output is to a thread destination, all dependencies for the horizontal group can be released by the left-most context, so this is the context that transmits feedback SN messages. DelayCount is incremented for every SP message received in the state 00′b, until it reaches the value OutputDelay. At this point, the output state is 01′b, which enables left-most context output for normal execution (the final SP message is a valid SP even though it is a response to a feedback output). By the definition of OutputDelay, the context receives valid input at this point and is enabled to execute. When the final vector output occurs, with Set_Valid, the context forwards the SN message, and normal operation begins.
The output-state transitions for Th=1, ThDst=0 are shown in
If the output is feedback, this triggers an SN message with Type=00′b as long as the value of DelayCount is less than OutputDelay. However, in this case the SN message has to be forwarded to all destination contexts, and the DelayCount value has to reflect an SN message to all of these context. Since the context isn't executing, it cannot distinguish, in the state 00′b, whether or not the SN message should have Rt set or not. Instead, the state 10′b is used in the feedback case to send the SN message with Rt=1, at which point the state transitions to 11′b and the context waits for the SP message from the next context: in this state, if Rt=1 in the previous SP message, indicating the right-boundary context, DelayCount is incremented. The next SP message causes a transition to the 01′b state. The transition 01′b→10′b→11′b→01′b continues until an SN message with RT=1 has been sent to the right-boundary context, and DelayCount has then been incremented to the value OutputDelay. At this point, the output state is 01′b, which enables output for normal execution (the final SP message is a valid SP message, from the left-boundary context, even though it is a response to a feedback output). By the definition of OutputDelay, the context receives valid input at this point and is enabled to execute. When the program signals Set_Valid it transitions to the state 00′b and normal operation resumes.
The output-state transitions for Th=1, ThDst=1 are shown in
If the output to the thread is feedback, this triggers an SN message with Type=00′b as long as the value of DelayCount is less than OutputDelay. DelayCount is incremented for every SP message received in the state 00′b, until it reaches the value OutputDelay. At this point, the output state is 01′b, which enables context output for normal execution (the final SP message is a valid SP message even though it's a response to a feedback output). By the definition of OutputDelay, the context receives valid input at this point and is enabled to execute. The program has to execute an END instruction before it's enabled to send a subsequent SN message, which occurs when the program executes again.
Programs can be configured to iterate on dataflow, in that they continue to execute on input datasets as long as these datasets are provided. This eliminates the burden of explicitly scheduling the program for every new set of inputs, but creates the requirement for data sources to signal the termination of source data, which in turn terminates the destination program. To support this, the dataflow protocol includes Output Termination messages that are used to signal the termination of a source program or a GLS read thread.
Output Termination (OT) messages are sent to the output destinations of a terminating context, at the point of termination, to indicate to the destination that the source will generate no more data. These messages are transmitted by contexts in turn as they terminate, in order to terminate all dataflow between contexts. Messages are distributed in time, as successive contexts terminate, and terminated contexts are freed as early as possible for new programs or inputs. For example, a new scan-line at the top of a frame boundary can be fetched into left-most contexts as right-side contexts are finishing execution at the bottom boundary of the previous frame.
Typically, dataflow termination is ultimately determined by a software condition, for example the termination of a FOR loop that moves data from a system buffer. Software execution is usually highly decoupled from data transfer, but the termination condition is detected after the final data transfer in hardware. Normally, the GLS processor 5402 (which is discussed in detail below) task that initiates the transfer is suspended while hardware completes the transfer, to enable other tasks to execute for other transfers. The task is re-scheduled when all hardware transfers are complete, and after being re-scheduled can the termination condition be detected, resulting in OT messages.
When the destination receives the OT, it can be in one of two states: either still executing on previous input, or finished execution by executing an END instruction and waiting on new input. In the first case, the OT is recorded in a context-state bit called Input Termination (InTm), and the program terminates when it executes an END instruction. In the second case, the execution of the END instruction is recorded in a context-state bit called End, and the program terminates when it receives an OT. To properly detect the termination condition, the context should reset End at the earliest indication that it is going to execute at least one more time: this is when it receives any input data, either scalar or vector, from the interconnect, and before any local data buffering. This generally cannot be based on receiving an SN, which is usually an earlier indication that data is going to be received, because it's possible to receive an SN from a program that does not provide output due to program conditions that cause it to terminate before outputting data.
It also should not matter whether a source producing data is also the one that sends the OT. All sources terminate at the same logical point in execution, and all are required to hold their OT until after they complete output for the final transfer and terminate. Thus, at least one input arrives before any OT.
Receipt of any termination signal is sufficient to terminate a program in the receiving context when it executes an END instruction. Other termination signals can be received by the context before or after termination, but they are ignored after the first one has been received.
Turning to
Additionally, the dataflow protocol can be implemented using information stored in the context-state RAM. An example for a program allocated five contexts is shown in
The remaining entries of the context-state RAM are used to buffer information related to the dataflow protocol and to control operation in the context. The first of these entries is a table of pending SP messages, which are to be sent once the context is free for new input, in a pending permission table. The second is a set of control information related to context dependencies and the dataflow protocol, called the dataflow state.
In
Looking first to the pending permissions 4202, which can be seen in
Now looking to the dataflow state 4210, which can be seen in
The node wrapper (i.e., 810-i), which is described below, schedules active, resident programs on the node (i.e., 808-i) using a form of pre-emptive multi-tasking. This generally optimizes node resource utilization in the presence of unresolved dependencies on input or output data (including side contexts). In effect, the execution order of tasks is determined by input and output dataflow. Execution can be considered data-driven, although scheduling decisions are usually made at instruction-specified task boundaries, and tasks cannot be pre-empted at any other point in execution.
The node wrapper (i.e., 810-i) can include an 8-entry queue, for example, for active resident programs scheduled by a Schedule Node Program message. This queue 4206, which can be seen in
Scheduling decisions are usually made at task boundaries because SIMD-register context is not preserved across these boundaries and the compiler 706 allocates registers and spill/fill accordingly. However, the system programming tool 718 can force the insertion of task boundaries to increase the possibility of optimum task-scheduling decisions, by increasing the opportunities for the node wrapper to make scheduling decisions.
Real-time scheduling typically prioritizes programs in queue order (mostly round-robin), but actual execution is data-dependent. Based on dependency stalls known to exist in the next sequential task to be scheduled, the scheduler can pre-empt this task to execute the same program (a subsequent task) in an earlier context, and can also pre-empt a program to execute another program further down in the program queue. Pre-empted tasks or programs are resumed at the earliest opportunity once the dependencies are resolved.
Tasks are generally maintained in queue order as long as they have not terminated. Normally, the wrapper (i.e., 810-i) schedules a program to execute all tasks in all contexts before scheduling the next entry on the queue. At this point, the program that has just completed all tasks in all contexts can either remain resident on the queue or can terminate, based on a bit in the original scheduling message (Te). If the program remains resident, it is terminated eventually by an Output Termination message—this allows the same program to iterate based on dataflow rather than constantly being rescheduled. If it terminates early, based the Te bit, this can be used to perform finer-grained scheduling of task sequences using the control node 1406 for event ordering.
Generally, hardware maintains, in the context-state RAM, an identifier of the program-queue entry associated with the context. Program-queue entries are assigned by hardware as a result of scheduling messages. This identifier is generally used by hardware to remove the program-queue entry when all execution has terminated in all contexts. This is indicated by Bk=1 in the descriptor of the context that encounters termination. The End bit in the program queue is a hint that a previous context has encountered an END instruction, and it used to control scheduling decisions for the final context (where Bk=1), when the program is possibly about to be removed from the queue 4230. Each context transmits its own set of Output Termination messages when the context terminates, but a Node Program Termination message is not sent to the control node 1406 until all associated contexts have completed execution.
When a program is scheduled, the base context number is used to detect whether or not any output of the program is a feedback output, and the queue-entry FdBk bit is set if and destination descriptor has FdBk set. This indicates that all associated context descriptors should be used to satisfy feedback dependencies before the program executes. When there is no feedback, the dataflow protocol doesn't start operating until the program begins execution.
Assuming no dependency stalls, program execution begins at the first entry of the task queue, at the initial program counter or PC and base context given by this entry (received in the original scheduling message). When the program encounters a task boundary, the program uses the initial PC to begin execution in the next sequential context (the previous task's PC is stored in the context save area of processor data memory, since it is part of the context for the previous task). This proceeds until the context with the Bk bit set is executed—at this point, execution resumes in the base context, using the PC from that context save area (along with other processor data memory context). Execution normally proceeds in this fashion, until all contexts have ended execution. At this point, if the Te bit is set, the program terminates and is removed from the program queue—otherwise it remains on the queue. In the latter case, new inputs are received into the program's contexts, and scheduling at some point will return to this program in the updated contexts.
As just described, tasks normally execute contexts from left to right, because this is the order of context allocation in the descriptors and implemented by the dataflow protocol. As explained above, this is a better match to the system dataflow for input and outputs, and satisfies the largest set of side-context dependencies. However, at the boundaries between nodes (i.e., between nodes 808-i and 808-(i+1)), it is possible that the task which provides Rlc data, in an adjacent node, has not begun execution yet. It is also possible, for example, because of data rates at the system level, that a context has not received a Set_Valid or a Source Permission message to allow it to begin execution. The scheduler first uses task pre-emption to attempt to schedule around the dependency, then, in a more general case, uses program pre-emption to attempt to schedule around the dependency. Task and program pre-emption are described below.
Now, referring back to
There is usually one entry on the program queue to track pre-emptive contexts, so task pre-emption is effectively nested one-deep. If a stalled context is encountered when there is a valid entry in the Pre-empt_Ctx# field (the Pre bit is set), the scheduler cannot use task pre-emption to schedule around the stall, and uses program pre-emption instead. In this case, the program-queue entry remains in its current state, so that it can be properly resumed when the dependency is resolved.
If the scheduler cannot avoid stalls using task pre-emption, it attempts to use program pre-emption instead. The scheduler searches the program queue, in order, for another program that is ready to execute, and schedules the first program that has a ready task. Analogous to task pre-emption, the scheduler will schedule the pre-empted program at the earliest task boundary after the pre-empted program becomes ready. At this point, execution returns to round-robin order within the program queue until the next point of program pre-emption.
To summarize, the schedule prefers scheduling tasks in context order given by the descriptors, until all contexts have completed execution, followed by scheduling programs in program-queue order. However, it can schedule tasks or programs out-of-order—first attempting tasks and then programs—but restoring the original order as soon as possible. Data dependencies keep programs in a correct order, so actual order doesn't matter for correctness. However, preferring this scheduling order is likely the most efficient in terms of matching system-level input and output.
The scheduler uses pointers into the program queue that indicate both the next program in sequential order and the pre-emptive program. It is possible that all programs are executed in the pre-emptive sequence without the pre-empted program becoming ready, and in this case the pre-emptive pointer is allowed to wrap across the sequential program (but the sequential program retains priority whenever it becomes ready). This wrapping can occur any number of times. This case arises because system programming tool 718 sometimes has to increase the node allocation for a program to provide sufficient SIMD data memory, rather than because of throughput requirements. However, increasing the node allocation also increases throughput for the program (i.e., more pixels per iteration than required)—by a factor determined by the number of additional nodes (i.e., using three nodes instead of one triples the potential throughput of this program). This means that the program can consume input and produce output much faster than it can be provided or consumed, and the execution rate is throttled by data dependencies. Pre-emption has the effect in this case of allowing the node allocation to make progress around the stalled program, effectively bringing the pre-empted program back down to the overall throughput for the use-case.
The scheduler also implements pre-emption at task boundaries, but makes scheduling decisions in advance of these boundaries. It is important that scheduling add no overhead cycles, and so scheduling cannot wait until the task boundary to determine the next task or program to execute—this can take multiple accesses of the context-state RAM. There are two concurrent algorithms used to decide between task pre-emption and program pre-emption. Since task boundaries are generally imperative—determined by the program code—and since the same code executes in multiple contexts, the scheduler can know the interval between task boundaries in the current execution sequence. The left-most context determines this value, and enables the hardware to count the number of cycles between the beginning of a task in this context and the next task switch. This value is placed in the program queue (it varies from task to task).
During execution in the current context, the scheduler can also inspect other entries on the program queue in the background, assuming that the context-state RAM is not desired for other purposes. If either the base, next, or pre-emptive context is ready in another program, the task-queue entry for that program is set ready (Rdy=1). At that point, this background scheduling operation returns to the next sequential program, and repeats the search: this keeps ready tasks in roughly round-robin order. By counting down the current task interval, the scheduler can determine when it is several cycles in advance of the next task boundary. At this point it can inspect the next task in the current program, and, if that task is not ready, it can decide on task pre-emption, if there is a pre-emptive task that can be run, or it can decide to schedule the next ready program in the program queue. In this manner, the scheduling decision is known with reasonably high accuracy by the time the task boundary is encountered. This also provides sufficient time to prepare for the task switch by fetching the program counter or PC for the next task from the context save area.
Turning to
Typically, loads and stores (from load store unit 4318-i) move data between SIMD data-memory locations and SIMD local registers, which can, for example, represent up to 64, 16-bit pixels. SIMD loads and stores use shared registers 4320-i for indirect addressing (direct addressing is also supported), but SIMD addressing operations read these registers: addressing context is managed by the core 4320. The core 4320 has a local memory 4328 for register spill/fill, addressing context, and input parameters. There is a partition instruction memory 1404-i provided per node, where it is possible for multiple nodes to share partition instruction memory 1404-i, to execute larger programs on datasets that span multiple nodes.
Node 808-i also incorporates several features to support parallelism. The global input buffer 4316-i and global output buffer 4310-i (which in conjunction with Lf and Rt buffers 4314-i and 4312-i generally comprise input/output (IO) circuitry for node 808-i) decouple node 808-i input and output from instruction execution, making it very unlikely that the node stalls because of system IO. Inputs are normally received well in advance of processing (by SIMD data memory 4306-1 to 4306-M and functional units 4308-1 to 4308-M), and are stored in SIMD data memory 4306-1 to 4306-M using spare cycles (which are very common). SIMD output data is written to the global output buffer 4210-i and routed through the processing cluster 1400 from there, making it unlikely that a node (i.e., 808-i) can stalls even if the system bandwidth approaches its limit (which is also unlikely). SIMD data memories 4308-1 to 4306-M and the corresponding SIMD functional unit 4306-1 to 4306-M are each collectively referred as a “SIMD units”
SIMD data memory 4306-1 to 4306-M is organized into non-overlapping contexts, of variable size, allocated either to related or unrelated tasks. Contexts are fully shareable in both horizontal and vertical directions. Sharing in the horizontal direction uses read-only memories 4330-i and 4332-i, which are typically read-only for the program but writeable by the write buffers 4302-i and 4304-i, load/store (LS) unit 4318-i, or other hardware. These memories 4330-i and 4332-i can also be about 512×2 bits in size. Generally, these memories 4330-i and 4332-i correspond to pixel locations to the left and right relative to the central pixel locations operated on. These memories 4330-i and 4332-i use a write-buffering mechanism (i.e. write buffers 4302-i and 4304-i) to schedule writes, where side-context writes are usually not synchronized with local access. The buffer 4302-i generally maintains coherence with adjacent pixel (for example) contexts that operate concurrently. Sharing in the vertical direction uses circular buffers within the SIMD data memory 4306-1 to 4306-M; circular addressing is a mode supported by the load and store instructions applied by the LS unit 4318-i. Shared data is generally kept coherent using system-level dependency protocols described above.
Context allocation and sharing is specified by SIMD data memory 4306-1 to 4306-M context descriptors, in context-state memory 4326, which is associated with the node processor 4322. This memory 4326 can, for example, 16×16×32 bit or 2×16×256 bit RAM. These descriptors also specify how data is shared between contexts in a fully general manner, and retain information to handle data dependencies between contexts. The Context Save/Restore memory 4324 is used to support 0-cycle task switching (which is described above), by permitting registers 4320-i to be saved and restored in parallel. SIMD data memory 4306-1 to 4306-M and processor data memory 4328 contexts are preserved using independent context areas for each task.
SIMD data memory 4306-1 to 4306-M and processor data memory 4328 are partitioned into a variable number of contexts, of variable size. Data in the vertical frame direction is retained and re-used within the context itself. Data in the horizontal frame direction is shared by linking contexts together into a horizontal group. It is important to note that the context organization is mostly independent of the number of nodes involved in a computation and how they interact with each other. The primary purpose of contexts is to retain, share, and re-use image data, regardless of the organization of nodes that operate on this data.
Typically, SIMD data memory 4306-1 to 4306-M contains (for example) pixel and intermediate context operated on by the functional units 4308-1 top 4308-M. SIMD data memory 4306-1 to 4306-M is generally partitioned into (for example) up to 16 disjoint context areas, each with a programmable base address, with a common area accessible from all contexts that is used by the compiler for register spill/fill. The processor data memory 4328 contains input parameters, addressing context, and a spill/fill area for registers 4320-i. Processor data memory 4328 can have (for example) up to 16 disjoint local context areas that correspond to SIMD data memory 4306-1 to 4306-M contexts, each with a programmable base address.
Typically, the nodes (i.e., node 808-i), for example, have three configurations: 8 SIMD registers (first configuration); 32 SIMD registers (second configuration); and 32 SIMD registers plus three extra execution units in each of the smaller functional unit (third configuration).
As an example,
Looking first to the processor core, the node processor 4322 generally executes all the control related instructions and holds all the address register values and special register values for SIMD units shown in register files 4340 and 4342 (respectively). Up to six (for example) memory instructions can be calculated in a cycle. For address register values, the address source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values, which are then used by SIMD unit for address calculation. Similarly, for special register values, the special register source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values.
Node processor 4322 can have (for example) 15 read ports and six write ports for SIMD. Typically, the 15 read ports include (for example) 12 read ports that accommodate two operands (i.e., lssrc and lssrc2) for each of six memory instructions and three ports for special register file 4312. Typically, special register file 4342 include two registers named RCLIPMIN and RCLIPMAX, which should be provided together and which are generally restricted to the lower four registers of the 16 entry register file 4342. RCLIPMAX and RCLIPMIN registers are then specified directly in the instruction. The other special registers RND and SCL are specified by a 4-bit register identifier and can be located anywhere in the 16 entry register file 4342. Additionally, node processor 4322 includes a program counter execution unit 4344, which can update the instruction memory 1404-i.
Turning now to the LS unit 4318-i and SIMD unit, the general structure for each can be seen in
Additionally, for the three example configurations for a node (i.e., node 808-i), the sizes of some components (i.e., logic unit 4352-1) or the corresponding instruction may vary, while others may remain the same. The LS data memory 4339, lookup table, and histogram remain relatively the same. Preferably, the LS data memory 4339 can be about 512*32 bits with the first 16 locations holding the context base addresses and the remaining locations being accessible by the contexts. The lookup table or LUT (which is generally within the PC execution unit 4344) can have up to 12 tables with a memory size of 16 Kb, wherein four bits can be used to select table and 14 bits can be used for addressing. Histograms (which are also generally located in the PC execution unit 4344) can have 4 tables, where the histogram shares the 4-bit ID with LUT to select a table and uses 8 bits for addressing. In Table 1 below, the instructions sizes for each of the three example configurations can be seen, which can correspond to the sizes of various components.
Looking first to
Turning to
As shown in
In
As shown, the functional unit (referred to here as 4338) includes a multiplexer or mux 4602, register file (referred to here as 4358), execution unit 4603, and mux 4644. Mux 4602 (which can be referred to as a pixel mux for imaging applications) includes muxes 4648 and 4650 (which are each, for example, 7:1 muxes). As shown, the register file 4658 generally comprises muxes 4604, 4606, 4608, and 4610 (which are each, for example, 4:1 muxes) and registers 4612, 4614, 4618, and 4620. Execution unit 4603 generally comprises muxes 4622, 4624, 4626, 4628, 1630, 4632, 4634, 4638, and 4640, (which are each, for example, one of a 2:1, 4:1, or 5:1 mux), multiply unit (referred to here as 4354), left logic unit (referred to here as 4352), and right logic unit (referred to here as 4656). Muxes 4244 and 4246 (which can, for example be 4:1 muxes) are also included. Typically, the mux 4602 can perform pixel selection (for example) based on an address that is provided. In Table 2 below, an example of pixel selection and pixel address can be seen.
In operation, functional unit 4338 performs operations in several stages. In the first stage, instructions are loaded from instruction memory (i.e., 1404-i) to an instruction register (i.e., LS register file 4340). These instructions are then decoded (by LS decoder 4334, for example). In the next few stages, there are typically pipeline delays that are one or more cycles in length. During this delay, several of the special register from file 4342 (such as CLIP, RND) can be read. Following the pipeline delays, the register file (i.e., register file 4342) is read, while the operands are muxed, and execution and write back to functional unit registers (i.e., SIMD register file 4358), with the result being forwarded to a parallel store instruction.
As an example (which is shown in
Generally, SIMD pipeline for the nodes (i.e., 808-i) is an eight stage pipeline. In the first stage, an Instruction Packet is feteched from instruction memory (i.e., 1402-i) by the node processor (i.e., 4322). This Instruction Packet is then decoded in the second stage (where addresses are calculated and registers for address are read). In the third stage, bank conflicts are resolved and addresses are sent to the bank (i.e., SIMD data memory 4306-1 to 4306-M). In the fourth stage, data is loaded to the banks (i.e., SIMD data memory 4306-1 to 4306-M). A cycle can then be introduces (in the fifth stage) to provide flexability to the placement of data into the banks (i.e., SIMD data memory 4306-1 to 4306-M). SIMD execution is performed in the sixth stage, and data is stored in stages seven and eight.
The addresses for SIMD loads and SIMD stores are calculated using registers 4320-i. These registers 4320-i are read in decode stage, while address calculation are also performed. The address calculation can be either immediate address or register plus immediate or circular buffer addressing. The circular buffer addressing can also do boundary processing for loads. No boundary processing takes place for stores. Also, SIMD loads can indicate if the functional unit is accessing its central pixels or its neighboring pixels. The neighboring pixels can be its immediate 2 pixels on the left and right. Thus a SIMD register can (for example) receive 6 pixels—2 central pixels, 2 pixels on the left of the 2 central pixels and 2 pixels on the right of the 2 central pixels. The pixel mux is then used to steer the appropriate pixels into the low and high portion of the SIMD register. The address can be the same for the entire centre context and side context memories—that is all 512 bits of center context, 32 bits of left context and 32 bits of right context memory are accessed using this address—and there are 4 such loads. The data that gets loaded into the 16 functional units can be different as the data in SIMD DMEM's are different.
All addresses generated by SIMD and processor 4322 are offsets and are relative. They are made absolute by the addition of a base. SIMD data memory's base is called Context base and this is provided by node_wrapper which is added to the offset generated by SIMD. This absolute address is what is used to access SIMD data memory. The context base is stored in the context descriptors as described above and is maintained by node wrapper based 810-i on which context is executing. Similarly all processor 4322 addresses as well go through this transformation. The base address is kept in the top 8 locations of the data memory 4328 and again node wrapper 810-i provides the appropriate base to processor 4322 so that all addresses processor 4322 provides has this base added to its offset.
There is also a global area reserved for spills in SIMD data memory. Following instructions can be used to access the global area:
LD *uc9, ua6, dst
ST dst, *uc9, ua6
Where uc9 is from uc9[8:0]. When uc9[8] is set, then the context base from node wrapper is not added to calculate the address—the address is simply uc9[8:0]. If uc[8] is 0, then context base from wrapper is added. Using this support, variables can be stored from SIMD DMEM top address and grow downward like a stack by manipulating uc9.
SIMD loads/SIMD stores, scalar output, vector output instructions have 3 different addressing modes—immediate mode, register plus immediate mode, and circular buffer addressing mode. The circular buffer addressing mode is controlled by the Vertical Index Parameter (VIP) that is held in one of the registers 4320-i and has the following format shown in
LD .LS1-.LS4 *lssrc(lssrc2),sc4, ua6, dst
Circular buffer address calculation is done as follows:
Circular buffer address calculation is:
In addition to performing boundary processing at the top and bottom, mirroring/repeating also affects what gets loaded into SIMD registers when we are the left and right boundaries as at the boundaries when we access neighboring pixels, there is no valid data.
When the frame is at the left or right edge, the descriptor will have Lf or Rt bits set. At the edges, the side context memories do not have valid data and hence the data from center context is either mirrored or repeated. Mirroring or repeating is indicated by mode bits in VIP
register where: Mirror when mode bits=01; and Repeat when mode bits=10. Pixels at the left and right edges are mirrored/repeated as shown below in
When Max_mode is indicated and (TF=1) or (BF=1), then register gets loaded with max value of 16′h 7FFF. When Lf=1 or Rt=1 and max_mode is indicated, then again if side pixels are being accessed, the register gets loaded with max value of 16′h 7FFF. Note that both horizontal boundary processing (Lf=1 or Rt=1) and vertical boundary processing (TF=1 or BF=1 and mode!=2′b00) can happen at same time. Addresses do not matter when max_mode is indicated.
Now, looking to the node wrapper 810-i, it used to schedule programs that reside in partition instruction memory 1404-i, signal events on the node 808-i, initialize the node configuration, and support node debug. The node wrapper 810-i has been described above with respect to scheduling, using its program queue 4230-i. Here, however, the hardware structure for the node wrapper 810-i is generally described.
In
As shown in
In
Now, looking to the node wrapper 810-i, it used to schedule programs that reside in partition instruction memory 1404-i, signal events on the node 808-i, initialize the node configuration, and support node debug. The node wrapper 810-i has been described above with respect to scheduling, using its program queue 4230-i. Here, however, the hardware structure for the node wrapper 810-i is generally described. Node wrapper 810-i generally comprises buffers for messaging, descriptor memory (which can be about 16×256 bits), and program queue 4230-i. Generally, node wrapper 810-i interprets messages and interacts with the SIMDs (SIMD data memories and functional units) for input/outputs as well as performing the task scheduling and PC to node processor 4322.
Within node wrapper 810-i is a message wrapper. This message wrapper has a several level entry (i.e., 2-entry) buffer that is used to hold messages, and when this buffer becomes full and the target is busy, the target can be stalled to empty the buffer. If the target is busy and then buffer is not full, then the buffer holds on to the message waiting for an empty cycle to update target.
Typically, the control node 1406 provides messages to the node wrapper 810-i. The messages from control node can follow this example pipeline:
Turning to
Within a SIMD, the left most pixels are associated with functional units, with F7 being the left most functional unit, then higher addresses going to F6, F5, etc. The SIMD pre-set value which identifies the functional unit and SIMD are set with the following values—pixel_position is an 8 bit value that is in the descriptor context, preset_simd is 4 bit number identifying SIMD number and the least significant 4 bits are the functional unit number—ranging from 0 through f:
f0_preset0_data={pixel_position, preset_simd, 4′hf};
f0_preset1_data={pixel_position, preset_simd, 4′hc};
f1_preset0_data={pixel_position, preset_simd, 4′hd};
f1_preset1_data={pixel_position, preset_simd, 4′hc};
f2_preset0_data={pixel_position, preset_simd, 4′hb};
f2_preset1_data={pixel_position, preset_simd, 4′ha};
f3_preset0_data={pixel_position, preset_simd, 4′h9};
f3_preset1_data={pixel_position, preset_simd, 4′h8};
f4_preset0_data={pixel_position, preset_simd, 4′h7};
f4_preset1_data={pixel_position, preset_simd, 4′h6};
f5_preset0_data={pixel_position, preset_simd, 4′h5};
f5_preset1_data={pixel_position, preset_simd, 4′h4};
f6_preset0_data={pixel_position, preset_simd, 4′h3};
f6_preset1_data={pixel_position, preset_simd, 4′h2};
f7_preset0_data={pixel_position, preset_simd, 4′h1};
f7_preset1_data={pixel_position, preset_simd, 4′h0};
The global IO buffer (i.e., 4310-i and 4316-i) is generally comprised of two parts: a data structure (which is generally a 16×256 bit structure) and control structure (which is kept generally 4×18 bit structure). Generally, four entries are used for the data structure, since the data structure is 16 entries deep and each line of data occupies four entries. The control structure can be updated in two bursts with the first sets of data and, for example, can have the following fields:
[8:0]: data memory offset
[12:9]: destination context number
[12]: set_valid
[13]: reserved
[15:14]: memory type
[16]: fill
[17]: reserved
[18]: output/input killed
[25:19]: shared function-memory offset
[31:26]: reserved
Typically, the data structure of the global IO buffer (i.e., 4310-i and 4316-i) can, for example, be made up of six of 16×256 bit buffers. When input data is received from data interconnect 814, the input data is placed in, for example, 4 entries of the first buffer. Once the first buffer is written, the next input will be placed in the second buffer. This way, when first buffer is being read to update SIMD data memory (i.e., 4306-1), the second buffer can receive data. The third through sixth buffers are used (for example) for outputs, lookup tables, and miscellaneous operations like Scalar output and node state read data. The third through sixth buffers are generally operated as one entity and data is loaded horizontally into one entry while the first and second buffers use takes 4 entries. The third through sixth buffers are generally designed to be width of the 4 SIMD's to reduce the time it takes to push output values or a lookup table value into the output buffers to one cycle rather than four cycles it would have taken if there had been one buffer that was loaded vertically like the first and second buffers.
An example of the write pipeline for the example arrangement described above is as follows. On the first clock cycle, a command and data (i.e., burst) are presented, which are accepted on the rising edge of the second clock cycle. In third clock cycle, the data is sent to the all of the nodes (i.e., 4) nodes of the partition (i.e., 1402-i). On the rising edge of the fourth clock cycle, the first entry of the first buffer from the global IO buffer (i.e., 4310-i and 4316-i) is updated. Thereafter, the remaining three entries are updated during the successive three clock cycles. Once entries for the first buffer are written, subsequent writes can be performed for the second buffer. There is a 2-bit (for example) counter that points to the appropriate buffer (i.e., first through sixth) to be written into, which is, for example, cycle seven for the second buffer, and twelve for the third buffer. Typically, four of the buffers can be unified into (for example) a 16×37 bit structure with the following fields:
Turning now to the communication between global IO buffer (i.e., 4310-i and 4316-i) and the SIMD data structures of the nodes (i.e., 808-i). Global IO buffer read and update of SIMD generally has three phases, which are as follows: (1) center context update; (2) right side context update; and (3) left side context update. To do this, the descriptor is first read using context number that is stored in the control structure, which can be performed in the first two clock cycles (for example). If the descriptor is busy, then read of descriptor is stalled till descriptor can be read. When the descriptor is read in a third clock cycle (for example), the following examples information can be obtained from descriptor:
(1) a 4-bit Right Context;
(2) a 4-bit Right node;
(3) a 4-bit Left Context;
(4) a 4-bit Left node;
(5) a Context Base; and
(6) Lf and Rt bits to see if side context updates should be done.
Typically, the context base is also added to SIMD data memory in this third cycle, and above information is stored on in a fourth cycle. Additionally, in the third clock cycle, a read for a buffer within global IO buffer (i.e., 4310-i and 4316-i) is setup, and the read is performed in the fourth cycle, reading, for example 256, bits of data. This data is then muxed and flopped in a fifth clock cycle, and the center context can be setup to be updated in a sixth clock cycle. If there is a bank conflict, then it can be stalled. At the same time, the right most two pixels can be sent for update using right context pointer (which generally consists of context number and node number). The right context pointer can be examined to see if there is a direct update to neighboring node (if the node number of current node+1=right context node number−then it is a direct update), a local update to itself (if the node number of current node=right context node number, then it is a local update to its own memories), or remote update to a node that is not a neighbor (if it is not direct or local, then it is a remote update).
Looking first to direct/local updates, in the fifth clock cycle described above, there are various pieces of information are sent out on the bus (which can be 115 bits wide). This bus is generally wide enough to carry two stores worth of information for the two stores that are possible in each cycle. Typically, the composition of the bus is as follows:
[3:0]—DIR_CONT (content number);
[7:4]—DIR_CNTR (counter value used for dependency checking);
[16:8]—DIR_ADDR0 (address);
[48:17]—DIR_DATA0 (data);
[49]—DIR_EN0 (enable);
[51:50]—DIR_LOHI0;
[60:52]—DIR_ADDR1 (address);
[92:61]—DIR_DATA1 (data);
[93]—DIR_EN1 (enable);
[95:94]—DIR_LOHI1;
[96]—DIR_FWD_NOT_EN (forwarded notification enable);
[97]—DIR_INP_EN (input initiated side context updates);
[98]—SET_VIN (set_valid of right or left side contexts);
[99]—RST_VIN (reset state bits);
[100]—SET_VLC (set Valid Local state);
[101]—SN_FWD_BUSY;
[102]—INP_KILLED;
[103]—INP_BUF_FULL (indication of a full buffer);
[104]—OE_FWD_BUSY;
[105]—OT_FWD_BUSY;
[106]—SV_TH_BUSY;
[107]—SV_SNRT_BUSY;
[108]—WB_FULL;
[109]—REM_R_FULL;
[110]—REM_L_FULL;
[111]—LOC_LBUF_FULL;
[112]—LOC_RBUF_FULL;
[113]—LOC_RST_BUSY;
[114]—LOC_LST_BUSY;
[118:115]-ACT_CONT; and
[119]—ACT_CONT_VAL
Turning to
When data is made available through data interconnect 814, the data can include a Set_Valid flag on the thirteen bit ([12]), as detailed above. A program can be dependent on several inputs, which are recorded in the descriptor, namely the In and #Inp bits. The In bit indicates that this program may desire input data and the #In bit indicates the number of streams. Once all the streams are received, the program can begin executing. It is important to remember that for a context to begin executing, Cvin, Rvin and Lvin should be set to 1. When Set Valid is received, the descriptor is checked to see if the number of Set_Valid's received is equal to number of inputs. If the number of Set_Valid's is not equal to number of inputs, then the SetValC field (two bit fields that indicates how many Set_Valid's have been received) is updated. When the number of Set_Valid's is equal to number of inputs, then the Cvin state of descriptor memory is set to 1. When the center context data memory is updated, this will spawn side context updates on the left and right using the left and right context pointers. The side contexts will obtain a context number, which will be used to read the descriptor to obtain the context base to be added to the data memory offset. At about the same point, the side context will obtain the #Inputs and SetValR, SetValL and update Rvin and Lvin in a similar manner to Cvin.
Turning now to remote updates of side contexts, remote updates are sent through a partition's BUI (i.e., 4710-i). For remote paths (as shown in
Typically, there are two types of remote transactions: master transactions and slave transactions. For master transactions, the buffer in BIU (i.e., 4710-i) is generally two entries deep, where each entry is the full bus width wide. For example, each entry can be 115 entries as this buffer can be used for side context update for stores, which can be two every cycles. For slave transaction, however, the buffer in the BIU (i.e., 4710-i) is generally three entries deep, being about two stores wide each (for example, 115 bits).
Additionally, each partition does interact with the shared function-memory 1410, but this interaction is described below.
The dependency checking is based on address (typically 9 bits) match and context (typically 4 bits) match. All addresses are offsets for address comparison. Once the write buffer is read, the context base is added to offset from write buffer and then used for bank conflict detection with other accesses like loads.
When performing dependency, though, there are several properties that are to be considered. The first property is that real time dependency checking should to be done for left contexts. A reason is that sharing is typically performed in real-time using left contexts. When a right context is to be accessed, then a task switch should take place so that a different context can produce the right context data. The second property is that one write can be performed for a memory location—that is two writes should not be performed in a context to same address. If there is a necessity to perform two writes, then a task switch should take place. A reason is that the destination can be behind the source. If the source performs a write followed successively a read and a write again, then at the destination, the read will see the second write's value rather than the first write's value. Using the one write property, the dependency checking relies on the fact that matches will be unique in the write buffers, and no prioritization is required as there are no multiple matches. The right context memory write buffers generally serve as a holding place before the context memory is updated; no forwarding is provided. By design when a right context load executes, the data is already in side context memory. For inputs, both left and right side contexts can be accessed any time.
When center context stores are updated, the side context pointers are used update the left and right contexts. The stores pointed to by right context pointer go and update the left context memory pointed to by the right context pointer. These stores enter a, for example, a six entry Source Write Buffer at the destination. Two stores can enter this buffer every cycle, and two stores can be read out to update left context memory. The source node is sending these stores and updating Source Write Buffer at destination.
As described above, dependency checking is related to the relative location of the destination node with respect to source node. If the Lvlc bit is set, it means that source node is done, and all the data destination desires have been computed. When node executes store, these stores update the left context memory of destination node, and this is the data that should to be provided when side context loads access the left context memory at destination. The left context memory is not updated by destination node; it is updated by source node. If the source node is ahead, then data has already been produced, and destination can readily access this data. If the source node is behind, then data is not ready; therefore, the destination node stalls. This is done by using counters, which are described above. The counters indicate whether source or destination is ahead or behind.
The source and destination node both can execute two stores in a cycle. The counters should to count at the right time in order to determine the dependency checking. For example, if both the counters are at 0, the destination node can execute the stores (source has not started or is synchronous), and after two delay slots, the destination node can execute a left side context load. To implement this scheme, destination node writes a 0 into left context memory (33rd bit or valid bit) so that when load executes, it will see a 0 on valid bit, which should stall the load. Since the store indication from source takes few of cycles to reach its destination, it is difficult to synchronize the source and destination write counters. Therefore, the stores at destination node enter a Destination Write buffer from where the stores will update a 0 into the left context memory. Note that normally a node does not update its left context memory; it is usually updated by a different node that is sharing the left context. But, to implement dependency checking, the destination node writes a 0 into the valid bit or 33rd bit of the left context memory. When a load now matches against the destination write buffer, the load is stalled. The stalling destination counter value is saved and when the source counter is equal or greater than the saved stalled destination counter, then load is unstalled.
Now, if the source begins producing stores with same address, then, when stores enter the source write buffer with good data, the stores are compared against the destination write buffer, and if stores match, the “kill” bit is set in the destination write buffer which will prevent the store from updating side context memory with 0 valid bit as source write buffer has good data and it desires to update the side context memory with good data. If the source store does not come from source, the write at destination will update the left side context memory with a 0 into the valid bit or 33rd bit. If a load accesses that address, then it will see a 0 and stall (note it is no longer in the destination write buffer). Thus a load can either stall due to: (1) matching against destination write buffer without the kill bit set (if the kill bit is set, then most likely the data is in source write buffer from where it can forward); or (2) does not match the destination write buffer—but finds a valid bit of 0 from side context load data. As mentioned, loads at destination node can forward from source write buffer or take data from side context memory provided the 33rd bit or valid bit is 1. If the source write counter is greater than or equal to the destination counter, then the stores will not enter the destination write buffer.
It should be noted that, in operation, loads first generate addresses, followed by accessing data memory (namely, SIMD data memory) and an update of the register file with the subsequent results. However, stalls can occur, and when a stall occurs, it occurs during between the accessing of data memory and the update of the register file. Generally, this stall can be due to: (1) a match against the destination write buffer; or (2) no match against the destination write buffer, but load result has its valid bit set as 0. This stall also generally coincides with address generation from subsequence packet of loads. For this load, which has stalled, its information saved so as to be recycled and once the load is successfully completed, and any following loads can proceed ahead of the stalled load. Typically, the save information generally comprises information used to restart the load, such as an address (i.e., an offset and context base), offset alone, pixel address, and so forth.
Following the update of the register file, data memory can be updates. Initially, indicators (i.e., dmem6_sten and dmem7_sten) can be used indicate stores are being set up to update data memory, and if the write buffers are full, then the stores will not be sent in following cycle. However, if the write buffers are not full, the stores can be sent to direct neighboring node, and the write buffer can be updated at the end of this cycle. Additionally, addresses can be compares against write buffers—node wrappers (i.e., 810-i) from two nodes are generally close to each other—not more than 1000 μm route as an example. A new counter value is also reflected in this cycle, for example, a “2” if two stores are present.
Typically, there are two local buffers (for example) which are filled from the write buffers when empty. For example, if there is one entry in write buffer, one gets filled. Since, for example, there are two write buffers, the write buffers can be read in a round-robin fashion if destination write buffer is valid; otherwise, the source write buffer is read every time the local buffer is empty. During a write buffer read so as to provide entries for the local buffers, an offset can be added to the context base. If a local buffer contains data, bank conflict detection can be performed with 4 loads. If there are no bank conflicts, both can set up the side context memories.
For the left side context memory, there is one more write buffer used for local and remote stores. Both remote and local stores can happen at about same time, but local stores are given higher priority compared to remote stores. To accommodate this feature, local stores follow same pipeline as direct stores, namely:
For the left side context, there can, for example, be three buffers: left source write buffer, a left destination write buffer, and a left local-remote write buffer. Each of these buffers can, for example, be six entries deep. Typically, the left source write buffer includes data, address offset, context base, lo_hi, and context number, where the context number and offset can be used for dependency checking. Additionally, forwarding of data can be provided with this left source write buffer. The left destination write buffer generally includes an address offset, context number, and context base, which can be used for dependency checking for concurrent tasks. The left local-remote write buffer generally includes data, address offset, context base, and lo_hi, but no forwarding is provided because the left local-remote write buffer is generally shared between local and remote paths. Round-robin filling occurs between the 3 write buffers, with a left destination write buffer, and a left local-remote write buffer sharing the round robin bit. Typically, there is one round robin bit; whenever destination write buffer or left local-remote write buffers are occupied then the round robin bit is 0. These buffers can update SIMD data memory, and every cycle the round robin bit can be flips between 0 and 1.
For the right side context, there can, for example, be are two write buffers: a direct traffic write buffer and a right local-remote write buffer. Each of these write buffers can, for example, be six entries deep. Typically, the direct traffic write buffer includes data, address offset, context base, lo_hi, and context number, while the right local-remote write buffer can include data, address offset, context base, and lo_hi. These buffers do not generally have dependency checking or forwarding. Write and read of these buffers is similar to left context write buffer. Generally, the priority between right context write buffer and input write buffer is similar to left side context memory—input write buffer updates go on the second port of the two write ports. Additionally, a separate round robin-bit is used to decide between the two write buffers on the right side.
A reason for a separate local-remote write buffers is that there can be concurrent traffic between direct and local, between direct and remote, and between local and remote. Managing all of this concurrent traffic becomes difficult without having the ability to update write buffer with several (i.e., 4 to 6) stores in one cycle. Building a write buffer that can update these stores in one cycle is difficult from a timing standpoint, and such a write buffer will generally have an area of a size similar to that of separate write buffers.
Anytime there is any write buffer stall, other writes can be stalled. For example, if a node (i.e., 808-i) is updating direct traffic on the left and right side contexts and one of the buffers become full, traffic on both paths would be stalled. A reason is that, when the SIMD unstalls, the SIMD re-issues stores. It is generally important, though, to ensure that stores are not re-issued again to a write buffer. Due to the pipeline of write buffer allocation, full is indicated when there are several (i.e., 4) writes in the write buffer—that is even though two entries are available as they are empty. This way if there are two stores coming in, they can skid into the available write buffers. Using exact full detection would have required eight write buffers with two buffers for skid. Also note that when there is a stall, the stall does not see if the stall is due to one write buffer available or two write buffers available—it just stalls assuming that two stores were coming from core and two entries were not available.
The write buffers should maintain context numbers so that context bases can be added to offsets received from other nodes for updating SIMD data memory. The write buffers generally maintain context bases so that, when there is a task switch, to generally ensure that write buffers are not flushed, as this will be detrimental to performance. Also, it is possible that there could be stores from several different contexts in a write buffer, which would mean that the ability to either store all these multiple context bases or read the descriptor after reading them out of the write buffer (which can also be bad as the pipeline for emptying write buffers becomes longer) is desirable. In order to make sure we do not stall the write buffer allocation because we do not have the context base, descriptors desire to be read for the various paths as soon as tasks are ready to execute—this is done speculatively and the architectural copy is updated in various parts of the pipeline.
As soon as a program has been updated, the program counter or PC is available as well as the base context. The base context can be used to: (1) fetch a SIMD context base from a descriptor; (2) fetch a processor data memory context base from a processor data memory; and (3) save side context pointers. This is done speculatively, and, once the program begins executing, the speculative copies are updated into architectural copies.
Architectural copies are updated as follows:
Speculative copies are updated at two points:
Task switches are indicated by software using (for example) a 2-bit flag. The task switches can indicate nop, release input context, set valid for outputs, or task switches. The 2-bit flag is decoded in a stage of instruction memory (i.e., 1404-i). For example, it can be assume that for a first clock cycle of Task 1 can then result in a task switch in a second clock cycle, and in the second clock cycle, a new instruction from instruction memory (i.e., 1404-i) is fetched for Task 2. The 2-bit flag is on a bus called cs_instr. Additionally, the PC can generally originate from two places: (1) from node wrapper (i.e., 810-i) from a program if the tasks have not encountered the BK bit; and (2) from context save memory if BK has been seen and task execution has wrapped back.
Task pre-emption can be explained using two nodes 808-i and 808-(i+1) of
There are relationships between the various contexts in node 808-k and reception of set_valid. When set_valid is received for context0, it sets Cvin for context0 and sets Rvin for context1. Since Lf=1 indicates left boundary, nothing should to be done for left context; similarly, if Rf is set, no Rvin should to be propagated. Once context1 receives Cvin, it propagates Rvin to context0, and since Lf=1, context0 is ready to execute. Context1 should generally that Rvin, Cvin and Lvin are set to 1 before execution, and, similarly, the same should be true for context2. Additionally, for context2, Rvin can be set to 1 when node 808-(k+1) receives a set_valid.
Rvlc and Lvlc are generally not examined until Bk=1 is reached after which task execution wraps around and at this point Rlvc and Lvlc should be examined. Before Bk=1 is reached, the PC originates from another program, and, afterward, PC originates from context save memory. Concurrent tasks can resolve left context dependencies through write buffers, which have been descried above, and right context dependencies can be resolved using programming rules described above.
The valid locals are treated like stores and can be paired with stores as well. The valid local are transmitted to the node wrapper (i.e., 810-i), and, from there, the direct, local or remote path can be taken to update Valid locals. These bits can be implemented in flip-flops, and the bit that is set is SET_VLC in the bus described above. The context num is carried on DIR_CONT. The resetting of VLC bits are done locally using previous context number that was saved away prior to the task switch—using a one cycle delayed version of CS_INSTR control.
As described above, there are various parameters that are checked to determine whether a task is ready. For now task pre-emption will be explained using input valids and local valids. But, this can be expanded to other parameters as well. Once Cvin, Rvin and Lvin are 1, a task is ready to execute (if Bk=1 has not been seen). Once task execution wraps around, in addition to Cvin, Rvin and Lvin, Rvlc and Lvlc can be checked. For concurrent tasks, Lvlc can be ignored as real time dependency checking takes over.
Also, when transitioning from between tasks (i.e., Task1 and Task2), the Lvlc for Task1 can be set when Task0 encounters context switch. At this point when the descriptor for Task1 is examined just before Task0 is about to complete using Task Interval counter, Task1 will not be ready as Lvlc is not set. However, Task1 is assumed to ready knowing that current task is 0 and next task is 1. Similarly when Task2 is, say, returning to Task 1, then again Rvlc for Task1 can be set by Task2; Rvlc can be set when context switch indication is present for Task2. Therefore, when Task1 is examined before Task2 is to be complete, Task1 will not be ready. Here again, Task1 is assumed to be ready knowing that current context is 2 and the next context to execute is 1. Of course, all the other variables (like input valids and the valid locals) should be set.
Task interval counter indicates the number of cycles a task is executing, and this data can be captured when the base context completes execution. Using Task0 and Task1 again in this example, when Task0 executes, the task interval counter is not valid. Therefore, after Task0 executes (during stage 1 of Task0 execution), speculative reads of descriptor, processor data memory are setup. The actual read happens in a subsequence stage of Task0 execution, and the speculative valid bits are set in anticipation of a task switch. During the next task switch, the speculative copies update the architectural copies as described earlier. Accessing the next context's information is not as ideal as using the task interval counter as checking whether the next context is valid or not immediately may result in a not ready task while waiting until the end of task completion may actually ready the task as more time has been given for task readiness checks. But, since counter is not valid, nothing else can be done. If there is a delay due to waiting for the task switch before checking to see if a task is ready, then task switch is delayed. It is generally important that all decisions—like which task to execute and so forth are made before the task switch flags are seen and when seen, task switch can occur immediately. Of course, there are cases where after the flag is seen, task switch cannot happen as the next task is waiting for input, and there is no other task/program to go to.
Once counter is valid, several (i.e. 10) cycles before the task is to be completed, the next context to execute is checked to whether it is ready. If it is not ready, then task pre-emption can be considered. If task pre-emption cannot be done as task pre-emption has already been done (one level of task pre-emption can be done), then program pre-emption can be considered. If no other program is ready, then current program can wait for the task to become ready.
When a task is stalled, then it can be awakened by valid inputs or local valid for context numbers that are in Nxt context number as described above. The Nxt context number can be copied with Base Context number when the program is updated. Also, when program pre-emption takes place, the pre-empted context number is stored in Nxt context number. If Bk has not been seen and task pre-emption takes place, then again Nxt context number has the next context that should execute. The wakeup condition initiates the program, and the program entries are checked one by one starting from entry 0 until a ready entry is detected. If no entry is ready, then the process continues until a ready entry is detected which will then cause a program switch. The wakeup condition is a condition which can be used for detecting program pre-emption. When the task interval counter is several (i.e., 22) cycles (programmable value) before the task is going to complete, each program entry is checked to see if it is ready or not. If ready, then ready bits are set in the program which can be used if there are no ready tasks in current program.
Looking to task preemption, a program can be written as a first-in-first-out (FIFO) and can be read out in any order. The order can be determined by which program is ready next. The program readiness is determined several (i.e., 22) cycles before the currently executing task is going to complete. The program probes (i.e., 22 cycles) should complete before the final probe for the selected program/task is made (i.e., 10 cycles). If no tasks or programs are ready, then anytime a valid input or valid local comes in, the probe is re-started to figure out which entry is ready.
The PC value to the node processor 4322 is several (i.e., 17) bits, and this value is obtained by shifting the several (i.e., 16) bits from Program left by (for example) 1 bit. When performing task switches using PC from context save memory—no shifting is required.
When a context begins executing, the context first sends Source Notification to see if destination is a thread or not, which is indicated by a Source Permission. The reasoning behind the first mode of operation—out of reset is that when first starting, a node does not know if the output is to a thread (ordering required) or node (no ordering required). Therefore, it starts out by sending a SN message. The Lf=1 node generally does this. It will get back a SP message indicating it is not a thread. The SN and SP messages are tied together by a two bit src_tag when it comes to nodes. The Lf=1 node sends out SN message after it examines the output enables—which is most significant bit of the output destination descriptor. For every destination descriptor, a SN is sent. Note that destination can be changed in SP from what was indicated in destination descriptor—therefore usually take the destination information from SP message. Pipeline for this is as follows:
Assuming this program had 1-0, 1-1 and 1-2 tasks with Bk=1 set on 1-2. Then Lf=1 context which is 1-0 sends SN for say two outputs enabled. Then SP message comes in for 1-0—which then forwards the “enable” to 1-1. When SP comes in for 1-1, OE for 1-1 is set to 1. Now that SP messages have been sent, outputs can be executed. If outputs are encountered before OE's are set, then we stall the SIMDs. This stall is like a bank conflict stall encountered in stage 3. Once the OEs are set, then stall goes away.
The program can then issue a set_valid using the 2 bit compiler flag which will reset the OE. Once the OE has been reset and we go back to executing 1-0, 1-1 etc, all contexts will now know that they are not a thread and hence can send a SN message. That is 1-0 which is Lf=1 context plus 1-1 and 1-2 will now send a SN message for outputs enabled. They will each receive a SP which will set their OE's and this time around they will not forward their SP messages like out of reset described earlier.
If the SP message indicates it is threaded, then OE is updated and data is provided to destination. Note that destination can be changed in SP message from what was indicated in destination descriptor—therefore usually take the destination information from SP message. When set_valid is executed by node, it will then forward the SP message it received to the right context pointer which will then send the SN to destination. The forwarding takes place when the output is read from the output buffer—this is so that we can avoid stalls in SIMD when there are back to back set_valid's. The set_valid for vector outputs is what causes the forwarding to happen. Scalar vector outputs do not do the forwarding—however both will reset the OE's.
The ua6[5:0] field (for scalar and vector outptuus) carries the following information:
Ua6[5]: set_valid
Ua6[4:3]: indicates size for scalar output
Ua6[2:0]: output number (for nodes/SFM—bits 1:0 are used)
Scalar outputs are also sent on message bus 1420 and send set_valid etc on following MReqInfo bits: (1) Bit 0: set_valid (internally remapped to bit 29 of message bus); and (2) Bit 1: output_killed (internally rem-mapped to bit 26 of message bus).
An SP messages is sent when CVIN, LRVIN and RLVIN are all 0's in addition to looking at the states for InSt. SN messages sends a 2 bit dst_tag field on bits 5:4 of payload data. These bits are from the destination descriptors—bits 14:13 which have been initialized by the TSys tool—these are static. The InSt bits are 2 bits wide and since we can have 4 outputs—there are 8 such bits and these occupy 15:8 of word 13 and replace the older pending permission bits and source thread bits. When the SN message comes in, dst_tag is used to index the 4 destination descriptors—if Dst_tag is 00—then InSt0 bits are read out—if pending permissions desires to be updated, word 8 is updated. InSt0 bits are 9:8 and InSt1 bits are 11:10 and so on. If the InSt bits are 00, then SP is sent and SP set 11. If now a SN message comes to same dst_tag, then InSt bits are moved to 10 and no SP message is sent. When CVIN is being set to 1, the InSt bits are checked—if they are 11, they are moved to 00. If they are 10, they are moved to 01. State 01 is equivalent to having a pending permission. When release_input comes, the SP is sent (provided CVIN, LRVIN and RLVIN are all 0's) and state bits are moved to 11 and the process repeats. Note that when release input comes and LRVIN and/or RLVIN are not 0, then when other contexts execute a release input, LRVIN and RLVIN will get locally reset when other contexts forward the release_input to reset LRVIN/RLVIN—at that point we check again if the 3 bits will be 0. If they are going to be 0—then pending permissions will be sent. When InSt=00 and CVIN, LRVIN and RLVIN are not 0's, then InSt bits move to 01 from where pending permissions are sent when release input is executed.
Following are sources of stalls in SIMD:
A task within a node level program (that describes an algorithm) is a collection of instructions that start from side context of input being valid and task switch when the side context of a variable computed during the task is desired or desired. Below is an example of a node level program:
For
Turning to
Nodes (i.e., 808-i) in this example can use two's complement representation for signed values and targets ISP6 functionality. A difference between ISP5 and ISP6 functionalities is the width of operators. For ISP5, the width is generally 24 bits, and for ISP6, the width may change to 26 bits. For packed instructions some registers can be accessed in two halves, <register>.lo and <register>.hi, these halves are generally 12 bits wide.
Each functional unit (i.e., 4338-1) has 32 registers each of which is 32 bits wide, which can be accessed as 16 bit values (unpacked) or 32 bit values (packed).
Nodes (i.e., 808-i) is typically a 10-instruction issue machine, with the 11 units each capable of issuing a single instruction in parallel. The eleven units are labeled as follows: .LS1, .LS2, .LS3, .LS4, .LS5, .LS6, .LS7, and .LS8 for node processor 4322; .M1 for multiply unit 4348; .L1 for logic unit 4346; and .R1 for round unit 4350. The instruction set is partitioned across these 10 units, with instruction types assigned to a particular unit. In some cases a provision has been made to allow more than one unit to execute the same instruction type. For example, ADD may be executed on either .L1 or .R1, or both. The unit designators (.LS1, .LS2, .LS3, .LS4, .LS5, .LS6, .LS7, .LS8, .M1, .L1, and .R1), which follow the mnemonic, indicate to the assembler what unit is executing the instruction type. An example is as follows:
In this example two add instructions are issued in parallel, one executing on the round unit 4350 and one executing on the logic unit 4346. It should also be noted that if parallel instructions write results to the same destination, the result is unspecified. The value in the destination is implementation dependent.
Since the nodes (i.e., 808-i) are VLIW machines, the compiler 706 should move independent instructions into the delay slots for branch instruction. The hardware is set up for SIMD instructions with direct load/store data from LS data memory 4339. The compiler 706 will see LS data memory 4339 as a large register file for data, for example:
It should also be note that the value RA will remain until another load or SIMD instruction writes to its register (i.e., register 4612). It is generally not desired to store value RC if the value is used locally within the next instructions. The value RC will remain until another load or SIMD instruction writes to its register (i.e., 4618). Value RE should be used locally and not written back to LS data memory 4339.
The pipeline is set up so that the compiler 706 can see banks of SIMD data memory (i.e., 4306-1) as a huge register file. There is no store to load forwarding—loads will usually take data from the SIMD data memory (i.e., 4306-1). There should to be two delay slots between store and a dependent load.
Output instruction is executed as a store instruction. The constant ua6 can been recoded to do the following:
Ua6[5:4]=00 will indicate Store
Vector output instructions output the lower 16 SIMD registers to a different node—it can be shared function-memory 1410 (described below) as well. All 32 bits can be updated.
Scalar outputs output a register value on the message interconnect bus (to control node 1406). Lower 16, upper 16, or entire 32 bits of data can be updated in the remote processor data memory 4328. The sizes are indicated on ua6[3:2], where 01 is the lower 16 bits, 10 is upper 16 bits, 11 is all 32 bits, and 00 is reserved. Additionally, there can be four output destination descriptors. Output instructions use ua6[1:0] to indicate which destination descriptor to use. The most significant bit of ua6 can be used to perform a set_valid indication which signals completion of all data transfers for a context from a particular input, which can trigger execution of a context in the remote node. Address offsets can be 16 bits wide when outputs are to shared function-memory 1410—else node to node offsets are 9 bits wide.
There is a global area reserved for spills in SIMD data memory (i.e., 4306-1). The following instructions can to be used to access the global area:
LD *uc9, ua6, dst
ST dst, *uc9, ua6
where uc9 is from variable uc9[8:0]. When uc9[8] is set, then the context base from node wrapper (i.e., 810-i) is not added to calculate the address—the address is simply uc9[8:0]. If uc[8] is 0, then context base from wrapper (i.e., 810-i) is added. Using this support, variables can be stored from SIMD data memory (i.e., 4306-1) top address and grow downward like a stack by manipulating uc9.
When the frame is at the left or right edge, the descriptor will have Lf or Rt bits set. At the edges, the side context memories do not have valid data, and, hence, the data from center context is either mirrored or repeated. Mirroring or repeating can be indicated by bit lssrc2[13] (circular buffer addressing mode).
Mirror when lssrc2[13]=0
Repeat when lssrc2[13]=1
Pixels at the left and right edges are mirrored/repeated. Boundaries are at pixel 0 and N. For example, if side context pixel −1 is accessed, pixel at location 1 or B is returned. Similarly for side context pixels −2, N and N+1.
The LS data memory 4339 (which can have a size of about 256×12 bit) can have the following regions:
Instructions that can move data between node processor 4322 and SIMD (i.e., SIMD unit including SIMD data memory 4306-1 and functional unit 4308-1) are indicated in Table 3 below:
More explanation of companion instructions for node processor 4322 is provided below.
6.8.10. LDSFMEM and STFMEM
The instructions LDSDMEM and STFMEM can access shared function-memory 1410. LDSFMEM reads a SIMD register (i.e., within 4338-1) for address and sends this over several cycles (i.e., 4) to shared function-memory 1410. Shared function-memory 1410 will return (for example) 64 pixels of data over 4 cycles which is then written into SIMD register 16 pixels at a time. These loads for instructions LDSDMEM have a latency of, typically, 10 cycles, but are pipelined so (for example) results for the second LDSFMEM should come immediately after the first one completes. To obtain high performance, four LDSFMEM instructions should be issued well ahead of its usage. Both LDSFMEM and STFMEM will stall if the IO buffers (i.e., within 4310-i and 4316-i) become full in node wrapper (i.e., 810-i).
The assembler syntax for the nodes (i.e., 808-i) can be seen in Table 4 below:
Abbreviations used for instructions can be seen in Table 5 below:
An example instruction set for each node (i.e., 808-i) can be seen in Table 6 below.
Within processing cluster 1400, general-purpose RISC processors serve various purposes. For example, node processor 4322 (which can be a RISC processor) can be used for program flow control. Below examples of RISC architectures are described.
Turning to
Turning to
There are typically two executable delay slots for instructions which modify the program counter. Instructions which exhibit branching behavior are not permitted in either delay slot of a branch. Instructions which are illegal in the delay slot of a branch may be identified by tooling using ProfAPI. If an instruction record's action field contains the keyword “BR”, this instruction is illegal in either of the two delay slots of a branch. Load instructions can exhibit a one cycle load use delay. This delay is generally managed by software (i.e., there is no hardware interlock to enforce the associated stall). An example is:
In this case the ADD will use the contents of R2 resulting from the SUB and not the results of the load. The MUL will use the contents of R2 resulting from the load. Loads which calculate an address, or have a register based address access data memory (i.e., 4328) after address calculation has been completed in execution stage 5310. Loads with address operands fully expressed as an immediate value exhibit “zero” cycles of load use delay relative to the execution pipe stage, i.e. these instructions access data memory (i.e., 4328) from decode stage 5308 rather than the execution stage 5310. The compiler 706 is generally responsible for appropriately scheduling access to data memory (i.e., 4328), and register values in the presence of these two types of loads.
Primary input rose mode[1:0] controls T20's behavior on exit from reset. When risc_mode is set to 2′b00 and after the completion of reset processor 5200 will perform a data memory (i.e., 4328) load from address 0, the reset vector. The value contained there is loaded into the PC. Causing an effective absolute branch to the address contained in the reset vector. When risc_mode is set to 2′b01 the processor 5200 remains stalled until the assertion of force_pcz. The reset vector is not loaded in this case.
Boundary pins, however, can also indicate stall conditions. Generally, there are four stall conditions signaled by entity boundary pins: instruction memory stall; data memory stall, context memory stall, and function-memory stall. De-assertion of any of these pins will stall processor 5200 under the following conditions:
(1) Instruction memory stall (imem_rdy)
(2) Data memory stall (dmem_rdy)
(3) Context memory stall (cmem_rdy)
(4) vector-memory stall (vmem_rdy)
Turning to
A decoder 5221 (which is part of the decode stage 5308 and processing unit 5202) decodes the instruction(s) from the instruction fetch 5204. The decoder 5221 generally includes a operator format circuit 5223-1 and 5223-2 (to generate intermediates) and a decode circuit 5225-1 and 5225-2 for the B-side and A-side, respectively. The output from the decoder 5221 is then received by the decode-to-execution unit 5220 (which is also part of the decode stage 5308 and processing unit 5202). The decode-to-execution unit 5220 generates command(s) for the execution unit 5227 that correspond to the instruction(s) received through the fetch packet.
The A-side and B-side of the execution unit 5227 is also subdivided. Each of the B-side and A-side of the execution unit 5227 respectively includes a multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, a add/subtract unit 5228-1/5228-2, and a move unit 5330-1/5330-2. The B-side of the execution unit 5227 also includes a load/store unit 5224 and a branches unit 5232. The multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, a add/subtract unit 5228-1/5228-2, and a move unit 5330-1/5330-2 can then, respectively, perform a multiply operation, a logical Boolean operation, add/subtract operation, and a data movement operation on data loaded into the general purpose register file 5206 (which also includes read addresses for each of the A-side and B-side). Move operations can also be performed in the control register file 5216.
The load/store unit 5224 can load and store data to processor data memory (i.e., 4328). In Table 8 below, loads for bytes, halfwords, and words and stores for bytes, unsigned bytes, halfwords, unsigned halfwords, and words can be seen.
The branch unit 5232 executed branch operations in instruction memory (i.e., 1404-1). The branch unit instructions are typically Bcc, CALL, DCBNZ, and RET, where RET generally has three executable delay slots and the remaining generally have two. Additionally, a load or store cannot generally be in the first delay slot during read of an RET.
Tuning now to
For processor 5200, there can be a single scalar instruction slot, therefore ‘unaligned’ has no relevance. Alternatively, aligned instructions can be provided for processor 5200. However, the benefit of unaligned instruction support on code size is reduced by new support for branches to the middle of fetch packets containing two twenty bit instructions. The additional branch support potentially provides both improved loop performance and code size reduction. The additional support for unaligned instructions potentially marginalizes the performance gain and has minimal benefit to code size.
20-bit instructions may also be executed serially. Generally, bit 19 of the fetch packet functions as the P-bit or parallel bit. This bit, when set (i.e. set to “1”), can indicate that the two 20-bit instructions form an execute packet. Non-parallel 20 bit instructions may also be placed on either half of the fetch packet, which is reflected in the setting of the P-bit or bit 19 of the fetch packet. Additionally, for a 40-bit instruction, the P-bit cannot be set, so either hardware or the system programming tool 718 can enforce this condition.
Turning to
In the first instruction, a load (on the B-side) to R0 (in the general purpose register file 5206) is performed, which followed by a no operation or nop. In the last instruction, a register (location R0) to register (location R1) add with R0 as the destination. All these instructions execute serially, and, in this example prior to execution, register location R0 contains 0x456, while register location R1 contains 0x1. The value from the load is 0x123 in this example. As shown, in the first cycle, the load instruction in the fetch stage 5306. In the second cycle, the decode for the load instruction is performed, while the nop instruction enters the fetch stage 5306. In the third cycle, the load instruction is executed, which loads an address into the processor data memory. Additionally, the add instruction enter the fetch stage 5306 in the third cycle. In the fourth cycle, the add instruction enters the decode stage 5308, and data is loaded into the processor data memory (which corresponds to the address loaded in the third cycle) and moved to register location R0. Finally, in the fifth and sixth cycles, the add instruction is executed, where the value 0x123 (from R0) and 0x1 (from R1) are added together and stored in location R0.
Since load (and store) instructions often calculate the effective RAM address, the RAM address is sent to the RAM in the execute stage 5310. A full cycle is usually allowed for RAM access, creating a 1 cycle penalty (which can be seen in
Additionally, the GLS processor 5402 supports branches whose target is the high side of a fetch packet. An example is shown below:
Lines 1A and 1B represents the first fetch packet in the loop. On first entry into the loop the Line 1A and Line 1B are executed. On subsequent loop iterations Line 1B is executed. Note that the branch target “&(LOOP+1)” specifies a high side branch. Offsets in GLS processor 5402 (for this example) are natively even, odd offsets specify the high side of a fetch packet. Labels are limited to even offsets, the LOOP+1 syntax specifies the high side of the target fetch packet. It should also be noted that specifying a high side target to a fetch packet containing a single 40 bit instruction is not generally permitted. Also, for high side branches, the high side of the target fetch packet is executed. This is usually true regardless of whether the target fetch packet contains two parallel or two serial instructions.
There is also a small set of loads which do not usually require an address computation since the load address is completely specified by an immediate operand, and these loads are specified to have a zero load use penalty. Using these loads it is not desired to insert a NOP for the load use penalty (the NOP shown is not in place to enforce a load use delay, the NOP is to simply disable the A-side for the purposes of explanation):
The top two waveforms show the pipeline advance of the two instructions through fetch, decode and execute. Note that the RAM address is sent to data memory in the load's decode stage 5308 phase. Otherwise the process is the same but with a performance benefit. However there is now an instruction scheduling requirement placed on code generation and validation when no hazard handling logic is included in processor 5200. All instructions which access data memory should be scheduled such that there is no contention for the data memory interface. This includes loads, stores, CALL, RET, LDRF, STRF, LDSYS and STSYS, where LDSYS and STSYS are instructions for the GLS processor 5402. A CALL combines the semantics of a store and a branch; it pushes the return PC value to the stack (in data memory) and branches to the CALL target. A RET combines the semantics of a load and a branch; it loads the return target from the stack (again, in DMEM) and then branches. In spite of the fact that these instructions do not update any internal state of the processor 5200, LDSYS and STSYS have load semantics similar to loads with 1 cycle of load use penalty and utilize the data memory interface in execution stage 5310.
Turning now to
LDW .SB *+R5, R0; 1 cycle load use, uses data memory in execution stage 5310
LDW .SB *+U24, R1; 0 cycle load use, uses data memory in decode stage 5308
Contention can occur since the second load's decode stage 5308 cycle overlaps the first load's execution stage 5310 cycle these instructions attempt to use the data memory interface in the same clock cycle. Replacing the first load with a store, CALL, RET, LDRF, STRF, LDSYS or STSYS will cause the same situation, and in
On execution of a CALL instruction the computed return address is written to the address contained in the stack pointer. The computed return address is a fixed positive offset from the current PC. The fixed offset is usually 3 fetch packets from the PC value of the CALL instruction.
Additionally, branch instructions or instructions which exhibit branch behavior, like CALL, have two executable delay slots before the branch occurs. The RET instruction has 3 executable delay slots. The delay slot count is usually measured in execution cycles. Serial instructions in the delay slots of a branch count as one delay slot per serial instruction. An example is shown below
The instructions above are labeled by their fetch packet, F#1 and their execute packet, Ex#1. The CALL is followed by two serial instructions and then a pair of parallel instructions. In this example the MUL∥SHL fetch packet is not executed. Even though the ADD Ex#2 and the SUB Ex#3 occupy the same fetch packet they are serial so they consume the delay slot cycles in the shadow of the CALL. Rewriting the above code in a functionally equivalent, fully parallel form, makes this explicit:
There is a difference in fetch behavior and code size, but the two fragments result in the same machine state after all delay slots have been executed.
Below is another example of non-parallel instructions, this time where the branch is located on the low side of the packet.
The fetch packet boundaries are explicitly commented. In this case the branch will execute before the ADD. Therefore the ADD counts as one executable delay slot and the SUB/MUL counts as the second executable delay slot. Finally the same example with no parallel instructions.
The branch and the ADD execute as before, with the ADD counting as the first executable delay slot. However in this example the SUB is executed since it is serial in relationship to the MUL, and counts as the second executable delay slot.
As stated above, the general purpose resister file 5206 can be a 16-entry by 32-bit general purpose register file. The widths of the general purpose registers (GPRs) can be parameterized. Generally, when processor 5200 is used for nodes (i.e., 808-i), there are 4+15 (15 are controlled by boundary pins) read ports and 4+6 (6 are controlled by boundary pins) write ports, while processor 5200 used for GLS unit 1408 has 4 read ports and 4 write ports.
Generally, all registers within the control register file 5216 are conventionally 16 bits wide; however, not all bits in each register are implemented and parameterization exists to extend or reduce the width of most registers. Twelve registers can be implemented in the control register file 5216. Address space is made available in the instruction set for processor 5200 (in the MVC instructions) for up to 32 control registers for future extensions. Generally, when processor 5200 is used for nodes (i.e., 808-i), there are 2 read ports and 2 write ports, while processor 5200 used for GLS unit 1408 has 4 read ports and 4 write ports. In the general case, the control register file is accessed by using the MVC instruction. MVC is generally the primary mechanism for moving the contents of registers between the register file 5206 and the control register file. MVC instructions are generally single cycle instructions which complete in the execute stage 5310. The register access is similar to that of a register file with by-passing for read-after-write dependency. Direct modification of the control register file entries is generally limited to a few special case instructions. For example, forms of the ADD and SUB instructions can directly modify the stack pointer to improve code execution performance (i.e., other instructions modify the condition code bits, etc.). In Table 9 below, the registers that can be included in control register file 5216 are described.
The stack pointer generally specifies a byte address in processor data memory (i.e., 4328). By convention the stack pointer can contain the next available address in processor data memory (i.e., 4328) for temporary storage. The LDRF instruction (which is pre-incremented) and the STRF instructions (which is post-decremented) can indirectly modify this register, storing or retrieving register file contents. The CALL instruction (which is post-decremented) and RET instructions (which is pre-incremented) indirectly modify this register, storing and retrieving the program counter or PC 5218. The stack pointer may be directly updated by software using the MVC instruction. The programmer is generally responsible for ensuring the correct alignment of the SP. Other instructions can be used to directly modify the stack pointer.
The control status register can contains control and status bits. Processor 5200 generally defines (for example) two sets of status bits, one set for each issue slot (i.e., A and B). As shown in the example for in Table 7 above, instructions which execute on the A-side update and read status bits CSR [4:0]. Instructions which execute on the B-side update and read status bits CSR [9:5]. All bits can be directly readable or writeable from either side using the MVC instructions. In Table 10 below, the bits for the control status register illustrated in Table 8 above are described.
Execution of compare instructions will enforce a one-hot condition for greater than/less than/equal to (GT/LT/EQ). However the condition code bits GT, LT, EQ are generally not required to be one-hot but may be set in any combinations using the MVC or by combinations of CMP and instructions which update the EQ bit. Having more than one bit set will not effect conditional branch execution as each branch compares the respective condition bits (i.e., BGE .SA uses the CSR[2] and CSR[0] to determine if the branch is taken). The remaining condition bits have no effect on BGE .SA.
This register is generally responds to register moves but has no effect on interrupts. The interrupt enable register (which can be about 16 bits) generally combines the functions of an interrupt status register, interrupt set register, interrupt clear register and interrupt mask register into a single register. The interrupt enable register's “E” bits can control individual enable and disable (masking) of interrupts. A one written to an interrupt enable bit (i.e., execution stage 5310 at [0] for int0 and E1 at [2] for int1) enables that interrupt. The interrupt enable register's “C” bits can provide status and control for the associated interrupts (i.e., C0 at [1] for int0 and C1 at [3] for int1). When an interrupt has been accepted the associated C bit is set and the remaining C bits are cleared. On execution of a RETI instruction all C bit values are cleared. The C bits can also be used to mimic the initiation of an interrupt. A 1 written to a C bit that is currently cleared initiates interrupt processing as if the associated interrupt pin had been asserted. All other processing steps and restrictions can the same as a pin asserted interrupt (GIE should be set, associated E bit should be set, etc). It should also be noted that if software wishes to use bit C1 (associated with int1) for this purpose external hardware should generally ensure that a valid value is driven onto new_pc and the force_pcz signal is held high, before writing to bit C1.
This register (which can also be 16 bits) generally responds to register moves but has no effect on interrupts. The interrupt return pointer can contains the address of the first instruction in the program flow that was not executed due to occurrence of an interrupt. The value contained in the interrupt return pointer can be copied directly to the PC 5218 upon execution of a BIRP instruction.
The load base register (which can also be 16 bits) can contain a base address used in some load instruction types. This register generally contains a 16 bit base address which when combined with general purpose register contents or immediate values, provides a flexible method to access global data.
The store base register can contain a base address used in some store instruction types. This register generally contains a 16 bit base address which when combined with general purpose register contents or immediate values, provides a flexible method to access global data.
The program counter or PC 5218 is generally an architectural register (i.e., having contains machine state or execution unit 4344, but is not directly accessible through the instruction set). Instruction execution has an effect on the PC 5218, but the current PC value can not be read or written explicitly. The PC 5218 is (for example) 16 bits wide, representing the instruction word address of the current instruction. Internally, the PC 5218 can contain an extra LSB, the half word instruction address bit. This bit indicates (for example) the high or low half of an instruction word for 20-bit serially executed instructions (i.e. p-bit=0). This extra LSB is generally not visible nor is can it be manipulates the state of this bit through program or external pin control. For example, a force_pcz event implicitly clears the half word instruction address bit.
Processor 5200 generally includes instructions which use a circular addressing mode to access buffers in memory. These instructions can be the six forms of OUTPUT and the CIRC instruction, which can, for example, include:
(1) (V)OUTPUT .SB R4, R4, S8, U6, R4
(2) (V)OUTPUT .SB R4, S14, U6, R4
(3) (V)OUTPUT .SB U18, U6, R4
(4) CIRC .SB R4, S8, R4
These instructions are generally 40 bits wide, and the VOUTPUT instructions are generally the vector/SIMD equivalent of the scalar OUTPUT instructions. Circular addressing instructions generally use a buffer control register to determine the results of a circular address calculation, and an example of the register format can be seen in Table 11 below.
The boundary pins new_ctx_data and cmem_wdata can be used to move machine state to and from the processor 5200 core. This movement is initiated by the assertion of force_ctxz. External logic can initiate a context switch by driving force_ctxz low and simultaneously driving new_ctx_data with the new machine state. Processor 5200 detects force_ctxz on the rising edge of the clock. Assertion of force_ctxz can cause processor 5200 to begin saving its current state and load the data driven on new_ctx_data into the internal processor 5200 registers. Subsequently processor 5200 can assert the signal cmem_wdata_valid and drive the previous state onto the cmem_wdata bus. While the context switch can occur immediately, there can be a two cycle delay between detection of force_ctxz assertion, and the assertion by processor 5200 of cmem_wdata_valid and cmem_wdata. These two cycles generally allow instructions in the decode stage 5308 and execute stage 5310 at the assertion of force_ctxz, to properly update the machine state before this machine state is written to the context memories. Processor 5200 can continue to assert cmem_wdata_valid and cmem_wdata until the assertion of cmem_rdy. Typically, cmem_rdy is asserted, but this allows external control logic to determine how long processor 5200 should keep cmem_wdata_valid and cmem_wdata valid. The format of the new_ctx_data and cmem_wdata buses is shown in Table 12 below.
Nodes (i.e., 808-i) can require access to the general purpose registers of processor 5200 as part of the SIMD instruction set. A pin is provided which will cause processor 5200 to drive the general purpose register contents onto cmem_wdata, which is normally held at a constant value to reduce switching power consumption and is active during write back of the machine state of processor 5200 as a side effect of a context switch (force_ctxz assertion). The input pin cmem_gpr_renz is generally provided to allow external logic to read the current value of the register file 5206. This input pin is used combinatorially by processor 5200 to drive the register file 5206 onto bits cmem_wdata[511:0].
Processor 5200 can support four externally signaled interrupts: reset (rst0z), a non-maskable interrupt (nmi), a maskable interrupt (int0) and an externally managed maskable interrupt (int1). int1 is typically the output of an external interrupt controller. In addition to reset, other events can be treated as interrupts by the hardware, namely and for example, Execution of a SWI (software interrupt) instruction and detection by the hardware of an undefined instruction. Table 13 below illustrates a summary of example interrupts for processor 5200, and the logical timings for these interrupts can be seen in
The debug module for the processor 5200 (which is a part of the processing unit 5202) utilizes the wrapper interface (i.e., node wrapper 810-i) to simplify the design of the debug module. The boundary pins for debug support are listed in above in Table 7. The debug register set is summarized below in Table 14.
Generally, the DBG_CNTRL register implements a single bit which re-enables event capture after the detection of an IDLE instruction. Processor 5200 indicates that it is in the IDLE state by the assertion of boundary pin risc_is_idle. To avoid counting irrelevant events event capture and counting is halted when the processor 5200 is in the idle state. DBG_CNTRL[0] is a sticky-bit which indicates an IDLE state has been detected. A write of 0x0 to DBG_CNTRL can be used to clear this bit. Once the processor 5200 has been moved out of the IDLE state, DBG_CNTRL[0]=0 will re-enable event counting.
There are also four instruction memory address break- or trace-point registers. A break- or trace-point match is indicated by assertion of the risc_brk_trc_match pin. A trace-point match is indicated by further assertion of risc_trc_pt match. External logic can detect a break point by:
break point match=risc_brk_trc_match & !risc_trc_pt_match.
In cases where multiple BRKx registers are programmed identically, the BRKx register with the lowest address will control assertion of the risc_trc_pt match id, BRK0 will have precedence over BRK1, etc. Behavior is undetermined when two or more BRKx registers are identical with the exception of the TM bit. This is considered an illegal condition and should be avoided.
There are also 8 event counters and 8 associated event counter control registers. Each event counter can be programmed to count one type. There are 11 internal event types and 16 user defined event types. User events are supplied to the debug model via the pins wp_events. User defined events are expected to be single cycle per event and active high on the wp_events bus. The ECC0-ECC7 registers consist of a mux select field [6:0] and an enable bit [7]. The event count register EC0-EC7 simply contain the count values for the events programmed by the associated ECC0-ECC7 registers. EC0-EC7 are 16 bit registers which are cleared on reset. The upper 16 bits are not writeable and read as zeros.
Table 15 below illustrates an example of an instruction set architecture for processor 5200, where:
indicates data missing or illegible when filed
8. RISC Processor Core with a Vector Processing Module Example
A RISC processor with a vector processing module is generally used with shared function-memory 1410. This RISC processor is largely the same as the RISC processor used for processor 5200 but it includes a vector processing module to extend the computation and load/store bandwidth. This module can contain 16 vector units that are each capable of executing a 4-operation execute packet per cycle. A typical execute packet generally includes a data load from the vector memory array, two register-to-register operations, and a result store to the vector memory array. This type of RISC processor generally uses an instruction word that is 80 bits wide or 120 bits wide, which generally constitutes a “fetch packet” and which may include unaligned instructions. A fetch packet can contain a mixture of 40 bit and 20 bit instructions, which can include vector unit instructions and scalar instructions similar to those used by processor 5200. Typically, vector unit instructions can be 20 bits wide, while other instructions can be 20 bits or 40 bits wide (similar to processor 5200). Vector instructions can also be presented on all lanes of the instruction fetch bus, but, if the fetch packet contains both scalar and vector unit instructions the vector instructions are presented (for example) on instruction fetch bus bits [39:0] and the scalar instruction(s) are presented (for example) on instruction fetch bus bits [79:40]. Additionally, unused instruction fetch bus lanes are padded with NOPs.
An “execute packet” can then be formed from one or more fetch packets. Partial execute packets are held in the instruction queue until completed. Typically, complete execute packets are submitted to the execute stage (i.e., 5310). Four vector unit instructions (for example), two scalar instructions (for example), or a combination of 20-bit and 40-bit instructions (for example) may execute in a single cycle. Back-to-back 20-bit instructions may also be executed serially. If bit 19 of the current 20 bit instruction is set, this indicates that the current instruction, and the subsequent 20-bit instruction form an execute packet. Bit 19 can be generally referred to as the P-bit or parallel bit. If the P-bit is not set this indicates the end of an execute packet. Back-to-back 20 bit instructions with the P-bit not set cause serial execution of the 20 bit instructions. It should also be noted that this RISC processor (with a vector processing module) may include any of the following constraints:
Turning to
This RISC processor (which includes processor 5200 and a vector module) can also be accessed through boundary pins; an example of each is described in Table 16 (with “z” denoting active low pins).
Within the vector units up to (for example) four instructions can execute simultaneously. This set of four instructions includes at most one load and one store and up to other instructions. Alternatively, up to four non-load and non-store instructions (for example) can be executed. All vector units can execute the same execute packet (the same set of up to four vector instructions, for example), but do so using their local register files.
The general purpose register file is similar to register file 5206 described above.
The control register file here is similar to the control register file 5216 described above; however, the control register file here includes several more registers. In Table 17 below, the registers that can be included in this control register file are described, and the additional registers are described in the following sections.
The HG_SIZE register can be written by external logic using the debug interface. HG_SIZE can be used as an implied operand in some instructions.
8.6. Horizontal Position Register (HG_POSN)
The HG_POSN register can be written by external logic using the debug interface. HG_POSN can be used as an implied operand in some instructions. It should also be noted that HG_POSN has a special property, if the value to be written to HG_POSN is larger than the current value of the HG_SIZE register then HG_POSN is written with zero.
In conjunction with the interrupt behavior described with respect to node processor 4322 above, this RISC processor also includes a GIE bit or global interrupt enable bit. If GIE bit is cleared assertions on pins nmi, int0 and int1 are ignored. In addition, pins int0 and int1 each have an associated enable bit in the interrupt enable register, which individually masks the associated input. The “reset interrupt” (input pin rstz0) software interrupts (SWI instruction) and UNDEF interrupts (detection of an undefined instruction) are usually enabled. Theses interrupts are generally not effected by the GIE bit and do not have entries in the interrupt enable register.
Reset is generally considered the highest priority interrupt and can be used to halt the processing unit (i.e., 5202) and return it to a known state. Some of the characteristics of reset interrupt can be:
Here, two maskable interrupts (i.e., int0) and int1) can be supported. Assuming that a maskable interrupt does not occur during the delay slot of a branch, the following conditions should be met to process a maskable interrupt:
For maskable interrupts the IRP register is loaded with the return address of the next instruction to execute after the maskable interrupt service routine terminates. To exit a maskable interrupt service routine the BIRP instruction is used. (Note BIRP has a 2 cycle delay slot which is also executed before returning control.) Execution of BIRP causes T80 to copy the contents of the IRP register to the PC. For int0 and int1, assuming the GIE bit is set, and the associated interrupt enable register bit is also set, the following actions can be performed:
A non-maskable Interrupt or NMI is generally considered the second-highest priority interrupt and is generally used to alert of a serious hardware problem. For NMI processing to occur, the global interrupt enable (GIE) bit in the interrupt enable register (IER) should be set to 1. This simplifies external control logic typically desired to block NMI's during power on or reset. Processing of an NMI is similar to maskable interrupt processing, except for the requirement that the appropriate IER bit be set, (NMI has no such bit). Otherwise the same steps are taken for entry and exit from the interrupt service routines.
The software interrupt or SWI instruction is used to trigger the software interrupt. Decoding of SWI instruction generally causes the SWI IST entry to be loaded into the program counter (i.e., 5218). Control can returned to the instruction immediately following the SWI instruction on the execution of a BIRP within the software interrupt service routine. Decode of an SWI instructions causes a store to the interrupt register pointer register with the return address of the next instruction to execute after the SWI service routine is complete. To exit a SWI service routine the BIRP instruction is used.
An UNDEF interrupt is triggered by decode stage (i.e., 5308) whenever an undefined instruction is detected. Detection of an undefined instruction causes the UNDEF IST entry to be loaded into the program counter (i.e., 5218). Control is returned to the instruction immediately following the UNDEF on the execution of a BIRP within the UNDEF interrupt service routine. Decode of an undefined instruction causes a load of the interrupt enable register with the return address of the next instruction to execute after the UNDEF service routine is complete. For the purposes of next instruction address calculations, UNDEF instructions are treated as narrow instructions, where narrow instruction occupy a single instruction word and where as wide instructions occupy two instruction words. In many cases the UNDEF interrupt is an indication of a severe problem in the contents of the instruction memory; however, provisions are available to recover from an UNDEF interrupt.
A processor 5200 that includes a vector module (such as the processor for the shared function memory 1410, which is discussed in detail below) can support scalar initiated loads and stores to the function-memory (discussed below), these instructions used vector implied addressing. Address calculation and assertion of function-memory control signals are handled by instruction executing on the processor 5200. The source data (for vector implied stores) and the destination register (for vector implied loads) are sourced/received by the vector units. A handshake interface is present in processor 5200 (with a vector module) between the processor 5200 and the vector units. This interface provides operand information to the vector units. An example of a vector implied load can be seen in
The debug module for the processor 5200 (which is a part of the processing unit 5202) utilizes the wrapper interface (i.e., node wrapper 810-i) to simplify the design of the debug module. The boundary pins for debug support are listed in above in Table 16. The debug register set is summarized below in Table 19.
Table 20 below illustrates an example of an instruction set architecture for a RISC processor having a vector processing module:
indicates data missing or illegible when filed
The GLS unit 1408 can map a general C++ model of data types, objects, and assignment of variables to the movement of data between the system memory 1416, peripherals 1414, and nodes, such as node 808-i, (including hardware accelerators if applicable). This enables general C++ programs which are functionally equivalent to operation of processing cluster 1400, without requiring simulation models or approximations of system Direct Memory Access (DMA). The GLS unit can implement a fully general DMA controller, with random access to system data structures and node data structures, and which is a target of a C++ compiler. The implementation is such that, even though the data movement is controlled by a C++ program, the efficiency of data movement approaches that of a conventional DMA controller, in terms of utilization of available resources. However, it generally avoids the desire to map between system DMA and program variables, avoiding possibly many cycles to pack and unpack data into DMA payloads. It also automatically schedules data transfers, avoiding overhead for DMA register setup and DMA scheduling. Data is transferred with almost no overhead and no inefficiency due to schedule mismatches.
Turning now to
For GLS unit 1408, there can be three main interfaces (i.e., system interface 5416, node interface 5420, and messaging interface 5418). For the system interface 5416, there is typically a connection to the system L3 interconnect, for access to system memory 1416 and peripherals 1414. This interface 5416 generally has two buffers (in a ping-pong arrangement) large enough to store (for example) 128 lines of 256-bit L3 packets each. For the messaging interface 5418, the GLS unit 1408 can send/receive operational messages (i.e., thread scheduling, signaling termination events, and Global LS-Unit configuration), can distribute fetched configurations for processing cluster 1400, and can transmit transmitting scalar values to destination contexts. For node interface 5420, the global IO buffer 5406 is generally coupled to the global data interconnect 814. Generally, this buffer 5406 is large enough to store 64 lines of node SIMD data (each line, for example, can contain 64 pixels of 16 bits). The buffer 5406 can also, for example, be organized as 256×16×16 bits to match the global transfer width of 16 pixels per cycle.
Now, turning to the memories 5403, 5405, and 5410, each contains information that is generally pertinent to resident threads. The GLS instruction memory 5405 generally contains instructions for all resident threads, regardless of whether the threads are active or not. The GLS data memory 5403 generally contains variables, temporaries, and register spill/fill values for all resident threads. The GLS data memory 5403 can also have an area hidden from the thread code which contains thread context descriptors and destination lists (analogous to destination descriptors in nodes). There is also a scalar output buffer 5412 which can contain outputs to destination contexts; this data is generally held in order to be copied to multiple destinations contexts in a horizontal group, and pipelines the transfer of scalar data to match the processing cluster 1400 processing pipeline. The dataflow state memory 5410 generally contains dataflow state for each thread that receives scalar input from the processing cluster 1400, and controls the scheduling of threads that depend on this input.
Typically, the data memory for the GLS UNIT 1408 is organized into several portions. The thread context area of data memory 5403 is visible to programs for GLS processor 5402, while the remainder of the data memory 5403 and context save memory 5414 remain private. The Context Save/Restore or context save memory is usually a copy of GLS processor 5402 registers for all suspended threads (i.e., 16×16×32-bit register contents). The two other private areas in the data memory 5403 contain context descriptors and destination lists.
The Request Queue and Control 5408 generally monitors load and store accesses for the GLS processor 5402 outside of the GLS data memory 5403. These load and store accesses are performed by threads to move system data to the processing cluster 1400 and vice versa, but data usually does not physically flow through the GLS processor 5402, and it generally does not perform operations on the data. Instead, the Request Queue 5408 converts thread “moves” into physical moves at the system level, matching load with store accesses for the move, and performing address and data sequencing, buffer allocation, formatting, and transfer control using the system L3 and processing cluster 1400 dataflow protocols.
The Context Save/Restore Area or context save memory 5414 is generally a wide RAM that can save and restore all registers for the GLS processor 5402 at once, supporting 0-cycle context switch. Thread programs can require several cycles per data access for address computation, condition testing, loop control, and so forth. Because there are a large number of potential threads and because the objective is to keep all threads active enough to support peak throughput, it can be important that context switches can occur with minimum cycle overhead. It should also be noted that thread execution time can be partially offset by the fact that a single thread “move” transfers data for all node contexts (e.g., 64 pixels per variable per context in the horizontal group). This can allow a reasonably large number of thread cycles while still supporting peak pixel throughputs.
Now, turning to the thread-scheduling mechanism, this mechanism generally comprises message list processing 5402 and thread wrappers 5404. The thread wrappers 5404 typically receive incoming messages, into mailboxes, to schedule threads for GLS unit 1408. Generally, there is a mailbox entry per thread, which can contain information (such as the initial program count for the thread and the location in processor data memory (i.e., 4328) of the thread's destination list. The message also can contain a parameter list that is written starting at offset 0 into the thread's processor data memory (i.e., 4328) context area. The mailbox entry also is used during thread execution to save the thread program count when the thread is suspended, and to locate destination information to implement the dataflow protocol.
In additional to messaging, the GLS unit also performs configuration processing. Typically, this configuration processing can implement a Configuration Read thread, which fetches a configuration for processing cluster 1400 (containing programs, hardware initialization, and so forth) from memory and distributes it to the remainder of processing cluster 1400. Typically, this configuration processing is performed over the node interface 5420. Additionally, the GLS data memory 5403 can generally comprise sections or areas for context descriptors, destination lists, and thread contexts. Typically, the thread context area can be visible to the GLS processor 5402, but the remaining sections or areas of the GLS data memory 5403 may not be visible.
The context descriptors contain the base addresses, in GLS data memory 5403, of contexts for all resident threads, whether active or not. A resident thread generally has the associated code located somewhere in GLS instruction memory 5405. The base address is generally located somewhere in the thread context area; this is generally the available portion of the GLS data memory 5403, not including words in the context descriptor area, and not including whatever portion of the GLS data memory 5403 is taken by the destination lists (variable). Contexts areas are generally provided for resident threads whether or not they have been scheduled to execute because a resident thread can be scheduled at any time, and its context should be available at that time.
Turning to
A destination list provides the capability for a read thread to output to multiple destinations. The structure of entries on the destination list depends on the use of the list. Read-thread programs access entries on the destination list as an array, analogous to node destination descriptors. For hardware access, when Output_Terminate (OT) has to be signaled to destinations, the destination list is organized as a sequential list of destination entries (there is no active program in this situation). In
As an example, the message that schedules a read thread contains the base address of the thread's array of destination entries (this is a halfword address). Each output of the read thread has a corresponding destination-tag identifier (Dst_Tag), which is the index into this array. When hardware accesses the list, it sends OT signals to all initial destinations identified by the list with OTe=1, starting at the first entry, up to and including the entry with Bk set.
Typically, destination-list entries contain two sets of related fields, containing information for destination segment identifiers, node identifiers, and context numbers or thread identifiers. The first halfword (i.e., bits 15:0) can contain information for the initial destination, set by the thread scheduling message: these fields do not generally change during execution. The second halfword (i.e., bits 31:16) can contains information for the next destination: these fields are updated by the dataflow protocol to enable the next transfer and to indicate the destination information for this transfer. The initial destination information is used to sequence back to the first context when the right boundary is encountered as a destination (the Rt bit is set in the Source Permission). It is also used as the destination for Output Termination messages from the thread (the destination context forwards this to other contexts in the horizontal group). It also can be used to sequence back to the first context when the right boundary is encountered as a destination (the Rt bit is set in the Source Permission), except that this information can also be obtained by enabling forwarding of a Source Notification to the right-boundary context.
Destination-list entries can also contain a Src_Tag field to identify this source to the destination, and a PermissionCount field to store the enabled number of transfers for thread destinations (this field is set to 1111′b for non-thread destinations, enabling an unlimited number of transfers). The Bk and OTe bits can control OT signals when the thread terminates. Some destinations are defined so that a read thread can provide initialization data to programs that don't participate in the main dataflow from the thread. These destinations should not receive an OT from the read thread, but instead from their own dataflow sources. Upon termination, hardware transmits an OT to every enabled destination (OTe=1), up to the entry with Bk=1.
In this example, each entry on the list can be updated with new destination information returned in Source Permission messages. The Source Permission contains the Thread_ID and Dst_Tag of the read or multi-cast thread, sent originally with the Source Notification. The Thread_ID selects the destination-list base address from the corresponding mailbox entry. The Dst_Tag selects the position of the entry relative to the base address. Dst_Tag 0 identifies the first list entry, and so on.
In order for the program for GLS processor 5402 to function correctly, it should have a view of memory that is generally consistent with other 32-bit processors in the processing cluster 1400, and also generally consistent with the node processors (i.e., node processor 4322) and SFM processor 7614 (which is described below). Generally, it is straightforward for GLS processor 5402 to have common addressing modes with the processing cluster 1400 because it is a general-purpose, 32-bit processor, with comparable addressing modes for system variables and data structures as other processors and peripherals (i.e., 1414). The issues can arise with software for the GLS processor 5402 operating correctly with data types and context organizations, and correctly performing data transfers using a C++ programming model.
Conceptually, the GLS processor 5492 can be considered a special form of vector processor (where vectors are, for example, in the form of all pixels on a scan line in a frame or, for example, in the form of a horizontal group within the node contexts). These vectors can have a variable number of elements, depending on the frame width and context organization. The vector elements also can be of variable size and type, and adjacent elements do not necessarily have the same type because pixels, for example, can be interleaved with other types of pixels on the same line. The program for the GLS processor 5402 can converts system vectors into the vectors used by node contexts; this is not a general set of operations but usually involves movement and formatting of these vectors with the dataflow protocol assisting in ordering and keeping the program for the GLS processor 5402 abstracted from the node-context organization for a particular use-case.
System data can have many different formats, which can reflect different pixel types, data sizes, interleaving patterns, packing, and so on. In a node (i.e., 808-i), SIMD data memory pixel data is, for example, in wide, de-interleaved formats of 64 pixels, aligned 16 bits per pixel. The correspondence between system data and node data is further complicated by the fact that a “system access” is intended to provide input data for all input contexts of a horizontal group: the configuration of this group, and its width, depend on factors outside the application program. It is generally very undesirable to expose this level of detail—either the format conversions to and from the specific node formats, or the variable node-context organization—to the application program. These are typically very complex to handle at the application level, and the details are implementation-dependent.
In source code for GLS processor 5402, value assignment of a system variable to a local variable generally can require that the system variable have a data type that can be converted to a local data type, and vice versa. Examples of basic system data types are characters and short integers, which can be converted to 8-, 10-, or 12-bit pixels. System data also can have synthetic types such as packed arrays of pixels, in either interleaved or de-interleaved formats, and pixels can have various formats, such as Bayer, RGB, YUV, and so forth. Examples of basic local data types are integers (32 bits) short integers (16 bits), and paired short integers (two, 16-bit values packed into 32 bits). Variables of the basic system and local data types can appear as elements in arrays, structures, and combinations of these. System data structures can contain compatible data elements in combination with other C++ data types. Local data structures usually can contain local data types as elements. Nodes (i.e., 808-i) provide a unique type of array that implements a circular buffer directly in hardware, supporting vertical context sharing, including top- and bottom-edge boundary processing. Typically, the GLS processor is included in the GLS unit 1408 to (1) abstract the above details from users, using C++ object classes; (2) provide dataflow to and from the system that maps to the programming model; (3) perform the equivalent of very general, high-performance direct memory access that conforms to the data-dependency framework of processing cluster 1400; and (4) schedule dataflow automatically for efficient processing cluster 1400 operation.
Application programs use objects of a class, called Frame, to represents system pixels in an interleaved format (the format of an instance is specified by an attribute). Frames are organized as an array of lines, with the array index specifying the location of a scan-line at a given vertical offset. Different instances of a Frame object can represent different interleaved formats of different pixels types, and multiples of these instances can be used in the same program. Assignment operators in Frame objects perform de-interleaving or interleaving operations appropriate to the format, depending on whether data is being transferred to or from processing cluster 1400.
The details of local data types and context organization are abstracted by introducing the concept of a class Line (in GLS UNIT 1408, Block data is treated as an array of Line data, with explicit iteration providing multiple lines to the block). Line objects, as implemented by the program for GLS processor 5402, generally support no operations other than variable assignment from, or assignment to, compatible system data-types. Line objects usually encapsulate all the attributes of system/local data correspondence, such as: pixel types, both node inputs and outputs; whether data is packed or not, and how data is packed and unpacked; whether data is interleaved or not, and the interleaving and de-interleaving patterns; and context configurations of the nodes.
Turning to
The GLS processor 5402 processes vectors of pixels in either system formats or node-context formats. However, the datapath for the GLS processor 5402 in this example does not directly perform any operations on these vectors. The operations that can be supported by the programming model in this example are assignment from Frame to Line or shared function-memory 1410 Block types, and vice versa, performing any formatting required to achieve the equivalent of direct operation on Frame objects by processing cluster nodes operating on Line or Block objects.
The size of a frame is determined by several parameters, including the number of pixel types, pixel widths, padding to byte boundaries, and the width and height of the frame in number of pixels per scan-line and number of scan-lines, which can vary according to the resolution. A frame is mapped to processing cluster 1400 contexts, normally organized as horizontal groups less wide than the actual image, frame divisions, which are swapped into processing cluster 1400 for processing as Line or Block types. This processing produces results: when a result is another Frame, that result normally is reconstructed from the partial intermediate results of processing cluster 1400 operation on frame divisions.
In a cross-hosted C++ programming environment, an object of class Line is considered to be the entire width of an image in this example, to generally eliminate the complexity required in hardware to process frame divisions. In this environment, an instance of a Line object includes the iteration in the horizontal direction, across the entire scan-line. The details of Frame objects are not abstracted by the object implementation, but also by intrinsics within the Frame objects, to hide the bit-level formatting required for de-interleaving and interleaving and to enable translation to instructions for the GLS processor 5402. This permits a cross-hosted C++ program to obtain results equivalent to execution in the environment of the processing cluster 1400, independent of the environment for processing cluster 1400.
In the code-generation environment for the processing cluster 1400, a Line is a scalar type (generally equivalent to an integer), except that code generation supports addressing attributes that correspond to horizontal pixel offsets for access from SIMD data memory. Iteration on scan-lines in this example is accomplished by a combination of parallel operation in the SIMD, iteration between contexts on a node (i.e., 808-i), and parallel operation of nodes. Frame divisions can be controlled by a combination of host software (which knows the parameters of the frame and frame division), GLS software (using parameters passed by the host), and hardware (detecting right-most boundaries using the dataflow protocol). A Frame is an object class implemented by GLS programs, except that most of the class implementation is accomplished directly by instructions for GLS processor 5402, as described below. Access functions defined for Frame objects have a side-effect of loading the attributes of a given instance into hardware, so that hardware can control access and formatting operations. These operations would generally be much too inefficient to implement in software at the desired throughputs, especially with multiple threads active.
Since there can be several active instances of Frame objects, it is expected that there are several configurations active in hardware at any given point in time. When an object is instantiated, the constructor associates attributes to the object. Access of a given instance loads the attributes of that instance into hardware, similar in concept to hardware registers defining the instance's data type. Since each instance has its own attributes, multiple instances can be active, each with their own hardware settings to control formatting.
Read threads and write threads are written as independent programs, so each can be scheduled independently based on their respective control and dataflow. The following two sections provide examples of a read thread and a write thread, showing the thread code, the Frame class declaration, and how these are used to implement very large data transfers, with very complex pixel formatting, using a very small number of instructions.
A read thread assigns variables representing system data to variables representing the input to processing cluster 1400 programs. These variables can be of any type, including scalar data. Conceptually, a read thread executes some form of iteration, for example in the vertical direction within a fixed-width frame division. Within the loop, pixels within Frame objects are assigned to Line objects, with the details of the Frame, and the organization of the frame division (the width of the Line), hidden from the source code. There also can be assignments of other vector or scalar types. At the end of each loop iteration, the destination processing cluster 1400 program(s) is/are invoked using Set_Valid. A loop iteration normally executes very quickly with respect to the hardware transfer of data. Loop execution configures hardware buffers and control to perform the desired transfer. At the end of an iteration, the thread execution is suspended (by a task switch instruction) while the hardware transfer continues. This frees the GLS processor 5402 to execute other threads, which can be important because there can be a single GLS processor 5402 processor controlling up to (for example) 16 thread transfers. The suspended thread is enabled to execute again once the hardware transfers are complete.
Turning to
In a cross-hosted environment for the example of
The example in
Turning to
In the source code 5702, the Line returned by the call to f_in->get(sys_in, Gr) is assigned to the node input variable nsf_in->Gr[i %3] (a Line in a circular buffer). In the generated code, this vector assignment to an extern variable results in a vector output instruction, VOUTPUT, using as a source register the virtual register loaded by the preceding LDSYS, and specifying the offset for nsf_in->Gr[i %3] in the destination context (the offset for nsf_in->Gr[0] is linked into the code after compilation, and the actual offset is computed using circular addressing compatible with the destination addressing). An example of the execution of this instruction is illustrated in
In the example of
Turning to
After the thread is suspended at the end of the loop, GLS processor 5402 can execute other threads in parallel with this thread's hardware transfers. The hardware detects the final transfer using the HG_Size parameter (or Block_Width for Block transfers). At this point, the thread can be re-enabled to execute the next loop iteration. If the loop terminates instead, the thread executes an END instruction, resulting in an Output_Terminate signal to the first (left-most) destination context. This context propagates the termination to all other contexts in the horizontal group, as well as to dependent destination contexts of that group. When the thread executes an END instruction, and all hardware transfers to TPIC are complete, the thread sends a Thread Termination message.
A write thread assigns variables representing output from processing cluster 1400 programs to variables representing system data. These variables can be of any type, including scalar data, but this section shows an example of assigning pixels in Line objects to Frame objects, since this is the most complex example of the operation of a write thread. A write thread typically is data-driven, in that it moves input data to the system as long as this data is provided. In most cases, this data is processing cluster 1400 output that is the ultimate result of read-thread input to processing cluster 1400, so the write thread effectively executes within the same iteration loop as the read thread. Within the write thread for an example application of image processing, pixels of Line objects are assigned to Frame objects, with the organization of the frame division (the width of the Line), and the details of the Frame, hidden from the source code. As with read threads, an iteration of a write thread normally executes very quickly with respect to the hardware transfer of data. Thread execution configures hardware buffers and control to perform the desired transfer. At the end of an iteration, the thread execution is suspended (by a task switch instruction) while the hardware transfer continues. This frees the GLS processor 5402 to execute other threads, which is important because there is a single GLS processor 5402 processor controlling up to 16 thread transfers. The suspended thread is enabled to execute again once the hardware transfers are complete.
Turning to
In a cross-hosted environment, the put function in the Frame class simply calls the intrinsic _STSYS, passing input parameters plus the attribute attr. This intrinsic inserts all the pixels from the input Line parameter, the entire width of the frame, into the associated positions at the given address. This insertion is done for each call to put, for each pixel type. As with the _LDSYS intrinsic, this implementation is functionally equivalent to processing cluster 1400's, but performance is unacceptably slow. The remainder of this section describes how the source code, Frame class, and _STSYS intrinsic are used to perform very high-throughput transfers with a very small number of instructions. When the write thread is first scheduled, it cannot execute right away because input data has not been provided. The thread remains idle until a processing cluster 1400 context outputs data, identifying the GLS unit 1408 as the destination node and the write thread as the destination thread. This enables the write thread to execute, as shown in
The example in
The second instruction, STSYS, is a straightforward translation of the intrinsic STSYS resulting from the call to put.
Other inputs can be identified before they can be interleaved into the frame and the result written to the system. This is accomplished by the other instructions in the loop, with the steady-state result shown in
As shown in this example, there is no guaranteed order between VINPUT and STSYS instructions for different accesses, and virtual-register identifiers are not necessarily unique. However, the instruction order does satisfy dependencies, so that the Request Queue 5408 can match write-thread inputs with system positions and addresses by pairing virtual register IDs, despite the order of instructions and despite the re-use of these IDs.
At the end of the loop, the thread is suspended while hardware transfers are completed. The hardware detects the final transfer because Set_Valid is asserted for the source context that has Rt=1 in its Source Notification message. At this point, the thread is in a condition to be re-enabled to execute the next loop iteration, but is not actually enabled to execute until new data is received. The thread has to detect the combination of Set_Valid and Rt=1 in order to distinguish data from a previous iteration from data for a new iteration, so that it is enabled to execute for new input. In addition to being enabled by new input, the thread is also enabled to execute when it receives an Output Termination message. This causes the loop condition to end the loop. When the thread executes an END instruction, all hardware transfers to the system should complete before the thread can send a Thread Termination message.
GLS UNIT 1408 generally conforms to the dataflow protocol between processing nodes (i.e., 808-i), but the internal implementation is significantly different than in the nodes (i.e., 808-i) and SFM 1410. GLS UNIT 1408 transfers can be highly parallel and overlapped, as defined by a program performing data movement to and from GLS processor 5402 virtual registers, converted by hardware into large transfers of system data to and from processing cluster, with de-interleaving and interleaving as required or desired. In contrast, node and SFM transfers are generally synchronous with program execution, and normally represent a relatively small amount of activity with respect to the entire program. Furthermore, because of conditional program execution, there can be a large variability in the output created by different iterations of a read thread. Output can be to different set of variables at a given destination, of a different set of types, and the order of output instructions can be different. On top of this variability, an iteration can also output to a different set of destinations. This variability is handled by the GLS dataflow protocol.
The destination-list entries for a read thread enable a large amount of overlap between the dataflow protocol and data transfer, and between transfers to different destinations on the list. The dataflow protocol does not generally appear in series with data transfers into the contexts associated with a particular destination, and each destination be can be provided with data at the maximum rate permitted by the destination. The destination list buffers an identifier for the next destination context while the current transfer is being serviced. When the current transfer is complete, this identifier can be used to transition immediately to the next destination context. In parallel, the thread can sends a Source Notification to the destination context, which forwards the notification. The context receiving the forwarded Source Notification responds with a Source Permission when it is ready to receive data, and the read thread stores the identifier from the permission in the destination-list entry. This protocol operates independently for each set of destination contexts—for each entry on the destination list. There is generally no serialization or synchronization between independent destinations. f
Turning to
In state 10′b, at any time during a current transfer, the thread can send a Source Notification (SN) to the current destination, enabling the destination to forward the SN to the next destination (Rt=1), up to the right-boundary context. The read thread determines the number of node destination contexts using the HG_Size parameter, which is provided to hardware on the GLS Data Interface (it is contained in the vertical-index parameter of the VOUTPUT instruction). Thus, the SN is sent up to the point where HG_Size sets of outputs have been done. After the SN is sent, the next two events can occur in any order:
The dataflow protocol for Line output to shared function-memory 1410 is similar to that for Line output to a node (the two are distinguished by a datatype field in the VOUTPUT instruction, which appears on the GLS Data Interface). However, there are several differences required by the SFM destination, since it is a single destination context, possibly in a continuation group (
To properly address the data in the destination context, the GLS unit 1408 can increment the offsets of successive transfers (for example, by 32 pixels each transfer), so that SFM input is directly addressed. Line transfers to node contexts are to the same address in SIMD data memory, but in different contexts. GLS unit 1408 also indicates the last line in a circular buffer, using Fill (from Data Interface), so that SFM 1410 can distinguish the final transfer of LineArray data.
Turning to
Usually, a single SN (or source notification) is sent for all blocks sent to a destination context. This is sent in state 00′b, after the thread suspends, to all destinations that have output in that iteration. When the output is enabled, block data is transferred such that the same column in all blocks are transferred, with Set_Valid after the final block transfer at each column position. Addressing in the destination context is accomplished by incrementing offsets by (for example) 32 pixels for each column position.
Because of the possible existence of continuation contexts, the SP received on the transition from state 00′b to 10′b updates the initial-destination ID in the destination-list entry, as well as the next-destination ID. The initial-destination ID is updated to transition continuation contexts, and the next-destination ID is used to route transfers. The initial-destination ID is also used to send and OT, because this should be sent to the last continuation context to receive data. Blocks of different widths can also be output. When the number of column transfers for any given block reaches its Block_Width, no more output to that block is done. However, output continues to wider blocks, up to the block or blocks with the greatest width. The number of columns output, with Set_Valid, usually cannot exceed the number permitted by the PermissionCount field of the destination list. This field is incremented by the P_Incr field in SPs that are received during the transfer, and decremented for each Set_Valid. This is required so that SFM 1410 can control the relative rates of different inputs, if desired, to perform dependency checking.
When output of all columns in an iteration is complete to all blocks, the thread is re-scheduled to execute. This occurs in state 10′b and output is still enabled. This iteration results in a new set of VOUTPUT instructions, which set new values for offsets in the destination context: these offsets are to the first columns in the next rows of the output blocks. This is not necessarily the same set of rows that was output in the previous iteration, because program conditions can be used to stop output to blocks that have fewer rows than others. However, the same techniques as just described are used to output whatever blocks have a corresponding VOUTPUT.
At the end of all iterations, the thread signals Block_End to the given destination. This is a special encoding of VOUTPUT, to properly order this signal to come after any prior data, but should not initiate a block transfer. Instead, the GLS UNIT 1408 performs a single dummy transfer with the Block_End encoding, and transitions to the state 00′b. The thread doesn't necessarily terminate at this point: subsequent iterations can perform block output either to the same destination, the continuation context of this destination, or another destination entirely.
A write threads iterates on the receipt of data, up to the point where an OT signal is received. This is based on a WHILE loop testing for the absence of termination. Set_Valid, though set by sources, is mostly irrelevant, because write threads process data and transmit to the system as it is received, and do not have to wait for an entire context to be valid. Once software execution has initiated a transfer, transfers from all source contexts are performed by hardware, using the dataflow protocol to perform flow control and to order inputs. Set_Valid is relevant for detecting the final transfer of an iteration (based on HG_Size or Block_Width). The final source context sends an OT after it has completed the final transfer. The OT schedules the write thread to execute, and the hardware provides a termination status that can be tested as a bit in the Condition Status Register for the GLS processor 5402. This causes the loop condition not to be met, so that the write thread no longer iterates, and instead terminates. For Block output to GLS UNIT 1408, the source can signal Block_End with a transfer after the final Set_Valid. This can be ignored.
In addition to vector (including pixel vector) data to SIMD data memory for the nodes (i.e., 4306-1) and shared function contexts (which are discussed in greater detail below), the read thread can also provide scalar data to node contexts for processor data memory (i.e., 4328). This can be either data that is explicitly coded in the application program, or implicit data such as parameters, initialization and/or configuration data, and control words for circular buffers (controlling boundary conditions, buffer latency, etc.). Buffering in the GLS units 1408 limits the number of vector outputs to four sets of destination contexts (each with a separate destination-list entry, identified by source tag). However, there can be up to sixteen (for example) outputs for scalar data, to provide a means for a read thread to perform initialization and control functions even to contexts where it has no direct, explicit involvement in dataflow (the initialization and control code is added to the read thread by the system programming tool 718, depending on the use-case, and is not explicitly coded into the read-thread applications code).
There is generally no particular order to scalar outputs with respect to their source-tag fields or with respect to vector outputs; this order generally depends on the source program and code generation. There can be any combination of outputs, with any source tag, in any number. The final scalar output at each source tag is flagged with Set_Valid. The outputs are queued in the order received in the Scalar Output Buffer (i.e., within global IO buffer 5406). This buffer contains scalar outputs from all threads that are in process, with each thread having pointers to the head and tail entries for its specific set of outputs in the buffer. Each entry includes the scalar data, their offsets in the destination contexts, and their Dst_Tag values.
Scalar data is generally provided to all destination contexts that are associated with a given Dst_Tag. Unlike vector data, which is different for every destination context, the same scalar data is copied to each destination context associated with the Dst_Tag. Scalar data is transferred over the messaging interconnect or bus 1420, using Update messages.
Destination-list entries can control both vector and scalar transfers, because a Source Permission from a destination context applies to both. Outputs of scalar-only data can proceed independent of any other vector or scalar transfers, but outputs of both scalar and vector data to a given set of destination contexts has to be synchronized with the dataflow protocol of the destination contexts, as reflected in the destination list. Because vector data is generally much larger than scalar data, it generally controls the rate of transfer and thus the rate of the dataflow protocol. Scalar transfers remain in the Scalar Output Buffer (i.e., within global IO buffer 5406) until all outputs to all destinations have been performed. When a vector output occurs to a given destination context, the Scalar Output Buffer (i.e., within global IO buffer 5406) is scanned for any scalar transfers with the given Dst_Tag field, and, if any entry has a matching Dst_Tag, the scalar transfer is performed. These transfers occur in parallel with the vector transfers.
Scalar output (if applicable) occurs along with vector outputs to all destination contexts, using repeated scans of the queue entries in the Scalar Output Buffer (i.e., within global IO buffer 5406), for example one for each context. If there are no vector outputs at a given Dst_Tag, the scalar output is accomplished the same way, but isn't synchronized with vector output, and uses a different dataflow-protocol sequence. By scanning all entries associated with the read thread, and by matching Dst_Tag fields of these entries with the Dst_Tag of the destination contexts, all data is correctly transferred to all destinations regardless of the order and number of output instructions from the read-thread code.
Scalar input is treated as separate from vector input by node destination contexts. Each is specified separately by the ValFlag LSB in the dataflow state. Scalar transfers have Set_Valid signals, on the messaging interconnect 1420, separate from Set_Valid for vector data on the global data interconnect. These signals are accounted for independently in the ValFlag fields in the node dataflow-state entries. There is also a separate Input_Done encoding of the scalar transfer from GLS that has the same effect as Set_Valid without providing new data (this is encoded in the scalar OUTPUT instruction).
If scalar data is provided along with vector data for a given destination, the scalar output is synchronized with vector output, and the vector dataflow protocol controls both. If scalar data is provided, then another set of state transitions is used to control output, and this is performed independently from other vector output.
In
In state 10′b scalar data is transferred usually once to a thread destination (SFM Line or Block), but is transferred to every data memory (i.e., 5403) context in a horizontal group (the same data is provided to all contexts). In the first case, as soon as all data has been transferred, with Set_Valid, the state transitions to 00′b for subsequent output from the thread (because Th=1). The second case—output to a horizontal group—is described below.
For a non-threaded destination, in state 10′b, an SN is sent for forwarding if the most recent SP was not received from a right-boundary context (Rt=1). This SP is forwarded at the destination to the next destination context, resulting in an SP from that context: this updates the next-destination ID. As with Line output this SP can come before or after the Set_Valid indicating the final transfer to the current destination. The state 11′b records the SP, re-enabling output after Set_Valid occurs, and the state 01′b records the Set_Valid and waits for the SP before re-enabling output. In both cases the next state is 10′b. This continues until an SP is received from the right-boundary context, at which point a Set_Valid causes a transition to 00′b to wait for subsequent output from the thread.
Program control flow can cause variability in read-thread output from one iteration to the next. Each thread has an iteration queue (which can be part of the thread wrapper 5404) that records information from the thread as it executes instructions for the iteration, and controls output for that iteration. This recording starts when the thread is scheduled, and stops when it is suspended. Each entry of the queue has a two-bit type flag for each of the eight possible destinations, recording the type of output to the destination for that iteration (none, scalar, vector, or both). The entry also contains the iteration's head and tail pointers into the Scalar Output Buffer 5412 for all scalar output (if any), to all destinations. The iteration queue is managed as a First-in-First-Out or FIFO queue, with the most recent iteration writing the tail of the FIFO, and entries being removed from the head once all transfers for an iteration are complete.
Vector output is normally controlled by the entry at the tail of the iteration queue, with this and other entries controlling scalar data. The reason for this is to support output of scalar parameters to programs that do not receive vector data directly from the thread, as illustrated in
This serialization can be avoided by having read threads input to the same level of the processing pipeline (programs with the same value of OutputDelay in the context descriptors), so that the read thread operates at the pipeline stage of its output. This costs of an additional read thread for every level of input: this is acceptable for vector input, because there are generally a limited number of stages where vector input is input from the system. However, it is likely that every program can require scalar parameters to be updated for each iteration, either from the system or computed by a read thread (for example, vertical-index parameters that control circular buffers in each processing stage). This would require a read thread for every pipeline stage, placing too much demand on the number of read threads.
Since scalar data can require much less memory than vector data, the GLS unit 1408 stores the scalar data from each iteration in the Scalar Output Buffer 5412, and, using the iteration queue, can provide this data as required to support the processing pipeline. This usually is not feasible for vector data, because the buffering required would be on the order of the size all node SIMD memory.
Pipelining of scalar output from the GLS unit 1408 is illustrated in
Subsequent programs execute as they receive input, skewing in time to reflect the execution pipeline. Until each program signals Release_Input during the first iteration, the read thread cannot output scalar data to the destination contexts. For this reason Scalar B2—Scalar D2 are retained in the Scalar Output Buffer 5412 until the destination contexts enable input with an SP. The duration of this data in the Scalar Output Buffer 5412 is indicated by the grey dashed arrows, showing scalar data synchronized with vector input from source programs. During this time, data for other iterations is also accumulated in the Scalar Output Buffer, up to the depth of the processing pipeline, in this example roughly four iterations. Each of these iterations has an iteration-queue entry that records data types, destinations, and location of scalar data in the Scalar Output Buffer for the successive iterations.
When scalar output is completed to each destination, that fact is recorded in the iteration queue (by setting the type flag to 00′b—the LSB will be 1). When all type flags are 0, this indicates that all output from the iteration is complete, and the iteration-queue entry can be freed. At this point, the content of the Scalar Output Buffer 5412 is discarded for this iteration, and the memory freed for allocation by subsequent thread execution.
Nodes (i.e., 808-i) can provide scalar input to GLS threads to control system data movement. For example, a node can set block dimensions, determined by a region of interest based on pixel analysis, for a GLS read thread to fetch the block into as shared function-memory continuation context. For this reason, GLS unit 1408 can implement the dataflow protocol for scalar input to threads. This is a small subset of what's required for processing and SFM nodes: there are no side contexts nor forwarding of SNs. The GLS thread simply can track SN messages for up to four sources, and count Set_Valid signals from each source.
When a thread is scheduled, and the In=1 in the context descriptor, the thread should receive the required number of inputs, each signaled with Set_Valid, before it can execute. If In=0, the thread can be scheduled for execution any time after the scheduling message is received. Otherwise, the thread first waits for scalar input.
In
In state 00′b, if an SN is received with InEn=0, the state transitions to 01′b to indicate that there is a valid SN recorded in the pending permission. If an SN was received from this source before other data was received, the pending permission cannot be used to generate an SP until all other input has been received, indicated by #SetVal=#Inp and resetting InEn. Input is re-enabled when the program signals Release_Input, which sets InEn, and the state transitions to 11′b. It is also possible for a source to signal Input_Done for scalar data, which indicates that the scalar data isn't updated, because of program conditions, but that the previous data should be considered valid. This is equivalent to a Set_Valid except that the scalar data is not updated.
Write threads should have special treatment for scalar input, because they also receive vector input, and these should be handled differently. Scalar input is received before the thread executes, but vector input is received after the thread executes. If input is enabled, scalar data is guaranteed to have memory allocation in data memory (i.e., 5403), but vector data should have a buffer allocation that can receive all input at a given column or horizontal position, before it can enable input. This causes a circularity in the dataflow protocol. The thread should send an SP if the SN Type indicates scalar data, to enable this scalar input; however, the source might also provide vector data, and this cannot be enabled until the thread executes and the required buffer allocation is determined.
To resolve this circularity, if Type[0]=1, the thread responds with an SP, but with P_Incr=0. The permission count should not apply to scalar output, so this enables the scalar output but does not permit the source to output vector data. Because the scalar data controls the output of vector data, it has to precede the output of vector data, so the source program can make progress even though vector output is disabled (if it were to output vector data first, it would deadlock, but this style of output isn't useful).
A similar issue applies in determining when to enable the SP response to the next SN. This SP can occur after all vector output for the previous SN has been received, and new buffers allocated for the next input. This condition is hardware-specific, and is indicated by the condition “vector data received” in the state-transition diagram, on the arcs that enable the SP.
Read-thread iterations complete very quickly compared to the data transfers that are initiated by the iteration, and the program enters a suspended state as the hardware completes the transfers. The thread is re-scheduled once all of these hardware transfers have been performed. In most cases, the program executes another iteration and initiates a new set of transfers. However, after the final iteration, there are no transfers indicated, and the program terminates instead. At this point, to signal that there are no more transfers from the thread, the hardware sends Output_Terminate (OT) signals to all destinations that are enabled to receive OT from the thread (these are normally destinations that receive data during thread iterations, rather than destinations that just receive initialization data at the beginning of the thread). Hardware transmits an OT to every destination on the destination list enabled by OTe=1, up to the entry with Bk=1.
GLS threads are scheduled by Schedule Read Thread and Schedule Write Thread messages. If the thread does not depend on scalar input (read or write thread) or vector input (write thread), it becomes ready to execute when the scheduling message is received: otherwise the thread becomes ready when Vin is set, for threads that depend on scalar input, or until vector data is received over global interconnect (write thread). Ready threads are enabled to execute in round-robin order.
When a thread begins executing, it continues to execute until all transfers have been initiated for a given iteration, at which point the thread is suspended by an explicit task-switch instruction while the hardware transfers complete. The task switch is determined by code generation, depending on variable assignments and flow analysis. For a read thread, all vector and scalar assignments to processing cluster 1400, to all destinations, have to be complete at the point of thread suspension (this typically is after the final assignment along any code path within an iteration). The task-switch instruction causes Set_Valid to be asserted for the final transfer to each destination (based on hardware knowing the number of transfers). For a write thread, the analysis is similar, except that the assignment is to the system, and Set_Valid is not explicitly set. When the thread is suspended, hardware saves all context for the suspended thread, and schedules the next ready thread, if any.
Once a thread is suspended, it can remains suspended until hardware has completed all data transfers initiated by the thread. This is indicated several different ways, depending on transfer conditions:
When a thread is re-enabled to execute, it can either initiate another set of transfers, or terminate. A read thread terminates by executing an END instruction, which results in OT signals to all destinations that have OTe=1, using the initial-destination IDs. A write thread generally terminates because it receives an OT from one or more sources, but isn't considered fully terminated until it executes an END instruction: it's possible that the while loop terminates but the program continues with a subsequent while loop based on termination. In either case, the thread can send a Thread Termination message after it executes END, all data transfers are complete, and all OTs have been transmitted.
Read threads can have two forms of iteration: an explicit FOR loop or other explicit iteration, or a loop on data input from processing cluster 1400, similar to a write thread (looping on the absence of termination). In the first case, any scalar inputs are not considered to be released until all loop iterations have been executed—the scalar input applies to the entire span of execution for the thread. In the second case, inputs are released (Release_Input signaled) after each iteration, and new input should be received, setting Vin, before the thread can be scheduled for execution. The thread terminates on dataflow, as a write thread does, after receiving an OT.
The GLS processor 5402 can include a dedicated interface to support hardware control based on read- and write-thread operation. This interface can permits the hardware to distinguish specific or specialized accesses from normal accesses for the GLS processor 5402 to GLS data memory 5403. Additionally, there can be instructions for the GLS processor 5402 to control this interface, which are as follows:
The GLS unit 1408 for this example can have any of the following features:
Table 21 below shows the list of pins and input/output (I/O) signals for an example of the GLS unit 1408 instantiated in the processing cluster 1400.
Turning to
Turning first to read thread data flow, a read thread is processed by the GLS unit 1408 when data should to be transferred from the OCP connection 1412 on to the interconnect 814. A read thread is scheduled by a Schedule Read thread Message, and once the thread is scheduled, the GLS unit 1408 can trigger the GLS processor 5402 to obtain the parameters (i.e., pixel parameters) for the thread and can access the OCP connection 1412 to fetch the data (i.e., pixel data). Once the data has been fetched, it can be deinterleaved and upsampled according to the configuration information stored (which is received from the GLS processor 5402) and sent to the proper destination via the data interconnect 814. The dataflow is maintained using the Source Notification, Source Permission, and output termination messages until the thread is terminated (as informed by the GLS processor 5402). The scalar data flow is maintained using an update data memory message.
Another data flow is the configuration read thread, the configuration read thread is processed by the GLS unit 1408 when configuration data should be to be transferred from the OCP connection 1412 to either GLS instruction memory 5405 or to other modules within the processing cluster 1400. A configuration read thread is scheduled by a Schedule Configuration Read message, and, once the message has been scheduled, the OCP connection 1412 is accessed to obtain the basic configuration information. The basic configuration information is decoded to obtain the actual configuration data and sent to the proper destination (via the data interconnect 814 if the destination is external module within the processing cluster 1400).
Yet another data flow is the write thread. A write thread is processed by GLS unit 1408 when data should to be transferred from the data interconnect 814 to the OCP connection 1412. A write thread is scheduled by a Schedule Write thread Message, and, once the thread is scheduled, the GLS unit 1408 triggers the GLS processor 5402 to obtain the parameters (i.e., pixel parameters) for the thread. After that the GLS unit 1408 waits for the data (i.e., pixel data) to arrive via the data interconnect 814, and, once the data from data interconnect 814 has been received, it is interleaved and downsampled according to the configuration information stored (received from the GLS processor 5402) and sent to the OCP connection 1412. The dataflow is maintained using the Source Notification, Source Permission, and output termination messages until the thread is terminated (as informed by the GLS processor 5402). The scalar data flow is maintained using the update data memory message.
Now, turning to the organization for the GLS data memory 5403 (which generally comprises a data memory RAM 6007 and a data memory arbiter 6008), this memory 5403 is configured to stores the various variables, temporaries, and register spill/fill values for all resident threads. It can also have an area hidden from the thread code which contains thread context descriptors and destination lists (analogous to destination descriptors in nodes). Specifically, for this example, the first 8 locations of the data memory RAM 6007 are allocated for the context descriptors so as to hold 16 context descriptors (where an example of the general structure for a context descriptor 5502 can be seen in
The GLS data memory 5403 can be accessed by multiple sources. The multiple sources are internal logic for the GLS unit 1408 (i.e., interfaces to the OCP connection 1412 and data interconnect 814), debug logic for the GLS processor 5402 (which can modify data memory 5403 contents during a debug mode of operation), messaging interface 5418 (both the slave messaging interface 6003 and the master messaging interface 6004), and the GLS processor 5402. The data memory arbiter 6008 is able to arbitrate access to the data memory RAM 6007. As an example (which is shown in
Turning now to the context save memory 5414 (which generally comprises a context state RAM 6014 and a context state arbiter 6015), this memory 5414 can be used by the GLS processor 5402 to save context information when a context switch is done in the GLS unit 1408. The context memory has a location for each thread (i.e., 16 in total supported). Each context save line is, for example, 609 bits, and an example of the organization of each line is detailed above. The arbiter 6015 arbitrates access to the context state RAM 6014 for accesses from the GLS processor 5402 and debug logic for the GLS processor 5402 (which can modify context same memory RAM 6014) contents during a debug mode of operation). Typically, a context switching occurs whenever a read or write thread is scheduled by the GLS wrapper.
With the instruction memory 5405 (which generally comprises an instruction memory RAM 6005 and an instruction memory arbiter 6006), it can store an instruction for the GLS processor 5402 in every line. Typically, arbiter 6006 can arbitrate access to the instruction memory RAM 6005 for accesses from GLS processor 5402 and debug logic for the GLS processor 5402 (which can modify instruction memory RAM 6005) contents during a debug mode of operation). The instruction memory 5405 is usually initialized as a result of the configuration read thread message, and, once the instruction memory 5405 is initialized, the program can be accessed using the Destination List Base address present in the schedule read thread or write thread. The address in the message is used as the instruction memory 5405 starting address for the thread whenever the context switch occurs.
Turning now to the scalar output buffer 5412 (which generally comprises a scalar RAM 6001 and arbiter 6002), the scalar output buffer 5414 (and the scalar RAM 6001, in particular) stores the scalar data that is written by the GLS processor 5402 and the messaging interface 5418 via a data memory update message, and the arbiter 6002 can arbitrate these sources. As part of the scalar output buffer 5412, there is also associated logic, and the architecture for this scalar logic can be seen in
In
In other parallel process for this example (which usually occurs for scalar-only read threads) and when SRC permission is received for a scheduled read thread (in response to previously sent SRC notification by the GLS unit 1408), the mailbox 6013 is updated with information extracted from the message. It should be noted that the source notification message can (for example) be sent by the scalar output buffer 5412 for read thread which has scalar-only transfer enabled. For read threads with both scalar and vector enabled, source notification message may not be sent. The pending permission table can then be read to determine if the DST_TAG sent in the source permission message matches with the one stored for that thread ID (previous source notification message would have written the DST_TAG). Once a match is obtained, the bits of the pending permission table for that thread for the scalar finite state machine (FSM) 6031 are updated. Then, the GLS data memory 5403 is updated with the new destination node and segment ID along with the thread ID. The GLS data memory 5403 is read to obtain the PINCR value from the destination list entry and update it). It is assumed that for scalar transfer the PINCR value sent by the destination will be ‘0’. Then the thread ID is latched into the Thread ID FIFO 6030 along with the status indication whether it is the left most thread or not.
Now, GLS unit 1408 has permission to transfer scalar data to the destination. The thread FIFO 6030 is read to extract the latched thread ID. The extracted thread ID along with the destination tag is used as index to fetch the proper data from the scalar RAM 6001. Once the data is read out, the destination index present is the data is extracted and matched with the destination tag stored in the request queue. Once a match is obtained, the extracted thread ID is used to index into the mailbox 6013 to fetch the GLS data memory 5403 destination address. The matched DST_TAG is then added to the GLS data memory 5403 destination address to determine the final address to the GLS data memory 5403. The GLS data memory 5403 is then accessed to fetch the destination list entry. The GLS unit 1408 sends an update GLS data memory 5403 message to the destination node (identified by the node id, seg ID extracted from the GLS data memory 5403) with data from the scalar RAM 6001, which is repeated until the entire data for the iteration is sent. Once the end of the data for the thread is reached, the GLS unit 1408 moves on to the next thread ID (if that thread has been pushed into the FIFO as active) as well as indicate to the global interconnect logic that end of the thread has been reached. This update sequence can be seen in
The scalar data contained in the execution is either from the program itself or fetched from a peripheral 1414 via OCP connection 1412 or from other blocks in the processing cluster 1400 via update data memory update message if scalar dependency is enabled. When the scalar is to be fetched from OCP connection 1412 by the GLS processor 5402, and it would send an address (for example) from 0->1M on its data memory address lines. The GLS unit 1408 translates that access to the OCP connection 1412 master read access (i.e., burst of 1-word). Once the GLS unit 1408 reads the word, it passes it to the GLS processor 5402 (i.e., 32 bits; which 32-bits depends on the address sent by the GLS processor 5402) which sends the data to the scalar RAM 6001.
In case the scalar data should be received from another processing cluster 1400 module, the scalar dependency bit will be set in the context descriptor for that thread. When the input dependency bit is set, the number of sources that would be sending the scalar data is also set in the same descriptor. Once the GLS unit 1408 receives the scalar data from all the sources and stored in the GLS data memory 5403, the scalar dependency is met. Once the dependency is met, the GLS processor 5402 is triggered. At this point, the GLS processor 5402 will the read the stored data and write to the scalar RAM 6001 using the OUTPUT instruction (normally for read threads).
The GLS processor 5402 may also choose to write the data (or any data) to the OCP connection 1412. When the data should to be written to the OCP connection 1412 by the GLS processor 1408, and it would send (for example) an address from 0->1M on its GLS data memory 5403 address lines. The GLS unit 1408 translates that access to OCP connection master write access (i.e., burst of 1-word) and write the (for example) 32 bits to the OCp connection 1412.
The mailbox 6013 in the GLS unit 1408 can be used to handle information flow between the messaging, scanner, and the data path. When a schedule read thread, schedule config read thread or a schedule write thread message is received by the GLS unit 1408, the values extracted from the message are stored in the mailbox 6013. Then the corresponding thread is put in scheduled state (for schedule read thread or schedule write thread) so that the scanner can move it to execution state to trigger the GLS processor 5402. The mailbox 6013 also latches values from the source notification message (for write threads), source permission message (for read threads) to be used by the GLS unit 1408. Interactions among various internal blocks of the GLS unit 1408 update the mailbox 6007 at various points in time (as shown in
The ingress message processor 6010 handles the messages received from the control node 1406, and Table 22 shows the list of messages received by the GLS unit 1408. The GLS can be accessed in the processing cluster 1400 subsystem with Seg_ID, Node_ID as {3,1} respectively.
Turning to
In
Turning to
In
Turning to
In
Turning to
In
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
The read thread is generally responsible for several functions in the GLS unit 1408, namely: (1) scheduling a read thread when the message is received by the GLS unit 1408; (2) sending source notification to destinations based on information stored in the data memory 5403; (3) managinh data transmission to various nodes/shared function-memory 1410 based on PINCR sent by the destinations in the source permission message; (4) reading data from peripherals (i.e., system memory 1416) and send it to various destinations using the global interconnect master interface; (5) de-interleaving (and/or upsampling) the image data; and (6) sending scalar data to destinations as required. The data flow protocol for a read thread is initiated when the GLS unit 1408 receives a schedule read thread message. The following steps are performed within the GLS unit 1408 upon recept of the message:
For read threads used with the GLS processor 5402, there are several instructions associated with the read threads: LDSYS, VOUTPUT, OUTPUT, END, and TASKSW.
Looking first to the LDSYS instruction, this is a load instruction. When the GLS processor 5402 executes the LDSYS instruction, the GLS processor 5402 asserts the following signals on it ports or boundry pins: (1) gls_is_ldsys is set to ‘1’; (2) gls_vreg (4-bits); (3) gls_sys_addr; and (4) gls_posn (3-bits) When the gls_is_ldsys=‘1’, the GLS unit 1408 will latch gls_vreg, and it will use it to cross-reference with the VOUTPUT instruction executed later. The GLS unit 1408 latches the gls_sys_addr to the image address of PARAMETER RAM as pointed to by the previously stored Context ID (from mailbox 6013). The format bits are obtained from the data lines of data memory 5403 when the GLS processor 5402 reads the data memory 5403 in response to the LDSYS instruction and stored in the PARAMETER RAM also. The POSN is also captured and stored to be used for storing DMEM_OFFSET that emerge from the VOUTPUT instruction.
Now turning to VOUTPUT instruction, this is a vector output instruction. When the GLS processor 5403 executes the VOUTPUT instruction, it asserts the following output signals on its bountry pins: (1) risc_is_voutput is set to ‘1’; (2) risc_output_wd (4-bits) drives the VREG to cross-ref with VREG obtained from LDSYS instruction; (3) risc_output_wa (18-bits) provides data memory offset information; (4) risc_output_pa (6-bits) extract DST tag from bit 2:0; and (5) risc_vip_size (8-bits) provides an 8-bit HG_SIZE value. The VREG information stored as a result of LDSYS execution is cross-referenced with VREG from VOUTPUT. If they match then the DMEM_OFFSET information is written into the Parameter RAM. The POSN obtained from LDSYS instruction is used as index to store the DMEM_OFFSET. It should be noted that there is no relation between the VREG value and the 64-pair present in the PARAMETER RAM. The GLS unit 1408 stores the 64-bit pair based on the time-order in which the VREG emerges from the GLS processor 5402.
The OUTPUT instruction is used by the GLS processor 5402 to load scalar information to the scalar RAM 6001. When the OUTPUT instruction is executed the GLS processor 5402 asserts the following signals: (1) risc_is_output is set to ‘1’; (2) risc_output_wd (32-bits)->Scalar data to be written to the scalar RAM 6001; (3) risc_output_wa (11-bits)->Lower 9-bits are the data memory offset that should to written to the scalar RAM 6001; (4) risc_output_pa with bit 2:0->DST_TAG to be latched into the scalar RAM, bits 4:3 as ‘11’ (Hi=‘1’, Lo=‘1’), ‘10’ (Hi=‘0’, Lo=‘1’), or ‘00’ (Hi=‘0’, Lo=‘0’), and bit 5 set_to ‘valid’; and (5) risc_store_disable. The risc_store_disable is sent by the GLS processor 5402 to be transmitted along with the scalar data to the destination (via MREQINFO). This bit informs the destination not to store the scalar data but process the set_valid sent normally. The set_valid bit is also sent as part of MREQINFO to indicate the last scalar data for the thread.
The END instruction from GLS processor 5402 is asserted in when the GLS processor 5402 determines that there is no more data to be read from the OCP connection 1412. When the END instruction is encountered, the GLS processor 5402 will assert the risc_is_end signal on its interface. This indicates to the GLS to start sending OT messages to all the destinations for the context, followed by thread termination.
The TASKSW instruction is a task switch instruction, and the TASKSW instruction asserts the risc_is_task_sw signal on the GLS processor interface. This signal is captured and it serves as the BK bit for the parameter RAM. It also serves as set_valid signal for the GLS logic to indicate that the last word for the PARAMETER RAM has been written by the GLS processor.
When the data from the OCP connection 1412 (i.e., from system memory 1416 or peripherals 1414) is passed to interconnect 814, it should to be deinterleaved, upsampled, repeated, and/or zero-inserted. After these operations are performed, the data should ready to be transmitted to the destinations via interconnect 814. The data in the peripheral (i.e., over OCP connection 1412) is fetched (for example) 128-bits at a time. From these 128-bit words, pixels (for example) should to be extracted, and the actions mentioned above (deinterleaved, upsampled, repeated, and/or zero-inserted) should to be performed. The format and type operation that should to be performed by the block is provided in the format information stored in the parameter RAM can be seen in
The first step performed by the GLS unit 1408 is to extract the pixels according to their bit-widths irrespective of the colors. Once that is done, the pixels are collected as per phase and interval settings in the format. The interval setting in the format allows the GLS unit 1408 to select blocks of N pixels (N is number of colors) and apply the phase setting to it.
In the GLS unit 1408, the write thread is generally responsible for (1) scheduling a write thread when the message is received by the GLS unit 1408; (2) source notification reception; (3) responding with a source permission message for the source notification message sent by a node (i.e., node 808-i); (4) sending PINCR value according to the buffer space available in the GLS unit 1408 for receiving data; (5) update GLS pending permission table and manage the table; (6) receive data from the nodes on the data interconnect slave interface and store it in the interconnect IO RAM (i.e., in buffer 5406); interleaving (and/or downsampling) the received data and sent to the peripheral (i.e., system memory 1416) based on the information from the parameter RAM; and (7) synchronizing and updating data memory 5403 with scalar data received from nodes (if enabled). The following steps are performed within the GLS unit 1408 upon the reception of the schedule write thread message:
Each DST Context ID# has a corresponding entry in the table which is implemented as (for example) an 80×16 Word RAM. There are (for example) five 32-bit words for each context ID that is assigned for the write thread. The first 4 words store information extracted from the source notification message and are indexed using the DST_TAG received. The 5th word displays the internal status of the GLS processing that context ID.
A 2-state functional state machine is implemented for each Src_Tag received in the source notification message.
Once the FSM state reaches the state to send source permission message, the GLS unit 1408 determines the amount of buffer space it has to store the write thread data for that context. It executes a lookup procedure to determine the buffer space amount available in the Global Interconnect IO RAM (i.e., buffer 5406) and determines the PINCR value to be used in the source permission message, uses that PINCR value, constructs the SRC permission message and sends it to the {SEG_ID, NODE_ID} destination. The GLS processor 5402 is triggered (context switch) with the context base address extracted from the write thread message. In response to the context switch, the GLS processor 5402 executes the program which corresponds to the write thread. As a result of the program writes the information shown in
The GLS processor 5402 can write upto (for example) four 64-bit pairs (upto 4 SRC-tags) for a write thread. Each 64-bit pair contains the following information that will be used by the GLS unit 1408 to send the write thread data to the peripheral (i.e., system memory 1416). The address is starting address in the peripheral (i.e., system memory 1416) for the data corresponding to the Src_Tag (or image line) to be written. The offset is the data memory offset that will used by the source to identify the color component of an image line (part of the MREQINFO sent by the source node sent on the interconnect 814 along with the data). BK identifies the last 64-bit pair for the write thread.
Once the GLS processor 5402 completes writing the information, the GLS processor 5402 performs a task switch which is interpreted by the GLS unit 1408 as the last word in the PARAMETER RAM (BK=1). A source permission message is sent for each source notification message received if there is buffer space to receive data from the source. If there is no buffer space, the source notification message received is kept in pending state until there is room in the buffer 5406 to receive data. The mailbox status is updated so that the GLS processor 5402 is not triggered repeatedly for subsequent source notification messages until the thread is terminated.
A Tag id for OCP transmissions is also allocated for the write thread. The allocated tag id will be used to write data to the peripheral. A new tag_id is allocated for each SRC_TAG that would be used by the write thread (identified, for example, by the number of 64-bit pairs written by the GLS processor 5402). Once the source permission is sent the write thread is put in a suspended state until the data arrives from the source. When the source(s) starts sending the data, it sends the data in bursts (for example) of two 256-bit bursts. Along with the data the source(s) send the following information in the MREQINFO:
SRC_TAG->Used to index into the pending permissions table as well as parameter RAM as well as update the 2-state finite state machine;
The two beats of data are stored in the interconnect RAM and passed on to the interleaver 6025 to interleave data. Once interleaved data (the format of the interleaved data has been already written by the GLS processor 5402 to the parameter RAM), for a SRC_TAG (or image line) is (for example) 128-bit wide, it is transferred to the buffer 6024. Once the buffer 6024 accumulates (for example) 8-beats worth of the data (or less if there is no more data to send), the beats are burst to the peripheral via the OCP connection 1412 using the previously assigned tag ID. At the same time the parameter RAM is updated with the new word offset (the word offset in the parameter RAM is maintained by the GLS unit 1408). The updated word offset will be added to the base address for subsequent data transfers. This process is repeated until set_valid for the SRC_TAG whose RT-bit was set in the source notification message is received or when HG_SIZE is equal to the internal counter value. When that condition occurs, the thread is terminated with a thread termination message sent to the processing cluster 1400 sub-system via the messaging interconnect and the thread state is moved to “non-executable state”.
When the context descriptor is accessed upon reception of the schedule write thread message, the descriptor contains information whether the thread depends upon reception of scalar input. When the In bit is set to ‘1’ for the thread's context descriptor, then it means the thread will also receive scalar input from nodes which desires to be written into the data memory 5403 at the address specified. The number of scalar inputs received for the thread is provided by the #Inp bits in the context descriptor. The GLS unit 1408 should to keep track of this also. The scalar input will be received by the GLS unit 1408 using the update data memory message. The data memory address to update the (for example) 32-bit scalar word (16-bits at a time depending upon the HI/LO setting in the message) is extracted from the message as well. This extracted address is added to the address in the context descriptor to determine the final address. This can be seen in
When the source has no more data to send, it normally sends an OUTPUT termination message. When this message is received by the GLS 1408, the destination context ID is extracted from the message and the GLS pending permission table is accessed to extract the information stored for the context. A scan of the table for the destination context is then performed to match the stored source information with the information received in the message. If a match is found, it means that source has no more output to send. The InTm bit is set to ‘1’ in the pending table. The GLS processor 5403 is indicated that the thread has been terminated by driving the wrp_terminate signal. The GLS processor 5403 executes the END instruction, and the GLS unit 1408 detects the END instruction and terminates the thread in the mailbox. 6013. A thread termination is then sent to the processing cluster 1400 sub-system.
The relevant instructions for the GLS processor 5403 are VINPUT, STSYS, END, and TASKSW. When the GLS processor 5403 executes the VINPUT instruction it asserts: risc_is_vinput (set to ‘1’); gls_sys_addr; gls_vreg (4-bits); and risc_vip_size (8-bits). The GLS unit captures gls_vreg when risc_is_vinput is set to ‘1’. The gls_vreg is a 4-bit index which serves as a cross-reference to latch values that result from execution of STSYS instruction by the GLS processor 5403. The gls_sys_addr is also captured and the value is the DMEM OFFSET value that desires to be latched into the Parameter RAM. When the GLS processor 5402 executes the STSYS instruction it asserts: gls_is_stsys (set to ‘1’); gls_vreg (4 bits will be cross-referenced with stored value from VINPUT); gls_sys_addr (image address); and gls_posn (3-bits). When the gls_is_stsys=‘1’, the GLS unit 1408 will compare the previously latched gls_vreg value and if a match is obtained, it latches the gls_sys_addr to the image address of PARAMETER RAM as pointed to by the previously stored Context ID (from mailbox 6013). The format bits are obtained from the data memory data lines when the GLS processor 5402 reads the data memory 5403. POSN is used as index to write the DMEM_OFFSET value into proper bits of the parameter RAM. It should also be noted that there is no relation between the VREG value and the 64-pair present in the PARAMETER RAM. The GLS unit 1408 (for example) stores the 64-bit pair based on the time-order in which the VREG emerges from the GLS processor 5402. The END instruction from the GLS processor 5402 is asserted in response to Output Termination indication by the GLS unit 1408. When the END instruction is encountered, the GLS processor 5402 will assert the risc_is_end signal on its interface. This indicates to the GLS unit 1408 to move the thread to HALTED state as well as update the GLS pending permissions table. The TASKSW instruction asserts the risc_is_task_sw signal on the GLS processor 5402 interface. This signal is captured and it serves as the BK bit for the parameter RAM. It also serves as set_valid signal for the GLS logic to indicate that the last word for the PARAMETER RAM has been written by the GLS processor 5402.
The interleaver 6025 is generally responsible for interleaving the data from the nodes/partitions so that it can be sent on the OCP connection 1412.
In the example shown in FIG. 60BA, the NUM_OF_COLORS is 4. It means that the interleaver 6025 should to create an image line with 4 color components with each pixel of “PIXEL_WIDTH” length. The transmitter will first send data on the interconnect 814 with DMEM_OFFSET0 (possibly). The interleaver 6025 is responsible for extracting the pixels based on the pixel width (drop the leading 0s also), and use the downsampling information to latch the extracted pixels at appropriate offset. In the above example the downsampling setting=“0101”. This means that when data with DMEM_OFFSET0 is transmitted, the pixels extracted from the (for example) 256-bit word occupy the outgoing pixel location-0, 2, and so forth. Once the data with DMEM_OFFSET1 is received, the zero-insertion/repetition bit is examined. In either case, the pixels are picked up from the appropriate locations (after extraction) and latched at appropriate offsets. In the above example, the pixels extracted for DMEM_OFFSET1 are latched in pixel location 1, 5, and so forth When data with DMEM_OFFSET2 is received the pixels are latched into appropriate offsets. In the above example, the pixels extracted for DMEM_OFFSET2 are latched in pixel location 2, 6, and so forth. As explained above, once data worth (for example) 128-bits are formed, the interleaved data is transferred to the buffer 6024.
The GLS unit 1408 supports multicasting of read thread data and write thread data. The multicast option for a thread is enabled when Schedule multicast message is received by the GLS unit 1408. A multicast thread can either receive data from the OCP connection 1412 (read thread) or receive data from the global interconnect (write thread). During a write thread when the data is received via interconnect 814 and if the thread had already received a schedule multicast message, the GLS unit 1408 performs extracts the previously stored DESTINATION_LIST_BASE from the mailbox 6013 for the thread (it would have been written by the multicast message). Then the data memory 5403 is scanned to determine the list of destinations. As source notification message is then sent to all the destinations present in the list which are not write threads. The destination can also include a write thread which is not “multicast”. When a source permission message is received from the destinations for which the source notification messages were sent, the data received via interconnect 814 is sent to the destination. If the destination happens to be a write thread, then the data is sent to the interleaver 6025 in the GLS unit 1408 for transfer to the OCP connection 1412. When data to all destinations have been transferred to them, the buffer 5406 is made free to receive new data
The primary source is the asynchronous reset provided to the GLS unit 1408. This reset fans out to all the modules of the GLS unit 1408.
There is limited clock gating in the GLS unit 1408. The GLS unit 1408 has ability to gate its messaging clock interface when the clock enable from the control node indicates so. The control node 1406 sends a MESSAGE_CLK_ENABLE signal which when set to ‘1’, enables the internal clock to the ingress and egress messaging interface. When it is set to ‘0’, the clocks to these modules are disabled.
Interconnect monitor is (for example) a 32-bit counter which monitors the interconnect 814 to detect activity on the data bus 1422. Whenever there is no interconnect activity, the counter starts counting upto 0x1fff_ffff. Whenever there is activity the counter is reset back to ‘0’. When the counter reaches the max count (0x1fff_ffff), an “no activity” signal is sent to the control node 1406. When the control node 1406 receives this signal, it starts initiating the power down sequence to power-down the processing cluster 1400 sub-system.
As shown in
In Table 23 below, an example of a list of IO signals of the Control Node 1406 that interacts with two partitions (labeled partition-0 and partition-1) can be seen.
Turning to
As shown in
Turning to
Typically, the input slave interfaces 6134-1 to 6134-(R+1) are generally responsible for handling all the ingress slave accesses from the upstream modules (i.e., GLS unit 1408). An example of the protocol between the slave and master can be seen in
The message pre-processors 6138-1 to 6138-(R+1) are generally responsible for determining if the control node 1406 should act upon the current message or forward it. This is determined by the decoding the latched header byte first. Table 24 below shows examples of the list of messages that the control node 1406 can decode and act upon when received from the upstream master.
As shown, when the {SEG_ID, NODE_ID} combination indicates a valid output port, the message is forwarded to the proper egress node.
The control node data memory initialization message is employed for action RAM initialization. As an example, when the control node 1410 receives this message, the control node 1410 examines the #Entries information contained in the data field. The #Entries field usually indicates the number of action list entries excluding the termination headers. For example, if the number of action list entries to be updated is 1 (ie, action_list_0) then the #Entries=1; if action_list_0 and action_list_1 should be updated then the #Entries=2. Therefore the valid range of #Entries is 1->246. There are cases where the number of action list entries make the total number of beats exceed (for example) 32 (where max beat count is, for example, 32). For example, if the number of action list entries is 19 then total number of data beats for the message is 1 (#Entries)+8 (node termination header)+8 (thread termination header)+20 (15 action list entries translate to 20 beats)=37 beats. The upstream is supposed to divide this into two beats (32 beats in the first packet and 5 beats in the next packet).
Registers 6144 are generally comprised of several registers, and a list of examples of some of the registers 6144 can be seen below in Table 25.
The sequential processor or sequencer 6140 sequences the access to the control node memory 6114 based at least in part on the indication is receives from various message pre-processors 6136-1 to 6136-(R+1). After the sequencer 6140 completes its actions that are generally used for a termination message, it indicates to the Message forwarder or master interfaces 6138-1 to 6138-(R+1) that a message is ready for transmission. Once the message forwarder (i.e., 6138-1) accepts the message and releases the sequencer 6140, it moves to the next termination message. At the same time it also indicates to the message pre-processor (i.e., 6136-1) that the actions have been completed for the termination message. This in turn triggers the message pre-processor (i.e., 6136-1) release of the message buffer for accepting new messages.
The message forwarder (i.e., 6138-1) forwards all the messages it receives from its message pre-processor (i.e., 6136-1) as well as the sequencer 6140. The message forwarder (i.e., 6138-1) can communicate with the master egress blocks to send the constructed/forwarded message by the control node 1406. Once the corresponding master indicates the completion of the transmission, the message forwarder (i.e., 6138-1) should the release the corresponding message pre-processor (i.e., 6136-1), which will in turn release the message buffer.
Turning to
In most cases, the control node 1406 typically does not act upon the message (i.e., 6104) except forward it to the correct destination master port. The control node can, however, takes action when a message contains segment ID 6110 and node ID 6112 combination that is addressed to it. Table 27 below shows an example of the various segment ID 6110 and node ID 6112 combinations that can be supported by the control node 1406.
Turning to
In
Base_Address=Action_table_base+(Prog_ID*2); or
Base_Address=Action_table_base+(Prog_ID*4)
Bit-8 of the header word 6406 can control the multiplier (i.e., 0 for *2 and 1 for *4), while Prog_ID can be extracted from the program termination message. Then, the base address can be used to extract action lists 6116 from the memory 6114. This 41-bit word, for example, is divided into header word and data-word to be sent as message to the destination nodes.
Turning to
An “action list end” encoding (as shown in Table 28 above) generally signifies the end of action list messages. Typically, for this encoding the control node 1406 can determine if the message ID and segment ID are equal to “0.” If not, then the header and data word are sent; otherwise an end is reached.
“Next list entry” and “message continuation” encodings (as shown in Table 28 above) can be used when the numbers of messages exceed the allowable entry list. Typically, for the “next list entry” encoding the control node 1406 can determine if the message ID and segment ID are equal to “0.” If not, then the header and data word are sent; otherwise, there is a move to the next entry. If node_ID is equal to 4′b1000 (for example), the information for “next list entry” is extracted to firm the base address to a new address in control node memory 6114. If node_ID is equal to “1,” however, then the encoding is “message continuation,” causing the next address to be read.
The “host interrupt info end” encoding (as shown in Table 28 above) is generally a special encoding to interrupt a host processor. When this encoding is decoded by the control node 1406, the contents of the encoded word bits (i.e., bits 31:0) can be written to an internal register and a host interrupt is asserted. The host would read the status register and clear the interrupt. An example for the message opcode 6502, a segment ID 6504, and a node ID 6506 can be 000′b, 00′b, and 0010′b, respectively.
The “debug notification info end” encoding (as shown in Table 28 above) is generally similar to “host interrupt info end” encoding. A difference, however, is that when this type of encoding is encountered as debug interrupt is asserted. The debugger would read the status register and clear the interrupt. An example for the message opcode 6502, a segment ID 6504, and a node ID 6506 can be 000′b, 00′b, and 0010′b, respectively.
An ACTION_LIST_END encoding signifies the end of action list messages, and turning to
The NEXT_LIST_ENTRY, MESSAGE_CONTINUATION encodings can be used when the numbers of messages exceed the allowable entry list. These three encodings are used together to form a linked list of messages as shown in the flow diagram of
The HOST_INTERRUPT_INFO_END encoding is a special encoding to interrupt the host processor 1316. When this encoding is decoded by the control node 1406, the contents of the encoded word bits 31:0 is written to an internal register (ACTION_HOTS INTR register), and a host interrupt is asserted. The host processor 1316 would read the status register and clear the interrupt. An example of which is shown in
The DEBUG_NOTIFICATION_INFO_END is similar to HOST_INTERRUPT_INFO_END encoding. But, a difference between the two is that when this type of encoding is encountered as debug interrupt is asserted. The debugger would read the status register and clear the interrupt. An example of which is shown in
The header word received is a master address sent by the source master on the ingress side. On the egress side, there are typically two cases to consider: forwarding and termination. With forwarding, the buffered master address is can be forwarded on the egress master if the message should be forwarded. For termination, if the ingress message is termination message, then the egress master address can be the combination of message, segment, and node IDs. Additionally, the data word on the ingress side can be extracted from the slave data bus of the ingress port. On the egress side, there are (again) typically two cases to consider: forwarding and termination. For forwarding, the data word on the egress side can be the buffered message from the ingress side, and for termination, a (for example) 32-bit message payload can be forwarded.
The control node 1406 can handles series of action list entries with no payload count. Namely, a sequence of action list entries with no payload count or link list entry can be handled by control node 1406. It is assumed that at the end somewhere an action list end message will be inserted. But in this scenario, the control node 1406 will generally send the first series of payload as a burst until it encounters the first “NEW Action list Entry”. Then the subsequent sub-set is set as a burst. This process is repeated until an action list end is encountered. The above sequence can be stored in the control node memory 6114. An exception of the this sequence can occur when there are single beat sequences to send. In this case, an action list end desires to be added after every beat. Examples of which can be seen in
Using the Next list entry, the control node provides a way to create linked entries of arbitrary lengths. Whenever a next list entry is encountered, the read pointer is updated with the new address and the control node continues processing normally. For this situation, it is assumed that at the end somewhere an action list end message will be inserted. Additionally, the control node 1406 can continually adjust its internal pointers as pointed by next list entry. This process can be repeated until an action list end is encountered or a new series of entries start. The above sequence can be stored in the control node memory 6114. Examples of which can be seen in
The control node 1406 can also handle multiple payload counts. If multiple payload counts are encountered within a series of messages without encountering an action list end or new series of entries, the control node 1406 can update its internal burst counter length automatically.
The maximum number of beats handled by the control node 1406 can (for example) be 32. If for some reason the beat length is greater than 32, then in case of termination messages, the control node 1406 can break the beats into smaller subsets. Each subset (for this example) can have a maximum of 32-beats. This scenario is typically encountered when the payload count is set to a value greater than 32 or multiple payload counts are encountered or a series of message continuation messages are encountered without an action list of or new sequence start. For example if the payload count in a sequence is set to 48, then the control node 1406 can break this into a 32-beat sequence followed by a 17-beat sequence (16+1) and send it to the same egress node.
Message pre-processors 6136-1 to 6136-(R+1) also can handle the HALT_ACK, Breakpoint, Tracepoint, NodeState Response and processor data memory read response messages. When a partition (i.e., 1402-1) sends one of these messages message pre-processor (i.e., 6136-1) can extract the data and store it in the debugger FIFO to be accessed by either the debugger or the host. The format of the HALT_ACK, Breakpoint, Tracepoint, and NodeState Response messages can be seen in
Looking first to
In
Turning to
In
Turning to
The sequential processor 6140 generally sequences the access to the control node memory 6114 based at least in part on the indication is receives from various message pre-processors 6136-1 to 6136-(R+1). Processor 6140 initiates sequential access to the control node memory 6140. After the sequencer completes its actions for a termination message, it indicates to the Message forwarder that a message is ready for transmission. Once the message forwarder accepts the message and releases the sequencer 6140, it moves to the next termination message. At the same time it also indicates to the message pre-processor (i.e., 6136-1) that the actions have been completed for the termination message. This in turn triggers the message pre-processor release of the message buffer for accepting new messages.
The message forwarder, as the name indicates, forwards all the messages it receives from the message pre-processors 6136-1 to 6136-(R+1) (forwarding message) as well as the sequencer 6140. The message forwarder block communicates with the OCP master egress block to send the constructed/forwarded message by the control node. Once the corresponding OCP master indicates the completion of the transmission, the message forwarder will the release the corresponding message pre-processor, which will in turn release the message buffer.
The host interface and configuration register module provides the slave interfaces for the host processor 1316 to control the control node 1406. The host interface 1405 is a non-burst single read/write interface to the host processor 1316. It handles both posted and non-posted OCP writes in the same non-posted write manner. In
The entries in the action lists 6116 are generally memory mapped for host read or for host write (normally not done). When the entries are to be written, the control node 1406 sends the contents in a “packed” form, which can be seen in
The control node 1406 would also generally handle the dual writes in certain cases (for example, action list entry-1 bits 20:0 and bits 40:21 of entries 7104 and 7106). Entry-1 bits 7104 are written first by the host along with entry-0 bits 7104. In this example, the control node 1406 will first write the entry-0 data 7102 followed by entry-1 data 7104. The host sresp is sent usually after the two writes have been completed.
Additionally, termination headers for nodes 7202 to 7212 and for threads 7214 to 722, which should be written by the host and which is generally a 10-bit header, can be seen in
The debugger interface 6133 is similar to the host or system interface 1405. It, however, generally has lower priority than the host interface 1405. Thus, whenever there is an access collision between the host interface 1405 and the debugger interface 6133, the host interface 1405 controls. The control node 1406 generally will not send any accept or response signal until the host has completed its access to the control node 1406.
The control node 1406 can support a message queue 6102 that is capable of handling messages related to update of control node memory 6114 and forwarding of messages that are sent in a packed format by one of the ingress ports or by the host/debugger. The message queue 6102 can be accessed by the host or debugger by writing packed format messages to MESSAGE_QUEUE_WRITE Register. The ingress ports can also access the message queue 6102 by setting the master address to the “b100—11—0001” (OPCODE=4, SEG_ID=3, NODE_ID=1). The message queue 6102 generally expects the payload data (i.e., action_0 to action_N) to be packed format shown in
Typically, the upper 9-bits in each action (i.e., action_0 to action_N) can indicate to the message queue 6102 what type of action the message queue 6102 should take. As shown in
Additionally, the message queue 6116 handles a special action update message 7500 for control node memory 6114 as shown in
Turning to
Looking the FIFO 7513, it generally has includes a general message entry FIFO (i.e., up to 3 header bytes, up to 8 bytes of payload, up to 2 bytes of timestamp and an extension timestamp FIFO (i.e., configurable depth that can support up to 6 additional bytes of timestamp). Typical messages from processing cluster 1400 should have a maximum (for example) of 2 beats of payload and (for example) between 2-3 bytes of header. If a timestamp is present in dense traffic less than (for example) 14 bits of LSB are likely to have changed since the last time it was transmitted. An extension timestamp FIFO can be used to hold up to (for example) 42 additional bits which may be desired in case of a sync request. The number of rows can be 4, 8, or 16, for example. The number of rows in general message FIFO can, for example, be 32+2), 64+2, or 128+2. The area used can be 466 bytes. A minimum of 32 rows is can be employed to ensure two consecutive processing cluster 1400 messages of 32 beats of payload each can be transmitted. The additional 2 rows are to buffer data in case of consecutive synchronization messages being inserted into the data stream. The transmission byte order can also be: H0→H1(if present)→H2(if present)→M(beat0)→LS byte 0→M(beat0) LS byte 1→M (beat0) LS byte 2→M (beat0) LS byte 3→(if present) M (beat1) LS byte 0→ . . . →M (beat1) LS byte 3→TS(7:0) (if present)→TS (15:8) (if present)→(if present) TS(23:16) . . . TS(63:56) (if present)
Turning back to the sync message generator 7514, as stated above, the sync message generator 7514 performs periodic synchronization. Periodic synchronization can use a count of message bytes transmitted (including timestamp as applicable) to be used to determine when sync markers should be added to the datastream. Sync markers are added at message boundaries and the byte count is used as a hint to determine when the markers are desired. Periodic Synchronization is enabled by the following programmable register:
Trace messages are typically comprised of a trace header and a trace body. These trace messages can support any number of message continuation fragments so as to support infinitely long message payloads. The message header for first or fragment of a message is a minimum of one byte in length. A second byte is required when the segment and node identifier pair can not be inferred. A third byte should be sent to transmit the mreqinfo information, if required.
To preserve the order of the header bytes the following combinations are allowed for a trace message:
The message header for any fragment of a multi-fragment message other than first fragment can, for example, be one byte in length. This implementation can reduce bandwidth overhead of splitting multiple beat (greater than 2) payloads across message fragments and can also optimize the header of single fragment messages to reduce bandwidth requirements. This implementation also encodes the timestamp after a message payload in order to eliminate transmission of an additional header with the timestamp. A timestamp is optionally present after the payload of the last fragment of a multi-fragment message or after the first and fragment of a single fragment message. The trace header is typically comprised of three bytes (examples of which are shown in
A trace message may (for example) have up to 32 beats of payload, where each beat can be 32-bits of data. Typically, the FIFO memory can be organized for steady state operation in which typical messages are 1 beat in length, and the length of synchronization sequences (which generally entails breaking up infrequent messages with long payloads with a known patterns that allows sync pattern to be reduced in length) can be reduced. This is due to there being no control over the contents of message payloads which could in essence be from trace perspective arbitrary sequences of ‘0’s and ‘1’s. Additionally, trace message less than or equal to (for example) 2 beats can be comprised of single fragment of the message with payload up to 2 beats and/or variable length timestamp. A trace message that is (for example) longer than 2 beats can be comprised of first fragment of the message with payload up to 2 beats; second and subsequent continuation fragments with payload up to 2 beats; last fragment with payload of up to 2 beats; and variable length timestamp payload. Examples of a trace messages with a 1-beat payload and a one-byte header, a 1-beat payload and a two-byte header, a 2-beat payload and three-byte header, and a 6-beat payload, all with no timestamps, can be seen can be seen in
There can be two sources of reset to the control node 1406. The primary source is generally the asynchronous reset provided to the control node 1406. The second source is generally the internal soft reset performed by the host/debugger.
The control node 1406 generally operates in a single clock domain, which is shown in
The control node 1406 generally controls the clocks of the downstream module (as shown in
The control node 1406 typically includes two interrupt lines. These interrupts are generally, active low interrupts and, for example, are a host interrupt and a debug interrupt. An example of a generic integration can be seen in
The host interrupt can be asserted because of the following events: if the action list encoding at the end of a series of action list actions is action list end with host interrupt; if the actions processed by the message queue has a action list end with host interrupt; or if the event translator indicates an underflow or overflow status. In these cases the host apart from reading the HOST_IRQSTATUS_RAW Register and HOST_IRQSTATUS, also can read the FIFO accessible by reading the ACTION_HOST_INTR_Register for interrupts caused by action events. For events caused by the event translator, the host (i.e., 1316) reads the ET_HOST_INTR register. The interrupt can be enabled by writing ‘1’ to HOST_IRQENABLE_SET Register. The enabled interrupt can be disabled by writing ‘1’ to HOST_IRQSTATUS_CLR Register. When the host has completed processing the interrupt, it is generally expected to write ‘0’ to HOST_IRQ_EOI Register. In addition to these, the interrupt can be asserted for test purpose by writing a ‘1’ to the bits of the HOST_IRQSTATUS_RAW Register (after enabling the interrupt using the HOST_IRQENABLE_SET Register). In order to clear the interrupt, the host should to write a ‘1’ to HOST_IRQSTATUS register. This is normally used to test the assertion and deassertion of the interrupt. In normal mode, the interrupt should stay asserted as long as the FIFOs pointed to by ACTION_HOST_INTR register and ET_HOST_INTR register are not empty. Software is generally responsible for reading all the words from the FIFO and can obtain the status of the FIFOs by reading either the CONTROL_NODE_STATUS register or ET_STATUS register.
The debug interrupt can be asserted because of the following events: if the action list encoding at the end of a series of action list actions is action list end with debug interrupt; if the actions processed by the message queue has a action list end with debug interrupt; of if the event translator indicates an underflow or overflow status. In these cases, the host/debugger apart from reading the DEBUG_IRQSTATUS_RAW Register and DEBUG_IRQSTATUS Register, also can to read the FIFO accessible by reading the DEBUG_HOST_INTR Register for interrupts caused by action event. For events caused by the event translator, the host (i.e., 1316) reads the ET_DEBUG_INTR regsiter. In this cases the debugger apart from reading the DEBUG_IRQSTATUS_RAW Register and DEBUG_IRQSTATUS Register, also can read the FIFO accessible by reading the DEBUG_READ_PART Register. The interrupt should be enabled by writing ‘1’ to one of the bits in DEBUG_IRQENABLE_SET Register. The enabled interrupt can be disabled by writing ‘1’ to DEBUG_IRQENABLE_CLR Register. When the debugger has completed processing the interrupt, it should be expected to write ‘1’ to DEBUG_IRQ_EOI Register. In addition to these, the interrupt can be asserted for test purpose by writing a ‘1’ to the bits of the DEBUG_IRQSTATUS_RAW Register (after enabling the interrupt using the DEBUG_IRQENABLE_SET Register). In order to clear the interrupt, the host should to write a ‘1’ to corresponding bit in DEBUG_IRQSTATUS Register. This is normally used to test the assertion and deassertion of the interrupt. In normal mode, the interrupt should remain asserted as long as the FIFO pointed to by DEBUG_HOST_INTR register and ET_DEBUG_INTR register are is not empty. Software is generally responsible for reading all the words from the FIFO and can obtain the status of the FIFOs by reading either the CONTROL_NODE_STATUS register or ET_STATUS register.
The event translator, whenever it detects an overflow or underflow condition while handling interrupts from external IP, will assert et_interrupt_en along with the vector number and overflow/underflow indication to the control node. The control node 1406 buffers these indications in a FIFO for host or debugger to read. When an overflow/underflow indication comes from the ET block, the control node 1406 stores the overflow/underflow indication along with the vector number in the FIFO and indicates to the host/debugger via interrupt an error has occurred. The host or debugger is responsible for reading the corresponding FIFOs. An example of error handling by the event translator (which is described in detail below) can be seen in
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
Turning to
The function-memory 7602 and vector-memory 7603 are generally “shared” in the sense that all processing nodes (i.e., 808-i) can access function-memory 7602 and vector-memory 7603. Data provided to the function-memory 7602 can be accessed via the SFM wrapper (typically in a write-only manner). This sharing is also generally consistent with the context management described above for processing nodes (i.e., 808-i). Data I/O between processing nodes nodes and shared function-memory 1410 also uses the dataflow protocol, and processing nodes nodes, typically, cannot directly access vector-memory 7603. The shared function-memory 1410 can also write to the function-memory 7602, but not while it is being accessed by processing nodes. Processing nodes (i.e., 808-i) can read and write common locations in function-memory 7602, but (usually) either as read-only LUT operations or write-only histogram operations. It is also possible for a processing node to have read-write access to an function-memory 7602 region, but this should be exclusive for access by a given program.
In Table 29 below, an example of a partial list of example IO signals, pins, or lead of the shared function-memory 1410 can be seen
In Table 30 below, an example of a partial list of example slave OCP ports of the shared function-memory 1410 can be seen
In Table 31 below, an example of a partial list of example slave OCP port configurations of the shared function-memory 1410 can be seen.
In Table 32 below, an example of a partial list of example master OCP ports of the shared function-memory 1410 can be seen.
In Table 33 below, an example of a partial list of example master OCP port configurations of the shared function-memory 1410 can be seen.
In
The function-memory 7602 organization in this example has 16 banks containing 16, 16-bit pixels each. It can be assumed that there is a lookup table or LUT of 256 entries, aligned starting at bank 7608-1. The nodes present input vectors of pixel values (16 pixels per cycle, 4 cycles for an entire node), and the table is accessed in one cycle using vector elements to access the LUT. Since this table is represented on a single line of each bank (i.e., 7608-1 to 7608-J), all nodes can perform a simultaneous access because no element of any vector can create a bank conflict. The result vector is created by replicating table values into elements of the result vector. For each element in the result vector, the result value is determined by the LUT entry selected by the value of the corresponding element of the input vector. If, at any given bank (i.e., 7608-1 to 7608-J), input vectors from two nodes create different LUT indexes into the same bank, the bank access is prioritized in favor of the least recent input, or, if all inputs occur at the same time, the left-most port input. Bank conflicts are not expected to occur very often, or to have much if any effect on throughput. There are several reasons for this:
Within a partition, one node (i.e., node 808-i) usually accesses the function memory 7602 at any given time, but this should not have a significant affect on performance. Nodes (i.e., node 808-i) executing the same program are at different points in the program, and distribute access to a given LUT in time. Even for nodes executing different programs, LUT access frequency is low, and there is a very low probability of a simultaneous access to different LUTs at the same time. If this does occur, the impact is generally minimized because the compiler schedules LUT access as far as possible from the use of the results.
Nodes in different partitions can access function memory 7602 at the same time, assuming no bank conflicts, but this should rarely occur. If, at any given bank, input vectors from two partitions create different LUT indexes into the same bank, the bank access is prioritized in favor of the least recent input, or, if all inputs occur at the same time, the left-most port input (e.g. Port 0 is prioritized over Port 1).
Histogram access is similar to LUT access, except that there is no result returned to the node. Instead, the input vectors from the nodes are used to access histogram entries, these entries are updated by an arithmetic operation, and the result placed back into the histogram entries. If multiple elements of the input vector select the same histogram entry, this entry is updated accordingly: for example, if three input elements select a given histogram entry, and the arithmetic operation is a simple increment, the histogram entry can be incremented by 3. Histogram updates can typically take one of three forms:
The format of the LUT and histogram table descriptors 7700 is shown in
Turning back to
As shown, SFM processor 7614 uses a RISC processor (as described in sections 7 and 8 above) for 32-bit (for example) scalar processing (i.e., two-issue in this case), and extends the instruction set architecture to support vector and array processing (as described in section 8 above) in (for example) 16, 32-bit datapaths, which can also operate on packed, 16-bit data for up to twice the operational throughput, and on packed, 8-bit data for up to four times the operational throughput. The SFM processor 7614 permits the compilation of any C++ program, while making available the ability to perform operations (for example) on wide pixel contexts, compatible with pixel datatypes (Line, Pair, and uPair). SFM processor 7614 also can provide more general data movement between (for example) pixel positions, rather than the limited side-context access and packing provided by process 4322, including both in the horizontal and vertical directions. This generality, compared to node processor 4322, is possible because SFM processor 7614 uses the 2-D access capability of the functional memory 7302, and because it can support a load and a store every cycle instead of four loads and two stores.
SFM processor 7614 can perform operations such as motion estimation, resampling, and discrete-cosine transform, and more general operations such as distortion correction. Instruction packets can be 120 bits wide (as described in section 8 above), providing for up to parallel issue of two scalar and four vector operations in a single cycle. In code regions where there is less instruction parallelism, scalar and vector instructions can be executed in any combination less than six wide, including serial issue of one instruction per cycle. Parallelism is detected using an instruction bit to indicate parallel issue with the preceding instruction, and instructions are issued in-order. There are two forms of load and store instructions for the SIMD datapath, depending on whether the generated function-memory address is linear or two-dimensional. The first type of access of function-memory 7602 is performed in the scalar datapath, and the second in the vector datapaths. In the latter case, the addresses can be completely independent, based on (for example) 16-bit register values in each datapath half (to access up to, for example, 32 pixels from independent addresses).
The node wrapper 7626 and control structures of the SFM processor 7614 are similar to those of node processor 4322 (as described in section 8 above), and share many common components, with some exceptions. The SFM processor 7614 can support (for example) very general pixel access in the horizontal direction, and the side-context management techniques used for nodes (i.e., 808-i) is generally not possible. For example, the offsets used can be based on program variables (in node processor 4322, pixel offsets are typically instruction immediates), so the compiler 706 cannot generally detect and insert task boundaries to satisfy side-context dependencies. For node processor 4322, the compiler 706 should know the location of these boundaries and can ensure that register values are not expected to live across these boundaries. For the SFM processor 7614, hardware determines when task switching should be performed and provides hardware support to save and restore all registers, in both the scalar and the SIMD vector units. Typically, the hardware used for save and restore is the context save restore circuitry 7610 and the context-state circuit 7612 (which can be, for example 16×256 bits). This circuitry 7610 (for example) comprises a scalar context save circuits (which can be, for example, 16×16×32 bits) and 32 vector context save circuits (which can each, for example, be 16×512 bits), which can be used to save and restore SIMD registers. Generally, the vector-memory 7603 does not support side-context RAMs, and, since pixel offsets (for example) can be variables, it does not generally permit the same dependency mechanisms used in node processor 4322 (and as described in section 7 above). Instead, pixels (for example) within a region of a frame are within the same context, rather than distributed across contexts. This provides functionality similar to node contexts, except that the contexts should not be shared horizontally across multiple, parallel nodes. The shared function-memory 1410 also generally comprises an SFM data memory 7618, SFM instruction memory 7616, and a global IO buffer 7620. Additionally, the shared function-memory 1410 also includes a interface 7606 that can perform prioritization, bank select, index select and result assembly and that is coupled to the node ports (i.e., 7624-1 to 7624-4) through partition BIUs (i.e., 4710-i).
Turing to
In
Turning back to
Vector-implied datatypes are generally SIMD-implemented vectors of either 8-bit chars, 16-bit halfwords, or 32-bit ints, operated on individually by each SIMD data path (i.e.,
The SFM processor 7614 SIMD generally operates within vector memory 7603 contexts similar node processor 4322 contexts, with descriptors having a base address aligned to the sets of banks 7802-1, and sufficiently large to address the entire vector memory 7603 (i.e., 13 bits for the size of 1024 kBytes). Each half of the a SIMD data path is numbered with a 6-bit identifier (POSN), starting at 0 for the left-most data path. For vector-implied addressing, the LSB of this value is generally ignored, and the remaining five bits are used to align the vector memory 7603 addresses generated by the data path to the respective words in the vector memory 7603.
In
These addresses access values aligned to a bank from each set 7802-1 to 7802-L (i.e., four of the sixteen banks), and the access can occur in a single cycle. No bank conflicts occur, since all addresses are based on the same scalar register and/or immediate values, differing in the POSN value in the LSBs.
Vector-packed addressing modes generally permit the SFM processor 7616 SIMD data paths to operate on datatypes that are compatible with (for example) packed pixels in nodes (808-i). The organization of these datatypes is significantly different in function-memory 7602 compared to the organization in node data memory (i.e., 4306-1). Instead of storing horizontal groups across multiple contexts, these groups can be stored in a single context. The SFM processor 7614 can take advantage of the vector memory 7603 organization to pack (for example) pixels from any horizontal or vertical location into data path registers, based on variable offsets, for operations such as distortion correction. In contrast, nodes (i.e., 808-i) access pixels in the horizontal direction using small, constant offsets, and these pixels are all in the same scan-line. Addressing modes for shared function-memory 1410 can support one load and one store per cycle, and performance is variable depending on vector memory bank (i.e., 7608-1) conflicts created by the random accesses.
Vector-packed addressing modes generally employ addressing analogous to the addressing of two-dimensional arrays, where the first dimension corresponds to the vertical direction within the frame and the second to the horizontal. To access a pixel (for example) at a given vertical and horizontal index, the vertical index is multiplied by the width of the horizontal group, in the case of a Line, or by the width of a Block. This results in an index to the first pixel located at that vertical offset: to this is added to the horizontal index to obtain the vector memory 7603 address of the accessed pixel within the given data structure.
The vertical index calculation is based on a programmed parameter, an example of which is shown in
Turning to
In
Turning to
As shown in this example, addresses for each buffer increase linearly in the vertical direction (downward) from the respective base address. In the node (i.e., 808-i), this address indexes the circular buffer, and the horizontal group for a given scan-line appears at the same index, across multiple contexts that are associated by left-context and right-context pointers. In shared function-memory 1410, this address indexes a two-dimensional array, implemented by vector-packed addressing modes. The first dimension of this array is the circular-buffer index, and the second dimension is the relative position of the pixels in the horizontal group (HG_POSN) relative to the left-most node context. The size of this second dimension is variable, depending on the size of the horizontal group (HG_Size), and is specified in the shared function-memory context descriptor configured by system programming tool 718. The value HG_POSN is maintained by hardware for the context, to mimic node iteration across horizontal groups; however, in this case, the iteration is serial within a single context instead of possibly parallel. The function-memory 7602 generally does not permit dependency checking between contexts in the horizontal direction.
This mapping of horizontal groups in the shared function-memory context in this example permits the SFM processor 7614 SIMD to access pixels at any position in the vertical and horizontal directions. The circular-buffer index has the same values as the related node index, to permit input and output between contexts using the same values. When a source generates output to a circular buffer, it specifies the offset in the destination context of the buffer base address, with a separate circular index into the buffer; this index is usually zero for other types of output. In the shared function-memory context, this circular-buffer index is multiplied by HG_Size to index to the first 64 pixels in the horizontal group at that index. At that point, HG_POSN is used to index into the horizontal group, and POSN aligns a data path half to a unique pixel in the group. This unique pixel is the current central pixel for the data path half. Note that the central pixel can be at any circular-buffer index for the data path half—each half of the data path can compute this index independently.
Node processor (i.e., 4322) typically uses the same vertical-index parameter as shared function-memory 1410 to access circular buffers, except that HG_Size is usually zero because the buffer is effectively one-dimensional within the context (the second dimension is introduced by other contexts in the horizontal group). For output from a node (i.e., 808-i) to shared function memory 1410 contexts, the node (i.e., 808-i) context has a vertical-index parameter for the shared function-memory 1410 circular buffer, and this parameter has HG_Size set to the width of the horizontal group (in increments of 32 pixels, for example). For code generation, node Line and shared function-memory Line are different datatypes (though, compatible for assignment), and the width of the horizontal group is known: this permits code generation to form the appropriate vertical-index parameter for local node (i.e., 808-i) and shared function-memory 1410 accesses and for I/O between node (i.e., 808-i) and shared function-memory 1410. For output from node (i.e., 808-i) to shared function-memory 1410, the node (808-i) can directly address the shared function-memory 1410 input using Horiz_Position to form the two-dimensional address. For output from shared function-memory 1410 to node (i.e., 808-i), shared function-memory 1410 uses one-dimensional addressing (i.e., HG_Size is 0 for node Line data), and the second dimension is implemented by the dataflow protocol because the SFM context is threaded, and provides output in scan-line order.
To mimic node (i.e., 808-i) hardware iteration over horizontal groups, in multiple node contexts, shared function-memory contexts generally implement hardware iteration using HG_POSN to center the SIMD datapath on a particular (for example) 32-pixel element corresponding to a node context. This iteration is implicit in that it is not generally expressed directly in the source code. Instead, the code is written, as for nodes (i.e., 808-i), as an inner loop with the iteration controlled by dataflow. Shared function-memory 1410 hardware increments—HG_POSN at the end of each iteration, and a new iteration is started based on new input data being received. Both shared function-memory 1410 and node (i.e., 808-i) iterate in the vertical direction using vertical-index parameters that are supplied by a system-level iterator, typically in the GLS unit 1408.
Turning to
In
Vector-packed accesses for Line data should be perform or enable the following operations:
Turning to
To support boundary processing and dependency checking, there is “hidden” state written by these instructions to be used during the vector memory 7603 access. Even though this state is written as a side-effect, it conforms to the register allocation done for the other operands, and it is saved and restored on context switches, so it does not generally require special treatment. The first item of state is a bit, VB, that indicates that boundary processing was performed during the vertical-index calculation. This state applies to each datapath half, and is stored in the MSB of the result register half (the maximum V_Index is a 14-bit value). The other state is the values for Md, SD, and HG_Size from the vertical-index parameter. This state applies to all results, and is written to a “shadow” register associated with all SIMD registers having the same identifier. To limit the number of vector shadow registers, and to provide for an 8-bit immediate s_idx, the destination vector registers are limited to the range of V0-V3, so that two bits can be used in the instruction to encode the register identifier.
Turning to
The first pair of operations add the buffer base address to the vertical index, to form a buffer vertical index. The second pair of operations form a horizontal index; this index is generally computed by adding the position of the datapath half, which is a concatenation of HG_POSN and POSN, to the horizontal s_offset. The result of this add is the horizontal index, H_Index. The address of the given pixel, relative to the context base address, is formed by adding the buffer vertical index to the horizontal index. This in turn is added to the context base address to form the vector memory 7603 address of the pixel, where the pixel address is shown (for example) as bits 19:1 because it is usually a halfword address with respect to vector memory 7603. The pixel at this address is either loaded into the target register half or stored from the source register half, subject to boundary processing and dependency checking. The latter are controlled by the hidden state written during the vertical-index calculation.
Because the addresses generated by vector-packed operations are random, and can span a large range of vector memory 7603 addresses, there are many potential store-to-load dependencies in the SIMD pipeline. These are generally not checked by hardware because it would entail comparing (for example) each of the 32 load addresses, in each stage of the load pipeline, against all 32 store address in every stage of the store pipeline. Given the immense complexity, the compiler instead schedules vector-packed loads from a given buffer so that vector-packed loads cannot appear sooner than a number of cycles after a vector-packed store into the same buffer. The number of cycles is TBD but is likely on the order of 3 or 4 cycles. Vector-packed stores are rarely interspersed with loads from the same buffer; typically, vector-packed loads are used to access input data, with vector-implied or vector-packed stores placing results in different buffers. Since these accesses are to different variables, they are independent by definition, and there are no store-to-load delays.
Boundary processing provides predictable values for Line accesses that lie outside of a frame in the vertical direction, or outside of a frame division in the horizontal direction. Nodes (i.e., 808-i) perform boundary processing directly in the ISA of node processor 4322, and this is limited in scope because vertical indexing is one-dimensional and horizontal offsets are instruction constants in the range of (for example) −2/+2, where horizontal boundary processing is performed in the left- and right-boundary contexts. Shared function-memory 1410 boundary processing is more complex, because shared function-memory 1410 Line accesses are two-dimensional, and because vertical and horizontal indexing is more general.
In the shared function-memory 1410, vertical boundary processing is performed both during the vertical-index calculation and during the vector-packed access. Horizontal boundary processing is performed during the vector-packed access. Both are controlled by the Md field in the vertical-index parameter (the encoding 00′b specifies and shared function-memory 1410 Block, in which case boundary processing does not generally apply).
Turning to
Boundary processing applies when one of the following conditions is detected during the vertical-index calculation: 1) TF=1 and TBOffset+s_offset<0 (a negative offset is beyond the first scan-line), or 2) BF=1 and s_offset>TBOffset (a positive offset is beyond the last scan-line). Boundary processing is accomplished as follows:
Regardless of the type of boundary processing performed, the VB bits are set in the vector destination register halves. This bit is used to suppress stores from the corresponding datapath half during a vector-packed store. Stores are invalid outside of the boundaries, and create incorrect results in vector memory 4703 if a store is performed using a vertical index modified for boundary processing.
Turning to
If the vector-packed access is a store, the store is suppressed if boundary processing applies. This is indicated either by VB=1 (vertical boundary processing) or by a horizontal boundary-processing condition being met. (The store is also suppressed if SD=1 in the vector shadow register.)
Shared function-memory 1410 Block datatypes represent fixed, rectangular regions of a frame, providing addressing of pixels (for example) in both vertical and horizontal directions. These are not directly compatible with Line datatypes, because they do not use implicit iteration, and do not support circular addressing and boundary processing. However, the Block datatypes similar in that the Block datatypes implemented using vector-packed addressing, and any pixel from any location can be loaded into (or stored from) a vector register half.
Iteration on Block data is explicit in the source code. Accesses use absolute, unsigned offsets from the relative position [0,0] in the block (the top, right-hand corner with respect to the frame), and iteration can explicitly modify these offsets. For example, iteration within the block can be accomplished by nested FOR loops, with the outer loop indexing the vertical direction, and the inner loop indexing in the horizontal direction at the given vertical index. This is just one example—any general form of indexing can be used.
Turning to
In
The index into a block, Blk_Index, is formed by adding the vertical index to an unsigned offset, u_offset, which is the same as H_Index in this case. The Blk_Index is added to the buffer base address to form a buffer index: this is the address of the given pixel, relative to the context base address. This in turn is added to the context base address to form the VMEM address of the pixel (the pixel address is shown as (for example) bits 19:1 because it is a halfword address with respect to vector memory 7603). The pixel at this address is either loaded into the target register half or stored from the source register half. As with Line data, the compiler schedules vector-packed loads from a given buffer so that they cannot appear sooner than a number of cycles (TBD) after a vector-packed store into the same buffer.
Vector-packed addressing permits block vertical and horizontal offsets to be based on vector-implied variables. Also, each datapath half can access its own POSN value to create this vector-implied data. This enables partitioning the SIMD to operate on separate regions of a block, because the position can be used by each datapath half to form its own set of vertical and horizontal indexes into the block. For example, a block of 32×32 pixels can be partitioned into four regions of 16×16 pixels, each operated on by four SIMD datapaths (eight datapath halves). In this case, for example, each group of eight SIMDs would be positioned, respectively, at pixels [0,0], [0,16], [16,0], and [16,16]. These vertical and horizontal base coordinates can be formed independently using the base POSN value for the datapath halves in each SIMD partition, and each region iterated independently using these base coordinates to form V_Index and H_Index offsets within the region.
A subset of the shared function-memory 1410 Block datatype can be considered to be an array of Line data, a datatype called LineArray. The distinction is that the LineArray data is in a linear array, rather than a circular buffer, and can be operated on using explicit iteration. This can require that the vertical dimension of the circular buffer in nodes (i.e., 808-i), which provides input to the array, be the same as the first dimension of the array. Each iteration through the circular buffer, from absolute index 0 to the maximum index, provides input to a single array, and the next iteration provides input to a new array instance. This new input can be either in the same shared function-memory 1410 context as the first (after input is released), or in a different context, to provide overlapped I/O and/or parallelism.
Nodes (i.e., 808-i) implement Block datatypes in function-memory 4702, though the implementation of node (i.e., 808-i) Block data is different than the implementation of share function-memory 1410 Block data. For example, the vertical- and horizontal-index calculations are not available in the ISA for the nodes (i.e., 808-i), so these addresses should be formed explicitly by other instructions (for example, the horizontal position of a datapath is available to each datapath, but this should be explicitly added to the horizontal index). Furthermore, the node wrapper (i.e., 810-i) does not generally support dependency checking on Block input, which can be significantly different than node (i.e., 808-i) Line input. Instead, an shared function-memory 1410 context is used to do this dependency checking and enable the node context to execute.
Since the SFM processor 7614 performs processing operations analogous to a node (i.e., 808-i), it is scheduled and sequenced much like a node, with analogous context organization and program scheduling. However, unlike a node, data is not necessarily shared between contexts horizontally across a scan line. Instead, the SFM processor 7614 can operate on much larger, standalone contexts. Additionally, because side contexts may not be dynamically shared, there is no requirement to support fine-grained multi-tasking between contexts, though the scheduler can still use program pre-emption to schedule around dataflow stalls.
Turning to
SFM processor 7614 can also support fully general task switch, with full context save and restore, including SIMD registers. The Context Save/Restore RAMs supports 0-cycle context switch. This is similar to the SFM processor 7614 Context Save/Restore RAM, except in this case there are 16 additional memories to save and restore SIMD registers. This allows program pre-emption to occur with no penalty, which is important for supporting dataflow into and out of multiple SFM processor 7614 programs. The architecture uses pre-emption to permit execution on partially-valid blocks, which can optimize resource utilization since blocks can require a large amount of time to transfer in their entirety. The Context State RAM is analogous to the node (i.e., 808-i) Context State RAM, and provides similar functionality. There are some differences in the context descriptors and dataflow state, reflecting the differences in SFM functionality, and these differences are described below. The destination descriptors and pending permissions tables are usually the same as nodes (808-i). SFM contexts can be organized a number of ways, supporting dependency checking on various types of input data and the overlap of Line and Block input with execution.
In
Unlike node (i.e., 808-i) contexts, an SFM context can receive a large amount of vector data, from multiple sources, for each set of scalar input data received. To permit operation on partially-valid vector input, SFM dataflow-state entries track vector and scalar input separately, with vector input summarized by the V_Input, HG_Input, and Blk_Input fields of the context descriptor. Turning to
SFM contexts typically receive a large amount of data for processing, compared to the operational bandwidth of the SIMD for SFM processor 7614. It is generally inefficient for the processor to wait until all input has been received—or even a single scan-line—before processing begins. This would serialize the transfer into the context with processing by the context, severely limiting the amount of potential overlap. To permit processing to overlap with execution, SFM program scheduling permits programs to execute using inputs that are partially valid (either Line or Block input).
Dependency analysis usually recognizes when an access within the input region, by any SIMD datapath, attempts to access data that has not yet been received. When desired for Line input, this assumes that contexts are threaded, so that input, even if from multiple processing node contexts, is provided first for the top, left-most input (with respect to the frame) and proceeds in scan-line order to the bottom, right-most input. It also assumes that Block input is from programs that iterate from left-to-right and top-to-bottom with respect to the frame (since the input is in-order because of serial program execution, the SFM context is not necessarily threaded, though can be). With these restrictions, this provides a significant opportunity to overlap SFM Line and Block input with execution. It permits the context to track valid input regions using valid index pointers that specify the range of valid data in any input data structure.
For Line input, the dependency checking should account for wrapping of addresses within the circular buffer. For this reason, two valid-index pointers are provided in the dataflow state: one specifying the vertical index of valid input, and one specifying the horizontal index. Any scalar input is provided once per scan-line, unless it is provided once for the entire program, as indicated by Input_Done.
For Block input, dependency checking uses a single valid-index pointer for all input, regardless of the size of the input (different block inputs can have different sizes). Accesses into blocks still use two-dimensional addressing, but the resulting address is linear within any given block. Any scalar input is provided once per block, unless it is provided once for the entire program, as indicated by Input_Done.
SFM dataflow state can track either Line or Block input, but not both. However, as described later, it is possible to overlay multiple context-state entries to track input to a program that mixes Line and Block input, so that dependencies are checked for each type independently.
To track vector input, the context should know the number of vector sources. A source signals Set_Valid whenever it has provided all data from an iteration, either implicit (Line) or explicit (Block). However, this usually is not sufficient to determine to what degree input is valid—this is determined by the valid-index pointers. In order to maintain these pointers, the context should know how many vector inputs to consider in updating the pointers: for example, if there are three vector sources, the context should receive a Set_Valid from each source in order to increment the valid-index pointer to increase the range of valid input.
The number of vector inputs is detected after initialization, as the context receives the first set of inputs. During this time, the #InpV field counts the number of initial Set_Valid signals received from independent vector sources, based on independent Src_Tag values. The #SetValV[n] fields are used to count all Set_Valid signals from each vector source. The context is enabled to execute when all of the first set of inputs has been received, determined by #Inputs, and, when this condition is met, #InpV indicates the number of vector sources. Following this, the #InpV field is not updated.
In
The Buffer_Base_Address is available in the source context by linking the offset in the destination context during final code generation. The Circ_Index and HG_Size are determined by the vertical-index parameter at the source, and Horiz_Position is contained in the source's context descriptor. In the SFM context, this index is added to the context base address, and the input is written starting at the resulting address, 16 pixels per cycle (for example). The resulting address selects an even bank of vector-memory 7603, and updates all entries of this bank and the next odd bank
The parameter Valid_Input is initialized to zero, and is updated as inputs arrive, based on the dataflow protocol. The following discussion starts by assuming that Line input is from a single set of source contexts (a single horizontal group), so that the basic concepts of dependency checking can be understood. In reality, input can be from multiple sources which provide data at different rates. Furthermore, the width of input data can be different for different sources: even though all Line data corresponds to the same region of a frame, data elements can be of different sizes, for example when some input is sub-sampled with respect to other input. Dependency checking should comprehend these more general cases.
In
In the first step of the sequence shown, a Source Notification message (SN) is received from the left-boundary node context, and the SFM context responds with a Source Permission (SP). The P_Incr field in the SP has the value 1111′b, because the context is guaranteed to have enough VMEM allocated for all input. (Block input uses a different P_Incr sequence; this difference is based on the Blk bit being set in the context descriptor.)
The SP enables output from the source context, with Set_Valid indicating the final output, as shown in the second step in the figure (Set_Valid is assumed to be to the buffer shown in the example, though it can be to any buffer receiving input from the source contexts). The Set_Valid increments Valid_Input and causes the source context to forward the SN to the next source context, which in turn sends an SN to the destination SFM context. This sequence continues, providing inputs to the first scan-line, shown in the third and fourth steps. At the end of the scan-line, the SN from the node context has Rt=1. The resulting Set_Valid causes sets the entire scan-line valid, and disables dependency checking using Valid_Input.
Execution in the context is enabled as long as there is valid input at the position of current execution on the line, HG_POSN. This is indicated by Valid_Input>HG_POSN. Before the scan-line is filled, dependency checking is performed during execution by comparing the H_Index values of relative vector-packed accesses to Valid_Input. The condition tested is whether H_Index is on or beyond the current input set (H_Index≧Valid_Input). If this condition is met, dependency checking fails.
If horizontal boundary processing applies, dependency checking uses H_Index as modified for boundary processing. However, if the boundary processing is specified to return a saturated value, this disables dependency checking because this value does not depend on input.
As mentioned above, dependency checking doesn't detect whether entire scan-lines of input are invalid (for example, all but the first line in the figure). Software handles these cases by special treatment of circular buffers at the top and bottom of frame boundaries.
After the scan-line is filled, Valid_Input is incremented to the value HG_Size. Since dependency checking is disabled, Valid_Input is used instead to indicate when a new scan-line can be accepted. This is illustrated in
The conditions for enabling new input are that: Release_Input is signaled, HG_POSN=Valid_Input, and input is disabled (InEn=0 or all ValFlag bits are 0). At this point, InEn is set, Valid_Input is reset to 0, and the SP response is enabled (the SP is sent immediately if an SN has been previously received). Before this set of conditions is satisfied, Release_Input is signaled by every program at other values of HG_POSN, but this no effect on the dataflow protocol. When input is enabled, the ValFlag[n] bits are set to reflect the number of sources (#Sources) to ensure that an SN is received from each source, setting the ValFlag field with the Type, before dependency checking is fully operational.
The final three steps in the figure are similar to the steps shown in
This iteration over input scan-lines continues until terminated by an Output_Terminate signal (OT). The OT can be received at any point during the final scan-line input, but does not take effect until the program ends.
In the description above, it assumed input from a single set of source contexts, in order to describe how the valid-input pointer is managed and how it is used to check dependencies on Line input. In the more general case, input can come from multiple sets of source contexts, and each set of sources can supply data at different rates. The dataflow protocol orders data from each set of sources, but there is no mechanism to synchronize the sets of sources with each other, and this would be undesirable because it is generally inefficient to stall one or more sources in order to synchronize them with other sources. Moreover, the data from multiple sets of sources can be of different effective HG_Size, even though they represent pixels from the same set of scan-lines. This can occur when pixels represent different sampling rates: for example, it is common for chroma YUV data to be sampled at half the rate of luma data, in which case two de-interleaved chroma inputs are half the width of luma input.
To track Line input from multiple sets of sources, the number of Set_Valid signals from each set of sources is counted independently, using the #SetValV[n] entries in the dataflow state. The valid-input pointer cannot be updated until each source at a given position has signaled Set_Valid, because all data up to the valid-input pointer is considered valid. When the last Set_Valid is received at a given horizontal position, allowing the pointer to be incremented, other sets of source contexts might be significantly ahead in providing input.
When Set_Valid is received with vector data, the Src_Tag accompanying the data is used to increment the corresponding #SetValV[n] field (n=Srg_Tag). Another source context with the same Src_Tag can be enabled to input after Set_Valid, so the respective #SetValV[n] can be incremented multiple times with respect to other sources with different Src_Tag values. Vector sources are indicated by ValFlag[n,1]=1, and this indicates which of the #SetValV[n] fields are counting vector Set_Valid signals. Each successive source context sends an SN which updates the ValFlag bits, but, because each SN sets ValFlag to the same value, the MSB still indicates which #SetValV fields are active.
The first set of vector inputs from all sources is valid when the final expected Set_Valid is received for the left-most input (Valid_Input=0). This is indicated by all active #SetValV[n] fields having non-zero values (the final input increments the corresponding #SetValV field from 0 to 1). This condition captures the fact that a Set_Valid has been received from all vector sources (unique Src_Tag values) at the left boundary. At this point Valid_Input is incremented, and the #SetValV[n] fields are decremented to account for the incrementing of Valid_Input: the valid-inut pointer captures the fact that a vector Set_Valid has been received for each vector Src_Tag at the respective input position.
For input at each successive value of Valid_Input, the process just described is used to determine when all inputs are valid at the respective horizontal position. The valid-input pointer is incremented when all #SetValV[n] fields with ValFlag[n,1]=1 are non-zero. At this point, Valid_Input is incremented, and the #SetValV[n] fields are decremented to reflect the new values of the pointer.
Inputs that have smaller HG_Size than others encounter the right-boundary source context at smaller horizontal positions with respect to the others. This position, for each Src_Tag, is indicated by Rt=1 in the SN message (outputs with the same Src_Tag are in the same horizontal group and should have the same effective HG_Size). When a Set_Valid is received at this position, ValFlag[n,1] is reset, and the value of the corresponding #SetValV[n] field is no longer considered in updating Valid_Input. However, the #SetValV[n] field might be non-zero at this point, depending on the current position of other sources, even though it is no longer considered for updating the valid-input pointer. When Valid_Input passes this position of input, the corresponding #SetValV[n] field is decremented to zero by definition, because Valid_Input reflects all Set_Valid signals beyond that position. Beyond this point, the condition for updating the valid-input pointer is the same as before, with a smaller number of non-zero #SetValV[n] expected, still indicated by corresponding ValFlag[n,1]=1, so the valid-input pointer increments beyond this point. Any access to the smaller input passes horizontal dependency checking by definition in this state, because it cannot generate (without boundary processing) an access with H_Index larger than Valid_Input. The source of this input can send an SN for new input, but this is recorded in the pending-permission entry, and the SP is held until all current input is received and the conditions for enabling new input are met.
This process is repeated until all sources have provided data from right-boundary contexts. At this point, all ValFlag[n,1] bits are 0, and all #SetValV[n] fields have been decremented to zero. Valid_Input is not incremented, and its value defines the final value of HG_POSN when iterating over the horizontal group.
The value of the #SetValV[n] field for any source cannot be allowed to wrap from 1111′b to 0000′b. This shouldn't be common, but should be explicitly avoided for correct operation of dependency checking based on counting Set_Valid signals. To prevent this, the SFM context withholds the SP to the next source under conditions where the pointer might wrap. This is handled by InSt sequencing.
Scalar data provided to an SFM context processing Line data falls into one of three categories: 1) parameter data, provided without vector data from the source; 2) scalar data provided along with vector data from a GLS source thread, provided once per iteration; and 3) scalar data from processing node source contexts, provided along with vector data from all contexts per iteration. Each of these cases is handled differently by dependency checking on scalar input.
Scalar parameter data is indicated by Type=01′b in the SN from the source. This updates the ValFlag field with a value that prevents the source from participating in vector input-dependency checking, since the MSB is 0. When Set_Valid is signaled for the scalar input, ValFlag[n,0] is reset, and, since both valid-input flags are 0, all dependencies are released for that source.
GLS scalar data, provided with vector data per iteration, is provided once per destination context. This data is provided to all destination node contexts, but once to an SFM context. It is received by the SFM context at the beginning of each input scan-line, when Valid_Input=0. The scalar Set_Valid from GLS resets ValFlag[n,0], releasing the scalar dependency even though vector data from GLS can still be participating in vector input-dependency checking
Node scalar data, provided with vector data per iteration, is provided from each source context, and so is received multiple times. The SN from each source context provides the same Type field, setting the ValFlag bits the same way, and new scalar input is provided by each source context. Execution is enabled when all scalar Set_Valid signals have been received from all sources, resetting the corresponding ValFlag[n,0] bits. The scalar input doesn't necessarily correspond to the source context at the current valid-input pointer, because some sources can be ahead of this position, but in this case all source contexts provide the same values for scalar input, so this lack of correspondence usually does not matter.
Dependency checking of SFM Block input is conceptually similar to dependency checking of Line input, with two major differences. First, Block input uses linear addressing in the SFM context, in contrast to the modulus used for circular-buffer addressing of scan-lines. This means that dependency checking with the valid-input pointer can cover both vertical and horizontal indexes. Second, source data is provided from single contexts or threads (node, SFM, or GLS). These sources have explicit iteration to provide block input (in GLS, this is in hardware, based on block parameters, instead of software). There is a single exchange of SN and SP messages at the beginning of the program, and then a Set_Valid to mark the end of output from each iteration without any additional SN-SP exchanges. This is in contrast to Line data, where there is a one-to-one correspondence between SN-SP message-exchange and Set_Valid from the source contexts.
At the source, the end of block output is determined by the end of all iterations that output block data. Set_Valid is used to mark the individual output of each iteration, so another method is desired to signal that all iterations are complete. This is based on a separate signal, Block_End, emitted in the code after all block output from the source, which is the point in the control flow after all iterations and conditional statements that perform block output. Since Block_End is based on control flow, it's awkward for it to be accompanied by valid data: for example, the last valid transfer would have to be moved beyond the end of an iteration loop, meaning that the loop would have to be written with one remaining output to be done. Instead, Block_End is handled similarly to Input_Done. This uses an encoding of the instruction that normally outputs vector data, but the accompanying data is not valid. The use of this encoding is to signal to the destination that there is no more current block output from the source.
Turning now to
As with Line input, Set_Valid signals are counted in the #SetValV[n] fields for block input from each source, and these fields are used to determine when Valid_Input can be incremented. And, as with Line input, the #SetVal[n] fields cannot be allowed to wrap from the value 1111′b to 0000′b. However, since there's a single SN-SP exchange for all block input, the destination SFM context cannot limit the output from a source, and the number of Set_Valid signals, by withholding an SP message. Instead, for Block input, the context uses P_Incr to limit output. This is denoted in the figure by P_Incr=E′h (1110′b). P_Incr=E′h limits each source 14 sets of block outputs (14 elements for each block), to prevent the potential overflow of #SetValV[n] for the corresponding source, in the extreme case where it gets very far ahead of other sources. (The value F′h enables an unlimited number of outputs, and so doesn't restrict output from a source.) Blocks often require more than 14 outputs, but this is handled by updating P_Incr during execution.
Block inputs arrive in order, due to restrictions in the programming model that iteration is linear in the horizontal direction, then linear in the vertical (if this restriction cannot be met, other forms of dependency checking apply, as described later, but block input cannot be overlapped with execution). Each 32-pixel (for example) input is accompanied by a context number and an offset into the context for a specific block element. The offset of the element is computed directly at the source, using a vertical-index parameter for the destination (this parameter specifies Block_Width). In the SFM context, this offset is added to the context base address, and the input is written starting at the resulting address, 16 pixels per cycle. The resulting address selects an even VMEM bank, and updates all entries of this bank and the next odd bank.
As shown, Valid_Input marks the block index at which at least one input is not yet valid (the block index, Blk_Index, is computed during an absolute vector-packed access). This valid-input pointer applies to all input blocks. Valid_Input is initialized to zero, and is updated as inputs arrive. The context expects block input for all sources that have ValFlag[n,1]=1. When all corresponding #SetValV[n] fields are non-zero, this indicates that a vector Set_Valid has been received from all sources at the current Valid_Input position. At this point, Valid_Input is incremented, and the #SetValV[n] fields are decremented to reflect the new value for Valid_Input.
Before all input is received, dependency checking is performed by comparing the index into a block of an absolute vector-packed access, Blk_Index, to Valid_Input. The condition tested is whether Blk_Index is on or beyond the current set of valid input (Blk_Index≧Valid_Input). If this condition is met, dependency checking fails.
Inputs of smaller blocks generally complete sooner than other inputs, as illustrated in the third step in the figure. The completion of block input is indicated by Block_End from the source. At this point, the ValFlag[n,1] bit is reset, removing this source from block input-dependency checking, and when Blk_Input passes this point of this input, the corresponding #SetValV[n] field will be decremented to zero (by definition, because Valid_Input reflects all Set_Valid signals from the sources). Beyond this point, the condition for updating Valid_Input is based on non-zero #SetValV[n] fields for sources that have ValFlag[n,1]=1, so that other sources increment the pointer beyond this point. Any access to the smaller input passes dependency checking, because it cannot generate an access with Blk_Index larger than Valid_Input.
This process is repeated until all sources have provided data and signaled Block_End. At this point, all #SetValV[n] fields have been decremented to zero, and all ValFlag bits are 0. There are no more expected Set_Valid signals, and dependency checking is disabled.
It is possible to receive block input with Output_Kill signaled, as a result of SD=1 in the source's vertical-index parameter. In this case, the input data is not written, and the block input state is not updated.
It has so far been assumed for these examples that a source provides a single block input. This is not a restriction on the programming model, because a program can contain a number of different iteration loops for different block output. However, the block output from the final set of iteration loops signals Set_Valid, because in the program flow these loops contain the final output in the program to the given destination. At this point, previous input is already valid, and so dependency checking is undesired—it applies to the final block. This limits the potential for overlap, but does not restrict the structure of programs.
SFM program scheduling is based on active contexts, and does not use a scheduling queue. The program-scheduling message identifies the context that the program executes in, and the program identifier is equivalent to the context number. If more than one context executes the same program, each context is scheduled separately. Scheduling a program in a context causes the context to become active, and it remains active until it terminates, either by executing an END instruction with Te=1 in the scheduling message, or by dataflow termination.
Active contexts are ready to execute as long as Valid_Input>HG_POSN, for Line input, or Blk_Input>0. Ready contexts are scheduled in round-robin priority, and each context executes until it encounters a dataflow stall or until it executes an END instruction. A dataflow stall occurs when a program attempts to read invalid input data, as determined by valid-input pointers, or when a program attempts to execute an output instruction and the output hasn't been enabled by a Source Permission. In either case, if there is another ready program, the stalled program is suspended and its state is stored in the Context Save/Restore RAM. The scheduler schedules the next ready context in round-robin order, providing time for the stall condition to be resolved. All ready contexts are scheduled before the suspended context is resumed.
If there is a dataflow stall and no other program is ready, the program remains active in the stalled condition. It remains stalled until either the stall condition is resolved, in which case it resumes from the point of the stall, or until another context becomes ready, in which case it is suspended to execute the ready program. If the program is suspended for input, it should receive at least one more set of inputs (incrementing Valid_Input) before it can become ready for execution again.
There are four major attributes of an SFM context, supporting various types of data and control flow for vector-memory 7603/function-memory 7602 and SFM and node processing:
Non-threaded contexts provide the capability for a one-to-one mapping between SFM contexts and node or other SFM contexts, as shown in
A threaded SFM context receives Line input from a node horizontal group, and permits constructing the output of an entire node horizontal group within a single SFM context, permitting node-compatible operations on Line data as described in Section Error! Reference source not found. The system-level dataflow into and out of the threaded context is shown in
Even though
In
In the state 01′b, one of two events can occur next (both occur eventually unless there's an output termination). The context can receive an SN from the left-boundary context for the next input phase, in which case it should be stored in the pending permissions until input is enabled: this is the transition to 10′b. Or, input can be re-enabled: on the transition of InEn from 0 to 1, the state transitions to 00′b to wait on the next SN (termination might occur instead of an SN).
In the state 10′b, where the context has received an SN and is waiting for input to be re-enabled, it's possible for Set_Valid to be received for the right-boundary input of the previous input phase. The reason for this is that the source forwards an SN to the left-boundary context after it signals Set_Valid, but there's no ordering at the destination between the SN received as a result of the forwarded SN and the vector data received with Set_Valid. These transfers occur on different interconnect and have different buffering at source and destination, and on the interconnect. Thus, a Set_Valid received in state 10′b also resets ValFlag[n,1] (Set_Valid cannot be received in state 10′b if it was received in state 01′b).
In state 10′b, when input is re-enabled, the context sends an SP using the pending-permission entry. Though it's an unlikely corner case, it's possible for the original SN to have Rt=1, in which case the state transitions to 01′b to record this boundary. (After initialization, or if input is enabled before the SN is received, the state is 00′b when the SN is received, but transitions immediately to 01′b after the SN is received.) Otherwise, if Rt=0, the state transitions to 00′b.
The transitions to 00′b from states 01′b and 10′b that depend on input being enabled occur on the transition of InEn from 0 to 1 (InEn→1), rather that InEn=1. When any given source completes its input, it is possible that InEn is still 1 because other sources have not yet completed InEn should first be reset to ensure that all current input data, from all sources, is used in execution. When this input is no longer desired, the program signals Release_Input, causing InEn→1 and enabling the next set of input. It is at this point that the context can respond with SP and permit previous input to be over-written.
The state 11′b is used to hold an SP response to an SN if the resulting Set_Valid might cause the value of #SetValV[n] to wrap from F′h to 0′h, which would lead to incorrect operation of input-dependency checking. Because of the lack of ordering between messages and vector data, the SP is held if an SN is received with #SetValV[n]=E′h, instead of the actual condition to be avoided. The reason for this is that the SN can be received because of a forwarded SN at the source of vector data, received before the Set_Valid that triggered the forwarded SN. If this transition were based on #SetValV[n]=F′h, it would be possible to receive the Set_Valid after the SN, causing the value to wrap. Basing the transition on the value E′h means that, in this worst-case scenario, #SetValV[n] increments to F′h, but the held SP prevents any further Set_Valid. From the state 11′b, once #SetValV[n] is decremented (based on other input from other sources), the state transitions either to 00′b or 01′b, based on the Rt bit in the SN that originally caused the transition to 11′b.
Turning to
When the SP is received in response to the SN, the state transitions to 01′b, where output is enabled for Dst_Tag n, for the program iteration with HG_POSN=0 (the identifier in the SP updates the destination descriptor, as it usually does, which has the effect of re-initializing the descriptor). When the output to that destination is set valid, the state transitions back to 00′b, causing an SN to the original destination with Rt=1. The destination forwards this SN, and the resulting SP identifies the next destination context: this updates the destination descriptor and enables output for the iteration with HG_POSN=1. This process repeats until the program terminates. Even though program iteration is based on the effective HG_Size of the largest input context, the destination contexts can have a different effective HG_Size. The dataflow protocol routes data to the correct destinations by virtue of the forwarded SNs even when HG_POSN does not correspond to the relative horizontal position of the destination context.
Feedback loops require special treatment beyond what is required for nodes (i.e., 808-i), because the SFM context should release the dependencies of all contexts in the destination horizontal group, and the DelayCount value applies to all of these contexts. If FdBk is set when the program is scheduled, the context immediately sends an SN to the first destination context (using the identifier in the shadow destination descriptor). When the SP is received, the state transitions to 01′b. At this point, the context should send an SN with Rt=1 so that it can be forwarded to the next destination context. However, this should not be done in state 00′b because there is nothing to distinguish this SN from the first one sent. Instead, if feedback is enabled, the state transitions to 10′b, where the SN is sent for forwarding, then the state transitions to 11′b to wait for the SP response.
This process continues until an SP is received with Rt=1, indicating the right-boundary destination. At this point, the state is 01′b, the state transitions to 10′b, the forwarded SN is sent, and the state transitions to 11′b. Here, because the earlier SP had Rt=1, DelayCount is incremented, and the next SP is from the left-boundary context, because of forwarding from the right-boundary context. If there are multiple feedback destinations, all should meet the condition to increment DelayCount before it's incremented.
As long as DelayCount hasn't reached the value of OutputDelay, subsequent iterations of this process continue to release dependencies, based on receiving SP messages from all destination contexts, until DelayCount=OutputDelay. At this point, an SP received from the left-boundary context enables output to that context, and the SFM context becomes ready for execution when it receives valid input (by the definition of OutputDelay). This execution results in Set_Valid and a transition to 00′b, where normal operation begins. Because this isn't the first execution, the SN sent in this state has Rt=1, as required.
Line data input to an SFM context is relatively small compared to the total data retained by the context, because this input is provided one scan-line at a time. Most of the data in the circular buffer remains valid, and this provides significant opportunity to overlap execution with data transfer. In contrast, Block data is input and operated on an entire block at a time, with the block being discarded upon Release_Input.
Because block transfer and execution times are potentially very large, it is undesirable to serialize data transfer with execution. To avoid this, the SFM context descriptor provides the capability to define a pointer to a continuation context. A continuation context is associated with the defining context, in that it participates in the same dataflow and executes the same program. The continuation context can in turn define its own continuation context, and so contexts can be organized as a context group that participates in the same dataflow and executes the same program.
Continuation contexts permit overlapping dataflow with execution, by providing multiple buffers (contexts) for dataflow independent of execution. This supports the streaming of large amounts of block data into multiple contexts while execution is performed on the blocks. A high degree of overlapped execution is possible, because execution is permitted on partially-valid blocks as they are being filled, assuming dependency checking passes, and on fully-valid blocks as other continuation contexts receive input.
Continuation contexts provide two degrees of freedom to match the computation rate to the dataflow rate:
Turning to
After the entire block is valid, the next SN received by the context is forwarded to the next continuation context, using the continuation pointer in the context descriptor. This forwarding uses the messaging interconnect, and, for the receiving context, is functionally equivalent to receiving the SN from the next source context (which can be different than the previous source, due to source contexts doing their own forwarding to provide thread input). The forwarding context is enabled to execute because all of its input is valid, and this execution can (and should) be overlapped with block input to the next context.
In
The dataflow protocol supports complex transitions between source and destination contexts that are required for transfers between continuation contexts and threads for Block input and output, or node horizontal groups for Line input and output. Since continuation contexts are used to overlap input of linear-addressed blocks, rather than circular buffers, Line input is for the subset block type of an array of Line data (LineArray). The following two sections describe operation in these cases.
Turning to
In
Block input isn't required to use a continuation context, though it's normally more efficient. Setting Cn=0 in the context descriptor is functionally equivalent to setting Cn=1 and setting the continuation context ID to the current context ID. In this case, the continuation context and the defining context are the same, with the effect that overlapped input and execution are defined by the behavior of the program in a single context. Either encoding can be used, but the second alternative is more compatible with the encoding of LineArray input: in this case Blk=0 to enable Line input, but Cn=1 indicates that the context operation is on Block data. In this case, if there is a single context, the context ID has to be the same as the defining context.
In
The SPs sent in state 00′b eventually enable all block input, signaled by Block_End. After this, the source can generate an SN for new input, or might forward an SN. Since the SN message and the Block_End signal are not ordered at the destination, either one can occur first, and either signals the end of the block input, causing a transition to state 01′b to record the end of the block. However, Block_End should be received before ValFlag[n] is reset, because this is the guarantee that the final data has been received (it is ordered to be received after the final block input).
The transitions from the state 01′b implement the behavior required if there is a continuation context, and determine the ordering of SN and Block_End from the previous input (if there is an SN, it should be recorded and handled correctly). The two cases, without a continuation context or with, are described separately (the continuation context can be the same as the current context):
In
For the context with 1st=1, the state is initialized to 00′b, and, as soon as the context program begins execution, the context sends an SN to the initial destination context. This uses the shadow destination descriptor, because it is possible that the destination descriptor has a stale value from previous execution: this case arises when the program is re-scheduled in the context without re-initializing the context. When the SP is received in response to the SN, the state transitions to 01′b, where output is enabled for Dst_Tag n, up to the number of Set_Valid transfers specified by P_Incr (the identifier in the SP updates the destination descriptor, as it usually does, which has the effect of re-initializing the descriptor). During execution, the context can receive SPs which update the permission count. When the block output is set valid with Block_End, the state transitions to 10′b, where an SN is sent on behalf of the continuation context, if Cn=1, or the current context, if Cn=0 (the continuation pointer can also be to the current context if Cn=1). At this point, the state transitions to 11′b, where an SP should be received (from a forwarded SN) before output can be re-enabled for Dst_Tag n: this SP updates the destination descriptor with the new destination ID. The context can be enabled to execute by new input at any point, but cannot output to a destination unless enabled by OutSt[n]=01′b. It's also possible that the program terminates after forwarding the SN, in which case an OT is sent from the context to the most recent destination.
Feedback dependencies are handled by the context with 1st=1. If FdBk is set when the program is scheduled, the context immediately sends an SN to the first destination context (using the identifier in the shadow destination descriptor). When the SP is received, the state transitions to 01′b and the DelayCount value is incremented (this is based on the value not already being equal to OutputDelay, to prevent incrementing DelayCount in normal operation). After incrementing DelayCount, if the value has not reached OutputDelay, the state transitions back to 00′b where another SN is sent. If there are multiple feedback destinations, all should meet the condition to increment DelayCount before it is incremented.
As long as DelayCount has not reached the value of OutputDelay, subsequent iterations of this process continue to release dependencies, based on receiving SP messages, until DelayCount=OutputDelay. At this point, the state is 01′b, and the SP just received enables output to that context. The SFM context becomes ready for execution when it receives valid input (by the definition of OutputDelay). This execution results in Block_End and a transition to 10′b, where normal operation begins.
Feedback dependencies can be released in multiple destination contexts in this manner when the destination is a continuation group. SP messages in response to feedback SNs update the destination descriptors so that subsequent SNs are sent to the proper destination contexts. Each destination context enabled to execute by the release of feedback dependencies executes a valid program even though there is no data provided by the feedback source for OutputDelay iterations.
As previously discussed, a subset of a Block datatype, LineArray, is a linear array of Line data, in contrast to a circular buffer. This data is provided as input from or output to a node horizontal group, using processing node circular buffers with the same vertical dimension as the SFM LineArray block. The width of a LineArray input is the same as the width of the source horizontal group, but input can be accepted, into different LineArray variables, from sources of different widths. LineArray data is distinguished from more general Block data in that the source and/or destination node or processing node contexts are non-threaded. This type of input is encoded by Blk=0 (encoding Line input), and Cn=1 (enabling a continuation context, which usually applies to SFM Block data: this encoding can require a continuation context, which can be the same as the current context if a single context is allocated).
The dataflow protocol for LineArray input and output is a hybrid of the protocol for Line and Block data. The program explicitly iterates on the input as a Block (the program datatype), and there's no notion of Line boundaries even though the source contexts provide output as Line. For this reason, the input usually does not wait at the right boundary for other input and for execution to begin (there is no boundary, though there is a right-boundary indication from the source). Instead, the end of input for the current program is indicated by a signal that accompanies the input data, called Fill, which indicates the last line in a circular buffer (the vertical index is equal to the buffer size). Input is overlapped with execution using the valid-index pointer to check dependencies, but this pointer is updated and used as for Block input. When the last set of inputs is received from a source, the next set of inputs is directed to the continuation context. The continuation context can receive new input while the current context continues processing. The input remains valid until Release_Input is signaled, when the entire block is released.
Turning to
In
In
In
Turning to
In both of the above cases, if an SN is forwarded, the state is still 01′b after the sequence. However, there can be no Set_Valid in this condition, so state transitions are used to order the events of: 1) input being re-enabled, and 2) an SN being received as a result of forwarding from another (or the current) SFM context. If input is re-enabled first (InEn→1), the state transitions to 00′b to wait on the SN. If the SN is received first, the state transitions to 10′b, and the possible event at this point is for input to be re-enabled, at which point the state transitions to 00′b.
Normal operation for the contexts with 1st=0 begins in state 11′b when an SP is received. The context receives this SP without sending and SN, because the SN was sent on it's behalf by another continuation context (the SP updates the destination descriptor, as usual). This SP enables output whether or not the context is ready to execute, but this output does not begin until sufficient input is provided for the program to be scheduled—the order of these two event does not matter. During execution, the transitions 01′b→00′b→01′b are used to send the SN to be forwarded by the destination context, and receive the SP as a result of this SN to enable output to the next context.
This continues until the program signals Block_End, indicating that output is complete in the current context and should be passed to the continuation context. As mentioned already, the transfer with Block_End signaled is suppressed (the accompanying data is invalid, and the destination does not desire this signal). Instead, Block_End causes a transition to 10′b, where an SN is sent on behalf of the continuation context (which can be set to the current context). At this point, the state transitions to 11′b, where the context waits again for an SP resulting from an SN sent by another context in state 10′b.
One continuation context in the group usually receives an Output_Terminate signal (OT); this is the context that receives the final block input. For block input received from one or more node contexts, the OT is sent by the context that performs the final input (for horizontal groups, this is the right-boundary context), and it is sent after the block has been set valid. For block input received from a read thread, the OT can be received at any time after the final set of inputs, and is recorded (InTm) and doesn't take affect until the entire block is set valid, and the program completes execution with an END instruction (it's possible, but unlikely, that the END will occur before OT, with the same effect).
When this context terminates, it sends an OT to each destination. If the destination is a write thread, this occurs after the final input to the thread. If the destination is a processing node horizontal group, the OT is sent to the left-boundary context, whose destination ID is in the shadow destination descriptor. This is not the context that received final data, but in any case the receiving context treats the OT in the usual manner. Once the left-boundary context terminates (if either it executes an END, or has already executed and END), it sends OT to any non-threaded destination, and forwards the OT to the right-side context for any threaded destination. This forwarding continues as contexts terminate, up to the left-boundary context, which then sends the OT on to any thread destination.
Since the SFM continuation contexts are threaded, one is enabled for output at any given time, and this is the one that receives and sends the OTs. Other contexts in the group have ended output at this point, and will not execute again, but don't receive an OT. In this case, the terminating context transmits a Node Program Termination message, which can result in other contexts in the group being re-initialized and/or re-scheduled, with the same effect as termination. To avoid having to predict which context receives the OT, the Control Node should be configured so that termination in each of the contexts has the same effect.
If an SN sets ValFlag[n,1:0] to 01′b, the input is for scalar-data. This occurs in situations where a source provides scalar data such as vertical-index parameters, with vector data being provided by other sources. If a source provides both scalar and vector data, the InSt transitions for vector input also cover scalar input. For scalar-only input, there are no vector transfers, but the vector input-state transitions can be used by treating this input as a special case of vector input. The special casing uses the following rules:
Note that treating scalar-only input as a special case of vector input also properly sequences the dataflow protocol for continuation contexts, which also apply to scalar-only input though defined for Block input.
Unlike processing nodes (i.e., 808-i), which supports program loops, the shared function-memory 1410 supports conditional statements (such as if statements). Some applications require that output be performed within conditional statements, so that destination programs are enabled to execute, or not, based on control flow. This is similar in concept to a switch statement where the case statements invoke the destination programs (though the control flow is more general). This form of output puts more pressure on the desired number of destinations, because the number of outputs is a function of the combination of program conditions, not just the number of destinations.
Because of this, shared function-memory 1410 can supports up to eight destinations (for example), using an extended context. If Ext=1 in the context descriptor, the program can use the destination descriptors and dataflow state of both the current and next sequential context-state entries. Dst_Tag values 0-3 use the current descriptor, and values 4-7 use the next sequential descriptor. The current descriptor defines all other attributes, such as the continuation context (note that other contexts in a continuation group should also have extended contexts).
An SFM context can be configured to perform synchronization operations for blocks that are operated on in other contexts. A synchronization context is used when other dependency mechanisms cannot be used. There are two case where this applies. The first is to provide Block input to function-memory 7602, to be operated on by a processing node (using LUT accesses). Processing node contexts do not generally support dependency checking on function-memory 7602, so the synchronization context is used instead to enable node execution. The second case is to provide Block input to vector-memory 7603 to be operated on by another SFM context on the same node, when the block input is randomly addressed instead of sequentially. Neither case should permit overlap of input and execution, but still supports parallel execution between nodes.
In
To properly handle the dependencies for the node context, the SFM context performs the dataflow protocol on behalf of the node context, forwarding SNs to the node context and forwarding SP replies from the node context back to the source. When all input has been provided, the source signals Block_End. This normally enables the SFM context to execute, but, since it is null, it effectively executes nothing, but provides “output” to the node context by signaling Set_Valid (Set_Valid is used instead of Block_End because node contexts do not generally interpret Block_End. This enables the node context to execute (depending on other input into the context), and prevents further input using the dataflow protocol until Release_Input. Since there is no execution in the synchronization context, a synchronization context has no continuation context. However, if the destination is an SFM context (for random vector-memory 7603 block input, with Fm=0), that context can be part of a continuation group to provide overlap with execution, though not on partially-valid blocks.
SFM context-state entries can be shared for use by a program, to provide more general forms of dependency checking and input sequencing than is possible with a single entry. A context is configured to share another context-state entry by setting the Shr bit in the context descriptor, and setting both the vector-memory 7603 and data memory context base addresses to the same value. In this configuration, the descriptor entry that is used to specify a continuation-context node ID is used instead to specify a share pointer indicating the context number of the shared entry. Continuation contexts are still possible, because shared contexts by definition are on the same node, so the Cn_Cntx# field is desired to specify the continuation context.
The basic use of a shared SFM context is to enable input dependency checking on both Line and Block input as shown in
As shown, the Line input descriptor points to the Block input descriptor. Normally, the block input is provided once, with input complete upon Block_End from all sources, and the Line data is provided as recurring input, with implicit iteration on the input. In this case, the Block input context is null, and the program is scheduled for the Line context. In any case, the non-null context contains the share pointer, and Release_Input releases input in this context. Input in the null context is released when the scheduled program terminates in the non-null context.
If both Cn and Shr bits are set in a context descriptor, the descriptor contains both a pointer to a continuation context and to a shared context-state entry, both on the same node. Since continuation contexts are used for block input, and since block dimensions are specified by a program, one descriptor is desired to check dependencies on any given set of inputs. Instead, the share pointer is used to control the persistence of input state, by controlling which dataflow state, and associated input, is affected by a Release_Input executed within the context.
Because shared continuation contexts execute the same program within the same address space, and share input and intermediate data, execution should be exclusive, such that the program executes in one context at a time, and runs to completion in that context. This is accomplished by scheduling the program in one of the continuation contexts, determined by how many sets of input are required before the program can begin execution. Once this program completes execution, it's scheduled to execute in the next context as determined by the continuation pointer.
Turning to
The share pointer of A points to A itself, so when the program signals Release_Input, block A is released. If the input to B is complete, A can receive new input while it completes execution. If A completes execution first, the program scheduling information is copied to B and execution begins on that input, possibly overlapped with the completion of input to B. The second step of the sequence shows the case where B input is complete and B is executing, while A receives input. The third step shows the completion of the ping-pong cycle, the same as the first step.
In
Turning to
Turning to
SFM node wrapper 7626 is a component of shared function-memory 1410 which implements the control and dataflow around the SFM processor 7614. SFM node wrapper 7626 generally implements the interface of the SFM to other nodes in processing cluster 1400. Namely, the SFM wrapper 7626 can implement following functions: initialization of the node configuration (IMEM, LUT); context management; programs scheduling, switching and termination; input dataflow and enables for input dependency checking; output dataflow and enables for output dependency checking; handling dependencies between contexts; and signal events on the node and support node-debug operations.
SFM wrapper 7626 typically has 3 main interfaces to other blocks in processing cluster 1400: messaging interface, data interface, and partition interface. The message interface is on OCP interconnect where input and output messages map to slave and master port of message interconnect respectively. The input messages from the interface are written into (for example) a 4-deep message buffer to decouple message processing from ocp interface. Unless if the message buffer is full, the ocp burst is accepted and processed offline. If the message buffer gets full, then the OCP interconnect is stalled til more message can be accepted. The data interface is generally used for exchanging vector data (input and output), as well as initialization of instruction memory 7616 and function-memory LUTs. The partition interface is on the generally includes at least one dedicated port in shared function-memory 1410 for each partition.
The initialization of instruction memory 7616 is done using node instruction memory initialization message. The message sets up the initialization process, and the instruction lines are sent on data interconnect. The initialization data is sent by GLS unit 1408 in multiple burst. MReqInfo[15:14]=“00” (for example) can identified the data on data interconnect 814 as instruction memory initialization data. In each burst, the starting instruction memory location is sent on MreqInfo[20:19] (MSBs) and MreqInfo[8:0] (LSBs). Within a burst, the address is internally incremented with each beat. Mdata[119:0] (for example) carries the instruction data. A portion of instruction memory 7616 can be reinitialized by providing starting address to reinit a selected program.
The initialization of function-memory 7602 lookup tables or LUTs is generally performed using an SFM function-memory initialization message. The message sets up the initialization process, and the data word lines are sent on data interconnect 814. The initialization data is sent by GLS unit 1408 in multiple burst. MReqInfo[15:14]=“10” can identifies the data on data interconnect 814 as function-memory 7602 initialization data. In each burst, the starting function-memory address location is sent on MreqInfo[25:19] (MSBs) and MreqInfo[8:0] (LSBs). Within a burst, the address is internally incremented with each beat. A portion of function-memory 1410 can be reinitialized by providing starting address. Function-memory 1410 initialization access to memory has lower priority than partition access to function-memory 1410.
Various control settings of SFM is initialize using SFM control initialization message. This initializes vontext descriptors, function-memory table descriptor, and destination descriptors. Since the number of words required to initialize the SFM control are expected to be more than message OCP interconnect max burst length, this message can be split in multiple OCP bursts. The message bursts for control initializations can be contiguous, with no other message type in between. The total number of words for control initialization should be (1+#Contexts/2+#Tables+4*#Contexts). The SFM control initialization should be completed before any input or program scheduling to shard function-memory 7616.
Now, turning to input dataflow and dependency checking, the input dataflow sequence generally starts with Source Notification message from source. The SFM destination context processes the source notification message and responds by Source Permission (SP) messages to enable data from source. Then the source sends data on respective interconnect followed by Set_Valid (encoded on MReqInfo bit on interconnect). The scalar data is sent using an update data memory message to be written into data memory 7618. The vector data is sent on data interconnect 814 to be written into vector-memory 7603 (or function-memory 7602 for synchronization context with Fm=1). SFM wrapper 7626 also maintains dataflow state variables, which are used to control the dataflow and also to enable the dependency checking in SFM processor 7614.
The input vector data is from OCP interconnect 1412 is first written into (for example) two 8-entry global input buffer 7620—consecutive data is written into/read from alternate buffers in ping pong arrangement. Unless if the input data buffer is full, the ocp burst is accepted and processed offline. The data is written into vector-memory 7603 (or function-memory 7602) in a spare cycle when the SFM processor 7614 (or partition) is not accessing the memory. If the global input buffer 7620 becomes full, then the OCP interconnect 1412 is stalled until more data can get accepted. In input buffer full condition, SFM processor 7614 is also stalled to write into the data memory and avoid stalling the interconnect 1412. The scalar data on the OCP message interconnect is also into (for example) a 4 entry message buffer, to decouple message processing from OCP interface. Unless the message buffer is full, the OCP burst is accepted and data is processed offline. The data is written to data memory 7618 in a spare cycle when SFM processor 7614 is not accessing the data memory 7618. If the message buffer becomes full, then the OCP interconnect 1412 is stalled until more message can be accepted, and SFM processor 7614 is stalled to write into memory 7618.
Input dependency checking is employed to generally ensure that the vector data being accessed by SFM processor 7614 from vector memory 7603 is a valid data (already received from input). Input dependency check is done for vector packed load instructions. Wrapper 7626 maintains a pointer (valid_inp_ptr) to the largest valid index in the memory 7618. Dependency check fails in a SFM processor 7614 vector unit, if H_Index is greater than valid_input_ptr (RLD) or Blk_Index is greater than valid_index_ptr (ALD). Wrapper 7626 also provides a flag to indicate that the complete input has been received and dependency checking is not desired. Input dependency check failure in SFM processor 7614 also causes stall or context switch—signals dependency check failure to wrapper and wrapper does task switch to switch to another ready program (or stalls processor 7614 if there are no ready programs). After a dependency check failure, when the same context program can be executed into again after at least another input has been received (so that dependency checking may pass). When the context program is enabled to execute again, the same instruction packet has to be re-executed. This employs special handling in processor 7614 because the input dependency check failure is detected in execute stage in pipeline. So this means that the other instructions in the instruction packet have already executed before processor 7614 stalls due to dependency check failure. To handle this special case, wrapper 7626 provides a signal to processor 7614 (wp_mask_non_vpld_instr), when it re-enabling a context program to execute after a previous dependency check failure. The vector packed load access is usually in a specific slot in the instruction packet, so one slot instruction is re-executed next time, and instruction in other slots are masked for execution. Below is sample logic for input dependency check:
Turning now to the Release_Input, once the complete input is received for an interation, no more inputs can be accepted from sources. The source permission is not sent to the sources to enable more input. Programs may release the inputs before end of iteration, so that the input for next iteration can be received. This is done through a Release_Input instruction, and signaled to processor 7614 through flag risc_is_release.
HG_POSN is position for current execution or Line data. For Line data context, HG_POSN is used for relative addressing of a pixel. HG_POSN is initialized to 0, and increment on the execution of a branch instruction (TBD) in processor 7614. The execution of the instruction is indicated to wrapper by flag: risc_inc_hg_posn. HG_POSN is wrapped to 0 after it reaches the right most pixel (HG_Size) and a increment flag is received form instruction execution.
The wrapper 7626 also provides program scheduling and switching. A Schedule Node Program message is generally used for program scheduling, and the Program scheduler does following functions: maintains a list of scheduled programs (active contexts) and the data structure from “schedule node progam” message; maintaints a list of ready contexts. It marks a program as “ready” when the context becomes ready to execute: active context on receiving sufficient inputs become ready; schedules a ready program for execution (based on round robin priority); provides program counter (Start_PC) to processor 7614 for a program being scheduled to execute for the first time; and provides dataflow variables to processor 7614 for dependency checking as well as some states variables for execution. The scheduler also can continuously keep looking for next ready context (next ready in priority after current executing context).
SFM wrapper 7626 can also maintain a local copy of descriptor and state bits for current executing context for instant access—these bits normally reside in data memory 7618 or Context descriptor memory. It keeps the local copy coherent when state variables in context descriptor memory are updated. For executing context, these following bits are usually used by processor 7614 for execution: data memory context base address; vector-memory context base address; input dependency check state variables; output dependency check state variables; HG_POSN; and flag for hg_posn !=hg_size SFM_Wrapper also maintains a local copy of descriptor and state bits for next ready context. When a different context becomes the “next ready context”, it again loads the required state variables and configuration bits from data memory 7618 and context descriptor memory. This is done so that the context switching is efficient, and does not wait to retrieve settings from memory access.
Task switching suspends the current executing program and moves the processor 7614 execution to “next ready context”. Shared function-memory 1410 dynamically does a task switch in case of dataflow stall (examples of which can be seen in
Turning now to the output data protocol for different datatype, In general, at the start of a program execution, SFM wrapper 7626 sends Source Notification message to all destinations. The destinations are programmed in destination descriptors, and destinations respond with Source Permission to enable output. For vector output, P_Incr field in source permission message indicate the number of transfers (vector set_valid) permitted to be sent to respective destination. OutSt state machine govern the behaviour of output dataflow. Two types of outputs can be produced by SFM 1410: scalar output and vector output. Scalar output is sent on message bus 1420 using update data memory message, and vector output is sent on data interconnect 814 (over data bus 1422). Scalar output is result of execution of OUTPUT instruction in processor 7614, and processor 7614 provides an output address (computed), control word (U6 instruction immediate) and output data word (32-bit from GPR). The format of (for example) a 6-bit control word is Set_Valid ([5]),Output Data Type ([4:3] which is Input Done(00), node line (01), Block (10), or SFM Line (11)), and destination number ([2:0] which can be 0-7). Vector output occurs by execution of VOUTPUT instruction in processor 7614, and processor 7614 provides an output address (computed) and control word (U6 instruction immediate). The output data is provided by a vector unit (i.e, 512-bit, [32-bit per T80 vector unit GPR]*16 vector units) within processor 7614. The format of (for example) a 6-bit control word for VOUTPTU is same as OUTPUT. The output data, address and controls from processor 7614 can be first written into a (for example) 8-entry global output buffer 7620. SFM wrapper 7626 reads the outputs from global output buffer 7620 and drives on the bus 1422. This scheme is done so that processor 7614 can continue execution while output data is being sent out on interconnect. If the interconnect 814 is busy and the global output buffer 7620 becomes full, then processor 7614 can be stalled.
For output dependency checking, the processor 7614 is allowed to execute output if the respective destination has given permission to SFM source context for sending data. If processor 7614 encounters a OUTPUT or VOUTPTU instruction when the output to the destination is not enabled, it results in a output dependency check failure causing task switch. SFM wrapper 7626 provides two flags to processor 7614 as enable, per-destination, for scalar and vector output respectively. Processor 7614 flag output dependency check failure to SFM wrapper 7626 to start task switch sequence. Output dependency check failure is detected in decode pipeline stage of processor 7614, and processor 7614 enters IDLE and flushes the fetch and decode pipeline if it encounters output dependency check failure. Typically, 2 delay slots are employed between OUTPUT or VOUTPUT instruction with Set_Valid so as to update the OutSt state machine based on Set_Valid and update the output_enable to processor 7614 before the next Set_Valid.
SFM wrapper 7626 also handles the program termination for SFM contexts. There are typically two mechanisms for program termination in processing cluster 1400. If the schedule node program message had Te=1, then the program terminates on END instruction. The other mechanism is based on dataflow termination. With dataflow termination, the program terminates when it has finished execution on all the input data. This allows the same program to run multiple iterations before termination (multiple END and multiple iteration of input data). A source signals Output Termination (OT) to its destinations when it has no more data to send—no more program iterations. The destination context stores the OT signal and terminates at the end of last iteration (END)—when it has completed execution on the last iteration of input data. Or, it may receive the OT signal after finishing the last iteration execution, in which case it immediately terminates.
The source signals the OT through same interconnect path as the last output data (scalar or vector). If the last output data from the source was scalar, then the output termination is signalled by scalar output termination message on message bus 1420 (same as scalar output). If the last output data from the source was vector, then the output termination is signalled by vector termination packet on data interconnect 814 or bus 1422 (same as data). This is to generally ensure that destination never received OT signal before the last data. On termination, an executing context sends OT message to all its destinations. The OT is sent on the same interconnect as the last output from this program. After finishing sending OT, the context sends node program termination message to control node 1406.
InTm state machine can also be used for termination. In particular, the InTm state machine can be used to store the Output Termination message and sequence the termination. SFM 1410 uses same InTm state machine as the nodes, but used “first set_valid” for state transitions instead of any set_valid like in the nodes Following sequence ordering are possible between input (set valid), OT and END at destination context: Input Set_Valid—OT—END: terminate on END; Input Set_Valid—END—OT: terminate on OT; Input Set_Valid (iter n−1)—Release_Input—Input Set_Valid (iter n)—OT—END—END: terminate on 2nd END: last iteration; Input Set_Valid (iter n−1)—Release_Input—Input Set_Valid (iter n)—END—OT-END: terminate on 2nd END: last iteration; and Input Set_Valid (iter n−1)—Release_Input—Input Set_Valid (iter n)—END—END—OT: terminate on OT.
In Table 34 below, an example of a partial list of IO pins or signals of the wrapper 7626 can be seen.
Node wtate write message can update instruction memory 7616 (i.e., 256 bits wide), data memory 7618 (i.e., 1024 bits wide), and SIMD register (i.e., 1024 bits wide). Example lengths of the bursts for these can be as follows: instruction memory—9 beats; data memory—33 beats; and SIMD register—33 beats. In partition biu (i.e., 4710-i), there is a counter called debug_cntr which increments for every data beat received—once the count reaches (for example) 7 which means 8 data beats (does not count the first header beat that has data count), debug_stall is asserted which will disable cmd_accept and data_accept till the write is done to the destination. The debug_stall is a state bit that is set in partition_biu and reset by node_wrapper when the write is done by node wrapper (i.e., 810-1)—unstall comes on nodex_unstall_msg_in (for partition 1402-x) input in partition biu 4710-x. An example of 32 data beats sent from partition biu 4710-x to node wrapper on bus:
When node state read message comes in—the appropriate slave—instruction memory, SIMD data memory and SIMD register are read and then placed into the (for example) 16×1024 bit global output buffer 7620. From there the data is sent to partition biu (i.e., 4710-1_which then pumps the data out to message bus 1420. When global output buffer 7620 is read, following signals can (for example) be enabled out of node wrapper—these buses typically carry traffic for vector outputs—but are overloaded to carry node state read data as well—therefore not all bits of nodeX_io_buffer_ctrl are typically pertinent:
Reading of data memory is similar to Node State read—then appropriate slave is read and then placed into the global output buffer and from there it goes to partition biu. For example, bits 32:31 of nodeX_io_buffer_ctrl are set to 01, and the message to be sent can (for example) be 32 bits wide and is sent as data memory read response. Bits 16:14 should also indicate IOBUF_CNTL_OP_DEB. The slaves can (for example) be:
The context save memory 7610 that holds the state for processor 7614 also can have (for example) address offsets as follows:
When Halt messge is receives, halt_acc signal is enabled which then sets state halt_seen. This is then sent on a bus 1420 as follows:
When the resume message is received, halt_risc[2] is enabled which will the restore the context—a force_pcz is then asserted to continue execution from the PC from context state. Processor 7614 uses force_pcz to enable cmem_wdata_valid which is disabled by node wrapper if the force_pcz is due to resume. Resume_seen signal also resets various states—like for example halt_seen and the fact that halt ack message was sent.
When the step N instruction message is received, the number of instructions to step comes on (for example) bits 20:16 of message data payload. Using this—imem_rdy is throttled. The way the throttling works is as below:
1. reload everything from context state as debugger could have changed state
2. mem_rdy is disabled for a clock—one instruction is fetched and executed
3. then pipe_stall[0] is examined—to see if instruction has completed execution
4. once pipe_stall[0] is asserted high—means pipes are drained—then context is saved process is repeated till the step counter goes to 0—once this goes to 0, a halt acknowledge message is sent
Breakpoint match/tracepoint matches can be indicated (for example) as follows:
Shared function-memory 1410 program scheduling is generally based on active contexts, and does not use a scheduling queue. The program scheduling message can identify the context that the program executes in, and the program identifier is equivalent to the context number. If more than one context executes the same program, each context is scheduled separately. Scheduling a program in a context causes the context to become active, and it remains active until it terminates, either by executing an END instruction with Te=1 in the scheduling message, or by dataflow termination.
Active contexts are ready to execute as long as HG_Input>HG_POSN. Ready contexts can be scheduled in round-robin priority, and each context can execute until it encounters a dataflow stall or until it executes an END instruction. A dataflow stall can occur when the program attempts to read invalid input data, as determined by HG_POSN and the relative horizontal-group position of the access with respect to HG_Input, or when the program attempts to execute an output instruction and the output has not been enabled by a Source Permission. In either case, if there is another ready program, the stalled program is suspended and its state is stored in the context save/restore circuit 7610. The scheduler can schedule the next ready context in round-robin order, providing time for the stall condition to be resolved. All ready contexts should be scheduled before the suspended context is resumed.
If there is a dataflow stall and no other program is ready, the program remains active in the stalled condition. It remains stalled until either the stall condition is resolved, in which case it resumes from the point of the stall, or until another context becomes ready, in which case it is suspended to execute the ready program.
As described above, all system-level control is accomplished by messages. Messages can be considered system-level instructions or directives that apply to a particular system configuration. In addition, the configuration itself, including program and data memory initialization—and the system response to events within the configuration—can be set by a special form of messages called initialization messages.
With respect to the shared function-memory 1410, there are several types of messages that can be used, which can be seen in
8 beats
Turning to
Turning to
Turning to
Turning to
The SFM controller is the physical memory controller that implements at least some of the functionality of the shared function-memory 1410. It can be used in the context of a higher-level instantiation which includes OCP interfaces and memory instances. An example of a supported port mapping is: PORT 0: Node 1; PORT 1: Node 2; PORT 2: Global Data; PORT 3: read; and PORT 4: write. The signal interface is generic so the memory controller functionality can be maximized. OCP interfacing will usually limit the bandwidth of the memory controller function by having all data to be available at the same time. The interface supports partial accesses for flexibility, however. For SIMD operations all data can be returned at the same time, but the flexibility exists at the interface regardless. The context of the SFM controller is shown in
The SFM controller is capable of high bandwidth read memory accesses. Each port access is capable of (for example) 16 unique memory accesses. Port addresses are structured for SIMD operations. However, other sources can utilize the ports as desired. For SIMD operations, it is expected that all addresses are used and are returned at the same time. There is flexibility to support partial port addresses and partial data (i.e., less than the 16 addresses used for any port) for non SIMD operations. Each port can support reads, writes, or a histogram increment function. Reads return a 16b element for each address (generally, a pixel location). Writes store (for example) a 16-bit element directly into memory for each address. Histogram functions increment the value of the data at the memory location with the data on the write bus. If there are multiple histogram accesses to a given memory location, all of them will be incremented for that access. In order to support the high bandwidth requirement for servicing multiple ports with minimized conflicts, the memories are banked every (for example) 32 bytes. This corresponds to the data size of all of the addresses provided by a port.
Address formats can be seen in
The SFM controller also performs read arbitration. Read arbitration can occur in three stages: (1) arbitration between port addresses; (2) arbitration between all resulting addresses; and (3) temporal arbitration. The first stage of arbitration allows for SIMD elements across nodes to compete for the same memory resource. For example, SIMD0 for Node1 arbitrates directly with SIMD0 of Node2. This allows for SIMDs in a Node to be serviced together. However, if the accesses from Node1 and Node2 do not conflict, they are both serviced. The second stage of arbitration resolves conflicts on a single bank between the individual address elements. The arbitration priority is based on element number. For example, PORT0 has highest priority, then PORT1, etc. The secondary priority is given to ADDR0, then ADDR1 and so forth. The third stage of arbitration is temporal ordering. All of the priorities are resolved for each cycle before advancing to the next cycle. It is not possible for a higher priority port to starve other ports. An example of read arbitration for the first two sequences is shown in
Although ports and element addresses compete for arbitration, it is still possible to service requests if the resulting addresses are within the region of a memory bank. In
The SFM controller also performs write arbitration. The arbitration for writes can also occurs in three stages: (1) arbitration between ports; (2) arbitration between all resulting addresses; and (3) temporal arbitration. Unlike reads, writes are arbitrated in the first stage immediately, according to port. The memory system is usually capable of managing a single write from any port at any time. The second stage of arbitration resolves conflicts on a single bank between the individual address elements. The arbitration priority is based on element number. For example, PORT0 has highest priority, then PORT1, etc. The secondary priority is given to ADDR0, then ADDR1 and so forth. The third stage of arbitration is temporal ordering. All of the priorities are resolved for each cycle before advancing to the next cycle. It is not usually possible for a higher priority port to starve other ports. The write arbitration for the first two sequences is shown in
Histogram accesses utilize the write arbitration flow, as shown
The SFM pipeline allows for back to back reads and writes as shown in the example of
In Table 37 below, an example of a partial list of IO pins or signals for the SFM controller can be seen. For these examples, inputs are prefixed by “gl_”, outputs are prefixed by “finem_”, synchronous is suffixed by “_{t/n}r”, t=active high, n=active low, r=rising edge, and asynchronous is suffixed by “_{t/n} a”, t=active high, n=active low, a=asynchronous. Busses which reflect multiple ports identify the lower number port in the lower bits. For example, PORT0 is identified by req_tr(0) and addr_tr(255:0), and PORT1 is identified by req(1) and addr_tr(511:256).
For reset timing, there is a single asynchronous reset, gl_reset_na. All outputs are typically inactive during reset. An example of a port interface read with no conflicts can be seen in
For benchmarking timing, the following signals can be used to indicate event causes in the memory controller: event_bank_stall_tr (bank conflict); event_source_stall_tr (source conflict); event_hist_stall_tr (histogram updating conflict); and event_stream_tr (data has been streamed from another access). For each cycle the system undergoes a stall, the event should be active for one cycle. At least one of the stall signal should be active whenever the port interface is not acknlowedging input requests. Informational events (like event_stream) should be active whenever the rd_data_valid signal is active. An example of memory interface timing can also be seen in
For power saving features in SFM 1410, the memories are implemented using PM signals to chain all memory banks allowing PRCM (described below) to execute Power On/Off for particular memory. Power chain allows proper Power On and Power Off.
Turning to
Typically, data interconnect 814 crossbar uses “wormhole” routing, based on the Segment_ID and Node_ID of the destination. The source's Segment_ID and Node_ID are also transmitted, along with the Set_Valid signal if applicable. Nodes (i.e., 808-1) within a partition (i.e., 1402-1) can communicate locally without using the data interconnect 814 (as described above). Within a partition, one node can be using the global interconnect at any given time. This simplifies the interconnect within the partition, and the partition's connection to the data interconnect 814. Data can be transferred concurrently within partitions, or between partitions, if there are no resource conflicts on different interconnects.
The messaging interconnect can also be considered a crossbar (of sorts), but designed for lower cost than the data interconnect 814, since message throughput is much lower than data throughput. In a partition, there is separate message input interconnect and output interconnect. All nodes within a partition share this interconnect, so one node can use either interconnect at a time, although two nodes can be sending and receiving at the same time. It is also possible for the same node to be sending and receiving messages at the same time. Essentially, the message interconnect can logically be considered an N×N crossbar, implemented by the control node 1406.
Generally, the interconnects are hierarchical and to achieve high utilization, it is important that mcmd_accept and sdata_accept is not used to back off the interconnect. Instead they should be normally high to accept accesses into a buffer at the destination and the buffer can then update a target for example load/store data memory in a node when load/store data memory is free. If the buffer becomes full, then SIMD is stalled and buffer is drained to make room for incoming data. This way interconnect data does not have the higher priority over SIMD accesses and usually stalls SIMD. It attempts to find an idle cycle—and when buffer becomes full, it stalls the SIMD. Most of the time, you should be able to find an empty cycle to update target. Note that the buffer should be easily configurable from 1 entry to multiple entries so that performance studies can be used to design the depth. Though be mindful of area as these buffers are flop based. In a partition there is a (for example) 16×512 global IO buffer to absorb pixel data which is part of the micro-architecture. The node wrappers have a 2 entry buffer for messages to tolerate SIMD being busy for one cycle—and most the control messages are typically 1 data to 2 data pieces. The longer messages are typically initialization messages during which time SIMD's are idle anyways.
In processing cluster 1400, sources and destinations negotiate through source notifications and permissions—therefore pushes or writes will usually succeed—that is there is usually space. There are write buffers for side contexts in the node wrappers of every node—these can become full—but, again here as well, if the write buffer is full and we are getting a new store, space is made by stalling the SIMD's if SIMD is busy and write buffer can update side context memory. Therefore, it can be important to make sure that these interconnects behave like as though they are tied high. Of course, there could be cases where multiple sources could be sending to same destination in which case there has to be enough buffering to make sure it doesn't stall sources. Destination also has to make sure that it has enough buffering to accept the data. Examples of such cases are control node and data interconnect. Typically though there is usually enough space in nodes and GLS unit 1408 as they both negotiate data transfers and have large global IO buffers.
For SRMD protocol, the command and data should be driven in the same cycle by the master. Data should not be driven before command. Master will probably issue command2 after it has sent the last piece of data for command1/data1. Slaves should be able to either accept command2 while the last packet of command1/data1 is still pending or slave should be able to not accept command2 while the last packet of command1/data1 is still pending.
All OCP ports should have a signal or pin called OCP_CLKEN which is used to indicate to master that is running at a higher frequency when to sample slave data or drive data to slave. Master sampling slave data (which is running at half the master clock) is shown in
In Table 38 below, an example of a partial list of IO pins or signals for the data interconnect 814 can be seen.
In Table 39 below, an example of a partial list of IO pins or signals for the left context interconnect can be seen.
In Table 40 below, an example of a partial list of IO pins or signals for the left context interconnect can be seen.
In Table 41 below, an example of a partial list of IO pins or signals for the LUT interconnect can be seen.
In Table 42 below, an example of a partial list of IO pins or signals for the host slave port can be seen.
In Table 43 below, an example of a partial list of IO pins or signals for the OCP interconnect port can be seen.
Turning to
As part of initialization, initializations messages 9604, 9606, and 9608 are generally used to initialize instruction memories and the function-memory 7602. In particular, messages 9604 and 9606 can be used to inform nodes (i.e., 808-i) and the shared function-memory 1410 that the next transfers over the data interconnect 814 are lines of instructions with instructions being written to consecutive locations starting at location 0 that continues until a Set_Valid is received. Also, message 9608 can inform the shared function-memory 1410 that the next transfers over the data interconnect 1414 are for function-memory 7602 with instructions being written to consecutive locations starting at location 0 and LUT entries being bank-aligned that continues until a Set_Valid is received.
In
The configuration read thread is responsible for initializing the instruction memories 5403, 7618, and 1401-1 to 1401-R as well LUT of the shared function-memory 1410. The information regarding which destination is/are initialized is contained in the data stored in the system memory 1416.
Turning to
In
In
In
The GLS unit 1408 can performs the following example steps once the first configuration structure is accessed. The encoding type is looked at to determine what type of init message is stored. If the encoding type is 3, then the LUT initialization is requested. If the encoding type is 2, then the IMEM initialization is requested. If the encoding type is 4, then control node action list initialization is requested. If the Cn bit=0, then the number of lines to initialize are the NUMBER_OF_LINES or NUMBER_OF_BLOCKS given in the message structure. If Cn=1, then we add the current NUMBER OF LINES or NUMBER_OF_BLOCKS with the previous. The destination SEG_ID, NODE_ID are also latched. The system address and start offset values are latched into the request queue RAM along with internal offset parameters. A tag is assigned for reading data from the assigned SYSTEM_BASE_ADDRESS and read commences. The node instruction memory init message is sent to the latched destination in case the destination is not GLS unit 1408 or control node 1410. Write data to the proper destination is also either directly (for GLS instruction memory case) or via egress message processor (control node action list update) or via interconnect 814. If the destination is instruction memory 5403, then 40-bits (for example) are extracted at a time from the data latched in the buffer 6024 and written into the instruction memory 5403 as shown in
Reset of the information sent on the interconnect 814 is the similar as SFM IMEM INIT (for each burst the DMEM_OFFSET is incremented by the burst size even for partition instruction memory init case as instruction memory data is 252-bits for partition). As shown in
The egress processor will accumulate (for example) upto 32-beats worth of data and send it to the control node 1410 via the messaging bus 1420. When the number of instructions/number of blocks/number of entries field in the entry list in the GLS unit 1408 keeps sending initialization data to the destination. Once the max count is reached, the GLS unit 1408 moves on to process the next entry. When the GLS unit 1408 encounters 3′b110 in the encoding filed for an entry, the GLS unit 1408 terminates the initialization routine. The allocated tag id for reading config word is also released to the general pool of free tag ids. An example of this can be seen in
Transfers are generally performed by write and read threads. There can be up to 16 active thread transfers, using their own sets of sources and destinations, with independent addressing. Each GLS unit 1408 thread, executed by GLS processor 5402, can implement an independent read or write thread, forming various types of processing flows: read thread; write thread; or read and write thread with intermediate processing. In the dataflow protocol, the fields used to identify nodes and contexts instead identify the GLS unit 1408 (Segment_ID, Node_ID), with the context-number field identifying the thread number instead.
Turning to
In this example of
In
Turning now to
Tuning now to
Turning to
As with a write thread, the dataflow protocol can performs ordering and flow control, so that all destinations can be ordered regardless of type (some can be write threads), and because it can take several cycles to process the multi-cast list and send data to all destinations. The source node 808-i does not distinguish the multi-cast thread from other types of output, and in fact can have multiple outputs including node-to-node, write, and multi-cast threads. There are two cases for source data. In the first, a multi-cast read thread (a), the GLS unit 1408 can perform a system read and place the data into a buffer. This operation is generally the same as for a read thread. In the second, a multi-cast write thread (b), the source node outputs data which identifies the GLS unit 1408 node and the thread number of the multi-cast thread. This operation is generally the same as for a write thread. Once source data is received by the GLS unit 1408 buffer, it accesses the thread's multi-cast list and transmits the data to all destinations—any combination of nodes or write threads on the GLS unit 1408. A multi-cast read thread allows a single system access to provide input data to multiple programs, and a multi-cast thread can be used when a node program writes a single set of output variables that have multiple destinations (for example, the destination node input is also copied to memory). In contrast, multiple node outputs, specified by the node context descriptors, are used when the program outputs multiple sets of variables, each to a unique destination context (program).
Resource allocation in processing cluster 1400 is analogous in many ways to resource allocation in an optimizing compiler, particularly a compiler that schedules operations on a VLIW or superscalar microarchitecture. However, instead of allocating registers, functional units, and memory to generate an instruction sequence to optimize performance (or memory usage, and so forth), system programming tool 718 can allocates “processors” and memory to generate binaries and messages to optimize the use of resources based on a throughput. The objective is to use a minimum, or near-minimum, allocation to accomplish the objectives. This permits scalability—that is, area and power are adjusted to performance requirements, nearly linearly. For example, doubling throughput doubles the resources employed.
A characteristic of processing cluster 1400 that simplifies resource allocation is that nodes of a specific type, such as node 808-i, are generally uniform. Also, nodes can be designed to support a very fine grain of resource allocation—for example in the definition of contexts, context descriptors, and fine-grained multi-tasking. Because of this general uniformity, generality, and flexibility, relatively simple allocation strategies can be employed to achieve optimum, or nearly optimum, allocations.
Resource allocation, in general, involves a circularity between the available resources, the allocation of those resources, data dependencies, and the resulting performance of the chosen allocation. Typically, these circularities are broken by ignoring certain constraints in early stages, generating an optimistic (and usually unrealistic) allocation as a starting point. From that starting point the allocation is refined by introducing successive constraints, and iterating on the allocation until a solution is found (or the allocation fails, meaning that there are not sufficient resources for the specified use-case).
In system programming tool 718, the initial assumptions are that there is an unlimited number of nodes of the required type (i.e. customization), each with unlimited instruction and data memory. From this starting point, allocation determines a bounded number of nodes and amount of memory. This bounded allocation assumes that each algorithm module executes in a dedicated set of compute nodes (i.e., node 808-i). That is, no two modules share the same hardware, and a criterion is that sufficient nodes are allocated that each module satisfies the throughput requirement. This allocation most likely uses more than the available number of nodes; it is, typically, the starting point for node allocation. However, the allocation fails if the number of nodes used by a single module, to achieve the specified throughput, is more than the available number of nodes (this should not be common).
Once the initial allocation is set, optimization can be performed. The system programming tool 718 iterates on the allocation, attempting to find shared allocations of nodes and contexts. The result of this allocation is either an organization of nodes and contexts that meets the desired requirements, or a failure to find a suitable allocation.
Initial node allocation begins by allocating each module a number of nodes of the required type that meets or exceeds the throughput requirements, based on number of cycles taken to execute that module (this information is provided by the compiler, based on compiling the module as a stand-alone program). Desired throughput requirements can be expressed in terms of cycles taken per pixel output: for example, in processing cluster 1400, if the output rate is 200 Mpixel/second, and a node (i.e., 808-i) operates at 400 MHz, the desired throughput requirement should be 2 cycles/pixel (400 Mcycles/sec÷200 Mpixel/sec). To meet the desired throughput requirements, the node allocation should output a number of pixels, in parallel, so that no more than 2 cycles are taken in the module for every pixel output. For example, a program that takes 58 cycles should generate at least 29 output pixels to maintain a rate of 2 cycles/pixel.
Turning to
The second step in node allocation is the analyze the relationships between individual modules, determined from the use-case graph 1100 of
Each path segment (i.e., 10802 and 10804) generally has its own natural throughput, based on the resource allocation of that segment, and this is likely different than the throughput of the system interfaces 1405 and of the hardware accelerators 1418. For this reason, the allocation is considered separately for each path segment, to decompose the analysis. As discussed later, resources can be shared between modules (i.e., 1004) on different path segments, but the allocation of resources is based on independent analysis of each segment—otherwise there can be an intractable interaction between the path segments, owing to their different natural throughput rates and resulting allocation tradeoffs.
Additionally, each path in a segment (i.e., 10802 and 10804) can have several paths through the programmable blocks, as shown in
Turning to
In
Critical_Path_Cycles+Critical_Slack_Cycles≦(Node_Width*Min∥Nodes−Lost_Pixel÷#Contexts)*(Cycles/Pixel)
The term “Lost_Pixels” generally captures the reduction in output width allocated to the path segment. It is based on a parameter given by the user which specifies the end-to-end reduction because since system programming tool 718 can not estimate it from the programmable components alone. This parameter can be an estimate, rather than being precise, at a potential loss in allocation efficiency. The number of contexts that can be used to meet this condition is evaluated for all path segments individually, and the path segment with the largest number of contexts sets all path segments. To properly share data within contexts, the number of contexts should be the same for all programmable components.
Turning to
In
As with most allocation problems, optimizing resources generally means having tradeoffs. Typically, the longest programs use the minimum number of parallel nodes, but these nodes can be shared by one or more other modules. Slack cycles generally indicate the degree to which this sharing can occur, and sharing increases path cycles because of time-multiplexing between modules. However, sharing can beneficial when path cycles are not increased within a path segment (i.e., 10802) to the point where the “critical path” (which may change due to sharing) exceeds the original length of the “critical path” plus the critical slack cycles. If this does occur, the question becomes whether the net benefit gained by sharing (reducing nodes) is greater or less than the additional node(s) that should be added to compensate for the increase in the critical path length beyond the original slack time available for it.
Sharing nodes also interacts with the memory allocation. In the initial allocation, the Critical_Cycles parameter can determine the choice of the number of contexts. Reducing the number of slack cycles by sharing nodes can increase the number of contexts. Furthermore, modules that share nodes can increase the number of contexts on those shared nodes, which increases the amount of data memory (i.e., SIMD data memory 4306-1) allocated to those nodes. If the total allocated data memory exceeds that available, one or more nodes should be added to provide sufficient data memory, and these additional nodes can change the optimum node allocation from a performance standpoint.
Resource allocation can be further complicated by combining source code for modules within a path segment into a larger program in a more efficient manner so as to affect sharing of resources. The larger program can be optimized by the compiler 706 to reduce cycles and data memory by scheduling resource usage over a larger program scope. Resources then can be allocated using these larger (but more efficient) programs.
There are a number of approaches that can be used for optimization, including exhaustive searches and constraints already imposed by throughput.
Turning to
At this point, the updated slack cycles can be used to refine the context allocation. The original context allocation was based on each program having its own node allocation, and the term “Critical_Slack_Cycles” that was used in context allocation has a different value after allocation due to node sharing. Furthermore, node sharing can complicate the determination of a value for Critical_Slack_Cycles, based on whether or not the sharing modules are from the same path segment. Modules (i.e., module 1014) that do not share nodes generally use the original slack time. Modules that share nodes, but which are in different path segments, can independently use the slack cycles for those nodes (e.g., modules 1022/1006 and 1008/1016 in this example). Slack cycles can be based on the largest number of cycles within the node allocation. For example, module 1010 uses one node (of the two allocated for modules 1004/1010), but the slack cycles are determined by the sum of the cycles of modules 1004 and 1010. For context allocation, “Critical_Cycles” (the sum of cycles and slack cycles of nodes in the “critical path”) can be affected in two ways. First, the term can be reduced because a module in the “critical path” is sharing a node with a module that is not in the “critical path.” For example, the path from module 1004 to module 1022 can include critical cycles reduced by the cycle count of module 1006. Second, if two or more modules in a “critical path” share a node allocation, the slack cycles of this allocation can be counted once in the critical path. For example, the path from module 1004 to module 1010 counts the slack cycles for modules 1004 and 1008 but not module 1010, and, furthermore, the slack cycles of module 1008 are reduced by sharing with module 1016. The resulting values for Critical_Cycles in each path segment (i.e., 10802 and 10804) can be used in the context allocation equation from the set of equations for basic context allocation 11204 to determine the number of contexts required by the shared node allocation.
In
Deadlock conditions, however, should not occur in processing cluster 1400 because execution is data-driven. Programs or modules are generally scheduled to execute if input data is valid. So, in this example, module 1004 should become ready at half the rate of module 1004, as desired. However, to efficiently use computing resources, module 1010 should execute in an inter-node organization, so that each iteration of module 1010 executes on nodes 808-(j+3) and 808-(j+4) at about the same time, enabling module 1010 to compute twice as many pixels at half the rate. This allocation for modules 1010 and 1004 can be seen in
In section 4 above, autogeneration of hosted application code by the system programming tool 718 is described, but the ultimate target of the code is the processing cluster 1400. The structure of this code targeted for the processing cluster 1400 depends on resource allocation decisions, as discussed above in section 15. One extreme example being that all applications source code is compiled as a single program and executed on a single compute node, and another extreme example is code is compiled as separate programs executing on a parallel allocation of multiple nodes, up to the total number of nodes available in the system 1400. Compiling sources for programmable nodes is generally not sufficient to complete the application. Node execution is data-driven but nodes (i.e., 808-i) by themselves have no mechanism for data and control flow. This in performed instead by mapping the iterator 602, and read/write threads 904/908 to sources compiled for the GLS processor 5402, which is discussed at least in part is section 5 above. Following this, the system programming tool 718 can generates a configuration structure which is used by a configuration read thread 9402 to load programs and LUT images and to perform initialization of all other hardware for the use-case.
Autogeneration for programmable nodes (i.e., 808-i) in the environment for processing cluster 1400 generally follows a process similar to that used to generate source code for the hosted environment (section 4 above). This code can also follow the same serial execution model, but the concept of objects is eliminated from node programs. Instead, sources are compiled more like conventional, standalone C programs, and mimic the object model by executing in dedicated node contexts. Global and local variables can appear as public and private variables because these variables are not generally accessible by other programs except being written by known sources of input data, to variables that are read-only at the destinations. The iterator 602, read thread 904, and write thread 908 do preserve the concept of objects. This abstracts the interfaces to the node programs—node programs in contexts are treated as objects even though they execute in distributed nodes with separate program counters.
Turning to
To complete code generation for a use-case, the system programming tool 718 create the source code for the iterator 602, read thread 904, and write thread 908. Turning back to
Unlike most node programs, source code for the GLS processor 5402 is free-form, C++ code, including procedure calls and objects. The overhead in cycle count is acceptable because iterations typically result in the movement of a very large amount of data relative to the number of cycle spent in the iteration. For example, for a read thread that moves interleaved Bayer data into three node contexts, this data is represented as four lines of 64 pixels each in each context. Across the three contexts, this is twelve, 64-pixels lines total, or 768 pixels. Assuming that all threads (i.e., 16) are active and presenting roughly equivalent execution demand (this is very rarely the case), and a throughput of one pixel per cycle (a likely upper limit), each iteration of a thread can use 48 cycles. Setting up the Bayer transfer generally can require on the order of six instructions, so there are 42 cycles remaining in this extreme case for loop overhead, state maintenance, and so forth.
Since the read thread 904 is logically embedded within the iterator 602, they can be merged into one program source (independent iterators and read threads can be combined in any functionally-correct combination). The system programming tool 718 generates this source code in a manner very similar to the hosted program (as described in sections 4 and 5 above), traversing the use-case diagram as a graph, and emitting source text strings within sections of a code template 11902, shown in the example of
The read thread, as written by the programmer, contains the code that moves data from the system to algorithm objects. There is, typically, no provision for parameter initialization, managing circular buffer state, and so forth. Instead, this code is added to the source code by system programming tool 718 based on the use-case. Variable declarations are added to the read thread, with output identifiers, so that the thread has access to the scalar input variables of all node programs. Code is also added to initialize these programs and to manage their circular-buffer state.
Also, as shown in
This programming model currently has a limitation caused by potential name conflict of input variables. These conflicts can occur when the iterator/read thread provides data to more than one program from the same algorithm class. Each of these programs can use the same name for input variables, so these cannot be independently declared in the source program. Consequently, these programs would generally require a unique read thread (though possibly within another instance of the same iterator). The best workaround for this problem is to use script tools to re-name these input variables. This approach could also relax the requirement to embed input variables within structures. If these improvements are implemented, existing code would remain compatible.
In
The initialization section 11912 can includes the initialization code for each programmable node. The included files are typically named by the corresponding components in the use-case diagram. Programmable nodes are generally initialized in this way: iterators, read threads, and write threads are passed parameters, similar to function calls, to control their behaviour. Programmable nodes usually do not support a procedure-call interface; instead, initialization can be accomplished by writing into the respective object's scalar input data structure, similar to other input data. In the hosted environment, the initialization functions are typically called, whereas, in the environment for the processing cluster 1400, initialization functions are expanded in-line. The writes to input parameters, in the generated code, generally results in output instructions identifying the destination and an offset of the parameter in the destination context. These are scalar variables, and, unlike vector variables, are copied into each processor data memory 4328 context associated with a horizontal group. These contexts are typically “discovered” using the dataflow protocol.
The composite_read function 11914 is the inner loop of the iterator, can also be created by code autogeneration. The name generally reflects that the function performs both implicit dataflow (in this case, to maintain circular-buffer state) and explicit dataflow as implemented by the read-thread object. The hosted program can calls each algorithm instance in an order that satisfies data dependencies, but in the environment for processing cluster 1400, calling the read thread alone is usually sufficient to accomplish the same logical functionality. However, environment for processing cluster 1400, execution can be highly parallel, implemented by data-driven execution as determined by node allocation, context organization, destination descriptors, and the operation of the dataflow protocol between source and destination contexts. The composite_read function 11914 can be passed the same parameters as the traverse function in the hosted environment, for example: 1) an index (idx) indicating the vertical scan line for the iteration, 2) the height of the frame division, 3) the number of circular buffers in the use-case (circ_no), and 4) the array of circular-buffer addressing state for the use-case, c_s. Before calling the read thread, composite_read function 11914 can calls the function _set_circ for each element in the c_s array, passing the height and scan-line number. The _set_circ function can update the values of all Circ variables in all contexts, based on this information and also can update the state of array entries for the next iteration. Circ variables are generally written using pointers to the extern scalar input structures. This results, in the generated code, in output instructions identifying the destination and an offset of the Circ variable in the destination context. As with scalar parameters, these variables can be copied into each context associated with a horizontal group, based on the dataflow protocol. After the circular-buffer addressing state has been set, composite_read function 11914 can call the execution member-function (run) of the read thread. The read thread is passed a parameter, the index into the current scan-line, to perform addressing. The output identifier associated with the read-thread output selects a destination, and the call to the read thread results in system data being moved to all destination contexts—a different portion of the scan line into every context. This behaviour is distinguished from the output of scalar data by virtue of the data types being moved, for example: Frame objects in the system into Line objects in the programmable nodes. The destination contexts are provided data in scan-line order by virtue of the dataflow protocol. Additionally, dataflow pointers can be seen in section 11918.
The iterator and read thread are implemented in a function 11926 (here called ISP_iter_read) intended to be called by a host processor that interfaces to the processing cluster 1400. The call generally executes the use-case on a unit of input data, such as a frame division for imaging, with system input and output. The ISP_iter_read function 11926 is not usually called directly. Instead, the host maps an API call into a Schedule Read Thread message and passes the required parameters in the message, structured as they would be passed by a conventional procedure call. The function prototype can be used in the API implementation to indicate which parameters are passed, and their types. When the GLS unit 1408 receives the scheduling message, it copies these parameters into the thread's context, starting at location 0, and this effectively serves as the top of a stack containing the parameters for the host “call” (though this is not the same stack used by the GLS processor 5402 code for internal procedure calls). This function 11926 can pass, for example, four parameters: the first two indicate the height and width of the frame, and the second two contain a pointer to the memory buffer containing Bayer data (in this case) and a pixel offset into the buffer (FD_offset). The height, width, and buffer pointer can be used by the read thread as for the hosted case. However, an additional parameter can be used in the environment of processing cluster 1400, where the width of the context allocation in hardware is generally less that the width of the frame, and frame-division processing is used. Frame-division processing generally can require fetching overlapped regions of the input data to generate contiguous output data. The amount of overlap is algorithm-dependent, and the FD_offset parameter is used by the read thread to determine the amount of overlap by specifying an offset with respect to the buffer pointer.
Also shown in
The initialization section 11920 can set the circ_s array, containing state for maintaining the values of Circ variables. In this case, pointers to the external variables are used, instead of pointers to public variables as in the hosted environment. This section 11920 then calls each initialization function, which in the environment for processing cluster 1400 results in this code being expanded in-line.
The code in
Section 11924 can de-allocates the read thread and iterator object instances and frees the memory associated with them. When the function ends, it remains resident and can be called again by the host, for example to operate on another frame division within the frame. Deleting objects prevents memory leaks from one invocation to the next.
Turning to
To summarize the generation of programs for the environment for processing cluster 1400, these are the operations that are usually performed by the system programming tool 718:
Turning to
The Power Clock Reset Management Subsystem (PRCM) generally controls the clock and reset distribution in the processing cluster 1400. Typically, the processing cluster 1400 has several power domains: The Control Node PD (CTRL_PD); Global LS Power Domain (GLS_PD); Shared Functional Memory Power Domain (SFM_PD); and Partition 0 Power Domain (Part0_PD) to Partition x Power Domain (Partx_PD). The internal interconnects (Interconnect 814, Right and Left Context Interconnects 5702 and 4704) are part of the GLS power domain since anytime there is traffic inbetween the different nodes the GLS unit 4708 will be involved and thus the interconnects and the GLS unit 4708 should be on. The messaging infrastructure below shows the logical paths the PRCM should follow to each power domain. Clocking for the processing cluster 1400 can be seen in
An example of the IO signals or pins for the PRCM can be seen in Table 45 below.
The PRCM typically residing inside the Control Node 1406 and is responsible for providing clocks to all the power domains except its own. The Control Node 1406 receives the SoC level clock (gl_clk_in) and wakes up based on the wakeup instructions from a SoC level Master module. The Control Node 1406 initiates the internal PRCM on wakeup following which the PRCM starts clock and reset generation and propagation to the processing cluster 1400 and submodules. The following are example features of the PRCM:
Event Translator is within the is designed to accept events and translate them to processing cluster 1400 messages, as well as accept processing cluster 1400 messages and translate them to events. Within processing cluster 1400, ET interfaces directly with the Control Node 1406. When an event is received from a hardware (HW) accelerator outside of the processing cluster 1400 boundary, that event is translated to a TPIC message and sent to the Control Node over an OCP interface. In the case where the Control Node 1406 sends a message to ET over a separate OCP interface, the event information is extracted from that message and sent out of the processing cluster 1400 boundary to the HW accelerator. In addition to the OCP interfaces between ET and the Control Node, there is a signal sent by ET to the Control Node 1406 when an event overflow or underflow occurs and which event bit caused this. This basically indicates that a particular event in ET has overflown or underflown and processing cluster 1400 is issuing an interrupt. ET does not generate the external interrupt. Once the Control Node 1406 receives the information about an overflow or underflow, it is responsible for generating an external interrupt.
Turning to
Having thus described the present disclosure by reference to certain of its preferred embodiments, it is noted that the embodiments disclosed are illustrative rather than limiting in nature and that a wide range of variations, modifications, changes, and substitutions are contemplated in the foregoing disclosure and, in some instances, some features of the present disclosure may be employed without a corresponding use of the other features. Accordingly, it is appropriate that the appended claims be construed broadly and in a manner consistent with the scope of the disclosure.
This application claims priority to: U.S. Patent Provisional Application Ser. No. 61/415,210, entitled “PROGRAMMABLE IMAGE CLUSTER (PIC),” filed on Nov. 18, 2010; andU.S. Patent Provisional Application Ser. No. 61/415,205, entitled “SYSTEM PROGRAMMING TOOL AND COMPILER,” filed on Nov. 18, 2010; and Each application is hereby incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
61415210 | Nov 2010 | US | |
61415205 | Nov 2010 | US |