Computer architectures typically use von Neumann architectures. This generally includes a central processing unit (CPU) and attached memory, usually with some form of input/output to allow useful operations. The CPU generally executes a set of machine instructions that check for various data conditions sequentially, as determined by the programming of the CPU. The input stream is processed sequentially, according to the CPU program.
In contrast, it is possible to implement a ‘semantic’ processing architecture, where the processors or processor respond directly to the semantics of an input stream. The execution of instructions is selected by the input stream. This allows for fast and efficient processing. This is especially true when processing packets of data.
Many devices communicate, either over networks or back planes, by broadcast or point-to-point, using bundles of data called packets. Packets have headers that provide information about the nature of the data inside the packet, as well as the data itself, usually in a segment of the packet referred to as the payload. Semantic processing, where the semantics of the header drive the processing of the payload as necessary, fits especially well in packet processing.
In some packet processors, there may be several processing engines. Efficient dispatching of the tasks to these engines can further increase the speed and efficiency advantages of semantic processors.
One embodiment is a dispatcher module operates inside a semantic processing having multiple semantic processing units. The dispatcher includes one or more queues to store task requests. The dispatcher also includes a task arbiter to select a current task for assignment from the task requests, and a unit arbiter to identify and assign the task to an available processing unit, such that the current task is not assigned to a previously-assigned processing unit.
Another embodiment is a semantic processor system having a dispatcher, a parser, an ingress buffer and an egress buffer.
Another embodiment is a method to assign task among several processing units.
Embodiments of the invention may be best understood by reading the disclosure with reference to the drawings, wherein:
a-4b show embodiments of status circuitry.
When a packet is received at the buffer 100, it notifies the parser 200 that a packet has been received by placing the packet in the queue 202. The parser also has a queue 204 that is linked to the CPU 600. The CPU initializes the parser through the queue 204. The parser then parses the packet header and determines what tasks need to be accomplished for the packet. The parser then associates a program counter, referred to here as a semantic processing unit (SPU) entry point (SEP), identifying the location of the instructions to be executed by whatever SPU is assigned the task and transfers it to the dispatcher 300. The dispatcher determines what SPU is going to be assigned the task, as will be discussed in more detail later.
The dispatcher 300 broadcasts information to the SPU cluster comprised of SPUs such as processing unit P0402 through processing unit Pn 404, where n is any number of desired processors, via three busses: disp_allspu_res_vld; disp_allspu_res_spuid; and disp_allspu_res_isa, such as 406. Each SPU in the cluster sends SPU(n)_IDLE status to the dispatcher to avoid a new task assignment while working on a previously assigned, uncompleted task.
The SPUs may employ a semantic code table (S-CODE) 408 to acquire the necessary instructions that they are to execute. The SPUs may already contain the instructions needed, or they may request them from the S-CODE table 408. A request is transmitted from the processing unit to the queues such as 410, where each SPU has a corresponding queue. The CPU has its own queue 412 through which it initializes the S-CODE RAM with SPU instructions. The S-CODE RAM broadcasts the requested instruction stream along with the SPU ID of the requesting SPU. Each processor decodes the ‘addressee’ of the broadcast message such that the requesting processing unit receives its requested code.
The assignment of the tasks to the SPUs determined by the parser 200 is handled by the dispatcher 300 by examining the contents of several pending task queues 302, 902,904, 906. Queue 902 stores requests from the parser to the SPUs. Queue 904 stores requests between SPUs. One SPU assigned a particular task may need to spawn further tasks to be executed by other SPUs or the CPU, and those requests may be stored in queue 906. SPU to SPU and SPU to CPU message queue messages are written by arbiter 510, which may also provide access to the cryptographic key and next hop routing database 910 within the array machine context data (AMCD) memory 912.
The dispatcher 300 monitors these queues and the status of the SPU array 400 to determine if tasks need to be assigned and to which processor. An embodiment of a dispatcher is shown in
Each subqueue has a connection to the task arbiter 306. While there are two connections shown, and the logic gate 304 is shown external to the task arbiter, there may be one connection and the logic gate 304 may be included in the task arbiter. For ease of discussion, however, the gate is shown separately. The task arbiter receives the task contents from the queues and determines their assignment. The logic gate 304 receives the task requests and provides an output signal indicating that there is a pending task request. The pending task request is gated with the SPU_AVAILABLE signal from the gate 310 to produce the signal DISP_ALLSPU_RES_VLD.
The unit allocation arbiter 308 receives that signal and determines which SPU should be assigned the task, based upon the availability signals SPU(n)_IDLE from the various SPUs and outputs this as DISP_ALLSPU_RES_SPUID. This will be discussed with more detail further.
In addition to the valid response signal, the dispatcher sends out a signal identifying the ‘place’ in the instructions the SPU is to execute the necessary operations. This is referred to as the SPU Entry Point (SEP). When the task is from the parser to the SPU, for example, the dispatcher provides the initial SEP address (ISA) as a program counter as well as an offset into the ingress buffer to allow the SPU to access the data upon which the operation is to be performed. The offset may be provided as a byte address offset into the ingress buffer. When the task is from the CPU to the SPU, for example, the program counter and the arguments may be provided to the SPU. When the task is from one SPU to another SPU, the dispatcher may pass the arguments and the program counter as well. This information is provided as the signal DISP_ALLSPU_RES_ISA.
One embodiment of circuitry to queue and detect unassigned pending tasks is shown in
In
The write pointer and the read pointer may be one bit wider than necessary. For example, if the addresses are 3 bits, the pointers will be 4 bits wide. If the pointers are identical, there are no pending tasks. If the two are different, there is a pending task. The extra bit is used to detect a wrap around condition if the queue is full, allowing the system to stall on writing requests until the number of pending entries has decreased . . . For example, if the 3 bits of the address are the same as ‘000’ but the fourth bit is different, the queue is full and has wrapped around back to 000. It does not matter whether the read pointer and write pointer are different in any manner, it indicates that the task queue has a pending task.
The comparison is done by a pair of comparators 926a and 926b, with the output of the comparator 926b indicating whether or not the queue is full and the output of the comparator 926a indicating whether or not the queue is empty. The queue empty signal is inverted by inverter 930 and combined with a write enable signal to assert the write enable signal used by the queue. If the queue is not empty, the write enable signal is asserted.
In addition to monitoring tasks requests from the queues so the task arbiter knows that at least one request is waiting, the dispatcher 300 of
The output of the dispatcher for a task is provided to the decoder 406 of SPU 402.
The use of SPU 402 for this example is merely for discussion purposes. Any processing unit may have a state machine using this type of logic circuitry that allows it to determine if there is a task being assigned to it. The dispatcher provides a signal that indicates that there is a task to be assigned, DISP_ALLSPU_RES_VLD, and the address or other identifier of the SPU, DISP_ALLSPU_RES_SPUID. The identifier is sent to a decoder 406 and the decoder determines if the identifier matches that of the processing element 402. The output of the decoder is provided to a logic gate 420.
If either the PWR_RESET is detected or the SPU pipeline detects that is has executed an ‘EXIT’ instruction, gate 420 will set SPU(n)_IDLE at flip-flop 412 to inform the dispatch hardware that this SPU is now a candidate to execute pending task requests. If the address if for the current SPU, and the dispatcher response if valid, as determined by AND gate 410, the flip/flop outputs that the SPU is not idle. It must be noted that this is just one possible combination of gates and storage to indicate the state of the SPU. Any combination of logic and storage may be used to provide the state of the SPU to the dispatcher and will be within the scope of the claims.
As tasks are processed from the subqueues of
b shows an embodiment of circuitry that causes the SPU to load an instruction. The signal DISP_TO_ME or a signal depending upon the DISP_TO_ME signal is used as a multiplexer enable signal for multiplexer 430 to select the new initial SEP address (ISA) result from
At 500, the dispatcher monitors the task queues to determine if there is a task request asserted from one of the queues. If there is a task pending, a queue containing a task is selected at 502, this is then remembered at 504. The selected task queue is ‘remembered’ to assist in the selection of the next task queue and fed back to 502.
During this process of task selection, the identification of an available SPU is performed at 510. If the SPU_IDLE signal is asserted for at least one SPU, that SPU is available to be assigned as task. If there is no SPU with SPU_IDLE asserted, then the process waits until a SPU is ready.
If one or more tasks is pending and one or more SPU are available, the dispatcher will select the next task at 512 and assign it to the next selected SPU, advance the read pointer for the selected task at 522 and remove the selected SPU from subsequent task assignment at 514 until the currently assigned task is completed. The advanced pointer is then used as described above to determine if there is a pending task request.
Returning to 502 and 512, if there is more than one SPU available, the highest priority SPU is assigned. In a round-robin task/SPU arbiter, the currently available SPU that was most recently allocated a task that has completed will be the lowest priority SPU to be allocated a task. For example, assume there were three SPUs, P0, P1 and P2. If P0 is assigned a task, then P1 and P2 would have higher priority for the next task.
Upon assignment, the processor assigned becomes the ‘previously assigned’ processor. When P1 is assigned a task, the priority becomes P2, P0 and then P1. Some tasks will take longer than others to complete, so the assignments may not be in order after some period of time. Based upon the assignment at 512, the last SPU assigned to a task, once finished with the task, is the lowest priority SPU to receive a new task assignment. The process then returns to monitoring the task queues and SPU availability.
In this manner, the dispatcher can monitor both the incoming task requests and the status of the processing resources to allow efficient dispatch of tasks for processing. Implementation of this in hardware structures and signals substantially reduces the number of cycles it takes the dispatcher to determine which processors are available and whether or not tasks are waiting. In one comparison, monitoring tasks and status using software make take 100 instructions cycles, while the above implementation only took 1 instruction cycle. This increase in efficiency further capitalizes on the advantages of the semantic processing architecture and methodology.
The embodiments provide a novel hardware dispatch mechanism to rapidly and efficiently assign pending tasks to a pool of available packet processors. The hardware evenly distributes pending task requests across the pool of available processors to reduce packet processing latency, maximize bandwidth, concurrency and equalize distribution of power and heat. The dispatch mechanism can scale to serve large numbers of pending task requests and large numbers of processing units. The mechanism for one process dispatch per cycle is described. The approach can easily be extended to higher rates of process dispatch.
Thus, although there has been described to this point a particular embodiment of a method and apparatus to perform hardware dispatch in a semantic processor, it is not intended that such specific references be considered as limitations upon the scope of this invention except in-so-far as set forth in the following claims.
Copending U.S. patent application Ser. No. 10/351,030, titled “Reconfigurable Semantic Processor,” filed by Somsubhra Sikdar on Jan. 24, 2003, is incorporated herein by reference.