The field of invention relates generally to networking equipment and, more specifically but not exclusively relates to techniques for arbitrating and scheduling thread usage in multi-thread compute engines.
Network devices, such as switches and routers, are designed to forward network traffic, in the form of packets, at high line rates. One of the most important considerations for handling network traffic is packet throughput. To accomplish this, special-purpose processors known as network processors have been developed to efficiently process very large numbers of packets per second. In order to process a packet, the network processor (and/or network equipment employing the network processor) needs to extract data from the packet header indicating the destination of the packet, class of service, etc., store the payload data in memory, perform packet classification and queuing operations, determine the next hop for the packet, select an appropriate network port via which to forward the packet, etc. These operations are generally referred to as “packet processing” or “packet forwarding” operations.
Modern network processors perform packet processing using multiple multi-threaded processing elements (e.g., processing cores) (referred to as microengines or compute engines in network processors manufactured by Intel® Corporation, Santa Clara, Calif.), wherein each thread performs a specific task or set of tasks in a pipelined architecture. During packet processing, numerous accesses are performed to move data between various shared resources coupled to and/or provided by a network processor. For example, network processors commonly store packet metadata and the like in static random access memory (SRAM) stores, while storing packets (or packet payload data) in dynamic random access memory (DRAM)-based stores. In addition, a network processor may be coupled to cryptographic processors, hash units, general-purpose processors, and expansion buses, such as the PCI (peripheral component interconnect) and PCI Express bus.
In general, the various packet-processing compute engines of a network processor, as well as other optional processing elements, will function as embedded specific-purpose processors. In contrast to conventional general-purpose processors, the compute engines do not employ an operating system to host applications, but rather directly execute “application” code using a reduced instruction set. For example, the microengines in Inte's® IXP2xxx family of network processors are 32-bit RISC processing cores that employ an instruction set including conventional RISC (reduced instruction set computer) instructions with additional features specifically tailored for network processing.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
a is a state diagram used in conjunction with a non-pre-emptive round-robin arbitration scheme for activating threads;
b is a state diagram used in conjunction with a pre-emptive round-robin arbitration scheme for activating threads;
Embodiments of methods and apparatus for arbitrating and scheduling thread usage in multi-threaded compute engines are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Modern network processors, such as Intel's® IXP2xxx family of network processors, employ multiple multi-threaded processing cores (e.g., microengines) to facilitate line-rate packet processing operations. Some of the operations on packets are well-defined, with minimal interface to other functions or strict order implementation. Examples include update-of-packet-state information, such as the current address of packet data in a DRAM buffer for sequential segments of a packet, updating linked-list pointers while enqueuing/dequeuing for transmit, and policing or marking packets of a connection flow. In these cases the operations can be performed within the predefined-cycle stage budget. In contrast, difficulties may arise in keeping operations on successive packets in strict order and at the same time achieving cycle budget across many stages. A block of code performing this type of functionality is called a context pipe stage.
In a context pipeline, different functions are performed on different microengines (MEs) as time progresses, and the packet context is passed between the functions or MEs, as shown in
Under a context pipeline, each thread in an ME is assigned a packet, and each thread performs the same function but on different packets. As packets arrive, they are assigned to the ME threads in strict order. For example, there are eight threads typically assigned in an Intel IXP2800® ME context pipe stage. Each of the eight packets assigned to the eight threads must complete its first pipe stage within the arrival rate of all eight packets. Under the nomenclature illustrated in
Under a functional pipeline, the context remains with an ME while different functions are performed on the packet as time progresses. The ME execution time is divided into n pipe stages, and each pipe stage performs a different function. As with the context pipeline, packets are assigned to the ME threads in strict order. There is little benefit to dividing a single ME execution time into functional pipe stages. The real benefit comes from having more than one ME execute the same functional pipeline in parallel.
A block diagram corresponding to one embodiment of a microengine architecture 200 is shown in
Architecture 200 support n hardware contexts. For example, in one embodiment n=8, while in other embodiments n=16 and n=4. Each hardware context has its own register set, program counter (PC), condition codes, and context specific local control and status registers (CSRs) 222. Unlike software-based contexts common to modern multi-threaded operating systems that employ a single set of registers that are shared among multiple threads using software-based context swapping, providing a copy of context parameters per context (thread) eliminates the need to move context specific information to or from shared memory and registers to perform a context swap. Fast context swapping allows a thread to do computation while other threads wait for input/output (IO) resources (typically external memory accesses) to complete or for a signal from another thread or hardware unit.
Under the embodiment illustrated in
In order to perform efficient pipeline-based processing, there needs to be a mechanism for controlling thread execution. Although each thread has its own context, only one thread (the active thread) is executing at any point in time. This mechanism is provided by thread arbiter/scheduler 220 in microengine architecture 200.
Under respective embodiments, thread arbiter/scheduler 220 supports various thread arbitration policies facilitates by corresponding modes. These include: 1) non-pre-emptive (cooperative) round-robin; 2) priority-based round-robin with pre-emption; 3) time division; 4) cooperative round-robin with time division; and 5) priority-based round-robin with pre-emption and time division. Arbitration policies employing aspects of combinations of these modes may also be implemented in view of the teachings disclosed herein.
Non-Pre-Emptive (Cooperative) Round Robin
The non-pre-emptive round robin modes employs a conventional round-robin thread execution scheme currently employed by microengines in network processors manufactured by Intel® Corporation (e.g., IXP2xxx). Although this technique is known, it is provided herein to better understand how the cooperative with time division policy may be implemented. Under the round robin policy, threads ready to execute are activated in a round-robin manner. However, the length of execution of a particular thread is variable, as once a thread becomes active, it executes until it relinquishes control (e.g., by issuing a Context Arbitration instruction).
a shows a state diagram illustrating the context state transitions for one embodiment of the non-pre-emptive round-robin mode. Each context will be in one of four states: 1) Inactive; 2) Ready; 3) Active; and 4) Sleep. (As used below, the terms “context” and “thread” are used interchangeably.) At most, one context can be in the Active state at one time, while any number of contexts can be in any of the other states.
A context is in the Inactive state when it is not used. This can be accomplished e.g. by having a CSR with enable bits for each Context, and leaving the enable bit for an unused Context as a ‘0’.
A context is in the Active state when is executing instructions. This Context is called the “Active Context”. The Active Context's PC is used to fetch instructions from control store 212. In non-pre-emptive Round Robin mode, a context will stay in this state until it executes a special Context Arbitration instruction which causes it to relinquish execution. That instruction causes it to go to the Sleep state; the key point is that there is no hardware interrupt or pre-emption; context swapping is completely under software control.
In the Ready state, a context is ready to execute, but is not executing because a different context is currently the Active Context. In the non-pre-emptive round robin mode, when the current Active Context goes to a Sleep state, thread arbiter/scheduler 220 selects the next context to go to the Active state from among all the contexts in the Ready state. In the non-pre-emptive round robin mode, the next context to go to the Active state is selected using round-robin selection. In one embodiment a circular pointer scheme is employed for facilitating round-robin selection, as depicted by circular pointer 300. A context in the Ready state will go to the Sleep state when it executes a Context Arbitration instruction. The Context will remain in the Sleep state until all of the external events that it is waiting upon complete, upon which it will go to Ready state.
A timeline diagram illustrating thread activity corresponding to an exemplary sequence of thread events on a microengine employing eight threads (contexts) using the non-pre-emptive round-robin arbitration mode is shown in
Priority-Based Round-Robin with Pre-Emption
b shows a state diagram illustrating the context state transitions for one embodiment of the priority-based round-robin with pre-emption mode. Overall, this arbitration mode employs the same states as the non-pre-emption round-robin mode shown in
Under the priority aspect, each context in Ready state will be arbitrated using one of two or more priority levels. In the exemplary embodiment illustrated in
Under one embodiment of the priority-based round-robin with pre-emption mode, a thread context having a higher priority level may pre-empt execution of a context having a lower priority level. For example, suppose thread 9 is the current Active thread, as shown in
After a higher-priority thread releases control, arbitration of the threads in the Ready state begins anew. In the present example, suppose that at time t2 thread 1 explicitly releases control, and that none of threads 0 or 2-7 changes to the Ready state while thread 1 is Active. In this instance, there will be no round-robin arbitration of the high-priority threads, because none are in the Ready state. Accordingly, round-robin arbitration proceeds to the next priority level. In this case, thread 9 is selected for the Active state, since it is the thread currently pointer to by circular pointer 306 and it is in the Ready state.
Continuing at time t3 in
At time point t4 the state of thread 3 is changed to Ready. Since thread 3 is at a higher priority level than the current Active thread 11, thread 11 is pre-empted by thread 3, which becomes the Active thread. At time t5, thread 3 explicitly releases control. At this point, each of threads 0, 2, and 4 are in the Ready state. Accordingly, round-robin arbitration is performed for these threads. This entails incrementing circular pointer 304, which points to thread 4, one of the Ready threads. In response, thread 4 becomes the active thread.
Time Division Scheduling
During time slot th0, thread 0 is the Active thread. At the completion of time slot th0 (which coincides with the start of time slot th1), control is handed of to the next thread 1, if this thread is in the Ready state. Thus, thread 1 is active during time slot th1. However, at the start of time slot th2, thread 2 is not in the Ready state. Accordingly, in one embodiment no thread is active during this instance of time slot th2. At time slot th3, its corresponding thread 3 becomes the Active thread since it is in the Ready state. At time slot th4, thread 4 is not in the Ready state so it is does not become the Active thread. At time slot th5, thread 5 is Ready and thus becomes the Active thread. In this illustrated case, thread 5 explicitly releases control prior to the completion of time slot th5. Under one embodiment, the remainder of the time slot is unused by any threads. This time-slot thread activation sequence continues with activation of thread 6, 7, 0 and 1 in order.
Under one embodiment, full usage of all time slots is provided. For example, in the example of
In a similar manner, in one embodiment explicit release of control causes the time slot to advance to the next time slot. For instance, when thread 5 in
In general, a time division scheme may be implemented using one of many well-known timing mechanisms, such as clocks, counters, etc. In one embodiment, a counter 800 is used in conjunction with a time slot length register 802 and a circular pointer 804, as shown in
In response to the time slot change event, circular pointer 804 is incremented by one to point to the next thread in the sequence. This causes the Active context to change to the applicable thread, and sends a reset to the counter to start the count over again. This cycle is then repeated on an ongoing basis.
In one embodiment, the time slot is advanced when a current Active thread releases control using a Context Arbitration instruction. In response to this event, counter 800 is cleared, which produces the same result as occurs when the counter reaches 0 (if counting down) or the time slot length value (if counting up). Thus, the time slot is immediately incremented to the next time slot in the sequence.
In a similar manner, in one embodiment counter 800 is cleared when a given thread corresponding to the current time slot is not ready. Accordingly, the time slot is immediately incremented to the next time slot in the sequence.
The net result of the foregoing implementations produce the following thread activation behavior. In one embodiment, threads are run in order, wherein time slots allocated for threads that are not Ready are lost. In another embodiment, the “lost” slots are filled by skipping the non-Ready threads, such that a thread is always executing during each time slot.
Cooperative Round-Robin with Time Division
In accordance with further aspects of embodiments of the invention, the characteristics of the foregoing thread arbitration/scheduling schemes may be combined to form addition thread activation policies. For example, in one embodiment a cooperative round-robin with time division mode is employed. Under this mode, a combination of features from the cooperative (non-pre-emptive) round robin and time division schemes is implemented in a single thread arbitration/scheduling scheme.
The sequence starts with activation of thread 0 during time slot th01 (e.g., the first instance of time slot th0). Thread 0 remains Active through time slot th01, whereupon thread 0 becomes the Active thread during time slot th11. It is noted that thread 0 did not complete its task, so it is returned to the Ready state rather than the Sleep state. At the close of time slot th02, arbitration begins from among thread 2-7. For convenience, it is presumed that the applicable circular pointer used to identify the current round-robin selection points to thread 2. Since this thread is Ready, it becomes the Active thread.
At the beginning of time slot th31, thread 2 remains the Active thread since time slots th2-th7 are allocated to threads 2-7 (via appropriate arbitration among this thread pool). This likewise is the situation at the beginning of time slot th41 and th51. During time slot th51, thread 2 explicitly release control at time t1. This causes the next Ready thread in the round-robin sequence to be activated, as depicted by the activation of thread 3 at time t1. As before, thread 3 remains active through the remainder of time slot th51 and time slots th61, th71.
At the start of time slot th02, execution of thread t3 is pre-empted in favor of thread 0, the thread assigned to time slot th0. As a result, thread t3 is returned to the Ready state. As before, thread 0 remains active through the end of time slot th02, followed by activation of thread 1 during time slot th12. Toward the end of this time slot, thread 1 explicitly releases control, causing its state to change to Sleep. Under the illustrated embodiment, no thread is active during the remainder of time slot th12. In another embodiment, the following time slot th22 immediately commences. In either case, thread activation is returned to the round-robin pool (threads 2-7) at the start of time slot th22. Since thread 3 was not completed, it did not return to the Sleep state and thus the circular pointer was not incremented. As a result, since the pointer still points to thread 3 and thread 3 is Ready, thread 3 becomes the Active thread during time slot th22 and any following time slots until either thread 3 explicitly releases control or a next instance of time slot 0 is encountered.
In the illustrated example, thread 3 explicitly releases control at time t2, causing the circular pointer to advance to point to thread 4. Since this thread is Ready, it becomes the Active thread during the remainder of time slot th32, time slots th42 and th52 and the first portion of time slot th62. At time t3, thread 4 explicitly releases control, and thread 5 becomes the Active thread. Thread 5 remains active through time slot th72, at which point it is pre-empted in favor of thread 0, which becomes the new Active thread. The thread arbitration proceeds over time in a similar manner.
Priority-Based Round-Robin with Pre-Emption and Time Division
Timelines illustrating thread arbitration corresponding to respective embodiments of priority-based round robin with pre-emption and time division modes are shown in
The timeline example shown in
At time t1, thread 2 becomes Ready. Since it is in the high-priority pool, it pre-empts thread 10 and becomes the Active thread. It continues as the active thread until time t2, at which point it explicitly releases control and thread arbitration of the high-priority pool is initiated. In this instance the next thread (thread 3) is in the Ready state, and thus becomes the new Active thread. Thread 3 continues until the end of time slot th71, at which point it is pre-empted in favor of thread 0, which is assigned to time slot th02. Similarly, thread 1 is Active during time slot th12.
At the start of time slot th22, re-arbitration of the high priority pool commences. Since thread 3 was pre-empted, it is still Ready and the circular pointer still points to it. Thus, thread 3 becomes the Active thread. At time t3, thread 3 explicitly releases control, and re-arbitration selects thread 4 as the next thread to activate. At time t4, thread 4 explicitly release control. At this point, there are no other threads in the high-priority pool that are Active. Accordingly, arbitration of the low-priority pool is commenced.
Since thread 10 was pre-empted at time t1, the low-priority pool circular pointer still points to thread 10 and it is in the Ready state. Thus, thread 10 becomes the active thread, and remains so until time t5. At this point, thread 10 explicitly releases control, and re-arbitration of the low-priority pool selects thread 11 to activate. At time slot th03, thread 11 is pre-empted in favor of thread 0, followed by activation of thread 1.
The embodiment of
This situation is illustrated in
At time end of time slot th12, the thread pools are re-arbitrated. In this instance, there are no threads that are ready in either of the high- or medium-priority pools. Thus, arbitration of the low-priority pool is performed. In this instance, thread 10 becomes the Active thread. At time t3, thread 4 becomes Ready, causing thread 10 to be pre-empted. At time t4, thread 4 explicitly releases control, returning activation to Thread 10 via the associated priority pool arbitration. At time t5, thread 10 explicitly releases control, leading to activation of thread 11. Thread 11 is then pre-empted by thread 0 in concurrence with the beginning of time slot th03.
In general, various thread-specific information is maintained in respective CSRs for each thread. As depicited by CSRs 222, a given set of CSRs for a compute engine are partitioned into respective groups of CSRs such that each thread has its own group of CSRs. In addition to conventional CSR usage (e.g., for that employed by an Intel® IPX2xxx network processor), each group of CSRs includes register space for storing the thread's current state 1206, priority level 1208 (if applicable), and time slot assignment 1210. During ongoing operations, CSRs 222 are read and updated by thread arbiter/scheduler 220.
Time slot generator 1202 is used to generate time slots. In one embodiment, time slot generator 1202 employs components similar to those shown in
Round-robin/pre-emption logic 1204 includes logic for implementing the thread arbitration schemes discussed herein. It includes logic to implement pre-emption policies 1212, and provides a round-robin pointer 1214 (e.g., similar to circular pointers 300, 304, 306 and 804) for each priority level supported by thread arbiter/scheduler 220. The particular thread selection policy to implement is controlled by data stored in a mode register 1216.
In general, the logic for implementing the various block functionality and components depicted in the figures herein may be implemented via hardware, software, or a combination of hardware and software. Typically, programmed logic in hardware will be used to implement the block functionality. However, some of the block functionality may be facilitated via execution of software, as described below.
The round-robin aspects of the foregoing thread arbitration schemes refer to basic round-robin schemes for purpose of illustration. It will be understood that these are merely examples of a round-robin-based scheme that may be implemented for performing this aspect of the thread arbitration. For example, a weighted round-robin scheme may be employed using one of many well-known weighted round-robin algorithms. Other types of round-robin-based schemes may also be employed.
Network processor 1300 includes n microengines 200. In one embodiment, n=8, while in other embodiment n=16, 24, or 32. Other numbers of microengines 200 may also me used. In the illustrated embodiment, 16 microengines 200 are shown grouped into two clusters of 8 microengines, including an ME cluster 0 and an ME cluster 1.
In the illustrated embodiment, each microengine 200 executes instructions (microcode) that are stored in a local control store 1308. Included among the instructions for one or more microengines are thread arbiter/scheduler setup instructions 1310 that are employed to setup operation of the various thread arbitration and scheduling operations described herein. In one embodiment, the thread arbiter/scheduler setup instructions instructions are written in the form of a microcode macro.
Each of microengines 200 is connected to other network processor components via sets of bus and control lines referred to as the processor “chassis”. For clarity, these bus sets and control lines are depicted as an internal interconnect 1312. Also connected to the internal interconnect are an SRAM controller 1314, a DRAM controller 1316, a general purpose processor 1318, a media switch fabric interface 1320, a PCI (peripheral component interconnect) controller 1321, scratch memory 1322, and a hash unit 1323. Other components not shown that may be provided by network processor 1300 include, but are not limited to, encryption units, a CAP (Control Status Register Access Proxy) unit, and a performance monitor.
The SRAM controller 1314 is used to access an external SRAM store 1324 via an SRAM interface 1326. Similarly, DRAM controller 1316 is used to access an external DRAM store 1328 via a DRAM interface 1330. In one embodiment, DRAM store 1328 employs DDR (double data rate) DRAM. In other embodiment DRAM store may employ Rambus DRAM (RDRAM) or reduced-latency DRAM (RLDRAM).
General-purpose processor 1318 may be employed for various network processor operations. In one embodiment, control plane operations are facilitated by software executing on general-purpose processor 1318, while data plane operations are primarily facilitated by instruction threads executing on microengines 200.
Media switch fabric interface 1320 is used to interface with the media switch fabric for the network element in which the line card is installed. In one embodiment, media switch fabric interface 1320 employs a System Packet Level Interface 4 Phase 2 (SPI4-2) interface 1332. In general, the actual switch fabric may be hosted by one or more separate line cards, or may be built into the chassis backplane. Both of these configurations are illustrated by switch fabric 1334.
PCI controller 1322 enables the network processor to interface with one or more PCI devices that are coupled to backplane interface 1304 via a PCI interface 1336. In one embodiment, PCI interface 1336 comprises a PCI Express interface.
During initialization, coded instructions (e.g., microcode) to facilitate various packet-processing functions and operations are loaded into control stores 1308. Thread arbiter/scheduler setup instructions 1310 are also loaded at this time. In one embodiment, the instructions are loaded from a non-volatile store 1338 hosted by line card 1302, such as a flash memory device. Other examples of non-volatile stores include read-only memories (ROMs), programmable ROMs (PROMs), and electronically erasable PROMs (EEPROMs). In one embodiment, non-volatile store 1338 is accessed by general-purpose processor 1318 via an interface 1340. In another embodiment, non-volatile store 1338 may be accessed via an interface (not shown) coupled to internal interconnect 1312.
In addition to loading the instructions from a local (to line card 1302) store, instructions may be loaded from an external source. For example, in one embodiment, the instructions are stored on a disk drive 1342 hosted by another line card (not shown) or otherwise provided by the network element in which line card 1302 is installed. In yet another embodiment, the instructions are downloaded from a remote server or the like via a network 1344 as a carrier wave.
In general, programs to implement the packet-processing functions and operations, as well as the thread arbitration/scheduler setup operations, may be stored on some form of machine-readable or machine-accessible media, and executed on some form of processing element, such as a microprocessor or the like. Thus, embodiments of this invention may be used as or to support a software program executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine-readable or machine-accessible medium. A machine-accessible medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-accessible medium can include such as a read only memory (ROM); a random access memory (RAM); a magnetic disk storage media; an optical storage media; and a flash memory device, etc. In addition, a machine-accessible medium can include propagated signals such as electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.).
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.