This disclosure relates generally to multi-core processing systems and more particularly to low latency data delivery within multi-core processing systems.
Multi-core processing systems often perform operations on packet data in which those operations are performed as tasks. Various cores executing a particular program perform tasks assigned to them by a task manager. The tasks themselves may have time periods in which another resource, such as a hardware accelerator, is performing a portion, or job, of the task so that the core is not actually involved with that task. In such case, the core can be used to execute another task while the job is being executed by the accelerator. When the hardware accelerator, for example, completes the job, the core eventually needs to continue the task. Thus it is important that the core be aware of the last known state of the task. This type of operation in which context information is used in providing for a core to switch tasks prior to completing the task is generally referenced as context switching. Context switching provides a benefit of more use of the cores in a given amount of time. However, one cost associated with context switching is that there can be some delay in transferring between jobs due to loading the context information of a previous task as it becomes the current task for the core. Also, there is a continuous desire for increased efficiency in performing tasks more quickly and with fewer resources.
In processing systems, such as Advanced I/O Processor (AIOP) processing systems, there are accelerator modules which are often provided input data from a workspace (such as a memory mapped random access memory (RAM)) workspace. After completing the job for which the input data was provided, output data is written back to the workspace. When the output data is written back to the workspace, a data consumer (such as a processor core) often needs to be notified that the output data has been written to the workspace. For reduced latency and increased performance, it is important that the data consumer be notified as early as possible of completion. A plurality of techniques is known to provide the notification. These techniques include providing a separate notification interface of completion, providing side band signals associated with an address/data bus, which are snooped by consumer and providing an additional status transaction after the output data is written to the workspace. However, these techniques can add additional routing and area to the processing system.
The present invention may be better understood, and its numerous objects, features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference number throughout the several figures designates a like or similar element.
In general, some embodiments of the present invention relate to a method comprising: assigning a task to a processing core; identifying a job within the task to be performed via an accelerator; performing and completing the job via the accelerator; generating output data including associated status information via the accelerator, the status information including an associated inactive write strobe; snooping the status information to determine when the job being performed by the accelerator is completed, the snooping comprising snooping the status information; and continuing executing the task using the output data associated with the status information.
More specifically in certain embodiments, the present invention relates to a method comprising: assigning a task to a processing core; identifying a job within the task to be performed via an accelerator; performing and completing the job via the accelerator; generating output data including associated status information via the accelerator, the status information including an associated inactive write strobe; snooping the status information to determine when the job being performed by the accelerator is completed, the snooping comprising snooping the status information; and continuing executing the task using the output data associated with the status information.
In other embodiments, the invention relates to a data processing system comprising: a processing core, the processing core performing a task; an accelerator, the processor core identifying a job to be performed by the accelerator; and, an interconnect circuit coupled to the processing core and the accelerator, the accelerator generating output data including associated status information, the status information including an associated inactive write strobe, the processing core snooping the status information to determine when the job being performed by the accelerator is completed, the snooping comprising snooping the status information and, the processing core continuing executing the task using the output data associated with the status information.
In other embodiments, the invention relates to an apparatus comprising an interconnect coupled to a processing core and an accelerator, the accelerator generating output data including associated status information, the status information including an associated inactive write strobe, the processing core snooping the status information to determine when the job being performed by the accelerator is completed, the snooping comprising snooping the status information and, the processing core continuing executing the task using the output data associated with the status information.
Referring to
In certain embodiments, the interconnect 150 comprises an Advanced Microcontroller Bus Architecture (AMBA) Advanced eXtensible Interface (AXI) interconnect. The AMBA interconnect is an open standard, on-chip interconnect specification for connection and management of functional blocks. The AXI portion of the standard further defines the standard to include separate address/control and data phases, support unaligned data transfers using byte strobes, support burst based transitions with only start address issued, allows issuing of multiple outstanding addresses with out of order responses and allows the addition of register stages to provide timing closure.
Examples of accelerators 140 include direct memory access (DMA), table look-up (TLU), parse/classify/distribute (PCD), reassembly unit, security (SEC), work scheduler, and task termination. Included within the task manager 130 is a task status information module 130 which maintains the status of each task. For each task there is a core that is assigned to perform the task, a context ID, and a status. The status may be one of four possibilities as follows: ready, executing, inhibited, and invalid. Ready means that the task is waiting to be scheduled to core. Inhibited means the core is waiting for something else such as an accelerator to finish its job. Executing means the core is actively working on the task. Invalid means the task is not a valid task.
In operation, the queue manager 110 provides a frame descriptor to the work scheduler 112 that in turn defines a plurality of tasks to be performed under the direction of task manager 114. The task manager 114 assigns tasks to the cores 120. The cores 120 begin executing the assigned tasks which may include a first task assigned to one core 120a and other tasks assigned to other cores 120b, 120c, 120d. The first task may include a job that is a software operation that the core 120a may perform on its own, The first task may also include a job that makes use of an accelerator such as accelerator 140a. In such case, the core 120a requests use of an accelerator from the task manger 114 and stores the context information for that stage of the task in a context storage buffer in the core 120a. The task manager 114 passes that job to an accelerator 140a that can perform the job. If the accelerator 140a can perform the job, the task manager 114 may assign the job to the accelerator 140a. After the task manager 114 assigns the job to the accelerator 140a, the core 120a is then available for the task manger 114 to assign it a second task. While the accelerator 140a is executing the job it has been assigned, the core 120a may begin the second task or it may be inhibited as it waits for the accelerator 140a to complete the job. When the accelerator 140a finishes its assigned job, the accelerator 140a provides an output pointer and completion status information to the task manager 114. The core 120a may still be performing the second task if it was not inhibited. Another core, such as core 120b, may be available for performing tasks at this point. In such case, the task manager 114 fetches the context information from the first core 120a and assigns the first task to another core 120b while also providing the context information to the other core 120b. With the core 120b now having the context information, the core 120b can continue with the first task. When a context is switched to a different core, task status information 130 is updated indicating that the other core 120b is now assigned to the first task. Also the executing of the task by the other core 120b will be entered in task status information 130.
When the task manager 114 accesses the context information from a core to move a task from one core to another core, the task manager 114 also receives other information relative to the task that is to be continued. For example, if an accelerator 140 is to be used next in executing the task, additional information beyond the context information that would be passed from the core to task manager 114 include identification of the particular type of accelerator, additional information, if any, about the attributes of the accelerator, inband information, if any, that would be passed to the accelerators as output pointers or command attributes, and input/output pointers.
Thus it is seen that packet data is processed in the form of tasks in which context switching is not just implemented for a single core but is able to switch context information from one core to another to provide more efficient execution of the tasks. In effect when the task manager 114 detects that the situation is right to transfer context from one core to another, the task manager 114 migrates tasks in ready state between cores without the cores knowledge. A core 120 may not have information about other cores or tasks in the system and in such case cannot initiate the migration. The task manager 114 accesses the context information from one core and transfers it to a second core which then executes the task. Thus, the task manager 114 may be viewed as migrating execution of a task from one core to another that includes transferring the context information.
When a packet data is received, the IOP 142 provides the frame information to the queue manager 110 and loads the data in memory 146. The packet data is processed through the cores 120 that access the memory 146 as needed. When a packet data is output by IOP 142, the data is read from the memory 146 and formatted using frame information provided by the queue manager 110.
Referring to
The data information further includes at least one byte of status information 240, The byte of status information 240 is identified by setting a corresponding data byte write strobe inactive (e.g., by setting the data byte strobe low). This byte of status information 240 may be considered a “zero-byte” data beat. By providing the output data associated with the accelerator with a zero-byte data beat, the data processing system 100 uses an existing capability of certain interconnect protocols (albeit in a new capacity) without adding additional area to indicate when an accelerator completes a task. Additionally, by providing the output data associated with the accelerator with a zero-byte data beat, a near instantaneous notification of task completion is provided to a snooping consumer when data has been written to the workspace 121 (i.e., when the task completes execution by the accelerator 140). The amount of information that can be passed via the interconnect 150 using zero-byte data beats is not limited.
Referring to
Referring to
Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. For example, a different resources than accelerators may be used by the cores in accomplishing tasks. Also for example, while the example shows adjacent data bytes with a status byte as the last byte of the output data, it will be appreciated that the bytes need not necessarily be adjacent and also that the status byte need not be the last byte of the output data.
Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to he included within the scope of the present invention. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Consequently, the invention is intended to be limited only by the spirit and scope of the appended claims, giving full cognizance to equivalents in all respects.