Computer architects implement accelerators to improve performance by introducing hardware specialized for performing specific application tasks. A program submits a task to be accelerated to the accelerator. The accelerator computes and returns the result for the program to consume. Communication between the program, the accelerator, and associated systems may incur overhead, depending on the particular implementation. For instance, task off-loading, completion notification, computation latency, and queue delays may reduce realized performance increases. In some cases, the performance of an accelerator is reduced by the communication overhead. For instance, if a task has a small computation granularity, the benefits of using an accelerator may be negated due to the time used for off-loading the task and/or by queue delays. Multiple small computation granularity tasks may generate a large amount of traffic in a processor core, potentially polluting caches and generating coherence traffic.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Methods, systems, and apparatuses are described for a distributed accelerator. The distributed accelerator includes a plurality of accelerator slices, including a coordinator slice and one or more subordinate slices. A command that includes instructions for performing a task is received. Sub-tasks of the task are determined to generate a set of sub-tasks. For each sub-task of the set of sub-tasks, an accelerator slice of the plurality of accelerator slices is allocated, and sub-task instructions for performing the sub-task are determined. Sub-task instructions are transmitted to the allocated accelerator slice for each sub-task. Each allocated accelerator slice is configured to generate a corresponding response indicative of the allocated accelerator slice having completed a respective sub-task.
In a further aspect, responses are received from each allocated accelerator slice and a coordinated response indicative of the responses is generated.
Further features and advantages of the embodiments, as well as the structure and operation of various embodiments, are described in detail below with reference to the accompanying drawings. It is noted that the claimed subject matter is not limited to the specific examples described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments and, together with the description, further serve to explain the principles of the embodiments and to enable a person skilled in the pertinent art to make and use the embodiments.
Embodiments will now be described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
The following detailed description discloses numerous example embodiments. The scope of the present patent application is not limited to the disclosed embodiments, but also encompasses combinations of the disclosed embodiments, as well as modifications to the disclosed embodiments.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
In the discussion, unless otherwise stated, adjectives such as “substantially” and “about” modifying a condition or relationship characteristic of a feature or features of an embodiment of the disclosure, are understood to mean that the condition or characteristic is defined to within tolerances that are acceptable for operation of the embodiment for an application for which it is intended.
If the performance of an operation is described herein as being “based on” one or more factors, it is to be understood that the performance of the operation may be based solely on such factor(s) or may be based on such factor(s) along with one or more additional factors. Thus, as used herein, the term “based on” should be understood to be equivalent to the term “based at least on.”
The example embodiments described herein are provided for illustrative purposes and are not limiting. The examples described herein may be adapted to any type of method or system for secure account login and authentication. Further structural and operational embodiments, including modifications/alterations, will become apparent to persons skilled in the relevant art(s) from the teachings herein.
Numerous exemplary embodiments are now described. Any section/subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, embodiments disclosed in any section/subsection may be combined with any other embodiments described in the same section/subsection and/or a different section/subsection in any manner.
A hardware accelerator (“accelerator”) is a separate unit of hardware from a processor (e.g., a central processing unit (CPU)) that is configured to perform functions for a computer program executed by the processor upon request by the program, and optionally in parallel with operations of the program executing in the processor. Computer architects implement accelerators to improve performance by introducing such hardware specialized for performing specific application tasks. A program executed by a processor submits a task to be accelerated to the accelerator. The accelerator computes and returns the result for the program to consume. The accelerator includes function-specific hardware, allowing for faster computation speeds while being energy efficient.
Programs may interface with the accelerator synchronously or asynchronously. In synchronous operation, the program waits for the accelerator to return a result before advancing. In asynchronous operation, the program may perform other tasks after submitting the function to the accelerator. In this scenario, to notify completion, the accelerator may interrupt the program, or the program may poll the accelerator. In some embodiments, both asynchronous and synchronous operations may be used.
Communication between the program, the accelerator, and associated systems may incur overhead, depending on the particular implementation. For instance, task off-loading, completion notification, computation latency, and queue delays may reduce or offset realized performance increases. In some cases, the increased performance of the accelerator is negated by the communication overhead. For instance, if a task has a small computation granularity, the benefits of using an accelerator may be negated due to the time used to off-load the task and/or by queue delays. However, multiple small computation granularity tasks may generate a large amount of traffic in a processor core, potentially polluting caches and generating coherence traffic.
Embodiments of the present disclosure present a distributed accelerator. A distributed accelerator may achieve higher degrees of parallelism and increased bandwidth for data access. A distributed accelerator includes a plurality of separate accelerator slices in a computing system that each can perform hardware acceleration on a portion of a task of a computer program. In accordance with an embodiment, each accelerator slice has an independent interface. Different accelerator slices may implement similar or different functions.
Accelerator slices may be distributed in a computing system in various ways. For instance, an accelerator may be integrated as a network-attached device, an off-chip input/output (TO) device, or an on-chip IO device, an on-chip processing element, as a specialized instruction in the instruction set architecture (ISA), and/or the like, depending on the particular implementation. In some embodiments, accelerator slices of a distributed accelerator may be integrated in different ways. For instance, in accordance with an embodiment, a distributed accelerator includes accelerator slices integrated in corresponding on-chip IO devices and on-chip processing elements. The particular configuration may be determined based on computation-to-communication ratio, number of shared users, cost, frequency of use within a program, complexity, characteristics of the computation, and/or other factors as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure. For instance, an accelerator slice implemented as an extension of an ISA may utilize CPU resources such as available memory bandwidth. In another example, an accelerator slice implemented as an off-chip device may be shared between more users. In accordance with an embodiment, a distributed accelerator may dynamically select accelerator slices based on the assigned task and accelerator integration type.
Distributed accelerators may be configured to operate in memory regions of various sizes. For instance, a distributed accelerator in accordance with an implementation may operate in a large memory region. In this context, the memory region is divided into multiple page-sized chunks aligned to page boundaries. The distributed accelerator or a processor core may determine page sizes and/or boundaries, depending on the particular implementation.
Embodiments of distributed accelerators are configured to accelerate functions of a computer processing system. For instance, a distributed accelerator may be configured to process data instructions (e.g., data movement instructions, encryption instructions, synchronization instructions, CRC instructions, etc.) and accelerate functions of a computer processing system. For example, a data movement function in an example implementation may be accelerated with twice as much bandwidth (or even greater bandwidths, depending on the particular implementation) as the processor core of the computer processing system via a distributed accelerator. A distributed accelerator distributes the data moving function across the computer processing system. For instance, accelerator slices coupled to the computer processing system interconnect, within components of the computer processing system (e.g., a cache controller), and/or coupled to IO devices of the computer processing system may be used to distribute traffic across system resources, improving data movement speed.
Distributed accelerators may be utilized in various applications. For instance, a distributed accelerator in accordance with an embodiment is shared across multiple users in a system that includes active virtual machines. Each virtual machine includes multiple active containers. In this context, tens, hundreds, thousands, or even greater numbers of users may invoke the distributed accelerator. The distributed accelerator is configured to enable sharing between users.
Distributed accelerators may be configured in various ways. For instance,
Processor core 102 is configured to execute programs, transmit commands to distributed accelerator 104, receive responses from distributed accelerator 104, and perform other tasks associated with processing system 100. For example, processor core 102 transmits a command 114 to distributed accelerator 104 via communication link 106. Command 114 includes instructions for performing a task. Distributed accelerator 104 performs the task according to the instructions and generates a response 118 that is transmitted to processor core 102. Processor core 102 receives response 118 from distributed accelerator 104 via communication link 106.
Command 114 may be a message including one or more processes to be completed, source addresses, destination address, and/or other information associated with the task. In accordance with an embodiment, processor core 102 stores the command in memory of processing system 100 (e.g., a memory device, a register, and/or the like) and notifies distributed accelerator 104 of the location of the command. In accordance with an embodiment, processor core 102 generates command 114 in response to executing an instruction. Command 114 may be a complex command including multiple sub-tasks. Command 114 may be identified with a program using a command identifier (CID). The CID may include a number associated with processor core 102, a program identifier (e.g., an address space identifier (ASID)), and/or other identifying information associated with command 114.
Distributed accelerator 104 is configured to receive commands from processor core 102, perform tasks, and generate responses. Distributed accelerator 104 may be discovered and configured via an operating system (OS) and operated in a user-mode, depending on the particular implementation. Distributed accelerator 104 includes a plurality of accelerator slices 108. Each accelerator slice of accelerator slices 108 includes an independent interface for accessing data. Accelerator slices 108 may implement similar or different functions, depending on the particular implementation. As shown in
Coordinator slice 110 is configured to receive commands from processor core 102, divide tasks into sub-tasks, and distribute sub-tasks to accelerator slices of accelerator slices 108. For instance, coordinator slice 110 receives command 114 from processor core 102, and is configured to decode command 114 into instructions for performing a task, and determine if the task is to be completed by coordinator slice 110, one or more of subordinate slices 112A-112N, or a combination of coordinator slice 110 and one or more of subordinate slices 112A-112N. For example, in accordance with an embodiment, coordinator slice 110 divides the task associated with command 114 into a set of sub-tasks and allocates an accelerator slice of accelerator slices 108 to each sub-task. Sub-tasks may be distributed across accelerator slices 108 based on the type of the sub-task, the address range the sub-task operates on, or other criteria, as would be understood by a person of skill in the relevant art(s) having benefit of this disclosure. In accordance with an embodiment, each allocated accelerator slice may transmit results regarding execution of its respective sub-task directly to processor core 102 (e.g., as response 118). In accordance with another embodiment, coordinator slice 110 receives responses from each allocated accelerator slice and generates a coordinated response. In this context, coordinator slice 110 transmits the generated coordinated response to processor 102 as response 118.
Accelerator slices 108 may be configured to communicate to each other in various ways. For instance, accelerator slices 108 may communicate through distributed accelerator registers, system memory, system interconnects, and/or other communication methods described herein or otherwise understood by a person of ordinary skill in the relevant art(s) having benefit of this disclosure. Accelerator slices 108 may be cache coherent, which reduces coherence traffic.
Distributed accelerator 104, as depicted in
Processor core 102 may operate synchronously or asynchronously to distributed accelerator 104. In synchronous operation, processor core 102 waits for distributed accelerator 104 to provide response 118, indicating the task is completed. In asynchronous operation, processor core 102 may perform other tasks after transmitting command 114, while distributed accelerator 104 executes command 114.
In asynchronous operation, processor core 102 may receive response 118 in a variety of ways, depending on the particular implementation. In a first example embodiment, processor core 102 transmits a poll signal 116 to distributed accelerator 104 to check if distributed accelerator 104 has completed the task. If distributed accelerator 104 has completed the task, distributed accelerator 104 transmits response 118 to processor core 102 in response to poll signal 116. In this context, processor core 102 may transmit poll signal 116 periodically, as part of another operation of processing system 100, or at the direction of a user associated with processing system 100. In a second example embodiment, distributed accelerator 104 transmits an interrupt signal 120 to processor core 102 to interrupt the current operation of processor core 102. After processor core 102 acknowledges the interrupt, distributed accelerator 104 transmits response 118.
Processing systems including distributed accelerators may be configured in various ways. For instance,
Processor cores 202A-202N are further embodiments of processor core 102 of
Cache controllers 204A-204N are configured to store and access copies of frequently accessed data. Cache controllers 204A-204N include respective coherence engines 220A-220N and respective caches 222A-222N. Caches 222A-222N store data managed by respective cache controllers 204A-204N. Coherence engines 220A-220N are configured to maintain data consistency of respective caches 222A-222N.
Memory controllers 206A-206N are configured to manage data stored in memory devices of processing system 200 (not shown in
IO controllers 208A-208N are configured to manage communication between processor cores 202A-202N and peripheral devices (e.g., USB (universal serial bus) devices, SATA (Serial ATA) devices, ethernet devices, audio devices, HDMI (high-definition media interface) devices, disk drives, etc.).
Distributed accelerator 210 is a further embodiment of distributed accelerator 104 of
Coordinator slice 212 is a further embodiment of coordinator slice 110 of
Subordinate slices 214A-214N, subordinate slices 216A-216N, and subordinate slices 218A-218N are further embodiments of subordinate slices 112A-112N of
Subordinate slices 216A-216N are subordinate accelerator slices configured as off-chip accelerator slices coupled to IO controller 208A. Subordinate slices 216A-216N may be expandable accelerator slices. For instance, off-chip accelerator slices may be coupled to interconnect 224 via an IO controller, such as IO controller 208A in
Subordinate slices 218A-218N are subordinate accelerator slices configured as components of respective cache controllers 204A-204N. In this context, each of subordinate slices 218A-218N may utilize respective coherence engines 220A-220N and respective caches 222A-222N. For instance, subordinate slices 218A-218N may use coherence engines 220A-220N for data movement functions, as described further below with respect to
Distributed accelerator 210 utilizes coordinator slice 212, subordinate slices 214A-214N, subordinate slices 216A-216N, and subordinate slices 218A-218N to perform tasks associated with commands received from processor cores 202A-202N. Distributing tasks across multiple accelerator slices utilizes spatial parallelism of multiple attach points to reduce hotspots.
Distributed accelerator 104 is depicted as having a single coordinator slice 212, however it is contemplated herein that multiple coordinator slices may be used. For instance, any of subordinate slices 214A-214N, subordinate slices 216A-216N, and/or subordinate slices 218A-218N may be replaced with or configured as a coordinator slice, depending on the particular implementation. In accordance with an embodiment, a processing system may have a number of accelerator slices equal to the number of cache controllers and memory controllers in the processing system.
Distributed accelerator 210 may provide responses to processor cores 202A-202N in various ways. For instance, each accelerator slice allocated by coordinator slice 212 may transmit a response to the processor core that issued the command. In another example embodiment, coordinator slice 212 receives each response from the allocated accelerator slices and generates a coordinated response as an aggregate of the received responses. In this context, coordinator slice 212 transmits the coordinated response to the processor core that issued the command. In another example embodiment, distributed accelerator 210 may store responses in caches (e.g., one or more of caches 222A-222N) or memory (e.g., a memory associated with one or more of memory controllers 206A-206N). In this context, distributed accelerator 210 or the associated controller may alert the processor core that issued the command (e.g., via an interrupt) or the processor may poll the cache controller or memory controller for a response.
Processing system 200 may include additional components (not shown in
Distributed accelerators may be configured in various ways. For instance,
Coordinator slice 304 is a further embodiment of coordinator slice 110 of
Response and communication registers 316 may be any type of registers that are described herein, and/or as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure. Response and communication registers 316 may include one or more registers for communicating with processor core 102 and/or subordinate slices 306A-306N. For instance, response and communication registers 316 may be used to communicate messages to and from subordinate slices 306A-306N. Results of coordinator slice 104 completing tasks may be communicated to processor 102 via response and communication registers 316. Response and communication registers 316 are communicatively coupled to interface 308 via response bus 342.
Data buffers 320 may be any type of data buffer that are described herein, and/or as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure. Data buffers 320 may be used to store data to be processed by or data that has been processed by coordinator slice 304. Data buffers 320 receives data to be processed from interface 308 via data bus 356. Interface 308 receives data processed by coordinator slice 304 from data buffers 308 via data bus 356.
Slice controller 310 is configured to manage coordinator slice 304 and components thereof. For example, slice controller 310 receives control signals from processor core 102 and provides status updates to processor core 102 via control and status bus 338. Slice controller 310 is further configured to configure components of coordinator slice 304 via configuration and status bus 346. Slice controller 310 includes a status manager 322, an abort task manager 324, and a coordinated response generator 326. Status manager 322 is configured to monitor the operation status of coordinator slice 304 and subordinate slices 306A-306N via configuration and status bus 346. Status manager 322 may poll allocated accelerator slices for sub-task or task status (e.g., via slice coordinator 314), may detect errors or exceptions in accelerator slice operation (e.g., via configuration and status bus 346), and/or otherwise monitor the operation status of coordinator slice 304 and subordinate slices 306A-306N, as described elsewhere herein.
Abort task manager 324 is configured to abort tasks or sub-tasks managed by coordinator slice 304. For instance, abort task manager 324 may be configured to abort a task or sub-task in response to an abort command from processor 102, abort a task or sub-task due to an error or exception, and/or otherwise abort a task or sub-task managed by coordinator slice 304, as described elsewhere herein.
Coordinated response generator 326 is configured to generate coordinated responses to send to processor core 102. For instance, coordinator slice 304 receives a corresponding response from each allocated accelerator slice indicative of the allocated accelerator slice having completed a respective sub-task. Coordinated response generator 326 generates a coordinated response 366 indicative of the corresponding responses. In accordance with an embodiment, coordinated response 366 is transmitted to execution engines 318 via configuration and status bus 346, which stores coordinated response 366 in response and communication registers 316 via response bus 354. Coordinator slice 304 transmits coordinated response 366 to processor core 102. Coordinated response 366 may be transmitted to or received by processor core 102 in various ways, as described elsewhere herein.
Command manager 312 is configured to manage commands received by coordinator slice 304. For instance, coordinator slice 304 receives command 360 from processor core 102. Command manager 312 receives command 360 via command bus 340. Command manager 312 is configured to determine if distributed accelerator 300 is capable of performing the task associated with command 360 and manage commands coordinated by coordinator slice 304. Command manager 312 includes a completion time estimator 328 and a command queue 330. Completion time estimator is configured to estimate a completion time of the task associated with command 360. Command manager 312 may determine if distributed accelerator 300 is capable of performing the task based on the estimated completion time. For instance, if the completion time is greater than a threshold, command manager 312 may reject command 360. If the completion time is lower than the threshold, command manager 312 adds command 360 to command queue 330. Command queue 330 is configured to store commands waiting to be processed by coordinator slice 304. Queued commands may include information such as buffer sizes, command latency, and/or other information associated with queued commands. Command manager 312 is configured to generate an instruction to execute commands in command queue 330.
Slice coordinator 314 is configured to coordinate accelerator slices of distributed accelerator 300 to perform commands in command queue 330. For instance, slice coordinator 314 receives an instruction to execute a command from command manager 312 via instruction bus 348 and coordinates accelerator slices to perform the command. Slice coordinator 314 includes a sub-task generator 332, a slice allocator 334, and a sub-instruction generator 336. Sub-task generator 332 is configured to receive the command from command manager 312 and determine one or more sub-tasks of the task to generate a set of sub-tasks. Sub-tasks may be determined in various ways. For instance, a task may be divided based on bandwidth needed to complete a task, size of data to be moved or manipulated, types of steps to be performed, or as described elsewhere herein.
Slice allocator 334 is configured to allocate an accelerator slice of distributed accelerator 300 to perform a sub-task. For instance, slice allocator 334 may allocate coordinator slice 304, one or more of subordinate slices 306A-306N, or a combination of coordinator slice 304 and one or more of subordinate slices 306A-306N. In embodiments, an accelerator slice may be allocated to perform a single sub-task or multiple sub-tasks. Slice allocator 334 may allocate an accelerator slice based on the type of accelerator slice, the type of sub-task, the latency of the sub-task, a load of the accelerator slice, or other factors described elsewhere herein.
Sub-instruction generator 336 is configured to determine, for each sub-task, sub-task instructions for performing the sub-task of the set of sub-tasks. Generated sub-instructions are transmitted to their respective allocated slices. For instance, sub-task instructions allocated to coordinator slice 304 are transmitted via engine instruction bus 350 to execution engines 318 and sub-task instructions 362A-N allocated to subordinate slices 306A-306N are transmitted to response and communication registers 316 via subordinate instruction bus 352.
Execution engines 318 are configured to perform sub-tasks allocated to coordinator slice 304 by slice allocator 334. For instance, execution engines 318 receive allocated sub-tasks via engine instruction bus 350. Execution engines 318 access corresponding responses 364A-N from other allocated slices via response bus 354 and access data stored in data buffers 320 via execution data bus 358. Execution engines 318 generate responses indicative of completing a task. Generated responses may either be transmitted to coordinated response generator 326 via configuration and status bus 346 or to response and communication registers 316 via response bus, depending on the particular implementation. For instance, if coordinator slice 304 is generating an individual response, execution engines 318 may transmit the response to response and communication registers 316 for storage and communication to processor core 102.
Subordinate slices 306A-306N are further embodiments of subordinate slices 112A-112N of
Subordinate slices may be configured in various ways. For instance,
For example, as illustrated in
While subordinate slice 306A is illustrated in
Note that coordinator slice 304 as illustrated in
Flowchart 400 starts with step 402. In step 402, a command that includes instructions for performing a task is received. For instance, coordinator slice 304 of
In step 404, one or more sub-tasks of the task are determined to generate a set of sub-tasks. For instance, sub-task generator 332 of
In step 406, for each sub-task of the set of the set of sub-tasks, an accelerator slice of a plurality of accelerator slices of the distributed accelerator is allocated to perform the sub-task. For instance, slice allocator 334 of
In step 408, for each sub-task of the set of the set of sub-tasks, sub-task instructions are determined for performing the sub-task. For instance, sub-instruction generator 336 of
In step 410, for each sub-task of the set of the set of sub-tasks, the sub-task instructions are transmitted to the allocated slice. For instance, slice coordinator 314 of
In step 412, a corresponding response is received from each allocated accelerator slice. Each corresponding response is indicative of the allocated accelerator slice having completed a respective sub-task. For instance, coordinated response generator 326 of coordinator slice 304 of
In step 414, a coordinated response indicative of the corresponding responses is generated. For instance, coordinated response generator 326 is configured to generate a coordinated response 366 indicative of the corresponding responses received in step 412. Coordinated response 366 may be stored in response and communication registers 316. In embodiments, coordinated response 366 may be transmitted to processor core 102 in various ways, as described elsewhere herein.
In embodiments, distributed accelerator 300 of
Flowchart 600 begins with step 602. In step 602, a status update command that includes a request for progression of a task is received. For example, interface 508 of coordinator slice 504 receives a status update command 520 from processor core 102. Interface 508 may store status update command 520 in a register, convert status update command 520 to another format, and/or otherwise process status update command 520, depending on the particular implementation. Command manager 510 receives status update command 520 from interface 508. Command manager 510 may process status update command 520 in various ways. For instance, command manager 510 may place status update command 520 in a command queue (e.g., command queue 330 of
In step 604, a status update instruction is transmitted to the allocated accelerator slices. For instance, slice coordinator 512 is configured to transmit status update instructions to allocated accelerator slices based on status update command 520. For example, slice coordinator 512 receives status update command 520 from command manager 510. Slice coordinator 512 determines sub-tasks associated with status update command 520 and accelerator slices allocated to determined sub-tasks. If coordinator slice 504 is an allocated accelerator slice, slice coordinator 512 transmits a status update instruction 524 to status manager 514. If one or more subordinate slices 506A-506N are allocated accelerator slices, slice coordinator 512 stores one or more status update instructions 526A-N in response and communication registers 516. Interface 508 receives one or more status update instructions 526A-N from response and communication registers 516 and transmits the instructions to corresponding subordinate slices 506A-506N.
In step 606, a corresponding status update response is received from each allocated accelerator slice. Each corresponding status update is indicative of the progression of the allocated accelerator slice performing the respective sub-task. For instance, coordinated response generator 518 is configured to receive a corresponding status update from each allocated accelerator slice. For example, if coordinator slice 504 is an allocated accelerator slice, status manager 514 receives status update instruction 524, determines the progression of coordinator slice 504 in performing the respective sub-task, and generates corresponding status update 528. If one or more of subordinate slices 506A-506N are allocated accelerator slices, the allocated accelerator slices receive respective status update instructions 526A-N, determine the progression of respective sub-tasks, and generate corresponding status updates 530A-N. Interface 508 of coordinator slice 504 receives corresponding status updates 530A-N and stores the updates in response and communication registers 516. Status manager 514 receives corresponding status updates 530A-N from response and communication registers 516. In accordance with an embodiment, status manager 514 is configured to evaluate or otherwise process corresponding status updates 530A-N. For instance, status manager 514 may check for errors in corresponding status updates 530A-N. Coordinated response generator 518 is configured to receive corresponding status update 528 and corresponding status updates 530A-N from status manager 514.
In step 608, a coordinated status update indicative of the one or more received status update responses is generated. For instance, coordinated response generator 518 is configured to generate a coordinated status update 532 indicative of corresponding status updates 528 and 530A-N. In accordance with an embodiment, coordinated response generator 518 stores coordinated status update 532 in a register of interface 508, e.g., a status register. Processor core 102 may receive coordinated response generator 518 from coordinator slice 504 asynchronously or synchronously, as described elsewhere herein.
Thus, a process for generating a coordinated status update has been described with respect to
In embodiments, distributed accelerator 300 of
Flowchart 800 begins at step 802. In step 802, an abort condition is identified. For instance, abort condition identifier 722 is configured to identifier an abort condition. An abort condition may be an abort command, an error in the operation of distributed accelerator 700, or another condition for aborting one or more sub-tasks performed by distributed accelerator 700. For instance, in accordance with an embodiment, interface 708 of coordinator slice 704 receives an abort command 726 from processor core 102 of
In accordance with an embodiment, abort condition identifier 722 is configured to identify an abort condition by identifying an error in the operation of distributed accelerator 700. The error may be detected in the coordinator slice 704 or one or more of subordinate slices 706A-706N. For instance, status manager 720 is configured to monitor the operation status of execution engines 718 via engine status signals 734 and subordinate slices 706A-706N via subordinate status signals 740A-740N. Status manager 720 may generate a status indication signal 738 indicative of the operation status of execution engines 718 and/or subordinate slices 706A-706N. Abort condition identifier 722 is configured to determine if status indication signal 738 indicates an abort condition. For instance, abort condition identifier 722 may determine that status indication signal 738 indicates a failure in the operation of execution engines 718, another component of coordinator slice 704, one or more of subordinate slices 706A-706N, communication between coordinator slice 704 and subordinate slices 706A-706N, and/or the like.
In accordance with an embodiment, abort condition identifier 722 may determine an exception has occurred. An exception is an error that an accelerator slice is unable to resolve. For instance, an exception may occur due to a fault in the accelerator slice, an error in data associated with a sub-task, a communication error, or other error condition in performing a sub-task. Coordinator slice 704 may reallocate a sub-task that resulted in an exception to another accelerator slice of distributed accelerator 700 or report the exception to processor core 102 for processing. For instance, an exception resulting from a page fault may be reported to processor core 102 for handling as a regular page fault.
In step 804, one or more sub-tasks of a set of sub-tasks are determined to be aborted. For instance, abort task determiner 724 is configured to determine one or more sub-tasks to be aborted based on the abort condition identified in step 806. Abort task determiner is further configured to generate an abort set signal 728 indicative of the one or more sub-tasks to be aborted. A sub-task may be identified by a CID, an allocated accelerator slice, a type of sub-task, and/or other criteria described herein. For instance, abort command 726 may include the CID of a command to be aborted. In this context, abort task determiner 724 determines to abort each sub-task associated with the CID.
In step 806, an abort instruction is transmitted to each allocated accelerator slice associated with the determined one or more sub-tasks to be aborted. For instance, slice coordinator 714 transmits abort instructions to each allocated accelerator slice associated with the one or more sub-tasks to be aborted determined in step 804. For example, slice coordinator 714 receives abort set signal 728 from abort task determiner 724. Slice coordinator 714 determines which accelerator slices are allocated to the one or more sub-tasks to be aborted. If coordinator slice 704 is allocated to a sub-task to be aborted, slice coordinator 714 transmits an abort instruction 730 to execution engines 718. If one or more of subordinate slices 706A-706N are allocated to a sub-task to be aborted, slice coordinator 714 stores abort instructions 732A-N in response and communication registers 716. Interface 708 receives abort instructions 732A-N from response and communication registers 716 and transmits abort instructions 732A-N to corresponding subordinate slices 706A-706N.
In accordance with an embodiment, distributed accelerator 700 is configured to update processor core 102 after one or more sub-tasks have been aborted. For instance, status manager 720 is configured to monitor the operation status of execution engines 718 via engine status signals 734 and subordinate slices 706A-706N via subordinate status signals 740A-740N. Status manager 720 generates an abort complete signal 736 indicative of each sub-task determined in step 804 has been aborted. Abort complete signal 736 may include data such as which accelerator slices were aborted, progress of aborted sub-tasks, data associated with aborted sub-tasks, abort condition identified in step 802, and/or the like. For example, in accordance with an embodiment, abort complete signal 736 includes states of aborted tasks and/or sub-tasks. In this example, processor core 102 receives abort complete signal 736 and utilizes the states of aborted tasks and/or sub-tasks for debugging and/or resuming aborted tasks.
In embodiments, completion time estimator 328 of
Flowchart 1000 starts with step 1002. In step 1002, an estimated completion time of a command is determined based on a command load of the distributed accelerator. For instance, completion time estimator 900 receives a command 914 from a processor, such as processor 102 of
Load analyzer 904 is configured to analyze a current workload of distributed accelerator 300. For instance, load analyzer 904 is configured to receive status signal 936 from status manager 322 (not shown in
Estimated completion time determiner 906 is configured to receive command analysis signal 918 from command analyzer 902 and load analysis signal 924 from load analyzer 904. Estimated completion time determiner 906 determines an estimated completion time of the task associated with command 914 based on command analysis signal 918 and load analysis signal 924. For instance, estimated completion time determiner 906 may analyze resources available to perform the task associated with command 914, commands queued in command queue 330, estimated completion time of queued commands, command latencies, and other data to generate an estimated completion time 926.
In step 1004, the estimated completion time is compared to a wait threshold. For instance, threshold analyzer 908 receives estimated completion time 926 and compares it to a wait threshold. In accordance with an embodiment, the wait threshold is included with command 914. For example, processor 102 may include a wait threshold indicative of a deadline to complete the task associated with command 914. In accordance with another embodiment, the wait threshold is a predetermined threshold. For instance, the wait threshold may be a maximum number of clock cycles after command 914 was received by completion time estimator 900. If estimated completion time 926 is below the wait threshold, flowchart 1000 proceeds to step 1006. Otherwise, flowchart 1000 proceeds to step 1008. It is contemplated herein that, if estimated completion time 926 is at the wait threshold, flowchart 1000 may proceed to either step 1006 or step 1008, depending on the particular implementation.
In step 1006, the received command is positioned in a command queue. For instance, threshold analyzer 908 is configured to generate, if estimated completion time 926 is below the wait threshold, a positioning signal 928. Positioning signal 928 includes command 914. Depending on the particular implementation, positioning signal 928 may include additional information such as command latency, estimated completion time 926, buffer size, and other information related to command 914. Command queue 330 receives positioning signal 928 and positions command 914 accordingly. In accordance with an embodiment, positioning signal 928 includes instructions to position command 914 in a particular position of command queue 330. For instance, positioning signal 928 may include instructions to position command 914 at the beginning of command queue 330, at the end of command queue 330, before or after a particular command in command queue 330, and/or the like.
In step 1008, a rejection response is generated. For instance, threshold analyzer 908 is configured to generate, if estimated completion time is at or above the wait threshold, a rejection response 930. Rejection response 930 may be stored in a register, such as response and communication registers 316 of
As stated above, completion time estimator 900 includes a latency log updater 910 and command latency log 912. Latency log updater 910 and command latency log 912 may enable dynamic command latency estimation. For instance, in accordance with an embodiment, status manager 322 of distributed accelerator 300 of
Coordinator slices may be configured in various ways, in embodiments. For instance, a coordinator slice may include hardware and/or firmware specialized for performing particular tasks. Coordinator slices specialized for performing different tasks may be included in the same distributed accelerator. For example,
Processing system 1100 includes processor cores 1102A-1102N and distributed accelerator 1104. Processor cores 1102A-1102N and distributed accelerator 1104 are communicatively coupled by interconnect 1106. Processor cores 1102A-1102N and interconnect 1106 are further embodiments of processor cores 202A-202N and interconnect 224 of
Distributed accelerator 1104 is a further embodiment of distributed accelerator 210 of
For instance, data mover coordinator slice 1110 is configured to perform data movement sub-tasks, such as copying a data buffer to another memory location, initializing memory with a data pattern, comparing two memory regions to produce a difference in a third data buffer, computing and appending checksum to a data buffer, applying previously computed differences to a buffer, move data in a buffer to a different cache level (e.g., L2, L3, or L4), and/or other data movement functions as would be understood by a person of skill in the relevant art(s) having benefit of this disclosure. For instance, data mover coordinator slice 1110, in accordance with an embodiment, is configured to coordinate data movement tasks requiring large bandwidths. Data mover coordinator slice 1110 may allocate accelerator slices of distributed accelerator 1104 to move portions of data associated with a data movement task. In this way, data movement traffic is distributed across processing system 1100, reducing hotspots in communication traffic (e.g., interconnect traffic, IO interface traffic, controller interface traffic). In accordance with an embodiment, data mover coordinator slice 1110 may include a coherence engine to perform data transfer within memory of processing system 1100.
Synchronization coordinator slice 1112 is configured to accelerate atomic operations to operate on small amounts of data (e.g., a few words of data). Synchronization coordinator slice 1112 may perform an atomic update of a variable, an atomic exchange of two variables based on the value of a third, and/or perform other synchronization functions, as would be understood by a person of skill in the relevant art(s) having benefit of this disclosure. Synchronization coordinator slice 1112 is configured to return data values in addition to task statuses. In accordance with an embodiment, synchronization coordinator slice 1112 may store a final result in a local cache of a processor core (e.g., one or more of processor cores 1102A-1102N).
Crypto coordinator slice 1114 is configured to perform cryptography sub-tasks, such as implementing encryption and decryption functions. Encryption and decryption functions may be based on various standards. Cryptography coordinator slice 1114 may be configured to encrypt and/or decrypt data used by other accelerator slices of distributed accelerator 1104. CRC coordinator slice 1116 is configured to perform CRC sub-tasks. For instance, CRC coordinator slice 1116 may detect errors in data or communication between components of processing system 1100.
Complex computation coordinator slice 1118 is configured to perform complex computations. Complex computation coordinator slice 1118 may be configured to perform complex computations alone or in coordination with other accelerator slices of distributed accelerator 1104. For instance, complex computation coordinator slice 1118 may include hardware and/or firmware specialized for performing encryption and data movement tasks. In this context, complex computation coordinator slice 1118 may perform tasks including encryption and data movement sub-tasks. In another embodiment, complex computation coordinator slice 1118 includes hardware and/or firmware for managing data coherence and receives a data movement command. In this example, complex computation coordinator slice 1118 allocates itself for managing coherence of the data movement and data mover coordinator slice 1110 for moving data.
Processing system 1100 may include additional components not shown in
Data mover coordinator slice 1110 may operate in various ways to move data, in embodiments. For example,
Flowchart 1300 begins with step 1302. In step 1302, a set of portions of data are determined. For instance, processor core 1202 generates a data movement command 1210 including instructions to move data from a first location to a second location. Data mover coordinator slice 1204 receives data movement command 1210 and determines a set of portions of the data. Data mover coordinator slice 1204 may separate the data into multiple portions based on the size of data to be moved, bandwidth of available accelerator slices, the number of accelerator slices that may be allocated to move data, location of accelerator slices, location of data to be moved, and/or other criteria described elsewhere herein. For instance, in a non-limiting example, data movement command 1210 includes instructions to move 30 MBs of data. Data mover coordinator slice 1204 separates the 30 MBs of data into three 10 MB portions of data.
In step 1304, for each portion of the set of portions of the data, a sub-task for moving the portion is determined. For instance, data mover coordinator slice 1204 determines, for each portion of the set of portions of the data determined in step 1302, a sub-task for moving the portion. Determined sub-tasks may be transmitted to allocated accelerator slices, as described with respect to steps 406-410 of flowchart 400 of
As illustrated in
Embodiments of data mover coordinator slices, such as data mover coordinator slice 1204 of
As described above, coordinator slices may include components similar to components of coordinator slice 304 of
As illustrated in
In
Distributed accelerators may be configured to perform complex computations in various ways, in embodiments. For example,
Flowchart 1600 begins with step 1602. In step 1602, an encrypt and CRC command including data is received. For instance, interface 1508 of complex computation coordinator slice 1504 receives encrypt and CRC command 1518. Interface 1508 may store the encrypt and CRC command 1518 in a register. In accordance with an embodiment, the included data is stored in a data buffer, not shown in
In step 1604, an encrypt sub-task and a CRC sub-task are determined. For instance, slice coordinator 1512 receives encrypt and CRC command 1518 and determines an encrypt sub-task and a CRC sub-task. Slice coordinator 1512 may determine the encrypt and CRC sub-tasks using a sub-task generator, such as sub-task generator 332 of
In step 1606, complex computation coordinator slice 1504 is allocated to perform the encrypt sub-task and CRC subordinate slice 1506 is allocated to perform the CRC sub-task. For instance, slice coordinator 1512 is configured to allocate complex computation coordinator slice 1504 to perform the encrypt sub-task and CRC subordinate slice 1506 to perform the CRC sub-task. Slice coordinator 1512 may allocate accelerator slices using a slice allocator, such as slice allocator 334 of
In step 1608, encrypt sub-task instructions and CRC sub-task instructions are determined. For instance, slice coordinator 1512 is configured to determine encrypt sub-task instructions 1520 and CRC sub-task instructions 1522. Slice coordinator 1512 may determine sub-task instructions using a sub-instruction generator, such as sub-instruction generator 336 of
In step 1610, encrypt sub-task instructions are performed by encrypting the included data. For instance, encryption engine 1514 is configured to perform encrypt sub-task instructions 1520 by encrypting the data included in encrypt and CRC command 1518 to generate encrypted data 1524. Encryption engine 1514 may access included data from a register or data buffer of complex computation coordinator slice 1504. As illustrated in
In step 1612, the CRC sub-task instructions and the encrypted data are transmitted to the CRC subordinate slice. For instance, response and communication registers 1516 receive CRC sub-task instructions 1522 from slice coordinator 1512 and encrypted data 1524 from encryption engine 1514. Response and communication registers 1516 transmit a CRC sub-command 1526 including CRC sub-task instructions 1522 and encrypted data 1524 to interface 1508, which transmits CRC sub-command 1526 to CRC subordinate slice 1506.
CRC subordinate slice 1506 is configured to process encrypted data 1524 and append a CRC value to it. As illustrated in
Thus, an example embodiment of a complex computation coordinator slice and a flowchart of a process for performing a complex computation have been described with respect to
As noted above, systems and devices, including distributed accelerators, coordinator slices, and subordinate slices, may be configured in ways to perform tasks. Accelerator slices have been described as network-attached devices, off-chip devices, on-chip devices, on-chip processing elements, or as a specialized instruction in an ISA, in embodiments. Various types of coordinator slices have been described herein, however it is contemplated herein that subordinate slices may include specialized hardware for performing particular tasks, as would be understood by persons of skill in the relevant art(s) having the benefit of this disclosure. For instance, a subordinate slice may include hardware specialized for data movement, synchronization, CRC, cryptography, complex computations, and/or the like. Furthermore, embodiments of the present disclosure may be configured to support coherent caches, increased bandwidth, quality of service monitoring, metering (e.g., for billing), depending on the particular implementation.
Embodiments of distributed accelerators may support virtual memory. A distributed accelerator in accordance with an embodiment translates a virtual address received with a command (e.g., a logic block address) to a physical address of a memory device. The physical address may be used for write operations, read operations, or other operations associated with the physical memory (e.g., handling page faults). In an example embodiment, a distributed accelerator stores translated addresses in a cache to minimize translation overheads.
Embodiments of the present disclosure may be configured to accelerate task performance. For instance, in a non-limiting example, a distributed accelerator in accordance with an embodiment is configured to process commands without a local address translation. In this context, the processor core translates a virtual address to a physical address and transmits a command to the distributed accelerator with the physical address. Such implementations may reduce the complexity and/or the size of the accelerator.
Moreover, according to the described embodiments and techniques, any components of processing systems, distributed accelerators, coordinator slices, and/or subordinate slices and their functions may be caused to be activated for operation/performance thereof based on other operations, functions, actions, and/or the like, including initialization, completion, and/or performance of the, functions, actions, and/or the like.
In some example embodiments, one or more of the operations of the flowcharts described herein may not be performed. Moreover, operations in addition to or in lieu of the operations of the flowcharts described herein may be performed. Further, in some example embodiments, one or more of the operations of the flowcharts described herein may be performed out of order, in an alternate sequence, or partially (or completely) concurrently with each other or with other operations.
The further example embodiments and advantages described in this Section may be applicable to any embodiments disclosed in this Section or in any other Section of this disclosure.
The embodiments described herein and/or any further systems, sub-systems, devices and/or components disclosed herein may be implemented in hardware (e.g., hardware logic/electrical circuitry), or any combination of hardware with software (computer program code configured to be executed in one or more processors or processing devices) and/or firmware.
Processor core 102, distributed accelerator 104, accelerator slices 108, coordinator slice 110, subordinate slices 112A-112N, processor cores 202A-202N, cache controllers 204A-204N, memory controllers 206A-206N, IO controllers 208A-208N, distributed accelerator 210, coordinator slice 212, subordinate slices 214A-214N, subordinate slices 216A-216N, subordinate slices 218A-218N, coherence engines 220A-220N, caches 222A-222N, interconnect 224, coordinator slice 304, subordinate slices 306A-306N, interface 308, slice controller 310, command manager 312, slice coordinator 314, response and communication registers 316, execution engines 318, data buffers 320, status manager 322, abort task manager 324, coordinated response generator 326, completion time estimator 328, command queue 330, sub-task generator 332, slice allocator 334, sub-instruction generator 336, interface 368, slice controller 370, command queue 372, response and communication registers 374, execution engines 376, data buffers 378, flowchart 400 coordinator slice 504, subordinate slices 506A-506N, interface 508, command manager 510, slice coordinator 512, status manager 514, response and communication registers 516, coordinated response generator 518, flowchart 600, coordinator slice 704, subordinate slices 706A-706B, interface 708, command manager 710, abort task manager 712, slice coordinator 714, response and communication registers 716, execution engines 718, status manager 720, abort condition identifier 722, abort task determiner 724, flowchart 800, completion time estimator 900, command analyzer 902, load analyzer 904, estimated completion time determiner 906, threshold analyzer 908, latency log updater 910, command latency log 912, flowchart 1000, processor cores 1102A-1102N, distributed accelerator 1104, interconnect 1106, coordinator slice 1108, data mover coordinator slice 1110, synchronization coordinator slice 1112, crypto coordinator slice 1114, CRC coordinator slice 1116, complex computation coordinator slice 1118, subordinate slices 1120A-1120N, processor core 1202, data mover coordinator slice 1204, data mover subordinate slice 1206, data mover subordinate slice 1208, flowchart 1300, CRC coordinator slice 1400, interface 1402, slice controller 1404, command manager 1406, slice coordinator 1408, response and communication registers 1410, execution engines 1412, processor core 1502, complex computation coordinator slice 1504, CRC subordinate slice 1506, interface 1508, command manager 1510, slice coordinator 1512, encryption engine 1514, response and communication registers 1516, and/or flowchart 1600 may be implemented in hardware, or hardware with any combination of software and/or firmware, including being implemented as computer program code configured to be executed in one or more processors and stored in a computer readable storage medium, or being implemented as hardware logic/electrical circuitry, such as being implemented in a system-on-chip (SoC). The SoC may include an integrated circuit chip that includes one or more of a processor (e.g., a microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and/or further circuits and/or embedded firmware to perform its functions.
As shown in
System 1700 also has one or more of the following drives: a hard disk drive 1714 for reading from and writing to a hard disk, a magnetic disk drive 1716 for reading from or writing to a removable magnetic disk 1718, and an optical disk drive 1720 for reading from or writing to a removable optical disk 1722 such as a CD ROM, DVD ROM, or other optical media. Hard disk drive 1714, magnetic disk drive 1716, and optical disk drive 1720 are connected to bus 1706 by a hard disk drive interface 1724, a magnetic disk drive interface 1726, and an optical drive interface 1728, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer. Although a hard disk, a removable magnetic disk and a removable optical disk are described, other types of hardware-based computer-readable storage media can be used to store data, such as flash memory cards and drives (e.g., solid state drives (SSDs)), digital video disks, RAMs, ROMs, and other hardware storage media.
A number of program modules or components may be stored on the hard disk, magnetic disk, optical disk, ROM, or RAM. These program modules include an operating system 1730, one or more application programs 1732, other program modules 1734, and program data 1736. In accordance with various embodiments, the program modules may include computer program logic that is executable by processing unit 1702 to perform any or all the functions and features of coherence engines 220A-220N, slice controller 310, command manager 312, slice coordinator 314, response and communication registers 316, status manager 322, abort task manager 324, coordinated response generator 326, completion time estimator 328, sub-task generator 332, slice allocator 334, sub-instruction generator 336, slice controller 370, command manager 510, slice coordinator 512, status manager 514, coordinated response generator 518, command manager 710, abort task manager 712, slice coordinator 714, response and communication registers 716, execution engines 718, status manager 720, abort condition identifier 722, abort task determiner 724, completion time estimator 900, command analyzer 902, load analyzer 904, estimated completion time determiner 906, threshold analyzer 908, latency log updater 910, command latency log 912, slice controller 1404, command manager 1406, slice coordinator 1408, response and communication registers 1410, execution engines 1412, command manager 1510, slice coordinator 1512, and/or encryption engine 1514 (including any suitable steps of flowcharts 400, 600, 800, 1000, 1300, and/or 1600).
A user may enter commands and information into the system 1700 through input devices such as keyboard 1738 and pointing device 1740. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, a touch screen and/or touch pad, a voice recognition system to receive voice input, a gesture recognition system to receive gesture input, or the like. These and other input devices are often connected to processor unit 1702 through a serial port interface 1742 that is coupled to bus 1706, but may be connected by other interfaces, such as a parallel port, game port, or a universal serial bus (USB).
A display screen 1744 is also connected to bus 1706 via an interface, such as a video adapter 1746. Display screen 1744 may be external to, or incorporated in, system 1700. Display screen 1744 may display information, as well as being a user interface for receiving user commands and/or other information (e.g., by touch, finger gestures, virtual keyboard, etc.). In addition to display screen 1744, system 1700 may include other peripheral output devices (not shown) such as speakers and printers.
System 1700 is connected to a network 1748 (e.g., the Internet) through an adaptor or network interface 1750, a modem 1752, or other means for establishing communications over the network. Modem 1752, which may be internal or external, may be connected to bus 1706 via serial port interface 1742, as shown in
As used herein, the terms “computer program medium,” “computer-readable medium,” and “computer-readable storage medium” are used to refer to physical hardware media such as the hard disk associated with hard disk drive 1714, removable magnetic disk 1718, removable optical disk 1722, other physical hardware media such as RAMs, ROMs, flash memory cards, digital video disks, zip disks, MEMs, nanotechnology-based storage devices, and further types of physical/tangible hardware storage media. Such computer-readable storage media are distinguished from and non-overlapping with communication media (do not include communication media). Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared and other wireless media, as well as wired media. Embodiments are also directed to such communication media that are separate and non-overlapping with embodiments directed to computer-readable storage media.
As noted above, computer programs and modules (including application programs 1732 and other programs 1734) may be stored on the hard disk, magnetic disk, optical disk, ROM, RAM, or other hardware storage medium. Such computer programs may also be received via network interface 1750, serial port interface 1742, or any other interface type. Such computer programs, when executed or loaded by an application, enable system 1700 to implement features of embodiments described herein. Accordingly, such computer programs represent controllers of the system 1700.
Embodiments are also directed to computer program products comprising computer code or instructions stored on any computer-readable medium. Such computer program products include hard disk drives, optical disk drives, memory device packages, portable memory sticks, memory cards, and other types of physical storage hardware. In accordance with various embodiments, the program modules may include computer program logic that is executable by processing unit 1702 to perform any or all of the functions and features of processor core 102 and/or distributed accelerator 104, as described above in reference to
In an embodiment, a processing system includes a distributed accelerator including a plurality of accelerator slices. The plurality of accelerator slices includes one or more subordinate slices and a coordinator slice. The coordinator slice is configured to receive a command that includes instructions for performing a task. The coordinator slice is configured to determine one or more sub-tasks of the task to generate a set of sub-tasks. For each sub-task of the set of sub-tasks, the coordinator slice is configured to allocate an accelerator slice of the plurality of accelerator slices to perform the sub-task, determine sub-task instructions for performing the sub-task, and transmit the sub-task instructions to the allocated accelerator slice. Each allocated accelerator slice is configured to generate a corresponding response indicative of the allocated accelerator slice having completed a respective sub-task.
In an embodiment, the coordinator slice is further configured to receive, from each allocated accelerator slice, the corresponding response indicative of the allocated accelerator slice having completed a respective sub-task. The coordinator slice is configured to generate a coordinated response indicative of the corresponding responses.
In an embodiment, the command is received from a processor core. Each allocated accelerator slice transmits the corresponding response indicative of the allocated accelerator slice having completed the respective sub-task to the processor core.
In an embodiment, the plurality of accelerator slices includes a plurality of coordinator slices.
In an embodiment, the processing system includes an interconnect network configured to transfer signals between the coordinator slice and the one or more subordinate slices. At least one accelerator slice of the plurality of accelerator slices is directly coupled to the interconnect network.
In an embodiment, the coordinator slice is one of: a data mover coordinator slice, a synchronization coordinator slice, a crypto coordinator slice, a cyclic redundancy check (CRC) coordinator slice, or a complex computation coordinator slice.
In an embodiment, the processing system includes a cache controller. The cache controller includes the coordinator slice. The task includes instructions to move data from a first location to a second location. The coordinator slice is a data mover coordinator slice configured to determine the one or more sub-tasks of the task by determining a set of portions of the data and determining, for each portion of the set of portions of the data, a sub-task for moving the portion.
In an embodiment, the coordinator slice is a complex computation coordinator slice configured to receive an encrypt and cyclic redundancy check (CRC) command including data. The complex computation coordinator slice is configured to determine an encrypt sub-task and a CRC sub-task, allocate the coordinator slice to perform the encrypt sub-task and a CRC subordinate slice of the one or more subordinate slices to perform the CRC sub-task, and determine encrypt sub-task instructions and CRC sub-task instructions. The complex computation coordinator slice is configured to perform the encrypt sub-task instructions by encrypting the included data and transmit the CRC sub-task instructions and the encrypted data to the CRC subordinate slice.
In an embodiment, the coordinator slice is further configured to receive a status update command that includes a request for progression of the task, transmit a status update instruction to the allocated accelerator slices, and receive, from each allocated accelerator slice, a corresponding status update response. The corresponding status update response is indicative of the progression of the allocated accelerator slice performing the respective sub-task. The coordinator slice is configured to generate a coordinated status update indicative of one or more received status update responses.
In an embodiment, the coordinator slice includes a data buffer, and the received command designates a physical address of the data buffer.
In an embodiment, the coordinator slice is further configured to determine, based on a command load of the distributed accelerator, an estimated completion time of the command. If the estimated completion time is below a wait threshold, the coordinator slice is configured to position the received command in a command queue. If the estimated completion time is above the wait threshold the coordinator slice is configured to generate a rejection response.
In an embodiment, the coordinator slice is further configured to identify an abort condition, determine one or more sub-tasks of the set of sub-tasks to be aborted, and transmit an abort instruction to each allocated accelerator slice associated with the determined one or more sub-tasks to be aborted.
In an embodiment, a method for performing a task by a distributed accelerator is performed. The method includes receiving a command that includes instructions for performing a task. One or more sub-tasks of the task are determined to generate a set of sub-tasks. For each sub-task of the set of sub-tasks, an accelerator slice of a plurality of accelerator slices of the distributed accelerator is allocated to perform the sub-task. For each sub-task of the set of sub-tasks, sub-task instructions are determined for performing the sub-task. For each sub-task of the set of sub-tasks, the sub-task instructions are transmitted to the allocated accelerator slice. A corresponding response is received from each allocated accelerator slice. Each corresponding response is indicative of the allocated accelerator slice having completed a respective sub-task. A coordinated response indicative of the corresponding responses is generated.
In an embodiment, the task includes instructions to move data from a first location to a second location. The determining the one or more sub-tasks of the task includes: determining a set of portions of the data and determining, for each portion of the set of portions of the data, a sub-task for moving the portion.
In an embodiment, a status update command that includes a request for progression of the task is received. A status update instruction is transmitted to the allocated accelerator slices. A corresponding status update response is received from each allocated accelerator slice. Each corresponding status update response is indicative of the progression of the allocated accelerator slice performing the respective sub-task. A coordinated status update indicative of the one or more received status update responses is generated.
In an embodiment, an estimated completion time of the command is determined based on a command load of the distributed accelerator. If the estimated completion time is below a wait threshold, the received command is positioned in a command queue. If the estimated completion time is above the wait threshold, a rejection response is generated.
In an embodiment, an abort condition is identified. One or more sub-tasks of the set of sub-tasks are determined to be aborted. An abort instruction is transmitted to each allocated accelerator slice associated with the determined one or more sub-tasks to be aborted.
In an embodiment, a coordinator slice is configured to allocate accelerator slices of a plurality of accelerator slices of a distributed accelerator to perform a task. The plurality of accelerator slices includes the coordinator slice. The coordinator slice is further configured to receive a command that includes instructions for performing the task and determine one or more sub-tasks of the task to generate a set of sub-tasks. For each sub-task of the set of sub-tasks, the coordinator slice is configured to allocate an accelerator slice of the plurality of accelerator slices of the distributed accelerator to perform the sub-task, determine sub-task instructions for performing the sub-task, and transmit the sub-task instructions to the allocated accelerator slice. The coordinator slice is configured to receive, from each allocated accelerator slice, a corresponding response indicative of the allocated accelerator slice having completed a respective sub-task. The coordinator slice is configured to generate a coordinated response indicative of the corresponding responses.
In an embodiment, the task includes instructions to move data from a first location to a second location. The coordinator slice is configured to determine the one or more sub-tasks of the task to generate the set of sub-tasks by determining a set of portions of the data and determining, for each portion of the set of portions of the data, a sub-task for moving the portion.
In an embodiment, the coordinator slice is further configured to receive a status update command that includes a request for progression of the task and transmit a status update instruction to the allocated accelerator slices. The coordinator slice is further configured to receive, from each allocated accelerator slice, a corresponding status update response indicative of the progression of the allocated accelerator slice performing the respective sub-task and generate a coordinated status update indicative of the one or more received status update responses.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the embodiments. Thus, the breadth and scope of the embodiments should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.