Multi-processor computer architectures capable of parallel computing operations were originally developed for supercomputers. Today, with modern microprocessors containing multiple processor “cores,” the principles of parallel computing have become relevant to both on-chip and distributed computing environment.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Semiconductor chips that include multiple computer processors have increased in complexity and scope to the point that on-chip communications may benefit from a routed packet network within the semiconductor chip. By using a same packet format on-chip as well as off-chip, a seamless fabric is created for high data throughput computation that does not require data to be re-packed and re-transmitted between devices.
To facilitate such an architecture, a multi-core chip may include a top level (L1) packet router for moving data inside the chip and between chips. All data packets are preceded by a header containing routing data. Routing to internal parts of the chip may be done by fixed addressing rules. Routing to external ports may be done by comparing the packet header against a set of programmable tables and/or registers. The same hardware can route internal-to-internal packets (loopback), internal-to-external packets (outbound), external-to-internal packets (inbound) and external-to-external packets (pass through). The routing framework supports a wide variety of geometries of chip connections, and allows execution-time optimization of the fabric to adapt to changing data flows.
However, as the number of processing elements within a system increase, there are several engineering challenges that need to be addressed. Two of the challenges are minimizing the processing bottlenecks and latency delays caused by multiple processors accessing memory at a same time, and the assigning of processing threads to processing elements. Early solutions placed the burden of assigning threads to processors on the software compiler. However, as the number of processing cores in a system may vary, compiler solutions are somewhat less flexible at run-time. Runtime solutions typically use one or more processors as dispatchers, keeping track of which processing elements are busy and which are free, and sending tasks to free processors for execution. Using a runtime solution, the burden on the compiler is reduced, since the compiler need only designate which threads can be run in parallel and which threads must be run sequentially.
While runtime solutions provide better utilization of processing elements, implementation can actually exacerbate the bottlenecks created by multiple processors overloading the memory bus with read requests. Specifically, each time a processing element is assigned to a new thread by a dispatcher, the processing element must fetch (or be sent) the executable code necessary to execute the thread specified by the dispatcher. The end result is a performance trade-off between maximizing the load balance between processors and the bus and memory bottlenecks that occur as a result.
Multiple first-in-first-out (FIFO) input and output hardware queues 118 are provided on the chip 100, each of which is assignable to serve as an input queue or an output queue. When configured as an input queue, the queue 118 is associated with a single “task.” A task comprises multiple executable instructions, such as the instructions for routine, subroutine, or other complex operation.
Defined tasks are each assigned a task identifier or “tag.” When a task identifier is invoked during execution of a program by a processing element 134, a task descriptor is sent to a task distributor 114. The task descriptor includes the task identifier, any needed operands or data, and an address where the task result should be returned. The task distributor 114 identifies a nearby queue associated with one or more processing elements 134 configured to perform the task. The assigned queue may be on a same chip 100 as the processing element 134 running the software that invoked the task, or may be on another chip. Since the processing elements subscribed to input queues repeatedly perform the same tasks, they can locally store and execute the same code over-and-over, substantially reducing the communication bottlenecks created when a processing element must go and fetch code (or be sent code) for execution.
Each input queue is affiliated with at least one subscribed processing element 134. The processing elements 134 affiliated with the input queues may each be loaded with a small scheduler program that is invoked after the processing element is idle for (or longer than) a specified/preset/predetermined duration (which may vary in length in accordance with the complexity of the task of the queue to which the processing element is currently affiliated/subscribed) When the scheduler program is invoked, the processing element 134 may unsubscribe from the input queue it was servicing and subscribe to a different input queue. In this way, processing elements can self-load balance independent of any central dispatcher.
In other words, it is not up to the main software program or a central dispatcher to assign work to a particular core (or possibly even to a particular chip). Instead, the chip 100 has some queues at a top level (in the network hierarchy), with each queue supporting one type of task at any time. To get a task done, a program deposits a descriptor of the task that needs to be done with a task distributor 114, which deposits the descriptor into the appropriate queue 118. The processing elements affiliated with the queue do the work, and typically produce output to some other queue (e.g., a queue 118 configured as an output queue).
Each hardware queue 118 has at least one event flag attached, so a processor core can sleep while waiting for a task to be placed in the queue, powering down and/or de-clocking operations. After a task descriptor is enqueued, at least one of the cores affiliated with that queue is awakened by the change in state of the event flag, causing the processor core to retrieve (dequeue) the descriptor and to start processing the operands and/or data it contains, using the locally-stored executable task code.
As noted, the hardware queues 118 may be configured as input queues or output queues. Dedicated input queues and dedicated output queues may also/instead be provided. When a task is finished, the last processing element to execute a portion of the assigned task or chain of tasks may deposit the results in an output queue. These output queues can generate event flags that produce externally visible (e.g., electrical) signals, so a host processor or other hardware (e.g., logic in an FPGA) can retrieve the finished result.
In the example in
Memory within a system including the processor chip 100 may also be hierarchical, and memory of different tiers may be physically different types of memory. Each processing element 134 may have a local program memory containing instructions that will be fetched by the core's micro-sequencer and loaded into the instruction registers for execution in accordance with a program counter. Processing elements 134 within a cluster 124 may also share a cluster memory 136, such as a shared memory serving a cluster 128 including eight processor cores 134a-134h. While a processor core may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of the instruction pipeline) when accessing its own operand registers, accessing global addresses external to a processing element 134 may experience a larger latency due to (among other things) the physical distance between the addressed component and the processing element 134. As a result of this additional latency, the time needed for a processor core to access an external main memory, a shared cluster memory 136, and the registers of other processing elements may be greater than the time needed for a core to access its own execution registers.
Each tier in the architecture hierarchy may include a router. The top-level (L1) router 102 may have its own clock domain and be connected by a plurality of asynchronous data busses to multiple clusters of processor cores on the chip. The L1 router may also be connected to one or more external-facing ports that connect the chip to other chips, devices, components, and networks. The chip-level router (L1) 102 routes packets destined for other chips or destinations through the external ports 103 over one or more high-speed serial busses 104a, 104b. Each serial bus 104 comprises at least one media access control (MAC) port 105a, 105b and a physical layer hardware transceiver 106a, 106b.
The L1 router 102 routes packets to and from a primary general-purpose memory for the chip through a supervisor port 107 to a memory supervisor 108 that manages the general-purpose memory. Packets to-and-from lower-tier components are routed through internal ports 121.
Each of the superclusters 122a-122d may be interconnected via an inter-supercluster router (L2) 120 which routes transactions between superclusters and between a supercluster 122 and the chip-level router (L1) 102. Each supercluster 122 may include an inter-cluster router (L3) 126 which routes transactions between each cluster 128 in the supercluster 122, and between a cluster 128 and the inter-supercluster router (L2) 120. Each cluster 128 may include an intra-cluster router (L4) 132 which routes transactions between each processing element 134 in the cluster 128, and between a processing element 134 and the inter-cluster router (L3) 126. The level 4 (L4) intra-cluster router 132 may also direct packets between processing elements 134 of the cluster and a cluster memory 136. Tiers may also include cross-connects (not illustrated) to route packets between elements in a same tier in the hierarchy.
When data packets arrive in one of the routers, the router examines the header at the front of each packet to determine the destination of the packet's data payload. Each chip 100 is assigned a unique device identifier (“device ID”). Packet headers received via the external ports 103 each identify a destination chip by including the device ID in an address contained in the packet header. Packets that are received by the L1 router 102 that have a device ID matching that of the chip containing the L1 router are routed within the chip using a fixed pipeline to the supervisor 108 or through one of the internal ports 121 linked to a cluster of processor cores within the chip. When packets are received with a non-matching device ID by the L1 router 102, the L1 router 102 uses programmable routing to select an external port and relay the packet back off the chip.
When a program invokes a task, the invoking processing element 134 sends a packet comprising a task descriptor to the local task distributor 114. The L1 router 102 and/or the L2 router 120 include a task port 113a/113b and a queue port 115a/115b. The routers route the packet containing the task descriptor via the task port 113 to the task distributor 114, which examines the task identifier included in the descriptor and determines which queue 118 to which to assign the task. The assigned queue 118 may be on the chip 100, or may be a queue on another chip. If the task is to be deposited in a queue on the chip, the task distributor 114 transfers the descriptor to the queue router 116, which deposits the descriptor in the assigned queue. Otherwise, task descriptor is routed to the other chip which contains the assigned queue 118.
The queue port 115a is used by the L1 router 102 to route descriptors that have been assigned by the task distributor 114 on another chip to the designated input queue 118 via the queue router 116. When queues 118 are configured as output queues, the processing elements 134 may retrieve task results from the output queue via the queue port 115a/115b using read/get requests, routed via the L1 and/or L2 routers.
Cross-connects (not illustrated) may be provided to signal when there is data enqueued in the I/O queues, which the processing elements 134 can monitor. For example, an eight-bit bus may be provided, where each bit of the bus corresponds to one of the I/O queues 118a-118f. When a queue is configured as an output queue, a processing element 134 may monitor the bit line corresponding to the queue while awaiting task results, and retrieve the results after the bit line is asserted. When a queue is configured as an input queue, subscribed processing elements 134 may monitor the bit line corresponding to the queue for tasks for the availability of tasks awaiting processing.
Selecting the task input queue and obtaining its address may be performed as plural steps or may be combined as a single step. Depending upon how the task distributor 114/114′ is implemented, a task input queue may be selected and then an address/identifier may be obtained for the selected task input queue, or the addresses/identifiers of one or more task input queues may be obtained and then the task input queue may be selected. Process combinations may also be used to select a queue and obtain a queue address/identifier, such as selecting several candidate input task queues, obtaining their addresses/identifiers, and then selecting an input task queue based on its address/identifier.
In the example in
The parser 270 may optionally include additional functionality. For example, it is possible to compress the task descriptor 230a (e.g., using Huffman encoding). In such a case, the parser 270 may be responsible for de-compressing any data that precedes the task identifier 232 to find the offset at which the task identifier 232 starts, then transmitting the task identifier 232 to the CAM 252. In such a design, the CAM 252 might use either the compressed or un-compressed form of the task identifier 232 as its key. In the latter case, the parser 270 would also be responsible for de-compressing the task identifier 232 prior to transmitting it to the CAM 252.
The assembler 274 is roughly a mirror image of the parser 270. Where the parser 270 extracts a task identifier 232 that indirectly refers to a task queue, the assembler 274 re-assembles an output packet (queue assignment 242) that describes the task with a header 202b that includes a physical or virtual address of a selected queue based on the address/identifier 210, where the header address is for the selected input queue that can carry out the type of task denoted by the task identifier 232. The payload of the output packet comprises the parameters and data 233. The assembler 274 receives the address/identifier 210 of the selected input queue from the CAM 252 and the task parameters and data 233 from the parser 270. Various approaches may be used by the assembler 274 to assemble the output packet 242. For example, the parser 270 may send the task descriptor 230a to the assembler, and the assembler 274 may overwrite the bits corresponding to the task identifier 232 with the header address, or the assembler 274 may concatenate the header address with the task parameters and data 233.
The assembler 274 may also include additional functionality. For example, if a compressed format is being used, the assembler 274 may re-compress some or all the task parameters and data 233 contained in the routable task descriptor 230b. The assembler 274 could also rearrange the data, or carry out other transformations such as converting a compressed format to an uncompressed format or vice versa.
In
In the case of a large system, the hash table 220 may be a distributed hash table, so one type of task has queues distributed throughout the system. A task request 240 causes the controller 216 to apply a distributed hash function to produce a hash that would find a “nearby” queue for that task, where the nearby queue should be reachable with a latency from the task identifier that is less than (or tied for smallest) that to reach other queues associated with the same task. Expected latency may be determined, among other ways, based on the minimum number of “hops” (e.g., intervening routers) to reach each queue from the task distributor 114′. The controller 216 outputs a packet containing the queue assignment 242, replacing the destination address in the header with the address of the assigned queue, as discussed in connection with
“Hop” information be determined, among other ways, from a routing table. The routing table may, for example, be used to allocate addresses that indicate locality to at least some degree. Distributed hashing frequently uses a very large (and very sparse) address space “overlaid” on top of a more typical addressing scheme like Internet Protocol (IP). For example, a hash might produce a 160-bit “address” that's later transformed to a 32-bit IPv4 address. With a logical address space like this, the allocation of addresses maybe be tailored to the system topology, such that the address itself provides an indication of that node's locality (e.g., assuming a backbone plus branches, the address could directly indicate a node's position on the backbone and distance from the backbone on its branch).
Hop information can be used with the CAM 252 as well. However, given the expense of storage in a CAM and the advantageous of keeping that data to a minimum, each CAM 252 will ordinarily store just one “best” result for a given tag lookup.
Resolution process 290b may be used by an address resolver 272a (an example of address resolver 272 in
Since the number of nodes/chips 100 in a system may vary dynamically, when a node is added or removed, a distributed hash function (e.g., 280, 281) may be recomputed and redistributed to all the task distributors 114′. Other options include leaving the function 280/281 itself, but modifying data that it uses internally (not illustrated, but may also be stored in registers/memory 250), or leave the function 280/281 alone, but modify the address lookup data (e.g., hash table 220). Choosing between modifying the hash function's data and modifying the lookup data is often a fairly open choice, and depends in part on how the hash function is structured and implemented (e.g., implemented in hardware, implemented as processor-executed software, etc.).
To optimize results for locality within the system, it is advantageous to produce a final address result that is based on location (relative to the topology of interconnected devices 100). The hash functions 280/281 used by the task distributors 114′ may the same throughout the system, or may be localized, depending upon whether localization data is updated by updating the hash function 280/281, its internal data, or its lookup table 220. For example, the distributed hash tables 220, sorted tables 221, and/or data used by the functions stored in one or more registers may be updated each time a device/node 100 is added or removed from the system.
As an alternative to a hash function, a lookup table may be used to store a tag 232, and with it an address/queue identifier 210. Sorting the table by tag 232, an interpolating search 282 may be used to search a small table, or a binary search 283 search may be used to sort a large table. Resolution processes 290d may be used by an address resolver 272c (an example of address resolver 272 in
If the hash function 280/281 or search function 282/283 is implemented in hardware, the logic providing the function 280-283 may fixed, with updates being to table values (e.g., 220/221) and/or to other registers storing values used by the function, separate from the logic. If the function is implemented as processor-executed software, either the software (as stored in memory) may be updated, table values (e.g., 220/221) may be updated, and/or registers storing values used by the function may be updated. Also, the type of function and nature of the tables may be changed as the system scales, selecting a function 280-283 optimized for the scale of the topology.
Choosing between address resolution techniques depends pretty factors that are not relevant to the task queues 118 themselves, and are fairly well known in the art. Hash tables 220 typically have O(1) expected complexity, but O(N) worst case (but deletion is often more expensive, and sometimes completely unsupported). Sorted tables 221 with binary search 283 offers O(log N) lookup, and O(N) insertion or deletion. Sorted tables 212 with interpolating search 282 improves search complexity to O(log log N), but insertion or deletion is still typically O(N). A self-balanced binary search tree may be used for O(log N) insertion, deletion or lookup. In a small system, all of the table-based address resolution approaches should be adequate, as the tables involved are relatively small.
Each time the data and/or functions used by the controllers 214/216 is updated, one-or-more processing elements 134 on the chip 100 may load and launch a queue update program. In conjunction with the task distributor 114/114′, the queue update program may determine the input queue address/identifier 210 for each possible task ID 232, and determine whether any of those addresses/identifiers are for I/O queues 118 on the device 100 containing the task distributor 114/114′. The queue update program then configures each queue for the assigned task (if not already configured), and configures at least one processing element 134 to subscribe to each input queue.
As illustrated in
The structure of the physical address 310 in the packet header 302 may vary based on the tier of memory being addressed. For example, at a top tier (e.g., M=1), a device-level address 310a may include a unique device identifier 312 identifying the processor chip 100 and an address 320a corresponding to a location in main-memory. At a next tier (e.g., M=2), a cluster-level address 310b may include the device identifier 312, a cluster identifier 314 (identifying both the supercluster 122 and cluster 128), and an address 320b corresponding to a location in cluster memory 136. At the processing element level (e.g., M=3), a processing-element-level address 310c may include the device identifier 312, the cluster identifier 314, a processing element identifier 316, an event flag mask 318, and an address 320c of the specific location in the processing element's operand registers, program memory, etc. Global addressing may accommodate both physical and virtual addresses.
The event flag mask 318 may be used by a packet to set an “event” flag upon arrival at its destination. Special purpose registers within the execution registers of each processing element may include one or more event flag registers, which may be used to indicate when specific data transactions have occurred. So, for example, a packet header designating an operand register of a processing element 134 may indicate to set an event flag upon arrival at the destination processing element. A single event flag but may be associated with all the registers, or with a group of registers. Each processing element 134 may have multiple event flag bits that may be altered in such a manner. Which flag is triggered may be configured by software, with the flag to be triggered designated within the arriving packet. A packet may also write to an operand register without setting an event flag, if the packet event flag mask 318 does not indicate to change an event flag bit.
The normal return indicator 434 and error reporting address 436 may indicate a memory or register address, the address of an output queue, or the address of any reachable component within the system. “Returning” results data to a location specified by the normal return indicator 434 includes causing the results data to be written, saved, stored, and/or enqueued to the location.
An additional task bit 433a is appended onto the first task identifier 432a, and indicates that there are additional tasks after the first task. An additional task bit 433b is appended onto the second task identifier 432b, and indicates that there are additional tasks after the second task. An additional task bit 433c is appended onto the third task identifier 432c, and indicates that there are no further tasks after the third task. The use of task chaining using the task descriptor format 430b will be discussed further below in connection with
The general purpose registers 760 include a front pointer register 762 containing the front pointer 532, a back pointer register 763 containing the back pointer 533, a depth register 764 containing the current depth of the queue, and several event flag register 764. Among the event flag registers is an empty flag 765, indicating that the queue is empty. When the empty flag 765 is de-asserted, indicating that there is at least one descriptor enqueued in the queue 118, a data-enqueued interrupt signal may be sent to subscribed processors (input queue) or a processor awaiting results (output queue), signaling them to wake and dequeue a descriptor or result. The data-enqueued interrupt signal can be generated by an inverter (not illustrated) that has its input tied to the output of the AND gate 755 or to the empty flag 765. Another event flag 764 is the full flag 766. When the full flag 766 is asserted, the data transaction interface 720 can output a back-pressure signal to the queue router 116. Assertion of a back-pressure signal may result in error reporting (in accordance with the error reporting address 436) if a task arrives for a full queue. The queue router 116 may also include an arbiter to reassign the descriptor received for the full queue to another input queue attached the queue router 116 that is configured to perform a same task (if such a queue exists).
If configured as an output queue, the event flags 764 may be masked so that when results data is enqueued, an interrupt is generated indicating to a waiting (or sleeping) processing element 134 that a result has arrived. Likewise, processing elements subscribed to an input queue can set a mask so that a data-enqueued signal from the subscribed queued causes an interrupt, but data-enqueued signals from other queues are ignored. Instead of an “empty” flag register 765, a “data available” flag register may be used, replacing the AND gate 755 with a NAND gate. In that case, data-enqueued interrupt signal can be generated in accordance with the output of the NAND gate, or the state of the data available flag register.
The input queue registers 767 are used by processing elements to subscribe and unsubscribe to the queue. A register 768 indicates how many processing elements 134 are subscribed to the queue. Each queue always has at least one subscribed processing element, so if an idle processing elements goes to unsubscribe, but it is the only subscribed processing element, then the processing element remains subscribed. When new processing elements subscribe to the queue, the number in the register 768 is incremented. Also, when a new processing element subscribes to a queue, it determines the start address where the executable instructions for the task are in memory (e.g., 780) from a program memory address register 769. The newly subscribed processing element then loads the task program into its own program memory.
When a descriptor 430 or the address 440/450 of a descriptor is received by the queue 118 for enqueuing, a data transaction interface 720 asserts a put signal 731, causing a write circuit 733 to save/store the descriptor 430 or address 440/450 into the stack 570 at a write address 734 determined based on the back pointer 533. For example, the back point 533 may specify the most significant bits corresponding to the slot 572 where the descriptor 430 is to be stored. The write circuit 733 may write (i.e., save/store) an entirety of a descriptor 430 as a parallel write operation, or may write the descriptor in a series of operations (e.g., one word at a time), toggling a write strobe 735 and incrementing the least significant bits of the write address 734 until an entirety of the descriptor 430 is stored.
After the descriptor 430 or descriptor address 440/450 is stored, the data transaction interface 720 de-asserts the put signal 731, causing a counter 737 to increment the back pointer on the falling edge of the put signal 731 and causing a counter 757 to increase the depth value. The counter 737 counts up in a loop, with the maximum count equaling the number of slots 572 in the stack 570. When the count exceeds the maximum count, a carry signal may be used to reset the counter 737, such that the counter 737 operates in a continual loop.
When a descriptor 430 is to be dequeued by a subscribing processing element 134, the data transaction interface 720 asserts a get signal 741, causing a read circuit 743 to read the descriptor 430 or descriptor address 530 from the stack at a read address 744 determined based on the front pointer 532. For example, the front point 532 may specify the most significant bits corresponding to the slot 572 where the descriptor 430b is to be stored. The read circuit 743 may read an entirety of the descriptor 430 as a parallel read operation, or may read the descriptor 430 as a series of reads (e.g., one word at a time).
After the descriptor 430 or descriptor address 440/450 is dequeued, the data transaction interface 720 de-asserts the get signal 741, causing a counter 747 to increment the front pointer 532 on the falling edge of the get signal 741 and causing a counter 757 to decrease the depth value. The counter 747 counts up in a loop, with the maximum count equaling the number of slots 572. When the count exceeds the maximum count, a carry signal may be used to reset the counter 747, such that the counter 747 operates in a continual loop.
The empty flag 765 may be set by circuit composed of a comparator 753, an inverter 754, and an AND gate 755. The comparator 753 determines when the front pointer 532 equals the back pointer 533. The inverter 754 receives the queue-full signal as input. The AND gate 755 receives the outputs of the comparator 753 and the inverter 754. When the front and back pointers are equal and the full signal is not asserted, the output of the AND gate 755 is asserted, indicating that the queue is empty. Depending upon how the counters 737, 747, 757 manage their output while asserting their “carry” signals, it may be possible for the front and back pointers to be equal when the queue is full. The inverter 754 and AND gate 755 provide for that eventuality, so that when the front and back pointers are equal and the full signal is also asserted, the output of the AND gate 755 is de-asserted, indicating that the queue is not empty. As an alternative to determine when the queue is empty, a comparator may compare the depth 764 to zero to determine when the depth equals zero. The full flag 766 may be set by the carry output of the counter 757, or a comparator may compare the depth 764 to the depth value corresponding to full.
Although the queue 118 uses a write-and-then-increment the back pointer, read-and-then-increment the front pointer arrangement, the queue may instead use an increment-and-then-write and increment-and-then-read arrangement. In that case, the counter 737 increments on the leading edge of the put signal 731, and the counter 747 increments on the leading edge of the get signal 741.
Also, instead of having both the counters 737 and 747 increment on the falling edge or increment on the leading edge, one may increment on the falling edge while the other increments on the leading edge. For example, the front pointer 532 may be incremented on the falling edge of the get signal 741, such that the front pointer 532 points to the slot that is currently at the front of the queue, whereas the back pointer 533 may be incremented on the leading edge of the put signal 731, such that the back pointer is 533 is pointing to one slot behind where the slot that will be used for the next write. In such an arrangement, when the stack 570 is empty, the front pointer and back pointer will not be equal. As a consequence, a comparison of the front and back pointers by comparator 753 will not indicate whether the stack 570 is empty. In that case, whether the stack 570 is or is not empty may be determined from the depth 764 (e.g., comparing the depth value to zero).
Whether the counter 757 increments and decrements the depth on the falling or leading edges may be independent of the arrangement used by the counters 737 and 747. If the counter 757 increments and decrements on the leading put/get signal edges, subscribed or monitoring processing elements 134 may begin to dequeue a descriptor or descriptor address while it is being enqueued, since the data-enqueued interrupt signal may be generated be generated before enqueuing is complete, thereby accelerating the enqueuing and dequeuing process. To accommodate simultaneous enqueuing and dequeuing from a same slot of the stack 570, the memory/registers used for the stack 570 may be dual-ported. Dual-ported memory cells/registers can be read via one port and written to via another port at a same time. In comparison, if the counter 757 increments and decrements on the falling put/get signal edges (as illustrated in
The front pointer 532, the back pointer 533, the depth value, empty flag, and full flag are illustrated in
Although
A data transaction interface 872 sends and receives packets and connects the processor core 890 to its associated program memory 874. The processor core 890 may be of a conventional “pipelined” design, and may be coupled to sub-processors such as an arithmetic logic unit 894 and a floating point unit 896. The processor core 890 includes a plurality of execution registers 880 that are used by the core 890 to perform operations. The registers 880 may include, for example, instruction registers 882, operand registers 884, and various special purpose registers 886. These registers 880 are ordinarily for the exclusive use of the core 890 for the execution of operations. Instructions and data are loaded into the execution registers 880 to “feed” an instruction pipeline 892. While a processor core 890 may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of a micro-sequencer 891) when accessing its own execution registers 880, accessing memory that is external to the core 890 may produce a larger latency due to (among other things) the physical distance between the core 890 and the memory.
The instruction registers 882 store instructions loaded into the core that are being/will be executed by an instruction pipeline 892. The operand registers 884 store data that has been loaded into the core 890 that is to be processed by an executed instruction. The operand registers 884 also receive the results of operations executed by the core 890 via an operand write-back unit 898. The special purpose registers 886 may be used for various “administrative” functions, such as being set to indicate divide-by-zero errors, to increment or decrement transaction counters, to indicate core interrupt “events,” etc.
The instruction fetch circuitry of a micro-sequencer 891 fetches a stream of instructions for execution by the instruction pipeline 892 in accordance with an address generated by a program counter 893. The micro-sequencer 891 may, for example, may fetch an instruction every “clock” cycle, where the clock is a signal that controls the timing of operations by the micro-sequencer 891 and the instruction pipeline 892. The instruction pipeline 892 comprises a plurality of “stages,” such as an instruction decode stage, an operand fetch stage, an instruction execute stage, and an operand write-back stage. Each stage corresponds to circuitry.
The chips' firmware may include a small scheduler program 883 in firmware. When a core 890 waits too long (an exact duration may be specified in a register, e.g., based on a number of clock cycles) for a task to show up in its queue, the core 890 wakes up and runs the scheduler 883 to find some other queue with tasks for it to execute, and thereafter begins executing those tasks. The scheduler program 883 may be loaded into the instruction registers 882 of processing elements 134 subscribed to a task queue when the processing element's idle counter 887 indicates at that the threshold duration of time has transpired (e.g., that the requisite number of clock cycles have elapsed). The scheduler program 883 may either be preloaded into the processing element 134, or loaded upon expiration of the idle counter 887. The idle counter 887 causes generation of an interrupt resulting in the micro-sequencer 891 executing the scheduler 883, causing the processing element 134 to search through the (currently in-use) queues, and find a queue with tasks that need execution. Once it finds a new queue, it unsubscribes from the old queue (decrementing the number in register 768), subscribes to the new queue (incrementing the number in register 768), fetches the program address from register 769 of the new queue, and loads the task program code into its own program memory 874.
After a processor 134c subscribed to the task 3 input queue 118.3d becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the processor 134c retrieves 1006 the descriptor from the queue 118.3d. Upon completion of the task, the processor 134c writes 1010 (by packet) the result to an output queue 118h on the processor chip 100b in accordance with the normal return indicator 434. The output queue 118h generates an event signal 1012, waking the processor 134a (if in a low power mode), and causing the processor 134a to retrieve 1014 the results from output queue 118h.
After a processor 134c subscribed to the task 3 input queue 118.3d becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the processor 134c retrieves 1106 the descriptor from the queue 118.3d. Upon completion of the task, the processor 134c writes 1110 (by packet) the result directly to operand registers 884 or program memory 874 of the processing element 134a in accordance with the normal return indicator 434.
After a processor 134e subscribed to the task 4 input queue 118.4d becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the processor 134e retrieves 1206 the descriptor from the queue 118.4d. Upon completion of the task, the processor 134e writes 1210 (by packet) the result to a task distributor 114d on the processor chip 100d as a Task 1 request as part of a chained task request. The task distributor 114d send 1212 the Task 1 assignment to the Task 1 input queue 118.1a on processor chip 100a.
After a processor 134a subscribed to the task 1 input queue 118.1a becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the processor 134a retrieves 1214 the descriptor from the queue 118.1a. Upon completion of the task, the processor 134a writes 12120 (by packet) the result to an output queue 118h on processor chip 100b, in accordance with the normal return indicator 434. The output queue 118h generates an event signal 1230, waking the processor 134a (if in a low power mode), and causing the processor 134a to retrieve 1234 the results from output queue 118h.
After a processor 134e subscribed to the task 4 input queue 118.4d becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the task 4 processor 134e retrieves 1406 the descriptor from the queue 118.4d. In this example, task 4 itself uses task 1 as a subroutine, resulting in the task 4 processor 134e sending 1410 a task 1 request to the task distributor 114d on the processor chip 100d. The task distributor 114d sends 1412 the task 1 assignment to the task 1 input queue 118.1a on processor chip 100a.
After a processor 134a subscribed to the task 1 input queue 118.1a becomes free and determines from the empty flag 765 that there is a descriptor 430b waiting to be dequeued, the processor 134a retrieves 1414 the descriptor from the queue 118.1a. Upon completion of the task, the task 1 processor 134a writes 1420 (by packet) the result directly to the task 4 processor 134e that issued the task 1 request. The task 4 processor 134e thereafter completes task 4, using the task 1 data. Upon completion, the task 4 processor 134e writes 1422 (by packet) the result to an output queue 118h on processor chip 100b, in accordance with the normal return indicator 434. The output queue 118h generates an event signal 1430, waking the originating processor 134a (if in a low power mode), and causing the originating processor 134a to retrieve 1434 the results from output queue 118h.
Initially, a task processor 134a is subscribed to a task 1 input queue 118.a, which has two subscribed cores (as specified in register 768). The queue depth (from register 764) is initially zero. After the queue 118.1a receives 1520 a task, the task processor 134a dequeues 1522 the task descriptor and executes 1524 the task, returning the results in accordance with the normal return indicator 434. The task processor 134a starts 1426 its idle counter 887 and may enter into a low power mode, waiting for an interrupt from the subscribed task queue indicating that a descriptor is ready to be dequeued. When the counter expires 1528 or reaches a specified value, the task processor 134a runs the scheduler program 883, which determines 1530 whether there is more than one core subscribed to the task 1 queue 118.1a (from register 768), such that the scheduler program 883 is permitted to choose a new input queue. If there is not (1530 “No”) more than one processor subscribe to the task 1 queue 118.1a, the processor 134a continues to wait 1534 for a new task 1 descriptor to appear in the input queue 118.1a. Otherwise, the scheduler program 883 checks other input queues on the device to determine 1532 whether the depth of any of the other queues exceeds a minimum threshold depth “R”. The threshold depth is used to reduce the frequency with which processors unsubscribe and subscribe from and to input queues, since each new subscription results in memory being accessed to retrieve the task program executable code.
If none of the depths of the other input queues exceed “R” (1532 “No”), the processor remains subscribed to the task 1 queue 118.1a. Otherwise, the scheduler 883 selects 1536 a new input queue. For example, the scheduler 883 may select the input queue with the greatest depth, or among input queues tied for the greatest depth. The scheduler 883 unsubscribes 1538 from the task 1 queue 118.1a, decrementing register 768. The scheduler then subscribes 1440 to the task 2 input queue 118.2a which had the largest depth of the task input queues on the device. The scheduler 883 then loads 1542 the task 2 program to the program memory 874 of the processing element 134a, based on the program address in the register 769 of the task 2 queue 118.2a. After the task 2 program is loaded, the task processor 134a resumes normal operations, retrieving 1544 a task 2 descriptor from the task 2 queue 118.2a, and executing that task 1546. The task processor 134a will continue executing that same retrieved program until such time that its idle counter expires again without a task becoming available.
The scheduler program 883 may comprise executable software and/or firmware instructions, may be integrated into each task processor 134 as a sequential logic circuit, or may be a combination of sequential logic with executable instructions. For example, sequential logic included in the task processor 134 may set and start (1526) the idle counter, and determine (1528) that the task processor 134 has been idle for (or longer than) a specified/preset/predetermined duration (e.g., based on the counter expiring or based on a comparison of the count on the counter equaling or exceeding the duration value). In response determining (1528) that the task processor 134 has been idle for (or longer than) the specified duration, the sequential logic may load a remainder of the scheduler program 883 into the instruction registers 882 from the program memory 874 or another memory, based on an address stored in a specified register such as a special purpose register 886.
The disclosed system allows for a simple, relatively easy to understand interface that accommodates chips with a large number of cores and that improves scaling of a system by decoupling logical tasks from the arrangement of physical cores. A programmer writing the main program does not need to know (or care much) about how many cores will be executing assigned tasks. The number of cores can simply increase or decrease, depending on the number of tasks needing execution. Combined with the ability of cores to sleep while waiting for input, this flexible distribution of tasks also helps to reduce power consumption.
Other addressing schemes may also be used, as well as different addressing hierarchies. Whereas a processor core 890 may directly access its own execution registers 882 using address lines and data lines, communications between processing elements through the data transaction interfaces 872 may be via bus-based or packet-based networks. The bus-based networks may comprise address lines and data lines, conveying addresses via the address lines and data via the data lines. In comparison, the packet-based network connections may comprise a single serial data-line, or plural data lines, conveying addresses in packet headers and data in packet bodies via the data line(s).
Aspects of the disclosed system, such as the scheduler 883 and the various executed software and firmware instructions, may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media.
The examples discussed herein are meant to be illustrative. They were chosen to explain the principles and application of a task-queue based system computer system, and are not intended to be exhaustive or to limit such a system to the disclosed topologies, hardware structures, logic states, header formats, and descriptor formats. Many modifications and variations that utilize the operating principles of task queuing may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers, microprocessor design, and network architectures should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of task queuing. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Different logic and logic elements can be interchanged for the disclosed logic while achieving the same results. For example, a digital comparator that determines whether the depth 764 is equal to zero is functionally identical a NOR gate where the data lines conveying the depth value are input into a NOR gate, the output of which will be asserted when the binary value across the data lines equals zero.
To accommodate the high speeds at which the device 100 will ordinarily operate, it is contemplated that the FIFO queues 118 will be hardware queues, as discussed in connection with
“Writing,” “storing,” and “saving” are used interchangeably. “Enqueuing” includes writing/storing/saving to a queue. When data is written or enqueued to a location by a component (e.g., by a processing element, a task distributor, etc.), the operation may be directed by the component or the component may send the data to be written/enqueued (e.g., sending the data by packet, together with a write instruction). As such, “writing” and “enqueuing” should be understood to encompass “causing” data to be written or enqueued. Similarly, when a component “reads” or “dequeues” from a location, the operation may be directed by the component or the component may send a request (e.g., by packet, by asserting a signal line, etc.) that causes the data to be provided to the component. Queue management (e.g., the updating of the depth, the front pointer, and back pointer) may be performed by the queue itself, such that enqueuing and dequeuing causes queue management to occur, but does not require that the component enqueuing to or dequeuing from the queue to itself be responsible for queue management.
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.