Multi-processor computer architectures capable of parallel computing operations were originally developed for supercomputers. Today, with modern microprocessors containing multiple processor “cores,” the principles of parallel computing have become relevant to both on-chip and distributed computing environment.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
In a network on a chip processor, it is common for a number of processing elements to share access to a common memory, such as for program code and/or data shared between the processing elements. It is also common for processing elements to synchronize at times. Following such synchronization, the processing elements frequently re-start execution from the same point in the shared memory. When this re-start takes place, a number of processing elements will often attempt to read the same data from the same location to retrieve the next instructions each needs to execute. In such a case, the memory can form a bottleneck as the same data is repeatedly retrieved. This bottleneck will result in individual processors being delayed from access the desired memory location and thus will cause undesired processing delays. There is, therefore, a need for a network on a chip with a memory shared between processing elements to minimize the delay when processing elements need access to the same locations, such as when execution is restarted following synchronization. Offered are a number of chip configurations allowing simultaneous (i.e., within the same clock cycle) or substantially simultaneous (i.e., within a few clock cycles) access to data stored in memory to multiple different processing elements.
In one example, a component such as comparator may be added to a chip configuration to determine when multiple processors are requesting access to the same memory address. In such a case, a broadcast mask may be constructed at the memory output to distribute the data from the memory address to the corresponding requesting processing elements. The requested address is read from the memory, and the result is broadcast to all the requesting processing elements substantially simultaneously. In this manner, data is obtained from the memory using a single memory read operation and distributed to multiple recipients, rather than a series of read operations (each requiring its own multiple clock cycles) needed to send the same data to multiple recipients. Other configurations, such as those disclosed below, are also possible. In this manner, distributing the same program data to multiple processors is made significantly more efficient.
Each processing element 170 may have direct access to some (or all) of the operand registers 284 of the other processing elements, such that each processing element 170 may read and write data directly into operand registers 284 used by instructions executed by the other processing element, thus allowing the processor core 290 of one processing element to directly manipulate the operands used by another processor core for opcode execution.
An “opcode” instruction is a machine language instruction that specifies an operation to be performed by the executing processor core 290. Besides the opcode itself, the instruction may specify the data to be processed in the form of operands. An address identifier of a register from which an operand is to be retrieved may be directly encoded as a fixed location associated with an instruction as defined in the instruction set (i.e. an instruction permanently mapped to a particular operand register), or may be a variable address location specified together with the instruction.
Each operand register 284 may be assigned a global memory address comprising an identifier of its associated processing element 170 and an identifier of the individual operand register 284. The originating processing element 170 of the read/write transaction does not need to take special actions or use a special protocol to read/write to another processing element's operand register, but rather may access another processing element's registers as it would any other memory location that is external to the originating processing element. Likewise, the processing core 290 of a processing element 170 that contains a register that is being read by or written to by another processing element does not need to take any action during the transaction between the operand register and the other processing element.
Conventional processing elements commonly include two types of registers: those that are both internally and externally accessible, and those that are only internally accessible. The hardware registers 276 in
The internally accessible registers in conventional processing elements include instruction registers and operand registers, which are internal to the processor core itself. These registers are ordinarily for the exclusive use of the core for the execution of operations, with the instruction registers storing the instructions currently being executed, and the operand registers storing data fetched from hardware registers 276 or other memory as needed for the currently executed instructions. These internally accessible registers are directly connected to components of the instruction execution pipeline (e.g., an instruction decode component, an operand fetch component, an instruction execution component, etc.), such that there is no reason to assign them global addresses. Moreover, since these registers are used exclusively by the processor core, they are single “ported,” since data access is exclusive to the pipeline.
In comparison, the execution registers 280 of the processor core 290 in
Communication between processing elements 170 may be performed using packets, with each data transaction interface 272 connected to one or more busses, where each bus comprises at least one data line. Each packet may include a target register's address (i.e., the address of the recipient) and a data payload. The busses may be arranged into a network, such as the hierarchical network of busses illustrated in
For example, referring to
Other addressing schemes may also be used, and different addressing hierarchies may be used. Whereas a processor core 290 may directly access its own execution registers 280 using address lines and data lines, communications between processing elements through the data transaction interfaces 272 may be via a variety of different bus architectures. For example, communication between processing elements and other addressable components may be via a shared parallel bus-based network (e.g., busses comprising address lines and data lines, conveying addresses via the address lines and data via the data lines). As another example, communication between processing elements and other components may be via one or more shared serial busses.
Addressing between addressable elements/components may be packet-based, message-switched (e.g., a store-and-forward network without packets), circuit-switched (e.g., using matrix switches to establish a direct communications channel/circuit between communicating elements/components), direct (i.e., end-to-end communications without switching), or a combination thereof. In comparison, to message-switched, circuit-switched, and direct addressing, a packet-based conveys a destination address in a packet header and a data payload in a packet body via the data line(s).
As an example of an architecture using more than one bus type and more than one protocol, inter-cluster communications may be packet-based via serial busses, whereas intra-cluster communications may be message-switched or circuit-switched using parallel busses between the intra-cluster router (L4) 160, the processing elements 170a to 170h within the cluster, and other intra-cluster components (e.g., cluster memory 162). In addition, within a cluster, processing elements 170a to 170h may be interconnected to shared resources within the cluster (e.g., cluster memory 162) via a shared bus or multiple processing-element-specific and/or shared-resource-specific busses using direct addressing (not illustrated).
The source of a packet is not limited only to a processor core 290 manipulating the operand registers 284 associated with another processor core 290, but may be any operational element, such as a memory controller 114, a data feeder (not shown), an external host processor connected to the chip 100, a field programmable gate array, or any other element communicably connected to a processor chip 100 that is able to communicate in the packet format.
The data feeder may execute programmed instructions which control where and when data is pushed to the individual processing elements 170. The data feeder may also be used to push executable instructions to the program memory 274 of a processing element 170 for execution by that processing element's instruction pipeline. The data feeder may also operate in conjunction with the arbitration component 164, discussed further below.
In addition to any operational element being able to write directly to an operand register 284 of a processing element 170, each operational element may also read directly from an operand register 284 of a processing element 170, such as by sending a read transaction packet indicating the global address of the target register to be read, and the global address of the destination address to which the reply including the target register's contents is to be copied.
A data transaction interface 272 associated with each processing element may execute such read, write, and reply operations without necessitating action by the processor core 290 associated with an accessed register. Thus, if the destination address for a read transaction is an operand register 284 of the processing element 170 initiating the transaction, the reply may be placed in the destination register without further action by the processor core 290 initiating the read request. Three-way read transactions may also be undertaken, with a first processing element 170x initiating a read transaction of a register located in a second processing element 170y, with the destination address for the reply being a register located in a third processing element 170z.
Memory within a system including the processor chip 100 may also be hierarchical. Each processing element 170 may have a local program memory 274 containing instructions that will be fetched by the micro-sequencer 291 in accordance with a program counter 293. Processing elements 170 within a cluster 150 may also share a program memory 162, such as a shared memory serving a cluster 150 including eight processor cores 290. While a processor core 290 may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of the instruction pipeline 292) when accessing its own execution registers 280, accessing global addresses external to a processing element 170 may experience a larger latency due to (among other things) the physical distance between processing elements 170. As a result of this additional latency, the time needed for a processor core to access an external main memory, a shared program memory 162, and the registers of other processing elements may be greater than the time needed for a core 290 to access its own program memory 274 and execution registers 280.
Data transactions external to a processing element 170 may be implemented with a packet-based protocol carried over a router-based or switch-based on-chip network. The chip 100 in
The superclusters 130a-130d may be interconnected via an inter-supercluster router (L2) 120 which routes transactions between superclusters and between a supercluster and the chip-level router (L1) 110. Each supercluster 130 may include an inter-cluster router (L3) 140 which routes transactions between each cluster 150 in the supercluster 130, and between a cluster 150 and the inter-supercluster router (L2). Each cluster 150 may include an intra-cluster router (L4) 160 which routes transactions between each processing element 170 in the cluster 150, and between a processing element 170 and the inter-cluster router (L3). The level 4 (L4) intra-cluster router 160 may also direct packets between processing elements 170 of the cluster and a cluster memory 162. Tiers may also include cross-connects (not illustrated) to route packets between elements in a same tier in the hierarchy. A processor core 290 may directly access its own operand registers 284 without use of a global address.
Memory of different tiers may be physically different types of memory. Operand registers 284 may be a faster type of memory in a computing system, whereas as external general-purpose memory typically may have a higher latency. To improve the speed with which transactions are performed, operand instructions may be pre-fetched from slower memory and stored in a faster program memory (e.g., program memory 274 in
Referring to
The program counter 293 may present the address of the next instruction in the program memory 274 to enter the pipeline for execution, with the instruction fetched by the micro-sequencer 291 in accordance with the presented address. The microsequencer 291 utilizes the instruction registers 282 for instructions being processed by the instruction pipeline 292. After the instruction is read on the next clock cycle of the system clock, the program counter may be incremented. A stage of the instruction pipeline 292 may decode the next instruction to be executed, and instruction registers 282 may be used to store the decoded instructions. The same logic that implements the decode stage may also present the address(es) of the operand registers 284 of any source operands to be fetched.
An operand instruction may require zero, one, or more source operands. The source operands may be fetched from the operand registers 284 by an operand fetch stage of the instruction pipeline 292 and presented to an arithmetic logic unit (ALU) 294 of the processor core 290 on the next clock cycle. The arithmetic logic unit (ALU) may be configured to execute arithmetic and logic operations using the source operands. The processor core 290 may also include additional component for execution of operations, such as a floating point unit 296. Complex arithmetic operations may also be sent to and performed by a component or components shared among processing elements 170a-170h of a cluster via a dedicated high-speed bus, such as a shared component for executing floating-point divides (not illustrated).
An instruction execution stage of the instruction pipeline 292 may cause the ALU 294 (and/or the FPU 296, etc.) to execute the decoded instruction. Execution by the ALU 294 may require a single cycle of the system clock, with extended instructions requiring two or more. Instructions may be dispatched to the FPU 296 and/or shared component(s) for complex arithmetic operations in a single clock cycle, although several cycles may be required for execution. If an operand write will occur, an address of a register in the operand registers 284 may be set by an operand write stage of the execution pipeline 292 contemporaneous with execution.
After execution, the result may be received by an operand write stage of the instruction pipeline 292 for write-back to one or more registers 284. The result may be provided to an operand write-back unit 296 of the processor core 290, which performs the write-back, storing the data in the operand register(s) 284. Depending upon the size of the resulting operand and the size of the registers, extended operands that are longer than a single register may require more than one clock cycle to write.
Register forwarding may also be used to forward an operand result back into the execution stage of a next or subsequent instruction in the instruction pipeline 292, to be used as a source operand execution of that instruction. For example, a compare circuit may compare the register source address of a next instruction with the register result destination address of the preceding instruction, and if they match, the execution result operand may be forwarded between pipeline stages to be used as the source operand for execution of the next instruction, such that the execution of the next instructions does not need to fetch the operand from the registers 284.
To preserve data coherency, a portion of the operand registers 284 being actively used as working registers by the instruction pipeline 292 may be protected as read-only by the data transaction interface 272, blocking or delaying write transactions that originate from outside the processing element 170 which are directed to the protected registers. Such a protective measure prevents the registers actively being written to by the instruction pipeline 292 from being overwritten mid-execution, while still permitting external components/processing elements to read the current state of the data in those protected registers.
As shown in
In another prior art shared memory configuration, shown in
For example, a memory, such as memory 402 may include a number of memory tiles 412-418. Incoming memory requests are gathered in a memory access pipeline 430, illustrated as showing requests R1-R4. The first incoming request R1 is processed to access the requesting memory tile. The system also tracks the clock cycles since each particular tile is accessed using tile access counters 422-428. If R1 requests an address in tile 412, the tile access counter 422 is set and then does not allow access to tile 412 until a sufficient amount of time has passed. This wait time varies between memories but may be several clock cycles (e.g. 5-10). Thus, if requests R2 also requests access to memory tile 412, it will need to wait until those clock cycles have completed before accessing the memory. As can be appreciated, if requests R3 requests and R4 are also requesting access to tile 412, their waits will be multiple times the wait of R2.
An example of these delays is illustrated in
Even further, however, in certain situations a memory access pipeline 430, will only allow a requests that involve different memory tiles, meaning that if an incoming request is attempting to access the same tile as any of the four requests in the pipeline, the incoming request may be rejected and may need to wait until the pipeline clears. Thus even more delays may be introduced to the system.
Such delays are generally undesirable, and particularly undesirable under certain circumstances. For example, during a system reset, processors may wish to perform a synchronization or other “boot-up” type of activity where each processor may wish to execute a same program instruction at or around the same time. The program instruction may be stored at a single location in shared memory, for example in shared program memory 162. Offered are chip configurations and methods to provide the program instruction to the requesting processors in a substantially simultaneous manner. One such configuration is shown in
The arbitration component 164 may check the request valid bit lines 560-567 to determine which of the request valid bit lines are active at any particular time. The arbitration component 164 may determine which address to select as the “base” address from which to compare to the other active addresses in a number of ways. The arbitration component 164 may proceed in order, that is start with processor 0510 to see if processor 0's request valid bit line is high, and if it is not, proceed to processor 1511, and so on. When a request valid bit line is high, the arbitration component 164 may take the address indicated on that processor's address line as the base address and may compare other addresses to that address. The arbitration component 164 may also go in reverse order, i.e., start at processor 7 and go to processor 0 to obtain a base address. The arbitration component 164 may also go in round-robin fashion, starting with a certain processor during one memory access cycle and starting with another processor during another memory access cycle. Other techniques for determining a base address may also be used. The task for determining which processor's request will be selected for purposes of accessing the memory may be performed by a component such as an arbiter, illustrated in
The arbitration component 164 may also be configured with comparator circuitry that can compare the address lines 520-527 for the processors whose request valid bit lines are active (high) at any particular time (i.e., clock cycle). The arbitration component 164 may compare other active request addresses to the base address to determine which processor(s) are also requesting the base address. The comparison may be performed by one or more comparator components, for example a scanner 650 illustrated in
Returning to
As can be appreciated, the circuitry (either in the arbitration component, 164, shared program memory 162, or otherwise) may include a pipeline to store memory access requests/addresses in order. The pipeline may function similarly to the pipeline of
As can also be appreciated, the configuration of
In another example, shown in
The multicaster 660 is connected to multiple data busses 670-677, each connected to the input of a respective processor. Each data bus has D lines representing the number of lines needed to communicate data. The collective P data (one for each processor) busses may be represented by line 680. The multicaster 660 receives the output data from the shared program memory on data bus 585. The multicaster 660 then activates the data bus(ses) corresponding to the processors requesting data from the accessed memory location. Thus, for example, if processors 0, 3, and 4 are all requesting to access the same memory address that is currently being output by the shared program memory 162, the multicaster 660 will send the data received on bus 585 to busses 670, 673, and 674. That data will then be received, respectively, by processor 0510, processor 3, 513, and processor 4514. (Though, as noted above, certain timing circuitry may be implemented to ensure that the data output onto lines 670, 673, and 674 at a time when the respective processors are notified that the requested data is available.) As can be appreciated, the data resulting from the requested address is thus broadcasted to the processors so that the processors receive the data substantially simultaneously, thus significantly reducing the delays found in the prior art discussed above.
Another implementation for distributing data from the shared program memory 162 to the processors is shown in
For example, if each processor is capable of buffering four input requests, the first input request of each processor may be compared against each other and the bits for the processors that match be set for multicaster 660. A state machine or other configuration may be used for a processor to determine when a read request has been accepted by the arbitration component 164 and/or arbiter 650 and when a next request is to be generated. Similarly, the second input request of each processor may be compared against each other and the bits for the processors that match be set for multicaster 661. Same for the third input request of each processor and multicaster 662, and the fourth input request for each processor and multicaster 663. For each queued input request the scanner 650 (or other comparator) may indicate to the arbiter 530 (or other component) which address request is duplicated across processors (or a processor that is requesting the duplicated address) for each respective queued input so the arbiter 530 can select the appropriate address corresponding to each queued input/multicaster.
Configuring multiple multicasters using the configuration of
For example, an arbitration component (which may include an arbiter 530, scanner 650, multicaster 660 and/or other component) may receive an indication over a first request valid bit line (e.g., line 560) an indication that a first processor (e.g., processor 510) is requesting data from memory. The arbitration component may also receive an indication over a second request valid bit line (e.g., line 561) an indication that a second processor (e.g., processor 511) is requesting data from memory. The arbitration component may receive a first address output by the first processor (e.g., processor 510) on a first address line (e.g., address line 520/652). The arbitration component may also receive a second address output by a second processor (e.g., processor 511) on a second address line (e.g., address line 521/652). The arbitration component may compare the first address to the second address and may determine that they are the same, namely that the first address and second address include the same first address. The arbitration component may then send the first address to the shared memory 162, for example on line 575/655. The shared program memory 162 may then output the data (e.g., a program instruction) corresponding to the first address on the data bus (585). The shared program memory 162 may acknowledge the request on acknowledgement line 570, which may include an active bit, or multiple bits to indicate to the arbiter which request is currently being handled.
Further, if used, a multicaster 660 may output the data corresponding to the first address on one or more data busses corresponding to the first processor and second processor. For example, the multicaster 660 may output the data onto a first data bus 670 corresponding to/connected to the first processor 510 and a second data bus 671 corresponding to/connected to the second processor 511.
The arbitration component may then set as active a first output valid bit line (e.g., 540) and a second output valid bit line (e.g., 541) (and/or other output valid bit lines) corresponding to the processors requesting data from the address being access by the memory 162. The first processor 510 and second processor 511 may then access the data on the connected data bus (585/670/671) in response to the corresponding output valid bit line being set as active.
In certain embodiments, a comparator or other circuitry may be configured to receive new read requests from processors and compare them to a memory address that is already in progress. If any of the new read requests match the address of a memory location currently being accessed, the comparator may update output valid bit lines (e.g., 540-547, 654, 754, or the like) to include the processor corresponding to the new read request. For example, if a first shared memory access event results in processors 0, 3, and 4 all requesting data from the same memory address, bits corresponding to those processors would be ready to be set when the data from the same memory address is output from the shared program memory 162. However, because the read operation may take multiple clock cycles, if, while the memory is being accessed, processor 1 indicates a new read request from the same memory address, the comparator may update the output valid bits (while the memory is being accessed) so that processor 1 is also given access to the data from the shared memory corresponding to the same memory address. In this manner, processors that are “late” by a few clock cycles in requesting data from a same memory location may still get the data from that location without having to engage a further memory retrieve operation.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers, microprocessor design, and network architectures should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.