The present disclosure relates to monitoring an operation state within a computing system that contains a plurality of multi-core processing devices, and, in particular, capturing meaningful state information indicating that processing or movement of data has been completed in a network of interest within the computing system.
Information-processing systems are computing systems that process electronic and/or digital information. Typical information-processing system may include multiple processing elements, such as multiple single core computer processors or one or more multi-core computer processors capable of concurrent and/or independent operation. Such systems may be referred to as multi-processor or multi-core processing systems.
In a multi-core processing system, data may be loaded to destination processing elements for processing. In epoch-based algorithms, such as in computational fluid dynamics (CFD) or neural models, the amount of data being sent is known ahead of time, and counted wait counters can be used to indicate when the expected number of packets have arrived. These types of applications can be characterized as having fully deterministic data movement that can be calculated either at compile time or at run time, prior to the start of the data movement. In other applications (e.g. radix sort), however, the amount of data arriving at any given memory is not known at compile time or cannot be calculated prior to storing. Moreover, data may be transmitted in a computing system without guarantee of an orderly delivery, especially if transmitted to different destinations. Therefore, there is a need in the art for capturing meaningful state information indicating that processing of data or data movement has finished in a network of interest within a computing system.
The present disclosure provides systems, methods and apparatuses for operating processing elements in a computing system. In one aspect of the disclosure, a processing device may be provided. The processing device may comprise a plurality of processing elements organized into a plurality of clusters. A first cluster of the plurality of clusters may comprise a plurality of interconnect buffers coupled to a subset of the plurality of processing elements within the first cluster. Each interconnect buffer may have a respective interconnect buffer signal line and may be configured to assert the respective interconnect buffer signal line to indicate a state of the respective interconnect buffer. The first cluster may further comprise a cluster state circuit that has inputs coupled to the interconnect buffer signal lines and an output indicating a state of the first cluster, and a cluster timer with an input coupled to the output of the cluster state circuit. The cluster timer may be configured to (i) start counting when all buffers of the plurality of interconnect buffers become empty, and (ii) assert a drain state when all buffers of the plurality of interconnect buffers remain empty for a duration of the cluster timer.
In another aspect of the disclosure, a method of operating a processing device may be provided. The method may comprise transmitting data on the processing device, monitoring state information for a plurality of buffers on the processing device, determining that a drain condition is satisfied using the state information for the plurality of buffers, starting a timer in response to determining that the drain condition is satisfied and asserting a drain state in response to the drain condition remaining satisfied for a duration of the timer.
These and other objects, features, and characteristics of the present invention, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
Certain illustrative aspects of the systems, apparatuses, and methods according to the present invention are described herein in connection with the following description and the accompanying figures. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include all such aspects and their equivalents. Other advantages and novel features of the invention may become apparent from the following detailed description when considered in conjunction with the figures.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. In other instances, well known structures, interfaces, and processes have not been shown in detail in order to avoid unnecessarily obscuring the invention. However, it will be apparent to one of ordinary skill in the art that those specific details disclosed herein need not be used to practice the invention and do not represent a limitation on the scope of the invention, except as recited in the claims. It is intended that no part of this specification be construed to effect a disavowal of any part of the full scope of the invention. Although certain embodiments of the present disclosure are described, these embodiments likewise are not intended to limit the full scope of the invention.
Embodiments according to the present disclosure may determine whether and/or when memory state (e.g., a single memory or a set of memories) has been transitioned to a desired state without implementing any sort of memory coherency logic to track memory state. For example, an embodiment may include one or more multi-core processors in which a plurality of processing cores may share a single memory or a set of memories but the one or more multi-core processors do not have any sort any sort of memory coherency logic. This is in contrast to a conventional multi-processor system, such as a symmetric multiprocessing (SMP) system, in which the memory and cache coherency may be used to enforce the idea that any given address has an “owner” at every instant in time.
Moreover, embodiments according to the present disclosure may determine whether and/or when memory state has been transitioned to a desired state without knowing in advance how many packets may be transmitted or without guaranteed packet ordering. For example, an embodiment may include one or more processors and at least some of the one or more processors may include a plurality of processing cores sharing a single memory or a set of memories. The various components of the embodiment may communicate data by packets, which the receiving component does not know how many would be transmitted or may be received out of the order they were transmitted.
In some implementations, the processing device 102 may include 2, 4, 8, 16, 32 or another number of high speed interfaces 108. Each high speed interface 108 may implement a physical communication protocol. In one non-limiting example, each high speed interface 108 may implement the media access control (MAC) protocol, and thus may have a unique MAC address associated with it. The physical communication may be implemented in a known communication technology, for example, Gigabit Ethernet, or any other existing or future-developed communication technology. In one non-limiting example, each high speed interface 108 may implement bi-directional high-speed serial ports, such as 10 Giga bits per second (Gbps) serial ports. Two processing devices 102 implementing such high speed interfaces 108 may be directly coupled via one pair or multiple pairs of the high speed interfaces 108, with each pair comprising one high speed interface 108 on one processing device 102 and another high speed interface 108 on the other processing device 102.
Data communication between different computing resources of the computing system 100 may be implemented using routable packets. The computing resources may comprise device level resources such as a device controller 106, cluster level resources such as a cluster controller or cluster memory controller, and/or the processing engine level resources such as individual processing engines and/or individual processing engine memory controllers. An exemplary packet 140 according to the present disclosure is shown in
The device controller 106 may control the operation of the processing device 102 from power on through power down. The device controller 106 may comprise a device controller processor, one or more registers and a device controller memory space. The device controller processor may be any existing or future-developed microcontroller. In one embodiment, for example, an ARM® Cortex M0 microcontroller may be used for its small footprint and low power consumption. In another embodiment, a bigger and more powerful microcontroller may be chosen if needed. The one or more registers may include one to hold a device identifier (DEVID) for the processing device 102 after the processing device 102 is powered up. The DEVID may be used to uniquely identify the processing device 102 in the computing system 100. In one non-limiting embodiment, the DEVID may be loaded on system start from a non-volatile storage, for example, a non-volatile internal storage on the processing device 102 or a non-volatile external storage. The device controller memory space may include both read-only memory (ROM) and random access memory (RAM). In one non-limiting embodiment, the ROM may store bootloader code that during a system start may be executed to initialize the processing device 102 and load the remainder of the boot code through a bus from outside of the device controller 106. The instructions for the device controller processor, also referred to as the firmware, may reside in the RAM after they are loaded during the system start.
The registers and device controller memory space of the device controller 106 may be read and written to by computing resources of the computing system 100 using packets. That is, they are addressable using packets. As used herein, the term “memory” may refer to RAM, SRAM, DRAM, eDRAM, SDRAM, volatile memory, non-volatile memory, and/or other types of electronic memory. For example, the header of a packet may include a destination address such as DEVID:PADDR, of which the DEVID may identify the processing device 102 and the PADDR may be an address for a register of the device controller 106 or a memory location of the device controller memory space of a processing device 102. In some embodiments, a packet directed to the device controller 106 may have a packet operation code, which may be referred to as packet opcode or just opcode to indicate what operation needs to be performed for the packet. For example, the packet operation code may indicate reading from or writing to the storage location pointed to by PADDR. It should be noted that the device controller 106 may also send packets in addition to receiving them. The packets sent by the device controller 106 may be self-initiated or in response to a received packet (e.g., a read request). Self-initiated packets may include for example, reporting status information, requesting data, etc.
In one embodiment, a plurality of clusters 110 on a processing device 102 may be grouped together.
In another embodiment, the host may be a computing device of a different type, such as a computer processor known in the art (for example, an ARM® Cortex or Intel® x86 processor) or any other existing or future-developed processors. In this embodiment, the host may communicate with the rest of the system 100A through a communication interface, which may represent itself to the rest of the system 100A as the host by having a device ID for the host.
The computing system 100A may implement any appropriate techniques to set the DEVIDs, including the unique DEVID for the host, to the respective processing devices 102 of the computing system 100A. In one exemplary embodiment, the DEVIDs may be stored in the ROM of the respective device controller 106 for each processing devices 102 and loaded into a register for the device controller 106 at power up. In another embodiment, the DEVIDs may be loaded from an external storage. In such an embodiment, the assignments of DEVIDs may be performed offline, and may be changed offline from time to time or as appropriate. Thus, the DEVIDs for one or more processing devices 102 may be different each time the computing system 100A initializes. Moreover, the DEVIDs stored in the registers for each device controller 106 may be changed at runtime. This runtime change may be controlled by the host of the computing system 100A. For example, after the initialization of the computing system 100A, which may load the pre-configured DEVIDs from ROM or external storage, the host of the computing system 100A may reconfigure the computing system 100A and assign different DEVIDs to the processing devices 102 in the computing system 100A to overwrite the initial DEVIDs in the registers of the device controllers 106.
The exemplary operations to be performed by the router 112 may include receiving a packet destined for a resource within the cluster 110 from outside the cluster 110 and/or transmitting a packet originating within the cluster 110 destined for a resource inside or outside the cluster 110. A resource within the cluster 110 may be, for example, the cluster memory 118 or any of the processing engines 120 within the cluster 110. A resource outside the cluster 110 may be, for example, a resource in another cluster 110 of the computer device 102, the device controller 106 of the processing device 102, or a resource on another processing device 102. In some embodiments, the router 112 may also transmit a packet to the router 104 even if the packet may target a resource within itself. In one embodiment, the router 104 may implement a loopback path to send the packet back to the originating cluster 110 if the destination resource is within the cluster 110.
The cluster controller 116 may send packets, for example, as a response to a read request, or as unsolicited data sent by hardware for error or status report. The cluster controller 116 may also receive packets, for example, packets with opcodes to read or write data. In one embodiment, the cluster controller 116 may be any existing or future-developed microcontroller, for example, one of the ARM® Cortex-M microcontroller and may comprise one or more cluster control registers (CCRs) that provide configuration and control of the cluster 110. In another embodiment, instead of using a microcontroller, the cluster controller 116 may be custom made to implement any functionalities for handling packets and controlling operation of the router 112. In such an embodiment, the functionalities may be referred to as custom logic and may be implemented, for example, by a field programmable gate array (FPGA) or other specialized circuitry. Regardless of whether it is a microcontroller or implemented by custom logic, the cluster controller 116 may implement a fixed-purpose state machine encapsulating packets and memory access to the CCRs.
Each cluster memory 118 may be part of the overall addressable memory of the computing system 100. That is, the addressable memory of the computing system 100 may include the cluster memories 118 of all clusters of all devices 102 of the computing system 100. The cluster memory 118 may be a part of the main memory shared by the computing system 100. In some embodiments, any memory location within the cluster memory 118 may be addressed by any processing engine within the computing system 100 by a physical address. The physical address may be a combination of the DEVID, a cluster identifier (CLSID) and a physical address location (PADDR) within the cluster memory 118, which may be formed as a string of bits, such as, for example, DEVID:CLSID:PADDR. The DEVID may be associated with the device controller 106 as described above and the CLSID may be a unique identifier to uniquely identify the cluster 110 within the local processing device 102. It should be noted that in at least some embodiments, each register of the cluster controller 116 may also be assigned a physical address (PADDR). Therefore, the physical address DEVID:CLSID:PADDR may also be used to address a register of the cluster controller 116, in which PADDR may be an address assigned to the register of the cluster controller 116.
In some other embodiments, any memory location within the cluster memory 118 may be addressed by any processing engine within the computing system 100 by a virtual address. The virtual address may be a combination of a DEVID, a CLSID and a virtual address location (ADDR), which may be formed as a string of bits, such as, for example, DEVID:CLSID:ADDR. The DEVID and CLSID in the virtual address may be the same as in the physical addresses.
In one embodiment, the width of ADDR may be specified by system configuration. For example, the width of ADDR may be loaded into a storage location convenient to the cluster memory 118 during system start and/or changed from time to time when the computing system 100 performs a system configuration. To convert the virtual address to a physical address, the value of ADDR may be added to a base physical address value (BASE). The BASE may also be specified by system configuration as the width of ADDR and stored in a location convenient to a memory controller of the cluster memory 118. In one example, the width of ADDR may be stored in a first register and the BASE may be stored in a second register in the memory controller. Thus, the virtual address DEVID:CLSID:ADDR may be converted to a physical address as DEVID:CLSID:ADDR+BASE. Note that the result of ADDR+BASE has the same width as the longer of the two.
The address in the computing system 100 may be 8 bits, 16 bits, 32 bits, 64 bits, or any other number of bits wide. In one non-limiting example, the address may be 32 bits wide. The DEVID may be 10, 15, 20, 25 or any other number of bits wide. The width of the DEVID may be chosen based on the size of the computing system 100, for example, how many processing devices 102 the computing system 100 has or may be designed to have. In one non-limiting example, the DEVID may be 20 bits wide and the computing system 100 using this width of DEVID may contain up to 220 processing devices 102. The width of the CLSID may be chosen based on how many clusters 110 the processing device 102 may be designed to have. For example, the CLSID may be 3, 4, 5, 6, 7, 8 bits or any other number of bits wide. In one non-limiting example, the CLSID may be 5 bits wide and the processing device 102 using this width of CLSID may contain up to 25 clusters. The width of the PADDR for the cluster level may be 20, 30 or any other number of bits. In one non-limiting example, the PADDR for the cluster level may be 27 bits and the cluster 110 using this width of PADDR may contain up to 227 memory locations and/or addressable registers. Therefore, in some embodiments, if the DEVID may be 20 bits wide, CLSID may be 5 bits and PADDR may have a width of 27 bits, a physical address DEVID:CLSID:PADDR or DEVID:CLSID:ADDR+BASE may be 52 bits.
For performing the virtual to physical memory conversion, the first register (ADDR register) may have 4, 5, 6, 7 bits or any other number of bits. In one non-limiting example, the first register may be 5 bits wide. If the value of the 5 bits register is four (4), the width of ADDR may be 4 bits; and if the value of 5 bits register is eight (8), the width of ADDR will be 8 bits. Regardless of ADDR being 4 bits or 8 bits wide, if the PADDR for the cluster level may be 27 bits then BASE may be 27 bits, and the result of ADDR+BASE may still be a 27 bits physical address within the cluster memory 118.
The processing engines 120A to 120H of each cluster 110 may share the data sequencer 164 that executes a data “feeder” program to push data directly to the processing engines 120A-120H. The data sequencer 164 uses an instruction set in a manner similar to that of a CPU, but the instruction set may be optimized for rapidly retrieving data from memory stores within the cluster and pushing them directly to local processing engines. The data sequencer 164 is also capable of pushing data to other destinations outside of the cluster.
The data feeder program may be closely associated with tasks running on local and remote processing engines. Synchronization may be performed via fast hardware events, direct control of execution state, and other means. Data pushed by the data sequencer 164 travels as flit packets within the processing device interconnect. The data sequencer 164 may comprise a series of feeder queues and place the outgoing flit packets into the feeder queues where the flit packets are buffered until the interconnect is able to transport them toward their destination. In one embodiment, there are separate outgoing feeder queues to unique paths to each processing engine 120 as well as a unique feeder queue for flit packets each with a destination outside of the cluster.
It should be noted that the data sequencer 164 does not replace a direct memory access (DMA) engine. In one embodiment, although not shown, each cluster 110 may also include one or more DMA engines. For example, the number of DMA engines in a cluster may depend on the number of memory blocks used such that one DMA engine is used for access certain memory block(s), and another DMA engine may be used for accessing other memory block(s). These one or more DMA engines may be identical: they do not run a program of any sort but only execute as a result of a DMA packet being sent to the particular memory to which they are associated. The DMA engines may use the same paths that normal packet reads/writes use. For example, if the data sequencer 164 sends a packet (or constructs a packet) from the memory, it uses an appropriate feeder queue instead of the memory outbound port. In contrast, if a DMA read packet is sent to the memory, then the associated DMA engine performs the requested DMA operation (it does not run a program) and sends the outbound flits via the memory's outbound path (the same path that would be used for a flit read of the memory).
The AIP 114 may be a special processing engine shared by all processing engines 120 of one cluster 110. In one example, the AIP 114 may be implemented as a coprocessor to the processing engines 120. For example, the AIP 114 may implement less commonly used instructions such as some floating point arithmetic, including but not limited to, one or more of addition, subtraction, multiplication, division and square root, etc. As shown in
The grouping of the processing engines 120 on a computing device 102 may have a hierarchy with multiple levels. For example, multiple clusters 110 may be grouped together to form a super cluster.
An exemplary cluster 110 according to the present disclosure may include 2, 4, 8, 16, 32 or another number of processing engines 120.
The instructions of the instruction set may implement the arithmetic and logic operations and the floating point operations, such as those in the INTEL® x86 instruction set, using a syntax similar or different from the x86 instructions. In some embodiments, the instruction set may include customized instructions. For example, one or more instructions may be implemented according to the features of the computing system 100. In one example, one or more instructions may cause the processing engine executing the instructions to generate packets directly with system wide addressing. In another example, one or more instructions may have a memory address located anywhere in the computing system 100 as an operand. In such an example, a memory controller of the processing engine executing the instruction may generate packets according to the memory address being accessed.
The engine memory 124 may comprise a program memory, a register file comprising one or more general purpose registers, one or more special registers and one or more events registers. The program memory may be a physical memory for storing instructions to be executed by the processing core 122 and data to be operated upon by the instructions. In some embodiments, portions of the program memory may be disabled and powered down for energy savings. For example, a top half or a bottom half of the program memory may be disabled to save energy when executing a program small enough that less than half of the storage may be needed. The size of the program memory may be 1 thousand (1K), 2K, 3K, 4K, or any other number of storage units. The register file may comprise 128, 256, 512, 1024, or any other number of storage units. In one non-limiting example, the storage unit may be 32-bit wide, which may be referred to as a longword, and the program memory may comprise 2K 32-bit longwords and the register file may comprise 256 32-bit registers.
The register file may comprise one or more general purpose registers for the processing core 122. The general purpose registers may serve functions that are similar or identical to the general purpose registers of an x86 architecture CPU.
The special registers may be used for configuration, control and/or status. Exemplary special registers may include one or more of the following registers: a program counter, which may be used to point to the program memory address where the next instruction to be executed by the processing core 122 is stored; and a device identifier (DEVID) register storing the DEVID of the processing device 102.
In one exemplary embodiment, the register file may be implemented in two banks—one bank for odd addresses and one bank for even addresses—to permit fast access during operand fetching and storing. The even and odd banks may be selected based on the least-significant bit of the register address for if the computing system 100 is implemented in little endian or on the most-significant bit of the register address if the computing system 100 is implemented in big-endian.
The engine memory 124 may be part of the addressable memory space of the computing system 100. That is, any storage location of the program memory, any general purpose register of the register file, any special register of the plurality of special registers and any event register of the plurality of events registers may be assigned a memory address PADDR. Each processing engine 120 on a processing device 102 may be assigned an engine identifier (ENGINE ID), therefore, to access the engine memory 124, any addressable location of the engine memory 124 may be addressed by DEVID:CLSID:ENGINE ID: PADDR. In one embodiment, a packet addressed to an engine level memory location may include an address formed as DEVID:CLSID:ENGINE ID: EVENTS:PADDR, in which EVENTS may be one or more bits to set event flags in the destination processing engine 120. It should be noted that when the address is formed as such, the events need not form part of the physical address, which is still DEVID:CLSID:ENGINE ID:PADDR. In this form, the events bits may identify one or more event registers to be set but these events bits may be separate from the physical address being accessed.
The packet interface 126 may comprise a communication port for communicating packets of data. The communication port may be coupled to the router 112 and the cluster memory 118 of the local cluster. For any received packets, the packet interface 126 may directly pass them through to the engine memory 124. In some embodiments, a processing device 102 may implement two mechanisms to send a data packet to a processing engine 120. For example, a first mechanism may use a data packet with a read or write packet opcode. This data packet may be delivered to the packet interface 126 and handled by the packet interface 126 according to the packet opcode. The packet interface 126 may comprise a buffer to hold a plurality of storage units, for example, 1K, 2K, 4K, or 8K or any other number. In a second mechanism, the engine memory 124 may further comprise a register region to provide a write-only, inbound data interface, which may be referred to a mailbox. In one embodiment, the mailbox may comprise two storage units that each can hold one packet at a time. The processing engine 120 may have a event flag, which may be set when a packet has arrived at the mailbox to alert the processing engine 120 to retrieve and process the arrived packet. When this packet is being processed, another packet may be received in the other storage unit but any subsequent packets may be buffered at the sender, for example, the router 112 or the cluster memory 118, or any intermediate buffers.
In various embodiments, data request and delivery between different computing resources of the computing system 100 may be implemented by packets.
In some embodiments, the exemplary operations in the POP field may further include bulk data transfer. For example, certain computing resources may implement DMA feature. Exemplary computing resources that implement DMA may include a cluster memory controller of each cluster memory 118, a memory controller of each engine memory 124, and a memory controller of each device controller 106. Any two computing resources that implemented the DMA may perform bulk data transfer between them using packets with a packet opcode for bulk data transfer.
In addition to bulk data transfer, in some embodiments, the exemplary operations in the POP field may further include transmission of unsolicited data. For example, any computing resource may generate a status report or incur an error during operation, the status or error may be reported to a destination using a packet with a packet opcode indicating that the payload 144 contains the source computing resource and the status or error data.
The POP field may be 2, 3, 4, 5 or any other number of bits wide. In some embodiments, the width of the POP field may be selected depending on the number of operations defined for packets in the computing system 100. Also, in some embodiments, a packet opcode value can have different meaning based on the type of the destination computer resources that receives it. By way of example and not limitation, for a three-bit POP field, a value 001 may be defined as a read operation for a processing engine 120 but a write operation for a cluster memory 118.
In some embodiments, the header 142 may further comprise an addressing mode field and an addressing level field. The addressing mode field may contain a value to indicate whether the single address field contains a physical address or a virtual address that may need to be converted to a physical address at a destination. The addressing level field may contain a value to indicate whether the destination is at a device, cluster memory or processing engine level.
The payload 144 of the packet 140 is optional. If a particular packet 140 does not include a payload 144, the size field of the header 142 may have a value of zero. In some embodiments, the payload 144 of the packet 140 may contain a return address. For example, if a packet is a read request, the return address for any data to be read may be contained in the payload 144.
The exemplary process 600 may start with block 602, at which a packet may be generated at a source computing resource of the exemplary embodiment of the computing system 100. The source computing resource may be, for example, a device controller 106, a cluster controller 118, a super cluster controller 132 if super cluster is implemented, an AIP 114, a memory controller for a cluster memory 118, or a processing engine 120. In one embodiment, in addition to the exemplary source computing resource listed above, a host, whether a device 102 designated the host, or a different device (such as the P_Host in system 100B), may also be the source of data packets. The generated packet may be an exemplary embodiment of the packet 140 according to the present disclosure. From block 602, the exemplary process 600 may continue to the block 604, where the packet may be transmitted to an appropriate router based on the source computing resource that generated the packet. For example, if the source computing resource is a device controller 106, the generated packet may be transmitted to a top level router 104 of the local processing device 102; if the source computing resource is a cluster controller 116, the generated packet may be transmitted to a router 112 of the local cluster 110; if the source computing resource is a memory controller of the cluster memory 118, the generated packet may be transmitted to a router 112 of the local cluster 110, or a router downstream of the router 112 if there are multiple cluster memories 118 coupled together by the router downstream of the router 112; and if the source computing resource is a processing engine 120, the generated packet may be transmitted to a router of the local cluster 110 if the destination is outside the local cluster and to a memory controller of the cluster memory 118 of the local cluster 110 if the destination is within the local cluster.
At block 606, a route for the generated packet may be determined at the router. As described herein, the generated packet may comprise a header that includes a single destination address. The single destination address may be any addressable location of a uniform memory space of the computing system 100. The uniform memory space may be an addressable space that covers all memories and registers for each device controller, cluster controller, super cluster controller if super cluster is implemented, cluster memory and processing engine of the computing system 100. In some embodiments, the addressable location may be part of a destination computing resource of the computing system 100. The destination computing resource may be, for example, another device controller 106, another cluster controller 118, a memory controller for another cluster memory 118, or another processing engine 120, which is different from the source computing resource. The router that received the generated packet may determine the route for the generated packet based on the single destination address. At block 608, the generated packet may be routed to its destination computing resource.
Certain components of the exemplary processing device 102B may comprise buffers. For example, the router 104 may comprise buffers 204A-204C, the router 134 may comprise buffers 209A-209C, the router 112 may comprise buffers 215A-215H. Each of the processing engines 120A-120H may have an associated buffer 225A-225H respectively.
As used herein, buffers may be configured to accommodate communication between different components within a computing system. Alternatively, and/or simultaneously, buffers may include electronic storage, including but not limited to non-transient electronic storage. Examples of buffers may include, but are not limited to, queues, first-in-first-out buffers, stacks, first-in-last-out buffers, registers, scratch memories, random-access memories, caches, on-chip communication fabric, switches, switch fabric, interconnect infrastructure, repeaters, and/or other structures suitable to accommodate communication within a multi-core computing system and/or support storage of information. An element within a computing system that serves a purpose as the point of origin for a transfer of information may be referred to as a source.
In some implementations, buffers may be configured to store information temporarily, in particular while the information is being transferred from a point of origin, via one or more buffers, to one or more destinations. Structures in the path from a source to a buffer, including the source, may be referred to as being upstream of the buffer. Structures in the path from a buffer to a destination, including the destination, may be referred to as being downstream of the buffer. The terms upstream and downstream may be used as directions and/or as adjectives. In some implementations, individual buffers, such as but not limited to buffers 225, may be configured to accommodate communication for a particular processing engine, between two particular processing engines, and/or among a set of processing engines. Packet switching may be implemented in store-and-forward, cut-through, or combination thereof. For example, one part of a processing device may use store-and-forwarding packet switching and another part of the same processing device may use cut-through packet switching. Individual ones of the one or more particular buffers may have a particular status, condition, and/or activity associated therewith, jointly referred to as a buffer state.
By way of non-limiting example, buffer states may include a buffer becoming completely full, a buffer becoming completely empty, a buffer exceeding a threshold level of fullness or emptiness (this may be referred to as a watermark), a buffer experiencing an error condition, a buffer operating in a particular mode of operation, at least some of the functionality of a buffer being turned on or off, a particular type of information being stored in a buffer, particular information being stored in a buffer, a particular level of activity, or lack thereof, upstream and/or downstream of a buffer, and/or other buffer states. In some implementations, a lack of activity may be conditioned on meeting or exceeding a particular duration, e.g. a programmable duration.
Conditions, status, activities and any other information related to the operating condition of components of a computing system comprising a plurality of processing devices 102 may be generated, monitored and/or collected, and tested any of various levels of the device and/or system. For example, one processing element (e.g., a processing engine) may write an unknown amount of data to some memory in a multi-chip machine. That data may be sent in one or more packets through FIFOs and buffers until it gets to its destination. While in flight, one or more FIFOs/buffers may hold part or all of the packet(s) being sent. When the packet(s) completely arrive at the destination, assuming there is no other activity in the system, all FIFOs/buffers will be empty and unallocated. Therefore, for our single processing element example, if it were possible to know the state of all FIFOs/buffers along the network or path of interest, the processing element may know that the data has “drained” out of the interconnect and arrived at its destination. In one embodiment, this may be achieved by an aggregated signal indicating those FIFOs/buffers are empty for sufficient time to cover the worst-case spacing between packets in the stream. When more processing elements and other components of a computing system are involved and more paths are being utilized, there may be more states to aggregate. That is, meaningful state indicating that interesting regions of the computing system, which may include one or more of boards, processing devices, super clusters and clusters, are empty.
Each of the plurality of processing elements 914 may comprise a signal line and each of the plurality of processing elements 914 may be configured to assert its respective signal line to indicate a state of the respective processing element. For example, when a processing element 914 has finished processing a piece of data assigned to it, the respective signal line may be asserted to indicate that the processing element 914 now has no data waiting to be processed or transmitted, and thus the processing element is now in a drain state. In one embodiment, the processing element 914 may assert its signal line when both inbound and outbound conditions are met. For example, for outbound condition to be met, any packet currently in the execute phase of the ALU of the processing element 914 must be completely sent. This may ensure that even if a packet associated with the instruction which is currently executing hasn't emerged into the cross connect with other processing elements 914 yet, it is taken into account. One exemplary inbound condition may be that there is no packet being clocked into the processing element 914, nor are any packets arbitrating at the processing element 914's interfaces.
Similarly, each of the one or more memory blocks 916, each of the optional external memory block or blocks 918, each of the plurality of interconnect buffers 922, the cluster router 920, the data sequencer 924 and feeder queues of the data sequencer 924 (and the DMA engine) may also comprise a signal line, and each of these components may be configured to assert its respective signal lines to indicate a state of the respective components. For some components, the signal lines may be asserted when there is no data in any interface buffers for these components. In one embodiment, the signal lines may be asserted when both outbound and inbound conditions are met, for example, all outbound FIFOs/buffers within these memory blocks are empty (including any data FIFOs/buffers at the back end of the memory blocks from which packets may be generated, so that packets about to enter the cluster interconnect are included in the drain state), and there is no packet being clocked in, nor are any packets arbitrating, at any of the inbound interfaces.
The signal lines from various components in a cluster may be coupled to a cluster state circuit 912 as inputs, such that the cluster state circuit 912 may generate an output to indicate a state of the cluster. In one embodiment, the cluster state circuit 912 may be implemented by one or more AND gates such that one asserted output may be generated when all inputs are asserted. For example, when all signal lines coupled to the inputs of the cluster state circuit 912 are asserted, the cluster state circuit 912 may generate a cluster drain condition. That is, the drain condition from all indicated areas of the cluster may be logically AND-ed together to generate a drain condition signal for the entire cluster. Therefore, a cluster's drain condition may be sourced exclusively by state within the cluster. This drain condition may be available for direct local use and also exported so that it can be aggregated at the supercluster and processing device levels. For example, the drain condition for each cluster may be individually sent up to an upper level (e.g., device controller block) and aggregated in an upper level register (e.g., the Cluster Raw Drain Status Register 1004 in
It should be noted that inputs to the cluster state circuit 912 may be selective, that is, the signal lines of one or more components may be selected to be output to the cluster state circuit 912. For example, as shown in
It should be noted that each processing element 914 and the data sequencer 924 may also have an execution IDLE state. In one embodiment, a processing element 914 (or data sequencer 924) may assert its signal line only when the processing element 914 (or the data sequencer 924) is in an execution IDLE state, in addition to the processing element 914 (or the data sequencer 924) may be drained. In another embodiment, a processing element 914 (or the data sequencer 924) may have a separate execution state signal line which may be asserted when the processing element 914 (or the data sequencer 924) is in an execution IDLE state, in addition to the drain state signal line for the processing element 914 (or the data sequencer 924). In a further embodiment, if both processing element (or data sequencer) drained and processing element execution (data sequencer) idle signal lines are implemented, a mask may be provided to select which of these signals may be passed through to the cluster state circuit 912.
The cluster 900 may further comprise a drain timer 908 and a timer register 904. The timer register 904 may store a time period to be set for the timer 908. The time period may be pre-determined and adjustable. In one embodiment, the drain timer 908 may start counting when the output of the cluster state circuit 912 is asserted and when the time period as set in the register 904 has passed, the drain timer 908 may generate a drain done signal to be held at an optional drain done signal storage 910 (e.g., a register or buffer). Thus, the cluster drain condition may control the drain timer 908. The timer 908 will run when the logic indicates that the cluster is drained, but will reset to the configured pre-load value if the drain state de-asserts before the timer 908 is exhausted. If the drain condition persists until the timer 908 is exhausted, the drain is completed, and one or more cluster events (e.g., EVF0, EVF1, EVF2, and/or EVF3) may be generated using the OR gates as shown in
The outputs from the multiplexers 1016 may be coupled to one or more logical AND gates 1006 as inputs, such that the one or more logical AND gates 1006 may collectively generate (e.g., aggregated in series) an output to indicate a drain condition for the selected status registers. The drain state monitoring circuit 1000 may also comprise a drain timer 1012, a drain timer value register 1010, a device event mask register 1020 and a plurality of AND gates 1018. The output of the one or more logical AND gates 1006 may be coupled to the drain timer 1012 as an input. When the logical AND of all selected drain conditions is asserted, the drain timer 1012 may begin to count down starting from a time period value loaded from the drain timer value register 1010. The time period value may be pre-determined and adjustable. If the drain condition de-asserts prior to the drain timer 1012 reaching zero, the timer 1012 may be reset, and the process starts over. Therefore, the drain condition may need to remain continuously asserted until the drain timer 1012 reaches zero to fulfil the drain criteria and assert the “drain done” signal 1014. The “drain done” signal 1014 may be coupled as one input to each of the AND gates 1018 (e.g., 1018.1, 1018.2, 1018.3 and 1018.4) respectively. Each of the AND gates 1018 may also have another input coupled to the device event mask register 1020 such that each of the AND gates 1018 may be configured to generate a device event (e.g., EVFD0, EVFD1, EVFD2, EVFD3) based on the “drain done” signal 1014 and a respective mask bit in the device event mask register 1020. In one embodiment, the device events generated based on drain state (e.g., EVFD0, EVFD1, EVFD2, EVFD3) on a processing device may be used by the processing device for synchronization.
In one embodiment, the drain timer value register 1010 and the device event mask register 1020 may be configured to receive their respective values from the advanced peripheral bus (APB) as well. Moreover,
In addition to generating device events, the “drain done” signal 1014 may be used to generate sync event signals. The sync event signals may be used, at all levels of the event hierarchy, to allow synchronization and signaling. In one embodiment, the sync events may be used to provide the highest level of events which span across multiple devices. For example, in addition to being used for drain signaling (as needed by the application and the system's size), these sync event signals can also be used for non-drain synchronization/signaling.
It should be noted that the “drain done” signal may directly cause an interrupt to the device controller processor (for example, an ARM Cortex-M0) in addition to “drain done” being able to cause device events and sync events. In a machine which uses interrupts or other signaling mechanisms rather than events, this may be another possible implementation of signaling “drain done” to report that drain is done.
The processing board 1200 may further comprise a drain timer 1212 which may be set to a period of time by a timer value register 1210. The time period value may be pre-determined and adjustable. The output of the one or more logical AND gates 1214 may be coupled to the drain timer 1212 as an input. The drain timer 1212 may start counting when all input signal lines are asserted. If any drain condition signal line de-asserts prior to the drain timer 1212 reaching zero, the timer 1212 may be reset, and the process starts over. Therefore, the drain condition may need to remain continuously asserted until the drain timer 1212 reaches zero to fulfil the drain criteria and assert the board drained signal line.
It should be noted that although
In addition to using signal lines at cluster level, device level and/or board level as described herein to monitor whether a drain state occurred in a region being monitored, one or more internal or external interfaces may also be monitored to determine the drain state. One example would be right at the boundary of a cluster. In this example, the drain state within the cluster may be monitored, and flit packets may be coming into the cluster from outside. In addition to monitor the interconnect (buffers) and various components within the cluster, the inbound edge of the cluster may be an interface that may be provide useful information to determine the drain state. A packet that is buffered somewhere else in the device (that buffer is not monitored for drain) is trying to get into the cluster. If there are multiple sources outside of the cluster, then the interface has arbitration and will grant a request for a packet to enter the cluster. When granted, the packet may be transferred through some interface logic and into a cluster interconnect buffer (perhaps one of many buffers depending on whether the interface has some switch/router logic in it). In addition, another piece of state that could be used to test drain could be whether that cluster inbound interface has any pending requests. For example, there may be a case that the buffers in the cluster are drained and the drain timer is running, but a request arrives at the interface for a packet that has been slow to get to the cluster.
This can be extended depending on the complexity of the interface. For example, in one implementation, the supercluster-to-supercluster interface may have multiple levels of arbitration and may also have isochronous signaling due to very long distances the signals need to go. Depending on the traffic density and the length of the particular path, it could take a relatively long period for a tardy packet to make it out of one supercluster and into the cluster which is monitoring its own drain state. In this case, the drain timing may be refined if an early warning from the interface may be generated to indicate that there is an incoming packet which hasn't made it to a cluster buffer yet.
Another example might be associated with an interface between blocks within a cluster. For example, the interface between a feeder queue and the memory. Assuming that the cluster is drained based on state from all the cluster interconnect and the state of the feeder queues. But the data sequencer executed one last instruction which is the process of fetching a packet from the memory which will be delivered to the appropriate feeder queue. There are several ways drain could handle this. First, take into account the data sequencer pipeline (execution) state as described above. Second, take into account the memory logic state. If the memory is processing a read, then the memory should not report drained. In one embodiment, if the cluster drain state is fine grained, whether the activity is a read or a write may be needed. If it's a read it might be important to know which path the read data will take (e.g., maybe the path out of the cluster is not interesting). The path may be determined as the read data leaves some internal FIFO or right at an egress interface. Third, take into account the interface between the memory and each feeder queue. As soon as the memory indicates that it has a packet for a feeder queue then that feeder queue can report that it is no longer drained.
The exemplary process 1300 may start with block 1302, at which data may be transmitted in a computing system. For example, one or more packets containing the data to be transmitted may be generated at a source computing resource of the exemplary embodiment of the computing system 100. The source computing resource may be, for example, a device controller 106, a cluster controller 118, a super cluster controller 132 if super cluster is implemented, an AIP 114, a memory controller for a cluster memory 118, a processing engine 120, or a host in the computing system (e.g., P_Host in system 100B). The generated packets may be an exemplary embodiment of the packet 140 according to the present disclosure.
At block 1304 state information for a plurality of circuit components in the computing system may be monitored for the transmitted data. For example, the one or more packets carrying the transmitted data may be transmitted across clusters, superclusters, processing devices, processing boards. A network of interest may be determined, for example, based on source and destination of the transmitted data or where the transmitted data may pass through. The signal lines of the circuit components within the network of interest that may indicate drain state information may be monitored. For example, the signal lines for processing elements, cluster routers, interconnect buffers within clusters, memory blocks within clusters, supercluster routers and controllers, device level routers and controllers, board controllers, board memory blocks, board interconnect buffers may be monitored for the network of interest.
At block 1306, the monitored state information may be aggregated and at block 1308 a timer may be started in response to determining that all circuit components being monitored are empty. At block 1310, a drain state may be asserted in response to unmasked drain conditions of the plurality of drain conditions from circuit components remaining asserted for the duration of the timer. For example, data may be transmitted from a first processing element in a first cluster of a first processing device to a second processing element in a second cluster of a second processing device. Along the path of the transmitted data, a drain region may be any region within a network of interest. For example, the network of interest may comprise the first processing device and the second processing device. The drain region may be any region within the network of interest, for example, the first cluster, the second cluster, the supercluster comprising the first cluster, the supercluster comprising the second cluster, the first processing device, the second processing device, a board hosting the first processing device, or a region comprising both the first processing device and the second processing device.
In some embodiments, the timer (e.g., drain timer 908, drain timer 1012, drain timer 1212) may be used to make sure that if there are relatively brief gaps in the stream of packets, the gaps do not cause a false-positive drain indication. For example, if something on the order of millions of packets are being sent in a bounded portion of an application, but the stream of packets can be non-uniform so that there could be spurts followed by some relatively brief dead periods, embodiments according to the present disclosure may avoid incorrectly triggering the drain signal by the dead period. The timer value may be configured so that it spans a period of time which is greater than what is determined to be the longest dead period expected (or calculated) in the packet stream. That is, timer(s) may be set to a value that is sufficient to span a period greater than the worst-case gap between packets of a bounded packet stream being monitored. If, however, the packet stream happens to be very uniform and constant, then the timer may be configured to a very short period of time, since gaps would never or only briefly be seen.
While specific embodiments and applications of the present invention have been illustrated and described, it is to be understood that the invention is not limited to the precise configuration and components disclosed herein. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Various modifications, changes, and variations which will be apparent to those skilled in the art may be made in the arrangement, operation, and details of the apparatuses, methods and systems of the present invention disclosed herein without departing from the spirit and scope of the invention. By way of non-limiting example, it will be understood that the block diagrams included herein are intended to show a selected subset of the components of each apparatus and system, and each pictured apparatus and system may include other components which are not shown on the drawings. Additionally, those with ordinary skill in the art will recognize that certain steps and functionalities described herein may be omitted or re-ordered without detracting from the scope or performance of the embodiments described herein.
The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality can be implemented in varying ways for each particular application—such as by using any combination of microprocessors, microcontrollers, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and/or System on a Chip (SoC)—but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the present invention. In other words, unless a specific order of steps or actions is required for proper operation of the embodiment, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the present invention.