The present invention relates generally relates to the formation of a 100 petaflop scale, low power, and massively parallel supercomputer.
This invention relates generally to the field of high performance computing (HPC) or supercomputer systems and architectures of the type such as described in the IBM Journal of Research and Development, Special Double Issue on Blue Gene, Vol. 49, Numbers 2/3, March/May 2005; and, IBM Journal of Research and Development, Vol. 52, 49, Numbers 1 and 2, January/March 2008, pp. 199-219.
Massively parallel computing structures (also referred to as “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, torus, and tree configurations. The conventional approach for the most cost/effective scalable computers has been to use standard processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Today, these supercomputing machines exhibit computing performance achieving 1-3 petaflops (see http://www.top500.org/ June 2009). However, there are two long standing problems in the computer industry with the current cluster of SMPs approach to building supercomputers: (1) the increasing distance, measured in clock cycles, between the processors and the memory (the memory wall problem) and (2) the high power density of parallel computers built of mainstream uni-processors or symmetric multi-processors (SMPs').
In the first problem, the distance to memory problem (as measured by both latency and bandwidth metrics) is a key issue facing computer architects, as it addresses the problem of microprocessors increasing in performance at a rate far beyond the rate at which memory speeds increase and communication bandwidth increases per year. While memory hierarchy (caches) and latency hiding techniques provide excellent solutions, these methods necessitate the applications programmer to utilize very regular program and memory reference patterns to attain good efficiency (i.e., minimizing instruction pipeline bubbles and maximizing memory locality).
In the second problem, high power density relates to the high cost of facility requirements (power, cooling and floor space) for such peta-scale computers.
It would be highly desirable to provide a supercomputing architecture that will reduce latency to memory, as measured in processor cycles, exploit locality of node processors, and optimize massively parallel computing at ˜100 petaOPS-scale at decreased cost, power, and footprint.
It would be highly desirable to provide a supercomputing architecture that exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single ASIC.
It would be highly desirable to provide a supercomputing architecture that comprises a unique interconnection of processing nodes for optimally achieving various levels of scalability.
It would be highly desirable to provide a supercomputing architecture that comprises a unique interconnection of processing nodes for efficiently and reliably computing global reductions, distribute data, synchronize, and share limited resources.
A novel massively parallel supercomputer capable of achieving 107 petaflop with up to 8,388,608 cores, or 524,288 nodes, or 512 racks is provided. It is based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five-dimensional torus networks that optimally maximize packet communications throughput and minimize latency. The 5-D network includes a DMA (direct memory access) network interface.
In one aspect, there is provided a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond, at decreased cost, power and footprint.
In a further aspect, there is provided a new class of massively-parallel, distributed-memory scalable computer architectures for achieving 100 peta-OPS scale computing and beyond that allows for a maximum packaging density of processing nodes from an interconnect point of view.
In a further aspect, there is provided an unprecedented-scale supercomputing architecture that exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single ASIC. Preferably, simple processing cores are utilized that have been optimized for minimum power consumption and capable of achieving superior price/performance to those obtainable current architectures, while having system attributes of reliability, availability, and serviceability expected of large servers. Particularly, each computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources. Many processors on a single die enables adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication.
In a further aspect, there is provided an ultra-scale supercomputing architecture that incorporates a plurality of network interconnect paradigms. Preferably, these paradigms include a five dimensional torus with DMA. The architecture allows parallel processing message-passing.
In a further aspect, there is provided in an highly scalable computer architecture, key synergies that allow new and novel techniques and algorithms to be executed in the massively parallel processing arts.
In a further aspect, there is provided I/O nodes for filesystem I/O wherein I/O communications and host communications are carried out. The application can perform I/O and external interactions without unbalancing the performance of the 5-D torus nodes.
Moreover, these techniques also provide for partitioning of the massively parallel supercomputer into a flexibly configurable number of smaller, independent parallel computers, each of which retain all of the features of the larger machine. Given the tremendous scale of this supercomputer, these partitioning techniques also provide the ability to transparently remove, or map around, any failed racks or parts of racks referred to herein as “midplanes,” so they can be serviced without interfering with the remaining components of the system.
In a further aspect, there is added serviceability such as Ethernet addressing via physical location, and JTAG interfacing to Ethernet.
According to yet another aspect of the invention, there is provided a scalable, massively parallel supercomputer comprising: a plurality of processing nodes interconnected in n-dimensions, each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations; and, the n-dimensional network meets the bandwidth and latency requirements of a parallel algorithm for optimizing parallel algorithm processing performance.
In one embodiment, the node architecture is based upon System-On-a-Chip (SOC) Technology wherein the basic building block is a complete processing “node” comprising a single Application Specific Integrated Circuit (ASIC). When aggregated, each of these processing nodes is termed a ‘Cell’, allowing one to define this new class of massively parallel machine constructed from a plurality of identical cells as a “Cellular” computer. Each node preferably comprises a plurality (e.g., four or more) of processing elements each of which includes a central processing unit (CPU), a plurality of floating point processors, and a plurality of network interfaces.
The SOC ASIC design of the nodes permits optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. In conjunction with novel packaging technologies, it further enables scalability to unprecedented levels The system-on-a-chip level integration allows for low latency to all levels of memory including a local main store associated with each node, thereby overcoming the memory wall performance bottleneck increasingly affecting traditional supercomputer systems. Within each node, each of multiple processing elements may be used individually or simultaneously to work on any combination of computation or communication as required by the particular algorithm being solved or executed at any point in time.
At least three modes of operation are supported. In the full virtual node mode, each of the processing cores will perform its own MPI (message passing interface) process independently. Each core is running four thread/process, and it uses a sixteenth of the memory (L2 and SDRAM) of the node, while coherence among the 64 processes within the node and across the nodes is maintained by MPI. In the full SMP, one MPI task with 64 threads (4 threads per core) is running, using the whole node memory capacity. The third mode called the mixed mode. Here 2,4,8,16, and 32 processes are running 32,16,8,4, and 2 threads, respectively.
Because of the torus' DMA feature, internode communications can overlap with computations running concurrently on the nodes.
With respect to the Torus network, it is configured, in one embodiment, as a 5-dimensional design supporting hyper-cube communication and partitioning. A 4-Dimensional design allows a direct mapping of computational simulations of many physical phenomena to the Torus network. However, higher dimensionality, 5 or 6-dimensional Toroids, which allow shorter and lower latency paths at the expense of more chip-to-chip connections and significantly higher cabling costs have been implemented in the past.
Further independent networks include an external Network (such as a 10 Gigabit Ethernet) that provides attachment of input/output nodes to external server and host computers; and a Control Network (a combination of 1 Gb Ethernet and a IEEE 1149.1 Joint Test Access Group (JTAG) network) that provides complete low-level debug, diagnostic and configuration capabilities for all nodes in the entire machine, and which is under control of a remote independent host machine, called the “Service Node”. Preferably, use of the Control Network operates with or without the cooperation of any software executing on the nodes of the parallel machine. Nodes may be debugged or inspected transparently to any software they may be executing. The Control Network provides the ability to address all nodes simultaneously or any subset of nodes in the machine. This level of diagnostics and debug is an enabling technology for massive levels of scalability for both the hardware and software.
Novel packaging technologies are employed for the supercomputing system that enables unprecedented levels of scalability, permitting multiple networks and multiple processor configurations. In one embodiment, there is provided multi-node “Node Cards” including a plurality of Compute Nodes, plus optionally one or two I/O Node where the external I/O Network is enabled. In this way, the ratio of computation to external input/output may be flexibly selected by populating “midplane” units with the desired number of I/O nodes. The packaging technology permits sub-network partitionability, enabling simultaneous work on multiple independent problems. Thus, smaller development, test and debug partitions may be generated that do not interfere with other partitions.
Connections between midplanes and racks are selected to be operable based on partitioning. Segmentation creates isolated partitions; each partition owning the full bandwidths of all interconnects, providing predictable and repeatable performance. This enables fine-grained application performance tuning and load balancing that remains valid on any partition of the same size and shape. In the case where extremely subtle errors or problems are encountered, this partitioning architecture allows precise repeatability of a large scale parallel application. Partitionability, as enabled by the present invention, provides the ability to segment so that a network configuration may be devised to avoid, or map around, non-working racks or midplanes in the supercomputing machine so that they may be serviced while the remaining components continue operation.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
The present invention is directed to a next-generation massively parallel supercomputer, hereinafter referred to as “BluGene” or “BluGene/Q”. The previous two generations were detailed in the IBM Journal of Research and Development, Special Double Issue on Blue Gene, Vol. 49, Numbers 2/3, March/May 2005; and, IBM Journal of Research and Development, Vol. 52, 49, Numbers 1 and 2, January/March 2008, pp. 199-219, the whole contents and disclosures of which are incorporated by reference as if fully set forth herein. The system uses a proven Blue Gene architecture, exceeding by over 15× the performance of the prior generation Blue Gene/P per dual-midplane rack. Besides performance, there are addition several novel enhancements which will be described herein below.
A compute node of this present massively parallel supercomputer architecture and in which the present invention may be employed is illustrated in
More particularly, the basic nodechip 50 of the massively parallel supercomputer architecture illustrated in
Each FPU 53 associated with a core 52 provides a 32 B wide data path to the L1-cache 55 of the A2, allowing it to load or store 32 B per cycle from or into the L1-cache 55. Each core 52 is directly connected to a private prefetch unit (level-1 prefetch, L1P) 58, which accepts, decodes and dispatches all requests sent out by the A2. The store interface from the A2 core 52 to the L1P 55 is 32 B wide, in one example embodiment, and the load interface is 16 B wide, both operating at processor frequency. The L1P 55 implements a fully associative, 32 entry prefetch buffer, each entry holding an L2 line of 128 B size, in one embodiment. The L1P provides two prefetching schemes for the private prefetch unit 58: a sequential prefetcher, as well as a list prefetcher.
As shown in
Network packet I/O functionality at the node is provided and data throughput increased by implementing MU 100. Each MU at a node includes multiple parallel operating DMA engines, each in communication with the XBAR switch, and a Network Interface unit 150. In one embodiment, the Network interface unit of the compute node includes, in a non-limiting example: 10 intra-rack and inter-rack interprocessor links 90, each operating at 2.0 GB/s, that, in one embodiment, may be configurable as a 5-D torus, for example); and, one I/O link 92 interfaced with the Network interface Unit 150 at 2.0 GB/s (i.e., a 2 GB/s I/O link (to an I/O subsystem)) is additionally provided.
The system is expandable to 512 compute racks, each with 1024 compute node ASICs (BQC) containing 16 PowerPC A2 processor cores at 1600 MHz. Each A2 core has associated a quad-wide fused multiply-add SIMD floating point unit, producing 8 double precision operations per cycle, for a total of 128 floating point operations per cycle per compute chip. Cabled as a single system, the multiple racks can be partitioned into smaller systems by programming switch chips, termed the BG/Q Link ASICs (BQL), which source and terminate the optical cables between midplanes. Each compute rack consists of 2 sets of 512 compute nodes. Each set is packaged around a doubled-sided backplane, or midplane, which supports a five-dimensional torus of size 4×4×4×4×2 which is the communication network for the compute nodes which are packaged on 16 node boards. This tori network can be extended in 4 dimensions through link chips on the node boards, which redrive the signals optically with an architecture limit of 64 to any torus dimension. The signaling rate is 10 Gb/s, 8/10 encoded), over ˜20 meter multi-mode optical cables at 850 nm. As an example, a 96-rack system is connected as a 16×16×16×12×2 torus, with the last ×2 dimension contained wholly on the midplane. For reliability reasons, small torus dimensions of 8 or less may be run as a mesh rather than a torus with minor impact to the aggregate messaging rate.
The Blue Gene/Q platform includes four kinds of nodes: compute nodes (CN), I/O nodes (ION), login nodes (LN), and service nodes (SN). The CN and ION share the same Blue Gene/Q compute ASIC.
Microprocessor Core and Quad Floating Point Unit of CN and ION
The basic node of this present massively parallel supercomputer architecture is illustrated in
The node here is based on a low power A2 PowerPC cores, though the architecture can use any low power cores. The A2 is a 4-way multi-threaded 64b PowerPC implementation. Each A2 core has its own execution unit (XU), instruction unit (IU), and quad floating point unit (QPU) connected via the AXU (Auxiliary eXecution Unit) (
Compute ASIC Node
The compute chip implements 18 PowerPC compliant A2 cores and 18 attached QPU floating point units. In one embodiment, seventeen (17) cores are functional. The 18th “redundant” core is in the design to improve chip yield. Of the 17 functional units, 16 will be used for computation leaving one to be reserved for system function.
I/O Node
Besides the 1024 compute nodes per rack, there are associated I/O nodes. These I/O nodes are in separate racks, and are connected to the compute nodes through an 11th port (an I/O port such as shown in
Memory Hierarchy—L1 and UP
The QPU has a 32 B wide data path to the L1-cache of the A2, allowing it to load or store 32 B per cycle from or into the L1-cache. Each core is directly connected to a private prefetch unit (level-1 prefetch, L1P), which accepts, decodes and dispatches all requests sent out by the A2. The store interface from the A2 to the L1P is 32 B wide and the load interface is 16 B wide, both operating at processor frequency. The L1P implements a fully associative, 32 entry prefetch buffer. Each entry can hold an L2 line of 128 B size. The L1P provides two prefetching schemes: a sequential prefetcher as used in previous Blue Gene architecture generations, as well as a novel list prefetcher. The list prefetcher tracks and records memory requests, sent out by the core, and writes the sequence as a list to a predefined memory region. It can replay this list to initiate prefetches for repeated sequences of similar access patterns. The sequences do not have to be identical, as the list processing is tolerant to a limited number of additional or missing accesses. This automated learning mechanism allows a near perfect prefetch behavior for a set of important codes that show the required access behavior, as well as perfect prefetch behavior for codes that allow precomputation of the access list.
A system, method and computer program product is provided for improving a performance of a parallel computing system, e.g., by prefetching data or instructions according to a list including a sequence of prior cache miss addresses.
In one embodiment, a parallel computing system operates at least an algorithm for prefetching data and/or instructions. According to the algorithm, with software (e.g., a compiler) cooperation, memory access patterns can be recorded and/or reused by at least one list prefetch engine (e.g., a software or hardware module prefetching data or instructions according to a list including a sequence of prior cache miss address(es)). In one embodiment, there are at least four list prefetch engines. A list prefetch engine allows iterative application software (e.g., “while” loop, etc.) to make an efficient use of general, but repetitive, memory access patterns. The recording of patterns of physical memory access by hardware (e.g., a list prefetch engine 2100 in
A list describes an arbitrary sequence (i.e., a sequence not necessarily arranged in an increasing, consecutive order) of prior cache miss addresses (i.e., addresses that caused cache misses before). In one embodiment, address lists which are recorded from L1 (level one) cache misses and later loaded and used to drive the list prefetch engine may include, for example, 29-bit, 128-byte addresses identifying L2 (level-two) cache lines in which an L1 cache miss occurred. Two additional bits are used to identify, for example, the 64-byte, L1 cache lines which were missed. In this embodiment, these 31 bits plus an unused bit compose the basic 4-byte record out of which these lists are composed.
In one embodiment, a general approach to efficiently prefetching data being requested by a L1 (level-one) cache is to prefetch data and/or instructions following a memorized list of earlier access requests. Prefetching data according to a list works well for repetitive portions of code which do not contain data-dependent branches and which repeatedly make the same, possibly complex, pattern of memory accesses. Since this list prefetching (i.e., prefetching data whose addresses appear in a list) can be understood at an application level, a recording of such a list and its use in subsequent iterations may be initiated by compiler directives placed in code at strategic spots. For example, “start_list” (i.e., a directive for starting a list prefetch engine) and “stop_list” (i.e., a directive for stopping a list prefetch engine) directives may locate those strategic spots of the code where first memorizing, and then later prefetching, a list of L1 cache misses may be advantageous.
In one embodiment, a directive called start_list causes a processor core to issue a memory mapped command (e.g., input/output command) to the parallel computing system. The command may include, but not limited to:
The first module 2120 receives a current cache miss address (i.e., an address which currently causes a cache miss) and evaluates whether the current cache miss address is valid. A valid cache miss address refers to a cache miss address belonging to a class of cache miss addresses for which a list prefetching is intended In one embodiment, the first module 2120 evaluates whether the current cache miss address is valid or not, e.g., by checking a valid bit attached on the current cache miss address. The list prefetch engine 2100 stores the current cache miss address in the ListWrite array 2135 and/or the history FIFO. In one embodiment, the write module 2130 writes the contents of the array 2135 to a memory device when the array 2135 becomes full. In another embodiment, as the ListWrite Array 2135 is filled, e.g., by continuing L1 cache misses, the write module 2130 continually writes the contents of the array 2135 to a memory device and forms a new list that will be used on a next iteration (e.g., a second iteration of a “for” loop, etc.).
In one embodiment, the write module 2130 stores the contents of the array 2135 in a compressed form (e.g., collapsing a sequence of adjacent addresses into a start address and the number of addresses in the sequence) in a memory device (not shown). In one embodiment, the array 2135 stores a cache miss address in each element of the array. In another embodiment, the array 2135 stores a pointer pointing to a list of one or more addresses. In one embodiment, there is provided a software entity (not shown) for tracing a mapping between a list and a software routine (e.g., a function, loop, etc.). In one embodiment, cache miss addresses, which fall within an allowed address range, carry a proper pattern of translation lookaside buffer (TLB) user bits and are generated, e.g., by an appropriate thread. These cache miss addresses are stored sequentially in the ListWrite array 2135.
In one embodiment, a processor core may allow for possible list miss-matches where a sequence of load commands deviates sufficiently from a stored list that the list prefetch engine 2100 uses. Then, the list prefetch engine 2100 abandons the stored list but continues to record an altered list for a later use.
In one embodiment, each list prefetch engine includes a history FIFO (not shown). This history FIFO can be implemented, e.g., by a 4-entry deep, 4 byte-wide set of latches, and can include at least four most recent L2 cache lines which appeared as L1 cache misses. This history FIFO can store L2 cache line addresses corresponding to prior L1 cache misses that happened most recently. When a new L1 cache miss, appropriate for a list prefetch engine, is determined as being valid, e.g., based on a valid bit associated with the new L1 cache miss, an address (e.g., 64-byte address) that caused the L1 cache miss is compared with the at least four addresses in the history FIFO. If there is a match between the L1 cache miss address and one of the at least four addresses, an appropriate bit in a corresponding address field (e.g., 32-bit address field) is set to indicate the half portion of the L2 cache line that was missed, e.g., the 64-byte portion of the 128-byte cache line was missed. If a next L1 cache miss address matches none of the at least four addresses in the history FIFO, an address at a head of the history FIFO is written out, e.g., to the ListWrite array 2135, and this next address is added to a tail of the history FIFO.
When an address is removed from one entry of the history FIFO, it is written into the ListWrite array 2135. In one embodiment, this ListWrite array 2135 is an array, e.g., 8-deep, 16-byte wide array, which is used by all or some of list prefetch engines. An arbiter (not shown) assigns a specific entry (e.g., a 16-btye entry in the history FIFO) to a specific list prefetch engine. When this specific entry is full, it is scheduled to be written to memory and a new entry assigned to the specific list prefetch engine.
The depth of this ListWrite array 2135 may be sufficient to allow for a time period for which a memory device takes to respond to this writing request (i.e., a request to write an address in an entry in the history FIFO to the ListWrite array 2135), providing sufficient additional space that a continued stream of L1 cache miss addresses will not overflow this ListWrite array 2135. In one embodiment, if 20 clock cycles are required for a 16-byte word of the list to be accepted to the history FIFO and addresses can be provided at the rate at which L2 cache data is being supplied (one L1 cache miss corresponds to 128 bytes of data loaded in 8 clock cycles), then the parallel computing system may need to have a space to hold 20/8≈3 addresses or an additional 12 bytes. According to this embodiment, the ListWrite array 2135 may be composed of at least four, 4-byte wide and 3-word deep register arrays. Thus, in this embodiment, a depth of 8 may be adequate for the ListWrite array 2135 to support a combination of at least four list prefetch engines with various degrees of activity. In one embodiment, the ListWrite array 2135 stores a sequence of valid cache miss addresses.
The list prefetch engine 2100 stores the current cache miss address in the array 2135. The list prefetch engine 2100 also provides the current cache miss address to the comparator 2110. In one embodiment, the engine 2100 provides the current miss address to the comparator 2110 when it stores the current miss address in the array 2135. In one embodiment, the comparator 2110 compares the current cache miss address and a list address (i.e., an address in a list; e.g., an element in the array 2135). If the comparator 2110 does not find a match between the current miss address and the list address, the comparator 2110 compares the current cache miss address with the next list addresses (e.g., the next eight addresses listed in a list; the next eight elements in the array 2135) held in the ListRead Array 2115 and selects the earliest matching address in these addresses (i.e., the list address and the next list addresses). The earliest matching address refers to a prior cache miss address whose index in the array 2115 is the smallest and which matches with the current cache miss address. An ability to match a next address in the list with the current cache miss address is a fault tolerant feature permitting addresses in the list which do not reoccur as L1 cache misses in a current running of a loop to be skipped over.
In one embodiment, the comparator 2110 compares addresses in the list and the current cache miss address in an order. For example, the comparator 2110 compares the current cache miss address and the first address in the list. Then, the comparator may compare the current cache miss address and the second address in the list. In one embodiment, the comparator 2110 synchronizes an address in the list which the comparator 2110 matches with the current cache miss address with later addresses in the list for which data is being prefetched. For example, the list prefetch engine 2100 finds a match between a second element in the array 2115, then the list prefetch engine 2100 prefetches data whose addresses are stored in the second element and subsequent elements of the array 2115. This separation between the address in the list which matches the current cache miss address and the address in the list being prefetched is called the prefetch depth and in one embodiment this depth can be set, e.g., by software (e.g., a compiler). In one embodiment, the comparator 2110 includes a fault-tolerant feature. For example, when the comparator 2110 detects a valid cache miss address that does not match any list address with which it is compared, that cache miss address is dropped and the comparator 2110 waits for next valid address. In another embodiment, a series of mismatches between the cache miss address and the list address (i.e., addresses in a list) may cause the list prefetch engine to be aborted. However, a construction of a new list in the ListWrite array 2135 will continue. In one embodiment, loads (i.e., load commands) from a processor core may be stalled until a list has been read from a memory device and the list prefetch engine 2100 is ready to compare (2110) subsequent L1 cache misses with at least or at most eight addresses of the list.
In one embodiment, lists needed for a comparison (2110) by at least four list prefetch engines are loaded (under a command of individual list prefetch engines) into a register array, e.g., an array of 24 depth and 16-bytes width. These registers are loaded according to a clock frequency with data coming from the memory (not shown). Thus, each list prefetch engine can access at least 24 four-byte list entries from this register array. In one embodiment, a list prefetch engine may load these list entries into its own set of, for example, 8, 4-byte comparison latches. L1 cache miss addresses issued by a processor core can then be compared with addresses (e.g., at least or at most eight addresses) in the list. In this embodiment, when a list prefetch engine consumes 16 of the at least 24 four-byte addresses and issues a load request for data (e.g., the next 64-byte data in the list), a reservoir of the 8, 4-byte addresses may remain, permitting a single skip-by-eight (i.e., skipping eight 4-byte addresses) and subsequent reload of the 8, 4-byte comparison latches without requiring a stall of the processor core.
In one embodiment, L1 cache misses associated with a single thread may require data to be prefetched at a bandwidth of the memory system, e.g., one 32-byte word every two clock cycles. In one embodiment, if the parallel computing system requires, for example, 100 clock cycles for a read command to the memory system to produce valid data, the ListRead array 2115 may have sufficient storage so that 100 clock cycles can pass between an availability of space to store data in the ListRead array 2115 and a consumption of the remaining addresses in the list. In this embodiment, in order to conserve area in the ListReady array 2115, only 64-byte segments of the list may be requested by the list prefetch engine 2100. Since each L1 cache miss leads to a fetching of data (e.g., 128-byte data), the parallel computing system may consume addresses in an active list at a rate of one address every particular clock cycles (e.g., 8 clock cycles). Recognizing a size of an address, e.g., as 4 bytes, the parallel computing system may calculate that a particular lag (e.g., 100 clock cycle lag) between a request and data in the list may require, for example, 100/8*4 or a reserve of 50 bytes to be provided in the ListRead array 2115. Thus, a total storage provided in the ListRead array 2115 may be, for example, 50+64≈114 bytes. Then, a total storage (e.g., 32+96=128 bytes) of the ListRead array 2115 may be close to a maximum requirement.
The prefetch unit 2105 prefetches data and/or instruction(s) according to a list if the comparator 2110 finds a match between the current cache miss address and an address on the list. The prefetch unit 2105 may prefetch all or some of the data stored in addresses in the list. In one embodiment, the prefetch unit 2105 prefetches data and/or instruction(s) up to a programmable depth (i.e., a particular number of instructions or particular amount of data to be prefetched; this particular number or particular amount can be programmed, e.g., by software).
In one embodiment, addresses held in the comparator 2110 determine prefetch addresses which occur later in the list and which are sent to the prefetch unit 2105 (with an appropriate arbitration between the at least four list prefetch engines). Those addresses (which have not yet been matched) are sent off for prefetching up to a programmable prefetch depth (e.g., a depth of 8). If an address matching (e.g., an address comparison between an L1 cache miss address and an address in a list) proceeds with a sufficient speed that a list address not yet prefetched matches the L1 cache miss address, this list address may trigger a demand to load data in the list address and no prefetch of the data is required. Instead, a demand load of the data to be returned directly to a processor core may be issued. The address matching may be done in parallel or in sequential, e.g., by the comparator 2110.
In one embodiment, the parallel computing system can estimate the largest prefetch depth that might be needed to ensure that prefetched data will be available when a corresponding address in the list turns up as an L1 cache miss address (i.e., an address that caused an L1 cache miss). Assuming that a single thread running in a processor core is consuming data as fast as the memory system can provide to it (e.g., a new 128-byte prefetch operation every 8 clock cycles) and that a prefetch request requires, for example, 100 clock cycles to be processed, the parallel computing system may need to have, for example, 100/8≈12 prefetch active commands; that is, a depth of 12, which may be reasonably close to the largest available depth (e.g., a depth of 8).
In one embodiment, the read module 2125 stores a pointer pointing to a list including addresses whose data may be prefetched in each element. The ListRead array 2115 stores an address whose data may be prefetched in each element. The read module 2125 loads a plurality of list elements from a memory device to the ListRead array 2115. A list loaded by the read module 2125 includes, but is not limited to: a new list (i.e., a list that is newly created by the list prefetch engine 2100), an old list (i.e., a list that has been used by the list prefetch engine 2100). Contents of the ListRead array 2115 are presented as prefetch addresses to a prefetch unit 2105 to be prefetched. This presence may continue until a pre-determined or post-determined prefetching depth is reached. In one embodiment, the list prefetch engine 2100 may discard a list whose data has been prefetched. In one embodiment, a processor (not shown) may stall until the ListRead array 2115 is fully or partially filled.
In one embodiment, there is provided a counter device in the prefetching control (not shown) which counts the number of elements in the ListRead array 2115 between that most recently matched by the comparator 2110 and the latest address sent to the prefetch unit 2105. As a value of the counter device decrements, i.e., the number of matches increments, while the matching operates with the ListRead array 2115, prefetching from later addresses in the ListRead array 2115 may be initiated to maintain a preset prefetching depth for the list.
In one embodiment, the list prefetch engine 2100 may be implemented in hardware or reconfigurable hardware, e.g., FPGA (Field Programmable Gate Array) or CPLD (Complex Programmable Logic Device), using a hardware description language (Verilog, VHDL, Handel-C, or System C). In another embodiment, the list prefetch engine 2100 may be implemented in a semiconductor chip, e.g., ASIC (Application-Specific Integrated Circuit), using a semi-custom design methodology, i.e., designing a chip using standard cells and a hardware description language. In one embodiment, the list prefetch engine 2100 may be implemented in a processor (e.g., IBM® PowerPC® processor, etc.) as a hardware unit(s). In another embodiment, the list prefetch engine 2100 may be implemented in software (e.g., a compiler or operating system), e.g., by a programming language (e.g., Java®, C/C++, .Net, Assembly language(s), Pearl, etc.).
At step 2215, the list prefetch engine evaluates whether the ListWrite array 2135 is full or not, e.g., by checking an empty bit (i.e., a bit indicating that a corresponding slot is available) of each slot of the array 2135. If the ListWrite array 2135 is not full, the control goes to step 2205 to receive a next cache miss address. Otherwise, at step 2220, the list prefetch engine stores contents of the array 2135 in a memory device.
At step 2225, the parallel computing system evaluates whether the list prefetch engine needs to stop. Such a command to stop would be issued when running list control software (not shown) issues a stop list command (i.e., a command for stopping the list prefetch engine 2100). If such a stop command has not been issued, the control goes to step 2205 to receive a next cache miss address. Otherwise, at step 2230, the prefetch engine flushes contents of the ListWrite array 2135. This flushing may set empty bits (e.g., a bit indicating that an element in an array is available to store a new value) of elements in the ListWrite array 2135 to high (“1”) to indicate that those elements are available to store new values. Then, at step 2235, the parallel computing system stops this list prefetch engine (i.e., a prefetch engine performing the steps 2200-2230).
While operating steps 2205-2230, the prefetch engine 2100 may concurrently operate steps 2240-2290. At step 2240, the list prefetch engine 2100 determines whether the current list has been created by a previous use of a list prefetch engine or some other means. In one embodiment, this is determined by a “load list” command bit set by software when the list engine prefetch 2200 is started. If this “load list” command bit is not set to high (“1”), then no list is loaded to the ListRead array 2115 and the list prefetch engine 2100 only records a list of the L1 cache misses to the history FIFO or the ListWrite array 2135 and does no prefetching.
If the list assigned to this list prefetch engine 2100 has not been created, the control goes to step 2295 to not load a list into the ListRead array 2115 and to not prefetch data. If the list has been created, e.g., a list prefetch engine or other means, the control goes to step 2245. At step 2245, the read module 2125 begins to load the list from a memory system.
At step 2250, a state of the ListRead array 2115 is checked. If the ListRead array 2115 is full, then the control goes to step 2255 for an analysis of the next cache miss address. If the ListRead array 2115 is not full, a corresponding processor core is held at step 280 and the read module 2125 continues loading prior cache miss addresses into the ListRead array 2115 at step 2245.
At step 2255, the list prefetch engine evaluates whether the received cache miss address is valid, e.g., by checking a valid bit of the cache miss address. If the cache miss address is not valid, the control repeats the step 2255 to receive a next cache miss address and to evaluate whether the next cache miss address is valid. A valid cache miss address refers to a cache miss address belonging to a class of cache miss addresses for which a list prefetching is intended Otherwise, at step 2260, the comparator 2110 compares the valid cache miss address and address(es) in list in the ListRead array 2115. In one embodiment, the ListRead array 2115 stores a list of prior cache miss addresses. If the comparator 2110 finds a match between the valid cache miss address and an address in a list in the ListRead array, the list prefetch engine resets a value of a counter device which counts the number of mismatches between the valid cache miss address and addresses in list(s) in the ListRead array 2115.
Otherwise, at step 2290, the list prefetch engine compares the value of the counter device to a threshold value. If the value of the counter device is greater than the threshold value, the control goes to step 2290 to let the parallel computing system stop the list prefetch engine 2100. Otherwise, at step 2285, the list prefetch engine 2100 increments the value of the counter device and the control goes back to the step 2255.
At step 2270, the list prefetch engine prefetches data whose addresses are described in the list which included the matched address. The list prefetch engine prefetches data stored in all or some of the addresses in the list. The prefetched data whose addresses may be described later in the list, e.g., subsequently following the match address. At step 2275, the list prefetch engine evaluates whether the list prefetch engine reaches “EOL” (End of List) of the list. In other words, the list prefetch engine 2100 evaluates whether the prefetch engine 2100 has prefetched all the data whose addresses are listed in the list. If the prefetch engine does not reach the “EOL,” the control goes back to step 2245 to load addresses (in the list) whose data have not been prefetched yet into the ListRead array 2115. Otherwise, the control goes to step 2235. At step 2235, the parallel computing system stops operating the list prefetch engine 2100.
In one embodiment, the parallel computing system allows the list prefetch engine to memorize an arbitrary sequence of prior cache miss addresses for one iteration of programming code and subsequently exploit these addresses by prefetching data stored in this sequence of addresses. This data prefetching is synchronized with an appearance of earlier cache miss addresses during a next iteration of the programming code.
In a further embodiment, the method illustrated in
The list prefetch engine can prefetch data through a use of a sliding window (e.g., a fixed number of elements in the ListRead array 2135) that tracks the latest cache miss addresses thereby allowing to prefetch data stored in a fixed number of cache miss addresses in the sliding window. This usage of the sliding window achieves a maximum performance, e.g., by efficiently utilizing a prefetch buffer which is a scarce resource. The sliding window also provides a degree of tolerance in that a match in the list is not necessary as long as the next L1 cache miss address is within a range of a width of the sliding window.
A list of addresses can be stored in a memory device in a compressed form to reduce an amount of storage needed by the list.
Lists are indexed and can be explicitly controlled by software (user or compiler) to be invoked.
Lists can optionally be simultaneously saved while a current list is being utilized for prefetching. This feature allows an additional tolerance to actual memory references, e.g., by effectively refreshing at least one list on each invocation.
Lists can be paused through software to avoid loading a sequence of addresses that are known not relevant (e.g., the sequence of addresses are unlikely be re-accessed by a processor unit). For example, data dependent branches such as occur during a table lookup may be carried out while list prefetching is paused.
In one embodiment, prefetching initiated by an address in a list is for a full L2 (Level-two) cache line. In one embodiment, the size of the list may be minimized or optimized by including only a single 64-byte address which lies in a given 128-byte cache line. In this embodiment, this optimization is accomplished, e.g., by comparing each L1 cache miss with previous four L1 cache misses and adding a L1 cache miss address to a list only if it identifies a 128-byte cache line different from those previous four addresses. In this embodiment, in order to enhance a usage of the prefetch data array, a list may identify, in addition to an address of the 128-byte cache line to be prefetched, those 64-byte portions of the 128-byte cache line which corresponded to L1 cache misses. This identification may allow prefetched data to be marked as available for replacement as soon as portions of the prefetched data that will be needed have been hit.
There is provided a system, method and computer program product for prefetching of data or instructions in a plurality of streams while adaptively adjusting prefetching depths of each stream.
Further the adaptation algorithm may constrain that the total depth of all prefetched streams is predetermined and consistent with the available storage resources in a stream prefetch engine.
In one embodiment, a stream prefetch engine (e.g., a stream prefetch engine 20200 in
In one embodiment, a parallel computing system operates at least one prefetch algorithm as follows:
Stream prefetching: a plurality of concurrent data or instruction streams (e.g., 16 data streams) of consecutive addresses can be simultaneously prefetched with a support up to a prefetching depth (e.g., eight cache lines can be prefetched per stream) with a fully adaptive depth selection. An adaptive depth selection refers to an ability to change a prefetching depth adaptively. A stream refers to sequential data or instructions. An MPEG (Moving Picture Experts Group) movie file or a MP3 music file is an example of a stream.
In one embodiment, there are provided rules for adaptively adjusting the prefetching depth. These rules may govern a performance of the stream prefetch engine (e.g., a stream prefetch engine 20200 in
Rule 1: a stream may increase its prefetching depth in response to a prefetch to a demand fetch conversion event that is an indicative of bandwidth starvation. A demand fetch conversion event refers to a hit on a line that has been established in a prefetch directory but not yet had data returned from a switch or a memory device. The prefetch directory is described in detail below in conjunction with
Rule 2: this depth increase is performed at an expense of a victim stream whenever a sum of all prefetching depths equals a maximum capacity of the stream prefetch engine. In one embodiment, the victim stream selected is the least recently used stream with non-zero prefetching depth. In this way, less active or inactive streams may have their depths taken by more competitive hot streams, similar to stale data being evicted from a cache. This selection of a victim stream has at least two consequences: First, that victim's allowed depth is decreased by one. Second, when an additional prefetching is performed for the stream whose depth has been increased, it is possible that all or some prefetch registers may be allocated to active streams including the victim stream since the decrease in the depth of the victim stream does not imply that the actual data footprint of that stream in the prefetch data array may correspondingly shrink. Prefetch registers refer to registers working with the stream prefetch engine. Excess data resident in the prefetch data array for the victim stream may eventually be replaced by new cache lines of more competitive hot streams. This replacement is not necessarily immediate, but may eventually occur.
In one embodiment, there is provided a free depth counter which is non-zero when a sum of all prefetching depths is less than the capacity of the stream prefetch engine. In one embodiment, this counter has value 32 on reset, and per-stream depth registers are reset to zero. These per-stream depth registers store a prefetching depth for each active stream. Thus, the contents of the per-stream depth registers are changed as a prefetching depth of a stream is changed. When a stream is invalidated, its depth is returned to the free depth counter.
The prefetch directory (PFD) 20240 stores tag information (e.g., valid bits) and meta data associated with each cache line stored in the prefetch data array (PDA) 20235. The prefetch data array 20235 stores cache lines (e.g., L2 (Level two) cache lines and/or L1 (Level one) cache lines) prefetched, e.g., by the stream prefetch unit 20200. In one embodiment, the stream prefetch engine 20200 supports diverse memory latencies and a large number (e.g., 1 million) of active threads run in the parallel computing system. In one embodiment, the stream prefetching makes use of the prefetch data array 20235 which holds up to, for example, 32 128-byte level-two cache lines.
In one embodiment, an entry of the PFD 20240 includes, but is not limited to, an address valid (AVALID) bit(s), a data valid (DVALID) bit, a prefetching depth (DEPTH) of a stream, a stream ID (Identification) of the stream, etc. An address valid bit indicates whether the PFD 20240 has a valid cache line address corresponding to a memory address requested in a load request issued by the processor. A valid cache line address refers to a valid address of a cache line. A load request refers to an instruction to move data from a memory device to a register in a processor. When an address is entered as valid into the PFD 20240, corresponding data may be requested from a memory device but may be not immediately received. The data valid bit indicates whether the stream prefetch engine 20200 has received data corresponding to a AVALID bit from a memory device 20220. In other words, DVALID bit is set to low (“0”) to indicate pending data, i.e., the data that has been requested to the memory device 20220 but has not been received by the prefetch unit 20215. When the prefetch unit 20215 establishes an entry in the prefetch directory 20240 with setting the AVALID bit to high (“1”) to indicate the entry has a valid cache line address corresponding to a memory address requested in a load request, the prefetch unit 20215 may also request corresponding data (e.g., L1 or L2 cache line corresponding to the memory address) from a memory device 20220 (e.g., L1 cache memory device, L2 cache memory device, a main memory device, etc.) and set corresponding DVALID bit to low. When a AVALID bit is set to high and a corresponding DVALID bit is set to low, the prefetch unit 20215 places a corresponding load request associated with these AVALID and DVALID bits in the DFC table 20225 to wait until the corresponding data that is requested by the prefetch unit 20215 comes from the memory device 20220. Once the corresponding data arrives from the memory device 20220, the stream prefetch engine 20200 stores the data in the PDA 20235 and sets the DVALID bit to high in a corresponding entry in the PFD 20240. Then, the load request, for which there exists a valid cache line in the PDA 20235 and a valid cache line address in the PFD 20240, are forwarded to the hit queue 20205, e.g., by the prefetch unit 20215. In other words, once the DVALID bit and the AVALID bit are set to high in an entry in the PFD 20240, a load request associated with the entry is forwarded to the hit queue 20205.
A valid address means that a request for the data for this address has been sent to a memory device, and that the address has not subsequently been invalidated by a cache coherence protocol. Consequently, a load request to that address may either be serviced as an immediate hit, for example, to the PDA 20235 when the data has already been returned by the memory device (DVALID=1), or may be serviced as a demand fetch conversion (i.e., obtaining the data from a memory device) with the load request placed in the DFC table 20225 when the data is still in flight from the memory device (DVALID=0).
Valid data means that an entry in the PDA 20235 corresponding to the valid address in the PFD 20240 is also valid. This entry may be invalid when the data is initially requested from a memory device and may become valid when the data has been returned by the memory device.
In one embodiment, the stream fetch engine 20200 is triggered by hits in the prefetch directory 20240. As a prefetching depth can vary from a stream to another stream, a stream ID field (e.g., 4-bit field) is held in the prefetch directory 20240 for each cache line. This stream ID identifies a stream for which this cache line was prefetched and is used to select an appropriate prefetching depth.
A prefetch address is computed, e.g., by selecting the first cache line within the prefetching depth that is not resident (but is a valid address) in the prefetch directory 20240. A prefetch address is an address of data to be prefetched. As this address is dynamically selected from a current state of the prefetch directory 20240, duplicate entries are avoided, e.g., by comparing this address and addresses that stored in the prefetch directory 20240. Some tolerance to evictions from the prefetch directory 20240 is gained.
An actual data prefetching, e.g., guided by the prefetching depth, is managed as follows: When a stream is detected, e.g., by detecting subsequent cache line misses, a sequence of “N” prefetch requests is issued in “N” or more clock cycles, where “N” is a predetermined integer between 1 and 8. Subsequent hits to this stream (whether or not the data is already present in the prefetch data array 20235) initiate a single prefetch request, provided that an actual prefetching depth of this stream is less than its allowed depth. Increases in this allowed depth (caused by hits to cache lines being prefetched but not yet resident in the prefetch data array 20235) can be exploited by this one-hit/one-prefetch policy because the prefetch line length is twice the L1 cacheline length: two hits will occur to the same prefetch line for sequential accesses. This allows two prefetch lines to be prefetched for every prefetch line consumed and depth can be extended. One-hit/one-prefetch policy refers to a policy initiating a prefetch of data or instruction in a stream per a hit in that stream.
The prefetch unit 20215 stores in a demand fetch conversion (DFC) table 20225 a load request for which a corresponding cache line has an AVALID bit set to high but a DVALID bit not (yet) set to high. Once a valid cache line returns from the memory device 20220, the prefetch unit 20215 places the load request into the hit queue 20205. In one embodiment, a switch (not shown) provides the data to the prefetch unit 20215 after the switch retrieves the data from the memory device. This (i.e., receiving data from the memory device or the switch and placing the load request in the hit queue 20205) is known as demand fetch conversion (DFC). The DFC table 20225 is sized to match a total number of outstanding load requests supported by a processor core associated with the stream prefetch engine 20200.
In one embodiment, the demand fetch conversion (DFC) table 20225 includes, but is not limited to, an array of, for example, 16 entries×13 bits representing at least 14 hypothetically possible prefetch to demand fetch conversions. A returning prefetch from the switch is compared against this array. These entries may arbitrate for access to the hit queue, waiting for free clock cycles. These entries wait until the cache line is completely entered before requesting an access to the hit queue.
In one embodiment, the prefetch unit 20215 is tied quite closely to the prefetch directory 20240 on which the prefetch unit 20215 operates and is implemented as part of the prefetch directory 20240. The prefetch unit 20215 generates prefetch addresses for a data or instruction stream prefetch. If a stream ID of a hit in the prefetch directory 20240 indicates a data or instruction stream, the prefetch unit 20275 processes address and data vectors representing “hit”, e.g., by following steps 110-140 in
When either a hit or DFC occurs, the next “N” cache line addresses may be also matched in the PFD 20240 where “N” is a number described in the DEPTH field of a cache line that matched with the memory address. A hit refers to finding a match between a memory address requested in a load request and a valid cache line address in the PFD 20240. If a cache line within the prefetching depth of a stream is not present in the PDA 20235, the prefetch unit 20215 prefetches the cache line from a cache memory device (e.g., a cache memory 20220). Before prefetching the cache line, the prefetch unit 20215 may establish a corresponding cache line address in the PFD 20240 with AVALID bit set to high. Then, the prefetch unit 20215 requests data load from the cache memory device 20220. Data load refers to reading the cache line from the cache memory device 20220. When prefetching the cache line, the prefetch unit 20215 assigns to the prefetched cache line a same stream ID which is inherited from a cache line whose address was hit. The prefetch unit 20215 looks up a current prefetching depth of that stream ID in the adaptive control block 20230 and inserts this prefetching depth in a corresponding entry in the PFD 20240 which is associated with the prefetched cache line. The adaptive control block 20230 is described in detail below.
The stream detect engine 20210 memorizes a plurality of memory addresses that caused cache misses before. In one embodiment, the stream detect engine 20210 memories the latest sixteen memory addresses that causes load misses. Load misses refer to cache misses caused by load requests. If a load request demands an access to a memory address which resides in a next cache line of a cache line that caused a prior cache miss, the stream detect engine 20210 detects a new stream and establishes a stream. Establishing a stream refers to prefetching data or instruction in the stream according to a prefetching depth of the stream. Prefetching data or instructions in a stream according to a prefetch depth refers to fetching a certain number of instructions or a certain amount of data in the stream within the prefetching data before they are needed. For example, if the stream detect engine 20210 is informed a load from “M1” memory address is a missed address, it will memorise the corresponding cacheline “C1”. Later, if a processor core issues a load request reading data in “M1+N” memory address and “M1+N” address corresponds to a cache line “C1+1” which is subsequent to the cache line “C1”, the stream detect engine 20210 detects a stream which includes the cache line “C1”, the cache line “C1+1”, a cache line “C1+2”, etc. Then, the prefetch unit 20215 fetches “C1+1” and prefetches subsequent cache lines (e.g., the cache line “C1+2”, a cache line “C1+3,” etc.) of the stream detected by the stream detect engine 20210 according to a prefetching depth of the stream. In one embodiment, the stream detect engine establishes a new stream whenever a load miss occurs. The number of cache lines established in the PFD 20240 by the stream detect engine 20210 is programmable.
In one embodiment, the stream prefetch engine 20200 operates three modes where a stream is initiated on each of the following events:
Each of these modes can be enabled/disabled independently via MMIO registers. The optimistic mode and DCBT instruction share hardware logic (not shown) with the stream detect engine 20210. In order for a use of the DCBT instruction, which is only effective to a L2 cache memory device and does not unnecessarily fill a load queue (i.e., a queue storing load requests) in a processor core, the stream prefetch engine 20200 may trigger an immediate return of dummy data allowing the DCBT instruction to be retired without incurring latency associated with a normal extraction of data from a cache memory device as this DCBT instruction only affects a L2 cache memory operation and the data may not be held in a L1 cache memory device by the processor core. A load queue refers to a queue for storing load requests.
In one embodiment, the stream detect engine 20210 is performed by comparing all cache misses to a table of at least 16 expected 128-byte cache line addresses. A hit in this table triggers a number n of cache lines to be established in the prefetch directory 20240 on the following n clock cycles. A miss in this table causes a new entry to be established with a round-robin victim selection (i.e., selecting a cache line to be replaced in the table with a round-robin fashion).
In one embodiment, a prefetching depth does not represent an allocation of prefetched cache lines to a stream. The stream prefetch engine 20200 allows elasticity (i.e., flexibility within certain limits) that can cause this depth to differ (e.g., by up to 8) between streams. For example, when a processor core 20200 aggressively issues load requests, the processor core can catch up with a stream, e.g., by hitting prefetched cache lines whose data has not yet been returned by the switch. These prefetch-to-demand fetch conversion cases may be treated as normal hits by the stream detect engine 20210 and additional cache lines are established and fetched. A prefetch-to-demand fetch conversion case refers to a case in which a hit on a line that has been established in the prefetch directory 20240 but not yet had data returned from a switch or a memory device. Thus, the number of prefetch lines used by a stream in the prefetch directory 20240 can exceed the prefetching depth of a stream. However, the stream prefetch engine 20200 will have the number of cache lines for each stream equal to that stream's prefetching depth once all pending requests are satisfied and the elasticity removed.
The adaptive control block 20230 includes at least two data structures: 1. Depth table storing a prefetching depth of each stream which are registered in the PFD 20240 with its stream ID; 2. LRU (Least Recently Used) table indentifying the least recently used streams among the registered streams, e.g., by employing a known LRU replacement algorithm. The known LRU replacement algorithm may update the LRU table whenever a hit in an entry in the PFD 20240 and/or DFC (Demand Fetch Conversion) occurs. In one embodiment, when a DFC occurs, the stream prefetch engine 20200 increments a prefetching depth of a stream associated with the DFC.
This increment allows a deep prefetch (e.g., prefetching data or instructions in a stream according to a prefetching depth of 8) to occur when only one or two streams are being prefetched, e.g., according to a prefetching depth of up to 8. Prefetching data or instructions according to a prefetching depth of a stream refers to fetching data or instructions in the stream within the prefetching depth ahead. For example, if a prefetching depth of a stream which comprises data stored in “K” cache line address, “K+1” cache line address, “K+2” cache line address, . . . , and “K+1000” cache line address is a depth of 2 and the stream detect engine 20200 detects this stream when a processor core requests data in “K1+1” cache line address, then the stream prefetch engine 20200 fetches data stored in “K+1” cache line address and “K1+2” cache line address. In one embodiment, an increment of a prefetching depth is only made in response to an indicator that loads from a memory device for this stream are exceeding the rate enabled by a current prefetching depth of the stream. For example, although the stream prefetch engine 20200 prefetches data or instructions, the stream may face demand fetch conversions because the stream prefetch engine 20200 fails to prefetch enough data or instructions ahead. Then, the stream prefetch engine 20200 increases the prefetching depth of the stream to fetch data or instruction further ahead for the stream. A load refers to reading data and/or instructions from a memory device. However, by only doing this increase in response to an indicator of data starvation, the stream prefetch engine 20200 avoids unnecessary deep prefetch. For example, when only hits (e.g., a match between an address in a current load request and an address in the PFD 20240) are taken, a prefetching depth of a stream associated with the current cache miss address is not increased. Unless PFD 20240 has a AVALID bit set to high and a corresponding DVALID bit set to low, the prefetch unit 20125 may not increase a prefetching depth of a corresponding stream. Because depth is stolen in competition with other active streams, the stream prefetch engine 20200 can also automatically adapt to optimally support concurrent data or instruction streams (e.g., 16 concurrent streams) with a small storage capability (e.g., a storage capacity storing only 32 cache lines) and a shallow prefetching depth (e.g., a depth of 2) for each stream.
As a capacity of the PDA 20235 is limited, it is essential that active streams do not try to exceed the capacity (e.g., 32 L2 cache lines) of the PDA 20235 to prevent thrashing and substantial performance degradation. This capacity of the PDA 20235 is also called a capacity of the stream prefetch engine 20200. The stream prefetch engine adaptation algorithm 20200 constrains a total depth of all streams across all the streams to remain as a predetermined value.
When incrementing a prefetching depth of a stream, the stream prefetch engine 20200 decrements a prefetching depth of a victim stream. A victim stream refers to a stream which is least recently used and has non-zero prefetching depth. Whenever a current active stream needs to acquire one more unit of its prefetching depth (e.g., a depth of 1), the victim stream releases one unit of its prefetching depth, thus ensuring the constraint is satisfied by forcing streams to compete for their prefetching depth increments. The constraint includes, but is not limited to: fixing a total depth of all streams.
In one embodiment, there is provided a victim queue (not shown) implemented, e.g., by a collection of registers. When a stream of a given stream ID is hit, that stream ID is inserted at a head of the victim queue and a matching entry is eliminated from the victim queue. The victim queue may list streams, e.g., by a reverse time order of an activity. A tail of this victim queue may thus include the least recently used stream. A stream ID may be used when a stream is detected and a new stream reinserted in the prefetch directory 20240. Stale data is removed from the prefetch directory 20240 and corresponding cache lines are freed.
The stream prefetch engine 20200 may identify the least recently used stream with a non-zero depth as a victim stream for decrementing a depth. An empty bit in addition to stream-ID is maintained in a LRU (Least Recently Used) queue (e.g., 16×5 bit register array). The empty bit is set to 0 when a stream ID is hit and placed at a head of the queue. If decrementing a prefetching depth of a victim stream results in a prefetching depth of the victim stream becoming zero, the empty bit of the victim stream is set to 1. A stream ID of a decremented-to-zero-depth stream is distributed to the victim queue. One or more comparator(s) matches this stream ID and sets the empty bit appropriately. A decremented-to-zero-depth stream refers to a stream whose depth is decremented to zero.
In one embodiment, a free depth register is provided for storing depths of invalidated streams. This register stores a sum of all depth allocations matching the capacity of the prefetch data array 20235, ensuring a correct book keeping.
In one embodiment, the stream prefetch engine 2100 may require elapsing a programmable number of clock cycles between adaptation events (e.g., the increment and/or the decrement) to rate control such adaptation events. For example, this elapsing gives a tunable rate control over the adaptation events.
In one embodiment, the Depth table does not represent an allocation of a space for each stream in the PDA 20235. As the prefetch unit 20215 changes a prefetching depth of a stream, a current prefetching depth of the stream may not immediately reflect this change. Rather, if the prefetch unit 20215 recently increased a prefetching depth of a stream, the PFD 20240 may reflect this increase after the PFD 20240 receives a request for this increase and prefetched data of the stream is grown. Similarly, if the prefetch unit 20215 decreases a prefetching depth of a stream, the PFD 20240 may include too much data (i.e., data beyond the prefetching depth) for that stream. Then, when a processor core issues subsequent load requests for this stream, the prefetch unit 20215 may not trigger further prefetches and at a later time an amount of the prefetched data may represent a shrunk depth. In one embodiment, the Depth table includes a prefetching depth for each stream. An additional counter is implemented as the free depth register for spare prefetching depth. This free depth register can semantically be thought of as a dummy stream and is essentially treated as a preferred victim for purposes of depth stealing. In one embodiment, invalidated stream IDs return their depths to this free depth register. This return may require a full adder to be implemented in the free depth register.
If a look-up address hits in the prefetch directory 20240, a prefetch is generated for the lowest address that is within a prefetching depth of a stream ID associated with the look-up address and which misses, for example, an eight-bit lookahead vector over the next 8 cache line addresses identifying which of these are already present in PFD 20240. A look-up address refers to an address associated with a request or command. A condition called underflow occurs when the look-up address is present with a valid address (and hence has been requested from a memory device) but corresponding data has not yet become valid. This underflow condition triggers a hit stream to increment its depth and decrement a depth of a current depth of a victim stream. A hit stream refers to a stream whose address is found in the prefetch directory 20240. As multiple hits can occur for each prefetched cache line, depths of hit streams can grow dynamically. The stream prefetch engine 20200 keeps a capacity of foot prints of all or some streams fixed, avoiding many pathological performance conditions that the dynamic growing could introduce. In one embodiment, the stream prefetch engine 20200 performs a less aggressive prefetch, e.g., by stealing depths from less active streams.
Due to outstanding load requests issued from a processor core, there is elasticity between issued requests, and those queued, pending or returned. Thus, even with the algorithm described above, a capacity of the stream prefetch engine 20200 can be exceeded by additional 4, 6 or 12 requests. The prefetching depths may be viewed as a “drive to” target depths whose sum is constrained not to exceed the capacity of a cache memory device when the processor core has no outstanding loads tying up slots of the cache memory. While the PFD 20240 does not immediately or automatically include precisely the number of cache lines for each stream corresponding to the depth of each stream, the stream prefetch engine 20200 makes its decisions about when to prefetch to try to get closer to a prefetching depth (drives towards it).
If the first memory address is present and valid in the PFD 20240 or there is a valid cache line address corresponding to the first memory address in the PFD 20240, at step 20110, the stream prefetch engine 20200 evaluates whether there exists valid data (e.g., valid L2 cache line) corresponding to the first memory address in the PDA 20235. In other words, if there is a valid cache line address corresponding to the first memory address in the PFD 20240, the stream prefetch engine 20200 evaluates whether the corresponding data is valid yet. If the data is not valid, then the corresponding data is pending, i.e., corresponding data is requested to the memory device 20220 but has not been received by the stream prefetch engine 20200. At step 20105, if the first memory address is not present or not valid in the PFD 20240, the control goes to step 20145. At step 20110, to evaluate whether there already exists the valid data in the PDA 20235, the stream prefetch engine 20200 may check a data valid bit associated with the first memory address or the valid cache line address in the PFD 20240.
If there is no valid data corresponding to the first memory address in the PDA 20235, at step 20115, the stream prefetch engine 20200 inserts the issued load request to the DFC table 20225 and awaits a return of the data from the memory device 20200. Then, the control goes to step 20120. In other words, if the data is pending, at step 20115, the stream prefetch engine 20200 inserts the issued load request to the DFC table 20225, the stream prefetch engine 20200 awaits the data to be returned by the memory device (since the address was valid, the data has already been requested but not returned) and the control goes to step 20120. Otherwise, the control goes to step 20130. At step 20120, the stream prefetch engine 20200 increments a prefetching depth of a first stream that the first memory address belongs to. While incrementing the prefetching depth of the first stream, at step 20125, the stream prefetch engine 20200 determines a victim stream among streams registered in the PFD 20240 and decrements a prefetching depth of the victim stream. The registered streams refers to streams whose stream IDs are stored in the PFD 20240. To determine the victim stream, the stream prefetch engine 20200 searches the least recently used stream having non-zero prefetching depth among the registered streams. The stream prefetch engine 20200 sets the least recently used stream having non-zero prefetching depth as the victim stream in a purpose of a reallocation of a prefetching depth of the victim stream.
In one embodiment, a total prefetching depth of the registered streams is a predetermined value. The parallel computing system operating the stream prefetch engine 20200 can change or program the predetermined value representing the total prefetching depth.
Returning to
At step 20140, the stream prefetch engine 20200 prefetches the additional data. Upon determining that prefetching of additional data is necessary, the stream prefetch engine 20200 may select the nearest address to the first address that is not present but is a valid address in the PFD 20240 within a prefetching depth of a stream corresponding to the first address and starts to prefetch data from the nearest address. The stream prefetch engine 20200 may also prefetch subsequent data stored in subsequent addresses of the nearest address. The stream prefetch engine 20200 may fetch at least one cache line corresponding to a second memory address (i.e., a memory address or cache line address not being present in the PFD 20240) within the prefetching depth of the first stream. Then, the control goes to step 20165.
At step 20145, the stream prefetch engine 20200 attempts to detect a stream (e.g., the first stream that the first memory address belongs to). In one embodiment, the stream prefetch engine 20200 stores a plurality of third memory addresses that caused load misses before. A load miss refers to a cache miss caused by a load request. The stream prefetch engine 20200 increments the third memory addresses. The stream prefetch engine 20200 compares the incremented third memory addresses and the first memory address. The stream prefetch engine 20200 identifies the first stream if there is a match between an incremented third memory address and the first memory address.
If the stream prefetch engine 20200 succeeds to detect a stream (e.g., the first stream), at step 20155, the stream prefetch engine 20200 starts to prefetch data and/or instructions in the stream (e.g., the first stream) according to a prefetching depth of the stream. Otherwise, the control goes to step 20150. At step 20150, the stream prefetch engine 20200 returns prefetched data and/or instructions to a processor core. The stream prefetch engine 20200 stores the prefetched data and/or instructions, e.g., in PDA 20235, before returning the prefetched data and/or instructions to the processor core. At step 20160, the stream prefetch engine 20200 inserts the issued load request to the DFC table 20225. At step 20165, the stream prefetch engine receives a new load request issued from a processor core.
In one embodiment, the stream prefetch engine 20200 adaptively changes prefetching depths of streams. In a further embodiment, the stream prefetch engine 20200 sets a minimum prefetching depth (e.g., a depth of zero) and/or a maximum prefetching depth (e.g., a depth of eight) that a stream can have. The stream prefetch engine 20200 increments a prefetching depth of a stream associated with a load request when a memory address in the load request is valid (e.g., its address valid bit has been set to high in the PFD 20240) but data (e.g., L2 cache line stored in the PDA 20235) corresponding to the memory address is not yet valid (e.g., its data valid bit is still set to low (“0”) in the PFD 20240). In other words, the stream prefetch engine 20200 increments the prefetching depth of the stream associated with the load request when there is no valid cache line data present in the PDA 20235 corresponding to the valid memory address in the PFD (due to the data being in flight from the cache memory). To increment the prefetching depth of the stream, the stream prefetch engine 20200 decrements a prefetching depth of the least recently used stream having non-zero prefetching depth. For example, the stream prefetch engine 20200 first attempts to decrement a prefetching depth of the least recently used stream. If the least recently used stream already has zero prefetching depth (i.e., a depth of zero), the stream prefetch engine 20200 attempts to decrement a prefetching depth of a second least recently used stream, and so on. In one embodiment, as described above, the adaptive control block 20230 includes the LRU table that traces least recently used streams according to hits on streams.
In one embodiment, the stream prefetch engine 20200 may be implemented in hardware or reconfigurable hardware, e.g., FPGA (Field Programmable Gate Array) or CPLD (Complex Programmable Logic deviceDevice), using a hardware description language (Verilog, VHDL, Handel-C, or System C). In another embodiment, the stream prefetch engine 20200 may be implemented in a semiconductor chip, e.g., ASIC (Application-Specific Integrated Circuit), using a semi-custom design methodology, i.e., designing a chip using standard cells and a hardware description language. In one embodiment, the stream prefetch engine 20200 may be implemented in a processor (e.g., IBM® PowerPC® processor, etc.) as a hardware unit(s). In another embodiment, the stream prefetch engine 20200 may be implemented in software (e.g., a compiler or operating system), e.g., by a programming language (e.g., Java®, C/C++, .Net, Assembly language(s), Pearl, etc.).
In one embodiment, the stream prefetch engine 20200 operates with at least four threads per processor core and a maximum prefetching depth of eight (e.g., eight L2 (level two) cache lines). In one embodiment, the prefetch data array 20235 may store 128 cache lines. In this embodiment, the prefetch data array stores 32 cache lines and, by adapting the prefetching depth according to a system load, the stream prefetch engine 20200 can support the same dynamic range of memory accesses. By adaptively changing the capacity of the PDA 20235, the prefetch data array 20235 whose capacity is 32 cache lines can also operate as an array with 128 cache lines.
In one embodiment, an adaptive prefetching is necessary to both support efficient low stream count (e.g., a single stream) and efficient high stream count (e.g., 16 streams) prefetching with the stream prefetch engine 20200. An adaptive prefetching is a technique adaptively adjusting prefetching depth per a stream as described in the steps 20120-20125 in
In one embodiment, the stream prefetch engine 20200 counts the number of active streams and then divides the PFD 20240 and/or the PDA 20235 equally among these active streams. These active streams may have an equal prefetching depth.
In one embodiment, a total depth of all active streams is predetermined and not exceeding a PDA capacity of the stream prefetch engine 20100 to avoid thrashing. An adaptive variation of a prefetching depth allows a deep prefetch (i.e., a depth of eight) for low numbers of streams (i.e., two streams), while a shallow prefetch (i.e., a depth of 2) is used for large numbers of streams (i.e., 16 streams) to maintain the usage of PDA 20235 optimal under a wide variety of load requests.
There is provided a system, method and computer program product for improving a performance of a parallel computing system, e.g., by operating at least two different prefetch engines associated with a processor core.
At step 21110, a look-up engine (e.g., a look-up engine 21315 in
At step 21110, if the look-up engine determines that a prefetch request has not been issued for the first data, e.g., the first data address is not found in the prefetch directory 21310, at step 21120, then a normal load command is issued to a memory system.
At step 21110, if the look-up engine determines that a prefetch request has been issued for the first data, then the look-up engine determines whether the first data is present in a prefetch data array (e.g., a prefetch data array 21250 in
The look-up engine also provides the command including an address of the first data to two at least two different prefetch engines simultaneously. These two different prefetch engines include, without limitation, at least one stream prefetch engine (e.g., a stream prefetch engine 21275 in
In one embodiment, the stream prefetch engine adaptively changes the prefetching depth according to a speed of each stream. For example, if a speed of a data or instruction stream is faster than speeds of other data or instruction streams (i.e., that faster stream includes data which is requested by the processor but is not yet resident in the prefetch data directory), the stream prefetch engine runs the step 21115 to convert a prefetch request for the faster stream to a demand load command described above. The stream prefetch engine increases a prefetching depth of the fastest data or instruction stream. In one embodiment, there is provided a register array for specifying a prefetching depth of each stream. This register array is preloaded by software at the start of running the prefetch system (e.g., the prefetch system 21320 in
The list prefetch engine(s) prefetch(es) third data associated with the command. In one embodiment, the list prefetch engine(s) prefetch(es) the third data (e.g., numerical data, string data, instructions, etc.) according to a list describing a sequence of addresses that caused cache misses. The list prefetch engine(s) prefetches data or instruction(s) in a list associated with the command. In one embodiment, there is provided a module for matching between a command and a list. A match would be found if an address requested in the command and an address listed in the list are same. If there is a match, the list prefetch engine(s) prefetches data or instruction(s) in the list up to a predetermined depth ahead of where the match has been found. A detail of the list prefetch engine(s) is described in described in connection with
The third data prefetched by the list prefetch engine or the second data prefetched by the stream prefetch engine may include data that may subsequently be requested by the processor. In other words, even if one of the engines (the stream prefetch engine and the list prefetch engine) fails to prefetch this subsequent data, the other engine succeeds to prefetch this subsequent data based on the first data that both prefetch engines use to initiate further data prefetches. This is possible because the stream prefetch engine is optimized for data located in consecutive memory locations (e.g., streaming movie) and the list prefetch engine is optimized for a block of randomly located data that is repetitively accessed (e.g., loop). The second data and the third data may include different set of data and/or instruction(s).
In one embodiment, the second data and the third data are stored in an array or buffer without a distinction. In other words, data prefetched by the stream prefetch engine and data prefetched by the list prefetch engine are stored together without a distinction (e.g., a tag, a flag, a label, etc.) in an array or buffer.
In one embodiment, each of the list prefetch engine(s) and the stream prefetch engine(s) can be turned off and/or turned on separately. In one embodiment, the stream prefetch engine(s) and/or list prefetch engine(s) prefetch data and/or instruction(s) that have not been prefetched before and/or have not listed in the prefetch directory 21310.
In one embodiment, the parallel computing system operates the list prefetch engine occasionally (e.g., when a user bit(s) are set). A user bit(s) identify a viable address to be used, e.g., by a list prefetch engine. The parallel computing system operates the stream prefetch engine all the time.
In one embodiment, if the look-up engine determines that the first data has not been prefetched, at step 21110, the parallel computing system immediately issues the load command for this first data to a memory system. However, it also provides an address of this first data to the stream prefetch engine and/or at least one list prefetch engine which use this address to determine further data to be prefetched. The prefetched data may be consumed by the processor core 21200 in subsequent clock cycles. A method to determine and/or identify whether the further data needs to be prefetched is described herein above. Upon determining and/or identifying the further data to be prefetched, the stream prefetch engine may establish a new stream and prefetch data in the new stream or prefetch additional data in an existing stream. At the same time, upon determining and/or identifying the further data to be prefetched, the list prefetch engine may recognize a match between the address of this first data and an earlier L1 cache miss address (i.e., an address caused a prior L1 cache miss) in a list and prefetch data from the subsequent cache miss addresses in the list separated by a predetermined “list prefetch depth”, e.g., a particular number of instructions and/or a particular amount of data to be prefetched by the list prefetch engine.
A parallel computing system which has at least one stream and at least one list prefetch engine may run more efficiently if both types of prefetch engines are provided. In one embodiment, the parallel computing system allows these two different prefetch engines (i.e., list prefetch engines and stream prefetch engines) to run simultaneously without serious interference. The parallel computing system can operate the list prefetch engine, which may require a user intervention, without spoiling benefits for the stream prefetch engine.
In one embodiment, the stream prefetch engine 21275 and/or the list prefetch engine 21280 is implemented in hardware or reconfigurable hardware, e.g., FPGA (Field Programmable Gate Array) or CPLD (Complex Programmable Logic deviceDevice), using a hardware description language (Verilog, VHDL, Handel-C, or System C). In another embodiment, the stream prefetch engine 21275 and/or the list prefetch engine 21280 is implemented in a semiconductor chip, e.g., ASIC (Application-Specific Integrated Circuit), using a semi-custom design methodology, i.e., designing a chip using standard cells and a hardware description language. In one embodiment, the stream prefetch engine 21275 and/or the list prefetch engine 21280 is implemented in a processor (e.g., IBM® PowerPC® processor, etc.) as a hardware unit(s). In another embodiment, the stream prefetch engine 21275 and/or the list prefetch engine 21280 is/are implemented in software (e.g., a compiler or operating system), e.g., by a programming language (e.g., Java®, C/C++, .Net, Assembly language(s), Pearl, etc.). When the stream prefetch engine 21275 is implemented in a compiler, the compiler adapts the prefetching depth of each data or instruction stream.
The prefetch system 21320 is a module that provides an interface between the processor core 21200 and the rest of the parallel computing system. Specifically, the prefetch system 21320 provides an interface to the switch 21305 and an interface to a computing node's DCR (Device Control Ring) and local control registers special to the prefetch system 21320. The system 21320 performs performance critical tasks including, without limitations, identifying and prefetching memory access patterns, managing a cache memory device for data resulting from this identifying and prefetching. In addition, the system 21320 performs write combining (e.g., combining four or more write commands into a single write command) to enable multiple writes to be presented as a single write to the switch 21305, while maintaining coherency between the write combine arrays.
The processor core 210200 issue at least one command including, without limitation, an instruction requesting data. The at least one register 21205 buffers the issued command, at least one address in the command and/or the data in the commands. The bypass engine 21210 allows a command to bypass the look-up queue 21220 when the look-up queue 21220 is empty.
The look-up queue 21220 receives the commands from the register 21205 and also outputs the earliest issued command among the issued commands to one or more of: the request array 21215, the stream detect engine 21260, the switch request table 21295 and the hit queue 21255. In one embodiment, the queue 21220 is implemented in as a FIFO (First In First Out) queue. The request array 21215 receives at least one address from the register 21205 associated with the command. In one embodiment, the addresses in the request array 21215 are indexed to the corresponding command in the look-up queue 21220. The look-up engine 21315 receives the ordered commands from the bypass engine 21210 or the request array 21215 and compares an address in the issued commands with addresses in the prefetch directory 21310. The prefetch directory 21310 stores addresses of data and/or instructions for which prefetch commands have been issued by one of the prefetch engines (e.g., a stream prefetch 21275 and a list prefetch engine 21280).
The address compare engine 21270 receives addresses that have been prefetched from the at least one prefetch engine (e.g., the stream prefetch engine 21275 and/or the list prefetch engine 21280) and prevents the same data from being prefetched twice by the at least one prefetch engine. The address compare engine 21270 allows a processor core to request data not present in the prefetch directory 21310. The stream detect engine 21265 receives address(es) in the issued commands from the look-up engine 21315 and detects at least one stream to be used in the stream prefetch engine 21275. For example, if the addresses in the issued commands are “L1” and “L1+1,” the stream prefetch engine may prefetch cache lines addressed at “L1+2” and “L1+3.”
In one embodiment, the stream detect engine 21265 stores at least one address that caused a cache miss. The stream detect engine 21265 detects a stream, e.g., by incrementing the stored address and comparing the incremented address with an address in the issued command. In one embodiment, the stream detect engine 21265 can detect at least sixteen streams. In another embodiment, the stream detect engine can detect at most sixteen streams. The stream detect engine 21265 provides detected stream(s) to the stream prefetch engine 21275. The stream prefetch engine 21275 issues a request for prefetching data and/instructions in the detected stream according to a prefetching depth of the detected stream.
The list prefetch engine 21280 issues a request for prefetching data and/or instruction(s) in a list that includes a sequence of address that caused cache misses. The multiplexer 21285 forwards the prefetch request issued by the list prefetch engine 21280 or the prefetch request issued by the stream prefetch engine 21275 to the switch request table 21295. The multiplexer 21290 forwards the prefetch request issued by the list prefetch engine 21280 or the prefetch request issued by the stream prefetch engine 21275 to the prefetch directory 21310. A prefetch request may include memory address(es) where data and/or instruction(s) are prefetched. The prefetch directory 21310 stores the prefetch request(s) and/or the memory address(es).
The switch request table 21295 receives the commands from the look-up queue 21220 and the forwarded prefetch request from the multiplexer 21285. The switch request table 21295 stores the commands and/or the forwarded request. The switch 21305 retrieves the commands and/or the forwarded request from the table 21295, and transmits data and/instructions demanded in the commands and/or the forwarded request to the switch response handler 21300. Upon receiving the data and/or instruction(s) from the switch 21305, the switch response handler 21300 immediately delivers the data to the processor core 21200, e.g., via the multiplexer 21240 and the interface logic 21325. At the same time, if the returned data or instruction(s) is the result of a prefetch request the switch response handler 21300 delivers the data or instruction(s) from the switch 21305 to the prefetch conversion engine 21260 and delivers the data and/or instruction(s) to the prefetch data array 21250.
The prefetch conversion engine 21260 receives the commands from the look-up queue 21220 and/or information bits accompanying data or instructions returned from the switch response handler 21300. The conversion engine 21260 converts prefetch requests to demand fetch commands if the processor requests data that were the target of a prefetch request issued earlier by one of the prefetch units but has not yet been fulfilled. The conversion engine 21260 will then identify this prefetch request when it returns from the switch 21305 through the switch response handler 21300 as a command that was converted from a prefetch request to a demand load command. This returning prefetch data from the switch response handler 21300 is then routed to the hit queue 21255 so that it is quickly passed through the prefetch data array 21250 on the processor core 21200. The hit queue 21255 may also receive the earliest command (i.e., the earliest issued command by the processor core 21200) from the look-up queue 21220 if that command requests data that is already present in the prefetch data array 21250. In one embodiment, when issuing a command, the processor core 21200 attaches generation bits (i.e., bits representing a generation or age of a command) to the command. Values of the generation bits may increase as the number of commands issued increases. For example, the first issued command may have “0” in the generation bits. The second issued command may be “1” in the generation bits. The hit queue 21255 outputs instructions and/or data that have been prefetched to the prefetch data array 21250.
The prefetch data array 21250 stores the instructions and/or data that have been prefetched. In one embodiment, the prefetch data array 21250 is a buffer between the processor core 21200 and a local cache memory device (not shown) and stores data and/or instructions prefetched by the stream prefetch engine 21275 and/or list prefetch engine 21280. The switch 21305 may be an interface between the local cache memory device and the prefetch system 21320.
In one embodiment, the prefetch system 21320 combines multiple candidate writing commands into, for example, four writing commands when there is no conflict between the four writing commands. For example, the prefetch system 21320 combines multiple “store” instructions, which could be instructions to various individual bytes in the same 32 byte word, into a single store instruction for that 32 byte word. Then, the prefetch system 21320 stores these coalesced single writing commands to at least two arrays called write-combine buffers 21225 and 21230. These at least two write-combine buffers are synchronized with each other. In one embodiment, a first write-combine buffer 21225 called write-combine candidate match array may store candidate writing commands that can be combined or concatenated immediately as they are issued by the processor core 21200. The first write-combine buffer 21225 receives these candidate writing commands from the register 21205. A second write-combine buffer 21230 called write-combine buffer flush receives candidate writing commands that can be combined from the bypass engine 21210 and/or the request array 21215 and/or stores the single writing commands that combine a plurality of writing commands when these (uncombined) writing commands reach the tail of the look-up queue 21220. When these write-combine arrays become full or need to be flushed to make the contents of a memory system be up-to-date, these candidate writing commands and/or single writing commands are stored in an array 21235 called store data array. In one embodiment, the array 21235 may also store the data from the register 21205 that is associated with these single writing commands.
The switch 21305 can retrieve the candidate writing commands and/or single writing commands from the array 21235. The prefetch system 21320 also transfers the candidate writing commands and/or single writing commands from the array 21235 to local control registers 21245 or a device command ring (DCR), i.e., a register storing control or status information of the processor core. The local control register 21245 controls a variety of functions being performed by the prefetch system 21320. This local control register 21245 as well as the DCR can also be read by the processor core 21200 with the returned read data entering the multiplexer 21240. The multiplexer 21240 receives, as inputs, control bits from the local control register 21245, the data and/or instructions from the switch response handler 21300 and/or the prefetched data and/or instructions from the prefetch data array 21250. Then, the multiplexer 21240 forwards one of the inputs to the interface logic 21325. The interface logic 21325 delivers the forwarded input to the processor core 21200. All of the control bits as well as I/O commands (i.e., an instruction for performing input/output operations between a processor and peripheral devices) are memory mapped and can be accessed either using memory load and store instructions which are passed through the switch 21305 or are addressed to the DCR or local control registers 21245.
Look-Up Engine
By default, the look-up engine 21315 is in a ready state 21455 (i.e., a state ready for performing an operation). Upon receiving a request (e.g., a register write command), the look-up engine 21315 goes to a register write state 21450 (i.e., a state for updating a register in the prefetch system 21320). In the register write state 21450, the look-up engine 21315 stays in the state 21450 until receiving an SDA arbitration input 21425 (i.e., an input indicating that the write data from the SDA has been granted access to the local control registers 21245). Upon completing the register update, the look-up engine 21315 goes back to the ready state 21455. Upon receiving a DCR write request (i.e., a request to write in the DCR) from the processor core 21200, the look-up engine 21315 goes from the register write state 21450 to a DCR write wait state 21405 (i.e., a state for performing a write to DCR). Upon receiving a DCR acknowledgement from the DCR, the look-up engine 21315 goes from the DCR write wait state 21405 to the ready state 21455.
The look-up engine 21315 goes from the ready state 21455 to a DCR read wait 21415 (i.e., a state for preparing to read contents of the DCR) upon receiving a DCR ready request (i.e., a request for checking a readiness of the DCR). The look-up engine 21315 stays in the DCR read wait state 21415 until the look-up engine 21315 receives the DCR acknowledgement 21420 from the DCR. Upon receiving the DCR acknowledgement, the look-up engine 21315 goes from the DCR read wait state 21415 to a register read state 21460. The look-up engine 21315 stays in the register read state 21415 until a processor core reload arbitration signal 21465 (i.e., a signal indicating that the DCR read data has been accepted by the interface 21325) is asserted.
The look-up engine 21315 goes from the ready state 21455 to the register read state 415 upon receiving a register read request (i.e., a request for reading contents of a register). The look-up engine 21315 comes back to ready state 21455 from the register read state 21415 upon completing a register read. The look-up engine 21315 stays in the ready state 21455 upon receiving one or more of: a hit signal (i.e., a signal indicating a “hit” in an entry in the prefetch directory 21310), a prefetch to demand fetch conversion signal (i.e., a signal for converting a prefetch request to a demand to a switch or a memory device), a demand load signal (i.e., a signal for loading data or instructions from a switch or a memory device), a victim empty signal (i.e., a signal indicating that there is no victim stream to be selected by the stream prefetch engine 21275), a load command for data that must not be put in cache (a non-cache signal), a hold signal (i.e., a signal for holding current data), a noop signal (i.e., a signal indicating no operation).
The look-up engine 21315 goes to the ready state 21455 to a WCBF evict state 21500 (i.e., a state evicting an entry from the WCBF array 21230) upon receiving a WCBF evict request (i.e., a request for evicting the WCBF entry). The look-up engine 21315 goes back to the ready state 21455 from the WCBF evict state 21500 upon completing an eviction in the WCBF array 21230. The look-up engine 21315 stays in the WCBF evict state 21500 while a switch request queue (SRQ) arbitration signal 21505 is asserted.
The look-up engine 21315 goes from the ready state 21455 to a WCBF flush state 21495 upon receiving a WCBF flush request (i.e., a request for flushing the WCBF array 21230). The look-up engine 21315 goes back to the ready state 21455 from the WCBF flush state 21495 upon a completion of flushing the WCBF array 21230. The look-up engine 21315 stays in the ready state 21455 while a generation change signal (i.e., a signal indicating a generation change of data in an entry of the WCBF array 21230) is asserted.
In one embodiment, most state transitions in the state machine 21400 are done in a single cycle. Whenever a state transition is scheduled, a hold signal is asserted to prevent further advance of the look-up queue 21220 and to ensure that a register at a boundary of the look-up queue 21220 retains its value. This state transition is created, for example, by a read triggering two write combine array evictions for coherency maintenance. Generation change triggers a complete flush of the WCBF array 21230 over multiple clock cycles.
The look-up engine 21315 outputs the following signals going to the hit queue 21255, SRT (Switch Request Table) 21295, demand fetch conversion engine 21260, and look-up queue 21220: critical word, a tag (bits attached by the processor core 21200 to allow it to identify a returning load command) indicating thread ID, 5-bit store index, a request index, a directory index indicating the location of prefetch data for the case of a prefetch hit, etc.
In one embodiment, a READ combinational logic (i.e., a combinational logic performing a memory read) returns a residency of a current address and next consecutive addresses. A STORE combinational logic (i.e., a combinational logic performing a memory write) returns a residency of a current address and next consecutive addresses and deasserts an address valid bit for any cache lines matching this current address.
Hit Queue
In one exemplary embodiment, the hit queue 21255 is implemented, e.g., by 12 entry×12-bit register array holds pending hits (hits for prefetched data) for a presentation to the interface 21245 of the processor core. Read and write pointers are maintained in one or two clock cycle domain. Each entry of the hit queue includes, without limitation, a critical word, a directory index and a processor core tag.
Prefetch Data Array
In one embodiment, the prefetch data array 21250 is implemented, e.g., by a dual ported 32×128-byte SRAM operating in one or two clock cycle domain. A read port is driven, e.g., by the hit queue and the write port is driven, e.g., by switch response handler 21300.
Prefetch Directory
The prefetch directory 21310 includes, without limitation, a 32×48-bit register array storing information related to the prefetch data array 21250. It is accessed by the look-up engine 21315 and written by the prefetch engines 21275 and 21280. The prefetch directory 21310 operates in one or two clock cycle domain and is timing and performance critical. There is provided a combinatorial logic associated with this prefetch directory 21310 including a replication count of address comparators.
Each prefetch directory entry includes, without limitation, an address, an address valid bit, a stream ID, data representing a prefetching depth. In one embodiment, the prefetch directory 21310 is a data structure and may be accessed for a number of different purposes.
Look-Up and Stream Comparators
In one embodiment, at least two 32-bit addresses associated with commands are analyzed in the address compare engine 21270 as a particular address (e.g., 35th bit to 3rd bit) and their increments. A parallel comparison is performed on both of these numbers for each prefetch directory entry. The comparators evaluate both carry and result of the particular address (e.g., 2nd bit to 0th bit)+0, 1, . . . , or 7. The comparison bits (e.g., 35th bit to 3rd bit in the particular address) with or without a carry and the first three bits (e.g., 2nd bit to 0th bit in the particular address) are combined to produce a match for lines N, N+1 to N+7 in the hit queue 21255. This match is used by look-up engine 21315 for both read, and write coherency and for deciding which line to prefetch for the stream prefetch engine 21275. If a write signal is asserted by the look-up engine 21315, a matching address is invalidated and subsequent read look-ups (i.e., look-up operations in the hit queue 21255 for a read command) cannot be matched. A line in the hit queue 21255 will become unlocked for reuse once any pending hits, or pending data return if the line was in-flight, have been fulfilled.
LIST Prefetch Comparators
In one embodiment, address compare engine 21270 includes, for example, 32×35-bit comparators returning “hit” (i.e., a signal indicating that there exists prefetched data in the prefetch data array 21250 or the prefetch directory 21310) and “hit index” (i.e., a signal representing an index of data being “hit”) to the list prefetch engine 21280 in one or two clock cycle period(s). These “hit” and “hit index” are used to decide whether to service or discard a prefetch request from the list prefetch engine 21280. The prefetch system 21230 does not establish the same cache line twice. The prefetch system 320 discards prefetched data or instruction(s) if it collides with an address in a write combine array (e.g., array 21225 or 21230).
Automatic Stream Detection, Manual Stream Touch
All or some of the read commands that cause a miss when looked up in the prefetch directory 21310 are snooped by the stream detect engine 21265. The stream detect engine 21265 includes, without limitation, a table of expected next aligned addresses based on previous misses to prefetchable addresses. If a confirmation (i.e., a stream is detected, e.g., by finding a match between an address in the table and an address forwarded by the look-up engine) is obtained (e.g., by a demand fetch issued on a same cycle), the look-up queue 21220 is stalled on a next clock cycle and a cache line is established in the prefetch data array 21250 starting from an (aligned) address to the aligned address. The new stream establishment logic is shared with at least 16 memory mapped registers, one for each stream that triggers a sequence of four cache lines to be established in the prefetch data array 21250 with a corresponding stream ID, starting with the aligned address written to the register.
When a new stream is established the following steps occur
In one embodiment, the demand fetch conversion engine 21260 includes, without limitation, an array of, for example, 16 entries×13 bits representing at least 14 hypothetically possible prefetch to demand fetch conversions (i.e., a process converting a prefetch request to a demand for data to be returned immediately to the processor core 21200). The information bits of returning prefetch data from the switch 21305 is compared against this array. If this comparison determines that this prefetch data has been converted to demand fetch data (i.e., data provided from the switch 21305 or a memory system), these entries will arbitrate for access to the hit queue 21255, waiting for free clock cycles. These entries wait until the cache line is completely entered before requesting an access to the hit queue 21255. Each entry in the array in the engine 21260 includes, without limitation, a demand pending bit indicating a conversion from a prefetch request to a demand load command when set, a tag for the prefetch, an index identifying the target location in the prefetch data array 21250 for the prefetch and a critical word associated with the demand.
ECC and Parity
In one embodiment, data paths and/or prefetch data array 21250 will be ECC protected, i.e., errors in the data paths and/or prefetch data array may be corrected by ECC (Error Correction Code). In one embodiment, the data paths will be ECC protected, e.g., at the level of 8-byte granularity. Sub 8-byte data in the data paths will by parity protected at a byte level, i.e., errors in the data paths may be identified by a parity bit. Parity bit and/or interrupts may be used for the register array 21215 which stores request information (e.g., addresses and status bits). In one embodiment, a parity bit is implemented on narrower register arrays (e.g., an index FIFO, etc.). There can be a plurality of latches in this module that may affect a program function. Unwinding logical decisions made by the prefetch system 21320 based on detected soft errors in addresses and request information may impair latency and performance. Parity bit implementation on the bulk of these decisions is possible. An error refers to a signal or datum with a mistake.
As shown in
As further shown in
In the multiprocessor system on a chip 22050, the “M” processors (e.g., 0 to M−1) are connected to the centralized crossbar switch 22060 through one or more pipe line latch stages. Similarly, “S” cache slices (e.g., 0 to S−1) are also connected to the crossbar switch 22060 through one or more pipeline stages. Any master “M” intending to communicate with a slave “S”, sends a request 22110 to the crossbar indicating its need to communicate with the slave “S”. The arbitrations device 22100 arbitrates among the multiple requests competing for the same slave “S”.
Processor core connects to the arbitration device 22100 via a plurality of Master data ports 22061 and Master control ports 22062. At a Master control port 22062, a respective processor signal 22110 requests routing of data latched at a corresponding Master data port 22061 to a Slave device associated with a cache slice. Processor request signals 22110 are received and latched at the corresponding Master control pipeline latch devices 220640 . . . , 22064M-1 for routing to the arbiter every clock cycle. Arbitration device issues arbitration grant signals 22120 to the respective requesting processor core 52 from the arbiter 22100. Grant signals 22120 are latched corresponding Master control pipeline latch devices 220660 . . . , 22066M-1 prior to transfer back to the processor. The arbitration device 22100 further generates corresponding Slave control signals 22130 that are communicated to slave ports 22068 via respective Slave control pipeline latch devices 220680 . . . , 22068S-1, in an example embodiment. Slave control port signals inform the slaves of the arrival of the data through a respective slave data port 220690 . . . , 22069S-1 in accordance with the arbitration scheme issued at that clock cycle. In accordance with arbitration grants selecting a Master Port 22061 and Slave Port 22069 combination in accordance with an arbitration scheme implemented, the arbitration device 22100 generates, in every clock cycle, multiplexor control signals 22150 for receipt at a respective multiplexor devices 220650 . . . , 22065S-1 to control, e.g., select by turning on, a respective multiplexor. A selected multiplexor enables forwarding of data from master data path latch device 220630 . . . , 22063S-1 associated with a selected Master Port to the selected Slave Port 22069 via a corresponding connected slave data path latch device 220670 . . . , 22067S-1. In
In one example embodiment, the arbitrations device 22100 arbitrates among the multiple requests competing for the same slave “S” using a two step mechanism: 1): There are “S” slave arbitration slices. Each slave arbitration slice includes arbitration logic that receives all the pending requests of various Masters to access it. It then uses a round robin mechanism that uses a single round robin priority vector, e.g., bits, to select one Master as the winner of the arbitration. This is done independently by each of the S slave arbitration slices in a clock cycle; 2): There are “M” Master arbitration slices. It is possible that multiple Slave arbitration slices have chosen the same Master in the previous step. Each master arbitration slice uses a round robin mechanism to choose one such slave. This is done independently by each of the “M” master arbitration slices. Though
This method ensures fairness, as shown in the signal timing diagram of arbitration device signals of
In this example, it takes at least 5 clock cycles 22160 before the request for Master 1 had even been granted to a slave due to the round robin scheme implemented. However, all transactions to slave 4 are scheduled by cycle 9.
This throughput performance through crossbar 22060 may be improved in a further embodiment: rather than each slave using a single round robin priority vector, each slave uses two or more round robin priority vectors. The slave cycles the use of these priority vectors every clock cycle. Thus, in the above example, slave 4 having chosen Master 0 in cycle 1, will choose Master 1 in cycle 2 using a different round robin priority vector. In cycle 2, Master 1 would choose slave 4 as it is the only slave requesting it.
In a similar vein, each Master can have two or more priority vectors and can cycle among their use every clock cycle to further increase performance.
In one example embodiment, the priority vector used by the slave, e.g., SP1, is M bits long (0 to M−1), as the slave arbitration has to choose one of M masters. Hence, only one bit would be set per cycle as the lowest priority bit, in the example. For example, if a bit 5 of the priority vector is set, then the Master 5 has the lowest priority and the Master 6 would have the highest priority, Master 7 has the second highest priority, etc. The order from highest priority to lowest priority is 6, 7, 8 . . . M−1, 0, 1, 2, 3, 4, 5 in this example priority vector. Further, for example, the Masters arbitration slices 7, 8 and 9 request the slave and Master 7 wins. The priority vector SP1 would be updated so that bit 7 would be set—resulting in priority order from highest to lowest as 8, 9, 10, . . . M−1, 0, 1, 2, 3, 4, 5, 6, 7 in the updated vector. A similar bit vector scheme is further used by the Master arbitration logic devices in determining priority values of slaves to be selected for access within a clock cycle.
The usage of multiple priority vectors both by the masters and slaves and cycling among them result in increased performance. For example, as a result of implementing processes at the arbitration Slave and Master arbitration slices of the example depicted in
A method and system are described that reduce latency between masters (e.g., processors) and slaves (e.g., devices having memory/cache—L2 slices) communicating with one another through a central cross bar switch.
Any master “m” desiring to communicate with a slave “s” goes through the following steps:
The latency expected for communicating among the masters, the cross bar 23102, and the slaves are shown in
Referring back to
The cross bar switch 23102 arbitrates among the multiple requests competing for the same slave “s”. In one embodiment, the cross bar switch 102 may include an arbiter logic 23116, which makes decisions as to which master can talk to which slave. The cross bar switch 23102 may include an arbiter for each master and each slave slice, for instance, a slave arbitration slice for each slave 0 to S−1, and a master arbitration slice for each master 0 to M−1. Once it has determined that a slot is available for transferring the information from “m” to “s”, the crossbar 23102 sends the information (“info_r1”) to the slave “s”, for example, via a pipe line latch 23114b. The crossbar 23102 also sends an acknowledgement back to the master “m” that the “eager” scheduling has succeeded, for example, via a pipe line latch 23110b.
Eager scheduling latency is shown in
At 23206, the master device checks whether a request to schedule information has been received from the cross bar switch. If there is no request to schedule information, the logic flows to 23210. If a request to schedule the information has been received, the master sends the information associated with this request to schedule to the cross bar switch at 23208. The logic flow then continues to 23210.
At 23210, it is determined whether a request was sent to the crossbar “arbitration delay” cycles before the current cycle. If so, at 23212, the master device “eagerly” sends the information or data associated with the request that was sent “arbitration delay” cycles before the current cycle. The logic then continues to 23202 where it is again determined whether there is a new request to send information to the cross bar switch.
At 23214, if no request was sent to the crossbar “arbitration delay” cycles before the current cycle, then the master device drives or sends to the cross bar switch the information associated with the latest request that was sent at least “arbitration cycles” before the current cycle. At 23216, the master device proceeds to the next cycle and the logic returns to continue at 23202.
The master continues to drive the information associated with the latest request sent at least “A” cycles before. So as long as no new requests are sent to the switch by that master, eager scheduling success is possible even in later cycles than the one indicated in
As an implementation example, each of the slave arbitration slices may maintain M counters (counter 0 to counter M−1). Counter[m][s] signals the number of pending requests from master “m” to slave “s”. When a master “m” sends a request to a slave “s”, counter[m][s] is incremented by that slave. When a request to that master gets scheduled (eager or non eager), the counter gets decremented. Each of the master arbitration slices also maintains the identifier of the slave that is last sent by the master. When a request to a master “m” gets scheduled to slave s, the identifier of the slave that is last sent by that master is matched with “s”. If there is a match, then eager scheduling is possible. Other implementations are possible to perform the eager scheduling described herein, and the present invention is not limited to one specific implementation.
At 23302, an arbiter, for example, a slave arbitration slice for s1 examines one or more requests from one or more masters to slave s1. At 23304, a master is selected. For instance, if there is more than one master desiring to talk to slave s1, the slave arbitration slice for s1 may use a predetermined protocol or rule to select one master. If there is only one master requesting to talk to this slave device, arbitrating for a master is not needed. Rather, that one master is selected. The predetermined protocol or rule may to use round robin priority selection method. Other protocols or rules may be employed for selecting a master from a plurality of masters.
At 23306, the slave arbitration slice sends the information that it selected a master, for example, master m1 to the master arbitration slice responsible for master m1. At 23308, it is determined whether the selected master accepted the slave arbitration slice's decision. It may be that this master has received selections or other requests to talk from more than one slave. In such cases the master may not accept the slave arbitration slice's decision to talk to it. If the selected master does not accept, for example, for that reason or other reasons, the logic flow returns to 23302 where the slave arbitration slice examines more requests.
At 23308, if the selected master has accepted the slave arbitration slice's decision to talk to it, then the priority vector of may be updated to indicate that this master has been selected, for example, so that in the next selection process, this master does not get the highest priority of selection and another master may be selected.
Once the slot between the selected master and this slave has been made available or established for example according to the previous steps for communication, it is determined at 23310 whether the eager scheduling can succeed. That is, the slave arbitration slice determines whether the information or data is available from this master that it can send to the slave device. The information or data may be available at the cross bar switch, if the selected master has sent the information “eagerly” after waiting for an arbitration delay period even without an acknowledgment from the cross bar switch to send the information.
If at 23312, it is determined that the information can be sent to the slave, the information from the selected master is sent to the slave at 23314. The arbitration slice sends a notification to the master arbitration slice that the eager scheduling succeeded. The master arbitration slice then sends the eager scheduling success notice to the selected master. The logic returns to 23302 to continue to the next request.
If at 23312, it is determined that the information is not available to send to the slave currently, slave arbitration slice sends a notification or request to schedule the information or data to the master at 23316, for example, via the master's arbitration slice at the cross bar switch. The logic returns to 23302 to continue to the next request.
At 23406, the master arbitration slice notifies the slave selected for communication. This establishes the communication or slot between the master and the slave. At 23408, a priority vector or the like may be updated to indicate that this slave has been selected, for example, so that this slave does not get the highest priority for selection in the next round of selections. Rather, other slaves a given a chance to communicate with this master in the next round.
Processing Unit
The complex consisting of A2, QPU and L1P is called processing unit (PU, see
The MMU 1100 comprises an SLB 1106, an SLB search logic device 1108, a TLB 1110, a TLB search logic device 1112, an Address Space Register (ASR) 1114, an SDR11116, a block address translation (BAT) array 1118, and a data block address translation (DBAT) array 1120. The SDR11116 specifies the page table base address for virtual-to-physical address translation. Block address translation and data block address translation are one possible implementation for translating an effective address to a physical address and are discussed in further detail in PEM v2.0 and U.S. Pat. No. 5,907,866.
Another implementation for translating an effective address into a physical address is through the use of an on-chip SLB, such as SLB 1106, and an on-chip TLB, such as TLB 1110. Prior art SLBs and TLBs are discussed in U.S. Pat. No. 6,901,540 and U.S. Publication No. 20090019252, both of which are incorporated by reference in their entirety. In one embodiment, the SLB 1106 is coupled to the SLB search logic device 108 and the TLB 1110 is coupled to the TLB search logic device 1112. In one embodiment, the SLB 1106 and the SLB search logic device 108 function to translate an effective address (EA) into a virtual address. The function of the SLB is further discussed in U.S. Publication No. 20090019252. In the PowerPC TM reference architecture, a 64 bit effective address is translated into an 80 bit virtual address. In the A2 implementation, a 64 bit effective address is translated into an 88 bit virtual address.
In one embodiment of the A2 architecture, both the instruction cache and the data cache maintain separate “shadow” TLBs called ERATs (effective to real address translation tables). The ERATs contain only direct (IND=0) type entries. The instruction I-ERAT contains 16 entries, while the data D-ERAT contains 32 entries. These ERAT arrays minimize TLB 1110 contention between instruction fetch and data load/store operations. The instruction fetch and data access mechanisms only access the main unified TLB 1110 when a miss occurs in the respective ERAT. Hardware manages the replacement and invalidation of both the I-ERAT and D-ERAT; no system software action is required in MMU mode. In ERAT-only mode, an attempt to access an address for which no ERAT entry exists causes an Instruction (for fetches) or Data (for load/store accesses) TLB Miss exception.
The purpose of the ERAT arrays is to reduce the latency of the address translation operation, and to avoid contention for the TLB 1110 between instruction fetches and data accesses. The instruction ERAT (I-ERAT) contains sixteen entries, while the data ERAT (D-ERAT) contains thirty-two entries, and all entries are shared between the four A2 processing threads. There is no latency associated with accessing the ERAT arrays, and instruction execution continues in a pipelined fashion as long as the requested address is found in the ERAT. If the requested address is not found in the ERAT, the instruction fetch or data storage access is automatically stalled while the address is looked up in the TLB 1110. If the address is found in the TLB 1110, the penalty associated with the miss in the I-ERAT shadow array is 12 cycles, and the penalty associated with a miss in the D-ERAT shadow array is 19 cycles. If the address is also a miss in the TLB 1110, then an Instruction or Data TLB Miss exception is reported.
When operating in MMU mode, the on-demand replacement of entries in the ERATs is managed by hardware in a least-recently-used (LRU) fashion. Upon an ERAT miss which leads to a TLB 1110 hit, the hardware will automatically cast-out the oldest entry in the ERAT and replace it with the new translation. The TLB 1110 and the ERAT can both be used to translate an effective or virtual address to a physical address. The TLB 1110 and the ERAT may be generalized as “lookup tables”.
The TLB 1110 and TLB search logic device 1112 function together to translate virtual addresses supplied from the SLB 1106 into physical addresses. A prior art TLB search logic device 1112 is shown in
Referring to
Page identification begins with the expansion of the effective address into a virtual address. The effective address is a 64-bit address calculated by a load, store, or cache management instruction, or as part of an instruction fetch. In one embodiment of a system employing the A2 processor, the virtual address is formed by prepending the effective address with a 1-bit ‘guest space identifier’, an 8-bit ‘logical partition identifier’, a 1-bit ‘address space identifier’ and a 14-bit’ process identifier’. The resulting 88-bit value forms the virtual address, which is then compared to the virtual addresses contained in the TLB page table entries. For instruction fetches, cache management operations, and for non-external PID storage accesses, these parameters are obtained as follows. The guest space identifier is provided by Machine State Register MACHINE STATE REGISTER[GS]. The logical partition identifier is provided by the Logical Partition ID (LPID) register. The process identifier is included in the Process ID (PID) register. The address space identifier is provided by MACHINE STATE REGISTER[IS] for instruction fetches, and by MACHINE STATE REGISTER[DS] for data storage accesses and cache management operations, including instruction cache management operations.
For external PID type load and store accesses, these parameters are obtained from the External PID Load Context (EPLC) or External PID Store Context (EPSC) registers. The guest space identifier is provided by EPL/SC[EGS] field. The logical partition identifier is provided by the EPL/SC[ELPID] field. The process identifier is provided by the EPL/SC[EPID] field, and the address space identifier is provided by EPL/SC[EAS].
The address space identifier bit differentiates between two distinct virtual address spaces, one generally associated with interrupt-handling and other system-level code and/or data, and the other generally associated with application-level code and/or data. Typically, user mode programs will run with MACHINE STATE REGISTER[IS,DS] both set to 1, allowing access to application-level code and data memory pages. Then, on an interrupt, MACHINE STATE REGISTER[IS,DS] are both automatically cleared to 0, so that the interrupt handler code and data areas may be accessed using system-level TLB entries (i.e., TLB entries with the TS field=0).
The TLB logic device 1112 comprises logic blocks 1302 and logic block 1329. Logic block 1300 comprises ‘AND’ gates 1303 and 1323, comparators 1306, 1309, 1310, 1315, 1317, 1318 and 1322, and ‘OR’ gates 1311 and 1319. ‘AND’ gate 1303 that receives input from TLBentry[ThdID(t)] (thread identifier) 1301 and ‘thread t valid’ 1302. TLBentry[ThdID(t)] 1301 identifies a hardware thread and in one implementation there are 4 thread ID bits per TLB entry. ‘Thread t valid’ 1304 indicates which thread is requesting a TLB lookup. The output of AND′ gate 1303 is 1 when the input of ‘thread t valid’ 1302 is 1 and the value of ‘thread identifier’ is 1. 1301 The output of AND′ gate 1303 is coupled to ‘AND’ gate 1323.
Comparator 1306 compares the values of inputs TLBentry[TGS] 1304 and ‘GS’ 1305. TLBentry[TGS] 1304 is a TLB guest state identifier and ‘GS’ 1305 is the current guest state of the processor. The output of comparator 1306 is only true, i.e., a bit value of 1, when both inputs are of equal value. The output of comparator 306 is coupled to ‘AND’ gate 1323.
Comparator 1309 determines if the value of the ‘logical partition identifier’ 1307 in the virtual address is equal to the value of the TLPID field 1308 of the TLB page entry. Comparator 1310 determines if the value of the TLPID field 1308 is equal to 0 (non-guest page). The outputs of comparators 1309 and 1310 are supplied to an ‘OR’ gate 1311. The output of ‘OR’ gate 1311 is supplied to ‘AND’ gate 1323. The ‘AND’ gate 1323 also directly receives an input from ‘validity bit’ TLBentry[V] 312. The output of ‘AND’ gate 1323 is only valid when the ‘validity bit’ 1312 is set to 1.
Comparator 1315 determines if the value of the ‘address space’ identifier 1314 is equal to the value of the ‘TS’ field 1313 of the TLB page entry. If the values match, then the output is 1. The output of the comparator 1315 is coupled to ‘AND’ gate 1323.
Comparator 1317 determines if the value of the ‘Process ID’ 1324 is equal to the ‘TID’ field 1316 of the TLB page entry indicating a private page, or comparator 1318 determines if the value of the TID field is 0, indicating a globally shared page. The output of comparators 1317 and 1318 are coupled to ‘OR’ gate 1319. The output of ‘OR’ gate 1319 is coupled to ‘AND’ gate 1323.
Comparator 1322 determines if the value in the ‘effective page number’ field 1320 is equal to the value stored in the ‘EPN’ field 13221 of the TLB page entry. The number of bits N in the ‘effective page number’ 1320 is calculated by subtracting log2 of the page size from the bit length of the address field. For example, if an address field is 64 bits long, and the page size is 4 KB, then the effective address field length is found according to equation 1:
EA=0 to N−1, where N=Address Field Length−log2(page size) (1)
or by subtracting log2(212) or 12 from 64. Thus, only the first 52 bits, or bits 0 to 51 of the effective address are used in matching the ‘effective address’ 320 field to the ‘EPN field’ 1321. The output of comparator 1322 is coupled to ‘AND’ gate 1323.
Logic block 1329 comprises comparators 1326 and 1327 and ‘OR’ gate 1328. Comparator 1326 determines if the value of bits ‘n:51’ 1331 of the effective address (where n=64−log2(page size)) is greater than the value of bits n:51 of the ‘EPN’ field 1332 in the TLB entry. Normally, the LSB are not utilized in translating the EA to a physical address. When the value of bits n:51 of the effective address is greater than the value stored in the EPN field, the output of comparator 1326 is 1. Comparator 1327 determines if the TLB entry ‘exclusion bit’ 1330 is set to 1. If the ‘exclusion bit’ 1330 is set to 1, than the output of comparator 1327 is 1. The ‘exclusion bit’ 1330 functions as a signal to exclude a portion of the effective address range from the current TLB page. Applications or the operating system may then map subpages (pages smaller in size than the current page size) over the excluded region. In one example embodiment of an IBM BlueGene parallel computing system, the smallest page size is 4 KB and the largest page size is 1 GB. Other available page sizes within the IBM BlueGene parallel computing system include 64 KB, 16 MB, and 256 MB pages. As an example, a 64 KB page may have a 16 KB range excluded from the base of the page. In other implementations, the comparator may be used to excluded a memory range from the top of the page. In one embodiment, an application may map additional pages smaller in page size than the original page, i.e., smaller than 16 KB into the area defined by the excluded range. In the example above, up to four additional 4 KB pages may be mapped into the excluded 16 KB range. Note that in some embodiments, the entire area covered by the excluded range is not always available for overlapping additional pages. It is also understood that the combination of logic gates within the TLB search logic device 1112 may be replaced by any combination of gates that result in logically equivalent outcomes.
A page entry in the TLB 1110 is only matched to an EA when all of the inputs into the ‘AND’ gate 1323 are true, i.e., all the input bits are 1. Referring back to
Referring now to
Column 1408 lists the ‘effective page number’ (EPN) bits associated with each page size. The values in column 1408 are based on the values calculated in column 1406. For example, the TLB search logic device 1112 requires all 52 bits (bits 0:51) of the EPN to look up the physical address of a 4 KB page in the TLB 1110. In contrast, the TLB search logic device 1112 requires only 34 bits (bits 0:33) of the EPN to look up the physical address of a 1 GB page in the TLB 1110. Recall that in one example embodiment, the EPN is formed by a total of 52 bits. Normally, all of the LSB (the bits after the EPN bits) are set to 0. Exclusion ranges may be carved out of large size pages in units of 4 KB, i.e., when TLBentry[X] bit 1330 is 1, the total memory excluded from the effective page is 4 KB*((value of Exclusion range bits 1440)+1). When the exclusion bit is set to 1 (X=1), even if the LSBs in the virtual page number are set to 0, a 4 KB page is still excluded from a large size page.
A 64 KB page only requires bits 0:47 within the EPN field to be set for the TLB search logic device 1112 to find a matching value in the TLB 1110. An exclusion range within the 64 KB page can be provided by setting LSBs 48:51 to any value except all ‘1’s. Note that the only page size smaller than 64 KB is 4 KB. One or more 4 KB pages can be mapped by software into the excluded memory region covered by the 64 KB page when the TLBentry[X] (exclusion) bit is set to 1. When the TLB search logic device 1112 maps a virtual address to a physical address and the TLB exclusion bit is also set to 1, the TLB search logic device 1112 will return a physical address that maps to the 64 KB page outside the exclusion range. If the TLB exclusion bit is set to 0, the TLB search logic device 1112 will return a physical address that maps to the whole area of the 64 KB page.
An application or the operating system may access the non excluded region within a page when the ‘exclusion bit’ 1330 is set to 1. When this occurs, the TLB search logic device 1112 uses the MSB to map the virtual address to a physical address that corresponds to an area within the non excluded region of the page. When the ‘exclusion bit’ 1330 is set to 0, then the TLB search logic device 1112 uses the MSB to map the virtual address to a physical address that corresponds to a whole page.
In one embodiment of the invention, the size of the exclusion range is configurable to M×4 KB, where M=1 to (TLB entry page size in bytes/212)−1. The smallest possible exclusion range is 4 KB, and successively larger exclusion ranges are multiples of 4 KB. In another embodiment of the invention, such as in the A2 core, for simplicity, M is further restricted to 2n, where n=0 to log2(TLB entry page size)−13, i.e., the possible excluded ranges are 4 KB, 8 KB, 16 KB, up to (page size)/2. Additional TLB entries may be mapped into the exclusion range. Pages mapped into the exclusion range cannot overlap and pages mapped in the exclusion range must be collectively fully contained within the exclusion range. The pages mapped into the exclusion range are known as subpages.
Once a TLB page table entry has been deleted from the TLB 1110 by the operating system, the corresponding memory indicated by the TLB page table entry becomes available to store new or additional pages and subpages. TLB page table entries are generally deleted when their corresponding applications or processes are terminated by the operating system.
Referring now to
In this embodiment, the compute node 1700 is a single chip (‘nodechip’) based on low power A2 PowerPC cores, though the architecture can use any low power cores, and may comprise one or more semiconductor chips. In the embodiment depicted, the node includes 16 PowerPC A2 cores running at 1600 MHz.
More particularly, the basic compute node 1700 of the massively parallel supercomputer architecture illustrated in
Each MMU 1100 receives data accesses and instruction accesses from their associated processor cores 1752 and retrieves information requested by the core 1752 from memory such as the L1 cache 1755, L2 cache 1770, external DDR3 1780, etc.
Each FPU 1753 associated with a core 1752 has a 32 B wide data path to the L1-cache 1755, allowing it to load or store 32 B per cycle from or into the L1-cache 1755. Each core 1752 is directly connected to a prefetch unit (level-1 prefetch, L1P) 1758, which accepts, decodes and dispatches all requests sent out by the core 1752. The store interface from the core 1752 to the L1P 1755 is 32 B wide and the load interface is 16 B wide, both operating at the processor frequency. The L1P 1755 implements a fully associative, 32 entry prefetch buffer. Each entry can hold an L2 line of 328 B size. The L1P provides two prefetching schemes for the prefetch unit 1758: a sequential prefetcher as used in previous BLUEGENE™ architecture generations, as well as a list prefetcher. The prefetch unit is further disclosed in U.S. patent application Ser. No. 11/767,717, which is incorporated by reference in its entirety.
As shown in
By implementing a direct memory access engine referred to herein as a Messaging Unit, ‘MU’ such as MU 1750, with each MU including a DMA engine and a Network Device 1750 in communication with the crossbar switch 1760, chip I/O functionality is provided. In one embodiment, the compute node further includes, in a non-limiting example: 10 intra-rack interprocessor links 1790, each operating at 2.0 GB/s, i.e., 10*2 GB/s intra-rack & inter-rack (e.g., configurable as a 5-D torus in one embodiment); and, one I/O link 1792 interfaced with the MU 1750 at 2.0 GB/s (2 GB/s I/O link (to I/O subsystem)) is additionally provided. The system node 1750 employs or is associated and interfaced with an 8-16 GB memory/node (not shown).
Although not shown, each A2 processor core 1752 has associated a quad-wide fused multiply-add SIMD floating point unit, producing 8 double precision operations per cycle, for a total of 328 floating point operations per cycle per compute node. A2 is a 4-way multi-threaded 64b PowerPC implementation. Each A2 processor core 1752 has its own execution unit (XU), instruction unit (IU), and quad floating point unit (QPU) connected via the AXU (Auxiliary eXecution Unit). The QPU is an implementation of the 4-way SIMD QPX floating point instruction set architecture. QPX is an extension of the scalar PowerPC floating point architecture. It defines 32 32 B-wide floating point registers per thread instead of the traditional 32 scalar 8 B-wide floating point registers.
The instruction unit of the A2 core fetches, decodes, and issues two instructions from different threads per cycle to any combination of the one execution pipeline and the AXU interface (see “Execution Unit” below, and Auxiliary Processor Unit (AXU) Port on page 49). The instruction unit includes a branch unit which provides dynamic branch prediction using a branch history table (BHT). This mechanism greatly improves the branch prediction accuracy and reduces the latency of taken branches, such that the target of a branch can usually be run immediately after the branch itself, with no penalty.
The A2 core contains a single execution pipeline. The pipeline consists of seven stages and can access the five-ported (three read, two write) GPR file. The pipeline handles all arithmetic, logical, branch, and system management instructions (such as interrupt and TLB management, move to/from system registers, and so on) as well as arithmetic, logical operations and all loads, stores and cache management operations. The pipelined multiply unit can perform 32-bit×32-bit multiply operations with single-cycle throughput and single-cycle latency. The width of the divider is 64 bits. Divide instructions dealing with 64 bit operands recirculate for 65 cycles, and operations with 32 bit operands recirculate for 32 cycles. No divide instructions are pipelined, they all require some recirculation. All misaligned operations are handled in hardware, with no penalty on any operation which is contained within an aligned 32-byte region. The load/store pipeline supports all operations to both big endian and little endian data regions.
The A2 core provides separate instruction and data cache controllers and arrays, which allow concurrent access and minimize pipeline stalls. The storage capacity of the cache arrays 16 KB each. Both cache controllers have 64-byte lines, with 4-way set-associativity I-cache and 8-way set-associativity D-cache. Both caches support parity checking on the tags and data in the memory arrays, to protect against soft errors. If a parity error is detected, the CPU will force a L1 miss and reload from the system bus. The A2 core can be configured to cause a machine check exception on a D-cache parity error. The PowerISA instruction set provides a rich set of cache management instructions for software-enforced coherency.
The ICC delivers up to four instructions per cycle to the instruction unit of the A2 core. The ICC also handles the execution of the PowerISA instruction cache management instructions for coherency.
The DCC handles all load and store data accesses, as well as the PowerISA data cache management instructions. All misaligned accesses are handled in hardware, with cacheable load accesses that are contained within a double quadword (32 bytes) being handled as a single request and with cacheable store or caching inhibited loads or store accesses that are contained within a quadword (16 bytes) being handled as a single request. Load and store accesses which cross these boundaries are broken into separate byte accesses by the hardware by the micro-code engine. When in 32 Byte store mode, all misaligned store or load accesses contained within a double quadword (32 bytes) are handled as a single request. This includes cacheable and caching inhibited stores and loads. The DCC interfaces to the AXU port to provide direct load/store access to the data cache for AXU load and store operations. Such AXU load and store instructions can access up to 32 bytes (a double quadword) in a single cycle for cacheable accesses and can access up to 16 bytes (a quadword) in a single cycle for caching inhibited accesses. The data cache always operates in a write-through manner. The DCC also supports cache line locking and “transient” data via way locking. The DCC provides for up to eight outstanding load misses, and the DCC can continue servicing subsequent load and store hits in an out-of-order fashion. Store-gathering is not performed within the A2 core.
The A2 Core supports a flat, 42-bit (4 TB) real (physical) address space. This 42-bit real address is generated by the MMU, as part of the translation process from the 64-bit effective address, which is calculated by the processor core as an instruction fetch or load/store address. Note: In 32-bit mode, the A2 core forces bits 0:31 of the calculated 64-bit effective address to zeroes. Therefore, to have a translation hit in 32-bit mode, software needs to set the effective address upper bits to zero in the ERATs and TLB. The MMU provides address translation, access protection, and storage attribute control for embedded applications. The MMU supports demand paged virtual memory and other management schemes that require precise control of logical to physical address mapping and flexible memory protection. Working with appropriate system level software, the MMU provides the following functions:
The translation lookaside buffer (TLB) is the primary hardware resource involved in the control of translation, protection, and storage attributes. It consists of 512 entries, each specifying the various attributes of a given page of the address space. The TLB is 4-way set associative. The TLB entries may be of type direct (IND=0), in which case the virtual address is translated immediately by a matching entry, or of type indirect (IND=1), in which case the hardware page table walker is invoked to fetch and install an entry from the hardware page table.
The TLB tag and data memory arrays are parity protected against soft errors; if a parity error is detected during an address translation, the TLB and ERAT caches treat the parity error like a miss and proceed to either reload the entry with correct parity (in the case of an ERAT miss, TLB hit) and set the parity error bit in the appropriate FIR register, or generate a TLB exception where software can take appropriate action (in the case of a TLB miss).
An operating system may choose to implement hardware page tables in memory that contain virtual to logical translation page table entries (PTEs) per Category E.PT. These PTEs are loaded into the TLB by the hardware page table walker logic after the logical address is converted to a real address via the LRAT per Category E.HV.LRAT. Software must install indirect (IND=1) type TLB entries for each page table that is to be traversed by the hardware walker. Alternately, software can manage the establishment and replacement of TLB entries by simply not using indirect entries (i.e. by using only direct IND=0 entries). This gives system software significant flexibility in implementing a custom page replacement strategy. For example, to reduce TLB thrashing or translation delays, software can reserve several TLB entries for globally accessible static mappings. The instruction set provides several instructions for managing TLB entries. These instructions are privileged and the processor must be in supervisor state in order for these instructions to be run.
The first step in the address translation process is to expand the effective address into a virtual address. This is done by taking the 64-bit effective address and prepending to it a 1-bit “guest state” (GS) identifier, an 8-bit logical partition ID (LPID), a 1-bit “address space” identifier (AS), and the 14-bit Process identifier (PID). The 1-bit “indirect entry” (IND) identifier is not considered part of the virtual address. The LPID value is provided by the LPIDR register, and the PID value is provided by the PID register.
The GS and AS identifiers are provided by the Machine State Register which contains separate bits for the instruction fetch address space (MACHINE STATE REGISTER[S]) and the data access address space (MACHINE STATE REGISTER[DS]). Together, the 64-bit effective address, and the other identifiers, form an 88-bit virtual address. This 88-bit virtual address is then translated into the 42-bit real address using the TLB.
The MMU divides the address space (whether effective, virtual, or real) into pages. Five direct (IND=0) page sizes (4 KB, 64 KB, 1 MB, 16 MB, 1 GB) are simultaneously supported, such that at any given time the TLB can contain entries for any combination of page sizes. The MMU also supports two indirect (IND=1) page sizes (1 MB and 256 MB) with associated sub-page sizes. In order for an address translation to occur, a valid direct entry for the page containing the virtual address must be in the TLB. An attempt to access an address for which no TLB direct exists results in a search for an indirect TLB entry to be used by the hardware page table walker. If neither a direct or indirect entry exists, an Instruction (for fetches) or Data (for load/store accesses) TLB Miss exception occurs.
To improve performance, both the instruction cache and the data cache maintain separate “shadow” TLBs called ERATs. The ERATs contain only direct (IND=0) type entries. The instruction I-ERAT contains 16 entries, while the data D-ERAT contains 32 entries. These ERAT arrays minimize TLB contention between instruction fetch and data load/store operations. The instruction fetch and data access mechanisms only access the main unified TLB when a miss occurs in the respective ERAT. Hardware manages the replacement and invalidation of both the I-ERAT and D-ERAT; no system software action is required in MMU mode. In ERAT-only mode, an attempt to access an address for which no ERAT entry exists causes an Instruction (for fetches) or Data (for load/store accesses) TLB Miss exception.
Each TLB entry provides separate user state and supervisor state read, write, and execute permission controls for the memory page associated with the entry. If software attempts to access a page for which it does not have the necessary permission, an Instruction (for fetches) or Data (for load/store accesses) Storage exception will occur.
Each TLB entry also provides a collection of storage attributes for the associated page. These attributes control cache policy (such as cachability and write-through as opposed to copy-back behavior), byte order (big endian as opposed to little endian), and enabling of speculative access for the page. In addition, a set of four, user-definable storage attributes are provided. These attributes can be used to control various system level behaviors.
L2 Cache
The 32MiB shared L2 (
The BGQ Compute ASIC incorporates support for thread-level speculative execution (TLS). This support utilizes the L2 cache to handle multiple versions of data and detect memory reference patterns from any core that violates sequential consistency. The L2 cache design tracks all loads to cache a cache line and checks all stores against these loads. This BGQ compute ASIC has up to 32 MiB of speculative execution state storage in L2 cache. The design supports for the following speculative execution mechanisms. If a core is idle and the system is running in a speculative mode, the target design provides a low latency mechanism for the idle core to obtain a speculative work item and to cancel that work and invalidate its internal state and obtain another available speculative work item if sequential consistency is violated. Invalidating internal state is extremely efficient: updating a bit in a table that indicates that the thread ID is now in the “Invalid” state. Threads can have one of four states: Primary non-speculative; Speculative, valid and in progress; Speculative, pending completion of older dependencies before committing; and Invalid, failed.
In one embodiment, there is allowed out of order issuance of store instructions and process the store instructions in a parallel computing system without using an msync instruction as is done in the art.
In
In one embodiment, a processor core issued the store instruction is a producer (i.e., a component producing or generating data). That processor core hands off the produced or generated data to, e.g., a register in, the MU 3220 (
In one embodiment, other processor cores access the updated data upon seeing the flag bit set, e.g., by accessing the updated data by using a load instruction specifying a memory location of the updated data. The store instruction may be a guarded store instruction or an unguarded store instruction. The guarded store instruction is not processed speculatively and/or run when its operation is guaranteed safe. The unguarded store instruction is processed speculatively and/or assumes no side effect (e.g., speculatively overwriting data in a memory location does not affect a true output) in accessing the shared cache memory device 3215. The parallel computing system run the method steps 3400-3430 without an assistance of a synchronization instruction (e.g., mysnc instruction).
At step 3540 in
In a further embodiment, a fourth request queue, associated with the MU 3220, also receives and stores the issued store instruction. The first processor may not flush this fourth request queue when flushing the first request queue. The synchronization instruction issued by a processor core may flush this fourth request queue when flushing all other request queues.
In a further embodiment, the first, second, third and forth request queues concurrently receive the issued store instruction from the first processor core. Alternatively, the first, second, third and fourth request queues receive the issued store instruction in a sequential order.
In a further embodiment, some of the method steps described in
In one embodiment, the method steps in
Generally, in field of synchronizing memory accesses in a multi-processor, parallel computing system parallel computing, application programs are split into “threads” that can run “speculatively” in parallel. The terms “speculative,” “speculatively,” “execution” and “speculative execution” as used herein are terms of art that do not imply mental steps or manual operation. Instead, they refer to computer processors running segments of code automatically. Some segments of code are known as “threads.” If the execution of code is “speculative,” this means that the thread is run in the computer as a sort of gamble. The gamble is that any given thread will be able to do something meaningful without altering data after some other thread altering the same data in a way that would make results from the given thread invalid. All of the operations are undertaken within the hardware on an automated basis.
There is further provided an instruction set and supporting hardware for a multiprocessor system that support speculative execution by improving synchronization of memory accesses.
Advantageously, a multiprocessor system will include a special msync unit for supporting memory synchronization requests. This unit will have a mechanism for keeping track of generations of requests and for delaying requests that exceed a maximum count of generations in flight.
Advantageously, also various different levels or methods of memory synchronization will be supported responsive to the msync unit.
The following description mentions a number of instruction and function names such as “msync,” “hwsync,” “lwsync,” and “eieio;” “TLBsync,” “Mbar,” “full sync,” “non-cumulative barrier,” “producer sync,” “generation change sync,” “producer generation change sync,” “consumer sync,” and “local barrier,” These names are arbitrary and for convenience of understanding. An instruction might equally well be given any name as a matter of preference without altering the nature of the instruction or without taking the instruction or the hardware supporting it outside of the scope of the claims.
Generally implementing an instruction will involve creating specific computer hardware that will cause the instruction to run when computer code requests that instruction. The field of Application Specific Integrated Circuits (“ASIC” s) is a well-developed field that allows implementation of computer functions responsive to a formal specification. Accordingly, no specific implementation will be discussed here. Instead the functions of instructions and units will be discussed.
As described herein, the use of the letter “B” represents a Byte quantity, e.g., 2 B, 8.0 B, 32 B, and 64 B represent Byte units. Recitations “GB” represent Gigabyte quantities. Throughout this disclosure a particular embodiment of a multi-processor system will be discussed. This embodiment includes various numerical values for numbers of components, bandwidths of interfaces, memory sizes and the like. These numerical values are not intended to be limiting, but only examples. One of ordinary skill in the art might devise other examples as a matter of design choice.
The compute node 50 is a single chip (“nodechip”) is based on low power A2 PowerPC cores, though any compatible core might be used. While the commercial embodiment is built around the PowerPC architecture, the invention is not limited to that architecture. In the embodiment depicted, the node includes 17 cores 52, each core being 4-way hardware threaded. There is a shared L2 cache 70 accessible via a full crossbar switch 60, the L2 including 16 slices 72. There is further provided external memory 80, in communication with the L2 via DDR-3 controllers 78—DDR being an acronym for Double Data Rate.
A messaging unit (“MU”) 30100 includes a direct memory access (“DMA”) engine 21, a network interface 150, a Peripheral Component Interconnect Express (“PCIe”) unit. The MU is coupled to interprocessor links 90 and i/o link 92.
Each FPU 53 associated with a core 52 has a data path to the L1-data cache 55. Each core 52 is directly connected to a supplementary processing agglomeration 58, which includes a private prefetch unit. For convenience, this agglomeration 58 will be referred to herein as “UP”—meaning level 1 prefetch—or “prefetch unit;” but many additional functions are lumped together in this so-called prefetch unit, such as write combining. These additional functions could be illustrated as separate modules, but as a matter of drawing and nomenclature convenience the additional functions and the prefetch unit will be illustrated herein as being part of the agglomeration labeled “UP.” This is a matter of drawing organization, not of substance. Some of the additional processing power of this L1P group is shown in
In this embodiment, the L2 Cache units provide the bulk of the memory system caching. Main memory may be accessed through two on-chip DDR-3 SDRAM memory controllers 78, each of which services eight L2 slices.
To reduce main memory accesses, the L2 advantageously serves as the point of coherence for all processors within a nodechip. This function includes generating L1 invalidations when necessary. Because the L2 cache is inclusive of the L1s, it can remember which processors could possibly have a valid copy of every line, and can multicast selective invalidations to such processors. In the current embodiment the prefetch units and data caches can be considered part of a memory access pathway.
The units 30301 and 30302 have outputs relevant to memory synchronization, as will be discussed further below with reference to
The L2, as point of coherence, detects that the copy of the data resident in the L1D for thread β is invalid. Slice 31801 therefore queues an invalidation signal to the queue 31089 and then, via the crossbar switch, to the queue 31807 of core/L1 group 310805.
When α writes the flag, this again passes through queue 31806 to the crossbar switch 31803, but this time the write is hashed to the queue 31810 of a second slice 31802 of the L2. This flag is then stored in the slice and queued at 31811 to go to through the crossbar 31803 to queue 31807 and then to the core/L1 group 31805. In parallel, thread β, is repeatedly scanning the flag in its own L1D.
Traditionally, multiprocessor systems have used consistency models called “sequential consistency” or “strong consistency”, see e.g. the article entitled “Sequential Consistency” in Wikipedia. Pursuant to this type of model, if unit 31804 first writes data and then writes the flag, this implies that if the flag has changed, then the data has also changed. It is not possible for the flag to be changed before the data. The data change must be visible to the other threads before the flag changes. This sequential model has the disadvantage that threads are kept waiting, sometimes unnecessarily, slowing processing.
To speed processing, PowerPC architecture uses a “weakly consistent” memory model. In that model, there is no guarantee whatsoever what memory access request will first result in a change visible to all threads. It is possible that β will see the flag changing, and still not have received the invalidation message from slice 31801, so β may still have old data in its L1D.
To prevent this unfortunate result, the PowerPC programmer can insert msync instructions 31708 and 31709 as shown in
In accordance with the embodiment disclosed herein, to support concurrent memory synchronization instructions, requests are tagged with a global “generation” number. The generation number is provided by a central generation counter. A core executing a memory synchronization requests the central unit to increment the generation counter and then waits until all memory operations of the previously current generation and all earlier generations have completed.
A core's memory synchronization request is complete when all requests that were in flight when the request began have completed. In order to determine this, the L1P monitors a reclaim pointer that will be discussed further below. Once it sees the reclaim pointer moving past the generation that was active at the point of the start of the memory synchronization request, then the memory synchronization request is complete.
A number of units within the nodechip queue memory access requests, these include:
Every such unit can contain some aspect of a memory access request in flight that might be impacted by a memory synchronization request.
The global OR tree 30502 per
Because the memory subsystem has paths—especially the crossbar—through which requests pass without contributing to the global OR reduce tree of
Memory access requests tagged with a generation number may be of many types, including:
The memory synchronization unit 30905 shown in
In the current embodiment, the generation counter is used to determine whether a requested generation change is complete, while the reclaim pointer is used to infer what generation has completed.
The module 30905 of
For a synchronization operation, a unit can request an increment of the current generation and wait for previous generations to complete.
The central generation counter uses a single counter 30601 to determine the next generation. As this counter is narrow, for instance 3 bits wide, it wraps frequently, causing the reuse of generation numbers. To prevent using a number that is still in flight, there is a second, reclaiming counter 30602 of identical width that points to the oldest generation in flight. This counter is controlled by a track and control unit 30606 implemented within the memory synchronization unit. Signals from the msync interface unit, discussed with reference to
The generation counter can only advance if doing so would not cause it to point to the same generation as the reclaim pointer per in the next cycle. If the generation counter is stalled by this condition, it can still receive incoming memory synchronization requests from other cores and process them all at once by broadcasting the identical grant to all of them, causing them all to wait for the same generations to clear. For instance, all requests for generation change from the hardware threads can be OR′d together to create a single generation change request.
The generation counter (gen_cnt) 601 and the reclaim pointer (rcl_ptr) 30602 both start at zero after reset. When a unit requests to advance to a new generation, it indicates the desired generation. There is no request explicit acknowledge sent back to the requestor, the requestor unit determines at whether its request has been processed based on the global current generation 30601, 30602. As the requested generation can be at most the gen_cnt+1, requests for any other generation at are assumed to have already been completed.
If the requested generation is equal to gen_cnt+1 and equal to rcl_ptr at, an increment is requested because the next generation value is still in use. The gen_cnt will be incremented as soon as the rcl_ptr increments.
If the requested generation is not equal to gen_cnt+1, it is assumed completed and is ignored.
If the requested generation is equal to gen_cnt+1 and not equal to rcl_ptr, gen_cnt is incremented at; but gen_cnt is incremented at most every 2 cycles, allowing units tracking the broadcast to see increments even in the presence of single cycle upset events.
Per
The PowerPC architecture defines three levels of synchronization:
Generally it has been found that programmers overuse the heavyweight sync in their zealousness to prevent memory inconsistencies. This results in unnecessary slowing of processing. For instance, if a program contains one data producer and many data consumers, the producer is the bottleneck. Having the producer wait to synchronize aggravates this. Analogously, if a program contains many producers and only one consumer, then the consumer can be the bottleneck and forcing it to wait should be avoided where possible.
In implementing memory synchronization, it has been found advantageous to offer several levels of synchronization programmable by memory mapped I/O. These levels can be chosen by the programmer in accordance with anticipated work distribution. Generally, these levels will be most commonly used by the operating system to distribute workload. It will be up to the programmer choosing the level of synchronization to verify that different threads using the same data have compatible synchronization levels.
Seven levels or “flavors” of synchronization operations are discussed herein. These flavors can be implemented as alternatives to the msync/hwsync, lwsync, and mbar/eieio instructions of the PowerPC architecture. In this case, program instances of these categories of Power PC instruction can all be mapped to the strongest sync, the msync, with the alternative levels then being available by memory-mapped i/o. The scope of restrictions imposed by these different flavors is illustrated conceptually in the Venn diagram of
The seven flavors disclosed herein are:
This sync ensures that the generation of the last access of the requestor has completed before the requestor can proceed. This sync is not strong enough to provide cumulative ordering as required by the PowerPC synchronizing instructions. The last load issued by this processor may have received a value written by a store request of another core from the subsequent generation. Thus this sync does not guarantee that the value it saw prior to the store is visible to all cores after this sync operation. More about the distinction between non-cumulative barrier and full sync is illustrated by
This sync ensures that the generation of the last store access before the sync instruction of the requestor has completed before the requestor can proceed. This sync is sufficient to separate the data location updates from the guard location update for the producer in a producer/consumer queue. This type of sync is useful where the consumer is the bottleneck and where there are instructions that can be carried out between the memory access and the msync that do not require synchronization. It is also not strong enough to provide cumulative ordering as required by the PowerPC synchronizing instructions.
Generation Change Sync 31714
This sync ensures only that the requests following the sync are in a different generation than the last request issued by the requestor. This type of sync is normally requested by the consumer and puts the burden of synchronization on the producer. This guarantees that load and stores are completed. This might be particularly useful in the case of atomic operations as defined in application 61/299,911 filed Jan. 29, 2010, which is incorporated herein by reference, and where it is desired to verify that all data is consumed.
Producer Generation Change Sync 31715
This sync is designed to slow the producer the least. This sync ensures only that the requests following the sync are in a different generation from the last store request issued by the requestor. This can be used to separate the data location updates from the guard location update for the producer in a producer/consumer queue. However, the consumer has to ensure that the data location updates have completed after it sees the guard location change. This type does not require the producer to wait until all the invalidations are finished. The term “guard location” here refers to the type of data shown in the flag of
Consumer Sync 31716
This request is run by the consumer thread. This sync ensures that all requests belonging to the current generation minus one have completed before the requestor can proceed. This sync can be used by the consumer in conjunction with a producer generation change sync by the producer in a producer/consumer queue.
Local Barrier 31717
This sync is local to a core/L1 group and only ensures that all its preceding memory accesses have been sent to the switch.
At 31105 thread β—the consumer—tests whether the ready flag is set. At 31106, thread β also tests, in accordance with a consumer sync, whether the reclaim pointer has reached the generation of the current synchronization request. When both conditions are met at 31107, then thread β can use the data at 31108.
In addition to the standard addressing and data functions 30454, 30455, when the L1P 58—shown in
To invoke the synchronizing behavior of synchronization types other than full sync, at least two implementation options are possible:
1. synchronization caused by load and store operations to predefined addresses
Synchronization levels are controlled by memory-mapped I/O accesses. As store operations can bypass load operations, synchronization operations that require preceding loads to have completed are implemented as load operations to memory mapped I/O space, followed by a conditional branch that depends on the load return value. Simple use of load return may be sufficient. If the sync does not depend on the completion of preceding loads, it can be implemented as store to memory mapped I/O space. Some implementation issues of one embodiment are as follows. A write access to this location is mapped to a sync request which is sent to the memory synchronization unit. The write request stalls the further processing of requests until the sync completes. A load request to the location causes the same type of requests, but only the full and the consumer request stall. All other load requests return the completion status as value back, a 0 for sync not yet complete, a 1 for sync complete. This implementation does not take advantage all of the built in PowerPC constraints of a core implementing PowerPC architecture. Accordingly, more programmer attention to order of memory access requests is needed.
2. configuring the semantics of the next synchronizations instruction, e.g. the PowerPC msync, via storing to a memory mapped configuration register.
In this implementation, before every memory synchronization instruction, a store is executed that deposits a value that selects a synchronization behavior into a memory mapped register. The next executed memory synchronization instruction invokes the selected behavior and restores the configuration back to the Full Sync behavior. This reactivation of the strongest synchronization type guarantees correct execution if applications or subroutines that do not program the configuration register are executed.
Memory Synchronization Interface Unit
The register storing configuration will sometimes be referred to herein as “configuration register.” This control unit 30906 notifies the core 52 via 30908 when the msync is completed. The core issuing the msync drains all loads and stored, stops taking loads and stores and stops the issuing thread until the msync completion indication is received.
This control unit also exchanges information with the global generation counter module 30905. This information includes a generation count. In the present embodiment, there is only one input per L1P to the generation counter, so the L1P aggregates requests for increment from all hardware threads of the processor 52. Also, in the present embodiment, the OR reduce tree is coupled to the reclaim pointer, so the memory synchronization interface unit gets information from the OR reduce tree indirectly via the reclaim pointer.
The control unit also tracks the changes of the global generation (gen_cnt) and determines whether a request of a client has completed. Generation completion is detected by using the reclaim pointer that is fed to observer latches in the L1P. The core waits for the L1P to handle the msyncs. Each hardware thread may be waiting for a different generation to complete. Therefore each one stores what the generation for that current memory synchronization instruction was. Each then waits individually for its respective generation to complete.
For each client 30901, the unit implements a group 30903 of three generation completion detectors shown at 31001, 31002, 31003, per
For each store request generated by a client, the first 31001 of the three detectors sets its ginfl_flag 31005 and updates the last_gen latch 31004 with the current generation. This detector is updated for every store, and therefore reflects whether the last store has completed or not. This is sufficient, since prior stores will have generations less than or equal to the generation of the current store. Also, since the core is waiting for memory synchronization, it will not be making more stores until the completion indication is received.
For each memory access request, regardless whether load or store, the second detector 31002 is set correspondingly. This detector is updated for every load and every store, and therefore its flag indicates whether the last memory access request has completed.
If a client requests a full sync, the third detector 31003 is primed with the current generation, and for a consumer sync the third detector is primed with the current generation-1. Again, this detector is updated for every full or consumer sync.
Since the reclaim pointer cannot advance without everything in that generation having completed and because the reclaim pointer cannot pass the generation counter, the reclaim pointer is an indication of whether a generation has completed. If the rcl_ptr 30602 moves past the generation stored in last gen, no requests for the generation are in flight anymore and the ginfl_flag is cleared.
Full Sync
This sync completes if the ginfl_flag 31009 of the third detector 31003 is cleared. Until completion, it requests a generation change to the value stored in the third detector plus one.
Non-Cumulative Barrier
This sync completes if the ginfl_flag 31007 of the second detector 31002 is cleared. Until completion, it requests a generation change to the value that is held in the second detector plus one.
Producer Sync
This sync completes if the ginfl_flag 31005 of the first detector 31001 is cleared. Until completion, it requests a generation change to the value held in the first detector plus one.
Generation Change Sync
This sync completes if either the ginfl_flag 31007 of the second detector 31002 is cleared or the if the last_gen 31006 of the second detector is different from gen_cnt 30601. If it does not complete immediately, it requests a generation change to the value stored in the second detector plus one. The purpose of the operation is to advance the current generation (value of gen_cnt) to at least one higher than the generation of the last load or store. The generation of the last load or store is stored in the last_gen register of the second detector.
This sync completes if either the ginfl_flag 31005 of the first detector 31001 is cleared or if the last_gen 31004 of the first detector is different from gen_cnt 30601. If it does not complete immediately, it requests a generation change to of the value stored in the first detector plus one. This operates similarly to the generation change sync except that it uses the generation of the last store, rather than load or store.
Consumer Sync
This sync completes if the ginfl_flag 31009 of the third detector 31003 is cleared. Until completion, it requests a generation change to of the value stored in the third detector plus one.
Local Barrier
This sync is executed by the L1P, it does not involve generation tracking.
From the above discussion, it can be seen that a memory synchronization instruction actually implicates a set of sub-tasks. For a comprehensive memory synchronization scheme, those sub-tasks might include one or more of the following:
In implementing the various levels of synchronization herein, sub-sets of this set of sub-tasks can be viewed as partial synchronization tasks to be allocated between threads in an effort to improve throughput of the system. Therefore address formats of instructions specifying a synchronization level effectively act as parameters to offload sub-tasks from or to the thread containing the synchronization instruction. If a particular sub-task implicated by the memory synchronization instruction is not performed by the thread containing the memory synchronization instruction, then the implication is that some other thread will pick up that part of the memory synchronization function. While particular levels of synchronization are specified herein, the general concept of distributing synchronization sub-tasks between threads is not limited to any particular instruction type or set of levels.
Physical Design
The Global OR tree needs attention to layout and pipelining, as its latency affects the performance of the sync operations.
In the current embodiment, the cycle time is 1.25 ns. In that time, a signal will travel 2 mm through a wire. Where a wire is longer than 2 mm, the delay will exceed one clock cycle, potentially causing unpredictable behavior in the transmission of signals. To prevent this, a latch should be placed at each position on each wire that corresponds to 1.25 ns, in other words approximately every 2 mm. This means that every transmission distance delay of 4 ns will be increased to 5 ns, but the circuit behavior will be more predictable. In the case of the msync unit, some of the wires are expected to be on the order of 10 mm meaning that they should have on the order of five latches.
Due to quantum mechanical effects, it is advisable to protect latches holding generation information with Error Correcting Codes (“ECC”) (4b per 3b counter data). All operations may include ECC correction and ECC regeneration logic.
The global broadcast and generation change interfaces may be protected by parity. In the case of a single cycle upset, the request or counter value transmitted is ignored, which does not affect correctness of the logic.
Software Interface
The Msync unit will implement the ordering semantics of the PPC hwsync, lwsync and mbar instruction by mapping these operations to the full sync.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The word “comprising”, “comprise”, or “comprises” as used herein should not be viewed as excluding additional elements. The singular article “a” or “an” as used herein should not be viewed as excluding a plurality of elements. Unless the word “or” is expressly limited to mean only a single item exclusive from other items in reference to a list of at least two items, then the use of “or” in such a list is to be interpreted as including (a) any single item in the list, (b) all of the items in the list, or (c) any combination of the items in the list. Ordinal terms in the claims, such as “first” and “second” are used for distinguishing elements and do not necessarily imply order of operation.
There is further provided a system and method for managing the loading and storing of data conditionally in memories of multi-processor systems.
A conventional multi-processor computer system includes multiple processing units (a.k.a. processors or processor cores) all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Cache memories are commonly utilized to temporarily buffer memory blocks that might be accessed by a processor in order to speed up processing by reducing access latency introduced by having to load needed data and instructions from system memory. In some multiprocessor systems, the cache hierarchy includes at least two levels. The level one (L1), or upper-level cache is usually a private cache associated with a particular processor core and cannot be directly accessed by other cores in the system. Typically, in response to a memory access instruction such as a load or store instruction, the processor core first accesses the upper-level cache. If the requested memory block is not found in the upper-level cache or the memory access request cannot be serviced in the upper-level cache (e.g., the L1 cache is a store-though cache), the processor core then accesses lower-level caches (e.g., level two (L2) or level three (L3) caches) to service the memory access to the requested memory block. The lowest level cache (e.g., L2 or L3) is often shared among multiple processor cores.
A coherent view of the contents of memory is maintained in the presence of potentially multiple copies of individual memory blocks distributed throughout the computer system through the implementation of a coherency protocol. The coherency protocol, entails maintaining state information associated with each cached copy of the memory block and communicating at least some memory access requests between processing units to make the memory access requests visible to other processing units.
In order to synchronize access to a particular granule (e.g., cache line) of memory between multiple processing units and threads of execution, load-reserve and store-conditional instruction pairs are often employed. For example, load-reserve and store-conditional instructions referred to as LWARX and STWCX have been implemented. Execution of a LWARX (Load Word And Reserve Indexed) instruction by a processor loads a specified cache line into the cache memory of the processor and typically sets a reservation flag and address register signifying the processor has interest in atomically updating the cache line through execution of a subsequent STWCX (Store Word Conditional Indexed) instruction targeting the reserved cache line. The cache then monitors the storage subsystem for operations signifying that another processor has modified the cache line, and if one is detected, resets the reservation flag to signify the cancellation of the reservation. When the processor executes a subsequent STWCX targeting the cache line reserved through execution of the LWARX instruction, the cache memory only performs the cache line update requested by the STWCX if the reservation for the cache line is still pending. Thus, updates to shared memory can be synchronized without the use of an atomic update primitive that strictly enforces atomicity.
Individual processors usually provide minimal support for load-reserve and store-conditional. The processors basically hand off responsibility for consistency and completion to the external memory system. For example, a processor core may treat load-reserve like a cache-inhibited load, but invalidate the target line if it hits in the L1 cache. The returning data goes to the target register, but not to the L1 cache. Similarly, a processor core may treat store-conditional as a cache-inhibited store and also invalidate the target line in the L1 cache if it exists. The store-conditional instruction stalls until success or failure is indicated by the external memory system, and the condition code is set before execution continues. The external memory system is expected to maintain load-reserve reservations for each thread, and no special internal consistency action is taken by the processor core when multiple threads attempt to use the same lock.
In a traditional, bus-based multiprocessor system, the point of memory system coherence is the bus itself. That is, coherency between the individual caches of the processors is resolved by the bus during memory accesses, because the accesses are effectively serialized. As a result, the shared main memory of the system is unaware of the existence of multiple processors. In such a system, support for load-reserve and store-conditional is implemented within the processors or in external logic associated with the processors, and conflicts between reservations and other memory accesses are resolved during bus accesses.
As the number of processors in a multiprocessor system increases, a shared bus interconnect becomes a performance bottleneck. Therefore, large-scale multiprocessors use some sort of interconnection network to connect processors to shared memory (or a cache for shared memory). Furthermore, an interconnection network encourages the use of multiple shared memory or cache slices in order to take advantage of parallelism and increase overall memory bandwidth.
It is desirable to implement synchronization based on load-reserve and store-conditional in such a large-scale multiprocessor, but it is no longer efficient to do so at the individual processors. What is needed is a mechanism to implement such synchronization at the point of coherence, which is the shared memory. Furthermore, the implementation must accommodate the individual slices of the shared memory. A unified mechanism is needed to insure proper consistency of lock reservations across all the processors of the multiprocessor system.
In the embodiment described above, each A2 processor core has four independent hardware threads sharing a single L1 cache with a 64-byte line size. Every memory line is stored in a particular L2 cache slice, depending on the address mapping. That is, the sixteen L2 slices effectively comprise a single L2 cache, which is the point of shared memory coherence for the compute node. Those skilled in the art will recognize that the invention applies to different multiprocessor configurations including a single L2 cache (i.e. one slice), a main memory with no L2 cache, and a main memory consisting of multiple slices.
Each L2 slice has some number of reservation registers to support load-reserve/store-conditional locks. One embodiment that would accommodate unique lock addresses from every thread simultaneously is to provide 68 reservation registers in each slice, because it is possible for all 68 threads to simultaneously use lock addresses that fall into the same L2 slice. Each reservation register would contain an N-bit address (specifying a unique 64-byte L1 line) and a valid bit, as shown in
When a load-reserve occurs, the reservation register corresponding to the ID (i.e. the unique thread number) of the thread that issued the load-reserve is checked to determine if the thread has already made a reservation. If so, the reservation address is updated with the load-reserve address. If not, the load-reserve address is installed in the register and the valid bit is set. In both cases, the load-reserve continues as an ordinary load and returns data.
When a store-conditional occurs, the reservation register corresponding to the ID of the requesting thread is checked to determine if the thread has a valid reservation for the lock address. If so, then the store-conditional is considered a success, a store-conditional success indication is returned to the requesting processor core, and the store-conditional is converted to an ordinary store (updating the memory and causing the necessary invalidations to other processor cores by the normal coherence mechanism). In addition, if the store-conditional address matches any other reservation registers, then they are invalidated. If the thread issuing the store-conditional has no valid reservation or the address does not match, then the store-conditional is considered a failure, a store-conditional failure indication is returned to the requesting processor core, and the store-conditional is dropped (i.e. the memory update and associated invalidations to other cores and other reservation registers does not occur).
Every ordinary store to the shared memory searches all valid reservation address registers and simply invalidates those with a matching address. The necessary back-invalidations to processor cores will be generated by the normal coherence mechanism.
In general, a thread is not allowed to have more than one load-reserve reservation at a time. If the processor does not track reservations, then this restriction must be enforced by additional logic outside the processor. Otherwise, a thread could issue load-reserve requests to more than one L2 slice and establish multiple reservations.
When the thread executes store-conditional, the address will be matched against the appropriate register. If it matches and the register is valid, then the store-conditional protocol continues as described above. If not, then the store-conditional is considered a failure, the core is notified, and only a special notification is sent to the L2 slice holding the reservation in order to cancel that reservation. This embodiment allows the processor to continue execution past the store-conditional very quickly. However, a failed store-conditional requires the message to be sent to the L2 in order to invalidate the reservation there. The memory system must guarantee that this invalidation message acts on the reservation before any subsequent store-conditional from the same processor is allowed to succeed.
Another embodiment, shown in
A similar tradeoff exists for load-reserve followed by load-reserve, but the performance of both storage strategies is the same. That is, the reservation resulting from the earlier load-reserve address must be invalidated at L2, which can be done with a special invalidate message. Then the new reservation is established as described previously. Again, the memory system must insure that no subsequent store-conditional can succeed before that invalidate message has had its effect.
When a load-reserve reservation is invalidated due to a store-conditional by some other thread or an ordinary store, all L2 reservation registers storing that address are invalidated. While this guarantees correctness, performance could be improved by invalidating matching lock reservation registers near the processors (
As described above, the L2 cache slices store the reservation addresses of all valid load-reserve locks. Because every thread could have a reservation and they could all fall into the same L2 slice, one embodiment, shown in
It is desirable to compare the address of a store-conditional or store to all lock reservation addresses simultaneously for the purpose of rapid invalidation. Therefore, a conventional storage array such as a static RAM or register array is preferably not used. Rather, discrete registers that can operate in parallel are needed. The resulting structure has on the order of N*68 latches and requires a 68-way fanout for the address and control buses. Furthermore, it is replicated in all sixteen L2 slices.
Because load-reserve reservations are relatively sparse in many codes, one way to address the power inefficiency of the large reservation register structure is to use clock-gated latches. Another way, as illustrated in
Although the reservation register structure in the L2 caches described thus far will accommodate any possible locking code, it would be very unusual for 68 threads to all want a unique lock since locking is done when memory is shared. A far more likely, yet still remote, possibility is that 34 pairs of threads want unique locks (one per pair) and they all happen to fall into the same L2 slice. In this case, the number of registers could be halved, but a single valid bit no longer suffices because the registers must be shared. Therefore, each register would, as represented in
With this embodiment, a store-conditional match is successful only if both the address and thread ID are the same. However, an address-only match is sufficient for the purpose of invalidation. This design uses on the order of 34*M latches and requires a 34-way fanout for the address, thread ID, and control buses. Again, the buses could be shielded behind AND gates, using the structure shown in
Because this design cannot accommodate all possible lock scenarios, a register selection policy is needed in order to cover the cases where there are no available lock registers to allocate. One embodiment is to simply drop new requests when no registers are available. However, this can lead to deadlock in the pathological case where all the registers are reserved by a subset of the threads executing load-reserve, but never released by store-conditional. Another embodiment is to implement a replacement policy such as round-robin, random, or LRU. Because, in some embodiments, it is very likely that all 34 registers in a single slice may be used, a policy that has preference for unused registers and then falls back to simple round-robin replacement will, in many cases provided excellent results.
Given the low probability of having many locks within a single L2 slice, the structure can be further reduced in size at the risk of a higher livelock probability. For instance, even with only 17 registers per slice, there would still be a total of 272 reservation registers in the entire L2 cache; far more than needed, especially if address scrambling is used to spread the lock addresses around the L2 cache slices sufficiently.
With a reduced number of reservation registers, the thread ID storage could be modified in order to allow sharing and accommodate the more common case of multiple thread IDs per register (since locks are usually shared). One embodiment is to replace the 7-bit thread ID with a 68-bit vector specifying which threads share the reservation. This approach does not mitigate the livelock risk when the number of total registers is exhausted.
Another compression strategy, which may be better in some cases, is to replace the 7-bit thread ID with a 5-bit processor ID (assuming 17 processors) and a 4-bit thread vector (assuming 4 threads per processor). In this case, a single reservation register can be used by all four threads of a processor to share a single lock. With this strategy, seventeen reservation registers would be sufficient to accommodate all 68 threads reserving the same lock address. Similarly, groups of threads using the same lock would be able to utilize the reservation registers more efficiently if they shared a processor (or processors), reducing the probability of livelock. At the cost of some more storage, the processor ID can be replaced by a 4-bit index specifying a particular pair of processors and the thread vector could be extended to 8 bits. As will be obvious to those skilled in the art, there is an entire spectrum of choices between the full vector and the single index.
As an example, one embodiment for the 17-processor multiprocessor is 17 reservation registers per L2 slice, each storing an L1 line address together with a 5-bit core ID and a 4-bit thread vector. This results in bus fanouts of 17.
While the embodiment herein disclosed describes a multiprocessor with the reservation registers implemented in a sliced, shared memory cache, it should be obvious that the invention can be applied to many types of shared memories, including a shared memory with no cache, a sliced shared memory with no cache, and a single, shared memory cache.
The disclosure further relates to managing speculation with respect to cache memory in a multiprocessor system with multiple threads, some of which may execute speculatively.
In a multiprocessor system with generic cores, it becomes easier to design new generations and expand the system. Advantageously, speculation management can be moved downstream from the core and first level cache. In such a case, it is desirable to devise schemes of accessing the first level cache without explicitly keeping track of speculation.
There may be more than one modes of keeping the first level cache speculation blind. Advantageously, the system will have a mechanism for switching between such modes.
One such mode is to evict writes from the first level cache, while writing through to a downstream cache. The embodiments described herein show this first level cache as being the physically first in a data path from a core processor; however, the mechanisms disclose here might be applied to other situations. The terms “first” and “second,” when applied to the claims herein are for convenience of drafting only and are not intended to be limiting to the case of L1 and L2 caches.
As described herein, the use of the letter “B”—other than as part of a figure number—represents a Byte quantity, while “GB” represents Gigabyte quantities. Throughout this disclosure a particular embodiment of a multi-processor system will be discussed. This discussion includes various numerical values for numbers of components, bandwidths of interfaces, memory sizes and the like. These numerical values are not intended to be limiting, but only examples. One of ordinary skill in the art might devise other examples as a matter of design choice.
The term “thread” is used herein. A thread can be either a hardware thread or a software thread. A hardware thread within a core processor includes a set of registers and logic for executing a software thread. The software thread is a segment of computer program code. Within a core, a hardware thread will have a thread number. For instance, in the A2, there are four threads, numbered zero through three. Throughout a multiprocessor system, such as the nodechip 50 of
These threads can be the subject of “speculative execution,” meaning that a thread or threads can be started as a sort of wager or gamble, without knowledge of whether the thread can complete successfully. A given thread cannot complete successfully if some other thread modifies the data that the given thread is using in such a way as to invalidate the given thread's results. The terms “speculative,” “speculatively,” “execute,” and “execution” are terms of art in this context. These terms do not imply that any mental step or manual operation is occurring. All operations or steps described herein are to be understood as occurring in an automated fashion under control of computer hardware or software.
If speculation fails, the results must be invalidated and the thread must be re-run or some other workaround found.
Three modes of speculative execution are to be supported: Speculative Execution (SE) (also referred to as Thread Level Speculation (“TLS”)), Transactional Memory (“TM”), and Rollback.
SE is used to parallelize programs that have been written as sequential program. When the programmer writes this sequential program, she may insert commands to delimit sections to be executed concurrently. The compiler can recognize these sections and attempt to run them speculatively in parallel, detecting and correcting violations of sequential semantics.
When referring to threads in the context of Speculative Execution, the terms older/younger or earlier/later refer to their relative program order (not the time they actually run on the hardware).
In Speculative Execution, successive sections of sequential code are assigned to hardware threads to run simultaneously. Each thread has the illusion of performing its task in program order. It sees its own writes and writes that occurred earlier in the program. It does not see writes that take place later in program order even if (because of the concurrent execution) these writes have actually taken place earlier in time.
To sustain the illusion, the L2 gives threads private storage as needed, accessible by software thread ID. It lets threads read their own writes and writes from threads earlier in program order, but isolates their reads from threads later in program order. Thus, the L2 might have several different data values for a single address. Each occupies an L2 way, and the L2 directory records, in addition to the usual directory information, a history of which thread IDs are associated with reads and writes of a line. A speculative write is not to be written out to main memory.
One situation that will break the program-order illusion is if a thread earlier in program order writes to an address that a thread later in program order has already read. The later thread should have read that data, but did not. The solution is to kill the later software thread and invalidate all the lines it has written in L2, and to repeat this for all younger threads. On the other hand, without such interference a thread can complete successfully, and its writes can move to external main memory when the line is cast out or flushed.
Not all threads need to be speculative. The running thread earliest in program order can be non-speculative and run conventionally; in particular its writes can go to external main memory. The threads later in program order are speculative and are subject to be killed. When the non-speculative thread completes, the next-oldest thread can be committed and it then starts to run non-speculatively.
The following sections describe the implementation of the speculation model in the context of addressing.
When a sequential program is decomposed into speculative tasks, the memory subsystem needs to be able to associate all memory requests with the corresponding task. This is done by assigning a unique ID at the start of a speculative task to the thread executing the task and attaching the ID as tag to all its requests sent to the memory subsystem.
As the number of dynamic tasks can be very large, it may not be practical to guarantee uniqueness of IDs across the entire program run. It is sufficient to guarantee uniqueness for all IDs concurrently present in the memory system. More about the use of speculation ID's, including how they are allocated, committed, and invalidated, appears in the incorporated applications.
Transactions as defined for TM occur in response to a specific programmer request within a parallel program. Generally the programmer will put instructions in a program delimiting sections in which TM is desired. This may be done by marking the sections as requiring atomic execution. According to the PowerPC architecture: “An access is single-copy atomic, or simply “atomic”, if it is always performed in its entirety with no visible fragmentation.”
To enable a TM runtime system to use the TM supporting hardware, it needs to allocate a fraction of the hardware resources, particularly the speculation IDs that allow hardware to distinguish concurrently executed transactions, from the kernel (operating system), which acts as a manager of the hardware resources. The kernel configures the hardware to group IDs into sets called domains, configures each domain for its intended use, TLS, TM or Rollback, and assigns the domains to runtime system instances.
At the start of each transaction, the runtime system executes a function that allocates an ID from its domain, and programs it into a register that starts marking memory access as to be treated as speculative, i.e., revocable if necessary.
When the transaction section ends, the program will make another call that ultimately signals the hardware to do conflict checking and reporting. Based on the outcome of the check, all speculative accesses of the preceding section can be made permanent or removed from the system.
The PowerPC architecture defines an instruction pair known as larx/stcx. This instruction type can be viewed as a special case of TM. The larx/stcx pair will delimit a memory access request to a single address and set up a program section that ends with a request to check whether the instruction pair accessed the memory location without interfering access from another thread. If an access interfered, the memory modifying component of the pair is nullified and the thread is notified of the conflict More about a special implementation of larx/stcx instructions using reservation registers is to be found in co-pending application Ser. No. 12/697,799 filed Jan. 29, 2010, which is incorporated herein by reference. This special implementation uses an alternative approach to TM to implement these instructions. In any case, TM is a broader concept than larx/stcx. A TM section can delimit multiple loads and stores to multiple memory locations in any sequence, requesting a check on their success or failure and a reversal of their effects upon failure.
Rollback occurs in response to “soft errors”, temporary changes in state of a logic circuit. Normally these errors occur in response to cosmic rays or alpha particles from solder balls. The memory changes caused by a programs section executed speculatively in rollback mode can be reverted and the core can, after a register state restore, replay the failed section.
Referring now to
More particularly, the basic nodechip 50 of the multiprocessor system illustrated in
The 17th core is configurable to carry out system tasks, such as
In other words, it offloads all the administrative tasks from the other cores to reduce the context switching overhead for these.
In one embodiment, there is provided 32 MB of shared L2 cache 70, accessible via crossbar switch 60. There is further provided external Double Data Rate Synchronous Dynamic Random Access Memory (“DDR SDRAM”) 80, as a lower level in the memory hierarchy in communication with the L2. Herein, “low” and “high” with respect to memory will be taken to refer to a data flow from a processor to a main memory, with the processor being upstream or “high” and the main memory being downstream or “low.”
Each FPU 53 associated with a core 52 has a data path to the L1-cache 55 of the CORE, allowing it to load or store from or into the L1-cache 55. The terms “L1” and “L1D” will both be used herein to refer to the L1 data cache.
Each core 52 is directly connected to a supplementary processing agglomeration 58, which includes a private prefetch unit. For convenience, this agglomeration 58 will be referred to herein as “UP”—meaning level 1 prefetch—or “prefetch unit;” but many additional functions are lumped together in this so-called prefetch unit, such as write combining. These additional functions could be illustrated as separate modules, but as a matter of drawing and nomenclature convenience the additional functions and the prefetch unit will be grouped together. This is a matter of drawing organization, not of substance. Some of the additional processing power of this L1P group is shown in
By implementing a direct memory access (“DMA”) engine referred to herein as a Messaging Unit (“MU”) such as MU 100, with each MU including a DMA engine and Network Card interface in communication with the XBAR switch, chip I/O functionality is provided. In one embodiment, the compute node further includes: intra-rack interprocessor links 90 which may be configurable as a 5-D torus; and, one I/O link 92 interfaced with the interfaced with the MU. The system node employs or is associated and interfaced with a 8-16 GB memory/node, also referred to herein as “main memory.”
The term “multiprocessor system” is used herein. With respect to the present embodiment this term can refer to a nodechip or it can refer to a plurality of nodechips linked together. In the present embodiment, however, the management of speculation is conducted independently for each nodechip. This might not be true for other embodiments, without taking those embodiments outside the scope of the claims.
The compute nodechip implements a direct memory access engine DMA to offload the network interface. It transfers blocks via three switch master ports between the L2-cache slices 70 (
The L2 has ports, for instance a 256b wide read data port, a 128b wide write data port, and a request port. Ports may be shared by all processors through the crossbar switch 60.
In this embodiment, the L2 Cache units provide the bulk of the memory system caching on the BQC chip. Main memory may be accessed through two on-chip DDR-3 SDRAM memory controllers 78, each of which services eight L2 slices.
The L2 slices may operate as set-associative caches while also supporting additional functions, such as memory speculation for Speculative Execution (SE), which includes different modes such as: Thread Level Speculations (“TLS”), Transactional Memory (“TM”) and local memory rollback, as well as atomic memory transactions.
The L2 serves as the point of coherence for all processors. This function includes generating L1 invalidations when necessary. Because the L2 cache is inclusive of the L1s, it can remember which processors could possibly have a valid copy of every line, and slices can multicast selective invalidations to such processors.
Address scrambling per
The L2 stores data in 128 B wide lines, and each of these lines is located in a single L2-slice and is referenced there via a single directory entry. As a consequence, the address bits 29 to 35 only reference parts of an L2 line and do not participate in L2 slice or set selection.
To evenly distribute accesses across L2-slices for sequential lines as well as larger strides, the remaining address bits 0-28 are hashed to determine the target slice. To allow flexible configurations, individual address bits can be selected to determine the slice as well as an XOR hash on an address can be used: The following hashing is used at 40242 in the present embodiment:
For each of the slices, 25 address bits are a sufficient reference to distinguish L2 cache lines mapped to that slice.
Each L2 slice holds 2 MB of data or 16K cache lines. At 16-way associativity, the slice has to provide 1024 sets, addressed via 10 address bits. The different ways are used to store different addresses mapping to the same set as well as for speculative results associated with different threads or combinations of threads.
Again, even distribution across set indices for unit and non-unit strides is achieved via hashing, to wit:
To uniquely identify a line within the set, using a(0 to 14) is sufficient as a tag.
Thereafter, the switch provides addressing to the L2 slice in accordance with an address that includes the set and way and offset within a line, as shown in
Long and Short Running Speculation
The L2 accommodates two types of L1 cache management in response to speculative threads. One is for long running speculation and the other is for short running speculation. The differences between the mode support for long and short running speculation is described in the following two subsections.
For long running transactions mode, the L1 cache needs to be invalidated to make all first accesses to a memory location visible to the L2 as an L1-load-miss. A thread can still cache all data in its L1 and serve subsequent loads from the L1 without notifying the L2 for these. This mode will use address aliasing as shown in
To reduce overhead in short running speculation mode, the embodiment herein eliminates the requirement to invalidate L1. The invalidation of the L1 allowed tracking of all read locations by guaranteeing at least one L1 miss per accessed cache line. For small transactions, the equivalent is achieved by making all load addresses within the transaction visible to the L2, regardless of L1 hit or miss, i.e. by operating the L1 in “read/write through” mode. In addition, data modified by a speculative thread is in this mode evicted from the L1 cache, serving all loads of speculatively modified data from L2 directly. In this case, the L1 does not have to use a four piece mock space as shown in
In the case of switching between memory access modes here, a register 41312 at the entry of the L1P receives an address field from the processor 40052, as if the processor 40052 were requesting a main memory access, i.e., a memory mapped input/output operation (MMIO). The L1P diverts a bit called ID_evict 41313 from the register and forwards it both back to the processor 40052 and also to control the L1 caches.
A special purpose register SPR 41315 also takes some data from the path 41311, which is then AND-ed at 41314 to create a signal that informs the L1D 41306, i.e. the data cache whether write on evict is to be enabled. The instruction cache, L1I 41312 is not involved.
At 41403, it is determined whether current memory access is responsive to a store by a speculative thread. If so, there will be a write through from L1 to L2 at 41404, but the line will be deleted from the L1 at 41405.
If access is not a store by a speculative thread, there is a test as to whether the access is a load at 41406. If so, the system must determine at 41407 whether there is a hit in the L1. If so, data is served from L1 at 41408 and L2 is notified of the use of the data at 41409.
If there is not a hit, then data must be fetched from L2 at 41410. If L2 has a speculative version per 41411, the data should not be inserted into L1 per 41412. If L2 does not have a speculative version, then the data can be inserted into L1 per 41413.
If the access is not a load, then the system must test whether speculation is finished at 41414. If so, the speculative status should be removed from L2 at 41415.
If speculation is not finished, and none of the other conditions are met, then default memory access behavior occurs at 41416.
A programmer will have to determine whether or not to activate evict on write in response to application specific programming considerations. For instance, if data is to be used frequently, the addressing mechanism of
If many small sections of code without frequent data accesses are to be executed in parallel, the mechanism of short running speculation will likely be advantageous.
L1/L1P Hit Race Condition
In case of a hit in L1P or L1 for TM at 41001, a notification for this address is sent to L2 41002, flagging the line as speculatively accessed. If a write from another core at 41003 to that address reaches the L2 before the L1/L1P hit notification and the write caused invalidate request has not reached the L1 or L1P before the L1/L1P hit, the core could have used stale data and while flagging new data to be read in the L2. The L2 sees the L1/L1P hit arriving after the write at 41004 and cannot deduce directly from the ordering if a race occurred. However, in this case a use notification arrives at the L2 with the coherence bits of the L2 denoting that the core did not have a valid copy of the line, thus indicating a potential violation. To retain functional correctness, the L2 invalidates the affected speculation ID in this case at 41005.
Coherence
A thread starting a long-running speculation always begins with an invalidated L1, so it will not retain stale data from a previous thread's execution. Within a speculative domain, L1 invalidations become unnecessary in some cases:
A thread using short running speculation evicts the line it writes to from its L1 due to the proposed evict on speculative write. This line is evicted from other L1 caches as well based on the usual coherence rules. Starting from this point on, until the speculation is deemed either to be successful or its changes have been reverted, L1 misses for this line will be served from the L2 without entering the L1 and therefore no incoherent L1 copy can occur.
Between speculative domains, the usual multiprocessor coherence rules apply. To support speculation, the L2 routinely records thread IDs associated with reads; on a write, the L2 sends invalidations to all processors outside the domain that are marked as having read that address.
Access Size Signaling from the L1/L1p to the L2
Memory write accesses footprints are always precisely delivered to L2 as both L1 as well as L1P operate in write-through.
For reads however, the data requested from the L2 does not always match its actual use by a thread inside the core. However, both the L1 as well as the L1P provide methods to separate the actual use of the data from the amount of data requested from the L2.
The L1 can be configured such that it provides on a read miss not only the 64 B line that it is requesting to be delivered, but also the section inside the line that is actually requested by the load instruction triggering the miss. It can also send requests to the L1P for each L1 hit that indicate which section of the line is actually read on each hit. This capability is activated and used for short running speculation. In long running speculation, L1 load hits are not reported and the L2 has to assume that the entire 64 B section requested has been actually used by the requesting thread.
The L1P can be configured independently from that to separate L1P prefetch requests from actual L1P data use (L1P hits). If activated, L1P prefetches only return data and do not add IDs to speculative reader sets. L1P read hits return data to the core immediately and send to the L2 a request that informs the L2 about the actual use of the thread.
This disclosure arose in the course of development of a new generation of the IBM® BluGene® system. This new generation included several concepts, such as managing speculation in the L2 cache, improving energy efficiency, and using generic cores that conform to the PowerPC architecture usable in other systems such as PCs; however, the invention need not be limited to this context.
An addressing scheme can allow generic cores to be used for a new generation of parallel processing system, thus reducing research, development and production costs. Also creating a system in which prefetch units and L1D caches are shared by hardware threads within a core is energy and floor plan efficient.
This address space will have at least four pieces, 40401, 40402, 40403, and 40404, because the embodiment of the core has four hardware threads. If the core had a different number of hardware threads, there could be a different number of pieces of the address space of the L1P. This address space allows each hardware thread to act as if it is running independently of every other thread and has an entire main memory to itself. The hardware thread number indicates to the L1P, which of the pieces is to be accessed.
When a line has been established by a speculative thread or a transaction, the rules for enforcing consistency change. When running purely non-speculative, only write accesses change the memory state; in the absence of writes the memory state can be safely assumed to be constant. When a speculatively running thread commits, the memory state as observed by other threads may also change. The memory subsystem does not have the set of memory locations that have been altered by the speculative thread instantly available at the time of commit, thus consistency has to be ensured by means other than sending invalidates for each affected address. This can be accomplished by taking appropriate action when memory writes occur.
Access Size Signaling from the L1/L1p to the L2
Memory write accesses footprints are always precisely delivered to L2 as both L1 as well as L1P operate in write-through.
For reads however, the data requested from the L2 does not always match its actual use by a thread inside the core. However, both the L1 as well as the L1P provide methods to separate the actual use of the data from the amount of data requested from the L2.
The L1 can be configured such that it provides on a read miss not only the 64 B line that it is requesting to be delivered, but also the section inside the line that is actually requested by the load instruction triggering the miss. It can also send requests to the L1P for each L1 hit that indicate which section of the line is actually read on each hit. This capability is activated and used for short running speculation. In long running speculation, L1 load hits are not reported and the L2 has to assume that the entire 64 B section requested has been actually used by the requesting thread.
The L1P can be configured independently from that to separate L1P prefetch requests from actual L1P data use (L1P hits). If activated, L1P prefetches only return data and do not add IDs to speculative reader sets. L1P read hits return data to the core immediately and send to the L2 a request that informs the L2 about the actual use of the thread.
The inventor here has discovered, that, surprisingly, given the extraordinary size of this type of supercomputer system, the caches, originally sources of efficiency and power reduction, have become significant power consumers—so that they themselves must be scrutinized to see how they can be improved.
The architecture of the current version of IBM® Blue Gene® supercomputer includes coordinating speculative execution at the level of the L2 cache, with results of speculative execution being stored by hashing a physical main memory address to a specific cache set—and using a software thread identification number along with upper address bits to direct memory accesses to corresponding ways of the set. The directory lookup for the cache becomes the conflict checking mechanism for speculative execution.
In a cache that has 16 ways, each memory access request for a given cache line, requires searching all 16 ways of the selected set along with elaborate conflict checking. When multiplied by the thousands of caches in the system, these lookups become energy inefficient—especially in the case where several sequential, or nearly sequential, lookups access the same line.
Thus the new generation of supercomputer gave rise to an environment where directory lookup becomes a significant component of the energy efficiency of the system. Accordingly, it would be desirable to save results of lookups in case they are needed by subsequent memory access requests.
The following document relates to write piggybacking in the context of DRAM controllers:
It would be desirable to reduce directory SRAM accesses to reduce power and increase throughput in accordance with one or both of the following methods:
These methods are especially effective if the memory access request generating unit can provide a hint whether this location might be accessed soon or if the access request type implies that other cores will access this location soon, e.g., atomic operation requests for barriers.
Throughout this disclosure a particular embodiment of a multi-processor system will be discussed. This discussion may include various numerical values. These numerical values are not intended to be limiting, but only examples. One of ordinary skill in the art might devise other examples as a matter of design choice.
The present invention arose in the context of the IBM® Blue Gene® project, which is further described in the applications incorporated by reference above.
Coherence tracking unit 4301 issues invalidations, when necessary. These invalidations are issued centrally, while in the prior generation of the Blue Gene® project, invalidations were achieved by snooping.
The request queue 4302 buffers incoming read and write requests. In this embodiment, it is 16 entries deep, though other request buffers might have more or less entries. The addresses of incoming requests are matched against all pending requests to determine ordering restrictions. The queue presents the requests to the directory pipeline 4308 based on ordering requirements.
The write data buffer 4303 stores data associated with write requests. This buffer passes the data to the eDRAM pipeline 4305 in case of a write hit or after a write miss resolution.
The directory pipeline 4308 accepts requests from the request queue 4302, retrieves the corresponding directory set from the directory SRAM 4309, matches and updates the tag information, writes the data back to the SRAM and signals the outcome of the request (hit, miss, conflict detected, etc.).
The L2 implements four parallel eDRAM pipelines 4305 that operate independently. They may be referred to as eDRAM bank 0 to eDRAM bank 3. The eDRAM pipeline controls the eDRAM access and the dataflow from and to this macro. If writing only subcomponents of a doubleword or for load-and-increment or store-add operations, it is responsible to schedule the necessary RMW cycles and provide the dataflow for insertion and increment.
The read return buffer 4304 buffers read data from eDRAM or the memory controller 78 and is responsible for scheduling the data return using the switch 60. In this embodiment it has a 32 B wide data interface to the switch. It is used only as a staging buffer to compensate for backpressure from the switch. It is not serving as a cache.
The miss handler 4307 takes over processing of misses determined by the directory. It provides the interface to the DRAM controller and implements a data buffer for write and read return data from the memory controller.
The reservation table 4306 registers and invalidates reservation requests.
In the current embodiment of the multi-processor, the bus between the L1 to the L2 is narrower than the cache line width by a factor of 8. Therefore each write of an entire L2 line, for instance, will require 8 separate transmissions to the L2 and therefore 8 separate lookups. Since there are 16 ways, that means a total of 128 way data retrievals and matches. Each lookup potentially involves all this conflict checking that was just discussed, which can be very energy-consuming and resource intensive.
Therefore it can be anticipated that—at least in this case—an access will need to be retained. A prefetch unit can annotate its request indicating that it is going to access the same line again to inform the L2 slice of this anticipated requirement.
Certain instruction types, such as atomic operations for barriers, might result in an ability to anticipate sequential memory access requests using the same data.
One way of retaining a lookup would be to have a special purpose register in the L2 slice that would retain an identification of the way in which the requested address was found. Alternatively, more registers might be used if it were desired to retain more accesses.
Another embodiment for retaining a lookup would be to actually retain data associated with a previous lookup to be used again.
An example of the former embodiment of retaining lookup information is shown in
A traditional store-operate instruction reads from, modifies, and writes to a memory location as an atomic operation. The atomic property allows the store-operate instruction to be used as a synchronization primitive across multiple threads. For example, the store- and instruction atomically reads data in a memory location, performs a bitwise logical- and operation of data (i.e., data described with the store-add instruction) and the read data, and writes the result of the logical- and operation into the memory location. The term store-operate instruction also includes the fetch-and-operate instruction (i.e., an instructions that returns a data value from a memory location and then modifies the data value in the memory location). An example of a traditional fetch-and-operate instruction is the fetch-and-increment instruction (i.e., an instruction that returns a data value from a memory location and then increments the value at that location).
In a multi-threaded environment, the use of store-operate instructions may improve application performance (e.g., better throughput, etc.). Because atomic operations are performed within a memory unit, the memory unit can satisfy a very high rate of store-operate instructions, even if the instructions are to a single memory location. For example, a memory system of IBM® Blue Gene®/Q computer can perform a store-operate instruction every 4 processor cycles. Since a store-operate instruction modifies the data value at a memory location, it traditionally invokes a memory coherence operation to other memory devices. For example, on the IBM® Blue Gene®/Q computer, a store-operate instruction can invoke a memory coherence operation on up to 15 level-1 (L1) caches (i.e., local caches). A high rate (e.g., every 4 processor cycles) of traditional store-operate instructions thus causes a high rate (e.g., every 4 processor cycles) of memory coherence operations which can significantly occupy computer resources and thus reduce application performance.
The present disclosure further describes a method, system and computer program product for performing various store-operate instructions in a parallel computing system that reduces the number of cache coherence operations and thus increases application performance.
In one embodiment, there are provided various store-operate instructions available to a computing device to reduce the number of memory coherence operations in a parallel computing environment that includes a plurality of processors, at least one cache memory and at least one main memory. These various provided store-operate instructions are variations of a traditional store-operate instruction that atomically modify the data (e.g., bytes, bits, etc.) at a (cache or main) memory location. These various provided store-operate instructions include, but are not limited to: StoreOperateCoherenceOnValue instruction, StoreOperateCoherenceThroughZero instruction and StoreOperateCoherenceOnPredecessor instruction. In one embodiment, the term store-operate instruction(s) also includes the fetch-and-operate instruction(s). These various provided fetch-and-operate instructions thus also include, but are not limited to: FetchAndOperateCoherenceOnValue instruction, FetchAndOperateCoherenceThroughZero instruction and FetchAndOperateCoherenceOnPredecessor instruction.
In one aspect, a StoreOperateCoherenceOnValue instruction is provided that improves application performance in a parallel computing environment (e.g., IBM® Blue Gene® computing devices L/P, etc. such as described in herein incorporated U.S. Provisional Application Ser. No. 61/295,669), by reducing the number of cache coherence operations invoked by a functional unit (e.g., a functional unit 35120 in
The FetchAndOperateCoherenceOnValue instruction invokes a cache coherence operation only when a result of the fetch-and-operate instruction is a particular value or set of values. The particular value may be given by the instruction issued from a processor in the parallel computing environment. The FetchAndOperateCoherenceThroughZero instruction invokes a cache coherence operation only when data (e.g., a numerical value) in a (cache or main) memory location described in the fetch-and-operate instruction changes from a positive value to a negative value, or vice versa. The FetchAndOperateCoherenceOnPredecessor instruction invokes a cache coherence operation only when the result of a fetch-and-operate instruction (i.e., the read data value in a memory location described in the fetch-and-operate instruction) is equal to particular data (e.g., a particular numerical value) stored in a preceding memory location of a logical memory address described in the fetch-and-operate instruction.
A processor N (35145) includes a local cache memory device 35175. In one embodiment, the term processor may also refer to a DMA engine or a network adaptor 35155 or similar equivalent units or devices. One or more of these processors may issue load or store instructions. These load or store instructions are transferred from the issuing processors, e.g., through a cross bar switch 35110, to an instruction queue 35115 in a memory or cache unit 35105. A functional unit (FU) 35120 fetches these instructions from the instruction queue 35115, and runs these instructions. To run one or more of these instructions, the FU 35120 may retrieve data stored in a cache memory 35125 or in a main memory (not shown) via a main memory controller 35130. Upon completing the running of the instructions, the FU 35120 may transfer outputs of the run instructions to the issuing processor or network adaptor via the network 35110 and/or store outputs in the cache memory 35125 or in the main memory (not shown) via the main memory controller 35130. The main memory controller 35130 is a traditional memory controller that manages data flow between the main memory device and other components (e.g., the cache memory device 35125, etc.) in the parallel computing environment 35100.
In one embodiment, the instruction 35240 specifies at least one condition under which a cache coherence operation is invoked. For example, the condition may specifies a particular value, e.g., zero.
Upon fetching the instruction 35240 from the instruction queue 35115, the FU 35120 evaluates 35200 whether the instruction 35240 is a load instruction, e.g., by checking whether the Opcode 35505 of the instruction 35240 indicates that the instruction 35240 is a load instruction. If the instruction 35240 is a load instruction, the FU 35120 reads 35220 data stored in a (cache or main) memory location corresponding to the logical address 35510 of the instruction 35240, and uses the crossbar 35110 to return the data to the issuing processor. Otherwise, the FU 35120 evaluates 35205 whether the instruction 35240 is a store instruction, e.g., by checking whether the Opcode 35505 of the instruction 35240 indicates that the instruction 35240 is a store instruction. If the instruction 35240 is a store instruction, the FU 35120 transfers 35225 the operand value 35515 of the instruction 35240) to a (cache or main) memory location corresponding to the logical address 35510 of the instruction 35240. Because a store instruction changes the value at a memory location, the FU 35120 invokes 35225, e.g. via cross bar 35110, a cache coherence operation on other memory devices such as L1 caches 35165-35175 in processors—35135-35145. Otherwise, the FU 35120 evaluates 35210 whether the instruction 35240 is a store-operate or fetch-and-operate instruction, e.g., by checking whether the Opcode 35505 of the instruction 35240 indicates that the instruction 35240 is a store-operate or fetch-and-operate instruction.
If the instruction 35240 is a store-operate instruction, the FU 120 reads 35230 data stored in a (cache or main) memory location corresponding to the logical address 35510 of the instruction 35240, modifies 35230 the read data with the operand value 35515 of the instruction, and writes 35230 the result of the modification to the (cache or main) memory location corresponding to the logical address 35510 of the instruction. Alternatively, the FU modifies 35230 the read data with data stored in a register (e.g., accumulator) corresponding to the operand value 35515, and writes 35230 the result to the memory location. Because a store-operate instruction changes the value at a memory location, the FU 35120 invokes 35225, e.g. via cross bar 35110, a cache coherence operation on other memory devices such as L1 caches 35165-35175 in processors—35135-35145.
If the instruction 35240 is a fetch-and-operate instruction, the FU 35120 reads 35230 data stored in a (cache or main) memory location corresponding to the logical address 35510 of the instruction 35240 and return, via the crossbar 35110, the data to the issuing processor. The FU then modifies 35230 the data, e.g., with an operand value 35515 of the instruction 35240, and writes 35230 the result of the modification to the (cache or main) memory location. Alternatively, the FU modifies 35230 the data stored in the (cache or main) memory location, e.g., with data stored in a register (e.g., accumulator) corresponding to the operand value 35515, and writes the result to the memory location. Because a fetch-and-operate instruction changes the value at a memory location, the FU 35120 invokes 35225, e.g. via cross bar 35110, a cache coherence operation on other memory devices such as L1 caches 35165-35175 in processors 35135-35145.
Otherwise, the FU 35120 evaluates 35215 whether the instruction 35240 is a StoreOperateCoherenceOnValue instruction or FetchAndOperateCoherenceOnValue instruction, e.g., by checking whether the Opcode 35505 of the instruction 35240 indicates that the instruction 35240 is a StoreOperateCoherenceOnValue instruction. If the instruction 35240 is a StoreOperateCoherenceOnValue instruction, the FU 35120 performs operations 35235 which is shown in detail in
If the instruction 35240 is a FetchAndOperateCoherenceOnValue instruction, the FU 35120 performs operations 35235 which is shown in detail in
In one embodiment, the StoreOperateCoherenceOnValue 35240 instruction described above is a StoreAddlnvalidateCoherenceOnZero instruction. The value in a memory location at the logical address 35510 is considered to be an integer value. The operand value 35515 is also considered to be an integer value. The StoreAddlnvalidateCoherenceOnZero instruction adds the operand value to the previous memory value and stores the result of the addition as a new memory value in the memory location at the logical address 35510. In one embodiment, a network adapter 35155 may use the StoreAddlnvalidateCoherenceOnZero instruction. In this embodiment, the network adaptor 35155 interfaces the parallel computing environment 35100 to a network 35160 which may deliver a message as out-of-order packets. A complete reception of a message can be recognized by initializing a counter to the number of bytes in the message and then having the network adaptor decrement the counter by the number of bytes in each arriving packet. The memory device 35105 is of a size that allows any location in a (cache) memory device to serve as such a counter for each message. Applications on the processors 35135-35145 poll the counter of each message to determine if a message has completely arrived. On reception of each packet, the network adaptor can issue a StoreAddlnvalidateCoherenceOnZero instruction 35240 to the memory device 35105. The Opcode 35505 specifies the StoreAddlnvalidateCoherenceOnZero instruction. The logical address 35510 is that of the counter. The operand value 35515 is a negative value of the number of received bytes in the packet. In this embodiment, only when the counter reaches the value 0, the memory device 35105 invokes a cache coherence operation to the level-1 (L1) caches of the processors 35135-35145. This improves the performance of the application, since the application demands the complete arrival of each message and is uninterested in a message for which all packets have not yet arrived and only invokes the cache coherence operation only when all packets of the message arrives at the network adapter 35155. By contrast, the application performance on the processors 35135-35145 may be decreased if the network adaptor 35155 issues a traditional Store-Add instruction, since then each of the processors 35135-35145 would then receive and serve an unnecessary cache coherence operation upon the arrival of each packet.
In one embodiment, the FetchAndOperateCoherenceOnZero instruction 35240 described above is a FetchAndDecrementCoherenceOnZero instruction. The value in a memory location at the logical address 35510 is considered to be an integer value. There is no accompanying operand value 35515. The FetchAndIncrementCoherenceOnZero instruction returns the previous value of the memory location and then increments the value at the memory location. In one embodiment, the processors 35135-35145 may use the FetchAndIncrementCoherenceOnZero instruction to implement a barrier (i.e., a point where all participating threads must arrive, and only then can the each thread proceed with its execution). The barrier uses a memory location in the memory device 30105 (e.g., a shared cache memory device) as a counter. The counter is initialized with the number of threads to participate in the barrier. Each thread, upon arrival at the barrier issues a FetchAndDecrementCoherenceOnZero instruction 35240 to the memory device 35105. The Opcode 35505 specifies the FetchAndDecrementCoherenceOnZero instruction. The memory location of the logical address 35510 stores a value of the counter. The value “1” is returned by the FetchAndDecrementCoherenceOnZero instruction to the last thread arriving at the barrier and the value “0” is stored to the memory location and a cache coherence operation is invoked. Given this value “1”, the last thread knows all threads have arrived at the barrier and thus the last thread can exit the barrier. For the other earlier threads to arrive at the barrier, the value “1” is not returned by the FetchAndDecrementCoherenceOnZero. So, each of these threads polls the counter for the value 0 indicating that all threads have arrived. Only when the counter reaches the value “0,” the FetchAndDecrementCoherenceOnZero instruction causes the memory device 35105 to invoke a cache coherence operation to the level-1 (L1) caches 35165-35175 of the processors 35135-35145. This FetchAndDecrementCoherenceOnZero instruction thus helps reduce computer resource usage in a barrier and thus helps improve the application performance. The polling mainly uses the L1-cache (local cache memory device in a processor; local cache memory devices 35165-35175) of each processor 35135-35145. By contrast, the barrier performance may be decreased if the barrier used a traditional Fetch-And-Decrement instruction, since then each of the processors 35135-35145 would then receive and serve an unnecessary cache coherence operation on the arrival of each thread into the barrier and thus would cause polling to communicate more with the memory device 35105 and communicate less with local cache memory devices.
If the instruction 35240 is a FetchAndOperateCoherenceOnPredecessor instruction, the FU 35120 performs operations 35310 which is shown in detail in
If the instruction 35240 is a FetchAndOperateCoherenceThroughZero instruction, the FU 35120 performs operations 35410 which is shown in detail in
In one embodiment, the store-operate operation described in the StoreOperateCoherenceOnValue or StoreOperateCoherenceOnPredecessor or StoreOperateCoherenceThroughZero includes one or more of the following traditional operations that include, but are not limited to: StoreAdd, StoreMin and StoreMax, each with variations for signed integers, unsigned integers or floating point numbers, Bitwise StoreAnd, Bitwise StoreOr, Bitwise StoreXor, etc.
In one embodiment, the Fetch-And-Operate operation described in the FetchAndOperateCoherenceOnValue or FetchAndOperateCoherenceOnPredecessor or FetchAndOperateCoherenceThroughZero includes one or more of the following traditional operations that include, but are not limited to: FetchAndIncrement, FetchAndDecrement, FetchAndClear, etc.
In one embodiment, the width of the memory location operated by the StoreOperateCoherenceOnValue or StoreOperateCoherenceOnPredecessor or StoreOperateCoherenceThroughZero or FetchAndOperateCoherenceOnValue or FetchAndOperateCoherenceOnPredecessor or FetchAndOperateCoherenceThroughZero includes, but is not limited to: 1 byte, 2 byte, 4 byte, 8 byte, 16 byte, and 32 byte, etc.
In one embodiment, the FU 35120 performs the evaluations 35200-35215, 35300 and 35400 sequentially. In another embodiment, the FU 35120 performs the evaluations 35200-35215, 35300 and 35400 concurrently, i.e., in parallel. For example,
In one embodiment, threads or processors concurrently may issue one of these instructions (e.g., StoreOperateCoherenceOnValue instruction, StoreOperateCoherenceThroughZero instruction, StoreOperateCoherenceOnPredecessor instruction, FetchAndOperateCoherenceOnValue instruction, FetchAndOperateCoherenceThroughZero instruction, FetchAndOperateCoherenceOnPredecessor instruction) to a same (cache or main) memory location. Then, the FU 35120 may run these concurrently issued instructions every few processor clock cycles, e.g., in parallel or sequentially. In one embodiment, these instructions (e.g., StoreOperateCoherenceOnValue instruction, StoreOperateCoherenceThroughZero instruction, StoreOperateCoherenceOnPredecessor instruction, FetchAndOperateCoherenceOnValue instruction, FetchAndOperateCoherenceThroughZero instruction, FetchAndOperateCoherenceOnPredecessor instruction) are atomic instructions that atomically implement operations on cache lines.
In one embodiment, the FU 35120 is implemented in hardware or reconfigurable hardware, e.g., FPGA (Field Programmable Gate Array) or CPLD (Complex Programmable Logic Device), e.g., by using a hardware description language (Verilog, VHDL, Handel-C, or System C). In another embodiment, the FU 35120 is implemented in a semiconductor chip, e.g., ASIC (Application-Specific Integrated Circuit), e.g., by using a semi-custom design methodology, i.e., designing a chip using standard cells and a hardware description language.
It would be desirable to allow for multiple modes of speculative execution concurrently in a multiprocessor system.
In one embodiment, a computer method includes carrying out operations in a multiprocessor system. The operations include:
In another embodiment, the operations include
In yet another embodiment, a multiprocessor system includes:
It would be desirable to prevent speculative memory accesses from going to main memory to improve efficiency of a multiprocessor system.
In one embodiment, a method for managing memory accesses in a multiprocessor system includes carrying out operations within the system. The operations include:
In another embodiment, a cache memory for use in a multiprocessor system includes:
Yet another embodiment is a cache control system for use in a multiprocessor system including
The term “thread” is used herein. A thread can be either a hardware thread or a software thread. A hardware thread within a core processor includes a set of registers and logic for executing a software thread. The software thread is a segment of computer program code. Within a core, a hardware thread will have a thread number. For instance, in the A2, there are four threads, numbered zero through three. Throughout a multiprocessor system, such as the nodechip 50 of
These threads can be the subject of “speculative execution,” meaning that a thread or threads can be started as a sort of wager or gamble, without knowledge of whether the thread can complete successfully. A given thread cannot complete successfully if some other thread modifies the data that the given thread is using in such a way as to invalidate the given thread's results. The terms “speculative,” “speculatively,” “execute,” and “execution” are terms of art in this context. These terms do not imply that any mental step or manual operation is occurring. All operations or steps described herein are to be understood as occurring in an automated fashion under control of computer hardware or software.
Speculation Model
This section describes the underlying speculation ID based memory speculation model, focusing on its most complex usage mode, speculative execution (SE), also referred to as thread level speculation (TLS). When referring to threads, the terms older/younger or earlier/later refer to their relative program order (not the time they actually run on the hardware).
Multithreading Model
In Speculative Execution, successive sections of sequential code are assigned to hardware threads to run simultaneously. Each thread has the illusion of performing its task in program order. It sees its own writes and writes that occurred earlier in the program. It does not see writes that take place later in program order even if, because of the concurrent execution, these writes have actually taken place earlier in time.
To sustain the illusion, the memory subsystem, in particular in the preferred embodiment the L2-cache, gives threads private storage as needed. It lets threads read their own writes and writes from threads earlier in program order, but isolates their reads from threads later in program order. Thus, the L2 might have several different data values for a single address. Each occupies an L2 way, and the L2 directory records, in addition to the usual directory information, a history of which threads have read or written the line. A speculative write is not to be written out to main memory.
One situation will break the program-order illusion—if a thread earlier in program order writes to an address that a thread later in program order has already read. The later thread should have read that data, but did not. A solution is to kill the later thread and invalidate all the lines it has written in L2, and to repeat this for all younger threads. On the other hand, without this interference a thread can complete successfully, and its writes can move to external main memory when the line is cast out or flushed.
Not all threads need to be speculative. The running thread earliest in program order can execute as non-speculative and runs conventionally; in particular its writes can go to external main memory. The threads later in program order are speculative and are subject to being killed. When the non-speculative thread completes, the next-oldest thread can be committed and it then starts to run non-speculatively.
The following sections describe a hardware implementation embodiment for a speculation model.
Speculation IDs
Speculation IDs constitute a mechanism for the memory subsystem to associate memory requests with a corresponding task, when a sequential program is decomposed into speculative tasks. This is done by assigning an ID at the start of a speculative task to the software thread executing the task and attaching the ID as tag to all requests sent to the memory subsystem by that thread. In SE, a speculation ID should be attached to a single task at a time.
As the number of dynamic tasks can be very large, it is not practical to guarantee uniqueness of IDs across the entire program run. It is sufficient to guarantee uniqueness for all IDs assigned to TLS tasks concurrently present in the memory system.
The BG/Q memory subsystem embodiment implements a set of 128 such speculation IDs, encoded as 7 bit values. On start of a speculative task, a thread requests an ID currently not in use from a central unit, the L2 CENTRAL unit. The thread then uses this ID by storing its value in a core-local register that tags the ID on all requests sent to the L2-cache.
After a thread has terminated, the changes associated with its ID are either committed, i.e., merged with the persistent main memory state, or they are invalidated, i.e., removed from the memory subsystem, and the ID is reclaimed for further allocation. But before a new thread can use the ID, no valid lines with that thread ID may remain in the L2. It is not necessary for the L2 to identify and mark these lines immediately because the pool of usable IDs is large. Therefore, cleanup is gradual.
Life cycle of a speculation ID
The thread starts using the ID with tagged memory requests at 50504. Such tagging may be implemented by the runtime system programming a register to activate the tagging. The application may signal the runtime system to do so, especially in the case of TM. If a conflict occurs at 50505, the conflict is noted in the conflict register of
After the ID state change from speculative to committed or invalid, the L2 slices start to merge or invalidate lines associated with the ID at 50512. More about merging lines will be described with reference to
In addition to the SE use of speculation, the proposed system can support two further uses of memory speculation: Transactional Memory (“TM”), and Rollback. These uses are referred to in the following as modes.
TM occurs in response to a specific programmer request. Generally the programmer will put instructions in a program delimiting sections in which TM is desired. This may be done by marking the sections as requiring atomic execution. According to the PowerPC architecture: “An access is single-copy atomic, or simply “atomic”, if it is always performed in its entirety with no visible fragmentation.”. Alternatively, the programmer may put in a request to the runtime system for a domain to be allocated to TM execution This request will be conveyed by the runtime system via the operating system to the hardware, so that modes and IDs can be allocated. When the section ends, the program will make another call that ultimately signals the hardware to do conflict checking and reporting. Reporting means in this context: provide conflict details in the conflict register and issue an interrupt to the affected thread. The PowerPC architecture has an instruction type known as larx/stcx. This instruction type can be implemented as a special case of TM. The larx/stcx pair will delimit a memory access request to a single address and set up a program section that ends with a request to check whether the memory access request was successful or not. More about a special implementation of larx/stcx instructions using reservation registers is to be found in co-pending application Ser. No. 12/697,799 filed Jan. 29, 2010, which is incorporated herein by reference. This special implementation uses an alternative approach to TM to implement these instructions. In any case, TM is a broader concept than larx/stcx. A TM section can delimit multiple loads and stores to multiple memory locations in any sequence, requesting a check on their success or failure and a reversal of their effects upon failure. TM is generally used for only a subset of an application program, with program sections before and after executing in speculative mode.
Rollback occurs in response to “soft errors,” normally these errors occur in response to cosmic rays or alpha particles from solder balls.
Referring back to
More particularly, the basic nodechip 50 of the multiprocessor system illustrated in
The 17th core is configurable to carry out system tasks, such as
In other words, it offloads all the administrative tasks from the other cores to reduce the context switching overhead for these.
In one embodiment, there is provided 32 MB of shared L2 cache 70, accessible via crossbar switch 60. There is further provided external Double Data Rate Synchronous Dynamic Random Access Memory (“DDR SDRAM”) 80, as a lower level in the memory hierarchy in communication with the L2.
Each FPU 53 associated with a core 52 has a data path to the L1-cache 55 of the CORE, allowing it to load or store from or into the L1-cache 55. The terms “L1” and “L1D” will both be used herein to refer to the L1 data cache.
Each core 52 is directly connected to a supplementary processing agglomeration 58, which includes a private prefetch unit. For convenience, this agglomeration 58 will be referred to herein as “L1P”—meaning level 1 prefetch—or “prefetch unit;” but many additional functions are lumped together in this so-called prefetch unit, such as write combining. These additional functions could be illustrated as separate modules, but as a matter of drawing and nomenclature convenience the additional functions and the prefetch unit will be illustrated herein as being part of the agglomeration labeled “UP.” This is a matter of drawing organization, not of substance. Some of the additional processing power of this L1P group includes write combining. The L1P group also accepts, decodes and dispatches all requests sent out by the core 52.
By implementing a direct memory access (“DMA”) engine referred to herein as a Messaging Unit (“MU”) such as MU 100, with each MU including a DMA engine and Network Card interface in communication with the XBAR switch, chip I/O functionality is provided. In one embodiment, the compute node further includes: intra-rack interprocessor links 90 which may be configurable as a 5-D torus; and, one I/O link 92 interfaced with the interfaced with the MU The system node employs or is associated and interfaced with a 8-16 GB memory/node, also referred to herein as “main memory.”
The term “multiprocessor system” is used herein. With respect to the present embodiment this term can refer to a nodechip or it can refer to a plurality of nodechips linked together. In the present embodiment, however, the management of speculation is conducted independently for each nodechip. This might not be true for other embodiments, without taking those embodiments outside the scope of the claims.
The compute nodechip implements a direct memory access engine DMA to offload the network interface. It transfers blocks via three switch master ports between the L2-cache slices 70 (
The L2 will have ports, for instance a 256b wide read data port, a 128b wide write data port, and a request port. Ports may be shared by all processors through the crossbar switch 60.
If the hardware determines that no conflict has occurred, the speculative results of the associated thread can be made persistent.
In response to a conflict, trying again may make sense where another thread completed successfully, which may allow the current thread to succeed. If both threads restart, there can be a “lifelock,” where both keep failing over and over. In this case, the runtime system may have to adopt other strategies like getting one thread to wait, choosing one transaction to survive and killing others, or other strategies, all of which are known in the art.
Once an ID is committed, the actions taken by the thread under that ID become irreversible.
In the current embodiment, a hardware thread can only use one speculation ID at a time and that ID can only be configured to one domain of IDs. This means that if TM or TLS is invoked, which will assign an ID to the thread, then rollback cannot be used. In this case, the only way of recovering from a soft error might be to go back to system states that are stored to disk on a more infrequent basis. It might be expected in a typical embodiment that a rollback snapshot might be taken on the order of once every millisecond, while system state might be stored to disk only once every hour or two. Therefore rollback allows for much less work to be lost as a result of a soft error. Soft errors increase in frequency as chip density increases. Executing in TLS or TM mode therefore entails a certain risk.
Generally, recovery from failure of any kind of speculative execution in the current embodiment relates to undoing changes made by a thread. If a soft error occurred that did not relate to a change that the thread made, then it may nevertheless be necessary to go back to the snapshot on the disk.
As shown in
Address scrambling tries to distribute memory accesses across L2-cache slices and within L2-cache slices across sets (congruence classes). Assuming a 64 GB main memory address space, a physical address dispatched to the L2 has 36 bits, numbered from 0 (MSb) to 35 (LSb) (a(0 to 35)).
The L2 stores data in 128 B wide lines, and each of these lines is located in a single L2-slice and is referenced there via a single directory entry. As a consequence, the address bits 29 to 35 only reference parts of an L2 line and do not participate in L2 or set selection.
To evenly distribute accesses across L2-slices for sequential lines as well as larger strides, the remaining address bits are hashed to determine the target slice. To allow flexible configurations, individual address bits can be selected to determine the slice as well as an XOR hash on an address can be used: The following hashing is used in the present embodiment:
For each of the slices, 25 address bits are a sufficient reference to distinguish L2 cache lines mapped to that slice.
Each L2 slice holds 2 MB of data or 16K cache lines. At 16-way associativity, the slice has to provide 1024 sets, addressed via 10 address bits. The different ways are used to store different addresses mapping to the same set as well as for speculative results associated with different threads or combinations of threads.
Again, even distribution across set indices for unit and non-unit strides is achieved via hashing, to wit:
Thereafter, the switch provides addressing to the L2 slice in accordance with an address that includes the set and way and offset within a line, as shown in
L2 as Point of Coherence
In this embodiment, the L2 Cache provides the bulk of the memory system caching on the BQC chip. To reduce main memory accesses, the L2 caches serve as the point of coherence for all processors. This function includes generating L1 invalidations when necessary. Because the L2 caches are inclusive of the L1 s, they can remember which processors could possibly have a valid copy of every line. Memory consistency is enforced by the L2 slices by means of multicasting selective L1 invalidations, made possible by the fact that the L1 s operate in write-through mode and the L2s are inclusive of the L1 s.
Per the article on “Cache Coherence” in Wikipedia, there are several ways of monitoring speculative execution to see if some resource conflict is occurring, e.g.
The prior version of the IBM® BluGene® processor used snoop filtering to maintain cache coherence. In this regard, the following patent is incorporated by reference: U.S. Pat. No. 7,386,685, issued 10 Jun. 2008.
The embodiment discussed herein uses directory based coherence.
Coherence tracking unit 4301 issues invalidations, when necessary.
The request queue 4302 buffers incoming read and write requests. In this embodiment, it is 16 entries deep, though other request buffers might have more or less entries. The addresses of incoming requests are matched against all pending requests to determine ordering restrictions. The queue presents the requests to the directory pipeline 4308 based on ordering requirements.
The write data buffer 4303 stores data associated with write requests. This embodiment has a 16 B wide data interface to the switch 60 and stores 16 16 B wide entries. Other sizes might be devised by the skilled artisan as a matter of design choice. This buffer passes the data to the eDRAM pipeline 4305 in case of a write hit or after a write miss resolution. The eDRAMs are shown at 40101 in
The directory pipeline 4308 accepts requests from the request queue 4302, retrieves the corresponding directory set from the directory SRAM 4309, matches and updates the tag information, writes the data back to the SRAM and signals the outcome of the request (hit, miss, conflict detected, etc.). Operations illustrated at
In parallel,
The L2 implements two eDRAM pipelines 43055 that operate independently. They may be referred to as eDRAM bank 0 and eDRAM bank 1. The eDRAM pipeline controls the eDRAM access and the dataflow from and to this macro. If writing only subcomponents of a doubleword or for load-and-increment or store-add operations, it is responsible to schedule the necessary Read Modify Write (“RMW”) cycles and provide the dataflow for insertion and increment.
The read return buffer 4304 buffers read data from eDRAM or the memory controller 50078 (
The miss handler 4307 takes over processing of misses determined by the directory. It provides the interface to the DRAM controller and implements a data buffer for write and read return data from the memory controller,
The reservation table 4306 registers reservation requests, decides whether a STWCX can proceed to update L2 state and invalidates reservations based on incoming stores.
Also shown in
The L2 implements a multitude of decoupling buffers for different purposes.
The L2 central unit 50203 is illustrated in
The L2 counter units 50201 of
A command execution unit 50415 coordinates operations with respect to speculation ID's. Operations associated with
The L2 slices 50072 communicate to the central unit at 50417 typically in the form of replies to commands, though sometimes the communications are not replies, and receive commands from the central unit at 50418. Other examples of what might be transmitted via the bus labeled “L2 replies” include signals from the slices indicating if a conflict has happened. In this case, a signal can go out via a dedicated broadcast bus to the cores indicating the conflict to other devices, that an ID has changed state and that an interrupt should be generated.
The L2 slices receive memory access requests at 50419 from the L1D at a request interface 50420. The request interface forwards the request to the directory pipe 4308 as shown in more detail in
Support for such functionalities includes additional bookkeeping and storage functionality for multiple versions of the same physical memory line.
These registers include 128 two bit registers 50431, each for storing the state of a respective one of the 128 possible thread IDs. The possible states are:
By querying the table on every use of an ID, the effect of instantaneous ID commit or invalidation can be achieved by changing the state associated with the ID to committed or invalid. This makes it possible to change a thread's state without having to find and update all the thread's lines in the L2 directory; also it saves directory bits.
Another set of 128 registers 50432 is for encoding conflicts associated with IDs. More detail of these registers is shown at
Another register 50433 has 5 bits and is for indicating how many domains have been created.
A set of 16 registers 50434 indicates an allocation pointer for each domain. A second set of 16 registers 50435 indicates a commit pointer for each domain. A third set of 16 registers 50436 indicates a reclaim pointer for each domain. These three pointer registers are seven bits each.
ID Ordering for Speculative Execution
The numeric value of the speculation ID is used in Speculative Execution to establish a younger/older relationship between speculative tasks. IDs are allocated in ascending order and a larger ID generally means that the ID designates accesses of a younger task.
To implement in-order allocation, the L2 CENTRAL at 50413 maintains an allocation pointer 50434. A function ptr_try_allocate tries to allocate the ID the pointer points to and, if successful, increments the pointer. More about this function can be found in a table of functions listed below.
As the set of IDs is limited, the allocation pointer 50434 (
The allocation pointer is a 7b wide register. It stores the value of the ID that is to be allocated next. If an allocation is requested and the ID it points to is available, the ID state is changed to speculative, the ID value is returned to the core and the pointer content is incremented.
The notation means: if the allocation pointer is, e.g., 10, then ID 0 is the oldest, 11 second oldest, . . . , 8 second youngest and 9 youngest ID.
Aside from allocating IDs in order for Speculative Execution, the IDs must also be committed in order. L2 CENTRAL provides a commit pointer 50435 that provides an atomic increment function and can be used to track what ID to commit next, but the use of this pointer is not mandatory.
Per
If the commit fails or the ID was already invalid before the commit attempt at 50525, the ID the commit pointer points to needs to be invalidated along with all younger IDs currently in use at 50527. Then the commit pointer must be moved past all invalidated IDs by directly writing to the commit pointer register 50528. Then, the A-bit for all invalidated IDs the commit pointer moved past can be cleared and thus released for reallocation at 50529. The failed speculative task then needs to be restarted.
Speculation ID Reclaim
To support ID cleanup, the L2 cache maintains a Use Counter within units 50201 for each thread ID. Every time a line is established in L2, the use counter corresponding to the ID of the thread establishing the line is incremented. The use counter also counts the occurrences of IDs in the speculative reader set. Therefore, each use counter indicates the number of occurrences of its associated ID in the L2.
At intervals programmable via DCR the L2 examines one directory set for lines whose thread IDs are invalid or committed. For each such line, the L2 removes the thread ID in the directory, marks the cache line invalid or merges it with the non-speculative state respectively, and decrements the use counter associated with that thread ID. Once the use counter reaches zero, the ID can be reclaimed, provided that its A bit has been cleared. The state of the ID will switch to available at that point. This is a type of lazy cleanup. More about lazy evaluation can be found the in Wikipedia article entitled “Lazy Evaluation.”
Domains
Parallel programs are likely to have known independent sections that can run concurrently. Each of these parallel sections might, during the annotation run, be decomposed into speculative threads. It is convenient and efficient to organize these sections into independent families of threads, with one committed thread for each section. The L2 allows for this by using up to the four most significant bits of the thread ID to indicate a speculation domain. The user can partition the thread space into one, two, four, eight or sixteen domains. All domains operate independently with respect to allocating, checking, promoting, and killing threads. Threads in different domains can communicate if both are non-speculative; no speculative threads can communicate outside their domain, for reasons detailed below.
Per
Transactional Memory
The L2's speculation mechanisms also support a transactional-memory (TM) programming model, per
The implementation of TM uses the hardware resources for speculation. A difference between TLS and TM is that TM IDs are not ordered. As a consequence, IDs can be allocated at 50602 and committed in any order 50608. The L2 CENTRAL provides a function that allows allocation of any available ID from a pool (try_alloc_avail) and a function that allows an ID to be atomically committed regardless of any pointer state (try_commit) 50605. More about these functions appears in a table presented below.
The lack of ordering means also that the mechanism to forward data from older threads to younger threads cannot be used and both RAW as well as WAR accesses must be flagged as conflicts at 50603. Two IDs that have speculatively written to the same location cannot both commit, as the order of merging the IDs is not tracked. Consequently, overlapping speculative writes are flagged as WAW conflicts 50604.
A transaction succeeds 50608 if, while the section executes, no other thread accesses to any of the addresses it has accessed, except if both threads are only reading per 50606. If the transaction does not succeed, hardware reverses its actions 50607: its writes are invalidated without reaching external main memory. The program generally loops on a return code and reruns failing transactions.
Mode Switching
Each of the three uses of the speculation facilities
Memory Consistency
This section describes the basic mechanisms used to enforce memory consistency, both in terms of program order due to speculation and memory visibility due to shared memory multiprocessing, as it relates to speculation.
The L2 maintains the illusion that speculative threads run in sequential program order, even if they do not. Per
At the L2 at 50902, the directory is marked to reflect which threads have read and written a line when necessary. Not every thread ID needs to be recorded, as explained with respect to the reader set directory, see e.g.
On a read at 50903, the L2 returns the line that was previously written by the thread that issued the read or else by the nearest previous thread in program order 50914; if the address is not in L2 50912, the line is fetched 50913 from external main memory.
On a write 50904, the L2 directory is checked for illusion-breaking reads-reads by threads later in program order. More about this type of conflict checking is explained with reference to
To kill a thread (and all younger threads), the L2 sends an interrupt 50915 to the corresponding core. The core receiving the interrupt has to notify the cores running its successor threads to terminate these threads, too per 50907. It then has to mark the corresponding hread IDs invalid 50908 and restart its current speculative thread 50909.
Commit Race Window Handling
Per
To close this window, the commit process is managed in TLS, TM mode, and rollback mode 51003, 51004, 51005. Rollback mode requires equivalent treatment to transition IDs to the invalid state.
Transition to Committed State
To avoid the race, the L2 gives special handling to the period between the end of a committed thread and the promotion of the next. Per 51003 and
In the case of TM, first the thread to be committed is set to a transitional state at 51120. Then accesses from other speculative threads or non-speculative writes are blocked at 51121. If any such speculative access or non-speculative write are active, then the system has to wait at 51122. Otherwise conflicts must be checked for at 51123. If none are present, then all side effects must be registered at 51124, before the thread may be committed and writes resumed at 51125.
Thread ID Counters
A direct implementation of the thread ID use counters would require each of the 16 L2's to maintain 128 counters (one per thread ID), each 16 bits (to handle the worst case where all 16 ways in all 1024 sets have a read and a write by that thread). These counters would then be ORd to detect when a count reached zero.
Instead, groups of L2's manipulate a common group-wide-shared set of counters 50201. The architecture assigns one counter set to each set of 4 L2-slices. The counter size is increased by 2 bits to handle directories for 4 caches, but the number of counters is reduced 4-fold. The counters become more complex because they now need to simultaneously handle combinations of multiple decrements and increments.
As a second optimization, the number of counters is reduced a further 50% by sharing counters among two thread IDs. A nonzero count means that at least one of the two IDs is still in use. When the count is zero, both IDs can potentially be reclaimed; until then, none can be reclaimed. The counter size remains the same, since the 4 L2's still can have at most 4*16*1024*3 references total.
A drawback of sharing counters is that IDs take longer to be reused-none of the two IDs can be reused until both have a zero count. To mitigate this, the number of available IDs is made large (128) so free IDs will be available even if several generations of threads have not yet fully cleared.
After a thread count has reached zero, the thread table is notified that those threads are now available for reuse.
Conflict Handling
Conflict Recording
To detect conflicts, the L2 must record all speculative reads and writes to any memory location.
Speculative writes are recorded by allocating in the directory a new way of the selected set and marking it with the writer ID. The set contains 16 dirty bits that distinguish which double word of the 128 B line has been written by the speculation ID. If a sub-double word write is requested, the L2 treats this as a speculative read of a double word, insertion of the write data into that word followed by full a double word write.
In the base directory, 50321, there are 15 bits that represent the upper 15b address bits of the line stored at 50271. Then there is a seven bit speculative writer ID field 50272 that indicates which speculation ID wrote to this line and a flag 50273 that indicates whether the line was speculatively written. Then there is a two bit speculative read flag field 50274 indicating whether to invoke the speculative reader directory 50324, and a one bit “current” flag 50275. The current flag 50275 indicates whether the current line is assembled from more than one way or not. The core 52 does not know about the fields 50272-50275. These fields are set by the L2 directory pipeline.
If the speculative writer flag is checked, then the way has been written speculatively, not taken from main memory and the writer ID field will say what the writer ID was. If the flag is clear, the writer ID field is irrelevant.
The LRU directory indicates “age”, a relative ordering number with respect to last access. This directory is for allocating ways in accordance with the Least Recently Used algorithm.
The COH/dirty directory has two uses, and accordingly two possible formats. In the first format, 50323, known as “COH,” there are 17 bits, one for each core of the system. This format indicates, when the writer flag is not set, whether the corresponding core has a copy of this line of the cache. In the second format, 50323′, there are 16 bits. These bits indicate, if the writer flag is set in the base directory, which part of the line has been modified speculatively. The line has 128 bytes, but they are recorded at 50323′ in groups of 8 bytes, so only 16 bits are used, one for each group of eight bytes.
Speculative reads are recorded for each way from which data is retrieved while processing a request. As multiple speculative reads from different IDs for different sections of the line need to be recorded, the L2 uses a dynamic encoding that provides a superset representation of the read accesses.
In
Format 50331 indicates that no speculative reading has occurred.
If only a single TLS or TM ID has read the line, the L2 records the ID along with the left and right boundary of the line section so far accessed by the thread. Boundaries are always rounded to the next double word boundary. Format 50332 uses two bit code “01” to indicate that a single seven bit ID, α, has read in a range delimited by four bit parameters denoted “left” and “right”.
If two IDs in TM have accessed the line, the IDs along with the gap between the two accessed regions are recorded. Format 50333 uses two bit code “11” to indicate that a first seven bit ID denoted “α” has read from a boundary denoted with four bits symbolized by the word “left” to the end of the line; while a seven bit second ID, denoted “β” has read from the beginning of the line to a boundary denoted by four bits symbolized by the word “right.”
Format 50334 uses three bit code “001” to indicate that three seven bit IDs, denoted “α,” “β,” and “γ,” have read the entire line. In fact, when the entire line is indicated in this figure, it might be that less than the entire line has been read, but the encoding of this embodiment does not keep track at the sub-line granularity for more than two speculative IDs. One of ordinary skill in the art might devise other encodings as a matter of design choice.
Format 50335 uses five bit code “00001” to indicate that several IDs have read the entire line. The range of IDs is indicated by the three bit field denoted “ID up”. This range includes the sixteen IDs that share the same upper three bits. Which of the sixteen IDs have read the line is indicated by respective flags in the sixteen bit field denoted “ID set.”
If two or more TLS IDs have accessed the line, the youngest and the oldest ID along with the left and right boundary of the aggregation of all accesses are recorded.
Format 50336 uses the eight bit code “00010000” to indicate that a group of IDs has read the entire line. This group is defined by a 16 bit field denoted “IDgroupset.”
Format 50337 uses the two bit code “10” to indicate that two seven bit IDs, denoted “α” and “β” have read a range delimited by boundaries indicated by the four bit fields denoted “left” and “right.”
When doing WAR conflict checking, per
Rollback ID reads are not recorded.
If more than two TM IDs, a mix of TM and TLS IDs or TLS IDs from different domains have been recorded, only the 64 byte access resolution for the aggregate of all accesses is recorded.
In summary, then, the current bit 50275 of
Conflict Detection
For each request the L2 generates a read and write access memory footprint that describes what section of the 128 B line is read and/or written. The footprints are dependent on the type of request, the size info of the request as well as on the atomic operation code.
For example, an atomic load-increment-bounded from address A has a read footprint of the double word at A as well as the double word at A+8, and it has a write footprint of the double word at address A. The footprint is used matching the request against recorded speculative reads and writes of the line.
Conflict detection is handled differently for the three modes.
Per
With respect to
Per
In Rollback mode, any access to a line that has a rollback version signals a conflict and commits the rollback ID unless the access was executed with the ID of the existing rollback version.
With respect to
TLS/TM/Rollback Management
The TLS/TM/Rollback capabilities of the memory subsystem are controlled via a memory-mapped I/O interface.
Global Speculation ID Management:
The management of the ID state is done at the L2 CENTRAL unit. L2 CENTRAL also controls how the ID state is split into domains and what attributes apply to each domain. The L2 CENTRAL is accessed via MMIO by the cores. All accesses to the L2 CENTRAL are performed with cache inhibited 8 B wide, aligned accesses.
The following functions are defined in the preferred embodiment:
Processor Local Configuration:
For each thread, a speculation ID register 50401 in
When starting a transaction or speculative thread, the thread ID provided by the ID allocate function of the Global speculation ID management has to be written into the thread ID register of
In the latest IBM® Blue Gene® architecture, the point of coherence is a directory lookup mechanism in a cache memory. It would be desirable to guarantee a hierarchy of atomicity options within that architecture.
In one embodiment, a multiprocessor system includes a plurality of processors, a conflict checking mechanism, and an instruction implementation mechanism. The processors are adapted to carry out speculative execution in parallel. The conflict checking mechanism is adapted to detect and protect results of speculative execution responsive to memory access requests from the processors. The instruction implementation mechanism cooperates with the processors and conflict checking mechanism adapted to implement an atomic operation that includes load, modify, and store with respect to a single memory location in an uninterruptible fashion.
In another embodiment, a system includes a plurality of processors and at least one cache memory. The processors are adapted to issue atomicity related operations. The operations include at least one atomic operation and at least one other type of operation. The atomic operation includes sub-operations including a read, a modify, and a write. The other type of operation includes at least one atomicity related operation. The cache memory includes an cache data array access pipeline and a controller. The controller is adapted to prevent the other types operations from entering the cache data array access pipeline, responsive to an atomic operation in the pipeline, when those other types of operation compete with the atomic operation in the pipeline for a memory resource.
In yet another embodiment, a multiprocessor system includes a plurality of processors, a central conflict checking mechanism, and a prioritizer. The processors are adapted to implement parallel speculative execution of program threads and to implement a plurality of atomicity related techniques. The central conflict checking mechanism resolves conflicts between the threads. The prioritizer prioritizes at least one atomicity related technique over at least one other atomicity related technique.
In a further embodiment, a computer method includes issuing an atomic operation, recognizing the atomic operation, and blocking other operations. The atomic operation is issued from one of the processors in a multi-processor system and defines sub-operations that include reading, modifying, and storing with respect to a memory resource. A directory based conflict checking mechanism recognizes the atomic operation. Other operations seeking to access the memory resource are blocked until the atomic operation has completed.
Three modes of speculative execution are supported in the current embodiment: Thread Level Speculation (“TLS”), Transactional Memory (“TM”), and Rollback.
TM occurs in response to a specific programmer request. Generally the programmer will put instructions in a program delimiting sections in which TM is desired. This may be done by marking the sections as requiring atomic execution. “An access is single-copy atomic, or simply “atomic”, if it is always performed in its entirety with no visible fragmentation.” IBM® Power ISA TM Version 2.06, Jan. 30, 2009. In a transactional model, the programmer replaces critical sections with transactional sections at 61601 (
Normally TLS occurs when a programmer has not specifically requested parallel operation. Sometimes a compiler will ask for TLS execution in response to a sequential program. When the programmer writes this sequential program, she may insert commands delimiting sections. The compiler can recognize these sections and attempt to run them in parallel.
Rollback occurs in response to “soft errors,” normally these errors occur in response to cosmic rays or alpha particles from solder balls. Rollback is discussed in more detail in co-pending application Ser. No. 12/696,780, which is incorporated herein by reference.
The present invention arose in the context of the IBM® Blue Gene® project, which is further described in the applications incorporated by reference above.
The application program 36131 can also request various operation types, for instance as specified in a standard such as the PowerPC architecture. These operation types might include larx/stcx pairs or atomic operations, to be discussed further below.
As described above,
As described above,
As described above,
In
The L2 implements a multitude of decoupling buffers for different purposes.
The L2 caches may operate as set-associative caches while also supporting additional functions, such as memory speculation for Speculative Execution (SE), Transactional Memory (TM) and local memory rollback, as well as atomic memory transactions. Support for such functionalities includes additional bookkeeping and storage functionality for multiple versions of the same physical memory line.
To reduce main memory accesses, the L2 cache may serve as the point of coherence for all processors. In performing this function, an L2 central unit will have responsibilities such as defining domains of speculation IDs, assigning modes of speculation execution to domains, allocating speculative IDS to threads, trying to commit the IDs, sending interrupts to the cores in case of conflicts, and retrieving conflict information. This function includes generating L1 invalidations when necessary. Because the L2 caches are inclusive of the L1 s, they can remember which processors could possibly have a valid copy of every line, and they can multicast selective invalidations to such processors. The L2 caches are advantageously a synchronization point, so they coordinate synchronization instructions from the PowerPC architecture, such as larx/stcx.
Larx/Stcx
The larx and stcx. instructions used to perform a read-modify-write operation to storage. If the store is performed, the use of the larx and stcx instruction pair ensures that no other processor or mechanism has modified the target memory location between the time the larx instruction is executed and the time the stcx. instruction completes.
The lwarx (Load Word and Reserve Indexed) instruction loads the word from the location in storage specified by the effective address into a target register. In addition, a reservation on the memory location is created for use by a subsequent stwcx. instruction.
The stwcx (Store Word Conditional Indexed) instruction is used in conjunction with a preceding lwarx instruction to emulate a read-modify-write operation on a specified memory location.
The L2 caches will handle lwarx/stwcx reservations and ensure their consistency. They are a natural location for this responsibility because software locking is dependent on consistency, which is managed by the L2 caches.
The A2 core basically hands responsibility for lwarx/stwcx consistency and completion off to the external memory system. Unlike the 450 core, it does not maintain an internal reservation and it avoids complex cache management through simple invalidation. Lwarx is treated like a cache-inhibited load, but invalidates the target line if it hits in the L1 cache. Similarly, stwcx is treated as a cache-inhibited store and also invalidates the target line in L1 if it exists.
The L2 cache is expected to maintain reservations for each thread, and no special internal consistency action is taken by the core when multiple threads attempt to use the same lock. To support this, a thread is blocked from issuing any L2 accesses while a lwarx from that thread is outstanding, and it is blocked completely while a stwcx is outstanding. The L2 cache will support lwarx/stwcx as described in the next several paragraphs.
Each L2 slice has 17 reservation registers. Each reservation register consists of a 25-bit address register and an 9-bit thread ID register that identifies which thread has reserved the stored address and indicates whether the register is valid (i.e. in use).
When a lwarx occurs, the valid reservation thread ID registers are searched to determine if the thread has already made a reservation. If so, the existing reservation is cleared. In parallel, the registers are searched for matching addresses. If found, the thread ID is tried to be added to the thread identifier. If either no address is found or the thread ID could not be added to reservation registers with matching addresses, a new reservation is established. If a register is available, it is used, otherwise a random existing reservation is evict and a new reservation is established in its place. The larx continues as an ordinary load and returns data.
Every store searches the valid reservation address registers. All matching registers are simply invalidated. The necessary back-invalidations to cores will be generated by the normal coherence mechanism.
When a stcx occurs, the valid reservation registers 4306 are searched for entries with both a matching address and a matching thread ID. If both of these conditions are met, then the stcx is considered a success. Stcx success is returned to the requesting core and the stcx is converted to an ordinary store (causing the necessary invalidations to other cores by the normal coherence mechanism). If either condition is not met, then the stcx is considered a failure. Stcx fail is returned to the requesting core and the stcx is dropped. In addition, for every stcx any pending reservation for the requesting thread is invalidated.
To allow more than 17 reservations per slice, the actual thread ID field is encoded by the core ID and a vector of 4 bits, each representing a thread of the indicated core. If a reservation is established, first a check for matching address and core number n any register is made. If a register has both matching address and matching core, the corresponding thread bit is activated. Only if all bits are clear, the entire register is assumed invalidated and available for reallocation without eviction.
Atomic Operations
The L2 supports multiple atomic operations on 8 B entities. These operations are sometimes of the type that perform read, modify, and write back atomically—in other words that combine several frequently used instructions and guarantee that they can perform successfully. The operation is selected based on address bits as defined in the memory map and the type of access. These operations will typically require RAW, WAW, and WAR checking. The directory lookup phase will be somewhat different from other instructions, because both read and write are contemplated.
It is possible to feed two atomic operations to two different addresses together through the EDRAM pipe: read a, read b, then write a and b.
Thread 61601 includes three parts,
Arrow 61607 indicates that the reader set directory is active for that part. Arrow 61608 indicates that the writer set directory is active for that part.
Code block 61602 is delimited by a larx instruction 61609 and a stcx instruction 61610. Arrow 61611 indicates that the reservation table 4306 is active. When the stcx instruction executes, if there has been any read or write conflict, the whole block 61602 fails.
Atomic operation 61603 is one of the types indicated in table below, for instance “load increment.” The arrows at 61612 show the arrival of the atomic operation during the periods of time delimited by double arrows at 61607 and 61611. The atomic operation is guaranteed to complete due to the block on the EDRAM pipe for the relevant memory accesses. Accordingly, if there is a concurrent use by a TM thread 61601 and/or by a block of code protected by LARX/STCX 61602, and if those uses access the same memory location as the atomic operation 61603, a conflict will be signaled and results of the code blocks 61601 and 61602 will be invalidated. A uninterruptible, persistent atomic operation will be given priority over a reversible operation, e.g. TM transaction, or an interruptible operation, e.g., a LARX/STCX pair.
As between blocks 61601 and 61602, which is successful and which invalidates will depend on the order of operations, if they compete for the same memory resource. For instance, in the absence of 61603, if the stcx instruction 61610 completes before the commit attempt 61606, the larx/stcx box will succeed while the TM thread will fail. Alternatively, also in the absence of 61603, if the commit attempt 61606 completes before the stcx instruction 61610, then the larx/stcx block will fail. The TM thread can actually function a bit like multiple larx/stcx pairs together.
At 61705, the miss handler treats the existence of multiple versions as a cache miss. It blocks further accesses to that set and prevents them from entering the queue, by directing them to the EDRAM decoupling buffer. With respect to the set, the EDRAM pipe is then made to carry out copy/insert operations at 61707 until the aggregation is complete at 61708. This version aggregation loop is used for ordinary memory accesses to cache lines that have multiple versions.
Once the aggregation is complete, or if there are not multiple versions, control passes to 61710 where the current access is inserted into the EDRAM queue. If there is already an atomic operation relating to this line of the cache at 61711, then, at 61711, the current operation must wait in the EDRAM decoupling buffer. Non atomic operations will similarly have to be decoupled if they seek to access a cache line that is currently being accessed by an atomic operation in the EDRAM queue. If there are no atomic operations relating to this line in the queue, then control passes to 61713 where the current operation is transferred to the EDRAM queue. Then, at 61714, the atomic operation traverses the EDRAM queue twice, once for the read and modify and once for the write. During this traversal, other operations seeking to access the same line may not enter the EDRAM pipe, and will be decoupled into the decoupling buffer.
The following atomic operations are examples that are supported in the preferred embodiment, though others might be implemented. These operations are implemented in addition to the memory mapped i/o operations in the PowerPC architecture.
For example load increment acts similarly to a load. This instruction provides a destination address to be loaded and incremented. In other words, the load gets a special modification that tells the memory subsystem not to simply load the value, but also increment it and write the incremented data back to the same location. This instruction is useful in various contexts. For instance if there is a workload to be distributed to multiple threads, and it is not known how many threads will share the workload or which one is ready, then the workload can be divided into chunks. A function can associate a respective integer value to each of these chunks. Threads can use load-increment to get a workload by number and process it.
Each of these operations acts like a modification of main memory. If any of the core/L1 units has a copy of the modified value, it will get a notification that the memory value has changed—and it evicts and invalidates its local copy. The next time the core/L1 unit needs the value, it has to fetch it from the l2. This process happens each time the location is modified in the l2.
A common pattern is that some of the core/L1 units will be programmed to act when a memory location modified by atomic operations reaches a specific value. When polling for the value, repeated L1 misses, fetches from L2 followed by L1 invalidations due to atomic operations occur.
Store_add_coherence_on_zero reduces the events of the local cache being invalidated and a new copy gotten from the 12 cache. With this atomic operation, L1 cache lines will be left incoherent and not invalidated unless the modified value reaches zero The threads waiting for zero can then keep checking whatever their local value its L1 cache is even if that local value is inaccurate, until the value is actually zero. This means that one thread might modify the value as far as the L2 is concerned, without generating a miss for other threads.
In general, the operations in the above table, called “atomic” have an effect that the regular load and store does not have. They load, read, modify and write back in one atomic operation, even within the context of speculation. This type of operation works in the context of speculation, because of the loop back in the EDRAM pipeline. It executes conflict checking equivalent to a sequence of a load and a store. Before the atomic operation is loading, it does the version aggregation discussed further in the provisional applications incorporated by reference above.
In a further aspect, a device and method for copying performance counter data are provided. The device, in one aspect, may include at least one processor core, a memory, and a plurality of hardware performance counters operable to collect counts of selected hardware-related activities. A direct memory access unit includes a DMA controller operable to copy data between the memory and the plurality of hardware performance counters. An interconnecting path connects the processor core, the memory, the plurality of hardware performance counters, and the direct memory access unit.
A method of copying performance counter data, in one aspect, may include establishing a path between a direct memory access unit to a plurality of hardware performance counter units, the path further connecting to a memory device. The method may also include initiating a direct memory access unit to copy data between the plurality of hardware performance counter units and the memory device.
Multicore chips are those computer chips with more than a single core. The extra cores may be used to offload the work of setting up a transfer of data between the performance counters and memory without perturbing the data being generated from the running application. A direct memory access (DMA) mechanism allows software to specify a range of memory to be copied from and to, and hardware to copy all of the memory in the specified range. Many chip multiprocessors (CMP) and systems on a chip (SoC) integrate a DMA unit. The DMA engine is typically used to facilitate data transfer between network devices and the memory, or between I/O devices and memory, or between memory and memory.
Many chip architectures include a performance monitoring unit (PMU). This unit contains a number of performance counters that count a number of events in the chip. The performance counters are typically programmable to select particular events to count. This unit can count events from some or all of the processors and from other components in the system, such as the memory system, or the network system.
If software wants to use the values from performance counters, it has to read performance counters. Counters are read out using a software program which reads the memory area where performance counters are mapped by reading counters sequentially. For a system with large number of counters or with large counter access latency, executing the code to get these counter values has a substantial impact on program performance.
The mechanism of the present disclosure combines hardware and software that allows for efficient, non-obtrusive movement of hardware performance counter data between the registers that hold that data and a set of memory locations. To be able to utilize a hardware DMA unit available on the chip for copying performance counters into the memory, the hardware DMA unit is connected via paths to the hardware performance counters and registers. The DMA is initialized to perform data copy in the same way it is initialized to perform the copy of any other memory area, by specifying the starting source address, the starting destination address, and the data size of data to be copied. By offloading data copy from a processor to the DMA engine, the data transfer may occur without disturbing the core on which the measured computation or operation (i.e., monitoring and gathering performance counter data) is occurring.
A register/memory location provides the start memory location of the first destination memory address. For example, the software, or an operating system, or the like pre-allocates memory area to provide space for writing and storing the performance counter data. Additional register and/or memory location provides the start memory location of the first source memory address. This source address corresponds to the memory address of the first performance counter to be copied. Additional register and/or memory location provides the size of data to be copied, or number of performance counters to be copied.
On a multicore chip, for example, the software running on an extra core, i.e., one not dedicated to gather performance data, may decide which of the performance counters to copy, utilize the DMA engine by setting up the copy, initiate the copy, and then proceed to perform other operations or work.
Both the performance counter unit 71102 and the memory 71108 are accessible from the DMA unit 71106. An operating system or software may allocate an area in memory 71108 for storing the counter data of the performance counters 71104. The operating system or software may decide which performance counter data to copy, whether the data is to be copied from the performance counters 71104 to the memory 71108 or the memory 71108 to the performance counters 71104, and may prepare a packet for DMA and inject the packet into the DMA unit 71106, which initiates memory-to-memory copy, i.e., between the counters 71104 and memory 71108. In one aspect, the control packet for DMA may contain a packet type identification, which specifies that this is a memory-to-memory transfer, a starting source address of data to be copied, size in bytes of data to be copied, and a destination address where the data are to be copied. The source addresses may map to the performance counter device 71102, and destination address may map to the memory device 71108 for data transfer from the performance counters to the memory.
In another aspect, data transfer can be performed in both directions, not only from the performance counter unit to the memory, but also from the memory to the performance counter unit. Such a transfer may be used for restoring the value of the counter unit, for example.
Multiple cores 71112 may be running different processes, and in one aspect, the software that prepares the DMA packet and initiates the DMA data transfer may be running on a core that is separate from the process running on another core that is gathering the hardware performance monitoring data. In this way, the core running a measure computation, i.e., that gathers the hardware performance monitoring data, need not be disturbed or interrupted to perform the copying to and from the memory 71108.
A device and method for hardware supported performance counter data collection are provided. The device, in one aspect, may include a plurality of performance counters operable to collect one or more counts of one or more selected activities. A first storage element may be operable to store an address of a memory location, and a second storage element may be operable to store a value indicating whether the hardware should begin copying. A state machine is operable to detect the value in the second storage element and trigger hardware copying of data in selected one or more of the plurality of performance counters to the memory location whose address is stored in the first storage element.
The present disclosure, in one aspect, describes hardware support to facilitate transferring the performance counter data between the hardware performance counters and memory. One or more hardware capability and configurations are disclosed that allow software to specify a memory location and have the hardware engine copy the counters without the software getting involved. In another aspect, the software may specify a sequence of memory locations and have the hardware perform a sequence of copies from the hardware performance counter registers to the sequence of memory locations specified by software. In this manner, the hardware need not interrupt the software.
The mechanism of the present disclosure combines hardware and software capabilities to allow for efficient movement of hardware performance counter data between the registers that hold that data and a set of memory locations. The following description of the embodiments uses the term “hardware” interchangeably with the state machine and associated registers used for controlling the automatic copying of the performance counter data to memory. Further, the term “software” may refer to the hypervisor, operating system, or another tool that either of those layers has provided direct access to. For example the operating system could set up a mapping, allowing a tool with the correct permission, to interact directly with the hardware state machine.
A direct memory engine (DMA) may be used to copy the values of performance monitoring counters from the performance monitoring unit directly to the memory without intervention of software. The software may specify the starting address of the memory where the counters are to be copied, and a number of counters to be copied.
After initialization of the DMA engine in the performance monitoring unit by software, other functions are performed by hardware. Events are monitored and counted, and an element such as a timer keeps track of time. After a time interval expires, or another triggering event, the DMA engine starts copying counter values to the predestined memory locations. For each performance counter, the destination memory address is calculated, and a set of signals for writing the counter value into the memory is generated. After all counters are copied to memory, the timer (or another triggering event) may be reset.
The device 72101 may be built into a microprocessor and includes a plurality of hardware performance counters 72102, which are registers used to store the counts of hardware-related activities within a computer. Examples of activities of which the counters 72102 may store counts may include, but are not limited to, cache misses, translation lookaside buffer (TLB) misses, the number of instructions completed, number of floating point instructions executed, processor cycles, input/output (I/O) requests, and other hardware-related activities and events.
Other examples may include, but are not limited to, events related to the network activity, like number of packets sent or received in each of networks links, errors when sending or receiving the packets to the network ports, or errors in the network protocol, events related to the memory activity, for example, number of cache misses for any or all cache level L1, L2, L3, or the like, or number of memory requests issued to each of the memory banks for on-chip memory, or number of cache invalidates, or any memory coherency related events. Yet more examples may include, but are not limited to, events related to one particular processor's activity in a chip multiprocessor systems, for example, instructions issued and completed, integer and floating-point, for the processor 0, or for any other processor, the same type of counter events but belonging to different processors, for example, the number of integer instructions issued in all N processors. Those are some of the examples activities and events the performance counters may collect.
A register or a memory location 72104 may specify the frequency at which the hardware state machine should copy the hardware performance counter registers 72102 to memory. Software, such as the operating system, or a performance tool the operating system has enabled to directly access the hardware state machine control registers, may set this register to frequency at which it wants the hardware performance counter registers 72102 sampled.
Another register or memory location 72109 may provide the start memory location of the first memory address 72108. For example, the software program running in address space A, may have allocated memory to provide space to write the data. A segmentation fault may be generated if the specific memory location is not mapped writable into the user address space A, that interacted with the hardware state machine 72122 to set up the automatic copying.
Yet another register or memory location 72110 may indicate the length of the memory region to be written to. For each counter to be copied, hardware calculates the destination address, which is saved in the register 72106.
For the hardware to automatically and directly perform copy of data from the performance counters 72102 to store in the memory area 72114, the software may set a time interval in the register 72104. The time interval value is copied into the timer 72120 that counts down, which upon reaching zero, triggers a state machine 72122 to invoke copying of the data to the address of memory specified in register 72106. For each new value to be stored, the current address in register 72106 is calculated. When the interval timer reaches zero, the hardware may perform the copying automatically without involving the software.
In addition, or instead of using the time interval register 72104 and timer 72120, an external signal 72130 generated outside of the performance monitoring unit may be used to start direct copying. For example, this signal may be an interrupt signal generated by a processor, or by some other component in the system.
Optionally, a register or memory location 72128 may contain a bit mask indicating which of the hardware performance counter registers 72102 should be copied to memory. This allows software to choose a subset of the registers of critical registers. Copying and storing only a selected set of hardware performance counters may be more efficient in terms of the amount of the memory consumed to gather the desired data.
In one aspect, hardware may be responsible for ensuring that memory address is valid. In this embodiment, state machine 72122 checks for each address if it is within the memory area specified by the starting address, as specified in 72109, and length value, as specified in 72110. In the case the address is beyond that boundary, an interrupt signal for segmentation fault may be generated for the operating system.
In another aspect, software may be responsible to keep track of the available memory and to provide sufficient memory for copying performance counters. In this embodiment, for each counter to be copied, hardware calculates the next address without making any address boundary checks.
Another register or memory location 72112 may store a value that specifies the number of times to write the above specified hardware performance counters to memory 72114. This register may be decremented every time a DMA engine starts its copying all, or selected counters to the memory. After this register reached zero, the counters are no more copied until the next re-programming by software. Alternatively or additionally, the value may include an on or off bit which indicates whether the hardware should collect data or not.
The memory location for writing and collecting the counter data may be a pre-allocated block 72108 at the memory 72114 such as L2 cache or another with a starting address (e.g., specified in 72109) and a predetermined length (e.g., specified in 72110). In one embodiment, the block 72108 may be written once until the upper boundary is reached, after which an interrupt signal may be initialized, and further copying is stopped. In another embodiment, memory block 72108 is arranged as a circular buffer, and it is continuously overwritten each time the block is filled. In this embodiment, another register 72118 or memory location may be used to store an indication as to whether the hardware should wrap back to the beginning of the area, or stop when it reaches the end of the memory region or block specified by software. Memory device 72114 that stores the performance counter data may be an L2 cache, L3 cache, or memory.
A time interval register 72204 may store a value that specifies the frequency of copying to be performed, for example, a time value that specifies to perform a copy every certain time interval. The value may be specified in seconds, milliseconds, instruction cycles, or others. A software entity such as an operating system or another application may write the value in the register 72204. The time interval value 72204 is set in the timer 72220 for the timer 72220 to being counting the time. Upon expiration of the time, the timer 72220 notifies the state machine 72222 to trigger the copying.
The state machine 72222 reads the address value of 72206 and begins copying the data of the performance counters specified in the counter list register 72224 to the memory location 72208 of the memory 72214 specified in the address register 72206. When the copying is done, the timer 72220 is reset with the value specified in the time interval 72204, and the timer 72220 begins to count again.
The register 72224 or another memory location stores the list of performance counters, whose data should be copied to memory 72214. For example, each bit stored in the register 72224 may correspond to one of the performance counters. If a bit is set, for example, the associated performance counter should be copied. If a bit is not set, for example, the associated performance counter should not be copied.
The memory location for writing and collecting the counter data may be a set of distinct memory blocks specified by set of addresses and lengths. Another set of registers or memory locations 72209 may provide the set of start memory locations of the memory blocks 72208. Yet another set of registers or memory locations 72210 may indicate the lengths of the set of memory blocks 72208 to be written to. The starting addresses 72209 and lengths 72210 may be organized as a list of available memory locations.
A hardware mechanism, such as a finite state machine 72224 in the performance counter unit 72201 may point from memory region to memory region as each one gets filled up. The state machine may use current pointer register or memory location 72216 to indicate where in the multiple specified memory regions the hardware is currently copying to, or which of the pairs of start address 72209 and length 72210 it is currently using from the performance counter unit 72201.
The state machine 72222 uses the current address and length registers, as specified in 72216, to calculate the destination address 72206. The value in 72216 stays unchanged until the state machine identifies that the memory block is full. This condition is identified by comparing the destination address 72206 to the sum of the start address 72209 and the memory block length 72210. Once a memory block is full, the state machine 72222 increments the current register 72216 to select a different pair of start register 72209 and length register 72210.
Another register or memory location 72218 may be used to store an indication as to whether the hardware should wrap back to the beginning of the area, or stop when it reaches the end of the memory region or block specified by software.
Another register or memory location 72212 may store a value that specifies the number of times to write the above specified hardware performance counters to memory 72214. Each time the state machine 72222 initiates copying and/or storing, the value of the number of writes 72212 is decremented. If the number reaches zero, the copying is not performed. Further copying from the performance counters 72202 to memory 72214 may be re-established after an intervention by software.
In another aspect, an external interrupt 72230 or another signal may trigger the state machine 72222 or another hardware component to start the copying. The external signal 72230 may be generated outside of the performance monitoring unit 72201 to start direct copying. For example, this signal may be an interrupt signal generated by a processor, or by some other component in the system.
While the above description referred to a timer element that detects the time expiration for triggering the state machine for, it should be understood that other devices, elements, or methods may be utilized for triggering the state machine. For instance, an interrupt generated by another element or device may trigger the state machine to begin copying the performance counter data.
As shown with respect to
While there are many operations that need to occur as part of a context switch, this disclosure focuses the description on those that pertain to the hardware performance counter infrastructure. In preparation for performing a context switch, the operating system, which knows of the characteristics and capabilities of the computer, will have set aside memory associated with each process commensurate with the number of hardware performance control registers and data values.
One embodiment of the hardware implementation to perform the automatic saving and restoring of data may utilize two control registers associated with the infrastructure, i.e., the hardware performance counter unit. One register, R1 (for convenience of naming), 73107, is designated to hold the memory address that data is to be copied to or from. Another register, for example, a second register R2, 73104, indicates whether and how the hardware should perform the automatic copying process. The value of second register is normally a zero. When the operating system wishes to initiate a copy of the hardware performance information to memory it writes a value in the register to indicate this mode. When the operating system wishes to initiate a copy of the hardware performance values from memory it writes another value in the register that indicates this mode. For example, when the operating system wishes to initiate a copy of the hardware performance information to memory it may write a “1” to the register, and when the operating system wishes to initiate a copy of the hardware performance values from memory it may write a “2” to the register. Any other values to indications may be utilized. This may be an asynchronous operation, i.e., the hardware and the operating system may operate or function asynchronously. An asynchronous operation allows the operating system to continue performing other tasks associated with the context switch while the hardware automatically stores the data associated with the performance monitoring unit and sets an indication when finished that the operating system can check to ensure the process was complete. Alternatively, in another embodiment, the operation may be performed synchronously by setting a third register. For example, R3, 73108 can be set to “1” indicating that the hardware should not return control to the operating system after the write to R2 until the copying operation has completed.
A memory device 73114, which may be an L2 cache or other memory, stores various data related to the running of the computer system and its applications. A register 73106 stores an address location in memory 73114 for storing the hardware performance counter information associated with the switched out process. For example, when the operating system determines it needs to switch out a given process A, it looks up in its data structures the previously allocated memory addresses (e.g., in 73114) for process A's hardware performance counter information and writes the beginning value of that address range into a register 73106. A register 73107 stores an address location in memory 73114 for loading the hardware performance counter information associated with the switched in process. For example, when the operating system determines it needs to switch in a given process B, it looks up in its data structures the previously allocated memory addresses (e.g., in 73114) for process B's hardware performance counter information and writes the beginning value of that address range into a register 73107.
Context switch register 73104 stores a value that indicates the mode of copying, for example, whether the hardware should start copying, and if so, whether the copying should be from the performance counters 73112 to memory 73114, or from the memory 73114 to the performance counters 73112, for example, depending on whether the process is being context switched in or out. Table 1 for examples shows possible values that may be stored by or written into the context switch 73102 as an indication for copying. Any other values may be used.
The operating system for example writes those values into the register 73104, according to which the hardware performs its copying.
A control state machine 73110 starts the context switch operation of the performance counter information when the register 73104 holds values that indicate that the hardware should start copying. If the value in the register 73104 is 1 or 2, the circuitry of the performance counter unit 73102 stores the current context (i.e., the information in the performance counters 73112) of the counters 73112 to the memory area 73114 specified in the context address register 73106. This actual data copying can be performed by a simple direct memory access engine (DMA), not shown in the picture, which generates all bus signals necessary to store data to the memory. Alternatively, this functionality can be embedded in the state machine 73110. All performance counters and their configurations are saved to the memory starting at the address specified in the register 73106. The actual arrangement of counter values and configuration values in the memory addresses can be different for different implementations, and does not change the scope of this invention.
If the value in the register 73104 is 3, or is 1 and the copy-out step described above is completed, the copy-in step starts. The new context (i.e., hardware performance counter information associated with the process being switched in) is loaded from the memory area 73114 indicated in the context address 73107. In addition, the values of performance counters are copied from the memory back to the performance counters 73112. The exact arrangement of counter values and configurations values does not change the scope of this invention.
When the copying is finished, the state machine 73110 sets the context switch register to a value (e.g., “0”) that indicates that the copying is completed. In another embodiment, the performance counters may generate an interrupt to signal the completion of copying. The interrupt may be used to notify the operating system that the copying has completed. In one embodiment, the hardware clears the context switch register 73104. In another embodiment, the operating system resets the context switch register value 73104 (e.g., “0”) to indicate no copying.
The state machine 73110 copies the memory address stored in the context address register 73107 to the context address register 73106. Thus, the new context address is free to be used in the future for the next context switch, and the current context will be copied back to its previous memory location.
In another embodiment of the implementation, the second context address register 73107 may not be needed. That is, the operating system may use one context address register 73106 for indicating the memory address to copy to or to copy from, for context switching out or context switching in, respectively. Thus, for example, register 73106 may be also used for indicating a memory address from where to context switch in the hardware performance counter information associated with a process being context switched in, when the operating system is context switching back in a process that was context switched out previously.
Additional number of registers or the like, or different configurations for hardware performance counter unit may be used to accomplish the automatic saving of storing and restoring of contexts by the hardware, for example, while the operating system may be performing other operations or tasks, and/or, so that the operating system or the software or the like need not individually read the counters and associated controls.
Referring to
The operating system or the like may proceed in performing other operations while the hardware copies that data from the hardware performance control and data registers. At 73208, after the hardware finishes copying, the hardware resets the value at register R1, for example, to “0” to indicate that the copying is done. At 73208, prior to completing the context switch, the operating system or the like checks the value of register R2 to make sure it is “0” or another value, which indicates that the hardware has finished the copy.
For context switching back in process B, the operating system or the like may perform the similar procedure. For example, the operating system writes the beginning of the range of addresses used for storing hardware performance counter information associated with process B into register R1 (or another such designated memory location), writes a value (e.g., “3”) into register R2 to indicate to the hardware to start copying from the memory location specified in register R1 to the hardware performance counters. The operating system or the like may proceed with other context restoring operation. Prior to returning control to the process, the operating system verifies that the hardware finished its copying function, for example, by checking the value in R2 (in this example, checking for “0” value). In this way, the copying of the hardware performance counter information with the other operations needed when performing a context switch can be performed in parallel, or substantially in parallel.
In another embodiment, rather than having the operating system check a register to determine whether the hardware completed its copying, another register, R3, may be used to indicate to the hardware whether and when the control to the operating system should be returned. For instance, if this register is set to a predetermined value, e.g., “1”, the hardware will not return control to the operating system until the copy is complete. For example, this register, or a bit in another control register, is labeled “interrupt enabled”, and it specifies that an interrupt signal should be raised when data copy is complete. Operating system performs operations which are part of context switching in parallel. Once this interrupt is received, operating system is informed that all data copying of the performance counters is completed.
The above described examples used the register values as being set to “0”, “1”, and “2” in explaining the different modes indicated in the register value. It should be understood, however, that any other values may be used to indicate the different modes of copying.
There is further provided hardware support to facilitate the efficient hardware switching and storing of counters. Particularly, in one aspect, the hardware support of the present disclosure allows specification of a set of groups of hardware performance counters, and the ability to switch between those groups without software intervention.
In one embodiment, hardware and software is combined that allows for the ability to set up a series of different configurations of hardware performance counter groups. The hardware may automatically switch between the different configurations at a predefined interval. For the hardware to automatically switch between the different configurations, the software may set an interval timer that counts down, which upon reaching zero, switches to the next configuration in the stored set of configurations. For example, the software may set up the set of configurations that it wants the hardware to switch between and also set a count of the number of hardware configurations it has set up. When the interval timer reaches zero, the hardware may update the currently collected set of hardware counters automatically without involving the software and set up a new group of hardware performance counters to start being collected.
In another aspect, another configuration switching trigger may be utilized instead of a timer element. For example, an interrupt or an external interrupt from another device may be set up to periodically or at a predetermined time or event, to trigger the hardware performance counter reconfiguration or switching.
In one embodiment, a register or memory location specifies the number of times to perform the configuration switch. In another embodiment, rather than a count, an on/off binary value may indicate whether hardware should continue switching configurations or not.
Yet in another embodiment, the user may set a register or memory location to indicate that when the hardware switches groups, it should clear performance counters. In still yet another embodiment, a mask register or memory location may be used to indicate which counters should be cleared.
A plurality of configuration registers 74110, 74112 may each include a set of configurations that specify what activities and/or events the counters 74118 should count. For example, configuration 1 register 74110 may specify counter events related to the network activity, like the number of packets sent or received in each of networks links, the errors when sending or receiving the packets to the network ports, or the errors in the network protocol. Similarly, configuration 2 register 74112 may specify a different set of configurations, for example, counter events related to the memory activity, for instance, the number of cache misses for any or all cache level L1, L2, L3, or the like, or the number of memory requests issued to each of the memory banks for on-chip memory, or the number of cache invalidates, or any memory coherency related events. Yet another counter configuration can include counter events related to one particular processor's activity in a chip multiprocessor systems, for example, instructions issued or instructions completed, integer and floating-point instructions, for the processor 0, or for any other processor. Yet another counter configuration may include the same type of counter events but belonging to different processors, for example, the number of integer instructions issued in all N processors. Any other counter configurations are possible. In one aspect, software may set up those configuration registers to include desired set of configurations by writing to those registers.
Initially, the state machine may be set to select a configuration (e.g., 74110 or 74112), for example, using a multiplexer or the like at 74114. A multiplexer or the like at 74116 then selects from the activities and/or events 74120, 74122, 74134, 74126, 74128, etc., the activities and/or events specified in the selected configuration (e.g., 74110 or 74112) received from the multiplexer 74114. Those selected activities and/or events are then sent to the counters 74118. The counters 74118 accumulate the counts for the selected activities and/or events.
A time interval component 74104 may be a register or the like that stores a data value. In another aspect, the time interval component 74104 may be a memory location or the like. Software such as an operating system or another program may set the data value in the time interval 74104. A timer 74106 may be another register that counts down from the value specified in the time interval register 74104. In response to the count down value reaching zero, the timer 74106 notifies a control state machine 74108. For instance, when the timer reaches zero, this condition is recognized, and a control signal connected to the state machine 74108 becomes active. Then the timer 74106 may be reset to the time interval value to start a new period for collecting data associated with the next configuration of hardware performance counters.
In response to receiving a notification from the timer 74106, the control state machine 74108 selects the next configuration register, e.g., configuration 1 register 74110 or configuration 2 register 74112 to reconfigure activities tracked by the performance counters 74118. The selection may be done using a multiplexer 74114, for example, that selects between the configuration registers 74110 and 74112. It should be noted that while two configuration registers are shown in this example, any number of configuration registers may be implemented in the present disclosure. Activities and/or events (e.g., as shown at 74120, 74122, 74124, 74126, 74128, etc.) are selected by the multiplexer 74116 based on the configuration selected at the multiplexer 74114. Each counter at 74118 accumulates counts for the selected activities and/or events.
In another embodiment, there may be a register or memory location labeled “switch” 74130 for indicating the number of times to perform the configuration switch. In yet another embodiment, the indication to switch may be provided by an on/off binary value. In the embodiment with a number of possible switching between the configurations, the initial value may be specified by software. Each time the state machine 74108 initiates state switching, the value of the remaining switching is decremented. Once the number of the allowed configuration switching reaches zero, all further configuration change conditions are ignored. Further switching between the configurations may be re-established after intervention by software, for instance, if the software re-initializes the switch value.
In addition, a register or memory location “clear” 74132 may be provided to indicate whether to clear the counters when the configuration switch occurs. In one embodiment, this register has only one bit, to indicate if all counter values have to be cleared when the configuration is switched. In another embodiment, this counter has a number of bits M+1, where M is the number of performance counters 74118. These register or memory values may be a mask register or memory location for indicating which of M counters should be cleared. In this embodiment, when configuration switching condition is identified, the state machine 74108 clears the counters and selects different counter events by setting appropriate control signals for the multiplexer 74116. If the clear mask is used, only the selected counters will be cleared. This may be implemented, for example, by AND-ing the clear mask register bits 74132 and “clear registers” signal generated by the state machine 74108 and feeding them to the performance counters 74118.
In addition, or instead of using the time interval register 74104 and timer 74106, an external signal 74140 generated outside of the performance monitoring unit may be used to start reconfiguration. For example, this signal may be an interrupt signal generated by a processor, or by some other component in the system. In response to receiving this external signal, the state machine 74108 may start reconfiguration in the same way as described above.
At 74206, in response to detecting that the time interval set in the time interval register has passed, the timer element signals or otherwise notifies the state machine controlling the configuration register selection. At 74208, the state machine selects the next configuration, for example, stored in a register. For example, the performance counters may have been providing counts for activities specified in configuration register A. After the state machine 74108 selects the next configuration, for example, configuration register B, the performance counters start counting the activities specified in configuration register B, thus reconfiguring the performance counters. Once the state machine switches configuration, the timer elements again starts counting the time. For example, the timer element may again read the value from the timer interval register and for instance, start counting down from that number until it reaches zero. In the present disclosure, any number of configurations, for example, each stored in a register can be supported.
As described above, the desired time intervals for multiplexing (i.e., reconfiguring) are programmable. Further, the counter configurations are also programmable. For example, the software may set the desired configurations in the configuration registers.
There is further provided, in one aspect, hardware support to facilitate the efficient counter reconfiguration, OS switching and storing of hardware performance counters. Particularly, in one aspect, the hardware support of the present disclosure allows specification of a set of groups of hardware performance counters, and the ability to switch between those groups without software intervention. Hardware switching may be performed, for example, for reconfiguring the performance counters, for instance, to be able to collect information related to different sets of events and activities occurring on a processor or system. Hardware switching also may be performed, for example, as a result of operating system context switching that occurs between the processes or threads. The hardware performance counter data may be stored directly to memory and/or restored directly from memory, for example, without software intervention, for instance, upon reconfiguration of the performance counters, operating system context switching, and/or at a predetermined interval or time.
The description of the embodiments herein uses the term “hardware” interchangeably with the state machine and associated registers used for controlling the automatic copying of the performance counter data to memory. Further, the term “software” may refer to the hypervisor, operating system, or another tool that either of those layers has provided direct access of the hardware to. For example, the operating system could set up a mapping, allowing a tool with the correct permission to interact directly with the hardware state machine.
In one aspect, hardware and software may be combined to allow for the ability to set up a series of different configurations of hardware performance counter groups. The hardware then may automatically switch between the different configurations. For the hardware to automatically switch between the different configurations, the software may set an interval timer that counts down, which upon reaching zero, switches to the next configuration in the stored set of configurations. For example, the software may set up a set of configurations that it wants the hardware to switch between and also set a count of the number of hardware configurations it has set up. In response to the interval timer reaching zero, the hardware may change the currently collected set of hardware performance counter data automatically without involving the software and set up a new group of hardware performance counters to start being collected. The hardware may automatically copy the current value in the counters to the pre-determined area in the memory. In another aspect, the hardware may switch between configurations in response to receiving a signal from another device, or receiving an external interrupt or others. In addition, the hardware may store the performance counter data directly in memory automatically.
In one embodiment, a register or memory location specifies the number of times to perform the configuration switch. In another embodiment, rather than a count, an on/off binary value may indicate whether hardware should continue switching configurations or not. Yet in another embodiment, the user may set a register or memory location to indicate that when the hardware switches groups, it should clear performance counters. In still yet another embodiment, a mask register or memory location may be used to indicate which counters should be cleared.
A plurality of configuration registers 6110, 6112, 6113 may each include a set of configurations that specify what activities and/or events the counters 6118 should count. For example, configuration 1 register 6110 may specify counter events related to the network activity, like the number of packets sent or received in each of networks links, the errors when sending or receiving the packets to the network ports, or the errors in the network protocol. Similarly, configuration 2 register 6112 may specify a different set of configurations, for example, counter events related to the memory activity, for instance, the number of cache misses for any or all cache level L1, L2, L3, or the like, or the number of memory requests issued to each of the memory banks for on-chip memory, or the number of cache invalidates, or any memory coherency related events. Yet another counter configuration can include counter events related to one particular process activity in a chip multiprocessor systems, for example, instructions issued or instructions completed, integer and floating-point instructions, for the process 0, or for any other process. Yet another counter configuration may include the same type of counter events but belonging to different processes, for example, the number of integer instructions issued in all N processes. Any other counter configurations are possible. In one aspect, software may set up those configuration registers to include desired set of configurations by writing to those registers.
Initially, the state machine 6108 may be set to select a configuration (e.g., 6110, 6112, . . . , or 6113), for example, using a multiplexer or the like at 114. A multiplexer or the like at 6116 then selects from the activities and/or events 6120, 6122, 6124, 6126, 6128, etc., the activities and/or events specified in the selected configuration (e.g., 6110 or 6112) received from the multiplexer 6114. Those selected activities and/or events are then sent to the counters 61118. The counters 6118 accumulate the counts for the selected activities and/or events.
A time interval component 6104 may be a register or the like that stores a data value. In another aspect, the time interval component 6104 may be a memory location or the like. Software such as an operating system or another program may set the data value in the time interval 6104. A timer 6106 may be another register that counts down from the value specified in the time interval register 6104. In response to the count down value reaching zero, the timer 6106 notifies a control state machine 6108. For instance, when the timer reaches zero, this condition is recognized, and a control signal connected to the state machine 6108 becomes active. Then the timer 6106 may be reset to the time interval value to start a new period for collecting data associated with the next configuration of hardware performance counters.
In another aspect, an external interrupt or another signal 6170 may trigger the state machine 6108 to begin reconfiguring the hardware performance counters 6118.
In response to receiving a notification from the timer 6106 or another signal, the control state machine 6108 selects the next configuration register, e.g., configuration 1 register 6110 or configuration 2 register 6112 to reconfigure activities tracked by the performance counters 6118. The selection may be done using a multiplexer 6114, for example, that selects between the configuration registers 6110, 6112, 6113. It should be noted that while three configuration registers are shown in this example, any number of configuration registers may be implemented in the present disclosure. Activities and/or events (e.g., as shown at 6120, 6122, 6124, 6126, 6128, etc.) are selected by the multiplexer 6116 based on the configuration selected at the multiplexer 6114. Each counter at 6118 accumulates counts for the selected activities and/or events.
In another embodiment, there may be a register or memory location labeled “switch” 6130 for indicating the number of times to perform the configuration switch. In yet another embodiment, the indication to switch may be provided by an on/off binary value. In the embodiment with a number of possible switching between the configurations, the initial value may be specified by software. Each time the state machine 6108 initiates state switching, the value of the remaining switching is decremented. Once the number of the allowed configuration switching reaches zero, all further configuration change conditions are ignored. Further switching between the configurations may be re-established after intervention by software, for instance, if the software re-initializes the switch value.
In addition, a register or memory location “clear” 6132 may be provided to indicate whether to clear the counters when the configuration switch occurs. In one embodiment, this register has only one bit, to indicate if all counter values have to be cleared when the configuration is switched. In another embodiment, this counter has a number of bits M+1, where M is the number of performance counters 6118. These register or memory values may be a mask register or memory location for indicating which of M counters should be cleared. In this embodiment, when configuration switching condition is identified, the state machine 6108 clears the counters and selects different counter events by setting appropriate control signals for the multiplexer 6116. If the clear mask is used, only the selected counters may be cleared. This may be implemented, for example, by AND-ing the clear mask register bits 6132 and “clear registers” signal generated by the state machine 6108 and feeding them to the performance counters 6118.
In addition, or instead of using the time interval register 6104 and timer 6106, an external signal 6170 generated outside of the performance monitoring unit may be used to start reconfiguration. For example, this signal may be an interrupt signal generated by a processor, or by some other component in the system. In response to receiving this external signal, the state machine 6108 may start reconfiguration in the same way as described above.
In addition, the software may specify a memory location 6136 and have the hardware engine copy the counters without the software getting involved. In another aspect, the software may specify a sequence of memory locations and have the hardware perform a sequence of copies from the hardware performance counter registers to the sequence of memory locations specified by software.
The hardware may be used to copy the values of performance monitoring counters 6118 from the performance monitoring unit 6102 directly to the memory area 6136 without intervention of software. The software may specify the starting address 6109 of the memory where the counters are to be copied, and a number of counters to be copied.
In hardware, events are monitored and counted, and an element such as a timer 6106 keeps track of time. After a time interval expires, or another triggering event, the hardware may start copying counter values to the predetermined memory locations. For each performance counter, the destination memory address 6148 may be calculated, and a set of signals for writing the counter value into the memory may be generated. After the specified counters are copied to memory, the timer (or another triggering event or element) may be reset.
Referring to
In another aspect, instead of a separate register or memory location 6140, the register at 6130 that specifies the number of configuration switches may be also used for specifying the number of memory copies. In this case, the number of reconfigurations and copying to memory may coincide.
Another register or memory location 6109 may provide the start memory location of the first memory address 6148. For example, the software program running in address space A, may have allocated memory to provide space to write the data. A segmentation fault may be generated if the specific memory location is not mapped writable into the user address space A that interacted with the hardware state machine 6108 to set up the automatic copying.
Yet another register or memory location 6138 may indicate the length of the memory region to be written to. For each counter to be copied, hardware calculates the destination address, which is saved in the register 6148.
For the hardware to automatically and directly perform copy of data from the performance counters 6108 to store in the memory area 6134, the software may set a time interval in the register 6104. The time interval value may be copied into the timer 6106 that counts down, which upon reaching zero, triggers a state machine 6108 to invoke copying of the data to the address of memory specified in register 6148. For each new value to be stored, the current address in register 6148 is calculated. When the interval timer reaches zero, the hardware may perform the copying automatically without involving the software. The time interval register 6104 and the timer 6106 may be utilized by the performance counter unit for both counter reconfiguration and counter copy to memory, or there may be two sets of time interval registers and timers, one used for directly copying the performance counter data to memory, the other used for counter reconfiguration. In this manner, the reconfiguration of the hardware performance counters and copying of hardware performance counter data may occur independently or asynchronously.
In addition, or instead of using the time interval register 6104 and timer 6106, an external signal 6170 generated outside of the performance monitoring unit may be used to start direct copying. For example, this signal may be an interrupt signal generated by a processor or by some other component in the system.
Optionally, a register or memory location 6146 may contain a bit mask indicating which of the hardware performance counter registers 6118 should be copied to memory. This allows software to choose a subset of the registers. Copying and storing only a selected set of hardware performance counters may be more efficient in terms of the amount of the memory consumed to gather the desired data.
The software is responsible for pre-allocating a region of memory sufficiently large to hold the intended data. In one aspect, if the software does not pass a large enough buffer in, a segmentation fault will occur when the hardware attempts to write the first piece of data beyond the buffer provided by the user (assuming the addressed location is unmapped memory).
Another register or memory location 6140 may store a value that specifies the number of times to write the above specified hardware performance counters to memory 6134. This register may be decremented every time the hardware state machine starts copying all, or a subset of counters to the memory. Once this register reaches zero, the counters are no longer copied until the next re-programming by software. Alternatively or additionally, the value may include an on or off bit which indicates whether the hardware should collect data or not.
The memory location for writing and collecting the counter data may be a pre-allocated block 6136 at the memory 6134 such as L2 cache or another with a starting address (e.g., specified in 6109) and a predetermined length (e.g., specified in 6138). In one embodiment, the block 6136 may be written once until the upper boundary is reached, after which an interrupt signal may be initialized, and further copying is stopped. In another embodiment, memory block 6136 is arranged as a circular buffer, and it is continuously overwritten each time the block is filled. In this embodiment, another register 6144 or memory location may be used to store an indication as to whether the hardware should wrap back to the beginning of the area, or stop when it reaches the end of the memory region or block specified by software. Memory device 6134 that stores the performance counter data may be an L2 cache, L3 cache, or memory.
The memory location for writing and collecting the counter data may be a set of distinct memory blocks specified by set of addresses and lengths. For example, the element shown at 6109 may be a set of registers or memory locations that specify the set of start memory locations of the memory blocks 6134. Similarly, the element shown at 6138 may be another set of registers or memory locations that indicate the lengths of the set of memory blocks to be written to. The starting addresses 6109 and lengths 6138 may be organized as a list of available memory locations. A hardware mechanism, such as a finite state machine 6108 in the performance counter unit 6102 may point from memory region to memory region as each one gets filled up. The state machine may use current pointer register or memory location 6142 to indicate where in the multiple specified memory regions the hardware is currently copying to, or which of the pairs of start address 6109 and length 6138 it is currently using from the performance counter unit 6102.
At 6206, the timer register counts down the time interval value, and when the timer count reaches zero, notifies a state machine. Any other method of detecting expiration of the timer value may be utilized. At 6208, the state machine triggers copying of all or selected performance counter register values to specified address in memory. At 6210, hardware copies performance counters to the memory.
At 6212, hardware checks if the configuration of performance counters needs to be changed, by checking a value in another register. If the configuration does not need to be changed, the processing returns to 6204. At 6214, a state machine changes the configuration of the performance counter data.
While there are many operations that need to occur as part of a context switch, this disclosure focuses the description on those that pertain to the hardware performance counter infrastructure. In preparation for performing a context switch, the operating system, which knows of the characteristics and capabilities of the computer, will have set aside memory associated with each process commensurate with the number of hardware performance control registers and data values.
One embodiment of the hardware implementation to perform the automatic saving and restoring of data may utilize two control registers associated with the infrastructure, i.e., the hardware performance counter unit. One register, R1 (for convenience of naming), 6156, is designated to hold the memory address that data is to be copied to or from. Another register, for example, a second register R2, 6160, indicates whether and how the hardware should perform the automatic copying process. The value of second register may be normally a zero. When the operating system wishes to initiate a copy of the hardware performance information to memory it writes a value in the register to indicate this mode. When the operating system wishes to initiate a copy of the hardware performance values from memory it writes another value in the register that indicates this mode. For example, when the operating system wishes to initiate a copy of the hardware performance information to memory it may write a “1” to the register, and when the operating system wishes to initiate a copy of the hardware performance values from memory it may write a “2” to the register. Any other values for such indications may be utilized. This may be an asynchronous operation, i.e., the hardware and the operating system may operate or function asynchronously. An asynchronous operation allows the operating system to continue performing other tasks associated with the context switch while the hardware automatically stores the data associated with the performance monitoring unit and sets an indication when finished that the operating system can check to ensure the process was complete. Alternatively, in another embodiment, the operation may be performed synchronously by setting a third register. For example, R3, 6158, can be set to “1” indicating that the hardware should not return control to the operating system after the write to R2 until the copying operation has completed.
Referring to
A memory device 6134, which may be an L2 cache or other memory, stores various data related to the running of the computer system and its applications. A register 6109 stores an address location in memory 6134 for storing the hardware performance counter information associated with the switched out process. For example, when the operating system determines it needs to switch out a given process A, it looks up in its data structures the previously allocated memory addresses (e.g., in 6162) for process A's hardware performance counter information and writes the beginning value of that address range into a register 6106. A register 6156 stores an address location in memory 6134 for loading the hardware performance counter information associated with the switched in process. For example, when the operating system determines it needs to switch in a given process B, it looks up in its data structures the previously allocated memory addresses (e.g., in 6164) for process B's hardware performance counter information and writes the beginning value of that address range into a register 6156.
Context switch register 6160 stores a value that indicates the mode of copying, for example, whether the hardware should start copying, and if so, whether the copying should be from the performance counters 6118 to memory 6134, or from the memory 6134 to the performance counters 6118, for example, depending on whether the process is being context switched in or out. Table 1 for examples shows possible values that may be stored by or written into the context switch 6160 as an indication for copying. Any other values may be used.
The operating system for example writes those values into the register 6160, according to which the hardware performs its copying.
A control state machine 6108 starts the context switch operation of the performance counter information when the signal 6170 is active, or when the timer 6106 indicates that the hardware should start copying. If the value in the register 6160 is 1 or 2, the circuitry of the performance counter unit 6102 stores the current context (i.e., the information in the performance counters 6118) of the counters 6118 to the memory area 6134 specified in the current address register 6148. All performance counters and their configurations are saved to the memory starting at the address specified in the register 6109. The actual arrangement of counter values and configuration values in the memory addresses can be different for different implementations, and does not change the scope of this invention.
If the value in the register 6160 is 3, or it is 1 and the copy-out step described above is completed, the copy-in step starts. The new context (i.e., hardware performance counter information associated with the process being switched in) is loaded from the memory area 6164 indicated in the context address 6156. In addition, the values of performance counters are copied from the memory back to the performance counters 6118. The exact arrangement of counter values and configurations values does not change the scope of this invention.
When the copying is finished, the state machine 6108 may set the context switch register to a value (e.g., “0”) that indicates that the copying is completed. In another embodiment, the performance counters may generate an interrupt to signal the completion of copying. The interrupt may be used to notify the operating system that the copying has completed. In one embodiment, the hardware clears the context switch register 6160. In another embodiment, the operating system resets the context switch register value 6160 (e.g., “0”) to indicate no copying.
The state machine 6108 copies the memory address stored in the context address register 6156 to the current address register 6148. Thus, the new context address register 6156 is free to be used for the next context switch.
In another embodiment of the implementation, the second context address register 6156 may not be needed. That is, the operating system may use one context address register 6109 for indicating the memory address to copy to or to copy from, for context switching out or context switching in, respectively. Thus, for example, register 6148 may be also used for indicating a memory address from where to context switch in the hardware performance counter information associated with a process being context switched in, when the operating system is context switching back in a process that was context switched out previously.
Additional number of registers or the like, or different configurations for hardware performance counter unit may be used to accomplish the automatic saving of storing and restoring of contexts by the hardware, for example, while the operating system may be performing other operations or tasks, and/or, so that the operating system or the software or the like need not individually read the counters and associated controls.
At 6402, software sets up all or some configuration registers in the performance counter unit or module 6102. Software, which may be a user-level application or an operating system, may set up several counter configurations, and one or more starting memory addresses and lengths where performance counter data will be copied. Software also writes time interval value into a designated register, and the information needed for switching out a given process A, and switching in the process B: allocated memory addresses for process A's hardware performance counter information, and writes the beginning value of that range into a register, e.g., register R1.
At 6404, condition is checked if operating system switch needs to be performed. This can be initiated by receiving an external signal to start operating system switch, or the operating system or the like may write in another register (e.g., register R2) to indicate that copying from and to performance counters to the memory should begin. For instance, the operating system or the like writes “1” to R2.
At 6406, if no OS switch needs to be performed, hardware transfers the value into a timer register. At 6408, the timer register counts down the time interval value, and when the timer count reaches zero, notifies a state machine. Any other method of detecting expiration of the timer value may be utilized. At 6410, the state machine triggers copying of all or selected performance counter register values to specified address in memory. At 6412, hardware copies performance counters to the memory.
At 6414, hardware checks if the configuration of performance counters needs to be changed, by checking a value in another register. If the configuration does not need to be changed, the processing returns to 6404. At 6416, a state machine changes the configuration of the performance counter data, and loops back to 6404.
Going back to 6404, operating system may indicate, for example, by storing a value, to begin context switching of the performance counter data, and the control transfers to 6418. At 6418, a state machine begins context switching the performance counter data, and copies the current context-all or some performance counter values, and all or some configuration registers into the memory. At 6420, after values associated with process A are copied out, the values associated with process B are copied into the performance counters and configuration registers from the memory. For instance, the state machine copies data from another specified memory location into the performance counters. After the hardware finishes copying, the hardware resets the value at register R2, for example, to “0” to indicate that the copying is done, which indicates that the hardware has finished the copy. Finally, at 6416, the new configuration consistent with the process B is performed.
At 6414, the software may specify reconfiguring of the performance counters, for example, periodically or every time interval, and the hardware, for instance, the state machine, may switch configuration of the performance counters at the specified periods. The specifying of reconfiguring and the hardware reconfiguring may occur while the operating system thread is in one context in one aspect. In another aspect, the reconfiguration of the performance counters may occur asynchronously to the context switching mechanism.
At 6418, the software may also specify copying of performance counters directly to memory, for instance, periodically or at every specified time interval. For example, the software may write a value in a register that automatically triggers the state machine (hardware) to automatically perform direct copying of the hardware performance counter data to memory without further software intervention. In one aspect, the specifying of copying the performance counter data directly to memory and the hardware automatically performing the copying may occur while an operating system thread is in context. In another aspect, this step may occur asynchronously to the context switching mechanism.
In one aspect, the storage needed for majority of performance count data is centralized, thereby achieving an area reduction. For instance, only a small number of least-significant bits are kept in the local units, thus saving area. This allows each processor to keep a large number of performance counters (e.g., 24 local counters per processor) at low resolution (e.g., 14 bits). To attain higher resolution counts, the local counter unit periodically transfer its counter values (counts) to a central unit. The central unit aggregates the counts into a higher resolution count (e.g., 64 bits). The local counters count a number of events, e.g., up to the local counter capacity. Before the local counter overflow occurs, it transfers its count to the central unit. Thus, no counts are lost in the local counters. The count values may be stored in a memory device such as a single central Static Random Access Memory (SRAM), which provides high bit density. Using this approach, it becomes possible to have multiples of performance counters supported per processor, while still providing for very large (e.g. 64 bit) counter values.
In another aspect, the memory or central SRAM may be used in multiple modes: a distributed mode, where each core or processor on a chip provides a relatively small number of counts (e.g., 24 per processor), as well as a detailed mode, where a single core or processor can provide a much larger number of counts (e.g., 7116).
In yet another aspect, multiple performance counter data counts from multiple performance counters residing in multiple processing modules (e.g., cores and cache modules) may be collected via a single daisy chain bus in a predetermined number of cycles. The predetermined number of cycles depends on the number of performance counters per processing module, the number of processing modules residing on the daisy chain bus, and the number of bits that can be transferred at one time on the daisy chain. In the description herein, the example configuration of the chip supports 24 local counters in each of its 17 cores, 16 local counters in each of its 16 L2 cache units or modules. The daisy chain bus supports 96 bits of data. Other configurations are possible, and the present invention is not limited only to that configuration.
In still yet another aspect, the performance counter modules and monitoring of performance data may be programmed by user software. Counters of the present disclosure may be configured through memory access bus. The hardware modules of the present disclosure are configured as not privileged such that user program may access the counter data and configure the modules. Thus, with the methodology and hardware set up of the present disclosure, it is not necessary to perform kernel-level operations such as system calls when configuring and gathering performance counts, which can be costly. Rather, the counters are under direct user control.
Still yet in another aspect, the performance counters and associated modules are physically placed near the cores or processing units to minimize overhead and data travel distance and to provide low-latency control and configuration of the counters by the unit to which the counters are associated.
A processing node may have multiple processors or cores and associated L1 cache units, L2 cache units, a messaging or network unit, and I/O interfaces such as PCI Express. The performance counters of the present disclosure allow the gathering of performance data from such functions of a processing node and may present the performance data to software. A processing node 7100 also referred to as a chip herein such as an application-specific integrated circuit (ASIC) may include (but not limited to) a plurality of cores (7102a, 7102b, 7102n) with associated L1 cache prefetchers (L1P). The processing node may also include (but not limited to) a plurality of L2 cache units (7104a, 7104b, 7104n), a messaging/network unit 7110, PCIe 7111 and Devbus 7112, connecting to a centralized counter unit referred to herein as UPC_C (7114). A core (e.g., 7102a, 7102b, 7102n), also referred to herein as a PU (processing unit) may include a performance monitoring unit or a performance counter (7106a, 7106b, 7106n) referred to herein as UPC_P. UPC_P resides in the PU complex and gathers performance data from the associated core (e.g., 7102a, 7102b, 7102n). Similarly, an L2 cache unit (e.g., 7104a, 7104b, 7104n) may include a performance monitoring unit or a performance counter (e.g., 7108a, 7108b, 7108n) referred to herein as UPC_L2. UPC_L2 resides in the L2 module and gathers performance data from it. The terminology UPC (universal performance counter) is used in this disclosure synonymously or interchangeable with general performance counter functions.
UPC_C 7114 may be a single, centralized unit within the processing node 7100, and may be responsible for coordinating and maintaining count data from the UPC_P (7106a, 7106b, 7106n) and UPC_L2 (7108a, 7108b, 7108nn) units. The UPC_C unit 7114 (also referred to as the UPC_C module) may be connected to the UPC_P (7104a, 7104b, 7104n) and UPC_L2 (7108a, 7108b, 7108n) via a daisy chain bus 7130, with the start 7116 and end 7118 of the daisy chain beginning and terminating at the UPC_C 7114. The performance counter modules (i.e., UPC_P, UPC_L2 and UPC_C) of the present disclosure may operate in different modes, and depending on the operating mode, the UPC_C 7114 may inject packet framing information at the start of the daisy chain 7116, enabling the UPC_P (7104a, 7104b, 7104n) and/or UPC_L2 (7108a, 7108b, 7108n) modules or units to place data on the daisy chain bus 7130 at the correct time slot. In a similar manner, messaging/network unit 7110, PCIe 7111 and Devbus 7112 may be connected via another daisy chain bus 7140 to the UPC_C 7114.
The performance counter functionality of the present disclosure may be divided into two types of units, a central unit (UPC_C), and a group of local units. Each of the local units performs a similar function, but may have slight differences to enable it to handle, for example, a different number of counters or different event multiplexing within the local unit. For gathering performance data from the core and associated L1, a processor-local UPC unit (UPC_P) is instantiated within each processor complex. That is, a UPC_P is added to the processing logic. Similarly, there may be a UPC unit associated with each L2 slice (UPC_L2). Each UPC_L2 and UPC_P unit may include a small number of counters. For example, the UPC_P may include 24 14 bit counters, while the UPC_L2 counters may instantiate 16 10 bit counters. The UPC ring (shown as solid line from 7116 to 7118) may be connected such that each UPC_P (7104a, 7104b, 7104n) or UPC_L2 unit (7108a, 7108b, 7108n) may be connected to its nearest neighbor. In one aspect, the daisy chain may be implemented using only registers in the UPC units, without extra pipeline latches.
Although not shown or described, a person of ordinary skill in the art will appreciate that a processing node may include other units and/or elements. The processing node 7100 may be an application-specific integrated circuit (ASIC), or a general-purpose processing node.
The UPC of the present disclosure may operate in different modes, as described below. However, the UPC is not limited to only those modes of operation.
Mode 0 (Distributed Count Mode)
In this operating mode (also referred to as distributed count mode), counts from multiple performance counters residing in each core or processing unit and L2 unit may be captured. For example, in an example implementation of a chip that includes 17 cores each with 24 performance counters, and 16 L2 units each with 16 performance counters, 24 counts from 17 UPC_P units and 16 counts from 16 UPC_L2 units may be simultaneously captured. Local UPC_P and UPC_L2 counters are periodically transferred to a corresponding 64 bit counter residing in the central UPC unit (UPC_C), over a 96 bit daisy chain bus. Partitioning the performance counter logic into local and central units allows for logic reduction, but still maintains 64 bit fidelity of event counts. Each UPC_P or UPC_L2 module places its local counter data on the daisy chain (4 counters at a time), or passes 96 bit data from its neighbor. The design guarantees that all local counters will be transferred to the central unit before they can overflow locally (by guaranteeing a slot on the daisy chain at regular intervals). With a 14 bit local UPC_P counter, each counter is transferred to the central unit at least every 1024 cycles to prevent overflow of the local counters. In order to cover corner cases and minimize the latency of updating the UPC_C counters, each counter is transferred to the central unit every 400 cycles. For Network, DevBus and PCIe, a local UPC unit similar to UPC_L2 and UPC_P may be used for these modules.
Mode 1 (Detailed Count Mode)
In this mode, the UPC_C assists a single UPC_P or UPC_L2 unit in capturing performance data. More events can be captured in the mode from a single processor (or core) or L2 than can be captured in distributed count mode. However, only one UPC_P or UPC_L2 may be examined at a time.
The UPC_P and UPC_L2 modules may be connected to the UPC_C unit via a 96 bit daisy chain, using a packet based protocol. Each UPC operating mode may use a different protocol. For example, in Mode 0 or distributed mode, each UPC_P and/or UPC_L2 places its data on the daisy chain bus at a specific time (e.g., cycle or cycles). In this mode, the UPC_C transmits framing information on the upper bits (bits 64:95) of the daisy chain. Each UPC_P and/or UPC_L2 module uses this information to place its data on the daisy chain at the correct time. The UPC_P and UPC_L2 send their counter data in a packet on bits 0:63 of the performance daisy chain. Bits 64:95 are generated by the UPC_C module, and passed unchanged by the UPC_P and/or UPC_L2 module. Table 1-2 defines example packets sent by UPC_P. Table 1-3 defines example packets sent by UPC_L2. Table 1-4 shows framing information injected by the UPC_C. The packet formats and framing information may be pre-programmed or hard-coded in the logic of the processing.
Table 1-2 defines example packets sent by an UPC_P. Each UPC_P may follow this format. Thus, the next UPC_P may send packets on the next 16 cycles, i.e., 16-31. The next UPC_P may send packets on the next 16 cycles, i.e., 32-47, and so forth. Table 1-5 shows an example of cycle to performance counter unit mappings.
Similar to UPC_P, the UPC_L2 may place data from its counters (e.g., 16 counters) on the daisy chain in an 8-flit packet, on daisy chain bits 0:63. This is shown in Table 1-3.
Table 1-4 shows the framing information transmitted by the UPC_C in Mode 0.
In this example format of both the UPC_P and UPC_L2 packet formats, every other flit contains no data. Flit refers to one cycle worth of information. The UPC_C uses these “dead” cycles to service memory-mapped I/O (MMIO) requests to the Static Random Access Memory (SRAM) counters or the like.
The UPC_L2 and UPC_P modules monitor the framing information produced by the UPC_C. The UPC_C transmits a repeating cycle count, ranging from 0 to 399 decimal. Each UPC_P and UPC_L2 compares this count to a value based on its logical unit number, and injects its packet onto the daisy chain when the cycle count matches the value for the given unit. The values compared by each unit are shown in Table 1-5.
Mode 0 Support for Simultaneous Counter Stop/Start
In Mode 0 (also referred to as distributed count mode), each UPC_P and UPC_L2 may contribute counter data. It may be desirable to have the local units start and stop counting on the same cycle. To accommodate this, the UPC_C sends a counter start/stop bit on the daisy chain. Each unit can be programmed to use this signal to enable or disable their local counters. Since each unit is on a different position on the daisy chain, each unit delays a different number of cycles, depending on their position in the daisy chain, before responding to the counter start/stop command from the UPC_C. This delay value may be hard coded into each UPC_P/UPC_L2 instantiation.
Mode 1 UPC_P, UPC_L2 Daisy Chain Protocol
As described above, Mode 1 (also referred to as detailed count mode) may be used to allow more counters per processor or L2 than what the local counters provide. In this mode, a given UPC_P or UPC_L2 is selected for ownership of the daisy chain. The selected UPC_P or UPC_L2 sends 92 bits of real time performance event data to the UPC_C for counting. In addition, the local counters are transferred to the UPC_C as in Mode 0. One daisy chain wire can be used to transmit information from all the performance counters in the processor, e.g., all 24 performance counters. The majority of the remaining wires can be used to transfer events to the UPC_C for counting. The local counters may be used in this mode to count any event presented to it. Also, all local counters may by used for instruction decoding. In Mode 1 92 events may be selected for counting by the UPC_C unit. 1 bit of the daisy chain is used to periodically transfer the local counters to the UPC_C, while 92 bits are used to transfer events. The three remaining bits are used to send control information and power gating signals to the local units. The UPC_C sends a rotating count from 0-399 on daisy chain bits 64:72, identically to Mode 0. The UPC_P or UPC_L2 that is selected for Mode 1 places it's local counters on bits 0:63 in a similar fashion as Mode 0, e.g. when the local unit decodes a certain value of the ring counter.
Examples of the data sent by the UPC_P are shown in Table 1-6. UPC_L2 may function similarly, for example, with 32 different types of events being supplied. The specified bits may be turned on to indicate the selected events for which the count is being transmitted. Daisy chain bus bits 92-95 specify control information such as the packet start signal on a given cycle.
The UPC_P module may use the x1 and x2 clocks. It may expect the x1 and x2 clocks to be phase-aligned, removing the need for synchronization of x1 signals into the x2 domain.
UPC_P Modes
As described above, the UPC_P module 200 may operate in distributed count mode or detailed count mode. In distributed count mode (Mode 0), a UPC_P module 200 may monitor performance events, for example 24 performance events from its 24 performance counters. The daisy chain bus is time multiplexed so that each UPC_P module sends its information to the UPC_C in turn. In this mode, the user may count 24 events per core, for example.
In Mode 1 (detailed count mode), one UPC_P module may be selected for ownership of the daisy chain bus. Data may be combined from the various inputs (core performance bus, core trace bus, L1P events), formatted and sent to the UPC_C unit each cycle. As shown in
Edge/Level/Polarity module 7224 may convert level signals emanating from the core's Performance bus 7226 into single cycle pulses suitable for counting. Each performance bit has a configurable polarity invert, and edge filter enable bit, available via a configuration register.
Widen module 7232 converts signals from one clock domain into another. For example, the core's Performance 7226, Trace 7228, and Trigger 7230 busses all may run at clkxl rate, and are transitioned to the clk×2 domain before being processed by the UPC_P. Widen module 7232 performs that conversion, translating each clk×1 clock domain signal into 2 clk×2 signals (even and odd). This module is optional, and may be used if the rate at which events are output are different (e.g., faster or slower) than the rate at which events are accumulated at the performance counters.
QPU Decode module 7234 and execution unit (XU) Decode module 7236 take the incoming opcode stream from the trace bus, and decode it into groups of instructions. In one aspect, this module resides in the clk×2 domain, and there may be two opcodes (even and odd) of each type (XU and QPU) to be decoded per clk×2 cycle. To accomplish this, two QPU and two XU decode units may be instantiated. This applies to implementations where the core 7220 operates at twice the speed, i.e., outputs 2 events, per operating cycle of the performance counters, as explained above. The 2 events saved by the widen module 7232 may be processed at the two QPU and two XU decode units. The decoded instruction stream is then sent to the counter blocks for selection and counting.
Registers module 7238 implements the interface to the MMIO bus. This module may include the global MMIO configuration registers and provide the support logic (readback muxes, partial address decode) for registers located in the UPC_P Counter units. User software may program the performance counter functions of the present disclosure via the MMIO bus.
Thread Combine module 7240 may combine identical events from each thread, counts them, and present a value for accumulation by a single counter. Thread Combine module 7240 may conserve counters when aggregate information across all threads is needed. Rather than using four counters (or number of counters for each thread), and summing in software, summing across all threads may be done in hardware using this module. Counters may be selected to support thread combining.
The Mode 1 Compress module 7242 may combine event inputs from the core's event bus 7226, the local counters 7224a . . . 7224n, and the L1 cache prefetch (L1P) event bus 7246, 7248, and place them on the appropriate daisy chain lines for transmission to the UPC_C, using a predetermined packet format, for example, shown in Table 1-6. This module 7242 may divide the 96 bit bus into 12 Event groups, with Event Group 0-7 containing 8 events, and Event Groups 8-11 containing 7 events, for a total of 92 events. Some event group bits can be sourced by several events. Not all events may connect to all event groups. Each event group may have a single multiplexer (mux) control, spanning the bits in the event group.
There may be 24 UPC_P Counter units in each UPC_P module. To minimize muxing, not all counters are connected to all events. Similarly, all counters may be used to count opcodes, but this is not required. Counters may be used to capture a given core's performance event or L1P event.
Referring to
Trace (Debug) Bus 7228 may be used to collect the opcode of all committed instructions.
MMIO interface 7250 to allow configuration and interrogation of the UPC_P module by the local core unit (7220).
UPC_P Outputs
The UPC_P 7200 may include two output interfaces. A UPC_P daisy chain bus 7252, used for transfer of UPC_P data to the UPC_C, and a MMIO bus 7250, used for reading/writing of configuration and count information from the UPC_P.
UPC_L2 Module
UPC_L2 Modes
The UPC_L2 module 7400 may operate in distributed count mode (Mode 0) or detailed count mode (Mode 1). In Mode 0, each UPC_L2 module may monitor 16 performance events, on its 16 performance counters. The daisy chain bus is time multiplexed so that each UPC_L2 module sends its information to the UPC_C in turn. In this mode, the user may count 16 events per L2 slice. In Mode 1, one UPC_L2 module is selected for ownership of the daisy chain bus. In this mode, all 32 events supported by the L2 slice may be counted.
UPC_C Module
Referring back to
The UPC_C module may operate in different modes. In Mode 0, each UPC_P and UPC_L2 contribute 24 and 16 performance events, respectively. In this way, a coarse view of the entire ASIC may be provided. In this mode, the UPC_C Module 7114 sends framing information to the UPC_P and UPC_L2 modules to the UPC_C. This information is used by the UPC_P and UPC_L2 to globally synchronize counter starting/stopping, and to indicate when each UPC_P or UPC_L2 should place its data on the daisy chain.
In Mode 1, one UPC_L2 module or UPC_P unit is selected for ownership of the daisy chain bus. All 32 events supported by a selected L2 slice may be counted, and up to 116 events can be counted from a selected PU. A set of 92 counters local to the UPC_C, and organized into Central Counter Groups, is used to capture the additional data from the selected UPC_P or UPC_L2.
The UPC_P/L2 Counter unit 142 gathers performance data from the UPC_P and UPC_L2 units, while the Network/DMA/IO Counter unit 7144 gathers event data from the rest of the ASIC, e.g., input/output (I/O) events, network events, direct memory access (DMA) events, etc.
UPC_P/L2 Counter Unit 7142 is responsible for gathering data from each UPC_P and UPC_L2 unit, and accumulating in it in the appropriate SRAM location. The SRAM is divided into 32 counter groups of 16 counters each. In Mode 0, each counter group is assigned to a particular UPC_P or UPC_L2 unit. The UPC_P unit has 24 counters, and uses two counter groups per UPC_P unit. The last 8 entries in the second counter group is unused by the UPC_P. The UPC_L2 unit has 16 counters, and fits within a single counter group. For every count data, there may exist an associated location in SRAM for storing the count data.
Software may read or write any counter from SRAM at any time. In one aspect, data is written in 64 bit quantities, and addresses a single counter from a single counter group.
In addition to reading and writing counters, software may cause selected counters of an arbitrary counter group to be added to a second counter group, with the results stored in a third counter group. This may be accomplished by writing to special registers in the UPC_P/L2 Counter Unit 7142.
Concurrently with writing the result to memory, the result is checked for a near-overflow. If this condition has occurred, a packet is sent over the daisy chain bus, indicating the SRAM address at which the event occurred, as well as which of the 4 counters in the SRAM has reached near-overflow (each 256 bit SRAM location stores 4 64-bit counters). Note that any combination of the 4 counters in a single SRAM address can reach near-overflow on a given cycle. Because of this, the counter identifier is sent as separate bits (one bit for each counter in a single SRAM address) on the daisy chain. The UPC_P monitors the daisy chain for overflow packets coming from the UPC_C. If the UPC_P detects a near-overflow packet associated with one or more of its counters, it sets an interrupt arming bit for the identified counters. This enables the UPC_P to issue an interrupt to its local processor on the next overflow of the local counter. In this way, interrupts can be delivered to the local processor very quickly after the actual event that caused overflow, typically within a few cycles.
Upon startup the UPC_C sends an enable signal along the daisy chain. A UPC_P/L2 unit 7600 may use this signal to synchronize the starting and stopping of their local counters. It may also optionally send a reset signal to the UPC_P and UPC_L2, directing them to reset their local counts upon being enabled. The 96 bit daisy chain provides adequate bandwidth to support both detailed count mode and distributed count mode operation.
For operating in detailed count mode, the entire daisy chain bandwidth can be dedicated to a single processor or L2. This greatly increases the amount of information that can be sent from a single UPC_P or UPC_L2, allowing the counting of more events. The UPC_P module receives information from three sources: core unit opcodes received via the trace bus, performance events from the core unit, and events from the L1P. In Mode 1, the bandwidth of the daisy chain is allocated to a single UPC_P or UPC_L2, and used to send more information. Global resources in the UPC_C (The Mode 1 Counter unit) assist in counting performance events, providing a larger overall count capability.
The UPC_P module may contain decode units that provide roughly 50 groups of instructions that can be counted. These decode units may operate on 4 16 bit instructions simultaneously. In one aspect, instead of transferring raw opcode information, which may consume available bandwidth, the UPC_P local counters may be used to collect opcode information. The local counters are periodically transmitted to the UPC_C for aggregation with the SRAM counter, as in Mode 0. However, extra data may be sent to the UPC_C in the Mode 1 daisy chain packet. This information may include event information from the core unit and associated L1 prefetcher. Multiplexers in the UPC_P can select the events to be sent to the UPC_C. This approach may use 1 bit on the daisy chain.
The UPC_C may have 92 local counters, each associated with an event in the Mode 1 daisy chain packet. These counters are combined in SRAM with the local counters in the UPC_P or L2. They are organized into 8-counter central counter groups. In total there may be 116 counters in mode 1, (24 counters for instruction decoding, and 92 for event counting).
The daisy chain input feeds events from the UPC_P or UPC_L2 into the Mode 1 Counter Unit for accumulation, while UPC_P counter information is sent directly to SRAM for accumulation. The protocol for merging the low order bits into the SRAM may be similar to Mode 0.
Each counter in the Mode 1 Counter Unit may correspond to a given event transmitted in the Mode 1 daisy chain packet.
The UPC counters may be started and stopped with fairly low overhead. The UPC_P modules map the controls to start and stop counters into MMIO user space for low-latency access that does not require kernel intervention. In addition, a method to globally start and stop counters synchronously with a single command via the UPC_C may be provided. For local use, each UPC_P unit can act as a separate counter unit (with lower resolution), controlled via local MMIO transactions. For example, the UPC_P Counter Data Registers may provide MMIO access to the local counter values. The UPC_P Counter Control Register may provide local configuration and control of each UPC_P counter.
All events may increment the counter by a value of 1 or more.
Software may communicate with the UPC_C via local Devbus access. In addition, UPC_C Counter Data Registers may give software access to each counter on an individual basis. UPC_C Counter Control Registers may allow software to enable each local counter independently. The UPC units provide the ability to count and report various events via MMIO operations to registers residing in the UPC units, which software may utilize via Performance Application Programming Interface (PAPI) Application Program Interface (API).
A UPC_C Accumulate Control Register may allow software to add counter groups to each other, and place the result in a third counter group. This register may be useful for temporarily storing the added counts, for instance, in case the added counts should not count toward the performance data. An example of such counts would be when a processor executes instructions based on anticipated future execution flow, that is, the execution is speculative. If the anticipated future execution flow results in incorrect or unnecessary execution, the performance counts resulting from those executions should not be counted.
At the same time or substantially the same time, the local performance counter module also monitors for near-overflow interrupt from the UPC_C at 7712. If there is an interrupt, the local performance counter module may retrieve the information associated with the interrupt from the daisy chain bus and determine whether the interrupt is for any one of its performance counters. For example, the SRAM location specified on the daisy chain associated with the interrupt is checked to determine whether that location is where the data of its performance counters are stored. If the interrupt is for any one of its performance counters, the local performance counter module arms the counter to handle the near-overflow. If a subsequent overflow of the counter in UPC_P or UPC_L2 occurs, the UPC_P or UPC_L2 may optionally freeze the bits in the specified performance counter, as well as generate an interrupt.
Miscellaneous Memory-Mapped Devices
All other devices accessed by the core or requiring direct memory access are connected via the device bus unit (DEVBUS) to the crossbar switch. The PCI express interface unit uses this path to enable PCIe devices to DMA data into main memory via the L2-caches. The DEVBUS switches requests from its slave port also to the boot eDRAM, an on-chip memory used for boot, RAS messaging and control-system background communication. Other units accessible via DEVBUS include the universal performance counter unit (UPC), the interrupt controller (BIC), the test controller/interface (TESTINT) as well as the global L2 state controller (L2-central).
Generally, hardware performance counters are extra logic added to the central processing unit (CPU) to track low-level operations or events within the processor. For example, there are counter events that are associated with the cache hierarchy that indicate how many misses have occurred at L1, L2, and the like. Other counter events indicate the number of instructions completed, number of floating point instructions executed, translation lookaside buffer (TLB) misses, and others. A typical computing system provides a small number of counters dedicated to collecting and/or recording performance events for each processor in the system. These counters consume significant logic area, and cause high-power dissipation. As such, only a few counters are typically provided. Current computer architecture allows many processors or cores to be incorporated into a single chip. Having only a handful of performance counters per processor does not provide the ability to count several events simultaneously from each processor.
Thus, in a further embodiment, there is provided a distributed trace device, that, in one aspect, may include a plurality of processing cores, a central storage unit having at least memory, and a daisy chain connection connecting the central storage unit and the plurality of processing cores and forming a daisy chain ring layout. At least one of the plurality of processing cores places trace data on the daisy chain connection for transmitting the trace data to the central storage unit. The central storage unit detects the trace data and stores the trace data in the memory.
Further, there is provided a method for distributed trace using central memory, that, in one aspect, may include connecting a plurality of processing cores and a central storage unit having at least memory using a daisy chain connection, the plurality of processing cores and the central storage unit being formed in a daisy chain ring layout. The method also may include enabling at least one of the plurality of processing cores to place trace data on the daisy chain connection for transmitting the trace data to the central storage unit. The method further may include enabling the central storage unit to detect the trace data and store the trace data in the memory.
Further, a method for distributed trace using central performance counter memory, in one aspect, may include placing trace data on a daisy chain bus connecting the processing core and a plurality of second processing cores to a central storage unit on an integrated chip. The method further may include reading the trace data from the daisy chain bus and storing the trace data in memory.
A centralized memory is used to store trace information from a processing core, for instance, in an integrated chip having a plurality of cores. Briefly, trace refers to signals or information associated with activities or internal operations of a processing core. Trace may be analyzed to determine the behavior or operations of the processing core from which the trace was obtained. In addition to a plurality of cores, each of the cores also referred to as local core, the integrated chip may include a centralized storage for storing the trace data and/or performance count data.
Each processor or core may keep a number of performance counters (e.g., 24 local counters per processor) at low resolution (e.g., 14 bits) local to it, and periodically transfer these counter values (counts) to a central unit. The central unit aggregates the counts into a higher resolution count (e.g., 64 bits). The local counters count a number of events, e.g., up to the local counter capacity, and before the counter overflow occurs, transfer the counts to the central unit. Thus, no counts are lost in the local counters.
The count values may be stored in a memory device such as a single central Static Random Access Memory (SRAM), which provides high bit density. The count values may be stored in a single central Static Random Access Memory (SRAM), which provides high bit density. Using this approach, it becomes possible to have multiples of performance counters supported per processor.
This local-central count storage device structure may be utilized to capture trace data from a single processing core (also interchangeably referred to herein as a processor or a core) residing in an integrated chip. In this way, for example, 1536 cycles of 44 bit trace information may be captured into an SRAM, for example, 256×256 bit SRAM. Capture may be controlled via trigger bits supplied by the processing core.
A core (e.g., 7102a, 7102b, 7102n), which may be also referred to herein as a PU (processing unit) may include a performance monitoring unit or a performance counter (7106a, 7106b, 7106n) referred to herein as UPC_P. UPC_P resides in the PU complex (e.g., 7102a, 7102b, 7102n) and gathers performance data of the associated core (e.g., 7102a, 7102b, 7102n). The UPC_P may be configured to collect trace data from the associated PU.
Similarly, an L2 cache unit (e.g., 7104a, 7104b, 7104n) may include a performance monitoring unit or a performance counter (e.g., 7108a, 7108b, 7108n) referred to herein as UPC_L2. UPC_L2 resides in the L2 and gathers performance data from it. The terminology UPC (universal performance counter) is used in this disclosure synonymously or interchangeable with general performance counter functions.
UPC_C 7114 may be a single, centralized unit within the processing node 7100, and may be responsible for coordinating and maintaining count data from the UPC_P (7106a, 7106b, 7106n) and UPC_L2 (7108a, 7108b, 7108n) units. The UPC_C unit 7114 (also referred to as the UPC_C module) may be connected to the UPC_P (7104a, 7104b, 7104n) and UPC_L2 (7108a, 7108b, 7108n) via a daisy chain bus 7130, with the start 7116 and end 7118 of the daisy chain beginning and terminating at the UPC_C 7114. In a similar manner, messaging/network unit 7110, PCIe 7111 and Devbus 7112 may be connected via another daisy chain bus 7140 to the UPC_C 7114.
The performance counter modules (i.e., UPC_P, UPC_L2 and UPC_C) of the present disclosure may operate in different modes, and depending on the operating mode, the UPC_C 7114 may inject packet framing information at the start of the daisy chain 7116, enabling the UPC_P (7104a, 7104b, 7104n) and/or UPC_L2 (7108a, 7108b, 108n) modules or units to place data on the daisy chain bus at the correct time slot. In distributed trace mode, UPC_C 114 functions as a central trace buffer.
As mentioned above, the performance counter functionality of the present disclosure may be divided into two types of units, a central unit (UPC_C), and a group of local units. Each of the local units performs a similar function, but may have slight differences to enable it to handle, for example, a different number of counters or different event multiplexing within the local unit. For gathering performance data from the core and associated L1, a processor-local UPC unit (UPC_P) is instantiated within each processor complex. That is, a UPC_P is added to the processing logic. Similarly, there may be a UPC unit associated with each L2 slice (UPC_L2). Each UPC_L2 and UPC_P unit may include a small number of counters. For example, the UPC_P may include 24 14 bit counters, while the UPC_L2 counters may instantiate 16 10 bit counters. The UPC ring (shown as solid line from 7116 to 7118) may be connected such that each UPC_P (7104a, 7104b, 7104n) or UPC_L2 unit (7108a, 7108b, 7108n) may be connected to its nearest neighbor. In one aspect, the daisy chain may be implemented using only registers in the UPC units, without extra pipeline latches.
For collecting trace information from a single core (e.g., 7102a, 7102b, 7102n), the UPC_C 114 may continuously record the data coming in on the connection, e.g., a daisy chain bus, shown at 7118. In response to detecting one or more trigger bits on the daisy chain bus, the UPC_C 7114 continues to read the data (trace information) on the connection (e.g., the daisy chain bus) and records the data for a programmed number of cycles to the SRAM 7120. Thus, trace information before and after the detection of the trigger bits may be seen and recorded.
The UPC_P and UPC_L2 modules may be connected to the UPC_C unit via a 96 bit daisy chain, using a packet based protocol. In trace mode, the trace data from the core is captured into the central SRAM located in the UPC_C 7114. Bit fields 0:87 may be used for the trace data (e.g., 44 bits per cycle), and bit fields 88:95 may be used for trigger data (e.g., 4 bits per cycle).
The UPC_P module may use the x1 and x2 clocks. It may expect the x1 and x2 clocks to be phase-aligned, removing the need for synchronization of x1 signals into the x2 domain. In one aspect, x1 clock may operate twice as fast as x2 clock.
Bits of trace information may be captured from the processing core 220 and sent across the connection connecting to the UPC_C, for example, the daisy chain bus shown at 7252. For instance, one-half of the 88 bit trace bus from the core (44 bits) may be captured, replicated as the bits pass from different clock domains, and sent across the connection. In addition, 4 of the 16 trigger signals supplied by the core 7200 may be selected at 7254 for transmission to the UPC_C. The UPC_C then may store 1024 clock cycles of trace information into the UPC_C SRAM. The stored trace information may be used for post-processing by software.
Edge/Level/Polarity module 7224 may convert level signals emanating from the core's Performance bus 7226 into single cycle pulses suitable for counting. Each performance bit has a configurable polarity invert, and edge filter enable bit, available via a configuration register.
Widen module 7232 converts clock signals. For example, the core's Performance 7226, Trace 7228, and Trigger 7230 busses all may run at clk×1 rate, and are transitioned to the clk×2 domain before being processed. Widen module 7232 performs that conversion, translating each clk×1 clock domain signal into 2 clk×2 signals (even and odd). This module is optional, and may be used if the rate at which events are output are different (e.g., faster) than the rate at which events are accumulated at the performance counters.
QPU Decode module 7234 and execution unit (XU) Decode module 7236 take the incoming opcode stream from the trace bus, and decode it into groups of instructions. In one aspect, this module resides in the clk×2 domain, and there may be two opcodes (even and odd) of each type (XU and QPU) to be decoded per clk×2 cycle. To accomplish this, two QPU and two XU decode units may be instantiated. This applies to implementations where the core 220 operates at twice the speed, i.e., outputs 2 events, per operating cycle of the performance counters, as explained above. The 2 events saved by the widen module 7232 may be processed at the two QPU and two XU decode units. The decoded instruction stream is then sent to the counter blocks for selection and counting.
Registers module 7238 implements the interface to the MMIO bus. This module may include the global MMIO configuration registers and provide the support logic (readback muxes, partial address decode) for registers located in the UPC_P Counter units. User software may program the performance counter functions of the present disclosure via the MMIO bus.
Thread Combine module 7240 may combine identical events from each thread, count them, and present a value for accumulation by a single counter. Thread Combine module 7240 may conserve counters when aggregate information across all threads is needed. Rather than using four counters (or number of counters for each thread), and summing in software, summing across all threads may be done in hardware using this module. Counters may be selected to support thread combining.
The Compress module 7242 may combine event inputs from the core's event bus 7226, the local counters 7224a . . . 7224n, and the L1 cache prefetch (L1P) event bus 7246, 7248, and place them on the appropriate daisy chain lines for transmission to the UPC_C, using a predetermined packet format.
There may be 24 UPC_P Counter units in each UPC_P module. To minimize muxing, not all counters need be connected to all events. All counters can be used to count opcodes. One counter may be used to capture a given core's performance event or L1P event.
Referring to
Trace (Debug) bus 7228 may be used to send data to the UPC_C for capture into SRAM. In this way, the SRAM is used as a trace buffer. In one aspect, the core whose trace information is being sent over the connection (e.g., the daisy chain bus) to the UPC_C may be configured to output trace data appropriate for the events being counted.
Trigger bus 7230 from the core may be used to stop and start the capture of trace data in the UPC_C SRAM. The user may send, for example, 4 to 16 possible trigger events presented by the core to the UPC for SRAM start/stop control.
MMIO interface 7250 may allow configuration and interrogation of the UPC_P module by the local core unit (7220).
The UPC_P 7200 may include two output interfaces. A UPC_P daisy chain bus 7252, used for transfer of UPC_P data to the UPC_C, and a MMIO bus 7250, used for reading/writing of configuration and count information from the UPC_P.
Referring back to
The UPC_C module may operate in different modes. In trace mode, the UPC_C acts as a trace buffer, and can trace a predetermined number of cycles of a predetermined number of bit trace information from a core. For instance, the UPC_C may trace 1536 cycles of 44 bit trace information from a single core.
The UPC_P/L2 Counter unit 7142 gathers performance data from the UPC_P and/or UPC_L2 units, while the Network/DMA/IO Counter unit 7144 gathers event data from the rest of the ASIC, e.g., input/output (I/O) events, network events, direct memory access (DMA) events, etc.
UPC_P/L2 Counter Unit 7142 may accumulate the trace data received from a UPC_P in the appropriate SRAM location. The SRAM is divided into a predetermined number of counter groups of predetermined counters each, for example, 32 counter groups of 16 counters each. For every count data or trace data, there may exist an associated location in SRAM for storing the count data.
Software may read or write any counter from SRAM at any time. In one aspect, data is written in 64 bit quantities, and addresses a single counter from a single counter group.
The following illustrates the functionality of UPC_C in capturing and centrally storing trace data from one or more of the processor connected on the daisy chain bus in one embodiment of the present disclosure.
1) UPC_C is programmed with the number of cycles to capture after a trigger is detected.
2) UPC_C is enabled to capture data from the ring (e.g., daisy chain bus 7130 of
3) UPC_C receives a trigger signal from ring (sent by UPC_P). UPC_C stores the address that UPC_C was writing to when the trigger occurred. This for example allows software to know where in the circular SRAM buffer the trigger happened.
4) UPC_C then continues to capture until the number of cycles in step 1 has expired. UPC_C then stops capture and may return to an idle state. Software may read a status register to see that capture is complete. The software may then reads out the SRAM contents to get the trace.
The following illustrates the functionality of UPC_P in distributed tracing of the present disclosure in one embodiment.
1) UPC_P is configured to send bits from a processor (or core), for example, either upper or lower 44 bits from processor, to UPC_C. (e.g., set mode 2, enable UPC_P, set up event muxes).
2) In an implementation where the processor operates at a faster (e.g., twice as fast) than the rest of the performance counter components, UPC_P takes two x1 cycles of 44 bit data and widens it to 88 bits at ½ processor rate.
3) UPC_P places this data, along with trigger data sourced from the processor, or from an MMIO store to a register residing in the UPC_P or UPC_L2, on the daisy chain. For example, 88 bits are used for data, and 6 bits of trigger are passed.
At 7904, the central counter unit detects the stop trigger on the daisy chain bus. Depending on programming, the central counter unit may operate differently. For example, in one embodiment, in response to detecting the stop trigger signal on the daisy chain bus, the central counter unit may continue to read and store the trace data from the daisy chain bus for predetermined number cycles after the detecting of the stop trigger signal. In another embodiment, the central counter unit may stop reading and storing the trace data in response to detecting the stop trigger signal. Thus, the behavior of the central counter unit may be programmable. The programming may be done by the software, for instance, writing on an appropriate register associated with the central counter unit. In another embodiment, the programming may be done by the software, for instance, writing on an appropriate register associated with the local processing core, and the local processing core may pass this information to the central unit via the daisy chain bus.
The store trace data on the SRAM may be read or otherwise accessible to the user, for example, via the user software. In one aspect, the hardware devices of the present disclosure allow the user software to directly access its data. No kernel system call may be needed to access the trace data, thus reducing the overhead needed to run the kernel or system calls.
The trigger may be sent by the processing cores or by software. For example, software or user program may write to an MMIO location to send the trigger bits on the daisy chain bus to the UPC_C. Trigger bits may also be pulled from the processing core bus and sent out on the daisy chain bus. The core sending out the trace information continues to place its trace data on the daisy chain bus and the central counter unit continuously reads the data on the daisy chain bus and stores the data in memory.
System Packaging
Each compute rack contains 2 midplanes, and each midplane contains 512 16-way PowerPC A2 compute processors, each on a compute ASIC Midplanes are arranged vertically in the rack, one above the other, and are accessed from the front and rear of the rack. Each midplane has its own bulk power supply and line cord. These same racks also house I/O boards. Each passive compute midplane contains 16 node boards, each with 32 compute ASICs and 9 Blue Gene/Q Link ASICs, and a service card that provides clocks, a control buss, and power management. An I/O midplane may be formed with 16 I/O boards replacing the 16 node boards. An I/O board contains 8 compute ASICs, 8 link chips, and 8 PCI2 2.0 adapter card slots.
The midplane, the service card, the node (or I/O) boards, as well as the compute, and direct current assembly (DCA) cards that plug into the I/O and node boards are described here. The BQC chips are mounted singly, on small cards with up to 72 (36) associated SDRAM-DDR3 memory devices (in the preferred embodiment, 64 (32) chips of 2 Gb SDRAM constitute a 16 (8) GB node, with the remaining 8 (4) SDRAM chips for chipkill implementation.) Each node board contains 32 of these cards connected in a 5 dimensional array of length 2 (2^5=32). The fifth dimension exists only on the node board, connecting pairs of processor chips. The other dimensions are used to electrically connect 16 node boards through a common midplane forming a 4 dimensional array of length 4; a midplane is thus 4^4×2=512 nodes. Working together, 128 link chips in a midplane extend the 4 midplane dimensions via optical cables, allowing midplanes to be connected together. The link chips can also be used to space partition the machine into sub-tori partitions; a partition is associated with at least one I/O node and only one user program is allowed to operate per partition. The 10 torus directions are referred to as the +/−a, +/−b, +/−c, +/−d, +/−e dimensions. The electrical signaling rate is 4 Gb/s and a torus port is 4 bits wide per direction, for an aggregate bandwidth of 2 GB/s per port per direction. The 5-dimensional torus links are bidirectional. We have the raw aggregate link bandwidth of 2 GB/s*2*10=40 GB/s. The raw hardware Bytes/s:FLOP/s is thus 40:204.8=0.195. The link chips double the electrical datarate to 8 Gb/s, add a layer of encoding (8b/10b+parity), and drive directly the Tx and Rx optical modules at 10 GB/s. Each port has 2 fibers for send and 2 for receive. The Tx+Rx modules handle 12+12 fibers, or 4 uni-directional ports, per pair, including spare fibers. Hardware and software work together to seamlessly change from a failed optical fiber link, to a spare optical fiber link, without application fail.
The BQC ASIC contains a PCIe 2.0 port of width 8 (8 lanes). This port, which cannot be subdivided, can send and receive data at 4 GB/s ( 8/10 encoded to 5 GB/s). It shares pins with the fifth (+/−e) torus ports. Single node compute cards can become single node I/O cards by enabling this adapter card port. Supported adapter cards include IB-QDR and dual 10 Gb Ethernet. Compute nodes communicate to I/O nodes over an I/O port, also 2+2 GB/s. Two compute nodes, each with an I/O link to an I/O node, are needed to fully saturate the PCIe bus. The I/O port is extended optically, through a 9th link chip on a node board, which allows compute nodes to communicate to I/O nodes on other racks. I/O nodes in their own racks communicate through their own 3 dimensional tori. This allows for fault tolerance in I/O nodes in that traffic may be re-directed to another I/O node, and flexibility in traffic routing in that I/O nodes associated with one partition may, software allowing, be used by compute nodes in a different partition.
A separate control host distributes at least a single 10 Gb/s Ethernet link (or equivalent bandwidth) to an Ethernet switch which in turn distributes 1 Gb/s Ethernet to a service card on each midplane. The control systems on BG/Q and BG/P are similar. The midplane service card in turn distributes the system clock, provides other rack control function, and consolidates individual 1 Gb Ethernet connections to the node and I/O boards. On each node board and I/O board the service bus converts from 1 Gb Ethernet to local busses (JTAG, 12C, SPI) through a pair of Field Programmable Gate Array (FPGA) function blocks codenamed iCon and Palimino. The local busses of iCon & Palimino connect to the Link and Compute ASICs, local power supplies, various sensors, for initialization, debug, monitoring, and other access functions.
Bulk power conversion is N+1 redundant. The input is 440V 3 phase, with one power supply with one input line cord and thus one bulk power supply per midplane at 48V output. Following the 48V DC stage is a custom N+1 redundant regulator supplying up to 7 different voltages built directly into the node and I/O boards. Power is brought from the bulk supplies to the node and I/O boards via cables. Additionally DC-DC converters of modest power are present on the midplane service card, to maintain persistent power even in the event of a node card failure, and to centralize power sourcing of low current voltages. Each BG/Q circuit card contains an EEPROM with Vital product data (VPD).
From a full system perspective, the supercomputer as a whole is controlled by a Service Node, which is the external computer that controls power-up of the machine, partitioning, boot-up, program load, monitoring, and debug. The Service Node runs the Control System software. The Service Node communicates with the supercomputer via a dedicated, private 1 Gb/s Ethernet connection, which is distributed via an external Ethernet switch to the Service Cards that control each midplane (half rack). Via an Ethernet switch located on this Service Card, it is further distributed via the Midplane Card to each Node Card and Link Card. On each Service Card, Node Card and Link Card, a branch of this private Ethernet terminates on a programmable control device, implemented as an FPGA (or a connected set of FPGAs).https://watgsa.ibm.com/%7Eswetz/shared/bgp/docs/Palomino.3.0/Palomino.html— The FPGA(s) translate between the Ethernet packets and a variety of serial protocols to communicate with on-card devices: the SPI protocol for power supplies, the I2C protocol for thermal sensors and the JTAG protocol for Compute and Link chips.
On each card, the FPGA is therefore the center hub of a star configuration of these serial interfaces. For example, on a Node Card the star configuration comprises 34 JTAG ports (one for each compute or IO node) and a multitude of power supplies and thermal sensors.
Thus, from the perspective of the Control System software and the Service Node, each sensor, power supply or ASIC in the supercomputer system is independently addressable via a standard 1 Gb Ethernet network and IP packets. This mechanism allows the Service Node to have direct access to any device in the system, and is thereby an extremely powerful tool for booting, monitoring and diagnostics. Moreover, the Control System can partition the supercomputer into independent partitions for multiple users. As these control functions flow over an independent, private network that is inaccessible to the users, security is maintained.
In one embodiment, the computer utilizes a 5D torus interconnect network for various types of inter-processor communication. PCIe-2 and low cost switches and RAID systems are used to support locally attached disk storage and host (login nodes). A private 1 Gb Ethernet (coupled locally on card to a variety of serial protocols) is used for control, diagnostics, debug, and some aspects of initialization. Two types of high bandwidth, low latency networks make up the system “fabric”.
System Interconnect—Five Dimensional Torus
The Blue Gene compute ASIC incorporates an integrated 5-D torus network router. There are 11 bidirectional 2 GB/s raw data rate links in the compute ASIC, 10 for the 5-D torus and 1 for the optional I/O link. A network messaging unit (MU) implements the prior generation Blue Gene style network DMA functions to allow asynchronous data transfers over the 5-D torus interconnect. MU is logically separated into injection and reception units.
The injection side MU maintains injection FIFO pointers, as well as other hardware resources for putting messages into the 5-D torus network. Injection FIFOs are allocated in main memory and each FIFO contains a number of message descriptors. Each descriptor is 64 bytes in length and includes a network header for routing, the base address and length of the message data to be sent, and other fields like type of packets, etc., for the reception MU at the remote node. A processor core prepares the message descriptors in injection FIFOs and then updates the corresponding injection FIFO pointers in the MU. The injection MU reads the descriptors and message data packetizes messages into network packets and then injects them into the 5-D torus network.
Three types of network packets are supported: (1) Memory FIFO packets; the reception MU writes packets including both network headers and data payload into pre-allocated reception FIFOs in main memory. The MU maintains pointers to each reception FIFO. The received packets are further processed by the cores; (2) Put packets; the reception MU writes the data payload of the network packets into main memory directly, at addresses specified in network headers. The MU updates a message byte count after each packet is received. Processor cores are not involved in data movement, and only have to check that the expected numbers of bytes are received by reading message byte counts; (3) Get packets; the data payload contains descriptors for the remote nodes. The MU on a remote node receives each get packet into one of its injection FIFOs, then processes the descriptors and sends data back to the source node.
MU resources are in memory mapped I/O address space and provide uniform access to all processor cores. In practice, the resources are likely grouped into smaller groups to give each core dedicated access. In one embodiment there is supported 544 injection FIFOs, or 32/core, and 288 reception FIFOs, or 16/core. The reception byte counts for put messages are implemented in L2 using the atomic counters described herein below. There is effectively unlimited number of counters subject to the limit of available memory for such atomic counters.
The MU interface is designed to deliver close to the peak 18 GB/s (send)+18 GB/s (receive) 5-D torus nearest neighbor data bandwidth, when the message data is fully contained in the 32 MB L2. This is basically 1.8 GB/s+1.8 GB/s maximum data payload bandwidth over 10 torus links. When the total message data size exceeds the 32 MB L2, the maximum network bandwidth is then limited by the sustainable external DDR memory bandwidth.
The Blue Gene/P DMA drives the 3-D torus network, but not the collective network. On Blue Gene/Q, because the collective and I/O networks are embedded in the 5-D torus with a uniform network packet format, the MU will drive all regular torus, collective and I/O network traffic with a unified programming interface.
There is provided an architecture of a distributed parallel messaging unit (“MU”) for high throughput networks, wherein a messaging unit at one or more nodes of a network includes a plurality of messaging elements (“MEs”). In one embodiment, each ME operates in parallel and includes a DMA element for handling message transmission (injection) or message reception operations.
The top level architecture of the Messaging Unit 65100 interfacing with the Network Interface Unit 65150 is shown in
As shown in
In one embodiment, one function of the messaging unit 65100 is to ensure optimal data movement to, and from the network into the local memory system for the node by supporting injection and reception of message packets. As shown in
The MU 65100 further supports data prefetching into the L2 cache 70. On the injection side, the MU splits and packages messages into network packets, and sends packets to the network respecting the network protocol. On packet injection, the messaging unit distinguishes between packet injection and memory prefetching packets based on certain control bits in the message descriptor, e.g., such as a least significant bit of a byte of a descriptor 65102 shown in
With respect to on-chip local memory copy operation, the MU copies content of an area in the associated memory system to another area in the memory system. For memory-to-memory on chip data transfer, a dedicated SRAM buffer, located in the network device, is used. Injection of remote get packets and the corresponding direct put packets, in one embodiment, can be “paced” by software to reduce contention within the network. In this software-controlled paced mode, a remote get for a long message is broken up into multiple remote gets, each for a sub-message. The sub-message remote get is allowed to enter the network if the number of packets belonging to the paced remote get active in the network is less than an allowed threshold. To reduce contention in the network, software executing in the cores in the same nodechip can control the pacing.
The MU 65100 further includes an interface to a crossbar switch (Xbar) 65060 in additional implementations. The MU 65100 includes three (3) Xbar interface masters 65125 to sustain network traffic and one Xbar interface slave 65126 for programming. The three (3) Xbar interface masters 65125 may be fixedly mapped to the iMEs 65110, such that for example, the iMEs are evenly distributed amongst the three ports to avoid congestion. A DCR slave interface unit 65127 providing control signals is also provided.
The handover between network device 65150 and MU 65100 is performed via buffer memory, e.g., 2-port SRAMs, for network injection/reception FIFOs. The MU 65100, in one embodiment, reads/writes one port using, for example, an 800 MHz clock (operates at one-half the speed of a processor core clock, e.g., at 1.6 GHz, for example), and the network reads/writes the second port with a 500 MHz clock, for example. The handovers are handled using the network injection/reception FIFOs and FIFOs' pointers (which are implemented using latches, for example).
As shown in
As further shown in
In an alternate embodiment, to reduce size of each control register 65112 at each node, only a small portion of packet information is stored in each iME that is necessary to generate requests to switch 65060. Without holding a full packet header, an iME may require less than 100 bits of storage. Namely, each iME 65110 holds pointer to the location in the memory system that holds message data, packet size, and miscellaneous attributes.
Header data is sent from the message control SRAM 65140 to the network injection FIFO directly; thus the iME alternatively does not hold packet headers in registers. The Network Interface Unit 65150 provides signals from the network device to indicate whether or not there is space available in the paired network injection FIFO. It also writes data to the selected network injection FIFOs.
As shown in
For packet injection, the Xbar interface slave 65126 programs injection control by accepting write and read request signals from processors to program SRAM, e.g., an injection control SRAM (ICSRAM) 65130 of the MU 65100 that is mapped to the processor memory space. In one embodiment, Xbar interface slave processes all requests from the processor in-order of arrival. The Xbar interface masters generate connection to the Xbar 60 for reading data from the memory system, and transfers received data to the selected iME element for injection, e.g., transmission into a network.
The ICSRAM 65130 particularly receives information about a buffer in the associated memory system that holds message descriptors, from a processor desirous of sending a message. The processor first writes a message descriptor to a buffer location in the associated memory system, referred to herein as injection memory FIFO (imFIFO) shown in
Returning to
As further shown in
In a methodology 65200 implemented by the MU for sending message packets, ICSRAM holds information including the start address, size of the imFIFO buffer, a head address, a tail address, count of fetched descriptors, and free space remaining in the injection memory FIFO (i.e., start, size, head, tail, descriptor count and free space).
As shown in step 65204 of
The Message selection arbiter unit 65145 receives the message specific information from each of the message control SRAM 65140, and receives respective signals 65115 from each of the iME engines 65110a, 65110b, . . . , 65110n. Based on the status of each respective iME, Message selection arbiter unit 65145 determines if there is any message waiting to be sent, and pairs it to an available iME engine 65110a, 65110b, . . . , 65110n, for example, by issuing an iME engine selection control signal 65117. If there are multiple messages which could be sent, messages may be selected for processing in accordance with a pre-determined priority as specified, for example, in Bits 0-2 in virtual channel in field 65513 specified in the packet header of
Injection Operation
Returning to
Then, as indicated at 65203, once an imFIFO 65099 is updated with the message descriptor, the processor, via the Xbar interface slave 65126 in the messaging unit, updates the pointer located in the injection control SRAM (ICSRAM) 65130 to point to a new tail (address) of the next descriptor slot 65102 in the imFIFO 65099. That is, after a new descriptor is written to an empty imFIFO by a processor, e.g., imFIFO 65099, software executing on the cores of the same chip writes the descriptor to the location in the memory system pointed to by the tail pointer, and then the tail pointer is incremented for that imFIFO to point to the new tail address for receiving a next descriptor, and the “new tail” pointer address is written to ICSRAM 65130 as depicted in
As shown in the method depicting the processing at the injection side MU, as indicated at 65204 in
Next, the arbitration logic implemented in the message selection arbiter 65145 receives inputs from the message control SRAM 65140 and particularly, issues a request to process the available message descriptor, as indicated at 65209,
In one embodiment, each imFIFO 65099 has assigned a priority bit, thus making it possible to assign a high priority to that user FIFO. The arbitration logic assigns available iMEs to the active messages with high priority first (system FIFOs have the highest priority, then user high priority FIFOs, then normal priority user FIFOs). From the message control SRAM 65140, the packet header (e.g., 32 B), number of bytes, and data address are read out by the selected iME, as indicated at step 65210,
In one embodiment, as the message descriptor contains a bitmap indicating into which network injection FIFOs packets from the message may be injected (Torus injection FIFO map bits 65415 shown in
Messages from injection memory FIFOs can be assigned to and processed by any iME and its paired network injection FIFO. One of the iMEs is selected for operation on a packet-per-packet basis for each message, and an iME copies a packet from the memory system to a network injection FIFO, when space in the network injection FIFO is available. At step 65210, the iME first requests the message control SRAM to read out the header and send it directly to the network injection FIFO paired to the particular iME, e.g., network injection FIFO 65180b, in the example provided. Then, as shown at 65211,
Data reads are issued as fast as the Xbar interface master allows. For each read, the iME calculates the new data address. In one embodiment, the iME uses a start address (e.g., specified as address 65413 in
The selection of read request size is performed as follows: In the following examples, a “chunk” refers to a 32 B block that starts from 32 B-aligned address. Thus, for example, for a read request of 128 B, the iME requests 128 B block starting from address 128N (N: integer), when it needs at least the 2nd and 3rd chunks in the 128 B block (i.e., It needs at least 2 consecutive chunks starting from address 128N+32. This also includes the cases that it needs first 3 chunks, last 3 chunks, or all the 4 chunks in the 128 B block, for example.) For a read request of 64 B, the iME requests 64 B block starting from address 64N, e.g., when it needs both chunks included in the 64 B block. For read request of 32 B: the iME requests 32 B block. For example, when the iME is to read 8 data chunks from addresses 32 to 271, it generates requests as follows:
1. iME requests 128 B starting from address 0, and uses only the last 96 B;
2. iME requests 128 B starting from address 128, and uses all 128 B;
3. iME requests 32 B starting from address 256.
It is understood that read data can arrive out of order, but returns via the Xbar interface master that issued the read, e.g., the read data will be returned to the same master port requesting the read. However, the order between read data return may be different from the request order. For example, suppose a master port requested to read address 1, and then requested to read address 2. In this case the read data for address 2 can arrive earlier than that for address 1.
iMEs are mapped to use one of the three Xbar interface masters in one implementation. When data arrives at the Xbar interface master, the iME which initiated that read request updates its byte counter of data received, and also generates the correct address bits (write pointer) for the paired network injection FIFO, e.g., network injection FIFO 65180b. Once all data initiated by that iME are received and stored to the paired network injection FIFO, the iME informs the network injection FIFO that the packet is ready in the FIFO, as indicated at 65212. The message control SRAM 65140 updates several fields in the packet header each time it is read by an iME. It updates the byte count of the message (how many bytes from that message are left to be sent) and the new data offset for the next packet.
Thus, as further shown in
Each time an iME 65110 starts injecting a new packet, the message descriptor information at the message control SRAM is updated. Once all packets from a message have been sent, the iME removes its entry from the message control SRAM (MCSRAM), advances its head pointer in the injection control SRAM 65130. Particularly, once the whole message is sent, as indicated at 65219, the iME accesses the injection control SRAM 65130 to increment the head pointer, which then triggers a recalculation of the free space in the imFIFO 65099. That is, as the pointers to injection memory FIFOs work from the head address, thus, when the message is finished, the head pointer is updated to the next slot in the FIFO. When the FIFO end address is reached, the head pointer will wrap around to the FIFO start address. If the updated head address pointer is not equal to the tail of the injection memory FIFO then there is a further message descriptor in that FIFO that could be processed, i.e., the imFIFO is not empty and one or more message descriptors remain to be fetched. Then, the ICSRAM will request the next descriptor read via the Xbar interface master, and the process returns to 65204. Otherwise, if the head pointer is equal to the tail, the FIFO is empty.
As mentioned, the injection side 65100A of the Messaging Unit supports any byte alignment for data reads. The correct data alignment is performed when data are read out of the network reception FIFOs, i.e., alignment logic for injection MU is located in the network device. The packet size will be the value specified in the descriptor, except for the last packet of a message. MU adjusts the size of the last packet of a message to the smallest size to hold the remaining part of the message data. For example, when user injects a 1025 B message descriptor whose packet size is 16 chunks=512 B, the MU will send this message using two 512 B packets and one 32 B packet. The 32 B packet is the last packet and only 1 B in the 32 B payload is valid.
As additional examples: for a 10 B message with a specified packet size=16 (512 B), the MU will send one 32 B packet, only 10 B in the 32 B data is valid. For a 0 B message with a specified packet size=anything, the MU will send one 0 B packet. For a 260 B message with a specified packet size=8 (256 B), the MU will send one 256 B packet and one 32 B packet. Only 4 B in the last 32 B packet data are valid.
In operation, the iMEs/rMEs further decide priority for payload read/write from/to the memory system based on the virtual channel (VC) of the message. Certain system VCs (e.g., “system” and “system collective”) will receive the highest priority. Other VCs (e.g., high priority and usercommworld) will receive the next highest priority. Other VCs will receive the lower priority. Software executing at the processors sets a VC correctly to get desired priority.
It is further understood that each iME can be selectively enabled or disabled using a DCR register. An iME 65110 is enabled when the corresponding DCR (control signal), e.g., bit, is set to 1, and disabled when the DCR bit is set to 0, for example. If this DCR bit is 0, the iME will stay in the idle state until the bit is changed to 1. If this bit is cleared while the corresponding iME is processing a packet, the iME will continue to operate until it finishes processing the current packet. Then it will return to the idle state until the enable bit is set again. When an iME is disabled, messages are not processed by it. Therefore, if a message specifies only this iME in the FIFO map, this message will not be processed and the imFIFO will be blocked until the iME is enabled again.
Reception
As shown in
In one embodiment, storing of data to Xbar interface master is via 16-byte unit and must be 16-byte aligned. The requestor rME can mask some bytes, i.e., it can specify which bytes in the 16-byte data are actually stored. The role of alignment logic is to place received data in the appropriate position in a 16-byte data line. For example: an rME needs to write 20-byte received data to memory system address 35 to 54. In this case 2 write requests are necessary: 1) The alignment logic builds the first 16-byte write data. The 1st to 13th received bytes are placed in byte 3 to 15 in the first 16-byte data. Then the rME tells the Xbar interface master to store the 16-byte data to address 32, but not to store the byte 0,1, and 2 in the 16-byte data. As a result, byte 3 to 15 in the 16-byte data (i.e. 1St to 13th received bytes) will be written to address 35 to 47 correctly. Then the alignment logic builds the second 16-byte write data. The 14th to 20th received bytes are placed in byte 0 to 6 in the second 16-byte data. Then the rME tell the Xbar interface master to store the 16-byte data to address 48, but not to store byte 7 to 15 in the 16-byte data. As a result, the 14th to 20th received bytes will be written to address 48 to 54 correctly.
Although not shown, control registers and SRAMs are provided that store part of control information when needed for packet reception. These status registers and SRAMs may include, but are not limited to, the following registers and SRAMs: Reception control SRAM (Memory mapped); Status registers (Memory mapped); and remote put control SRAM (Memory mapped).
In operation, when one of the network reception FIFOs receives a packet, the network device generates a signal 65159 for receipt at the paired rME 65120 to inform the paired rME that a packet is available. In one aspect, the rME reads the packet header from the network reception FIFO, and parses the header to identify the type of the packet received. There are three different types of packets: memory FIFO packets, direct put packets, and remote get packets. The type of packet is specified by bits in the packet header, as described below, and determines how the packets are processed.
In one aspect, for direct put packets, data from direct put packets processed by the reception side MU device 65100B are put in specified locations in memory system. Information is provided in the packet to inform the rME of where in memory system the packet data is to be written. Upon receiving a remote get packet, the MU device 65100B initiates sending of data from the receiving node to some other node.
Other elements of the reception side MU device 65100B include the Xbar interface slave 65176 for management. It accepts write and read requests from a processor and updates SRAM values such as reception control SRAM (RCSRAM) 65160 or remote put control SRAM (R-put SRAM) 65170 values. Further, the Xbar interface slave 65176 reads SRAM and returns read data to the Xbar. In one embodiment, Xbar interface slave 65176 processes all requests in-order of arrival. More particularly, the Xbar interface master 65125 generates a connection to the Xbar 60 to write data to the memory system. Xbar interface master 65125 also includes an arbiter unit 65157 for arbitrating between multiple rMEs (reception messaging engine units) 65120a, 65120b, . . . 65120n to access the Xbar interface master. In one aspect, as multiple rMEs compete for a Xbar interface master to store data, the Xbar interface master decides which rME to select. Various algorithm can be used for selecting an rME. In one embodiment, the Xbar interface master selects an rME based on the priority. The priority is decided based on the virtual channel of the packet the rME is receiving. (e.g., “system” and “system collective” have the highest priority, “high priority” and “usercommworld” have the next highest priority, and the others have the lowest priority). If there are multiple rMEs that have the same priority, one of them may be selected randomly.
As in the MU injection side of
The reception control SRAM 65160 is written to include pointers (start, size, head and tail) for rmFIFOs, and further, is mapped in the processor's memory address space. The start pointer points to the FIFO start address. The size defines the FIFO end address (i.e. FIFO end=start+size). The head pointer points to the first valid data in the FIFO, and the tail pointer points to the location just after the last valid data in the FIFO. The tail pointer is incremented as new data is appended to the FIFO, and the head pointer is incremented as new data is consumed from the FIFO. The head and tail pointers need to be wrapped around to the FIFO start address when they reach the FIFO end address. A reception control state machine 65163 arbitrates access to reception control SRAM (RCSRAM) between multiple rMEs and processor requests, and it updates reception memory FIFO pointers stored at the RCSRAM. As will be described in further detail below, R-Put SRAM 65170 includes control information for put packets (base address for data, or for a counter). This R-Put SRAM is mapped in the memory address space. R-Put control FSM 65175 arbitrates access to R-put SRAM between multiple rMEs and processor requests. In one embodiment, the arbiter mechanism employed alternately grants an rME and the processor an access to the R-put SRAM. If there are multiple rMEs requesting for access, the arbiter selects one of them randomly. There is no priority difference among rMEs for this arbitration.
In the case of memory FIFO packet processing, in one embodiment, memory FIFO packets include a reception memory FIFO ID field in the packet header that specifies the destination rmFIFO in memory system. The rME of the MU device 65100B parses the received packet header to obtain the location of the destination rmFIFO. As shown in
In one embodiment, as described in greater detail herein, to allow simultaneous usage of the same rmFIFO by multiple rMEs, each rmFIFO has advance tail, committed tail, and two counters for advance tail ID and committed tail ID. The rME copies packets to the memory system location starting at the advance tail, and gets advance tail ID. After the packet is copied to the memory system, the rME checks the committed tail ID to determine if all previously received data for that rmFIFO are copied. If this is the case, the rME updates committed tail, and committed tail ID, otherwise it waits. An rME implements logic to ensure that all store requests for header and payload have been accepted by the Xbar before updating committed tail (and optionally issuing interrupt).
In the case of direct put packet processing, in one embodiment, the MU device 65100B further initiates putting data in specified location in the memory system. Direct put packets include in their headers a data ID field and a counter ID field-both used to index the R-put SRAM 65170; however, the header includes other information such as, for example, a number of valid bytes, a data offset value, and counter offset value. The rME of the MU device 65100B parses the header of the received direct put packet to obtain the data ID field and a counter ID field values. Particularly, as shown in
Base address+data offset=address for the packet
In one embodiment, the data offset is stored in the packet header field “Put Offset” 65541 as shown in
Likewise, a counter base address is read from the R-put SRAM 65170, in one embodiment, and the rME calculates another address in the memory system where a counter is located. The value of the counter is to be updated by the rME. In one embodiment, the address for counter storage is calculated according to the following:
Base address+counter offset=address for the counter
In one embodiment, the counter offset value is stored in header field “Counter Offset” 65542,
In one embodiment, the rME moves the packet payload from a network reception FIFO 65190 into the memory system location calculated for the packet. For example, as shown at 65323, the rME reads the packet payload and, via the Xbar interface master, writes the payload contents to the memory system specified at the calculated address, e.g., in 16 B chunks or other byte sizes. Additionally, as shown at 65325, the rME atomically updates a byte counter in the memory system.
The alignment logic implemented at each rME supports any alignment of data for direct put packets.
As shown in
Utilizing notation in
Then, the rME requests the Xbar interface master to store BUF to address A-R=16 (16 B-aligned) resulting in byte enable (BE)=000000000000011. As a result, D0 and D1 is stored to correct address 30 and 31 and the variables are re-calculated as: A=A-R+16=32, N=N+R−16=18. Then, a further check is performed to determine if the next 16 B line is the last N≤16 and in this example, the determination would be that the next line is not the last line. Thus, the next line is stored, e.g., by copying the next 16 bytes (D2, . . . , D17) to BUF(0 to 15) and letting BE(0 to 15)=1 as depicted in
Furthermore, an error correcting code (ECC) capability is provided and an ECC is calculated for each 16 B data sent to the Xbar interface master and on byte enables.
In a further aspect of direct put packets, multiple rMEs can receive and process packets belonging to the same message in parallel. Multiple rMEs can also receive and process packets belonging to different messages in parallel.
Further, it is understood that a processor core at the compute node has previously performed operations including: the writing of data into the remote put control SRAM 65170; and, a polling of the specified byte counter in the memory system until it is updated to a value that indicates message completion.
In the case of remote get packet processing, in one embodiment, the MU device 65100B receives remote get packets that include, in their headers, an injection memory FIFO ID. The imFIFO ID is used to index the ICSRAM 65130. As shown in the MU reception side 65100B-3 of
Further, at 65333, via the Xbar interface master, the rME writes descriptors from the packet payload to the memory system location in the imFIFO pointed to by the corresponding tail pointer read from the ICSRAM. In one example, payload data at the network reception FIFO 65190 is written in 16 B chunks or other byte denominations. Then, at 65335, the rME updates the imFIFO tail pointer in the injection control SRAM 65130 so that the imFIFO includes the stored descriptors. The Byte alignment logic 65122 implemented at the rME ensures that the data to be written to the memory system are aligned, in one embodiment, on a 32 B boundary for memory FIFO packets. Further in one embodiment, error correction code is calculated for each 16 B data sent to the Xbar and on byte enables.
Each rME can be selectively enabled or disabled using a DCR register. For example, an rME is enabled when the corresponding DCR bit is 1 at the DCR register, and disabled when it is 0. If this DCR bit is 0, the rME will stay in the idle state or another wait state until the bit is changed to 1. The software executing on a processor at the node sets a DCR bit. The DCR bits are physically connected to the rMEs via a “backdoor” access mechanism (not shown). Thus, the register value propagates to rME immediately when it is updated.
If this DCR bit is cleared while the corresponding rME is processing a packet, the rME will continue to operate until it reaches either the idle state or a wait state. Then it will stay in the idle or wait state until the enable bit is set again. When an rME is disabled, even if there are some available packets in the network reception FIFO, the rME will not receive packets from the network reception FIFO. Therefore, all messages received by the network reception FIFO will be blocked until the corresponding rME is enabled again.
When an rME can not store a received packet because the target imFIFO or rmFIFO is full, the rME will poll the FIFO until it has enough free space. More particularly, the rME accesses ICSRAM and when it finds the imFIFO is full, ICSRAM communicates to rME that it is full and can't accept the request. Then rME waits for a while to access the ICSRAM again. This process is repeated until the imFIFO becomes not-full and the rME's request is accepted by ICSRAM. The process is similar when rME accesses reception control SRAM but the rmFIFO is full.
In one aspect, a DCR interrupt will be issued to report the FIFO full condition to the processors on the chip. Upon receiving this interrupt, the software takes action to make free space for the imFIFO/rmFIFO. (e.g. increasing size, draining packets from rmFIFO, etc.). Software running on the processor on the chip manages the FIFO and makes enough space so that the rME can store the pending packet. Software can freeze rMEs by writing DCR bits to enable/disable rMEs so that it can safely update FIFO pointers.
Packet Header and Routing
In one embodiment, a packet size may range from 32 to 544 bytes, in increments of 32 bytes. In one example, the first 32 bytes constitute a packet header for an example network packet. As shown in
The first network header portion 65501 as shown in
A further field 65513 includes class routes must be defined so that the packet could travel along appropriate links. For example, bits indicated in Packet header field 65513 may include: virtual channel bit (e.g., which bit may have a value to indicate one of the following classes: dynamic, deterministic (escape); high priority; system; user commworld; subcommincator, or, system collective); zone routing id bit(s); and, “stay on bubble” bit.
A further field 65514 includes destination addresses associated with the particular dimension A-E, for example. A further field 65515 includes a value indicating the number (e.g., 0 to 16) of 32 byte data payload chunks added to header, i.e., payload sizes, for each of the memory FIFO packets, put, get or paced-get packets. Other packet header fields indicated as header field 65516 include data bits to indicate the packet alignment (set by MU), a number of valid bytes in payload (e.g., the MU informs the network which is the valid data of those bytes, as set by MU), and, a number of 4 B words, for example, that indicate amount of words to skip for injection checksum (set by software). That is, while message payload requests can be issued for 32 B, 64 B and 128 B chunks, data comes back as 32 B units via the Xbar interface master, and a message may start at a middle of one of those 32 B units. The iME keeps track of this and writes, in the packet header, the alignment that is off-set within the first 32 B chunk at which the message starts. Thus, this offset will indicate the portion of the chunk that is to be ignored, and the network device will only parse out the useful portion of the chunk for processing. In this manner, the logic implemented at the network logic can figure out which bytes out of the 32 B are the correct ones for the new message. The MU knows how long the packet is (message size or length), and from the alignment and the valid bytes, instructs the Network Interface Unit where to start and end the data injection, i.e., from the 32 Byte payload chunk being transferred to network device for injection. For data reads, the alignment logic located in the network device supports any byte alignment.
As shown in
The payload size field specifies number of 32 bytes chunks. Thus payload size is 0 B to 512 B (32 B*16).
Remaining bytes of the each network packet or collective packet header of
For the case of direct put packets, the direct put packet header 65540 includes bits specifying: a Rec. Payload Base Address ID, Put Offset and a reception Counter ID (e.g., set by software), a number of Valid Bytes in Packet Payload (specifying how many bytes in the payload are actually valid—for example, when the packet has 2 chunks (=32 B*2=64 B) payload but the number of valid bytes is 35, the first 35 bytes out of 64 bytes payload data is valid; thus, MU reception logic will store only first 35 bytes to the memory system.); and Counter Offset value (e.g., set by software), each such as processed by MU 65100B-2 as described herein in connection with
For the case of remote get packets, the remote get packet header 550 includes the Remote Get Injection FIFO ID such as processed by the MU 65100B-3 as described herein in connection with
Interrupt Control
Interrupts and, in one embodiment, interrupt masking for the MU 65100 provide additional functional flexibility. In one embodiment, interrupts may be grouped to target a particular processor on the chip, so that each processor can handle its own interrupt. Alternately, all interrupts can be configured to be directed to a single processor which acts as a “monitor” of the processors on the chip. The exact configuration can be programmed by software at the node in the way that it writes values into the configuration registers.
In one example, there are multiple interrupt signals 65802 that can be generated from the MU for receipt at the 17 processor cores shown in the compute node embodiment depicted in
For example, MU generated interrupts include: packet arrival interrupts that are raised by MU reception logic when a packet has been received. Using this interrupt, the software being run at the node can know when a message has been received. This interrupt is raised when the interrupt bit in the packet header is set to 1. The application software on the sender node can set this bit as follows: if the interrupt bit in the header in a message descriptor is 1, the MU will set the interrupt bit of the last packet of the message. As a result, this interrupt will be raised when the last packet of the message has been received.
MU generated interrupts further include: imFIFO threshold crossed interrupt that is raised when the free space of an imFIFO exceeds a threshold. The threshold can be specified by a control register in DCR. Using this interrupt, application software can know that an MU has processed descriptors in an imFIFO and there is space to inject new descriptors. This interrupt is not used for an imFIFO that is configured to receive remote get packets.
MU generated interrupts further include: remote get imFIFO threshold crossed interrupt. This interrupt may be raised when the free space of an imFIFO falls below the threshold (specified in DCR). Using this interrupt, the software can notice that MU is running out of free space in the FIFO. Software at the node might take some action to avoid FIFO full (e.g. increasing FIFO size). This interrupt is used only for an imFIFO that is configured to receive remote get packets.
MU generated interrupts further include an rmFIFO threshold crossed interrupt which is similar to the remote get FIFO threshold crossed interrupt; this interrupt to be raised when the free space of an rmFIFO fall below the threshold.
MU generated interrupts further include a remote get imFIFO insufficient space interrupt that is raised when the MU receives a remote get packet but there is no more room in the target imFIFO to store this packet. Software responds by taking some action to clear the FIFO.
MU generated interrupts further include an rmFIFO insufficient space interrupt which may be raised when the MU receives a memory FIFO packet but there is no room in the target rmFIFO to store this packet. Software running at the node may respond by taking some action to make free space. MU generated interrupts further include error interrupts that reports various errors and are not raised under normal operations.
In one example embodiment shown in
In addition to these 68 direct interrupts 65802, there may be provided 5 more interrupt lines 65805 with the interrupt: groups 0 to 3 are connected to the first interrupt line, groups 4 to 7 to the second line, groups 8 to 11 to the third interrupt, groups 12 to 15 to the fourth interrupt, and the group 16 is connected to the fifth interrupt line. These five interrupts 805 are sent to a global event aggregator (GEA) 65900 where they can then be forwarded to any thread on any core.
The MU additionally, may include three DCR mask registers to control which of these 68 direct interrupts participate in raising the five interrupt lines connected to the GEA unit. The three (3) DCR registers, in one embodiment, may have 68 mask bits, and are organized as follows: 32 bits in the first mask register for cores 0 to 7, 32 bits in the second mask register for cores 8 to 15, and 4 mask bits for the 17th core in the third mask register.
In addition to these interrupts, there are additional more interrupt lines 65806 for fatal and nonfatal interrupts signaling more serious errors such as a reception memory FIFO becoming full, fatal errors (e.g., an ECC uncorrectable error), correctable error counts exceeding a threshold, or protection errors. All interrupts are level-based and are not pulsed.
Additionally, software can “mask” interrupts, i.e., program mask registers to raise an interrupt only for particular events, and to ignore other events. Thus, each interrupt can be masked in MU, i.e., software can control whether MU propagates a given interrupt to the processor core, or not. The MU can remember that an interrupt happened even when it is masked. Therefore, if the interrupt is unmasked afterward, the processor core will receive the interrupt.
As for packet arrival and threshold crossed interrupts, they can be masked on a per-FIFO basis. For example, software can mask a threshold crossed interrupt for imFIFO 0,1,2, but enable this interrupt for imFIFO 3, et seq.
In one embodiment, direct interrupts 65802 and shared interrupt lines 65810 are available for propagating interrupts from MU to the processor core. Using direct interrupts 65802, each processor core can directly receive packet arrival and threshold crossed interrupts generated at a subset of imFIFOs/rmFIFOs. For this purpose, there are logic paths directly connect between MU and cores.
For example, a processor core 0 can receive interrupts that happened on imFIFO 0-31 and rmFIFO 0-15. Similarly, core 1 can receive interrupts that happened on imFIFO 32-63 and rmFIFO 16-31. In this example scheme, a processor core N (N=0, . . . , 16) can receive interrupts that happened on imFIFO 32*N to 32*N+31 and rmFIFO 16*N to 16*N+15. Using this mechanism each core can monitor its own subset of imFIFOs/rmFIFOs which is useful when software manages imFIFOs/rmFIFOs using 17 cores in parallel. Since no central interrupt control mechanism is involved, direct interrupts are faster than GEA aggregated interrupts as these interrupt lines are dedicated for MU.
Software can identify the source of the interrupt quickly, speeding up interrupt handling. A processor core can ignore interrupts reported via this direct path, i.e., a direct interrupt can be masked using a control register.
As shown in
Using this controller, a processor core can receive arbitrary interrupts issued by the MU. For example, a core can listen to threshold crossed interrupts on all the imFIFOs and rmFIFOs. It is understood that a core can ignore interrupts coming from this interrupt controller.
As shown in
In one embodiment, the control logic device 66165 processing may be external to both the L2 cache and MU 65100. Further, in one embodiment, the Reception control SRAM includes associated status and control registers that maintain and atomically update these advance tail ID counter, advance tail, committed tail ID counter, committed tail pointer values in addition to fields maintaining packet “start” address, “size minus one” and “head” fields.
When a MU wants to read from or write to main memory, it accesses L2 memory controller via the xbar master ports. If the access hits L2, the transaction completes within the L2 and hence no actual memory access is necessary. On the other hand, if it doesn't hit, L2 has to request the memory controller (e.g., DDR-3 Controller 78,
When a DMA engine implemented in a rME wants to store a packet, it obtains from the RCSRAM 65160 the advance tail 66197 which points to the next memory area in that reception memory FIFO 66199 to store a packet (Advance tail address). Then, the advance tail is then moved (incremented) for next packet. The read of advance tail and the increment of advance tail both occur at the same time and cannot be intervened, i.e. they happen atomically. After the DMA at the rME has stored the packet, it requests an atomic update of the Commit tail pointer to indicate that the last address packets have been completely stored. The Commit tail may be referred to by software to know up to where there are completely stored packets in the memory area (e.g., software checks commit tail and the processor may read packets in the main memory up to the commit tail for further processing.) DMAs write commit tail in the same order as they get advance tail. Thus, the commit tail will have the last address correctly. To manage and guarantee this ordering between DMAs, advance ID and commit ID are used.
As exemplified in
As exemplified in
Continuing to
It should be understood that the foregoing described algorithm holds for multiple DMA engine writes in any multiprocessing architecture. It holds even when all DMAs (e.g., DMA0 . . . 15) in respective rMEs configured to operate in parallel. In one embodiment, commit ID and advanced ID are 5 bit counters that roll-over to zero when they overflow. Further, in one embodiment, memory FIFOs are implemented as circular buffers with pointers (e.g. head and tail) that, when updated, must account for circular wrap conditions by using modular arithmetic, for example, to calculate the wrapped pointer address.
Once a packet of a particular byte length has arrived at a particular DMA engine (e.g., at an rME), then in 66215, the globally maintained advance tail and advance ID are locally recorded by the DMA engine. Then, as indicated at 66220, the advance tail is set equal to the advance tail+size of the packet being stored in memory, and, at the same time (atomically) advance ID is incremented, i.e., advance ID=advance ID+1, in the embodiment described. The packet is then stored to the memory area pointed to by the locally recorded advance tail in the manner as described herein at 66224. At this point, an attempt is made to update the commit tail and commit tail ID at 66229. Proceeding next to 66231,
Thus, in a multiprocessing system comprising parallel operating distributed messaging units (MUs), each with multiple DMAs engines (messaging elements, MEs), packets destined for the same rmFIFO, or packets targeted to the same processor in a multiprocessor system could be received at different DMAs. To achieve high throughput, the packets can be processed in parallel on different DMAs.
The asymmetrical torus comprises nodes 671021 to 67102n. These nodes are also known as ‘compute nodes’. Each node 67102 occupies a particular point within the torus and is interconnected, directly or indirectly, by a physical wire to every other node within the torus. For example, node 671021 is directly connected to node 671022 and indirectly connected to node 671023. Multiple connecting paths between nodes 67102 are often possible. A feature of the present invention is a system and method for selecting the ‘best’ or most efficient path between nodes 67102. In one embodiment, the best path is the path that reduces communication bottlenecks along the links between nodes 67102. A communication bottleneck occurs when a reception FIFO at a receiving node is full and unable to receive a data packet from a sending node. In another embodiment, the best path is the quickest path between nodes 67102 in terms of computational time. Often, the quickest path is also the same path that reduces communication bottlenecks along the links between nodes 67102.
As an example, assume node 671021 is a sending node and node 671026 is a receiving node. Nodes 671021 and 671026 are indirectly connected. There exists between these nodes a ‘best’ path for communicating data packets. In an asymmetrical torus, experiments conducted on the IBM BLUEGENE™ parallel computer system have revealed that the ‘best’ path is generally found by routing the data packets along the longest dimension first, then continually routing the data across the next longest path, until the data is finally routed across the shortest path to the destination node. In this example, the longest path between node 1021 and node 1026 is along the y-axis and the shortest path is along the x-axis. Therefore, in this example the ‘best’ path is found by communicating data along the y-axis from node 671021 to node 671022 to node 671023 to node 671024 and then along the x-axis from node 671024 node 671025 and finally to receiving node 671026. Traversing the torus in this manner, i.e., by moving along the longest available path first, has been shown in experiments to increase the efficiency of communication between nodes in an asymmetrical torus by as much as 40%. These experiments are further discussed in “Optimization of All-to-all Communication on the Blue Gene/L Supercomputer” 37th International Conference on Parallel Processing, IEEE 2008, the contents of which are incorporated by reference in their entirety. In those experiments, packets were first injected into the network and sent to an intermediate node along the longest dimension, where it was received into the memory of the intermediate node. It was then re-injected into the network to the final destination. This requires additional software overhead and requires additional memory bandwidth on the intermediate nodes. The present invention is much more general than this, and requires no receiving and re-injecting of packets at intermediate nodes.
As shown in
The MU 65100 (
The handover between network device 65150 and MU 65100 is performed via 2-port SRAMs for network injection/reception FIFOs. The MU 65100 reads/writes one port using, for example, an 800 MHz clock, and the network reads/writes the second port with a 500 MHz clock. The only handovers are through the FIFOs and FIFOs' pointers (which are implemented using latches).
The size of the data packet 67384 may range from 32 to 544 bytes, in increments of 32 bytes. The first 32 bytes of the data packet 67384 form the packet header. The first 12 bytes of the packet header form a network header (bytes 0 to 11); the next 20 bytes form a message unit header (bytes 12 to 31). The remaining bytes (bytes 32 to 543) in the data packet 67384 are the payload ‘chunks’. In one embodiment, there are up to 16 payload ‘chunks’, each chunk containing 32 bytes.
Several bytes within the data packet 67384, i.e., byte 67402, byte 67404 and byte 67406 are shown in further detail in
Referring now to
A point-to-point packet flows along the directions specified by the hint bits at each node until reaching its final destination. As described in U.S. Pat. No. 7,305,487 the hint bits get modified as the packet flows through the network. When a node reaches its destination in a dimension, the network logic device 67381 changes the hint bits for that dimension to 0, indicating that the packet has reached its destination in that dimension. When all the hint bits are 0, the packet has reached its final destination. An optimization of this permits the hint bit for a dimension to be set to 0 on the node just before it reaches its destination in that dimension. This is accomplished by having a DCR register containing the node's neighbor coordinate in each direction. As the packet is leaving the node on a link, if the data packet's destination in that direction's dimension equals the neighbor coordinate in that direction, the hint bit for that direction is set to 0.
The Injection FIFO 65180 stores data packets that are to be injected into the network interface by the network logic device 67381. The network logic device 67381 parses the data packet to determine in which direction the data packet should move towards its destination, i.e., in a five-dimensional torus the network logic device 67381 determines if the data packet should move along links in the ‘a’ ‘b’ ‘c’ ‘d’ or ‘e’ dimensions first by using the hint bits. With dynamic routing, a packet can move in any direction provided the hint bit for direction is set and the usual flow control tokens are available and the link is not otherwise busy. For example, if the ‘+a’ and ‘+b’ hint bits are set, then a packet could move in either the ‘+a’ or ‘+b’ directions provided tokens and links are available.
Dynamic routing, where the proper routing path is determined at every node, is enabled by setting the ‘dynamic routing’ bit in the data packet header 67514 to 1. To improve performance on asymmetric tori, ‘zone’ routing can be used to force dynamic packets down certain dimensions before others. In one embodiment, the data packet 67384 contains 2 zone identifier bits 67520 and 67521, which point to registers in the network DCR unit 65182 (
In one embodiment, the mask also breaks down the torus into ‘zones’. A zone includes all the allowable directions in which the data packet may move. For example, in a five dimensional torus, if the mask reveals that the data packet is only allowed to move along in the ‘+a’ and ‘+e’ dimensions, then the zone includes only the ‘+a’ and ‘+e’ dimensions and excludes all the other dimensions.
For selecting a direction or a dimension, the packet's hint bits are AND-ed with the appropriate zone mask to restrict the set of directions that may be chosen. For a given set of zone masks, the first mask is used until the destination in the first dimension is reached. For example, in a 2N×N×N×N×2 torus, where N is an integer such as 16, the masks may be selected in a manner that routes the packets along the ‘a’ dimension first, then either the ‘b’ ‘c’ or ‘d’ dimensions, and then the ‘e’ dimension. For random traffic patterns this tends to have packets moving from more busy links onto less busy links. If all the mask bits are set to 1, there is no ordering of dynamic directions. Regardless of the zone bits, a dynamic packet may move to the ‘bubble’ VC to prevent deadlocks between nodes. In addition, a ‘stay on bubble’ bit 67522 may be set; if a dynamic packet enters the bubble VC, this bit causes the packet to stay on the bubble VC until reaching its destination.
As an example, in a five-dimensional torus, there are two zone identifier bits and ten hint bits stored in a data packet. The zone identifier bits are used to select a mask from the network DCR 65182. As an example, assume the zone identifier bits 67520 and 67521 are set to ‘00’. In one embodiment, there are up to five masks associated with the zone identifier bits set to ‘00’. A mask is selected by identifying an ‘operative zone’, i.e., the smallest zone for which both the hint bits and the zone mask are non-zero. The operative zone can be found using equation 1 where in this example m=‘00’, the set of zone masks corresponding to zone identifier bits ‘00’;
zone k=min{j:h& ze_m(j)!=0 (1)
Where j is a variable representing the zone masks for each of the dimensions in the torus, i.e., in a five-dimensional torus k=0 to 4, j varies between 0 and 4 h represents the hint bits and ze_m(j) represents the mask bits, and the ‘&’ represents a bitwise ‘AND’ operation.
The following example illustrates how a network logic device 67381 implements equation 1 is used to select an appropriate mask from the network DCR registers. As an example, assume the hint bits are set as ‘h’=1000100000 corresponding to moves along the ‘−a’ and the ‘−c’ dimensions. Assume that three possible masks associated with the zone identifiers bits 67520 and 67521 are stored in the network DCR unit as follows: ze_m(0)=0011001111 (b, d or e moves allowed); ze_m(1)=1100000000 (a moves allowed); and ze_m(2)=0000110000 (c moves allowed).
Network logic device 67381 further applies equation 1 to the hint bits and each individual zone, i.e., ze_m(0), ze_m(1), ze_m(2), reveals the operative zone is found when k=1 because h & ze_m(0)=0, but h& ze_m(1) !=0, i.e., when the hint bits and the mask are ‘AND’ed together the result is the minimum value that does not equal zero. When j=0, h & ze_m(0)=0, i.e., 1000100000 & 0011001111=0. When j=1, h & ze_m(1)=1000100000 & 1100000000=1000000000. Thus in equation 1, the min j such that h & ze_m(j) !=0 is 1 and so k=1.
After all the moves along the links interconnecting nodes in the ‘a’ dimension are made, at the last node of the ‘a’ dimension, as described earlier the logic sets the hint bits for the ‘a’ dimension to ‘00’ and the hint bits ‘h’=0000100000, corresponding to moves along the ‘c’ dimension in the example described. The operative zone is found according to equation 1 when k=2 because ‘h & ze_m(0)=0’, and ‘h & ze_m(1)=0’, and ‘h & ze_m(2) !=0’.
The network logic device 67381 then applies the selected mask to the hint bits to determine which direction to forward the data packet. In one embodiment, the mask bits are ‘AND’ed with the hint bits to determine the direction of the data packet. Using the example where the mask bits are 1, 0, 1, 0, 0, indicating that moves in the dimensions ‘a’ or ‘c’ are allowed. Assume the hint bits are set as follows: hint bit 67501 is set to 1, hint bit 67502 is set to 0, hint bit 67503 is set to 0, hint bit 67504 is set to 0, hint bit 67505 is set to 1, hint bit 67506 is set to 0, hint bit 67507 is set to 0, hint bit 67508 is set to 0, hint bit 67509 is set to 0, and hint bit 67510 is set to 0. The first hint bit 67501, a 1 is ‘AND’ed with the corresponding mask bit, also a 1 and the output is a 1. The second hint bit 67502, a 0 is ‘AND’ed with the corresponding mask bit, a 1 and the output is a 0. Application of the mask bits to the hint bits reveals that movement is enabled along ‘−a’. The remaining hint bits are ‘AND’ed together with their corresponding mask bits to reveal that movement is enabled along the ‘−c’ dimension. In this example, the data packet will move along either the ‘−a’ dimension or the ‘−c’ dimension towards its final destination. If the data packet first reaches a destination along the ‘−a’ dimension, then the data packet will continue along the ‘−c’ dimension towards its destination on the ‘−c’ dimension. Likewise, if the data packet reaches a destination along the ‘−c’ dimension then the data packet will continue along the ‘−a’ dimension towards its destination on the ‘−a’ dimension.
As a data packet 67384 moves along towards its destination, the hint bits may change. A hint bit is set to 0 when there are no more moves left along a particular dimension. For example, if hint bit 67501 is set to 1, indicating the data packet is allowed to move along the ‘−a’ direction, then hint bit 67501 is set to 0 once the data packet moves the maximum amount along the ‘−a’ direction. During the process of routing, it is understood that the data packet may move from a sending node to one or more intermediate nodes before each arriving at the destination node. Each intermediate node that forwards the data packet towards the destination node also functions as a sending node.
In some embodiments, there are multiple longest dimensions and a node chooses between the multiple longest dimensions to selecting a routing direction for the data packet 384. For example, in a five dimensional torus, dimensions ‘+a’ and ‘+e’ may be equally long. Initially, the sending node chooses to between routing the data packet 67384 in a direction along the ‘+a’ dimension or the ‘+e’ dimension. A redetermination of which direction the data packet 67384 should travel is made at each intermediate node. At an intermediate node, if ‘+a’ and ‘+e’ are still the longest dimensions, then the intermediate node will decide whether to route the data packet 67384 in direction of the ‘+a’ or ‘+e’ dimensions. The data packet 67384 may continue in direction of the dimension initially chosen, or in direction of any of the other longest dimensions. Once the data packet 67384 has exhausted travel along all of the longest dimensions, a network logic device at an intermediate node sends the data packet in direction of the next longest dimension.
The hint bits are adjusted at each compute node 65100 (
In an alternative embodiment, the hint bits need not be explicitly stored in the packet, but the logical equivalence to the hint bits, or “implied” hint bits can be calculated by the network logic on each node as the packet moves through the network. For example, suppose the packet header contains not the hint bits and destination, but rather the number of remaining hops to make in each dimension and whether the plus or minus direction should be used in each direction (a direction indicator). Then, when a packet reaches a node, the implied hint for a direction is 1 if the number of remaining hops in that dimension is non-zero, and the direction indicator for that dimension is set. Each time the packet makes a move in a dimension, the remaining hop count is decremented is decremented by the network logic device 67381. When the remaining hop count is zero, the packet has reached its destination in that dimension, at which point the implied hint bit is zero.
Referring now to
Where d is a selected dimension, e.g., ‘+/−x’, ‘+/−y’, ‘+/−z’ or ‘+/−a’, ‘+/−b’, ‘+/−c’, ‘+/−d’, ‘+/−e’; and cutoff_plus[d] and cutoff_minus[d] are software controlled programmable cutoff registers that store values that represent the endpoints of the selected dimension. The hint bits are recalculated and rewritten to the data packet 67384 by the network logic device 67381 as the data packet 67384 moves towards its destination. Once the data packet 67384 reaches the receiving node, i.e., the final destination address, all the hint bits are set to 0, indicating that the data packet 384 should not be forwarded.
The method starts at block 67602. At block 67602, if a node along the source dimension is equal to a node along the dimension, then the data packet has already reached its destination on that particular dimension and the data packet does not need to be forwarded any further along that one dimension. If this situation is true, then at block 67604 all of the hint bits for that dimension are set to zero by the hint bit calculator and the method ends. If the node along the source dimension is not equal to the node along the destination dimension, then the method proceeds to step 67606. At step 67606, if the node along the destination dimension is greater than the node along the source dimension, e.g., the destination node is in a positive direction from the source node, then method moves to block 67612. If the node along the destination dimension is not greater than the source node, e.g., the destination node is in a negative direction from the source node, then method proceeds to block 67608.
At block 67608, a determination is made as to whether the destination dimension is greater than or equal to a value stored in the cutoff_minus register. The plus and minus cutoff registers are programmed in such a way that a packet will take the smallest number of hops in each dimension If the destination dimension is greater than or equal to the value stored in the cutoff_minus register, then the method proceeds to block 67609 and the hint bits are set so that the data packet 67384 is routed in a negative direction for that particular dimension. If the destination dimension is not greater than or equal to the value stored in the cutoff_plus register, then the method proceeds to block 67610 and the hint bits are set so the data packet 67384 is routed in a positive dimension for that particular dimension.
At block 67612, a determination is made as to whether the destination dimension is less than or equal to a value stored in the cutoff_plus register. If the destination dimension is less than or equal to the value stored in the cutoff_plus register, then the method proceeds to block 67616 and the hint bits are set so that the data packet is routed in a positive direction for that particular dimension. If the destination dimension is not less than or equal to the value stored in the cutoff_plus register, then the method proceeds to block 67614 and the hint are set so that the data packet 67384 is routed in a negative direction for that particular dimension.
The above method is repeated for each dimension to set the hint bits for that particular dimension, i.e., in a five-dimensional torus the method is implemented once for each of the ‘a’, ‘b’, ‘c’, ‘d’, and ‘e’ dimensions.
Network Support for System Initiated Checkpoint
In parallel computing system, such as BlueGene® (a trademark of International Business Machines Corporation, Armonk N.Y.), system messages are initiated by the operating system of a compute node. They could be messages communicated between the Operating System (OS) kernel on two different compute nodes, or they could be file I/O messages, e.g., such as when a compute node performs a “printf” function, which gets translated into one or more messages between the OS on a compute node OS and the OS on (one or more) I/O nodes of the parallel computing system. In highly parallel computing systems, a plurality of processing nodes may be interconnected to form a network, such as a Torus; or, alternately, may interface with an external communications network for transmitting or receiving messages, e.g., in the form of packets.
As known, a checkpoint refers to a designated place in a program at which normal processing is interrupted specifically to preserve the status information, e.g., to allow resumption of processing at a later time. Checkpointing, is the process of saving the status information. While checkpointing in high performance parallel computing systems is available, generally, in such parallel computing systems, checkpoints are initiated by a user application or program running on a compute node that implements an explicit start checkpointing command, typically when there is no on-going user messaging activity. That is, in prior art user-initiated checkpointing, user code is engineered to take checkpoints at proper times, e.g., when network is empty, no user packets in transit, or MPI call is finished.
In one aspect t is desirable to have the computing system initiate checkpoints, even in the presence of on-going messaging activity. Further, it must be ensured that all incomplete user messages at the time of the checkpoint be delivered in the correct order after the checkpoint. To further complicate matters, the system may need to use the same network as is used for transferring system messages.
In one aspect, a system and method for checkpointing in parallel, or distributed or multiprocessor-based computer systems is provided that enables system initiation of checkpointing, even in the presence of messaging, at arbitrary times and in a manner invisible to any running user program.
In this aspect, it is ensured that all incomplete user messages at the time of the checkpoint be delivered in the correct order after the checkpoint. Moreover, in some instances, the system may need to use the same network as is used for transferring system messages.
The system, method and computer program product supports checkpointing in a parallel computing system having multiple nodes configured as a network, and, wherein the system, method and computer program product in particular, obtains system initiated checkpoints, even in the presence of on-going user message activity in a network.
As there is provided a separation of network resources and DMA hardware resources used for sending the system messages and user messages, in one embodiment, all user and system messaging be stopped just prior to the start of the checkpoint. In another embodiment, only user messaging be stopped prior to the start of the checkpoint.
Thus, there is provided a system for checkpointing data in a parallel computing system having a plurality of computing nodes, each node having one or more processors and network interface devices for communicating over a network, the checkpointing system comprising: one or more network elements interconnecting the network interface devices of computing nodes via links to form a network; a control device to communicate control signals to each the computing node of the network for stopping receiving and sending message packets at a node, and to communicate further control signals to each the one or more network elements for stopping flow of message packets within the formed network; and, a control unit, at each computing node and at one or more the network elements, responsive to a first control signal to stop each of the network interface devices involved with processing of packets in the formed network, and, to stop a flow of packets communicated on links between nodes of the network; and, the control unit, at each node and the one or more network elements, responsive to second control signal to obtain, from each the plurality of network interface devices, data included in the packets currently being processed, and to obtain from the one or more network elements, current network state information, and, a memory storage device adapted to temporarily store the obtained packet data and the obtained network state information.
As described herein with respect to
One function of the messaging unit 65100 is to ensure optimal data movement to, and from, the network into the local memory system for the node by supporting injection and reception of message packets. As shown in
The MU 65100 further supports data prefetching into the memory, and on-chip memory copy. On the injection side, the MU splits and packages messages into network packets, and sends packets to the network respecting the network protocol. On packet injection, the messaging unit distinguishes between packet injection, and memory prefetching packets based on certain control bits in its memory descriptor, e.g., such as a least significant bit of a byte of a descriptor 65102 shown in
With respect to on-chip local memory copy operation, the MU copies content of an area in the local memory to another area in the memory. For memory-to-memory on chip data transfer, a dedicated SRAM buffer, located in the network device, is used.
As shown in
In one embodiment of a multiprocessor system node, such as described herein, there may be a clean separation of network and Messaging Unit (DMA) hardware resources used by system and user messages. In one example, users and systems are provided to have different virtual channels assigned, and different messaging sub-units such as network and MU injection memory FIFOs, reception FIFOs, and internal network FIFOs.
Thus, for example, at each node(s), the DCR control unit for the MU 65100 and network device 65150 is configured to issue respective stop/start signals 5221a, . . . 5221N over respective conductor lines, for initiating starting or stopping of corresponding particular subunit(s), e.g., subunit 5300a, . . . , 5300N. In an embodiment described herein, for checkpointing, the sub-units to be stopped may include all injection and reception sub-units of the MU (DMA) and network device. For example, in one example embodiment, there is a Start/stop DCR control signal, e.g., a set bit, associated with each of the iMEs 65110, rMEs 65120 (
For example, each iME and rME can be selectively enabled or disabled using a DCR register. For example, an iME/rME is enabled when the corresponding DCR bit is 1 at the DCR register, and disabled when it is 0. If this DCR bit is 0, the rME will stay in the idle state or another wait state until the bit is changed to 1. The software executing on a processor at the node sets a DCR bit. The DCR bits are physically connected to the iME/rMEs via a “backdoor” access mechanism including separate read/write access ports to buffers arrays, registers, and state machines, etc. within the MU and Network Device. Thus, the register value propagates to iME/rME registers immediately when it is updated.
The control or DCR unit may thus be programmed to set a Start/stop DCR control bit provided as a respective stop/start signal 5221a, . . . , 5221N corresponding to the network injection FIFOs to enable stop of all network injection FIFOs. As there is a DCR control bit for each subunit, these bits get fed to the appropriate iME FSM logic which will, in one embodiment, complete any packet in progress and then prevent work on subsequent packets. Once stopped, new packets will not be injected into the network. Each network injection FIFO can be started/stopped independently.
As shown in
Further, the control or DCR unit sets a Start/stop DCR control bit provided as a respective stop/start signal 5221a, . . . 5221N corresponding to network reception FIFOs to enable stop of all network reception FIFOs. Once stopped, new packets cannot be removed from the network reception FIFOs. Each FIFO can be started/stopped independently. That is, as there is a DCR control bit for each subunit, these bits get fed to the appropriate FSM logic which will, in one embodiment, complete any packet in progress and then prevent work on subsequent packets. It is understood that a network DCR register 5182 shown in
In an example embodiment, for the case of packet reception, if this DCR stop bit is set to logic 1, for example, while the corresponding rME is processing a packet, the rME will continue to operate until it reaches either the idle state or a wait state. Then it will stay in the state until the stop bit is removed, or set to logic 0, for example. When an rME is disabled (e.g., stop bit set to 1), even if there are some available packets in the network device's reception FIFO, the rME will not receive packets from the network FIFO. Therefore, all messages received by the network FIFO will be blocked until the corresponding rME is enabled again.
Further, the control or DCR unit sets a Start/stop DCR control bit provided as a respective stop/start signal 5221a, . . . 5221N corresponding to all network sender and receiver units such as sender units 651850-65185N and receiver units 651950-65195N shown in
In the system shown in
That is, the system of the invention may have a separate control network, wherein each compute node signals a “barrier entered” message to the control network, and it waits until receiving a “barrier completed” message from the control system. The control system implemented may send such messages after receiving respective barrier entered messages from all participating nodes.
Thus, continuing in
As mentioned, each node includes “state machine” registers (not shown) at the network and MU devices. These state machine registers include unit status information such as, but not limited to, FIFO active, FIFO currently in use (e.g., for remote get operation), and whether a message is being processed or not. These status registers can further be read (and written to) by system software at the host or controller node.
Thus, when it has been determined at the computer nodes forming a network (e.g., a Torus or collective) to be checkpointed that all user programs have been halted, and all packets have stopped moving according to the embodiment described herein, then, as shown at step S420,
In one embodiment, these registers may include packets ECC or parity data, as well as network link level sequence numbers, VC tokens, state machine states (e.g., status of packets in network), etc., that can be read and written. In one embodiment, the checkpoint reads/writes are read by operating system software running on each node. Access to devices is performed over a DCR bus that permits access to internal SRAM or state machine registers and register arrays, and state machine logic, in the MU and network device, etc. as shown in
Returning to
Proceeding to step S450,
In another implementation of the network sender 5185′ illustrated in
For restarting, there is performed setting the unit stop DCR bits to logic “0”, for example, bits in DCR control register 5501 (e.g.,
Returning to
Thus, if selective re-start can not be performed, then the entire network is Reset which effectively rids the network of all packets (e.g., user and system packets) in network. After the network reset, only system packets will be utilized by the OS running on the compute node. Subsequently, the system using the network would send out information about the user code and program and MU/network status and writes that to disk, i.e., the necessary network, MU and user information is checkpointed (written out to external memory storage, e.g., disk) using the freshly reset network. The user code information including the network and MU status information is additionally checkpointed.
Then, all other user state, such as user program, main memory used by the user program, processor register contents and program control information, and other checkpointing items defining the state of the user program, are checkpointed. For example, as memory is the content of all user program memory, i.e., all the variables, stacks, heap is checkpointed. Registers include, for example, the core's fixed and floating point registers and program counter. The checkpoint data is written to stable storage such as disk or a flash memory, possibly by sending system packets to other compute or I/O nodes. This is so the user application is later restarted at the exactly same state it was in.
In one aspect, these contents and other checkpointing data are written to a checkpoint file, for example, at a memory buffer on the node, and subsequently written out in system packets to, for example, additional I/O nodes or control host computer, where they could be written to disk, attached hard-drive optical, magnetic, volatile or non-volatile memory storage devices, for example. In one embodiment the checkpointing may be performed in a non-volatile memory (e.g., flash memory, phase-change memory, etc) based system, i.e., with checkpoint data and internal node state data expediently stored in a non-volatile memory implemented on the computer node, e.g., before and/or in addition to being written out to I/O. The checkpointing data at a node could further be written to possibly other nodes where stored in local memory/flash memory.
Continuing, after user data is checkpointed, at 5470,
After restoring the network state at each node, a call is made to a third barrier. The system thus ensures that all nodes have entered the barrier after each node's state has restored from a checkpoint (i.e., have read from stable storage and restored user application and network data and state. The system will wait until each node has entered the third data barrier such as shown at steps 5472, 5475 before resuming processing.
From the foregoing, the system and methodology can re-start the user application at exactly the same state in which it was in at time of entering the checkpoint. With the addition of system checkpoints, in the manner as described herein checkpointing can be performed anytime while a user application is still running.
In an alternate embodiment, two external barriers could be implemented, for example, in a scenario where system checkpoint is taken and the hardware logic is engineered so as not to have to perform a network reset, i.e., system is unaffected while checkpointing user. That is, after first global barrier is entered upon halting all activity, the nodes may perform checkpoint read step using backdoor access feature, and write checkpoint data to storage array or remote disk via the hardware channel. Then, these nodes will not need to enter or call the second barrier after taking checkpoint due to the use of separate built in communication channel (such as a Virtual Channel). These nodes will then enter a next barrier (the third barrier as shown in
The present invention can be embodied in a system in which there are compute nodes and separate networking hardware (switches or routers) that may be on different physical chips. For example, network configuration shown in
In the further embodiment of a network configuration 5018″ shown in
Further, the entire machine may be partitioned into subpartitions each running different user applications. If such subpartitions share network hardware resources in such a way that each subpartition has different, independent network input (receiver) and output (sender) ports, then the present invention can be embodied in a system in which the checkpointing of one subpartition only involves the physical ports corresponding to that subpartition. If such subpartitions do share network input and output ports, then the present invention may be embodied in a system in which the network can be stopped, checkpointed and restored, but only the user application running in the subpartition to be checkpointed is checkpointed while the applications in the other subpartitions continue to run.
Programs running on large parallel computer systems often save the state of long running calculations at predetermined intervals. This saved data is called a checkpoint. This process enables restarting the calculation from a saved checkpoint after a program interruption due to soft errors, hardware or software failures, machine maintenance or reconfiguration. Large parallel computers are often reconfigured, for example to allow multiple jobs on smaller partitions for software development, or larger partitions for extended production runs.
A typical checkpoint requires saving data from a relatively large fraction of the memory available on each processor. Writing these checkpoints can be a slow process for a highly parallel machine with limited I/O bandwidth to file servers. The optimum checkpoint interval for reliability and utilization depends on the problem data size, expected failure rate, and the time required to write the checkpoint to storage. Reducing the time required to write a checkpoint improves system performance and availability.
Thus, it is desired to provide a system and method for increasing the speed and efficiency of a checkpoint process performed at a computing node of a computing system, such as a massively parallel computing system.
In one aspect, there is provided a system and method for increasing the speed and efficiency of a checkpoint process performed at a computing node of a computing system by integrating a non-volatile memory device, e.g., flash memory cards, with a direct interface to the processor and memory that make up each parallel computing node.
This flash memory provides a local storage for checkpoints thus relieving the bottleneck due to I/O bandwidth limitations. Simple available interfaces from the processor such as ATA or UDMA that are supported by commodity flash cards provide sufficient bandwidth to the flash memory for writing checkpoints. For example, a multiple GB checkpoint can be written to local flash at 20 MB/s to 40 MB/s in a few minutes. All processors writing the same data through normal I/O channels could take more than 10× as long. An example implementation is shown in
The flash memory size associated with each processor is ideally 2× to 4× the required checkpointmemory size to allow for multiple backups so that recovery is possible from any failures that occur during the checkpoint write itself. Also, the system is tolerant of a limited number of hard failures in the local flash storage, since checkpoint data from those few nodes can simply be written to the file system through the normal I/O channels using only a fraction of the total I/O bandwidth.
In one embodiment, there is no cabling used in these interfaces. Network interfaces are wired through the compute card connectors to the node board, and some of these, including the I/O network connections are carried from the node board to other parts of the system, e.g., via optical fiber cables.
In one aspect, checkpointing data are written to a checkpoint file, for example, at a compact non-volatile memory buffer on the node, and subsequently written out in system packets to the I/O nodes where they could be written to disk, attached hard-drive optical, magnetic, volatile or non-volatile memory storage devices, for example.
As shown in
Data transferred to/from the flash memory may be further effected by interfaces to a processor such as ATA or UDMA (“Ultra DMA”) that are supported by commodity flash cards that provide sufficient bandwidth to the flash memory for writing checkpoints. For example, the ATA/ATAPI-4 transfer modes support speeds at least from 16 MByte/s to 33 MByte/second. In the faster Ultra DMA modes and Parallel ATA up to 133 MByte/s transfer rate is supported.
From the foregoing, the system and methodology can re-start the user application at exactly the same state in which it was in at time of entering the checkpoint. With the addition of system checkpoints, in the manner as described herein checkpointing can be performed anytime while a user application is still running.
In one example embodiment, a large parallel supercomputer system, that provides 5 gigabyte/s I/O bandwidth from a rack, where a rack includes 1024 compute nodes in an example embodiment, each with 16 gigabyte of memory, would require about 43 minutes to checkpoint 80% of memory. If this checkpoint instead were written locally at 40 megabyte/s to a non-volatile memory such as flash memory 5020 shown in
Thus, for a 200 hour compute job the system without flash memory might use 12-16 checkpoints, depending on expected failure rate, adding a total time of 8.5 to 11.5 hours for backup. Using the same assumptions, the system with local flash memory could perform 35-47 checkpoints, adding only 3.1 to 4.2 hours. With no fails or restarts during the job, the improvement in throughput is modest, about 3%. However, for one or two fails and restarts, the throughput improvement increases to over 10%.
As mentioned, in one embodiment, the size of the flash memory associated with each processor core is, in one embodiment, two time (or greater) the required checkpoint memory size to allow for multiple backups so that recovery is possible from any failures that occur during the checkpoint write itself. Larger flash memory size is preferred to allow additional space for wear leveling and redundancy. Also, the system design is tolerant of a limited number of hard failures in the local flash storage, since checkpoint data from those few nodes can simply be written to the file system through the normal I/O network using only a small fraction of the total available I/O bandwidth. In addition, redundancy through data striping techniques similar to those used in RAID storage can be used to spread checkpoint data across multiple flash memory devices on nearby processor nodes via the internal networks, or on disk via the I/O network, to enable recovery from data loss on individual flash memory cards.
Thus a checkpoint storage medium provided with only modest reliability can be employed to improve the reliability and availability of a large parallel computing system. Furthermore, the flash memory cards is a more cost effective way of increasing system availability and throughput than increasing in IO bandwidth.
In sum, the incorporation of the flash memory device 5020 at the multiprocessor node provides a local storage for checkpoints thus relieving the bottleneck due to I/O bandwidth limitations associated with some memory access operations. Simple available interfaces to the processor such as ATA or UDMA (“Ultra DMA”) that are supported by commodity flash cards provide sufficient bandwidth to the flash memory for writing checkpoints. For example, the ATA/ATAPI-4 transfer modes support speeds at least from 16 MByte/s to 33 MByte/second. In the faster Ultra DMA modes and Parallel ATA up to 133 MByte/s transfer rate is supported.
For example, a multiple gigabyte checkpoint can be written to local flash card at 5020 megabyte/s to 40 megabyte/s in only a few minutes. Writing the same data to disk storage from all processors using the normal I/O network could take more than ten (10) times as long.
Highly parallel computing systems, with tens to hundreds of thousands of nodes, are potentially subject to a reduced mean-time-to-failure (MTTF) due to a soft error on one of the nodes. This is particularly true in HPC (High Performance Computing) environments running scientific jobs. Such jobs are typically written in such a way that they query how many nodes (or processes) N are available at the beginning of the job and the job then assumes that there are N nodes available for the duration of the run. A failure on one node causes the job to crash. To improve availability such jobs typically perform periodic checkpoints by writing out the state of each node to a stable storage medium such as a disk drive. The state may include the memory contents of the job (or a subset thereof from which the entire memory image may be reconstructed) as well as program counters. If a failure occurs, the application can be rolled-back (restarted) from the previous checkpoint on a potentially different set of hardware with N nodes.
However, on machines with a large number of nodes and a large amount of memory per node, the time to perform such a checkpoint to disk may be large, due to limited I/O bandwidth from the HPC machine to disk drives. Furthermore, the soft error rate is expected to increase due to the large number of transistors on a chip and the shrinking size of such transistors as technology advances.
To cope with such software, processor cores and systems increasingly rely on mechanisms such as Error Corrrecting Codes (ECC) and instruction retry to turn otherwise non-recoverable soft errors into recoverable soft errors. However, not all soft errors can be recovered in such a manner, especially on very small, simple cores that are increasingly being used in large HPC systems such as BlueGene/Q (BG/Q).
Thus, in one aspect, there is provided an approach to recover from a large fraction of soft errors without resorting to complete checkpoints. If this can be accomplished effectively, the frequency of checkpoints can be reduced without sacrificing availability.
There is thus provided a technique for performing “local rollbacks” by utilizing a multi-versioned memory system such as that on BlueGene/Q. On BG/Q, the level 2 cache memory (L2) is multi-versioned to support both speculative running, a transactional memory model, as well as a rollback mode. Data in the L2 may thus be speculative. On BG/Q, the L2 is partitioned into multiple L2 slices, each of which acts independently. In speculative or transactional mode, data in the main memory is always valid, “committed” data and speculative data is not written back to the main memory. In rollback mode, speculative data may be written back to the main memory, at which point it cannot be distinguished from committed data. In this invention, we focus on the hardware capabilities of the L2 to support local rollbacks. That capability is somewhat different than the capability to support speculative running and transactional memory. This mutli-versioned cache is used to improve reliability. Briefly, in addition to supporting common caching functionality, the L2 on BG/Q includes the following features for running in rollback mode. The same line (128 bytes) of data may exist multiple times in the cache. Each such line has a generation id tag and there is an ordering mechanism such that tags can be ordered from oldest to newest. There is a mechanism for requesting and managing new tags, and for “scrubbing” the L2 to clean it of old tags.
Local Rollback—the Case when there is No I/O
There is first described an embodiment in which there is no I/O into and out of the node, including messaging between nodes. Checkpoints to disk or stable storage are still taken periodically, but at a reduced frequency. There is a local rollback interval. If the end of the interval is reached without a soft error, the interval is successful and a new interval can be started. Under certain conditions to be described, if a soft error occurs during the local rollback interval, the application can be restarted from the beginning of the local interval and re-executed. This can be done without restoring the data from the previous complete checkpoint, which typically reads in data from disk. If the end of the interval is then reached, the interval is successful and the next interval can be started. If such conditions are met, we term the interval “rollbackable”. If the conditions are not met, a restart from the previous complete checkpoint is performed. The efficiency of the method thus depends upon the overhead to set up the local rollback intervals, the soft error rate, and the fraction of intervals that are rollbackable.
In this approach, certain types of soft errors cannot be recovered via local rollback under any conditions. Examples of such errors are an uncorrectable ECC error in the main memory, as this error corrupts state that is not backed up by multi-versioning, or an unrecoverable soft error in the network logic, as this corrupts state that can not be reinstated by rerunning. If such a soft error occurs, the interval is not rollbackable. We categorize soft errors into two classes: potentially rollbackable, and unconditionally not rollbackable. In the description that follows, we assume the soft error is potentially rollbackable. Examples of such errors include a detected parity error on a register inside the processor core.
At the start of each interval, each thread on each core saves it's register state (including the program counter). Certain memory mapped registers outside the core, that do not support speculation and need to be restored on checkpoint restore, are also saved. A new speculation generation id tag T is allocated and associated with all memory requests run by the cores from hereon. This ID is recognized by the L2-cache to treat all data written with this ID to take precedence, i.e., to maintain semantics of these accesses overwriting all previously written data. At the start of the interval, the L2 does not contain any data with tag T and all the data in the L2 has tags less than T, or has no tag associated (T0) and is considered nonspeculative. Reads and writes to the L2 by threads contain a tag, which will be T for this next interval.
When a thread reads a line that is not in the L2, that line is brought into the L2 and given the non-speculative tag T0. Data from this version is returned to the thread. If the line is in the L2, the data returned to the thread is the version with the newest tag.
When a line is written to the L2, if a version of that line with tag T does not exist in the L2, a version with tag T is established. If some version of the line exists in the L2, this is done by copying the newest version of that line into a version with tag T. If a version does not exist in the L2, it is brought in from memory and given tag T. The write from the thread includes byte enables that indicate which bytes in the current write command are to be written. Those bytes with the byte enable high are then written to the version with tag T. If a version of the line with tag T already exists in the L2, that line is changed according to the byte enables.
At the end of an interval, if no soft error occurred, the data associated with the current tag T is comitted by changing the state of the tag from speculative to committed. The L2 runs a continuous background scrub process that converts all occurrences of lines written with a tag that has committed status. It merges all committed version of the same address into a single version based on tag ordering and removes the versions it merged.
The L2 is managed as a set-associative cache with a certain number of lines per set. All versions of a line belong to the same set. When a new line, or new version of a line, is established in the L2, some line in that set may have to be written back to memory. In speculative mode, non-committed, or speculative, versions are never allowed to be written to the memory, In rollback mode, non-committed versions can be written to the memory, but an “overflow” bit in a control register in the L2 is set to 1 indicating that such a write has been done. At the start of an interval all the overflow bits are set to 0.
Now consider the running during a local rollback interval. If a detected soft error occurs, this will trigger an interrupt that is delivered to at least one thread on the node. Upon receiving such an interrupt, the thread issues a core-to-core interrupt to all the other threads in the system which instructs them to stop running the current interval. If at this time, all the L2 overflow bits are 0, then the main memory contents have not been corrupted by data generated during this interval and the interval is rollbackable. If one of the overflow bits is 1, then main memory has been corrupted by data in this interval, the interval is not rollbackable and running is restarted from the most previous complete checkpoint.
If the interval is rollbackable, the cores are properly re-initialized, all the lines in the L2 associated with tag T are invalidated, all of the memory mapped regsiters and thread regsiters are restored to their values at the start of the interval, and the running of the interval restarts. The L2 invalidates the lines associated with tag T by changing the state of the tag to invalid. The L2 background invalidation process removes occurrences of lines with invalid tags from the cache.
This can be done in such a way that is completely transparent to the application being run. In particular, at the beginning of the interval, the kernel running on the threads can, in coordinated fashion, set a timer interrupt to fire indicating the end of the next interval. Since interrupt handlers are run in kernel, not user mode, this is invisible to the application. When this interrupt fires, and no detectable soft-error has occurred during the interval, preparations for the next interval are made, and the interval timer is reset. Note that this can be done even if an interval contained an overflow event (since there was no soft error). The length of the interval should be set so that an L2 overflow is unlikely to occur during the interval. This depends on the size of the L2 and the characteristics of the application workload being run.
Local Rollback—the Case with I/O
An embodiment is now described in the more complicated case of when there is I/O, specifically messaging traffic between nodes. If all nodes participate in a barrier synchronization at the start of an interval, and if there is no messaging activity at all during the interval (either data injected into the network or received from the network) on every node, then if a rollbackable software error occurs during the interval on one or more nodes, then those nodes can re-run the interval and if successful, enter the barrier for the next interval. In such a case, the other nodes in the system are unaware that a rollback is being done somewhere else. If one such node has a soft error that is non-rollbackable, then all nodes may begin running from the previous full checkpoint. There are three problems with this approach:
We therefore seek alternative conditions that do not require barriers and relax the assumption that no messaging activity occurs during the interval. This will reduce the overhead and increase the fraction of rollbackable intervals. In particular, an interval will be rollbackable if no data that was generated during the current interval is injected into the network (in addition to some other conditions to be described later). Thus an interval is rollbackable if the data injected into the network in the current interval were generated during previous intervals. Thus packets arriving during an interval can be considered valid. Furthermore, if a node does do a local rollback, it will never inject the same messages (packets) twice, (once during the failed interval and again during the re-running). In addition note that the local rollback intervals can proceed independently on each node, without coordination from other nodes, unless there is a non rollbackable interval, in which case the entire application may be restarted from the previous checkpoint.
We assume that network traffic is handled by a hardware Message Unit (MU), specifically the MU is responsible for putting messages, that are packetized, into the network and for receiving packets from the network and placing them in memory. Dong Chen, et al., “DISTRIBUTED PARALLEL MESSAGING UNIT FOR MULTIPROCESSOR SYSTEMS”, U.S. Pat. No. 8,458,267, wholly incorporated by reference as if set forth herein, describes the MU in detail. Dong Chen, et al., “SUPPORT FOR NON-LOCKING PARALLEL RECEPTION OF PACKETS BELONGING TO THE SAME FIFO”, U.S. Pat. No. 8,086,766, wholly incorporated by reference as if set forth herein, also describes the MU in detail. Specifically, there are message descriptors that are placed in Injection FIFOs. An Injection Fifo is a circular buffer in main memory. The MU maintains memory mapped registers that, among other things contain pointers to the start, head, tail and end of the FIFO. Cores inject messages by placing the descriptor in the memory location pointed to by the tail, and then updating the tail to the next slot in the FIFO. The MU recognizes non-empty Fifos, pulls the descriptor at the head of the FIFO, and injects packets into the network as indicated in the descriptor, which includes the length of the message, its starting address, its destination and other information having to do with what should be done with the message's packets upon reception at the destination. When all the packets from a message have been injected, the MU advances the head of the FIFO. Upon reception, if the message is a “direct put”, the payload bytes of the packet are placed into memory starting at an address indicated in the packet. If the packets belong to a “memory FIFO” message, the packet is placed at the tail of a reception FIFO and then the MU updates the tail. Reception FIFOS are also circular buffers in memory and the MU again has memory mapped registers pointing to the start, head, tail and end of the FIFO. Threads read packets at the head of the FIFO (if non-empty) and then advance the head appropriately. The MU may also support “remote get” messages. The payload of such messages are message descriptors that are put into an injection FIFO. In such a way, one node can instruct another node to send data back to it, or to another node.
When the MU issues a read to an L2, it tags the read with a non-speculative tag. In rollback mode, the L2 still returns the most recent version of the data read. However, if that version was generated in the current interval, as determined by the tag, then a “rollback read conflict” bit is set in the L2. (These bits are initialized to 0 at the start of an interval.) If subsections (sublines) of an L2 line can be read, and if the L2 tracks writes on a subline basis, then the rollback read conflict bit is set when the MU reads the subline that a thread wrote in the current interval. For example, if the line is 128 bytes, there may be 8 subsections (sublines) each of length 16 bytes. When a line is written speculatively, it notes in the L2 directory for that line which sublines are changed. If a soft error occurs during the interval, if any rollback read conflict bit is set, then the interval cannot be rolled back.
When the MU issues a write to the L2, it tags the write with a non-speculative id. In rollback mode, both a non-speculative version of the line is written and if there are any speculative versions of the line, all such speculative versions are updated. During this update, the L2 has the ability to track which subsections of the line were speculatively modified. When a line is written speculatively, it notes which sublines are changed. If the non-speculative write modifies a subline that has been speclatively written, a “write conflict” bit in the L2 is set, and that interval is not rollbackable. This permits threads to see the latest MU effects on the memory system, so that if no soft error occurs in the interval, the speculative data can be promoted to non-speculative for the next interval. In addition, if a soft error occurs, it permits rollback to non-speculative state.
On BG/Q, the MU may issue atomic read-modify-write commands. For example, message byte counters, that are initialized by software, are kept in memory. After the payload of a direct put packet is written to memory, the MU issues an atomic read-modify-write command to the byte counter address to decrement the byte counter by the number of payload bytes in the packet. The L2 treats this as both a read and a write command, checking for both read and write conflicts, and updating versions.
In order for the interval to be rollbackable, certain other conditions may be satisfied. The MU cannot have started processing any desciptors that were injected into an injection FIFO during the interval. Violations of this “new desciptor injected” condition are easy to check in software by comparing the current MU injection FIFO head pointers with those at the beginning of the interval, and by tracking how many descriptors are injected during the interval. (On BG/Q, for each injection FIFO the MU maintains a count of the number of descriptors injected, which can assist in this calculation.)
In addition, during the interval, a thread may have received packets from a memory reception FIFO and advanced the FIFO's head pointer. Those packets will not be resent by another node, so in order for the rollback to be successful, it may be able to reset the FIFO's head pointer to what it was at the beginning of the interval so that packets in the FIFO can be “re-played”. Since the FIFO is a circular buffer, and since the head may have been advanced during the interval, it is possible that a newly arrived packet has overwritten a packet in the FIFO that may be re-played during the local rollback. In such a case, the interval is not rollbackable. It is easy to design messaging software that identifies when such an over-write occurs. For example, if the head is changed by an “advance_head” macro/inline or function, advance_head can increment a counter representing the number of bytes in the FIFO between the old head and the new head. If that counter exceeds a “safe” value that was determined at the start of the interval, then a write to an appropriate memory location system that notes the FIFO overwrite condition occurred. Such a write may be invoked via a system call. The safe value could be calculated by reading the FIFOs head and tail pointers at the beginning of the interval and, knowing the size of the FIFO, determining how many bytes of packets can be processed before reaching the head.
On BG/Q barriers or global interrupts may be initiated by injecting descriptors into FIFOs, but via writing a memory mapped register that triggers barrier/interrupt logic inside the network. If during an interval, a thread initiates a barrier and a soft error occurs on that node, then the interval is not rollbackable. Software can easily track such new barrier/interrupt initiated occurrences, in a manner similar to the FIFO overwrite condition. Or, the hardware (with software cooperation) can set a special bit in the memory mapped barrier register whenever a write occurs; if that bit is initialized to 0 at the beginning of the interval, then if the bit is high, the interval cannot be rolled back.
We assume that the application uses a messaging software library that is consistent with with local rollbacks. Specifically, hooks in the messaging software support monitoring the reception FIFO overwrite condition, the injection FIFO new descriptor injected condition, and the new global interrupt/barrier initiated condition. In addition, if certain memory mapped I/O registers are written during an interval, such as when a FIFO is reconfigured by moving it, or resizing it, an interval cannot be rolled back. Software can be instrumented to track writes to such memory mapped I/O registers and to record appropriate change bits if the conditions to rollback an interval are violated. These have to be cleared at the start of an interval, and checked when soft errors occur.
Putting this together, at the beginning of an interval:
If there is no detected soft error at the end of the interval, running of the next interval is initiated. If an unconditionally not rollbackable soft error occurs during the interval, running is re-started from the previous complete checkpoint. If a potentially rollbackable soft error occurs:
The above discussion assumes that no real-time interrupts such as messages from the control system, or MU interrupts occur. ON BG/Q, a MU interrupt may occur if a packet with an interrupt bit set is placed in a memory FIFO, the amount of free space in a reception FIFO decreases below a threshold, or the amount of free space in an injection FIFO crosses a threshold. For normal injection FIFOS, the interrupt occurs if the amount of free space in the FIFO increases above a threshold, but for remote get injection FIFOs the interrupt occurs if the amount of free space in the FIFO decreases below a threshold.
A conservative approach would be to classify an interval as non rollbackable if any of these interrupts occurs, but we seek to increase the fraction of rollbackable intervals by appropriately handling these interrupts. First, external control system interrupts or remote get threshold interrupts are rare and may trigger very complicated software that is not easily rolled back. So if such an interrupt occurs, the interval will be marked not rollbackable.
For the other interrupts, we assume that the interrupt causes the messaging software to run some routine, e.g., called “advance”, that handles the condition.
For the reception FIFO interrupts, advance may pull packets from the FIFO and for an injection FIFO interrupt, advance may inject new descriptors into a previously full injection FIFO. Note that advance can also be called when such interrupts do not occur, e.g., it may be called when an MPI application calls MPI_Wait. Since the messaging software may correctly deal with asynchronous arrival of messages, it may be capable of processing messages whenever they arrive. In particular, suppose such an interrupt occurs during an interval, and software notes that it has occurred, and an otherwise rollbackable soft error occurs during the interval. Note that when the interval is restarted, there are at least as many packets in the reception FIFO as when the interrupt originally fired. If when the interval is restarted, the software sets the hardware interrupt registers to re-trigger the interrupt, this will cause advance to be called on one or more threads at, or near the beginning of the interval (if the interrupt is masked at the time). In either case, the packets in the reception FIFO will be processed and the condition causing the interrupt will eventually be cleared. If when the interval starts, advance is already in progress, having the interrupt bit high may simply cause advance to be run a second time.
Mode Changes
As alluded to above, the L2 can be configured to run in different modes, including speculative, transactional, rollback and normal. If there is a mode change during an interval, the interval is not rollbackable.
Multiple Tag Domains
In the above description, it assumes that there is a single “domain” of tags. Local rollback can be extended to the case when the L2 supports multiple domain tags. For example, suppose there are 128 tags that can be divided into up to 8 tag domains with 16 tags/domain. Reads and writes in different tag domains do not affect one another. For example, suppose there are 16 (application) cores per node with 4 different processes each running on a set of 4 cores. Each set of cores could comprise a different tag domain. If there is a shared memory region between the 4 processors, that could comprise a fifth tag domain. Reads and writes by the MU are non-speculative and may be seen by every domain. The checks for local rollback may be satisfied by each tag domain. In particular, if the overflow, read and write conflict bits are on a per domain basis, then an interval cannot be rolled back if any of the domains indicate a violation.
The L2 cache 70100 is multi-versioned to support both speculative running mode, a transactional memory mode, and a rollback mode. A speculative running mode computes instruction calculations ahead of their time as defined in a sequential program order. In such a speculative mode, data in the L2 cache 70100 may be speculative (i.e., assumed ahead or computed ahead and may subsequently be validated (approved), updated or invalidated). A transactional memory mode controls a concurrency or sharing of the L2 cache 70100, e.g., by enabling read and write operations to occur at simultaneously, and by allowing that intermediate state of the read and write operations are not visible to other threads or processes. A rollback mode refers to performing a local rollback.
In one embodiment, the L2 cache 70100 is partitioned into multiple slices, each of which acts independently. In the speculative or transactional mode, data in a main memory (not shown) is always valid. Speculative data held in the L2 cache 70100 are not written back to the main memory. In the rollback mode, speculative data may be written back to the main memory, at which point the speculative data cannot be distinguished from committed data and the interval can not be rolled back if an error occurs. In addition to supporting a common caching functionality, the L2 cache 70100 is operatively controlled or programmed for running in the rollback mode. In one embodiment, operating features include, but are not limited to: an ability to store a same cache line (e.g., 128 bytes) of data multiple times in the cache (i.e., multi-versioned); Each such cache line having or provided with a generation ID tag (e.g., tag 1 (70105) and a tag T (70110) in
In one embodiment, the software or hardware sets a length of the current interval so that an overflow of the L2 cache 70100 is unlikely to occur during the current interval. The length of the current interval depends on a size of the L2 cache 70100 and/or characteristics of an application workload being run.
In one embodiment, the control logic device 70120 communicates with the cache memory, e.g., the L2 cache. In a further embodiment, the control logic device 70120 is a memory management unit of the cache memory. In a further embodiment, the control logic device 70120 is implemented in a processor core. In an alternative embodiment, the control logic device 70120 is implemented is a separate hardware or software unit.
The following describes situations in which there is no I/O operation into and out of a node, including no exchange of messages between nodes. Checkpoints to disk or a stable storage device are still taken periodically, but at a reduced frequency. If the end of a current local rollback interval (e.g., an interval 1 (70200) in
In one embodiment, certain types of soft errors cannot be recovered via local rollback under any conditions (i.e., are not rollbackable). Examples of such errors include one or more of: an uncorrectable ECC error in a main memory, as this uncorrectable ECC error may corrupt a state that is not backed up by the multi-versioning scheme; an unrecoverable soft error in a network, as this unrecoverable error may corrupt a state that can not be reinstated by rerunning. If such a non-rollbackable soft error occurs, the interval is not rollbackable. Therefore, according to one embodiment of the present invention, there are two classes of soft errors: potentially rollbackable and unconditionally not rollbackable. For purposes of description that follow, it is assumed that a soft error is potentially rollbackable.
At the start of each local rollback interval, each thread on each processor core stores its register state (including its program counter), e.g., in a buffer. Certain memory mapped registers (i.e., registers that have their specific addresses stored in known memory locations) outside the core that do not support the speculation (i.e., computing ahead or assuming future values) and need to be restored on a checkpoint are also saved, e.g., in a buffer. A new (speculation) generation ID tag “T” (e.g., a tag “T” bit or flag 70110 in
When a cache line is written to the L2 cache, if a version of that line with the tag “T” (70110) does not exist in the L2 cache, a version with the tag “T” (70110) is created. If some version of the line exists in the L2 cache, the control logic device 70120 copies the newest version of that line into a version with the tag “T” (70110). If a version of the line does not exist in the L2 cache, the line is brought in from a main memory and given the tag “T” (70110). A write from a thread includes, without limitation, byte enables that indicate which bytes in a current write command are to be written. Those bytes with the byte enable set to a predetermined logic level (e.g., high or logic ‘1’) are then written to a version with the tag “T” (70110). If a version of the line with the tag “T” (70110) already exists in the L2 cache 70100, that line is changed according to the byte enables.
At the end of a local rollback interval, if no soft error occurred, data associated with a current tag “T” (70110) is committed by changing a state of the tag from speculative to committed (i.e., finalized, approved and/or determined by a processor core). The L2 cache 70100 runs a continuous background scrub process that converts all occurrences of cache lines written with a tag that has committed status to non-speculative. The scrub process merges all or some of a committed version of a same cache memory address into a single version based on tag ordering and removes the versions it merged.
In one embodiment, the L2 cache 70100 is a set-associative cache with a certain number of cache lines per set. All versions of a cache line belong to a same set. When a new cache line, or new version of a cache line, is created in the L2 cache, some line(s) in that set may have to be written back to a main memory. In the speculative mode, non-committed, or speculative, versions are may not be allowed to be written to the main memory. In the rollback mode, non-committed versions can be written to the main memory, but an “overflow” bit in a control register in the L2 cache is set to 1 indicating that such a write has been done. At the start of a local rollback interval, all the overflow bits are set to 0.
In another embodiment, the overflow condition may cause a state change of a speculation generation ID (i.e., an ID of a cache line used in the speculative mode in which speculation the line was changed) in to a committed state in addition to or as an alternative to setting an overflow flag.
If a soft error occurs during a local rollback interval, this soft error triggers an interrupt that is delivered to at least one thread running on a node associated with the L2 cache 70100. Upon receiving such an interrupt, the thread issues a core-to-core interrupt (i.e., an interrupt that allow threads on arbitrary processor cores of an arbitrary computing node to be notified within a deterministic low latency (e.g., 10 clock cycles)) to all the other threads which instructs them to stop running the current interval. If at this time, all the overflow bits of the L2 cache are 0, then contents in the main memory have not been corrupted by data generated during this interval and the interval is rollbackable. If one of the overflow bits is 1, then the main memory has been corrupted by data in this interval, the interval is not rollbackable and rerunning is restarted from the most previous checkpoint.
If the interval is rollbackable, processor cores are re-initialized, all or some of the cache lines in the L2 associated with the tag “T” (70110) are invalidated, all or some of the memory mapped registers and thread registers are restored to their values at the start of the interval, and a running of the interval restarts. The control logic device 70120 invalidates cache lines associated with the tag “T” (70110) by changing a state of the tag “T” (70100) to invalid. The L2 cache background invalidation process initiates removal of occurrences of lines with invalid tags from the L2 cache 70100 in the rollbackable interval.
Recovering rollbackable soft errors can be performed in a way that is transparent to an application being run. At the beginning of a current interval, a kernel running on a thread can, in a coordinated fashion (i.e., synchronized with the control logic device 70120), set a timer interrupt (i.e., an interrupt associated with a particular timing) to occur at the end of the current interval. Since interrupt handlers are run in kernel, this timer interrupt is invisible to the application. When this interrupt occurs and no detectable soft error has occurred during the interval, preparations for the next interval are made, and the timer interrupt is reset. These preparations can be done even if a local rollback interval included an overflow event (since there was no soft error).
The following describes situation in which there is at least one I/O operation, for example, messaging traffic between nodes. If all nodes participate in a barrier synchronization at the start of a current interval, if there is no messaging activity at all during the interval (no data injected into a network or received from the network) on every node, if a rollbackable software error occurs during the interval on one or more nodes, then those nodes can rerun the interval and, if successful, enter the barrier (synchronization) for a next interval.
In one embodiment, nodes are unaware that a local rollback is being performed on another node somewhere else. If a node has a soft error that is non-rollbackable, then all other nodes may begin an operation from the previous checkpoint.
In another embodiment, software or the control logic device 70120 checks the at least one condition or state, which does not require barriers and that relaxes an assumption that no messaging activity occurs during a current interval. This checking of the at least one condition reduces an overhead and increases a fraction of rollbackable intervals. For example, a current interval will be rollbackable if no data that was generated during the current interval is injected into the network. Thus the current interval is rollbackable if the data injected into the network in the current interval were generated during previous intervals. Thus, packets arriving during a local rollback interval can be considered valid. Furthermore, if a node performs a local rollback within the L2 cache 70100, it will not inject the same messages (packets) twice, (i.e., once during a failed interval and again during a rerunning). Local rollback intervals can proceed independently on each node, without coordination from other nodes, unless there is a non-rollbackable interval, in which case an entire application may be restarted from a previous checkpoint.
In one embodiment, network traffic is handled by a hardware Message Unit (MU). The MU is responsible for putting messages, which are packetized, into the network and for receiving packets from the network and placing them in a main memory device. In one embodiment, the MU is similar to a DMA engine on IBM® Blue Gene®/P supercomputer described in detail in “Overview of the IBM Blue Gene/P project”, IBM® Blue Gene® team, IBM J. RES. & DEV., Vol. 52, No. 1/2 January/March 2008, wholly incorporated by reference as if set forth herein. There may be message descriptors that are placed in an injection FIFO (i.e., a buffer or queue storing messages to be sent by the MU). In one embodiment, an injection FIFO is implemented as a circular buffer in a main memory.
The MU maintains memory mapped registers that include, without limitation, pointers to a start, head, tail and end of the injection FIFO. Processor cores inject messages by placing the descriptor in a main memory location pointed to by the tail, and then updating the tail to a next slot in the injection FIFO. The MU recognizes non-empty slots in the injection FIFO, pulls the descriptor at the head of the injection FIFO, and injects a packet or message into the network as indicated in the descriptor, which includes a length of the message, its starting address, its destination and other information indicating what further processing is to be performed with the message's packets upon a reception at a destination node. When all or some of the packets from a message have been injected, the MU advances the head pointer of the injection FIFO. Upon a reception, if the message is a “direct put”, payload bytes of the packet are placed into a receiving node's main memory starting at an address indicated in the packet. (A “direct put” is a packet type that goes through the network and writes payload data into a receiving node's main memory.) If a packet belongs to a “memory FIFO” message (i.e., a message associated with a queue or circular buffer in a main memory of a receiving node), the packet is placed at the tail of a reception FIFO and then the MU updates the tail. In one embodiment, a reception FIFO is also implemented as a circular buffer in a main memory and the MU again has memory mapped registers pointing to the start, head, tail and end of the reception FIFO. Threads read packets at the head of the reception FIFO (if non-empty) and then advance the head pointer of the reception FIFO appropriately. The MU may also support “remote get” messages. (A “remote get” is a packet type that goes through the network and is deposited into the injection FIFO on a node A. Then, the MU causes the “remote get” message to be sent from the node A to some other node.) A payload of such “remote get” message is message descriptors that are put into the injection FIFO. Through the “remote get” message, one node can instruct another node to send data back to it, or to another node.
When the MU issues a read to the L2 cache 70100, it tags the read with a non-speculative tag (e.g., a tag “T0” (70115) in
In another embodiment, the conflict condition may cause a state change of the speculation ID to the committed state in addition to or as an alternative to setting a read conflict bit.
When the MU issues a write to the L2 cache 70100, it tags the write with a non-speculative ID (e.g., a tag “T0” (70115) in
In another embodiment, the write conflict condition may cause a state change of the speculation ID to the committed state in addition to or as an alternative to setting a write conflict bit.
In one embodiment, the MU issues an atomic read-modify-write command. When a processor core accesses a main memory location with the read-modify-write command, the L2 cache 70100 is read and then modified and the modified contents are stored in the L2 cache. For example, message byte counters (i.e., counters that store the number of bytes in messages in a FIFO), which are initialized by an application, are stored in a main memory. After a payload of a “direct put” packet is written to the main memory, the MU issues the atomic read-modify-write command to an address of the byte counter to decrement the byte counter by the number of payload bytes in the packet. The L2 cache 70100 treats this command as both a read and a write command, checking for both read and write conflicts and updating versions.
In one embodiment, in order for the current interval to be rollbackable, certain conditions should be satisfied. One condition is that the MU cannot have started processing any descriptors that were injected into an injection FIFO during the interval. Violations of this “new descriptor injected” condition (i.e., a condition that a new message descriptor was injected into the injection FIFO during the current interval) can be checked by comparing current injection FIFO head pointers with those at the beginning of the interval and/or by tracking how many descriptors are injected during the interval. In a further embodiment, for each injection FIFO, the MU may count the number of descriptors injected.
In a further embodiment, during the current interval, a thread may have received packets from the reception FIFO and advanced the reception FIFO head pointer. Those packets will not be resent by another node, so in order for a local rollback to be successful, the thread should be able to reset the reception FIFO head pointer to what it was at the beginning of the interval so that packets in the reception FIFO can be “re-played”. Since the reception FIFO is a circular buffer, and since the head pointer may have been advanced during the interval, it is possible that a newly arrived packet has overwritten a packet in the reception FIFO that should be re-played during the local rollback. In such a situation where an overwriting occurred during a current interval, the interval is not rollbackable. In one embodiment, there is provided messaging software that identifies when such an overwriting occurs. For example, if the head pointer is changed by an “advance_head” macro/inline or function (i.e., a function or code for advancing the head pointer), the “advance_head” function can increment a counter representing the number of bytes in the reception FIFO between an old head pointer (i.e., a head pointer at the beginning of the current interval) and a new head pointer (i.e., a head pointer at the present time). If that counter exceeds a “safe” value (i.e., a threshold value) that was determined at the start of the interval, then a write to a main memory location that invokes the reception FIFO overwriting condition occurs. Such a write may also be invoked via a system call (e.g., a call to a function handled by an Operating System (e.g., Linux TM of a computing node). The safe value can be calculated by reading the reception FIFO head and tail pointers at the beginning of the interval, by knowing a size of the FIFO, and/or by determining how many bytes of packets can be processed before reaching the reception FIFO head pointer.
The barrier(s) or interrupt(s) may be initiated by writing a memory mapped register (not shown) that triggers the barrier or interrupt handler inside a network (i.e., a network connecting processing cores, a main memory, and/or cache memory(s), etc.). If during a local rollback interval, a thread initiates a barrier and a soft error occurs on a node, then the interval is not rollbackable. In one embodiment, there is provided a mechanism that can track such barrier or interrupt, e.g., in a manner similar to the reception FIFO overwriting condition. In an alternative embodiment, hardware (with software cooperation) can set a flag bit in a memory mapped barrier register 70140 whenever a write occurs. This flag bit is initialized to 0 at the beginning of the interval. If the special bit is high, the interval cannot be rolled back. A memory mapped barrier register 70140 is a register outside a processor core but accessible by the processor core. When values in the memory mapped barrier register changes, the control logic device 70120 may cause a barrier or interrupt packet (i.e., packet indicating a barrier or interrupt occurrence) to be injected to the network. There may also be control registers that define how this barrier or interrupt packet is routed and what inputs triggers or creates this packet.
In one embodiment, an application being run uses a messaging software library (i.e., library functions described in the messaging software that is consistent with local rollbacks. The messaging software may monitor the reception FIFO overwriting condition (i.e., a state or condition indicating that an overwriting occurred in the reception FIFO during the current interval), the injection FIFO new descriptor injected condition (i.e., a state or condition that a new message descriptor was injected into the injection FIFO during the current interval), and the initiated interrupt/barrier condition (i.e., a state or condition that the barrier or interrupt is initiated by writing a memory mapped register). In addition, if a memory mapped I/O register 135 (i.e., a register describing status of I/O device(s) or being used to control such device(s)) is written during a local rollback interval, for example, when a FIFO is reconfigured by moving that FIFO, or resizing that FIFO, the interval cannot be rolled back. In a further embodiment, there is provided a mechanism that tracks a write to such memory mapped I/O register(s) and records change bits if condition(s) for local rollback is(are) violated. These change bits have to be cleared at the start of a local rollback interval and checked when soft errors occur.
Thus, at the beginning of a local rollback interval:
1. Threads, run by processing cores of a computing node, set the read and write conflict and overflow bits to 0.
2. Threads store the injection FIFO tail pointers and reception FIFO head pointers, compute and store the safe value and set the reception FIFO overwrite bit (i.e., a bit indicating an overwrite occurred in the reception FIFO during the interval) to 0, set the barrier/intrrupt bit (i.e., a bit indicating a barrier or interrupr is initated, e.g., by writing a memory mapped register, during the interval) to 0, and set the change bits (i.e., bits indicating something has been changed during the interval) to 0.
3. Threads initiate storing of states of their internal and/or external registers.
4. A new speculative ID tag (e.g., a tag “T” (70110) in
5. Threads begin running code in the interval.
If there is no detected soft error at the end of a current interval, the control logic device 120 runs a next interval. If an unconditionally not rollbackable soft error (i.e., non-rollbackable soft error) occurs during the interval, the control logic device 70120 or a processor core restarts an operation from a previous checkpoint. If a potentially rollbackable soft error occurs:
1. If the MU is not already stopped, the MU is stopped, thereby preventing new packets from entering a network (i.e., a network to which the MU is connected to) or being received from the network. (Typically, when the MU is stopped, it continues processing any packets currently in progress and then stops.)
2. Rollbackable conditions are checked: the rollback read and write conflict bits, or if the speculation ID is already in committed state, the injection FIFO new descriptor injected condition, the reception FIFO overwrite bits, the barrier/interrupt bit, and the change bits. If the interval is not rollbackable, the control logic device 70120 or a processor core restarts an operation from a previous checkpoint. If the interval is rollbackable, proceeding to the next step 3.
3. Processor cores are reinitialized, all or some of the cache lines in the L2 cache 70100 are invalidated (without writing back speculative data in the L2 cache 70100 to a main memory), and, all or some of the memory mapped regsiters and thread regsiters are restored to their values at the start of the current interval. The injection FIFO tail pointers are restored to their original values at the start of the current interval. The reception FIFO head pointers are restored to their original values at the start of the current interval. If the MU was already stopped, the MU is restarted; and,
4. Running of the current interval restarts.
In one embodiment, real-time interrupts such as messages from a control system (e.g., a unit controlling the HPC system), or interrupts initiated by the MU (“MU interrupt”) occur. An MU interrupt may occur if a packet with an interrupt bit set high is placed in an injection or reception FIFO, if an amount of free space in a reception FIFO decreases below a threshold, or if an amount of free space in an injection FIFO increases above a threshold. For a (normal) injection FIFO, an interrupt occurs if the amount of free space in the injection FIFO increases above a threshold. For a remote get injection FIFO (i.e., a buffer or queue storing “remote get” message placed by the MU), an interrupt occurs if an amount of free space in the reception FIFO decreases below a threshold.
In one embodiment, the control logic device 70120 classifies an interval as non-rollbackable if any of these interrupts occurs. In an alternative embodiment, the control logic device 70120 increases a fraction of rollbackable intervals by appropriately handling these interrupts as described below. Control system interrupts or remote get threshold interrupts (i.e., interrupts initiated by the remote get injection FIFO due to an amount of free space lower than a threshold) may trigger software that is not easily rolled back. So if such an interrupt (e.g., control system interrupts and/or remote get threshold interrupt) occurs, the interval is not rollbackable.
All the other interrupts cause the messaging software to run a software routine, e.g., called “advance”, that handles all the other interrupts. For example, for the reception FIFO interrupts (i.e., interrupts initiated by the reception FIFO because an amount of free space is below a threshold), the advance may pull packets from the reception FIFO. For the injection FIFO interrupt (i.e., an interrupt occurred because an amount of free space is above a threshold), the advance may inject new message descriptors into a previously full injection FIFO (i.e., a FIFO which was full at some earlier point in time; when the injection FIFO interrupt occurred, the FIFO was no longer full and a message descriptor may be injected). The advance can also be called when such interrupts do not occur, e.g., the advance may be called when an MPI (Messaging Passing Interface) application calls MPI_Wait. MPI refers to a language-independent communication protocol used to program parallel computers and is described in detail in http://www.mpi-forum.org/ or http://www.mcs.anl.gov/research/projects/mpi/. MPI_Wait refers to a function that waits for an MPI application to send or receive to complete its request.
Since the messaging software can correctly deal with asynchronous arrival of messages, the messaging software can process messages whenever they arrive. In a non-limiting example, suppose that an interrupt occurs during a local rollback interval and that the control logic device 70120 detects that the interrupt has occurred, e.g., by checking whether the barrier or interrupt bit is set to high (“1”), and that a rollbackable soft error occurs during the interval. In this example, when the interval is restarted, there may be at least as many packets in the reception FIFO as when the interrupt originally occurred. If the control logic device 70120 sets hardware interrupt registers (i.e., registers indicating interrupt occurrences) to re-trigger the interrupt, when the interval is restarted, this re-triggering will cause the advance to be called on one or more threads at, or near the beginning of the interval (if the interrupt is masked at the time). In either case, the packets in the reception FIFO will be processed and a condition causing the interrupt will eventually be cleared. If the advance is already in progress, when the interval starts, having interrupt bits set high (i.e., setting the hardware interrupt registers to a logic “1” for example) may cause the advance to be run a second time.
The L2 cache 7000 can be configured to run in different modes, including, without limitation, speculative, transactional, rollback and normal (i.e., normal caching function). If there is a mode change during an interval, the interval is not rollbackable.
In one embodiment, there is a single “domain” of tags in the L2 cache 70100. In this embodiment, a domain refers to a set of tags. In one embodiment, the software (e.g., Operating System, etc.) or the hardware (e.g., the control logic device, processors, etc.) performs the local rollback when the L2 cache supports a single domain of tags or multiple domains of tags. In the multiple domains of tags, tags are partitioned into different domains. For example, suppose that there are 128 tags that can be divided into up to 8 tag domains with 16 tags per domain. Reads and writes in different tag domains do not affect one another. For example, suppose that there are 16 (application) processor cores per node with 4 different processes each running on a set of 4 processor cores. Each set of cores could comprise a different tag domain. If there is a shared memory region between the 4 processes, which could comprise a fifth tag domain. Reads and writes by the MU are non-speculative (i.e., normal) and may be seen by every domain. Evaluations for local rollback may be satisfied by each tag domain. In particular, if the overflow, read and write conflict bits are set to high in a domain during a local rollback interval, then interval cannot be rolled back if any of the domains indicate non-rollbackable situation (e.g., the overflow bits are high).
If, at step 70310, an unrecoverable condition occurs during the current interval, at step 70312, the control logic device 70120 commits changes made before the occurrence of the unrecoverable condition. At step 70315, the control logic device 70315 evaluates whether a minimum interval length is reached. The minimum interval length refers to the least number of instructions or the least amount of time that the control logic device 70120 spends to run a local rollback interval. If the minimum interval length is reached, at step 70330, the software or hardware ends the running of the current interval and instructs the control logic device 70120 to commit changes (in states of the processor) occurred during the minimum interval length. Then, the control returns to the step 70300 to run a next local rollback interval in the L2 cache 70100. Otherwise, if the minimum interval length is not reached, at step 70335, the software or hardware continues the running of the current interval until the minimum interval length is reached.
Continuing to step 70340, while running the current interval before reaching the minimum interval length, whether an error occurred or not can be detected. The error that can be detected in step 70340 may be non-recoverable soft error because an unrecoverable condition has been occurred during the current interval. If a non-recoverable error (i.e., an error that cannot be recovered by restarting the current interval) has not occurred until the minimum interval length is reached, at step 70330, the software or hardware ends the running of the current interval upon reaching the minimum interval length and commits changes occurred during the minimum interval length. Then, the control returns to the step 70300 to run a next local rollback interval. Otherwise, if a non-recoverable error occurs before reaching the minimum interval length, at step 70345, the software or hardware stops running the current interval even though the minimum interval length is not reached and the control is aborted 70345.
In one embodiment, at least one processor core performs method steps described in
IEEE 754 describes floating point number arithmetic. Kahan, “IEEE Standard 754 for Binary Floating-Point Arithmetic,” May 31, 1996, UC Berkeley Lecture Notes on the Status of IEEE 754, wholly incorporated by reference as if set forth herein, describes IEEE Standard 754 in detail.
According to IEEE Standard 754, to perform floating point number arithmetic, some or all floating point numbers are converted to binary numbers. However, the floating point number arithmetic does not need to follow IEEE or any particular standard. Table 1 illustrates IEEE single precision floating point format.
“Signed” bit indicates whether a floating point number is a positive (S=0) or negative (S=1) floating point number. For example, if the signed bit is 0, the floating point number is a positive floating point number. “Exponent” field (E) is represented by a power of two. For example, if a binary number is 10001.0010012=1.00010010012×24, then E becomes 127+4=13110=1000_00112. “Mantissa” field (M) represents fractional part of a floating point number.
For example, to add 2.510 and 4.7510, 2.510 is converted to 0x40200000 (in hexadecimal format) as follows:
Convert 210 to a binary number 102, e.g., by using binary division method.
Convert 0.510 to a binary number 0.12, e.g., by using multiplication method.
Calculate the exponent and mantissa fields: 10.12 is normalized to 1.012×21. Then, the exponent field becomes 12810, i.e., 127+1, which is equal to 1000_00002. The mantissa field becomes 010_0000_0000_0000_00002. By combining the signed bit, the exponent field and the mantissa field, a user can obtain 0100_0000_0010_0000_0000_0000_0000_00002=0x40200000.
Similarly, the user covert 4.7510 to 0x40980000.
Add 0x40200000 and 0x40980000 as follows:
Determine values of the fields.
Adjust a number with a smaller exponent to have a maximum exponent (i.e., largest exponent value among numbers; in this example, 1000_00012). In this example, 2.510 is adjusted to have 1000_00012 in the exponent field. Then, the mantissa field of 2.510 becomes 0.1012.
Add the mantissa fields of the numbers. In this example, add 0.1012 and 1.00112. Then, append the exponent field. Then, in this example, a result becomes 0100_0000_1110_1000_0000_0000_0000_00002.
Convert the result to a decimal number. In this example, the exponent field of the result is 1000_00012=12910. By subtracting 12710 from 12910, the user obtains 210. Thus, the result is represented by 1.11012×22=111.012. 1112 is equal to 710. 0.012 is equal to 0.2510. Thus, the user obtains 7.2510.
Although this example is based on single precision floating point numbers, the mechanism used in this example can be extended to double precision floating point numbers. A double precision floating number is represented by 64 bits, i.e., 1 bit for the signed bit, 11 bits for the exponent field and 52 bits for the mantissa field.
Traditionally, in a parallel computing system, floating point number additions in multiple computing node operations, e.g., via messaging, are done in part, e.g., by software. The additions require at per network hop a processor to first receive multiple network packets associated with multiple messages involved in a reduction operation. Then, the processor adds up floating point numbers included in the packets, and finally puts the results back into the network for processing at the next network hop. An example of the reduction operations is to find a summation of a plurality of floating point numbers contributed (i.e., provided) from a plurality of computing nodes. This software had large overhead, and could not utilize a high network bandwidth (e.g., 2 GB/s) of the parallel computing system.
Therefore, it is desirable to perform the floating point number additions in a collective logic device to reduce the overhead and/or to fully utilize the network bandwidth.
In one embodiment, the present disclosure illustrates performing floating point number additions in hardware, for example, to reduce the overhead and/or to fully utilize the network bandwidth.
In one embodiment, the back-end floating point logic device 75240 includes, without limitation, at least one shift register for performing normalization and/or shifting operation (e.g., a left shift, a right shift, etc.). In embodiment, the collective logic device 75260 further includes an arbiter device 75250. The arbiter device is described in detail below in conjunction with
In a further embodiment, the collective logic device 75260 is embedded and/or implemented in a 5-Dimensional torus network.
At step 75120, the ALU tree 75230 adds the integer numbers and generates a summation of the integer values. Then, the ALU tree 75230 provides the summation to the back-end floating point logic device 75240. At step 75130, the back-end logic device 75240 converts the summation to a floating point number (“second floating point number”), e.g., by performing left shifting and/or right shifting according to the maximum exponent and/or the summation. The second floating point number is an output of adding the inputs 75200. This second floating point numbers is reproducible. In other words, upon receiving same inputs, the collective logic device 75260 produces same output(s). The outputs do not depend on an order of the inputs. Since an addition of integer numbers (converted from the floating point numbers) does not generate a different output based on an order of the addition, the collective logic device 75260 generates the same output(s) upon receiving same inputs regardless of an order of the received inputs.
In one embodiment, the collective logic device 75260 performs the method steps 75100-75130 in one pass. One pass refers that the computing nodes sends the inputs 75200 only once to the collective logic device 75260 and/or receives the output(s) only once from the collective logic device 75260.
In a further embodiment, in each computing node, besides at least 10 bidirectional links for the 5D torus network 75400, there is also at least one dedicated I/O link that is connected to at least one I/O node. Both the I/O link and the bidirectional links are inputs to the collective logic device 75260. In one embodiment, the collective logic device 75260 has at least 12 inputs. One or more of the inputs may come from a local computing node(s). In another embodiment, the collective logic device 75260 has at most 12 inputs. One or more of the inputs may come from a local computing node(s).
In a further embodiment, at least one computing node defines a plurality of collective class maps to select a set of inputs for a class. A class map defines a set of input and output links for a class. A class represents an index into the class map on at least one computing node and is specified, e.g., by at least one packet.
In another embodiment, the collective logic device 75260 performs the method steps 75100-75130 in at least two passes, i.e., the computing nodes sends (intermediate) inputs at least twice to the collective logic device 75260 and/or receives (intermediate) outputs at least twice from the collective logic device 75260. For example, in the first pass, the collective logic device 75260 obtains the maximum exponent of the first floating point numbers. Then, the collective logic device normalizes the first floating point numbers and converts them to integer numbers. In the second pass, the collective logic device 75260 adds the integer numbers and generates a summation of the integer numbers. Then, the collective logic device 75260 converts the summation to a floating point number called the second floating point number. When the collective logic device 75260 operates based on at least two passes, its latency may be at least twice larger than a latency based on one pass described above.
In one embodiment, the collective logic device 75260 performing method steps in
Following describes an exemplary floating point number addition according to one exemplary embodiment. Suppose that the collective logic device 75260 receives two floating point numbers A=21*1.510=310 and B=23*1.2510=1010 as inputs. The collective logic device 75260 adds the number A and the number B as follows:
I. (corresponding to Step 75105 in
II. (corresponding to Step 75110 in
Thus, when the number A is converted to an integer number, it becomes 0x0180000000000000. When the number B is converted, it becomes 0x0500000000000000. Note that the integer numbers comprise only the mantissa field. Also note that the most significant bit of the number B is two binary digits to the left (larger) than the most significant bit of the number A. This is exactly the difference between the two exponents (1 and 3). III. (corresponding to Step 75120 in
IV. (corresponding to Step 75130 in
In this example, after steps 1-3, 0x0680000000000000 is converted to 0x003a000000000000=23*1.62510=1310, which is expected by adding 1010 and 310.
In one embodiment, the collective logic device 75260 performs logical operations including, without limitation, logical AND, logical OR, logical XOR, etc. The collective logic device 75260 also performs integer operations including, without limitation, an unsigned and signed integer addition, min and max with an operand size from 32 bits to 4096 bits in units of (32*2n) bits, where n is a positive integer number. The collective logic device 75260 further performs floating point operations including, without limitation, a 64-bit floating point addition, min (i.e., finding a minimum floating point number among inputs) and max (finding a maximum floating point number among inputs). In one embodiment, the collective logic device 75260 performs floating point operations at a peak network link bandwidth of the network.
In one embodiment, the collective logic device 75260 performs a floating point addition as follows: First, some or all inputs are compared and the maximum exponent is obtained. Then, the mantissa field of each input is shifted according to the difference of its exponent and the maximum exponent. This shifting of each input results in a 64-bit integer number which is then passed through the integer ALU tree 75230 for doing an integer addition. A result of this integer addition is then converted back to a floating point number, e.g., by the back-end logic device 75240.
Once input requests has been chosen by an arbiter, those input requests are sent to appropriate senders (and/or the reception FIFO) 75330 and/or 75350. Once some or all of the senders grant permission, the main arbiter 75325 relays this grant to a particular sub-arbiter which has won and to each receiver (e.g., an injection FIFO 75300 and/or 75305). The main arbiter 75325 also drives correct configuration bits to the collective logic device 75260. The receivers will then provide their input data through the collective logic device 75260 and an output of the collective logic device 75260 is forwarded to appropriate sender(s).
Integer Operations
In one embodiment, the ALU tree 75230 is built with multiple levels of combining blocks. A combining block performs, at least, an unsigned 32-bit addition and/or 32-bit comparison. In a further embodiment, the ALU tree 75230 receives control signals for a sign (i.e., plus or minus), an overflow, and/or a floating point operation control. In one embodiment, the ADD tree 75230 receives at least two 32-bit integer inputs and at least one carry-in bit, and generates a 32-bit output and a carry-out bit. A block performing a comparison and/or selection receives at least two 32-bit integer inputs, and then selects one input depending on the control signals. In another embodiment, the ALU tree 75230 operates with 64-bit integer inputs/outputs, 128-bit integer inputs/outputs, 256-bit integer inputs/outputs, etc.
Floating Point Operations
In one embodiment, the collective logic device 75260 performs 64-bit double precision floating point operations. In one embodiment, at most 12 (e.g., 10 network links+1 I/O link+1 local computing node) floating point numbers can be combined, i.e., added. In an alternative embodiment, at least 12 floating point number are added.
A 64-bit floating point number format is illustrated in Table 2.
In IEEE double precision floating point number format, there is a signed bit indicating whether a floating point number is an unsigned or signed number. The exponent field is 11 bits. The mantissa field is 52 bits.
In one embodiment, Table 3 illustrates a numerical value of a floating point number according to an exponent field value and a mantissa field value:
If the exponent field is 2047 and the mantissa field is 0, a corresponding floating point number is plus or minus Infinity. If the exponent field is 2047 and the mantissa field is not 0, a corresponding floating point number is NaN (Not a Number). If the exponent field is between 1 and 204610, a corresponding floating point number is (−1)S×0·M×2E. If the exponent field is 0 and the mantissa field is 0, a corresponding floating point number is 0. If the exponent field is 0 and the mantissa field is not 0, a corresponding floating point number is (−1)S×0·M×2−1022. In one embodiment, the collective logic device 75260 normalizes a floating point number according to Table. 3. For example, if S is 0, E is 210=102 and M is 1000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00002, a corresponding floating number is normalized to 1.1000 . . . 0000×22.
In one embodiment, an addition of (+) infinity and (+) infinity generates (+) infinity, i.e., (+) Infinity+(+) Infinity=(+) Infinity. An addition of (−) infinity and (−) infinity generates (−) infinity, i.e., (−) Infinity+(−) Infinity=(−) Infinity. An addition of (+) infinity and (−) infinity generates NaN, i.e., (+) Infinity+(−) Infinity=NaN. Min or Max operation for (+) infinity and (+) infinity generates (+) infinity, i.e., MIN/MAX (+Infinity, +Infinity)=(+) infinity. Min or Max operation for (−) infinity and (−) infinity generates (−) infinity, i.e., MIN/MAX (−Infinity, −Infinity)=(−) infinity.
In one embodiment, the collective logic device 75260 does not distinguish between different NaNs. An NaN newly generated from the collective logic device 75260 may have the most significant fraction bit (the most significant mantissa bit) set, to indicate NaN.
Floating Point (FP) Min and Max
In one embodiment, an operand size in FP Min and Max operations is 64 bits. In another embodiment, an operand size in FP Min and Max operations is larger than 64 bits. The operand passes through the collective logic device 75260 without any shifting and/or normalization and thus reduces an overhead (e.g., the number of clock cycles to perform the FP Min and/or Max operations). Following describes the FP Min and Max operations according to one embodiment. Suppose that “I” be an integer representation (i.e., integer number) of bit patterns for 63 bits other than the sign bit. Given two floating point numbers A and B,
if (Sign(A)=0 and Sign(B)=0, or both positive) then
In one embodiment, operands are 64-bit double precision Floating point numbers. In one embodiment, the operands are 32 bits floating point numbers, 128 bits floating point numbers, 256 bits floating point numbers, 256 bits floating point numbers, etc. There is no reordering on injection FIFOs 75300-75305 and/or reception FIFOs 75330-75335.
In one embodiment, when a first half of the 64-bit floating point number is received, the exponent field of the floating point number is sent to the FP exponent max unit 75220 to get the maximum exponent for some or all the floating point numbers contributing to an addition of these floating point numbers. The maximum exponent is then used to convert each 64-bit floating point numbers to 64-bit integer numbers. The mantissa field of each floating point numbers has a precision of 53 bits, in the form of 1.× for regular numbers, and 0.× for denormalized numbers. The converted integer numbers reserve 5 most significant bits, i.e., 1 bit for a sign bit and 4 bits for guarding against overflow with up to 12 numbers being added together. The 53-bits mantissa field is converted into a 64-bit number in the following way. The left most 5 bits are zeros. The next bit is one if the floating point number is normalized and it is zero if the floating point number is denormalized. Next, the 53-bit mantissa field is appended and then 6 zeroes are appended. Finally, the 64-bit number is right-shifted by Emax-E, where Emax is the maximum exponent and E is a current exponent value of the 59-bit number. E is never greater than Emax, and so Emax-E is zero or positive. After this conversion, if the sign bit retained from the 64-bit floating point number, then the shifted number (“N”) is converted to 2's complementary format (“N_new”), e.g., by N_new=(not N)+1, where “not N” may be implemented by a bitwise inverter. A resulting number (e.g., N_new or N) is then sent to the ALU tree 75230 with a least significant 32-bit word first. In a further embodiment, there are additional extra control bits to identify special conditions. In one embodiment, each control bit is binary. For example, if the NaN bit is 0, then it is not a NaN, and if it is 1, then it is a NaN. There are control bits for +Infinity and −Infinity as well.
The resulting numbers are added as signed integers with operand sizes of 64 bits, with a consideration to control bits for Infinity and NaN. A result of the addition is renormalized to a regular floating point format: (1) if a sign bit is set (i.e., negative sum), covert the result back from 2's complementary format using, e.g., K_new=not (K−1), where K_new is the converted result and K is the result before the converting; (2) Then, right or left shift K or K_new until the left-most bit of the final integer sum (i.e., an integer output of the ALU 75230) which is a ‘1’ is in the 12th bit position from the left of the integer sum. This ‘1’ will be a “hidden” bit in the second floating point number (i.e., a final output of adding of floating point numbers). If the second floating point number is a denormalized number, shift right the second floating point number until the left-most ‘1’ is in the 13th position, and then shift to the right again, e.g., by the value of the maximum exponent. The resultant exponent is calculated as Emax+the amount it was right-shifted−6, for normalized floating point results. For denormalized floating point results, the exponent is set to the value according to the IEEE specification. A result of this renormalization is then sent on with most significant 64-bit word to computing nodes as a final result of the floating point addition.
Global Clock
There are a wide variety of inter-chip and intra-chip clock frequencies required for BG/Q. The processor frequency is 1.6 GHz and portions of the chip run at fractions of this speed, e.g., /2, /4, /8, or /16 of this clock. The high speed communication in BG/Q is accomplished by sending and receiving data between ASICs at 4 Gb/s, or 2.5 times the target processor frequency of 1.6 GHz. All signaling between BG/Q ASICs is based on IBM Micro Electronic Division (IMD) High Speed I/O which accepts an input clock at ⅛ the datarate, or 500 MHz. The optical communication is at 8 Gb/s but due to the need for DC balancing of the currents, this interface is 8b-10b encoded and runs at 10 Gb/s with an interface of 1 GBs/. The memory system is based on SDRAM-DDR3 at 1.333 Gb/s (667 MHz address frequency).
These frequencies are generated on the BQC chip through Phase Locked Loops. The PLLs are driven from a single global 100 MHz clock.
The BG/P clock network uses over 10,000 1-10 PECL clock redrive buffers to distribute the signal derived from a single source to the up to 36 racks or beyond. There are 7 layers to the clock tree. The first 3 layers exist on the 1→10 clock fanout cards on each rack, connected with max 5 m differential cables. The next 4 layers exist on the service and node or I/O boards themselves. For a 96-rack BG/Q system, IBM has designed an 8-layer LVPECL clock redrive tree with slightly longer rack-to-rack cables. The service card contains circuitry to drop a clock pulse, with the number of clocks to be dropped and the spacing between dropped clocks variable. Glitch detection circuitry in BQC detects these clock glitches and uses them for tight synchronization.
While modern processing systems have clock frequencies in a multi-GHz range, this may result in communications paths between processors necessarily involving multiple clock cycles. Additionally, the clock frequencies in modern multiprocessor systems are not all exactly equal, as they are typically derived from multiple local oscillators that are each directly used by only a small fraction of the processors in the multiprocessor systems. Having all processors utilize the same clock may require that all modules in the system receive a single global clock signal, thereby requiring a global clock network. Both the lack of a global clock signal and the complexities of synchronization of chips when communication distances between chips are many cycles may result in an inability of modern systems to exactly synchronize.
Thus, in a further aspect, there is provided a system, method and computer program product for synchronizing a plurality of processors in a parallel computing system.
That is, in one aspect, there is a method, a system and a computer program product by which a global clock network can be enhanced along with innovative circuits inside receiving devices to enable global clock synchronization. By achieving the global clock synchronization, the multiprocessor system may enable exact reproducibility of processing of instructions. Thus, this global clock synchronization may assist to accurately reproduce processing results in a system-wide debugging mechanism.
This disclosure describes a method, system and a computer program product to generate and/or detect a global clock signal having a pulse width modification in one or more selected clock period(s). In the present disclosure, a global clock signal can be used as an absolute phase reference signal (i.e., a reference signal for a phase correction of a clock signal) as well as a clock signal to synchronize processors in the parallel computing system. A global clock signal can be used for a synchronized system with a resetting capability, network synchronization, pacing of parallel calculations and power management in a parallel computing system. This disclosure describes a clock signal with modulated clock pulse width used for a global synchronization signal. This disclosure also describes a method, system and a computer program product for generating a global synchronization signal (e.g., a signal 9545 in
At step 9620 in
At step 9630 in
In one embodiment, a user configures the hardware module, e.g., through a hardware console (e.g., JTAG) by loading code written by a hardware description language (e.g., VHDL, Verilog, etc.). The hardware module 9120 may include, but is not limited to: a logical exclusive OR gate for narrowing a pulse width within a clock period in the third clock signal, a logical OR gate for widening a pulse width within a clock period in the third clock signal, and/or another logical exclusive OR gate for removing a pulse within a clock period within the second clock signal. The hardware module 9120 may also include a counter device to divide clock signal frequency and to determine a specific clock cycle to perform a pulse width modification.
a illustrates an example of removing a pulse within a clock period in a clock signal. In this example, the clock divider and splitter 9115 receives a 200 MHz first clock signal (9200) from the clock synthesizer 9110 and outputs a 100 MHz second clock signal (9205) to the hardware module 9120. The hardware module 9120 generates a pulse (9210), e.g., by counting the number of rising edges in the 100 MHz second clock signal (9205) and generating a pulse when the counting reaches a certain number (e.g., a determined number two). The pulse shown at 9210, also referred to as a gate pulse is used to determine which clock period in the 100 MHz second clock signal (9205) is going to be modified. In this example, there is a pulse (9210) at a location (9280) corresponding to the second pulse (9275) in the 100 MHz second clock signal (9205). The location (9280) of this pulse (9210) corresponds to the second pulse (9275) in the 100 MHz second clock signal (9205). Thus, it is determined that the second pulse (9275) is to be modified as shown at
b illustrates an example of narrowing a pulse width within a clock period in the third clock signal. In this example, the clock divider and splitter 9115 receives a 400 MHZ first clock signal (9220) from the clock synthesizer 9110 and outputs a 200 MHz second clock signal (9225) to the hardware module 9120. The hardware module 120 generates a pulse (9230), e.g., by counting the number of rising edges in the 200 MHz second clock signal (9225) and generating a pulse when the counting reaches a certain number (e.g., a determined number 2). The hardware module 9120 also divides the clock frequency of the 200 MHz second clock signal (9225) to generate a 100 MHz third clock signal (9240). The pulse shown at 9230, also referred to as a gate pulse, is used to determine which clock period in the 100 MHz third clock signal (9240) is going to be modified. In this example, there is a pulse (9230) at a location (9285) corresponding to the second pulse (9290) in the 100 MHz third clock signal (9240). The location (9285) of this pulse (9230) corresponds to the second pulse (9290) in the 100 MHz third clock signal (9240). Thus, it is determined that the second pulse (9290) is to be modified as shown at
To widen a clock pulse in a clock signal, after generating the pulse (9230), the hardware module 9120 may shift the pulse (9230), e.g., shift left or right the pulse (9230) by a fraction of a clock cycle such as a quarter or half cycle of the 100 MHz third clock signal (9240) and perform a logical OR operation between the shifted pulse and the 100 MHz third clock signal (9240) to generate a pulse width modified clock signal.
c illustrates an example of widening a pulse width within a clock period in the third clock signal. In this example, the clock divider and splitter 9115 receives a 400 MHZ first clock signal (9250) from the clock synthesizer 9110 and outputs a 200 MHz second clock signal (9255) to the hardware module 9120. The hardware module 9120 generates a pulse (9260), e.g., by counting the number of rising edges in the 200 MHz second clock signal (9255) and generating a pulse when the counting reaches a certain number (e.g., a determined number 2). The hardware module 9120 also divides the clock frequency of the 200 MHz second clock signal (9255) to generate a 100 MHz third clock signal (9265). The pulse shown at 9260, also referred to as a gate pulse, is used to determine which clock period in the 100 MHz third clock signal (9265) is going to be modified. In this example, there is a pulse (9260) at a location (9292) corresponding to the second pulse (9294) in the 100 MHz third clock signal (9265). The location (9292) of this pulse (9260) corresponds to the second pulse (9294) in the 100 MHz third clock signal (9265). Thus, it is determined that the second pulse (9294) is to be modified as shown at
Referring again to
There may be diverse methods to modify clock pulse width. In one embodiment, a clock generation circuit (e.g., the circuit 9100 shown in
For example, if the hardware module 9120 includes a decrementing counter device and an logical OR gate, by decrementing a value of the counter device from 3 to 0 every falling edge of the first clock signal 9250 (e.g., 400 MHz clock signal), the hardware module 9120 generates a second clock signal 9255 (e.g., 200 MHz clock signal) and a third clock signal 9265 (e.g., 100 MHz clock signal) as shown in
Referring to
A choice of which edge to preserve (i.e., rising edge sensitive or falling edge sensitive) is independent of a choice of narrowing, removing or widening a clock pulse within a clock period in a clock signal.
Upon receiving the pulse width modified clock signal 9145, the input buffer 9500 (e.g., a plurality of inverters) strengthens the pulse width modified clock signal, e.g., by increasing magnitude of the pulse width modified clock signal 9145. The input buffer 9500 provides the strengthened clock signal to the PLL or DLL or the like 9505 and to the latches 955. The PLL or DLL 9505 filters the strengthened clock signal and increases a clock frequency of the filtered clock signal (e.g., generates a clock signal which is 8 times or 16 times faster than the pulse width modified clock signal 9145). The PLL and/or DLL and/or the latches 9555 may be used for oversampling according to any other sampling rate. The PLL or DLL or the like 9505 provides the filter clock signal having the increased clock frequency to the latches 9555 and the flip flop 9510 for their clocking signals. The latches 9555 also receive the strengthened clock signal from the input buffer 9500, detect a clock pulse having a modification in the strengthened clock signal, and generate a global synchronization signal as shown in
The latches 9555 perform this oversampling along with an oversampling frequency obtained from the PLL or DLL or the like 9505. The latches 9555 increase a sampling rate, e.g., by increasing the number of flip flops in it. The latches 9555 decrease a sampling rate, e.g., by decreasing the number of flip flops in it. For example, as shown in
In one embodiment, the detection circuit 9410 detects a widened clock pulse, e.g., as the latches 9555 receive “1”s which are extended to, for example, an extra quarter clock cycle. In other words, if the latches 9555 receive more “1”s than “0”s within a clock period, the detection circuit 9410 detects a widened clock pulse. In one embodiment, the detection circuit 9410 detects a narrowed clock pulse, e.g., as the latches 9555 receive “0”s which are extended to, for example, an extra quarter clock cycle. In other words, if the latches 9555 receive more “0”s than “1”s within a clock period, the detection circuit 9410 detects a narrowed clock pulse.
In one embodiment, a parallel computing system is implemented in a semiconductor chip (not shown) that includes a plurality of processors. There is at least one clock generation circuit 9100 and at least one detection circuit 9410 in the chip. These processors detect a pulse width modified clock signal, e.g., via the detection circuit 9410.
Returning to
The counter 9420 delays a response to the aligned global synchronization signal, e.g., by forwarding the aligned global synchronization signal to processors when a value of the counter becomes a zero or a threshold value. In one embodiment, the counter 9420 can be programmed in a different or same way across semiconductor chips implementing parallel computing systems. The processor(s) controls the logic 9415 and/or the counter 9420. In one embodiment, a pulse width modification occurs repetitively. The global synchronization signal 9545 comes into the counter 9420 at a regular rate. By programming the counter 9420 that decrements or increments on every pulse on the global synchronization signal 9545, issuing an interrupt signal 9425 or the like to processors can be delayed until a value of the counter 9420 reaches zero or a threshold value. In other words, an action (e.g., interrupt 9425) to processors can be delayed for a predetermined time period, e.g., by configuring the value of the counter 9420.
In one embodiment, if a control (e.g., an instruction) from a processor writes a number “N” into the counter 9420, the counter 9420 may start decrementing on a receipt of every subsequent global synchronization signal. Once the counter 9420 expires (i.e. has decremented to 0), the counter 9420 generates a counter expiration signal 9435, that a subsequent logic can use for whatever purpose. For example, a purpose of the counter expiration signal is to trigger for a series of subsequent counters that provide a sequence for waking up the chip (i.e., a semiconductor chip having a plurality of processors) from a reset state.
The following describes an exemplary protocol that can be applied in
0. All semiconductor chips in a partition start with having a gsync interrupt masked (i.e. incoming gsync signals are ignored).
1. A single semiconductor chip in the partition (which can span from a single chip to all chips in a machine, e.g., IBM® Blue Gene L/P/Q) takes a lead role. This single semiconductor chip is referred herein to a “director” chip.
2. Software on the director chip clears any pending a gsync interrupt state (i.e., a state caused by the gsync interrupt) and then unmasks the gsync interrupt.
3. A next incoming gsync signal may thus trigger a gsync interrupt.
4. After taking this interrupt, the director chip waits for an appropriate delay and then communicates to all semiconductor chips in the partition to take the next gsync interrupt.
5. All semiconductor chips (including the director chip) clear any pending gsync interrupt and then unmask the gsync interrupt.
6. A next incoming gsync signal may thus trigger a gsync interrupt on all the chips.
7. All the chips wait an appropriate delay and then write the counter 9420 with a suitable number “N.”
8. All the chips quiesce and go into reset in order to achieve a reproducible state.
9. If necessary, an external control system can even step in and take a step to achieve the reproducible state.
10. Upon an expiration of the counter 9420, i.e., when a value of the counter 9420 becomes zero, all the chips start a deterministic wake-up sequence that is run synchronously.
All the chips may therefore be in a deterministic phase relationship with each other.
The “appropriate delay” in step 4 is intended to overcome jitter that is incurred between semiconductor chips in the machine. This delay represents an uncertainty in timing due to a chip-to-chip communication having a different distribution path from a (global) oscillating signal distribution path to each semiconductor chip.
If a gsync signal occurs with a period, for example, on a millisecond scale, and a corresponding jitter band across the machine (e.g., the worst uncertainty case in a gsync signal distribution+the worst latency case of a chip-to-chip communication) is, for example, 10s of microseconds, then it is sufficient for the director chip(s) to wait, e.g. 100 microseconds after its gsync signal from step 3 to ensure that all chips in the partition will be safely ignore an initial noise signal, and may be ready to the chip-to-chip communication of step 4 and to the step 5 before the next gsync signal (of step 6) arrives. This next gsync signal is indeed the same gsync signal for all the chips.
The “appropriate delay” in step 7 is to ensure that the counter 9420 is programmed once a current gsync signal (of step 6) is detected, so that decrementing a value of the counter 420 starts only on a subsequent gsync signal. However, depending on an implementation of the machine, this delay in step 7 may not be necessary, i.e. can be zero.
The “suitable number N” of step 7 may safely cover the reset state of steps 8 and 9, including any time span that may need to be incurred to give the external control system an opportunity to step in.
In one embodiment, the clock generation circuit 9100 preserves rising edges of the oscillating signal so that on-chip PLLs (e.g., PLL 9505 in
An embodiment as now described herein arose in the context of the multiprocessor system that is described in more detail in the co-pending applications incorporated by reference herein.
Using Reproducibility to Debug a Multiprocessor System
If a multiprocessor system offers reproducibility, then a test case can be run multiple times and exactly the same behavior will occur in each run. This also holds true when there is a bug in the hardware logic design. In other words, a test case failing due to a bug will fail in the same fashion in every run of the test case. With reproducibility, in each run it is possible to precisely stop the execution of the program and examine the state of the system. Across multiple runs, by stopping at subsequent clock cycles and extracting the state information from the multiprocessor system, chronologically exact hardware behavior can be recorded. Such a so-called event trace usually greatly aids identifying the bug in the hardware logic design which causes the test case to fail. It can also be used to debug software.
Debugging the hardware logic may require analyzing the hardware behavior over many clock cycles. In this case, many runs of the test case are required to create the desired event trace. It is thus desirable if the time and effort overhead between runs is kept to a minimum. This includes the overhead before a run and the overhead to scan out the state after the run.
Aspects Allowing a Multiprocessor System to Offer Reproducibility
Below are described a set of aspects allowing a multiprocessor system to offer reproducibility.
Deterministic System Start State
Advantageously, the multiprocessor system is configured such that reproducibility-relevant initial states are set to a fixed value. The initial state of a state machine is an example of reproducibility-relevant initial state. If the initial state of a state machine differs across two runs, then the state machine will likely act differently across the two runs. The state of a state machine is typically recorded in a register array.
Various techniques are used to minimize the amount of state data to be set between runs and thus to reduce the overhead between reproducible runs. For example, each unit on a chip can use reset to reproducibly initialize much of its state, e.g. to set its state machines. This minimizes the number of unit states that have to be set by an external host or other external agent before or after reset.
Another example would be, having the test case program code and other initially-read contents of DRAM memory retained between runs. In other words, the DRAM memory unit need not be reset between runs and thus only some of the contents may need to be set before each run.
The remaining state data within the multiprocessor system should be explicitly set between runs. This state can be set by an external host computer as described below. The external host computer controls the operation of the multiprocessor system. For example, in
As illustrated in
A Single System Clock
To achieve system wide reproducibility, a single system clock drives the entire multiprocessor system. Such a single system clock and its distribution to chips in the system is described on page 227 section ‘Clock Distribution’ of Coteus et al.
The single system clock has little to no negative repercussions and thus also is used to drive the system in regular operation when reproducibility is not required. In
Within the clock distribution hardware of the preferred embodiment, the drift across processor chips across runs has been found to be too small to endanger reproducibility. In
In the alternative, multiple clocks would drive different processing elements and would likely result in frequency drift that would break reproducibility. In the time of a realistic test case run, the frequencies of multiple clocks can drift over many cycles. For example, for a 1 GHz clock signal, the drift across multiple clocks must be well under 1 in a billion to not drift a cycle in a one second run.
System-Wide Phase Alignment
The single system clock described above allows for a system-wide phase alignment of all reproducibility-relevant clock signals within the multiprocessor system. Each processor chip uses the single system clock to drive its phase-lock-loop units and other units creating other clock frequencies used by the processor chip. An example of such a processor chip and other units is the IBM® BlueGene® node chip with its peripheral chips, such as DRAM memory chips.
In
The clock generator 92230 can be designed such that the phases of the system clock and the derived clock frequencies are all aligned. Please see the following paper for a similar clock generator with aligned phases: A. A. Bright, “Creating the Blue Gene/L Supercomputer from Low Power System-on-a-Chip ASICs,” Digest of Technical Papers, 2005 IEEE International Solid-State Circuits Conference, or see FIG. 5 and associated text in http://www.research.ibm.com/journal/rd49-23.html “Blue Gene/L compute chip: Synthesis, timing, and physical design” A. A. Bright, R. A. Haring, M. B. Dombrowa, M. Ohmacht, D. Hoenicke, S. Singh, J. A. Marcella, R. F. Lembach, S. M. Douskey, M. R. Ellaysky, C. G. Zoellin, and A. Gara. The contents and disclosure of both articles are incorporated by reference as if fully set forth herein
This alignment ensures that across runs there is the same phase relationship across clocks. This alignment across clocks thus enables reproducibility in a multiprocessor system.
With such a fixed phase relationship across runs, an action of a subsystem running on its clock occurs at a fixed time across runs as seen by any other clock. Thus with such a fixed phase relationship across runs, the interaction of subsystems under different clocks is the same across runs. For example, assume that clock generator 92230 drives subunit 92263 with 100 MHz and subunit 92264 with 200 MHz. Since clock generator 92230 aligns the 100 MHz and 200 MHz clocks, the interaction of subsystem 92263 with subunit 92264 is the same across runs. If the interaction of the two subsystems is the same across runs, the actions of each subunit can be the same across runs.
A more detailed system-wide phase alignment is described below in section ‘1.2.4 System-wide synchronization events.’
System-Wide Synchronization Events
The single system clock described above can carry synchronization events. In
The external host computer 92180 controls the operation of the multiprocessor system 92100. The external host computer 92180 uses a synchronization event to initiate the reset phase of the processor chips 92201, 92202, 92203, 92204 in the multiprocessor system 92100.
As described above, within a processor chip, the phases of the clocks are aligned. Thus like any other event on the system clock, the synchronization event occurs at a fixed time across runs with respect to any other clock. The synchronization event thus synchronizes all units in the multiprocessor system, whether they are driven by the system clock or by clocks derived from clock generator 92230.
The benefit of the above method can be understood by examining a less desirable alternative method. In the alternative, there is a separate network fanning out the reset to all chips in the system. If the clock and reset are on separate networks, then across runs the reset arrival times can be skewed and thus destroy reproducibility. For example, on a first run, reset might arrive 23 cycles earlier on one node than another. In a rerun, the difference might be 22 cycles.
The method of this disclosure as used in BG/Q is described below. Particular frequency values are stated, but the technique is not limited to those and can be generalized to other frequency values and other ratios between frequencies as a matter of design choice.
The single system clock source 92110 provides a 100 MHz signal, which is passed on by the synchronization event generator 92120. On the processor chip 92201, 33 MHz is the greatest common divisor of all on-chip clock frequencies, including the incoming 100 MHz system clock, the 1600 MHz processor cores and the 1633 MHz external DRAM chips. In
Per the above-mentioned ‘GLOBAL SYNC . . . ’ co-pending application on the synchronization event generator 92120, the incoming 100 MHz system clock is internally divided-by-3 to 33 MHz and a fixed 33 MHz rising edge is selected from among 3 possible 100 MHz clock edges. The synchronization event generator 92120 generates synchronization events at a period that is a (large) multiple of the 33 MHz period. The large period between synchronization events ensure that at any moment there is at most one synchronization event in the entire system. Each synchronization event is a pulse width modulation of the outgoing 100 MHz system clock from the synchronization event generator 92120.
On the processor chip 92201, the incoming 100 MHz system clock is divided-by-3 to an on-chip 33 MHz clock signal. This on-chip 33 MHz signal is aligned to the incoming synchronization events which are at a period that is a (large) multiple of the 33 MHz period. Thus there is a system wide phase alignment across all chips for the 33 MHz clock on each chip. On the processor chip 92201, all clocks are aligned to the on-chip 33 MHz rising edge. Thus there is a system wide phase alignment across all chips for all clocks on each chip.
An application run involves a number of configuration steps. A reproducible application run may require one or more system-wide synchronization events for some of these steps. For example, on the processor chip 92201, the configuration steps: e.g. clock start, reset, and thread start, can each occur synchronized to an incoming synchronization event. Each step is thus synchronized and thus reproducible across all processor chips 92201-92204. On each processor chip, there is an option to delay a configuration step by a programmable number of synchronization events. This allows a configuration step to complete on different processor chips at different times. The delay is chosen to be longer than the longest time required on any of the chips for that configuration step. After the configuration step, due to the delay, between any pair of chips, there is the same fixed phase difference across runs. The exact phase difference value is typically not of much interest and typically differs across different pairs of chips.
Reproducibility of Component Execution
On each chip, each component or unit or subunit has a reproducible execution. As known to anyone skilled in the art, this reproducibility depends upon various aspects. Examples of such aspects include:
Advantageously, to achieve reproducibility, within the multiprocessor system the interfaces across chips will be deterministic. In the multiprocessor system 92100 of
These include the following alternatives. A given interface uses one of these or another alternative to achieve a deterministic interface. On a chip with multiple interfaces, each interface could use a different alternative:
Communication with the multiprocessor system is designed to not break reproducibility. For example, all program input is stored within the multiprocessor system before the run. Such input is part of the deterministic start state described above. For example, output from a processor chip, such as printf( ) uses a message queue, such as described in http://en.wikipedia.org/wiki/Message_queue, also known as a “mailbox,” which can be read by an outside system without impacting the processor chip operation in any way. In
Precise Stopping of System State
One enabler of reproducible execution is the ability to precisely stop selected clocks. The precise stopping of the clocks may be designed into the chips and the multiprocessor system to accomplish this. As illustrated in
Selected clocks are not stopped. For example, as described in section ‘1.2.6 Deterministic Chip Interfaces’, some subunits continue to run and are not reset across runs. As described in section ‘1.2.9 Scanning of system state’, a unit is stopped in order to scan out its state. The clocks chosen to not be stopped are clocks that do not disturb the state of the units to be scanned. For example, the clocks to a DRAM peripheral chip do not change the values stored in the DRAM memory.
This technique of using a clock stop timer 93240 may be empirical. For example, when a run initially fails on some node, the timer can be examined for the current value C. If the failing condition is assumed to have happened within the last N cycles, then the desired event trace is from cycle C−N to cycle C. So on the first re-run, the clock stop timer is set to the value C−N, and the state at cycle C−N is captured. On the next re-run, the clock stop timer is set to the value C−N+1, and the state at cycle C−N+1 can be captured. And so on, until the state is captured from cycle C−N to cycle C.
Scanning of System State
After the clocks are stopped, as described above, the state of interest in the chip is advantageously extractable. An external host computer can scan out the state of latches, arrays and other storage elements in the multiprocessor system.
This is done using the same machinery described in section 1.2.1 which allows an external host computer to set the deterministic system start state before the beginning of the run. As illustrated in
Recording the Chronologically Exact Hardware Behavior
If a multiprocessor system offers reproducibility then a test case can be run multiple times and exactly the same behavior will occur in each run. This also holds true when there is a bug in the hardware logic design. In other words, a test case failing due to a bug will fail in the same fashion in every run of the test case. With reproducibility, in each run it is possible to precisely stop the execution of the program and examine the state of the system. Across multiple runs, by stopping at subsequent clock cycles and extracting the state information, the chronologically exact hardware behavior can be recorded. Such a so-called event trace typically makes it easy to identify the bug in the hardware logic design which is causing the test case to fail.
At 92901, a stop timer is set. At 92902, a reproducible application is started (using infrastructure from the “Global Sync” application cited above). At 92903, each segment of the reproducible application, which may include code on a plurality of processors, is run until it reaches the pre-set stop time. At 92904, the chip state is extracted responsive to a scan of many parallel components. At 92905, a list of stored values of stop times is checked. If there are unused stop times in the list, then the stop timer should be incremented at 92906 in components of the system and control returns to 92902.
When there are no more stored stop times, extracted system states are reviewable at 92907.
Roughly speaking, the multiprocessor system is composed of many thousands of state machines. A snapshot of these state machines can be MBytes or GBytes in size. Each bit in the snapshot basically says whether a transistor is 0 or 1 in that cycle. Some of the state machines may have bits that do not matter for the rest of the system. At least in a particular run, such bits might not be reproduced. Nevertheless, the snapshot can be considered “exact” for the purpose of reproducibility of the system.
The above technique may be pragmatic. For example, a MByte or GByte event trace may be conveniently stored on the a disk or other mass storage of the external host computer 92180. For example, the use of mass storage allows the event trace to include many cycles; and the external host computer can be programmed to only record a selected subset of the states of the multiprocessor system 92100.
The above technique can be used in a flexible fashion, responsive to the particular error situation. For instance, the technique need not require the multiprocessor system 92100 to continue execution after it has been stopped and scanned. Such continuation of execution might present implementation difficulties.
When the clockstop timer 93240 stops the clocks, all registers are stopped at the same time. This means a scan of the latch state is consistent with a single point in time, similar to the consistency in a VHDL simulation of the system. In the next run, with the clock stop timer 93240 set to the next cycle, the scanned out state of some registers will not have changed. For example, register in a slower clock domain will not have changed values unless the slow clock happens to cross over a rising edge. The tool creating the event traces from the extracted state of each run thus simply appends the extracted state from each run into the event trace.
Referring to
Thus, the present invention increases application performance by reducing the performance cost of software blocked in a spin loop or similar blocking polling loop. In one embodiment of the invention, a processor core has four threads, but performs at most one integer instruction and one floating point instruction per processor cycle. Thus, a thread blocked in a polling loop is taking cycles from the other three threads in the core. The performance cost is especially high if the polled variable is L1-cached, since the frequency of the loop is highest. Similarly, the performance cost is high if a large number of L1-cached addresses are polled and thus take L1 space from other threads.
In the present invention, the WakeUp-assisted loop has a lower performance cost, compared to the software polling loop. In one embodiment of the invention, the external unit is embodied as a wakeup unit, the thread 80040 writes the base and enable mask of the address range to the WakeUp address compare (WAC) registers of the WakeUp unit. The thread then puts itself into a paused state. The WakeUp unit wakes up the thread when any of the addresses are written to. The awoken thread then reads the data value(s) of the address(es). If the exit condition is reached, the thread exits the polling loop. Otherwise a software program again configures the WakeUp unit and the thread again goes into a paused state, continuing the process as described above. In addition to address comparisons, the WakeUp unit can wake a thread on signals provided by the message unit (MU) or by the core-to-core (c2c) signals provided by the BIC.
Polling may be accomplished by the external unit or WakeUp unit when, for example, messaging software places one or more communication threads on a memory device. The communication thread learns of new work, i.e., a detected condition or event, by polling an address, which is accomplished by the WakeUp unit. If the memory device is only running the communication thread, then the WakeUp unit will wake the paused communication thread when the condition is detected. If the memory device is running an application thread, then the WakeUp unit, via a bus interface card (BIC), will interrupt the thread and the interrupt handler will start the communication thread. A thread can be woken by any specified event or a specified time interval.
The system of the present invention thereby, reduces the performance cost of a polling loop on a thread within a core having multiple threads. In addition, the system of the present invention includes the advantage of waking a thread only when a detected event or signal has occurred and thus, there is not a falsely woken up thread if a signal(s) has not occurred. For example, a thread may be woken up if a specified address or addresses have been written to by any of a number of threads on the chip. Thus, the exit condition of a polling loop will not be missed.
In another embodiment of the invention, an exit condition of a polling loop is checked by the awakened thread as actually occurring. Such reasons for a thread being woken even if a specified address(es) has not been written to, include, for example, false sharing of the same L1 cache line, or an L2 castout due to resource pressure.
Referring to
Referring to
In one embodiment of the invention, the WakeUp unit 80210 drives the signals wake_result0-3 80212, which are negated to produce an_ac_sleep_en0-3 80214. A processor 80220 thread 80040 (
Referring to
The 1-bits written to the wake_statusX_clear MMIO address clears individual bits in wake_statusX. Similarly, the 1-bits written to the wake_statusX_set MMIO address sets individual bits in wake_statusX. A use of setting status bits is verification of the software. This setting/clearing of individual status bits avoids “lost” incoming wake_source transistions across sw-read-modify-writes.
Referring to
The DAC1 or DAC2 event occurs only if the data address matches the value in the DAC1 register, as masked by the value in the DAC2 register. That is, the DAC1 register specifies an address value, and the DAC2 register specifies an address bit mask which determines which bit of the data address should participate in the comparison to the DAC1 value. For every bit set to 1 in the DAC2 register, the corresponding data address bit must match the value of the same bit position in the DAC1 register. For every bit set to 0 in the DAC2 register, the corresponding address bit comparison does not affect the result of the DAC event determination.
Of the 12 WAC units, the hardware functionality for unit wac3 is illustrated in
In an example, a level-2 cache (L2) record for each L2 line in 17 bits may be implemented for which the processor has performed a cached-read on the line. On a store to the line, the L2 then sends an invalidate to each subscribed core 80222. The WakeUp unit snoops the stores by the local processor core and snoops the incoming invalidates.
The previous paragraph describes normal cached loads and stores. For the atomic L2 loads and stores, such as fetch-and-increment or store-add, the L2 sends invalidates for the corresponding normal address to the subscribed cores. The L2 also sends an invalidate to the core issuing the atomic operation, if that core was subscribed. In other words, if that core had a previous normal cached load on the address.
Thus each WakeUp WAC snoops all addressed stored to by the local processor. The unit also snoops all invalidate addresses given by the crossbar to the local processor. These invalidates and local stores are physical addresses. Thus software must translate the desired virtual address to a physical address to configure the WakeUp unit. The number of instructions taken for such address translation is typically much lower than the alternative of having the thread in a polling loop.
The WAC supports the full BGQ memory map. This allows a WAC to observe local processor loads or stores to MMIO. The local address snooped by WAC is exactly that output by the processor, which in turn is the physical address resolved by TLB within the processor. For example, WAC could implement a guard page on MMIO. In contrast to local processor stores, the incoming invalidates from L2 inherently only cover the 64 GB architected memory.
In an embodiment of the invention, the processor core allows a thread to put itself or another thread into a paused state. A thread in kernel mode puts itself into a paused state using a wait instruction or an equivalent instruction. A paused thread can be woken by a falling edge on an input signal into the processor 80220 core 80222. Each thread 0-3 has its own corresponding input signal. In order to ensure that a falling edge is not “lost”, a thread can only be put into a paused state if its input is high. A thread can only be paused by instruction execution on the core or presumably by low-level configuration ring access. The WakeUp unit wakes a thread. The processor 80220 cores 80222 wake up a paused thread to handle enabled interrupts. After interrupt handling completes, the thread will go back into a paused state, unless the subsequent paused state is overriden by the handler. Thus, interrupts are transparently handled. The WakeUp unit allows a thread to wake any other thread, which can be kernel configured such that a user thread can or cannot wake a kernel thread.
The WakeUp unit may drive the signals such that a thread of the processor 80220 will wake on a rising edge. Thus, throughout the WakeUp unit, a rising edge or value 1 indicates wake-up. The WakeUp unit may support 32 wake sources. The wake sources may comprise 12 WakeUp address compare (WAC) units, 4 wake signals from the message unit (MU), 8 wake signals from the BIC's core-to-core (c2c) signaling, 4 wake signals are GEA outputs 12-15, and 4 so-called convenience bits. These 4 bits are for software convenience and have no incoming signal. The other 28 sources can wake one or more threads. Software determines which sources wake corresponding threads.
In one embodiment of the invention, a WakeUp unit includes 12 address compare (WAC) units, allowing WakeUp on any of 12 address ranges. Thus, 3 WAC units per A2 hardware thread, though software is free to use the 12 WAC units differently across the 4 A2 threads. For example, one A2 thread could use all 12 WAC units. Each WAC unit has its own two registers accessible via memory mapped I/O (MMIO). A register is set by software to a address of interest. The register is set by software to the address bits of interest and thus allows a block-strided range of addresses to be matched.
In another embodiment of the invention, data address compare (DAC) Debug Event Fields may include DAC1 or DAC2 event occurring only if the data address matches the value in the DAC1 register, as masked by the value in the DAC2 register. That is, the DAC1 register specifies an address value, and the DAC2 register specifies an address bit mask which determines which bit of the data address should participate in the comparison to the DAC1 value. For every bit set to 1 in the DAC2 register, the corresponding data address bit must match the value of the same bit position in the DAC1 register. For every bit set to 0 in the DAC2 register, the corresponding address bit comparison does not affect the result of the DAC event determination.
In another embodiment of the invention, an address compare on a wake signal, the WakeUp unit does not ensure that the thread wakes up after any and all corresponding memory has been invalidated in level-1 cache (L1). For example if a packet header includes a wake bit driving a wake source, the WakeUp unit does not ensure that the thread wakes up after the corresponding packet reception area has been invalidated in cache L1. In an example solution, the woken thread performs a data-cache-block-flush (dcbf) on the relevant addresses before reading them.
In another embodiment of the invention, a message unit (MU) provides 4 signals. The MU may be a direct memory access engine, such as MU 80100, with each MU including a DMA engine and Network Card interface in communication with a cross-bar switch (XBAR) switch XBAR switch, and chip I/O functionality. MU resources are divided into 17 groups. Each group is divided into 4 subgroups. The 4 signals into WakeUp corresponds to one fixed group. An A2 core must observe the other 16 network groups via BIC. A signal is an OR command of specified conditions. Each condition can be individually enabled. An OR of all subgroups is fed into BIC, so a core serving a group other than its own must go via the BIC. The BIC provides core-to-core (c2c) signals across the 17*4=68 threads. The BIC provides 8 signals as 4 signal pairs. Any of the 68 threads can signal any other thread. Within each pair: 1 signal is OR of signals from threads on core 16. If source needed, software interrogates BIC to identify which thread on core 16. One signal is OR from threads on cores 0-15. If source needed, software interrogates BIC to identify which thread on which core.
In another embodiment of the invention, the WakeUp unit uses software, for example, using library routines. Handling multiple wake sources may be similarly managed as interrupt handling and requires avoiding problems like livelock. In addition to simplifying user software, the use of library routines also has other advantages. For example, the library can provide an implementation which does not use WakeUp unit and thus measures the application performance gained by WakeUp unit.
In one embodiment of the invention using interrupt handlers, assuming a user thread is paused waiting to be woken up by WakeUp, the thread enters an interrupt handler which uses WakeUp. A possible software implementation has the handler at exit set a convenience bit to subsequently wake the user to indicate that the WakeUp has been used by system and that user should poll all potential user events of interest. The software can be programmed to either have the handler or the user reconfigure the WakeUp for subsequent user use.
In another embodiment of the invention, a thread can wake another thread. One techniques for a thread to wake another thread is across A2 cores. Other techniques include core-to-core (c2c) interrupts, using a polled address. A write by the user thread to an address can wake a kernel thread. The address must be in user space. Across the 4 threads within an A2 core, have at least 4 alternative technique techniques. Since software can write bit=1 to wake_status, the WakeUp unit allows a thread to wake one or more other threads. For this purpose, any wake_status bit can be used whose wake_source can be turned off. Alternatively, setting wake_status bit=1 and toggle wake_enable. This allows any bit to be used, regardless if wake_source can be turned off. For the above techniques, if the wake status bit is kernel use only, a user thread cannot use the above method to wake the kernel thread.
Thereby, the present invention, provides a wait instruction (initiating the pause state of the thread) in the processor, together with the external unit that initiates the thread to be woken (active state) upon detection of the specified condition. Thus, preventing the thread from consuming resources needed by other threads in the processor until the pin is asserted. Thereby the present invention offloads the monitoring of computing resources, for example memory resources, from the processor to the external unit. Instead of having to poll a computing resource, a thread configures the external unit (or wakeup unit) with the information that it is waiting for, i.e., the occurrence of a specified condition, and initiates a pause state. The thread in pause state no longer consumes processor resources while it is in pause state. Subsequently, the external unit wakes the thread when the appropriate condition is detected. A variety of conditions can be monitored according to the present invention, including, writing to memory locations, the occurrence of interrupt conditions, reception of data from I/O devices, and expiration of timers.
In another embodiment of the invention, the system 80010 and method 80100 of the present invention may be used in a supercomputer system. The supercomputer system may be expandable to a specified amount of compute racks, each with predetermined compute nodes containing, for example, multiple processor cores. For example, each core may be associated to a quad-wide fused multiply-add SIMD floating point unit, producing 8 double precision operations per cycle, for a total of 128 floating point operations per cycle per compute chip. Cabled as a single system, the multiple racks can be partitioned into smaller systems by programming switch chips, which source and terminate the optical cables between midplanes.
Further, for example, each compute rack may consists of 2 sets of 80512 compute nodes. Each set may be packaged around a doubled-sided backplane, or midplane, which supports a five-dimensional torus of size 4×4×4×4×2 which is the communication network for the compute nodes which are packaged on 16 node boards. The tori network can be extended in 4 dimensions through link chips on the node boards, which redrive the signals optically with an architecture limit of 64 to any torus dimension. The signaling rate may be 10 Gb/s, 8/10 encoded), over about 20 meter multi-mode optical cables at 850 nm. As an example, a 96-rack system is connected as a 16×16×16×12×2 torus, with the last ×2 dimension contained wholly on the midplane. For reliability reasons, small torus dimensions of 8 or less may be run as a mesh rather than a torus with minor impact to the aggregate messaging rate. One embodiment of a supercomputer platform contains four kinds of nodes: compute nodes (CN), I/O nodes (ION), login nodes (LN), and service nodes (SN).
The method of the present invention is generally implemented by a computer executing a sequence of program instructions for carrying out the steps of the method and may be embodied in a computer program product comprising media storing the program instructions. Although not required, the invention can be implemented via an application-programming interface (API), for use by a developer, and/or included within the network browsing software, which will be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers, or other devices. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations.
Other well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers (PCs), server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like, as well as a supercomputing environment. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
An exemplary system for implementing the invention includes a computer with components of the computer which may include, but are not limited to, a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. The system bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus).
The computer may include a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer.
System memory may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements within computer, such as during start-up, is typically stored in ROM. RAM typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit. The computer may also include other removable/non-removable, volatile/nonvolatile computer storage media.
A computer may also operate in a networked environment using logical connections to one or more remote computers, such as a remote computer. The remote computer may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer. The present invention may apply to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes. The present invention may apply to an environment with server computers and client computers deployed in a network environment, having remote or local storage. The present invention may also apply to a standalone computing device, having programming language functionality, interpretation and execution capabilities.
The present invention, or aspects of the invention, can also be embodied in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
In another embodiment of the invention, to avoid race conditions, when using a WAC to reduce performance cost of polling, software use ensures two conditions are met such that no invalidates are missed for all the addresses of interest, the processor, and thus the WakeUp unit, is subscribed with the L2 slice to receive invalidates. The following pseudo-code meets the above conditions:
loop:
In alternative embodiments the present invention may be implemented in mutli-processor core SMP, like BGQ, wherein each core may be single or multi-threaded. Also, implementation may include a single thread node polling IO device, wherein the polling thread can consume resources, e.g., a crossbar, used by the IO device.
In alternative embodiments the present invention may be implemented in mutli-processor core SMP, like BGQ, wherein each core may be single or multi-threaded. Also, implementation may include a single thread node polling IO device, wherein the polling thread can consume resources, e.g., a crossbar, used by the IO device.
In an additional aspect according to the invention a pause unit may only know if desired memory location was written to. The pause unit may not know if a desired value was written. When a false resume is possible, software has to check condition itself. The pause unit may not miss a resume condition. For example, with the correct software discipline, the WakeUp unit guarantees that a thread will be woken up if the specified address(es) has been written to by any of the other 67 hw threads on the chip. Such writing includes the L2 atomic operations. In other words, the exit condition of a polling loop will never be missed. For a variety of reasons, a thread may be woken even if an the specified address(es) has not been written to. An example is false sharing of the same L1 cache line. Another example is an L2 castout due to resource pressure. Thus an awakened thread software must check if the exit condition of the polling loop has indeed been reached.
In an alternative embodiment of the invention, a pause unit can serve multiple threads. The multiple threads may or may not be within a single processor core. This allows address-compare units and other resume condition hardware to be shared by multiple threads. Further, the threads in the present invention may include barrier, and ticket locks threads.
Also, in an embodiment of the invention, a transaction coming from the processor may be restricted to particular ttypes (memory operation types), for example, MESI shared memory protocol.
In analyzing and enhancing performance of a data processing system and the applications executing within the data processing system, it is helpful to know which software modules within a data processing system are using system resources. Effective management and enhancement of data processing systems requires knowing how and when various system resources are being used. Performance tools are used to monitor and examine a data processing system to determine resource consumption as various software applications are executing within the data processing system. For example, a performance tool may identify the most frequently executed modules and instructions in a data processing system, or may identify those modules which allocate the largest amount of memory or perform the most I/O requests. Hardware performance tools may be built into the system or added at a later point in time.
Currently, processors have minimal support for counting carious instruction types executed by a program. Typically, only a single group of instructions may be counted by a processor by using the internal hardware of the processor. This is not adequate for some applications, where users want to count many different instruction types simultaneously. In addition, there are certain metrics that are used to determine application performance (counting floating point instructions for example), that are not easily measured with current hardware. Using the floating point example, a user may need to count a variety of instructions, each having a different weight, to determine the number of floating point operations performed by the program A scalar floating point multiply would count as one FLOP, whereas a floating point multiply-add instruction would count as 2 FLOPS. Similarly, a quad-vector floating point add would count as 4 FLOPS, while a quad-vector floating point multiply-add would count as 8 FLOPS.
Thus, in a further aspect of the invention, there is provided methods, systems and computer program products for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to said each instruction, assigning the instructions to a plurality of groups, and analyzing said plurality of groups to measure one or more metrics of the program.
In one embodiment, each instruction includes an operating code portion, and the assigning includes assigning the instructions to said groups based on the operating code portions of the instructions. In an embodiment, each instruction is one type of a given number of types, and the assigning includes assigning each type of instruction to a respective one of said plurality of groups. In an embodiment, these groups may be combined into a plurality of sets of the groups.
In an embodiment of the invention, to facilitate the counting of instructions, the processor informs an external logic unit of each instruction that is executed by the processor. The external unit then assigns a weight to each instruction, and assigns it to an opcode group. The user can combine opcode groups into a larger group for accumulation into a performance counter. This assignment of instructions to opcode groups makes measurement of key program metrics transparent to the user.
As shown and described herein with respect to
As described above, each processor includes four independent hardware threads sharing a single L1 cache with sixty-four byte line size. Each memory line is stored in a particular L2 cache slice, depending on the address mapping. The sixteen L2 slices effectively comprise a single L2 cache. Those skilled in the art will recognize that the invention may be embodied in different processor configurations.
The L1P 8230 provides two prefetching schemes: a sequential prefetcher, as well as a list prefetcher. The list prefetcher tracks and records memory requests sent out by the core, and writes the sequence as a list to a predefined memory region. It can replay this list to initiate prefetches for repeated swquences of similar access patterns. The sequences do not have to be identical, as the list processing is tolerant to a limited number of additional or missing accesses. This automated learning mechanism allows a near perfect prefetch behavior for a set of important codes that show the required access behavior, as well as perfect prefetch behavior for codes that allow precomputation of the access list.
Each PU 8200 connects to a central low latency, high bandwidth crossbar switch 8240 via a master port. The central crossbar routes requests and write data from the master ports to the slave ports and read return data back to the masters. The write data path of each master and slave prot is 16 B wide. The read data return port is 32 B wide.
As mentioned above, currently, processors have minimal support for counting various instruction types executed by a program. Typically, only a single group of instructions may be counted by a processor by using the internal hardware of the processor. This is not adequate for some applications, where users want to count many different instruction types simultaneously. In addition, there are certain metrics that are used to determine application performance (counting floating point instructions for example) that are not easily measured with current hardware.
Embodiments of the invention provide methods, systems and computer program products for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to said each instruction, assigning the instructions to a plurality of groups, and analyzing said plurality of groups to measure one or more metrics of the program.
With reference to
As one specific example of the present invention,
The implementation, in an embodiment, is hardware dependent. The processor runs at two times the speed of the counter, and because of this, the counter has to process two cycles of A2 data in one counter cycle. Hence, the two OSP0/1 and the two FLOPS0/1 are used in the embodiment of
In one embodiment, the highest count that the A2 can produce is 9. This is because the maximum weight assigned to one FLOP is 8 (the highest possible weight this embodiment), and, in this implementation, all integer instructions have a weight of 1. This totals 9 (8 flop and 1 op) per A2 cycle. When this maximum count is multiplied by two clock cycles per counting cycle, the result is a maximum count of 18 per count cycle, and as a result, the counter has to be able to add from 0-18 every counting cycle. Also, because all integer instructions have a weight of 1, a reduce (logical OR) is done in the OP path, instead of weighting logic like on the FLOP path.
Boxes 8402/8404 perform the set selection logic. They pick which groups go into the counter for adding. The weighting of the incoming groups happens in the FLOP_CNT boxes 8412/8414. In an implementation, certain groups are hard coded to certain weights (e.g. FMA gets 2, quad fma gets 8). Other group weights are user programmable (DIV/SQRT), and some groups are hard coded to a weight of 1. The reduce block on the op path functions as an OR gate because, in this implementation, all integer instructions are counted as 1, and the groups are mutually exclusive since each instruction only goes into one group. In other embodiments, this reduce box can be as simple as an OR gate, or complex, where, for example, each input group has a programmable weight.
The Thread Compare boxes are gating boxes. With each instruction that is input to these boxes, the thread that is executing the instruction is recorded. A 4 bit mask vector is input to this block to select which threads to count. Incrementers 8436 and 8440 are used, in the embodiment shown in
The outputs of thread compares 8422, 8424 are applied to and counted by incrementer 8436, and the outputs of thread compares 8432, 8434 are applied to and counted by incrementer 8440. The outputs of incrementers 8436, 8440 are passed to multiplexers 8442, 8444, and the outputs of the multiplexers are applied to six bit adder 8446. The output of six bit adder 8446 is transmitted to fourteen bit adder 8450, and the output of the fourteen bit adder is transmitted to counter register 8452.
There is further provided a method and system for enhancing barrier collective synchronization in message passing interface (MPI) applications with multiple processes running on a compute node for use in a massively parallel supercomputer, wherein the compute nodes may be connected by a fast interconnection network.
In known computer systems, a message passing interface barrier (MPI barrier) is an important collective synchronization operation used in parallel applications or parallel computing. Generally, MPI is a specification for an application programming interface which enables communications between multiple computers. In a blocking barrier, the progress of the process or a thread calling the operation is blocked until all the participating processes invoke the operation. Thus, the barrier ensures that a group of threads or processes, for example in the source code, stop progress until all of the concurrently running threads (or processes) progress to reach the barrier.
A non-blocking barrier can split a blocking barrier into two phases: an initiation phase, and a waiting phase, for waiting for the barrier completion. A process can do other work in-between the phases while the barrier progresses in the background.
The collection of the processes invoking the barrier operation is embodied in MPI using a communicator. The communicator stores the necessary state information for a barrier algorithm. An application can create as many communicators as needed depending on the availability of the resources. For a given number of processes, there could be exponiential number of communicators resulting in exponential space requirements to store the state. In this context, it is important to have an efficient space bounded algorithm to ensure scalable implementations.
For example, on an exemplary supercomputer system, a barrier operation within a node can be designed via the fetch-and-increment atomic operations. To support an arbitrary communicator, an atomic data entity needs to be associated with the communicator. As discussed above, making every communicator contain this data item leads to storage space waste. In one approach to this problem, a single global data structure element is used for all the communicators. However, as discussed in further detail below, this is inefficient as concurrent operations are serialized when a single resource is available.
In one embodiment of a supercomputer, a node can have several processes and each process can have up to four hardware threads per core. MPI allows for concurrent operations initiated by different threads. However, each of these operations needs to use different communicators. The operations are serialized because there is only a single resource. For all the operations to progress concurrently it is imperative that separate resources need to be allocated to each of the communicators. This results in undesirable use of storage space.
One way of allocating counters is to allocate one counter for each communicator as different threads can only call collectives on different communicators as per the MPI standard. Then, the counter can be immediately located based on a communicator ID. However, a drawback of the above approach results in inferior utilization of memory space.
There is therefore a need for a method and system to allocate counters for communicators while enhancing efficiency of utilization of memory space. Further, there is a need for a method and system to use less memory space when allocating counters. It would also be desirable for a method and system to allocate counters for each communicator using the MPI standard, while reducing memory allocation usage.
Generally, in a blocking barrier, the progress of the process or a thread calling the operation will be blocked until all the participating processes invoked the operation. The collection of the processes invoking the barrier operation is embodied in message passing interface (MPI) using a communicator. The communicator stores the necessary state information for the barrier algorithm. The Barrier operation may use multiple processes/threads on a node. An MPI process may consist of more than one thread. In the text, the software driven processes or threads is used interchangebly where appropriate to explain the mechanisms referred herein.
Fast synchronization primitives on a supercomputer, for example, IBM® Blue Gene®, via the fetch-and-increment atomic mechanism can be used to optimize the MPI barrier collective call within a node with many processes. This intra-node mechanism needs to be coupled with a network barrier for barrier across all the processes. A node can have several processes and each process can have many threads with a maximum limit, for example, of 64. For simultaneous transfers initiated by different threads, different atomic counters need to be used.
Referring to
Similarly, in another embodiment of the invention, the system above used for blocking communications can be extended to non-blocking communications. Instead of using a per thread resource allocation, a central pool of resources can be allocated. A master process or thread per communicator is responsible for claiming the resources from the pool and freeing the resources after their usage. The resources are allocated and freed in a safe manner as multiple concurrent communications can occur simultaneously. More specifically, as the resources are mapped to the different communications, care must be taken that no two communications get the same resource, otherwise, the operation is error prone. The process or thread participating in the resource allocation/de-allocation should use mechanisms such as locking to prevent such scenarios.
For a very large number of communicators, allocating one counter per communicator will pose severe scalability issues. Using such large number of counters results in a wastage of memory space, especially in a computer system that has limited memory per thread.
When blocking communications, one counter per thread is needed in a process, as that is the maximum number of active collective operations via MPI. In the present invention, the system 81010 includes a mechanism where each communicator 81050 designates a master core 81026 in the multi-processor environment. In the system 81010, there is one counter 81060 for each thread 81030, and each counter has a table 70 with a number of entries equal to the maximum number of threads. When a process thread 81030 initiates a collective of processors 81026, if it is the master core it sets the table 70 entry 81078 with the ID 81074 of the communicator 81050. Threads 81030 of non-master processes just poll the entries 81078 of the master process to discover the counter 81060 to use for the collective. Table 1 below further illustrates the basic mechanism of the system 81010.
In Table 1: #counters=#threads=64 on a super computer system; Processes or threads Ids={0, 1, 2, 3}; Running on cores={0, 1, 2, 3}; Communicator 1={0, 1, 2}; Master core=0; Communicator 2={1, 2, 3}; and Master core=1. Table entries are as below:
In Table 1 above, the counter is discovered by searching entries in the table, however, space overhead is considerably reduced. The searching power overhead for a computer is small, as typically only a small number of communicators are given time to occupy the first few slots in the table.
In another embodiment of the invention, for non-blocking communications, instead of using a per thread resource allocation, a central pool of resources is allocated. A master process or thread per communicator is responsible for claiming the resources from this pool and freeing the resources after their usage. However, it is important that the resources are allocated/freed in a safe manner as multiple concurrent communications can happen simultaneously.
Additionally, the mechanism/system 81010 according to the present invention may be applied to other collective operations needing finite amount of resources for their operation. The mechanisms applied in the present invention can also be applied to other collective operations such as an MPI operation, for example, MPI Allreduce. Such an operation as MPI_Allreduce performs a global reduce operation on the data provided by the application.
Similar to the Barrier operation with multiple processes/threads on a node, it also requires a shared pool of resources, in this context, a shared pool of memory buffers where the data can be reduced. The algorithm described in this application for resource sharing can be applied to shared the pool of memory buffers for MPI_Allreduce for different communicators.
Thereby, in the present invention, the system 81010 provides a mechanism where each communicator designates a master core in the multi-processor environment. One counter for each thread is allocated and has a table with number of entries equal to the maximum number of threads. When a process thread initiates a collective, if it is the master core, it sets the table entry with the ID of the communicator. Threads of non-master processes just poll the entries of the master process to discover the counter to use for the collective.
Referring to
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the
Modern processors typically include multiple hardware threads, allowing for the concurrent execution of multiple software threads on a single processor. Due to silicon area and power constraints, it is not possible to have each hardware thread be completely independent from other threads. Each hardware thread shares resources with the other threads. For example, execution units (internal to the processor), and memory and IO subsystems (external to the processor), are resources typically shared by each hardware thread. In many programs, at times a thread must wait for an action to occur external to the processor before continuing its program flow. For example, a thread may need to wait for a memory location to be updated by another processor, as in a barrier operation. Typically, for highest speed, the waiting thread would poll the address residing in memory, waiting for the thread to update it. This polling action takes resources away from other competing threads on the processor. In this example, the load/store unit of the processor would be utilized by the polling thread, at the expense of the other threads that share it.
The performance cost is especially high if the polled variable is L1-cached (primary cache), since the frequency of the loop is highest. Similarly, the performance cost is high if, for example, a large number of L1-cached addresses are polled, and thus take L1 space from other threads.
Multiple hardware threads in processors may also apply to high performance computing (HPC) or supercomputer systems and architectures such as IBM® BLUE GENE® parallel computer system, and to a novel massively parallel supercomputer scalable, for example, to 100 petaflops. Massively parallel computing structures (also referred to as “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, torus, and tree configurations. The conventional approach for the most cost/effective scalable computers has been to use standard processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Currently, these supercomputing machines exhibit computing performance achieving 1-3 petaflops.
There is therefore a need to increase application performance by reducing the performance loss of the application, for example, reducing the increased cost of software in a loop, for example, software may be blocked in a spin loop or similar blocking polling loop. Further, there is a need to reduce performance loss, i.e., consuming processor resources, caused by polling and the like to increase overall performance. It would also be desirable to provide a system and method for polling external conditions while minimizing consuming processor resources, and thus increasing overall performance.
Referring to
Thereby, the present invention executes the wait instruction 82034 (
Referring to
Referring to
Thereby, the present invention offloads the monitoring of computing resources, for example memory resources, from the processor to the pin and logic circuit. Instead of having to poll a computing resource, a thread configures the logic circuit with the information that it is waiting for, i.e., the occurrence of a specified condition, and initiates a pause state. The thread in pause state no longer consumes processor resources while it is waiting for the external condition. Subsequently, the pin wakes the thread when the appropriate condition is detected by the logic circuit. A variety of conditions can be monitored according to the present invention, including, but not limited to, writing to memory locations, the occurrence of interrupt conditions, reception of data from I/O devices, and expiration of timers.
The method of the present invention is generally implemented by a computer executing a sequence of program instructions for carrying out the steps of the method and may be embodied in a computer program product comprising media storing the program instructions. Although not required, the invention can be implemented via an application-programming interface (API), for use by a developer, and/or included within the network browsing software, which will be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers, or other devices. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations.
Other well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers (PCs), server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like, as well as a supercomputing environment. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In another embodiment of the invention, the system 82010 and method 82100 of the present invention may be used in a supercomputer system. The supercomputer system may be expandable to a specified amount of compute racks, each with predetermined compute nodes containing, for example, multiple A2 processor cores. For example, each core may be associated to a quad-wide fused multiply-add SIMD floating point unit, producing 8 double precision operations per cycle, for a total of 128 floating point operations per cycle per compute chip. Cabled as a single system, the multiple racks can be partitioned into smaller systems by programming switch chips, which source and terminate the optical cables between midplanes.
Further, for example, each compute rack may consists of 2 sets of 82512 compute nodes. Each set may be packaged around a doubled-sided backplane, or midplane, which supports a five-dimensional torus of size 4×4×4×4×2 which is the communication network for the compute nodes which are packaged on 16 node boards. The tori network can be extended in 4 dimensions through link chips on the node boards, which redrive the signals optically with an architecture limit of 64 to any torus dimension. The signaling rate may be 10 Gb/s, 8/10 encoded), over about 20 meter multi-mode optical cables at 850 nm. As an example, a 96-rack system is connected as a 16×16×16×12×2 torus, with the last ×2 dimension contained wholly on the midplane. For reliability reasons, small torus dimensions of 8 or less may be run as a mesh rather than a torus with minor impact to the aggregate messaging rate. One embodiment of a supercomputer platform contains four kinds of nodes: compute nodes (CN), I/O nodes (ION), login nodes (LN), and service nodes (SN).
An exemplary system for implementing the invention includes a computer with components of the computer which may include, but are not limited to, a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. The system bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus).
The computer may include a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer.
System memory may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements within computer, such as during start-up, is typically stored in ROM. RAM typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit. The computer may also include other removable/non-removable, volatile/nonvolatile computer storage media.
A computer may also operate in a networked environment using logical connections to one or more remote computers, such as a remote computer. The remote computer may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer. The present invention may apply to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes. The present invention may apply to an environment with server computers and client computers deployed in a network environment, having remote or local storage. The present invention may also apply to a standalone computing device, having programming language functionality, interpretation and execution capabilities.
The present invention, or aspects of the invention, can also be embodied in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
In an embodiment of the invention, the processor core allows a thread to put itself or another thread to into a pause state. A thread in kernel mode puts itself into a pause state using a wait instruction or an equivalent instruction. A paused thread can be woken by a falling edge on an input signal into the processor 82220 core 82222. Each thread 0-3 has its own corresponding input signal. In order to ensure that a falling edge is not “lost”, a thread can only be put into a pause state if its input is high. A thread can only be put into a paused state by instruction execution on the core or presumably by low-level configuration ring access. The logic circuit wakes a thread. The processor 82220 cores 82222 wake up a paused thread to handle enabled interrupts. After interrupt handling completes, the thread will go back into a paused state, unless the subsequent pause state is overriden by the handler. Thus, interrupts are transparently handled. The logic circuit allows a thread to wake any other thread, which can be kernel configured such that a user thread can or cannot wake a kernel thread.
The logic circuit may drive the signals such that a thread of the processor 82220 will wake on a rising edge. Thus, throughout the logic circuit, a rising edge or value 1 indicates wake-up. The logic circuit may support 32 wake sources. The wake sources may comprise 12 WakeUp address compare (WAC) units, 4 wake signals from the message unit (MU), 8 wake signals from the BIC's core-to-core (c2c) signaling, 4 wake signals are GEA outputs 12-15, and 4 so-called convenience bits. These 4 bits are for software convenience and have no incoming signal. The other 28 sources can wake one or more threads. Software determines which sources wake corresponding threads.
In an embodiment of the invention, the thread pausing instruction sequence, includes:
1. Software setting bits to enable the allowed wakeup options for a thread. Enabling specific exceptions to interrupt the paused thread and resume execution. Each thread has a set of Wake Control bits which determine how the corresponding thread can be started after a pause state has been entered.
In an alternative embodiment of the invention, a pause unit can serve multiple threads. The multiple threads may or may not be within a single processor core. This allows address-compare units and other resume condition hardware to be shared by multiple threads. Further, the threads in the present invention may include barrier, and ticket locks threads.
Traditional operating systems rely on a MMU (memory management unit) to create mappings for applications. However, it is often desirable to create a hole between application heap and application stacks. The hole catches applications that may be using too much stack space, or buffer overruns.
Thus, there is further provided a system and a method for an operating system to create mappings for applications when the operating system cannot create a hole between application heap and application stacks.
A system and method is also provided for an operating system to create mappings as above when the operating system creates a static memory mapping at application startup, such as in a supercomputer. It would also be desirable to provide a system and method for an alternative to using a processor or debugger application or facility to perform a memory access check.
Referring to
In one embodiment of the invention, the wakeup unit 85110 drives a hardware connection 85112 to the bus interface card (BIC) 85130 designated by the code OR (enabled WAC0-11). A processor 85120 thread 85440 (
Referring to
Referring to
The core 85214 of the system 85200 includes a main hardware (hw) thread 85220 having a used stack 85222, a growable stack 85224, and a guard page 85226. A first heap region 85230 includes a first stack hwthread 85232 and guard page 85234, and a third stack hwthread 85236 and a guard page 85238. A second heap region 85240 includes a stack pthread 85242 and a guard page 85244, and a second stack hwthread 85246 and a guard page 85248. The core 85214 further includes a read-write data segment 85250, and an application text and read-only data segment 85252.
Using the wakeup unit's 85110 registers 85452 (
The guard pages have attributes which typically include the following features:
Thereby, instead of using the processor or debugger facilities to perform the memory access check, the system 85100 of the present invention uses the wakeup unit 85110. The wakeup unit 85110 detects memory accesses between the level-1 cache (L1p) and the level-2 cache (L2). If the L1p is fetching or storing data into the guard page region, the wakeup unit will send an interrupt to the wakeup unit's core.
Referring to
The following steps are used to create/reposition/resize a guard page for an embodiment of the invention:
Referring to
According to the present invention, the WAC registers may be implemented as a base address and a bit mask. An alternative implementation could be a base address and length, or base starting address and base ending address. In step 85332, the operating system moves the guard page whenever the top of the heap changes size. Thus, in one embodiment of the invention, when a guard page is violated, the wakeup unit detects the memory access from L1p→L2 and generates an interrupt to the core 85120. The operating system 85424 takes control when the interrupt occurs and queries the wakeup unit 85110 to determine the source of the interrupt. Upon detecting the WAC registers 85452 assigned to the guard page that have been activated or tripped, the operating system 85424 then initiate a response, for example, delivering a signal, or terminating the application.
When a hardware thread changes the guard page of the main thread, it sends an interprocessor interrupt (IPI) to the main hwthread only if the main hwthread resides on a different processor 85120. Otherwise, the thread that caused the heap to change size can directly update the wakeup unit WAC registers. Alternatively, the operating system could ignore this optimization and always interrupt.
Unlike other supercomputer solutions, the data address compare (DAC) registers of the processor of the present invention are still available for debuggers to use and set. This enables the wakeup solution to be used in combination with the debugger.
Referring to
In an alternative embodiment of the invention the memory device includes cache memory. The cache memory is positioned adjacent to and nearest the wakeup unit and between the processor and the wakeup unit. When the cache memory fetches data from a guard page or stores data into the guard page, the wakeup unit sends an interrupt to a core of the wakeup unit. Thus, the wakeup unit can be connected between selected levels of cache.
Referring to
Referring to
Referring to
Referring to
IBM BLUEGENE™/L and P parallel computer systems use a separate collective network, such as the logical tree network disclosed in commonly assigned U.S. Pat. No. 7,650,434, for performing collective communication operations. The uplinks and downlinks between nodes in such a collective network needed to be carefully constructed to avoid deadlocks between nodes when communicating data. In a deadlock, packets cannot move due to the existence of a cycle in the resources required to move the packets. In networks these resources are typically buffer spaces in which to store packets.
If logical tree networks are constructed carelessly, then packets may not be able to move between nodes due to a lack of storage space in a buffer. For example, a packet (packet 1) stored in a downlink buffer for one logical tree may be waiting on another packet (packet 2) stored in an uplink buffer of another logical tree to vacate the buffer space. Furthermore, packet 2 may be waiting on a packet (packet 3) in a different downlink buffer to vacate its buffer space and packet 3 may be waiting for packet 1 to vacate its buffer space. Thus, none of the packets can move into an empty buffer space and a deadlock ensues. While there is prior art for constructing deadlock free routes in a torus for point-to-point packets (Dally “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 5, MAY 1987 and Duato “A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources” IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 12, DECEMBER 2001), there are no specific rules for constructing deadlock free collective class routes in a torus network, nor is it obvious how to apply Duato's general rules in such a way to avoid deadlocks when constructing multiple virtual tree networks that are overlayed onto a torus network. If different collective operations are always separated by barrier operations (that do not use common buffer spaces with the collectives nor block on common hardware resources as the collectives), then the issue of deadlocks does not arise and class routes can be constructed in an arbitrary manner. However, this increases the time of the collective operations and therefore reduces performance.
Thus, there is a need in the art for a method and system for performing collective communication operations within a parallel computing network without the use of a separate collective network and in which multiple logical trees can be embedded (or overlayed) within a multiple dimension torus network in such a way as to avoid the possibility of deadlocks. Virtual channels (VCs) are often used to represent the buffer spaces used to store packets. It is further desirable to have several different logical trees using the same VC and thus sharing the same buffer spaces.
The torus comprises a plurality of interconnected compute nodes 861021 to 861021. The structure of a compute node 86102 is shown in further detail in
The compute nodes 86102 are interconnected to each other by one or more physical wires or links. To prevent deadlocks, a physical wire that functions as an uplink for a logical tree on a VC can never function as a downlink in any other virtual tree (or class route) on that same VC. Similarly, a physical wire that functions as a downlink for a particular class route on a VC can never function as an uplink in any other virtual tree on that same VC. Each class route is associated with its own unique tree network. In one embodiment of the IBM BlueGene parallel computing system, there are 16 class routes, and thus at least 16 different tree networks embedded within the multi-dimensional torus network that form the parallel computing system.
While
As in BlueGene/L, the logical trees (class routes) can be defined by DCR registers programmed at each node. Each class route has a DCR containing a bit vector of uptree link inputs and one or more local contribution bits and a bit vector of uptree link outputs. If bit i is set in the input link DCR, then that means that an input is required on link i (or the local contribution). If bit i is set in the output link DCR, then uptree packets are sent out link i. At most one output link may be specified at each node. A leaf node has no input links, but does have a local input contribution. An intermediate link has both input links and an output link and may have a local contribution. A root node has only input links, and may have a local contribution. In one embodiment of the invention, all nodes in the tree have a local contribution bit set and the tree defines one or more sub-rectangles. Bits in the packet may specify which class route to use (class route id). As packets flow through the network, the network logic inspects the class route ids in the packets, reads the DCR registers for that class route id and determines the appropriate inputs and outputs for the packets. These DCRs may be programmed by the operating system so as to set routes in a predetermined manner. Note that the example trees in
In one embodiment, the predetermined manner is routing the data packet in direction of an ‘e’ dimension, and if routing the data packet in direction of the ‘e’ dimension is not possible (either because there are no hops to make in the e dimension, or if the predefined coordinate in the e dimension has been reached or if the edge of the subrectangle in the e-dimension has been reached), then routing the data packet in direction of an ‘a’ dimension, and if routing the data packet in direction of the ‘a’ dimension is not possible, then routing the data packet in direction of a ‘b’ dimension, and if routing the data packet in direction of the ‘b’ dimension is not possible, then routing the data packet in direction of a ‘c’ dimension, and if routing the data packet in direction of the ‘c’ dimension is not possible, then routing the data packet in direction of the ‘d’ dimension.
In one embodiment, routing between nodes occurs in an ‘outside-in’ manner with compute nodes communicating data packets along a subrectangle from the leaf nodes towards a predefined coordinate in each dimension (which may be the middle coordinate in that dimension) and changing dimension when the node is reached having either the predefined coordinate in that dimension or the end of the subrectangle is reached in a dimension, whichever comes first. Routing data from the ‘outside” to the ‘inside’ until the root of the virtual tree is reached, and then broadcasting the packets down the virtual tree in the opposite direction in such a predetermined manner prevents communication deadlocks between the compute nodes.
In one embodiment, compute nodes arranged in a logical tree overlayed on to a multidimensional network are used to evaluate collective operations. Examples of collective operations include logical bitwise AND, OR and XOR operations, unsigned and signed integer ADD, MIN and MAX operations, and 64 bit floating point ADD, MIN and MAX operations. In one embodiment, the operation to be performed is specified by one or more OP code (operation code) bits specified in the packet header. In one embodiment, collective operations are performed in one of several modes, e.g., single node broadcast mode or “broadcast” mode, global reduce to a single node or “reduce” mode, and global all-reduce to a root node, then broadcast to all nodes or “all reduce” mode. These three modes are described in further detail below.
In the mode known as “ALL REDUCE”, each compute node in the logical tree makes a local contribution to the data packet, i.e., each node contributes a data packet of its own data and performs a logic operation on the data stored in that data packet and data packets from all input links in the logical tree at that node before the “reduced” data packet is transmitted to the next node within the tree. This occurs until the data packet finally reaches the root node, e.g., 1026. Movement from a leaf node or intermediate node towards a root node is known as moving ‘uptree’ or ‘uplink’. The root node makes another local contribution (performs a logic operation on the data stored in the data packet) and then rebroadcasts the data packet down the tree to the all leaf and intermediate nodes within the tree network. Movement from a root node towards a leaf or intermediate node is known as moving ‘downtree’ or ‘downlink’. The data packet broadcast from the root node to the leaf nodes contains final reduced data values, i.e., local contribution from all the nodes in the tree which are combined according to the prescribed OP code. As the data packet is broadcast downlink the leaf nodes do not make further local contributions to the data packet. Packets are also received at the nodes as they are broadcast down the tree, and every node receives exactly the same final reduced data values.
The mode known as “REDUCE” is exactly the same as “ALL REDUCE”, except that the packets broadcast down the tree are not received at any compute node except for one which is specified as a destination node in the packet headers.
In the mode known as “BROADCAST”, a node in the tree makes a local contribution to a data packet and communicates the data packet up the tree toward a root node, e.g., node 861026. The data packet may pass through one or more intermediate nodes to reach the root node, but the intermediate nodes do not make any local contributions or logical operations on the data packet. The root node receives the data packet and the root node also does not perform any logic operations on the data packet. The root node rebroadcasts the received data packet downlink to all of the nodes within the tree network.
In one embodiment, packet type bits in the header are used to specify ALL REDUCE, REDUCE or BROADCAST operation. In one embodiment, the topology of the tree network is determined by a collective logic device as shown in
In one embodiment, the back-end floating point logic device 86440 includes, without limitation, at least one shift register for performing normalization and/or shifting operation (e.g., a left shift, a right shift, etc.). In embodiment, the collective logic device 86460 further includes an arbiter device 86450. The arbiter device is described in detail below in conjunction with
Once input requests has been chosen by an arbiter, those input requests are sent to appropriate senders (and/or the reception FIFO) 86530 and/or 86550. Once some or all of the senders grant permission, the main arbiter 86525 relays this grant to a particular sub-arbiter which has won and to each receiver (e.g., an injection FIFO 86500 and/or 86505). The main arbiter 86525 also drives correct configuration bits to the collective logic device 460. The receivers will then provide their input data through the collective logic device 86460 and an output of the collective logic device 86460 is forwarded to appropriate sender(s).
Byte 86604 comprises collective class route bits. In one embodiment, there are four collective class route bits that provide 16 possible class routes (i.e., 2^4=16 class routes). Byte 86606 comprises bits that enable collective operations and determine the collective operations mode, i.e., “broadcast”, “reduce” and “all reduce modes”. In one embodiment, setting the first three bits (bits 0 to 2) of byte 86604 to ‘86110 indicates a system collective operation is to be carried out on the data packet. In one embodiment, setting bits 3 and 4 of byte 86606 indicates the collective mode. For example, setting bits 3 and 4 to ‘00’ indicates broadcast mode, ‘11’ indicates reduce, and ‘10’ indicates all-reduce mode.
Bytes 86608, 86610, 86612 and 86614 comprise destination address bits for each dimension, a through e, within a 5-dimensional torus. In one embodiment, these address bits are only used when operating in “reduce” mode to address a destination node. In one embodiment, there are 6 address bits per dimension. Byte 86608 comprises 6 address bits for the ‘a’ dimension, byte 86610 comprises 6 address bits for the ‘b’ dimension and 2 address bits for the ‘c’ dimension, byte 86612 comprises 4 address bits for the ‘c’ dimension and 4 address bits for the ‘d’ dimension, and byte 86614 comprises 2 address bits for the ‘d’ dimension and 6 address bits for the ‘e’ dimension.
Parallel computer applications often use message passing to communicate between processors. Message passing utilities such as the Message Passing Interface (MPI) support two types of communication: point-to-point and collective. In point-to-point messaging, a processor sends a message to another processor that is ready to receive it. In a collective communication operation, however, many processors participate together in the communication operation.
Collective communication operations play a very important role in high performance computing. In collective communication, data are redistributed cooperatively among a group of processes. Sometimes the redistribution is accompanied by various types of computation on the data and it is the results of the computation that are redistributed. MPI, which is the de facto message passing programming model standard, defines a set of collective communication interfaces, including MPI_BARRIER, MPI_EBCAST, MPI_REDUCE, MPI_ALLREDUCE, MPI_ALLGATHER, MPI_ALLTOALL etc. These are application level interfaces and are more generally referred to as APIs. In MPI, collective communications are carried out on communicators which define the participating processes and a unique communication context.
Functionally, each collective communication is equivalent to a sequence of point-to-point communications, for which MPI defines MPI_SEND, MPI_RECEIVE and MPI_WAIT interfaces (and variants). MPI collective communication operations are implemented with a layered approach in which the collective communication routines handle semantic requirements and translate the collective communication function call into a sequence of SFND/RECV/WAIT operations according to the algorithms used. The point-to-point communication protocol layer guarantees reliable communication.
Collective communication operations can be synchronous or asynchronous. In a synchronous collective operation all processors have to reach the collective before any data movement happens on the network. For example, all processors need to make the collective API or function call before any data movement happens on the network. Synchronous collectives also ensure that all processors are participating in one or more collective operations that can be determined locally. In an asynchronous collective operation, there are no such restrictions and processors can start sending data as soon as the processors reach the collective operation. With asynchronous collective operations, several collectives can be happening simultaneously at the same time.
Asynchronous one-sided collectives that do not involve participation of the intermediate and destination processors are critical for achieving good performance in a number of programming paradigms. For example, in an async one-sided broadcast, the root initiates the broadcast and all destination processors receive the broadcast message without any intermediate nodes forwarding the broadcast message to other nodes.
The torus network supports both point to point operations and collective communication operations. The collective communication operations supported are barrier, broadcast, reduce and allreduce. For example, a broadcast put descriptor will place the broadcast payload on all the nodes in the class route (a predetermined route set up for a group of nodes in the MPI communicator). Similarly there are collective put reduce and broadcast operations. A remote get (with a reduce put payload can be broadcast) to all the nodes from where data will be reduced via the put descriptor.
Each application or programming language may implement a collective API 88302 to invoke or call collective operation functions. A user application for example implemented in that application programming language then may make the appropriate function calls for the collective operations. Collective operations may be then performed via the API adaptor 88304 using its internal components such as an MPI communicator 88312, in addition to the other components in the collective framework, such as scheduler 88314, executor 88306, and multisend interface 88310.
Language adaptor 88304 interfaces the collective framework to a programming language. For example, a language adaptor such as for a message passing interface (MPI) has a communicator component 88312. Briefly, an MPI communicator is an object with a number of attributes and rules that govern its creation, use, and destruction. The communicator 88312 determines the scope and the “communication universe” in which a point-to-point or collective operation is to operate. Each communicator 88312 contains a group of valid participants and the source and destination of a message is identified by process rank within that group.
Executor 88306 may handle functionalities for specific optimizations such as pipelining, phase independence and multi-color routes. An executor may query a schedule on the list of tasks and execute the list of tasks returned by the scheduler 88314. Typically, each collective operations is assigned one executor.
The scheduler 88314 handles a functionality of collective operations and algorithms, and includes a set of steps in the collective algorithm that execute a collective operation. Scheduler 88314 may split a collective operation into phases. For example, a broadcast can be done through a spanning tree schedule where in each phase, a message is sent from one node to the next level of nodes in the spanning tree. In each phase, scheduler 88314 lists sources that will send a message to a processor and a list of tasks that need to be performed in that phase.
Multisend interface 88310 provides an interface to multisend 88316, which is a message passing backbone of a collective framework. Multisend functionality allows sending many messages at the same time, each message or a group of messages identified by a connection identifier. Multisend functionality also allows an application to multiplex data on this connection identifier.
As mentioned above, asynchronous one-sided collectives that do not involve participation of the intermediate and destination processors are critical for achieving good performance in a number of programming paradigms. For example, in an async one-sided broadcast, the root initiates the broadcast and all destination processors receive the broadcast message without any intermediate nodes forwarding the broadcast message to other nodes.
Embodiments of the present invention provide a method and system for one-sided asynchronous reduce operation. Embodiments of the invention use the remote get collective to implement one-sided operations. The compute node kernel (CNK) operating system allows each MPI task to map the virtual to physical addresses of all the other tasks in the booted partition. Moreover the remote-get and direct put descriptors take physical address of the input buffers.
Two specific example embodiments are described below. One embodiment, represented in
With reference to
In the procedure illustrated in
The prior art Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to the compute nodes is handled by the I/O nodes. In the compute node core, the compute nodes are arranged into both a logical tree structure and a multi-dimensional torus network. The logical tree network connects the compute nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer.
In the Blue Gene/Q system, the compute nodes comprise a multidimensional torus or mesh with N dimensions and that the I/O nodes also comprise a multidimensional torus or mesh with M dimensions. N and M may be different, e.g., for scientific computers, typically N>M. Compute nodes do not typically have I/O devices such as disks attached to them, while I/O nodes may be attached directly to disks, or to a storage area network.
Each node in a D dimensional torus has 2D links going out from it. For example, the BlueGene/L computer system (BG/L) and the BlueGene/P computer system (BG/P) have D=3. The I/O nodes in BG/L and BG/P do not communicate with one another over a torus network. Also, in BG/L and BG/P, compute nodes communicate with I/O nodes via a separate collective network. To reduce costs, it is desirable to have a single network that supports point-point, collective, and I/O communications. Also, the compute and I/O nodes may be built using the same type of chips. Thus, for I/O nodes, when M<N, this means simply that some dimensions are not used, or wired, within the I/O torus. To provide connectivity between compute and I/O nodes, each chip has circuitry to support an extra bidirectional I/O link. Generally this I/O link is only used on a subset of the compute nodes. Each I/O node generally has its I/O link attached to a compute node. Optionally, each I/O node may also connect it's unused I/O torus links to a compute node.
In BG/L, point-to-point packets are routed by placing both the destination coordinates and “hint” bits in the packet header. There are two hint bits per dimension indicating whether the packet should be routed in the plus or minus direction; at most one hint bit per dimension may be set. As the packet routes through the network, the hint bit is set to zero as the packet exits a node whose next (neighbor) coordinate in that direction is the destination coordinate. Packets can only move in a direction if its hint bit is set in that direction. Upon reaching its destination, all hint bits are 0. On BG/L, BG/P and BG/Q, there is hardware support, called a hint bit calculator, to compute the best hint bit settings for when packets are injected into the network.
Thus, in a further aspect, a system and method for routing I/O packets between compute nodes and I/O nodes in a parallel computing system is provided. The invention may be implemented, in an embodiment, in a massively parallel computer architecture, referred to as a supercomputer, e.g., such as shown in
The Blue Gene/Q platform contains four kinds of nodes: compute nodes (CN), I/O nodes (ION), login nodes (LN), and service nodes (SN). The CN and ION share the same compute ASIC.
In addition, associated with a prescribed plurality of processing nodes is a dedicated node that comprises a quad-processor with external memory, for handling of I/O communications to and from the compute nodes. Each I/O node has an operating system that can handle basic tasks and all the functions necessary for high performance real time code. The I/O nodes contain a software layer above the layer on the compute nodes for handling host communications. The choice of host will depend on the class of applications and their bandwidth and performance requirements.
In an embodiment, each compute node of the massively parallel computer architecture is connected to six neighboring nodes via six bi-directional torus links, as depicted in the three-dimensional torus sub-cube portion shown in
The ASIC that powers the nodes is based on system-on-a-chip (s-o-c) technology and incorporates all of the functionality needed by the system. The nodes themselves are physically small allowing for a very high density of processing and optimizing cost/performance.
In the overall architecture of the multiprocessor computing node 50 implemented in a parallel computing system shown in
A mechanism is provided whereby certain of the torus links on the I/O nodes can be configured in such a way that they are used as additional I/O links into and out of that I/O node; thus each I/O node may be attached to more than one compute node.
In one embodiment of the invention, in order to route I/O packets, there is a separate virtual channel (VC) and separate network injection and reception Fifos for I/O traffic. Each VC has its own internal network buffers; thus system packets use different internal buffers than user packets. All I/O packets use the system VC. The VC may also be used for kernel-to-kernel communication on the compute nodes, but this VC may not be used for user packets.
In addition, with reference to
The packet header also has additional ioreturn bits. When a packet is injected on an I/O node, if the ioreturn bits are not set, the packet is routed to another I/O node on the I/O torus using the hint bits and destination. If the ioreturn bits are set, they indicate which link the packet should be sent out on first. This may be the I/O link, or one of the other torus links that are not used for intra-I/O node routing.
When a packet with the ioreturn bits set arrives at a compute node (the I/O entrance node), the network logic has an I/O link hint bit calculator. If the hint bits in the header are 0, this hint bit calculator inspects the destination coordinates, and sets the hint bits appropriately. Then, if any hint bits are set, those hint bits are used to route the packet to its final compute node destination. If hint bits are already set in the packet when it arrives at the entrance node, those hint bits are used to route the packet to its final compute node destination. In an embodiment, on the entrance node, packets for different compute nodes are not placed into the memory of the entrance node and need not be re-injected into the network. This reduces memory and processor utilization on the entrance nodes.
On the I/O VC, within the compute or I/O torus packets are routed deterministically following rules referred to as the “bubble” rules. When a packet enters the I/O link from a compute node, the bubble rules are modified so that only one token is required to go on the I/O link (rather than two as in strict bubble rules). Similarly, when a packet with the ioreturn bits set is injected into the network, the packet only requires one, rather than the usual two tokens.
If the compute nodes are a mesh in a dimension, then the ioreturn bits can be used to increase bandwidth between compute and IO nodes. At the end of the mesh in a dimension, instead of wrapping a link back to another compute node, a link in that dimension may be connected instead to an I/O node. Such a compute node can inject packets with ioreturn bits set that indicate which link to use (connected to an I/O node). If a link hint bit calculator is attached to the node on the other end of the link, the packet can route to a different I/O node. However, with the mechanism described above. This extra link to the I/O nodes can only be used for packets injected at that compute node. This restriction could be avoided by having multiple toio bits in the packet, where the bit indicates which outgoing link to the I/O node should be used.
Further, in one aspect, a system and method are provided that relates to embedding global barrier and collective networks in a parallel computing system organized as a torus network, such as the BGQ platform shown in
The Blue Gene/Q platform contains four kinds of nodes: compute nodes (CN), I/O nodes (ION), login nodes (LN), and service nodes (SN). The CN and ION share the same compute ASIC.
In addition, associated with a prescribed plurality of processing nodes is a dedicated node that comprises a quad-processor with external memory, for handling of I/O communications to and from the compute nodes. Each I/O node has an operating system that can handle basic tasks and all the functions necessary for high performance real time code. The I/O nodes contain a software layer above the layer on the compute nodes for handling host communications. The choice of host will depend on the class of applications and their bandwidth and performance requirements.
In an embodiment, each compute node of the massively parallel computer architecture is connected to six neighboring nodes via six bi-directional torus links, as depicted in the three-dimensional torus sub-cube portion shown at 90010 in
The ASIC that powers the nodes is based on system-on-a-chip (s-o-c) technology and incorporates all of the functionality needed by the system. The nodes themselves are physically small allowing for a very high density of processing and optimizing cost/performance.
The BG/Q network is a 5-dimensional (5-D) torus for the compute nodes. In a compute chip, besides the 10 bidirectional links to support the 5-D torus, there is also a dedicated I/O link running at the same speed as the 10 torus links that can be connected to an I/O node.
The BG/Q torus network originally supports 3 kind of packet types: (1) point-to-point DATA packets from 32 bytes to 544 bytes, including a 32 byte header and a 0 to 512 bytes payload in multiples of 32 bytes, as shown in
The receiver logic diagram is shown in
The sender logic block diagram is shown in
To embed a collective network over the 5-D torus, a new collective DATA packet type is supported by the network logic. The collective DATA packet format shown in
The collective word length indicates the operand size in units of 2n*4 bytes for signed and unsigned integer operations, while the floating point operand size is fixed to 8 byte (64 bit double precision floating point numbers). The collective class route identifies one of 16 class routes that are supported on the BG/Q machine. On a single node, the 16 classes are defined in Device Control Ring (DCR) control registers. Each class has 12 input bits identifying input ports, for the 11 receivers as well as the local input; and 12 output bits identifying output ports, for the 11 senders as well as the local output. In addition, each class definition also has 2 bits indicating whether the particular class is used as user Comm_World (e.g., all compute nodes in this class), user sub-communicators (e.g, a subset of compute nodes), or system Comm_World (e.g., all compute nodes, possibly with I/O nodes serving the compute partition also).
The algorithm for setting up dead-lock free collective classes is described in co-pending patent application YOR920090598US1. An example of a collective network embedded in a 2-D torus network is shown in
In byte 3 of the collective DATA packet header, bit 3 to bit 4 defines a collective operation type which can be (1) broadcast, (2) all reduce or (3) reduce. Broadcast means one node broadcasts a message to all the nodes, there is no combining of data. In an all-reduce operation, each contributing nodes in a class contributes a message of the same length, the input message data in the data packet payload from all contributing nodes are combined according to the collective OP code, and the combined result is broadcasted back to all contributing nodes. The reduce operation is similar to all-reduce, but in a reduce operation, the combined result is received only by the target node, all other nodes will discard the broadcast they receive.
In the Blue Gene/Q compute chip (BQC) network logic, two additional collective injection fifos (one user+one system) and two collective reception fifos (one user+one system) are added for the collective network, as shown in
A diagram of the central collective logic block 306 is shown in
When the torus network is routing point-to-point packets, priority is given to system packets. For example, when both user and system requests (either from receivers or from injection fifos) are presented to a sender, the network will give grant to one of the system requests. However, when the collective network is embedded into the torus network, there is a possiblity of livelock because at each node, both system and user collective operations share up-tree and down-tree logic path, and each collective operation involve more than one node. For example, a continued stream of system packets going over a sender could block a down-tree user collective on the same node from progressing. This down-tree user collective class may include other nodes that happen to belong to another system collective class. Because the user down-tree collective already occupies the down-tree collective logic on those other nodes, the system collective on the same nodes then can not make progress. To avoid the potential livelock between the collective network traffic and the regular torus network traffic, the arbitration logic in both the central collective logic and the senders are modified.
In the central collective arbiter, shown in
In addition, the down-tree arbitration logic in the central collective block also implements a DCR programmable timeout, where if the request to a given sender does not make progress for a certain time, all requests to different senders and/or local reception fifo involved in the broadcast are cancelled and a new request/grant arbitration cycle will follow.
In the network sender, the arbitration logic priority is further modified as follows, in order of descending priority;
On BlueGene/L and BlueGene/P, the global barrier network is a separate and independent network. The same network can be used for (1) global AND (global barrier) operations, or (2) global OR (global notification or global interrupt) operations. For each programmable global barrier bit on each local node, a global wired logical “OR” of all input bits from all nodes in a partition is implemented in hardware. The global AND operation is achieved by first “arming” the wire, in which case all nodes will program its own bit to ‘1’. After each node participating in the global AND (global barrier) operation has done “arming” its bit, a node then lowers its bit to ‘0’ when the global barrier function is called. The global barrier bit will stay at ‘1’ until all nodes have lowered their bits, therefore achieving a logical global AND operation. After a global barrier, the bit then needs to be re-armed. On the other hand, to do a global OR (for global notification or global interrupt operation), each node would initially lower its bit, then any one node could raise a global attention by programming its own bit to ‘1’.
To embed the global barrier and global interrupt network over the existing torus network, in one embodiment, a new GLOBAL_BARRIER packet type is used. This packet type, an example of which is shown in
The logic addition includes each receiver's packet decoder (shown at 90416 in
Each class map (collective or global barrier) has 12 input bits and 12 output bits. When the bit is high or set to ‘1’, the corresponding port is enabled. A typical class map will have multiple inputs bits set, but only one output bit set, indicating the up tree link. On the root node of a class, all output bits are set to zero, and the logic recognizes this and uses the input bits for outputs. Both collective and global barrier have separated up-tree logic and down-tree logic. When a class map is defined, except for the root node, all nodes will combine all enabled inputs and send to the one output port in an up-tree combine, then take the one up-tree port (defined by the output class bits) as the input of the down-tree broadcast, and broadcast the results to all other senders/local reception defined by the input class bits, i.e., the class map is defined for up-tree operation, and in the down-tree logic, the actual input and output ports (receivers and senders) are reversed. At the root of the tree, all output class bits are set to zero, the logic combines data (packet data for collective, global barrier state for global barrier) from all enabled input ports (receivers), reduces the combined logic to a single result, and then broadcast the result back to all the enabled outputs (senders) using the same input class bits, i.e., the result is turned around and broadcast back to all the input links.
On BlueGene/L and BlueGene/P, each global barrier is implemented by a single wire per node, the effective global barrier logic is a global OR of all input signals from all nodes. Because there is a physical limit of the largest machine, there is an upper bound for the signal propagation time, i.e., the round trip latency of a barrier from the furthest node going up-tree to the root that received the down-tree signal at the end of a barrier tree is limited, typically within about one micro-second. Thus a simple timer tick is implemented for each barrier, one will not enter the next barrier until a preprogrammed time has passed. This allows each signal wire on a node to be used as an independent barrier. However, on BlueGene/Q, when the global barrier is embedded in the torus network, because of the possibility of link errors on the high speed links, and the associated retransmission of packets in the presence of link errors, it is, in an embodiment, impossible to come up with a reliable timeout without making the barriers latency unnecessarily long. Therefore, one has to use multiple bits for a single barrier. In fact, each global barrier will require 3 status bits, the 3 byte barrier state in Blue Gene/Q therefore supports 8 barriers per physical link.
To initialize a barrier of a global barrier class, all nodes will first program its 3 bit barrier control registers to “100”, and it then waits for its own barrier state to become “100”, after which a different global barrier is called to insure all contributing nodes in this barrier class have reached the same initialized state. This global barrier can be either a control system software barrier when the first global barrier is being set up, or an existing global barrier in a different class that has already been initialized. Once the barrier of a class is set up, the software then can go through the following steps without any other barrier classes being involved. (1) From “100”, the local global barrier control for this class is set to “010”, and when the first bit of the 3 status bits reaches 0, the global barrier for this class is achieved. Because of the nature of the global OR operations, the 2nd bit of the global barrier status bit will reach ‘1’ either before or at the same time as the first bit going to ‘0’, i.e., when the 1st bit is ‘0’, the global barrier status bits will be “010”, but it might have gone through an intermediate “110” state first. (2) For the second barrier, the global barrier control for this class is set from “010” to “001, i.e., lower the second bit and raise the 3rd bit, and wait for the 2nd bit of status to change from ‘1’ to ‘0’. (3) Similarly, the third barrier is done by setting the control state from “001” to “100”, and then waiting for the third bit to go low. After the 3rd barrier, the whole sequence repeats.
An embedded global barrier requires 3 bits, but if configured as a global interrupt (global notification), then each of the 3 bit can be used separately, but every 3 notification bits share the same class map.
While the BG/Q network design supports all 5 dimensions labeled A, B, C, D, E symmetrically, in practice, the fifth E dimension, in one embodiment, is kept at 2 for BG/Q. This allows the doubling of the number of barriers by keeping one group of 8 barriers in the E=0 4-D torus plane, and the other group of 8 barriers in the E=1 plane. The barrier network processor memory interface therefore supports 16 barriers. Each node can set a 48 bit global barrier control register, and read another 48 bit barrier state register. There is a total of 16 class maps that can be programmed, one for each of 16 barriers. Each receiver carries a 24 bit barrier state, so does each sender. The central barrier logic takes all receiver inputs plus local contribution, divides them into 16 classes, then combines them into an OR of all inputs in each class, and the result is then sent to the torus senders. Whenever a sender detects that its local barrier state has changed the sender sends the new barrier state to the next receiver using the GLOBAL_BARRIER packet. This results in an effective OR of all inputs from all compute and I/O nodes within a given class map. Global barrier class maps can also go over the I/O link to create a global barrier among all compute nodes within a partition.
The above feature of doubling the class map is also used by the embedded collective logic. Normally, to support three collective types, i.e., user Comm_World, user sub_comm, and system, three virtual channels would be needed in each receiver. However, because the fifth dimension is a by 2 dimension on BG/Q, user COMM_WORLD can be mapped to one 4-D plane (e=0) and the system can be mapped to another 4-D plane (e=1). Because there are no physical links being shared, the user COMM_WORLD and system can share a virtual channel in the receiver, shown in
In one embodiment of the invention, because the 5th dimension is 2, the class map is doubled from 8 to 16. For global barriers, class 0 and 8 will use the same receiver input bits, but different groups of the local inputs (48 bit local input is divided into 2 groups of 24 bits). Class i (0 to 7) and class i+8 (8 to 15) can not share any physical links, these class configuration control bits are under system control. With this doubling, each logic block in
The local state has separate wires for each group (48 bit state, 2 groups of 24 bits) and is unchanged.
The 48 global barrier status bits also feed into an interrupt control block. Each of the 48 bits can be separately enabled or masked off for generating interrupts to the processors. When one bit in a 3 bit class is configured as a global interrupt, the corresponding global barrier control bit is first initialized to zero on all nodes, then the interrupt control block is programmed to enable interrupt when that particular global barrier status bit goes to high (‘1’). After this initial setup, any one of the nodes within the class could raise the bit by writing a ‘1’ into its global barrier control register at the specific bit position. Because the global barrier logic functions as a global OR of the control signal on all nodes, the ‘1’ will be propagated to all nodes in the same class, and trigger a global interrupt on all nodes. Optionally, one can also mask off the global interrupt and have a processor poll the global interrupt status instead.
On BlueGene/Q, while the global barrier and global interrupt network is implemented as a global OR of all global barrier state bits from all nodes (logic 91220 and 91240), it provides both global AND and global OR operations. Global AND is achieved by utilizing a ‘1’ to ‘0’ transition on a specific global barrier state bit, and global OR is achieved by utilizing a ‘0’ to ‘1’ transition. In practice, one can also implement the logic block 91220 and 91240 as AND reduces, where then global AND are achieved with ‘0’ to ‘1’ state transition and global OR with ‘1’ to ‘0’ transition. Any logically equivalent implementations to achieve the same global AND and global OR operations should be covered by this invention.
Cooling
Blue Gene/Q racks are indirect water cooled. The reason for water cooling is (1) to maintain the junction temperatures of the optical modules to below their max operating frequency of 55C, and (2) to reduce infrastructure costs. The preferred embodiment is to use a serpentine water pipe which lies above the node card. Separable metal heat-spreaders lie between this pipe and the major heat producing devices. Compute cards are cooled with a heat-spreader on one side only, with backside DRAMs cooled by a combination of conduction and modest airflow which is required for the low power components.
Optical modules have a failure rate which is a strong function of temperature. The operating range is 20 C to 55 C, but highest reliability and lowest error rate is achieved if an even temperature at the low end of this range can be maintained. This favors indirect water cooling.
Using indirect water cooling in this manner requires control of the water temperature above dew point, to avoid condensation on the exposed water pipes. This indirect water cooling can result in dramatically reduced operating costs as the power to run larger chillers can be largely avoided. They will provide a 7.5 MW power and cooling upgrade for a 96-rack system, this would be an ideal time to also save dramatically on infrastructure costs by providing water not at the usual 6 C for air conditioning, but rather at the 18 C minimum temperature for indirect water cooling.
In a further aspect a system and method is provided to accurately predict a processor's operational lifetime by assessing the aging characteristics at the architecture level in an environment where process variation exists.
In light of the above, a method and a system of accurately estimating and adjusting for system-level aging are disclosed.
Even though the discussion below is relevant to a single-core, dual-core or a multi-core processor, for clarity purposes, the discussion below will generally refer to a multi-core processor (referred to hereinafter as processor).
Moreover, the term “core,” as used in the discussion below, generally refers to any computing block or a processing unit, with data storing and data processing/computing capability, or any combination of the two.
Furthermore, the term “memory,” as used in the discussion below, generally refers to any computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), flash memory, solid state memory, firmware or any type of media suitable for storing electronic instructions.
Additionally, the term “effective aging profile” as used in the discussion below, may be interchangeably used with the term “predicted operational lifetime.”
Also, it should be noted that at the design stage, a certain clock-frequency target, a thermal design point and a voltage is provided. However, at the manufacturing stage, due to process variation, the processor and its components may have different threshold voltages that are different than those assumed earlier at the design stage. Consequently, the processor and its components may require different supply voltages in order to run at the targeted frequency. Moreover, in the context of process variation, existing aging analysis and prediction techniques often do not provide accurate results. As a result, the processor aging is not predicted or prevented properly causing longer down-time and less reliable processors.
Finally, all contents of U.S. Pat. Nos. 7,472,038 and 7,486,107 are hereby expressly incorporated by reference herein as if fully set forth herein.
The design stage data that may be relevant for this analysis may include, for example, architecture redundancy, circuit characteristics, target frequency and assumed switch factors. The manufacturing stage data that may be relevant for this analysis may include, for example, threshold voltages, as measured by aging sensors, and supply voltages, as determined by manufacturing tests. The design and manufacturing stage data may form the inputs for calculating effective aging for each core of the processor using an aging model, such as a Diffusion-Reaction (hereinafter DR) model or any of its derivative models or any other aging models for estimating the operating lifetime of a processor. Furthermore, different aging models can be used for different components/parts/structure/steps in the method or the system.
The calculation of effective aging may occur at the manufacturing facility after the processor has been manufactured. The data that is output from the calculation of effective aging for each core of the processor may be stored in a data structure, such as a history table, which may be stored in memory internal or external to the processor. In one embodiment, history table is table in which various kinds of information related to the calculation of effective aging profile are registered, stored, organized and capable of being retrieved from for later use by the processor or logic device.
A list and description of some exemplary known aging models may be found at http://www.iue.tuwien.ac.at/phd/wittmant/node10.html#SECTION001020000000000000000. Other exemplary known aging models are described in ‘M. A. Alam and S. Mahapatra, “A Comprehensive Model of PMOS NBTI Degradation,” Microelectronics Reliability, vol. 45, no. 2005, pp. 71-81, 2004’ and ‘S. Ogawa and N. Shiono, “Generalized Diffusion-Reaction Model for the Low-Field Charge-Buildup Instability at the Si—SiO2 Interface,” Physical Review B, vol. 51, no. 7, pp. 4218-4230, 1995’ and ‘M. A. Alam, “A Critical Examination of the Mechanics of Dynamic NBTI for PMOSFETs,” in Proc. Int. Electron Devices Meeting (IEDM), pp. 14.4.1-14.4.4, 2003.’ All contents of all documents cited in this paragraph are hereby expressly incorporated by reference herein as if fully set forth herein.
At step 94102, a review of current selection of operating cores of the processor, their frequency and their voltages is done. This review is done in order to be later used for effective aging profile calculation.
At step 94103, a determination is made if the aging has exceeded the threshold for a redo analysis. This determination is made in order to determine if it is necessary to reconfigure the processor's current operating settings. It should be noted that different types of aging may have different indicators to trigger this determination. For example, while timing a measurement of signal propagation speed in transistors may be an adequate indicator for NBTI-induced aging, for other types of aging, such as EM, timing may not be the proper indicator.
Furthermore, it should be noted that this determination is architecture and technology dependent. For example, the redo analysis timing for a 45 nm processor architecture may be different for a 22 nm processor architecture. Regardless, if a determination is made that aging has not exceeded for a redo analysis i.e. none of the matching preexisting criteria that trigger the redo analysis are met, then the process loops to step 84102. Otherwise, the process continues to step 94104.
At step 94104, a reading of data stored in the history table occurs. This reading, the execution of which may be triggered by the core, may also include data from other sources such as hardware counters, thermal sensors and aging sensors. The data from this reading is received by the processor or like logic device.
At step 94105, an update is made to the history table, wherein the cells in the history table are populated with new data received from hardware counters, thermal sensors and aging sensors. The execution of this update may be triggered by the core.
At step 94106, an effective aging profile is calculated and stored in the history table with a corresponding time stamp. The execution of calculation of the effective aging profile may be triggered by a core to measure its own or other cores' effective aging profile. It is possible that after this calculation, the hardware counters and thermal sensors may be reset and the corresponding entries in the history table may be cleared in order to allow for subsequent storing of new information for the time interval beginning from after the current calculation until the next time when effective aging profile needs to be recalculated.
Moreover, at the time of recalculation of the effective aging profile, the history table may receive the data from the aging sensors from each core of the processor. These readings may provide an accurate estimate of how much aging has occurred to the aging sensor itself when it was exposed to the switching factors of 1.0. Accordingly, by using the temperature, variation, voltage and frequency information gathered from step 94101, and assuming switching factors of 1.0, the estimated aging rate of the aging sensor may be calculated. By comparing the estimated aging rate and the actual aging rate from measuring the aging sensor, coefficients in the aging model may be recalibrated in order to specifically account for process variation at that core. The effective aging profile calculation then may use, in one embodiment, the aging model with the calibrated coefficients, to recalculate the predicted operational lifetime for the core. The calculation may use information from history table that may include switching factors as measured by the hardware counters, the temperature as measured by thermal sensors, frequency and voltage and the previous predicted operational lifetime (and VT-shift) of the cores. The effective aging profile may also account for architecture redundancy.
Additionally, on a system that supports Dynamic Frequency and Voltage Scaling (DVFS), where frequencies and supply voltages of each core could change when going into less demanding tasks or idle state to save power, effective aging profile calculation may be recalculated in response to occurrence of these events, or the voltage/frequency states can be recorded and used later for recalculating effective aging profile.
Effective aging profile is calculated at pre-determined periods appropriate for the corresponding aging process. For example, effective aging profile may be calculated and updated once in a few days or any time period that is relevant for the operating/server and workload conditions. It is also possible to customize the update frequency interval.
The steps shown in the
Furthermore, the time period frequency at which effective aging profile may be calculated may relate to a change in the voltage, frequency or workload as detected by hardware counters or by thermal sensors, or as requested by a user when a system-level event such as rebooting, changing workload, Operating Systems (OS) context switch, OS-driven idle period, periodic maintenances or when frequency/voltage are changed by OS to conserve energy.
Current literature on transistor level aging models provides detailed dependencies for voltage, temperature and other parameters. For example, aging simulations are ran on a processor core using voltage of 1.0V, frequency of 2 Ghz and fixed temperature of 85° C., assuming switching factor of 1.0. The circuit characteristics, such as cycle-time constraint, threshold voltage, circuit types and circuit criticality, are known in advance since they are designed in advance.
During processor operation, the processor uses hardware counters, and aging and temperature sensors to capture data relating to the actual operating conditions and supply voltage. Next, the processor may supply this data to a software module or a logic circuit which calculates aging profile. In microprocessor architecture, often, aging profile is a vector that covers different types of components with different aging characteristics. In one embodiment, aging profile can be a vector. Yet, for sake of simplicity, we use value for the rest of the test and its use should not be construed as limiting. Thus, if the chip was actually running at 0.8V, frequency of 1.4 Ghz and varying temperatures between 60-85° C., the hardware counters measure switching factor to be 0.21. Because these conditions are different, and the processor has also been used for a while, thus already using up some life time, the aging profile metric has to be recalculated.
At step 94107, a determination is made if the processor's predicted operational lifetime meets a predetermined aging requirement. If a determination is made that the processor's predicted operational lifetime meets the predetermined aging requirement, then the process loops to step 94102. Otherwise, the process continues to step 94108.
At step 94108, a corrective reaction to prolong processor operational lifetime occurs and then the process loops to step 94106. An example of the corrective reaction may include, but not be limited to, any of the following: 1) an adjustment in the supply voltage while maintaining the same frequency, 2) an adjustment in the frequency with the same or lower supply voltage, 3) a reduction in the workload, such as an increase in the amount of idle time of the processor or a reduction in the number of operating cores, 4) a selective shut-down of cores that have short operational lifetimes and a performance of workload scheduling by using cores that have sufficiently long operational lifetimes, 5) a determination of whether task migration of application processing activity at one core in favor of another core is possible and if the workload requires less cores than the total number of cores on the processor, then whether one can schedule the cores to run the workload such that each core has sufficient time to rest and 6) a matching of the busiest or hottest tasks in the workload to the cores that have higher operational lifetime. The reactions above may be used individually or in combination in order to meet the processor's operational lifetime requirement. The determination of which corrective action to take may be pre-programmed in advance by a predetermined heuristic.
Block 94201 performs step 94101 depicted in
In one example, block 94201, which may be a logic circuit programmed to perform its function, receives data from inputs 94201a-d which relate to circuit characteristics, architecture redundancy, assumed workload data and assumed operating conditions, respectively. Data from inputs 94201a-d may be used for determining the aging profile (ideal processor operational lifetime) by calculating the effective aging for each core of the processor using an aging model. The formula and coefficients are stored in the history table for later use in calculating an effective aging profile when actual workload and operating conditions are available. Alternatively, the formula and coefficients may be stored in memory, internal or external to the processor, where the core or processor controller can have access to when they calculate the operational lifetime.
Data received from input 94201a is related to circuit and device characteristics such as the connectivity of logic/SRAM design, target cycle time, gate oxide thickness and capacitance and VT.
In one embodiment, data received from input 94201b is related to architecture characteristics and redundancy such as the duplication of critical components of a system with the intention of increasing reliability of the system as often done in the case of a backup or fail-safe. In a different embodiment, the architecture data and redundancy information are taken into account at the aging analyzer stage.
Data received from input 94201c is related to workload data such as assumed clock-gating factors and switching factors.
Data received from input 94201d is related to assumed operating conditions such as voltage, frequency and temperature.
The output of block 94201 (aging profile) is then input into block 94202 where it is compared to process variation data (expected core lifetime based on actual physical measuring of the core at the post-manufacturing stage). Process variation measurements may be done by determining VT using aging sensors or by applying different voltages to the processor and measuring the propagation speed of each component. Block 94202 may be a logic circuit programmed to perform its function. The output of block 94202 may then be passed to the processor's controller, which may then optimize the global chip lifetime based on core values.
The output of block 94202 is then fed into block 94203 where tuning for effective aging occurs. Since process variation and processor aging profile characteristics are not deterministic and have wafer and chip-level (or even finer-grain) characteristics, process variation data and the aging profile characteristics are fed into the effective aging profile unit to tune it for the specific processor. The design and manufacturing stage data may be used for calculating effective aging for each core of the processor using an aging model, for example, the DR model or other model. The calculation of effective aging may occur at the manufacturing facility after the processor has been manufactured. The data that is output from the calculation of effective aging for each core of the processor may be stored in a history table, which may be stored in memory internal or external to the processor. History table is table in which various kinds of information related to the calculation of effective aging profile are registered, stored, organized and capable of being retrieved from for later use. Block 94202 may be a logic circuit programmed to perform its function and be configured to store the calibrated formula and coefficients mentioned above.
Block 94204a performs steps 94102-94105 depicted in
Block 94204 performs step 94106 depicted in
When it is time to recalculate effective aging profile, the history table reads the data output from aging sensors from each core of the processor. The aging sensor readings provide an accurate estimate of how much aging has occurred to the aging sensor when exposed to the switching factors of 1.0.
By using the temperature, voltage, frequency and process variation information from blocks 94201-94203, and assuming switching factors of 1.0, the estimated aging rate of the aging sensor may be calculated. By comparing the estimated aging rate and the actual aging rate from measuring of the aging sensor, recalibration of coefficients in the aging model, to tailor specifically to the processor to account for process variation, may be possible.
The effective aging profile calculation may then use the aging model with the calibrated coefficients, to recalculate the predicted operational lifetime for the core. The calculation may use the information from the history table that may include switching factor as measured by the hardware counters, the temperature as measured by thermal sensors, frequency and voltage and the previous predicted operational lifetime (and VT-shift) of the cores.
It is possible that after this calculation, the hardware counters and thermal sensors may be reset and the corresponding entries in the history table may be cleared in order to allow for new information storing for the time interval beginning from after the current calculation until the next time when effective aging profile needs to be recalculated. Also, if effective aging profile needs to be recalculated, data from aging sensors may be read and stored in the history table. The calculated effective aging profile may also be stored in history table for future use. A time stamp detailing when the reading is made may also be stored in the table in order to associate with each aging sensor reading.
Because aging is a slow process, the effective aging profile does not need to be calculated and updated frequently. For example, effective aging profile may be calculated and updated once in a few days. It is also possible to customize the update frequency interval.
Also, the time period frequency at which effective aging profile may be calculated may relate to a sudden change in the voltage, frequency or workload as detected by hardware counters or by thermal sensors, or as requested by a user when a system-level event such as rebooting, changing workload, Operating Systems (OS) context switch, OS-driven idle period, periodic maintenances or when frequency/voltage are changed by OS to conserve energy.
Upon calculation of the effective aging profile, block 94204 feeds block 94205 a data signal in the format of a number, a metric, a symbol or a variable. The execution of calculation of the aging requirement may be triggered by a core to measure its own or other cores' results. Block 94205 may be a logic circuit programmed to perform its function. Block 94205 performs step 94107 depicted in
Aging requirement comprises of a performance and a lifetime target, where the performance target may be a clock-frequency or sustained number of operations per second such as a number of Floating Point Operations per Seconds (FLOPS). The lifetime target may be the number of cores that can sustain the performance target for at least the period of time desired for the workload until the first failure. Block 94206 performs step 94108 depicted in
Furthermore, the term “workload-induced conditions” as discussed in reference to
Additionally, because the numbers of bits in the core or within any of its components could be substantial, the hardware counters can be programmed to sample switching factors of only a subset of bits of the critical components or of components that are more prone to switching, or to compress the bits using functions, such as XOR, before computing their switching factors.
In one embodiment, block 94304b corresponds to steps performed by an age-analyzer which is constructed such that it mimics the operation of the core it is trying to predict the aging of. The age-analyzer captures critical information in terms of the architectural characteristics of the core, types of logic and such. Because the age-analyzer closely mimics the operation of the core, the age-analyzer provides a more direct prediction of aging from its reading and reduces the need for further computations of complicated models. The age-analyzer may include or make use of aging sensors.
In different embodiments architectural characteristics and redundancy information can be taken into account in different stages. In one embodiment, the architectural characteristics and redundancy information is factored in calculating the effective aging profile. In another embodiment, the architectural characteristics and redundancy information is factored in at the aging analyzer stage, but not in effective aging profile. Specifically, if a core has several pipeline stages and its critical path is likely to be limited by some of the stages that have a combination of VT devices, SRAM and wire capacitance, then the age-analyzer will have a component mimic each of the critical paths. For example, if a core has two critical paths, one consists of 40% high-VT transistors and 60% SRAM, and the other consists of 40% high-VT transistors and 60% wire, then the age-analyzer will be structured to have two structures, one consists of 40% high-VT transistors and 60% SRAM, and the other consists of 40% high-VT transistors and 60% wire.
The structure of the age-analyzer can also be designed to reflect redundancy present in the core wherein each of the core structures (main and spares) has a mimic in the age-analyzer. To closely mimic the workload conditions of the core, block 94304b is not receiving data from block 94304a. Rather, block 304b actually mimics the workload switching activities that are output from block 94304a. For example, if block 94304a outputs a signal with a switch factor of 0.4, then block 94304b is also forced to switch with factor 0.40 (switching 40% of times). By measuring the timing of each of the sensor structures in the age-analyzer, as exemplarily shown in
Although only one aging sensor and/or age-analyzer 94405a-d is shown in each core, aging sensors and/or age-analyzers 94405a-d may include multiple instances and various implementations of age sensors and age-analyzers, internal or external to the core, customized for the circuit characteristic. In one embodiment, aging sensors and/or age-analyzers 94405a-d may be placed in multiple locations that are critical in timing and thus most likely to run out of lifetime early. In another embodiment, aging sensors and/or age-analyzers 94405a-d may comprise of multiple implementations of circuit blocks, such as inverter chains, SRAM, combinational logic chains, accumulators, MUXes, latches of different types, and multiple transistor types, such as high-VT transistors and low-VT transistors, stacked and non-stacked transistors. Additionally, even though only one thermal sensor and one hardware counter are shown within each core 94403a-d, thermal sensor 94404a-d and hardware counters 94406a-d could include multiple instances, customized for the component of interest within any or all cores 94403a-d.
Thermal sensors 94404a-d are a type of hardware that may be implemented, for example, as a diode or a ring oscillator. Thermal sensors 94404a-d collect temperatures for core components units or cores 94403a-d that are more likely to have shorter operational lifetimes.
Aging sensors and/or age-analyzers 94405a-d are a type of hardware that may be implemented, for example, as a ring oscillator. Aging sensors and/or age-analyzers 94405a-d are exposed to the workload switching factor of 1.0 (switching every clock-cycle) or other fixed value. An initial reading of aging sensors and/or age-analyzers 94405a-d, while in the manufacturing stage, provides process variation profile, while subsequent readings help calculate VT shifting rate. Thus, by comparing the initial readings done at design stage and manufacturing stage (or any other previous readings) to the subsequent readings, aging can be predicted based how much threshold-voltage-shift (VT-shift) has occurred over time.
Hardware counters 94406a-d are a type of hardware registers that keep count on events of interest within processor 94400. For example, types of hardware counters 94406a-d that may be used include instruction and processor cycle counters, counters that count number of cycles a certain unit is used or counters that count how many bits are switched for a set of states in a certain unit over a period of time. Hardware counters 94406a-d are used to collect information on switching factors of cores 94403a-d or core components unit. In the interest of filtering information, hardware counters 94406a-d may be customized and thus designed to collect only switching factors that represents the critical paths of cores 94403a-d that are more likely to have shorter operational lifetimes.
Furthermore, because the numbers of bits in the core or in any of its components could be substantial, the hardware counters can be programmed to sample switching factors of only a subset of bits of the critical components or of components that are more prone to switching, or to compress the bits using functions, such as XOR, before computing their switching factors.
In this exemplary embodiment, at design stage of processor 94400, a certain clock-frequency target, a thermal design point and a voltage are assumed. However, at the manufacturing stage, due to process variation, multi-core processor 94400 and its components will have different threshold voltages that are different than those assumed earlier by the design stage. As a result, multi-core processor 94400 and its components will require different supply voltages among cores 94403a-d and within cores 94403a-d in order to run at the targeted frequency. The information from design stage, such as architecture redundancy, circuit characteristics, target frequency and assumed switch factors, and information from manufacturing stage, such as threshold voltages as measured by aging sensors and supply voltages as determined by manufacturing tests, form the inputs for calculating effective aging for each core 403a-d using the aging model. The calculation of effective aging may occur at the manufacturing facility after the processor has been manufactured. The data that is output from the calculation of effective aging for each core of the processor may be stored in a history table, which may be stored in memory internal or external to the processor.
In one embodiment, during operation of multi-core processor 94400, readings from thermal sensors 94404a-d, aging sensors and/or age-analyzers 94405a-d and hardware counters 94406a-d are automatically, frequently, routinely and continuously read and stored in history table 94401. In order to more efficiently store these readings, history table 94401 can store thermal sensors 94404a-d readings in the form of average temperatures taken over a certain period of time, aging sensors and/or age-analyzers 94405a-d readings in the form of VT and hardware counters 94406a-d readings in the form of switch probability over time.
Since computer system using multi-processor 94400 may be shut-down or restarted, history table 94401 is adapted and configured to store its values by implementing history table 94401 in persistent storage such as memory. Due to possibility of data failure, a copy of history table 94401 may also be backed up in persistent storage such as memory.
When effective aging profile needs to be recalculated, data from aging sensors and/or age-analyzers 94405a-d is read and stored in history table 94401 with a corresponding time stamp. The execution of calculation of the effective aging profile may be triggered by any of cores 94403a-d to measure its own or other cores' effective aging profile.
Additionally, since aging is a slow process, the effective aging profile does not need to be calculated and updated frequently. For example, effective aging profile may be calculated and updated once in a few days. It is also possible to customize the update frequency interval.
Moreover, on a system that supports Dynamic Frequency and Voltage Scaling (DVFS), where frequencies and supply voltages of each core could change when going into less demanding tasks or idle state to save power, the effective aging profile calculation can be redone these changes happen, or the voltage/frequency states can be recorded and used later for recalculating effective aging profile.
Also, the time period frequency at which effective aging profile may be calculated may relate to a sudden change in the voltage, frequency or workload as detected by hardware counters or by thermal sensors, or as requested by a user when a system-level event such as rebooting, changing workload, Operating Systems (OS) context switch, OS-driven idle period, periodic maintenances or when frequency/voltage are changed by OS to conserve energy.
When it is time to recalculate effective aging profile, history table 94401 again reads data from aging sensors and/or age-analyzers 94405a-d for each core 94403a-d. These readings provide an accurate estimate of how much aging has occurred to aging sensors and/or age-analyzers 94405-d when they were exposed to the switching factors of 1.0.
By using the temperature, variation, voltage and frequency information from effective aging, and assuming switching factors of 1.0, one can calculate the estimated aging rate of aging sensors and/or age-analyzers 94405a-d. By comparing the estimated aging rate and the actual aging rate from the measuring output from aging sensors and/or age-analyzers 94405a-d, one can recalibrate the coefficients in the aging model to tailor specifically to the chip to account for process variation.
The effective aging profile calculation then uses the aging model with the calibrated coefficients, to recalculate the protected lifetime for the core. The calculation uses information from history table 94401 that include switching factor as measured by the hardware counters 94406a-d, the temperature as measured by thermal sensors 94404a-d, frequency and voltage and the previous predicted operational lifetime (and VT-shift) of the cores 94401a-d. The effective aging profile may also account for architecture redundancy.
It is possible that after this calculation, hardware counters 94406a-d and thermal sensors 94404a-d may be reset and the corresponding entries in history table 94401 may be cleared in order to allow for new information storing for the time interval beginning from after the current calculation until the next time when effective aging profile needs to be recalculated.
Each column within history table 94500 represents a type of data collected from a core or a logic block that is being monitored. Within history table 94500, ‘Block name’ column stores the identification data related to the monitored item of interest such as a core, a circuit or a logic block. ‘Voltage’ and ‘Frequency’ columns store values collected at runtime that describe the supplied voltage (VDD) and clock frequency of the measured item, respectively. ‘Time stamp’ column stores values of the time and date of when the time stamp value was measured. ‘Switch factors’ column stores probability values, which are measured from corresponding hardware counters of how often the bits switch in the measured item. ‘Aging sensor reading’ column stores values obtained from aging sensors and/or age-analyzers (see
Additionally, an aging sensor 94600 may be implemented using a number of different kinds of logic such as SRAM, ring oscillators, inverter chains, with different aging characteristics that sufficiently mimic the critical components of the processor cores, individually or using the aforementioned combinations. The process of tuning with a given aging profile number implies finding these representative combinations and generating the conditions that represent the aging profile number.
As mentioned, the term “core” generally refers to a digital and/or analog structure having a data storing and/or data processing capability, or any combination of the two. For example, a core may be embodied as a purely storage structure or a purely computing structure or a structure having some extent of both capabilities.
Also, the concept of turning off a core or “selective core turn-off” may be implemented by putting the core in a low-power mode, assigning the core with extremely low-power tasks, or cutting off the supply voltage or clock signal(s) to the core such that it is not usable.
Additionally, a “break-even” condition is a state of being at a particular time that facilitates the evaluation of the ability of a core to tolerate performance variation from its intended original design, i.e. as a result of administering tests that determine how much process variation it takes to change the static (non-time varying) decision of which core or set of cores to turn off.
Moreover, the term “variation,” as used in the discussion below, generally refers to process variation, packaging, cooling, power delivery, power distribution and other similar types of variation.
The disclosed technology achieves higher performance and energy efficiency by intelligently selecting which cores to shut down (i.e. turn off or disable) in a multi-core architecture setting. The decision process for core shut down can be done randomly or through a fixed decision (such as always turn off core 1) without any basis for the decision beyond a selecting a fixed core for all chips. In this disclosure, we disclose a technique that optimizes system efficiency through the core shut down decisions-especially in the existence of on-chip variation among processing units.
The disclosed technique can be adjusted for different optimization criteria for different chips, though, for simplicity reasons, we focus on exemplary embodiments for energy efficiency and temperature characteristics. The technique of picking the optimal set of cores to turn off is applicable for multiple objective functions such as Temperature and Energy Efficiency (leakage reduction), which is more related to average temperature than peak temperature. In the case that the scheme is targeting thermal optimization, the technique focuses on (Tpeak, #neighbors) function where the static peak temperature among the processing units can be reduced while reducing the peak temperatures of maximum number of neighbors for the core turn-off candidate under consideration. However, in the case that the scheme is targeting for energy reduction, the same function is multiplied by a factor (Tavg* # neighbors component), which tracks for the average temperature reduction in the maximum number of neighbor cores and the static power dissipation is reduced significantly. By modifying the function in f(Tpeak, #neighbors) by (Tavg*Area), we optimize for energy efficiency with the same technique.
Processor 96101 includes three cores 96102a-c, of which two, for example, are needed to process a certain workload.
Processor 96103 includes cores 96104a-c, of which two, for example, are needed to process a certain workload. Due to core scheduling, cores 96104a and 96104b are turned on and core 96104c is turned off. Since cores 96104a and 96104b are in close physical proximity to each other in the chip, due to their static power dissipation, cores 96104a and 96104b spatially heat up each other. Consequently, during operation, cores 96104a and 96104b in sum, consume more static power.
Processor 96105 includes cores 96106a-c, of which only two are needed to process a certain workload, for example. Due to a core scheduling, for example, cores 96106a and 96106c are turned on and core 96106b is turned off at a given point in time. Since core 96106a and 96106c are considered not in close physical proximity to each other, they do not spatially heat up each other as much. Consequently, during operation, cores 96106a and 96106c consume less static power.
It should be noted that although cores 96104c and 106b are turned off in their respective scenarios, core 96106b, due to its position between the turned on cores 96106a and 96106c, may be heated at a higher rate than core 96104c. Consequently, during operation, core 96106b may consume more static power than core 96104c in this exemplary scenario.
Exemplary scenarios, as illustrated in
One way to determine the optimal set of cores to turn off is by performing exhaustive tests on each processor after the processor is manufactured. By operating each core, measuring the static power and trying all the combinations of cores to turn on/off, the combination of which cores to turn on/off that exhibit the lowest power consumption may be found. However, this brute force method is overly time consuming and costly due to increased testing time in manufacturing and the costs associated with testing equipment and testing time. Furthermore, the costs become even more prohibitive when the number of cores increases to tens or even beyond hundreds and the number of cores to shut down is more than one.
Ring oscillator 94600 may be adapted to measure variation for a respective core by counting how many times the output signal Q in ring oscillator 94600 changes from 0 to 1 and 1 to 0, in a fixed period of time such as within a clock cycle. Since faster transistors typically exhibit a higher rate of outflow of static power, higher counts in ring oscillator 94600 imply that the core consumes more static power.
Additionally, ring oscillator 94600 may be positioned within or outside of a core e.g., may be built as components on the SOC in proximity to the respective cores.
Moreover, ring oscillator 94600 may be a configured as a Phase-Shift Ring Oscillator (PSRO). Alternative designs of ring oscillator 94600 or other devices performing a similar function can also be incorporated in coordination with a PSRO or other variation sensing devices/structures.
Also, in one embodiment, one or all steps within Stage A may be performed on a computer at a chip design facility where the processor chip is being designed.
Additionally, in one embodiment, one or all steps within Stage B may be performed by the processor itself or a computer attached to the processor at the manufacturing facility where the processor chip is being manufactured.
In step 96302, a static processor analysis is conducted and its analysis results may be output via a signal. This analysis is conducted by simulating on a computer the operation of the processor running a particular workload. Using the results of the simulation, the computer determines the optimal core or set of cores to turn off given the particular workload. Since this analysis may, in one embodiment, take into consideration some static thermal (e.g. detailed temperature values for individual processing units, macros, cores, temperature maps and such), power (e.g. static and dynamic power dissipation for macros, units or cores) and performance characteristics (e.g. data measured by performance counters, clock frequency, instructions per cycle and bytes per second and such) of the processor (by utilizing known thermal, power and performance models), the resulting processor configurations may be ranked, individually or in combination, by optimal thermal, power and/or performance characteristics. This data may be output as one or more signals for later use in subsequent steps such as step 303. This signal(s) may include data corresponding to a static list of processor cores to turn off.
Also, throughout execution of step 96302, the absence of variation is assumed.
Additionally, the simulation in step 96302 includes scenarios where the processor has various power modes to reduce power and/or to implement shut-down. Processor power modes are a range of operating modes that selectively shut down and/or reduce the voltage/frequency of parts or all of the processor in order to improve the power-energy efficiency. It is possible that power modes may include full shut down and/or drowsy modes of processing cores and cache structures.
In step 96303, at least one break-even condition is determined by utilizing data from step 96302 and data from a preexisting library of various variation patterns. This determination is done by simulating on a computer the occurrence of a particular variation pattern on the optimal core or set of cores to turn off given the particular workload employed in the analysis at step 96302. Consequently, a list of break-even conditions providing for a switch from one decision of the optimal core or set of cores to turn off (without the effects of variation) to another different set (with the effects of variation) is determined and output via a signal. This signal may be used by subsequent steps, such as step 96304.
Also, the simulation of the occurrence of a particular variation pattern on the optimal core or set of cores to turn off given the particular workload employed in the analysis at step 96302 may be conducted via a computational algorithm that relies on repeated injection of variation patterns. The variation patterns may be taken from preexisting library of variation patterns for a specific manufacturing site, manufacturing technology and relevant processor assumptions. In one embodiment, the injection algorithm also stores information from earlier runs of the chip under investigation to converge on most frequent variation patterns. While the variation can be largely due to process variation, the injection technique does not discriminate the source of variation and thus can effectively be used with other sources of variation such as packaging, cooling, power delivery, power distribution and such. In an embodiment where the same design is manufactured in a different technology node, or a different site, the preexisting libraries may be customized for these assumptions and thus, the static analysis in this stage will be targeted towards the specific manufacturing technology and site.
In step 96304, the output list of break-even conditions of step 96303 is used to create a data structure, such as a look-up table, where upon the input of the values of a variation of the core, the data structure will output an ordered list of cores to turn off in order to reduce power or to reduce temperature. For example, when using the ordered list, if the objective function is to reduce power and at most three cores could be turned off to still meet a certain performance target, the ordered list is sorted such that turning off the first three cores in the list will provide the optimal power configuration for the same performance.
The data structure, such as a look-up table, may be stored in memory internal or external to the processor. The content of the data structure may be registered, stored, organized and capable of being retrieved from for later use by the processor, a logic device, a resource manager, an initial configuration controller and/or a tester during the performance of step 96306.
In step 96306, during Wafer Final Test (WFT) and/or Module Final Test (MFT), the variation of each core is assessed using tester infrastructure, on-chip ring oscillator and/or a temperature sensor and stored in a memory (or a combination of any of these). In one embodiment, the measuring involves applying different supply voltages and clock frequencies to a core or all the cores in the processor and determining the signal counts output by the ring oscillator. Consequently, the measuring may provide values that represent variation for each core measured in ring oscillator counts. These values may be output as a signal used by subsequent steps, such as step 96307.
In step 96307, the process variation values obtained from step 96306 are used with look-up table data listing of cores to turn off obtained from step 96304 in order to automatically decide which core or set of cores to turn off in the processor. Since the on-chip variation patterns are different for different chips, the turn-off decisions that are unique to a certain processor may be stored within the processor or stored externally with reference to the processor's identification information. The actual decision of which core or set of cores to turn off may be implemented at the manufacturing stage by cutting off the frequency and/or voltage of the selected cores to turn off, or be made available to the systems for applying one of the aforementioned turn-off actions.
In step 96308, a list including a core or set of cores to turn off in the processor is finalized and may be output. In one embodiment, the content of the list may be ordered by corresponding core weights/ranks (i.e. cores may be ordered according to the energy or thermal benefit obtained from turning the selected cores off). Thus, a number of cores represented by a variable N and included in this list may be selected and subsequently turned off. Since the content of the list is ordered, a maximum benefit from the core shut down selection may be obtained. The variable N is a parameter which may be defined by a processor manufacturer based on a predetermined performance requirement and can be changed according to a desired number of cores to turn off. For example, the processor manufacturer may set variable N to 6 cores operating at 2 Ghz below 65W power.
In one embodiment, the first column of look-up table 96400 must cover all the possible combinations of process variations of the corresponding processor such that at least one row will be tested TRUE for every manufactured processor. For example, multiple rows within the first column may be tested TRUE when the processor layout is symmetric, such that turning off core on one end has the same effect of turning off a core from the other end. If more than one row is tested TRUE, then any of the rows that are tested TRUE may be selected i.e. any list of cores to turn off is specified in the any of the rows tested TRUE.
In some cases where some of the cores are non-functional (i.e. not able to operate according to the standards set by the manufacturer) and thus must be turned off, there are less choices from which remaining functional cores can be turned off since the non-functional cores must be turned off and their turn-off will affect the power and the choices for the remaining functional cores to turn off. Consequently, to make use of table 96400 when some of the cores must be mandatorily turned off due to their non-functionality, the disclosed technique changes the preexisting content of some cells within table 96400 to content corresponding to as if the non-functional cores have already been turned off. This occurs by allowing only the rows of table 96400 that have the non-functional cores turned off in the second column (Cores to turn off) to be used for look-up. Also, in one embodiment, conditions listed in the first column that involve disabling the non-functional cores must be removed. For example, in table 96400, if two cores should be turned off and if a core 3 has to be turned off due to its non-functionality in a particular processor, then only rows 2, 3, 4 and 6 (those rows that already have core 3 as one of the first two cores to be turned off) will be used for this processor. Thus, in order to determine which of the remaining cores should be turned off, the conditions that involves core 3 such as count[core 1]>count[core 3] and count[core 1]<=count[core 3] are removed from column 1, without using the actual counts or actual evaluation of core 3.
Also, look-up table 96400 may be stored in memory internal or external to the processor. The content of the data structure may be registered, stored, organized and capable of being retrieved from for later use by the processor, a logic device, a resource manager, an initial configuration controller and/or a tester during the performance of step 96306.
Processor 96500 also includes other units such as caches, interconnect, memory controller and Input/Output, collectively marked as Block 96503 that are typically found on a multiprocessor and SOC devices. Because Block 96503 may consume active and static power and may be affected by temperatures of the cores, as well as possibly heating up the cores due to their close proximity with one or more cores close-by, circuitry of Block 96503 may be used in the analysis referred to in
Block 96504 is the logic circuit corresponding to the look-up table by referred to
Block 96505 is the logic circuit corresponding to a variation table, storing values of ring oscillator readings referred to in
In step 96601, a static analysis of the processor's thermal profile is conducted. The static analysis is conducted in order to minimize the overhead associated with the static analysis without compromising accuracy. The static analysis includes a determination of the processor's thermally critical regions R where the average temperature of a region is higher than a predetermined threshold temperature, which is based on the analysis of the processor architecture and determined after extensive analysis at the design stage. The determination of the processor's thermally critical regions R occurs by computer simulation, whereby the processor's map-like physical layout is recursively separated into multiple sections. Next, the average temperature corresponding to a variable Taverage is calculated for each processor section and compared with the other processor sections as well as the whole processor's average temperature over a certain period of time. Next, a list of thermally critical regions Ri: {R1-RN} is provided. All the thermally critical regions R1-RN are evaluated in steps 96602-96607. Furthermore, each region Ri is defined by a number of cores (C1-CN) as well as mapping coordinates (x1, x2, y1, y2) on the layout of the chip. Upon determination of the thermally critical regions, the subsequently performed steps focus on regions Ri without doing the analysis exhaustively for every single core on the chip. Also, architectural criticality may be factored in this step where if, for example, Region 1 has operational significance for a particular processor architecture, then Region 1 can still be in the list or may be overwritten.
In step 96602, core turn-off is simulated for all cores in region R. Turn-off simulation may occur by selecting an Ith core among M cores (e.g. 2nd core out of 10 cores) where M is the total number of cores on the processor and I is a predetermined constant for the given number of cores/chip area such that I/M cores are neighboring cores from a region R (x1, x2, y1, y2) in the thermally critical regions. Consequently, for example, if N cores out of M should be turned off, then all the combinations of turning off N cores out of M cores are exhaustively simulated for the occurrence of various power and thermal scenarios on each combination until all the combinations are tried and the optimal combination is chosen.
In step 96603, a determination is made whether the peak temperature of a selected core I, which is turned off during simulation, is less than its peak original temperature. If not, then process loops back to step 96602. Otherwise, step 96604 is executed.
In step 96604, a determination is made whether the difference between the current average temperature and original average temperature is less than the threshold temperature. If not, then the process loops back to step 96602. Otherwise, step 96605 is executed.
In step 96605, information identifying the simulated core is placed in a static turn-off list. Static turn-off list is an ordered list wherein the listed cores are weighted/ranked according to the amount of energy efficiency and temperature improvement achievable through turning the listed cores off. In one embodiment, the weights may be based on ΔT where average ΔT would also indicate leakage and corresponding energy efficiency improvement i.e. the amount of temperature reduction (in terms of peak and/or average temperature) if a certain core is turned off. In one embodiment, the step of deciding how much power/temperature savings could be achieved by turning off a particular core can be extended to include the amount of static power reduction that translates to the level of temperature reduction. Consequently, if variation is lacking, then data from the performance of step 96605 can be subsequently used to assist in turn-off of any number of cores by selecting N cores out of this ordered list in order. While the static turn-off list may be subsequently partially overwritten by breakeven conditions (see for example
In step 96606, a determination is made as to whether all the cores in region R have been analyzed. If not, then the process loops back to step 96602. Otherwise, step 96607 is executed.
In step 96607, the content of static turn-off list is finalized. The static turn-off list may be output for use by step 96303 shown in
In step 96701, a core represented by a variable J from a listing of all cores listed in a static turn-off list is selected. The static turn-off list is provided from the performance of all steps symbolically shown in
In step 96702, a process variation pattern is selected from a preexisting library of various variation patterns. The variation pattern is represented by variable Vi. The variation patterns may be taken from preexisting library of variation patterns for a specific manufacturing site, manufacturing technology and relevant processor assumptions. In one embodiment, the injection algorithm also stores information from earlier runs of the chip under investigation to converge on most frequent variation patterns. While the variation can be largely due to process variation, the injection technique does not discriminate the source of variation and thus can effectively be used with other sources of variation such as packaging, cooling, power delivery, power distribution and such. In an embodiment where the same design is manufactured in a different technology node, or a different site, the preexisting libraries may be customized for these assumptions and thus, the static analysis in this stage will be targeted towards the specific manufacturing technology and site. In one embodiment, the variation pattern may be selected from Block 96505 exemplarily shown in
In step 96703, a variation pattern Vi is injected into core J via a computational algorithm during a power and/or temperature simulation.
In step 96704, a simulation of the occurrence of variation pattern Vi on core J takes place. This simulation may take into account various performance scenarios, workloads, power schemes and temperatures. Specifically, variation data may include lot/wafer/chip/core/unit level variation data that is relevant for the core under consideration. Given the core architecture characteristics/specifications, an injection of the variation pattern Vi into the corresponding operating specs of the processor occurs. As previously mentioned, the operating specifications can include certain workload characteristics, power modes, temperatures and other scenarios into account in order to do a realistic assessment of the impact of the variation on the processor.
In step 96705, a determination is made as to whether the performance results of step 704 on core J are different from those performance results corresponding to core J as determined by step 96607 shown in
In step 96706, a determination is made as to whether the power and temperature values for core J result in maximum energy efficiency (static power reduction) and/or thermal improvement when executing a workload than those corresponding to core J when executing the same workload in step 96607 shown in
In step 96707, process variation pattern Vi is placed in break-even pattern list, which may be stored in a data structure such as a look-up table 96400 shown in
In Step 96709, the content of break-even pattern list is finalized. Thus, break-even pattern list per core for all variation patterns from the library of various variation patterns is provided resulting in a listing of break-even points per core such that if a core is above the specific variation level it gets assigned to the break-even pattern list. The break-even pattern list may be output via a signal for subsequent use.
Furthermore, as discussed above in reference to step 96308 in
There are several methods to execute process in
Power Distribution
Each midplane is individually powered from a bulk power supply formed of N+1 redundant, hot pluggable 440V (380V-480V) 3 phase AC power modules, with a single line cord with a plug. The rack contains an on-off switch. The 48V power and return are filtered to reduce electromagnetic emissions (EMI) and are isolated from low voltage ground to reduce noise, and are then distributed through a cable harness to the midplanes.
Following the bulk power are local, redundant DC-DC converters. The DC-DC converter is formed of two components. The first component, a high current, compact front-end module, will be direct soldered in N+1, or N+2, fashion at the point of load on each node and I/O board. Here N+2 redundancy is used for the highest current applications, and allows a fail without replacement strategy. The higher voltage, more complex, less reliable back-end power regulation modules will be on hot pluggable circuit cards (DCA for direct current assembly), 1+1 redundant, on each node and I/O board.
The 48V power is always on. To service a failed DCA board, the board is commanded off (to draw no power), its “hot” 48V cable is removed, and the DCA is then removed and replaced into a still running node or I/O board. There are thermal overrides to shutdown power as a “failsafe”, otherwise local DC-DC power supplies on the node, link, and service cards are powered on by the service card under host control. Generally node cards are powered on at startup and powered down only for service. As a service card is required to run a rack, it is not necessary to hot plug a service card and so this card is replaced by manually powered off the bulk supplies using the circuit breaker built into the bulk power supply chassis.
The service port, clocks, link chips, fans, and temperature and voltage monitors are always active.
Power Management
A robust power management is provided to lower power usage that is based on clock gating. Processor chip internal clock gating is triggered in response to at least 3 inputs: (a) total midplane power (b) local DC-DC power on any of several voltage domains (c) critical device temperatures. The BG/Q control network senses this information and conveys it to the compute and I/O processors. The bulk power supplies create (a), the FPGA power supplies controllers in the DCAs provide (b), and local temperature sensors either read by the compute nodes, or read by external A-D converters each compute and I/O card, provide (c). As in BG/P, the local FPGA is heavily invested in this process through a direct, 2 wire link between BQC and Palimino.
System Software
As software is a critical component in any computer and is especially important in computers with new architectures, there is implemented a robust layered system of software that at the lowest level is very simple and efficient, yet sufficient to run most parallel applications.
For example, a control system is provided for controlling the following node types: Compute nodes dedicated to running user application, simple compute node kernel (CNK) I/O nodes (ION) run Linux and provide a more complete range of OS services—files, sockets, process launch, signaling, debugging, and termination; and, Service node performs system management services (e.g., heart beating, monitoring errors)—transparent to application software
Compute Node Kernel (CNK) are adapted to perform and/or are provided with the following:
Binary Compatible with Linux System Calls; Leverage Linux runtime environments and tools;
Up to 64 Processes (MPI Tasks) per Node; SPMD and MIMD Support; Multi-Threading: optimized runtimes; Native POSIX Threading Library (NPTL); OpenMP via XL and Gnu Compilers; Thread-Level Speculation (TLS); System Programming Interfaces (SPI); Networks and DMA, Global Interrupts; Synchronization, Locking, Sleep/Wake; Performance Counters (UPC); MPI and OpenMP (XL, Gnu); Transactional Memory (TM); Speculative Multi-Threading (TLS); Shared and Persistent Memory; Scripting Environments (Python); Dynamic Linking, Demand Loading.
Firmware are adapted to perform and/or are provided with the following:
Boot, Configuration, Kernel Load; Control System Interface; Common RAS Event Handling for CNK & Linux.
Systems Software Overview
Three are 7 major software components: (1) CNK (Compute Node Kernel); (2) ION (I/O) node Linux; (3) run-time firmware; (4) control system; (5) messaging layer; (6) compilers; and (7) GNU compilers and toolchain.
1. The Compute Node Kernel (CNK) is a lightweight kernel running on each of the compute nodes focused on performance. Its primary characteristics are low noise, support of most glibc/Linux system calls with function shipping to I/O nodes. It supports processes a pthreads, allows user-mode access to hardware to high performance, and has a mode where applications incur no TLB misses.
2. I/O Node (ION) Linux provides the compatibility environment for CNK function shipping. An I/O proxy daemon (IOPROXY) performs the backend function shipped system calls on behalf of each compute node. A Control and I/O Daemon (CIOD) is provided that interacts with the control system to manage jobs. CIOD also provides a tools interface to allow debuggers such as TotalView to control and query the compute nodes.
3. The runtime firmware (RTF) is the layer below a kernel. That kernel could be the above described CNK or ION Linux, or other customer implemented kernel. RTF's primary characteristics are providing a common set of non-performance-critical services isolation the kernel from the underlying hardware and control system, and providing a uniform RAS delivery mechanism. As with CNK it is introduces little noise and is well suited to HPC application needs.
4. The control system consists of two components: the high-level control system, or MMCS (Midplane Monitoring Control System), and the low-level control system, or mcServer (machine controller). The control system is the software that boots and partitions the machine, interacts with a scheduler to run jobs, tracks and analyzes RAS events, and provides a unified graphical view of the machine state, RAS, and jobs. Enhancements include high availability failover and, in an alternative embodiment, a distributed componentized control system. The mcServer portion handles power supplies, interactions with the FPGAs on the compute cards, and in general is responsible for controlling the hardware. MMCS handles interactions with the database for maintaining persistent job information and machine state. MMCS is the component responsible for partitioning and interfacing to schedulers such as Load Leveler or SLURM. The control system relies on interactions with the kernel for RAS messages, but for the most part, other software components rely on the control system. The Blue Gene control system presents a simple, efficient, and unified interface to control a world leading number of compute nodes. In a single glance it provides the state of the machine and the status of running jobs. It provides a searchable database for analyzing previous jobs runs, failures, hardware replacement, RAS events, and more. The Blue Gene control and diagnostics allow concurrent maintenance on one part of the machine while running jobs on another part.
5. The Blue Gene messaging stack is designed to allow the user access to the full power of the hardware, while providing a robust and optimized environment for standard programs. The messaging stack exposes two levels of APIs. A lower level one called SPI (System Programmer Interface) is a minimalistic layer of software that allows hardware (message queues, counters, etc) manipulation from user space. Starting on BGP the SPI is a fully supported and documented layer for achieving maximum performance from the hardware. Built on the SPI layer, DCMF (Deep Computing Messaging Framework) supports high performance message passing and shared memory programming models, such as OpenMP, Global Arrays (GA), Charm++, UPC, and others. MPICH is built on top of DCMF.
6. XL compilers are among the industry's leaders in performance and standards compliance. These compilers perform optimizations specific to each embodiment. The compilers implement standards for C, C++, and FORTRAN. The compiler supports auto parallelization with OpenMP and includes high performance MASS/MASSV libraries and ESSL. They have additional performance enhancement for HPC features. The compiler supports SIMD instruction generation with detailed compiler listing support for tuning optimizations. One compiler, in alternative embodiment, includes support for transactional memory and speculative execution.
7. The GNU compiler libraries and GNU toolchain is implemented. An automated patch and build process is provided for the toolchain that makes installation easy and provides the customer with a complete source base for any modifications or patches desired. The patch enables C, C++, FORTRAN and GNU OpenMP (GOMP). The toolchain implements ANSI, POSIX, IEEE and ISO standards for C, C++, FORTRAN, and OpenMP. The C library supports numerous ANSI, IEEE and POSIX standards including IEEE POSIX 1003.1c-1995 pthreads interfaces. The GNU linker, assembler, and related utilities have become de facto standards on Linux platforms.
Other application and system libraries beyond standard Linux, runtimes, math libraries, and messaging libraries are provided. A user-level application checkpoint restore library facilitates the transformation of applications into ones that can recover from system failures. The multi-valued L2 cache provides an opportunity for hardware and software support for fine-grained (sub millisecond) transparent system rollback to increase MTBF contributions from soft-errors. Link checksum interfaces are provided that application can use to find faulty network links. Other system programming interfaces (SPI) and tool interfaces are provided.
Light Weight-Kernel
Compute Node Kernel (CNK) is written from scratch and is open source under the Common Public License (CPL). The primary goal of the kernel is to launch applications, map hardware features into user space, and provide an infrastructure requiring little additional user-kernel interaction. Application compatibility with Linux is also provided. The approach emulates Linux system calls by function shipping the majority of the work to an I/O node running Linux. Some job control system calls are implemented locally by CNK including mmap( ) and clone( ). This strategy allows access to shared memory, creation of threads, and dynamic linking in a manner that does not require restructuring glibc. For example, this allows python and other applications with dynamic linking requirements to work without modification.
Unlike Linux, memory is mapped with a set of static translation lookaside buffers (TLBs). This eliminates the cost of TLB misses and allows the calculation between virtual to physical addresses to be performed in user space. The DMA torus interfaces are made available to user space allowing communication libraries to send messages directly from the application without involving the kernel. The kernel, in conjunction with the hardware, implements secure limit registers that prevent the DMA from targeting memory outside the application. These constraints, along with the electrical partitioning of the torus, provide security between applications. Blue Gene hardware provides multiple, communication FIFO (First-In First-Out) data structures implemented by hardware for efficient messaging. The FIFOs are assigned to MPI tasks and threads providing dedicated resources per task.
CNK provides both a pure MPI programming model and a hybrid approach that allows MPI to be mixed with different shared memory programming models such as OpenMP, UPC (Unified Parallel C), or pthreads.
CNK provides support for SIMD execution, Transactional Memory (TM), and Speculative Execution (SE). CNK leverages BGQ's unique hardware support for TM and course-grained thread-level speculation execution. Subcontractor will provide significant compiler support and optimization for each of these execution environments. CNK works in unison with the compilers. In particular for SIMD execution, CNK saves and restores the requisite registers. A transaction can be initiated from user space. The hardware can be configured so that upon completion of a transaction, either CNK receives an interrupt and calls a signal in user code, or user code can check a statue register to determine the success or failure of the transaction. For speculation, CNK provides a software thread context per hardware thread. When the runtime wishes to initiate speculation, a kernel call activates the speculative thread, sets the appropriate TLB bits, and returns control to the speculative thread. If during the speculation a conflict occurs, CNK will handle the interrupt and logically terminate the speculative thread. Upon successful completion of the speculative code, the speculative state is saved, and CNK will return control to the thread that was running prior to the activation of the speculative thread.
BGQ further allows detailed fine-grained simultaneous monitoring of numerous performance metrics. CNK will provide a user-space mapping of control registers for managing 1024 performance counters in one embodiment. The counters can be configured in three modes: a distributed count mode, a detailed count mode, and a trace mode. The distributed count mode allows some counters from all of the cores to be monitored; in detailed mode, a large number of counters from a single core may be monitored. In trace mode, every instruction is recorded. Approximately 1500 cycles of instruction information can be traced in this mode. The distributed and detailed modes also apply to the L2.
For performance and scalability, CNK implements function shipping for I/O requests. The I/O function shipping mechanism is implemented in a manner similar to a remote procedure call. When an I/O request is made to CNK, CNK sends a message to a CIOD daemon running on the ION Linux, where a proxy performs the operation. Linux compatibility is enabled on the I/O node by careful management of the context in which the system call is performed. Rather than emulate Linux behavior, Subcontractor's approach is to minor the compute node environment on the I/O node with a process and corresponding threads. This allows CIOD to provide Linux semantics for the CNK process context including current working directory, file handles, locks, and user and group id security. The I/O function shipping also addresses scalability of the I/O subsystem. An I/O node further manages a number of compute nodes reducing the filesystem clients and administration by two orders of magnitude.
Both CNK and the Linux kernel on the I/O node utilize a common runtime firmware (RTF) service layer for non-performance critical events. RAS events are emitted via this firmware layer to the control system over the secure control network. For space efficiency, RAS events are logged from CNK as encoded binary and decoded within the control system allowing the lightweight kernel a smaller memory footprint. RAS events are recorded in a database on the service node and are associated with specific hardware, a partition, and a job. The control system monitors the nodes, node boards, and service cards by externally polling the system without interacting with CNK or other software running on the node thereby providing monitoring with zero interference. Failing hardware can be detected even if a node becomes so unresponsive that even CNK and its firmware cannot act. In these situations the control system will produce RAS events on behalf of the nodes. This provides additional information over what a standard cluster can provide. By using the JTAG interface, the control system can obtain the state of the failing node.
In one embodiment, system software boots the I/O nodes as part of the initial boot of the partition. Once a partition is booted the system allows individual or groups of I/O nodes to be rebooted as desired. For simplification, compute nodes associated with the I/O node(s) are also rebooted. As this process happens in parallel it does not add to the ION reboot time. In normal operation, nodes are booted once to start the partition and then multiple jobs are run without further reboots. In a further embodiment, the I/O nodes are collected into racks and decoupled from compute nodes; however, enhancement enable support of reconfiguring partitions without rebooting the I/O nodes.
LN, ION and SN Linux OS: The I/O Node (ION) Linux is an embedded Linux based on a standard enterprise Linux distribution. ION Linux, in one embodiment, may leverage the same runtime firmware used by CNK. This firmware layer is designed to provide consistent RAS from any kernel including CNK, Subcontractor's provided ION Linux, any customer built Linux, or other customer supplied operating system. In addition to RAS, the runtime firmware provides a common interface to the control system for configuration of networks and console output.
Job control may be provided through a Control and I/O Daemon (CIOD). CIOD accepts connections over the functional network from the control system on the service node. The control system may start, signal, debug, or end a job over this connection. The control system achieves scalability by a division of labor where the service node interacts in parallel with a relatively small set of IONs, which in term interact in parallel with the set of associated compute nodes.
Using this technique Blue Gene may efficiently perform job launch and control on 100,000s of nodes. Standard input (stdin), stdout, and stderr are multiplexed over the high-speed functional network. Debugging and related tools scale by running the debugger in parallel across the I/O nodes. The debugger and tools interface is documented. Tools may leverage the high-speed functional network and the compute capacity of the I/O nodes to perform and coordinate work.
Function shipping is provided through an I/O Proxy Daemon (IOPROXY) running on the ION. An IOPROXY daemon is responsible for each compute task. This IOPROXY shares the network connection to the compute nodes with CIOD and responds to requests from the compute nodes to perform system calls on behalf of the compute task. The IOPROXY creates threads to minor compute processes. Each IOPROXY process corresponds to a compute process and leverages Linux to track current working directory, file locks, user and group id, and any special context required by specific filesystems.
The IOPROXY avoids data copying by driving the network connection directly from user space. In one embodiment, this connection is over a collective network. Alternatively, hardware provides DMA support from user space alleviating the computational requirement for driving this network.
In one embodiment, the integrated 10 Gbps Ethernet is driven by a kernel network device driver. The Ethernet supports scatter-gather DMA with IPv4 checksum offload for TCP and UDP payloads. In an alternate embodiment, the external I/O is provided by a PCIe 2.0 adapter that is expected to provide similar or better offload capabilities.
Boot control of the I/O nodes is performed remotely from the service node using low-level Joint Test Action Group (JTAG) protocol. As with compute nodes, the I/O nodes are started remotely. Consistent with Blue Gene's design for reliability there is no local resident firmware or local storage; the booter and kernel are loaded over the network.
In one embodiment, the I/O nodes are integrated into the compute racks and are booted when a partition is configured. These I/O nodes may be rebooted either individually or in arbitrary subsets as desired. The I/O node reboot procedure may be performed between jobs. For simplification, compute nodes associated with the I/O node(s) are also rebooted. As this process happens in parallel it does not add to the ION reboot time. This discards any persistent data stored on the compute node.
In an alternative embodiment, reboot is similar, but the I/O nodes are in racks and are interconnected by an I/O torus. These I/O nodes will be booted independently of the compute racks, and will normally remain in operation until a maintenance window. It will be possible to reboot individual or sets of I/O nodes as allowed by the hardware. If an I/O node fails in a manner where the torus remains intact an administrator may choose to leave it down. Neither embodiment needs power cycling to reset nodes. The control system can send signals to the node via JTAG causing a reset.
System Administration:
System administration features include a centralized database that contains machine information such as hardware state, jobs, partitions, service actions, diagnostics, environmental readings, and RAS events. From the central database, an administrator can monitor machine activity. System administration is provided as a centralized service scalable to large (100,000s) number of nodes. The service provides the ability to debug jobs, initiate service actions, run diagnostics, view diagnostics results, view hardware status, kill jobs, free partitions, and other system administration tasks. All administrative tasks may be performed either by using the browser-based Navigator or from the command-line. The Navigator is customizable, in that it supports plug-in features whereby the administrator can provide site-specific graphs, reports, and notifications.
Most administrative tasks, such as service actions, running performance tests, or performing diagnostics, are parallel and can be run concurrently (at the same time on different partitions of the machine). For example, diagnostics could be run on one partition of the machine, while another partition is having a service action performed, while yet another partition of the machine is running a user application.
The database is used as a backing repository. The control system is designed so that the database does not become a bottleneck. Operations like system shutdown or reboot are not database-intensive operations. Once an operation is initiated only a few state transitions are logged in the database. RAS “storms” can cause significant database activity.
Petascale System Services:
The control system is designed to give a high degree of flexibility for creating and booting partitions, and launching and debugging jobs. The control system allows each partition to be booted with a partition-specific kernel. This customization, combined with partitioning features of the machine, allows different kernels to be used on different partitions at the same time. The choice of kernels is easily configured with commands and APIs provided by the control system. There is also support for different methods of job submission. Commonly, a single binary is run on all compute nodes of a partition. The control system also allows multiple binaries to run within a single partition. This is known as Multiple Program Multiple Data (MPMD). Another job launch paradigm is known as High-Throughput Computing (HTC) in which all the nodes of a partition can be running a different binary, and these binaries are each launched independently.
In one embodiment, security is based on access to the service node controlled by Linux accounts. Users who are given accounts on the service node can issue any command to the control system. In alternative embodiment, security and authentication in the control system are designed based on capabilities. A capability (known in some systems as a key) is a communicable, unforgettable token of authority. Users without access to the service node have the ability to launch and debug jobs from login nodes. More advanced tasks, such as running diagnostic suites or performing service actions, can be performed by system administrators on the service node. The security model provides a subset of service node commands to aid in debugging and collecting information about user jobs. One sample scenario might allow a user access to the service node, but only give them enough commands to view or change information about their partition and job.
Remote job launch is secured by the use of a challenge-response authorization protocol on login nodes, service nodes, and I/O nodes. Initiating a job from a login node may require a shared secret to authenticate with the service node. The secret is stored in a file on both the login node's and service node's local file system and can be of arbitrary length. A similar process occurs when initiating the job launch from the service node to I/O nodes. In this case a shared secret is randomly generated by the control system when the partition is booted. As part of the boot process, the secret is sent to each I/O node over the private service network. The I/O node software only allows remote connections who posses this shared secret and pass a challenge response.
Within the framework of a scheduler, interactive job launch can be prevented by the use of a control system plug-in. This plug-in is flexible enough to make portions of the machine available to interactive use, while denying requests the overlap with scheduler-controlled hardware resources.
The control system provides a comprehensive solution for resource management. An integral part is a database that stores four categories of data. There is a configuration database that is a representation of the hardware on the system, an operational database that is representation of partitions, jobs, and history, and an environmental and RAS database.
The configuration database has a complete and detailed layout of the racks, the node cards within those racks, and the cables that connect the racks. This physical layout of the machine is used as a base for performing resource management. For example, a request for a partition of 1,024 compute nodes in a fully connected torus requires referencing the physical layout stored in the configuration database. The configuration database also records the current status of the hardware. Even though hardware is present, it may currently be undergoing a service action. The configuration database is kept consistent with the state of the machine when hardware errors are detected (e.g., bulk power supply, fan, etc.) or service actions are in progress. This hardware is unavailable during the course of the service action and therefore is unavailable to a resource manager and is marked as such in the database. Additionally, certain RAS events may also indicate a hardware fail.
The operational database tracks the current use of the hardware. A resource manager uses the operational database to determine if a partition is available to boot. The same database also tracks where current jobs are running and can be used to ensure multiple jobs are not launched to the same partition and the same time.
The control system provides several mechanisms for users to allocate resources and run jobs. Users have access to mpirun, a command-line program that supports creating partitions, booting partitions, and running jobs. It can be used to run a job on a booted partition, boot a pre-created partition, or create a partition, or combinations of the above. Schedulers can us APIs to perform the above three actions, or can call mpirun at any stage in the management. Note, mpirun does not take into consideration the multi-user nature of the machine. For this reason, users may choose to use a centralized resource manager (or scheduler) to ensure that user requests are processed fairly, taking into consideration such factors as priority, advanced reservations, and job duration.
The scheduler APIs are a set of functions that can be used to extract the machine topology and status. Using these APIs, a scheduler can gather physical layout, hardware status, and operational state. Schedulers use this information to create partitions dynamically and run user jobs on those partitions. The control system provides polling and event-based categories of APIs. The event-based ones allow a “real-time” notification model, in which the scheduler gets the starting snapshot of the machine, and then registers to be notified in about any changes to hardware status or operational state. This notification model eliminates the need for the scheduler to poll for machine status changes.
Some classes of user requests may be satisfied by a simple scheduler that creates a set of static partitions and allocates sets of those predefined partitions to users. For more complex job loads, a dynamic allocator is available. It provides schedulers with topology-aware allocation strategies for finding requested resources. In this default strategy, the dynamic allocator finds the first available hardware that meets the requested size and shape, while minimizing the fragmentation of the hardware. The system also provides a plug-in architecture in which additional algorithms for resource allocation. The allocator plug-ins provide a fertile ground for collaboration in an open source community. Even with the dynamic allocator it is important to have a mechanism to avoid resource request collisions, which can be provided by a central resource manager.
RAS Software:
The software RAS strategy for Blue Gene is to limit the impact of failures, report RAS events in a consistent manner, persist events in a database, enable analysis of events, and alert administrators of conditions that require action.
The impact of hardware failures is limited through multiple techniques. One technique is redundant components, e.g., providing N+1 power modules. When a redundant component fails, an event is logged to indicate service should be scheduled. Another technique to limit the extent of a failure is to partition the system. The Blue Gene system allows flexibility in logically partitioning the machine so that multiple smaller jobs can be run simultaneously. These jobs are electrically isolated and users can not access or interfere with data flow on another partition. Failures are also isolated; a node failure only impacts its partition. Compute nodes are rebootable to recover from soft failures without rebooting the partition.
In one embodiment, there is provided the ability to reboot only a subset of the I/O nodes in a partition. This is an improvement on previous offerings because it allows a booted partition running ION Linux with existing Ethernet connections and mounted file systems to remain unaffected by a reboot of the compute nodes. This leads to improved stability of the I/O node complex, while providing the flexibility of either leaving the compute nodes booted across multiple jobs, or doing a reboot before each job starts.
The RAS architecture according to that embodiment defines the format of RAS Event descriptions, the APIs for reporting events, and the RAS handling framework. Events include a unique message id, location, severity, message, detailed description, and recommended service action(s). RAS Events are passed through a set of handlers in the Control System prior to being logged to expand the message from the compact binary format logged by CNK. This design reduces the kernel memory needed to log RAS messages.
The Environmental Monitor in MMCS generates events for anomalous environmental conditions such as over temperature, over current, etc. The low-level Control System generates events for errors with power supplies, temperature monitors, fan speeds, network configuration, chip initialization, etc. Concentrating the RAS handling in the Control System has resulted in a scalable and flexible RAS architecture. Message text, severity, codes, and recommended service actions, can be adapted based on the operational context (running jobs, diagnostics, service action) of the machine. This provides system operators, in each context, accurate and meaningful information upon an error event.
A diagnostic package is provided to check the hardware and isolate problems. The diagnostics harness supports the execution of individual test and test suites. A hardware checkup suite is provided to rapidly verify system health. To facilitate hardware replacement, a set of Service Action utilities are provided. A service-action-prepare step marks the hardware as under service in the database, gathers additional information for failure analysis, and powers off the necessary hardware. At this point a designated engineer can replace the hardware. The service-action-end step restores power to the hardware, runs diagnostics, and makes the hardware available by marking it active in the database. The diagnostics and service actions are executable from the command line or from the Navigator.
The Navigator RAS Event Log can be used to query, sort, and filter RAS events. The Navigator Health Center indicates to system administrators failure conditions needing attention. Software fixes are provided via efixes and are applied using the efix tool.
In alternative embodiment, RAS is more extensible to enable new system components to contribute RAS information and handlers without requiring a change to the RAS library. In addition, an Error Log Analysis plug-in framework will be added to improve problem isolation. The RAS components leverage the system capability-based security model. Separate capabilities are associated with the execution of Diagnostics and with Service Actions.
Apps Development Environment:
Subcontractor's delivered XL FORTRAN and XL C/C++ compilers are standards-based, highly optimized compilers. These compilers provide advanced optimization and utilize specific hardware features of any embodiment. The compilers are proprietary and fully supported by Subcontractor. The XL FORTRAN compiler provides implementation of FORTRAN 2003 (ISO/IEC 1539-1:2004, ISO/IEC TR 15580:2001(E), SO/IEC TR 15581:2001(E)).
For example, in one embodiment, the majority of the FORTRAN 2003 standard is supported, excepting parameterized derived types, but including object-oriented programming. In the alternative embodiment, FORTRAN 2003 is fully implemented. The XL C/C++ compiler provides full implementation for C (ANSI/ISO/IEC 9899:1999; ISO/IEC 9899:1999 Cor. 1:2001(E), ISO/IEC 9899:1999 Cor. 2:2004(E), ISO/IEC 9899:1999 Cor. 3:2007(E)) and C++(ANSI/ISO/IEC 14882:2003, ISO/IEC 9945-1:1990/IEEE POSIX 1003.1-1990; ANSI/ISO-IEC 9899-1990 C standard, with support for Amendment 1:1994). Both XL FORTRAN and XL C/C++ compilers also provide full implementation of OpenMP (OpenMP V2.5 in one embodiment, and OpenMP V3.0 for alternate embodiment). These compilers are an evolution of Subcontractor's XL compiler products for Linux on POWER, and benefit from functional, performance, and quality enhancements generated by the Linux on Power user base.
The XL compilers provide industry-leading optimization technology. Through compiler options and directives, programmers may select from a range of optimization levels (−O2, −O3, −O4, and −O5). These levels allow the user to select comprehensive low-level optimization up through more extensive whole-program optimization.
In one embodiment, optimization and tuning for the BGP architecture includes −qarch=450, which generates code for the single floating point unit (FPU), while-qarch=450d generates parallel instructions for the 450d Double Hummer dual FPU. The −qtune=450 option optimizes code for the 450 family of processors. The XL compiler family includes a set of built-in functions that are optimized for the POWER architecture. In addition, on the BGP, the XL compilers provide a set of built-in functions that are specifically optimized for the 450d's Double Hummer dual FPU.
IN the alternate embodiment, the XL compiler provides automatic SIMD vectorization to exploit the QPX unit, and automatic speculative parallelization to exploit the new hardware for speculative execution. The compiler also provides support for a variety of intrinsics and pragmas (SIMD intrinsics, Transactional Memory (TM) directives, and prefetching pragmas), which allow the user to directly exploit new hardware features.
Mathematical Acceleration Subsystem (MASS and MASSV) and ESSL libraries may additionally be provided. These libraries provide high performance scalar and vector functions that perform common mathematical computations. The libraries are tuned specifically to yield improved performance over standard mathematical library routines. Under higher levels of optimization, the XL compilers can identify patterns in code that can be replaced by calls to MASS subroutines. There is also provided the Basic Linear Algebra Subroutines (BLAS) set of high-performance linear algebraic functions. The compilers may be dependent on the GNU toolchain for linker, loader, and GNU C library. The GNU toolchain includes GNU OpenMP (GOMP).
As described with respect to the CNK, Blue Gene provides a rich program counting interface, i.e., BGQ allows detailed fine-grained simultaneous monitoring of numerous performance metrics. CNK will provide a user-space mapping of control registers for managing the 1024 performance counters. The counters can be configured in three modes. There is a distributed count mode, a detailed count mode, and a trace mode. The distributed count mode allows some counters from all of the cores to be monitored; in detailed mode, a large number of counters from a single core may be monitored. In trace mode, every instruction is recorded. Approximately 1500 cycles of instruction information can be traced in this mode. The distributed and detailed modes also apply to the L2.
The GNU autoconf tool is a popular configuration tool for software projects that must compile and cross-compile on multiple hardware and software platforms. Autoconf provides an open source, portable and flexible configuration infrastructure that is well understood in the software development community. For autoconf to be effective developers must understand and correctly utilize its function. While cross-compilation is straightforward, the build infrastructure for large software code bases can become complex. Often, a build has external dependencies beyond the control of the developer. To ameliorate situations where modifying the complex build infrastructure is not palatable, there is provided a solution to allow remote execution of binaries as required by autoconf.
There is further provided a comprehensive solution allowing the binaries to be run on a High Throughput Cluster (HTC) partition of an alternate embodiment, e.g., Sequoia, transparently to the autoconf environment. This solution provides an identically-matched environment on a CN rather than a closely-matched one on an ION.
Two performance toolkits may be supplied to support application tuning and enablement. The first toolkit, known as the High Performance Computing Toolkit (HPCT), is a suite of tools that focus on performance analysis, as opposed to tuning. These tools are designed for performance data collection in both their organization and presentation. The user is provided various views of the performance data. These views are correlated to the application's source code for improved user understanding. The toolkit is organized around five basic “dimensions” of performance relative to HPC applications: (1) CPU, (2) Memory, (3) Message-Passing with MPI, (4) Threading with OpenMP, and (5) File I/O. This five-dimensional framework was developed over years of working with scientists and engineers to provide a natural and intuitive means to manage the potentially large sets of performance data that is collected with large-scale applications.
The tool may use a visual abstraction of the application that allows the user to interact with it at the source level, but all instrumentation is performed on the binary executable. For example, the user can create instrumentation points based on either the specific type of information desired (e.g., all MPI_Wait calls involving array foobar in function foo), or else can visually select portions of the source code to be instrumented. The framework collects these high-level specifications for instrumentation from the user, creates the appropriate binary coding of them, and inserts them into the existing binary executable. No recompilation of the application is performed. This preserves the integrity of the user's source code, which does not get altered in the HPCT framework.
In addition, the infrastructure for collecting the performance data is inherently scalable, since the specifics of the data collection are contained in the modified binary executable. In other words, this instrumented binary carries with it the “DNA” of the HPCT data collection framework wherever it executes, regardless of how many processors it runs on. The performance data is persistent and remains in a distributed filesystem for post-mortem analysis by the remainder of the HPCT.
The second toolkit, known as the High Productivity Computing Systems Toolkit (HPCST), is a framework dedicated to application tuning, as opposed to analysis. It is complementary to the HPCT in that it can be used in conjunction with it, and that it employs the same means of abstraction for its instrumentation needs. In particular, the HPCST consists of two main components: a Bottleneck Detection Engine (BDE) and a Solution Determination Engine (SDE). The BDE is a rule-based knowledge system that provides an automated means of finding performance bottlenecks. It can be used in two modes. In the first, an application can be tested for the presence of known bottleneck signatures as stored in a BDE-repository. These a-priori signatures are developed by expert users with a simple conditional grammar. The bottleneck signatures can be persistent and even community developed because the repository and grammar are open. The second mode of use for the BDE is by means of dynamic interrogation. The signature grammar is of sufficient power so as to allow users to ask very specific “questions” about the behavior of an application. This mode is an extremely powerful means for being able to understand large volumes of performance data, typically unsuitable for traditional methods of display (tables and charts). It provides a method of inserting human intelligence into the tuning effort in an automated and programmable manner. It is analogous to extracting information patterns from large scale databases.
The SDE component of the HPCST mines the results of the BDE and searches for underlying causes for any of the bottlenecks found by it. The overall process for the HPCST is to automatically determine the presence of bottlenecks via the BDE, and then further analyze those bottlenecks to find the underlying causes via the SDE. The user will then be presented with various results and logs that include specific measures of how to mitigate the bottlenecks that were found. The process can be iterated to further understand the application's performance behavior, and modified appropriately by the user.
Message Passing System:
The Blue Gene messaging stack exposes two levels of APIs. A lower level one called System Programmer Interface (SPI) is a minimalistic layer of software that allows hardware (message queues, counters, etc) manipulation from user space. Starting on BGP the SPI is a fully supported and documented layer for achieving maximum performance from the hardware. Built on the SPI layer, Deep Computing Messaging Framework (DCMF) supports high performance message passing and shared memory programming models, such as OpenMP, ARMCI, Charm++, UPC, and others. MPICH is built on top of DCMF.
Consistent with the high performance focus of Blue Gene, DCMF is available in user space and directly interacts with the messaging unit hardware. Kernel system calls are minimized. There is a single torus network on Sequoia and the messaging stack is designed to drive it at its maximum rate. DCMF is designed to take advantage of all of the links of a node as well as to choose the optimal network, torus or collective, for performing a given operation.
The messaging stack has been co-designed with CNK. As above, CNK has been designed with high performance applications in mind, and obviates the need for pinning memory for DMA. Short unexpected messages are handled by using temporary buffers. The messaging stack minimizes the amount of memory needed that grows with the number of MPI tasks. Most of this type of memory is required by the MPI specification, not by the messaging stack.
If the MPI specification is strictly followed, the amount of memory used for MPI vector collective operations will be large.
There are four potential areas that can affect the memory used for buffering. Eager connection list memory could be controlled by switching between the array which is faster and the hash table which is smaller. The memory used for rank map allocation can not be controlled by the user. The size of the shared memory FIFOS may be set by the user with an environment variable. User applications that use many ranks should be “well behaved” and not issue or not expect strict MPI specification compliance when issuing MPI vector collective operations.
The messaging stack takes advantage of shared memory to improve performance. By making each core's memory visible to other core on the node, the point-to-point shmem FIFOS would be smaller as the bulk of the data transfer is accomplished by a direct memcpy( ) by the receiver out of the sender's memory “non SMP” mode collectives require a local collective before, and sometimes after, the network collective. Also the cores would synchronize with shared memory and then the cores would directly access the input data to perform the operation and pipeline the result to the network collective phase.
Advantageously, the novel packaging and system management methods and apparatuses of the present invention support the aggregation of the computing nodes to unprecedented levels of scalability, supporting the computation of “Grand Challenge” problems in parallel computing, and addressing a large class of problems including those where the high performance computational kernel involves finite difference equations, dense or sparse equation solution or transforms, and that can be naturally mapped onto a multidimensional grid. Classes of problems for which the present invention is particularly well-suited are encountered in the field of molecular dynamics (classical and quantum) for life sciences and material sciences, computational fluid dynamics, astrophysics, Quantum Chromodynamics, pointer chasing, and others.
Although the embodiments of the present invention have been described in detail, it should be understood that various changes and substitutions can be made therein without departing from spirit and scope of the inventions as defined by the appended claims. Variations described for the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application need not be used for all applications. Also, not all limitations need be implemented in methods, systems and/or apparatus including one or more concepts of the present invention.
The present invention can be realized in hardware, software, or a combination of hardware and software. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and run, controls the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
Thus the invention includes an article of manufacture which comprises a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the article of manufacture comprises computer readable program code means for causing a computer to effect the steps of a method of this invention. Similarly, the present invention may be implemented as a computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the computer program product comprising computer readable program code means for causing a computer to effect one or more functions of this invention. Furthermore, the present invention may be implemented as a program storage device readable by machine, tangibly embodying a program of instructions runnable by the machine to perform method steps for causing one or more functions of this invention.
The present invention may be implemented as a computer readable medium (e.g., a compact disc, a magnetic disk, a hard disk, an optical disk, solid state drive, digital versatile disc) embodying program computer instructions (e.g., C, C++, Java, Assembly languages, Net, Binary code) run by a processor (e.g., Intel® Core™, IBM® PowerPC®) for causing a computer to perform method steps of this invention. The present invention may include a method of deploying a computer program product including a program of instructions in a computer readable medium for one or more functions of this invention, wherein, when the program of instructions is run by a processor, the compute program product performs the one or more of functions of this invention.
It is noted that the foregoing has outlined some of the more pertinent objects and embodiments of the present invention. This invention may be used for many applications. Thus, although the description is made for particular arrangements and methods, the intent and concept of the invention is suitable and applicable to other arrangements and applications. It will be clear to those skilled in the art that modifications to the disclosed embodiments can be effected without departing from the spirit and scope of the invention. The described embodiments ought to be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be realized by applying the disclosed invention in a different manner or modifying the invention in ways known to those familiar with the art.
While the invention has been particularly shown and described with respect to illustrative and preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form and details may be made therein without departing from the spirit and scope of the invention that should be limited only by the scope of the appended claims.
The present disclosure is a continuation application of U.S. patent application Ser. No. 13/004,007 filed Jan. 10, 2011, which claims benefit of priority from commonly-owned from U.S. Provisional Patent Application Ser. No. 61/293,611, filed on Jan. 8, 2010, and additionally claims priority from U.S. Provisional Application Ser. No. 61/295,669, filed Jan. 15, 2010, and additionally claims priority from U.S. Provisional Application Ser. No. 61/299,911, filed Jan. 29, 2010 the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. The present invention further relates to following commonly-owned, co-pending United States Patent Applications, the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. U.S. Pat. No. 8,275,954, for “USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY”; U.S. Pat. No. 8,275,964 for “HARDWARE SUPPORT FOR COLLECTING PERFORMANCE COUNTERS DIRECTLY TO MEMORY”; U.S. patent application Ser. No. 12/684,190 for “HARDWARE ENABLED PERFORMANCE COUNTERS WITH SUPPORT FOR OPERATING SYSTEM CONTEXT SWITCHING”; U.S. Pat. No. 8,468,275, for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST RECONFIGURATION OF PERFORMANCE COUNTERS”; U.S. Pat. No. 8,347,001, for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST MULTIPLEXING OF PERFORMANCE COUNTERS”; U.S. patent application Ser. No. 12/697,799, for “CONDITIONAL LOAD AND STORE IN A SHARED CACHE”; U.S. Pat. No. 8,595,389, for “DISTRIBUTED PERFORMANCE COUNTERS”; U.S. Pat. No. 8,103,910, for “LOCAL ROLLBACK FOR FAULT-TOLERANCE IN PARALLEL COMPUTING SYSTEMS”; U.S. Pat. No. 8,447,960, for “PROCESSOR WAKE ON PIN”; U.S. Pat. No. 8,268,389, for “PRECAST THERMAL INTERFACE ADHESIVE FOR EASY AND REPEATED, SEPARATION AND REMATING”; U.S. Pat. No. 8,359,404, for “ZONE ROUTING IN A TORUS NETWORK”; U.S. patent application Ser. No. 12/684,852, for “PROCESSOR WAKEUP UNIT”; U.S. Pat. No. 8,429,377, for “TLB EXCLUSION RANGE”; U.S. Pat. No. 8,356,122, for “DISTRIBUTED TRACE USING CENTRAL PERFORMANCE COUNTER MEMORY”; U.S. patent application Ser. No. 13/008,602, for “PARTIAL CACHE LINE SPECULATION SUPPORT”; U.S. Pat. No. 8,473,683, for “ORDERING OF GUARDED AND UNGUARDED STORES FOR NO-SYNC I/O”; U.S. Pat. No. 8,458,267, for “DISTRIBUTED PARALLEL MESSAGING FOR MULTIPROCESSOR SYSTEMS”; U.S. Pat. No. 8,086,766, for “SUPPORT FOR NON-LOCKING PARALLEL RECEPTION OF PACKETS BELONGING TO THE SAME MESSAGE”; U.S. Pat. No. 8,571,834, for “OPCODE COUNTING FOR PERFORMANCE MEASUREMENT”; U.S. patent application Ser. No. 12/684,776, for “MULTI-INPUT AND BINARY REPRODUCIBLE, HIGH BANDWIDTH FLOATING POINT ADDER IN A COLLECTIVE NETWORK”; U.S. Pat. No. 8,533,399, for “CACHE DIRECTORY LOOK-UP REUSE”; U.S. Pat. No. 8,621,478, for “MEMORY SPECULATION IN A MULTI LEVEL CACHE SYSTEM”; U.S. patent application Ser. No. 13/008,583, for “METHOD AND APPARATUS FOR CONTROLLING MEMORY SPECULATION BY LOWER LEVEL CACHE”; U.S. patent application Ser. No. 12/984,308, for “MINIMAL FIRST LEVEL CACHE SUPPORT FOR MEMORY SPECULATION MANAGED BY LOWER LEVEL CACHE”; U.S. patent application Ser. No. 12/984,329, for “PHYSICAL ADDRESS ALIASING TO SUPPORT MULTI-VERSIONING IN A SPECULATION-UNAWARE CACHE”; U.S. Pat. No. 8,255,633, for “LIST BASED PREFETCH”; U.S. Pat. No. 8,347,039, for “PROGRAMMABLE STREAM PREFETCH WITH RESOURCE OPTIMIZATION”; U.S. patent application Ser. No. 13/004,005, for “FLASH MEMORY FOR CHECKPOINT STORAGE”; U.S. Pat. No. 8,359,367, for “NETWORK SUPPORT FOR SYSTEM INITIATED CHECKPOINTS”; U.S. Pat. No. 8,327,077, for “TWO DIFFERENT PREFETCH COMPLEMENTARY ENGINES OPERATING SIMULTANEOUSLY”; U.S. Pat. No. 8,364,844, for “DEADLOCK-FREE CLASS ROUTES FOR COLLECTIVE COMMUNICATIONS EMBEDDED IN A MULTI-DIMENSIONAL TORUS NETWORK”; U.S. Pat. No. 8,549,363, for “IMPROVING RELIABILITY AND PERFORMANCE OF A SYSTEM-ON-A-CHIP BY PREDICTIVE WEAR-OUT BASED ACTIVATION OF FUNCTIONAL COMPONENTS”; U.S. Pat. No. 8,571,847, for “A SYSTEM AND METHOD FOR IMPROVING THE EFFICIENCY OF STATIC CORE TURN OFF IN SYSTEM ON CHIP (SoC) WITH VARIATION”; U.S. patent application Ser. No. 12/697,043, for “IMPLEMENTING ASYNCHRONOUS COLLECTIVE OPERATIONS IN A MULTI-NODE PROCESSING SYSTEM”; U.S. patent application Ser. No. 13/008,546, for “MULTIFUNCTIONING CACHE”; U.S. patent application Ser. No. 12/697,175 for “I/O ROUTING IN A MULTIDIMENSIONAL TORUS NETWORK”; U.S. Pat. No. 8,370,551 for ARBITRATION IN CROSSBAR FOR LOW LATENCY; U.S. Pat. No. 8,312,193 for EAGER PROTOCOL ON A CACHE PIPELINE DATAFLOW; U.S. Pat. No. 8,521,990 for EMBEDDED GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK; U.S. Pat. No. 8,412,974 for GLOBAL SYNCHRONIZATION OF PARALLEL PROCESSORS USING CLOCK PULSE WIDTH MODULATION; U.S. patent application Ser. No. 12/796,411 for IMPLEMENTATION OF MSYNC; U.S. patent application Ser. No. 12/796,389 for NON-STANDARD FLAVORS OF MSYNC; U.S. patent application Ser. No. 12/696,817 for HEAP/STACK GUARD PAGES USING A WAKEUP UNIT; U.S. Pat. No. 8,527,740 for MECHANISM OF SUPPORTING SUB-COMMUNICATOR COLLECTIVES WITH O(64) COUNTERS AS OPPOSED TO ONE COUNTER FOR EACH SUB-COMMUNICATOR; and U.S. Pat. No. 8,595,554 for REPRODUCIBILITY IN BGQ.
This invention was made with Government support under subcontract number B554331 awarded by the Department of Energy. The Government has certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
5941981 | Tran | Aug 1999 | A |
5958040 | Jouppi | Sep 1999 | A |
6047363 | Lewchuk | Apr 2000 | A |
6134643 | Kedem | Oct 2000 | A |
6230252 | Passint | May 2001 | B1 |
7350029 | Fluhr | Mar 2008 | B2 |
8103832 | Gara | Jan 2012 | B2 |
8255633 | Boyle | Aug 2012 | B2 |
8327077 | Boyle | Dec 2012 | B2 |
8347039 | Boyle | Jan 2013 | B2 |
20020046324 | Barroso | Apr 2002 | A1 |
20040078482 | Blumrich | Apr 2004 | A1 |
20050132148 | Arimilli | Jun 2005 | A1 |
20070143550 | Rajwar | Jun 2007 | A1 |
20070226462 | Scott | Sep 2007 | A1 |
20080126750 | Sistla | May 2008 | A1 |
20090006808 | Blumrich | Jan 2009 | A1 |
Entry |
---|
Cintra et al, “Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors,” May 2000, Proceedings of the 27th annual international symposium on Computer architecture, pp. 13-24. |
Gendler et al., “A PAB-Based Multi-Prefetcher Mechanism,” Apr. 2006, International Journal of Parallel Programming, vol. 34, No. 2, pp. 171-188. |
Number | Date | Country | |
---|---|---|---|
20160011996 A1 | Jan 2016 | US |
Number | Date | Country | |
---|---|---|---|
61293611 | Jan 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13004007 | Jan 2011 | US |
Child | 14701371 | US |