This technology generally relates to improving processing efficiency. More particularly, the technology herein relates to specialized circuitry for handling data synchronization.
Users want deep learning and high performance computing (HPC) compute programs to continue to scale as graphics processing unit (GPU) technology improves and the number of processing core units increases per chip with each generation. What is desired is a faster time to solution for a single application, not scaling only by running N independent applications.
Due to the potentially massive number of computations deep learning requires, faster is usually the goal. And it makes intuitive sense that performing many computations in parallel will speed up processing as compared to performing all those computations serially. In fact, the amount of performance benefit an application will realize by running on a given GPU implementation typically depends entirely on the extent to which it can be parallelized. But there are different approaches to parallelism.
Conceptually, to speed up a process, one might have each parallel processor perform more work (see
Computer scientists refer to the first approach as “weak scaling” and the second approach as “strong scaling.”
Users of such applications thus typically want strong scaling, which means a single application can achieve higher performance without having to change its workload -- for instance, by increasing its batch size to create more inherent parallelism. Users also expect increased speed performance when running existing (e.g., recompiled) applications on new, more capable GPU platforms offering more parallel processors. GPU development has met or even exceeded the expectations of the marketplace in terms of more parallel processors and more coordination/cooperation between increased numbers of parallel execution threads running on those parallel processors - but further performance improvements to achieve strong scaling are still needed.
Parallel processing also creates the need for communication and coordination between parallel execution threads or blocks. Synchronization primitives are an essential building block to parallel programming. Besides the functionality correctness such a synchronization primitives guarantees, they also contribute to improved performance and scalability.
One way for different execution processes to coordinate their states with one another is by using barrier synchronization. Barrier synchronization typically involves each process in a collection of parallel-executing processes waiting at a barrier until all other processes in the collection catch up. No process can proceed beyond the barrier until all processes reach the barrier.
In modern GPU architectures, many execution threads execute concurrently, and many warps each comprising many threads also execute concurrently. When threads in a warp need to perform more complicated communications or collective operations, the developer can use for example NVIDIA’s CUDA “_syncwarp” primitive to synchronize threads. The _syncwarp primitive initializes hardware mechanisms that cause an executing thread to wait before resuming execution until all threads specified in a mask have called the primitive with the same mask. For more details see for example U.S. Pat. Nos. 8,381,203; 9,158,595; 9,442,755; 9,448,803; 10,002,031; and 10,013,290; and see also https://devblogs.nvidia.com/using-cuda-warp-level-primitives/; and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions.
Before NVIDIA’s Cooperative Groups API, both execution control (i.e., thread synchronization) and inter-thread communication were generally limited to the level of a thread block (also called a “cooperative thread array” or “CTA”) executing on one SM. The Cooperative Groups API extended the CUDA programming model to describe synchronization patterns both within and across a grid or across multiple grids and thus potentially (depending on hardware platform) spanning across devices or multiple devices. The Cooperative Groups API provides CUDA device code APIs for defining, partitioning, and synchronizing groups of threads - where “groups” are programmable and can extend across thread blocks. The Cooperative Groups API also provides host-side APIs to launch grids whose threads are all scheduled by software-based scheduling to be launched concurrently. These Cooperative Groups API primitives enable additional patterns of cooperative parallelism within CUDA, including producer-consumer parallelism and global synchronization across an entire thread grid or even across multiple GPUs, without requiring hardware changes to the underlying GPU platforms.
For example, the Cooperative Groups API provides a grid-wide (and thus often device-wide) synchronization barrier (“grid.sync()”) that can be used to prevent threads within the grid group from proceeding beyond the barrier until all threads in the defined grid group have reached that barrier. Such device-wide synchronization is based on the concept of a grid group (“grid_group”) defining a set of threads within the same grid, scheduled by software to be resident on the device and schedulable on that device in such a way that each thread in the grid group can make forward progress. Thread groups could range in size from a few threads (smaller than a warp) to a whole thread block, to all thread blocks in a grid launch, to grids spanning multiple GPUs. Newer GPU platforms such as NVIDIA Pascal and Volta GPUs enable grid-wide and multi-GPU synchronizing groups, and Volta’s independent thread scheduling enables significantly more flexible selection and partitioning of thread groups at arbitrary cross-warp and sub-warp granularities.
There is still a need for faster synchronization that can improve the performance of a group of processes executing on multiple processors.
Embodiments of this disclosure are directed to a new synchronization primitive, methods and systems. Example embodiments provide for producer processes and consumer processes, even if executing on respectively different processors, to synchronize with low latency, such as, for example, a latency of approximately half a roundtrip time incurred in memory access. This fast synchronization is referred to herein as “speed of light” (SOL) synchronization.
Strong scaling was described above in relation to
Since wires are expensive and do not scale as well as processing bandwidth, brute-force adding wires for extra bandwidth is no longer a feasible option. Instead, embracing locality is viewed as a more promising design choice. The goal of the embodiments described herein is to enable efficient data sharing and localized communication at a level greater than one SM while minimizing synchronization latency. Besides new cross-SM cooperation mechanisms, proper synchronization primitives play a critical role in such design.
Conventionally, CUDA provides the hardware named barrier as its core synchronization primitive which mostly follows the BSP (Bulk Synchronous Parallel) model. The arrive-wait barrier described in U.S. Pat. Application No. 16/712,236 filed Dec. 12, 2019 was introduced to better serve the producer-consumer style synchronization. However, named barriers and arrive-wait barriers while highly useful, each have their own weaknesses. For example, a hardware named barrier is a dedicated processor-local resource that provides a limited number of barriers and which is difficult to expose to software, may be incompatible with the thread programming model, may provide inefficient support for producer-consumer communication, and may be hard to extend to cross-processor synchronization. The arrive-wait barrier does not suffer from many of these disadvantages but is often implemented as a shared-memory backed resource that provides a software-polling based wait operation. Such an arrive-wait barrier may incur a latency exposure and a substantial bandwidth cost to shared memory traffic. For example, given the extensive additional cross-processor guaranteed concurrency the CGA programming model provides, more efficient cross-processor asynchronous data exchange and associated synchronization could result in substantial performance improvements by reducing bandwidth requirements across long data paths between parallel processors.
The illustrated GPU shows how some GPU implementations may enable plural partitions that operate as micro GPUs such as the shown micro GPU0 and micro GPU1, where each micro GPU includes a portion of the processing resources of the overall GPU. When the GPU is partitioned into two or more separate smaller micro GPUs for access by different clients, resources -- including the physical memory devices such as local L2 cache memories -- are also typically partitioned. For example, in one design, a first half of the physical memory devices coupled to micro GPU0 may correspond to a first set of memory partition locations and a second half of the physical memory devices coupled to micro GPU1 may correspond to a second set of memory partition locations. Performance resources within the GPU are also partitioned according to the two or more separate smaller processor partitions. The resources may include level two cache (L2) resources and processing resources. One embodiment of such a Multi-Instance GPU (“MIG”) feature allows the GPU to be securely partitioned into many separate GPU Instances for CUDA (“Compute Unified Device Architecture”) applications, providing multiple users with separate GPU resources to accelerate their respective applications. More particularly, each micro GPU includes a plurality of Graphic Processing Clusters (GPC) each with a plurality of SMs. Each GPC connects to the L2 cache via a crossbar interconnect.
Each GPC includes a plurality of streaming multiprocessors (SM) that are each a massively parallel processor including a plurality of processor cores, register files, and specialized units such as load/store units, texture units, etc. A memory management unit (MMU) in each GPC interconnects the SMs on the same GPC, and also provides each SM with access to the memory including L2 cache and other memory. The GPCs in the same micro GPU are interconnected by a crossbar switch, and the micro-GPUs are interconnected by the respective crossbar switches. The GPU may additionally have copy engines and other IO units and links for external connections. For more information on prior GPU hardware and how it has advanced, see for example USP8,112,614; USP7,506,134; USP7,836,118; USP7,788,468; US10909033; US20140122809; Lindholm et al, “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro (2008); https://docs.nvidia.com/cuda/parallel-thread-execution/index.html (retrieved 2021); Choquette et al, “Volta: Performance and Programmability”, IEEE Micro (Volume: 38, Issue: 2, Mar./April 2018), DOI: 10.1109/MM.2018.022071134.
In one embodiment, a CGA is a collection of CTAs where hardware guarantees that all CTAs of the CGA are launched to the same hardware organization level the CGA specifies or is associated with. The hardware is configured to make sure there are enough processing resources in the target hardware level to launch all CTAs of the CGA before launching any.
As
For example, CGAs let an application take advantage of the hierarchical nature of the interconnect and caching subsystem in modern GPUs and make it easier to scale as chips grow in the future. By exploiting spatial locality, CGAs allow more efficient communication and lower latency data movement. GPU hardware improvements guarantee that threads of plural CTAs the new CGA hierarchical level(s) define will run concurrently for desired spatial localities, by allowing CGAs to control where on the machine the concurrent CTA threads will run relative to one another.
In one embodiment, CGAs are composed of clusters of CTAs that are guaranteed by hardware to launch and execute simultaneously/concurrently. The CTAs in a CGA cluster may -- and in the general case will - execute on different SMs within the GPU. Even though the CTAs execute on different SMs, the GPU hardware/system nevertheless provides a cross-SM guarantee that the CTAs in a CGA cluster will be scheduled to execute concurrently. The GPU hardware/system also provides efficient mechanisms by which the concurrently-executing CTAs can communicate with one another. This allows an application to explicitly share data between the CTAs in a CGA cluster and also enables synchronization between the various threads of the CTAs in the CGA cluster.
In example embodiments, the various threads within the CGA cluster can read/write from common shared memory -- enabling any thread in the CGA cluster to share data with any other thread in the cluster. Sharing data between CTAs in the CGA cluster saves interconnect and memory bandwidth which is often the performance limiter for an application.
Now, using the concurrent execution and additional shared memory supported by hardware, it is possible to directly share data between threads of one CTA and threads of another CTA - enabling dependencies across CTAs that can bridge hardware (e.g., cross-SM) partitions.
Because CGAs guarantee all their CTAs execute concurrently with a known spatial relationship, other hardware optimizations are possible such as: Multicasting data returned from memory to multiple SMs (CTAs) to save interconnect bandwidth as in embodiments of this disclosure; Direct SM2SM communication for lower latency data sharing and improved synchronization between producer and consumer threads in the CGA; Hardware barriers for synchronizing execution across all (or any) threads in a CGA; and more (see copending commonly-assigned patent applications listed above).
The additional cluster overlay provided by the CGA defines where and when the CTAs will run, and in particular, guarantees that all CTAs of a CGA will run concurrently within a common hardware domain that provides dynamic sharing of data, messaging and synchronization between the CTAs.
In example embodiments, all CTA threads within a CGA may reference various types of commonly-accessible shared memory. Hardware support in the GPU allows the different CTAs in a CGA cluster to read and write each other’s shared memory. Thus, load, store and atomic memory accesses by a first CTA can target shared memory of a second CTA, where the first and second CTAs are within the same CGA cluster. In some embodiments, the source multicast SM writes to the receiver multicast SMs in their respective memories using a distributed memory mechanism. An example distributed memory that may be used in embodiments is described in U.S. Application No. 17/691,690, which is incorporated herein by reference in its entirety. In some example embodiments, the distributed shared memory of the respective SMs is mapped into generic address space.
In a parallel processing multiprocessor system, communication between processors using memory located data “objects” may be slowed significantly by the latency involved in using memory barriers for synchronizing processes running on the different processors. An example of a memory barrier is a flag variable or other memory data structure that can be used to control the order of certain operations. Typically, a “producer” thread executing on one processor writes (also interchangeably referred to as “stores”) a buffer into memory, then writes a flag variable in memory to indicate the data is ready to be read (also interchangeably referred to as “load”) by a “consumer” thread executing on another processor. In many systems, between the last data write and the flag write, a memory fence or barrier operation is used to ensure the data writes are ordered before the flag write. This is done in order to prevent the consumer thread from seeing incomplete, non-updated or corrupted data.
Generally this fence or barrier is expensive. That is, it has long latency. For example,
The illustrated scenario involves SM1 as the consumer for data stored by producer SM0 in the L2 slice of SM0 (“L2 slice 0”) and in the L2 slice of SM1 (“L2 slice 1”). After the storing of data D0 and D1 to L2 slice 0 and L2 slice 1 respectively, producer SM0 waits for “ack” messages from the L2 caches for the memory barriers (“membar”) that were issued for the stored data, whereupon it updates a “flag F” to indicate that the data storing is complete. The membar instructions instruct the hardware to ensure visibility of the stored data to subsequent instructions such as those executed by consumer SM1. SM1 can acquire the flag F, as is necessary to load the stored data, only after SM0 updates the flag F by. Upon querying the L2 slices and successfully acquiring flag F, SM1 proceeds to load the data D0 and D1.
While the above sequence of operations successfully ensures synchronization of data between the producer and the consumer, three or four roundtrips of latency to/from the L2 cache may be consumed in the sequence from the time SM0 issues the store instructions to the time SM1 obtains the data in response to a load instruction,. Thus, according to this conventional scheme, the exchange of data incurs a latency cost of 3 to 4 roundtrips through the L2 cache.
A new generation of NVIDIA GPU implements various schemes for efficient producer/consumer parallelism for data orchestration, including, for example, inter-SM communication supporting SMs accessing memory of other SMs (described in U.S. Application No. 17/691,690; multicast over distributed shared memory (described in U.S. Application No. 17/691,288; and tensor memory access unit (described in U.S. Application No. 17/691,276 and in U.S. Application No. 17/691,422). In systems in which there exists shared memory with remote store capabilities in which SMs can access memory of other SMs without going through the memory hierarchy such as the L2 cache, the latency can be significantly reduced, provided that the data and the flag are co-located in the memory of one of the SMs such as the consumer SM. To exchange data in such systems taking advantage of the remote CGA store capabilities, the latency cost of data exchange may be decreased from the 3-4 of
Upon receiving the acknowledgments for the stored data, the producer SM0 updates a flag F in the shared memory (SMEM) of SM1 to indicate that the stored data is available (like ringing your neighbor’s doorbell after dropping off the groceries except that the flag remains set for whenever the consumer SM1 cares to read it). The consumer thread on SM1 can wait on the flag F which is local in the consumer SM1′s own shared memory (and if necessary, can “acquire” the flag when it is updated by SM0) before issuing the load for the newly written data D0 and D1 out of its own local memory. This data exchange, assuming that the SM-to-SM communication through shared memory has latency that is roughly equal to the latency of SM-to-L2, incurs a latency cost of only approximately 1.5 roundtrips because the amount of time it takes for SM1 to read from its own local memory is very fast.
Each scheme for efficient producer/consumer parallelism for data orchestration, such as the schemes for efficient producer/consumer parallelism noted above, addresses the same fundamental two synchronization challenges:
(A) When is the consumer ready to receive new data (e.g., a tile has become dead)? and
(B) When is the consumer ready to begin processing filled data (e.g., a tile has become alive)?
The embodiments described in this disclosure solve these two problems using the same unified mechanism, as they are fundamentally two sides of the same coin. The embodiments leverage some aspects of the arrive-wait barriers, introduced in a previous generation of NVIDIA GPU, as the basis for producer/consumer communication. Although the embodiments are described in this disclosure primarily in relation to shared memory, some embodiments may be applied to global memory or combinations of shared memory and global memory.
Consider a producer and consumer communicating through shared memory.
Example embodiments further address a practical issue relating to reliance on the memory ordering between the store operation and the arrive operation. In massively parallel multithreaded machines, operations are sometimes or even often opportunistically performed out of order in order to reduce execution latency. For example, the store operation and the arrive operation may be often reordered in the GPU e.g., due to hit under miss, or differing transfer distances from L2 slices. In situations in which both the producer and consumer are on the same SM, reordering does not, or is very unlikely to, occur frequently. For example, the “ARRIVES.LDGSTSBAR” instruction, introduced in a previous generation of Nvidia GPUs, depends on the producer performing the data store instruction “LDGSTS” and the consumer issuing the load “LDS” being on the same SM. However, with the introduction of CGAs, multiple CTAs running on different SMs may now be synchronizing across SM boundaries. While the CTAs are all guaranteed to be running concurrently, the concurrency guarantee in one embodiment does not extend to ordering execution on one SM relative to execution on another SM(s). Without solving this problem of maintaining ordering in the face of possible re-ordering of store operations and arrive operations when producer and consumer are on different SMs, the arrive/wait technique of synchronization may not be very useful for distributed producer/consumer synchronization. Accordingly, some embodiments include solutions for maintaining such ordering across SM boundaries.
When threads running on multiple processors in a non-uniform memory access (NUMA)-organized system (or subsystem) are aware of the location of the thread they are communicating with (for example, which processor it is on) and are able to target the communication data writes and flag write to the memory physically associated with that destination processor, then there is an opportunity to achieve optimally fast data synchronization. The optimally fast data synchronization is achieved by updating the flag mentioned above immediately (or without significant delay) after writing the data.
In order to solve the problem of maintaining ordering in the face of possible re-ordering of stores and arrives, example embodiments provide a new combined (remote) store and arrive (“STS+Arrive”) instruction. In an implementation, upon receipt of the combined store and arrive operation in the destination processor, the data is written to a receive buffer, and a barrier, on which another process (a consumer thread) may perform a wait() operation to obtain access to the receive buffer in order to read the data, is updated to indicate that the data has been written. This instruction specifies two addresses in the same SM shared memory: (A) the address to write the data to (e.g., an address associated with the receive buffer), and (B) the address of a barrier to update when the store is completed. According to one implementation, a parameter for the wait() operation to be cleared is set to be equal to the number of lines (T) in the tile. Thus the consumer thread passing (i.e. being able to process instructions beyond) the barrier indicates that the tile has been filled. Note that T is a free variable (as is the number of resident tiles), and so software can adjust the space-latency tradeoff in order to hide average fill latency for tiles.
The scenario of
However, this scheme shown in
In one embodiment, the remote wait() is avoided by splitting the synchronization functionality across two barriers: one co-located with the consumer, and one co-located with the producer. This imposes a relatively minor space overhead, yet removes all need for remote wait() functionality. In this scheme, the “c-barrier” (a barrier co-located with the consumer) is waited on (e.g. using wait() instruction) by the consumer and arrived on (e.g. using arrive()) by the producer. Conversely the “p-barrier” (a barrier co-located with the producer) is arrived on by the consumer and waited on by the producer.
The embodiments according to
In some example embodiments, the barrier is atomically updated by adding the number of bytes of data that were written. Utilization of this by software means that in one embodiment, the software knows the number of bytes expected to be written in the buffer in order to synchronize. The receiving thread (also referred to as destination thread) in principle waits on the barrier reaching a certain expected value so that the barrier reaches a “clear” condition. The expected value, for example, in some embodiments, may correspond to the number of bytes or other measure of data that will be written by sending threads (also referred to a source thread).
In one embodiment, a barrier support unit, such as the hardware-implemented synchronization unit described in U.S. Application No. 17/691,296, the entire description of which is incorporated herein by reference, may be used to accelerate the barrier implementations by enabling the receiving thread to simply wait on the barrier “clearing”. The hardware-implemented synchronization unit in U.S. Application No. 17/691,296 handles the details of waiting for the count to be reached. However, the receiving thread software does have to know the expected byte count and supply it to the barrier support unit. An example barrier support unit is shown in
Since the barrier is managed atomically (all accesses are atomic), some example embodiments allow multiple other threads on the same processor or multiple other processors to cooperate in sending data to one receiving thread. To use this, one embodiment assumes that the software understands to write each element of the data buffer once (possibly from different source threads) and that the receiving thread knows the expected byte count. In one embodiment, the byte count is not required to be static and the sending threads can use this barrier support unit to update the expected byte count before or in parallel with doing this disclosure’s special stores with barrier update. Example embodiments (optionally, with the assistance of the barrier support unit) supports dynamically sized data buffers. The result is that data can be written and the barrier updated with no significant delay after the data write.
Therefore no SM, more specifically neither a producer thread nor a consumer thread, is waiting on a barrier that is not in its own local shared memory. Since a remote wait() is not required, this new technique can utilize a wait() call similar to that used in a previous generation of NVIDIA GPU without modification - that is, the wait() operation for either of the consumer thread or the producer thread is on a local barrier in a memory of its own SM.
Another new aspect in the introduced synchronization is the remote arrive operation from the consumer to the p-barrier (indicating tile has become dead). In some example embodiments, this can be implemented with the same combined store and arrive operation functionality described above (e.g., with a dummy store address). This could be optimized further by adding a remote arrive operation that does not need a store (STS) component. Note that in the multi-producer scenario (e.g., where several producer threads, possibly in different processors, store to the same buffer) described above, a separate combined store and arrive operation is sent to each producer in one embodiment. However, if a multicast scheme (e.g., programmatic multicast described in US 17/691,288) is implemented, then this combined store and arrive can be multicast to those producers.
At the beginning of time (i.e. beginning of the illustrated synchronization sequence), there may be a (safe) race between loading the program for the producer thread on the producer SM and the consumer thread on the consumer SM. Assuming that the producer wins, time then proceeds as follows (logical time steps 0-4):
As noted above, software may, as necessary, dynamically or as configured, trade T off against the number of total resident tiles in order to achieve optimum workload-specific throughput.
This type of fast synchronization achieved with less than a roundtrip latency is, as mentioned above, referred to in this disclosure as “speed of light” or “SOL” synchronization. In some embodiments, SOL synchronization covers the cases of an object (e.g., a set of one or more memory locations) located in a DSMEM (distributed shared memory) location in a particular destination CTA with one or more other CTAs (in the CGA) writing to the object and then in an SOL manner alerting the destination that the object is ready for use (e.g., that writes are completed and visible).
For two concurrently executing threads (or processes) on processor 1 and processor 2, an expected usage of the synchronization of example embodiments may be as follows: processor 1 executes the sequence ST D0, ST D1, REL; and processor 2 executes the sequence ACQ, LD D0, LD D1. The ST D0 and ST D1 operations are combined stored and arrive operations.
According to example embodiments, the combined store and arrive instructions that store to the same buffer are not required to take the same path from SM0 to SM1. For example,
An example transaction barrier structure that includes an arrive count, an expected arrive count, and a transaction count is shown in
Returning to
In the embodiments described in relation to
Two new instructions are implemented to support the above described SOL synchronization in some example embodiments: a “Store with Synch” instruction and a “Reduction with Synch” instruction.
The Store with Synch instruction and the Reduction with Synch instruction may be exposed to the programmer through a library interface. In some embodiments, the store may be regarded as invalid unless addressed to shared memory located in another SM belonging to a CTA that is part of the same CGA as the source CTA. The instructions may support one or more operand sizes, such as, for example, 32 bytes, 64 bytes, 128 bytes, etc. A barrier address and a data address are provided as input address parameters. Specifically, the barrier address is represented by CTA_ID in CGA of barrier (which in some embodiments must be the same as the CTA_ID in CGA of the data) and the shared memory offset (address) of the barrier at the target. The data address is represented by CTA_ID in CGA of the data location and the shared memory offset (address) of the data at the target.
The Store with Sync instruction provides a SOL CGA memory data exchange using a synchronization that travels with a store. The instruction may be of the form:
This instruction stores data to target/destination CTA(s) SMEM[DataAddr]. After data store is guaranteed visible, it decrements the transaction count field of barrier at target CTA(s) SMEM[BarAddr] by the amount of data (e.g., number of bytes) being stored.
The Reduction with Synch instruction does shared memory atomic reductions instead of stores. The instruction may have a format such as:
The reduction instruction performs an arrive reduction operation for the number of threads executing the REDS. The arrive reduction operation may be the same as for arrive atomic. The instruction may increment the transaction count in the destination barrier by the number specified in URb. The reduce instruction noted above may be used in some embodiments to sum up all the counts for all the individual threads and store it into URb.
In some embodiments, the instruction format may also be built to track the number of stored bytes explicitly. In this scheme, the Store-with-synch instruction also maintain a running count of number of bytes as below.
The operation of the instruction may be as follows:
The corresponding Reduction-with-Synch instruction may be as below:
This instruction performs an arrive reduction to barrier in CTA(s) SMEM[BarAddr]. The instruction reduces the TCount across all the threads, and then increments the ByteTransactionCnt field of barrier at target CTA(s) SMEM[BarAddr] by the total TCount.
In another embodiment, there is also a barrier located in the same (processor affiliated) memory unit as the destination data buffer, in a manner similar to an above described embodiment. However, instead of updating the barrier with the number of bytes written (or updating the barrier with some other measure of the amount of data written to the data buffer) as in the above described embodiment, in this other embodiment, sending processor(s) write as much data as desired (and/or allowed) to the destination thread’s memory (NUMA associated with the destination thread), and then send a memory write fence operation. To help clarify, it helps to realize that the memory system network (e.g., NOC, network on chip) will typically have multiple paths between any two processors in a multiprocessor and a subset of these paths can be used for transmitting write operations from a source thread on a source processor to a destination thread on a destination processor. The total number of possible paths may be implementation dependent, but may be known at startup time and may remain fixed after startup. In this alternative embodiment, the sending processor replicates the fence to be sent on all paths (the fence is sent after the write on the same network paths that write could have taken, but the fence is replicated to all paths on which writes might have travelled). Each arriving fence increments the barrier at the destination one time. The destination thread waits on the known count (expected number of fence arrivals) that is the known number of possible paths from the source processor to the destination processor. When all the fence messages for the store operation by the source SM have arrived at the barrier, the counter in the barrier indicates that all the expected fence messages have been received and therefore all stores have already been received at the destination processor. When the barrier clears, the consumer process, which may be waiting at the barrier, begins to load the data received data.
In a particular implementation, after initial runtime configuration, the number of paths between any two processors is considered to be a static unchanging number. An advantage of this other embodiment in comparison to the embodiment in which the barrier is updated with each store operation, is that the communicating threads are not required to know and communicate the number of bytes written beforehand. Thus, this embodiments may be useful for the many applications for which it is difficult for software to determine the number of bytes to be written. For example, if a subroutine is called to do some of the data writes, it may be difficult for the calling code to know how many bytes the subroutine writes. This alternative implementation avoids that problem of the destination having to know the amount of data it is expecting.
This alternative embodiment yields lower latency than the conventional synchronization scheme described in relation to
In CGA environments, an example embodiment provides for CGA memory object-scoped SOL synchronization of SM2SM communication in the context of CGA memory. This relates to inter-SM stores to CGA shared memory hosted within SMs in their respective shared memory storage. The feature may be utilized between multiple SMs on the same GPC, or on multiple GPCs. In some embodiments, the inter-SM stores may utilize L2-hosted CGA shared memory.
A key architecture aspect and hardware support for this embodiment is a fence operation that is directed at one other CTA in the CGA running on a different SM (it may also apply for CTAs running on the same SM). The fence instruction specifies a target CTA (which, for example, hardware mapping tables can map to a target SM). It also specifies a barrier address. In some embodiments, the specified barrier address may be located in the CTA’s global memory region (e.g., in the partitioned global address space (PGAS) of the CTA). The fence travels on all paths that stores from the source SM to destination SM could travel in the interconnect (crossbar or other switch) interconnecting the SMs. Doing so sweeps all prior stores from the source SM to the destination SM ahead to the destination. That is, the fence cannot arrive at the destination SM on all paths until all stores prior to the fence have arrived at the destination SM.
The fence arrives at the destination SM once per path. Each fence arrival event causes the destination SM to do a fence arrive operation on the barrier in the destination SM. Although a fence arrive may be more complicated than simply an atomic decrement (or increment, depending on implementation) of the barrier location, for purposes of this description one may simply consider it as that.
In the simplest use case, the program on the destination SM knows how many fence arrives to expect. The number of paths through the memory system for these inter-SM stores may be a fixed hardware value available to the program. The program could in this case initialize the barrier to N, the number of paths, and then poll the barrier until it is zero. A value of zero would mean the fence had arrived on all paths and on each arrival the fence count of the barrier was decremented.
A more complicated use case involves multiple sending CTAs being required to, in effect, initialize the barrier prior to the expected fence arrival count. The fence arrivals may be regarded as “transactions”. The software layer operates in terms of messages, and sending and receiving CTAs agree on how many messages will arrive. Those messages are “arrivals”. With each “arrival” the software, optionally with help from hardware, adjusts the barrier’s expected transaction count (i.e., number of paths * number of messages).
An example programming model for this embodiment may be as described next. The destination CTA has an object barrier in its local memory, which it may initialize before the synchronization operation. The destination CTA communicates “space available” to one or more source CTAs when a designated buffer in the CGA shared memory of the destination CTA becomes available.
The destination CTA executes a wait on the object barrier. One or more source CTA(s) write bulk data to the designated buffer in the destination CTA’s CGA shared memory, and then executes a fence and arrive targeting the object barrier in the destination CTA’s memory. When the destination CTA is released from the wait on the object barrier, it may then read bulk data from shared memory.
An example implementation may be as described next. The memory object barrier in the destination CTA local shared memory may be a word with several fields. The fields may include an expected arrive count, actual arrive count, and a fence transaction count. In an example implementation, the arrive counts may be made visible to software, but the fence transaction count may be hardware managed and opaque to software. The fence transaction count, and/or any of the other fields of the memory object barrier can be positive or negative. The barrier can be initialized with a target value for the expected arrive count, and the fence transaction count can be set to 0.
The fence and arrive instruction may be executed by the source CTA(s) after writing arbitrary bulk data to destination CTA. Alternatively, in some embodiments, the fence and the arrive can be split to two instructions. The source CTA(s) can each send its fences down all N possible address paths to destination SM (e.g. N=4 paths). Each fence includes the address of the memory object barrier. The arrive may be sent down any path to the destination SM, whereas the arrive may be unordered with respect to the fence transactions.
Fence instructions follow and flush previous memory instructions, if any, on each path from the source SM to destination SM. At the \destination SM each fence instruction increments the transaction count in the memory object barrier. Also, at the destination, arrives increments the arrive count in the memory object barrier. Further at the destination, the arrives decrements the fence transaction count in the memory object barrier by N. Fence and arrives do not need to be ordered with respect to each other in the case of multiple source CTAs synchronizing on one destination CTA’s memory object barrier.
A barrier wait instruction executed at the \destination SM may take the barrier address as a parameter and may return “clear” status when (arrive count == expected arrive count) && (fence transaction count == 0). Note, that the fence transaction count is incremented by fence instructions and decremented by arrives, and the destination CTA may wait until it resolves to 0. Note also the assumption that N is statically known in hardware (e.g. 4). If N is not known, the arrive may carry the N.
The fence instructions, because of their replication to all paths between the source and destination, may generate substantial additional traffic on the interconnect. For example, for each fence packet generated by the source thread, the memory system may replicate N fence packets to transmit over each of the N available paths to the destination SM. In some embodiments the interconnect bandwidth is configured to limit the bandwidth available for the SM2SM traffic such as the fence messages so that the reduction of the interconnect bandwidth available to L2 data and L2-related messaging is minimized. Thus, example embodiments, may restrict SM2SM traffic on the interconnect (e.g. crossbar) to a subset of the links available on the interconnect, and may distribute the SM2SM traffic over the subset so that the reduction of bandwidth available for L2 on the interconnect is controlled.
In the model of
An entity in an L2 slice, which may be referred to as an agent and which may be well-known to processors in the system, is defined to mediate the exchange between producer and consumer processes. A respective agent may be defined for each specific type of communication, and may include a queue that coordinates the communication and permits some level of pipelining. Instead of a data queue that a producer thread pushes into, the agent can include a consumer queue upon which the consumer waits for data. Hardware support can provide for such a queue to reside in a single L2 slice.
The consumer thread sets up a local shared memory buffer as the receiving buffer, a transaction barrier, and a data receiving remap table. The consumer pushes receiving data into the queue. Push fails due to the queue being full, may require retry. When the queue is non-empty, the producer thread can push its data through L2 slices, where the agent operates to reflect the data into the consumer’s receive buffer. The data push may target a specific L2 slice and be bounced back to consumer like a load data.
The L2 slice of choice may be determined through a hash or the like aiming to spread out to different slices more evenly. Data packets are tagged with the consumer information posted in the queue, together with buffer internal offset managed by producer. The time chart in
Comparing the mediated model of
The data communication in the embodiment of
When the transaction barrier in the consumer clears, the data updates from the producers to the associated shared memory buffer is guaranteed visible. The minimal latency between consumer push and data arrival is around two L2 roundtrips. The first roundtrip to communicate the “buffer ready information” (“P.B” in
Hardware support may include hardware-supported mediating queues in L2. Each queue may carry the following information: the block size of the transfer, and the number of blocks and routing information (e.g., consumer SM ID or crossbar node / port ID) in each entry. In an example implementation, a queue may fit on a single L2 slice and may be memory backed. In an example implementation, a single 256B queue can fit on a single slice, can support up to 63 entries with 4B entry size, or 126 entries with 2B entry size. In case a larger queue is desired, special address hash / space and global memory carve out for a backing-store may be used.
The hardware may support atomic queue operations: PushBuf and Popwait. PushBuf can be used to advertise data receiving buffer by pushing into the tail of the queue, return if the push is successful immediately. PopWait can be used to wait if pending block counter indicates not enough buffer to produce, or reserve the pending block counter otherwise. A separated try-wait buffer to hold the waiting producer may also be provided.
The hardware may further provide split SM2SM support. Consumer-side data receiving buffer setup may include allocating shared memory buffer, initializing transaction barrier, setting up buffer base / size / barrier etc. into a remap table, and performing PushBuf and cancelling the setting up of the buffer on fail. Producer side data pushing may include special store flavor that will bounce to destination SM like a load, special tex2gnic packet type with routing info returned from PopWait, optionally multicast store for queue expansion. Since the barrier is on the consumer side and is not made visible to producer, the consumer may have the responsibility to setup the expected transaction count. Either set the precise number if the size is well-known between the producer and consumer. When the data packet size can vary, set the maximum number that matches the buffer size, and the producer has the responsibility to close the exchange when the data is actually smaller.
The producer, through a persistent agent in global memory, launches the consumer as needed by causing the persistent agent to issue a launch a consumer thread to receive the data that the producer is yet to produce. By providing for the persistent agent to control the startup of the consumer process, the producer retires and releases register file resource for consumer as it dumps data into a local queue. The persistent agent (“CWD Workload Dispatch” in
A conceptual system diagram for the operations in
An example illustrative architecture in which the fast data synchronization disclosed in this application is incorporated will now be described. The following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.
One or more PPUs 1000 may be configured to accelerate thousands of High Performance Computing (HPC), data center, and machine learning applications. The PPU 1000 may be configured to accelerate numerous deep learning systems and applications including autonomous vehicle platforms, deep learning, high-accuracy speech, image, and text recognition systems, intelligent video analytics, molecular simulations, drug discovery, disease diagnosis, weather forecasting, big data analytics, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimizations, and personalized user recommendations, and the like.
As shown in
The NVLink 1010 interconnect enables systems to scale and include one or more PPUs 1000 combined with one or more CPUs, supports cache coherence between the PPUs 1000 and CPUs, and CPU mastering. Data and/or commands may be transmitted by the NVLink 1010 through the hub 1030 to/from other units of the PPU 1000 such as one or more copy engines, a video encoder, a video decoder, a power management unit, etc. (not explicitly shown). The NVLink 1010 is described in more detail in conjunction with
The I/O unit 1005 is configured to transmit and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the interconnect 1002. The I/O unit 1005 may communicate with the host processor directly via the interconnect 1002 or through one or more intermediate devices such as a memory bridge. In an embodiment, the I/O unit 1005 may communicate with one or more other processors, such as one or more of the PPUs 1000 via the interconnect 1002. In an embodiment, the I/O unit 1005 implements a Peripheral Component Interconnect Express (PCIe) interface for communications over a PCIe bus and the interconnect 1002 is a PCIe bus. In alternative embodiments, the I/O unit 1005 may implement other types of well-known interfaces for communicating with external devices.
The I/O unit 1005 decodes packets received via the interconnect 1002. In an embodiment, the packets represent commands configured to cause the PPU 1000 to perform various operations. The I/O unit 1005 transmits the decoded commands to various other units of the PPU 1000 as the commands may specify. For example, some commands may be transmitted to the front end unit 1015. Other commands may be transmitted to the hub 1030 or other units of the PPU 1000 such as one or more copy engines, a video encoder, a video decoder, a power management unit, etc. (not explicitly shown). In other words, the I/O unit 1005 is configured to route communications between and among the various logical units of the PPU 1000.
In an embodiment, a program executed by the host processor encodes a command stream in a buffer that provides workloads to the PPU 1000 for processing. A workload may comprise several instructions and data to be processed by those instructions. The buffer is a region in a memory that is accessible (e.g., read/write) by both the host processor and the PPU 1000. For example, the I/O unit 1005 may be configured to access the buffer in a system memory connected to the interconnect 1002 via memory requests transmitted over the interconnect 1002. In an embodiment, the host processor writes the command stream to the buffer and then transmits a pointer to the start of the command stream to the PPU 1000. The front end unit 1015 receives pointers to one or more command streams. The front end unit 1015 manages the one or more streams, reading commands from the streams and forwarding commands to the various units of the PPU 1000.
The front end unit 1015 is coupled to a scheduler unit 1020 that configures the various GPCs 1050 to process tasks defined by the one or more streams. The scheduler unit 1020 is configured to track state information related to the various tasks managed by the scheduler unit 1020. The state may indicate which GPC 1050 a task is assigned to, whether the task is active or inactive, a priority level associated with the task, and so forth. The scheduler unit 1020 manages the execution of a plurality of tasks on the one or more GPCs 1050.
The scheduler unit 1020 is coupled to a work distribution unit 1025 that is configured to dispatch tasks for execution on the GPCs 1050. The work distribution unit 1025 may track a number of scheduled tasks received from the scheduler unit 1020. In an embodiment, the work distribution unit 1025 manages a pending task pool and an active task pool for each of the GPCs 1050. The pending task pool may comprise a number of slots (e.g., 32 slots) that contain tasks assigned to be processed by a particular GPC 1050. The active task pool may comprise a number of slots (e.g., 4 slots) for tasks that are actively being processed by the GPCs 1050. As a GPC 1050 finishes the execution of a task, that task is evicted from the active task pool for the GPC 1050 and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 1050. If an active task has been idle on the GPC 1050, such as while waiting for a data dependency to be resolved, then the active task may be evicted from the GPC 1050 and returned to the pending task pool while another task in the pending task pool is selected and scheduled for execution on the GPC 1050.
The work distribution unit 1025 communicates with the one or more GPCs 1050 via XBar 1070. The XBar 1070 is an interconnect network that couples many of the units of the PPU 1000 to other units of the PPU 1000. For example, the XBar 1070 may be configured to couple the work distribution unit 1025 to a particular GPC 1050. Although not shown explicitly, one or more other units of the PPU 1000 may also be connected to the XBar 1070 via the hub 1030.
The tasks are managed by the scheduler unit 1020 and dispatched to a GPC 1050 by the work distribution unit 1025. The GPC 1050 is configured to process the task and generate results. The results may be consumed by other tasks within the GPC 1050, routed to a different GPC 1050 via the XBar 1070, or stored in the memory 1004. The results can be written to the memory 1004 via the partition units 1080, which implement a memory interface for reading and writing data to/from the memory 1004. The results can be transmitted to another PPU 1004 or CPU via the NVLink 1010. In an embodiment, the PPU 1000 includes a number U of partition units 1080 that is equal to the number of separate and distinct memory devices 1004 coupled to the PPU 1000. A partition unit 1080 will be described in more detail below in conjunction with
In an embodiment, a host processor executes a driver kernel that implements an application programming interface (API) that enables one or more applications executing on the host processor to schedule operations for execution on the PPU 1000. In an embodiment, multiple compute applications are simultaneously executed by the PPU 1000 and the PPU 1000 provides isolation, quality of service (QoS), and independent address spaces for the multiple compute applications. An application may generate instructions (e.g., API calls) that cause the driver kernel to generate one or more tasks for execution by the PPU 1000. The driver kernel outputs tasks to one or more streams being processed by the PPU 1000. Each task may comprise one or more groups of related threads, referred to herein as a warp. In an embodiment, a warp comprises 32 related threads that may be executed in parallel. Cooperating threads may refer to a plurality of threads including instructions to perform the task and that may exchange data through shared memory. Threads, cooperating threads and a hierarchical grouping of threads such as cooperating thread arrays (CTA) and cooperating group arrays (CGA) according to some embodiments are described in more detail in U.S. Application No. 17/691,621, the entire content of which is hereby incorporated by reference in its entirety.
In an embodiment, the operation of the GPC 1050 is controlled by the pipeline manager 1110. The pipeline manager 1110 manages the configuration of the one or more DPCs 1120 for processing tasks allocated to the GPC 1050. In an embodiment, the pipeline manager 1110 may configure at least one of the one or more DPCs 1120 to implement at least a portion of a graphics rendering pipeline, a neural network, and/or a compute pipeline. For example, with respect to a graphics rendering pipeline, a DPC 1120 may be configured to execute a vertex shader program on the programmable streaming multiprocessor (SM) 1140. The pipeline manager 1110 may also be configured to route packets received from the work distribution unit 1025 to the appropriate logical units within the GPC 1050. For example, some packets may be routed to fixed function hardware units in the PROP 1115 and/or raster engine 1125 while other packets may be routed to the DPCs 1120 for processing by the primitive engine 1135 or the SM 1140.
The PROP unit 1115 is configured to route data generated by the raster engine 1125 and the DPCs 1120 to a Raster Operations (ROP) unit, described in more detail in conjunction with
Each DPC 1120 included in the GPC 1050 includes an M-Pipe Controller (MPC) 1130, a primitive engine 1135, and one or more SMs 1140. The MPC 1130 controls the operation of the DPC 1120, routing packets received from the pipeline manager 1110 to the appropriate units in the DPC 1120. For example, packets associated with a vertex may be routed to the primitive engine 1135, which is configured to fetch vertex attributes associated with the vertex from the memory 1004. In contrast, packets associated with a shader program may be transmitted to the SM 1140.
The SM 1140 comprises a programmable streaming processor that is configured to process tasks represented by a number of threads. Each SM 1140 is multi-threaded and configured to execute a plurality of threads (e.g., 32 threads) from a particular group of threads concurrently. In an embodiment, the SM 1140 implements a SIMD (Single-Instruction, Multiple-Data) architecture where each thread in a group of threads (e.g., a warp) is configured to process a different set of data based on the same set of instructions. All threads in the group of threads execute the same instructions. In another embodiment, the SM 1140 implements a SIMT (Single-Instruction, Multiple Thread) architecture where each thread in a group of threads is configured to process a different set of data based on the same set of instructions, but where individual threads in the group of threads are allowed to diverge during execution. In an embodiment, a program counter, call stack, and execution state is maintained for each warp, enabling concurrency between warps and serial execution within warps when threads within the warp diverge. In another embodiment, a program counter, call stack, and execution state is maintained for each individual thread, enabling equal concurrency between all threads, within and between warps. When execution state is maintained for each individual thread, threads executing the same instructions may be converged and executed in parallel for maximum efficiency. The SM 1140 is described in more detail below in conjunction with
The MMU 1190 provides an interface between the GPC 1050 and the partition unit 1080. The MMU 1190 may provide translation of virtual addresses into physical addresses, memory protection, and arbitration of memory requests. In an embodiment, the MMU 1190 provides one or more translation lookaside buffers (TLBs) for performing translation of virtual addresses into physical addresses in the memory 1004.
In an embodiment, the memory interface 1170 implements an HBM2 memory interface and Y equals half U. In an embodiment, the HBM2 memory stacks are located on the same physical package as the PPU 1000, providing substantial power and area savings compared with conventional GDDR5 SDRAM systems. In an embodiment, each HBM2 stack includes four memory dies and Y equals 4, with HBM2 stack including two 128-bit channels per die for a total of 8 channels and a data bus width of 1024 bits.
In an embodiment, the memory 1004 supports Single-Error Correcting Double-Error Detecting (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher reliability for compute applications that are sensitive to data corruption. Reliability is especially important in large-scale cluster computing environments where PPUs 1000 process very large datasets and/or run applications for extended periods.
In an embodiment, the PPU 1000 implements a multi-level memory hierarchy. In an embodiment, the memory partition unit 1080 supports a unified memory to provide a single unified virtual address space for CPU and PPU 300 memory, enabling data sharing between virtual memory systems. In an embodiment the frequency of accesses by a PPU 1000 to memory located on other processors is traced to ensure that memory pages are moved to the physical memory of the PPU 1000 that is accessing the pages more frequently. In an embodiment, the NVLink 1010 supports address translation services allowing the PPU 1000 to directly access a CPU’s page tables and providing full access to CPU memory by the PPU 1000.
In an embodiment, copy engines transfer data between multiple PPUs 1000 or between PPUs 1000 and CPUs. The copy engines can generate page faults for addresses that are not mapped into the page tables. The memory partition unit 1080 can then service the page faults, mapping the addresses into the page table, after which the copy engine can perform the transfer. In a conventional system, memory is pinned (e.g., non-pageable) for multiple copy engine operations between multiple processors, substantially reducing the available memory. With hardware page faulting, addresses can be passed to the copy engines without worrying if the memory pages are resident, and the copy process is transparent.
Data from the memory 1004 or other system memory may be fetched by the memory partition unit 1080 and stored in the L2 cache 1160, which is located on-chip and is shared between the various GPCs 1050. As shown, each memory partition unit 1080 includes a portion of the L2 cache 1160 associated with a corresponding memory device 1004. Lower level caches may then be implemented in various units within the GPCs 1050. For example, each of the SMs 1140 may implement a level one (L1) cache. The L1 cache is private memory that is dedicated to a particular SM 1140. Data from the L2 cache 1160 may be fetched and stored in each of the L1 caches for processing in the functional units of the SMs 1140. The L2 cache 1160 is coupled to the memory interface 1170 and the XBar 1070.
The ROP unit 1150 performs graphics raster operations related to pixel color, such as color compression, pixel blending, and the like. The ROP unit 450 also implements depth testing in conjunction with the raster engine 1125, receiving a depth for a sample location associated with a pixel fragment from the culling engine of the raster engine 1125. The depth is tested against a corresponding depth in a depth buffer for a sample location associated with the fragment. If the fragment passes the depth test for the sample location, then the ROP unit 1150 updates the depth buffer and transmits a result of the depth test to the raster engine 1125. It will be appreciated that the number of partition units 1080 may be different than the number of GPCs 1050 and, therefore, each ROP unit 1150 may be coupled to each of the GPCs 1050. The ROP unit 1150 tracks packets received from the different GPCs 1050 and determines which GPC 1050 that a result generated by the ROP unit 1150 is routed to through the Xbar 1070. Although the ROP unit 1150 is included within the memory partition unit 1080 in
As described above, the work distribution unit 1025 dispatches tasks for execution on the GPCs 1050 of the PPU 1000. The tasks are allocated to a particular DPC 1120 within a GPC 1050 and, if the task is associated with a shader program, the task may be allocated to an SM 1140. The scheduler unit 1210 receives the tasks from the work distribution unit 1025 and manages instruction scheduling for one or more thread blocks assigned to the SM 1140. The scheduler unit 1210 schedules thread blocks for execution as warps of parallel threads, where each thread block is allocated at least one warp. In an embodiment, each warp executes 32 threads. The scheduler unit 1210 may manage a plurality of different thread blocks, allocating the warps to the different thread blocks and then dispatching instructions from the plurality of different cooperative groups to the various functional units (e.g., cores 1250, SFUs 1252, and LSUs 1254) during each clock cycle.
Cooperative Groups is a programming model for organizing groups of communicating threads that allows developers to express the granularity at which threads are communicating, enabling the expression of richer, more efficient parallel decompositions. Cooperative launch APIs support synchronization amongst thread blocks for the execution of parallel algorithms. Conventional programming models provide a single, simple construct for synchronizing cooperating threads: a barrier across all threads of a thread block (e.g., the syncthreads() function). However, programmers would often like to define groups of threads at smaller than thread block granularities and synchronize within the defined groups to enable greater performance, design flexibility, and software reuse in the form of collective group-wide function interfaces.
Cooperative Groups enables programmers to define groups of threads explicitly at sub-block (e.g., as small as a single thread) and multi-block granularities, and to perform collective operations such as synchronization on the threads in a cooperative group. The programming model supports clean composition across software boundaries, so that libraries and utility functions can synchronize safely within their local context without having to make assumptions about convergence. Cooperative Groups primitives enable new patterns of cooperative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across an entire grid of thread blocks. Hierarchical grouping of threads such as cooperating thread arrays (CTA) and cooperating group arrays (CGA) according to some embodiments are described in more detail in U.S. Application No. 17/691,621, the entire content of which is hereby incorporated by reference in its entirety.
A dispatch unit 1215 is configured to transmit instructions to one or more of the functional units. In the embodiment, the scheduler unit 1210 includes two dispatch units 1215 that enable two different instructions from the same warp to be dispatched during each clock cycle. In alternative embodiments, each scheduler unit 1210 may include a single dispatch unit 1215 or additional dispatch units 1215.
Each SM 1140 includes a register file 1220 that provides a set of registers for the functional units of the SM 1140. In an embodiment, the register file 1220 is divided between each of the functional units such that each functional unit is allocated a dedicated portion of the register file 1220. In another embodiment, the register file 1220 is divided between the different warps being executed by the SM 1140. The register file 1220 provides temporary storage for operands connected to the data paths of the functional units.
Each SM 1140 comprises multiple processing cores 1250. In an embodiment, the SM 1140 includes a large number (e.g., 128, etc.) of distinct processing cores 1250. Each core 1250 may include a fully-pipelined, single-precision, double-precision, and/or mixed precision processing unit that includes a floating point arithmetic logic unit and an integer arithmetic logic unit. In an embodiment, the floating point arithmetic logic units implement the IEEE 754-2008 standard for floating point arithmetic. In an embodiment, the cores 1250 include 64 single-precision (32-bit) floating point cores, 64 integer cores, 32 double-precision (64-bit) floating point cores, and 8 tensor cores.
Tensor cores are configured to perform matrix operations, and, in an embodiment, one or more tensor cores are included in the cores 1250. In particular, the tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inferencing. In an embodiment, each tensor core operates on a 4x4 matrix and performs a matrix multiply and accumulate operation D=AxB+C, where A, B, C, and D are 4x4 matrices.
In an embodiment, the matrix multiply inputs A and B are 16-bit floating point matrices, while the accumulation matrices C and D may be 16-bit floating point or 32-bit floating point matrices. Tensor cores operate on 16-bit floating point input data with 32-bit floating point accumulation. The 16-bit floating point multiply requires 64 operations and results in a full precision product that is then accumulated using 32-bit floating point addition with the other intermediate products for a 4x4x4 matrix multiply. In practice, Tensor cores are used to perform much larger two-dimensional or higher dimensional matrix operations, built up from these smaller elements. An API, such as CUDA C++ API, exposes specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently use Tensor cores from a CUDA-C++ program. At the CUDA level, the warp-level interface assumes 16x16 size matrices spanning all 32 threads of the warp.
In some embodiments, transposition hardware is included in the processing cores 1250 or another functional unit (e.g., SFUs 1252 or LSUs 1254) and is configured to generate matrix data stored by diagonals and/or generate the original matrix and/or transposed matrix from the matrix data stored by diagonals. The transposition hardware may be provide inside of the shared memory 1270 to register file 1220 load path of the SM 1140.
In one example, the matrix data stored by diagonals may be fetched from DRAM and stored in the shared memory 1270. As the instruction to perform processing using the matrix data stored by diagonals is processed, transposition hardware disposed in the path of the shared memory 1270 and the register file 1220 may provide the original matrix, transposed matrix, compacted original matrix, and/or compacted transposed matrix. Up until the very last storage prior to instruction, the single matrix data stored by diagonals may be maintained, and the matrix type designated by the instruction is generated as needed in the register file 1220.
Each SM 1140 also comprises multiple SFUs 1252 that perform special functions (e.g., attribute evaluation, reciprocal square root, and the like). In an embodiment, the SFUs 1252 may include a tree traversal unit (e.g., TTU 1143) configured to traverse a hierarchical tree data structure. In an embodiment, the SFUs 1252 may include texture unit (e.g., Texture Unit 1142) configured to perform texture map filtering operations. In an embodiment, the texture units are configured to load texture maps (e.g., a 2D array of texels) from the memory 1004 and sample the texture maps to produce sampled texture values for use in shader programs executed by the SM 1140. In an embodiment, the texture maps are stored in the shared memory/L1 cache 1170. The texture units implement texture operations such as filtering operations using mip-maps (e.g., texture maps of varying levels of detail). In an embodiment, each SM 1140 includes two texture units.
Each SM 1140 also comprises multiple LSUs 1254 that implement load and store operations between the shared memory/L1 cache 1270 and the register file 1220. Each SM 1140 includes an interconnect network 1280 that connects each of the functional units to the register file 1220 and the LSU 1254 to the register file 1220, shared memory/ L1 cache 1270. In an embodiment, the interconnect network 1280 is a crossbar that can be configured to connect any of the functional units to any of the registers in the register file 1220 and connect the LSUs 1254 to the register file 1220 and memory locations in shared memory/L1 cache 1270.
The shared memory/L1 cache 1270 is an array of on-chip memory that allows for data storage and communication between the SM 1140 and the primitive engine 1135 and between threads in the SM 1140. In an embodiment, the shared memory/L1 cache 1270 comprises 128KB of storage capacity and is in the path from the SM 1140 to the partition unit 1080. The shared memory/L1 cache 1270 can be used to cache reads and writes. One or more of the shared memory/L1 cache 1270, L2 cache 1160, and memory 1004 are backing stores.
Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The capacity is usable as a cache by programs that do not use shared memory. For example, if shared memory is configured to use half of the capacity, texture and load/store operations can use the remaining capacity. Integration within the shared memory/L1 cache 1270 enables the shared memory/L1 cache 1270 to function as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data.
In the context of this disclosure, an SM or “streaming multiprocessor” means a processor architected as described in USP7,447,873 to Nordquist including improvements thereto and advancements thereof, and as implemented for example in many generations of NVIDIA GPUs. For example, an SM may comprise a plurality of processing engines or cores configured to concurrently execute a plurality of threads arranged in a plurality of single-instruction, multiple-data (SIMD) groups (e.g., warps), wherein each of the threads in a same one of the SIMD groups executes a same data processing program comprising a sequence of instructions on a different input object, and different threads in the same one of the SIMD group are executed using different ones of the processing engines or cores. An SM may typically also provide (a) a local register file having plural lanes, wherein each processing engine or core is configured to access a different subset of the lanes; and instruction issue logic configured to select one of the SIMD groups and to issue one of the instructions of the same data processing program to each of the plurality of processing engines in parallel, wherein each processing engine executes the same instruction in parallel with each other processing engine using the subset of the local register file lanes accessible thereto. An SM typically further includes core interface logic configured to initiate execution of one or more SIMD groups. As shown in the figures, such SMs have been constructed to provide fast local shared memory enabling data sharing/reuse and synchronization between all threads of a CTA executing on the SM.
When configured for general purpose parallel computation, a simpler configuration can be used compared with graphics processing. Specifically, the fixed function graphics processing units shown in
The PPU 1000 may be included in a desktop computer, a laptop computer, a tablet computer, servers, supercomputers, a smart-phone (e.g., a wireless, hand-held device), personal digital assistant (PDA), a digital camera, a vehicle, a head mounted display, a hand-held electronic device, and the like. In an embodiment, the PPU 1000 is embodied on a single semiconductor substrate. In another embodiment, the PPU 1000 is included in a system-on-a-chip (SoC) along with one or more other devices such as additional PPUs 1000, the memory 1004, a reduced instruction set computer (RISC) CPU, a memory management unit (MMU), a digital-to-analog converter (DAC), and the like.
In an embodiment, the PPU 1000 may be included on a graphics card that includes one or more memory devices 1004. The graphics card may be configured to interface with a PCIe slot on a motherboard of a desktop computer. In yet another embodiment, the PPU 1000 may be an integrated graphics processing unit (iGPU) or parallel processor included in the chipset of the motherboard.
Systems with multiple GPUs and CPUs are used in a variety of industries as developers expose and leverage more parallelism in applications such as artificial intelligence computing. High-performance GPU-accelerated systems with tens to many thousands of compute nodes are deployed in data centers, research facilities, and supercomputers to solve ever larger problems. As the number of processing devices within the high-performance systems increases, the communication and data transfer mechanisms need to scale to support the increased bandwidth.
In another embodiment (not shown), the NVLink 1010 provides one or more high-speed communication links between each of the PPUs 1000 and the CPU 1330 and the switch 1355 interfaces between the interconnect 1002 and each of the PPUs 1000. The PPUs 1000, memories 1004, and interconnect 1002 may be situated on a single semiconductor platform to form a parallel processing module 1325. In yet another embodiment (not shown), the interconnect 1002 provides one or more communication links between each of the PPUs 1000 and the CPU 1330 and the switch 1355 interfaces between each of the PPUs 1000 using the NVLink 1010 to provide one or more high-speed communication links between the PPUs 1000. In another embodiment (not shown), the NVLink 1010 provides one or more high-speed communication links between the PPUs 1000 and the CPU 1330 through the switch 1355. In yet another embodiment (not shown), the interconnect 1002 provides one or more communication links between each of the PPUs 1000 directly. One or more of the NVLink 1010 high-speed communication links may be implemented as a physical NVLink interconnect or either an on-chip or on-die interconnect using the same protocol as the NVLink 1010.
In the context of the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to multichip modules with increased connectivity which simulate on-chip operation and make substantial improvements over utilizing a conventional bus implementation. Of course, the various circuits or devices may also be situated separately or in various combinations of semiconductor platforms per the desires of the user. Alternately, the parallel processing module 1325 may be implemented as a circuit board substrate and each of the PPUs 1000 and/or memories 1004 may be packaged devices. In an embodiment, the CPU 1330, switch 1355, and the parallel processing module 1325 are situated on a single semiconductor platform.
In an embodiment, the signaling rate of each NVLink 1010 is 20 to 25 Gigabits/second and each PPU 1000 includes six NVLink 1010 interfaces (as shown in
In an embodiment, the NVLink 1010 allows direct load/store/atomic access from the CPU 1330 to each PPU’s 1000 memory 1004. In an embodiment, the NVLink 1010 supports coherency operations, allowing data read from the memories 1004 to be stored in the cache hierarchy of the CPU 1330, reducing cache access latency for the CPU 1330. In an embodiment, the NVLink 1010 includes support for Address Translation Services (ATS), allowing the PPU 1000 to directly access page tables within the CPU 1330. One or more of the NVLinks 1010 may also be configured to operate in a low-power mode.
As shown, a system 1365 is provided including at least one central processing unit 1330 that is connected to a communication bus 1375. The communication bus 1375 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s). The system 1365 also includes a main memory 1340. Control logic (software) and data are stored in the main memory 1340 which may take the form of random access memory (RAM).
The system 1365 also includes input devices 1360, the parallel processing system 1325, and display devices 1345, e.g. a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display or the like. User input may be received from the input devices 1360, e.g., keyboard, mouse, touchpad, microphone, and the like. Each of the foregoing modules and/or devices may even be situated on a single semiconductor platform to form the system 1365. Alternately, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
Further, the system 1365 may be coupled to a network (e.g., a telecommunications network, local area network (LAN), wireless network, wide area network (WAN) such as the Internet, peer-to-peer network, cable network, or the like) through a network interface 1335 for communication purposes.
The system 1365 may also include a secondary storage (not shown). The secondary storage includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well-known manner.
Computer programs, or computer control logic algorithms, may be stored in the main memory 1340 and/or the secondary storage. Such computer programs, when executed, enable the system 1365 to perform various functions. The memory 1340, the storage, and/or any other storage are possible examples of computer-readable media.
The architecture and/or functionality of the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system, and/or any other desired system. For example, the system 1365 may take the form of a desktop computer, a laptop computer, a tablet computer, servers, supercomputers, a smart-phone (e.g., a wireless, hand-held device), personal digital assistant (PDA), a digital camera, a vehicle, a head mounted display, a hand-held electronic device, a mobile phone device, a television, workstation, game consoles, embedded system, and/or any other type of logic.
An application program may be implemented via an application executed by a host processor, such as a CPU. In an embodiment, a device driver may implement an application programming interface (API) that defines various functions that can be utilized by the application program in order to generate graphical data for display. The device driver is a software program that includes a plurality of instructions that control the operation of the PPU 1000. The API provides an abstraction for a programmer that lets a programmer utilize specialized graphics hardware, such as the PPU 1000, to generate the graphical data without requiring the programmer to utilize the specific instruction set for the PPU 1000. The application may include an API call that is routed to the device driver for the PPU 1000. The device driver interprets the API call and performs various operations to respond to the API call. In some instances, the device driver may perform operations by executing instructions on the CPU. In other instances, the device driver may perform operations, at least in part, by launching operations on the PPU 1000 utilizing an input/output interface between the CPU and the PPU 1000. In an embodiment, the device driver is configured to implement the graphics processing pipeline 1400 utilizing the hardware of the PPU 1000.
Various programs may be executed within the PPU 1000 in order to implement the various stages of the processing for the application program. For example, the device driver may launch a kernel on the PPU 1000 to perform one stage of processing on one SM 1140 (or multiple SMs 1140). The device driver (or the initial kernel executed by the PPU 1000) may also launch other kernels on the PPU 1000 to perform other stages of the processing. If the application program processing includes a graphics processing pipeline, then some of the stages of the graphics processing pipeline may be implemented on fixed unit hardware such as a rasterizer or a data assembler implemented within the PPU 1000. It will be appreciated that results from one kernel may be processed by one or more intervening fixed function hardware units before being processed by a subsequent kernel on an SM 1140.
All patents, patent applications and publications cited herein are incorporated by reference for all purposes as if expressly set forth.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
This application is related to the following commonly-assigned copending US patent applications, the entire contents of each of which are incorporated by reference: U.S. Application No. 17/691,276 filed Mar. 10, 2022, titled “Method And Apparatus For Efficient Access To Multidimensional Data Structures And/Or Other Large Data Blocks”;U.S. Application No. 17/691,621 filed Mar. 10, 2022, titled “Cooperative Group Arrays”;U.S. Application No. 17/691,690 filed Mar. 10, 2022, titled “Distributed Shared Memory”;U.S. Application No. 17/691,759 filed Mar. 10, 2022, titled “Virtualizing Hardware Processing Resources in a Processor”;U.S. Application No. 17/691,288 filed Mar. 10, 2022, titled “Programmatically Controlled Data Multicasting Across Multiple Compute Engines”;U.S. Application No. 17/691,296 filed Mar. 10, 2022, titled “Hardware Accelerated Synchronization With Asynchronous Transaction Support”;U.S. Application No. 17/691,406 filed Mar. 10, 2022, titled “Efficient Matrix Multiply and Add with a Group of Warps”;U.S. Application No. 17/691,872 filed Mar. 10, 2022, titled “Techniques for Scalable Load Balancing of Thread Groups in a Processor”;U.S. Application No. 17/691,808 filed Mar. 10, 2022, titled “Flexible Migration of Executing Software Between Processing Components Without Need For Hardware Reset”; andU.S. Application No. 17/691,422 filed Mar. 10, 2022, titled “Method And Apparatus For Efficient Access To Multidimensional Data Structures And/Or Other Large Data Blocks”.