Highly parallel data applications are used in a variety of fields such as science, entertainment, finance, medical, engineering, social media, and so on. Machine learning data models, shader programs, and similar highly parallel data applications process large amounts of data by performing complex calculations at substantially high speeds. With an increased number of processing circuits in computing systems, the latency to deliver data to the processing circuits becomes emphasized. The performance, such as throughput, of the processing circuits depends on quick access to stored data. To support high-performance, the memory hierarchy includes storage elements with implementations that transition from relatively fast, volatile memory, such as registers on a processor die to caches either located on the processor die or connected to the processor die, and to off-chip storage with longer access times.
The benefit of the memory hierarchy reduces when the relatively fast, volatile memory is idle for a significant amount of time. One example is when a relatively high number of vector registers of a vector general-purpose register file are used as destination registers of long-latency vector memory instructions. These vector registers are idle from the point in time of instruction issue of the vector memory instruction until the point in time of targeted data is returned. This latency can stretch to several hundred clock cycles. These vector registers remain idle during this latency. These idle vector registers cannot be actively used by other instructions, and these idle vector registers do not store useful data.
In view of the above, efficient methods and mechanisms for efficiently processing vector memory accesses on an integrated circuit are desired.
While the invention is susceptible to various modifications and alternative forms, specific implementations are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, one having ordinary skill in the art should recognize that the invention might be practiced without these specific details. In some instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the present invention. Further, it will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements.
Apparatuses and methods efficiently scheduling wavefronts for execution on an integrated circuit are contemplated. In various implementations, a computing system includes a processing circuit. In various implementations, the processing circuit is a parallel data processing circuit with a highly parallel data microarchitecture. The parallel data processing circuit includes multiple, replicated compute circuits, each with the circuitry of multiple lanes of execution. Each compute circuit executes one or more wavefronts. The parallel data processing circuit includes a memory used to store temporary data. An example of the temporary data is stack data of a function call used by each wavefront that is generated based on the function call. In some implementations, the memory is a local memory of the parallel data processing circuit such as dedicated memory that is not shared with another processing circuit. In an implementation, the local memory is a portion of video memory used to store video frame data. In one implementation, the parallel data processing circuit does not maintain cache coherency information in the local memory. In other implementations, the local memory is a local cache used to store the temporary data.
Each compute circuit includes a dispatch circuit that includes a queue for storing multiple wavefronts before the wavefronts are dispatched to the execution circuits of the compute circuits for execution. Each of the execution circuits is a single instruction multiple data (SIMD) circuit that includes multiple lanes of execution for executing a wavefront. Each compute circuit includes one or more SIMD circuits, and therefore, each compute circuit is able to execute one or more wavefronts. As used herein, the term “dispatch” refers to wavefronts being selected and sent from the dispatch circuit of the compute circuit to the SIMD circuits of the compute circuit. As used herein, the term “issue” refers to instructions of a wavefront in a compute circuit being selected and sent to the multiple lanes of execution of one of the multiple SIMD circuits of the compute circuit. The compute circuit also includes a control circuit that identifies an instruction of the in-flight wavefronts as a long-latency vector memory access instruction with an indication specifying that a destination targets data storage in the cache, rather than a vector register file. In various implementations, the long-latency vector memory access instruction is a long-latency vector load instruction. The control circuit assigns (allocates) cache storage space (e.g., a cache line) for the vector memory access instruction while foregoing vector register assignment. It is noted that the description herein uses the term cache line when referring to the cache storage space. However, the amount of storage assigned need not correspond directly to the size of a cache line. The storage space may be more than or less than a cache line. The control circuit sends a data retrieval request corresponding to the vector memory access instruction to one or more higher levels of a cache memory subsystem.
The SIMD circuits execute other instructions of the in-flight wavefronts using the vector register file until the requested data returns. When the requested data has arrived from the memory subsystem, the control circuit stores the requested data in the assigned cache line of cache. The control circuit sends a notification to one of the SIMD circuits that generated the vector memory access instruction. The notification indicates that the requested data has been returned. When the control circuit receives an indication specifying that the corresponding one of the SIMD circuits is ready to receive the requested data, the control circuit completes the long-latency vector memory access instruction by sending the requested data from the cache to the corresponding one of the SIMD circuits. In some implementations, the compute circuit also includes a buffer for data storage when the cache is full. Further details of these techniques to efficiently schedule wavefronts for execution on an integrated circuit are provided in the following description of
Turning now to
In an implementation, the processing circuit 110 is a general-purpose central processing unit (CPU). In one implementation, processing circuit 102 is a parallel data processing circuit with a highly parallel data microarchitecture, such as a graphics processing unit (GPU). The processing circuit 102 can be a discrete device, such as a dedicated GPU (dGPU), or the processing circuit 102 can be integrated (an iGPU) in the same package as another processing circuit. Other parallel data processing circuits that can be included in computing system 100 include digital signal processing circuits (DSPs), field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and so forth.
In various implementations, processing circuit 102 includes multiple compute circuits 104A-104N, each including similar circuitry and components such as the multiple, parallel computational lanes 106. In some implementations, the processing circuit 102 includes one or more single instruction multiple data (SIMD) circuits 108A-108B, each including the multiple, parallel computational lanes 106. Additionally, each of the compute circuits 104A-104N includes a cache 103 and an optional buffer 107. In some implementations, the parallel computational lanes 106 (or parallel execution lanes 106 or lanes 106) operate in lockstep. In various implementations, the data flow within each of the lanes 106 is pipelined. Pipeline registers are used for storing intermediate results and circuitry for arithmetic logic units (ALUs) perform integer arithmetic, floating-point arithmetic, Boolean logic operations, branch condition comparisons and so forth. These components are not shown for ease of illustration. Each of the ALUs within a given row across lanes 106 includes the same circuitry and functionality, and operates on a same instruction, but different data associated with a different thread. A particular combination of the same instruction and a particular data item of multiple data items is referred to as a “work item.” A work item is also referred to as a thread.
The multiple work items (or multiple threads) are grouped into thread groups, where a “thread group” is a partition of work executed in an atomic manner. In some implementations, a thread group includes instructions of a function call that operates on multiple data items concurrently. Each data item is processed independently of other data items, but the same sequence of operations of the subroutine is used. As used herein, a “thread group” is also referred to as a “work block” or a “wavefront.” Tasks performed by the parallel data processing circuit 102 can be grouped into a “workgroup” that includes multiple thread groups (or multiple wavefronts).
In an implementation, processing circuit 102 performs video rendering on an 8-pixel by 8-pixel block of a video frame. The corresponding 64 threads are grouped into two wavefronts with each wavefront including 32 threads or 32 work items. The hardware of a SIMD circuit includes 32 parallel lanes of execution. The hardware of a scheduler (not shown) assigns workgroups to the compute circuits 104A-104N. Each of the compute circuits 104A-104N includes a dispatch circuit that includes a queue for storing multiple wavefronts before the wavefronts are dispatched to the SIMD circuits that include the parallel lanes of execution 106. Scheduling circuitry of the assigned compute circuit of the compute circuits 104A-104N divides the received workgroup into two separate wavefronts, stores the two wavefronts in a dispatch circuit, and assigns each wavefront to a respective SIMD circuit. In other implementations, another number of threads and wavefronts are used based on the hardware configuration of the parallel data processing circuit 102.
In some implementations, cache 103 is a local cache used to store data that corresponds to long-latency vector memory access instructions. In other implementations, instead of cache 103, compute circuits 104A-104N include a local memory such as dedicated memory shared by SIMD circuits 108A-108B, but that is not shared with another processing circuit such as processing circuit 110. In yet another implementation, processing circuit 102 includes the local memory that is shared by compute circuits 104A-104N, but that is not shared with another processing circuit such as processing circuit 110. In an implementation, the local memory is a portion of video memory used to store video frame data. In one implementation, processing circuit 102 does not maintain cache coherency information in the local memory. In other implementations, cache 103 is a last level cache structure that maintains cache coherency information and this last level cache structure is shared by SIMD circuits 108A-108B. A compute circuit of the compute circuits 104A-104N dispatches wavefronts from a local dispatch circuit for execution on one of multiple SIMD circuits 108A-108B. The compute circuit identifies an instruction of the in-fight wavefronts as a long-latency vector memory access instruction with an indication specifying that a destination targets data storage in cache 103, rather than a vector register file. The compute circuit assigns a cache line for the vector memory access instruction while foregoing vector register assignment. The compute circuit sends a data retrieval request corresponding to the vector memory access instruction to one or more higher levels of a cache memory subsystem and the memory controller 130.
The SIMD circuits 108A-108B execute other instructions of the in-flight wavefronts using the vector register file until the requested data returns. When the requested data has arrived from the memory subsystem, the circuitry of the compute circuit stores the requested data in the assigned cache line of cache 103 in place of storing the requested data in one of the vector registers of a vector register file. The circuitry of the compute circuit sends a notification to one of the SIMD circuits 108A-108B that generated the vector memory access instruction. In various implementations, the long-latency vector memory access instruction is a long-latency vector load instruction. The notification indicates that the requested data has been returned. When the circuitry receives an indication specifying that the corresponding one of the SIMD circuits 108A-108B is ready to receive the requested data, the circuitry completes the long-latency vector memory access instruction by sending the requested data from the cache 103 to the corresponding one of the SIMD circuits 108A-108B. In some implementations, the circuitry uses buffer 107 for data storage when cache 103 is full. Buffer 107 is separate from the cache memory subsystem and the memory subsystem. Rather, buffer 107 provides local data storage for cases when cache 103 is full.
Although an example of a single instruction multiple data (SIMD) microarchitecture is shown for the compute circuits 104A-104N, other types of highly parallel data micro-architectures are possible and contemplated. The high parallelism offered by the hardware of the compute circuits 104A-104N is used for real-time data processing. Examples of real-time data processing are rendering multiple pixels, image blending, pixel shading, vertex shading, and geometry shading. In such cases, each of the data items of a wavefront is a pixel of an image. The compute circuits 104A-104N can also be used to execute other threads that require operating simultaneously with a relatively high number of different data elements (or data items). Examples of these threads are threads for scientific, medical, finance and encryption/decryption computations.
In one implementation, the processing circuit 110 is a general-purpose processing circuit, such as a CPU, with any number of processing circuit cores that include circuitry for executing program instructions. Memory 112 represents a local hierarchical cache memory subsystem. Memory 112 stores source data, intermediate results data, results data, and copies of data and instructions stored in memory devices 140. For example, memory 112 stores the application 116, which is a copy of the application 144 stored in the memory devices 140. Processing circuit 110 is coupled to bus 125 via interface 113. Processing circuit 110 receives, via interface 113, copies of various data and instructions, such as the operating system 142, one or more device drivers, one or more applications such as application 146, and/or other data and instructions.
The processing circuit 110 retrieves a copy of the application 144 from the memory devices 140, and the processing circuit 110 stores this copy as application 116 in memory 112. One example of application 116 is a highly parallel data application such as a shader program. When the instructions of compiler 114 are executed by circuitry 118, the circuitry 118 compiles the application 116. As part of the compilation, circuitry 118 translates instructions of the application 116 into commands executable by the SIMD circuits 108A-108B of the compute circuits 104A-104N of the processing circuit 102. For example, when the instructions of the compiler 114 are executed by the circuitry 118, the circuitry 118 uses a graphics library with its own application program interface (API) to translate function calls of the application 116 into commands particular to the compute circuits 104A-104N of the processing circuit 102.
To change the scheduling of threads from the processing circuit 110 to the processing circuit 102, software development kits (SDKs) and application programming interfaces (APIs) were developed for use with widely available high-level languages to provide supported function calls. The function calls provide an abstraction layer of the parallel implementation details of processing circuit 102 such as the lanes 106 of the compute circuits 104A-104N. The details are hardware specific to the parallel data processing circuit 102 but hidden to the developer to allow for more flexible writing of software applications. When circuitry 118 executes the instructions of the compiler 114, the circuitry 118 compiles the generated second sequence of instructions into machine executable code for execution by the SIMD circuits of the compute circuits 104A-104N. An example of the processing circuitry is a GPU such as the compute circuits 104A-104N of the processing circuit 102. The function calls in high level languages, such as C, C++, FORTRAN, and Java and so on, are translated to commands which are later processed by the hardware in processing circuit 102.
Platforms such as OpenCL (Open Computing Language), OpenGL (Open Graphics Library), OpenGL for Embedded Systems (OpenGL ES), and Vulkan provide a variety of APIs for running programs on GPUs from AMD, Inc. Developers use OpenCL for simultaneously processing the multiple data elements of the scientific, medical, finance, encryption/decryption and other computations while using OpenGL and OpenGL ES for simultaneously rendering multiple pixels for video graphics computations. Vulkan is a low-overhead, cross-platform API, open standard for three-dimensional (3-D or 3D) graphics applications. Further, DirectX is a platform for running programs on GPUs in systems using one of a variety of Microsoft operating systems.
In an implementation, when executing instructions of a kernel mode driver (KMD), circuitry 118 assigns state information for a command group generated by compiling the application 116. Examples of the state information are a process identifier (ID), a name of the application or an ID of the application, a version of the application, a compute/graphics type of work, and so on. When executing instructions of the kernel mode driver, circuitry 118 sends the command group and state information to a ring buffer 112 in the memory devices 140. The processing circuit 102 accesses, via the memory controllers 130, the command group and state information stored in the ring buffer. For wavefronts based on function calls to be executed, the number of variables to allocate, how to address the variables, and the number of vector registers used to allocate variables are needed. At least the stack data and the heap data determine data allocation.
Static data can be used to allocate statically declared objects, such as global variables and constants. A majority of these objects can be arrays. Stack can also be used to allocate scalar variables rather than arrays, such as local variables and parameters in the functions currently being invoked. Stack data can be grown and shrunk on procedure call or return, respectively. The heap data can be used to allocate dynamic objects accessed with pointers and are typically not scalar variables. The heap data can be used to reduce the frequency of copying the contents of strings and lists by storing the contents of temporary strings or lists during the string/list operations. The heap data is not affected by the return of the function call. The processing circuit 102 schedules the retrieved commands to the compute circuits 104A-104N based on at least the state information. Other examples of scheduling information used to schedule the retrieved commands are age of the commands, priority levels of the commands, an indication of real-time data processing of the commands, and so forth.
In some implementations, computing system 100 utilizes a communication fabric (“fabric”), rather than the bus 125, for transferring requests, responses, and messages between the processing circuits 102 and 110, the I/O interfaces 120, the memory controllers 130, the network interface 135, and the display controller 150. When messages include requests for obtaining targeted data, the circuitry of interfaces within the components of computing system 100 translates target addresses of requested data. In some implementations, the bus 125, or a fabric, includes circuitry for supporting communication, data transmission, network protocols, address formats, interface signals and synchronous/asynchronous clock domain usage for routing data.
Memory controllers 130 are representative of any number and type of memory controllers accessible by processing circuits 102 and 110. While memory controllers 130 are shown as being separate from processing circuits 102 and 110, it should be understood that this merely represents one possible implementation. In other implementations, one of memory controllers 130 is embedded within one or more of processing circuits 102 and 110 or it is located on the same semiconductor die as one or more of processing circuits 102 and 110. Memory controllers 130 are coupled to any number and type of memory devices 140.
Memory devices 140 are representative of any number and type of memory devices. For example, the type of memory in memory devices 140 includes Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), NAND Flash memory, NOR flash memory, Ferroelectric Random Access Memory (FeRAM), or otherwise. Memory devices 140 store at least instructions of an operating system 142, one or more device drivers, and application 144. In some implementations, application 144 is a highly parallel data application such as a video graphics application, a shader application, or other. Copies of these instructions can be stored in a memory or cache device local to processing circuit 110 and/or processing circuit 112.
I/O interfaces 120 are representative of any number and type of I/O interfaces (e.g., peripheral component interconnect (PCI) bus, PCI-Extended (PCI-X), PCIE (PCI Express) bus, gigabit Ethernet (GBE) bus, universal serial bus (USB)). Various types of peripheral devices (not shown) are coupled to I/O interfaces 120. Such peripheral devices include (but are not limited to) displays, keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, and so forth. Network interface 135 receives and sends network messages across a network.
Turning now to
In other implementations, the parallel data processing circuit 205 includes other components, omits one or more of the illustrated components, has multiple instances of a component even if only one instance is shown in the apparatus 200, and/or is organized in other suitable manners. Also, each connection shown in apparatus 200 is representative of any number of connections between components. Additionally, other connections can exist between components even if these connections are not explicitly shown in apparatus 200. In various implementations, the apparatus 200 executes any of various types of highly parallel data applications. As part of executing an application, a host CPU (not shown) launches kernels to be executed by the parallel data processing circuit 205. The command processing circuit 235 receives kernels from the host CPU and uses dispatch circuit 250 to dispatch wavefronts of these kernels to compute circuits 255A-255N.
In various implementations, the circuitry of partition 250B is a replicated instantiation of the circuitry of partition 250A. In some implementations, each of the partitions 250A-250B is a chiplet. In an implementation, the local cache 258 represents a last level shared cache structure such as a local level-two (L2) cache within partition 250A. In other implementations, instead of local cache 258, compute circuits 255A-255N include a local memory such as dedicated memory that is not shared with another processing circuit. In yet another implementation, parallel data processing circuit 205 includes the local memory that is shared by compute circuits 255A-255N, but that is not shared with another processing circuit. In an implementation, the local memory is a portion of video memory used to store video frame data. In one implementation, compute circuits 255A-255N do not maintain cache coherency information in the local memory. Each of the compute circuits 255A-255N receives wavefronts from the dispatch circuit 240 and stores the received wavefronts in a corresponding local dispatch circuit (not shown). Control circuit 256 schedules these wavefronts to be dispatched from the local dispatch circuits to SIMD circuits 230A-230N of the compute circuits 255A-255N. Control circuit 256 includes circuitry for dynamically assigning and allocating vector registers of vector general-purpose register files, or vector register files (VRFs 232A-232N), to wavefronts at call boundaries.
Threads within wavefronts executing on compute circuits 255A-255N read and write data to corresponding cache 258, vector registers 232A-232N, global data share 270, shared L1 cache 265, and L2 cache 260. It is noted that L1 cache 265 can include separate structures for data and instruction caches. It is also noted that global data share 270, shared L1 cache 265, L2 cache 260, memory controller 220, system memory, and caches 230A-230N can collectively be referred to herein as a “memory subsystem”. In some implementations, the wavefronts executing on the SIMD circuits 230A-230N generate long-latency vector memory access instructions targeting temporary data. An example of the temporary data is stack data of a function call used by each wavefront that is generated based on the function call. Static data can be used to allocate statically declared objects, such as global variables and constants. A majority of these objects can be arrays. The stack data can also be used to allocate scalar variables rather than arrays, such as local variables and parameters in the functions currently being invoked. Stack data can be grown and shrunk on procedure call or return, respectively.
In an implementation, a long-latency vector memory access instruction can have a latency of 400-500 clock cycles between issuing of the vector memory access instruction and arrival of the requested data from the memory subsystem at the requesting one of the compute circuits 255A-255N. Afterward, the wavefront uses the retrieved data for only 10-20 clock cycles. Rather than assign and allocate a vector register of the VRFs 232A-232N that stores data to be used for only 10-20 clock cycles of the 500 or more idle clock cycles of the vector register, the control circuit 256 assigns and allocates a cache line of cache 258 based on an indication stored in the vector memory access instruction specifying the destination. In various implementations, the long-latency vector memory access instruction is a long-latency vector load instruction. In an implementation, control circuit 256 selects the cache line based on a value of a base register stored in a programmable configuration register and an offset provided by the vector memory access instruction. In some implementations, control circuit 256 detects a type of long-latency vector memory access instruction based on one or more fields of the vector memory access instruction such as at least an opcode field.
Control circuit 256 sends a data retrieval request corresponding to the vector memory access instruction to one or more of the global data share 270, shared L1 cache 265, the L2 cache 260, and the memory controller 220. Compute circuit 255A executes other instructions of the in-flight wavefronts using the VRFs 232A-232N until the requested data returns. When the requested data has arrived from the memory subsystem, control circuit 256 directs storage of the requested data to the assigned cache line of cache 258 in place of storing the requested data in one of the vector registers of VRFs 232A-232N. Control circuit 256 sends a notification to one of the SIMD circuits 230A-230N that generated the vector memory access instruction. The notification indicates that the requested data has been returned. When control circuit 256 receives an indication specifying that the corresponding one of the SIMD circuits 230A-230N is ready to receive the requested data, control circuit 256 completes the long-latency vector memory access instruction by sending the requested data from the cache 258 to the corresponding one of the SIMD circuits 230A-230N.
In some implementations, control circuit 256 uses buffer 257 for data storage when cache 258 is full. Buffer 257 is separate from the cache memory subsystem and the memory subsystem. Rather, buffer 257 provides local data storage for cases when cache 258 is full. In other implementations, when cache 258 is full and has no available cache lines, control circuit 256 sends the requested data to one or more of the global data share 270, shared L1 cache 265, and the L2 cache 260. Control circuit 256 retrieves the requested data for data storage in cache 258 when data storage space becomes available again in cache 258.
Referring now to
In one implementation, dispatch circuit 305 is a local dispatch circuit that receives wavefronts from an external dispatch circuit or external scheduler controlled by a command processing circuit (e.g., command processing circuit 235 of
The reservation station 320 can also be referred to as ordered list 320 where the wavefronts are ordered according to their relative priority level. A corresponding priority level of a wavefront is based on one or more of an age of the wavefront, a quality of service (QOS) parameter of the wavefront, a corresponding reservation data size of the wavefront, a ratio of the corresponding reservation data size to the available data storage space in the cache, an application identifier or type, such as a real-time application, and so forth. In one implementation, when dispatch circuit 305 is getting ready to launch a next wavefront, dispatch circuit 305 queries control circuit 310 to determine an initial number of vector registers to allocate to the next wavefront. In this implementation, control circuit 310 queries register assignment circuit 315 when determining how to dynamically allocate vector registers of vector register files (VRFs) 355A-355N at the granularity of the functions of wavefronts.
In one implementation, register assignment circuit 315 includes free register list 317 which includes identifiers (IDs) of the vector registers of VRFs 355A-355N that are currently available for allocation. If there are enough vector registers available in free register list 317 for the next function of a wavefront, then control circuit 310 assigns these vector registers to the next wavefront. Otherwise, if there are insufficient vector registers available in free register list 317 for the next wavefront, it is possible for the control circuit 310 to stall the launch of the next wavefront. One or more of the register assignment circuit 315, the control circuit 310, and the reservation station 320 maintains a count of vector registers of VRFs 355A-355N allocated to the wavefronts 322-325. Cache 360 is a local cache used to store data such as temporary data. An example of the temporary data is stack data of a function call used by each wavefront that is generated based on the function call. The control circuit 310 also monitors the available data storage space in the cache 360.
In various implementations, the control circuit 310 has the functionality of control circuit 256 (of
It is noted that the arrangement of components such as dispatch circuit 305, control circuit 310, and register assignment circuit 315 shown in
It is possible and contemplated that one or more of the processing circuits, the compute circuits, and apparatuses illustrated in
A second silicon wafer (or second wafer) is fabricated with multiple instantiated copies of integrated circuitry of a second chiplet, and this second wafer is diced using laser cutting techniques to separate the multiple copies of the second chiplet. The first chiplet provides functionality different from the functionality of the second chiplet. One or more copies of the first chiplet are placed in an integrated circuit, and one or more copies of the second chiplet is placed in the integrated circuit. The first chiplet and the second chiplet are interconnected to one another within a corresponding MCM. Such a process replaces a process that fabricates a third silicon wafer (or third wafer) with multiple copies of a single, monolithic semiconductor die that includes the functionality of the first chiplet and the second chiplet as integrated functional blocks within the single, monolithic semiconductor die.
Process yield of single, monolithic dies on a silicon wafer is lower than process yield of smaller chiplets on a separate silicon wafer. In addition, a semiconductor process can be adapted for the particular type of chiplet being fabricated. With single, monolithic dies, each die on the wafer is formed with the same fabrication process. However, it is possible that an interface functional block does not require process parameters of a semiconductor manufacturer's expensive process that provides the fastest devices and smallest geometric dimensions. With separate chiplets, designers can add or remove chiplets for particular integrated circuits to readily create products for a variety of performance categories. In contrast, an entirely new silicon wafer must be fabricated for a different product when single, monolithic dies are used. It is possible and contemplated that one or more of the processing circuits, the compute circuits, and apparatuses illustrated in
In some implementations, the hardware of the processing circuits and the apparatuses illustrated in
Regarding the methods 400-600 (of
A control circuit dispatches wavefronts from a local dispatch circuit for execution on one of multiple execution circuits of a compute circuit (block 402). The control circuit identifies an instruction of the in-fight wavefronts as a vector memory access with an indication specifying that a destination targets data storage in a cache, rather than a vector register file (block 404). The control circuit assigns a cache line for the vector memory access while foregoing vector register assignment (block 406). The control circuit sends, to a memory subsystem, a data retrieval request corresponding to the vector memory access instruction (block 408).
SIMD circuits of compute circuits execute other instructions of the in-flight wavefronts using the vector register file (block 410). If the requested data has not yet arrived from the memory subsystem (“no” branch of the conditional block 412), then control flow of method 400 returns to block 410 where the SIMD circuits execute other instructions of the in-flight wavefronts using the vector register file. Otherwise, if the requested data has arrived from the memory subsystem (“yes” branch of the conditional block 412), then the control circuit stores the requested data in the assigned cache line (block 414). The control circuit sends a notification to a source, such as one of the SIMD circuits that generated the vector memory access instruction, that the requested data has returned (block 416). At a later point in time, the control circuit receives an indication specifying that the source is ready to receive the requested data (block 418). The control circuit completes the vector memory access instruction by sending the requested data to the SIMD circuit for use during execution of a wavefront (block 420).
Turning now to
If there are available cache lines of a first cache for data storage (“yes” branch of the conditional block 510), then the control circuit stores the requested data in the first cache (block 512). The control circuit completes the vector memory access instruction using the corresponding one of the allocated cache lines of the first cache (block 514). If there are no available cache lines of the first cache for data storage (“no” branch of the conditional block 510), then the control circuit stores the requested data in an available cache line of a second cache that is at a higher level than the first cache (block 516). At a later point in time, if there are available cache lines of the first cache for data storage (“yes” branch of the conditional block 518), then control flow of method 500 moves to block 512 where the control circuit stores the requested data in the first cache (block 520). If there are no available cache lines of the first cache for data storage (“no” branch of the conditional block 518), then the control circuit maintains the requested data in the second cache (block 522). Afterward, control flow of method 500 returns to conditional block 518.
Turning now to
If the remainder of the data for the collective operation is not stored in the cache (“no” branch of the conditional block 610), then the control circuit moves the requested data from the first partition to a second partition of the cache, if not already moved (block 612). The SIMD circuits execute other instructions of the in-flight wavefronts while waiting for the remainder of the data (block 614). Afterward, control flow of method 600 returns to conditional block 610. If the remainder of the data for the collective operation is stored in the cache (yes” branch of the conditional block 610), then the control circuit moves all requested data from the first partition to the second partition of the cache, if not already moved (block 616). The control circuit notifies a source that generated the collective operation that the requested data has returned (block 618).
It is noted that one or more of the above-described implementations include software. In such implementations, the program instructions that implement the methods and/or mechanisms are conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage. Generally speaking, a computer accessible storage medium includes any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium includes storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media further includes volatile or non-volatile memory media such as RAM (e.g., synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g., Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc. Storage media includes microelectromechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
Additionally, in various implementations, program instructions include behavioral-level descriptions or register-transfer level (RTL) descriptions of the hardware functionality in a high-level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII). In some cases, the description is read by a synthesis tool, which synthesizes the description to produce a netlist including a list of gates from a synthesis library. The netlist includes a set of gates, which also represent the functionality of the hardware including the system. The netlist is then placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks are then used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the instructions on the computer accessible storage medium are the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions are utilized for purposes of emulation by a hardware-based type emulator from such vendors as Cadence®, EVER, and Mentor Graphics®.
Although the implementations above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.