Modern parallel processors (e.g., graphics processing units (GPUs)) include structures for executing multiple threads in parallel. A thread can also be referred to herein as a “work-item”. A group of work-items is also referred to herein as a “warp” or “wavefront”. Wavefronts often stall at wait count instructions waiting for any outstanding memory requests to complete. Thus, the longest latency memory requests associated with a given wait count will be on the wavefront's critical path. This is often referred to as a memory divergence problem, with “memory divergence” referring to the difference in arrival times for requests pending during a wait count instruction. For example, the memory requests for some threads within a wavefront hit in the cache while other threads from the same wavefront miss in the cache. When a wavefront executes a waitcnt( ) instruction, that wavefront gets blocked until the number of memory instructions specified by that waitcnt( ) instruction is retired. Application performance can be significantly affected by memory divergence.
The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various implementations may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Various systems, apparatuses, and methods for implementing memory request priority assignment techniques for parallel processors are disclosed herein. In one implementation, a system includes at least a processor coupled to a memory subsystem, where the processor includes at least a plurality of compute units for executing wavefronts in lock-step. The processor assigns priorities to memory requests of wavefronts on a per-lane basis by indexing into a first priority vector, with the index generated based on lane-specific information. If a given event is detected, a second priority vector is generated by applying a given priority promotion vector to the first priority vector. Then, for subsequent wavefronts, memory requests are assigned priorities by indexing into the second priority vector with lane-specific information. The memory subsystem reorders and/or allocates shared resources to requests generated by the wavefronts according to priorities assigned to the requests. The use of priority vectors to assign priorities to memory requests helps to reduce the memory divergence problem experienced by different work-items of a wavefront.
Referring now to
In one implementation, processor 105A is a general purpose processor, such as a central processing unit (CPU). In this implementation, processor 105N is a data parallel processor with a highly parallel architecture. Data parallel processors include graphics processing units (GPUs), digital signal processors (DSPs), field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and so forth. In some implementations, processors 105A-N include multiple data parallel processors.
Memory controller(s) 130 are representative of any number and type of memory controllers accessible by processors 105A-N and I/O devices (not shown) coupled to I/O interfaces 120. Memory controller(s) 130 are coupled to any number and type of memory devices(s) 140. Memory device(s) 140 are representative of any number and type of memory devices. For example, the type of memory in memory device(s) 140 includes Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), NAND Flash memory, NOR flash memory, Ferroelectric Random Access Memory (FeRAM), or others. It is noted that memory controller(s) 130 and memory device(s) 140 can collectively be referred to herein as a “memory subsystem”.
I/O interfaces 120 are representative of any number and type of I/O interfaces (e.g., peripheral component interconnect (PCI) bus, PCI-Extended (PCI-X), PCIE (PCI Express) bus, gigabit Ethernet (GBE) bus, universal serial bus (USB)). Various types of peripheral devices (not shown) are coupled to I/O interfaces 120. Such peripheral devices include (but are not limited to) displays, keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, and so forth. Network interface 135 is used to receive and send network messages across a network.
In various implementations, computing system 100 is a computer, laptop, mobile device, game console, server, streaming device, wearable device, or any of various other types of computing systems or devices. It is noted that the number of components of computing system 100 varies from implementation to implementation. For example, in other implementations, there are more or fewer of each component than the number shown in
Turning now to
In various implementations, computing system 200 executes any of various types of software applications. As part of executing a given software application, a host CPU (not shown) of computing system 200 launches kernels to be performed on GPU 205. Command processor(s) 235 receive kernels from the host CPU and use dispatch unit 250 to dispatch wavefronts of these kernels to compute units 255A-N. Control logic 240 includes logic for determining the priority that should be assigned to memory requests of threads of the dispatched wavefronts. Control logic 240 also includes logic for updating the priority determination mechanism in response to detecting one or more events. Threads within kernels executing on compute units 255A-N read and write data to corresponding local L0 caches 257A-N, global data share 270, shared L1 cache 265, and L2 cache(s) 260 within GPU 205. It is noted that each local L0 cache 257A-N and/or shared L1 cache 265 can include separate structures for data and instruction caches. While the implementation shown in system 200 has a 3-level cache hierarchy, it should be understood that this is merely one example of multi-level cache hierarchy that can be used. In other implementations, other types of cache hierarchies with other numbers of cache levels can be employed.
It is noted that L0 caches 257A-N, global data share 270, shared L1 cache 265, L2 cache(s) 260, memory controller 220, system memory 225, and local memory 230 can collectively be referred to herein as a “memory subsystem”. This memory subsystem is able to receive memory requests and associated priorities. The memory requests are processed by the memory subsystem according to the priorities. The memory subsystem services higher priority memory access requests prior to servicing lower priority memory access requests. Prioritizing memory access requests can lead to higher priority requests having a decreased latency and an overall improvement of the performance of system 200. In some implementations, one or more components within the memory subsystem include multiple memory request queues that have associated priority levels. These components of memory subsystem can place or store the memory request in an appropriate input queue based on the determined level of priority. Other mechanisms for differentiating memory requests based on their assigned priorities are possible and are contemplated.
Referring now to
For XSbench benchmark 300, every lane tends to have a different latency value, with lanes 0-31 typically incurring more latency. For double-precision general matrix multiplication (DGEMM) benchmark 305, work-items executing on hardware lanes 0-5, 30-37, and 62-63 typically observe longer latencies compared to other lanes. Such differences across applications occur due to many reasons including imbalanced load assignment, coalescing behavior (e.g., XSBench has poor coalescing behavior, leading to each lane incurring different latency values), cache hits/misses, memory address mapping, and so on. The methods and mechanisms presented herein aim to prioritize memory requests that are on the critical path by exploiting the unique divergent behavior for a particular workload.
Turning now to
As shown in
In one implementation, the process is started by assigning random values to each entry of priority vector 400. The random values can range from a LOW_THRESHOLD to a HIGH_THRESHOLD, where HIGH_THRESHOLD—LOW_THRESHOLD gives the number of prioritization levels that the algorithm can assign. The number of priority levels are configurable by changing the LOW_THRESHOLD and HIGH_THRESHOLD values. The different prioritization vector samples are then generated by changing the priority values at random lane locations or adjusting the indexing function to use different bits of information (e.g., lane ID) or the same bits in different ways. In other words, adjusting the indexing function changes how each request maps into priority vector 400. Then, a test suite is run and the algorithm converges with preferred values for priority vector 400 and the indexing function that are predicted to yield a desired performance level based on a fitness function. In one implementation, the test suite includes a predetermined set of applications that represents the diverse class of applications being targeted for optimization.
The fitness function in this case is the function that maps the priority vector values to the parallel processor performance and the optimization here is to choose the priority vector values that result in the best performance. After feeding the prioritization vector samples to the algorithm, the algorithm then comes up with a priority vector and indexing function optimized for performance. The same algorithm can be used to optimize kernels for power or for reducing the memory bandwidth consumption by changing the fitness function. For example, when optimizing for power, the fitness function maps the priority vector values to power consumption and comes up with the best priority vector that reduces the average power consumption of the entire testing suite.
Once the preferred priority vector and indexing function are generated by the algorithm, this vector is used to assign priorities to memory requests for actual applications. It is noted that the preferred priority vector and indexing function can be generated using any suitable combination of software and/or hardware. In one implementation, the preferred priority vector and indexing function are generated using one or more test kernels, and then the preferred priority vector and indexing function are used with actual kernels in real-world use cases. In another embodiment, multiple priority vectors, each optimized for a different constraint (e.g., power, performance, energy), are generated so that the user (or a system management unit) can dynamically select one for a desired optimization. Alternatively, multiple priority vectors can be used for a single program to address phase behaviors. In one implementation, each application (or a class of applications with similar characteristics) has its own priority vector that is predicted to yield the best performance. Before starting an application, the hardware uses the application's priority vector for assigning priorities to the memory requests. In one implementation, a specific priority vector is generated by tuning the genetic, or other, algorithm with a test suite that shares the same characteristics of the targeted application. In another implementation, a specific priority vector is generated by tuning the algorithm with a real-world application. In this implementation, the specific priority vector is continually updated as different applications are executed in real-time.
Referring now to
Without loss of generality, it is assumed for the purposes of this discussion that priority transition diagram 502 shows the priority changes for the last memory response from a wavefront instruction event. Accordingly, if the last response was for a request which mapped to lane 2 (priority 2), then that lane with priority 2 will transition to priority 4. The other lanes will be shifted down in priority, with lane 3 being shifted down from priority 4 to priority 3, with lane 1 being shifted down from priority 3 to priority 2, and with lane 4 staying at priority 1 since lane 4 was already at the lowest priority level. After this transition, priority transition diagram 505 shows the updated transition of priority based on the last memory response being for a request which mapped to lane 2. After the event has happened and the priority transition is completed according to priority transition diagram 505, the priority vector should be updated to reflect the new priority assignment as shown in priority vector 510. The next set of memory requests from a wavefront will use this updated priority assignment. Each time a new event is detected, the priority transition diagram and priority vectors will be updated based on the lane in which the new event was detected. While the example of the new event being the last response for a request was used for this particular discussion, it should be understood that other events can be used to trigger a priority vector update in other implementations.
It is noted that a priority transition diagram can be represented with a priority promotion vector. For example, the priority promotion vector for priority transition diagram 502 is shown as priority promotion vector 515. Priority promotion vector 515 shows the priorities to which the current priority should be promoted to in response to an event. Such a priority promotion vector can be generated by the same priority vector generation mechanism that was previously described in the discussion associated with
Turning now to
In one implementation, a training phase is used to determine the optimal predictor state or indexing function to be preloaded for a given kernel or class of kernels. Similar to the fitness function of genetic algorithms, this optimization function targets power, performance, or any other optimizations. The trained model is deployed when executing actual end-user applications and the model predicts the priority of each memory request dynamically based on the dynamic state of the system. The model receives the dynamic state and outputs the prioritization number for each memory request.
Referring now to
A processor assigns a priority to a memory request of each lane of a first wavefront based on a first priority vector (block 705). For example, in one implementation, the processor indexes into the first priority vector using information (e.g., lane ID) associated with the lane. The processor then retrieves a priority from the first priority vector at an entry determined by the index. The processor monitors whether any events have been detected during execution of the first wavefront (block 710). The types of events which the processor is trying to detect can vary according to the implementation. If none of the various events have been detected (conditional block 715, “no” leg), then the processor waits until execution has terminated for all lanes of the first wavefront (block 720). Next, for subsequent work-items, the processor assigns a priority to a memory request of each lane based on the first priority vector (block 725). It is noted that the memory requests referred to in block 725 can be generated by work-items of the first wavefront or of a second wavefront. After block 725, method 700 ends.
If a given event has been detected (conditional block 715, “yes” leg), then the processor generates a priority promotion vector for the given event (block 730). In one implementation, each different type of event has a separate priority promotion vector. It is noted that each priority promotion vector can be generated from a different corresponding priority transition diagram. Next, the processor generates a second priority vector by applying the priority promotion vector to the first priority vector (block 735). Then, the processor waits until execution has terminated for all lanes of the first wavefront (block 740). Next, for subsequent work-items, the processor assigns a priority to a memory request of each lane based on the second priority vector (block 745). It is noted that the subsequent work-items can come from the first wavefront or from a second wavefront. After block 745, method 700 ends. By implementing method 700 for dynamically updating priorities, priorities are assigned to memory requests in a more intelligent fashion which helps to reduce the amount of time spent waiting at a waitcnt( ) instruction.
Turning now to
Next, the trained priority predictor with the tuned optimization function is deployed in a production environment (block 810). During execution of a given wavefront, static information (e.g., program counter, wavefront ID, lane ID) and dynamic system state information (e.g., average request latency, cache hit rate, bandwidth utilization, stall counter values) are provided to the trained priority predictor (block 815). Next, the trained priority predictor performs a lookup for each memory request based on the static information and dynamic system state information to retrieve a priority (block 820). Then, for each memory request, the trained priority predictor assigns the retrieved priority to the memory request (block 825). After block 825, method 800 ends.
Referring now to
Subsequent to the first event, the control logic assigns a priority to each lane based on the second mapping of priorities to lanes (block 925). Also, for subsequent work-items, the control logic assigns a priority to a memory request of each work-item based on a priority of a corresponding lane (block 930). It is noted that the work-items referred to in block 930 can come from the first wavefront or from a second wavefront different from the first wavefront. A memory subsystem processes each memory request according to a corresponding priority (block 935). After block 935, method 900 ends.
In various implementations, program instructions of a software application are used to implement the methods and/or mechanisms described herein. For example, program instructions executable by a general or special purpose processor are contemplated. In various implementations, such program instructions are represented by a high level programming language. In other implementations, the program instructions are compiled from a high level programming language to a binary, intermediate, or other form. Alternatively, program instructions are written that describe the behavior or design of hardware. Such program instructions are represented by a high-level programming language, such as C. Alternatively, a hardware design language (HDL) such as Verilog is used. In various implementations, the program instructions are stored on any of a variety of non-transitory computer readable storage mediums. The storage medium is accessible by a computing system during use to provide the program instructions to the computing system for program execution. Generally speaking, such a computing system includes at least one or more memories and one or more processors configured to execute program instructions.
It should be emphasized that the above-described implementations are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Number | Name | Date | Kind |
---|---|---|---|
4760515 | Malmquist et al. | Jul 1988 | A |
5553223 | Greenlee et al. | Sep 1996 | A |
5706502 | Foley et al. | Jan 1998 | A |
5761513 | Yellin et al. | Jun 1998 | A |
5815653 | You et al. | Sep 1998 | A |
5923885 | Johnson et al. | Jul 1999 | A |
6058393 | Meier et al. | May 2000 | A |
6119247 | House et al. | Sep 2000 | A |
6138140 | Yokote | Oct 2000 | A |
6618854 | Mann | Sep 2003 | B1 |
7076569 | Bailey | Jul 2006 | B1 |
7127573 | Strongin | Oct 2006 | B1 |
7150021 | Vajjhala | Dec 2006 | B1 |
7596647 | Van Dyke et al. | Sep 2009 | B1 |
8448001 | Zhu et al. | May 2013 | B1 |
9529400 | Kumar et al. | Dec 2016 | B1 |
9594621 | Navilappa | Mar 2017 | B1 |
9886414 | Yun et al. | Feb 2018 | B2 |
9971700 | Loh | May 2018 | B2 |
9983652 | Piga et al. | May 2018 | B2 |
10067796 | Metcalf | Sep 2018 | B1 |
10503438 | La Fratta | Dec 2019 | B1 |
10558591 | Smith et al. | Feb 2020 | B2 |
10664285 | Bedy | May 2020 | B1 |
10725957 | Davis | Jul 2020 | B1 |
10861504 | Tsien et al. | Dec 2020 | B2 |
20030035371 | Reed et al. | Feb 2003 | A1 |
20030172149 | Edsall et al. | Sep 2003 | A1 |
20030191857 | Terrell et al. | Oct 2003 | A1 |
20040093404 | Snyder et al. | May 2004 | A1 |
20050160320 | Elez | Jul 2005 | A1 |
20050198459 | Bogin et al. | Sep 2005 | A1 |
20050228531 | Genovker et al. | Oct 2005 | A1 |
20050268051 | Hill | Dec 2005 | A1 |
20060109829 | O'Neill | May 2006 | A1 |
20060165115 | Warren et al. | Jul 2006 | A1 |
20060171329 | Mng | Aug 2006 | A1 |
20080120441 | Loewenstein | May 2008 | A1 |
20080126750 | Sistla | May 2008 | A1 |
20090016355 | Moyes | Jan 2009 | A1 |
20090168782 | Beshai | Jul 2009 | A1 |
20100211720 | Satpathy et al. | Aug 2010 | A1 |
20100303075 | Tripathi | Dec 2010 | A1 |
20110035529 | Wang et al. | Feb 2011 | A1 |
20110119526 | Blumrich et al. | May 2011 | A1 |
20110138098 | Satpathy et al. | Jun 2011 | A1 |
20110219208 | Asaad et al. | Sep 2011 | A1 |
20120059962 | Lai | Mar 2012 | A1 |
20120072563 | Johnsen et al. | Mar 2012 | A1 |
20120221767 | Post et al. | Aug 2012 | A1 |
20130339592 | Yu | Dec 2013 | A1 |
20140181467 | Rogers | Jun 2014 | A1 |
20140192583 | Rajan et al. | Jul 2014 | A1 |
20140201471 | Cutter et al. | Jul 2014 | A1 |
20140281083 | Canepa et al. | Sep 2014 | A1 |
20140344528 | Kini | Nov 2014 | A1 |
20150089008 | Sridharan | Mar 2015 | A1 |
20150120978 | Kalyanasundharam et al. | Apr 2015 | A1 |
20150220460 | Litch et al. | Aug 2015 | A1 |
20150269396 | Grafton | Sep 2015 | A1 |
20160055005 | Hsu | Feb 2016 | A1 |
20160094435 | Goss et al. | Mar 2016 | A1 |
20160127191 | Nair | May 2016 | A1 |
20160148143 | Anderson | May 2016 | A1 |
20160191420 | Nagarajan et al. | Jun 2016 | A1 |
20160210381 | Singleton et al. | Jul 2016 | A1 |
20160378168 | Branover et al. | Dec 2016 | A1 |
20170339106 | Rimmer et al. | Nov 2017 | A1 |
20180048562 | Meyer | Feb 2018 | A1 |
20180063016 | Gulati et al. | Mar 2018 | A1 |
20180067775 | Frandzel et al. | Mar 2018 | A1 |
20190013965 | Sindhu | Jan 2019 | A1 |
20190188132 | Yap | Jun 2019 | A1 |
20190199617 | Kalyanasundharam et al. | Jun 2019 | A1 |
20190272240 | Kachare | Sep 2019 | A1 |
20190370173 | Boyer | Dec 2019 | A1 |
20200089609 | Colline | Mar 2020 | A1 |
20200226093 | Butcher | Jul 2020 | A1 |
20200304426 | Zhao | Sep 2020 | A1 |
Number | Date | Country |
---|---|---|
9638791 | Dec 1996 | WO |
0031925 | Jun 2000 | WO |
Entry |
---|
Non-Final Office Action in U.S. Appl. No. 15/725,912, dated Jul. 12, 2018, 10 pages. |
Final Office Action in U.S. Appl. No. 15/725,912, dated Feb. 7, 2019, 15 pages. |
International Search Report and Written Opinion in International Application No. PCT/US2018/051542, dated Dec. 12, 2018, 12 pages. |
International Search Report and Written Opinion in International Application No. PCT/US2018/051782, dated Jan. 4, 2019, 14 pages. |
Non-Final Office Action in U.S. Appl. No. 15/850,616, dated Mar. 14, 2019, 12 pages. |
Final Office Action in U.S. Appl. No. 15/850,616, dated Jul. 23, 2019, 14 pages. |
Ausavarungnirun et al., “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance”, 2015 International Conference on Parallel Architecture and Compilation (PACT), Oct. 2015, 14 pages. |
Jiménez, Daniel A., “Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches”, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2013, pp. 284-296. |
Number | Date | Country | |
---|---|---|---|
20210173796 A1 | Jun 2021 | US |