The invention relates to a technology in the field of graph data processing, in particular to a method and system for dynamically configuring memory for a read strategy of large graph data whose size exceeds the Graphics Processing Unit (GPU) memory capacity adaptively for a Unified-Memory-based architecture.
Unified Memory introduces a unified memory space to the existing GPU memory management method, allowing programs to use a pointer to directly access the main memory of the Central Processing Unit (CPU) or the GPU memory. This technology enables GPUs to increase the available address space so that GPUs can process large-scale graph data that exceeds the memory capacity of GPUs. However, directly using this technique often comes with a significant performance penalty.
Targeting the above-mentioned deficiencies of the prior arts, the present invention proposes a kind of adaptive, unified memory management method and system for large graphs. It adopts different graph algorithms based on the characteristics of graph data and the size of the available GPU memory. It can significantly improve the performance of processing large graphs whose sizes exceed the GPU memory capacity under the unified memory architecture, including improving GPU bandwidth utilization, reducing the number of page faults and the overhead of processing page faults, and speeding up the running time of graph computing programs.
The present invention is achieved by the following technical solutions:
The present invention relates to a kind of adaptive unified memory management method tailored to large graphs. Given different types of graph data structure in graph computing applications, it assigns a priority order to those data structures. According to the priority order, it successively performs GPU memory checking to see whether the current GPU memory is full and performs data overflow checking to see whether the size of the current data structure exceeds the available memory capacity of the GPU, and then it performs policy configuration for unified memory management.
The described different types of graph data structure include Vertex Offset, Vertex Property, Edge and Frontier where Vertex Offset, Vertex Property, Edge are three arrays for Compressed Sparse Row (CSR).
The described priority order refers to: the graph data structure is ranked in descending order of their access times during the execution of the graph algorithm; specifically, the order is: Vertex Property, Vertex Offset, Frontier, and Edge.
The described graph algorithm can be divided into traversal algorithms or computational algorithms, including but not limited to single source shortest path algorithm (SSSP), breadth-first search algorithm (BFS), web page ranking algorithm (PageRank, PR), and connected component algorithm (CC).
The described GPU memory checking calls the cudaMemGetInfo API to determine the remaining capacity of the GPU memory. The data overflow checking determines whether the size of the data structure exceeds the size of the available memory of the GPU.
The described policy configuration for unified memory management sets the management strategy of the current graph data structure by calling APIs including but not limited to cudaMemPrefetchAsync and cudaMemAdvise, wherein: cudaMemPrefetchAsync can move part of data in advance to GPU memory; cudaMemAdvise can be set for specified data a memory usage hint that guides the GPU driver control data movement in an appropriate way and improve the final performance. The optional data usage hints include AccessedBy and ReadMostly. These instructions are for NVIDIA's various series of GPUs, specifically including:
{circumflex over (1)} For Vertex Property data, when the GPU memory is full, set the hint of Vertex Property to AccessedBy; otherwise, that is, when the GPU memory is not full and when the Vertex Property does not exceed the GPU's available memory capacity, set the prefetching size of Vertex Property to the size of the Vertex Property; when the Vertex Property exceeds the available memory capacity of the GPU, set the hint of the Vertex Property to AccessedBy, and set the prefetching amount of the Vertex Property as: the prefetch rate multiplies available memory capacity of the GPU, measured in bytes.
{circumflex over (2)} For Vertex Offset data, when the GPU memory is full, set the hint of Vertex Offset to AccessedBy; otherwise, that is, when the GPU memory is not full and when the Vertex Offset does not exceed the GPU's available memory capacity, set the prefetching size of Vertex Offset to the size of the Vertex Offset; when the Vertex Offset exceeds the available memory capacity of the GPU, set the hint of the Vertex Offset to AccessedBy, and set the prefetching amount of the Vertex Offset as: the prefetch rate multiplies the available memory capacity of the GPU, measured in bytes.
{circumflex over (3)} For Frontier data, when the GPU memory is full, set the hint of Frontier to AccessedBy; otherwise, that is, when the GPU memory is not full and when the Frontier does not exceed the GPU's available memory capacity, set the prefetching size of Frontier to the size of the Frontier; when the Frontier exceeds the available memory capacity of the GPU, set the hint of the Frontier to AccessedBy, and set the prefetching amount of the Frontier as: the prefetch rate multiplies available memory capacity of the GPU, measured in bytes.
{circumflex over (4)} For Edge data, when the GPU memory is full, set the hint of Edge to AccessedBy; otherwise, that is, when the GPU memory is not full and when the Edge does not exceed the GPU's available memory capacity, set the prefetching size of Edge to the size of the Edge; when the Edge exceeds the available memory capacity of the GPU, set the hint of the Edge to AccessedBy, and set the prefetching amount of the Edge as: the prefetch rate multiplies the available memory capacity of the GPU, measured in bytes.
The present invention as a whole solves the technical problem that the existing GPU does not have the ability to process a large-scale graph that exceeds the GPU memory.
Compared with prior art, the present invention uses unified memory technology to manage graph data, according to specific priority order, adopts pertinent management strategy to different graph data structures, according to the size of graph data and available GPU memory. The management strategy of adjusting graph data relative to size and type of graph algorithm significantly improves the operation efficiency of graph algorithm.
As shown in
The operating parameters include: whether memory is full (GPUIsFull), currently available memory capacity of the GPU (AvailGPUMemSize) and prefetching rate τ.
The initialization refers to: setting GPUIsFull to false; obtaining AvailGPUMemSize through cudaMemGetInfo.
The read-ahead rate τ is set to 0.5 for traversal graph algorithms (such as BFS) and 0.8 for computational graph algorithms (such as CC).
The described CUDA APIs includes but not limited to those that allow the same functionality as the explicit memory copy and memory-pinning APIs without reverting to the constraints of explicit GPU memory allocation: explicit prefetching (cudaMemPrefetchAsync) and memory hints (cudaMemAdvise).
As shown in
Step 1 (B0 in
Step 2 (B1, C0 in
Step 2.1 (C1 in
Step 2.1.1 (C2 in
Step 2.1.1.1 (B3˜B4 in
Step 2.1.1.2 (B5˜B7 in
Step 2.1.2 (B8 in
After detailed experiments, in a server environment equipped with Intel Xeon E5-2620 CPU, 128GB memory and NVIDIA GTX1080Ti GPU, based on this method, the graph algorithm is executed under different data sets, and the execution time of the algorithm is measured, that is, on the GPU from the total running time from start to finish, excluding time for preprocessing and data transfer. During the experiment, each algorithm is repeated 5 times, and the average value of the 5 execution times is taken.
The aforementioned dataset is a plurality of graph datasets of different sizes, specifically comprising social network graph datasets (LiveJournal, Orkut, Twitter, Friendster), and Internet snapshot graph datasets (UK-2005, SK-2005, UK-union), where the Livejournal dataset contains 5 million vertices and 69 million edges, with a volume of 1.4 GB; UK-union contains 133 million vertices and 5.5 billion edges, with a volume of 110 GB.
The aforementioned graph algorithm, is SSSP, BFS, PR, CC four kinds of graph algorithms, wherein SSSP, BFS are traversal algorithm, PR, CC are computational algorithm. For BFS and SSSP, the algorithm takes the first vertex in each graph dataset as the source vertex; for PR, set 0.85 as attenuation factor and 0.01 as fault-tolerant read. The condition for the algorithm to end running is that the algorithm converges, or the number of iterations reaches 100.
Experimental results show that the method can speed up the total execution time of graph computations by a factor of 1.1 to 6.6. Among them, SSSP has the highest performance improvement, while PR has the lowest performance improvement, indicating that this method is more beneficial to memory-intensive programs. τ=0.5 and τ=0.8 are the optimal read-ahead rates for ergodic and computational algorithms, respectively, which can achieve the highest average speedup of 1.43 and 1.48 times, respectively, compared to τ=0.
In summary, the present invention can significantly shorten the running time of the graph processing program for large-scale graphs.
The aforementioned specific implementation can be carried out with partial adjustment to it in different ways by those experts under the premise that it does not deviate from the principle of the present invention and purpose The protection scope of the present invention is based on claims and is not limited by the above-mentioned specific implementation, each implementation scheme within its scope is bounded by the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202011244031.3 | Nov 2020 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/072376 | 1/18/2021 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2022/099925 | 5/19/2022 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
8373710 | Vij | Feb 2013 | B1 |
20130212594 | Choi | Aug 2013 | A1 |
20140281323 | Duluk, Jr. | Sep 2014 | A1 |
20150170316 | Balmin | Jun 2015 | A1 |
20190266695 | Rao | Aug 2019 | A1 |
20230244537 | Wang | Aug 2023 | A1 |
Number | Date | Country |
---|---|---|
110471766 | Dec 2022 | CN |
Entry |
---|
Khorasani, Farzad, “Scalable SIMD-Efficient Graph Processing on GPUs”, 2015 International Conference on Parallel Architecture and Compilation (PACT), Mar. 2016, pp. 39-50. (Year: 2016). |
Kim, Youngrang, “Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration” Aug. 2019, pp. 2193-2204. (Year: 2019). |
Number | Date | Country | |
---|---|---|---|
20230297234 A1 | Sep 2023 | US |