1. Technical Field the present invention relates to sorting data and more specifically to scalable parallel sorting on manycore-based computing systems.
2. Description of the Related Art
Sorting data is a fundamental problem in the field of computer science, and as computing systems become more parallel, sorting methods that scale with hardware parallelism will become indispensable for a variety of applications. Sorting is generally performed using well established methods (e.g., quicksort, merge-sort, radix sort, etc.). Several efficient, parallel implementations of these methods exist, but these existing parallel methods require synchronization between parallel threads. Such synchronization is detrimental to performance scalability as the parallelism, or the number of threads, increases.
In addition, these parallel algorithms do not carefully chunk data in order to match processor cache sizes and increase data locality (and avoid the slow external memory accesses), which can lead to performance degradation problems. As such, there is a need for an efficient and scalable sorting system and method which overcomes the above-mentioned issues.
A method for sorting data, including chunking unsorted data using a processor, such that each chunk is of a size that fits within a last level cache of the system; instantiating, one or more threads in each physical core of the system, and distributing chunks assigned to the physical cores evenly across the one or more threads on the physical cores; and sorting subchunks in the physical cores using vector intrinsics, the subchunks being data assigned to the one or more threads in the physical cores. The subchunks are merged to generate sorted large chunks, and a binary tree, which includes one or more leaf nodes that correspond to each of the sorted large chunks, is built. One or more leaf nodes are assigned to the one or more threads, and each of one or more tree nodes is assigned to a circular buffer, wherein the circular buffer is lock and synchronization free. The sorted large chunks are merged to generate sorted data as output.
A manycore-based system for sorting data, including a chunking module configured to chunk unsorted data, such that each chunk is of a size that fits within a last level cache of the system; an instantiation module configured to instantiate one or more threads in each physical core of the system, and to distribute chunks assigned to the physical cores evenly across the one or more threads on the physical cores; and a sorting module configured to sort subchunks in the physical cores using vector intrinsics, the subchunks being data assigned to the one or more threads in the physical cores. A merging module is configured to merge the subchunks to generate sorted large chunks, and to build a binary tree which includes one or more leaf nodes that correspond to each of the sorted large chunks; and an assignment module is configured to assign the one or more leaf nodes to the one or more threads, and to assign each of one or more tree nodes a circular buffer, wherein the circular buffer is lock and synchronization free. A large chunk merging module is configured to merge the sorted large chunks to generate sorted data as output.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
In accordance with the present principles, systems and methods for sorting data are provided. In one embodiment systems and methods for scalable parallel sorting on manycore-based computing systems (e.g., multi-socket systems including several commodity multi-core processors, systems including manycore processors, etc.) are illustratively depicted in accordance with the present principles. The present principles may implement a parallel implementation of sorting methods (e.g., mergesort), and may be tailored to manycore processing systems.
The system and method according to the present principles may include lock-free buffers and may include a method to ensure that threads generally remain busy while using no locks. It is noted that although the system and method locks may be employed at certain times (e.g., between major stages). The present principles also may be applied to chunk data in a manner in which most data is cached, thereby minimizing off-chip memory accesses. Thus, the present principles may be employed to achieve significant improvement in operation speeds for applications that use sorting when compared to currently available sorting systems and methods.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to
The chunks may be a plurality of sizes. For example, in one embodiment, for a chunk size C, the cache size may also be C. In another embodiment, if the cache size is C, then M/C sorted chunks may be generated in block 104. In yet another embodiment, the chunk size C may be equal to the last level cache size multiplied by at integer (e.g., the number of physical processing cores p), or may be of a size set by an end user when chunking the input data in block 104. Each chunk may be sorted by all the processing cores p in parallel using a vectorized sorting method according to the present principles (hereinafter “VectorChunkSort”) in block 106, and the sorted chunks may be stored in memory.
In one embodiment, the sorted chunks may be assigned and distributed evenly across the p physical cores of a manycore system in block 108. Each physical core p may merge its sorted chunks within each core using a merging method according to the present principles (hereinafter “TreeChunkMerge.”) in block 110. After the TreeChunkMerge, there may be exactly P sorted larger chunks (e.g., larger than the non-merged chunks) in memory, where P is the number physical cores, and the P larger chunks may be sorted using a parallel chunk sorting method according to the present principles (hereinafter “ParallelChunkMerge”) in block 112. Sorted data (e.g., M bytes of sorted data) may be output in block 114. It is noted that the methods according to the present principles for VectorChunkSort, TreeChunkMerge, and ParallelChunkMerge will be discussed in further detail hereinbelow.
Referring now to
In one embodiment, each thread may sort and merge its subchunks using, for example, vector intrinsics, to produce as many larger subchunks as threads in the system. For example, each thread may vector-sort each of its subchunks in block 208, and each thread may vector merge its sorted subchunks to produce a sorted large subchunk in block 210. Next, all threads may parallel merge the subchunks to produce the sorted chunk in block 212 (e.g., P*T threads may parallel merge P*T large sorted subchunks, were P is the number of physical cores, and T is the number of threads per physical core). Sorted data (e.g., sorted chunk) may be output in block 214, and the sorted chunk may be of size, for example, P* last level cache size, where P is the number of physical cores.
Referring now to
In one embodiment, a data quantum size Q1 may be set for each thread, and a node may be assigned from the partition that contains the most amount of data in block 310. Then, for each node, if both child nodes have at least Q1 bytes of data and their parent has Q1 bytes of space, the children's data may be merged and stored in the circular buffer in block 312, and a sorted large chunk may be output in block 314.
Referring now to
In one embodiment, a sorted large chunk may be received as input in block 402. A binary tree with leaf nodes may be built in block 404, and each node may be assigned (e.g., statically) to a physical core in block 406. It is noted that the number of leaf nodes may be equal to the number of sorted large chunks to be merged, which may also equal the number of physical cores. Each node may be assigned to a circular buffer in block 408, and the total size of the buffers may be the number of processing cores p* last level cache size. For each node, if both children have, for example, Q2 bytes of data, and there is Q2 bytes of space in its circular buffer, children's data may be merged in block 410, and the result of the child data merge may be stored in the circular buffer (e.g., shared circular buffer) in block 412. The sorted data (e.g., M bytes) may be output in block 414. It is noted that although one thread and one chunk per physical core are illustratively depicted, it is contemplated that other sorts of configurations may be employed according to the present principles.
Referring now to
The system 501 may receive input data 503 which may be employed as input to a plurality of modules, including a chunk module 502, a VectorChunkSort module 504, a TreeChunkMerge modulo 506, and a ParallelChunkMerge module 508, which may be configured to perform a plurality of tasks, including, but not limited to receiving data, chunking data, instantiating threads, sorting and merging chunks and subchunks, caching data, and buffering. The system 501 may produce output data 507, which in one embodiment may be displayed on one or more display devices 510. It should be noted that while the above configuration is illustratively depicted, it is contemplated that other sorts of configurations may also be employed according to the present principles.
Referring now to
In one embodiment, the present principles employ a tree based parallel merge with synchronization free data structures, and tree nodes May be allocated to threads during merging. The tree-based parallel merging system and method may employ shared data structures, and may manage the size of the shared data structures by considering the manycore caches of the manycore systems. It is noted that the synchronization free parallel merging according to the present principles may be highly scalable for sorting as the system becomes more parallel, and the merging may be performed while avoiding off-chip memory access when employing a circular buffer 612.
The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application claims priority to provisional application Ser. No. 61/871,960 filed on Aug. 30, 2013, incorporated herein by reference in its entirety,
Number | Date | Country | |
---|---|---|---|
61871960 | Aug 2013 | US |