The current document is directed to methods and systems for sorting data and, in particular, to data sorting within cloud-computing and other distributed computing environments.
Many of the classical computational methods related to data processing, data storage, and information retrieval were developed during an era in which even large computer systems were generally based on a single processor and directly connected data-storage devices and other peripheral devices. Data processing in such systems is sequential in nature, as a result of which many of the classical data-processing methods are inherently sequential, in nature, and fail to take advantage of parallel processing. As computer networking and distributed computer systems evolved, during the past 30 years, new types of computational methods have evolved to take advantage of the enormous computational bandwidths that are possible when a computational task is partitioned and distributed among a large number of concurrently executing processors and individual computational systems. More recently, the emergence of cloud computing has yet again changed the underlying constraints, capabilities, and dynamics associated with computational resources. As a result, new opportunities are emerging for the development of new types of computational methods and systems implemented within cloud-computing environments and other types of distributed computing environments.
The current document is directed to a method and system for data processing in cloud-computing environments and other distributed-computing environments. Implementations of a merge sort suitable for the sorting of data within cloud-computing environments and other distributed-computing environments are disclosed. These implementations takes advantage of the massive parallelism available in cloud-computing environments as well as take into consideration numerous constraints regarding data-storage and data-retrieval operations in a cloud-computing environment. The implementations provide a type of data-sorting method and system that iteratively carries out highly parallel merge-sort operations that can be effectively applied over a range of data-set sizes up to extremely large data sets.
The current document is directed to data-sorting methods and systems suitable for execution within cloud-computing environments and other distributed-computing environments. Various implementations of these methods and systems are discussed below using detailed illustrations, control-flow diagrams, and an example C++ implementation.
It should be noted, at the onset, that the currently disclosed methods and systems are directed to real, tangible, physical systems and methods carried out within physical systems, including client computers and server computers. Those familiar with modern science and technology well appreciate that, in modern computer systems and other processor-controlled devices and systems, the control components are often fully or partially implemented as sequences of computer instructions that are stored in one or more electronic memories and, in many cases, also in one or more mass-storage devices, and which are executed by one or more processors. As a result of their execution, a processor-controlled device or system carries out various operations, generally at many different levels within the device or system, according to control logic implemented in the stored and executed computer instructions. Computer-instruction-implemented control components of modern processor-controlled devices and systems are as tangible and physical as any other component of the system, including power supplies, cooling fans, electronic memories and processors, and other such physical components.
In the following discussion, the phrases “cloud computing” and “cloud-computing environment” are used to describe, in general terms, the large number of relatively new, computing-as-a-utility distributed-computing facilities that allow users to configure remote, virtual computing systems and data centers and execute various types of computational tasks within these remote computer systems and data centers. In general, cloud-computing facilities provide users with virtual systems and data centers mapped to actual physical server computers, data-storage subsystems, and other remote physical data-center components. Users may dynamically add computational bandwidth and data-storage capabilities and dynamically return unused computational bandwidth and data-storage capacity in order to respond to dynamically changing computational loads in a cost-effective manner. Users of cloud-computing facilities essentially rent underlying physical facilities, allowing the users to concentrate on developing and deploying service applications and other programs without worrying about assembling and maintaining physical data centers and without needing to purchase and maintain large computational facilities to handle peak loads that, during non-peak periods, lay idle while incurring power, maintenance, and physical-housing costs.
Although there are many different types of cloud-computing facilities and environments, many of the cloud-computing environments have certain commonly shared characteristics. For example, because a physical location of a user's virtual system or data center is dynamic within a cloud-computing facility, cloud-computing facilities generally provide virtual data-storage subsystems for long-term data storage. Thus, long-term data storage is generally decoupled from computation in many cloud-computing environments.
In many cloud-computing environments, data is stored within relatively large objects similar to files in a traditional computer system. These objects are associated with unique identifiers that allow the objects to be reliably stored in a data-storage subsystem and subsequently retrieved. The objects are generally written sequentially and can only be updated by rewriting the entire object, and are read into the memory of a physical or virtual server or other computer system for random access. In general, any cloud-computing-facility server or other computer system may be authorized to access any data object stored within the data-storage subsystem of a cloud-computing facility. Cloud-computing facilities provide interfaces that allow users to allocate, start, and stop virtual servers and systems within the cloud and to launch particular computational tasks on the allocated servers and other virtual systems.
Records may have significantly more complex internal structures. As one example,
The basic computational flow illustrated in
In a first step, shown in
As shown in
To recapitulate, as discussed above with reference to
The above-described cloud merge sort is designed to be reliable under cloud-computing conditions. For example, each of the distributed tasks generated in each fan-out operation of each cloud-merge-sort cycle is idempotent so that, whenever a task fails, it can simply be restarted on the same or another computational resource. Distribution of the tasks generated in each fan-out operation may be carried out according to any of many different types of task-distribution schemes. Each task may be distributed to a different virtual computer system within a cloud-computing environment, or sets of tasks may be distributed to each of a number of different virtual computer systems within a cloud-computing environment. Many other types of task-distribution schemes may be followed. As discussed above with reference to
Next, a C++ implementation of the cloud merge sort is provided. This implementation employs a number of standard C++ libraries and data types, as can be seen in the following include and typedef statements:
First, classes for the block_info objects that describe blocks within manifests and the class for manifest objects are declared, as follows:
The class block_info includes two constructors, declared on lines 4-8, and data members, on lines 9-13, that store the identifier for a block, number of records in the block, indicate whether or not the block is sorted, and the value of the first and last key of the first and last records of the block, as discussed above with reference to
Next, a relational function for comparing two blocks is provided:
The two add function members of the manifest class are next provided without detailed additional comments:
The blocks of one manifest are appended to the end of another manifest, in the first add operation, and a single block is added to a manifest in the second add function.
Next, the manifest-class member function split_memory_sort is provided:
In the for-loop of lines 3-8, each block in the manifest incorporated into a single sort task that is added to a vector of single-sort tasks. The split_memory_sort operation is discussed above with reference to
Next, the manifest-class member function split_merge_sort is provided:
On line 4, the local variable final discussed above with reference to
Next, an implementation of the manifest-class member function finalize_sort is provided:
The merge-sort task, discussed above with reference to
The merge-sort_task class includes a vector of block_info objects representing a set of blocks to merge sort, declared on line 5, and the begin_on and end_before data members, on lines 6-7, discussed above with reference to
The merge-sort_task function member “run,” which executes a merge sort on a data subset, as discussed above with reference to
A result manifest out is declared on line 3. A vector of readers is declared on line 9 for reading records from the blocks that are being merge sorted. The options data structures, discussed above with respect to
The single sort task used for the in-memory sorting of each block in the first cycle of a cloud merge sort is next provided, without detailed explanation:
The in-memory sort is carried out on line 16 and the block_info object that describes the sorted block is updated on lines 17-21. Finally, simulation code for the cloud-computing environment is provided without further discussion:
The following classes block_reader, block_writer, and cloud simulate a cloud-computing environment. The member functions of these classes allow a block of records to be written to the simulated data-storage subsystem and read from the simulated data-storage subsystem of the simulated cloud-computing environment. This code is provided for the sake of completeness, but is not directed to implementation of the cloud merge sort:
Finally, the main routine for cloud merge sort is provided:
An instance of the class cloud is declared on line 6. On lines 7-36, a large number of records is constructed to generate a simulated data set. On lines 37-38, the manifest-class member function split_memory_sort is used to carry out the first fan-out operation. The sorting of individual blocks is carried out on lines 41-42. The do-while loop of lines 44-52 carry out the remaining cycles of the cloud merge sort, with the fan-out for each of the subsequent cycles carried out by lines 47-48, using the manifest-class member function split_merge_sort, and the merge sorts carried out in the for-loop of lines 49-50. The final sort, in which all of the non-overlapping blocks are sorted by first key value, is carried out on line 53. The remaining code is used to verify that the cloud merge sort produced a correct sorted data set.
Although the present invention has been described in terms of particular embodiments, it is not intended that the invention be limited to these embodiments. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, any of many different implementations of the cloud merge sort can be obtained by varying any of many different design and implementation parameters, including programming language, underlying operating system, data structures, control structures, modular organization, cloud-computing-environment interface, and other such parameters. As mentioned above, the cloud merge sort may sort a data set on one or multiple dimensions depending on implementation of the relational operator used to compare key values of records with one another. In addition, the cloud merge sort may sort data records in ascending order, descending order, or in more complex ways depending on the implementation of the relational operator. Cloud merge sort can be tailored for execution in any of many different types of distributed-computing environments, including many different types of cloud-computing environments.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
This application claims the benefit of Provisional Application No. 61/656,426, filed Jun. 6, 2012.
| Number | Name | Date | Kind |
|---|---|---|---|
| 5307485 | Bordonaro et al. | Apr 1994 | A |
| 5421007 | Coleman | May 1995 | A |
| 5842207 | Fujiwara et al. | Nov 1998 | A |
| 8484279 | Cole | Jul 2013 | B1 |
| 20030177161 | Rothschild et al. | Sep 2003 | A1 |
| 20070226226 | Mintz | Sep 2007 | A1 |
| 20080086442 | Dasdan et al. | Apr 2008 | A1 |
| 20100031003 | Chen et al. | Feb 2010 | A1 |
| 20110004521 | Behroozi et al. | Jan 2011 | A1 |
| 20140052711 | Bamba | Feb 2014 | A1 |
| 20160253405 | Heyns | Sep 2016 | A1 |
| Number | Date | Country |
|---|---|---|
| 2011120791 | Oct 2011 | WO |
| Entry |
|---|
| Bitton et al., A Taxonomy of Parallel Sorting, Computer Surveys, vol. 16, No. 3, Sep. 1984, pp. 287-318. |
| Jeon et al., Parallel Merge Sort with Load Balancing, International Journal of Parallel Programming, vol. 31, No. 1, Feb. 2003, pp. 21-33. |
| “International Search Report and Written Opinion” for PCT/US2013/044615, mailed Nov. 19, 2013, 8 pages. |
| “International Preliminary Report on Patentability” for PCT/US2013/044615, mailed Dec. 9, 2014, 6 pages. |
| Number | Date | Country | |
|---|---|---|---|
| 20130346425 A1 | Dec 2013 | US |
| Number | Date | Country | |
|---|---|---|---|
| 61656426 | Jun 2012 | US |