The present invention relates generally to a system and method databases, and, in particular embodiments, to a system and method for parallelizing hash-based operators in symmetric multiprocessing (SMP) databases.
With the booming of Internet applications more and more data is generated and stored into database. The database queries can be very complicated. Parallel processing database management systems are designed for managing and processing huge amount of data. A symmetric multiprocessing (SMP) computer system contains multiple central processing unit (CPU) cores which shares large amount of memory and is ideal for running a parallel processing database system.
For database operations, hash-based operators include hash join and hash aggregation. Hash join is a method to find, for each distinct value of the join attribute, the set of tuples in each database table having that value. Hash join consists of two phases: build phase and probe phase. When joining two tables, first the build phase creates a hash table on top of the smaller table (inner table). A hash table entry contains the join attribute and the data row. The probe phase uses the same hash function as the build phase. It scans the rows of the larger table (outer table), hashes the join attribute and finds the matching rows in the inner table by referring to the hash table. Hash aggregation is a way to implement the database aggregate operations such as group by and distinct. Similar as the hash join operator, it also creates a hash table on top to the relation data with the target aggregate attribute as the hash key. From the hash table, the tuples can be distributed into groups or the unique ones can be extracted.
In accordance with an embodiment of the present invention, a method in a device for performing hash based database operations includes receiving at the device a database query; creating a plurality of execution workers to process the query; and building by the execution workers a hash table from a database table, the database table comprising one of a plurality of partitions and a plurality of scan units, the hash table shared by the execution workers, each execution worker scanning a corresponding partition and adding entries to the hash table if the database table is partitioned, each execution worker scanning an unprocessed scan unit and adding entries to the hash table according to the scan unit if the database table comprises scan units, and the workers performing the scanning and the adding in a parallel manner.
In accordance with another embodiment, a device configured for performing hash based database operations includes a processor and a non-transitory computer readable storage medium storing programming for execution by the processor, the programming including instructions to: receive a database query; create a plurality of execution workers to process the query; and build, by the execution workers, a hash table from a database table, the database table comprising one of a plurality of partitions and a plurality of scan units, the hash table shared by the execution workers, each execution worker scanning a corresponding partition and adding entries to the hash table if the database table is partitioned, each execution worker scanning an unprocessed scan unit and adding entries to the hash table according to the scan unit if the database table comprises scan units, and the workers performing the scanning and the adding in a parallel manner.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
Disclosed herein are systems and methods to perform hash-based database operations in a parallel manner. When an SMP database receives a query, embodiment systems and methods create multiple execution workers to process the query. A worker includes multiple threads for data processing and exchanging. Each execution worker is assigned a data partition. The execution workers process their own data partition. All the hash-based operations first build a hash table and then perform the operations, such as join and aggregation. In a disclosed embodiment, a hash table is created and shared among all the execution workers. Each worker scans its own partition and adds entries to the shared hash table. For hash join, after the shared hash table is created, each worker probes its partition of the outer join table using the shared hash table. In an embodiment, the hash aggregation is also done with the shared hash table.
Disclosed herein are at least three methods of handling contention between execution workers: synchronization, lock-free algorithms, and shuffling. A producer thread is started by each worker to exchange data tuples among the workers for shuffling. For a hash join, the execution workers scan in parallel and probe the outer table against the shared hash table. For hash aggregation, the execution workers aggregate the results in parallel using the shared hash table. A gather worker collects the results from all the execution workers and sends the results to clients or upper query operators.
If the tables involved in the database operation exceed the memory size of the system, spilling may occur. In an embodiment, the query execution workers scan the tables and create hash partitions on disk in parallel. In spilling situation, for hash join the query execution workers create partition pairs of the inner and outer table in parallel. The workers perform the partition wise hash join in parallel either using the same approach as the non-spilling case, or each worker is assigned a quantity of partition pairs to process. After that, the workers process the partitions on the disk and perform partition wise join or hash aggregation. For hash aggregation, the query execution workers build partitions of the hash table. The workers process the partition one by one using the non-spilling approach or the workers can be allotted a number of partitions to aggregate the results in parallel.
Other solutions for parallelizing hash-based operators in SMP databases include classical hybrid hash join algorithms and hybrid caching algorithms designed for traditional relational database management systems. However, system resources, such as CPUs and memory may not be fully utilized in these solutions.
Hash Table Building
A SMP system has multiple CPU cores. When receiving a query, the SMP system launches several query execution workers to process the query in parallel. For hash based operations, the first step is to build a hash table. In a disclosed approach, the hash table is shared among all the query execution workers. The relation data may be organized in two ways: (1) Partition the tables according to the degree of parallelism (DOP) of the SMP system. (2) The table data is not partitioned or the number of partitions is different from DOP. In the first case, for example, if the DOP is four, four workers are started to execute the query. Each query execution worker is assigned a partition. Each worker scans its own data partition. In the second case, a parallel scan is utilized. The scan unit is a number of data pages. Each worker processes a scan unit. After it is done, the worker scans another scan unit until the whole table is scanned.
Each worker 204 is assigned some SUs 210. For example SU1, SU2 and SU3 are assigned to the first worker 204; SU4, SU5, SU6 are assigned to the second worker 204. This is just for illustration purpose. In other embodiments, the assignment of SUs 210 to workers 204 may be different. For example, in another embodiment, it is also possible that SU1, SU4 and SU5 are assigned to the first worker 204 and SU2, SU3 and SU6 are assigned to the second worker 204. However, each SU 210 is only processed by one worker 204. In an embodiment, the iterator 208 iterates to the next SU 210 after one SU 210 is processed to ensure that all of the SUs 210 are processed.
When building the shared hash table (e.g., shared hash table 102 in
The join key is hashed and the system 300 mods the result by the DOP to decide which worker 306 the data should be sent to. In
Parallel Probing of Hash Join
In an embodiment, the number of partitions is the same as the DOP. Each worker 404 scans its own partition 410 of the outer table 408 and probes the entry against the shared hash table 406.
Parallel Hash Aggregation
Spilling Handling for Parallel Hash Join and Hash Aggregation
If the tables are too big and the memory is not enough to hold the whole table when performing a hash join or a hash aggregation, the data, in an embodiment, should to be spilled to the disk. Spilling is determined during the building of shared hash table.
Shared Hash Table Creation for Spilling
In an embodiment, during building the shared hash table, the buckets of the hash table 610 are split into groups. Each group is like an expandable partition 612 of the hash table 610. When the memory is not sufficient to hold the hash table 610, one partition 612 (e.g., Partition1) is kept in memory (e.g., hash table partition 606 kept in memory) and the other partitions 612 (e.g., Partition2 and Parition3) of the hash table are spilled onto disk (e.g., hash table partitions 602, 604 kept on disk). In an embodiment, the partition whose size is closest to the memory used for the hash table is selected to be kept in memory. Below is an algorithm to choose the partition to be kept in memory.
Method for choosing the partition to be kept in memory
for all the partitions of the hash table
choose a partition whose size is closes to the size of the memory used for the hash table
The query execution workers 608 continue adding entries to the shared hash table: for the one kept in memory, add entries to the hash table partition in memory 606; for the rest of the hash table partitions 612, add entries to the partition files on disk 602, 604.
Parallel Hash Join with Spilling
For hash join after the shared hash table is built, the outer table is read and then the same hash function and number of buckets as the shared hash table are used to create the partitions of the outer table: one partition is kept in memory matching the in memory one of the shared hash table; and all the other partitions are created on disk. Similar as the creation of partitions of the shared hash table, all the query execution workers create the partitions of the outer table in parallel. Since multiple workers may write to the same partition simultaneously, synchronization or shuffling can be used to resolve contention similar to that described above with reference to
Partition wise join is used for the hash join. First, the pair of partitions kept in memory are joined. Next, other pairs of partitions created on disk are loaded and are joined. To parallelize the pair wise hash join, there are at least two alternatives.
Option 1: use the method described above with reference to
Option 2: split the pairs of partitions among the execution workers. Each worker joins a number of pairs of partitions. The join results are gathered from all the workers and sent to a client or upper layer operator. For example, if there are 1024 pairs of partitions created from the inner and outer join tables and the DOP is 16, each worker joins 64 pairs of partitions.
Parallel Hash Aggregation with Spilling
In a spilling situation, after the partitions of the hash table are created, the query execution workers, acting in a parallel manner, aggregate the results in one of at least two manners.
Option 1: use the method as described above with reference to system 500 in
Option 2: allocate the partitions of the hash table to the workers. Each worker process a number of partitions.
In some embodiments, the processing system 900 is included in a network device that is accessing, or part otherwise of, a telecommunications network. In one example, the processing system 900 is in a network-side device in a wireless or wireline telecommunications network, such as a base station, a relay station, a scheduler, a controller, a gateway, a router, an applications server, or any other device in the telecommunications network. In other embodiments, the processing system 900 is in a user-side device accessing a wireless or wireline telecommunications network, such as a mobile station, a user equipment (UE), a personal computer (PC), a tablet, a wearable communications device (e.g., a smartwatch, etc.), or any other device adapted to access a telecommunications network. In other embodiments, the processing system 900 is a stand alone data processing system without the capability to communicate with other devices.
A disclosed embodiment of a method in a device for performing hash based database operations includes receiving at the device a database query; creating a plurality of execution workers to process the query; and building by the execution workers a hash table from a database table, the database table comprising one of a plurality of partitions and a plurality of scan units, the hash table shared by the execution workers, each execution worker scanning a corresponding partition and adding entries to the hash table if the database table is partitioned, each execution worker scanning an unprocessed scan unit and adding entries to the hash table according to the scan unit if the database table comprises scan units, and the workers performing the scanning and the adding in a parallel manner. The method may also include synchronizing writing to a hash bucket by the execution workers to minimize contention between the execution workers for the hash bucket or utilizing a lock-free algorithm to minimize contention between the execution workers for a hash bucket. In an embodiment, the method may include partitioning the hash table into a plurality of hast table partitions according to a degree of parallelism (DOP) of the SMP database; starting with each execution worker a corresponding producer thread for data exchanging, the producer thread scanning corresponding partition data from the database table, hashing the corresponding partition data, and sending hashed corresponding partition data to the corresponding execution worker; and hashing with the execution worker the hashed corresponding partition data and adding data entries to a respective partition of the partitioned hash table. The method may include partitioning an outer table; and assigning each partition of the outer table to a respective one of the execution workers, each execution worker scanning a corresponding partition of the outer table and probing an entry from the outer table against the hash table in parallel with the scanning and probing by other execution workers. In an embodiment, the method may include performing an aggregation operation wherein the aggregation operation comprises: partitioning the hash table into partitions; assigning a first partition to a first one of the execution workers; and assigning a second partition to the first one of the execution workers when the first one of the execution workers completes processing the first partition, wherein the second partition has not been processed by one of the execution workers prior to the assigning to the first one of the execution workers. In an embodiment, the method may include splitting buckets of the hash table into groups; keeping one of the groups in memory; and spilling other ones of the groups onto a storage disk. The method may also include determining a percentage of data in a partition of the hash table as compared to all the data in the hash table; and selecting a group size according to the percentage and according to the size of the memory. In an embodiment, the method may include creating a pair of partitions, each pair of partitions comprising a partition of an inner table and a partition of an outer table; loading a partition of the hash table into memory, the partition of the hash table shared by the execution workers; and allotting each of a plurality of portions of the partition of the outer table to a respective one of the execution workers, each execution worker performing join operations on the allotted portion of the partition in parallel with the other execution workers. The method may include creating a pair of partitions, each pair of partitions comprising a partition of an inner table and a partition of an outer table; splitting the pairs of partitions among the execution workers, each worker joining one or more pairs of partitions; and gathering join results from the execution workers. In an embodiment, the method may include performing a parallel hash aggregation operation with spilling, wherein the parallel hash aggregation operation comprises assigning an unprocessed hash partition to one of the execution workers when the one of the execution workers becomes free. The method may include performing a parallel hash aggregation operation with spilling, wherein the parallel hash aggregation operation comprises allocating each of the partitions of the hash table to a respective execution worker, each execution worker processing its assigned partitions.
A disclosed embodiment of a device configured for performing hash based database operations includes a processor and a non-transitory computer readable storage medium storing programming for execution by the processor, the programming including instructions to: receive an SMP database query; create a plurality of execution workers to process the query; and build, by the execution workers, a hash table from a database table, the database table comprising one of a plurality of partitions and a plurality of scan units, the hash table shared by the execution workers, each execution worker scanning a corresponding partition and adding entries to the hash table if the database table is partitioned, each execution worker scanning an unprocessed scan unit and adding entries to the hash table according to the scan unit if the database table comprises scan units, and the workers performing the scanning and the adding in a parallel manner.
While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.