Some embodiments relate to a data structure. More specifically, some embodiments provide a method and system for a data structure and use of same in parallel computing environments.
A number of presently developed and developing computer systems include multiple processors in an attempt to provide increased computing performance. Advances in computing performance, including for example processing speed and throughput, may be provided by parallel computing systems and devices as compared to single processing systems that sequentially process programs and instructions.
For parallel shared-memory aggregation processes, a number of approaches have been proposed. However, the previous approaches each include sequential operations and/or synchronization operations such as, locking, to avoid inconsistencies or lapses in data coherency. Thus, prior proposed solutions for parallel aggregation in parallel computation environments with shared memory either contain a sequential step or require some sort of synchronization on the data structures.
Accordingly, a method and mechanism for efficiently processing data in parallel computation environments and the use of same in parallel aggregation processes are provided by some embodiments herein.
In an effort to more fully and efficiently use the resources of a particular computing environment, a data structure and techniques of using that data structure may be developed to fully exploit the design characteristics and capabilities of that particular computing environment. In some embodiments herein, a data structure and techniques for using that data structure (i.e., algorithms) are provided for efficiently using the data structure disclosed herein in a parallel computing environment with shared memory.
As used herein, the term parallel computation environment with shared memory refers to a system or device having more than one processing unit. The multiple processing units may be processors, processor cores, multi-core processors, etc. All of the processing units can access a main memory (i.e., a shared memory architecture). All of the processing units can run or execute the same program(s). As used herein, a running program may be referred to as a thread. Memory may be organized in a hierarchy of multiple levels, where faster but smaller memory units are located closer to the processing units. The smaller and faster memory units located nearer the processing units as compared to the main memory are referred to as cache.
Processing units 105, 110, and 115 communicates with a shared memory 135 via a system bus 175. System bus also provides a mechanism for the processing units to communicate with a storage device 140. Storage device 140 may include any appropriate information storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, and/or semiconductor memory devices for storing data and programs.
Storage device 140 stores a program 145 for controlling the processing units 105, 110, and 115 and query engine application 150 for executing queries. Processing units 105, 110, and 115 may perform instructions of the program 145 and thereby operate in accordance with any of the embodiments described herein. For example, the processing units may concurrently execute a plurality of execution threads to build the index hash table data structures disclosed herein. Furthermore, query engine 150 may operate to execute a parallel aggregation operation in accordance with aspects herein in cooperation with the processing units and by accessing database 155. Program 145 and other instructions may be stored in a compressed, uncompiled and/or encrypted format. Program 645 may also include other program elements, such as an operating system, a database management system, and/or device drivers used by the processing units 105, 110, and 115 to interface with peripheral devices.
In some embodiments, storage device 140 includes a database 155 to facilitate the execution of queries based on input table data. The database may include data structures (e.g., index hash tables), rules, and conditions for executing a query in a parallel computation environment such as that of
In some embodiments, the data structure disclosed herein as being developed for use in parallel computing environments with shared memory is referred to as a parallel hash table. In some instances, the parallel hash table may also be referred to as a parallel hash map. In general, a hash table may be provided and used as index structures for data storage to enable fast data retrieval. The parallel hash table disclosed herein may be used in a parallel computation environment where multiple concurrently executing (i.e., running) threads insert and retrieve data in tables. Furthermore, an aggregation algorithm that uses the parallel hash tables herein is provided for computing an aggregate in a parallel computation environment.
As shown in
In general, a user may submit a query from client 205 in the form of a SQL query statement to DBS 210. DBMS 215 may execute the query by evaluating the parameters of the query statement and accessing database 225 as needed to produce a result 230. The result 230 may be provided to client 205 for storage and/or presentation to the user.
One type of query is an aggregation query. As will be explained in greater detail below, a parallel aggregation algorithm, process, or operation may be used to compute SQL aggregates. In general with reference to
As an extension of
The computation environment of
A hash table is a fundamental data structure in computer science that is used for mapping “keys” (e.g., the names of people) to the associated values of the keys (e.g., the phone number of the people) for fast data look-up. A conventional hash table stores key—value pairs. Conventional hash tables are designed for sequential processing.
However, for parallel computation environments there exists a need for data structures particularly suitable for use in the parallel computing environment. In some embodiments herein, the data structure of an index hash map is provided. In some aspects, the index hash map provides a lock-free cache-efficient hash data structure developed to parallel computation environments with shared memory. In some embodiments, the index hash map may be adapted to column stores.
In a departure from conventional hash tables that store key—value pairs, the index hash map herein does not store key—value pairs. The index hash map herein generates key—index pairs by mapping each distinct key to a unique integer. In some embodiments, each time a new distinct key is inserted in the index hash map, the index hash map increments an internal counter and assigns the value of the counter to the key to produce a key—index pair. The counter may provide, at any time, the cardinality of an input set of keys that have thus far been inserted in the hash map. In some respects, the key—index mapping may be used to share a single hash map among different columns (or value arrays). For example, for processing a plurality of values distributed among different columns, the associated index for the key has to be calculated just once. The use of key—index pairs may facilitate bulk insertion in columnar storages. Inserting a set of key—index pairs may entail inserting the keys in a hash map to obtain a mapping vector containing indexes. This mapping vector may be used to build a value array per value column.
Referring to
To achieve a maximum parallel processor utilization, the index hash maps herein may be designed to avoid locking when being operated on by concurrently executing threads by producing wide data independence. In some embodiments, index hash maps herein may be described by a framework defining a two step process. In a first step, input data is split or separated into equal-sized blocks and the blocks are assigned to worker execution threads. These worker execution threads may produce intermediate results by building relatively small local hash tables or hash maps. The local hash maps are private to the respective thread that produces it. Accordingly, other threads may not see or access the local hash map produced by a given thread.
In a second step, the local hash maps including the intermediate results may be merged to obtain a global result by concurrently executing merger threads. When accessing and processing the local hash maps, each of the merger threads may only consider a dedicated range of hash values. The merger threads may process hash-disjoint partitions of the local hash maps and produce disjoint result hash tables that may be concatenated to build an overall result.
The second step of the data structure framework herein is depicted in
In some embodiments, when accessing and processing the local hash maps, each of the merger threads may only consider a dedicated range of hash values. From a logical perspective, the local hash maps may be considered as being partitioned by their hash value. One implementation may use, for example, some first bits of the hash value to form a range of hash values. The same ranges are used for all local hash maps, thus the “partitions” of the local hash maps are disjunctive. As an example, if a value “a” is in range 5 of a local hash map, then the value will be in the same range of other local hash maps. In this manner, all identical values of all local hash maps may be merged into a single result hash map. Since the “partitions” are disjunctive, the merged result hash maps may be created without a need for locks. Additionally, further processing on the merged result hash maps may be performed without locks since any execution threads will be operating on disjunctive data.
In some embodiments, the local (index) hash maps providing the intermediate results may be of a fixed size. Instead of resizing a local hash map, the corresponding worker execution thread may replace its local hash map with a new hash map when a certain load factor is reached and place the current local hash map into a buffer containing hash maps that are ready to be merged. In some embodiments, the size of the local hash maps may be sized such that the local hash maps fit in a cache (e.g., L2 or L3). The specific size of the cache may depend on the sizes of caches in a given CPU architecture.
In some aspects, insertions and lookups of keys may largely take place in cache. In some embodiments, over-crowded areas within a local hash map may be avoided by maintaining statistical data regarding the local hash maps. The statistical data may indicate when the local hash map should be declared full (independent of an actual load factor). In some aspects and embodiments, the size of a buffer of a computing system and environment holding local hash maps ready to be merged is a tuning parameter, wherein a smaller buffer may induce more merge operations while a larger buffer will necessarily require more memory.
In some embodiments, a global result may be organized into bucketed index hash maps where each result hash map includes multiple fixed-size physical memory blocks. In this configuration, cache-efficient merging may be realized, as well as memory allocation being more efficient and sustainable since allocated blocks may be shared between queries. In some aspects, when a certain load factor within a global result hash map is reached during a merge operation, the hash map may be resized. Resizing a hash map may be accomplished by increasing its number of memory blocks. Resizing of a bucketed index hash map may entail needing to know the entries to be repositioned. In some embodiments, the maps' hash function may be chosen such that its codomain increases by adding further least significant bits of need during a resize operation. In an effort to avoid too many resize operations, an estimate of a final target size may be determined before an actual resizing of the hash map.
In some embodiments, the index hash map framework discussed above may provide an infrastructure to implement parallelized query processing algorithms or operations. One embodiment of a parallelized query processing algorithm includes a hash-based aggregation, as will be discussed in greater detail below.
In some embodiments, a parallelized aggregation refers to a relational aggregation that groups and condenses relational data stored in tables. An example of a table that may form an input of a parallel aggregation operation herein is depicted in
Aggregation result tables determined by the four different groupings are illustrated in
In some embodiments, such as the examples of
In an effort to fully utilize the resources of parallel computing environments with shared memory, an aggregation operation should be computed and determined in parallel. In an instance the aggregation is not computed in parallel, the processing performance for the aggregation would be bound by the speed of a single processing unit instead of being realized by the multiple processing units available in the parallel computing environment.
In some embodiments, a first plurality of execution threads, aggregator threads, are initially running and a second plurality of execution threads are not initially running or are in a sleep state. The concurrently operating aggregator threads operate to fetch an exclusive part of table 605. Partition 610 is fetched by aggregator thread 620 and partition 615 is fetched by aggregator thread 625.
Each of the aggregator threads may read their partition and aggregate the values of each partition into a private local hash table or hash map. Aggregator thread 620 produces private hash map 630 and aggregator thread 625 produces local hash map 635. Since each aggregator thread processes its own separate portion of input table 605, and has its private hash map, the parallel processing of the partitions may be accomplished lock-free.
In some embodiments, the local hash tables may be the same size as the cache associated with the processing unit executing an aggregator thread. Sizing the local hash tables in this manner may function to avoid cache misses. In some aspects, input data may be read from table 605 to aggregate and written to the local hash tables row-wise or column-wise.
When a partition is consumed by an aggregator thread, the aggregator thread may fetch another, unprocessed partition of input table 605. In some embodiments, the aggregator threads move their associated local hash maps into a buffer 640 when the local hash table reaches a threshold size, initiate a new local hash table, and proceed.
In some embodiments, when the number of hash tables in buffer 640 reaches a threshold size, the aggregator threads may wake up a second plurality of execution threads, referred to in the present example as merger threads, and the aggregator threads may enter a sleep state. In some embodiments, the local hash maps may be retained in buffer 640 until the entire input table 605 is consumed by the aggregations threads 620 and 625. When the entire input table 605 is consumed by the aggregator threads 620 and 625, the second plurality of execution threads, the merger threads, are awaken and the aggregator threads enter a sleep state.
Each of the merger threads is responsible for a certain partition of all of the private hash maps in buffer 640. The particular data partition each merger thread is responsible for may be determined by assigning distinct, designated key values of the local hash maps to each of the merger threads. That is, the partition of the data of the portion the data for which each merger thread is responsible may be determined by “key splitting” in the local hash maps. As illustrated in
As further illustrated in
In some embodiments, in the instance a merger thread has processed its data partition and there are additional data partitions in need of being processed, the executing merger threads may acquire responsibility for a new data partition and proceed to process the new data partition as discussed above. In the instance all data partitions are processed, the merger threads may enter a sleep state and the aggregator threads may return to an active, running state. Upon returning to the active, running state, the processes discussed above may repeat.
In the instance there is no more data to be processed by the aggregator threads and the merger threads, the parallel aggregation operation herein may terminate. The results of the aggregation process will be contained in the set of part hash maps (e.g., 675 and 680). In some respects, the part hash maps may be seen as forming a parallel result since the part hash maps are disjoint.
In some embodiments, the part hash maps may be processed in parallel. As an example, a having clause may be evaluated and applied to every group or parallel sorting and merging may be performed thereon.
An overall result may be obtained from the disjoint part hash maps by concatenating them together, as depicted in
At S715 a determination is made whether the aggregating of the partitions of the input table partitions is complete or whether the buffer is full. In the instance additional partitions remain to be aggregated and buffer 640 is not full, whether at the end of aggregating a current partition and/or for other considerations, process 700 returns to further aggregate partitions of the input data and store the aggregated values in key—index pairs in local hash tables. In the instance aggregating of the partitions is complete or the buffer is full, process 700 proceeds to assign designated parts of the local hash tables or hash maps to a second plurality of execution threads at S720. The second plurality of execution threads work to merge the designated parts of the local hash maps into thread-local part hash maps at S725 and to produce result tables.
At S730, a determination is made whether the aggregating is complete. In the instance the aggregating is not complete, process 700 returns to further aggregate partitions of the input data. In the instance aggregating is complete, process 700 proceeds S735.
At S735, process 700 operates to generate a global result by assembling the results obtained at S725 into a composite result table. In some embodiments, the overall result may be produced by concatenating the part hash maps of S725 to each other.
Each system described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of the devices herein may be co-located, may be a single device, or may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Moreover, each device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. Other topologies may be used in conjunction with other embodiments.
All systems and processes discussed herein may be embodied in program code stored on one or more computer-readable media. Such media may include, for example, a floppy disk, a CD-ROM, a DVD-ROM, magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. According to some embodiments, a memory storage unit may be associated with access patterns and may be independent from the device (e.g., magnetic, optoelectronic, semiconductor/solid-state, etc.) Moreover, in-memory technologies may be used such that databases, etc. may be completely operated in RAM memory at a processor. Embodiments are therefore not limited to any specific combination of hardware and software.
Embodiments have been described herein solely for the purpose of illustration. Persons skilled in the art will recognize from this description that embodiments are not limited to those described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.
Number | Date | Country | |
---|---|---|---|
61363304 | Jul 2010 | US |