Array processing has wide application in many areas including machine learning, graph analysis and image processing. The importance of such arrays has led to new storage and analysis systems, such as array-oriented databases (AODBs). An AODB is organized based on a multi-dimensional array data model and supports structured query language (SQL)-type queries with mathematical operators to be performed on arrays, such as operations to join arrays, operations to filter an array, and so forth. AODBs have been applied to a wide range of applications, including seismic analysis, genome sequencing, algorithmic trading and insurance coverage analysis.
An array-oriented database (AODB) may be relatively more efficient than a traditional database for complex multi-dimensional analyses, such as analyses that involve dense matrix multiplication, K-means clustering, sparse matrix computation and image processing, just to name a few. The AODB may, however, become overwhelmed by the complexity of the algorithms and the dataset size. Systems and techniques are disclosed herein for purposes of efficiently processing queries to an AODB-based system by distributing the processing of the queries among central processing units (CPUs) and co-processors.
A co-processor, in general, is supervised by a CPU, as the co-processor may be limited in its ability to perform some CPU-like functions (such as retrieving instructions from system memory, for example). However, the inclusion of one or multiple co-processors in the processing of queries to an AODB-based system takes advantage of the co-processor's ability to perform array-based computations. In this manner, a co-processor may have a relatively large number of processing cores, as compared to a CPU. For example, a co-processor such as the NVIDIA Tesla M2090 graphics processing unit (GPU) may have 16 multi-processors, with each having 32 processing cores for a total of 512 processing cores. This is in comparison to a given CPU, which may have, for example, 8 or 16 processing cores. Although a given CPU processing core may possess significantly more processing power than a given co-processor processing core, the relatively large number of processing cores of the co-processor combined with the ability of the co-processor's processing cores to process data in parallel make the co-processor quite suitable for array computations, which often involve performing the same operations on a large number of array entries.
For example implementations disclosed herein, the co-processor is a graphics processing unit (GPU), although other types of co-processors (digital signal processing (DSP) co-processers, floating-point arithmetic co-processors, and so forth) may be used, in accordance with further implementations.
In accordance with example implementations, the GPU(s)and CPU(s) of an AODB system maybe disposed on at least one computer (a server, a client, an ultrabook computer, a desktop computer, and so forth). More specifically, the GPU may be disposed on an expansion card of the computer and may communicate with components of the computer over an expansion bus, such as a Peripheral Component Interconnect Express (PCIe) bus, for example. The expansion card may contain a local memory, which is separate from the main system memory of the computer; and a CPU of the computer may use the PCIe bus for purposes of transferring data and instructions to the GPU's local memory so that the GPU may access the instructions and data for processing. Moreover, when the GPU produces data as a result of this processing, the data is stored in the GPU'S local memory; and a CPU may likewise use PCIe bus communications to instruct the transfer of data from the GPU's local memory to the system memory.
The GPU may be located on a bus other than a PCIe bus in further implementations. Moreover, in farther implementations, the GPU may be a chip or chip set that is integrated into the computer, and as such, the GPU may not be disposed on an expansion card.
In general, the user input 150 may be a query or a user defined function. Regardless of its particular form, the user input 150 defines an operation to be performed by the database system 100. In this manner, a query, in general, may use operators that are part of the set of operators defined by the AODB, where as the user-defined function allows the user to specify custom algorithms and/or operations on array data.
A given user input 150 may be associated with one or multiple units of data called “data chunks” herein. As an example, a given array operation that is described by a user input 150 may be associated with partitions of one or multiple arrays, and each chunk corresponds to one of the partitions. The system 100 distributes the compute tasks for the data chunks among one or multiple CPUs 112 and one or multiple GPUs 114 of the system 100. In this context, a “compute task” maybe viewed as the compute kernel for a given data chunk. Each CPU 112 may have one or multiple processing cores (8 or 16 processing cores, as an example); and each CPU processing core is a potential candidate for executing a thread to perform a given compute task. Each GPU 114 may also contain one or multiple processing cores (512 processing cores, as an example); and the processing cores of the GPU 114 may perform a given compute task assigned to the GPU 114 in parallel.
For the foregoing example, it is assumed that the AODB system 100 is formed from one or multiple physical machines 110, such as example physical machine 110-1. In general, the physical machines 110 are actual machines that are made up of actual hardware and actual machine executable instructions, or “software.” In this regard, as depicted in
As depicted in
For the example implementation depicted in
Based on the schedule indicated by the data in the queue 127, the executor 126 retrieves corresponding data chunks 118 from the storage 117 and stores the chunks 118 in the system memory 130. For a CPU-executed compute task, the executor 126 initiates execution of the compute task by the CPU(s) 112; and the CPU(s) 112 access the data chunks from the system memory 130 for purposes of performing the associated compute tasks. For a GPU-executed task, the executor 126 may transfer the appropriate data chunks from the system memory 130 into the GPU's local memory 115 (via a PCIe bus transfer, for example).
The AODB database 120 further includes a size regulator, or size optimizer 124, that regulates the data chunk sizes for compute task processing. In this manner, although the data chunks 118 may be sized for efficient transfer of the chunks 118 from the storage 117 (and for efficient transfer of processed data chunks to the storage 117), the size of the data chunk 118 may not be optimal for processing by a CPU 112 or a GPU 114. Moreover, the optimal size of the data chunk for CPU processing may be different than the optimal size of the data chunk for GPU processing.
In accordance with some implementations, the AODB database 120 recognizes that the chunk size influences the performance of the compute task processing. In this manner, for efficient GPU processing, relatively large chunks may be beneficial due to (as examples) the reduction in data transfer overhead, as relatively larger chunks are more efficiently transferred into and out of the GPU's local memory 115 (via PCIe bus transfers, for example); and relatively larger chunks enhances GPU processing efficiency, as the GPU's processing cores having a relatively large amount of data to process in parallel. This is to be contrasted to the chunk size for CPU processing, as a smaller chunk size may enhance allocating data locality and reduce the overhead of accessing data to be processed among CPU 112 threads.
The size optimizer 124 regulates the data chunk size based on the processing entity that performs the related compute task on that chunk. For example, the size optimizer 124 may load relatively large data chunks 118 from the storage 117 and store relatively large data chunks in the storage 117 for purposes of expediting communication of this data to and from the storage 117. The size optimizer 124 selectively merges and partitions the data chunks 118 to produce modified size data chunks based on the processing entity that processes these chunks. In this manner, in accordance with an example implementation, the size optimizer 124 partitions the data chunks 118 into multiple smaller data chunks when these chunks correspond to compute tasks that are performed by a CPU 112 and stores these partitioned blocks along with the corresponding CPU tags in the queue 127. To the contrary, the size optimizer 124 may merge two or multiple data chunks 118 together to produce a relatively larger data chunk for GPU-based processing; and the size optimizer 124 may store this merged chunk in the queue 127 along with the appropriate GPU tag.
In accordance with example implementations, the executor 126 may further decode, or convert, the data chunk into a format that is suitable for the processing entity that performs the related compute task. For example, the data chunks 118 maybe stored in the storage 117 in a triplet format. An example triplet format 400 is depicted in
Referring back to
In accordance with further implementations, the scheduler 134 may employ a dynamic assignment policy based on metrics that are provided by a monitor 128 of the AODB database 120. In this manner, the monitor 128 may monitor such metrics as CPU utilization, CPU compute task processing time, GPU utilization, GPU compute task processing time, the number of concurrent GPU tasks and so forth; and based on these monitored metrics, the scheduler 134 dynamically assigns the compute tasks, which provides the scheduler 134 the flexibility to tune performance at runtime. In accordance with example implementations, the scheduler 134 may make the assignment decisions based on the metrics provided by the monitor 128 and static policies. For example, the scheduler 134 may assign a certain percentage of compute tasks to the GPU(s) 114 until a fixed limit on the number of concurrent GPU tasks are reached or until the GPU compute task processing time decreases below a certain threshold. Thus, in accordance with some implementations, the scheduler 134 may exhibit a bias toward assigning compute tasks to the GPU(s) 114. This bias, in turn, takes advantage of a potentially faster compute task processing time by the GPU 114.
In this manner,
Referring to
The CPU(s) 112 process the data chunks 210 to form corresponding chunks 212 that are communicated back to the storage 117. The data chunks 218 for the GPU job may be further decoded, or reformatted (as indicated by reference numeral 220), to produce corresponding reformatted data chunks 221 that are moved in (as illustrated by reference numeral 222) into the GPU's memory 115 (via a PCIe bus transfer, for example) to form local blocks 223 to be processed by the GPU (s) 114. After GPU processing 224 that produces data blocks 225, the work flow 200 includes moving out the blocks 225 from the GPU local memory 115 (as indicated at reference numeral 226), such as by a PCIe bus transfer, which produces blocks 227 and encoding (as Indicated by reference numeral 228) the blocks 227 (using the CPU, for example) to produce reformatted blocks 230 that are then transferred to the storage 117.
Thus, referring to
More specifically,
While a limited number of examples have been disclosed herein, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/072674 | 3/15/2013 | WO | 00 |