Large amounts of multidimensional data are generated by large-scale scientific experiments, such as but not limited to astronomy, physics, remote sensing, oceanography and biology. The volume of data in these fields is approximately doubling each year. These large volumes of scientific data are often stored in databases, and need to be analyzed for decision making. The core of analysis in scientific databases is the management of multidimensional arrays. A typical approach is to break the arrays into sub-arrays. These sub-arrays are constructed using different strategies, which include but are not limited to defining the size of sub-arrays. Defining sub-array size impacts the performance of I/O access and operator execution. Existing strategy uses predefined and fixed size sub-arrays, which make it difficult to satisfy the different input parameters for different analysis applications.
a is a diagram illustrating an example of matrix multiplication using a two-level chunking schema.
Scientific activities generate data at unprecedented scale and rate. Massive-scale, multidimensional array management is an important topic to the database community.
Structured Query Language (SQL) is a programming language designed for managing data in database management systems (DBMS), But SQL is awkward at expressing complex operations for query processing, such as for BLAS (Basic Linear Algebra Subprograms) which is widely used in statistical computing. Some scientific databases support a declarative query language extending SQL-92 with operations on arrays and provide a C++/JAVA® programming interface. Others define new languages.
Some applications extract data from the database into desktop software packages, such as the statistical package Matlab®, or using custom code (e.g., programmed in the JAVA® or C programming languages). But these cause copy-out overhead and out-of-core problems. For example, the applications run slowly or even crash when the size of data exceeds the size of physical main memory.
The query language (e.g., SQL) remains the programming language of choice, because the database engine enables users to push computations closer to physical data by creating user-defined functions (UDFs) and reduces overhead caused by high-volume data movement. The query language typically handles database processing of lame amounts of data as arrays.
Arrays are commonly represented in relational database management systems (DBMS) as tables. For example, an array A may be represented as a table A(I, J, . . . K, Value), where I, J, . . . K are attributes of the array A referred to as indices or dimensions. This approach works well in practice for very sparse arrays (i.e., arrays containing empty values), because the elements with empty values are typically not stored, But for dense arrays (i.e., arrays containing more data and fewer empty values), the indices occupy “expensive” space in terms of processing. For massive-scale datasets, the query processing of such tables is inefficient.
Using a pipeline execution model, the database engine calls get-next( ) to get an element from the array, and then determines a result. With object-relational applications, many database engines use simple array data types or provide APIs for the user to define custom data types. One approach is to break an array into several sub-arrays. This can significantly improve the overall performance of the database engine. But the size of the sub-array impacts the performance of data access (e.g., incurring input/output (I/O) overhead), and operator execution (e.g., incurring processor overhead).
Two-level chunking for data analytics is described herein. In a two-level chunking approach, an array is divided into a series of basic chunks. These basic chunks can be stored in physical blocks of memory. The chunks can be dynamically combined into a bigger super-chunk. The super-chunk can then be used in various operations.
Before continuing, it is noted that as used herein, the terms “includes” and “including” mean, but is not limited to, “includes” or “including” and “includes at least” or “including at least.” The term “based on” means “based on” and “based at least in part on.”
For each chunk, many data storage layout strategies can be leveraged to convert an n-dimensional (n-D) array into single dimensional (1-D) array, such as row-major, column-major, s-order, and z-order. In databases, a chunk can be constructed and stored in two tables which record raw data 116 and metadata 117 separately. For example, for a single disk block, a chunk can be packed into a space that is several kilobytes (KBs) in size to several hundred megabytes (MBs) in size. The metadata table 117 records the structure information, such as number of dimensions, number of chunks in each dimension.
Two types of chunking strategies may be used, including regular (REG) and irregular (IREG) chunking. Using REG chunking, an array is broken into uniform chunks of the same size and shape. For example, an array may be constructed as a Matrix Am,n where m=[1:12] and n=[1:12], as shown in
A similar approach may be used to define super-chunks. An example two-level chunk is illustrated in
Each chunk in the matrix 200 is “packed” with the same shape (represented by the dots in
Two-level chunking may be implemented using single-level storage for n-dimensional (n-D) array management. In general, small chunks are generally more efficient for simple operations, such as selection queries and dicing queries. To fit the size of one physical block, the chunk may be constructed as 16K or 32K, meaning that only one I/O operation is executed to access each chunk. Larger chunks are generally more efficient for complex operations, such as matrix multiplication.
In first-level chunking, an array is divided into regular and fixed-size chunks (e,g., to form the underlying structure 200 having a height (m) and a width (n)). In second-level chunking, a dynamic schema is implemented on the top of the basic chunks in the underlying structure 200. The location of chunk 210 in the underlying structure 200 is a=1 and b=3. The location of super-chunk 220 is at s_a=2 and s_b=0.
The super-chunk 220 is used as the basic computing unit for database operations. The size and/or shape of the super-chunk 220 can be defined (e.g., by the user) according to the complexity of operator. For example, the height (h) and width (w) of the super chunk 220 may be defined based on the specific operator.
A range-selection query may be used to construct the super-chunk 320 by dynamically combining fixed-size chunks into a larger assembly. At run time, the operator combines the super-chunk 220 from different matrices. The basic chunks can be combined into a super-chunk 220 at runtime without changing the underlying storage structure. This chunking strategy can be used to achieve an optimum balance between I/O overhead and processor overhead.
For purposes of illustration, the two-level chunking strategy can be better understood as it may be applied to matrix multiplication (although the two-level chunking described herein is not limited to such an example). Matrix multiplication is widely used in statistical computing. For purposes of this illustration, matrix C is the product of matrix A and matrix B. That is, C [m,l]=A [m,n] B [n,l], where the parameters m and n of matrix A are illustrated in
The height and width of the super-chunk used in Matrix A is given by (h) and (w), respectively. The height and width of the super-chunk used in Matrix
B is given by (w) and (h), respectively. The size of each dimension of the basic chunk is given by (s).
Pseudo code for implementing two-level chunks for matrix multiplication may be expressed by the following Algorithm 1.
Algorithm 1 may be better understood with reference to
In the first loop, all super-chunks in matrix A are sequentially scanned from the row coordinate. In the second loop, the corresponding super-chunks in matrix B are sequentially scanned from the column coordinate. Because the size of the super-chunk (e.g., 311a) is typically less than the size of the matrix (e.g., 310), these operations iterate multiple times for each coordinate.
In the example shown in
For(int i=0; i<12(3*2);i=i++)
for (int j=0; j<6/(3*2)j++)
for(int k=0;k<12/(3*1);k++)
To compute super-chunk 301, i=0, and the loop iterates for j, k. To compute super-chunk 302, i=1 and the loop iterates for j, k.
The following examples illustrate how chunk size, super-chunk size and super-chunk shape enhance I/O and processor performance. The first example shows the results of matrix multiplication using two-level chunking. The operations were executed using a Hewlett-Packard xw 8600 workstation with a 4-core, 2.00 Hz CPU and an entry-level NVIDIA GPU Quadro FX 570.
For matrix multiplication, the two input matrices (e.g., Matrix A and Matrix B) were square in shape, and the size of each dimension was selected as 2048. Matrix A was divided into different sizes of square chunks (e.g., 64×64, 128×128, 256×256, 512×512, 1024×1024 and 2048×2048), as shown across the top row in Tables 1 and 2, below. The chunks from Matrix A were combined with different size super-chunks (e.g., 1024×512) from Matrix B, as shown down the first column in Tables 1 and 2, below. Actual performance data for matrix multiplication operations is shown in Table 1 and in Table 2, below.
The results shown in Table I indicate that pairing the same size super-chunks in each Matrix (e.g., 64×64 in Matrix A with 64×64 in Matrix B) tended to increase performance. In addition, for the same super-chunk (e.g., reading across a row), the size of the chunk generally had little negative effect on operator performance.
The results shown in Table 2 indicate that even very small tiling does not offer better 110 performance for frequent I/O access. The super-chunk is the basic computing unit in this system, and thus may be involved multiple times for aggregation (see, e.g., 350 in
The second example shows the results of OR factorization using two-level chunking. In linear algebra, OR factorization of a matrix means decomposing the matrix into an orthogonal matrix Q and an upper triangular matrix R. QR factorization may be used, for example, to solve a linear least squares problem. Again, the operations were executed using a Hewlett-Packard xw8600 workstation with a 4-core, 2.00 Hz CPU and an entry-level NVIDIA GPU Quadro FX 570. In this example, a column-oriented super-chunk was used. Different column widths were selected, and the corresponding I/O performance was measured as a function of processing time. The results are shown in
It is recognized that not all multidimensional arrays can be divided by a concrete value. For example, if matrix A includes thirteen items in each row, the matrix is not divisible by four. To address this issue, the data is still stored in one chunk, and empty values are used to fill the outer areas. If the size of one chunk is small (e.g., only 16K or 32K), these empty values do not consume much storage and thus is an acceptable solution,
it is also recognized is that not all arrays can be divided by the size of the super-chunk. Again, the same strategy may be adopted. The metadata table records all dimension information, and so this approach does not impact the final results or cause errors.
Before continuing, it should be noted that two-level chunking for data analytics may be implemented in a database environment. The database(s) may include any content. There is no limit to the type or amount of content that may be used. In addition, the content may include unprocessed or “raw” data, or the content may undergo at least some level of processing.
The operations described herein may be implemented in a computer system configured to execute database program code. In an example, the program code may be implemented in machine-readable instructions (such as but not limited to, software or firmware). The machine-readable instructions may be stored on a non-transient computer readable medium and are executable by one or more processor to perform the operations described herein The program code executes the function of the architecture of machine readable instructions as self-contained modules. These modules can be integrated within a self-standing tool, or may be implemented as agents that run on top of an existing program code. However, the operations described herein are not limited to any specific implementation with any particular type of program code.
The examples described above are provided for purposes of illustration, and are not intended to be limiting. Other devices and/or device configurations may be utilized to carry out the operations described herein.
Operation 510 includes dividing an array into fixed-size chunks. Operation 520 includes dynamically combining the fixed-size chunks into a super-chunk. A size of the super-chunk may be based on parameters of a subsequent operation. The size of the super-chunk may be determined at run time. For example, the chunk size may be selected to be between about 16K to 32K.
For purposes of illustration, the subsequent operation may be matrix multiplication. Matrix multiplication may include iterating over chunks to join matrix A and matrix B and outputting result matrix C, and using range selection queries for super-chunk A, super-chunk B. and super-chunk C. Matrix multiplication may also include breaking super-chunk C into a set of chunks; and returning matrix C having a format of the set of chunks.
It is noted that two-level chunking for data analytics is not limited to use with matrix multiplication. Two-level chunking for data analytics may be implemented with other statistical computing and execution workflows,
The operations shown and described herein are provided to illustrate example implementations. It is noted that the operations are not limited to the ordering shown, Still other operations may also be implemented.
Still further operations may include using range-selection queries for dynamically combining the fixed-size chunks into the super-chunk. Operations may include accessing each chunk with only one input/output (I/O) operation. Operations may also include dynamically combining fixed-size chunks into a super-chunk.
The operations may be implemented at least in part using an end-user interface (e.g., web-based interface). In an example, the end-user is able to make predetermined selections, and the operations described above are implemented on a back-end device to present results to a user. The user can then make further selections, It is also noted that various of the operations described herein may be automated or partially automated.
It is noted that the examples shown and described are provided for purposes of illustration and are not intended to be limiting. Still other examples are also contemplated.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US12/29275 | 3/15/2012 | WO | 00 | 9/11/2014 |