1. Description Regarding Related Application
The present invention is based on and claims the priority from Japanese Patent Application No. 2013-221305 (filed on Oct. 24, 2013), the entire description of which the application is incorporated herein by reference.
The present invention relates to an information processing device, an information processing method, and a program and relates particularly to an information processing device, an information processing method, and a program for storing tuples in a column-oriented database.
2. Background Art
Recently, there is a demand for a technique for analyzing in real-time a large volume of data which changes every moment, such as position information. As such, high data insertion performance is desired in addition to high speed reference performance regarding a database.
When high speed reference performance is desired, a column-oriented database is used. The column-oriented database stores segmented data by each attribute (column) which enables high Input/Output (IO) efficiency and allows high speed reference query execution (NPL 1).
As a related technique, PTL 1 describes a shared data processing system which prevents, in accesses from a plurality of systems to shared data in a shared storage device, a situation where only one of the systems is exclusively allowed to access the data and which does not require any exclusive control, such as lock application. PTL 2 describes a processing system including a plurality of memory sharing processors configured to execute jobs in parallel and a means for ensuring data consistency.
The entire contents disclosed in PTLs and NPL listed above are incorporated herein by reference. The following analysis was made by the inventors of the present invention.
For real-time data analysis of data being generated in a large volume, it is required that data is stored at high speed. As such, needed is a technique for reducing processing time by carrying out data storing processes in parallel by the use of computational resources, such as a multicore Central Processing Unit (CPU) or a plurality of computers. However, even when data storing processes are carried out in parallel, each instance of data needs to be stored in the database so that it may be pulled in a complete form. Out of the ACID (Atomicity, Consistency, Isolation, Durability) attributes which comprise a database transaction, this property is called “Isolation (I)”.
Description is given below of a method of managing data in the column-oriented database based on a specific example. First, description is given of tabular data with reference to
In the column-oriented database, the tuples each formed by N columns (N attributes) are segmented and managed by each M (≦N) columns.
Description is given of a problem which may occur when two new instances of tuple data, i.e., (Tuple 1)={MS-05, 1981, 3000} and (Tuple 2)={MS-09, 1982, 2000} are to be stored in the column-oriented database configured to manage data as described above with reference to
A first conceivable method is to perform exclusive control on processes among the tuples. An example is a method of storing the data of Tuple 2 after the completion of storing the data of Tuple 1. The storing process for a single tuple is equivalent to the storing process of three columns.
When processes for respective columns are carried out in successive order in the first method, processes which may be carried out simultaneously is the storing process for a single column, and hence it becomes difficult to improve the performance by the use of computational resources, such as the multicore CPU or the plurality of computers.
On the other hand, when the instances of data of columns are processed in parallel in the first method, the following problem occurs. The following procedure is carried out when exclusive control is performed on the processes among the respective tuples and the processes between the columns in the tuples are carried out in parallel: (1) acquire a lock; (2) carry out processes for respective columns in parallel; (3) wait for the completion of the processes for all the columns; and (4) remove the lock. In (3) of the above procedure, the processes are synchronized which increases the calculation cost and makes it difficult to achieve high efficiency of parallel execution. Especially when a program of the storing processes for the columns is performed by different processes or by different computers, the cost for synchronization of the processes further increase.
As described above, the first method, in which exclusive control is performed on the processes among the tuples, has the problem of not being able to improve the performance by the use of adequate computation resources, such as the multicore CPU or the plurality of computers.
A second conceivable method is to execute processes among columns in parallel without performing exclusive control among the tuples. However, according to the second method, it may have a problem of inconsistency with a processing sequence of instances of tuple data among the columns. For example, when the instances of data of Tuple 1 and Tuple 2 are stored in this order for ColA while the instances of data of Tuple 2 and Tuple 1 are stored in this order for ColB, the instances of data are stored as mixed tuples as the values of Tuple 1 and Tuple 2 are mixed, and therefore, it is difficult to ensure isolation of the data processes.
Note that the above-described problems are not solved even with the techniques described in PTLs 1 and 2.
To address these problems, there is a demand for accelerating processes of storing the plurality of instances of tuple data in tables, each tuple data including complex attributes while ensuring isolation. The present invention aims to provide an information processing device, an information processing method, and a program to contribute to the demand.
An information processing device according to a first aspect of the present invention includes:
a storage unit which stores a plurality of instances of attribute data included in a tuple as a plurality of tables differing for each attribute;
a sequence determination unit which segments a first process of inserting a plurality of tuples into the plurality of tables, into a plurality of second processes in a unit of attribute, and determines a processing sequence of the plurality of second processes; and
a pipeline processing unit which carries out the plurality of second processes in pipelining according to the processing sequence.
An information processing method according to a second aspect of the present invention by an information processing device, the information processing method includes:
a step of storing, in a storage unit, a plurality of instances of attribute data included in a tuple as a plurality of tables differing for each attribute;
a step of segmenting a first process of inserting a plurality of tuples into the plurality of tables into a plurality of second processes in a unit of attribute;
a step of determining a processing sequence of the plurality of second processes; and
a step of carrying out the plurality of second processes in pipelining according to the processing sequence.
A program according to the third aspect of the present invention causes a computer to implement processes of, by an information processing device:
storing, in a storage unit, a plurality of instances of attribute data included in a tuple as a plurality of tables differing for each attribute;
segmenting a first process of inserting a plurality of tuples into the plurality of tables into a plurality of second processes in a unit of attribute;
determining a processing sequence of the plurality of second processes; and
carrying out the plurality of second processes in pipelining according to the processing sequence.
Note that the program may be provided as a program product being a non-transitory computer-readable storage medium in which the program is stored.
With the information processing device, the information processing method, and the program according to the present invention, it is possible to accelerate processes of storing the plurality of instances of tuple data in tables, each tuple data including complex attributes while ensuring isolation.
First, an outline of exemplary embodiments is described. Note that the reference signs from the drawings included in this outline are provided solely for illustrative purpose to aid the understanding and are not intended to limit the present invention to any mode illustrated in the drawings.
In the example presented in
Here, the pipeline processing unit 20 may include a plurality of stage execution units 22P, 22Q, . . . , and 22X configured to execute the plurality of second processes in pipelining and the sequence determination unit 10 may assign the plurality of second processes to the plurality of stage execution units 22P, 22Q, . . . , and 22X according to the determined processing sequence. In this case, the plurality of stage execution units 22P, 22Q, . . . , and 22X execute the process assigned among the plurality of second processes, in the same sequence of the plurality of tuples.
In the example in
With the information processing device it is possible to accelerate the process of storing the plurality of instances of tuple data in tables, each tuple data including complex attributes while ensuring isolation.
Next, an information processing device according to a first exemplary embodiment is described in detail with reference to the drawings. In this exemplary embodiment, the information processing device stores tuples including a plurality of attributes by each attribute in bulk.
The pipeline processing unit 20 includes a plurality of stage execution units 22P, 22Q, and 22R. The stage execution units 22P, 22Q, and 22R include first-in-first-out (FIFO) type queues 24P, 24Q, and 24R each of which is configured to store processes and data processing units 26P, 26Q, and 26R, respectively.
The data processing unit 26P of the stage execution unit 22P carries out the process extracted (dequeued) from the queue 24P and adds (enqueues) the process to the queue 24Q of the subsequent stage execution unit 22Q. Similarly, the data processing unit 26Q of the stage execution unit 22Q carries out the process extracted from the queue 24Q and adds the process to the queue 24R of the subsequent stage execution unit 22R.
The storage unit 30 stores the instances of data for each column (attribute) in bulk.
Note that although the storage unit 30 is configured to manage the instances of data for each column in bulk in this exemplary embodiment, the present invention is not limited to this. For example, the storage unit 30 may be configured to manage the instances of data for each of plurality of columns. The number of columns may differ among the tables stored in the storage unit 30. Furthermore, as an example, although the number of stage execution units is three, i.e., the stage execution units 22P, 22Q, and 22R in this exemplary embodiment, the present invention is not limited to this.
[Operation]
<Preparation for Pipeline Process>
Preparation for a pipeline process is described with reference to FIG. 3. First, the sequence determination unit 10 segments the tuple data storing process into a plurality of stages (Step A1). Here, as an example, assume a case where the storing of the tuples including three columns is segmented into three stages by column. The process in each stage corresponds to the process of storing instances of data of a single column in a corresponding one of data areas for respective columns in the storage unit 30.
Next, the sequence determination unit 10 determines a sequence in which the stages are to be executed (Step A2). Here, as an example, the processing sequence of the stages is assumed to be ColA, ColB, and then ColC.
Next, the sequence determination unit 10 sets the processes of the stages in the pipeline processing unit 20 (Step A3). Here, the three stage execution units 22P, 22Q, and 22R are provided for the three respective stages. The stage execution units 22P, 22Q, and 22R execute the processes of storing ColA, ColB, and ColC, respectively. Preceding data processing unit sets information on the subsequent queue so that the subsequent process is carried out after the completion of the process by each stage execution unit.
<Tuple Storing Process>
Next, a state of storing data in actual is described with reference to
Each of the stage execution units 22P, 22Q, and 22R operates according to the flowchart in
Note that the execution sequence of Step B3 and Step B4 in
Then, the data processing unit 26P of the stage execution unit 22P starts the storing process for the instance of tuple data of TID=2. In parallel with the start of the process for the instance of the tuple data corresponding to TID=2 of the stage execution unit 22P, the data processing unit 26Q of the stage execution unit 22Q extracts TID=1 from the queue 24Q (Step B2) and stores TID=1 in the queue 24R of the subsequent stage execution unit 22R (Step B3). Then, the data processing unit 26Q stores the data “2010” corresponding to ColB of the tuple of TID=1 in an area 32Q of ColB in the storage unit 30 (Step B4).
A similar process is carried out also in the stage execution unit 22R, and the storing processes for the respective columns are carried out simultaneously in parallel.
As the processes for the respective columns retain the first insertion sequence in the queue 24P, isolation of the processes is ensured.
As described above, with the information processing device 110 of this exemplary embodiment, it is possible to execute processes in parallel without losing data integrity and accelerating the data storing process when the data including the plurality of attributes is segmented and stored for each of one or more attributes.
Next, an information processing device according to a second exemplary embodiment is described with reference to the drawings. In this exemplary embodiment, as the above, the information processing device stores tuples including the plurality of attributes by each attribute in bulk.
[Operation]
<Preparation of Pipeline Process>
Preparation of a pipeline process is similar to that of the information processing device 110 according to the first exemplary embodiment, and hence the description thereof is omitted.
<Tuple Storing Process>
The operation of storing actual data is described with reference to
Each of the stage execution units 22P, 22Q, and 22R operates according to the flowchart in
Then, since this is not the last stage (No in Step C4), the data processing unit 26P stores TID=1 in the queue 24Q of the subsequent stage execution unit 22Q (Step C5). The data processing unit 26P of the stage execution unit 22P then starts the storing process for the instance of the data of the tuple of TID=2.
In parallel with the start of the tuple data process of TID=2 of the stage execution unit 22P, the data processing unit 26Q of the stage execution unit 22Q extracts TID=1 from the queue 24Q (Step C2) and stores the data “2010” corresponding to ColB of the tuple of TID=1, in an area 32Q of ColB in the storage unit 30 (Step C3).
Then, since this is not the last stage (No in Step C4), the data processing unit 26Q stores TID=1 in the queue 24R of the subsequent stage execution unit 22R (Step C5).
Similarly, in parallel with the start for the instance of the tuple data corresponding to TID=2 of the stage execution unit 22Q, the data processing unit 26R of the stage execution unit 22R extracts TID=1 from the queue 24R (Step C2) and stores the data “3000” corresponding to ColC of the tuple of TID=1, in an area 32R of ColC in the storage unit 30 (Step C3).
Then, since this is the last stage for processing the instances of tuple data (Yes in Step C4), the data processing unit 26R updates (e.g., increments) a value MaxTID of an area 34 which stores MaxTID in the storage unit 30 (Step C6).
According to the information processing device 120 of this exemplary embodiment, as the information processing device 110 of the first exemplary embodiment, it is possible to execute the processes of storing the tuples in parallel while ensuring isolation of the tuple processes. In addition, according to this exemplary embodiment, it is possible to keep track of the TID of the tuples up to which the tuple insertion process has been completed by referring to the value MaxTID in the storage unit 30.
In this exemplary embodiment, description is given of the case where the TIDs assigned to the instances of input data in
<Tuple Reference Process>
Next, a process of referring to data in the state in
First, the data reference unit 40 refers to the area 34 which stores the value MaxTID in the storage unit 30 and acquires the value stored in the area (Step D1). Here, the data reference unit 40 acquires MaxTID=1.
The data reference unit 40 then searches for the tuple having a value of ColB which is smaller than or equal to 2013 in the range of TID≦1 (Step D2). Here, as a result of this search, the data reference unit 40 acquires TID={1}. The data reference unit 40 returns the value “MX-30” of ColA of TID={1} as the result.
With the information processing device 120 of this exemplary embodiment, which carries out the reference process using MaxTID as described above, it is possible to execute the reference process only for the tuple(s) for which the storing process has been completed at the time of starting the reference process.
Next, an information processing device according to a third exemplary embodiment is described with reference to the drawings.
The information processing device of this exemplary embodiment further includes a user interface 50 illustrated in
According to
Operation of the user interface 50 in
The user then inputs the number of stages in the area 54, to which the number of stages is input. The sequence determination unit 10 acquires the number of stages input in the area 54 (Step E2).
The user interface 50 then displays the column selection areas 56P, 56Q, and 56R corresponding to the number of stages input in the area 54 (Step E3). The example in
In each of the areas 58P, 58Q, and 58R in which the column(s) to be processed at the corresponding stage is selected, the user marks the column(s) to be processed at the stage.
According to the information processing device of this exemplary embodiment, by including the user interface 50 illustrated in
Next, an information processing device according to a fourth exemplary embodiment is described with reference to the drawings.
Specifically, the information processing device 140 of this exemplary embodiment has a set configuration in which the stage execution units 22P, 22Q, and 22R included in the pipeline processing unit 20 of the information processing device 110 (
The detailed configuration of the stage execution units 22P, 22Q, and 22R and the operation of the sequence determination unit 10 and the stage execution units 22P, 22Q, and 22R of this exemplary embodiment are similar to those of the information processing device (
According to the information processing device 140 of this exemplary embodiment, it is possible to accelerate the processes of storing in a database the plurality of instances of tuple data based on complex columns (attributes) by the use of the plurality of computers and the plurality of storage nodes, while ensuring isolation.
The invention of the present application is described above with reference to the above exemplary embodiments, however, the invention of the present application is not limited to the above-described exemplary embodiments. It is possible to make various changes which may be understood by those skilled in the art to the configuration and details of the invention of the present application within the scope of the invention of the present application. For example, the stage execution units of the pipeline processing unit and the storage unit do not need to be provided in a single computer and may be virtually or physically distributed to the plurality of computers. In the second exemplary embodiment, the value MaxTID is equal to the processed TID of the last column in the sequence of the column storing processes determined by the sequence determination unit 10. Accordingly, the data reference unit 40 may refer directly to the value of the TID of the last column, instead of providing the area 34 for MaxTID in the storage unit 30.
Note that in the present invention, the following modes are possible.
The information processing device according to the above-described first aspect.
In the information processing device according to Mode 1,
the pipeline processing unit includes a plurality of stage execution units which execute the plurality of second processes in pipelining; and
the sequence determination unit assigns the plurality of second processes to the plurality of stage execution units according to the processing sequence.
In the information processing device according to Mode 2, the plurality of stage execution units execute the assigned process from the plurality of second processes in same sequence for the plurality of tuples.
In the information processing device according to Mode 3, the plurality of stage execution units includes
a queue retaining an identifier identifying the tuple and
a data processing unit inserting an instance of attribute data included in the tuple indicated by the identifier dequeued from the queue, into the corresponding one of the plurality of tables.
In the information processing device according to Mode 4, when dequeuing of the identifier from the queue, the data processing unit enqueues the dequeued identifier to the queue included in the subsequent stage execution unit.
In the information processing device according to any one of Modes 2 to 5, the storage unit stores a count value indicating the number of tuples of the plurality of tuples the last stage execution unit has processed.
In the information processing device according to Mode 6, when dequeuing of the identifier from the queue, the data processing unit included in the last stage execution unit inserts an instance of attribute data included in the tuple indicated by the dequeued identifier, into the corresponding one of the plurality of tables and updates the count value stored in the storage unit.
In the information processing device according to any one of Modes 1 to 7, upon receipt of number of segments to which the first process is to be segmented, the sequence determination unit segments the first process into the plurality of second processes according to the received number of segments.
In the information processing device according to Mode 8, the sequence determination unit receives the assignment of the plurality of attributes included in the plurality of tuples to the plurality of second processes and assigns the plurality of attributes to the plurality of second processes according to the received assignment.
The information processing method according to the above-described second aspect.
The information processing method according to Mode 10, includes a step of assigning the plurality of second processes for a plurality of stage execution units which process the plurality of second processes in pipelining, according to the processing sequence.
In the information processing method according to Mode 11, the plurality of stage execution units execute the assigned process from the plurality of second processes in same sequence for the plurality of tuples.
The information processing method according to Mode 12, includes by the stage execution units;
a step of storing the plurality of an identifier identifying the tuple in a queue and
a step of inserting an instance of attribute data included in the tuple indicated by the identifier dequeued from the queue in the corresponding one of the plurality of tables.
In the information processing method according to Mode 13, when dequeuing of the identifier from the queue, the plurality of stage execution unit enqueues the dequeued identifier to the queue included in a subsequent stage execution unit.
In the information processing method according to any one of Modes 11 to 14, includes a step of storing in the storage unit, a count value indicating the number of tuples of the plurality of tuples the last stage execution unit has processed.
In the information processing method according to Mode 15, when dequeuing of the identifier from the queue, the last stage execution unit inserts an instance of attribute data included in the tuple indicated by the dequeued identifier, into the corresponding one of the plurality of tables and updates the count value stored in the storage unit.
The program according to the above-described third aspect.
The program according to Mode 17, wherein causing the computer to implement a process of assigning the plurality of second processes according to the processing sequence to a plurality of stage execution units which execute the plurality of second processes in pipelining.
The program according to Mode 18, wherein causing the plurality of stage execution units to implement a process of carrying out the assigned one of the plurality of second processes in same sequence for the plurality of tuples.
The program according to Mode 19, wherein causing the plurality of stage execution units to implement processes of:
storing an identifier identifying the tuple, in a queue and
inserting an instance of attribute data included in the tuple indicated by the identifier dequeued from the queue, into the corresponding one of the plurality of tables.
The program according to Mode 20, causing the plurality of stage execution units to implement a process of enqueuing, when dequeuing of the identifier from the queue, the dequeued identifier to the queue included in the subsequent stage execution unit.
Note that the contents of the entire disclosures of PTLs and NPL listed above are incorporated in this description by reference. Changes and adjustments of the exemplary embodiments are further made possible within the entire disclosure of the present invention (including the scope of claims) based on the basic technical spirit. Various combinations of and selections from various disclosed elements (including the elements in the claims, the elements in the exemplary embodiments, the elements in the drawings and the like) are possible within the scope of the claims of the present invention. In other words, the present invention naturally includes various alternations and modifications which may be made by those skilled in the art according to the entire disclosure including the scope of claims and the technical spirit. In particular, each numeric range described in this description should be understood so that any numeric value or smaller range included in the range is specifically described even without being particularly mentioned.
Number | Date | Country | Kind |
---|---|---|---|
2013-221305 | Oct 2013 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2014/065117 | 6/6/2014 | WO | 00 |