This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2023-194835, filed Nov. 16, 2023, the entire contents of which are incorporated herein by reference.
This present disclosure relates to an index array conversion device, an index array conversion method, and an index array conversion program.
Data cleansing refers to various preprocessing tasks that cleanse various types of data in a data lake into training data.
The analysis system shown in
The data cleansing unit cleanses the data stored in the data lake into training data. In addition to simple preprocessing, the data cleansing unit may cleanse the data according to complex algorithms utilizing Artificial Intelligence (AI) or machine learning.
The data cleansing unit stores the cleansed training data in the training data storage unit. Next, the learning unit learns a model using the training data stored in the training data storage unit. The learning unit stores the learned model in the model storage unit. The analysis system utilizes the model stored in the model storage unit.
As the scale of the data to be handled increases, data cleansing tends to become a bottleneck in data analysis. For example, the time required for data cleansing may account for 80% of the total time required for data analysis. Therefore, there is a demand for speeding up data cleansing.
Among the processes targeted for speeding up in data cleansing is the process of extracting only data of rows specified by an index from the target sequence information (hereinafter simply referred to as the “target array”).
As shown on the left side of
As shown on the right side of
The process shown in
According to the data structure specified by the framework applied to the process shown in
The reason for storing the target array in a fragmented state is explained with reference to
The rectangles shown in
In the process shown in
Similarly, thread 2 cannot write its processing result C to the consecutive area until threads 0 and 1 complete writing their processing results. In other words, even if threads attempt to write their processing results directly to the consecutive area without copying, it is impossible to execute parallel computation because it is unclear to what extent other threads have written their results to the consecutive area.
In the process shown in
Since the results are output to separate areas, each thread can execute parallel computation. However, as shown in
As described above, to execute parallel computation for the process of extracting only the data of rows specified by an index, it is often necessary to store the target array in a fragmented state to avoid inconveniences.
Furthermore, Patent Literature 1 describes a data collection processing method for collecting dispersed data in large amounts of data used in the field of information processing.
Hereinafter, issues present in the process of extracting only the data of rows specified by an index from a fragmented target array will be explained.
An actual data structure shown on the left side of
As shown on the left side of
To eliminate the need for the above verification process, the right side of
Furthermore, Patent Literature 1 does not describe a method to solve the above issues.
Therefore, the purpose of the present disclosure is to provide an index array conversion device, an index array conversion method, and an index array conversion program that can extract only the data of rows specified by an index from a fragmented target array.
A preferred aspect of the index array conversion device includes a generation unit that executes, for i=1 to N, a generation process to generate partial index array information Ki, where indices in a range Σt=0i-1(dt) to Σt=0i(dt)(d0=0) are stored in ascending order, using index array information where a plurality of indices are stored in ascending order and partial target array information D1, D2, . . . , DN, each indicating lengths d1, d2, . . . , dN (where d1 to dN are integers of 1 or more) generated by dividing target array information into N parts (N is an integer of 2 or more), containing data each assigned the index, and a subtraction unit that executes, for i=1 to N, a subtraction process to subtract Σt=0i-1(dt) from the indices stored in the generated partial index array information Ki.
A preferred aspect of the index array conversion method includes executing, for i=1 to N, a generation process to generate partial index array information Ki, where indices in a range Σt=0i-1 (dt) to Σt=0i(dt)(d0=0) are stored in ascending order, using index array information where a plurality of indices are stored in ascending order and partial target array information D1, D2, . . . , DN, each indicating lengths d1, d2, . . . , dN (where d1 to dN are integers of 1 or more) generated by dividing target array information into N parts (N is an integer of 2 or more), containing data each assigned the index, and executing, for i=1 to N, a subtraction process to subtract Σt=0i-1 (dt) from the indices stored in the generated partial index array information Ki.
A preferred aspect of the index array conversion program causes a computer execute executing, for i=1 to N, a generation process to generate partial index array information Ki, where indices in a range Σt=0i-1(dt) to Σt=0i (dt)(d0=0) are stored in ascending order, using index array information where a plurality of indices are stored in ascending order and partial target array information D1, D2, . . . , DN, each indicating lengths d1, d2, . . . , dN (where d1 to dN are integers of 1 or more) generated by dividing target array information into N parts (N is an integer of 2 or more), containing data each assigned the index, and executing, for i=1 to N, a subtraction process to subtract Σt=0i-1 (dt) from the indices stored in the generated partial index array information Ki.
According to the present disclosure, it is possible to extract only the data of rows specified by the index from the fragmented target array.
Hereinafter, an example embodiment of the present disclosure will be explained with reference to the drawings. It should be noted that the drawings are associated with one or more example embodiments in this disclosure.
The global index alignment partial division unit 110 receives the aforementioned index array as input. In this example embodiment, an index stored in the index array is referred to as a global index.
The global index alignment partial division unit 110 has a function of dividing the index array into portions where the global index is aligned.
As shown in
For example, as shown at the top of
Since there is no element before the 0th element, the global index alignment partial division unit 110 then checks the 1st element “7” of the index array.
The 1st element “7” is greater than the previous element “5”. Therefore, the global index alignment partial division unit 110 then checks the 2nd element “8” of the index array.
The 2nd element “8” is greater than the previous element “7”. Therefore, the global index alignment partial division unit 110 then checks the 3rd element “11” of the index array.
The 3rd element “11” is greater than the previous element “8”. Therefore, the global index alignment partial division unit 110 then checks the 4th element “1” of the index array.
The 4th element “1” is smaller than the previous element “11”. Therefore, the global index alignment partial division unit 110 sets the 4th element “1” as the first element of the second aligned portion.
Next, the global index alignment partial division unit 110 checks the 5th element “2” of the index array. The 5th element “2” is greater than the previous element “1”.
Since there are no elements stored beyond the 5th element in the index array, the global index alignment partial division unit 110 divides the index array before the 4th element set as the first element of the second aligned portion.
As shown at the bottom of
The global index local block division unit 120 has a function of dividing global indices aligned in ascending order according to the shape of a fragmented target array.
The global index local block division unit 120 calculates each offset by sequentially adding the lengths of the fragmented target arrays defined in an actual data structure. For example, the global index local block division unit 120 calculates the 1st offset “4” by adding the length “4” of the fragmented target array of chunk [0] to the 0th offset.
Next, the global index local block division unit 120 calculates the 2nd offset “8” by adding the length “4” of the fragmented target array of chunk [1] to the calculated 1st offset “4”.
Next, the global index local block division unit 120 calculates the 3rd offset “12” by adding the length “4” of the fragmented target array of chunk [2] to the calculated 2nd offset “8”.
That is, for the fragmented target arrays D1, D2, . . . , DN, each indicating lengths d1, d2, . . . , dN (where d1 to dN are integers of 1 or more), the global index local block division unit 120 calculates the offsets for i=1 to N (N is an integer of 2 or more) as Σt=0i(dt). Note that d0=0.
In the example shown in
Using the generated offsets, the global index local block division unit 120 generates a local index array composed of three blocks for each aligned portion.
Since the 0th element “5” is not less than the 1st offset “4”, the global index local block division unit 120 determines that there are no elements to be stored in the 0th block. Therefore, as shown in
As shown in
Since the 1st element “7” is less than the 2nd offset “8”, the global index local block division unit 120 then compares the 2nd offset “8” with the next 2nd element “8”.
Since the 2nd element “8” is not less than the 2nd offset “8”, the global index local block division unit 120 determines that the elements to be stored in the 1st block are “5” and “7”.
Therefore, as shown in
As shown in
Since the 3rd element “11” is less than the 3rd offset “12”, the global index local block division unit 120 ends the block generation process because there are no elements stored beyond the 3rd element in the first aligned portion.
Therefore, as shown in
Similarly, the global index local block division unit 120 generates the 0th block to the 2nd block constituting the local index array for the second aligned portion.
As shown in
At the stage shown in
The global index local block division unit 120 inputs the two generated local index arrays to the local index conversion unit 130.
The local index conversion unit 130 has a function of converting the global indices into local indices starting from the beginning of the block for each block constituting the local index array.
As shown in
For example, as shown in
That is, considering the 0th block as the first block, the conversion process by the local index conversion unit 130 corresponds to subtracting Σt=0i-1 (dt) from the global indices stored in the ith block.
The local index conversion unit 130 inputs the two local index arrays, in which the global indices have been converted into local indices, to the local array extraction unit 140.
The local array extraction unit 140 has a function of extracting data using a fragmented target array and a block constituting the local index array.
The example shown in
Also, as shown in
Also, as shown in
As shown in
Also, as shown in
That is, the local array extraction unit 140 extracts data as a local array Ri from the fragmented target arrays Di, which stores one or more pieces of data each assigned a new index, using the partial index array information Ki, from which the offset has been subtracted for each element. The local array extraction unit 140 executes the extraction process for local arrays Ri over i=1 to N.
In the actual data structure shown in
Next, the local array extraction unit 140 outputs an array, where three local arrays stored in memory are sequentially arranged vertically (an array labeled “result” shown in
In the example shown in
Hereinafter, an operation of the data extraction device 100 in this example embodiment will be explained with reference to
First, an index array is input to the global index alignment partial division unit 110. The global index alignment partial division unit 110 executes the index array division processing (Step S110).
Next, the global index local block division unit 120 calculates an offset by referring to a fragmented target array (Step S120).
Then, the global index local block division unit 120 generates blocks constituting a local index array, equal in number to the fragmented target arrays, using the calculated offset (Step S130). The global index local block division unit 120 inputs the local index array constituted by the generated blocks to the local index conversion unit 130.
Next, the local index conversion unit 130 subtracts the offset corresponding to the block from the global indices stored in the blocks that constitute the local index array (Step S140). The local index conversion unit 130 inputs the local index array, in which the global indices have been converted to local indices, to the local array extraction unit 140.
Then, the local array extraction unit 140 obtains data using a pair of the fragmented target array and the block that constitutes the local index array. The local array extraction unit 140 generates a local array that store the obtained data (Step S150).
Next, the local array extraction unit 140 outputs the array, where the generated local arrays are arranged sequentially in a vertical order, as a result (Step S160). After outputting, the data extraction device 100 completes the data extraction processing.
Next, the index array division processing of Step S110, which is a sub-process constituting the data extraction processing shown in
First, the global index alignment partial division unit 110 enters an element loop (Step S111). Then, the global index alignment partial division unit 110 checks a next element stored in an index array (Step S112).
In the first element loop, the global index alignment partial division unit 110 checks the 0th element stored in the index array at Step S112.
Next, the global index alignment partial division unit 110 determines whether the checked element is smaller than the previous element (Step S113). If the checked element is larger than the previous element (No in Step S113), the global index alignment partial division unit 110 proceeds to Step S115.
If the checked element is smaller than the previous element (Yes in Step S113), the global index alignment partial division unit 110 sets the checked element as the first element of the next aligned portion (Step S114). After setting, the global index alignment partial division unit 110 proceeds to Step S115.
Since there is no previous element in the first element loop, the global index alignment partial division unit 110 determines that the checked element is larger than the previous element (No in Step S113).
The global index alignment partial division unit 110 repeats the processes of Steps S112 to S114 as long as there are unchecked elements stored in the index array. When all elements have been checked, the global index alignment partial division unit 110 exits the element loop (Step S115).
Next, the global index alignment partial division unit 110 divides the index array by aligned portions (Step S116). The global index alignment partial division unit 110 inputs the aligned portions generated by dividing the index array to the global index local block division unit 120. After inputting, the global index alignment partial division unit 110 returns to the data extraction processing shown in
As described above, the global index local block division unit 120 in this example embodiment executes the generation process for i=1 to N, to generate partial index array information Ki, where indices in the range Σt=0i-1(dt) to Σt=0i(dt)(d0=0) are stored in ascending order, using index array information (aligned portions) where a plurality of indices are stored in ascending order and partial target array information (fragmented target arrays) D1, D2, . . . , DN, each indicating lengths d1, d2, . . . , dN (where d1 to dN are integers of 1 or more), generated by dividing target array information into N parts (N is an integer of 2 or more), containing data each assigned the index.
Moreover, the local index conversion unit 130 in this example embodiment executes the subtraction process for i=1 to N, to subtract Σt=0i-1(dt) from the indices stored in the generated partial index array information Ki.
Furthermore, the local array extraction unit 140 in this example embodiment extracts data as an extraction result (local arrays) Ri from the partial target array information Di, in which one or more pieces of data each assigned a new index are stored, using the partial index array information Ki where the subtraction process has been executed. The local array extraction unit 140 executes the extraction process to extract the extraction result Ri for i=1 to N, and outputs data arranged vertically in R1, R2, . . . , RN order.
Additionally, the global index alignment partial division unit 110 in this example embodiment generates a plurality of index array information by dividing the index information (index array) composed of a plurality of index array information.
The partial target array information D1 to DN in this example embodiment may be stored in a distributed state in memory.
The global index alignment partial division unit 110 and the global index local block division unit 120 in this example embodiment divide the index array according to the shape of the fragmented target arrays and generate the local index arrays.
Moreover, the local index conversion unit 130 converts the global indices stored in the local index array into local indices corresponding to the newly assigned indices in the fragmented target arrays.
Assigning new indices to the fragmented target arrays means that the fragmented target arrays are referenced by zero-based local indices. Therefore, the local array extraction unit 140 can extract data without combining the fragmented target arrays.
Below, a specific example of a hardware configuration of the data extraction device 100 in this example embodiment will be explained.
As shown in
The data extraction device 100 is implemented in software by the CPU 11 executing programs that provide the functions of each component shown in
That is, the CPU 11 loads and executes programs stored in the auxiliary memory unit 14 into the main memory unit 12, thereby controlling the operation of the data extraction device 100 and implementing each function in software.
The data extraction device 100 may include a DSP (Digital Signal Processor) instead of the CPU 11. Alternatively, the data extraction device 100 may include both the CPU 11 and the DSP.
The main memory unit 12 is used as a workspace for data and a temporary storage area for data. The main memory unit 12 may be, for example, RAM (Random Access Memory). The database 150 is implemented in the main memory unit 12.
The communication unit 13 has a function of inputting and outputting data to and from peripheral devices via a wireless network (information communication network).
The auxiliary memory unit 14 is a non-transitory tangible storage medium. Examples of non-transitory tangible storage media include magnetic disks, magneto-optical disks, CD-ROMs (Compact Disk Read Only Memory), DVD-ROMs (Digital Versatile Disk Read Only Memory), and semiconductor memory.
The input unit 15 has a function of inputting data and processing instructions. The input unit 15 may be an input device such as a keyboard, mouse, or touch panel.
The output unit 16 has a function of outputting data. The output unit 16 may be a display device such as a liquid crystal display device, a touch panel, or a printing device such as a printer.
As shown in
In the data extraction device 100, the auxiliary memory unit 14 stores programs to implement the global index alignment partial division unit 110, the global index local block division unit 120, the local index conversion unit 130, and the local array extraction unit 140.
The data extraction device 100 may also include hardware components such as an LSI (Large Scale Integration) that implements the functions shown in
Moreover, the data extraction device 100 may be implemented by hardware that does not include computer functions using elements such as a CPU. For example, a part of or all of the components may be implemented by general-purpose circuits, dedicated circuits, processors, or combinations thereof. These may be configured on a single chip (such as the aforementioned LSI) or a plurality of chips connected via a bus. A part of or all of the components may also be implemented by a combination of the aforementioned circuits and programs.
Additionally, a part of or all of the components of the data extraction device 100 may be implemented by one or more information processing devices equipped with computing and storage units.
When a part of or all of the components are implemented by multiple information processing devices or circuits, these devices or circuits may be centrally located or distributed. For example, the information processing devices or circuits may be implemented in configurations such as client-server systems or cloud computing systems, each connected via a communication network.
Next, an overview of the present disclosure will be explained.
With such a configuration, the index array conversion device can extract only the data of rows specified by an index from a fragmented target array.
Furthermore, the index array conversion device 20 may include an extraction unit (for example, the local array extraction unit 140) that extracts data as an extraction result Ri from the partial target array information Di, in which one or more pieces of data each assigned a new index are stored, using the partial index array information Ki where the subtraction process has been executed.
With such a configuration, the index array conversion device can extract only the data of rows specified by an index from a fragmented target array.
The extraction unit may execute the extraction process to extract the extraction result Ri for i=1 to N, and output the data arranged vertically in R1, R2, . . . , RN order.
With such a configuration, the index array conversion device can output the data extracted from the target array before it was fragmented.
Additionally, the index array conversion device 20 may include a division unit (for example, the global index alignment partial division unit 110) that generates a plurality of index array information by dividing index information composed of a plurality of index array information.
With such a configuration, the index array conversion device can generate a plurality of index array information.
The partial target array information D1 to DN may be stored in memory in a distributed state.
Moreover, a part of or all of the components of the aforementioned example embodiments may be described as follows but are not limited to these descriptions.
(Supplementary note 1) An index array conversion device comprising:
(Supplementary note 2) The index array conversion device according to supplementary note 1, further comprising:
(Supplementary note 3) The index array conversion device according to supplementary note 2, wherein
(Supplementary note 4) The index array conversion device according to any one of supplementary notes 1 to 3, further comprising:
(Supplementary note 5) The index array conversion device according to any one of supplementary notes 1 to 4, wherein
(Supplementary note 6) An index array conversion method comprising:
(Supplementary note 7) The index array conversion method according to supplementary note 6, further comprising:
(Supplementary note 8) The index array conversion method according to supplementary note 7, wherein
(Supplementary note 9) The index array conversion method according to any one of supplementary notes 6 to 8, further comprising:
(Supplementary note 10) The index array conversion method according to any one of supplementary notes 6 to 9, wherein
(Supplementary note 11) An index array conversion program causing a computer execute:
(Supplementary note 12) The index array conversion program according to supplementary note 11, causing a computer execute
(Supplementary note 13) The index array conversion program according to supplementary note 12, causing a computer execute
(Supplementary note 14) The index array conversion program according to any one of supplementary notes 11 to 13, causing a computer execute
(Supplementary note 15) The index array conversion program according to any one of supplementary notes 11 to 14, wherein:
As described above, the present disclosure has been explained with reference to the example embodiments, but the present disclosure is not limited to the above example embodiments. Various changes can be made to the configuration and details of the present disclosure within the scope of understanding of those skilled in the art. Each example embodiment can be appropriately combined with other example embodiments.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2023-194835 | Nov 2023 | JP | national |