This invention relates to a technology for detecting a change to data.
ETL (extract, transfer, load) servers are becoming widespread. The ETL servers are configured to collect data from a data source, for example, a core system, an Internet of Things (IoT) device, or a sensor, and to write the data to a database or data warehouse (hereinafter referred to as “DWH”) configured to perform analysis.
When a change or addition, for example, of a data item occurs on the data source side, an operation administrator on the data source side is required to notify an operation administrator of the ETL server of the change. As a technology for changing data, for example, JP 2003-280955 A is known.
In JP 2003-280955 A, there is disclosed a technology in which, when a redefinition of a record is changed on a general-purpose machine side, in order to again reflect the changed definition information of the record in the RDBMS record, an association between the database (extraction source) of the general-purpose machine and the items of the open-system RDBMS (extraction destination) is continued.
In the related-art example described above, in order to reflect a change to the data on the data source side in the data import definition of the ETL server, the administrator on the data source side is required to notify the administrator of the ETL server of the content of the change.
In a case in which a management organization on the data source side and a management organization on the ETL server are different organizations, when the management organization on the data source side updates the data source by itself, it may take a long time until the management organization of the ETL server knows of the change, and there may be a delay in the handling of the update.
Further, in a case in which the data source side is an IoT device or a sensor network, when the IoT device or sensor, for example, is updated by another management organization, there is a problem in that changes to the order of data items or changes to the data content, for example, cannot be grasped by the management organization of the ETL server unless the management organization on the data source side notifies of those changes. Therefore, this invention has been made in view of the problem described above, and an object of this invention is to provide a system configured to support handling of additions, updates, and specification changes to a data source.
According to one aspect of the present invention, a data management method for detecting occurrence of a change to data of a data source. The method includes the first through fifth steps. In the first step, the computer acquires data from the data source. In the second step, the computer analyzes the meaning of the acquired data column-by-column and stores the result of this analysis in a meaning storage module. In the third step, the computer obtains the previous analysis result of the column from a meaning storage module. In the fourth step, the computer compares the results of this time analysis with the results of the last time analysis, and determines that a change has occurred in the data if there is a difference in the results of the last time analysis. In the fifth step, if the computer determines that a change has occurred in the data, it outputs the occurrence of the change and the contents of the difference.
According to at least one embodiment of this invention, changes to the specification of the data source can be detected on a column-by-column basis by analyzing and accumulating the meaning of data acquired from the data source, and comparing the analysis result of the last time with the analysis result of this time.
The details of at least one embodiment of a subject matter disclosed herein are set forth in the accompanying drawings and the following description. Other features, aspects, and effects of the disclosed subject matter become apparent from the following disclosure, drawings, and claims.
Embodiments of this invention are described below with reference to the accompanying drawings.
In the following description, when individual user-operated PCs 7-1 to 7-n are not specified, the reference symbol from “—” onwards is omitted and the user-operated PCs are denoted by reference symbol “7”. The same applies to the reference symbols of other parts.
The computer system also includes a network 4 configured to couple the data source 1, the ETL server 2, and the data check server 5, and a network 6 configured to couple the data check server 5 and the user-operated PCs 7-1 to 7-n.
The data source 1 includes, for example, a database server of a core system, an Internet of Things (IoT) device, a sensor network, and is configured to provide data to be analyzed by the DWH 3.
The ETL server 2 is configured to acquire data from the database server of the core system, the IoT device, or the sensor network, format the data of the data source 1 based on predetermined mapping information (or aggregated information), and to output the formatted data in a format which can be used by the DWH 3.
The DWH 3 is configured to perform predetermined processing, for example, statistical processing and analysis processing, by using the data formatted by the ETL server 2. In the first embodiment, there is described an example in which various types of processing are performed in the DWH 3, but the various types of processing may be performed in the database server.
Further, in the illustrated example, the ETL server 2 and the DWH 3 are directly coupled, but in actual practice, the DWH 3 is coupled to the ETL server 2 and the data check server 5 via the network 4. The DWH 3 includes one or more computers.
The data check server 5 is configured to detect occurrence of a specification change by analyzing the data of the data source 1, and to transmit a notification of the data specification change to the user-operated PC 7 when a change to the data specification has occurred. The data specification change notification is composed of a data specification change screen like that described later, and includes the content of the change to the data specification change, a data model definition, and an ETL (aggregated information) definition.
The user-operated PC 7 can instruct, via the data specification change screen, the data check server 5 to determine or edit the data model and to reflect a change in the ETL definition. The data check server 5 transmits to the ETL server 2 and the DWH 3, and reflects in the ETL server 2 and the DWH 3, the ETL definition and the data model definition determined by the user-operated PC 7 in response to the change to the data specification.
Although not shown, the user-operated PC 7 is a computer including a processor, a memory, a network interface, an input/output apparatus, and an output apparatus (or display apparatus).
The network interface 12 is coupled to each of the network 4 and the network 6. The input/output apparatus 15 includes an input apparatus, for example, a mouse, a keyboard, and a touch panel, and an output apparatus, for example, a display.
A data acquisition module 21, a data meaning analysis module 22, a data specification change detection module 23, a data specification change notification module 24, a mapping information modification module 25, and a data model modification module 26 are loaded onto the memory 14 as programs and are executed by the processor 11.
In the storage apparatus 13, a temporary storage area 31, a data meaning storage module 32, a mapping information storage module 33, and a data model storage module 34 are held.
The data acquisition module 21 is configured to acquire data from the data source 1 at a predetermined timing, and to store the acquired data in the temporary storage area 31. The data acquired by the data acquisition module 21 is the same as the data acquired by the ETL server 2, and the data to be acquired is set in advance. The predetermined timing is, for example, a timing when a command to acquire the data is received from the input/output apparatus 15 or the user-operated PC 7, or a timing when a preset period (for example, 24 hours) is reached.
The data meaning analysis module 22 is configured to read the data from the temporary storage area 31, calculate feature information on a column-by-column basis, and to analyze the meaning of the data (column) based on the feature information. The meaning of the analyzed data (data meaning) is stored in the data meaning storage module 32 as an analysis result of this time. The stored meaning data is accumulated by including the past data, and is used when the data specification change detection module 23 detects a specification change.
The data specification change detection module 23 is configured to detect a change to the data specification based on the analysis result (feature amount) of this time analyzed by the data meaning analysis module 22 and the analysis result of the last time. The data specification change notification module 24 is configured to output a specification change notification to the user-operated PC 7 when a change to the specification of the data of the data source 1 has occurred.
The mapping information modification module 25 is configured to modify, in accordance with an instruction from the data specification change notification module 24, mapping (mapping information) between the input data defined in the ETL server 2 and the items of the output destination data model.
The data model modification module 26 is configured to modify, in accordance with an instruction from the data specification change notification module 24, the item information on the output destination data model defined in the ETL server 2. In the first embodiment, the data model of the output destination of the ETL server 2 is the data model of the DWH 3.
The processor 11 is configured to operate as a functional unit configured to provide a predetermined function by performing processing in accordance with the program of each functional unit. For example, the processor 11 functions as the data meaning analysis module 22 by performing processing in accordance with a data meaning analysis program. The same applies to the other programs as well. Further, the processor 11 also operates as a functional unit configured to provide each function of a plurality of processes executed by each program. The computer and the computer system are an apparatus and a system including those functional units.
The data of the data source 1 collected by the data acquisition module 21 is stored in the temporary storage area 31 of the storage apparatus 13. The meaning (feature amount) of the data calculated by the data meaning analysis module 22 is stored in the data meaning storage module 32.
In the mapping information storage module 33, a correspondence relationship between the data and columns output by the data source 1 and the data and columns used by the DWH 3 is stored as mapping information. In the data model storage module 34, a format of the data used by the DWH 3 is stored as a data model.
In the data check server 5, the data acquisition module 21 acquires predetermined data from the data source 1 at a predetermined timing, adds identification information on the data source 1, and stores the acquired predetermined data including the added identification information in the temporary storage area 31 (Step S101).
The data acquired by the data acquisition module 21 from the data source 1 is the same as the data used by the ETL server 2. Further, the data stored in the temporary storage area 31 of the storage apparatus 13 is stored for a period of time determined in advance and used in the data meaning analysis module 22.
Next, the data meaning analysis module 22 reads the data (for example, past N records or past N days) required for analysis from the temporary storage area 31 (Step S102), and executes meaning analysis of the data in the manner described later. The analysis result of this time is stored in the data meaning storage module 32 (Step S103). The analysis of the meaning of the data is performed on a column-by-column basis, as described later. Further, information indicating, for example, the date and time or generation is added to the data to be stored in the data meaning storage module 32.
Next, the data specification change detection module 23 reads the analysis result of the last time from the data meaning storage module 32, compares the analysis result of this time with the analysis result of the last time on a column-by-column basis of the data, and detects changes to the data specification. When there is a difference between the analysis result of the last time and the analysis result of this time, the data specification change detection module 23 determines that a change to the data specification has occurred, and transmits the meaning analysis result of the data and the detection result to the data specification change notification module 24 (Step S104).
The data specification change notification module 24 adds, based on the received meaning analysis result and the detection result of the data specification change, the change to the existing data model of the data model storage module 34 and to the existing mapping information of the mapping information storage module 33, and notifies the user-operated PC 7-n of the data model after the specification change as mapping (Step S105).
The data model after the specification change is a data model in which an item definition of the data model of the output destination (DWH 3) has been changed in accordance with an addition, deletion, or change of the data column. Further, the mapping information after the specification change is mapping information in which the mapping definition of an item of the data model of the output destination (DWH 3) and a column of the input data has been changed in accordance with an addition, deletion, or switch in order of a data column.
The data specification change notification module 24 may generate update information on the processing content corresponding to the change in the mapping information and notify the user-operated PC 7-n of the update information. The data specification change notification module 24 may handle information including the mapping information and the processing content as aggregated information.
Through the processing described above, the data check server 5 acquires the data used by the ETL server 2 from the data source 1, analyzes the meaning of the data on a column-by-column basis, and detects the occurrence of a change to the data specification based on the analysis result of the meaning of the last-time data and the this-time data. When a change to the data specification has occurred, the data check server 5 can present a notification of the occurrence of the specification change, the data model after the specification change, and the mapping information to the user-operated PC 7.
On the user-operated PC 7 which has received the data specification change notification, a user checks the content of the specification change. When the user determines that the content of the specification change (data model after the specification change and mapping information or aggregated information) is correct, the user instructs the data check server 5 to provide feedback on the content of the change (Step S201).
The data specification change notification module 24 of the data check server 5 instructed by the user-operated PC 7 to reflect the content of the change updates the information by writing the data model after the specification change in the data model storage module 34 and writing the mapping information (or aggregated information) after the specification change in the mapping information storage module 33 (Step S202).
Next, the data specification change notification module 24 transmits the mapping information after the specification change to the mapping information modification module 25, and transmits the data model after the specification change to the data model modification module 26 (Step S203).
The mapping information modification module 25 transmits new mapping information (or aggregated information) to the ETL server 2 and updates the mapping information used by the ETL server 2 to the new mapping information (or aggregated information) (Step S204).
The data model modification module 26 transmits a new data model to the DWH 3 and updates the data model used by the DWH 3 to the new data model (Step S205).
Through the processing described above, when the data specification change notification module 24 receives a response from the user-operated PC 7, the data model and mapping information (aggregated information) after the specification change can be reflected in the DWH 3 and the ETL server 2.
In Step S1, the data acquisition module 21 of the data check server 5 acquires predetermined data from the data source 1 at a predetermined timing, adds the identification information on the data source 1, and stores the acquired predetermined data including the added identification information in the temporary storage area 31.
Next, in Step S2, the data meaning analysis module 22 reads the data to be analyzed from the temporary storage area 31, executes meaning analysis of the data on a column-by-column basis, and stores the analysis result in the data meaning storage module 32. This processing is described in detail with reference to
In Step S3, the data specification change detection module 23 acquires, for each column, the result of the meaning analysis of this time and the result of the meaning analysis of the last time from the data meaning storage module 32, and determines presence or absence of a difference. This processing is described in detail with reference to
In Step S5, the data specification change notification module 24 adds, based on the detection result of the change to the data specification, the change to the existing data model and the mapping information acquired from the data model storage module 34 and the mapping information storage module 33, and notifies the user-operated PC 7 of the data model after the specification change as mapping.
In Step S11, the data specification change notification module 24 receives from the user-operated PC 7 an instruction to reflect the change to the data specification. In Step S12, it is determined from the reflection instruction whether to perform processing by the data model modification module 26 or the mapping information modification module 25. In Step S13, when modification of only the mapping information is to be performed, the processing advances to Step S14, and when modification of the mapping information and the data model is to be performed, the processing advances to Step S15.
In Step S14, only the order of the data items has been switched, and therefore the mapping information modification module 25 is activated.
The mapping information modification module 25 modifies the mapping between the items of the data source 1 in which the order of the data items has been switched and the items of the output destination data model (input data of the DWH 3), and updates the mapping information by writing the modified mapping in the mapping information storage module 33. Further, the mapping information modification module 25 notifies the ETL server 2 of the updated mapping information, and the ETL server 2 updates the mapping information to the latest mapping information.
In Step S15, the order of the data items has been switched and the content of the columns has also changed, and therefore the processing is executed by the mapping information modification module 25 and the data model modification module 26.
First, the data model modification module 26 updates the data model of the data model storage module 34 by adding items to, changing items of, and deleting items from the output destination data model (input data of DWH 3). Further, the data model modification module 26 transmits the updated data model to the DWH 3, and the DHW 3 updates the mapping information to the new data model.
Next, in the same manner as in Step S14, the mapping information modification module 25 updates the mapping information of the mapping information storage module 33 and notifies the ETL server 2 of the updated mapping information.
The data specification change screen 100 includes a data specification change content display section 110 configured to display the content of the data specification change, a data model definition display section 120 configured to display a definition of the changed data model, and a new ETL definition display section 130 configured to display the changed ETL definition.
The data specification change content display section 110 displays difference information 1100 on the analysis result of this time and the analysis result of the last time of the data meaning storage module 32. The difference information 1100 includes, for each data identifier (in
In the example illustrated in
In the data model definition display section 120, the definition of the data model of the output destination of the ETL server 2 is displayed as a data-association diagram 1210 and as a data sample 1220 which is combined together with the data-association diagram 1210. Further, a model determination button 1230 for the user to allow the changed data model is displayed in the data model definition display section 120.
In the data sample 1220, the data model after the specification change is displayed, and as the data model of the output destination, an example of a data model including columns “a” to “d” of the data A and column “a1” of the data B is displayed.
The data-association diagram 1210 displays a model showing associations among pieces of the data as a diagram. In the data-association diagram 1210, places at which there have been changes to the data associations and the changed content can be recognized, for example, by displaying added relationships in a highlighted manner and displaying deleted relationships as broken lines. The user can edit the data-association diagram 1210 to define a data model (data structure of data to be output destination of input data).
The new ETL definition display section 130 displays a processing and input/output model 1310 in the ETL server 2 after the specification change. Further, the new ETL definition display section 130 displays a reflect change button 1320 and a cancel button 1330 for receiving a new ETL definition.
In the model 1310 illustrated in
The mapping editing dialog 300 includes an output item 301 indicating a column of the specified data model, and an edit content 302 displaying the data and the column on the input data source 1 side to be input.
The user of the user-operated PC 7 can modify the mapping information on the data source 1 and the data model of the output destination by modifying the edit content 302. In the example illustrated in
The data specification change notification 400 is output from the data specification change notification module 24 when a change to the data specification is detected by the data specification change detection module 23. The data specification change screen 100 of
In Step S21, the data specification change notification module 24 acquires the changed, deleted, or added columns from the data specification change detection module 23 based on the determination result obtained from the data specification change detection module 23.
In Step S22, the data specification change notification module 24 determines whether or not there is a column having a changed meaning or a newly added column. When there is a column having a changed meaning or a newly added column, the processing advances to Step S23, and when there is no column having a changed meaning or there is no newly added column, the processing advances to Step S24.
In Step S23, when it is detected that there is a newly added column or a column having a changed meaning, the data specification change notification module 24 evaluates the associations with the columns of other data, and detects the following addition or deletion of an association.
(1) Addition of a new association between the newly added column or changed column and a column of different data
(2) Deletion of an existing association between the changed column and a column of different data
In Step S24, the data specification change notification module 24 determines whether or not there is a deleted column. When there is a deleted column, the processing advances to Step S25, and when there is not a deleted column, the processing advances to Step S26.
In Step S25, the data specification change notification module 24 extracts the existing associations between the column corresponding to the detection result of the above-mentioned item (2) and the column of different data, and detects the association to be deleted.
In Step S26, the data specification change notification module 24 reflects the association detected in Step S23 and Step S25 in the data model stored in the data model storage module 34. In Step S27, the data specification change notification module 24 outputs, to the data specification change screen 100 (data-association diagram 1210 or data sample 1220) of
In Step S31, the data model modification module 26 acquires the changed, deleted, or added columns from the data specification change detection module 23. In Step S32, the data model modification module 26 acquires the existing mapping information corresponding to the column from the mapping information storage module 33.
In Step S33, the data model modification module 26 determines whether or not there is a new addition by a specification change. When there is a new addition, the processing advances to Step S34, and when there is no new addition, the processing advances to Step S35. In Step S34, the data model modification module 26 newly adds the input data items to the mapping information on the column, adds processing corresponding to the meaning of the newly added column, and adds the data items of a data model corresponding to the newly added column.
In Step S35, the data model modification module 26 determines whether or not there is a column which has been changed (for example, change to meaning or change to listed order) by a specification change. When there is a column which has been changed, the processing advances to Step S36, and when there is no column which has been changed, the processing advances to
Step S37.
In Step S36, the data model modification module 26 changes (the order of) the input data items in the mapping information corresponding to the column changed by a specification change, changes to processing corresponding to the meaning of the changed column, and changes the items of the corresponding data model. The processing corresponding to the meaning of the changed column may be to notify the ETL server 2 that the meaning of the column has been changed.
In Step S37, the data model modification module 26 determines whether or not there is a column which has been deleted by a specification change.
When there is a column which has been deleted, the processing advances to Step S38, and when there is no column which has been deleted, the processing advances to Step S39.
In Step S38, the data model modification module 26 deletes the relevant column from the input data items in the mapping information corresponding to the column deleted by a specification change, deletes or updates the processing relating to the deleted column, and deletes the data items of the corresponding data model.
In Step S39, the data model modification module 26 outputs the mapping information having the specification changed in Step S34, Step S36, and Step S38 to the model 1310 of the new ETL definition display section 130 of the data specification change screen 100 of
In Step S41, the mapping information modification module 25 acquires the latest mapping information (for example, the mapping information checked or modified by the user on the data specification change screen 100 of
Next, in Step S42, the mapping information modification module 25 converts the latest mapping information into a definition of the ETL processing in accordance with the specification of the processing of the update destination ETL server 2. The reflection of the latest mapping information in the ETL processing definition may be performed, for example, by using a data mapping integration tool.
In Step S43, the mapping information modification module 25 transmits the converted ETL definition to the ETL server 2 to update the ETL definition content.
Through the processing described above, the definition of the ETL server 2 can be updated based on mapping information checked by the user.
In Step S51, the data model modification module 26 acquires the latest data model (the data model checked or modified by the user on the data specification change screen 100 of
In Step S52, the data model modification module 26 converts, in accordance with the specification of the input data of the update destination DWH 3, the information on the data model into a definition for updating the schema of DWH 3. The conversion of the schema definition may be performed, for example, by using a schema conversion tool. In Step S53, the data model modification module 26 transmits the converted definition for updating the schema to the DWH 3, and the schema of the DWH 3 is thus updated.
Through the processing described above, the schema of the DWH 3 can be updated based on a data model checked by the user.
In the data 500, each entry stores a recording date and time in column number 1, a humidity in column number 2, a speed in column number 3, a temperature in column number 4, a voltage in column number 5, and a wind direction in column number 6.
The data meaning analysis module 22 reads the data included in one column of the data 500 from the temporary storage area 31 (Step S61). The data meaning analysis module 22 determines whether the type of the data in the column is a numerical value type, a character type, or a date and time type (Step S62).
In Step S63, the data meaning analysis module 22 advances the processing in accordance with the data type. The data meaning analysis module 22 advances the processing to Step S64 when the data type is a numerical value type, advances the processing to Step S65 when the data type is a character type, and advances the processing to Step S66 when the data type is a date and time type.
In Step S64, the data meaning analysis module 22 calculates the feature amount of the numeric value type column. In Step S65, the data meaning analysis module 22 calculates the feature amount of the character type column. In Step S66, the data meaning analysis module 22 calculates the feature amount of the date and time type column.
As the feature amount, for example, a feature amount (feature information) like the following can be used. For the numerical value type, for example, a statistical value such as a maximum value, a minimum value, an average value, and a variance, or a periodicity is calculated as the feature amount. For the character type, for example, a statistical value such as a maximum value, a minimum value, an average value, and a variance of a character string length, or a frequently-appearing character pattern and an appearance ratio of the pattern, is calculated as the feature amount. For the date and time type, for example, a statistical value such as a maximum value, a minimum value, an average value, and a variance relating to an interval is calculated as the feature amount. The examples described above of the feature amount are examples, and this invention is not limited to those examples.
The data meaning analysis module 22 determines whether or not there is an unprocessed column in the data 500. When there is an unprocessed column, the processing advances to Step S68, and when all of the columns have been processed, the processing advances to Step S69 (Step S67).
When there is a remaining column, the data meaning analysis module 22 advances the processing to the next column (Step S68), returns to Step S61, and executes the processing described above on the applicable column. Meanwhile, when there is no remaining column, the data meaning analysis module 22 stores the feature amount calculated as described above in the data meaning storage module 32 as a feature amount table 700 showing the meaning of the data of this time (Step S69).
The column number 740 stores the column number of the data 500 shown in
The maximum value 755 stores the maximum value of numerical type data, the maximum value of the character string lengths of character type data, and the maximum value of the time intervals of date and time type data. The minimum value 760 stores the minimum value of numerical value type data, the minimum value of the character string lengths of character type data, and the minimum value of the time intervals of date and time type data.
The average value 765 stores the average value of the values of numerical type data, the average value of the character string lengths of character type data, and the average value of the time intervals of the date and time type data.
The variance 770 stores the variance of the values of numerical type data, the variance of the character string lengths of character type data, and the variance of the time intervals of date and time type data. The periodicity 775 stores the periodicity of numerical value type data.
The frequently-appearing pattern 780 stores the pattern frequently appearing in character type data or the format of date and time type data. The appearance ratio 785 stores the appearance ratio of data applicable to the format of the frequently-appearing pattern 780.
The feature amount of column number 1 stores the fact that the data type 745 is date and time type data. The frequently-appearing pattern 780 stores the format of the date and the character string, and the corresponding character string is stored in the number of digits 750.
The appearance ratio 785 stores the ratio at which data matching the pattern of the frequently-appearing pattern 780 appears. An appearance ratio 785 of 100% means that all the data is stored in that format. The maximum value 755 to the average value 765 are evenly spaced at intervals of 600 seconds (10 minutes).
Column numbers 2 to 5 show an example in which the feature amount of numerical value data is stored. The integer part of the numerical value stored in the number of digits 750 indicates the number of digits of the integer part of the data value of the column, and the decimal part represents the number of digits of the decimal part of the data value of the column. For example, the number of digits 750 in the column of column number 2 is “2.1”, which indicates that the value of the data stored in column number 2 is two digits in the integer part and one digit in the decimal part. When the number of digits is not constant, a value is not stored in the number of digits 750.
In the periodicity 775 of column number 2, “144” is input, and therefore column number 2 indicates that data having a periodicity every 144 pieces of data is input. When there is no periodicity, “0” is input.
Column number 6 shows an example in which the feature amount of character type data is stored. When the character string length is fixed, the character string length is stored in the number of digits 750, but when the character string length is variable, no value is input.
Further, the maximum value 755 to the variance 770 of column number 6 store the maximum, minimum, average, and variance of the character string length. The periodicity is stored in the periodicity 775, but “0” is stored when there is no periodicity. The frequently-appearing pattern 780 stores, in the form of a regular expression, patterns which frequently appear. The appearance ratio 785 stores the ratio of data matching the pattern of the frequently-appearing pattern 780. Column number 6 shows that 98% of the data matches the pattern of frequently-appearing pattern 780.
The data specification change detection module 23 accesses the data meaning storage module 32 and acquires the meaning (feature amount) of the data of one column from the feature amount table 700 showing the meaning of the this-time data 500 (Step S1305).
Further, the data specification change detection module 23 accesses the data meaning storage module 32 and acquires the feature amount indicating the meaning of the data of the same column of the last-time data 500 (Step S1310).
In Step S1315, the data specification change detection module 23 compares the meaning (feature amount) of the acquired this-time data with the meaning (feature amount) of the data of the same column of the last-time data, and determines whether or not the feature amounts match. The determination regarding whether or not the feature amounts match is described in detail with reference to
When the meanings (feature amounts) of the data match in Step S1315, the data specification change detection module 23 determines in Step S1320 that there is “no change to the column.” Meanwhile, when the meanings of the data do not match in Step S1315, the data specification change detection module 23 shifts the processing to Step S1325.
In Step S1325, the data specification change detection module 23 compares the meaning (feature amount) of the data of, among the last-time data, all of the columns other than the column acquired in Step S1310 with the meaning of the data of the column of the this-time data acquired in Step S1305. Then, in Step S1330, the data specification change detection module 23 determines whether or not the meaning (feature amount) of the data of the other columns matches the meaning (feature amount) of the this-time column data. The determination regarding whether or not the meanings match is performed in the same manner as in Step S1315.
In the determination of Step S1330, when there is another column which matches, the data specification change detection module 23 determines that the column is to be switched (Step S1335), and when there is not another column which matches, the data specification change detection module 23 determines that the column is to be added (Step S1340).
Based on the determination result of no change to the column, switching of the column, or addition of the column in Step S1320, Step S1335, and Step S1340, the data specification change detection module 23 records in a column change table 800 which column of this-time data the column of the last-time data corresponds to, or whether the column of the last-time data is to be newly added (Step S1345). The column change table 800 is stored in the data meaning storage module 32, and is described later with reference to
Next, the data specification change detection module 23 accesses the feature amount table 700 of the data meaning storage module 32 and determines whether or not there is, of the this-time data, a column which has not been processed yet (Step S1350).
When there is a column that has not been processed yet, the data specification change detection module 23 advances the processing to Step S1355 to shift the processing to the next column, returns to the above-mentioned Step S1305, and repeats the processing described above.
Meanwhile, when it is determined in Step S1350 that there is not a remaining column, the data specification change detection module 23 refers to the column change table 800 recorded in Step S1345, and acquires the correspondence relationship of columns of the this-time data and the last-time data (Step S1360).
Next, in Step S1365, the data specification change detection module 23 determines whether or not, among the columns of the last-time data, there is a column not associated with the this-time data (Step S1365). When there are no columns not associated with the this-time data among the columns of the last- time data (when there are no columns which have been deleted), the data specification change detection module 23 ends the processing as it is.
Meanwhile, when there is a column not associated with the this-time data among the columns of the last-time data, the data specification change detection module 23 determines that the non-associated column has been deleted from the this-time data (Step S1370), and adds information indicating the deletion to the column change table 800 recorded in Step S1345 (Step S1375), and ends the processing.
Through the processing described above, by comparing the value of the feature amount table 700 of the this-time data with the value of the feature amount table 700 of each column of the last-time data 500, changes to each column are detected and recorded in the column change table 800.
The data specification change detection module 23 refers to the feature amount tables 700 and determines whether or not the data type 745 of the column of the last-time data and the data type 745 of the column of the this-time data match (Step S1405). When the data types 745 match, the data specification change detection module 23 advances the processing to Step S1415. When the data types 745 do not match, the data specification change detection module 23 determines that the meanings (feature amounts) of the data do not match (Step S1410), and ends the processing.
When it is determined in Step S1405 that the data types 745 match, in Step S1415, the data specification change detection module 23 advances the processing in accordance with the data type. The data specification change detection module 23 advances the processing to Step S1420 when the data type 745 is a numerical value type or a date and time type, and advances the processing to Step S1440 when the data type is a character type (Step S1415).
In Step S1420, the data specification change detection module 23 calculates a distance between feature amounts in a feature amount space of the feature amount table 700 of the this-time data and the feature amount table 700 of the last-time data. The distance between the feature amounts may be calculated values of from the maximum value of 755 to the variance 770 by a publicly-known or well-known method, for example, a geometric distance.
In Step S1425, the data specification change detection module 23 determines whether or not the distance is equal to or less than a predetermined threshold value. When the distance is equal to or less than the threshold value, the processing advances to Step S1430, and the data specification change detection module 23 determines that the meanings (feature amounts) of the last- time column data and the this-time column data match. Meanwhile, when the distance exceeds the threshold value, the processing advances to Step S1435, and the data specification change detection module 23 determines that the meanings (feature amounts) of the last-time column data and the this-time column data do not match.
In Step S1440 of the character type, the data specification change detection module 23 determines whether or not the frequently-appearing patterns 780 match from the feature amount table 700 of the this-time data and the feature amount table 700 of the last-time data.
When the frequently-appearing patterns 780 match, the processing advances to Step S1445, and the data specification change detection module 23 determines that the meanings (feature amounts) of the last-time column data and the this-time column data match. Meanwhile, when the frequently-appearing patterns 780 do not match, the processing advances to Step S1450, and the data specification change detection module 23 determines that the meanings (feature amounts) of the last-time column data and the this-time column data do not match.
Through the processing described above, it is determined whether or not the column of the this-time data and the column of the last-time data are the same based on the feature amounts of the columns.
In the column change table 800, the columns of the last-time data are arranged in the horizontal direction (1505 to 1525) and the columns of the this-time data are arranged in the vertical direction (1530 to 1550).
When the data specification change detection module 23 determines in Step S1320 of
When the data specification change detection module 23 determines in Step S1335 of
When the data specification change detection module 23 determines in Step S1340 of
In Step S1360, the data specification change detection module 23 acquires the row 1560 of the column change table 800. When there is a column in which “match” or “switch” is not recorded, in Step S1365 of
Through reference to the column change table 800, the data specification change detection module 23 can determine a difference between the last-time data and the this-time data.
As described above, in the first embodiment, the data check server 5 can detect a specification change in the data source 1, for example, addition, update, or deletion of a column, by analyzing the meaning of the data acquired from the data source 1 as a feature amount.
Further, the data specification change notification module 24 can present a modification proposal for the mapping information and a modification proposal for the data model to the user-operated PC 7 in accordance with the content of the change to the column, and hence maintenance of the ETL server 2 and the DWH 3 can be easily performed.
The data meaning analysis module 22 in the second embodiment includes machine learning modules 905 and 1005 which are illustrated in
The data meaning analysis module 22 advances the processing to Step S64 when the data type is a numerical value type, calculates the feature amount of the numerical value type column, and writes the calculated feature amount in the feature amount table 700. Then, in Step S81, the data meaning analysis module 22 inputs the calculated feature amount into the machine learning module 905 (
In Step S83, the data meaning analysis module 22 inputs the date and time type feature amount calculated in Step S66 into the machine learning module 1005 (
HITACHI3-411800099US01
In Step S81 of
The machine learning module 905 includes input sections (910 to 930) and output sections (950 to 975). The input sections (input elements) include a maximum value 910, a minimum value 915, an average value 920, a variance 925, and a periodicity 930. The output sections (output elements) include a temperature 950, a humidity 955, a speed 960, a voltage 965, a current 970, and a pressure 975. The machine learning module 905 uses the feature amount of data having a known data content, and learns such that, when the feature amount is input, the corresponding output section outputs a value of 1 and the other output sections output a value of 0.
In Step S81 of
In Step S83 of
The machine learning module 1005 includes input sections (input elements) 1010 to 1025 and output sections (output elements) 1050 to 1065.
The input sections include a maximum value 1010, a minimum value 1015, an average value 1020, and a variance 1025. The output sections include a start time 1050, an end time 1055, a measurement time 1060, and an event occurrence time 1065. The feature amount calculated in Step S66 of
In Step S83 of
Then, the data meaning analysis module 22 compares the patterns set in the data pattern 1200 with the frequently-appearing pattern 780 obtained as the feature amount of the character string type data in Step S65, and estimates the data content.
The data pattern 2200 of
In Step S65 of
As described above, according to each of the first and second embodiments, there is provided a data management method, in which a computer (data check server 5) including a processor (11) and a memory (14) is configured to detect occurrence of a change to data of a data source (1). The data management method includes: a first step (Step S1) of acquiring, by the computer (5), data from the data source (1); a second step (Step S2) of analyzing, by the computer (5), a meaning of the acquired data on a column-by-column basis and storing, by the computer (5), an analysis result of this time in a meaning storage module (data meaning storage module 32); a third step (Step S3) of acquiring, by the computer (5), an analysis result of a last time of the columns from the meaning storage module (32); a fourth step (Step S4) of comparing (Step S3), by the computer (5), the analysis result of the last time with the analysis result of this time and determining, by the computer (5), that a change to the data has occurred when a difference exists between the analysis results; and a fifth step (Step S5) of outputting, by the computer (5), when it is determined that a change to the data has occurred, the occurrence of the change and a content of the difference.
As a result, the data check server 5 can detect a specification change, for example, an addition, update, or deletion of a column in the data source 1 by analyzing the meaning of the data acquired from the data source 1 as a feature amount.
Further, the second step (Step S2) includes calculating a feature amount of the data is calculated on a column-by-column basis as the meaning of the data, and storing the calculated feature amount in the meaning storage module (32) as the analysis result of this time. As a result, the data meaning analysis module 22 calculates a feature amount on a column-by-column basis of the data source 1 and accumulates the calculated feature amount in the data meaning storage module 32, and thus the data specification change detection module 23 can detect a change to the specification of the same column.
Further, the fourth step includes calculating a distance between the feature amount of this time and the feature amount of the last time (S1420), and determining that a change to the data has occurred when the distance is larger than a threshold value set in advance (Step S1425) (Step S1435).
As a result, the data specification change detection module 23 can detect a change to the data on a column-by-column basis by comparing the feature amount of this time with the feature amount of the last time based on a predetermined threshold value.
Further, the second step includes calculating the feature amount of the data on a column-by-column basis as the meaning of the data, inputting the calculated feature amount to a machine learning module (905, 1005) trained in advance to estimate the content of the data, and storing an estimation result of the machine learning module (905, 1005) in the meaning storage module (32) as the analysis result of this time.
As a result, the occurrence of changes to the columns can be detected by the data meaning analysis module 22 inputting a feature amount on a column-by-column basis into the machine learning modules 905 and 1005 to estimate the content of the data, and by the data specification change detection module 23 determining whether or not the estimation results match or do not match.
Further, the data management method further includes a sixth step (Step S204) of updating, by the computer (5), in accordance with the content of the difference, mapping information obtained by aggregating the data of the data source 1 for generating output data, and transmitting, by the computer (5), the updated mapping information to a server configured to execute the aggregation of the data.
As a result, the mapping information modification module 25 of the data check server 5 can notify the ETL server 2 executing the data aggregation of the mapping information updated based on the content of the change to the data. This invention is not limited to the embodiments described above, and encompasses various modification examples. For instance, the embodiments are described in detail for easier understanding of this invention, and this invention is not limited to modes that have all of the described components. Some components of one embodiment can be replaced with components of another embodiment, and components of one embodiment may be added to components of another embodiment. In each embodiment, other components may be added to, deleted from, or replace some components of the embodiment, and the addition, deletion, and the replacement may be applied alone or in combination.
Some of all of the components, functions, processing units, and processing means described above may be implemented by hardware by, for example, designing the components, the functions, and the like as an integrated circuit. The components, functions, and the like described above may also be implemented by software by a processor interpreting and executing programs that implement their respective functions. Programs, tables, files, and other types of information for implementing the functions can be put in a memory, in a storage apparatus such as a hard disk, or a solid state drive (SSD), or on a recording medium such as an IC card, an SD card, or a DVD.
The control lines and information lines described are lines that are deemed necessary for the description of this invention, and not all of control lines and information lines of a product are mentioned. In actuality, it can be considered that almost all components are coupled to one another.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/044921 | 12/6/2018 | WO | 00 |