The present application claims priority from Japanese application JP2014-135511 filed on Jul. 1, 2014, the content of which is hereby incorporated by reference into this application.
The present invention relates to the technique of analyzing the correlation rules for grasping the specifications of a database (DB) utilized in an information system of an object for the purpose of development and the like of the information system.
As a background art of the technical field, JP-A-H11-259567 (patent literature 1) may be referred to. This publication discloses that “the technique capable of extracting a data set of competing events and searching for the data set having the strong relationship even if a rate of occurrence is low is provided” for the purpose of the analysis of the correlation rules (refer to the abstract).
In development and maintenance of the information system, it is important to understand the specifications of the database (DB). The specifications of the database are sometimes described in the specification document explicitly but are sometimes prescribed tacitly. In order to understand the tacit specifications, the technique of extracting the features from data in the database is effective. Concretely, the basket analysis can be used to find out the dependence relation and restriction conditions (one side of the specifications) to be satisfied by the data preserved in the database from the rules (correlation rules) of the simultaneous appearance relation of the data. Further, in the present invention, a relational database (RDB) is supposed as the database specifically. At this time, the data dependence relation and restriction conditions existing among columns can be found out by means of the basket analysis.
For example, in a table of a certain relational database, if the correlation rule that “a value of “deletion date” is not necessarily NULL when the value of a “deletion flag” is “1”” can be found out by the basket analysis”, the existence of the specifications that “the value of the “deletion date” is indispensable when the “deletion flag” is “1”” can be presumed.
Generally, in the basket analysis, a large number of correlation rules are produced in many cases. Accordingly, it is necessary to figure out a way to reduce time and effort at the time that a human being makes confirmation. Further, measures for (1) reducing the total number of the correlation rules by summarizing or putting the extracted correlation rules together and (2) scoring the correlation rules mechanically so as to make it possible to adapt the correlation rules for filtering and ranking (sorting) are used.
In the scoring (2) of them, support, confidence and lift values which are index values of the correlation rules are used in many cases. Moreover, the patent literature 1 describes the method of reflecting the “rule which has a low evaluation in the conventional basket analysis but is useful” to the score by means of indexes such as “expectation relation index” and “relation strength index”.
However, numerical expression of the rules in the above conventional methods is made to only usefulness as individual correlation rules and the indexes indicating the usefulness as the specifications existing among columns are not considered. The specifications existing among columns include plural correlation rules and accordingly there is a problem that analysis using only such indexes is insufficient.
The index values expressed numerically by the conventional methods treat all correlation rules uniformly and do not consider the characteristics as the specifications.
Concretely, evaluation values for the correlation rule indicating the correspondence relation of data (for example, when “annual paid holiday flag” is “1”, “substitute day off flag” is “0”) and the correlation rule indicating the magnitude relation of data (for example, when “selling price” is “105”, “material cost” is “30”) are calculated by the same method. Hence, there is a problem that the evaluation values indicating the usefulness of the correlation rules properly cannot be calculated (concretely, description is made in embodiments).
Accordingly, it is an object of the present invention to provide an apparatus for numerically expressing the usefulness as the specifications existing among columns of a relational database table by integrating plural correlation rules and producing evaluation values in the viewpoint of considering the characteristics of data to thereby make scoring of the correlation rules properly from the viewpoint as the specifications of the relational database.
In order to solve the above problems, according to the present invention, the ratio between the appearance rate of conditions for data and the rate that the restriction is satisfied is used as the above scoring. This result can be used for summarizing or putting the correlation rules together. In more detail, the following structure is adopted. The correlation rule analysis apparatus which extracts at least any of data dependence relation and restriction conditions of columns of a database from data stored in the database, comprises correlation rule extraction means to extract information of simultaneous appearance relation of data among plural columns as correlation rules from data of a database table in which data to be analyzed are stored, correlation rule summarization means to summarize the extracted correlation rules on the basis of specific community and summarization result appropriateness judgment means to calculate usefulness indexes including at least one of the data dependence relation and the restriction conditions from appearance frequency and combination in the summarized correlation rules. Here, in the present specification, the “simultaneous appearance relation” means that when one appears, the other also appears and appearances are not necessarily required to be coincident temporally. Further, the present invention includes a computer program for realizing a method and the apparatus.
According to the present invention, the correlation rules extracted from data of a relational database can be scored from the viewpoint as the specifications of the relational database. Thus, for example, when the user of the present invention analyzes the specifications of the relational database, additional information for confirming the correlation rules which is information indicating the specifications while ranking and filtering the correlation rules properly can be provided. Accordingly, the analysis work of the specifications of the relational database can be made more efficient.
It should be further understood by those skilled in the art that although the foregoing description has been made on embodiments of the invention, the invention is not limited thereto and various changes and modifications may be made without departing from the spirit of the invention and the scope of the appended claims.
An embodiment of the present invention is now described referring to the accompanying drawings.
In the embodiment, an example of a correlation rule analysis apparatus is described.
It is supposed that the processing program 113 is read in the memory 102 at the time of execution and is executed by the CPU 101. The processing contents thereof are described later with reference to the flow charts.
The column characteristic judgment rule memory part 107, the correlation rule summarization rule memory part 109 and the summarized correlation rule evaluation rule memory part 112 are previously provided with column characteristic judgment rules, correlation rule summarization rules and summarized correlation rule evaluation rules, respectively, and details of the column characteristic judgment rules, correlation rule summarization rules and summarized correlation rule evaluation rules are described later.
Data of a relational database table inputted externally by means of the input unit 103 are written in the memory part 106 for storing table data to be analyzed.
The processing part 122 for counting appearances of column values counts appearances of data in respective columns while referring to data of columns read out from the memory part 106 for storing table data to be analyzed and writes the results thereof into the memory part 121 for storing the number of times of appearances of column values.
The column characteristic judgment part 114 prepares column characteristic information using the column characteristic judgment rules read out from the column characteristic judgment rule memory part 107 while referring to the number of times of appearances of column values read out from the memory part 121 for storing the number of times of appearances of column values and writes the column characteristic information into the column characteristic memory part 108.
The correlation rule extraction processing part 116 counts appearances of sets of values in columns while referring to data of columns read out from the memory part 106 for storing table data to be analyzed and writes the result thereof into the correlation rule memory part 110.
The correlation rule summarization rule judgment part 115 selects correlation rule summarization rules using correlation rule summarization rules read out from the correlation rule summarization rule memory part 109 while referring to the column characteristic information read out from the column characteristic memory part 108 and writes the selected correlation rule summarization rules as information of the correlation rules stored in the correlation rule memory part 110. Further, the correlation rule summarization rule judgment part 115 derives correlation rule summarization names for extracted correlation rules using the selected correlation rule summarization rules and writes the derived correlation rule summarization names as information of the correlation rules stored in the correlation rule memory part 110.
The correlation rule pre-summarization processing part 117 rearranges the correlation rules read out from the correlation rule memory part 110 and updates information in the correlation rule memory part 110. Further, the correlation rule pre-summarization processing part 117 reads out the correlation rules from the correlation rule memory part 110 and calculates necessary numerical values while referring to the number of times of appearances of column values read out from the memory part 121 for storing the number of times of appearances of column values and the correlation rule summarization rules read out from the correlation rule summarization rule memory part 109 to thereby complement the information. Thereafter, the correlation rule pre-summarization processing part 117 writes the numerical values as the correlation rules in the correlation rule memory part 110 again. Moreover, the correlation rule pre-summarization processing part 117 calculates index values of the correlation rules using the information of the correlation rules read out from the correlation rule memory part 110 and updates the information of the correction rules. Thereafter, the correlation rule pre-summarization processing part 117 writes the information as the correlation rule of the correlation rule memory part 110 again.
The correlation rule summarization processing part 118 summarizes or puts the information of the correlation rules read out from the correlation rule memory part 110 together on the basis of the community of summarization names of the correlation rules and thereafter writes the information in the correlation rule summarization result memory part 111 as the summarized correlation rules.
The summarization result appropriateness judgment part 119 refers to the information of the summarized correlation rules read out from the correlation rule summarization result memory part 111 to be complemented using the information of the summarized correlation rule evaluation rules read out from the summarized correlation rule evaluation rule memory part 112 and thereafter writes the correlation rule into the summarized correlation rule memory part 111 again.
The summarization result visualization processing part 120 reads out the correlation rule summarization result from the correlation rule summarization result memory part 111 in accordance with the user's instruction of the apparatus and converts it into a visually and easily understandable format. Thereafter, the summarization result visualization processing part 120 outputs it onto the output unit 104.
In step 201, data of the relational database table is inputted as input information to the correlation rule analysis apparatus. The input operation is made by the user of the apparatus. In step 201, data corresponding to one table from among data of the relational database inputted from the input unit 103 are written into the memory part 106 for storing table data to be analyzed.
In step 202, a set of columns to be analyzed is selected as the input information to the correlation rule analysis apparatus. The selection operation is made by the user of the apparatus.
The information for the column set includes a set of “cause-side column” and “result-side column”. The “cause-side column” and the “result-side column” will be described in step 205 and subsequent steps thereto in the embodiment. In the embodiment, unless otherwise described below, the case where the user of the apparatus selects the “update date” 301 as the “cause-side column” and the “approval date” 302 as the result-side column name is supposed to make description. Moreover, this step may be omitted and combination of columns may be analyzed.
The processing in the following steps 203 to 209 is the mechanical processing based on the input information and can be performed only by the database analysis apparatus without hand.
In step 203, the processing part 122 for counting appearances of column values counts appearances of column data while referring to the column data read out from the memory part 106 for storing table data to be analyzed and writes the counted result in the memory part 121 for storing the number of times of appearances of column values.
The processing about the “update date” 301 has been described with reference to
Moreover, as in the embodiment, when the values of columns are given by the character strings, conversion logics 503 may be provided as functions for converting the values into quantitative values. In the following description of the embodiment, it is supposed that evaluation and processing are made after the column values are converted by such conversion logics in case where the column values are treated in partial order relation specifically even unless noted otherwise.
Further, when the rate is larger than or equal to a fixed value in plural column characteristics, one column characteristic may be decided by selecting the column characteristic having the maximum rate or the like. Alternatively, each of the column characteristics may be adopted as providing plural column characteristics in one column. In the embodiment, the following steps are described as providing one column characteristic in one column, for simplification.
In step 205, the correlation rule extraction processing part 116 counts appearances of sets of values in respective columns while referring to data of columns read out from the memory part 106 for storing table data to be analyzed and writes the result thereof into the correlation rule memory part 110.
The correlation rule extraction processing part 116 registers column names of the “cause-side column” and the “result-side column” selected in step 202 as the cause-side column name 701 and the result-side column name 702, respectively. Furthermore, the correlation rule extraction processing part 116 preserves the “update date” 301 and the “approval date” 302 which are the cause-side column and the result-side column of the input information 300, respectively, as the set of values of the cause-side value 704 and the result-side value 705 after eliminating duplication in combination thereof. Moreover, the correlation rule extraction processing part 116 counts appearances of the sets of values by referring to values of the “update date” 301 and the “approval date” 302 and registers the counted result as information of the number of items 706.
In step 206, the correlation rule summarization rule judgment part 115 selects the correlation rule summarization rules using the correlation rule summarization rules read out from the correlation rule summarization rule memory part 109 while referring to the column characteristic information read out from the column characteristic memory part 108 and writes the selected rules in the correlation rule memory part 110 as information of the correlation rules held therein.
Moreover, the correlation rule summarization rule judgment part 115 derives the correlation rule summarization names for extracted correlation rules using the selected correlation rule summarization rules and writes the derived names in the correlation rule memory part 110 as information of the correlation rules held therein.
The correlation rule summarization rule judgment part 115 selects one of the correlation rules held in the inter-column correlation rule information 700. Thereafter, the functions of the summarization object correlation rule judgment logics 805 found out as above are successively executed using the cause-side values 704 and the result-side values 704 of the selected correlation rules as input parameters. When the result of truth is obtained by the execution, the summarization name 804 is registered as the summarization rule 707 of the correlation rule being selected. When the result of falsehood is obtained by the execution, this processing is repeated until truth is obtained. When the result of all functions is false, the summarization rule 706 may be left to be blank. The same processing is performed for each of correlation rules 1001 held in the inter-column correction rule information 700, so that operation in step 206 is completed.
In step 207, the correlation rule pre-summarization processing part 117 rearranges the correlation rules read out from the correlation rule memory part 110 and updates the information in the correlation rule memory part 110. Furthermore, the correlation rule pre-summarization processing part 117 reads out the correlation rules from the correlation rule memory part 110 and calculates necessary numerical values while referring to the number of times of appearances of column values read out from the memory part 121 for storing the number of times of appearances of column values and the correlation rule summarization rules read out from the correlation rule summarization rule memory part 109, so that the information is complemented. Thereafter, the correlation rule pre-summarization processing part 117 writes the information in the correlation rule memory part 110 as the correlation rule thereof again.
Then, the correlation rule pre-summarization processing part 117 calculates an index value of the correlation rule using the information of the correlation rule read out from the correlation rule memory part 110 and updates the information of the correlation rule. Thereafter, the information is written as the correlation rule of the correlation rule memory part 110 again.
(number of items of relevant correlation rules/number of items on cause side)/(number of items on result side/total of number of items of correlation rules)
The calculated value is written in the inter-column correlation rule information 700 as Lift value 710 of the correlation rules. Information in the correlation rule memory part 110 is updated by the written inter-column correlation rule information 700 to thereby complete the step.
Furthermore, here, only the Lift value is calculated as the index value of the correlation rules, although Support value, Confidence value and the like which are other index values may be calculated together in this processing. In step 208, the correlation rule summarization processing part 118 summarizes or puts the information of correction rules read out from the correlation rule memory part 110 together on the basis of the community of summarization names of the correlation rules and then writes the information in the correlation rule summarization result memory part 111 as the summarized correlation rules.
Further, after computation of the number of items 1607, the Lift values 1608 and the Support values 1609 for all groups divided as above is completed, the number of items 1607, the Lift values 1608 and the Support values 1609 as totalized evaluation values 1610 of the summarized correlation rules 1600 may be calculated. In this case, the total value of the number of items 1607 of all groups to be summarized is calculated to be described in the number of items 1607. The harmonic mean of the Lift values 1608 of all groups to be summarized is calculated to be described in the Lift values 1608. The total value of the Support values 1609 of all groups to be summarized is calculated to be described in the Support value 1609.
The Support values 1609 of all the summarized correlation rules are 100% and since the relation of before and after in time always comes into effect, the summarized correlation rules are considered to be effective rules as the specifications when judgment is made from only the viewpoint of the Support values. However, the Lift value 1608 in the correlation rules 1702 is as low as 1.0 and it is shown that usefulness as the specification is low.
Further, the Lift value is a value representing the degree that the range taken by the “result-side value” is narrowed by the “cause-side value” and is a value expressed by the magnifying factor which is a reference value (1.0) when the “cause-side value” is not prescribed. When the value is 1.0, restriction conditions are not specifically added by the “cause-side value” and accordingly it can be judged that the usefulness as the correlation rules is low.
In case of the relation of before and after in the example of
In step 209, the summarization result appropriateness judgment part 119 refers to information of the summarized correlation rules read out from the correlation rule summarization result memory part 111 and complements the correlation rule using information of the summarized correlation rule evaluation rules read out from the summarized correlation rule evaluation rule memory part 112. Thereafter, the complemented correlation rule is written in the summarized correlation rule memory part 111 again.
When the summarized correlation rule judgment conditions 1801 are extracted, all sets of the object summarization rules 1803 and the Support value conditions 1804 held by the summarized correlation rule judgment conditions are subjected to judgment about the conditions described later. When the conditions are satisfied for all cases, it is judged that agreement is obtained for all conditions and the summarized correlation rule judgment conditions 1801 are extracted.
In judgment of the conditions represented by the sets of object summarization rules 1803 and Support value conditions 1804, the summarization rules 1604 having the same value as the object summarization rules 1803 are first found out from the summarized correlation rules 1600 to extract the Support values 1609 corresponding to the found-out rules. When the summarization rules 1604 having the same value as the object summarization rules 1803 are not found out, it is regarded that the Support value is 0%. Thereafter, it is judged whether the extracted Support value satisfies the restriction conditions of the Support value conditions 1804.
Moreover, when the summarized correlation rule judgment conditions 1801 corresponding to the summarized correlation rules 1600 cannot be extracted from the summarized correlation rule evaluation rules 1800, the processing in step 209 may be ended while the validity 1603 of the summarized correlation rules 1600 is left blank. The blank state represents that the contents of the summarized correlation rules 1600 are not the rule structure supposed as the specifications and the contents are information having the low useful degree as the specifications.
In step 210, the user of the Invention obtains the analysis results of data by the correlation rule analysis apparatus 100 through the output unit 104. The summarization result visualization processing part 120 reads out the correlation rule summarization results from the correlation rule summarization result memory part 111 in accordance with the instruction of the user of the apparatus and converts the results into a visually and easily understandable format to be then outputted to the output unit 104. Further, the output may be produced as text data or binary data so that the data can be treated by a computer or may be displayed in a monitor in character or graphically so that a developer can read the output.
It should be further understood by those skilled in the art that although the foregoing description has been made on embodiments of the invention, the invention is not limited thereto and various changes and modifications may be made without departing from the spirit of the invention and the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2014-135511 | Jul 2014 | JP | national |