This application is based upon and claims the benefit of priority from prior Japanese Patent Application No. 2004-282056, filed Sep. 28, 2004, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to construction of a schema of a hierarchical database.
2. Description of the Related Art
In a case where a plurality of organizations such as corporations gather to prepare a database having a common schema (classes and properties of the classes), in order to determine classes or properties, a specialist in database or modeling asks opinions of a domain specialist who belongs to each organization, to prepare the database top down.
In recent years, tools have been developed which support schema mapping by Extensible Markup Language (XML). These tools only visually support combining of tag names, and do not newly prepare common classes. Even by the use of these tools, the domain specialist of each organization still has to adjust the tools one by one in order to check association between properties.
In Jpn. Pat. Appln. KOKAI Publication No. 8-249338, it is described that in order to unify the schema, similarity is judged with respect to the property names of a database which have heretofore been used by the corporation to thereby support schema integration.
Terms or management methods which have heretofore been used by the respective organizations are different, or the terms of the domain specialist are different from those of the modeling specialist. Therefore, a terms adjustment, which is not essential is required in schema design. Even once the schema design is completed, problems are found in actually inputting data, in that case the schema design sometimes has to be done again.
As to the property names which have been used by the respective organizations, when conceptually similar names are used such as “heaviness”, “gravity”, “weight”, mapping using the property names is sufficient as described in Jpn. Pat. Appln. KOKAI Publication No. 8-249338. The property name is sometimes insufficient for performing mapping as in a case where a property name that does not have any concept is used like “w1” as a schema name.
As described above, there has heretofore been a problem that the property of which the name is different, but the same cannot be easily detected with high precision from record data made in each organization.
An object of the present invention is to provide a classification support apparatus and method in which different property names are used among a plurality of record data made in each organization, but the same property can be easily detected with high precision.
An aspect of the invention provides a classification support apparatus comprising: an input device configured to input a plurality of record data for each of a plurality of organizations, the plurality of record data each belonging to a class item of a plurality of class items and having a plurality of property data corresponding to a plurality of properties, respectively; an extraction device configured to extract a characteristic of each of the property data from each of the record data for each of the properties to acquire a plurality of characteristics corresponding to the plurality of property data; a classification device configured to classify the properties into a plurality of unified property items of the class item based on similarity between the characteristics of the property data among the record data to obtain a first classification result; a display device configured to display the first classification result; a correction device configured to correct selectively the displayed first classification result according to correction request of a user to obtain a second classification result; and a memory which stores the first classification result failed to be corrected and the second classification result.
An embodiment of the present invention will be described hereinafter with reference to the drawings.
As shown in
When components and products belonging to a certain class item are represented by a plurality of property data concerning the components and products, in most cases, even the property data of the same property have different property names for each organization such as company and department. When the organization differs, a recording form of property data concerning each component and product belonging to the class item, that is, the form of record data also differs with each class item.
In the classification support system shown in
Therefore, first, as to a certain class item, record data are used as sample data having different forms for the respective organizations. Each property of the record data for each organization is classified into one of a plurality of class items for classifying each property based on characteristics of the property data of the respective properties included in each record data. In this case, even when the property name differs in each record data, a property having a similar characteristic is detected. That is, the property is detected which can be regarded as the same property. Moreover, the same property is classified into the same property item. It is to be noted that when the same property is not detected from another record data with respect to the property of certain record data, the property is also classified into one property item.
Thus, a plurality of property items are obtained which are unified with respect to the class item in all the organizations in order to classify the respective properties of a plurality of record data classified by class items or organizations. Moreover, the respective properties are classified into one of the plurality of property items, and a result is presented to a user.
The preprocessing unit 1 converts an original form of the record data which has been input as the sample data for each organization into a form capable of mutually comparing the property data included in the contents data in each record data.
The preprocessing unit 1 converts an original form of the record data into a comparable form in such a manner that the property data of each record data can be easily compared among three organizations. Here, for example, it is assumed that the form of each record data is converted into a table form.
As shown in
It is to be noted that here the table form is described as an example of the comparable form, but the present invention is not limited to this example, and any form may be used as long as it is possible to compare the characteristics of the property data of the contents data included in each record data.
Moreover, the original form of each record data classified by class items and organizations may be a common separated value (CSV) form or a Hypertext Markup Language (HTML) document in addition to the table form or the Extensible Markup Language (XML) document as described above.
In the classification support system of
The instance set comparison unit 3 compares the characteristics of the property data of each property among different record data, obtains a plurality of property items for classifying the respective properties of the plurality of record data based on similarity of the characteristics of the property data, and classifies the respective properties into one of the plurality of class items. In this case, the instance set comparison unit 3 detects the same property among the plurality of record data based on the similarity of the characteristics of characteristic data classified by properties among the plurality of record data, and the unit classifies the same property into the same property item. Each property item is provided with an identifier (e.g., an identifier such as a BSU) for identifying each item, and correspondence property information is obtained as shown in
As shown in
As shown in
The class/property determination unit 5 receives the “determine” instruction or the correction instruction from the user to update the correspondence property information shown in
The enumeration type data proposal unit 6 detects a property having enumeration type data as the property data based on the correspondence property information updated by the class/property determination unit 5, and a characteristic amount of each property data obtained by the property characteristic extraction unit 2, and displays the property in the display unit 14.
The display unit 14 displays a property item having the enumeration type data as the property data. Thereafter, the user operates the input device 15 to input into the class/property determination unit 5 a correspondence between data used in the same meaning in each record data classified into the property item. The enumeration type data proposal unit 6 gives the identifier (e.g., a BSU) with respect to each value which can be taken by the property item input by the user. Moreover, as shown in
As shown in
The class/property determination unit 5 receives the instruction “determine” or the correction instruction from the user to update the enumeration type data correspondence information shown in
By the dictionary edition unit 10, the user performs edition such as correction/addition or the like with respect to dictionary data registered in the dictionary data storage unit 131 of the database 13.
The conversion program production unit 9 produces a conversion program classified by organizations and class items to convert each property data of the record data classified by organizations and class items into the property data classified by property items of the class item, using the correspondence property information or the enumeration type data correspondence information registered in the dictionary data storage unit 131 as shown in
The contents registration unit 11 converts each property data of the record data which belongs to the class item from the organization into the property data classified by property items of the class item using a conversion program 17 classified by organizations and class items, which has been produced by the conversion program production unit 9. Furthermore, the contents registration unit 11 converts the data into data of a common format for registration, and registers the data in a contents data storage unit 132 of the database 13.
The class proposal unit 7 detects a common property item of a plurality of class items based on the characteristic of each property data included in the sample data from each organization. The common property item is owned by each of the plurality of class items, and required for producing the class item of a higher class of the plurality of class items. The class proposal unit 7 displays in the display unit 14 the detected common property item and the plurality of class items having the shared property item. Moreover, the class proposal unit 7 informs the user that it is possible to produce the class item of the upper class of the plurality of class items.
Based on the characteristic of each property data included in the sample data from each organization, the division proposal unit 8 detects another class item that has the same property item as that owned by one of the plurality of class items. The division proposal unit 8 displays in the display unit 14 the two detected class items and the property item common to the two class items.
FIGS. 21 to 23 are flowcharts showing the whole process operation of the class construction support system of
(Preprocessing Unit)
First, the user indicates an arbitrary class item (e.g., “clinical thermometer” is indicated here) to the preprocessing unit 1 (step S101). Moreover, the user inputs into the preprocessing unit 1 the sample data which belongs to the class item as shown in
First, the user selects the comparable form which is a target with respect to the preprocessing unit 1 (step S1). Here, for example, the user selects the table form. The preprocessing unit 1 reads the sample data (step S2), and supplies to the user a GUI for converting the form (source) of each record data read as the sample data into the selected comparable form (table form).
It is to be noted that here the property name of the property data of each contents data included in the record data is written in each cell of a first line of a table of the target. In each of second and subsequent lines, the property data of each contents data included in the record data is written corresponding to each property name of the first line. Each row has a form including the property data having the same property name in each contents data included in the record data.
The user gives an instruction using the GUI in such a manner as to assign the property name of each property data of the record data which is the source to each cell of the first line of the table of the target, and assign the property data (instance) of each contents data included in the record data to the second and subsequent lines of the table of the target.
For example, the record data of
The format mapping information indicates the part of the source record data, which is to be assigned to each cell in the target table, and the information is stored in the storage unit 12 of
The record data of
The record data of
Next, the unit converts the form of each record data shown in
(Property Characteristic Extraction Unit)
Next, the property characteristic extraction unit 2 obtains characteristic information of the property data classified by properties with respect to (the table of) each record data (step S104).
The property characteristic extraction unit 2 reads each record data of the comparable form shown in
The data type definition information 122 indicates a pattern of a data structure constituting the data type with respect to each of a character type (STRING), an integer type (INTEGER), and a real number type (REAL). The property characteristic extraction unit 2 checks whether or not each property data included in the row agrees with the pattern of the data type with respect to each row to judge the data type of the property data of each row.
When the data type of the property data is a numerical type (integer or real number) (step S13), the process advances to step S14. When the data type is a character type (step S13), the process advances to step S15.
In step S14, characteristic amounts are obtained such as a minimum value, maximum value, average value, and appearance frequency of the property data with respect to the property of the row which is judged to be of a numerical type. Furthermore, the unit compares with each characteristic amount the basic information (stored beforehand in the storage unit 12 of
As shown in
The basic information shown in
In step S15, the property characteristic extraction unit 2 obtains characteristic amounts such as a character string length (maximum and minimum) and character string type with respect to each property data of the row which is judged to be of the character type. Furthermore, as described in step S14, the unit compares the respective characteristic amounts with the basic information shown in
As shown in
Moreover, as shown in
Moreover, as shown in
Furthermore, in the record data of the B company of
As shown in
It is to be noted that the characteristic information obtained from the property data of each row (property) of the table of the record data is not limited to the information shown in
The process operation of the property characteristic extraction unit 2 has been described above.
(Instance Set Comparison Unit)
Next, the instance set comparison unit 3 compares the characteristic information classified by property data obtained with respect to each record data between the record data. Moreover, the instance set comparison unit 3 obtains a plurality of property items for classifying the respective properties of the plurality of record data, and classifies each property into one of the plurality of class items. In this case, the instance set comparison unit 3 detects the same property among the plurality of record data based on the similarity of the characteristic of the property data classified by properties among the plurality of record data, and classifies the same property into the same property item (step S105).
First, the instance set comparison unit 3 selects standard record data from three record data which are sample data (step S21). Here, it is assumed that record data whose property number is largest is selected from these three record data. Therefore, the record data of company A is selected.
Next, the unit selects one (here, from the record data of companies B and C) of the record data (record data which is a comparison object) to be compared with the standard record data (steps S22, S23).
With regard to an arbitrary property of the record data which is the comparison object selected in step S23, the instance set comparison unit 3 compares the characteristic of the property data with that of each property of the standard record data. Moreover, the instance set comparison unit 3 obtains the property of the standard record data having a characteristic (regarded as the same as that of the arbitrary property) having a highest similarity with respect to the characteristic of the arbitrary property of the record data which is the comparison object. When a plurality of properties are obtained from the standard record data, the instance set comparison unit 3 selects one of them based on the similarity of the property name (steps S24, S25).
When the instance set comparison unit 3 obtains the property of the standard record data having the characteristic (regarded as the same as that of the arbitrary property) having a highest similarity with respect to the characteristic of the arbitrary property of the record data which is the comparison object (step S26), as shown in
In step S25, the similarity of the standard record data to each property is calculated with respect to the characteristics like the data type, the character string type and the like of the arbitrary property of the record data which is the comparison object with reference to the property characteristic information shown in
For example, the “name” property of the record data of company B will be described in a case where the characteristics of the property are compared with those of each property of the record data of company A selected as the standard record data.
As shown in
Then, the instance set comparison unit 3 compares each characteristic information of the “name” property of the record data of company B with that of the arbitrary property of the record data of company A. When there is matched characteristic information, the similarity is set to “1” concerning the characteristic information. Moreover, as to the characteristic information represented by the numerical value, when the value does not agree, a ratio of the difference (difference between the characteristic information of the “name” property and the record data of the A company) with respect to the characteristic information of the “name” property is set as the similarity concerning the characteristic information. It is to be noted that when this ratio if not more than the predetermined threshold value, the similarity may be set to “0” concerning the characteristic information. In the case of the disagreement of the characteristic information indicating the type like the “DATA_TYPE” or the “character string type”, the similarity is set to “0” concerning the characteristic information. As to the certain property of the record data of the company A, after obtaining the similarity of each characteristic information to the “name” property of the record data of company B, a total value is calculated.
When there is not any “TYPE” characteristic information in the “name” property of the record data of company B, the total value of the similarity indicates the similarity between the “name” property of the record data of company B and the arbitrary property of the record data of company A.
When there is the “TYPE” characteristic information in the “name” property of the record data of company B, the weighting of a predetermined value is performed with respect to the total value of the similarity of the property having the “TYPE” characteristic information which agrees with that of the “name” property among the properties of the record data of company A. For example, the total value of the similarity is multiplied with a predetermined weight value (e.g., a positive integer value), and, as a result, an obtained value is set as the similarity between the “name” property of the record data of company B and the property of the record data of company A.
It is to be noted that a similarity which is higher than that of another characteristic information is assigned especially to the characteristic information representing the characteristic of the property most among the characteristic information concerning a certain property, or the weighting is performed otherwise in accordance with the importance of the characteristic information.
In this manner, the similarity between the properties indicates a high value, when there is more characteristic information (especially the characteristic information which is an important element in representing the characteristic of the property) whose values agree with each other or are close to each other. Additionally, when both “TYPE” characteristic information agrees with each other, any calculation method may be used as long as a higher value is indicated.
As shown in
Moreover, the “location” property of the record data of company B will be described, when compared with the characteristic of each property of company A record data selected as the standard record data.
As shown in
Among the properties of the record data of company A, as to the “HP” property, the “DATA_TYPE” is “STRING”, and the “TYPE” is “URL” in the same manner as in information “location” property of the record data or company B. The maximum and minimum character string lengths also indicate values which are equal to those of the “location” property of the record data of company B. Therefore, the similarity of the “HP” property is highest among the properties of the record data of company A.
In this manner, as to the characteristic of the arbitrary property of the record data which is the comparison object, the instance set comparison unit 3 calculates the similarity to each property of the standard record data. As a result, the unit selects properties whose similarities are not less than a predetermined threshold value from the standard record data. The property having a highest similarity is selected from the properties. It is judged that the selected property is the same as the arbitrary property of the record data which is a comparison object.
It is to be noted that in a case where a plurality of properties are obtained whose similarities are not less than the predetermined threshold value and whose values are highest from the standard record data by the instance set comparison unit 3, as to the respective property names of the plurality of properties, the similarity is obtained with respect to the property name of the arbitrary property of the row which is the comparison object. Moreover, the property name is selected whose similarity is highest, and it is judged that the selected property is the same as the arbitrary property of the record data which is the comparison object.
Here, one example will be briefly described as to a method of calculating the similarity between the “property names”. A distance is obtained which corresponds to the similarity between the property names (vocabularies) in ontology, using an ontology dictionary (e.g., it is assumed that the dictionary is stored in the database 13 or the storage unit 12) indicating identity or similarity, lower/upper relation or the like of the meaning or concept between the respective vocabularies which are usable as the property names.
When the same property as the arbitrary property of the record data which is the comparison object is obtained from the standard record data in this manner (step S26), as shown in
After performing the process of steps S25 to S27 with respect to all the properties of the record data which is the comparison object (step S24), the process returns to step S22. When there is record data that has not been selected as the comparison object in the step S22, the process advances to step S23, unselected record data is selected, and the process of steps S24 to S27 is repeated. In step S22, the process of steps S23 to S27 is repeated until all the record data is selected except the standard record data as the comparison object.
As a result of the process shown in
The instance set comparison unit 3 applies identifiers (here “P1” to “P6) to a plurality of property items of the class item as shown in
(Property Candidate Presentation Unit)
In step S106 of
First, a display format (e.g., a table form here) shown in
Next, each record data shown in
(Class/Property Determination Unit)
When property candidates shown in
The class/property determination unit 5 receives the “determine” instruction or the correction instruction from the user to update the correspondence property information shown in
(Enumeration Type Data Proposal Unit)
In the property characteristic information shown in
For example, when the total number of the property data is “250”, and there are two types of values: “male”; and “female”, the “appearance frequency” characteristic information is “ 2/250=0.008”. In the property characteristic information of
An enumeration type data evaluation measure 20 stored beforehand in the storage unit 12 is a threshold value. When the appearance frequency is not more than (or is less than) the value, the property data is judged as the enumeration type data. It is assumed here that the enumeration type data evaluation measure is set to “0.5”. Therefore, properties are judged as the enumeration type data: a “P5” property including the “company name” property (appearance frequency is “0.25”) of the record data of company A and a “C6” property (appearance frequency is “0.25”) of the record data of the company C; and a “P6” property including the “state” property (appearance frequency is “0.5”) of the record data of the company A and a “C2” property of the record data of the company C.
The enumeration type data proposal unit 6 displays in the display unit 14 (the identifier of) the property item judged as the enumeration type data among a plurality of property items together with the property name or the property data of each record data classified into the property item (step S110 of
For example, in the “P6” property, the record data of the company A has two types of property data: “OK”; and “NG”, and the record data of the company C has two types of the property data: “possible”; and “impossible”. In this case, when the user inputs information indicating that the “OK” of the record data of the company A is synonymous with the “possible” of the record data of the company C, the enumeration type data proposal unit 6 gives the identifier “P7” to the information. When the user inputs information indicating that the “NG” of the record data of company A is synonymous with the “impossible” of the record data of company C, the enumeration type data proposal unit 6 gives the identifier “P8” to the information.
It is to be noted that in steps S110 and S111 of
Moreover, as shown in
The user confirms this information. If there is not any correction, the user operates the input device 15 to input the “determine” instruction into the class/property determination unit 5 with respect to the information displayed in the display unit 14 (step S112 of
The class/property determination unit 5 receives the “determine” instruction or the correction instruction from the user to update the enumeration type data correspondence information shown in
(Conversion Program Production Unit)
In step S115 of
It is to be noted that this conversion program may include a program for converting the form of the record data belonging to the class item from the organization into a form which is common to all the organizations.
First, a template of the conversion program is read as shown in
Here, an example will be described in which the conversion program is produced with respect to the class item “clinical thermometer” of company A. Since the record data of company A uses six property names “product No.”, “HP”, “weight”, “height”, “company name”, and “state”, the conversion program production unit 9 substitutes six property names into arguments “source” of six command sentences L1, respectively. Furthermore, the unit substitutes the identifiers “P1” to “P6” of the property items corresponding to the six property names into the arguments “target” of the six command sentences L1, respectively. As a result, the conversion program is produced as shown in
The conversion programs are similarly produced with respect to companies B and C.
The above-described steps S101 to S115 are a series of process operation using the input sample data with respect to one class item. When the process of steps S101 to S115 is repeated with respect to each class item, a plurality of property items can be obtained which are unified in all the organizations with respect to each class item.
(Contents Registration Unit)
As shown in
(Class Proposal Unit)
When the process of steps S101 to S115 is repeated with respect to each class item, it is possible to obtain a plurality of property items, and the results of the classification of the respective properties of the record data classified by organizations into the plurality of property items in accordance with class item.
For example, as to the class item “clinical thermometer” shown in
Moreover, for example, when the process of the above-described steps S101 to S115 is also performed, for example, with respect to a “water thermometer” which is another class item, it is assumed that property items “P11” to “P15” are obtained.
Furthermore, for example, when the process of the above-described steps S101 to S115 is also performed, for example, with respect to a “room thermometer” which is another class item, it is assumed that property items “P21” to “P25” are obtained.
When a plurality of property items owned by each class item are obtained with respect to a plurality of class items in this manner, the class proposal unit 7 extracts the common property item owned by each of the plurality of class items.
A process operation of the class proposal unit 7 will be described with reference to the flowchart shown in
First, step S51 will be described. When the property names “P1” to “P6” are obtained with respect to the class item “clinical thermometer”, the property names “P11” to “P15” are obtained with respect to the class item “water thermometer”, and the property names “P21” to “P25” are obtained with respect to the class item “room thermometer” as described above, the property characteristic extraction unit 2 performs a process similar to that of the instance set comparison unit 3, using the property characteristic information obtained from the sample data of each class item as shown in
For example, it is assumed that the characteristic information of the property data of the respective record data corresponding to the respective property names of “P1”, “P11”, and “P21” agree with or are similar to one another, and they are judged as the same property. It is also assumed that the characteristic information of the property data of the respective record data corresponding to the property names “P2”, “P12”, and “P22” agree with or are similar to one another, and they are judged as the same property. It is also assumed that the characteristic information of the property data of the respective record data corresponding to the property names “P3”, “P13”, and “P23” agree with or are similar to one another, and they are judged as the same property.
Here, for the sake of convenience, the property names of “P1”, “P11”, “P21” judged as the same property are assumed as “P1”, the property names of “P2”, “P12”, “P22” are assumed as “P2”, and the property names of “P3”, “P13”, “P23” are assumed as “P3”.
In step S51, since the property items “P1” to “P3” exist in any of the three class items, the class proposal unit 7 extracts these common property items “P1” to “P3”.
Moreover, in step S52, when the property items “P1” to “P3” are common to the above-described three class items, the display unit 14 displays information informing the user that the class item having the three common properties can be an upper class item of the three class items.
The user accepts, rejects or corrects and thereafter accepts that the class item having the properties “P1” to “P3” is set as the upper class item of the three class items. For example, after correcting the name or identifier of the upper class item, the property owned by the upper class item or the like, the user inputs “acceptance”. Then, as a result of the correction, a class system shown in
The class system (hierarchical structure of the class item) shown in
(Division Proposal Unit)
With respect to a plurality of class items, the division proposal unit 8 detects another class item having the same property item as that of one class item among the plurality of class items based on the characteristic of each property data included in the sample data from each organization (step S61).
That is, the division proposal unit 8 performs a process similar to that of the instance set comparison unit 3 using the property characteristic information to check whether or not both the items have the same property. The property characteristic information includes: property characteristic information shown in
When the same property exists in both the characteristics, that is, when two class items are detected having the common property item, the division proposal unit 8 displays in the display unit 14 two detected class items and the property item common to the two class items (step S62).
The user can delete the property item which has been judged to be the same as that of the class item shown in
This edition is performed, for example, by the dictionary edition unit 10.
As described above, according to the above-described embodiment, a plurality of property items are obtained in accordance with class item based on the characteristic of the property data classified by properties of each record data classified by organizations, and each property of each record data is classified into one of the plurality of property items. Accordingly, different property names are used among a plurality of record data classified by organizations, but it is possible to detect the same property easily with high precision.
Moreover, when displaying the classification result classified by property items of each property of each record data, the system supports the user in such a manner as to perform one-dimensional management of the record data classified by organizations whose property name or form is not unified in accordance with the unified property item and form.
It is to be noted that each constituting unit (preprocessing unit 1, property characteristic extraction unit 2, instance set comparison unit 3, property candidate presentation unit 4, class/property determination unit 5, enumeration type data proposal unit 6, class proposal unit 7, division proposal unit 8, conversion program production unit 9, dictionary edition unit 10, contents registration unit 11 or the like) of the classification support system of
For example, storage means such as a memory of the computer or a hard disc is used as the storage unit 12 or the database 13 of
According to the present invention, the different property names are used among a plurality of record data made in each organization, but the same property can be easily detected with a high precision. As a result, with respect to a user, it is possible to identify the record data made in each organization whose property names or forms are not unified, in accordance with unified property items and forms, and a common class system can be efficiently constructed.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general invention concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2004-282056 | Sep 2004 | JP | national |