This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-151011, filed on Sep. 9, 2020, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to data processing.
Table data such as a relational database (RDB) is used in data analysis or machine learning. The table data often includes an attribute value for each of a plurality of attributes. An attribute included in the table data is sometimes called a column or an item.
Japanese Laid-open Patent Publication No. 2020-112919, International Publication Pamphlet No. WO 2016/125277, Japanese Laid-open Patent Publication No. 2012-38066, and F. Nargesian et al., “Table Union Search on Open Data”, Proceedings of the VLDB Endowment, Vol. 11, No. 7, March 2018, pages 813-825 are disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores a data processing program for causing a computer to execute processing including: obtaining a similarity between each of a plurality of attributes included in first table data and each of a plurality of attributes included in second table data; and associating each of the plurality of attributes included in the first table data with any attribute of the plurality of attributes included in the second table data on the basis of the similarity, an order of the plurality of attributes included in the first table data, and an order of the plurality of attributes included in the second table data.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In combining processing of combining two table data to generate integrated table data, attributes are associated on a one-to-one basis between the two table data.
Table data 102 includes attributes such as “device”, “manufacturing company”, and “construction company”, and includes a plurality of attribute values of each of the attributes. “Air conditioner” and “refrigerator” are the attribute values of the “device”, “XX electric corporation” and “YY electric company” are the attribute values of the “manufacturing company”, and “ZZ factory” and “WW company limited” are the attribute values of the “construction company”.
In this case, integrated table data 103 is generated by associating the “equipment name”, “manufacturer”, and “construction company” of the table data 101 with the “device”, “manufacturing company”, and “construction company” of the table data 102, respectively. The integrated table data 103 includes the same attributes as the table data 101, and the attribute values of each of the attributes include the attribute values of both the table data 101 and the table data 102.
A table combining search on open data is known in relation to the table data combining processing. A data integration support device that supports efficient data integration is also known. A database analysis device that enables efficient and highly accurate analysis of a large-scale and complicated database is also known. A data processing device that improves efficiency of work of extracting columns that corresponds to each other between two two-dimensional data is also known.
There are some cases where, in the combining processing of combining two table data, an attribute of one table data is associated with an incorrect attribute of the other table data in the case of associating the attribute of one table data with any attribute of the other table data on the basis of similarity between attributes.
In one aspect, the embodiment aims to accurately associate attributes with each other between two table data.
Hereinafter, embodiments will be described in detail with reference to the drawings.
In the table combining search of F. Nargesian et al., “Table Union Search on Open Data”, Proceedings of the VLDB Endowment, Vol. 11, No. 7, March 2018, pages 813-825, a similarity between attributes is calculated between the two table data, and an attribute of one table data corresponding to each attribute of the other table data is estimated on the basis of the similarity between attributes. As the similarity between attributes, a similarity based on schema matching, a Jaccard similarity, a cosine similarity using a word vector generated by word embedding, a cosine similarity using the vector of Japanese Laid-open Patent Publication No. 2020-112919, or the like can be used in addition to the similarity of F. Nargesian et al., “Table Union Search on Open Data”, Proceedings of the VLDB Endowment, Vol. 11, No. 7, March 2018, pages 813-825.
However, in the case of associating attributes with each other between two table data on the basis of only the similarity between attributes, an attribute B of one table data having the maximum similarity to an attribute A of the other table data is associated with the attribute A. Therefore, an incorrect attribute B may be associated with the attribute A.
The similarity of the “manufacturer” and the “device” is 0.22, the similarity of the “manufacturer” and the “manufacturing company” is 0.75, and the similarity of the “manufacturer” and the “construction company” is 0.84.
In this case, the “manufacturer” and the “construction company” indicated by the maximum similarity of 0.84 are incorrectly associated with each other.
In a case where each table data contains a large number of attributes, since the number of combinations of the attributes of one table data and the attributes of the other table data increases, there is a high possibility that an incorrect association is performed.
Next, the association processing unit 313 performs association processing on the basis of the obtained similarity, an order of the plurality of attributes included in the first table data, and an order of the plurality of attributes included in the second table data (step 402). The association processing is processing of associating each of the plurality of attributes included in the first table data with any attribute of the plurality of attributes included in the second table data.
According to the data processing device 301 of
The attributes included in the table data are often arranged in an order that is easy for humans to understand. In particular, a plurality of highly related attributes is often arranged at positions dose to each other. Therefore, in combining processing of combining two table data, not only the similarity between two association candidate attributes but also the similarity between attributes existing in the vicinity of the association candidate attributes are used, so that the incorrect association is reduced, and the accuracy of the association processing can be improved.
The storage unit 511 stores table data 521-1 and table data 521-2 to be combined. Table data 521-1 and the table data 521-2 include a plurality of attributes and include a plurality of attribute values of each of the attributes.
The similarity calculation unit 512 calculates the similarity between each attribute of the table data 521-1 and each attribute of the table data 521-2, generates similarity information 522 indicating the calculated similarity, and stores the similarity information 522 in the storage unit 511. As the similarity between attributes, for example, the similarity of F. Nargesian et al., “Table Union Search on Open Data”, Proceedings of the VLDB Endowment, Vol. 11, No. 7, March 2018, pages 813-825, the similarity based on schema matching, the Jaccard similarity, the cosine similarity using a word vector, or the cosine similarity using the vector of Japanese Laid-open Patent Publication No. 2020-112919 can be used.
The extraction unit 513 extracts an order of the plurality of attributes in the table data 521-i from the table data 521-i (i=1 or 2), generates attribute order information 523-i indicating the extracted order, and stores the attribute order information 523-i in the storage unit 511.
The association processing unit 514 selects each attribute of the table data 521-1 as an association candidate attribute A1, and selects an attribute B1 adjacent to the attribute A1 in the table data 521-1 using the attribute order information 523-1. Then, the association processing unit 514 selects each attribute of the table data 521-2 as an association candidate attribute A2, and selects an attribute B2 adjacent to the attribute A2 in the table data 521-2 using the attribute order information 523-2.
Next, the association processing unit 514 determines whether to associate the attribute A1 and the attribute A2 on the basis of a similarity SA between the attribute A1 and the attribute A2 and a similarity SB between the attribute B1 and the attribute B2. The association processing unit 514 calculates a similarity S by, for example, the following equation.
S=SA+SB (1)
The similarity S represents a sum of the similarity SA and the similarity SB. The association processing unit 514 calculates the similarity S using each attribute of the table data 521-2 as the attribute A2, and associates the attribute A2 having the maximum similarity S with the attribute A1. By repeating similar processing using each attribute of table data 521-1 as the attribute A1, each attribute of the table data 521-1 is associated with any attribute of the table data 521-2.
Next, the association processing unit 514 generates an association result 524 indicating a correspondence between the plurality of attributes of the table data 521-1 and the plurality of attributes of the table data 521-2, and stores the association result 524 in the storage unit 511. The output unit 515 outputs the association result 524.
Table data 702 corresponds to the table data 521-2 of
First, the “manufacturer” of the table data 701 is selected as the attribute A1, and the “equipment name” adjacent to the left side of the “manufacturer” is selected as the attribute B1. Then, the “manufacturing company” in the table data 702 is selected as the attribute A2, and the “device” adjacent to the left side of the “manufacturing company” is selected as the attribute B2. The similarity SA between the “manufacturer” and the “manufacturing company” is 0.75, and the similarity SB between the “equipment name” and the “device” is 0.83; therefore, the similarity S is 1.58.
Next, the “construction company” in the table data 702 is selected as the attribute A2, and the “price” adjacent to the left side of the “construction company” is selected as the attribute B2. The similarity SA between the “manufacturer” and the “construction company” is 0.84, and the similarity SB between the “equipment name” and the “price” is 0.21; therefore, the similarity S is 1.05.
In a case where the similarity S when the “manufacturing company” of the table data 702 is selected as the attribute A2 is the maximum value of the similarity S, the “manufacturer” of the table data 701 is associated with the “manufacturing company”.
Next, the “construction company” in the table data 702 is selected as the attribute A2, and the “work history” adjacent to the right side of the “construction company” is selected as the attribute B2. The similarity SA between the “construction company” in the table data 701 and the “construction company” in the table data 702 is 0.56, and the similarity SB between the “construction log” and the “work history” is 0.71; therefore, the similarity S is 1.27.
In a case where the similarity S when the “construction company” of the table data 702 is selected as the attribute A2 is the maximum value of the similarity S, the “construction company” of the table data 701 is associated with the “construction company” of the table data 702.
According to such association processing, not only the similarity between two association candidate attributes but also the similarity between attributes adjacent to the association candidate attributes are used, so that the accuracy of the association processing can be improved.
Meanwhile, in the case of using the attribute adjacent to the right side of the association candidate attribute, the right end attributes of the table data 521-1 and the table data 521-2 are excluded from attributes to be processed. In this case, all of attributes other than the right end attributes are the attributes to be processed.
In a case where the table data 521-1 and the table data 521-2 include a large number of attributes, even if the left end or right end attribute is excluded from the attributes to be processed, association for the majority of remaining attributes can be performed.
First, the similarity calculation unit 512 calculates the similarity between each attribute of the table data 521-1 and each attribute of the table data 521-2, and generates the similarity information 522 indicating the calculated similarity (step 901).
Next, the extraction unit 513 generates attribute order information 523-1 indicating the order of the plurality of attributes in the table data 521-1, and generates attribute order information 523-2 indicating the order of the plurality of attributes in the table data 521-2 (step 902).
Next, the association processing unit 514 performs processing of loop 1 for each attribute of the table data 521-1. In the loop 1, the association processing unit 514 selects one of the attributes to be processed included in the table data 521-1 as the association candidate attribute A1 (step 903). Then, the association processing unit 514 identifies the attribute adjacent to the left side or the right side of the attribute A1 in the table data 521-1 by using the attribute order information 523-1 and selects the attribute as the attribute B1.
Next, the association processing unit 514 sets an initial value “−∞” in a variable Smax indicating the maximum value of the similarity S, and sets an initial value “NULL” in a variable Amax indicating an attribute of the table data 521-2 corresponding to the maximum value of the similarity S (step 904). “−∞” represents negative infinity.
Next, the association processing unit 514 performs processing of loop 2 for each attribute of the table data 521-2. In the loop 2, the association processing unit 514 selects one of the attributes to be processed included in the table data 521-2 as the association candidate attribute A2 (step 905). Then, the association processing unit 514 identifies the attribute adjacent to the left side or the right side of the attribute A2 in the table data 521-2 by using the attribute order information 523-2 and selects the attribute as the attribute B2.
In a case where the attribute B1 is adjacent to the left side of the attribute A1, the attribute adjacent to the left side of the attribute A2 is selected as the attribute B2, and in a case where the attribute B1 is adjacent to the right side of the attribute A1, the attribute adjacent to the right side of the attribute A2 is selected as the attribute B2.
Next, the association processing unit 514 calculates the similarity S by the equation (1) (step 906), and compares the similarity S with Smax (step 907). In a case where the similarity S is larger than Smax (step 907, YES), the association processing unit 514 sets the similarity S in Smax and sets the attribute A2 in Amax (step 908). On the other hand, in a case where the similarity S is Smax or less (step 907, NO), the association processing unit 514 does not change Smax and Amax.
When the processing of loop 2 is completed for all the attributes to be processed included in the table data 521-2, the association processing unit 514 associates Amax with the attribute A1 (step 909). When the processing of loop 1 is completed for all the attributes to be processed included in the table data 521-1, the association processing unit 514 generates the association result 524, and the output unit 515 outputs the association result 524 (step 910).
The association processing unit 514 can also perform the association processing by using a plurality of attributes included in a predetermined range based on the attributes of the association candidates. As the plurality of attributes included in the predetermined range, for example, an attribute adjacent to the left side of the association candidate attribute and an attribute adjacent to the right side of the association candidate attribute can be used.
In this case, the association processing unit 514 selects each attribute of the table data 521-1 as the association candidate attribute A1. Then, the association processing unit 514 selects the attribute B1 adjacent to the left side of the attribute A1 and an attribute C1 adjacent to the right side of the attribute A1 in the table data 521-1 using the attribute order information 523-1.
Next, the association processing unit 514 selects each attribute of the table data 521-2 as the association candidate attribute A2. Then, the association processing unit 514 selects the attribute B2 adjacent to the left side of the attribute A2 and an attribute C2 adjacent to the right side of the attribute A2 in the table data 521-2 using the attribute order information 523-2.
Next, the association processing unit 514 calculates the similarity S by the following equation, using the similarity SA between the attribute A1 and the attribute A2, the similarity SB between the attribute B1 and the attribute B2, and a similarity SC between the attribute C1 and the attribute C2.
S=SA+SB+SC (2)
The similarity S represents the sum of the similarity SA, the similarity SB, and the similarity SC. The association processing unit 514 performs the association processing using the similarity S of the equation (2) instead of the similarity S of the equation (1).
First, an attribute 1012 of the table data 1001 is selected as the attribute A1, an attribute 1011 adjacent to the left side of the attribute 1012 is selected as the attribute B1, and an attribute 1013 adjacent to the right side of the attribute 1012 is selected as the attribute C1. Then, ab attribute 1022 of the table data 1002 is selected as the attribute A2, an attribute 1021 adjacent to the left side of the attribute 1022 is selected as the attribute B2, and an attribute 1023 adjacent to the right side of the attribute 1022 is selected as the attribute C2.
The similarity SA between the attribute 1012 and the attribute 1022 is 0.7, the similarity SB between the attribute 1011 and the attribute 1021 is 0.8, and the similarity SC between the attribute 1013 and the attribute 1023 is 0.8; therefore, the similarity S is 2.3.
Next, an attribute 1032 of the table data 1002 is selected as the attribute A2, an attribute 1031 adjacent to the left side of the attribute 1032 is selected as the attribute B2, and an attribute 1033 adjacent to the right side of the attribute 1032 is selected as the attribute C2. The similarity SA between the attribute 1012 and the attribute 1032 is 0.8, the similarity SB between the attribute 1011 and the attribute 1031 is 0.1, and the similarity SC between the attribute 1013 and the attribute 1033 is 0.2; therefore, the similarity S is 1.1.
In a case where the similarity S when the attribute 1022 of the table data 1002 is selected as the attribute A2 is the maximum value of the similarity S, the attribute 1012 of the table data 1001 is associated with the attribute 1022.
The association processing unit 514 may use two or more attributes existing on the left side of the association candidate attribute and two or more attributes existing on the right side of the association candidate attribute as the plurality of attributes included in the predetermined range. In this case, the similarity S is calculated by adding the similarity between each attribute included in the predetermined range of the table data 521-1 and each attribute included in the predetermined range of the table data 521-2 to the similarity between two association candidate attributes.
According to such association processing, not only the similarity between two association candidate attributes but also the similarity between attributes included in the predetermined range based on the association candidate attributes are used, so that the accuracy of the association processing can be improved.
In a case where the table data 521-1 and the table data 521-2 include a large number of attributes, even if the left end and right end attributes are excluded from the attributes to be processed, association for the majority of remaining attributes can be performed.
Processing of step 1101, step 1102, step 1104, and steps 1107 to 1110 is similar to the processing of step 901, step 902, step 904, and steps 907 to 910 of
In step 1103, the association processing unit 514 selects one of the attributes to be processed included in the table data 521-1 as the association candidate attribute A1. Then, the association processing unit 514 identifies the attribute adjacent to the left side of the attribute A1 in the table data 521-1 by using the attribute order information 523-1 and selects the attribute as the attribute B1. Furthermore, the association processing unit 514 identifies the attribute adjacent to the right side of the attribute A1 in the table data 521-1 by using the attribute order information 523-1 and selects the attribute as the attribute C1.
In step 1105, the association processing unit 514 selects one of the attributes to be processed included in the table data 521-2 as the association candidate attribute A2. Then, the association processing unit 514 identifies the attribute adjacent to the left side of the attribute A2 in the table data 521-2 by using the attribute order information 523-2 and selects the attribute as the attribute B2. Furthermore, the association processing unit 514 identifies the attribute adjacent to the right side of the attribute A2 in the table data 521-2 by using the attribute order information 523-2 and selects the attribute as the attribute C2.
In step 1106, the association processing unit 514 calculates the similarity S by the equation (2).
For example, in the case of using the similarity SA, the similarity SB, and the similarity SC of the equation (2) as the plurality of similarities, the association processing unit 514 calculates the similarity S by the following equation.
S=WA*SA+WB*SB +WC*SC (3)
WA, WB, and WC respectively represent the weighting coefficients for the similarity SA, similarity SB, and similarity SC, and the similarity S represents the weighted sum of the similarity SA, similarity SB, and similarity SC. The association processing unit 514 performs the association processing using the similarity S of the equation (3) instead of the similarity S of the equation (2).
Furthermore, since the similarity SA between the attribute 1012 and the attribute 1032 is 0.69, the similarity SB between the attribute 1011 and the attribute 1031 is 0.7, and the similarity SC between the attribute 1013 and the attribute 1033 is 0.7, the sum is 2.09. Therefore, in the case of using the sum as the similarity S, the attribute 1012 of the table data 1001 is associated with the attribute 1032.
Meanwhile, by using the similarity S in the equation (3), the weighting in which the weight of the similarity SA is larger than the weights of the similarity SB and the similarity SC can be performed. For example, in the case of performing the weighting with WA=1.0, WB=0.5, and WC=0.5, the similarity S between the attribute 1012 and the attribute 1022 is calculated by the following equation.
S=1.0*0.8+0.5*0.7+0.5*0.5=1.4 (4)
Similarly, the similarity S between the attribute 1012 and the attribute 1032 is calculated by the following equation.
S=1.0*0.69+0.5*0.7+0.5*0.7=1.39 (5)
In this case, since the similarity S of the equation (4) is larger than the similarity S of the equation (5), the attribute 1012 of the table data 1001 is associated with the attribute 1022.
According to such association processing, the similarity between two association candidate attributes can be more preferentially used than the similarity between attributes included in the predetermined range based on the two association candidate attributes.
The similarity calculation unit 512 calculates the similarity between each attribute of the table data 521-1 and each attribute of the table data 521-i (i=2 to N), generates similarity information 1511 indicating the calculated similarity, and stores the similarity information 1511 in the storage unit 511.
The extraction unit 513 extracts the order of the plurality of attributes in the table data 521-i from the table data 521-i (i=1 to N), generates the attribute order information 523-i indicating the extracted order, and stores the attribute order information 523-i in the storage unit 511.
The association processing unit 514 associates each attribute of the table data 521-1 with any attribute of the table data 521-i (i=2 to N), using the similarity information 1511 and the attribute order information 523-1 to attribute order information 523-N. Then, the association processing unit 514 generates an association result 1512 indicating a correspondence between the plurality of attributes of the table data 521-1 and the plurality of attributes of the table data 521-i (i=2 to N), and stores the association result 1512 in the storage unit 511. The output unit 515 outputs the association result 1512.
According to the data processing device 1501 of
The configurations of the data processing device 301 of
The flowcharts illustrated in
The table data illustrated in
Equations (1) to (5) are merely examples, and the data processing device may calculate the similarity using another calculation equation.
The memory 1602 is a semiconductor memory, for example, a read only memory (ROM), a random access memory (RAM), a flash memory, and the like, and stores a program and data used for processing. The memory 1602 may operate as the storage unit 311 of
The CPU 1601 (processor) operates as the similarity calculation unit 312 and the association processing unit 313 in
The input device 1603 is, for example, a keyboard, a pointing device, or the like and is used for inputting an instruction or information from an operator or a user. The output device 1604 is, for example, a display device, a printer, or the like and is used for an inquiry or an instruction to the operator or the user, and outputting a processing result. The processing result may be the association result 524 or the association result 1512. The output device 1604 may operate as the output unit 515 of
The auxiliary storage device 1605 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The auxiliary storage device 1605 may be a hard disk drive or a flash memory. The information processing device can save programs and data in the auxiliary storage device 1605 and load these programs and data into the memory 1602 to use. The auxiliary storage device 1605 may operate as the storage unit 311 of
The medium drive device 1606 drives a portable recording medium 1609 and accesses recorded content of the portable recording medium 1609. The portable recording medium 1609 is a memory device, a flexible disk, an optical disk, a magneto-optical disk, or the like. The portable recording medium 1609 may be a compact disk read only memory (CD-ROM), a digital versatile disk (DVD), a universal serial bus (USB) memory, or the like. The operator or the user can store programs and data in the portable recording medium 1609 and load these programs and data into the memory 1602 for use.
As described above, a computer-readable recording medium in which the programs and data used for processing are stored includes a physical (non-transitory) recording medium such as the memory 1602, the auxiliary storage device 1605, and the portable recording medium 1609.
The network connection device 1607 is a communication interface circuit that is connected to a communication network such as a local area network (LAN) and a wide area network (WAN), and that performs data conversion pertaining to communication. The information processing device can receive programs and data from an external device via the network connection device 1607 and load these programs and data into the memory 1602 to use. The network connection device 1607 may operate as the output unit 515 of
Note that the information processing device does not need to include all the configuration elements in
While the disclosed embodiments and the advantages thereof have been described in detail, those skilled in the art will be able to make various modifications, additions, and omissions without departing from the scope of the embodiments as explicitly set forth in the claims.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-151011 | Sep 2020 | JP | national |