This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2016-099876, filed on May 18, 2016, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an evaluation program, an evaluation method, and an information processing device.
For example, in a business system, various types of information used in business is registered and managed as master data. Also, there are cases where a plurality of business systems is integrated, and due to the integration, name identification of a plurality of pieces of master data is performed. In name identification, for example, between one master data and another master data, columns that have corresponding contents are associated. Japanese Laid-open Patent Publication No. 2012-234343, Japanese Laid-open Patent Publication No. 2008-27072, Japanese Laid-open Patent Publication No. 2012-14684, Japanese Laid-open Patent Publication No. 2004-086782, and Japanese Laid-open Patent Publication No. 2007-188343 discuss related art.
For example, as a method for associating columns between pieces of data for name identification, values of cells which belong to columns are compared to one another between pieces of data and columns including many sets of cells from which similar character strings have been detected are associated with one another. However, for example, there are cases where, although one column of one data and another column of another data do not correspond to one another, the values of cells which belong to the columns are similar to one another. For example, assuming a case where there are a column in which the address of a company is registered and a column in which the address of a person in charge is registered, respective pieces of information of the columns are similar to one another from a point of view of address. Therefore, these columns might have similar values in the columns of the cells and thus there is a probability that the columns are associated with one another, but the address of a company and the address of an individual are associated with one another, and therefore, this association is improper. Also, as another example, there are cases where numeric strings of serial numbers are assigned to records of pieces of data. In such a case, an assigned numeric string might be similar to a numeric string assigned in another data and there is a probability that the columns thereof are associated with one another, but the serial numbers have different meaning for each piece of data and the association of the columns is improper as association of columns. As described above, there are cases where, even when values of cells which belong to columns are similar to one another, the serial numbers have different meaning for each piece of data, thus resulting in improper association of columns. Therefore, for example, it is desired to provide a technology that enables association of columns between a plurality of pieces of data with high accuracy.
In one aspect, it is therefore an object of the present disclosure to provide a technology that enables association of columns between a plurality of pieces of data with high accuracy.
According to an aspect of the invention, an evaluation method includes: comparing values of cells between a plurality of pieces of data each including a plurality of cells divided by a plurality of columns and a plurality of records; storing, in a storage unit, information that indicates a plurality of cell sets that have been detected as sets of cells including similar character strings by the comparing; and setting, with reference to the storage unit, a score of each of a plurality of column sets formed by making each of columns of one of the plurality of pieces of data and each of columns of another one of the plurality of pieces of data as a set, based on a score for a record set of records in which a cell set, among the plurality of cell sets, which is included in the column set is included.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
Some embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. Note that corresponding elements in a plurality of drawings are denoted by the same reference character.
As described above, for example, for data in table form or in matrix form, for name identification, as a method for associating a column (also called as an attribute) with another column between pieces of data, values of cells which belong to columns between pieces of data are compared to one another, and columns that include many sets of cells from which similar character strings have been detected are associated with one another. Note that target data on which column association is performed may be data, such as, for example, a database, a table, or the like. Data may be, for example, master data. Also, although a case where, assuming that two pieces of data are targets, column association is performed between the pieces of data will be described as an example below, the present disclosure is not limited thereto and, assuming that three or more pieces of data are targets, column association may be executed between pieces of data.
On the other hand, in the following description, separated rows will be referred to as records. For example, in DATA A, “a1”, “a2”, “a3”, . . . are records. Also, in the following description, areas which are divided by columns and records and store values will be referred to as cells. In the following description, between a plurality of pieces of data, that is, DATA A and DATA B, or the like, a set of single columns will be occasionally referred to as a column set. For example, each of a plurality of columns of DATA A is made as a set with each of a plurality of columns of DATA B, and thereby, a plurality of column sets is made. Similarly, between a plurality of pieces of data, a set of single records will be occasionally referred to as a record set, for example, each of a plurality of records of DATA A is made as a set with each of a plurality of records of DATA B, and thereby, a plurality of record sets is made.
In this case, in the example of
Also,
In
However, for the column “A3: LOCATION” of DATA A, a plurality of match character strings with both of the column “B3: ADDRESS OF BUSINESS PARTNER” and the column “B4: ADDRESS OF PERSON IN CHARGE” of DATA B have been detected. As described above, in the example of
Also, as another example, when the number of characters of match character strings is counted, between the column A1 and the column B1, the number of characters of match character strings is nine characters, which is the total of three characters of “001”, three characters of “002”, and three characters of “003”. Similarly, between the column A2 and the column B2, the number of characters of match character strings is seven characters, which is the total of three characters of “F” and four characters of “AA” The number of characters of match character strings between columns of DATA A and DATA B is acquired in the manner described above and columns sets are ranked in accordance with the number of characters of match character strings, which has been acquired, so that a result Illustrated in
Also, in this case, although the column “A3: LOCATION” of DATA A corresponds to the column “B3: ADDRESS OF BUSINESS PARTNER” of DATA B, in
For example, in many cases, name identification is originally executed on data including many corresponding columns and records. For a record set of proper association, there is a tendency that match character strings are found in a plurality of columns. Therefore, for example, there is a tendency that, assuming a case where a column set in which columns are associated with one another using match character strings is a proper column set, seeing a record set including match character strings included in the column set, match character strings are also found in another column.
For example, in the column set of the column “A3: LOCATION” and the column “B3: ADDRESS OF BUSINESS PARTNER”, which has many matches in
On the other hand, for example, in the column set of the column “A3: LOCATION” and the column “B4: ADDRESS OF PERSON IN CHARGE”, which has many matches in
In embodiments that will be described below, for example, scores of column sets are set such that a higher score is given to a column set in which a set of cells (which will be hereinafter occasionally referred to as a cell set) including match character strings in a record set the score of which is higher appears. Also, scores of record sets are set such that a higher score is given to a record set in which a cell set including match character strings in a column the score of which is higher appears. Thus, considering the above-described tendency that, “in a properly associated record set, match character strings are found in a plurality of columns”, the scores of column sets may be evaluated and, as a result, it is enabled to associate a set of columns with high accuracy using the scores of the column sets. Embodiments will be described further in detail below with reference to
Subsequently, calculations of the score of a column set and the score of a record set according to the embodiment will be described. As described above, for example, values of cells are compared to one another between two pieces of data (for example, DATA A and DATA B) and character string match is executed, thereby enabling detection of match character strings that math between the two pieces of data.
The result M of character string match may be expressed by, for example, M={m1, m2, . . . , mk, . . . , mμ}. In this case, mk (1≦k≦μ) is information related to a match character string detected by character string match. Note that p may be the total number of match character strings detected by character string match. Also, k may be an index assigned to a match character string. Each element of mk may be expressed by mk=(ik, jk, uk, vk, sk). In this case, ik may be information used for identifying a record in DATA A of a cell that includes a match character string of mk and, for example, may be a1, a2, . . . or the like, which is an identifier of a record of DATA A. jk may be information used for identifying a record in DATA B of a cell that includes a match character string of mk and, for example, may be b1, b2, . . . or the like, which is an identifier of a record of DATA B. Also, uk may be information used for identifying a column in DATA A of a cell that includes a match character string of mk and, for example, may be A1, A2, . . . or the like, which is an identifier of a column of DATA A. vk may be information used for identifying a column in DATA B of a cell that includes a match character string of mk and, for example, may be B1, B2, . . . or the like, which is an identifier of a column of DATA B. sk is a score that corresponds to mk and a value that determines reliability of mk. Sk may be determined in advance. For example, when all of match character strings that have been detected by character string match are equivalently treated, a value (for example, sk=1) that is common to all of sk may be set. As another option, in a case where, the longer the character length of a match character string is, the more important match character string the match character string is treated as, sk=the match character sting length may be employed.
Subsequently, a calculation of the score of a column set and a calculation of the score of a record set using the result M of character string match will be described. Note that, in the following description, the score of the column set is occasionally referred to as Pc and the score of the record set is occasionally referred to as Pr.
<Score Calculation>
Assume that the score of a column set (u, v) is expressed by Pc (u, v). Also, assume that the score of a record set (i, j) is expressed by Pr (i, j). In this case, Pc (u, v) of the column set (u, v) may be expressed by Expression 1 below, using the score Pr (ik, jk) of each record set (Ik, jk).
p
c(u,v)=Σks.t.u
Note that, in Expression 1, “s. t.” is, for example, an abbreviation of “subject to”. Then, “k s. t. uk=u, vk=v” Indicates, for example, that, among entries registered in the RESULT M of
Also, similarly, the score Pr (i, j) of a record set (i, j) may be expressed by Expression 2 below using the score Pc (uk, vk) of each column set (uk, vk).
p
r(i,j)=Σks.t.i
Note that, in Expression 2, “k s. t ik=i, jk=j” indicates, for example, that, among entries registered in the RESULT M of
Subsequently, a calculation of each of respective scores of a plurality of column sets between two pieces of data using Expression 1 and a calculation of each of respective scores of a plurality of record sets using Expression 2 will be described. Note that the plurality of column sets may be achieved by making a single column from one of the two pieces of data and a single column from the other one of the two pieces of data into a set and thus forming a plurality of sets of columns. The plurality of record sets may be achieved by making a single record from one of the two pieces of data and a single record from the other one of the two pieces of data into a set and thus forming a plurality of sets of records.
For the column set score information 501 and the record set score information 502, for example, at least one of the tables thereof may be initialized when a score calculation is performed. In score initialization, for example, the control unit 301 may be configured to initialize all of scores to a common value (for example, “1” as illustrated in
A similar calculation is performed, and thereby, the scores Pr of all of record sets (ik, jk) are calculated.
A similar calculation is performed, and thereby, the scores Pc of all of record sets (uk, vk) are calculated.
For example, scores are calculated in the above-described manner, and thereby, scores of column sets may be set such that a higher score is given to a column set in which a cell set including match character strings in a record set the score of which is higher appears. Similarly, scores of record sets may be set such that a higher score is given to a record set in which a cell set including match character strings in a column set the score of which is higher appears. For example, it is enabled to associate a set of columns between pieces of data with high accuracy using the scores of the column sets which have been acquired.
Note that, according to this embodiment, similarly, it is enabled to associate a set of records with high accuracy by using the scores Pr (ik, Jk) of the record set score information 502.
Furthermore, a calculation of the score of a record set using scores of column sets and a calculation of the score of a column set using scores of record sets are alternately repeated, and thereby, accuracy of association of a set of columns and a set of records may be further increased.
Subsequently, calculations of scores of column sets are performed using the result M.
Subsequently, the control unit 301 calculates the score Pr of each record set of the record set score information 502, in accordance with Expression 2, using the column set score information 501 that has been initialized. The left-upper table in
Furthermore, the right-upper table in
As Illustrated in
Note that the control unit 301 may be configured to execute alternate repetition of a calculation of the score of a column set and a calculation of the score of a record set, for example, until at least one of the rankings of the column sets or the records sets no longer fluctuate after the calculations are repeated a predetermined number of times.
In Step 1301 (which will be hereinafter referred to as S1301 by describing Step as “S”), the control unit 301 reads a plurality of pieces of data, which are targets on which column association is performed. In S1302, the control unit 301 executes character string match and generates the result M including Information related to match character strings that match between the plurality of pieces of data.
In S1303, the control unit 301 determines whether or not the score Pc of each column set, which is registered in the column set score Information 501, is to be initialized. Note that whether an initialization target is to be the column set score information 501 or the record set score information 502 may be determined when an input of a user is received, or may be determined with reference to information that has been set in advance from the storage unit 302. In S1303, when the score Pc of each column set is initialized (YES in S1303), the flow proceeds to S1304. In S1304, the control unit 301 initializes the scores Pc of all of column sets of the column set score information 501. The control unit 301 may be configured to initialize all of scores to, for example, a common value (for example, “1”). As another option, for example, the control unit 301 may be configured to receive an input made by a user and set a large value to a column set columns of which are expected to be associated in advance.
In S1305, the scores Pr of all of record sets of the record set score information 502 are calculated, using the scores Pc of column sets and the result M of character string match, in accordance with Expression 2. Note that, by a calculation of Expression 2, the scores Pr may be set such that a higher score is given to a record set in which a cell set including match character strings in a column set the score of which is higher appears.
In S1306, the control unit 301 determines whether or not a score calculation has ended. The control unit 301 may be configured to repeat a calculation of the score Pc of a column set and a calculation of the score Pr of a record set, for example, until at least one of rankings of column sets of the column set score information 501 and record sets of the record set score information 502 no longer fluctuates after the calculations have been repeated a predetermined number of times. Then, the control unit 301 may be configured to determine, when at least one of rankings of column sets of the column set score information 501 and record sets of the record set score information 502 no longer fluctuates, YES in S1306. As another option, the control unit 301 normalizes at least the values of the scores of column sets of the column set score information 501 or the values of the scores of record sets of the record set score information 502. Then, the control unit 301 may be configured to determine, if, while repeating calculations, a change in a normalized value is lower than a predetermined threshold, YES in S1306. Note that, for example, for column sets, the normalization may be performed by performing constant multiplication such that the sum of the scores Pc of the column set score information 501 is 1. Similarly, the scores Pr may be normalized.
In S1306, if a score calculation has not ended (NO in S1306), the flow proceeds to S1308. In S1308, using the scores P, of record sets and the result M of character string match, the control unit 301 calculates the scores Pc of all of column sets of the column set score information 501 in accordance with Expression 1. By a calculation of Expression 1, the scores Pc may be set such that a higher score is given to a column set in which a cell set including match character strings in a record set the score of which is higher appears.
In S1309, the control unit 301 determines whether or not a score calculation has ended. For example, the control unit 301 may be configured to perform, in S1309, similar determination to determination performed in S1306. In S1309, if a score calculation has not ended (NO in S1309), the flow returns to S1305.
Also, in S1303, if the scores Pc are not to be initialized (NO in S1303), the follow proceeds to S1307. In S1307, the control unit 301 initializes the scores Pr of all of record sets of the record set score information 502. The control unit 301 may be configured to initialize all of the scores to a common value (for example, “1”). As another option, for example, the control unit 301 may be configured to receive an input made by a user and set a large value to a column set columns of which are expected to be associated in advance.
Also, in S1306 or S1309, if the control unit 301 determines that a score calculation has ended (YES in S1306 or S1309), the flow proceeds to S1310. In S1310, the control unit 301 outputs a column set, based on the scores Pc of column sets registered in the column set score information 501. For example, the control unit 301 may be configured to output only a predetermined number of ones of entries of the column set score information 501, which have high ranking from the top. As another option, the control unit 301 may be configured to output a column set having the highest score to each column of one of a plurality of pieces of data that are targets on which column association is performed.
In S1311, the control unit 301 determines whether or not a record is to be associated. Note that whether or not a record is to be associated may be determined when an input of a user is received, or may be determined with reference to information indicating whether or not a record that has been stored in the storage unit 302 in advance is to be associated.
If a record is not to be associated (NO in S1311), this operation flow ends. On the other hand, if a record is to be associated (YES in S1311), the flow proceeds to S1312.
In S1312, the control unit 301 outputs a record set, based on the scores Pr of record sets registered in the record set score information 502. For example, the control unit 301 may be configured to output a predetermined number of record sets that have high scores in the record set score Information 502. As another option, the control unit 301 may be configured to output a record set that has the highest score to each record of one of a plurality of pieces of data. When the control unit 301 outputs association with a record in S1312, this operation flow ends.
Note that, in processing of S1302 of the operation flow of
As described above, according to this embodiment, the control unit 301 performs a calculation of Expression 1, and thereby, is enabled to set the scores Pc such that a higher score is given to a column set in which a cell set including match character strings in a record set the score of which is higher appears. Therefore, column association is performed in accordance with the given scores, and thereby, columns may be associated with one another between pieces of data with high accuracy. Also, according to this embodiment, even without using other information than the value of data, columns may be associated with one another between pieces of data with high accuracy.
Similarly, in the above-described embodiment, the control unit 301 performs a calculation of Expression 2, and thereby, is enabled to set the scores Pr such that a higher score is given to a record set in which a cell set including match character strings in a column set the score of which is higher appears. Therefore, record association is performed in accordance with the given scores, and thereby, records may be associated with one another between pieces of data with high accuracy. Also, according to this embodiment, even without using any other information than the value of data, records may be associated with one another between pieces of data with high accuracy.
Also, as described in the above-described embodiment, a calculation of the score of a record set using scores of column sets and a calculation of the score of a column set using scores of record sets are alternately repeated, and thereby, accuracy of association may be further increased.
Therefore, according to the embodiment, columns may be associated with one another between a plurality of pieces of data with high accuracy.
Note that the control unit 301 may be configured to store the column set score information 501 and the record set score information 502 that have been achieved as a result in the storage unit 302 as they are. As another option, for example, a configuration in which, from all of column sets of the column set score information 501 and all of record sets of the record set score information 502, only a column set and a record set the score of which is not 0 are extracted and stored in the storage unit 302 may be employed.
Also, for example, there may be a case where, when there are DATA A and DATA B that are targets on which column association is performed, a column of DATA A corresponds to a plurality of columns of DATA B. For example, there may be a case where the column “A2: ADDRESS” of DATA A is divided into columns “B7: PREFECTURE/COUNTRY”, “B8: CITY/TOWN”, and “B9: STREET/HOUSE NUMBER” and thus held in DATA B. In such a case, the embodiment may be applied, for example, by combining an arbitrary number of columns together and assigning a new column thereto. For example, it is enabled to associate the column “B10” of DATA B and “A2: ADDRESS” of DATA A by assigning a column “B10” to data obtained by combining pieces of data of the column “B7: PREFECTURE/COUNTRY”+“B8: CITY/TOWN”+“B9: STREET/HOUSE NUMBER”.
Furthermore, although, in the above-described embodiment, a case where association between two pieces of data is performed has been described as an example, embodiments are not limited thereto. For example, the embodiment may be applied to column or record association between three or more pieces of data. For example, a match character sting set between N pieces of data is employed as an input and each of the numbers of arguments of Pc and Pr is set to be N, so that association between N pieces of data is possible. For example, when name Identification is performed between three pieces of data, a match result is set to be a set of (ik, jk, lk, uk, vk, wk, and sk) and each of respective scores are extended to the corresponding one of Pc (uk, vk, wk) and Pr (ik, jk, lk), so that the embodiment may be applied.
In the description above, an embodiment has been described, but embodiments are not limited thereto. For example, the above-described operation flow is provided merely for illustrative purpose and embodiments are not limited thereto. In a possible case, the operation flow may be also executed in a changed order, and may further include another processing, and a part of processing may be omitted.
Also, for example, in the above-described embodiment, in S1301 to S1302, data that is a target on which column association is performed is read out and then character string match is executed. However, embodiments are not limited thereto. For example, character string match may be executed in another device, the operation flow may be started with S1303, and a result of character string match executed in the another device may be used.
Also, in another embodiment, a result of record association is output, and a result of column association is not output.
The processor 1401 executes, for example, a program in which processes of the above-described operation flow are described using the memory 1402, and thereby, provides some or all of functions of the control unit 301. For example, the processor 1401 executes a program in which, for example, processes of the above-described operation flow are described using the memory 1402, and thereby, operates as the comparison unit 311 and the setting unit 312. Also, the storage unit 302 includes, for example, the memory 1402, the storage device 1403, and a removable storage medium 1405. For example, data that is a target on which column association is performed, the result M of character string match, the column set score information 501, and the record set score information 502 may be stored in the storage device 1403.
The memory 1402 may be, for example, semiconductor memory and include a RAM area and a ROM area. The storage device 1403 is, for example, semiconductor memory, such as a hard disk, flash memory, or the like, or an external storage device. Note that RAM is an abbreviation of random access memory. Also, ROM is an abbreviation of read only memory.
The reading device 1404 accesses the removable storage medium 1405 in accordance with an Instruction of the processor 1401. The removable storage medium 1405 is realized, for example, by a semiconductor device (USB memory or the like), a medium (a magnetic disk or the like) to and from which information is input and output by magnetic effects, a medium (CD-ROM, DVD, or the like) to and from which information is input and output by optical effects, or the like. Note that USB is an abbreviation of universal serial bus. CD is an abbreviation of compact disc. DVD is an abbreviation of digital versatile disk.
The communication interface 1406 transmits and receives data via a network 1420 in accordance with an instruction of the 1401. The input and output interface 1407 may be, for example, an interface between an input device and an output device. The input device is, for example, a device, such as a keyboard, a mouse, or the like, which receives an instruction of a user. The output device is, for example, a display device, such as a display or the like, or an audio device, such as a speaker or the like.
Each program according to the embodiment is provided to the information processing device 300 in any of the following forms.
Note that the hardware configuration of the computer 1400 that realizes the information processing device 300, which has been described with reference to
The processor 1401 of the computer 1400 reads out and executes a program in which, for example, processes of the above-described operation flow are described, and thereby, columns may be associated with one another with high accuracy. As a result, for example, a record set that is not used is not stored in the storage device 1403, and therefore, a storage capacity of the storage device 1403, which may be used, may be increased. Also, processing costs of accessing a record that is not used may be reduced.
In the description above, some embodiments have been described. However, embodiments are not limited to the above-described embodiments and are to be understood to include various modified embodiments and alternative embodiments of the above-described embodiments. For example, it is to be understood that each of various embodiments may be achieved by modifying components to an extent not departing from the first and scope of the present disclosure. Also, it is to be understood that a plurality of components disclosed in the above-described embodiments may be combined, as appropriate, so that various embodiments may be executed. Furthermore, it is also to be understood by those skilled in the art that various embodiments may be performed by removing or replacing some of components from all of the components described in the embodiments, or adding some components to the components described in the embodiments.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2016-099876 | May 2016 | JP | national |