The present invention relates to a string similarity join technique.
A string similarity join is a technique for detecting all pairs of a given element s and a given element r from element sets S and R, respectively, in a manner such that a distance between strings contained in the individual elements of each pair satisfies a condition of a threshold value. For the measurement of distances between strings, there exist various types of distance scales having different characteristics such as Jaccard index, cosine index, and edit distance.
The edit distance represents the minimum number of procedures (inserting, deleting or replacing a letter) necessary for converting one string to another string. For example, it determines how many procedures including inserting, deleting or replacing letters are necessary in order to calculate an edit distance between two strings “kitten” and “sitting” to convert the word “kitten” (or “sitting”) into the word “sitting” (or “kitten”). In this case, the string “kitten” can be converted into the string “sitting” by replacing “k” with “s,” replacing “e” with “i,” and inserting “g.” Thus, the edit distance between the string “kitten” and the string “sitting” is three (replacing twice and inserting once).
Hereinafter, the string similarity join is also simply referred to as a string join or join. Further, a tuple set serving as an input of the string similarity join is also referred to as data or input data. Each tuple set contains at least one tuple. The tuple is formed by plural attribute values. The tuple contained in the input data contains at least one string as an attribute value. Hereinafter, an attribute having a string set thereto as the attribute value is also referred to as a string attribute. The string attribute used as a key in the string similarity join is referred to as a join key attribute, and a value of the join key attribute is referred to as a join key or join key string.
Hereinafter, the edit distance between the join key of the tuple s and the join key of the tuple r is also referred to as an edit distance between the tuples s and r, an edit distance of a tuple pair (s, r), or an edit distance between a tuple s and a tuple r. Further, in the case where an edit distance of a certain tuple pair is less than or equal to a predetermined threshold value τ, the tuples s and r of this pair are referred to as “having similarity.”
In the left table in the lower portion of
The table located right below in
Methods of the string similarity join employing such an edit distance are proposed, for example, in Non-patent Documents 1 to 4 below. These methods employ different approaches according to average string lengths of input data serving as a target. Here, the average string length of the input data means an average of lengths of strings (number of characters) serving as the join key in each input tuple. Thus, when the average of the lengths of strings serving as the join key in each tuple is short, it is indicated that the input data has a short average string length.
In the method proposed in Non-patent Documents 1 to 3, the target is set to input data having a relatively long average string length such as a text. In general, the time required for calculating the long edit distance between strings is long. Thus, in the case where data having a long average string length is targeted, the time required for the string join process increases. In view of the facts described above, the methods proposed in Non-patent Documents 1 to 3 subject the join key to signature to convert the join key into short bit stream, calculate a distance between signatures (or degree of similarity), and leave pairs of tuples that are highly likely to have a similarity (filtering). Thus, by calculating edit distances only for filtered pairs from among all the pairs in the input tuple (refining), it is possible to increase the speed of the string similarity join process.
Non-patent Document 4 proposes an approach different from the filter-and-refine approach, and targets data having a relatively short average string length. The method proposed in Non-patent Document 4 first stores all the join keys of the input data S and R in one trie (Trie). The trie represents a data structure that can express plural strings in a compressed manner, and is frequently used as an index for the string. In general, with the trie that stores a set formed by short strings, it is possible to search the tree in a relatively short period of time. The method proposed in Non-patent Document 4 searches the trie that stores all the join keys, and calculates the edit distance between the join keys, thereby performing the join for the data having relatively short average string lengths at a relatively high speed.
As described above, with the string similarity join, the edit distances are calculated for all the pairs of tuples in the input data S and the input data R, and hence, the time required for the processing increases with an increase in the data volume in the input data S and the input data R. In view of the facts described above, Non-patent Documents 5 and 6 propose a method of processing the string similarity join in parallel to reduce the time required for the entire processing. The method proposed in Non-patent Document 5 employs the filter-and-refine approach in a parallel manner, and is suitable for data having a long average string length. The method proposed in Non-patent Document 6 employs a distance scale different from that for the edit distance, and performs the parallel processing for the string similarity join using characteristics of the distance scale.
However, with the string similarity join method and the parallel process method for the string similarity join as described above, it is necessary to apply a certain limitation to the distance scale for the join key or the string of the input data to achieve appropriate performance. For example, the filter-and-refine approach is not suitable for the data having a short average string length. This is because the large number of candidates is more likely to remain after filtering, and hence, it takes a long period of time to perform the refining process. Further, the methods proposed in Non-patent Documents 5 and 6 are not directed to the string similarity join employing the edit distance as the distance scale for the string.
Here, there is a following simple method designed for the parallel processing for the string similarity join employing the edit distance. For example, it is assumed that the processing target is set to data S containing m pieces of tuples and data R containing n (m≧n) pieces of tuples, N pieces of processing hosts are used, and the join processes are performed in parallel.
Then, a data host retaining the data R generates N pieces of duplicates of the data R, and distributes the data R to processing hosts. Further, a data host retaining the data S divides the data S into N pieces of subsets, and distributes the subsets to the processing hosts. The processing hosts uses the data distributed from each of the data hosts to perform the join to n pieces of tuples and (m/N) pieces of tuples. With this method, it is possible to calculate the edit distances for all the pairs of the tuple s contained in the data S and the tuple r contained in the data R while accurately detecting the pairs that satisfy the threshold value.
The data host described above is also referred to as a data management device. Further, the processing host described above is also referred to as a join processing device.
In this method, (N×n+m) pieces of tuples are to be processed in total. Thus, with the increase in the values of m, n, and N, the time required for obtaining N pieces of duplicates of the data R largely increases, and the cost of communication from the data host to the processing host increases. In the parallel processing, the cost of communication occupies large portion of the entire processing cost, and hence, the increase in the cost of the communication cannot be ignored. In other words, with the method described above, in the case where the volume of input data is large and the large number of the processing hosts serving as the distribution destination exists, the entire processing time increases.
As described above, in the string similarity join employing the edit distance, the processing time increases with the increase in the volume of the input data, and hence, the processing time can be reduced by performing the join processes in parallel. However, in the case where the join processes are performed in parallel, it is necessary that all the pairs in the data S and the data R are distributed as the join target into the plural processing hosts. In other words, the data S and the data R have to be distributed in a manner such that all the similar pairs that should be detected are processed without fail.
An object of the present invention is to provide a technique for performing the string similarity join employing the edit distance in an appropriate and a rapid manner.
In order to solve the problems described above, each aspect of the present invention employs the following configurations.
A first aspect of the present invention relates to a join processing device that performs a similarity join process to plural tuples using an edit distance threshold value τ (positive integer). The join processing device according to the first aspect includes a join processing unit that excludes, from a target of edit distance calculation, a pair of tuples that do not have any common character in an end portion ranging from a head character or a tail character to the (τ+1)th character in a join key string in each of the plural tuples.
A second aspect of the present invention relates to a data management device communicatively connected to plural join processing devices that each perform a similarity join process to plural tuples using an edit distance threshold value τ (positive integer). The data management device according to the second aspect includes: a data storage unit that stores the plural tuples; and a data distributing unit that determines a distribution destination of each of the tuples stored in the data storage unit to be a join processing device that processes each of the tuples from among the plural join processing devices in a manner such that each of the tuples is distributed to the distribution destination same as that of another tuple containing, in an end portion ranging from a head character or tail character to a (τ+1)th character in a join key string thereof, at least one character that the each of the tuples contains in the end portion in the join key string thereof, and is not distributed to a distribution destination same as that of another tuple that does not contain any character common to that in the end portion in the join key string of each of the tuples.
A third aspect of the present invention relates to a string similarity join system including at least one data management device and plural join processing devices that each perform a similarity join process to plural tuples stored in the at least one data management device using an edit distance threshold value τ (positive integer). In the string similarity join system according to the third aspect, the at least one data management device includes a key information generating unit that generates, for a join key string of each of the tuples, (τ+1) pieces of key information tuples containing a combination of a tail portion string ranging from a tail character to an i-th character (i is a positive integer less than or equal to (τ+1)) counted from a head character, a string length of the remaining head portion string, and tuple identifying data, or a combination of a head portion string ranging from the head character to an i-th character counted from the tail character, a string length of the remaining tail portion string, and the tuple identifying data; and a data distributing unit that determines a distribution destination of each of the key information tuples on the basis of the head character of the tail portion string or the tail character of the head portion string contained in each of the key information tuples generated by the key information generating unit, and distributes, as data on each of the tuples, each of the key information tuples to each of the join processing devices determined to be the distribution destination. Further, the plural join processing devices each include: a receiving unit that receives the plural key information tuples distributed from the at least one data management device; and a join processing unit that performs the similarity join process for each set of key information tuples having the head character of the tail portion string or the tail character of the head portion string common to each other from among the plural key information tuples received by the receiving unit.
It should be noted that another aspect of the present invention may provide a string similarity join method that causes at least one computer to perform each of the processes contained in first to third aspects described above, or may provide a program that causes at least one computer to perform each of the configurations contained in first to third aspects, or may provide a computer-readable storage medium that records such a program. This storage medium includes a non-transitory tangible medium.
According to the aspects described above, it is possible to provide a technique of performing the string similarity join employing the edit distance in an appropriate and a rapid manner.
Hereinbelow, exemplary embodiments of the present invention will be described. Note that each of the exemplary embodiments described below is merely an example, and the present invention is not limited to the configuration of each of the exemplary embodiment described below.
A join processing device according to a first exemplary embodiment performs a similarity join process to plural tuples using an edit distance threshold value τ (τ is a positive integer, and the edit distance threshold value is also simply referred to a threshold value τ). This join processing device includes a join processing unit that excludes, from the target of the edit distance calculation, pairs of tuples that do not have any common characters in an end portion ranging from the head character or the tail character to the (τ+1)th character in a join key string of each of the tuples.
Thus, while the edit distances are calculated for all the tuple pairs of data serving as the target of the similarity join process in the conventional technique, the pairs of tuples that do not have any common character in the end portion are excluded in the first exemplary embodiment. This makes it possible to reduce the processing cost and reduce the time required for the similarity join process in the system as a whole, as compared with the conventional technique. Further, according to the first exemplary embodiment, all the pairs of tuples that have to be detected are processed without fail, and hence, it is possible to output the appropriate similarity join process results.
Described below is the reason that the first exemplary embodiment can achieve such an effect. As described above, in the similarity join process, it is necessary to apply the calculation process for the edit distances of the join keys, the comparison process between the edit distance and the edit distance threshold value τ or other processes to all the tuple pairs in the data serving as the processing target. For example, in the string similarity join between the input data S and the input data R, the total number of tuple pairs is a value obtained through multiplication of the number of tuples in the input data S and the number of tuples in the input data R, which results in the vast amount of processing time.
Thus, in the first exemplary embodiment, before the edit distance is actually calculated, the tuple pairs having the edit distance exceeding the edit distance threshold value τ is determined, and the determined tuple pairs are excluded from the target of the similarity join process. With these operations, the number of tuple pairs subjected to the edit distance calculation or other process can be reduced, whereby it is possible to reduce the time required for the similarity join process as a whole.
In the case where the end portion of each of the join key strings in the tuple s and the tuple r does not have any common character, the number of characters in the end portion is (τ+1), and hence, the edit distance in the tuple s and the tuple r is obviously greater than a threshold value τ. For example, in the case where the join key string in tuple s is “abcdef,” the join key string in tuple r is “ghidef,” and the threshold value τ is 2, the string in the end portion of the tuple s is “abc,” the string in the end portion of the tuple r is “ghi,” so that there is no common character between them. At this point in time, the edit distance in the tuple s and the tuple r obviously exceeds the threshold value τ, and hence, it can be readily understood that the calculation of the edit distance is not necessary for the pair of the tuple s and the tuple r. Note that the actual edit distance is 3.
Thus, as described in the first exemplary embodiment, even if the pairs of tuples that do not have any common character in the end portion are excluded from the target of the edit distance calculation, all the pairs of tuples that should be detected are processed without fail, and it is possible to output the appropriate similarity join process results.
On the other hand, if there exists any character common to the end portions in the tuples, there is a possibility that the edit distance is less than or equal to the threshold value τ. Thus, the join processing device according to the first exemplary embodiment sets the pair of tuples having at least one common character in the end portion of the tuples for the target of the edit distance calculation.
A data management device according to a second exemplary embodiment is connected to plural join processing devices that each perform the similarity join process for the plural tuples using the threshold value τ in a manner that they can communicate with each other. This data management device includes a data storage unit that stores the plural tuples, and a data distributing unit that determines a join processing device that processes each of the tuples stored in the data storage unit to be a distribution destination of each of the tuples, this determination being made in a manner such that one tuple is distributed to the distribution destination same as that of another tuple containing, in an end portion ranging from the head character or the tail character to the (τ+1)th character in the join key string thereof, at least one character that the one tuple contains in the end portion in the join key string, and is not distributed to a distribution destination same as that of another tuple that does not contain any character common to that in the end portion in the join key string of the one tuple.
The data management device according to the second exemplary embodiment distributes each of the tuples stored in the data storage unit and subjected to the similarity join process to at least one of the plural join processing devices, and the plural join processing devices perform the similarity join process in parallel, thereby increasing the speed of the string similarity join process. Here, the plural tuples subjected to the similarity join process may be extracted from one tuple set stored in the data storage unit of one data management device, or may be extracted from plural tuple sets stored in the data storage unit of one data management device, or may be extracted from plural tuple sets stored in the data storage units of plural data management devices.
The similarity join process performed in the plural join processing devices may be performed using known methods as described above. Here, each of the join processing devices represent a unit capable of performing the similarity join process, and may be one computer or may be one central processing unit (CPU). Thus, in the case of a computer including plural circuit boards each provided with a CPU, the plural join processing devices may be realized with one computer.
The data management device determines that each of the tuples stored in the data storage unit is distributed to the distribution destination same as that of another tuple containing, in an end portion ranging from the head character or the tail character to the (τ+1)th character in the join key string thereof, at least one character that each of the tuples contains in the end portion in the join key string. Here, the other tuple described above may be a tuple stored in the data storage unit of the data management device itself, or may be a tuple retained in another data management device.
As a result, one join processing device performs the similarity join process for pairs of tuples having at least one common character in the end portion ranging from the head character to the (τ+1)th character in the join key string, or for pairs of tuples having at least one common character in the end portion ranging from the tail character to the (τ+1)th character in the join key string. On the other hand, the pair of tuples that do not have any common character in the end portion in the join key string is not distributed to the same join processing device, and hence, is excluded from the target of the edit distance calculation.
In other words, in the second exemplary embodiment, whether or not a pair of tuples is subjected to the edit distance calculation is determined depending on whether or not the pair of the tuples is distributed to the same join processing device. Thus, according to the second exemplary embodiment, it is possible to achieve an effect similar to that of the first exemplary embodiment.
The system controlling device 10 receives a request for the similarity join process, and controls the data management device 20 and the join processing device 30 to perform the similarity join process in accordance with the request. The system controlling device 10 receives join results transmitted from each of the join processing devices 30, and outputs the final results of the string similarity join process.
The data management device 20 manages at least one item of data (tuple set) serving as the join process target. In the third exemplary embodiment, the data management devices 20(#1) and 20(#2) each manage the data. As in the first exemplary embodiment, the data management device 20 determines a distribution destination of each of the tuples constituting the data that this data management device 20 manages, and distributes data concerning each of the tuples to the join processing device 30 serving as the determined distribution destination. More specifically, in the third exemplary embodiment, SIP tuples, which will be described later, are distributed as the data concerning each of the tuples.
The join processing device 30 identifies tuple pairs having edit distances that satisfy conditions of an edit distance threshold value τ on the basis of the data distributed from the data management device 20, and transmits the data concerning the identified tuple pairs as the join results to the system controlling device 10.
The system controlling device 10, the data management device 20, and the join processing device 30 are connected to each other through a network 7 in a manner that they can communicate with each other. The network 7 includes, for example, a public network such as the Internet, a wide area network (WAN), a local area network (LAN), and a wireless communication network. Note that, in this exemplary embodiment, the communication protocol between the devices, the form of network and the like are not limited, provided that these devices are connected to each other in a manner that they can communicate with each other.
As illustrated in
Further, this exemplary embodiment does not limit the number of the data management devices 20 and the join processing devices 30. In the case where the data on the join process target are retained in one data management device 20, it is only necessary that one data management device 20 exists. The number of the join processing devices 30 is set to 2 or more, and is less than or equal to the number of types of characters appearing in the input data. The basis for choice of the number of these devices will be described later.
Described below are specific configurations of the devices constituting the system 1 according to the third exemplary embodiment.
The request controlling unit 11 acquires a processing request for the string similarity join, generates an execution instruction on the basis of details of the acquired processing request, and transmits the execution instruction through a communication interface of the input-output I/F 4 to the data management device 20 and the join processing device 30. Here, the processing request includes a data identifier for identifying data serving as the join process target, information on a join key attribute of the target data, and a threshold value τ. This processing request may be acquired from an external device through a communication, or may be inputted through a user interface (not illustrated) of the system controlling device 10.
The execution instruction transmitted to the data management device 20 is a communication message including network address information such as an internet protocol (IP) address and a port of the join processing device 30, and each data and threshold value τ included in the processing request. Further, the execution instruction transmitted to the join processing device 30 is a communication message including, for example, an data identifier, a threshold value τ, network address information such as an IP address of the join processing device 30, network address information such as an IP address of the system controlling device 10. This exemplary embodiment does not limit the format of the communication message.
The join result storage unit 15 stores a local join result of a string similarity join transmitted from each of the join processing devices 30. In the case where the join processing device 30 contains pairs of tuples whose edit distance is estimated to satisfy the condition of the threshold value τ, this local join result includes as many local result tuples as the number of pairs of tuples. The local result tuple includes an edit distance estimation value, and pairs of tuple pointers for identifying the pair of tuples.
The edit distance estimation value is a value calculated by the join processing device 30, and hereinafter, is also referred to as a local edit distance. Details of this local edit distance will be described later. The tuple pointer is identification information for identifying one tuple from among all the tuples treated in the system 1, and is also referred to as tuple identifying data. In this exemplary embodiment, the tuple pointer is formed by a tuple identifier for identifying a certain tuple in a certain tuple set (data), and a data identifier for identifying the tuple set (data). Note that, in the case where the tuple identifier is set in a unique manner for all the tuples treated in the system 1, the tuple pointer may be formed only by the tuple identifier.
The result generating unit 12 acquires the local join result from each of the join processing devices 30 through the communication interface of the input-output I/F 4. When the acquired local join result is stored in the join result storage unit 15, the result generating unit 12 detects the local result tuples containing the pair of the same tuple pointers, and stores only the local result tuple having the minimum local edit distance of all the detected local result tuples, in the join result storage unit 15 as information on the pair of tuples having the edit distance that satisfies the conditions of edit distance threshold value τ.
The data storage unit 25 stores data (tuple set) and a data identifier for identifying the data. In the example illustrated in
The SIP tuple generating unit 21 receives an execution instruction from the system controlling device 10 through the communication interface of the input-output I/F 4, and extracts data corresponding to the data identifier contained in this execution instruction from the data storage unit 25. In the example illustrated in
The SIP tuple generating unit 21 generates (τ+1) pieces of SIP tuples concerning each of the tuples contained in the extracted data. The SIP tuple relates to the join key string in the tuple, and is formed by a combination of a tail portion string ranging from the tail character to the i-th character (i is a positive integer less than or equal to (τ+1)) counted from the head character, the string length of the remaining head portion string, and the tuple pointer, or a combination of a head portion string ranging from the head character to the i-th character counted from the tail character, the string length of the remaining tail portion string, and the tuple pointer. The join key in each of the tuples is identified on the basis of the information of the join key attribute contained in the execution instruction transmitted from the system controlling device 10.
Here, the SIP tuple of the tuple s can be expressed as <st_i, |sh_i|, s_ptr>, where, in the join key string in the tuple s, st_i is the tail portion string ranging from the tail character to the i-th character counted from the head character, |sh_i| is the string length of the remaining head portion string, and the s_ptr is the tuple pointer of the tuple s. Further, |sh_i| is equal to (i−1).
Here, in the case where the data storage unit 25 stores the input tuple set S illustrated in
SIP tuple (i=1): <“XWY-RS200”, 0, “S: 101”>
SIP tuple (i=2): <“WY-RS200”, 1, “S: 101”>
SIP tuple (i=3): <“Y-RS200”, 2, “S: 101”>
In the case of another mode of the SIP tuple, more specifically, in the case where the combination of the head portion string ranging from the head character to the i-th character counted from the tail character, the string length of the remaining tail portion string, and the tuple pointer is used, the SIP tuple in the above-described example is generated in the following manners.
SIP tuple (i=1): <“XWY-RS200”, 0, “S: 101”>
SIP tuple (i=2): <“XWY-RS20”, 1, “S: 101”>
SIP tuple (i=3): <“XWY-RS2”, 2, “S: 101”>
It should be noted that any of the two methods may be employed as the mode of the SIP tuple. As described above, the SIP tuple is a tuple containing information on the join key, and hence, can be called a key information tuple. Further, the SIP tuple generating unit 21 can be called a key information generating unit.
The data distributing unit 22 receives the SIP tuple set generated by the SIP tuple generating unit 21, determines the distribution destination of each of the SIP tuples, and distributes (transmits), as the data on each of the tuples, each of the SIP tuples to the join processing device 30 determined to be the distribution destination. The data distributing unit 22 determines the distribution destination of each of the SIP tuples on the basis of the head character of the tail portion string or the tail character of the head portion string contained in each of the SIP tuples. Then, the SIP tuples having the same head character or the same tail character are distributed to the same join processing device 30.
For example, the data distributing unit 22 determines the distribution destination of each of the SIP tuples using a function such as a hash function that, in response to input of one character, outputs one value with which any one of the join processing devices 30(#1), 30(#2), 30(#3), and 30(#4) can be identified. The data distributing unit 22 identifies a network address of the join processing device 30 serving as the determined distribution destination on the basis of the network address information contained in the execution instruction transmitted from the system controlling device 10, and transmits the corresponding SIP tuple to the join processing device 30. Note that the method of determining the distribution destination from a certain character is not limited to the method using the function such as the hash function.
Here, the head character of the tail portion string is any one of the characters contained in the end portion ranging from the head character to the (τ+1)th character in the join key string, and the tail character of the head portion string is any one of the characters contained in the end portion ranging from the tail character to the (τ+1)th character in the join key string. Thus, for the tuple s and the tuple r having at least one character common to each other contained in the end portion ranging from the head character or the tail character to the (τ+1)th character in the join key string, one or more pair of all the combinations (pairs) of (τ+1) pieces of SIP tuples related to the tuple s and (τ+1) pieces of SIP tuples related to the tuple r are distributed to the same join processing device 30, and are subjected to the similarity join process. Note that, in this exemplary embodiment, the tuple s and the tuple r that do not have any common character in the end portion may be distributed to the same join processing device 30, or may be distributed to different join processing devices 30. In any of the cases described above, the tuple s and the tuple r that do not have any common character in the end portion are excluded from the target of the similar join process in the join processing device 30.
The SIP tuple receiving unit 31 receives the SIP tuple transmitted from the data management device 20, and stores the received SIP tuple in the SIP tuple storage unit 35. The SIP tuple storage unit 35 stores a set of the SIP tuples received by the SIP tuple receiving unit 31. The SIP tuple storage unit 35 stores, as one set, the SIP tuples having the common head character of the tail portion string or common tail character of the head portion string contained in each of the SIP tuples.
The join processing unit 32 receives the execution instruction transmitted from the system controlling device 10, and retains various kinds of data contained in the execution instruction. As described above, the various kinds of data includes, for example, a data identifier, a threshold value τ, a network address information such as an IP address of the data management device 20, and a network address information such as an IP address of the system controlling device 10.
The join processing unit 32 uses the retained data as described above to perform the similarity join process for the SIP tuples stored in the SIP tuple storage unit 35. More specifically, the join processing unit 32 extracts, from the SIP tuple storage unit 35, a set of given SIP tuples having a common head character of the tail portion string or a common tail character of the head portion string, and causes the estimation value calculating unit 33 to perform a predetermined process for all the combinations (all the pairs) of two SIP tuples in the plural SIP tuples having the common head character of the tail portion string or the common tail character of the head portion string extracted above. Here, in the case where the similarity join process is directed to the join between different tuple sets (data), the join processing unit 32 causes the estimation value calculating unit 33 to perform the predetermined process for all the combinations (all the pairs) of two SIP tuples, the two SIP tuples having data identifiers different from each other, in the plural SIP tuples having the common head character of the tail portion string or the common tail character of the head portion string extracted above. The data identifier is extracted from the tuple pointer contained in the SIP tuple. With these configurations, even if the tuple s and the tuple r that do not have any common character in the end portion ranging from the head character or the tail character to the (τ+1)th character in the join key string are distributed to the same join processing device 30, they are excluded from the target of the similarity join process.
The join processing unit 32 acquires the process results from the estimation value calculating unit 33, and identifies, on the basis of the acquired process results, the pair of the SIP tuples having the edit distance estimated to satisfy the condition of the threshold value τ. As for the process results of the estimation value calculating unit 33, information indicating whether or not the local edit distance concerning each of the pairs of the SIP tuples and its edit distance satisfy the condition of the threshold value τ is acquired. The join processing unit 32 generates a local result tuple containing the pair of the tuple pointer and the local edit distance for each of the identified pairs, and transmits the local join result containing the generated local result tuple to the system controlling device 10.
Upon receiving the pair of SIP tuples (two SIP tuples) from the join processing unit 32, the estimation value calculating unit 33 calculates the edit distance between tail portion strings or head portion strings contained in each of the SIP tuples. The calculated edit distance is denoted as a partial string edit distance. In the case where the calculated partial string edit distance does not satisfy the condition of the threshold value τ, the estimation value calculating unit 33 sends back, to the join processing unit 32, the process results indicating that the edit distance of the pair of the SIP tuples does not satisfy the condition of the threshold value τ.
On the other hand, in the case where the calculated partial string edit distance satisfies the condition of the threshold value τ, the estimation value calculating unit 33 further calculates a local edit distance of the pair by adding the partial string edit distance to the string length of the larger head portion string or the string length of the larger tail portion string. The estimation value calculating unit 33 compares the calculated local edit distance with the threshold value τ, and returns the comparison result serving as the process results to the join processing unit 32. More specifically, in the case where the local edit distance satisfies the condition of the threshold value τ, the estimation value calculating unit 33 returns, as the process results, the local edit distance and the information indicating that the edit distance of the pair of the SIP tuples satisfies the condition of the threshold value τ. On the other hand, in the case where the local edit distance does not satisfy the condition of the threshold value τ, the estimation value calculating unit 33 returns the process results indicating that the edit distance of the pair of the SIP tuples does not satisfy the condition of the threshold value τ.
Next, a description will be made of a relationship between the local edit distance calculated using the partial string edit distance as described above and a normal edit distance (hereinafter, referred to as an actual edit distance) calculated using the entire join key string. A head portion string ranging from the head character to the i−1th character in a join key string of a tuple x is denoted as xh_i, a head portion string ranging from the head character to j−1th character in a tuple y is denoted as yh_i, and the remaining tail portion strings are denoted as xt_i and yt_j, respectively.
The following relationship is formed among an edit distance ED(x, y) of the tuple x and the tuple y, an edit distance (partial string edit distance) ED(xh_i, yh_j) between the head portion strings, and the edit distance (partial string edit distance) ED(xt_i, yt_j) between the tail portion strings.
ED(x,y)≦ED(xh—i,yh—j)+ED(xt—i,yt—j) Equation 1
Further, the following relationship is formed between ED(x, y) and the string lengths |x| and |y| of the join key. Note that the max( ) is a function that outputs the larger value.
ED(x,y)≦max(|x|,|y|) Equation 2
On the basis of the Equation 1 and Equation 2 described above, the following Equation 3 and Equation 4 can be obtained.
ED(x,y)≦max(|xh—i|,|yh—j|)+ED(xt—i,yt—j) Equation 3
ED(x,y)≦max(|xt—i|,|yt—j|)+ED(xh—i,yh—j) Equation 4
The right-hand sides of Equation 3 and Equation 4 correspond to the local edit distance calculated by the estimation value calculating unit 33 using the partial string edit distance. For the tuple x and the tuple y, {2×(τ+1)} pieces of the SIP tuples are generated, and hence, plural pieces of the local edit distances for the tuple x and the tuple y may be generated. All the plural local edit distances generated are not always equal to the actual edit distance. This is because, as shown in Equation 3 and Equation 4, the local edit distance only indicates the upper limit value of the actual edit distance. Thus, the local edit distance may be called an edit distance estimation value. On the basis of Equation 3 and Equation 4 described above, the following relationships can be formed.
If max(|xh_i|, |yh_j|)+ED(xt_i, yt_j)≦τ is established, ED(x, y)≦τ Equation 5, where i is not less than 1 and not more than τ+1
If max(|xt_i|, |yt_j|)+ED(xh_i, yh_j)≦τ is established, ED(x, y)≦τ Equation 6, where j is not less than 1 and not more than τ+1
On the basis of the relationships Equation 5 and Equation 6 described above, it can be understood that the actual edit distance is always less than or equal to τ if the local edit distance of the tuple x and the tuple y is less than or equal to the threshold value τ. Further, on the basis of the general characteristics of the edit distance, it can be derived that, if the actual edit distance is less than or equal to the threshold value τ, the minimum local edit distance of the plural local edit distance of the tuple x and the tuple y is always equal to the actual edit distance.
Those described above can be expressed using the following theorem: if the edit distance ED(x, y) between the given strings x and y is less than or equal to the edit distance threshold value τ, then at least one pair <i, j> of positive integers that satisfy the following condition exists.
(1≦i≦τ+1) AND (1≦j≦τ+1) AND (x[i]=y[j]) AND {max(|xh—i,yh—j|}+ED(xt—i,yt—j)=ED(x,y)}, or
(0≦|xs|−i+1 τ+1) AND (0≦|ys|−j+1≦τ+1) AND (x[i−1]=y[j−1]) AND {max(|xt—i,yt—j|)+ED(xh—i,yh—j)=ED(x,y)}
Here, xt_i indicates a tail portion string ranging from the tail character to the i-th character counted from the head character in the string x; |xh_i| indicates a string length of the remaining head portion string; yt_j indicates a tail portion string ranging from the tail character to the j-th character counted from the head character in the string y; and |yh_j| indicates a string length of the remaining head portion string. Further, x[i] indicates the i-th character counted from the head character in the string x; y[j] indicates the j-th character counted from the head character in the string y; |xs| indicates the string length of the string x; and |ys| indicates the string length of the string y.
From the theorem described above, it is guaranteed that, if the actual edit distance is less than or equal to the threshold value τ, the minimum local edit distance of the plural local edit distance of the tuple x and the tuple y generated from the SIP tuples is always equal to the actual edit distance. There is a possibility that, for the tuple x and the tuple y, plural local edit distances are generated, and hence, there may exist plural local edit distances having a value less than or equal to the threshold value τ. However, the result generating unit 12 of the system controlling device 10 filters the local result tuples using the local edit distance as described above, so that the actual edit distance can be readily known on the basis of the local edit distance. Thus, as described in this exemplary embodiment, even if the similarity join result is obtained using the local edit distance in place of the actual edit distance, it is possible to obtain the correct similarity join result.
Next, an example of an operation performed by the system 1 according to the third exemplary embodiment will be described.
First, in the system controlling device 10, the request controlling unit 11 acquires a processing request for the string similarity join (S501). This processing request includes a data identifier (for example, S and R) for identifying data serving as the join process target, information (for example, “product number”) on an join key attribute of the target data, and an edit distance threshold value τ.
The request controlling unit 11 generates an execution instruction on the basis of details of the processing request acquired, and transmits the generated execution instructions to all the data management devices 20 and all the join processing devices 30 (S502).
Each of the data management devices 20 that receives the execution instruction operates in the following manners. The SIP tuple generating unit 21 extracts data corresponding to the data identifier contained in the execution instruction from the data storage unit 25, and generates (τ+1) pieces of SIP tuples for each of the tuples contained in the extracted data (S503).
Next, the data distributing unit 22 receives a SIP tuple set generated by the SIP tuple generating unit 21, identifies the join processing device 30 serving as the distribution destination of each of the SIP tuples, and distributes each of the SIP tuples to the identified join processing device 30 (S504).
Each of the join processing devices 30 that receives each of the SIP tuples operates in the following manners. The SIP tuple receiving unit 31 receives the SIP tuples transmitted from the data management device 20, and sequentially stores the received SIP tuples in the SIP tuple storage unit 35. Once all the data management devices 20 serving as the target complete the distribution, each of the join processing devices 30 entirely acquires all the SIP tuples serving as the join process target.
The join processing unit 32 performs the similarity join process for the SIP tuples stored in the SIP tuple storage unit 35 (S505). In this similarity join process, a partial string edit distance is calculated for the pairs of the SIP tuples, and a local edit distance is calculated on the basis of the partial string edit distance. Then, the pairs of SIP tuples having an edit distance estimated to satisfy the condition of a threshold value τ are identified.
The join processing unit 32 generates a local result tuple containing pairs of tuple pointer and the local edit distance for each of the identified pairs, and transmits local join results containing the generated local result tuple to the system controlling device 10 (S506).
In the system controlling device 10 that receives the local join results from each of the join processing devices 30, the result generating unit 12 excludes all the local result tuples having pairs of the same tuple pointers except for those having the minimum local edit distance, and stores the local result tuples in the join result storage unit 15 (S507). Then, from the local result tuples stored in the join result storage unit 15 of the system controlling device 10, it is possible to obtain information on the pair of tuples that satisfy the condition of the edit distance threshold value τ.
Next, of the steps shown in
First, the SIP tuple generating unit 21 extracts, from the data storage unit 25, data S corresponding to a data identifier S contained in an execution instruction from the system controlling device 10 (S601). In
The SIP tuple generating unit 21 judges whether the data S contains an unprocessed tuple s (S602). If the unprocessed tuple s exists (S602; YES), the SIP tuple generating unit 21 acquires a tuple pointer s_ptr and a join key (string length: |s|) for the tuple s (S603). In the example illustrated in
The SIP tuple generating unit 21 sets the initial value 1 to a variable i (S604).
The SIP tuple generating unit 21 acquires a tail portion string st_i ranging from a tail character to the i-th character counted from a head character in the join key, and a string length |sh_i| of the remaining head portion string (S605). Here, the string length of the tail portion string st_i is (|s|−i+1), and the |sh_i| is indicated as (i−1). The SIP tuple generating unit 21 uses the acquired data to generate a SIP tuple <st_i, |sh_i|, s_ptr> for the variable i (S605).
Then, the SIP tuple generating unit 21 adds the generated SIP tuple to a SIP tuple set sip[s[i]] concerning a character s[i] (S606). The character s[i] corresponds to the i-th character counted from the head character in the join key s[ ] (for example, “XWY-RS200”), in other words, corresponds to the head character of the tail portion string st_i. Thus, the SIP tuple set sip[s[i]] is a set of SIP tuples having the head character of the tail portion string st_i common to each other.
Next, the SIP tuple generating unit 21 judges whether the sip[s[i]] sufficiently accumulates the SIP tuples (for example, 10 pieces of SIP tuples) (S607). If the sufficient number of SIP tuples are not accumulated (S607; NO), the SIP tuple generating unit 21 judges whether or not a value (++i) obtained by adding one to the variable i is less than or equal to (τ+1) (S610). If the (++i) is less than or equal to (τ+1) (S610; YES), the SIP tuple generating unit 21 performs the processes S605 and S606 described above for the variable i having one added thereto. After this, the SIP tuple generating unit 21 repeats the processes S605 and S606 described above until the (++i) exceeds (τ+1).
If sufficient numbers of SIP tuples are accumulated (S607; YES), the SIP tuple generating unit 21 notifies the data distributing unit 22 to that effect. With this operation, the data distributing unit 22 determines the distribution destination of the SIP tuple set sip[s[i]] serving as the target of the notification (S608). For example, the data distributing unit 22 determines the join processing device 30 corresponding to a hash value obtained by applying the character s[i] to a predetermined hash function, to be the distribution destination.
The data distributing unit 22 transmits the SIP tuple set sip[s[i]] to the join processing device 30 serving as the distribution destination determined on the basis of the head character (s[i]) of the tail portion string st_i as described above (S608). If the transmission is successfully made, the data distributing unit 22 initializes (empties) the SIP tuple set sip[s[i]] (S609).
If (++i) exceeds (τ+1) (S610; NO), the SIP tuple generating unit 21 judges again whether the data S contains any unprocessed tuple s (S602). If the unprocessed tuple s exists (S602; YES), the process S603 and thereafter are performed for the unprocessed tuple s in a similar manner described above. In the example illustrated in
If no unprocessed tuple s exists (S602; NO), the SIP tuple generating unit 21 judges whether there exists any unprocessed data identifiers (S′) contained in the execution instruction (S611). If there exists any unprocessed data identifier (S′) (S611; YES), the data identifier S′ is set to the data identifier S (S612), and then, the process S601 and thereafter are performed in the manners described above.
If it is determined that there exists no unprocessed other data identifier contained in the execution instruction (S611; NO), the SIP tuple generating unit 21 requests the data distributing unit 22 to transfer the SIP tuple set sip[c] that has not been initialized (not emptied). In response to this request, the data distributing unit 22 determines the distribution destination of the SIP tuple set sip[c] that has not been emptied (S613), and transmits the SIP tuple set sip[c] to the join processing device 30 serving as the determined distribution destination (S613).
Once the processes performed by each of the data management devices 20 as described above complete, each of the SIP tuples related to the input tuple sets (data) S and R illustrated in
Below, an example of an operation performed by the join processing device 30(#1) will be described as an example using the join processing device 30(#1) having the SIP tuple set in
The join processing unit 32 extracts a set of SIP tuples having the head character of the tail portion string common to each other from the SIP tuple storage unit 35 (S1001).
The join processing unit 32 transmits, to the estimation value calculating unit 33, information indicating all the pairs (x, y) of two SIP tuples having different data identifiers determined on the basis of the tuple pointer from the extracted SIP tuple set, and a processing instruction. The estimation value calculating unit 33 calculates an edit distance ED(x, y) of the tail portion string for all the pairs (x, y) of the SIP tuples on the basis of the information transmitted from the join processing unit 32 (S1002). It is only necessary that the edit distance ED(x, y) is calculated using generally known method of calculating the edit distance.
The estimation value calculating unit 33 judges whether or not the calculated partial string edit distances are less than or equal to the threshold value τ (S1003), and calculates the local edit distance for the pairs (x, y) having the partial string edit distance less than or equal to the threshold value τ (S1003; YES, S1004).
Here, by denoting the local edit distance for each of the pairs (x, y) as LED(x, y), the expression for calculating the local edit distance can be given as Equation 7 below.
LED(x,y)=ED(xt—i,yt—i)+max(|xh—i|,|yh_|) Equation 7
Further, the estimation value calculating unit 33 uses the calculated local edit distance to perform a join process judgment for each of the pairs (x, y) having the partial string edit distance less than or equal to the threshold value τ (S1005). The join process judgment is to judge whether or not the local edit distance is less than or equal to the threshold value τ. In other words, judgment of following Equation 8 is made.
LED(x,y)≦τ Equation 8
In the example illustrated in
In this case, the estimation value calculating unit 33 calculates “0” (zero) for the partial string edit distance of the pair of the tuple (S: 103) and the tuple (R: 203). Since the partial string edit distance (0) is less than or equal to the threshold value τ (2), the estimation value calculating unit 33 adds the partial string edit distance (0) to the string length (2) of the larger head portion string to calculates the local edit distance (2). The expression at this time can be given as Equation 9 below.
LED(S:103,R:203)=ED(“X-BB-KC”,“X-BB-KC”)+max(2,2)=0+2=2 Equation 9
Since the calculated local edit distance (2) is less than or equal to the threshold value τ (2), the estimation value calculating unit 33 sets the results of the join process judgment to “true.” Such an estimation value calculating unit 33 may be realized, for example, as one function (validation function). In this case, the validation function is configured so as to acquire an address for accessing a pair of SIP tuples, and return the local edit distance of the pair, and information indicating the results of the join process judgment.
The join processing unit 32 acquires the local edit distance and the results of the join process judgment as the processing results for each of the pairs of the SIP tuples from the estimation value calculating unit 33. The join processing unit 32 identifies pairs for which the results of the join process judgment are true, and generates a local result tuple containing a pair of tuple pointers and the local edit distance for each of the identified pairs (S1006).
The join processing unit 32 transmits the local join result containing the generated local result tuple to the system controlling device 10 (S1006).
After transmitting the local join result to the system controlling device 10 (S1006), or if it is determined that there exists no pair having the partial string edit distance less than or equal to the threshold value τ (S1003; NO), the join processing unit 32 judges whether any unprocessed other SIP tuple sets exist in the SIP tuple storage unit 35 (S1007). If there exists no unprocessed SIP tuple set (S1007; NO), the join processing unit 32 terminates the process. On the other hand, if there exists the unprocessed SIP tuple set (S1007; YES), the join processing unit 32 sets the head character c′ of the unprocessed SIP tuple to a variable c (S1008), and then, the process S1001 and thereafter are performed in a similar manner described above.
The result generating unit 12 receives each of the local join results transmitted from each of the join processing devices 30 (S1201). Each of domicile join results contains a local result tuple s.
The result generating unit 12 extracts, from the join result storage unit 15, a local result tuple r having the pair of tuple pointers same as the pair of tuple pointers contained in the received local result tuple s (S1201). Then, the result generating unit 12 acquires a local edit distance led_s contained in the received local result tuple s and a local edit distance led_r contained in the local result tuple r extracted from the join result storage unit 15 (S1203).
The result generating unit 12 judges whether the acquired local edit distance led_s is smaller than the local edit distance led_r acquired in a similar manner (S1204). If the local edit distance led_s is smaller than the local edit distance led_r (S1204; YES), the result generating unit 12 deletes the local result tuple r from the join result storage unit 15, and inserts the local result tuple s instead (S1205). If the local edit distance led_s is more than or equal to the local edit distance led_r (S1204; NO), the result generating unit 12 terminates without processing.
Here, in
It should be noted that, in the example of the operation described above, the configuration in which the SIP tuple is formed by the tail portion string, the string length of the remaining head portion string and the tuple pointer is given as an example. However, as described in [Device configuration], it may be possible to employ a configuration in which the SIP tuple is formed by a head portion string, a string length of the remaining tail portion string, and a tuple pointer. In the case of this configuration, the distribution destination of each of the SIP tuples is determined on the basis of the tail character of the head portion string, and the local edit distance is calculated by adding the edit distance between the head portion strings to the string length of the larger tail portion string.
As described above, in the third exemplary embodiment, for the join key string of each of the tuples of data that the data management device 20 has, the data management device 20 generates (τ+1) pieces of SIP tuples each formed by a combination of the tail portion string ranging from the tail character to the i-th character (i is a positive integer less than or equal to (τ+1)) counted from the head character, the string length of the remaining head portion string, and the tuple pointer, or a combination of the head portion string ranging from the head character to the i-th character counted from the tail character, the string length of the remaining tail portion string, and the tuple pointer. Then, the join processing device 30 serving as the distribution destination of each of the SIP tuples is determined on the basis of the head character of the tail portion string of each of the SIP tuples or the tail character of the head portion string of each of the SIP tuples, and each of the SIP tuples is distributed to the determined join processing device 30.
With this configuration, in the third exemplary embodiment, for each of the tuples, (τ+1) pieces of the join processing devices 30 are selected for the distribution destination at the maximum. Thus, according to the third exemplary embodiment, it is possible to reduce the total amount of communication flowing in the network 7. More specifically, in the third exemplary embodiment, the entire processing cost of the system 1 is {τ×(m+n)}, regardless of the number N of the join processing devices 30, where m and n are the numbers of tuples contained in the input data S and the input data R. On the other hand, with the conventional method, the processing cost is (N×m+n). Thus, with the increase in the number N of the join processing devices 30 and the decrease in the threshold value τ, the third exemplary embodiment can further reduce the processing cost as compared with the conventional method.
Further, with the increase in the value of i, the data size of the SIP tuple is lower than the data size of the complete join key string. Thus, according to the third exemplary embodiment, it is possible to reduce the total communication amount as compared with the conventional method in which the complete join key string needs to be transferred to each of the join processing devices 30.
Further, with the third exemplary embodiment, the join processing device 30 calculates, for each of the pairs of the plural SIP tuples, the edit distance between the tail portion strings or the head portion strings serving as the partial string edit distance. Then, pairs of SIP tuples having the calculated partial string edit distance less than or equal to the threshold value τ are identified. For the identified pairs of SIP tuples, the partial string edit distance is added to the string length of the larger head portion string or the string length of the larger tail portion string to calculate the edit distance estimation value, and pairs of SIP tuples having the calculated edit distance estimation value less than or equal to the threshold value τ are identified.
Further, in the third exemplary embodiment, for the identified pairs of SIP tuples, the join processing device 30 generates the local result tuple containing the pair of tuple pointers and the local edit distance, and transmits the generated local result tuple to the system controlling device 10. The system controlling device 10 detects overlapping result tuples containing pairs of the same tuple pointers from among the plural local result tuples transmitted from the plural join processing devices 30, and delete all the detected overlapping result tuples except for the local result tuples having the minimum edit distance estimation value, thereby determining the pair of tuples having the edit distance less than or equal to the threshold value τ.
Thus, according to the third exemplary embodiment, the partial string edit distance is calculated, whereby it is possible to reduce the processing time as compared with the conventional method that calculates the edit distance of the complete join key string.
Further, when the distribution method as in the third exemplary embodiment is used, one or more join processing devices 30 always calculate the local edit distance LED(s, r) having the value equal to that of the actual edit distance ED(s, r) for the pair <s, r> of tuples, as described in the theorem described above, whereby the processing target data can be distributed in a manner such that the join process is performed for all the processing target without fail. Yet further, with the exclusion process for the overlapping local result tuple by the system controlling device 10, it is possible to obtain the appropriate join result of the string similarity join.
A fourth exemplary embodiment is different from the third exemplary embodiment in the join process method performed by the join processing device 30. Below, a system 1 according to the fourth exemplary embodiment will be described with focus being placed on things different from the third exemplary embodiment, and the details same as the third exemplary embodiment will not be repeated.
The join processing unit 32 according to the fourth exemplary embodiment causes the trie structuring unit 37 to perform a process of structuring a trie of SIP tuples stored in the SIP tuple storage unit 35, and trace the structured trie to generate a local result tuple. As in the third exemplary embodiment, the join processing unit 32 transmits a local join result containing the local result tuple thus generated, to the system controlling device 10.
The trie has a structure similar to a patricia trie, and is formed on the memory of the join processing device 30. The trie structuring unit 37 extracts, from the SIP tuple storage unit 35, SIP tuple sets having the head character of the tail portion string or the tail character of the head portion string common to each other, and structures a trie having the extracted SIP tuple sets mapped therein. The trie structured by the join processing unit 32 has a structure capable of retaining information on the tail portion string (or head portion string), a string length of the head portion string (or string length of the tail portion string), and a tuple pointer, each of which constitutes the SIP tuples.
More specifically, in the trie, the tail portion strings (or head portion strings) of the SIP tuples are mapped to branches (path) from a root node to edge nodes (leaf nodes), and tuple pointers of the SIP tuples mapped to the branches together with a weight are attached to the edge nodes of the branches. Further, to the root node, the head character of the tail portion string or the tail character of the head portion string is attached as a label, and a list of pointers of child nodes together with a weight is attached. As for the weight used in this trie, the string length of the head portion string or the string length of the tail portion of the SIP tuple is used.
In the example illustrated in
The string “abaa” is mapped to the root node (0), the node (1), the node (2), and the node (3). The string “acaa” is mapped to the root node (0), the node (4), the node (5), and the node (6). The string “abcaa” is mapped to the root node (0), the node (1), the node (7), the node (8), and the node (9). The string “aca” is mapped to the root node (0), the node (10), and the node (11).
To the edge node (9), which is a branch having the string “abcaa” mapped thereto, the tuple pointer “s3” of the SIP tuple together with the string length “0” of the head portion string is attached. Similarly, to the edge node (6), the tuple pointer “s2” together with the string length “1” of the head portion string is attached, and to the edge node (11), the tuple pointer “s4” together with the string length “2” of the head portion string is attached.
The trie structuring unit 37 maps the tail portion string (or head portion string) of the SIP tuple having the same tuple pointer to one branch to suppress the amount of memory used. For example, since the string “abaa” and the string “aa” in
After the trie structuring unit 37 completes structuring the trie, the join processing unit 32 searches the structured trie to acquire a set of local result tuples having the local edit distance less than or equal to the threshold value τ. The join processing unit 32 sequentially visits each nodes of the trie, and generates a list (active list) of other nodes analogous to each of the nodes, thereby acquiring the set of the local result tuple. In this exemplary embodiment, the active list stores node tuples containing a node number, a weight and a local edit distance in connection with each of the analogous other nodes. The searching process of the trie and the acquiring process of the local result tuple by the join processing unit 32 will be described later.
Below, an example of an operation performed by the system 1 according to the fourth exemplary embodiment will be described. In this description, things different from that of the join process method performed by the join processing device 30 in the third exemplary embodiment will be described.
The trie structuring unit 37 extracts, from the SIP tuple storage unit 35, a set of SIP tuples having the head character of the tail portion string common to each other (S1601). In the example illustrated in
After the SIP tuple set sip[a] is extracted, the trie structuring unit 37 initializes the trie (S1602).
Then, the trie structuring unit 37 acquires unprocessed SIP tuples from the extracted SIP tuple set sip[a] (S1603). Here, the tail portion string contained in the SIP tuple is denoted as a string s, the string length of the remaining head portion string is denoted as a length plen, and the tuple pointer is denoted as p.
The trie structuring unit 37 judges whether the root node of the trie of the head character “a” has a node nd with a label s[2] as the child node having the weight plen attached thereto (S1604). In other words, the trie structuring unit 37 judges whether, in the root node, the pointer of the child node having the label of the character s[2] is set to the child node pointer having the weight plen (S1604). The character s[2] represents the second character counted from the head of the string s. In the example illustrated in
If it is judged that such a child node exists (S1604; YES), the trie structuring unit 37 performs a process S1606. On the other hand, if it is judged that such a child node does not exist (S1604; NO), the trie structuring unit 37 newly generates a node nd with a label s[2], and sets the newly generated node nd to the child node with a weight plen in the root node (S1605). In the example in
The trie structuring unit 37 sets the variable i to be three, and sets the variable parent to be the node nd of the label s[2] (S1606).
Next, the trie structuring unit 37 repeats the processes from S1608 to S1613 described below until the variable i exceeds the length of the string s (S1607).
In the process S1608, the trie structuring unit 37 judges whether the node nd having the character s[i] set to the label thereof exists in the child nodes of the node having the variable parent set thereto. This judgment is made, for example, using the pointer of the child node contained in each of the nodes.
If node nd having the label s[i] already exists (S1608; YES), the trie structuring unit 37 sets the variable parent to be the node nd with the label s[i] (S1612), adds one to the variable i (S1613), and makes the judgment of S1607 again.
On the other hand, if the node nd with the label s[i] does not yet exist (S1608; NO), the trie structuring unit 37 newly generates the node nd with the label s[i], and sets the newly generated node nd for the child nod of the node having the variable parent set thereto (S1609). In the example illustrated in
Then, the trie structuring unit 37 judges whether there exists a SIP tuple containing the tail portion string that starts from the character s[i−1], which is the character immediately preceding the character s[i], and whether this character s[i−1] is equal to the label of the root node (S1610). More specifically, it is judged whether {plen +(i−2)≦τ} and (s[i−1]=“a”) are true as illustrated in
If the SIP tuple containing the tail portion string that starts from the character s[i−1], which is a character immediately before the character s[i], exists and the character s[i−1] is equal to the label of the root node (S1610; YES), the trie structuring unit 37 adds a node nd as the child node with the weight (plen+(i−2)) of the root node (S1611). In other words, the trie structuring unit 37 sets the pointer of the node nd with the label s[i] together with the weight (plen +(i−2)) for this root node.
In the example illustrated in
If the variable i is greater than the length of the string s (S1607; YES), in other words, if mapping of the string s to the trie completes, the trie structuring unit 37 sets the tuple pointer p together with the weight plen is set to the node nd (S1614). In the example illustrated in
Next, the trie structuring unit 37 judges whether there exists any unprocessed SIP tuple t′ in the SIP tuple set sip[a] (S1615). If the unprocessed SIP tuple t′ exists (S1615; YES), the trie structuring unit 37 sets t′ for the variable t indicating the SIP tuple serving as the processing target (S1616), and then, performs the process S1603 and thereafter again. Note that, if no unprocessed SIP tuple t′ exists (S1615; NO), the trie structuring unit 37 terminates its process.
With the processed described above, the trie as illustrated in
After the trie structuring unit 37 completes the process of structuring the above-described trie, the join processing unit 32 searches the structured trie to acquire the set of the local result tuples having the local edit distance less than or equal to the threshold value τ.
The join processing unit 32 acquires the unprocessed weight w from the weights set in the root node (S1801).
Further, the join processing unit 32 acquires a given weight w2 from the weights set in the root node (S1802). The join processing unit 32 calculates a local edit distance between the root node with the weight w (hereinafter, referred to as a root(w)) and the root node with the weight w2 (hereinafter, referred to as a root(w2)) (S1802). The local edit distance is calculated through a method similar to that in the third exemplary embodiment. Since the root(w) and the root(w2) have the same label, the edit distance ED(root(w), root(w2)) is 0 (zero). Thus, the local edit distance led between the root(w) and the root(w2) is the larger weight value, in other words, is the string length of the head portion string (or tail portion string).
The join processing unit 32 adds the root(w2) to the active list of the root(w) (S1802). With this process, the node tuple concerning the root(w2) is set in the active list of the root(w). This node tuple contains the node number (0), the weight (w2), and the local edit distance between the root(w) and the root(w2). As described above, the local edit distance between the root nodes is the larger weight value, and hence, is always less than or equal to the threshold value τ. Thus, in the process S1802, the node tuples concerning the root(w2) is added to the active list without comparing the threshold value τ and the local edit distance.
Next, the join processing unit 32 calculates the local edit distance led between the root(w) and the child node nd2(w2) of the root(w2) (S1803). Here, the local edit distance between the nodes is calculated by adding the larger weight value to the edit distance between the strings through a path up to each of the nodes. In other words, the local edit distance between the root(w) and the node nd2(w2) is a value obtained by adding the larger weight value of the weigh w and the weight w2 to the edit distance between the character attached to the label of the root(w) and the string formed by the label of the root(w), and the label of the node nd2(w2).
If the calculated local edit distance led is less than or equal to the threshold value τ, the join processing unit 32 adds the node nd2(w2) to the active list of the root(w) (S1803). With this process, the node tuple concerning the node nd2(w2) is added to the active list of the root(w). This node tuple includes the node number concerning the node nd2(w2), the weight (w2), and the calculated local edit distance led.
It should be noted that, if plural child nodes nd2(w2) of the root(w2) exist, the join processing unit 32 performs the process S1803 to each of the child nodes. Further, if plural weights are set to the root node, the join processing unit 32 performs the processes S1802 and S1803 described above for each of the weights (w2). Then, the generation of the active list of the root(w) completes.
The join processing unit 32 sequentially generates the active lists of the descendant node nd(w) of the root(w) in a recursive manner. First, the join processing unit 32 acquires each of the child node nd(w) of the root(w) (S1804).
The join processing unit 32 acquires the active list of the parent node of the acquired child node nd(w) (S1805), and acquires the node an(w3) set in the active list (S1806).
The join processing unit 32 calculates the local edit distance led between the node nd(w) and the node an(w3), and if the calculated local edit distance led is less than or equal to the threshold value τ, adds the node an(w3) to the active list of the node nd(w) (S1807).
Further, the join processing unit 32 calculates the local edit distance led2 between the node nd(w) and each of the child nodes an child(w3) of the node an(w3), and if the calculated local edit distance led2 is less than or equal to the threshold value τ, adds the node an child(w3) to the active list of the node nd(w) (S1808).
The join processing unit 32 judges whether there exists any unprocessed node an(w3) in the active list of the parent node of the node nd(w) (S1809). If the unprocessed node an(w3) exists (S1809; YES), the join processing unit 32 performs the process S1806 described above and thereafter for the unprocessed node an(w3).
If the unprocessed node an(w3) does not exist (S1809; NO), the join processing unit 32 judges whether there exists any unprocessed child node nd_child(w) in the node nd(w) (S1810). If the unprocessed child node nd_child(w) exists (S1810; YES), the join processing unit 32 sets the unprocessed child node nd_child(w) to be the node nd(w) serving as the processing target (S1811), and performs the process S1805 described above and thereafter.
If no unprocessed child node nd_child(w) exists (S1810; NO), the join processing unit 32 judges whether there exists any unprocessed weight w′ in the root node (S1812). If the unprocessed weight w′ exists (S1812; YES), the join processing unit 32 performs the process S1801 described above and thereafter for the unprocessed weight. On the other hand, if no unprocessed weight w′ exists (S1812; NO), the join processing unit 32 terminates its process.
With the processes described above, the join processing unit 32 generates the active list for each of the nodes in the trie.
The join processing unit 32 identifies pairs of nodes including the node (3) and having the local edit distance less than or equal to the threshold value τ on the basis of the active list of the node (3) containing the tuple pointer. In the example illustrated in
In the fourth exemplary embodiment, the join processing device 30 maps the SIP tuples distributed from the data management device 20 to the trie. In the trie, the string length of the head portion string or the tail portion string contained in the SIP tuple is used as the weight, the root node having at least one child node is generated for each of the weight, and the tuple pointer together with the weight is attached to the edge node (leaf node).
The local edit distance between a node and another node selected on the basis of the active list of the parent node is calculated for each of the nodes of the structured trie, and the calculated local edit distance is compared with the threshold value τ. As a result, information on the other node having the local edit distance less than or equal to the threshold value τ is set in the active list of each of the nodes.
As described above, according to the fourth exemplary embodiment, by using the characteristics of the trie to select the target of the join process for each of the nodes on the basis of the active list of the parent node, it is possible to limit the number of targets that require calculation of the local edit distance, so that the amount of calculation of the join process can be reduced. Further, according to the fourth exemplary embodiment, the local edit distance is calculated using the tail portion string or the head portion string, whereby it is possible to reduce the processing cost as compared with the conventional method that requires calculation of the edit distance for all the join key strings.
Further, according to the fourth exemplary embodiment, the active list of each of the nodes contains information on another node having the local edit distance from each of the nodes less than or equal to the threshold value, and also contains its local edit distance. Thus, according to the fourth exemplary embodiment, by referring to the active list of each of the edge nodes, it is possible to rapidly acquire the local result tuples having the local edit distance less than or equal to the threshold value τ.
In the exemplary embodiments described above, (τ+1) pieces of SIP tuples are generated for one tuple. Then, the plural SIP tuples generated from the one tuple have the same head character of the tail portion string (same tail character of the head portion string), and hence, are possibly distributed to the same join processing device 30. For example, in the example illustrated in
Thus, the SIP tuple generating unit 21 of the data management device 20 may generate only the SIP tuple having the minimum string length of the remaining head portion string for the SIP tuple having the same head character of the tail portion string. In this case, it is only necessary for the join processing device 30 to generate the required another SIP tuple on the basis of the received SIP tuple.
In the example illustrated in
With the configuration described above, it is possible to further reduce the communication cost in the system 1.
Further, in the exemplary embodiments described above, the request controlling unit 11 of the system controlling device 10 acquires a processing request, then, in response to the execution instruction transmitted by the request controlling unit 11, the data management device 20 distributes the SIP tuples, and the join processing device 30 starts the process. As another configuration, it may be possible to employ a configuration in which, before the processing request is received, the data management device 20 distributes the SIP tuple satisfying a predetermined condition to the join processing device 30 in advance.
More specifically, by setting the edit distance threshold value τ to be the upper limit value max_τ, the data management device 20 generates the SIP tuple (1≦i≦max—τ+1) in the case where the threshold value τ is the upper limit value, and distributes the generated SIP tuple in advance. Upon receiving the processing request containing the threshold value τ, the system controlling device 10 issues the execution instruction only to the join processing device 30. The join processing device 30 performs the join process using only the SIP tuple having the string length of the head portion string (or tail portion string) less than or equal to the threshold value τ of all the SIP tuples distributed in advance.
With this configuration, the time required for performing the process of transmitting distribution data from the data management device 20 to the join processing device 30 is not contained in the period of time from a time when the processing request is received to a time when the join process result is generated. Thus, it is possible to reduce the time from reception of the processing request to generation of the join process result. This configuration is suitable for the online process in which the large number of processing requests are inputted within a short period of time.
From among at least one piece of data (SIP tuple set) stored in at least one data management device 20 and specified on the basis of the processing request, the similarity join process performed in the exemplary embodiments described above and the modification example detects pairs of tuples whose edit distance between the strings in the join key attribute specified on the basis of the processing request satisfies the condition of the threshold value τ. However, the similarity join process according to the present invention is not limited to such a configuration, and is an idea including the meaning of a string similarity searching process.
In other words, the similarity join process according to the present invention may be a process of searching at least one item of data stored in at least one data management device 20 for a tuple formed by a query string obtained, for example, through a processing request and having the edit distance between the query strings satisfying the condition of the threshold value τ.
In this configuration, the data management device 20 retaining the data specified by the processing request generates and distributes the SIP tuple in a similar manner to the exemplary embodiments described above and the modification example. On the other hand, the data management device 20 acquiring the query string acquired from the system controlling device 10 generates the SIP tuple concerning this query string, and distributes it. The join processing device 30 receiving the SIP tuple concerning the query string performs a join process similar to that in the exemplary embodiments described above and the modification example for the pair of the SIP tuple concerning the query string and another SIP tuple.
Further, the similarity join process according to the present invention is an idea including a name-consolidating technique of detecting strings having the same meaning but expressed in slightly different ways.
Further, the string similarity join system 1 described in the exemplary embodiments and the modification example is applicable to a stock searching system for plural stores. With such a configuration, even if the name of product slightly differs according to shops, it is possible to detect the desired merchandise. Further, there is a case where the product number differs according to merchandise since color or size thereof is different although the type of the merchandise is the same. In such a case, with this configuration, it is possible to detect the merchandise with the same model but different colors or sizes.
It should be noted that, in the plural flowcharts used in the description above, plural steps (processes) are described in a sequential order. However, the order of the process steps performed in the exemplary embodiments is not limited to the order of the steps described. In the exemplary embodiments, the order of the process steps illustrated in the drawings may be exchanged, provided that the exchange does not impair the details of the processes. The above-described exemplary embodiments and the modification example may be combined, provided that the details thereof do not contradict each other.
Part or all of the exemplary embodiments and the modification example can be described in a manner illustrated in the Supplementary Notes below. However, the exemplary embodiments and the modification example are not limited to the descriptions below.
A join processing device that performs a similarity join process to plural tuples using an edit distance threshold value τ (positive integer), including
a join processing units that excludes, from a target of edit distance calculation, a pair of tuples that does not have any common character in an end portion ranging from a head character or a tail character to the (τ+1)th character in a join key string in each of the plural tuples.
A data management device communicatively connected to plural join processing devices that each perform a similarity join process to plural tuples using an edit distance threshold value τ (positive integer), including;
a data storage unit that stores the plural tuples; and
a data distributing unit that determines a distribution destination of each of the tuples stored in the data storage unit to be a join processing device that processes each of the tuples from among the plural join processing devices in a manner such that each of the tuples is distributed to the distribution destination the same as that of another tuple containing, in an end portion ranging from a head character or tail character to a (τ+1)th character in a join key string thereof, at least one character that the each of the tuples contains in the end portion in the join key string thereof, and is not distributed to a distribution destination the same as that of another tuple that does not contain any character common to that in the end portion in the join key characters string of each of the tuples.
A string similarity join system including at least one data management device and plural join processing devices that each perform a similarity join process to plural tuples stored in the data management device using an edit distance threshold value τ (positive integer)
the at least one data management device including:
the plural join processing devices each including:
The string similarity join system according to Supplemental Note 3,
the plural join processing devices each further including:
the join processing unit identifies a pair of tuples having the edit distance estimated to satisfy the condition of the edit distance threshold value τ on the basis of a comparison result between the edit distance estimation value calculated by the estimation value calculating unit and the edit distance threshold value τ.
The string similarity join system according to Supplemental Note 3,
the plural join processing devices each further including:
the join processing unit:
The string similarity join system according to Supplemental Note 5, in which
the join processing unit selects another node serving as a calculation target of the edit distance estimation value of each target node on the basis of information on the other node contained in the list of a parent node.
The string similarity join system according to Supplemental Note 5 or 6, in which
the trie structuring unit maps tail portion strings or head portion strings of plural key information tuples having the same tuple identifying data to one branch of the trie and at least one portion of the branch, and sets the weight value together with a node pointer for identifying the at least one portion of the one branch in the root node of the trie.
The string similarity join system according to any one of Supplemental Notes 4 to 7, further including a system controlling device that can communicate to the at least one data management device and the plural join processing devices, in which
the join processing unit of the join processing device generates, for each identified pair, a result tuple containing a pair of tuple identifying data and an edit distance estimation value, and transmits the generated result tuple to the system controlling device, and
the system controlling device includes a result generating unit that detects an overlapping result tuple containing a pair of the same tuple identifying data from among plural result tuples transmitted from the plural join processing devices, and deletes a result tuple other than the result tuple having the minimum edit distance estimation value from the detected overlapping result tuple, thereby determining a pair of tuples having an edit distance satisfying the condition of the edit distance threshold value τ.
The string similarity join system according to any one of Supplemental Notes 3 to 8, in which
the system controlling device further includes a request controlling unit that acquires a processing request containing the edit distance threshold value τ, and then, transmits an execution instruction for processing to each of the plural join processing devices, in which
the key information generating unit of the at least one data management device generates a temporary key information tuple for a join key string of each of the tuples on the assumption that the upper limit value of the edit distance threshold value τ determined in advance is the edit distance threshold value τ, and
the data distributing unit of the at least one data management device distributes the temporary key information tuple to the join processing device serving as a distribution destination before the system controlling device acquires the processing request.
A string similarity join method performed for plural tuples using an edit distance threshold value τ (positive integer), in which
at least one computer excludes, from a target of an edit distance calculation, a pair of tuples that does not have any common character in an end portion ranging from a head character or a tail character to the (τ+1)th character in a join key string of each of the tuples.
A string similarity join method performed for plural tuples using an edit distance threshold value τ (positive integer), the method being performed by at least one computer and including:
generating (τ+1) pieces of key information tuples containing a combination of a tail portion string ranging from a tail character to a i-th character (i is a positive integer less than or equal to (τ+1)) counted from a head character in a join key string of each of the tuples, a string length of the remaining head portion string, and tuple identifying data, or a combination of a head portion string ranging from the head character i-th character counted from the tail character, a string length of the remaining tail portion string, and the tuple identifying data;
determining a distribution destination of each of the key information tuples on the basis of the head character of the tail portion string or the tail character of the head portion string contained in each of the generated key information tuples; and
distributing, as data on each of the tuples, each of the key information tuples to each target computer determined to be the distribution destination of each of the key information tuples, in which
the target computer determined to be the distribution destination of each of the key information tuples receives the distributed key information tuples, and performs a similarity join process for each set of key information tuples having the head character of the tail portion string or the tail character of the head portion string common to each other from among the received key information tuples.
The string similarity join method according to Supplemental Note 11, in which
the target computer determined to be the distribution destination of each of the key information tuples further:
The string similarity join method according to Supplemental Note 11, in which
the target computer determined to be the distribution destination of each of the key information tuples further:
The string similarity join method according to Supplemental Note 13, in which
the edit distance estimation value is calculated in a manner such that another node serving as a calculation target for the edit distance estimation value for each target node is selected on the basis of information on another node contained in a list set for a parent node.
The string similarity join method according to Supplemental Note 13 or 14, in which
the trie is structured in a manner such that tail portion strings or head portion strings of plural key information tuples having the same tuple identifying data are mapped to one branch of the trie and at least one portion of the one branch, and the weight value and a node pointer for identifying the at least one portion of the one branch are set in the root node of the trie.
The string similarity join method according to any one of Supplemental Notes 12 to 15, in which
the target computer determined to be the distribution destination of each of the key information tuples further:
the other computer:
A program that causes at least one computer to perform a string similarity join to plural tuples using an edit distance threshold value τ (positive integer), in which
the program causing the at least one computer to realize a join processing unit that excludes, from a target of an edit distance calculation, a pair of tuples that do not have any common character in an end portion ranging from a head character or a tail character to a (τ+1)th character in a join key string of each of the tuples.
A program that causes at least one computer to perform a string similarity join to plural tuples using an edit distance threshold value τ (positive integer), in which
the program causes the at least one computer to realize:
the program causes the target computer determined to be the distribution destination of each of the key information tuples to realize:
The program according to Supplemental Note 18, which causes the target computer determined to be the distribution destination of each of the key information tuples to realize:
The program according to Supplemental Note 18, which causes the target computer determined to be the distribution destination of each of the key information tuples to realize:
a trie structuring unit that structures a trie having the tail portion string or the head portion string mapped to a branch extending from a root node to an edge node, the trie managing a string length of the head portion string or the tail portion string as a weight value on the basis of the plural key information tuples received by the receiving unit;
a join processing unit that:
A computer-readable storage medium that stores a program according to any one of Supplemental Notes 17 to 20.
The present application claims priority based on Japanese Patent Application No. 2011-020374 filed in Japan on Feb. 2, 2011, the disclosures of which are incorporated herein by reference in their entirety.
Number | Date | Country | Kind |
---|---|---|---|
2011020374 | Feb 2011 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2011/006218 | 11/7/2011 | WO | 00 | 8/2/2013 |