Data records often contain errors. Two records may refer to a particular item in two different ways, for instance. Or two records may look different, but actually refer to one item. These errors can cause problems for people relying on these records. Assume that a company wants to send catalogs to all of its customers. Assume also that the company's database has two records for the same customer, like “Jane Doe, 123 W. American St., 90005” and “Jane T. doe, West 123 American Street, 90005”. If the company does not know that these two records refer to one customer, not two, it may send Jane Doe two catalogs.
Some current software techniques attempt to find these kinds of errors by comparing records using similarity functions. Current techniques might execute one similarity function on two records to determine whether or not the records are the same if white spaces and punctuation are removed from both records. Current techniques might then execute another similarity function on the same two records to determine whether or not the records are the same if both records are all caps or are not capitalized. Current techniques might then execute another similarity function on the same two records to determine whether or not the records are the same if common word strings are truncated. For the above example, performing each of these similarity functions might result in the first record looking like: “janedoe123wamericanst90005” and the second record looking the same (truncating West to “w” and Street to “st”). These records may then be recognized as referring to the same entity.
System(s) and/or method(s) (“tools”) are described that enable actions to be reused that are common to multiple similarity functions. The tools may do so, in one embodiment, by composing similarity functions into a single, composed function that performs actions once that are common to multiple similarity functions. This composed function may also permit data to be analyzed in one pass and/or render unnecessary a merge operation. The tools may also enable actions to be reused when a similarity function is performed multiple times. The tools may do so, in one embodiment, by retaining a result of performing an action and using that result when performing the similarity function again.
The tools may also enable records to be compared using a flip-window algorithm. This algorithm may be an efficient way in which to compare records in a table to determine which of those records are similar or duplicates.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The same numbers are used throughout the disclosure and figures to reference like components and features.
Overview
The following document describes tools that enable, in some embodiments, actions to be reused that are common to multiple similarity functions or can be performed multiple times by the same similarity function. The tools may, in one embodiment, compose similarity functions into a single, composed function comprising actions of multiple similarity functions. The tools may also, in another embodiment, retain a result of performing an action to use that result when re-performing a same similarity function. The tools may also, in still another embodiment, compare records in a table using a flip-window algorithm.
An environment in which these tools may enable these and other techniques is set forth first below. This is followed by others sections describing various inventive techniques and exemplary embodiments of the tools. One, entitled Composing and/or Executing Actions of Similarity Functions, describes an exemplary process for composing and executing actions of similarity functions, which may permit actions to be reused. Another, entitled Flip-Window Algorithm, describes an exemplary process enabling comparison of records in a table, which may reduce how many record pairs are analyzed.
Exemplary Operating Environment
Before describing the tools in detail, the following discussion of an exemplary operating environment is provided to assist the reader in understanding one way in which various inventive aspects of the tools may be employed. The environment described below constitutes but one example and is not intended to limit application of the tools to any one particular operating environment. Other environments may be used without departing from the spirit and scope of the claimed subject matter.
The platform's processors are capable of accessing and/or executing the computer-readable media. The computer-readable media comprises or has access to a composition module 110, similarity functions 112, constituent actions 114, composed function 116, similarity module 118, dirty records 120, and cache 122.
Each similarity function is capable of determining a similarity between data records or parts of data records (e.g., records of dirty records 120). To do so, the similarity functions may comprise one or more constituent actions 114. These constituent actions may be used, in some embodiments, to build the similarity functions, such as responsive to selection by a user. Some of these actions may also be customized, and thus similarity functions be made extensible to provide additional functionality. Particular industries, such as the pharmaceutical industry, may have particular needs and peculiarities for data. Most industries may need similarity functions that can determine that two words with different cases are similar if they have the same characters, e.g., that “help” is similar to “Help” and “HELP”. But data may have peculiarities in an industry, such as in the pharmaceutical industry where “20 mg” should be considered similar to “0.02 g”. These actions may therefore enable custom identifications of industry-specific data similarities by alteration or selection of a particular action.
These actions may also perform operations useful to multiple similarity functions, such as two similarity functions that require tokenization. Having similarity functions that comprise a same action where that same action is separately executable may enable same actions to be reused (e.g., performed once rather than multiple times) when executing multiple different similarity functions.
Constituent actions 114 are shown with actions comprised by the exemplary similarity functions, here: tokenize action 208; capitalization comparer action 210; transposed character comparer action 212; transposition action 214; 11 text comparer action 216; and white space removal action 218.
Returning to
Similarity module 118 is capable of executing the similarity functions, actions, and/or composed function to determine similarities between data records. The similarity module may do so according to various algorithms, such as a sliding window algorithm or a flip-window algorithm (set forth in greater detail below).
Dirty records 120 comprise data records to be analyzed for similarities. It may be received from data warehouse 108 in a table or other type of format. Data warehouse 108 may be ERP-dependent or independent. Cache 122 is capable of storing results of various actions, such as tokenized data resulting from tokenize action 208, for later use or storage.
Composing and/or Executing Actions of Similarity Functions
Block 302 receives similarity functions comprising actions. One or more of these similarity functions may comprise a same action or they may all comprise different actions. Each of the similarity functions may also produce results that may be merged with a post-performance merge operation into a single result. These similarity functions may be those selected or altered by a user, such as with an industry-specific similarity function (or constituent action) capable of determining that “20 mg” is similar to “0.02 g”. In so doing, the tools enable fine-grain control of what is and is not deemed similar, here with a logical primitive deeming “20 mg” a duplicate of “0.02 g”. In an exemplary embodiment, composition module 110 receives the three similarity functions 202, 204, and 206 shown in
Block 304 composes similarity functions. Block 304 may produce a single, composed function capable of producing a same result as separate performance of each of the similarity functions and merging of the results from each. Block 304 may compose these similarity functions by determining which actions are comprised by the similarity functions and then ordering those actions into a single function. In some cases one or more of the actions of the similarity functions will be the same. The extra, redundant actions may then be excluded from the composed function. If this is done, the composed function may require fewer resources to perform a same result as performance of each of the similarity functions of which the composed function is a composition. The composed function, in effect, reuses actions that are redundant by performing the redundant action once and retaining the result for future input or output to other actions.
Also, the composed function may be performed with one pass over the data. Multiple passes over data may take more resources than one pass, which permits the composed function to require fewer resources (in some cases) than the multiple similarity functions. This composed function may be capable of being performed without need of a merge function to merge results of different similarity functions.
Here composition module 110 determines the actions comprised by similarity functions 202, 204, and 206. The constituent actions of these three functions are shown in
Block 306 executes a composed function of two or more similarity functions. The tools may perform the composed function in one pass, thereby not needing to separately merge results from two or more similarity functions and not having to touch the data multiple times. Here performing each of the three similarity functions received would result in three sets of results that may then be merged in a separate operation. Similarity module 118 may execute the composed function without needing to merge results from multiple similarity functions.
Manners in which the actions of the composed function may be executed are described in greater detail with subblocks shown internal to block 306. These subblocks may be effective to perform block 306 as described above or may instead by an alternative to block 306.
Subblock 306a executes an action. This action may be part of or have been a part of a similarity function. See, for example,
Execution of these similarity functions through their constituent actions is described using exemplary data records shown in
Similarity module 118 executes the tokenize action 208 on the first and second data record. In doing so, it executes the first action of composition function 402 of
Subblock 306b retains the result of executing the action. The similarity module can retain the result of this and other actions for later use as input or output to other actions or that output a final result. Here the similarity module retains 602T and 604T in cache 122.
Subblock 306c retrieves the result. This result is used for at least one other action of the composed function or of one or more similarity functions. The result can be used to enable execution of multiple similarity functions or another use of the same similarity function.
Similarity module 118 next executes capitalization comparer 210 by setting all capitalizations to lower case. The results are shown at 602C and 604C in
Subblock 306d executes actions of another similarity function without having to re-execute a previously-performed action. Thus, performance of the tokenize action once is effective for use in a second (and later a third or other) similarity function.
Next, the similarity module executes transposed character comparer action 212 to find transposed characters. The results are identical to 602C and 604C as no transpositions are found. Likewise, execution of transposition action 214 results look like 602C and 604C as no characters are identified as needing to be transposed. Next it executes white space removal action 218. While difficult to see, this action removes a space in front of tokenized “soft” from the second record. These results are shown at 602S and 604S. Next it executes text comparer action 216. The results indicate that two tokens from each record are the same. Here “Pro” and “Pro” and “XP” and “XP”. By so doing, the first and second records are shown to be similar. Similarity module 118 caches the results of each action performed at 306a, 306d, and 306e in cache 122.
The results of a performed action may also be retained and used for the same similarity function (here capitalization function 202) when used on a same set of data.
Subblock 306e executes actions of a same similarity function without having to re-execute the first action on data that the action has already been executed on. The tools enable execution of the same capitalization function over the first record and some other record without executing the tokenize action on the first record again. The similarity module is attempting to determine if the first record is also similar to the third record. The similarity module retrieves the cached 602T (tokenized data of record 602 in row 1), and any other same actions performed on the same data (capitalized data 602C and transposition character comparer and transposition 602TT). Thus, the similarity module does not have to perform the tokenize action again for the first record.
Note also that, if the similarity module is attempting to determine similarities between the second and third record, actions performed above may be reused for both of the records (e.g., tokenized data 604T and 606T).
Each of subblocks 306a, b, c, d, and e may be performed again. Here the similarity module continues through the five records and determines that the records 602, 604, 608, and 610 (in rows 1, 2, 4, and 5) are similar. It may then create a record showing canonicals for each of the similar records (e.g., a better identifier for that software: “Microsoft® Windows™ XY Professional”).
Flip-Window Algorithm
Block 802 receives a table having records. The table has many rows of records, each of which has one or more columns of data, such as dirty records 120 of
Block 804 partitions the table into windows. The number of windows will depend on the size of the windows and the table. If all of the windows (except usually the last window) are the same size, such as 50 records, the number of windows may be set equal to the number of records in the table divided by the number of records in the windows and rounded up to a nearest integer. Thus, if the table has 1005 records and the windows are 50 records (except the last one), then the number of windows is 1005/50=20.1, which is rounded up to 21. Thus, the first 20 windows have 50 records and the last one has five.
In an illustrated embodiment shown in
Block 806 compares records within a particular window to determine if any records in that window are similar or duplicates. Block 806 may do so using one or more similarity functions or actions or a composed function. It may also do so as set forth for block 306 or subblocks 306a, 306b, 306c, 306d and/or 306e. Block 806 may also compare records of a particular window with records from another window that were found to be duplicates. These windows may be adjoining in the table or performed in order but not adjoining, or otherwise.
For first window 902, similarity module 118 determines which of the records in the first 10-record window are likely duplicates, here records in rows 1, 2, 4, 8, and 10 are likely duplicates with each other, as are rows 3 and 7 with each other. The similarity module determine which are likely duplicates by comparing the first record with records 2-10, then the second record with records 3-10, then the third record with records 4-10, and so forth. It may also forgo comparing a particular record with the rest of the records if it has already been shown to be a duplicate. Thus, if record 1 and 2 are found to be duplicates, the similarity module may forgo comparing record 2 with records 3-10. In this example, then, similarity module 118 compares 1 with 2 and marks 1 and 2 as duplicates, then 1 with 3, marks 3 as not a duplicate of 1, then 1 with 4, and marks 4 as a duplicate of 1, then 1 with 5-7 and marks each as not a duplicate of 1, then 1 with 8 and marks it as a duplicate of 1, then 1 with 9 and marks it as not a duplicate of 1, and then 1 with 10 and marks it as a duplicate of 1. Because 2, 4, 8, and 10 are marked as potential duplicates of 1, the similarity module may proceed to compare record 3 with just 5, 6, 7, and 9. The similarity module marks 7 as a likely duplicate of 3 and then proceeds to compare 5 with 6 and 9 and then 6 with 9.
Block 808 sets or determines a canonical for duplicate records. Here the similarity module sets row 1 as a canonical for rows 1, 2, 4, 8, and 10 and 3 for rows 3 and 7. A canonical may be the best manner in which to describe data or be one of the records that have been analyzed. Determining a canonical may be performed in manners well-known in the art.
Blocks 806 and 808 may be repeated. Block 806, for instance, may be repeated for each window of the table. But block 806 may compare more records than just those of each window. As mentioned above, the similarity module may compare records of a window with other records found to have duplicates, such as a canonical for each set of duplicate records found in an immediately prior window.
For example, assume that the similarity module starts with a window of 10 records, window 904 of
Here comparing the second window and prior duplicates generates the following sets of duplicates: 1, 14, and 18; 3 and 13; and 15 and 17. Thus, the second window produced three sets of duplicates, two of which have a record from the prior window.
This continues, such that canonicals are set as rows 1, 13, and 17, and are then analyzed along with records 21-30 from the third window 906. The result of analyzing this window provides one set of duplicates: 17 and 28. Thus, if another window of records (e.g., rows 31-40, not shown) were to be analyzed, only those rows and the immediately prior duplicate (here either 17 or 28) would be analyzed with rows 31-40.
Thus, the total number of times record pairs are analyzed in this embodiment is dependent on the number of duplicate found. Assume, for one case, that all of the records of a first window are duplicates. Block 806 compares the first record of the first window to the second through the last record of the first window. The second and later records do not need to be compared with each other because they are duplicates. Thus, 9 record pairs are analyzed in the first window. The second window has 10 records plus one canonical from the first window, and thus is 11 records long. If all of these are also duplicates with themselves but not the record of the first window, only 10 record pairs are analyzed. For the third flip-window, 10 analyses again would be needed if all of the records are duplicates of themselves but not the record from the prior window. In this case, the similarity module analyzes 29 records pairs (9+10+10).
Assume, in another case, that none of the records in the 30-record table are found to be duplicates. Here the similarity module may then compare each record of each window with each other record. This results, for each window of 10 records, in the following number of analyzed record pairs:
9+8+7+6+5+4+3+2+1=45.
This may also be represented as 9#. For all three iterations, this would result in analysis of 135 record pairs (3*45).
In another case, assume that all of each window's records have a single duplicate. Thus, for a window size of 10, the first window has 5 pairs of duplicates, which can be set to 5 canonicals for each window. The number of analyzed record pairs may be, if 1-5 are duplicates of each of 6-10: 1 with 2-10 for 9 pairs, 2 with 3-10 for 8 pairs, 3 with 4-10 for 7 pairs, 4 with 5-10 for 6 pairs, and 5 by 6-10 for 5 pairs. As 6-10 are duplicates of 1-5, respectively, the similarity module may forgo comparing 6 through 10 with each other. The results of this would be 9#−5#, or 45−15=30. For the next window if we assume the same, we have an initial window of 10 plus 5 canonicals for 15 records. If none of the next window's records are duplicates of the canonicals but are of themselves, then the number of record pairs analyzed would be 14#−5#=90. The third window, if like the second and not matching canonicals from the second window, would also have 90 analyzed pairs. The total for this example is 210 record pairs compared.
A sliding window algorithm, for the above cases, however, may require a number of analyzed record pairs sufficient to compare every record in each window with each other, multiplied by the number of windows. Thus, for a window size of 10 records and 30 total records, the sliding window algorithm may require 290 analyzed record pairs.
Process 800 may be used in conjunction with parts of process 300, such that analyzing a record a second or later time requires fewer resources. If record 1 is 11 compared with record 2, results of certain actions may be reused when analyzing record 1 against records 3-10. Similarly, analyzing record 2 against 3-10 may reuse certain actions performed when record 1 was compared with record 2. This may result in faster and/or fewer resources needed to analyze records for similarities.
The above-described systems and methods may enable actions to be reused that are common to multiple similarity functions or can be performed multiple times by the same similarity function. These systems and methods may also compose similarity functions into a composed function that enables reuse of actions and permits comparison of records in one pass and/or without needing a merge operation. The number of record pairs analyzed may also be reduced using a flip-window algorithm. Any one of these many techniques may enable records to be cleansed in less time and/or with fewer resources. Although the system and method has been described in language specific to structural features and/or methodological acts, it is to be understood that the system and method defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed system and method.